高分辨率实时抠像-RobustVideoMatting飞桨复现

转自AI Studio，原文链接：高分辨率实时抠像-RobustVideoMatting（RVM）飞桨复现 - 飞桨AI Studio飞桨复现视频抠像RobustVideoMatting（RVM）这是作者林山川的第二篇关于视频抠像的论文，他的第一篇论文获得CVPR2021最佳学生论文提名奖（提名只有三篇论文），论文代码地址林山川的第一篇论文是学生时代发布的，第二篇论文RobustVideoMatt

AI Studio

3327人浏览 · 2022-04-22 00:19:35

AI Studio · 2022-04-22 00:19:35 发布

转自AI Studio，原文链接：高分辨率实时抠像-RobustVideoMatting（RVM）飞桨复现 - 飞桨AI Studio

飞桨复现视频抠像RobustVideoMatting（RVM）

这是作者林山川的第二篇关于视频抠像的论文，他的第一篇论文获得CVPR2021最佳学生论文提名奖（提名只有三篇论文），论文代码地址

林山川的第一篇论文是学生时代发布的，第二篇论文RobustVideoMatting是在字节跳动发布的，期待他在微软的第三篇论文实现。

在综合研判后，决定用飞桨复现他的第二篇论文Robust High-Resolution Video Matting with Temporal Guidance，论文官网地址 Github代码地址 Gitee镜像地址

飞桨复现Github地址. Gitee地址

论文 Robust High-Resolution Video Matting with Temporal Guidance，简称RVM ，专为稳定人物视频抠像设计。不同于现有神经网络将每一帧作为单独图片处理，RVM 使用循环神经网络，在处理视频流时有时间记忆。RVM 可在任意视频上做实时高清抠像。在 Nvidia GTX 1080Ti 上实现 4K 76FPS 和 HD 104FPS。此研究项目来自字节跳动。

论文复现基本思路和过程

回想我第一次看见论文效果的时候，是在B站偶然浏览到效果展示视频，简直惊为天人，原来视频抠图还能这样玩！这不就是把《黑客帝国》开始大热的绿幕/蓝幕拍摄技术普及到人人都可以操作的地步了么！人人都可以圆大导演的梦啦！当时就暗暗定下一个小目标：一定要复现出飞桨框架的代码实现版本！一转眼，尔来有二十有一周矣。不经历风雨怎能见彩虹，作为一个视频抠像小白，从零开始复现一个CVPR最佳论文（提名）的作者的新作，即使只是把代码从Pytroch改成飞桨PaddlePaddle，过程也是充满曲折和坎坷。

在此感谢鹏城实验室AI平台，为我调试代码提供了算力支持！

复现步骤概述

首先考虑使用X2Paddle直接转换代码

可以一键转换项目和模型文件，简单方便快捷。但是有时不能转换的算子太多，这次主要是Pytorch里面的高阶函数无法转换，最终放弃此方法。当然后来知道X2Paddle专门实现了ResNet50等常见模型，而且已经跟Pytorch对齐，所以以后再复现的话，可以先到X2Paddle的复现模型库里面找，会节省很多时间。

X2Paddle转换方便，但是转换后的代码需要附带X2Paddle库，有时不太方便。

退而求次，全手动改写代码。

整体流程为：调通ResNet50模型-调通ResNet50 MattingNetwork模块-调通推理部分-调通MobileNetV3模型-调通MobileNetV3 MattingNetwork-调通推理部分

1 调通ResNet50模型骨干网

事实上选择先调ResNet50模型是对的，因为后来在调试MobileNetV3对齐的时候，难度要远远高于ResNet50模型。

2 调通MattingNetwork网络

在调通ResNet骨干网络的基础上，调通MattingNetwork网络。把网络里面的模块，拆开，一个一个的转写并测试。本ipynb文件中主要就是在AIStudio中验证演示调通ResNet50、MattingNetwork和推理部分。

3 调通推理模块

耗时最少的模块

4 调通MobileNetV3模型骨干网MattingNetwork 耗时最多的模块，一言难尽，需要单独一个项目才能讲明白：飞桨源码MobileNetV3分类模型对齐Pytorch-省事版

飞桨实现RobustVideoMatting

前期大部分步骤都是在启智云脑AI平台调试的，主要完成Pytorch到飞桨代码的转换，AIStudio项目中主要做后期精度对齐等操作，毕竟习惯把飞桨放在AIStudio中跑。

安装包和文件下载等

先下载代码： git clone https://github.com/PeterL1n/RobustVideoMatting 进入目录后安装相应的库文件 pip install -r requirements_inference.txt

需要安装av 和Pytorch1.9。注意，前面适配飞桨的启智云脑AI平台默认的Pytorch版本低，需要安装或升级到requirements_inference.txt中的版本。

In [ ]

# 安装相关包， ipywidgets安装好之后还报错要重启内核
!pip install av tqdm pims ipywidgets

复现开始！其实主要是改写啦！

1. 改写ResNet50

如前所述，X2Paddle有现成ResNet50模型代码，并且已经跟Pytorch模型对齐。但是在AIStudio里面无法顺利调试X2Paddle代码，同时自己以前也做过ResNet50模型的复现，所以最终还是选择手工复现。

另一个原因是使用X2Paddle的ResNet50模型发现在这个项目里没有对齐，所以才选择了重写。而后来还是选择把X2Paddle里面的ResNet50代码拿来用，但是把X2Paddle相关的代码，都修改成纯飞桨代码，这样就能在AIStudio里面顺利执行了。

当然最终复现结束的时候，终于弄明白一开始没对齐的原因，是因为一个参数的设置问题导致。也就是不管 X2Paddle里的现成复现，还是自己以前写的ResNet50模型，都跟Pytroch的是对齐的，只是当时没注意参数设置罢了。

1.1 ReLU和constant_init_()

In [ ]

# 手工修改几个x2paddle里面的代码，以便在AIStudio下运行
import paddle
class ReLU(paddle.nn.ReLU):
    def __init__(self, inplace=False):
        super().__init__()
        self.inplace = inplace

    def forward(self, x):
        if self.inplace:
            out = paddle.nn.functional.relu_(x)
        else:
            out = super().forward(x)
        return out

def constant_init_(param, val):
    replaced_param = paddle.create_parameter(
        shape=param.shape,
        dtype=param.dtype,
        default_initializer=paddle.nn.initializer.Assign(
            paddle.full(param.shape, val, param.dtype)))
    paddle.assign(replaced_param, param)

1.2 PaddleDtypes()

In [ ]

# 照抄x2paddle里面的paddle_dtypes，因为x2paddle有时候抽风
# -*- coding:UTF-8 -*-
# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle


def string(param):
    """ 生成字符串。
    """
    return "\'{}\'".format(param)


def check_version():
    version = paddle.__version__
    v0, v1, v2 = version.split('.')
    if not ((v0 == '0' and v1 == '0' and v2 == '0') or
            (int(v0) >= 2 and int(v1) >= 1)):
        return False
    else:
        return True


class PaddleDtypes():
    def __init__(self, is_new_version=True):
        if is_new_version:
            self.t_float16 = paddle.float16
            self.t_float32 = paddle.float32
            self.t_float64 = paddle.float64
            self.t_uint8 = paddle.uint8
            self.t_int8 = paddle.int8
            self.t_int16 = paddle.int16
            self.t_int32 = paddle.int32
            self.t_int64 = paddle.int64
            self.t_bool = paddle.bool
        else:
            self.t_float16 = "paddle.fluid.core.VarDesc.VarType.FP16"
            self.t_float32 = "paddle.fluid.core.VarDesc.VarType.FP32"
            self.t_float64 = "paddle.fluid.core.VarDesc.VarType.FP64"
            self.t_uint8 = "paddle.fluid.core.VarDesc.VarType.UINT8"
            self.t_int8 = "paddle.fluid.core.VarDesc.VarType.INT8"
            self.t_int16 = "paddle.fluid.core.VarDesc.VarType.INT16"
            self.t_int32 = "paddle.fluid.core.VarDesc.VarType.INT32"
            self.t_int64 = "paddle.fluid.core.VarDesc.VarType.INT64"
            self.t_bool = "paddle.fluid.core.VarDesc.VarType.BOOL"


is_new_version = check_version()
paddle_dtypes = PaddleDtypes(is_new_version)

1.3 凯明初始化

In [ ]

# 单独写凯明初始化
# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import math
from functools import reduce
import paddle
from paddle.fluid import framework
from paddle.fluid.core import VarDesc
from paddle.fluid.initializer import XavierInitializer, MSRAInitializer
from paddle.fluid.data_feeder import check_variable_and_dtype
# from x2paddle.utils import paddle_dtypes


def _calculate_fan_in_and_fan_out(var):
    dimensions = var.dim()
    if dimensions < 2:
        raise ValueError(
            "Fan in and fan out can not be computed for var with fewer than 2 dimensions"
        )
    num_input_fmaps = var.shape[0]
    num_output_fmaps = var.shape[1]
    receptive_field_size = 1
    if var.dim() > 2:
        receptive_field_size = reduce(lambda x, y: x * y, var.shape[2:])
    fan_in = num_input_fmaps * receptive_field_size
    fan_out = num_output_fmaps * receptive_field_size
    return fan_in, fan_out


def _calculate_correct_fan(var, mode):
    mode = mode.lower()
    valid_modes = ['fan_in', 'fan_out']
    if mode not in valid_modes:
        raise ValueError("Mode {} not supported, please use one of {}".format(
            mode, valid_modes))
    fan_in, fan_out = _calculate_fan_in_and_fan_out(var)
    return fan_in if mode == 'fan_in' else fan_out


def _calculate_gain(nonlinearity, param=None):
    linear_fns = [
        'linear', 'conv1d', 'conv2d', 'conv3d', 'conv_transpose1d',
        'conv_transpose2d', 'conv_transpose3d'
    ]
    if nonlinearity in linear_fns or nonlinearity == 'sigmoid':
        return 1
    elif nonlinearity == 'tanh':
        return 5.0 / 3
    elif nonlinearity == 'relu':
        return math.sqrt(2.0)
    elif nonlinearity == 'leaky_relu':
        if param is None:
            negative_slope = 0.01
        elif not isinstance(param, bool) and isinstance(
                param, int) or isinstance(param, float):
            # True/False are instances of int, hence check above
            negative_slope = param
        else:
            raise ValueError("negative_slope {} not a valid number".format(
                param))
        return math.sqrt(2.0 / (1 + negative_slope**2))
    elif nonlinearity == 'selu':
        return 3.0 / 4  # Value found empirically (https://github.com/pytorch/pytorch/pull/50664)
    else:
        raise ValueError("Unsupported nonlinearity {}".format(nonlinearity))


class KaimingNormal(MSRAInitializer):
    def __init__(self, a=0, mode='fan_in', nonlinearity='leaky_relu'):
        super(KaimingNormal, self).__init__(uniform=False, fan_in=None, seed=0)
        self.a = a
        self.mode = mode
        self.nonlinearity = nonlinearity

    def __call__(self, var, block=None):
        """Initialize the input tensor with MSRA initialization.
        Args:
            var(Tensor): Tensor that needs to be initialized.
            block(Block, optional): The block in which initialization ops
                   should be added. Used in static graph only, default None.
        Returns:
            The initialization op
        """
        block = self._check_block(block)

        assert isinstance(var, framework.Variable)
        assert isinstance(block, framework.Block)
        f_in, f_out = self._compute_fans(var)

        if self._seed == 0:
            self._seed = block.program.random_seed

        # to be compatible of fp16 initalizers
        if var.dtype == paddle_dtypes.t_float16:
            out_dtype = paddle_dtypes.t_float32
            out_var = block.create_var(
                name=unique_name.generate(".".join(
                    ['masra_init', var.name, 'tmp'])),
                shape=var.shape,
                dtype=out_dtype,
                type=VarDesc.VarType.LOD_TENSOR,
                persistable=False)
        else:
            out_dtype = var.dtype
            out_var = var

        fan = _calculate_correct_fan(var, self.mode)
        gain = _calculate_gain(self.nonlinearity, self.a)
        std = gain / math.sqrt(fan)
        op = block._prepend_op(
            type="gaussian_random",
            outputs={"Out": out_var},
            attrs={
                "shape": out_var.shape,
                "dtype": int(out_dtype),
                "mean": 0.0,
                "std": std,
                "seed": self._seed
            },
            stop_gradient=True)

        if var.dtype == VarDesc.VarType.FP16:
            block.append_op(
                type="cast",
                inputs={"X": out_var},
                outputs={"Out": var},
                attrs={"in_dtype": out_var.dtype,
                       "out_dtype": var.dtype})

        if not framework.in_dygraph_mode():
            var.op = op
        return op


def kaiming_normal_(param, a=0, mode='fan_in', nonlinearity='leaky_relu'):
    replaced_param = paddle.create_parameter(
        shape=param.shape,
        dtype=param.dtype,
        default_initializer=KaimingNormal(
            a=a, mode=mode, nonlinearity=nonlinearity))
    paddle.assign(replaced_param, param)


class XavierNormal(XavierInitializer):
    def __init__(self, gain=1.0):
        super(XavierNormal, self).__init__(
            uniform=True, fan_in=None, fan_out=None, seed=0)
        self._gain = gain

    def __call__(self, var, block=None):
        block = self._check_block(block)
        assert isinstance(block, framework.Block)
        check_variable_and_dtype(var, "Out", ["float16", "float32", "float64"],
                                 "xavier_init")

        fan_in, fan_out = _calculate_fan_in_and_fan_out(var)

        if self._seed == 0:
            self._seed = block.program.random_seed

        # to be compatible of fp16 initalizers
        if var.dtype == paddle_dtypes.t_float16:
            out_dtype = paddle_dtypes.t_float32
            out_var = block.create_var(
                name=unique_name.generate(".".join(
                    ['xavier_init', var.name, 'tmp'])),
                shape=var.shape,
                dtype=out_dtype,
                type=VarDesc.VarType.LOD_TENSOR,
                persistable=False)
        else:
            out_dtype = var.dtype
            out_var = var

        std = self._gain * math.sqrt(2.0 / float(fan_in + fan_out))
        op = block._prepend_op(
            type="uniform_random",
            inputs={},
            outputs={"Out": out_var},
            attrs={
                "shape": out_var.shape,
                "dtype": out_dtype,
                "min": 0,
                "max": std,
                "seed": self._seed
            },
            stop_gradient=True)
        if var.dtype == paddle_dtypes.t_float16:
            block.append_op(
                type="cast",
                inputs={"X": out_var},
                outputs={"Out": var},
                attrs={"in_dtype": out_var.dtype,
                       "out_dtype": var.dtype})
        if not framework.in_dygraph_mode():
            var.op = op
        return op


def xavier_normal_(param, gain=1.0):
    replaced_param = paddle.create_parameter(
        shape=param.shape,
        dtype=param.dtype,
        default_initializer=XavierNormal(gain=gain))
    paddle.assign(replaced_param, param)


class XavierUniform(XavierInitializer):
    def __init__(self, gain=1.0):
        super(XavierUniform, self).__init__(
            uniform=True, fan_in=None, fan_out=None, seed=0)
        self._gain = gain

    def __call__(self, var, block=None):
        block = self._check_block(block)
        assert isinstance(block, framework.Block)
        check_variable_and_dtype(var, "Out", ["float16", "float32", "float64"],
                                 "xavier_init")

        fan_in, fan_out = _calculate_fan_in_and_fan_out(var)

        if self._seed == 0:
            self._seed = block.program.random_seed

        # to be compatible of fp16 initalizers
        if var.dtype == paddle_dtypes.t_float16:
            out_dtype = paddle_dtypes.t_float32
            out_var = block.create_var(
                name=unique_name.generate(".".join(
                    ['xavier_init', var.name, 'tmp'])),
                shape=var.shape,
                dtype=out_dtype,
                type=VarDesc.VarType.LOD_TENSOR,
                persistable=False)
        else:
            out_dtype = var.dtype
            out_var = var

        std = self._gain * math.sqrt(2.0 / float(fan_in + fan_out))
        limit = math.sqrt(3.0) * std
        op = block._prepend_op(
            type="uniform_random",
            inputs={},
            outputs={"Out": out_var},
            attrs={
                "shape": out_var.shape,
                "dtype": out_dtype,
                "min": -limit,
                "max": limit,
                "seed": self._seed
            },
            stop_gradient=True)
        if var.dtype == paddle_dtypes.t_float16:
            block.append_op(
                type="cast",
                inputs={"X": out_var},
                outputs={"Out": var},
                attrs={"in_dtype": out_var.dtype,
                       "out_dtype": var.dtype})
        if not framework.in_dygraph_mode():
            var.op = op
        return op


def xavier_uniform_(param, gain=1.0):
    replaced_param = paddle.create_parameter(
        shape=param.shape,
        dtype=param.dtype,
        default_initializer=XavierUniform(gain=gain))
    paddle.assign(replaced_param, param)


def constant_init_(param, val):
    replaced_param = paddle.create_parameter(
        shape=param.shape,
        dtype=param.dtype,
        default_initializer=paddle.nn.initializer.Assign(
            paddle.full(param.shape, val, param.dtype)))
    paddle.assign(replaced_param, param)


def normal_init_(param, mean=0.0, std=1.0):
    replaced_param = paddle.create_parameter(
        shape=param.shape,
        dtype=param.dtype,
        default_initializer=paddle.nn.initializer.Assign(
            paddle.normal(
                mean=mean, std=std, shape=param.shape)))
    paddle.assign(replaced_param, param)


def ones_init_(param):
    replaced_param = paddle.create_parameter(
        shape=param.shape,
        dtype=param.dtype,
        default_initializer=paddle.nn.initializer.Assign(
            paddle.ones(param.shape, param.dtype)))
    paddle.assign(replaced_param, param)


def zeros_init_(param):
    replaced_param = paddle.create_parameter(
        shape=param.shape,
        dtype=param.dtype,
        default_initializer=paddle.nn.initializer.Assign(
            paddle.zeros(param.shape, param.dtype)))
    paddle.assign(replaced_param, param)

1.5 ResNet主代码

In [ ]

# 使用x2paddle里面的代码,完全临摹torch的ResNet，以便参数对齐
import paddle
import paddle.nn as nn
from paddle import Tensor
from paddle.utils.download import get_weights_path_from_url
from typing import Type, Any, Callable, Union, List, Optional
# from x2paddle import storch2paddle

__all__ = [
    'ResNet', 'resnet18', 'resnet34', 'resnet50', 'resnet101', 'resnet152',
    'resnext50_32x4d', 'resnext101_32x8d', 'wide_resnet50_2', 'wide_resnet101_2'
]

model_urls = {
    'resnet18':
    'https://x2paddle.bj.bcebos.com/vision/models/resnet18-pt.pdparams',
    'resnet34':
    'https://x2paddle.bj.bcebos.com/vision/models/resnet34-pt.pdparams',
    'resnet50':
    'https://x2paddle.bj.bcebos.com/vision/models/resnet50-pt.pdparams',
    'resnet101':
    'https://x2paddle.bj.bcebos.com/vision/models/resnet101-pt.pdparams',
    'resnet152':
    'https://x2paddle.bj.bcebos.com/vision/models/resnet152-pt.pdparams',
    'resnext50_32x4d':
    'https://x2paddle.bj.bcebos.com/vision/models/resnext50_32x4d-pt.pdparams',
    'resnext101_32x8d':
    'https://x2paddle.bj.bcebos.com/vision/models/resnext101_32x8d-pt.pdparams',
    'wide_resnet50_2':
    'https://x2paddle.bj.bcebos.com/vision/models/wide_resnet50_2-pt.pdparams',
    'wide_resnet101_2':
    'https://x2paddle.bj.bcebos.com/vision/models/wide_resnet101_2-pt.pdparams',
}


def conv3x3(in_planes: int,
            out_planes: int,
            stride: int=1,
            groups: int=1,
            dilation: int=1) -> nn.Conv2D:
    """3x3 convolution with padding"""
    return nn.Conv2D(
        in_planes,
        out_planes,
        kernel_size=3,
        stride=stride,
        padding=dilation,
        groups=groups,
        bias_attr=False,
        dilation=dilation)


def conv1x1(in_planes: int, out_planes: int, stride: int=1) -> nn.Conv2D:
    """1x1 convolution"""
    return nn.Conv2D(
        in_planes, out_planes, kernel_size=1, stride=stride, bias_attr=False)


class BasicBlock(nn.Layer):
    expansion: int = 1

    def __init__(self,
                 inplanes: int,
                 planes: int,
                 stride: int=1,
                 downsample: Optional[nn.Layer]=None,
                 groups: int=1,
                 base_width: int=64,
                 dilation: int=1,
                 norm_layer: Optional[Callable[..., nn.Layer]]=None) -> None:
        super(BasicBlock, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if groups != 1 or base_width != 64:
            raise ValueError(
                'BasicBlock only supports groups=1 and base_width=64')
        if dilation > 1:
            raise NotImplementedError(
                "Dilation > 1 not supported in BasicBlock")
        # Both self.conv1 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = norm_layer(planes)
        self.relu = ReLU(True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = norm_layer(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x: Tensor) -> Tensor:
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out


class Bottleneck(nn.Layer):
    # Bottleneck in torchvision places the stride for downsampling at 3x3 convolution(self.conv2)
    # while original implementation places the stride at the first 1x1 convolution(self.conv1)
    # according to "Deep residual learning for image recognition"https://arxiv.org/abs/1512.03385.
    # This variant is also known as ResNet V1.5 and improves accuracy according to
    # https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch.

    expansion: int = 4

    def __init__(self,
                 inplanes: int,
                 planes: int,
                 stride: int=1,
                 downsample: Optional[nn.Layer]=None,
                 groups: int=1,
                 base_width: int=64,
                 dilation: int=1,
                 norm_layer: Optional[Callable[..., nn.Layer]]=None) -> None:
        super(Bottleneck, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2D
        width = int(planes * (base_width / 64.)) * groups
        # Both self.conv2 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv1x1(inplanes, width)
        self.bn1 = norm_layer(width)
        self.conv2 = conv3x3(width, width, stride, groups, dilation)
        self.bn2 = norm_layer(width)
        self.conv3 = conv1x1(width, planes * self.expansion)
        self.bn3 = norm_layer(planes * self.expansion)
        self.relu = ReLU(True)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x: Tensor) -> Tensor:
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out


class ResNet(nn.Layer):
    def __init__(self,
                 block: Type[Union[BasicBlock, Bottleneck]],
                 layers: List[int],
                 num_classes: int=1000,
                 zero_init_residual: bool=False,
                 groups: int=1,
                 width_per_group: int=64,
                 replace_stride_with_dilation: Optional[List[bool]]=None,
                 norm_layer: Optional[Callable[..., nn.Layer]]=None) -> None:
        super(ResNet, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2D
        self._norm_layer = norm_layer

        self.inplanes = 64
        self.dilation = 1
        if replace_stride_with_dilation is None:
            # each element in the tuple indicates if we should replace
            # the 2x2 stride with a dilated convolution instead
            replace_stride_with_dilation = [False, False, Ture]
        if len(replace_stride_with_dilation) != 3:
            raise ValueError("replace_stride_with_dilation should be None "
                             "or a 3-element tuple, got {}".format(
                                 replace_stride_with_dilation))
        self.groups = groups
        self.base_width = width_per_group
        self.conv1 = nn.Conv2D(
            3,
            self.inplanes,
            kernel_size=7,
            stride=2,
            padding=3,
            bias_attr=False)
        self.bn1 = norm_layer(self.inplanes)
        self.relu = ReLU(True)
        self.maxpool = nn.MaxPool2D(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(
            block,
            128,
            layers[1],
            stride=2,
            dilate=replace_stride_with_dilation[0])
        self.layer3 = self._make_layer(
            block,
            256,
            layers[2],
            stride=2,
            dilate=replace_stride_with_dilation[1])
        self.layer4 = self._make_layer(
            block,
            512,
            layers[3],
            stride=2,
            dilate=replace_stride_with_dilation[2])
        self.avgpool = nn.AdaptiveAvgPool2D((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        for m in self.sublayers():
            if isinstance(m, nn.Conv2D):
                kaiming_normal_(
                    m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2D, nn.GroupNorm)):
                constant_init_(m.weight, 1)
                constant_init_(m.bias, 0)

        # Zero-initialize the last BN in each residual branch,
        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
        if zero_init_residual:
            for m in self.sublayers():
                if isinstance(m, Bottleneck):
                    constant_init_(m.bn3.weight,
                                                0)  # type: ignore[arg-type]
                elif isinstance(m, BasicBlock):
                    constant_init_(m.bn2.weight,
                                                0)  # type: ignore[arg-type]

    def _make_layer(self,
                    block: Type[Union[BasicBlock, Bottleneck]],
                    planes: int,
                    blocks: int,
                    stride: int=1,
                    dilate: bool=False) -> nn.Sequential:
        norm_layer = self._norm_layer
        downsample = None
        previous_dilation = self.dilation
        if dilate:
            self.dilation *= stride
            stride = 1
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                conv1x1(self.inplanes, planes * block.expansion, stride),
                norm_layer(planes * block.expansion), )

        layers = []
        layers.append(
            block(self.inplanes, planes, stride, downsample, self.groups,
                  self.base_width, previous_dilation, norm_layer))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(
                block(
                    self.inplanes,
                    planes,
                    groups=self.groups,
                    base_width=self.base_width,
                    dilation=self.dilation,
                    norm_layer=norm_layer))

        return nn.Sequential(*layers)

    def _forward_impl(self, x: Tensor) -> Tensor:
        # See note [TorchScript super()]
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = paddle.flatten(x, 1)
        x = self.fc(x)

        return x

    def forward(self, x: Tensor) -> Tensor:
        return self._forward_impl(x)


def _resnet(arch: str,
            block: Type[Union[BasicBlock, Bottleneck]],
            layers: List[int],
            pretrained: bool,
            **kwargs: Any) -> ResNet:
    model = ResNet(block, layers, **kwargs)
    if pretrained:
        state_dict = paddle.load(get_weights_path_from_url(model_urls[arch]))
        model.load_dict(state_dict)
    return model


def resnet18(pretrained: bool=False, progress: bool=True,
             **kwargs: Any) -> ResNet:
    r"""ResNet-18 model from
    `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    return _resnet('resnet18', BasicBlock, [2, 2, 2, 2], pretrained, **kwargs)


def resnet34(pretrained: bool=False, progress: bool=True,
             **kwargs: Any) -> ResNet:
    r"""ResNet-34 model from
    `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    return _resnet('resnet34', BasicBlock, [3, 4, 6, 3], pretrained, **kwargs)


def resnet50(pretrained: bool=False, progress: bool=True,
             **kwargs: Any) -> ResNet:
    r"""ResNet-50 model from
    `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    return _resnet('resnet50', Bottleneck, [3, 4, 6, 3], pretrained, **kwargs)


def resnet101(pretrained: bool=False, progress: bool=True,
              **kwargs: Any) -> ResNet:
    r"""ResNet-101 model from
    `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    return _resnet('resnet101', Bottleneck, [3, 4, 23, 3], pretrained, **kwargs)


def resnet152(pretrained: bool=False, progress: bool=True,
              **kwargs: Any) -> ResNet:
    r"""ResNet-152 model from
    `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    return _resnet('resnet152', Bottleneck, [3, 8, 36, 3], pretrained, **kwargs)


def resnext50_32x4d(pretrained: bool=False, progress: bool=True,
                    **kwargs: Any) -> ResNet:
    r"""ResNeXt-50 32x4d model from
    `"Aggregated Residual Transformation for Deep Neural Networks" <https://arxiv.org/pdf/1611.05431.pdf>`_.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    kwargs['groups'] = 32
    kwargs['width_per_group'] = 4
    return _resnet('resnext50_32x4d', Bottleneck, [3, 4, 6, 3], pretrained,
                   **kwargs)


def resnext101_32x8d(pretrained: bool=False, progress: bool=True,
                     **kwargs: Any) -> ResNet:
    r"""ResNeXt-101 32x8d model from
    `"Aggregated Residual Transformation for Deep Neural Networks" <https://arxiv.org/pdf/1611.05431.pdf>`_.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    kwargs['groups'] = 32
    kwargs['width_per_group'] = 8
    return _resnet('resnext101_32x8d', Bottleneck, [3, 4, 23, 3], pretrained,
                   **kwargs)


def wide_resnet50_2(pretrained: bool=False, progress: bool=True,
                    **kwargs: Any) -> ResNet:
    r"""Wide ResNet-50-2 model from
    `"Wide Residual Networks" <https://arxiv.org/pdf/1605.07146.pdf>`_.
    The model is the same as ResNet except for the bottleneck number of channels
    which is twice larger in every block. The number of channels in outer 1x1
    convolutions is the same, e.g. last block in ResNet-50 has 2048-512-2048
    channels, and in Wide ResNet-50-2 has 2048-1024-2048.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    kwargs['width_per_group'] = 64 * 2
    return _resnet('wide_resnet50_2', Bottleneck, [3, 4, 6, 3], pretrained,
                   **kwargs)


def wide_resnet101_2(pretrained: bool=False, progress: bool=True,
                     **kwargs: Any) -> ResNet:
    r"""Wide ResNet-101-2 model from
    `"Wide Residual Networks" <https://arxiv.org/pdf/1605.07146.pdf>`_.
    The model is the same as ResNet except for the bottleneck number of channels
    which is twice larger in every block. The number of channels in outer 1x1
    convolutions is the same, e.g. last block in ResNet-50 has 2048-512-2048
    channels, and in Wide ResNet-50-2 has 2048-1024-2048.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    kwargs['width_per_group'] = 64 * 2
    return _resnet('wide_resnet101_2', Bottleneck, [3, 4, 23, 3], pretrained,
                   **kwargs)

因为一度无法跟原torch的ResNet50对齐，还专门使用了手写版进行尝试。

# 改写resnet，这是手动改写版，最后还是用的x2paddle版本。
import paddle
import paddle.nn as nn
from typing import Type, Any, Callable, Union, List, Optional

from paddle import Tensor
from paddle.nn import functional as F
from typing import Tuple, Optional
from paddle.vision.models.resnet import BottleneckBlock as Bottleneck #实在不行，这里需要按照torch重写，或者用x2paddle的代码。



# basicblock
class Identity(nn.Layer):
    def __init_(self):
        super().__init__()

    def forward(self, x):
        return x


class BasicBlock(nn.Layer):
    def __init__(self, in_dim, out_dim, stride):
        super().__init__()
        ## 补充代码
        self.conv1 = nn.Conv2D(in_channels=in_dim, 
            out_channels=out_dim,
            kernel_size=3,
            stride=stride,
            padding=1,
            bias_attr=False
            )

        self.bn1 = nn.BatchNorm2D(out_dim)
        self.relu = nn.ReLU()
        self.conv2 = nn.Conv2D(in_channels=out_dim, 
            out_channels=out_dim,
            kernel_size=3,
            stride=1,
            padding=1,
            bias_attr=False
            )

        self.bn2 = nn.BatchNorm2D(out_dim)

        if stride==1 or in_dim!=out_dim :
            self.downsample = nn.Sequential(*[
                nn.Conv2D(in_dim,out_dim,1,stride=stride),
                nn.BatchNorm2D(out_dim)
            ])
        else:
            self.downsample=Identity()


    def forward(self, x):
        ## 补充代码
        h = x
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.conv2(x)
        x = self.bn2(x)

        identity = self.downsample(h)
        x = x+identity
        x = self.relu(x)

        return x


# Bottleneck = BottleneckBlock
class ResNet(nn.Layer):

    def __init__(
        self,
        block: Type[Union[BasicBlock, Bottleneck]],
        layers: List[int],
        num_classes: int = 1000,
        zero_init_residual: bool = False,
        groups: int = 1,
        width_per_group: int = 64,
        replace_stride_with_dilation: Optional[List[bool]] = None,
        norm_layer: Optional[Callable[..., nn.Layer]] = None
    ) -> None:
        super(ResNet, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2D
        self._norm_layer = norm_layer

        self.inplanes = 64
        self.dilation = 1
        if replace_stride_with_dilation is None:
            # each element in the tuple indicates if we should replace
            # the 2x2 stride with a dilated convolution instead
            replace_stride_with_dilation = [False, False, False]
        if len(replace_stride_with_dilation) != 3:
            raise ValueError("replace_stride_with_dilation should be None "
                             "or a 3-element tuple, got {}".format(replace_stride_with_dilation))
        self.groups = groups
        self.base_width = width_per_group
        # self.conv1 = nn.Conv2D(in_channels=3, 
        #     out_channels=in_dim,
        #     kernel_size=3,
        #     stride=1,
        #     padding=1,
        #     bias_attr=False
        #     )

        self.conv1 = nn.Conv2D(3, self.inplanes, kernel_size=7, stride=2, padding=3,
                               bias_attr=False)
        self.bn1 = norm_layer(self.inplanes)
        # self.relu = nn.ReLU(inplace=True)
        self.relu = nn.ReLU()
        
        self.maxpool = nn.MaxPool2D(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
                                       dilate=replace_stride_with_dilation[0])
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
                                       dilate=replace_stride_with_dilation[1])
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
                                       dilate=replace_stride_with_dilation[2])
        # self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.avgpool = nn.AdaptiveAvgPool2D(1)
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        # 初始化先去掉
        # for m in self.children():
        #     if isinstance(m, nn.Conv2D):
        #         nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
        #     elif isinstance(m, (nn.BatchNorm2D, nn.GroupNorm)):
        #         nn.init.constant_(m.weight, 1)
        #         nn.init.constant_(m.bias, 0)

        # Zero-initialize the last BN in each residual branch,
        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
        # 初始化先去掉
        # if zero_init_residual:
        #     for m in self.modules():
        #         if isinstance(m, Bottleneck):
        #             nn.init.constant_(m.bn3.weight, 0)  # type: ignore[arg-type]
        #         elif isinstance(m, BasicBlock):
        #             nn.init.constant_(m.bn2.weight, 0)  # type: ignore[arg-type]

    def _make_layer(self, block: Type[Union[BasicBlock, Bottleneck]], planes: int, blocks: int,
                    stride: int = 1, dilate: bool = False) -> nn.Sequential:
        norm_layer = self._norm_layer
        downsample = None
        previous_dilation = self.dilation
        if dilate:
            self.dilation *= stride
            stride = 1
        # if stride != 1 or self.inplanes != planes * block.expansion:
        #     downsample = nn.Sequential(
        #         conv1x1(self.inplanes, planes * block.expansion, stride),
        #         norm_layer(planes * block.expansion),
        #     )
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(*[
                nn.Conv2D(self.inplanes,planes * block.expansion,1,stride=stride),
                nn.BatchNorm2D(planes * block.expansion)])
            
        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
                            self.base_width, previous_dilation, norm_layer))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes, groups=self.groups,
                                base_width=self.base_width, dilation=self.dilation,
                                norm_layer=norm_layer))

        return nn.Sequential(*layers)

    def _forward_impl(self, x: Tensor) -> Tensor:
        # See note [TorchScript super()]
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = paddle.flatten(x, 1)
        x = self.fc(x)

        return x

    def forward(self, x: Tensor) -> Tensor:
        return self._forward_impl(x)

1.6测试飞桨的ResNet50跟Pytroch对齐

调用完全相同的存档模型，比对输出，确认两者输出完全一致（精度误差低于万分一）。当然Pytorch的测试代码不能在AIStudio下使用，要使用本机或者其它AI平台。

首先跟Pytorch的ResNet模型进行输出shape比对，验证通过。其实这里有个坑，因为ResNet50的输出shape都是[batch_size, 1000],导致一开始模型内部并没有对齐，也蒙混过关了。后来发现问题后，再反过来调ResNet50的代码。

In [ ]

a = paddle.randn([2,3,224,224])
model = ResNet(block=Bottleneck, layers=[3, 4, 6, 3], replace_stride_with_dilation=[False, False, True])
tmp = model(a)
print(a.shape, tmp.shape)

跟Pytorch的ResNet模型进行结构对齐，验证通过。

一开始模型结构怎么也对不齐，最终发现是replace_stride_with_dilation参数没有设置的原因，论文代码中的replace_stride_with_dilation设置了跟默认不一样的参数，这个地方走了弯路，反查代码才找出来是参数问题。

In [ ]

# 模型结构对齐
import numpy as np
paddlemodel = ResNet(block=Bottleneck, layers=[3, 4, 6, 3], replace_stride_with_dilation=[False, False, True])
img = np.ones([1,3,224,224]).astype('float32')
img = paddle.to_tensor(img)
model = paddlemodel
paddle.summary(model, input=img)

测试resnet50前向对齐，同样的输入信息，输出out应该一致。验证通过。

In [ ]

# 测试resnet50对齐，测试通过！
import numpy as np
paddlemodel = ResNet(block=Bottleneck, layers=[3, 4, 6, 3], replace_stride_with_dilation=[False, False, True])
img = np.ones([1,3,224,224]).astype('float32')
img = paddle.to_tensor(img)
model = paddlemodel

modelpath = "work/resnet50.pdparams"
def copyStateDict(state_dict):
    if list(state_dict.keys())[0].startswith('module'):
        start_idx = 1
    else:
        start_idx = 0
    new_state_dict = OrderedDict()
    for k, v in state_dict.items():
        name = '.'.join(k.split('.')[start_idx:])
        new_state_dict[name] = v
    return new_state_dict

# model.load_state_dict(copyStateDict(paddle.load(modelpath)))
model.set_state_dict(paddle.load(modelpath))


# model.set_state_dict(para_state_dict)
model.eval()

out = model(img)

print (out)

1.7测试飞桨ResNet50Encoder跟Pytroch对齐

这里主要对比了输出的shape

In [ ]

import paddle
from paddle import nn
# from paddle.vision.models.resnet import BottleneckBlock
# from paddle.vision.models import ResNet
# from torchvision.models.resnet import ResNet, Bottleneck

class ResNet50Encoder(ResNet):
    def __init__(self, pretrained: bool = False):
        super().__init__(
            block=Bottleneck,
            layers=[3, 4, 6, 3],
            replace_stride_with_dilation=[False, False, True],
            norm_layer=None)
        
        if pretrained:
            # self.load_state_dict(torch.hub.load_state_dict_from_url(
            #     'https://download.pytorch.org/models/resnet50-0676ba61.pth'))
            load_weight = paddle.load("rvm_resnet50.pdparams")
            self.weight.set_value(load_weight)
            
        del self.avgpool
        del self.fc
        
    def forward_single_frame(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        f1 = x  # 1/2
        x = self.maxpool(x)
        x = self.layer1(x)
        f2 = x  # 1/4
        x = self.layer2(x)
        f3 = x  # 1/8
        x = self.layer3(x)
        x = self.layer4(x)
        f4 = x  # 1/16
        return [f1, f2, f3, f4]
    
    def forward_time_series(self, x):
        B, T = x.shape[:2]
        features = self.forward_single_frame(x.flatten(0, 1))
        
        # print("==before unflatten features", B, T, len(features), features[0].shape, features[0].shape)
        # for i,j in enumerate(features):
        #     print(i,j.shape)
        # tmpshape = [B, T] + features[0].shape[1:]
        # features = [f.unflatten(0, (B, T)) for f in features]
        features = [f.reshape([B, T] + f.shape[1:]) for f in features]
        # print("==after unflatten features", len(features), features[0].shape, features[0].shape)
        return features
    
    def forward(self, x):
        if x.ndim == 5:
            return self.forward_time_series(x)
        else:
            return self.forward_single_frame(x)
# a = paddle.randn((2, 3, 224, 224))
# testmodel = ResNet50Encoder()
# tmp = testmodel(a)
# print(len(tmp))

In [ ]

a = paddle.randn((2, 3, 3, 244, 244))
testmodel = ResNet50Encoder()
tmp = testmodel(a)
print(len(tmp))

In [ ]

for i in tmp :
    print(i.shape)

ResNet50Encoder验证通过，输出shape一致，期间碰到5d数据不通过，报错

/tmp/ipykernel_101/852320671.py in <listcomp>(.0)
     38         B, T = x.shape[:2]
     39         features = self.forward_single_frame(x.flatten(0, 1))
---> 40         features = [f.unflatten(0, (B, T)) for f in features]
     41         return features
     42 
AttributeError: 'Tensor' object has no attribute 'unflatten'

现在改写了unflatten语句，5d也通过了

# features = [f.unflatten(0, (B, T)) for f in features]
features = [f.reshape([B, T] + f.shape[1:]) for f in features]

2.调通MattingNetwork网络

将论文中的MattingNetwork网络模块拆开，单独复现和测试。

2.1改写LRASPP

通过查看源代码，以及在Pytorch程序中设置断点输出，来拿到LRASPP的输入和输出数据的shape信息。然后进行比对。

f1, f2, f3, f4 = self.backbone(src_sm) torch.Size([2, 64, 112, 112]) torch.Size([2, 256, 56, 56]) torch.Size([2, 512, 28, 28]) torch.Size([2, 2048, 14, 14])

aspp = LRASPP(960, 128)

f4 = self.aspp(f4) torch.Size([2, 256, 14, 14])

In [ ]

class LRASPP(nn.Layer):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.aspp1 = nn.Sequential(
            nn.Conv2D(in_channels, out_channels, 1, bias_attr=False),
            nn.BatchNorm2D(out_channels),
            nn.ReLU(True)
        )
        self.aspp2 = nn.Sequential(
            nn.AdaptiveAvgPool2D(1),
            nn.Conv2D(in_channels, out_channels, 1, bias_attr=False),
            nn.Sigmoid()
        )
        
    def forward_single_frame(self, x):
        return self.aspp1(x) * self.aspp2(x)
    
    def forward_time_series(self, x):
        B, T = x.shape[:2]
        # x = self.forward_single_frame(x.flatten(0, 1)).unflatten(0, (B, T))
        x = self.forward_single_frame(x.flatten(0, 1))
        x = x.reshape([B, T]+x.shape[1:])
        
        return x
    
    def forward(self, x):
        # print("x.ndim =", x.ndim)
        if x.ndim == 5:
            # print("self.forward_time_series(x)", self.forward_time_series(x).shape)
            return self.forward_time_series(x)
        else:
            return self.forward_single_frame(x)

In [ ]

a = paddle.randn((2, 2048, 14, 14))
print(a.max())
testmodel = LRASPP(2048, 256)
tmp = testmodel(a)
print(len(tmp))
for i in tmp:
    print(i.shape)
    # print(i.max(2))

LRASPP完成

还是一样，一开始只完成4d数据，5d数据报错,主要是没有“unflatten” ，最后就是用reshape命令来实现了unflatten解决了问题。

2.2修改RecurrentDecoder和Projection

self.decoder = RecurrentDecoder([64, 256, 512, 256], [128, 64, 32, 16])

decoder的输入：f4 = self.aspp(f4) torch.Size([2, 256, 14, 14])

命令hid, *rec = self.decoder(src_sm, f1, f2, f3, f4, r1, r2, r3, r4)

输出大约是x0, r1, r2, r3, r4 torch.Size([2, 16, 224, 224]) torch.Size([2, 16, 112, 112]) torch.Size([2, 32, 56, 56]) torch.Size([2, 64, 28, 28]) torch.Size([2, 128, 14, 14])

In [ ]

class RecurrentDecoder(nn.Layer):
    def __init__(self, feature_channels, decoder_channels):
        super().__init__()
        self.avgpool = AvgPool()
        self.decode4 = BottleneckBlock(feature_channels[3])
        self.decode3 = UpsamplingBlock(feature_channels[3], feature_channels[2], 3, decoder_channels[0])
        self.decode2 = UpsamplingBlock(decoder_channels[0], feature_channels[1], 3, decoder_channels[1])
        self.decode1 = UpsamplingBlock(decoder_channels[1], feature_channels[0], 3, decoder_channels[2])
        self.decode0 = OutputBlock(decoder_channels[2], 3, decoder_channels[3])

    def forward(self,
                s0: Tensor, f1: Tensor, f2: Tensor, f3: Tensor, f4: Tensor,
                r1: Optional[Tensor], r2: Optional[Tensor],
                r3: Optional[Tensor], r4: Optional[Tensor]):
        s1, s2, s3 = self.avgpool(s0)
        x4, r4 = self.decode4(f4, r4)
        x3, r3 = self.decode3(x4, f3, s3, r3)
        x2, r2 = self.decode2(x3, f2, s2, r2)
        x1, r1 = self.decode1(x2, f1, s1, r1)
        x0 = self.decode0(x1, s0)
        # print("x0, r1, r2, r3, r4", x0.shape, r1.shape,r2.shape,r3.shape, r4.shape)
        return x0, r1, r2, r3, r4
    

class AvgPool(nn.Layer):
    def __init__(self):
        super().__init__()
        self.avgpool = nn.AvgPool2D(2, 2, exclusive=False, ceil_mode=True)  # count_include_pad exclusive
        
    def forward_single_frame(self, s0):
        s1 = self.avgpool(s0)
        s2 = self.avgpool(s1)
        s3 = self.avgpool(s2)
        return s1, s2, s3
    
    def forward_time_series(self, s0):
        B, T = s0.shape[:2]
        s0 = s0.flatten(0, 1)
        s1, s2, s3 = self.forward_single_frame(s0)
        # s1 = s1.unflatten(0, (B, T))
        # s2 = s2.unflatten(0, (B, T))
        # s3 = s3.unflatten(0, (B, T))
        s1 = s1.reshape([B, T] + s1.shape[1:])
        s2 = s2.reshape([B, T] + s2.shape[1:])
        s3 = s3.reshape([B, T] + s3.shape[1:])
        return s1, s2, s3
    
    def forward(self, s0):
        if s0.ndim == 5:
            return self.forward_time_series(s0)
        else:
            return self.forward_single_frame(s0)


class BottleneckBlock(nn.Layer):
    def __init__(self, channels):
        super().__init__()
        self.channels = channels
        self.gru = ConvGRU(channels // 2)
        
    def forward(self, x, r: Optional[Tensor]):
        # print("a, b = x.split(self.channels // 2, axis=-3)", self.channels // 2)
        a = []
        # a, b = x.split(self.channels // 2, axis=-3) # dim
        a, b = x.split(2, axis=-3) 
        # print(len(a))
        b, r = self.gru(b, r)
        x = paddle.concat([a, b], axis=-3) # cat concat
        return x, r

    
class UpsamplingBlock(nn.Layer):
    def __init__(self, in_channels, skip_channels, src_channels, out_channels):
        super().__init__()
        self.out_channels = out_channels
        self.upsample = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=False)
        self.conv = nn.Sequential(
            nn.Conv2D(in_channels + skip_channels + src_channels, out_channels, 3, 1, 1, bias_attr=False),
            nn.BatchNorm2D(out_channels),
            nn.ReLU(),
        )
        self.gru = ConvGRU(out_channels // 2)

    def forward_single_frame(self, x, f, s, r: Optional[Tensor]):
        x = self.upsample(x)
        x = x[:, :, :s.shape[2], :s.shape[3]]
        x = paddle.concat([x, f, s], axis=1) #torch.cat paddle.concat dim - axis

        x = self.conv(x)
        # a, b = x.split(self.out_channels // 2, axis=1)
        a, b = x.split(2, axis=1)
        
        b, r = self.gru(b, r)
        x = paddle.concat([a, b], axis=1)
        return x, r
    
    def forward_time_series(self, x, f, s, r: Optional[Tensor]):
        B, T, _, H, W = s.shape
        x = x.flatten(0, 1)
        f = f.flatten(0, 1)
        s = s.flatten(0, 1)
        x = self.upsample(x)
        x = x[:, :, :H, :W]
        x = paddle.concat([x, f, s], axis=1)
        x = self.conv(x)
        # x = x.unflatten(0, (B, T))
        x = x.reshape([B, T] + x.shape[1:])

        # a, b = x.split(self.out_channels // 2, axis=2)
        a, b = x.split(2, axis=2)
        b, r = self.gru(b, r)
        x = paddle.concat([a, b], axis=2)
        return x, r
    
    def forward(self, x, f, s, r: Optional[Tensor]):
        if x.ndim == 5:
            return self.forward_time_series(x, f, s, r)
        else:
            return self.forward_single_frame(x, f, s, r)


class OutputBlock(nn.Layer):
    def __init__(self, in_channels, src_channels, out_channels):
        super().__init__()
        self.upsample = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=False)
        self.conv = nn.Sequential(
            nn.Conv2D(in_channels + src_channels, out_channels, 3, 1, 1, bias_attr=False),
            nn.BatchNorm2D(out_channels),
            nn.ReLU(),
            nn.Conv2D(out_channels, out_channels, 3, 1, 1, bias_attr=False),
            nn.BatchNorm2D(out_channels),
            nn.ReLU(),
        )
        
    def forward_single_frame(self, x, s):
        x = self.upsample(x)
        x = x[:, :, :s.shape[2], :s.shape[3]]
        x = paddle.concat([x, s], axis=1)
        x = self.conv(x)
        return x
    
    def forward_time_series(self, x, s):
        B, T, _, H, W = s.shape
        x = x.flatten(0, 1)
        s = s.flatten(0, 1)
        x = self.upsample(x)
        x = x[:, :, :H, :W]
        x = paddle.concat([x, s], axis=1)
        x = self.conv(x)
        # x = x.unflatten(0, (B, T))
        x = x.reshape([B, T] + x.shape[1:])
        return x
    
    def forward(self, x, s):
        if x.ndim == 5:
            return self.forward_time_series(x, s)
        else:
            return self.forward_single_frame(x, s)


class ConvGRU(nn.Layer):
    def __init__(self,
                 channels: int,
                 kernel_size: int = 3,
                 padding: int = 1):
        super().__init__()
        self.channels = channels
        self.ih = nn.Sequential(
            nn.Conv2D(channels * 2, channels * 2, kernel_size, padding=padding),
            nn.Sigmoid()
        )
        self.hh = nn.Sequential(
            nn.Conv2D(channels * 2, channels, kernel_size, padding=padding),
            nn.Tanh()
        )
        
    def forward_single_frame(self, x, h):
        # print("forward_single_frame split(self.channels, axis=1)", self.channels)
        # r, z = self.ih(paddle.concat([x, h], axis=1)).split(self.channels, axis=1)
        r, z = self.ih(paddle.concat([x, h], axis=1)).split(2, axis=1)
        
        # print(r,z)
        c = self.hh(paddle.concat([x, r * h], axis=1))
        h = (1 - z) * h + z * c
        return h, h
    
    def forward_time_series(self, x, h):
        o = []
        for xt in x.unbind(axis=1): # dim to axis
            ot, h = self.forward_single_frame(xt, h)
            o.append(ot)
        o = paddle.stack(o, axis=1) # torch.stack dim-axis
        return o, h
        
    def forward(self, x, h: Optional[Tensor]):
        if h is None:
            h = paddle.zeros((x.shape[0], x.shape[-3], x.shape[-2], x.shape[-1]),
                             dtype=x.dtype)
        
        if x.ndim == 5:
            return self.forward_time_series(x, h)
        else:
            return self.forward_single_frame(x, h)


class Projection(nn.Layer):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Conv2D(in_channels, out_channels, 1)
    
    def forward_single_frame(self, x):
        return self.conv(x)
    
    def forward_time_series(self, x):
        B, T = x.shape[:2]
        # return self.conv(x.flatten(0, 1)).unflatten(0, (B, T))
        x = self.conv(x.flatten(0, 1))
        return x.reshape([B, T] + x.shape[1:])
        
    def forward(self, x):
        if x.ndim == 5:
            return self.forward_time_series(x)
        else:
            return self.forward_single_frame(x)

In [ ]

# 需要看看RecuRecurrentDecoder的输入信息，以便验证
RecurrentDecoder([64, 256, 512, 256], [128, 64, 32, 16])

s0 = paddle.randn([2, 2, 3, 224, 224])
f1=paddle.randn([2, 2, 64, 112, 112])
f2=paddle.randn([2, 2, 256, 56, 56])
f3=paddle.randn([2, 2, 512, 28, 28])
f4=paddle.randn([2, 2, 256, 14, 14])
r1=r2=r3=r4=None

testmodel = RecurrentDecoder([64, 256, 512, 256], [128, 64, 32, 16])
tmp = testmodel(s0, f1, f2, f3, f4, r1, r2, r3, r4)
print(len(tmp))
for i in tmp:
    print(i.shape)

RecurrentDecoder验证通过，跟torch比较，shape对齐

2.3开始FastGuidedFilterRefiner

这个在torch里面好像没有涉及？

后来才明白，是因为程序中有默认参数，默认不用FastGuidedFilterRefiner。

In [ ]

"""
Adopted from <https://github.com/wuhuikai/DeepGuidedFilter/>
"""

class FastGuidedFilterRefiner(nn.Layer):
    def __init__(self, *args, **kwargs):
        super().__init__()
        self.guilded_filter = FastGuidedFilter(1)
    
    def forward_single_frame(self, fine_src, base_src, base_fgr, base_pha):
        fine_src_gray = fine_src.mean(1, keepdim=True)
        base_src_gray = base_src.mean(1, keepdim=True)
        
        fgr, pha = self.guilded_filter(
            # torch.cat([base_src, base_src_gray], dim=1),
            # torch.cat([base_fgr, base_pha], dim=1),
            # torch.cat([fine_src, fine_src_gray], dim=1)).split([3, 1], dim=1) 
            paddle.concat([base_src, base_src_gray], axis=1),
            torch.cat([base_fgr, base_pha], dim=1),
            torch.cat([fine_src, fine_src_gray], dim=1)).split([3, 1], dim=1) 
#         print("FastGuidedFilterRefiner forward_single_frame fgr, pha", fgr.shape, pha.shape)
        return fgr, pha
    
    def forward_time_series(self, fine_src, base_src, base_fgr, base_pha):
#         print("==FastGuidedFilterRefiner fine_src, base_src, base_fgr, base_pha", fine_src, base_src, base_fgr, base_pha)
        B, T = fine_src.shape[:2]
        fgr, pha = self.forward_single_frame(
            fine_src.flatten(0, 1),
            base_src.flatten(0, 1),
            base_fgr.flatten(0, 1),
            base_pha.flatten(0, 1))
        # fgr = fgr.unflatten(0, (B, T))
        fgr = fgr.reshape([B, T] + fgr.shape[1:])
        # pha = pha.unflatten(0, (B, T))
        pha = fgr.reshape([B, T] + pha.shape[1:])
#         print("FastGuidedFilterRefiner forward_time_series fgr, pha", fgr.shape, pha.shape)
        return fgr, pha
    
    def forward(self, fine_src, base_src, base_fgr, base_pha, base_hid):
        # print("fine_src.ndim=", fine_src.ndim)
        if fine_src.ndim == 5:
            return self.forward_time_series(fine_src, base_src, base_fgr, base_pha)
        else:
            return self.forward_single_frame(fine_src, base_src, base_fgr, base_pha)


class FastGuidedFilter(nn.Layer):
    def __init__(self, r: int, eps: float = 1e-5):
        super().__init__()
        self.r = r
        self.eps = eps
        self.boxfilter = BoxFilter(r)

    def forward(self, lr_x, lr_y, hr_x):
        mean_x = self.boxfilter(lr_x)
        mean_y = self.boxfilter(lr_y)
        cov_xy = self.boxfilter(lr_x * lr_y) - mean_x * mean_y
        var_x = self.boxfilter(lr_x * lr_x) - mean_x * mean_x
        A = cov_xy / (var_x + self.eps)
        b = mean_y - A * mean_x
        A = F.interpolate(A, hr_x.shape[2:], mode='bilinear', align_corners=False)
        b = F.interpolate(b, hr_x.shape[2:], mode='bilinear', align_corners=False)
        return A * hr_x + b


class BoxFilter(nn.Layer):
    def __init__(self, r):
        super(BoxFilter, self).__init__()
        self.r = r

    def forward(self, x):
        # Note: The original implementation at <https://github.com/wuhuikai/DeepGuidedFilter/>
        #       uses faster box blur. However, it may not be friendly for ONNX export.
        #       We are switching to use simple convolution for box blur.
        kernel_size = 2 * self.r + 1
        # kernel_x = torch.full((x.data.shape[1], 1, 1, kernel_size), 1 / kernel_size, device=x.device, dtype=x.dtype)
        # kernel_y = torch.full((x.data.shape[1], 1, kernel_size, 1), 1 / kernel_size, device=x.device, dtype=x.dtype)
        kernel_x = paddle.full((x.shape[1], 1, 1, kernel_size), 1 / kernel_size, dtype=x.dtype)
        kernel_y = paddle.full((x.shape[1], 1, kernel_size, 1), 1 / kernel_size, dtype=x.dtype)
        x = F.conv2d(x, kernel_x, padding=(0, self.r), groups=x.shape[1])
        x = F.conv2d(x, kernel_y, padding=(self.r, 0), groups=x.shape[1])
#         print(x.shape)
        return x

2.4改写DeepGuidedFilterRefiner

In [ ]

"""
Adopted from <https://github.com/wuhuikai/DeepGuidedFilter/>
"""

class DeepGuidedFilterRefiner(nn.Layer):
    def __init__(self, hid_channels=16):
        super().__init__()
        self.box_filter = nn.Conv2D(4, 4, kernel_size=3, padding=1, bias_attr=False, groups=4) # 修改bisa
        # print("box_filter", type(self.box_filter), self.box_filter.weight)
        # self.box_filter.weight.data[...] = 1 / 9
        
        # self.box_filter.weight =1/9
        # x = paddle.to_tensor(1/9, dtype="float32")
        # print("==self.box_filter.weight.shape", self.box_filter.weight.shape)
        x = paddle.full(self.box_filter.weight.shape, 1/9, dtype="float32") # shape=[4,1,3,1,]
        self.box_filter.weight = paddle.create_parameter(shape=x.shape,
                        dtype=str(x.numpy().dtype),
                        default_initializer=paddle.nn.initializer.Assign(x))
        self.conv = nn.Sequential(
            nn.Conv2D(4 * 2 + hid_channels, hid_channels, kernel_size=1, bias_attr=False),
            nn.BatchNorm2D(hid_channels),
            nn.ReLU(True),
            nn.Conv2D(hid_channels, hid_channels, kernel_size=1, bias_attr=False),
            nn.BatchNorm2D(hid_channels),
            nn.ReLU(True),
            nn.Conv2D(hid_channels, 4, kernel_size=1, bias_attr=True)
        )
        
    def forward_single_frame(self, fine_src, base_src, base_fgr, base_pha, base_hid):
        # fine_x = torch.cat([fine_src, fine_src.mean(1, keepdim=True)], dim=1) # axis
        # base_x = torch.cat([base_src, base_src.mean(1, keepdim=True)], dim=1)
        # base_y = torch.cat([base_fgr, base_pha], dim=1)
        fine_x = paddle.concat([fine_src, fine_src.mean(1, keepdim=True)], axis=1) # axis
        base_x = paddle.concat([base_src, base_src.mean(1, keepdim=True)], axis=1)
        base_y = paddle.concat([base_fgr, base_pha], axis=1)

        mean_x = self.box_filter(base_x)
        mean_y = self.box_filter(base_y)
        cov_xy = self.box_filter(base_x * base_y) - mean_x * mean_y
        var_x  = self.box_filter(base_x * base_x) - mean_x * mean_x
        
        # A = self.conv(torch.cat([cov_xy, var_x, base_hid], dim=1))
        A = self.conv(paddle.concat([cov_xy, var_x, base_hid], axis=1))
        b = mean_y - A * mean_x
        
        H, W = fine_src.shape[2:]
        A = F.interpolate(A, (H, W), mode='bilinear', align_corners=False)
        b = F.interpolate(b, (H, W), mode='bilinear', align_corners=False)
        
        out = A * fine_x + b
        # fgr, pha = out.split([3, 1], dim=1) 
        fgr = out[:, :3, ]
        pha = out[:, 3:, ]
#         print("DeepGuidedFilterRefiner forward_single_frame fgr, pha", fgr.shape, pha.shape)
        return fgr, pha
    
    def forward_time_series(self, fine_src, base_src, base_fgr, base_pha, base_hid):
        B, T = fine_src.shape[:2]
        fgr, pha = self.forward_single_frame(
            fine_src.flatten(0, 1),
            base_src.flatten(0, 1),
            base_fgr.flatten(0, 1),
            base_pha.flatten(0, 1),
            base_hid.flatten(0, 1))
        # fgr = fgr.unflatten(0, (B, T))
        fgr = fgr.reshape([B, T] + fgr.shape[1:])
        # pha = pha.unflatten(0, (B, T))
        pha = pha.reshape([B, T] + pha.shape[1:])
#         print("DeepGuidedFilterRefiner forward_time_series fgr, pha", fgr.shape, pha.shape)
        return fgr, pha
    
    def forward(self, fine_src, base_src, base_fgr, base_pha, base_hid):
        if fine_src.ndim == 5:
            # print("if fine_src.ndim == 5:")
            return self.forward_time_series(fine_src, base_src, base_fgr, base_pha, base_hid)
        else:
            # print("if fine_src.ndim != 5:")
            return self.forward_single_frame(fine_src, base_src, base_fgr, base_pha, base_hid)

2.5开始最后的冲刺MattingNetwork

到了这一步，就感觉见到了曙光，有了冲刺的感觉！因为模型的各组成部分已经复现完毕，就等最终一战了！

In [ ]

class MattingNetwork(nn.Layer):
    def __init__(self,
                 variant: str = 'mobilenetv3',
                 refiner: str = 'deep_guided_filter',
                 pretrained_backbone: bool = False):
        super().__init__()
        assert variant in ['mobilenetv3', 'resnet50']
        assert refiner in ['fast_guided_filter', 'deep_guided_filter']
#         print(variant, refiner)
        if variant == 'mobilenetv3':
            self.backbone = MobileNetV3LargeEncoder(pretrained_backbone)
            self.aspp = LRASPP(960, 128)
            self.decoder = RecurrentDecoder([16, 24, 40, 128], [80, 40, 32, 16])
        else:
            self.backbone = ResNet50Encoder(pretrained_backbone)
            self.aspp = LRASPP(2048, 256)
            self.decoder = RecurrentDecoder([64, 256, 512, 256], [128, 64, 32, 16])
            
        self.project_mat = Projection(16, 4)
        self.project_seg = Projection(16, 1)

        if refiner == 'deep_guided_filter':
            self.refiner = DeepGuidedFilterRefiner()
        else:
            self.refiner = FastGuidedFilterRefiner()
        
    def forward(self,
                src: Tensor,
                r1: Optional[Tensor] = None,
                r2: Optional[Tensor] = None,
                r3: Optional[Tensor] = None,
                r4: Optional[Tensor] = None,
                downsample_ratio: float = 1,
                segmentation_pass: bool = False):
        
        if downsample_ratio != 1:
            src_sm = self._interpolate(src, scale_factor=downsample_ratio)
        else:
            src_sm = src
#         print("src_sm=", src_sm.shape)
        f1, f2, f3, f4 = self.backbone(src_sm)
#         print("====f1, f2, f3, f4 = self.backbone(src_sm)", f1.shape, f2.shape, f3.shape, f4.shape)
        f4 = self.aspp(f4)
#         print("f4 = self.aspp(f4)", f4.shape)
        hid, *rec = self.decoder(src_sm, f1, f2, f3, f4, r1, r2, r3, r4)
        
        if not segmentation_pass:
            # fgr_residual, pha = self.project_mat(hid).split([3, 1], dim=-3)
#             print("self.project_mat(hid).split", self.project_mat(hid).shape)
            fgr_residual, pha = self.project_mat(hid).split([3, 1], axis=-3)
            # fgr_residual = self.project_mat(hid)[:, :3, ]
            # pha = self.project_mat(hid)[:, 3:, ]

            if downsample_ratio != 1:
                fgr_residual, pha = self.refiner(src, src_sm, fgr_residual, pha, hid)
            fgr = fgr_residual + src
            fgr = fgr.clip(0., 1.)
            pha = pha.clip(0., 1.)
            return [fgr, pha, *rec]
        else:
            seg = self.project_seg(hid)
            return [seg, *rec]

    def _interpolate(self, x: Tensor, scale_factor: float):
        if x.ndim == 5:
            B, T = x.shape[:2]
            # x = F.interpolate(x.flatten(0, 1), scale_factor=scale_factor,
            #     mode='bilinear', align_corners=False, recompute_scale_factor=False)
            x = F.interpolate(x.flatten(0, 1), scale_factor=scale_factor,
                mode='bilinear', align_corners=False)
            # x = x.unflatten(0, (B, T))
            x = x.reshape([B, T] + x.shape[1:])
        else:
            # x = F.interpolate(x, scale_factor=scale_factor,
            #     mode='bilinear', align_corners=False, recompute_scale_factor=False)
            x = F.interpolate(x, scale_factor=scale_factor,
                mode='bilinear', align_corners=False)
        return x

MattingNetwork模型输出比对，并通过。

在本次复现过程中，其实是在这一步发现模型结构没有对齐，因为调用X2Paddle转换的模型，报错，并且模型的前向计算跟torch不一致。于是才发现模型结构不一致，然后反推往回查，发现ResNet50模型没有对齐。

In [ ]

# 验证通过
model = MattingNetwork('resnet50')
# a = paddle.randn((2, 24, 3, 224, 224))
import numpy as np
np.random.seed(1)
a = np.random.randn(3,3,244,244).astype('float32')
a = paddle.to_tensor(a)
tmp = model(a)
print(len(tmp))
for i in tmp:
    print(i.max())

查看模型结构，跟原论文结构对比，花了很大精力才把结构对齐。

In [ ]

model.parameters

3.拆解推理模块

3.1飞桨实现PIL to image

In [ ]

# 飞桨实现PIL to image 
import paddle
import PIL
import numbers
import numpy as np
from PIL import Image
from paddle.vision.transforms import BaseTransform
from paddle.vision.transforms import functional as F


class ToPILImage(BaseTransform):
    def __init__(self, mode=None, keys=None):
        super(ToTensor, self).__init__(keys)

    def _apply_image(self, pic):
        """
        Args:
            pic (Tensor|np.ndarray): Image to be converted to PIL Image.
        Returns:
            PIL: Converted image.
        """
        if not (isinstance(pic, paddle.Tensor) or isinstance(pic, np.ndarray)):
            raise TypeError('pic should be Tensor or ndarray. Got {}.'.format(
                type(pic)))

        elif isinstance(pic, paddle.Tensor):
            if pic.ndimension() not in {2, 3}:
                raise ValueError(
                    'pic should be 2/3 dimensional. Got {} dimensions.'.format(
                        pic.ndimension()))

            elif pic.ndimension() == 2:
                # if 2D image, add channel dimension (CHW)
                pic = pic.unsqueeze(0)

        elif isinstance(pic, np.ndarray):
            if pic.ndim not in {2, 3}:
                raise ValueError(
                    'pic should be 2/3 dimensional. Got {} dimensions.'.format(
                        pic.ndim))

            elif pic.ndim == 2:
                # if 2D image, add channel dimension (HWC)
                pic = np.expand_dims(pic, 2)

        npimg = pic
        if isinstance(pic, paddle.Tensor) and "float" in str(pic.numpy(
        ).dtype) and mode != 'F':
            pic = pic.mul(255).byte()
        if isinstance(pic, paddle.Tensor):
            npimg = np.transpose(pic.numpy(), (1, 2, 0))

        if not isinstance(npimg, np.ndarray):
            raise TypeError(
                'Input pic must be a paddle.Tensor or NumPy ndarray, ' +
                'not {}'.format(type(npimg)))

        if npimg.shape[2] == 1:
            expected_mode = None
            npimg = npimg[:, :, 0]
            if npimg.dtype == np.uint8:
                expected_mode = 'L'
            elif npimg.dtype == np.int16:
                expected_mode = 'I;16'
            elif npimg.dtype == np.int32:
                expected_mode = 'I'
            elif npimg.dtype == np.float32:
                expected_mode = 'F'
            if mode is not None and mode != expected_mode:
                raise ValueError(
                    "Incorrect mode ({}) supplied for input type {}. Should be {}"
                    .format(mode, np.dtype, expected_mode))
            mode = expected_mode

        elif npimg.shape[2] == 2:
            permitted_2_channel_modes = ['LA']
            if mode is not None and mode not in permitted_2_channel_modes:
                raise ValueError("Only modes {} are supported for 2D inputs".
                                 format(permitted_2_channel_modes))

            if mode is None and npimg.dtype == np.uint8:
                mode = 'LA'

        elif npimg.shape[2] == 4:
            permitted_4_channel_modes = ['RGBA', 'CMYK', 'RGBX']
            if mode is not None and mode not in permitted_4_channel_modes:
                raise ValueError("Only modes {} are supported for 4D inputs".
                                 format(permitted_4_channel_modes))

            if mode is None and npimg.dtype == np.uint8:
                mode = 'RGBA'
        else:
            permitted_3_channel_modes = ['RGB', 'YCbCr', 'HSV']
            if mode is not None and mode not in permitted_3_channel_modes:
                raise ValueError("Only modes {} are supported for 3D inputs".
                                 format(permitted_3_channel_modes))
            if mode is None and npimg.dtype == np.uint8:
                mode = 'RGB'

        if mode is None:
            raise TypeError('Input type {} is not supported'.format(
                npimg.dtype))

        return Image.fromarray(npimg, mode=mode)

3.2 推理工具

In [ ]

# RobustVideoMatting/inference_utils.py
# 后面会用到这四个函数：from inference_utils import VideoReader, VideoWriter, ImageSequenceReader, ImageSequenceWriter
import av
import os
import pims
import numpy as np
# from torch.utils.data import Dataset
from paddle.io import Dataset # 据说这个跟torch功能一致
# from torchvision.transforms.functional import to_pil_image
# from paddle.vision.transforms.functional import to_pil_image
to_pil_image = ToPILImage
from PIL import Image

# @property创建只读属性

class VideoReader(Dataset):
    def __init__(self, path, transform=None):
        self.video = pims.PyAVVideoReader(path)
        self.rate = self.video.frame_rate
        self.transform = transform
        
    @property
    def frame_rate(self):
        return self.rate
        
    def __len__(self):
        return len(self.video)
        
    def __getitem__(self, idx):
        frame = self.video[idx]
        frame = Image.fromarray(np.asarray(frame))
        if self.transform is not None:
            frame = self.transform(frame)
        return frame


class VideoWriter:
    def __init__(self, path, frame_rate, bit_rate=1000000):
        self.container = av.open(path, mode='w')
        self.stream = self.container.add_stream('h264', rate=round(frame_rate))
        self.stream.pix_fmt = 'yuv420p'
        self.stream.bit_rate = bit_rate
    
    def write(self, frames):
        # frames: [T, C, H, W]
#         print("==frames: [T, C, H, W]", frames.shape, frames[0,0,0,0])
        self.stream.width = frames.shape[3] #shape size(3)
        self.stream.height = frames.shape[2]
        if frames.shape[1] == 1:
#             print("==write frames before repeat", frames.shape)
#             frames = frames.repeat(1, 3, 1, 1) # convert grayscale to RGB repeat对应飞桨什么呢？
            frames = frames.tile([1, 3, 1, 1])
#             print("==write frames after repeat", frames.shape)
        # 拆分下面的长句，以便单步执行和代码替换
        x=frames*255
#         print("==x=frames*255", x[0,0,0,0])
        x=x.transpose([0,2,3,1])
        x=x.astype('uint8')
#         print("==x.astype('uint8')", x)
        # print("==write", x.shape)
        x=x.numpy()
        # frames = frames.mul(255).byte().cpu().permute(0, 2, 3, 1).numpy()
        frames = x
#         print("==frames", frames.shape, frames[0,0,0,0])
        for t in range(frames.shape[0]):
            frame = frames[t]
#             print('=frame', frame.shape, type(frame), frame)
            frame = av.VideoFrame.from_ndarray(frame, format='rgb24')
            self.container.mux(self.stream.encode(frame))
                
    def close(self):
        self.container.mux(self.stream.encode())
        self.container.close()


class ImageSequenceReader(Dataset):
    def __init__(self, path, transform=None):
        self.path = path
        self.files = sorted(os.listdir(path))
        self.transform = transform
        
    def __len__(self):
        return len(self.files)
    
    def __getitem__(self, idx):
        with Image.open(os.path.join(self.path, self.files[idx])) as img:
            img.load()
        if self.transform is not None:
            return self.transform(img)
        return img


class ImageSequenceWriter:
    def __init__(self, path, extension='jpg'):
        self.path = path
        self.extension = extension
        self.counter = 0
        os.makedirs(path, exist_ok=True)
    
    def write(self, frames):
        # frames: [T, C, H, W]
        for t in range(frames.shape[0]):
            to_pil_image(frames[t]).save(os.path.join(
                self.path, str(self.counter).zfill(4) + '.' + self.extension))
            self.counter += 1
            
    def close(self):
        pass

3.3 复现推理主文件

In [ ]

"""
python inference.py \
    --variant mobilenetv3 \
    --checkpoint "CHECKPOINT" \
    --device cuda \
    --input-source "input.mp4" \
    --output-type video \
    --output-composition "composition.mp4" \
    --output-alpha "alpha.mp4" \
    --output-foreground "foreground.mp4" \
    --output-video-mbps 4 \
    --seq-chunk 1
"""

import numpy as np

import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from paddle.io import Dataset, BatchSampler, DataLoader

import os
# from paddle.vision.transforms import functional as F
from paddle.vision import transforms
from typing import Optional, Tuple
from tqdm.auto import tqdm

# from inference_utils import VideoReader, VideoWriter, ImageSequenceReader, ImageSequenceWriter

def convert_video(model,
                  input_source: str,
                  input_resize: Optional[Tuple[int, int]] = None,
                  downsample_ratio: Optional[float] = None,
                  output_type: str = 'video',
                  output_composition: Optional[str] = None,
                  output_alpha: Optional[str] = None,
                  output_foreground: Optional[str] = None,
                  output_video_mbps: Optional[float] = None,
                  seq_chunk: int = 1,
                  num_workers: int = 0,
                  progress: bool = True,
                  device: Optional[str] = None,
                  dtype: Optional[paddle.dtype] = None): # torch.dtype
    
    """
    Args:
        input_source:A video file, or an image sequence directory. Images must be sorted in accending order, support png and jpg.
        input_resize: If provided, the input are first resized to (w, h).
        downsample_ratio: The model's downsample_ratio hyperparameter. If not provided, model automatically set one.
        output_type: Options: ["video", "png_sequence"].
        output_composition:
            The composition output path. File path if output_type == 'video'. Directory path if output_type == 'png_sequence'.
            If output_type == 'video', the composition has green screen background.
            If output_type == 'png_sequence'. the composition is RGBA png images.
        output_alpha: The alpha output from the model.
        output_foreground: The foreground output from the model.
        seq_chunk: Number of frames to process at once. Increase it for better parallelism.
        num_workers: PyTorch's DataLoader workers. Only use >0 for image input.
        progress: Show progress bar.
        device: Only need to manually provide if model is a TorchScript freezed model.
        dtype: Only need to manually provide if model is a TorchScript freezed model.
    """
    
    assert downsample_ratio is None or (downsample_ratio > 0 and downsample_ratio <= 1), 'Downsample ratio must be between 0 (exclusive) and 1 (inclusive).'
    assert any([output_composition, output_alpha, output_foreground]), 'Must provide at least one output.'
    assert output_type in ['video', 'png_sequence'], 'Only support "video" and "png_sequence" output modes.'
    assert seq_chunk >= 1, 'Sequence chunk must be >= 1'
    assert num_workers >= 0, 'Number of workers must be >= 0'
    
    # Initialize transform
    if input_resize is not None:
        transform = transforms.Compose([
            transforms.Resize(input_resize[::-1]),
            transforms.ToTensor()
        ])
    else:
        transform = transforms.ToTensor()

    # Initialize reader
    if os.path.isfile(input_source):
        source = VideoReader(input_source, transform)
    else:
        source = ImageSequenceReader(input_source, transform)
#     print("source.shape", source.shape)
    # reader = DataLoader(source, batch_size=seq_chunk, pin_memory=True, num_workers=num_workers)
    reader = DataLoader(source, batch_size=seq_chunk, num_workers=num_workers)
    
    
    # Initialize writers
    if output_type == 'video':
        frame_rate = source.frame_rate if isinstance(source, VideoReader) else 30
        output_video_mbps = 1 if output_video_mbps is None else output_video_mbps
        if output_composition is not None:
            writer_com = VideoWriter(
                path=output_composition,
                frame_rate=frame_rate,
                bit_rate=int(output_video_mbps * 1000000))
        if output_alpha is not None:
            writer_pha = VideoWriter(
                path=output_alpha,
                frame_rate=frame_rate,
                bit_rate=int(output_video_mbps * 1000000))
        if output_foreground is not None:
            writer_fgr = VideoWriter(
                path=output_foreground,
                frame_rate=frame_rate,
                bit_rate=int(output_video_mbps * 1000000))
    else:
        if output_composition is not None:
            writer_com = ImageSequenceWriter(output_composition, 'png')
        if output_alpha is not None:
            writer_pha = ImageSequenceWriter(output_alpha, 'png')
        if output_foreground is not None:
            writer_fgr = ImageSequenceWriter(output_foreground, 'png')

    # Inference
    # model = model.eval() 
    model.eval()
    # if device is None or dtype is None: # 先暂时屏蔽看看
    #     param = next(model.parameters())
    #     dtype = param.dtype
    #     device = param.device
    
    if (output_composition is not None) and (output_type == 'video'):
        # bgr = torch.tensor([120, 255, 155], device=device, dtype=dtype).div(255).view(1, 1, 3, 1, 1)
        bgr = (paddle.to_tensor([120, 255, 155], dtype="float32") / paddle.to_tensor(255.0, dtype='float32')).reshape([1,1,3,1,1])
#         print ("==bgr", bgr.shape, bgr[0,0,0])
    
    try:
        with paddle.no_grad():
            # bar = tqdm(total=len(source), disable=not progress, dynamic_ncols=True)
            bar = tqdm(total=len(source))
            rec = [None] * 4
            for src in reader:

                if downsample_ratio is None:
                    downsample_ratio = auto_downsample_ratio(*src.shape[2:])

                # src = src.to(device, dtype, non_blocking=True).unsqueeze(0) # [B, T, C, H, W]
                src =src.unsqueeze(0)
                fgr, pha, *rec = model(src, *rec, downsample_ratio)

                if output_foreground is not None:
                    writer_fgr.write(fgr[0])
                if output_alpha is not None:
                    writer_pha.write(pha[0])
                if output_composition is not None:
                    if output_type == 'video':
                        com = fgr * pha + bgr * (1 - pha)
                    else:
                        fgr = fgr * pha.gt(0)
                        # com = torch.cat([fgr, pha], dim=-3)
                        com = paddle.concat([fgr, pha], axis=-3)
                    writer_com.write(com[0])
                
#                 bar.update(src.size(1))
                bar.update(src.shape[1])

    finally:
        # Clean up
        if output_composition is not None:
            writer_com.close()
        if output_alpha is not None:
            writer_pha.close()
        if output_foreground is not None:
            writer_fgr.close()


def auto_downsample_ratio(h, w):
    """
    Automatically find a downsample ratio so that the largest side of the resolution be 512px.
    """
    return min(512 / max(h, w), 1)


class Converter:
    def __init__(self, variant: str, checkpoint: str, device: str):
        self.model = MattingNetwork(variant).eval().to(device)
        self.model.load_state_dict(torch.load(checkpoint, map_location=device))
        self.model = torch.jit.script(self.model)
        self.model = torch.jit.freeze(self.model)
        self.device = device
    
    def convert(self, *args, **kwargs):
        convert_video(self.model, device=self.device, dtype=torch.float32, *args, **kwargs)

3.4 pytorch权重转paddlepaddle权重

因为AIStudio下无法执行，所以这里将其设为MarkDown 代码模式，在启智AI智能平台或其它环境下，将其设置为code格式即可。

import paddle
import torch
def export_weight_names(net):
    print(net.state_dict().keys())
    with open('paddle.txt', 'w') as f:
        for key in net.state_dict().keys():
            f.write(key + '\n')
            
# pytorch权重转paddlepaddle权重
def transfer(paddlemodel=None):
#     res2net_paddle_implement = paddle.vision.models.resnet50(pretrained=False)
    res2net_paddle_implement =paddlemodel
    export_weight_names(res2net_paddle_implement)  # 将自己paddle模型的keys存为txt
    paddle_list = open('paddle.txt')  # paddle的keys
#     state_dict = torch.load('resnet50-0676ba61.pth')
    state_dict = torch.load("rvm_resnet50.pth")

    paddle_state_dict = OrderedDict()
    paddle_list = paddle_list.readlines()
    torch_list = state_dict.keys()
    for p in paddle_list:
        p = p.strip()
        t = p
        if "mean" in p:
            t = p.replace("_mean", "running_mean")
        if "variance" in p:
            t = p.replace("_variance", "running_var")
        if t in torch_list:
            if 'fc' not in p:
                paddle_state_dict[p] = state_dict[t].detach().cpu().numpy()
            else:
                paddle_state_dict[p] = state_dict[t].detach().cpu().numpy().T
        else:
            print(p)

#     f = open('resnet50.pdparams', 'wb')
#     f = open("rvm_resnet50.pdparams")
#     import pickle
#     pickle.dump(paddle_state_dict, f)
#     f.close()
    with open("rvm_resnet50.pdparams", 'wb') as f:
        import pickle
        pickle.dump(paddle_state_dict, f)
        
        
import paddle
import numpy as np
# from paddle.vision.models import ResNet
# from paddle.vision.models.resnet import BottleneckBlock, BasicBlock
from collections import OrderedDict
# paddleresnet50 = ResNet(BottleneckBlock, 50)
model = MattingNetwork('resnet50') 
transfer(paddlemodel=model)

3.5 最终推理

执行推理，查看能否正常执行，输出文件是否正确。

In [ ]

#若只需要做视频抠像处理，我们提供简单的 API:
# from inference import convert_video 
model = MattingNetwork('resnet50') 
model.set_state_dict(paddle.load("rvm_resnet50.pdparams"))
convert_video(
    model,                           # 模型，可以加载到任何设备（cpu 或 cuda）
    input_source='video.mp4',        # 视频文件，或图片序列文件夹
    output_type='video',             # 可选 "video"（视频）或 "png_sequence"（PNG 序列）
    output_composition='com.mp4',    # 若导出视频，提供文件路径。若导出 PNG 序列，提供文件夹路径
    output_alpha="pha.mp4",          # [可选项] 输出透明度预测
    output_foreground="fgr.mp4",     # [可选项] 输出前景预测
    output_video_mbps=4,             # 若导出视频，提供视频码率
    downsample_ratio=None,           # 下采样比，可根据具体视频调节，或 None 选择自动
    seq_chunk=1                    # 设置多帧并行计算 12
)

将输出的com.mp4文件下载到本地，是不是一个帅哥在绿幕前的样子？这就是证明整个项目复现成功了！这个帅哥的项目地址是：https://gitee.com/roy-kwok/robust-video-matting-master

限于篇幅，MobileNetV3骨干网络部分就不在这里写了，大家可以移步这里飞桨源码MobileNetV3分类模型对齐Pytorch-省事版

经验总结

这次论文复现，存在投机取巧的部分，跟论文复现赛的要求比起来尚有不足，有如下几点没有完成：

没有复现出训练部分
骨干网精度对齐了，但最终网络没有精度对齐测试，只是看最终效果还行，就算任务完成了。
代码没有规范化处理，很多测试语句只是注释掉而没有删除。

不过能够从头到尾完成整个项目（从模型到推理），心里已经非常激动了。在连续参加大约4届论文复现赛后，这是自己复现效果最好的一次，我的评价是：满意

那些测试语句，带着也挺好。因为其实只要过1-2周，再看这些代码，就已然看不懂了，有测试代码，还能让自己想起来的快点。

把复现中常见的一些问题，存档到这个项目里了，以方便以后的复现工作飞桨与Pytorch算子对照表-论文复现小助手

同时完成一个爆款项目没有绿幕，AI给我们造！超强的稳定视频抠像 (RVM),就是把这个项目封装成人人都可以用的超简单模式，大家也可以来试试哈！

调试纠错

报错too many values to unpack (expected 2)

     58     def forward(self, x, r: Optional[Tensor]):
---> 59         a, b = x.split(self.channels // 2, axis=-3) # dim
     60         b, r = self.gru(b, r)
     61         x = paddle.concat([a, b], axis=-3) # cat concat
ValueError: too many values to unpack (expected 2)

飞桨是按照那个分开的数值输出多个变量，可能torch不一样

第二个参数不同： PyTorch：第二个参数split_size_or_sections类型为int或者list(int)。 PaddlePaddle：第二个参数num_or_sections类型为int、list(int)或者tuple(int)。也就是torch是分的块大小，而飞桨是分的块数

self.box_filter.weight =1/9

TypeError: assignment to parameter 'weight' should be of type Parameter or None, but got 'float' 也就是飞桨里，怎么给模型的weight赋值。

import paddle

weight_attr = paddle.ParamAttr(name="weight",
                               learning_rate=0.5,
                               regularizer=paddle.regularizer.L2Decay(1.0),
                               trainable=True)
print(weight_attr.name) # "weight"
paddle.nn.Linear(3, 4, weight_attr=weight_attr)

or
# PaddlePaddle示例：
import paddle
x = paddle.zeros([2, 3], dtype="float32")
param = paddle.create_parameter(shape=x.shape,
                        dtype=str(x.numpy().dtype),
                        default_initializer=paddle.nn.initializer.Assign(x))
param.stop_gradient = True

# 输出
# Parameter containing:
# Tensor(shape=[2, 3], dtype=float32, place=CPUPlace, stop_gradient=True,
#        [[0., 0., 0.],
#         [0., 0., 0.]])

split 里面切分参数是列表

split([3,1], dim = -3) 直接用切片解决

[:, :3, ] [:, 3:, ] 后来发现不能用切片解决，因为不知道它是几维的啊！其实就用split就行了，只是dim要改成axis

            # fgr_residual, pha = self.project_mat(hid).split([3, 1], dim=-3)
            print("self.project_mat(hid).split", self.project_mat(hid).shape)
            fgr_residual, pha = self.project_mat(hid).split([3, 1], axis=-3)

'Tensor' object has no attribute 'clamp'

---> 57             fgr = fgr.clamp(0., 1.)
     58             pha = pha.clamp(0., 1.)
     59             return [fgr, pha, *rec]
AttributeError: 'Tensor' object has no attribute 'clamp'

直接用 paddle.clip 即可。

调用训练参数文件报错

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1441: UserWarning: Skip loading for refiner.conv.4._mean. refiner.conv.4._mean is not found in the provided dict.
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1441: UserWarning: Skip loading for refiner.conv.4._variance. refiner.conv.4._variance is not found in the provided dict.
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))

就是因为看到调用训练参数文件这么多报错，才下决心重写ResNet。

其实有报错，就证明模型没有对齐。

unflatten报错

/tmp/ipykernel_166/3011955335.py in forward_time_series(self, x)
     40         B, T = x.shape[:2]
     41         features = self.forward_single_frame(x.flatten(0, 1))
---> 42         features = [f.unflatten(0, (B, T)) for f in features]
     43         return features
     44 
/tmp/ipykernel_166/3011955335.py in <listcomp>(.0)
     40         B, T = x.shape[:2]
     41         features = self.forward_single_frame(x.flatten(0, 1))
---> 42         features = [f.unflatten(0, (B, T)) for f in features]
     43         return features
     44 
AttributeError: 'Tensor' object has no attribute 'unflatten'

我们的tensor没有unflatten函数。 torch的

>>> torch.randn(3, 4, 1).unflatten(1, (2, 2)).shape
torch.Size([3, 2, 2, 1])
>>> torch.randn(3, 4, 1).unflatten(1, (-1, 2)).shape # the size -1 is inferred from the size of dimension 1
torch.Size([3, 2, 2, 1])
>>> torch.randn(2, 4, names=('A', 'B')).unflatten('B', (('B1', 2), ('B2', 2)))
tensor([[[-1.1772,  0.0180],
        [ 0.2412,  0.1431]],
        [[-1.1819, -0.8899],
        [ 1.5813,  0.2274]]], names=('A', 'B1', 'B2'))
>>> torch.randn(2, names=('A',)).unflatten('A', (('B1', -1), ('B2', 1)))
tensor([[-0.8591],
        [ 0.3100]], names=('B1', 'B2'))

本来感觉用split可以处理的，后来试验了一下，发现不行，那就看下torch的源代码吧。看不懂源码，没找到关键语句，用reshape语句解决问题

# features = [f.unflatten(0, (B, T)) for f in features]
features = [f.reshape([B, T] + f.shape[1:]) for f in features]

报错

ValueError: (InvalidArgument) Broadcast dimension mismatch. Operands could not be broadcast together with the shape of X = [6, 3, 4, 224, 224] and the shape of Y = [6, 8, 3, 224, 224]. Received [3] in X is not equal to [8] in Y at i:1.
  [Hint: Expected x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1 == true, but received x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1:0 != true:1.] (at /paddle/paddle/fluid/operators/elementwise/elementwise_op_function.h:240)
  [operator < elementwise_add > error]

咋不对捏？盲猜是不是前面的切片有问题？

调整新shape测试，发现是56行报错：

     54             if downsample_ratio != 1:
     55                 fgr_residual, pha = self.refiner(src, src_sm, fgr_residual, pha, hid)
---> 56             fgr = fgr_residual + src
     57             fgr = fgr.clip(0., 1.)
     58             pha = pha.clip(0., 1.)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py in __impl__(self, other_var)
    262             axis = -1
    263             math_op = getattr(_C_ops, op_type)
--> 264             return math_op(self, other_var, 'axis', axis)
    265 
    266         comment = OpProtoHolder.instance().get_op_proto(op_type).comment
ValueError: (InvalidArgument) Broadcast dimension mismatch. Operands could not be broadcast together with the shape of X = [2, 3, 4, 224, 224] and the shape of Y = [2, 24, 3, 224, 224]. Received [3] in X is not equal to [24] in Y at i:1.
  [Hint: Expected x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1 == true, but received x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1:0 != true:1.] (at /paddle/paddle/fluid/operators/elementwise/elementwise_op_function.h:240)
  [operator < elementwise_add > error]

由切片改回split,解决问题！

            # fgr_residual, pha = self.project_mat(hid).split([3, 1], dim=-3)
            print("self.project_mat(hid).split", self.project_mat(hid).shape)
            fgr_residual, pha = self.project_mat(hid).split([3, 1], axis=-3)

加载模型数据后，前向计算结果不同

跟torch model.named_parameters进行对比，下面是飞桨tmpmodel.sublayers：

Tensor(shape=[64, 3, 7, 7], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
        [[[[ 0.00920054,  0.00598424, -0.02342818, ..., -0.05485044,
            -0.05662441, -0.08892745],
           [ 0.00141057,  0.00058585,  0.00649579, ..., -0.00672991,
            -0.02917050, -0.05985972],
           [ 0.02777172,  0.03086680,  0.01981828, ...,  0.10167034,
             0.06547850,  0.04509544],

torch

OrderedDict([('backbone.conv1.weight',
              tensor([[[[ 9.2005e-03,  5.9842e-03, -2.3428e-02,  ..., -5.4850e-02,
                         -5.6624e-02, -8.8927e-02],
                        [ 1.4106e-03,  5.8585e-04,  6.4958e-03,  ..., -6.7299e-03,
                         -2.9171e-02, -5.9860e-02],
                        [ 2.7772e-02,  3.0867e-02,  1.9818e-02,  ...,  1.0167e-01,
                          6.5478e-02,  4.5095e-02],

看来还是对齐的

再来看看最后的信息,最后的信息好像没堆起来：

paddle

torch
('refiner.conv.4.weight',
              tensor([1.1835, 1.1862, 1.3103, 1.0038, 1.2307, 1.5397, 1.0028, 1.1493, 1.2051,
                      1.1066, 0.8704, 1.1728, 1.0673, 1.1416, 1.0163, 1.1113],
                     device='cuda:0')),
             ('refiner.conv.4.bias',
              tensor([-0.0158,  0.0759, -0.0215, -0.1072, -0.2232,  0.0079,  0.0211, -0.0166,
                      -0.0709, -0.0674, -0.0459,  0.1081,  0.0335,  0.0550,  0.0274,  0.1692],
                     device='cuda:0')),
             ('refiner.conv.4.running_mean',
              tensor([-0.0813, -0.1583, -0.2181,  0.2204,  0.0940,  0.0842, -0.1248, -0.0449,
                       0.1911,  0.1713, -0.1602,  0.0618, -0.0110, -0.0663,  0.2190,  0.0388],
                     device='cuda:0')),
             ('refiner.conv.4.running_var',
              tensor([0.1016, 0.0304, 0.0449, 0.0309, 0.0496, 0.0117, 0.4346, 0.0580, 0.0196,
                      0.0543, 0.2745, 0.0461, 0.0263, 0.0308, 0.1099, 0.0121],
                     device='cuda:0')),
             ('refiner.conv.4.num_batches_tracked',
              tensor(11863, device='cuda:0')),
             ('refiner.conv.6.weight', tensor([[[[-7.8937e-02]],
              
                       [[-1.7649e-01]],
              
                       [[ 2.6084e-01]],
              
                       [[ 6.5176e-02]],
              
                       [[ 1.6884e-01]],
              
                       [[ 1.6580e-03]],
              
                      ........
              
                       [[ 1.5224e-01]]]], device='cuda:0')),
             ('refiner.conv.6.bias',
              tensor([ 0.1447,  0.1293,  0.0782, -0.0261], device='cuda:0'))])

ypeError: 'numpy.int64' object is not callable

/tmp/ipykernel_1229/1406632871.py in write(self, frames)
     44     def write(self, frames):
     45         # frames: [T, C, H, W]
---> 46         self.stream.width = frames.size(3)
     47         self.stream.height = frames.size(2)
     48         if frames.size(1) == 1:

TypeError: 'numpy.int64' object is not callable

这是因为torch的size返回tensor，而飞桨的size返回int值。使用torch.shape即可。应该是paddle.shape

transpose() takes from 2 to 3 positional arguments but 5 were given

---> 54         x=x.transpose(0,2,3,1)
     55         x=x.numpy()
     56         # frames = frames.mul(255).byte().cpu().permute(0, 2, 3, 1).numpy()

TypeError: transpose() takes from 2 to 3 positional arguments but 5 were given

这里奇怪了，哪里来的5维度啊？哦，原来是写错参数了，应该是transpose([0, 2, 3, 1])

##. 报错Operator transpose2 does not have kernel for data_type[int8_t]:

RuntimeError: (NotFound) Operator transpose2 does not have kernel for data_type[int8_t]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN].
  [Hint: Expected kernel_iter != kernels.end(), but received kernel_iter == kernels.end().] (at /paddle/paddle/fluid/imperative/prepared_operator.cc:159)
  [operator < transpose2 > error]

简单，把类型转换放到后面就行了。

没有更多提示信息的错误

/tmp/ipykernel_1229/4196880029.py in write(self, frames)
     62         for t in range(frames.shape[0]):
     63             frame = frames[t]
---> 64             frame = av.VideoFrame.from_ndarray(frame, format='rgb24')
     65             self.container.mux(self.stream.encode(frame))
     66 

av/video/frame.pyx in av.video.frame.VideoFrame.from_ndarray()

AssertionError:

经过调试输出信息，发现是因为前面类型转换的时候使用‘int8’,导致出现负数才报错的。使用‘uint8’解决。

报错 'Tensor' object has no attribute 'repeat' 还是要解决repeat问题。

<ipython-input-72-dcd21ea5364d> in write(self, frames)
     48         self.stream.height = frames.shape[2]
     49         if frames.shape[1] == 1:
---> 50             frames = frames.repeat(1, 3, 1, 1) # convert grayscale to RGB repeat对应飞桨什么呢？
     51         # 拆分下面的长句，以便单步执行和代码替换
     52         x=frames*255

AttributeError: 'Tensor' object has no attribute 'repeat'

repeat 改成tile

#             frames = frames.repeat(1, 3, 1, 1) # convert grayscale to RGB repeat对应飞桨什么呢？
            frames = frames.tile([1, 3, 1, 1])

'numpy.int64' object is not callable,

<ipython-input-86-db69f39ec30d> in convert_video(model, input_source, input_resize, downsample_ratio, output_type, output_composition, output_alpha, output_foreground, output_video_mbps, seq_chunk, num_workers, progress, device, dtype)
    157                     writer_com.write(com[0])
    158 
--> 159                 bar.update(src.size(1))
    160 
    161     finally:

TypeError: 'numpy.int64' object is not callable

因为用了size ，改成shape

#                 bar.update(src.size(1))
                bar.update(src.shape[1])

版本信息

版本1.0 2022.4.21日

结束语

用飞桨，划时代！让我们荡起双桨，在AI的海洋乘风破浪！

飞桨官网：https://www.paddlepaddle.org.cn

因为水平有限，难免有不足之处，还请大家多多帮助。

作者：段春华，网名skywalk 或天马行空，济宁市极快软件科技有限公司的AI架构师，百度飞桨PPDE。

我在AI Studio上获得至尊等级，点亮11个徽章，来关注啊~ https://aistudio.baidu.com/aistudio/personalcenter/thirdview/141218

百度飞桨AI Studio社区

学大模型，用大模型上飞桨星河社区！每天8点V100G算力免费领！免费领取ERNIE 4.0 100w Token >>>

更多推荐

利用Amazon Bedrock畅玩Claude 3等多种领先模型，抢占AI高地(体验倒计时4小时)

百度飞桨星河社区

RAPTOR：索引树状 RAG，使用树结构来捕捉文本的高级和低级细节

百度飞桨星河社区

MultiHop-RAG：多跳查询的基准检索增强生成

百度飞桨星河社区

所有评论(0)

查看更多评论

AI Studio

@m0_63642362

已为社区贡献1437条内容