MobileFormer 一种桥接Global&Local特征的高效并行架构

(已完成MNIST以及Cifar10测试)带你手把手复现MobileFormer，更有视频教程详解模型的结构设计以及展示模型在部分数据集上的相关效果。

AI Studio

1686人浏览 · 2021-12-10 08:43:20

AI Studio · 2021-12-10 08:43:20 发布

Vison Transformer的痛点

模型完整实现以及训练代码可参考: BR-IDL/PaddleViT 模型复现教学视频: (全网首发)用Paddle手把手带你复现MobileFormer模型

部分实现略有调整，后期会更新新的模型代码到PaddleViT仓库中。(主要实现同视频一致，部分归一化层以及参数略有调整)

模型结构反复调整主要由于论文代码未开源，部分参数信息需要通过不断测试不同的数据集进行调整(暂未测试ImageNet).

已完成测试的数据集: MNIST, Cifar10, Cifar100.

近一年来，各大VIT模型横空出世，更高的业务能力，识别率更高，检测效果更好，迁移能力显著——各种各样的SOTA层出不穷。

但虽然众多的VIT模型性能卓越，但往往将性能与计算量-参数量挂钩。

怎么挂勾呢？——简单来说就是，大家可以发现，现在表现优异的VIT模型大都计算量超过1G的FLOPs，这就意味着其对实时性有一定的影响。

因此，就有学者研究相关VIT结构实现类似MobileNet系列模型在Conv卷积模型架构中的地位一样——包括纯VIT或者掺入VIT的混合模型架构，实现将高性能的VIT性能移接到移动端上。

由于这样的初衷，在过去的这一年中，就出现了像MobileVIT、MobileFormer这样的轻量VIT模型架构，

而本项目将以MobileFormer作为本次手把手论文模型复现的主要内容，带大家了解最新的MobileFormer架构设计。

(MobileFormer相关图表)

并行桥接结构	模型参数/计算量表	整体结构

MobileFormer特性说明

采用并行混合模型架构(conv+vit)，以往模型大都为串联组织的模型架构
采用分头独立的注意力映射(先分多头的映射)，相对于直接映射qkv，该方式会随着头数不断减少参数量
设计一种桥接结构，实现局部与全局信息的注意力交互，并且仅对有限个Token进行注意力映射，特征图不进行映射运算
设计少量的Token嵌入/学习全局信息，减少计算量与参数量

关于分头注意力映射，我会在后边去介绍，这一点是通过参数量对齐时确定的，论文中仅仅写作分头注意力，但如果直接使用以往的分头注意力参数量会大不少，因此作此改动

欢迎大家在评论区讨论相关结构细节

介绍MobileFormer全局与局部交互方式

左侧 $X_i$ 作为输入，是一组Feature Map; 右侧 $Z_i$ 作为输入，是一组Token.
左侧的输入 $X_i$ 先传入右侧的黄色区域，将特征图中的局部信息融入到 $Z_i$ 中——这一步通过注意力映射来完成，但是仅对输入的Token进行Query的映射.
上面的这一步，不对特征图进行映射，而是直接当作Key和Value进行注意力计算，最后利用残差将局部信息融入到Token中的全局信息里——称为Mobile->Former，重命名为ToFormer_Bridge.
通过ToFormer_Bridge的桥接后，将补充信息后的 $Z_i$ 传入Former结构，即绿色区域——一个单纯的Transformer结构，完成特征提取后输出 $Z_{i+1}$ .
拿到 $Z_{i+1}$ 后，将其作为当前阶段完整的全局信息Token，传回到Mobile结构，即蓝色区域, 此时才将 $X_i$ 传入Mobile结构——实现全局信息补充到局部信息的特征融合/交互过程.
这是信息的交互，主要是通过Token提供给动态ReLU函数生成动态参数，去对卷积过程的局部信息/特征进行一个筛选的交互过程.
$X_i$ 经过Mobile结构后，信息并没有真正的补足全局的认识，因此，还需要与表示全局信息的Token进行一次注意力计算.
此时再将 $Z_{i+1}$ 经过两次映射，得到Key和Value，而经过Mobile输出的 $X_i$ 保持原有特征，直接作为Query指导全局信息的融入，最后通过残差进行特征融合——称为Former->Mobile，重命名为ToMobile_Bridge.
这样一个交互过程，由于特征图始终没有进行任何映射，保持原汁原味的局部信息——更加纯净的融入到Token中，并且Token中表示全局的信息也在多个这样的block中不断的更新并融合到特征图的特征上.

以上，称为一个MobileFormer_Block.
其中，每一个桥接部分:ToFormer_Bridge、 ToMobile_Bridge，都是Token在进行注意力映射，且实现都采用前面提到的多头独立映射.

总得来说:

总是通过映射为query的一端，去筛选作为key于value的一端中的信息——比如，token为query就筛选局部信息，然后残差融合选出的信息进行特征融合.

MobileFormer模型(结构)复现

基于MobileFormer294M模型结构进行说明，其余模型结构仅仅是部分参数不同。

各部分实现代码如下:

baseconv.py: Stem + PW + DW + BottleNeck
attention.py: MLP + Attention
dyrelu.py: DyReLU
droppath.py: DropPath
mobileformer.py: Classifier_Head + Mobile + ToFormer_Bridge + Former + ToMobile_Bridge + MFBlock + MobileFormer + build_mformer

# 导入基本的依赖库
import paddle
from paddle import nn

import numpy as np

一、Stem: 渐入层实现

Token部分在最后组网时实现，我们暂且跳过那一步

这里的渐入层，是个人的一个命名习惯——将输入的第一个卷积或者特征提取模块称为渐入层.

意为: 特征开始缓缓流出，逐渐被提取进入到模型中——在卷积实现时，通常需要下采样，也应该需要(在实验中，MobileFormer将第一层变为步长为1时效果不如步长为2下采样的效果.)

结构:

一个3x3卷积层 + BatchNorm2D + Hardswish
通过_conv_init函数，生成可选的参数初始化方法，并通过weight_attr等进行配置

class Stem(nn.Layer):
    """Stem
    """
    def __init__(self,
                 in_channels,
                 out_channels,
                 kernel_size=3,
                 stride=1,
                 padding=0,
                 act=nn.Hardswish,
                 init_type='kn'):
        super(Stem, self).__init__(
                 name_scope="Stem")
        conv_weight_attr, conv_bias_attr = self._conv_init(init_type=init_type)
        self.conv = nn.Conv2D(in_channels=in_channels,
                              out_channels=out_channels,
                              kernel_size=kernel_size,
                              stride=stride,
                              padding=padding,
                              weight_attr=conv_weight_attr,
                              bias_attr=conv_bias_attr)
        self.bn = nn.BatchNorm2D(out_channels)
        self.act = act()

    def _conv_init(self, init_type='kn'):
        if init_type == 'xu':
            weight_attr = nn.initializer.XavierUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'ku':
            weight_attr = nn.initializer.KaimingUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'kn':
            weight_attr = nn.initializer.KaimingNormal()
            bias_attr = nn.initializer.Constant(value=0.0)
        return weight_attr, bias_attr

    def forward(self, inputs):
        x = self.conv(inputs)
        x = self.bn(x)
        x = self.act(x)
        return x

二、Lite-BottleNeck结构实现

MobileFormer沿用MobileNetV3中设计的线性瓶颈结构来实现，因此，需要实现一般的BottleNeck，然后通过修改参数来实现Lite轻量的BottleNeck结构。

首先，明确BottleNeck的组成: 1x1 pw-conv + 激活函数-DYReLU + 3x3 dw-conv + 激活函数-DYReLU + 1x1 pw-conv

pw(pointwise): 逐点卷积–1x1的卷积，实现通道的压缩与拓张，是控制瓶颈大小(channel)的关键结构

dw(depthwise): 深度卷积–3x3的分组卷积，实现特征提取

DYReLU: 动态ReLU–利用传入的参数生成动态参数，然后通过max进行特征筛选

2.1 (PW)逐点卷积实现

为了解耦，仅作卷积部分，不构建归一化与激活函数.

结构:

1x1conv – 支持分组卷积实现

之所以支持分组卷积，是因为在最小模型中，所有的1x1 PointWise Conv会实现分组为4的实现，来减少参数量.

由于是1x1卷积，固定的，且只做channel控制，所以入口参数不开放卷积核大小、步长等参数.

class PointWiseConv(nn.Layer):
    """PointWise 1x1Conv -- support group conv
        Params Info:
            groups: the number of groups
    """
    def __init__(self,
                 in_channels,
                 out_channels,
                 groups=1,
                 init_type='kn'):
        super(PointWiseConv, self).__init__(
                 name_scope="PointWiseConv")
        conv_weight_attr, conv_bias_attr = self._conv_init(init_type=init_type)
        self.conv = nn.Conv2D(in_channels=in_channels,
                              out_channels=out_channels,
                              kernel_size=1,
                              stride=1,
                              padding=0,
                              groups=groups,
                            weight_attr=conv_weight_attr,
                            bias_attr=conv_bias_attr)

    def _conv_init(self, init_type='kn'):
        if init_type == 'xu':
            weight_attr = nn.initializer.XavierUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'ku':
            weight_attr = nn.initializer.KaimingUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'kn':
            weight_attr = nn.initializer.KaimingNormal()
            bias_attr = nn.initializer.Constant(value=0.0)
        return weight_attr, bias_attr

    def forward(self, inputs):
        x = self.conv(inputs)
        return x

2.2 (DW)深度卷积实现

为了解耦，仅作卷积部分，不构建归一化与激活函数.

结构:

3x3分组卷积 or lite-3x3分组卷积

分组卷积的分组数全为输入通道数，因为在瓶颈结构中，DW仅作特征提取，通道数保持不变.

由于需要支持可能的卷积核等调整，所以开放卷积核大小等参数，不开放输出通道参数——因为始终保持输入通道数.

lite-3x3分组卷积，采用两个卷积层实现，卷积核分别为[3, 1]与[1, 3].

特别地，实现lite结构时，传入的步长、填充等也需要进行划分成两个不同的参数，传入两个卷积层中

class DepthWiseConv(nn.Layer):
    """DepthWise Conv -- support lite weight dw_conv
        Params Info:
            is_lite: use lite weight dw_conv
    """
    def __init__(self,
                 in_channels,
                 kernel_size=3,
                 stride=1,
                 padding=0,
                 is_lite=False,
                 init_type='kn'):
        super(DepthWiseConv, self).__init__(
                 name_scope="DepthWiseConv")
        self.is_lite = is_lite

        conv_weight_attr, conv_bias_attr = self._conv_init(init_type=init_type)
        if is_lite is False:
            self.conv = nn.Conv2D(in_channels=in_channels,
                                out_channels=in_channels,
                                groups=in_channels,
                                kernel_size=kernel_size,
                                stride=stride,
                                padding=padding,
                                weight_attr=conv_weight_attr,
                                bias_attr=conv_bias_attr)
        else: # lite 结构
            self.conv = nn.Sequential(
                # kernel_size -- [3, 1]
                nn.Conv2D(in_channels=in_channels,
                          out_channels=in_channels,
                          kernel_size=[kernel_size, 1],
                          stride=[stride, 1],
                          padding=[padding, 0],
                          groups=in_channels,
                        weight_attr=conv_weight_attr,
                        bias_attr=conv_bias_attr),
                nn.BatchNorm2D(in_channels),
                # kernel_size -- [1, 3]
                nn.Conv2D(in_channels=in_channels,
                          out_channels=in_channels,
                          kernel_size=[1, kernel_size],
                          stride=[1, stride],
                          padding=[0, padding],
                          groups=in_channels,
                        weight_attr=conv_weight_attr,
                        bias_attr=conv_bias_attr)
            )

    def _conv_init(self, init_type='kn'):
        if init_type == 'xu':
            weight_attr = nn.initializer.XavierUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'ku':
            weight_attr = nn.initializer.KaimingUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'kn':
            weight_attr = nn.initializer.KaimingNormal()
            bias_attr = nn.initializer.Constant(value=0.0)
        return weight_attr, bias_attr

    def forward(self, inputs):
        x = self.conv(inputs)
        return x

2.3 动态ReLU实现

实现B型动态ReLU，完成通道上的ReLU效果–包含筛选通道特征.

1.动态ReLU需要提前设置参数组大小，即K值大小，利用K值大小去生成指定个数的a与b的参数组:

等式左边为利用初始值生成的结果动态参数组——分别是 $a_k$ 与 $b_k$
等式右侧参数: (注意每个通道各生成k各a与b哦)
- $\lambda_a$ 为生成动态参数组 $a_k$ 的系数， $\lambda_b$ 为生成动态参数组 $b_k$ 的系数
- $\alpha_k$ 为生成动态参数组 $a_k$ 的常数项， $\beta_k$ 为生成动态参数组 $b_k$ 的常数项
- $\Delta$ 部分为输入数据映射后的结果——具体看代码可以清楚理解

2.通过上述所知参数，可以得到参数组如下:

3.得到每个通道上的a与b参数——每个通道得到k个a与k个b动态系数.

4.最后通过max得到最终的激活输出，其中a做系数，b做偏置或者常数项:

max处理的维度为最后一维，对应每个通道上动态生成的2*k个结果.

特别说明: 由于在本文中有说到使用MLP替换该结构中的线性映射，因此需要先实现MLP，所以接下来先展示MLP结构，然后展示动态ReLU结构.

2.3.1 MLP实现

class MLP(nn.Layer):
    """Multi Layer Perceptron
        Params Info:
            in_features: input token feature size
            out_features: output token feature size
            mlp_ratio: the scale of hidden feature size
            mlp_dropout_rate: the dropout rate of mlp layer output
    """
    def __init__(self,
                 in_features,
                 out_features=None,
                 mlp_ratio=2,
                 mlp_dropout_rate=0.,
                 act=nn.GELU,
                 init_type='kn'):
        super(MLP, self).__init__(name_scope="MLP")
        self.out_features = in_features if out_features is None else \
                            out_features
        linear_weight_attr, linear_bias_attr = self._linear_init(init_type=init_type)
        self.fc1 = nn.Linear(in_features=in_features,
                             out_features=int(mlp_ratio*in_features),
                             weight_attr=linear_weight_attr,
                             bias_attr=linear_bias_attr)
        self.fc2 = nn.Linear(in_features=int(mlp_ratio*in_features),
                             out_features=self.out_features,
                             weight_attr=linear_weight_attr,
                             bias_attr=linear_bias_attr)

        self.act = act()
        self.dropout = nn.Dropout(mlp_dropout_rate)

    def _linear_init(self, init_type='kn'):
        if init_type == 'xu':
            weight_attr = nn.initializer.XavierUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'ku':
            weight_attr = nn.initializer.KaimingUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'kn':
            weight_attr = nn.initializer.KaimingNormal()
            bias_attr = nn.initializer.Constant(value=0.0)
        return weight_attr, bias_attr

    def forward(self, inputs):
        x = self.fc1(inputs)
        x = self.act(x)
        x = self.dropout(x)
        x = self.fc2(x)
        x = self.dropout(x)
        return x

2.3.2 DyReLU实现

部分参数说明:

k=2: 动态参数基个数
coefs=[1.0, 0.5]: 初始系数值
consts=[1.0, 0.0]: 初始常数项
self.mid_channels: 总的参数个数，每个通道生成2xk个结果，所以需要映射出2xkxchannels的动态参数，然后计算出动态参数组
- 见上面的1.
self.coef: 利用初始系数值进行广播，得到[ $a_1$ , $a_2$ , $b_1$ , $b_2$ ]的系数，由于 $a_k$ 系数与 $b_k$ 系数总是各自唯一的，所以将初始值依次复制 $k$ 份即可
self.const: 利用初始化常数项，依次生成 $a_k$ 与 $b_k$ 的常数项，这里将输入的consts[0]作为 $a_1$ 的常数项，其余皆为consts[2]，但仍然需要保持shape为2xk
self.project: 将输入映射到指定的参数数量——注意，这里由于是模型中表示全局信息的Token生成动态参数，因此输入维度为embed_dims

class DyReLU(nn.Layer):
    """Dynamic ReLU activation function -- use one MLP
        Params Info:
            in_channels: input feature map channels
            embed_dims: input token embed_dims
            k: the number of parameters is in Dynamic ReLU
            coefs: the init value of coefficient parameters
            consts: the init value of constant parameters
            reduce: the mlp hidden scale,
                    means 1/reduce = mlp_ratio
    """
    def __init__(self,
                 in_channels,
                 embed_dims,
                 k=2, # a_1, a_2 coef, b_1, b_2 bias
                 coefs=[1.0, 0.5], # coef init value
                 consts=[1.0, 0.0], # const init value
                 reduce=4,
                 init_type='kn'):
        super(DyReLU, self).__init__(
                 name_scope="DyReLU")
        self.embed_dims = embed_dims
        self.in_channels = in_channels
        self.k = k

        self.mid_channels = 2*k*in_channels

        # 4 values
        # a_k = alpha_k + coef_k*x, 2
        # b_k = belta_k + coef_k*x, 2
        self.coef = paddle.to_tensor([coefs[0]]*k + [coefs[1]]*k)
        self.const = paddle.to_tensor([consts[0]] + [consts[1]]*(2*k-1))

        self.project = nn.Sequential(
            # nn.LayerNorm(embed_dims),
            MLP(in_features=embed_dims,
                out_features=self.mid_channels,
                mlp_ratio=1/reduce,
                act=nn.GELU,
                init_type=init_type),
            # nn.BatchNorm(self.mid_channels)
            nn.LayerNorm(self.mid_channels)
        )

    def forward(self, feature_map, tokens):
        B, M, D = tokens.shape
        dy_params = self.project(tokens[:, 0]) # B, mid_channels
        # B, IN_CHANNELS, 2*k: 衍生每个通道的a与b参数组的动态参数
        dy_params = dy_params.reshape(shape=[B, self.in_channels, 2*self.k])

        # B, IN_CHANNELS, 2*k -- a_1, a_2, b_1, b_2：衍生每个通道最终确定的动态参数组
        dy_init_params = dy_params * self.coef + self.const
        f = feature_map.transpose(perm=[2, 3, 0, 1]).unsqueeze(axis=-1) # H, W, B, C, 1 : 转置保证f(x)沿着通道维度进行，且与上一步的结果可以进行广播运算

        # output shape: H, W, B, C, k
        output = f * dy_init_params[:, :, :self.k] + dy_init_params[:, :, self.k:]
        output = paddle.max(output, axis=-1) # H, W, B, C
        output = output.transpose(perm=[2, 3, 0, 1]) # B, C, H, W ：还原数据shape

        return output

2.4 Lite-BottleNeck实现

通过is_lite控制是否启动Lite结构，也就是在构建网络层是使用Lite-DW.

通过use_dyrelu控制是否使用动态ReLU作为激活函数

可以看出Bottle结构实际上就是Mobile结构中的核心部分——基本上与下图中的桥接结构中的Mobile一致，但其实还不够.

注意，由于要使用Token产生动态参数，因此需要在forward传入两个参数，依次为特征图–Token.

class BottleNeck(nn.Layer):
    """BottleNeck
        Params Info:
            groups: the number of groups, by 1x1conv
            embed_dims: input token embed_dims
            k: the number of parameters is in Dynamic ReLU
            coefs: the init value of coefficient parameters
            consts: the init value of constant parameters
            reduce: the mlp hidden scale,
                    means 1/reduce = mlp_ratio
            use_dyrelu: whether use dyrelu
            is_lite: whether use lite dw_conv
    """
    def __init__(self,
                 in_channels,
                 hidden_channels,
                 out_channels,
                 groups=1,
                 kernel_size=3,
                 stride=1,
                 padding=0,
                 embed_dims=None,
                 k=2, # the number of dyrelu-params
                 coefs=[1.0, 0.5],
                 consts=[1.0, 0.0],
                 reduce=4,
                 use_dyrelu=False,
                 is_lite=False,
                 init_type='kn'):
        super(BottleNeck, self).__init__(
                 name_scope="BottleNeck")
        self.is_lite = is_lite
        self.use_dyrelu = use_dyrelu

        assert use_dyrelu==False or (use_dyrelu==True and embed_dims is not None), \
               "Error: Please make sure while the use_dyrelu==True,"+\
               " embed_dims(now:{0})>0.".format(embed_dims)

        self.in_pw = PointWiseConv(in_channels=in_channels,
                                   out_channels=hidden_channels,
                                   groups=groups,
                                   init_type=init_type)
        self.in_pw_bn = nn.BatchNorm2D(hidden_channels)

        self.dw = DepthWiseConv(in_channels=hidden_channels,
                                kernel_size=kernel_size,
                                stride=stride,
                                padding=padding,
                                is_lite=is_lite,
                                init_type=init_type)
        self.dw_bn = nn.BatchNorm2D(hidden_channels)

        self.out_pw = PointWiseConv(in_channels=hidden_channels,
                                    out_channels=out_channels,
                                    groups=groups,
                                    init_type=init_type)
        self.out_pw_bn = nn.BatchNorm2D(out_channels)

        if use_dyrelu == False:
            self.act = nn.ReLU()
        else:
            self.act = DyReLU(in_channels=hidden_channels,
                                embed_dims=embed_dims,
                                k=k,
                                coefs=coefs,
                                consts=consts,
                                reduce=reduce,
                                init_type=init_type)

    def forward(self, feature_map, tokens):
        x = self.in_pw(feature_map)
        x = self.in_pw_bn(x)
        if self.use_dyrelu:
            x = self.act(x, tokens)

        x = self.dw(x)
        x = self.dw_bn(x)
        if self.use_dyrelu:
            x = self.act(x, tokens)

        x = self.out_pw(x)
        x = self.out_pw_bn(x)

        return x

三、Mobile结构实现

上边的BottleNeck结构已经基本实现的Mobile结构，但是在实际使用时，由于偶尔要进行下采样，因此还需要再BottleNeck前加入一个下采样的3x3DW卷积.

结构:

if downsample: 3x3 DW + BN + 激活函数(这里加入激活函数是否合适，也在考虑——因为传统的BottleNeck结构输出的是一个低维度的数据，由MobileNetV2的研究工作知道，在低维度上进行非线性激活会丢失信息；由于本文中太多信息没有透露，所以暂且使用标准的下采样块: conv+bn+act)
BottleNeck: 均不使用Lite结构，但需要使用动态ReLU

输入为特征图与Token

class Mobile(nn.Layer):
    """Mobile Sub-block
        Params Info:
            in_channels: input feature map channels
            hidden_channels: the dw layer hidden channel size
            groups: the number of groups, by 1x1conv
            embed_dims: input token embed_dims
            k: the number of parameters is in Dynamic ReLU
            coefs: the init value of coefficient parameters
            consts: the init value of constant parameters
            reduce: the mlp hidden scale,
                    means 1/reduce = mlp_ratio
            use_dyrelu: whether use dyrelu
    """
    def __init__(self,
                 in_channels,
                 hidden_channels,
                 out_channels,
                 kernel_size=3,
                 stride=1,
                 padding=0,
                 groups=1,
                 embed_dims=None,
                 k=2,
                 coefs=[1.0, 0.5],
                 consts=[1.0, 0.0],
                 reduce=4,
                 use_dyrelu=False,
                 init_type='kn'):
        super(Mobile, self).__init__(
                 name_scope="Mobile")
        self.add_dw = True if stride==2 else False
        self.bneck = BottleNeck(in_channels=in_channels,
                                hidden_channels=hidden_channels,
                                out_channels=out_channels,
                                kernel_size=kernel_size,
                                stride=1,
                                padding=1,
                                groups=groups,
                                embed_dims=embed_dims,
                                k=k,
                                coefs=coefs,
                                consts=consts,
                                reduce=reduce,
                                use_dyrelu=use_dyrelu,
                                init_type=init_type)

        if self.add_dw: # stride==2
            self.downsample_dw = nn.Sequential(
                DepthWiseConv(in_channels=in_channels,
                              kernel_size=kernel_size,
                              stride=stride,
                              padding=padding,
                              init_type=init_type),
                nn.BatchNorm2D(in_channels),
                nn.ReLU() # maybe other act
            )

    def forward(self, feature_map, tokens):
        if self.add_dw:
            feature_map = self.downsample_dw(feature_map)
        
        # 动态relu需要token
        x = self.bneck(feature_map, tokens)
        return x

四、ToFormer_Bridge结构实现

这个结构，将特征图中的局部信息融入到Token中去：

其中token做query映射，特征图直接作为key和value(仅仅做一些数据shape改变的工作)，减少了常规注意力的映射参数成本
利用query去有选择的融入特征图的局部信息，得到注意力结果
最后通过残差拼接原始输入token，实现局部与全局信息的融合

注意这里使用分头独立映射——即，先将token分头，再进行注意力映射得到每个头的query，再进行拼接得到以往注意力的多头数据形式——其余步骤都是正常的注意力计算步骤

分头独立映射优缺点说明:

优点: 能随着头数增加而减少映射所需的参数量和计算量，基于全连接层参数量为feature size的平方数(输入等于输出大小时)
缺点: 部分特征融合映射，每个头之间的映射相互独立了

为了与标准的transformer中的注意力操作对应，并且将各桥接模块进行解耦，将LN添加到forward中的第一步上.

输入为特征图与Token

class ToFormer_Bridge(nn.Layer):
    """Mobile to Former Bridge
        Params Info:
            in_channels: input feature map channels
            embed_dims: input token embed_dims
            num_head: the number of head is in multi head attention
            dropout_rate: the dropout rate of attention result
            attn_dropout_rate: the dropout rate of attention distribution
    """
    def __init__(self,
                 embed_dims,
                 in_channels,
                 num_head=1,
                 dropout_rate=0.,
                 droppath_rate=0.,
                 attn_dropout_rate=0.,
                 qkv_bias=True,
                 norm=nn.LayerNorm,
                 init_type='kn'):
        super(ToFormer_Bridge, self).__init__(
                 name_scope="ToFormer_Bridge")
        self.num_head = num_head
        self.head_dims = in_channels // num_head
        self.scale = self.head_dims ** -0.5
        
        linear_weight_attr, linear_bias_attr = self._linear_init(init_type=init_type)
        # input normlization
        self.input_norm = norm(embed_dims)

        # split head to project: 分头独立映射层
        self.heads_q_proj = []
        for i in range(num_head): # n linear
            self.heads_q_proj.append(
                nn.Linear(in_features=embed_dims // num_head,
                          out_features=self.head_dims,
                          weight_attr=linear_weight_attr,
                          bias_attr=linear_bias_attr if qkv_bias else False)
            )
        self.heads_q_proj = nn.LayerList(self.heads_q_proj)

        self.output = nn.Linear(in_features=self.num_head*self.head_dims,
                                out_features=embed_dims,
                                weight_attr=linear_weight_attr,
                                bias_attr=linear_bias_attr)
        self.softmax = nn.Softmax()
        self.dropout = nn.Dropout(dropout_rate)
        self.droppath = DropPath(droppath_rate)
        self.attn_dropout= nn.Dropout(attn_dropout_rate)

    def _linear_init(self, init_type='kn'):
        if init_type == 'xu':
            weight_attr = nn.initializer.XavierUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'ku':
            weight_attr = nn.initializer.KaimingUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'kn':
            weight_attr = nn.initializer.KaimingNormal()
            bias_attr = nn.initializer.Constant(value=0.0)
        return weight_attr, bias_attr

    def transfer_shape(self, feature_map, tokens):
        B, C, H, W = feature_map.shape
        assert C % self.num_head == 0, \
            "Erorr: Please make sure feature_map.channels % "+\
            "num_head == 0(now:{0}).".format(C % self.num_head)
        fm = feature_map.reshape(shape=[B, C, H*W]) # B, C, L
        fm = fm.transpose(perm=[0, 2, 1]) # B, L, C -- C = num_head * head_dims
        fm = fm.reshape(shape=[B, H*W, self.num_head, self.head_dims])
        fm = fm.transpose(perm=[0, 2, 1, 3]) # B, n_h, L, h_d

        B, M, D = tokens.shape
        h_token = tokens.reshape(shape=[B, M, self.num_head, D // self.num_head])
        h_token = h_token.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, D // n_h

        return fm, h_token

    def _multi_head_q_forward(self, token, B, M):
        q_list = []
        for i in range(self.num_head):
            q_list.append(
                # B, 1, M, head_dims
                self.heads_q_proj[i](token[:, i, :, :]).reshape(
                                shape=[B, 1, M, self.head_dims])
            )
        q = paddle.concat(q_list, axis=1) # B, num_head, M, head_dims
        return q

    def forward(self, feature_map, tokens):
        B, M, D = tokens.shape
        tokens_ = self.input_norm(tokens)

        # fm（key/value） to shape: B, n_h, L, h_d
        # token to shape: B, n_h, M, D // n_h
        fm, token = self.transfer_shape(feature_map, tokens_) # 先分头
        q = self._multi_head_q_forward(token, B, M) # 再映射

        # attention distribution
        attn = paddle.matmul(q, fm, transpose_y=True) # B, n_h, M, L
        attn = attn * self.scale
        attn = self.softmax(attn)
        attn = self.attn_dropout(attn)

        # attention result
        z = paddle.matmul(attn, fm) # B, n_h, M, h_d
        z = z.transpose(perm=[0, 2, 1, 3])
        z = z.reshape(shape=[B, M, self.num_head*self.head_dims])
        z = self.output(z) # B, M, D
        z = self.dropout(z)

        z = self.droppath(z)
        z = z + tokens

        return z

五、Former结构实现

简单的Transformer结构

结构:

LN + ATTENTION + LN + MLP

5.1 注意力实现

class Attention(nn.Layer):
    """Multi Head Attention
        Params Info:
            embed_dims: input token embed_dims
            num_head: the number of head is in multi head attention
            dropout_rate: the dropout rate of attention result
            attn_dropout_rate: the dropout rate of attention distribution
            qkv_bias: whether use the bias in qkv matrix
    """
    def __init__(self,
                 embed_dims,
                 num_head=1,
                 dropout_rate=0.,
                 attn_dropout_rate=0.,
                 qkv_bias=True,
                 init_type='kn'):
        super(Attention, self).__init__(
                 name_scope="Attention")
        self.num_head = num_head
        self.head_dims = embed_dims // num_head
        self.scale = self.head_dims ** -0.5

        linear_weight_attr, linear_bias_attr = self._linear_init(init_type=init_type)
        self.qkv_proj = nn.Linear(in_features=embed_dims,
                                  out_features=3*self.num_head*self.head_dims,
                                  weight_attr=linear_weight_attr,
                                  bias_attr=linear_bias_attr if qkv_bias else False)
        self.output = nn.Linear(in_features=self.num_head*self.head_dims,
                                out_features=embed_dims,
                                weight_attr=linear_weight_attr,
                                bias_attr=linear_bias_attr)

        self.softmax = nn.Softmax()
        self.dropout = nn.Dropout(dropout_rate)
        self.attn_dropout= nn.Dropout(attn_dropout_rate)

    def _linear_init(self, init_type='kn'):
        if init_type == 'xu':
            weight_attr = nn.initializer.XavierUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'ku':
            weight_attr = nn.initializer.KaimingUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'kn':
            weight_attr = nn.initializer.KaimingNormal()
            bias_attr = nn.initializer.Constant(value=0.0)
        return weight_attr, bias_attr

    def transfer_shape(self, q, k, v):
        B, M, _ = q.shape
        q = q.reshape(shape=[B, M, self.num_head, self.head_dims])
        q = q.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, h_d
        k = k.reshape(shape=[B, M, self.num_head, self.head_dims])
        k = k.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, h_d
        v = v.reshape(shape=[B, M, self.num_head, self.head_dims])
        v = v.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, h_d

        return q, k, v

    def forward(self, inputs):
        B, M, D = inputs.shape
        assert D % self.num_head == 0, \
            "Erorr: Please make sure Token.D % "+\
            "num_head == 0(now:{0}).".format(D % self.num_head)

        qkv= self.qkv_proj(inputs)
        q, k, v = qkv.chunk(3, axis=-1)
        # B, n_h, M, h_d
        q, k, v = self.transfer_shape(q, k, v)

        attn = paddle.matmul(q, k, transpose_y=True) # B, n_h, M, M
        attn = attn * self.scale
        attn = self.softmax(attn)
        attn = self.attn_dropout(attn)

        z = paddle.matmul(attn, v) # B, n_h, M, h_d
        z = z.transpose(perm=[0, 2, 1, 3]) # B, M, n_h, h_d
        z = z.reshape(shape=[B, M, self.num_head*self.head_dims])
        z = self.output(z)
        z = self.attn_dropout(z)

        return z

5.2 Dropath实现–多分支丢弃

根据Batch中的样本，随机丢弃一些样本为0.

class DropPath(nn.Layer):
    """Multi-branch dropout layer -- Along the axis of Batch
        Params Info:
            p: droppath rate
    """
    def __init__(self,
                 p=0.):
        super(DropPath, self).__init__(
                 name_scope="DropPath")
        self.p = p
    
    def forward(self, inputs):
        if self.p > 0. and self.training:
            keep_p = np.asarray([1 - self.p], dtype='float32')
            keep_p = paddle.to_tensor(keep_p)
            # B, 1, 1....
            shape = [inputs.shape[0]] + [1] * (inputs.ndim-1)
            random_dr = paddle.rand(shape=shape, dtype='float32')
            random_sample = paddle.add(keep_p, random_dr).floor() # floor to int--B
            output = paddle.divide(inputs, keep_p) * random_sample
            return output

        return inputs

5.3 Forme实现

class Former(nn.Layer):
    """Former Sub-block
        Params Info:
            embed_dims: input token embed_dims
            num_head: the number of head is in multi head attention
            mlp_ratio: the scale of hidden feature size
            dropout_rate: the dropout rate of attention result
            droppath_rate: the droppath rate of attention output
            attn_dropout_rate: the dropout rate of attention distribution
            mlp_dropout_rate: the dropout rate of mlp layer output
            qkv_bias: whether use the bias in qkv matrix
    """
    def __init__(self,
                 embed_dims,
                 num_head=1,
                 mlp_ratio=2,
                 dropout_rate=0.,
                 droppath_rate=0.,
                 attn_dropout_rate=0.,
                 mlp_dropout_rate=0.,
                 norm=nn.LayerNorm,
                 act=nn.GELU,
                 qkv_bias=True,
                 init_type='kn'):
        super(Former, self).__init__(name_scope="Former")

        self.attn = Attention(embed_dims=embed_dims,
                                num_head=num_head,
                                dropout_rate=dropout_rate,
                                attn_dropout_rate=attn_dropout_rate,
                                qkv_bias=qkv_bias,
                                init_type=init_type)
        self.attn_ln = norm(embed_dims)
        self.attn_droppath = DropPath(droppath_rate)

        self.mlp = MLP(in_features=embed_dims,
                        mlp_ratio=mlp_ratio,
                        mlp_dropout_rate=mlp_dropout_rate,
                        act=act,
                        init_type=init_type)
        self.mlp_ln = norm(embed_dims)
        self.mlp_droppath = DropPath(droppath_rate)

    def forward(self, inputs):
        res = inputs
        x = self.attn_ln(inputs)
        x = self.attn(x)
        x = self.attn_droppath(x)
        x = x + res

        res = x
        x = self.mlp_ln(x)
        x = self.mlp(x)
        x = self.mlp_droppath(x)
        x = x + res

        return x

六、ToMobile_Bridge结构实现

MobileFormer Block中最后一个结构

利用当前层次的Token输出作为Bridge的数据源：

将Token映射为Key和Value，Mobile输出的特征图数据作为原生的Query
利用query去查询value的组成，最后得到筛选出来/需要补充的全局信息
利用残差链接，将提取出来合适的全局信息与原始的局部信息进行特征融合

各部分与ToFormer_Bridge结构类似，映射仍然时Token的事儿，总是利用query去提取需要的信息，然后利用残差来融合特征

映射均为分头独立映射

为了与标准的transformer中的注意力操作对应，并且将各桥接模块进行解耦，将LN添加到forward中的第一步上.

输入为特征图与Token

class ToMobile_Bridge(nn.Layer):
    """Former to Mobile Bridge
        Params Info:
            in_channels: input feature map channels
            embed_dims: input token embed_dims
            num_head: the number of head is in multi head attention
            dropout_rate: the dropout rate of attention result
            attn_dropout_rate: the dropout rate of attention distribution
    """
    def __init__(self,
                 embed_dims,
                 in_channels,
                 num_head=1,
                 dropout_rate=0.,
                 droppath_rate=0.0,
                 attn_dropout_rate=0.,
                 qkv_bias=True,
                 norm=nn.LayerNorm,
                 init_type='kn'):
        super(ToMobile_Bridge, self).__init__(
                 name_scope="ToMobile_Bridge")
        self.num_head = num_head
        self.head_dims = in_channels // num_head
        self.scale = self.head_dims ** -0.5

        self.input_token_norm = norm(embed_dims)

        linear_weight_attr, linear_bias_attr = self._linear_init(init_type=init_type)

        self.heads_k_proj = []
        self.heads_v_proj = []
        for i in range(num_head): # n linear
            self.heads_k_proj.append(
                nn.Linear(in_features=embed_dims // num_head,
                          out_features=self.head_dims,
                          weight_attr=linear_weight_attr,
                          bias_attr=linear_bias_attr if qkv_bias else False)
            )
            self.heads_v_proj.append(
                nn.Linear(in_features=embed_dims // num_head,
                          out_features=self.head_dims,
                          weight_attr=linear_weight_attr,
                          bias_attr=linear_bias_attr if qkv_bias else False)
            )
        self.heads_k_proj = nn.LayerList(self.heads_k_proj)
        self.heads_v_proj = nn.LayerList(self.heads_v_proj)

        self.softmax = nn.Softmax()
        self.dropout = nn.Dropout(dropout_rate)
        self.droppath = DropPath(droppath_rate)
        self.attn_dropout= nn.Dropout(attn_dropout_rate)

    def _linear_init(self, init_type='kn'):
        if init_type == 'xu':
            weight_attr = nn.initializer.XavierUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'ku':
            weight_attr = nn.initializer.KaimingUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'kn':
            weight_attr = nn.initializer.KaimingNormal()
            bias_attr = nn.initializer.Constant(value=0.0)
        return weight_attr, bias_attr

    def transfer_shape(self, feature_map, tokens):
        B, C, H, W = feature_map.shape
        assert C % self.num_head == 0, \
            "Erorr: Please make sure feature_map.channels % "+\
            "num_head == 0(now:{0}).".format(C % self.num_head)

        fm = feature_map.reshape(shape=[B, C, H*W]) # B, C, L
        fm = fm.transpose(perm=[0, 2, 1]) # B, L, C -- C = num_head * head_dims
        fm = fm.reshape(shape=[B, H*W, self.num_head, self.head_dims])
        fm = fm.transpose(perm=[0, 2, 1, 3]) # B, n_h, L, h_d

        B, M, D = tokens.shape
        k = tokens.reshape(shape=[B, M, self.num_head, D // self.num_head])
        k = k.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, D // n_h
        v = tokens.reshape(shape=[B, M, self.num_head, D // self.num_head])
        v = v.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, D // n_h

        return fm, k, v

    def _multi_head_kv_forward(self, k_, v_, B, M):
        k_list = []
        v_list = []
        for i in range(self.num_head):
            k_list.append(
                # B, 1, M, head_dims
                self.heads_k_proj[i](k_[:, i, :, :]).reshape(
                                shape=[B, 1, M, self.head_dims])
            )
            v_list.append(
                # B, 1, M, head_dims
                self.heads_v_proj[i](v_[:, i, :, :]).reshape(
                                shape=[B, 1, M, self.head_dims])
            )
        k = paddle.concat(k_list, axis=1) # B, num_head, M, head_dims
        v = paddle.concat(v_list, axis=1) # B, num_head, M, head_dims
        return k, v

    def forward(self, feature_map, tokens):
        B, C, H, W = feature_map.shape
        B, M, D = tokens.shape

        tokens = self.input_token_norm(tokens)
        # fm（q） to shape: B, n_h, L, h_d
        # k/v to shape: B, n_h, M, D // n_h
        q, k_, v_ = self.transfer_shape(feature_map, tokens)
        k, v = self._multi_head_kv_forward(k_, v_, B, M)

        # attention distribution
        attn = paddle.matmul(q, k, transpose_y=True) # B, n_h, L, M
        attn = attn * self.scale
        attn = self.softmax(attn)
        attn = self.attn_dropout(attn)

        # attention result
        z = paddle.matmul(attn, v) # B, n_h, L, h_d
        z = z.transpose(perm=[0, 1, 3, 2]) # B, n_h, h_d, L
        # B, n_h*h_d, H, W
        z = z.reshape(shape=[B, self.num_head*self.head_dims, H, W])
        z = self.dropout(z)

        z = self.droppath(z)
        z = z + feature_map

        return z

七、MFBlock结构实现

将以上的MobileFormer Block组成单元组起来，得到MobileFormer Block，简称: MFBlock.

结构:

ToFormer_Bridge
Former
Mobile
ToMobile_Bridge

输入为特征图与Token

class MFBlock(nn.Layer):
    """MobileFormer Basic Block
        Params Info:
            in_channels: the number of input feature map channel
            hidden_channels: the number of hidden(dw_conv) feature map channel
            out_channels: the number of output feature map channel
            embed_dims: input token embed_dims
            num_head: the number of head is in multi head attention
            groups: the number of groups in 1x1 conv
            k: the number of parameters is in Dynamic ReLU
            coefs: the init value of coefficient parameters
            consts: the init value of constant parameters
            reduce: the mlp hidden scale,
                    means 1/reduce = mlp_ratio
            use_dyrelu: whether use dyrelu
            mlp_ratio: the scale of hidden feature size
            dropout_rate: the dropout rate of attention result
            droppath_rate: the droppath rate of attention output
            attn_dropout_rate: the dropout rate of attention distribution
            mlp_dropout_rate: the dropout rate of mlp layer output
            qkv_bias: whether use the bias in qkv matrix
    """
    def __init__(self,
                 in_channels,
                 hidden_channels,
                 out_channels,
                 embed_dims,
                 kernel_size=3,
                 stride=1,
                 padding=0,
                 groups=1,
                 k=2,
                 coefs=[1.0, 0.5],
                 consts=[1.0, 0.0],
                 reduce=4,
                 use_dyrelu=False,
                 num_head=1,
                 mlp_ratio=2,
                 dropout_rate=0.,
                 droppath_rate=0.,
                 attn_dropout_rate=0.,
                 mlp_dropout_rate=0.,
                 norm=nn.LayerNorm,
                 act=nn.GELU,
                 qkv_bias=True,
                 init_type='kn'):
        super(MFBlock, self).__init__(
                 name_scope="MFBlock")
        self.mobile = Mobile(in_channels=in_channels,
                             hidden_channels=hidden_channels,
                             out_channels=out_channels,
                             kernel_size=kernel_size,
                             stride=stride,
                             padding=padding,
                             groups=groups,
                             embed_dims=embed_dims,
                             k=k,
                             coefs=coefs,
                             consts=consts,
                             reduce=reduce,
                             use_dyrelu=use_dyrelu,
                             init_type=init_type)

        self.toformer_bridge = ToFormer_Bridge(embed_dims=embed_dims,
                                               in_channels=in_channels,
                                               num_head=num_head,
                                               dropout_rate=dropout_rate,
                                               droppath_rate=droppath_rate,
                                               attn_dropout_rate=attn_dropout_rate,
                                               qkv_bias=qkv_bias,
                                               init_type=init_type)

        self.former = Former(embed_dims=embed_dims,
                             num_head=num_head,
                             mlp_ratio=mlp_ratio,
                             dropout_rate=droppath_rate,
                             mlp_dropout_rate=mlp_dropout_rate,
                             attn_dropout_rate=attn_dropout_rate,
                             droppath_rate=droppath_rate,
                             qkv_bias=qkv_bias,
                             norm=norm,
                             act=act,
                             init_type=init_type)

        self.tomobile_bridge = ToMobile_Bridge(in_channels=out_channels,
                                               embed_dims=embed_dims,
                                               num_head=num_head,
                                               dropout_rate=dropout_rate,
                                               droppath_rate=droppath_rate,
                                               attn_dropout_rate=attn_dropout_rate,
                                               qkv_bias=qkv_bias,
                                               init_type=init_type)

    def forward(self, feature_map, tokens):
        z_h = self.toformer_bridge(feature_map, tokens)
        z_out = self.former(z_h)

        f_h = self.mobile(feature_map, z_out)
        f_out = self.tomobile_bridge(f_h, z_out)
        return f_out, z_out

八、分类头实现

分类时，将Tokens中的第一个token拼接到特征图池化后的结构上，然后参与分类.

class Classifier_Head(nn.Layer):
    """Classifier Head
        Params Info:
            in_channels: input feature map channels
            embed_dims: input token embed_dims
            hidden_features: the fc layer hidden feature size
            num_classes: the number of classes
    """
    def __init__(self,
                 in_channels,
                 embed_dims,
                 hidden_features,
                 num_classes=1000,
                 dropout=0.0,
                 act=nn.Hardswish,
                 init_type='kn'):
        super(Classifier_Head, self).__init__(
                 name_scope="Classifier_Head")
        linear_weight_attr, linear_bias_attr = self._linear_init(init_type=init_type)

        self.avg_pool = nn.AdaptiveAvgPool2D(output_size=1)
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(in_features=in_channels+embed_dims,
                             out_features=hidden_features,
                             weight_attr=linear_weight_attr,
                             bias_attr=linear_bias_attr)
        self.dropout = nn.Dropout(dropout)
        self.fc2 = nn.Linear(in_features=hidden_features,
                             out_features=num_classes,
                             weight_attr=linear_weight_attr,
                             bias_attr=linear_bias_attr)

        self.act = act()
        self.softmax = nn.Softmax()

    def _linear_init(self, init_type='kn'):
        if init_type == 'xu':
            weight_attr = nn.initializer.XavierUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'ku':
            weight_attr = nn.initializer.KaimingUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'kn':
            weight_attr = nn.initializer.KaimingNormal()
            bias_attr = nn.initializer.Constant(value=0.0)
        return weight_attr, bias_attr

    def forward(self, feature_map, tokens):
        x = self.avg_pool(feature_map) # B, C, 1, 1
        x = self.flatten(x) # B, C

        z = tokens[:, 0] # B, 1, D
        x = paddle.concat([x, z], axis=-1)

        x = self.fc1(x)
        x = self.act(x)
        x = self.dropout(x)
        x = self.fc2(x)

        if self.training:
            return x
        return self.softmax(x)

MobileFormer模型组网

将上面的所有结构进行组网，得到MobileFormer模型架构，类似MobileFormer294M的模型结构：

由于，层数较多，参数较复杂，因此通过yaml以及config.py传参，而其余组网部分正常进行。

结构:

Stem – 渐入层
Lite-BottleNeck – 轻量瓶颈结构
Mobile-Former Block – 基本block
End_ToFormer_Bridge – 最后输出时的全局信息加强
Channel Conv – 末尾分类前的通道拓展
Classifier Head – 分类头

构造网络结构的方法说明:

_create_token: 生成可学习的Tokens
_create_stem: 生成Stem层
_create_lite_bneck: 生成Lite-BottleNeck层
_create_mf_blocks: 生成MobileFormer Block层
_create_former_end_bridge: 生成End ToFormer Bridge层
_create_channel_conv: 生成Channel Conv层
_create_head: 生成Head层
_create_model: 组网

这里仅仅将组网来出来介绍，而使用参数构建模型的build_mformer函数并没拿出来，而在mobileformer.py文件的末尾中定义

class MobileFormer(nn.Layer):
    """MobileFormer
        Params Info:
            num_classes: the number of classes
            in_channels: the number of input feature map channel
            tokens: the shape of former token
            num_head: the number of head is in multi head attention
            groups: the number of groups in 1x1 conv
            k: the number of parameters is in Dynamic ReLU
            coefs: the init value of coefficient parameters
            consts: the init value of constant parameters
            reduce: the mlp hidden scale,
                    means 1/reduce = mlp_ratio
            use_dyrelu: whether use dyrelu
            mlp_ratio: the scale of hidden feature size
            dropout_rate: the dropout rate of attention result
            droppath_rate: the droppath rate of attention output
            attn_dropout_rate: the dropout rate of attention distribution
            mlp_dropout_rate: the dropout rate of mlp layer output
            alpha: the scale of model size
            qkv_bias: whether use the bias in qkv matrix
            config: total model config
            init_type: init params kind
    """
    def __init__(self, num_classes=1000, in_channels=3,
                 tokens=[3, 128], num_head=4, mlp_ratio=2,
                 use_dyrelu=True, k=2, reduce=4.0,
                 coefs=[1.0, 0.5], consts=[1.0, 0.0],
                 dropout_rate=0.0, droppath_rate=0.0,
                 attn_dropout_rate=0.0, mlp_dropout_rate=0.0,
                 norm=nn.LayerNorm, act=nn.GELU,
                 alpha=1.0, qkv_bias=True,
                 config=None, init_type='kn'):
        super(MobileFormer, self).__init__()
        self.num_token, self.embed_dims = tokens[0], tokens[1]
        self.num_head = num_head
        self.num_classes = num_classes
        self.in_channels = in_channels
        self.mlp_ratio = mlp_ratio
        self.alpha = alpha
        self.qkv_bias = qkv_bias
        self.dropout_rate = dropout_rate
        self.droppath_rate = droppath_rate
        self.attn_dropout_rate = attn_dropout_rate
        self.mlp_dropout_rate = mlp_dropout_rate
        self.init_type = init_type

        assert init_type in ['xu', 'ku', 'kn'], \
            "Error: Please choice the init type in ['xu', 'ku', 'kn']"+\
            ", but now it is {0}.".format(init_type)
        assert config is not None, \
            "Error: Please enter the config(now: {0})".format(config)+\
            " in the __init__."

        # create learnable tokens: self.tokens
        self._create_token(num_token=self.num_token,
                           embed_dims=self.embed_dims)

        # create total model
        self._create_model(use_dyrelu=use_dyrelu,
                           reduce=reduce, dyrelu_k=k,
                           coefs=coefs, consts=consts,
                           alpha=alpha, norm=norm, act=act,
                           config=config)

    def _create_token(self, num_token, embed_dims):
        # B(1), token_size, embed_dims
        shape = [1] + [num_token, embed_dims]
        self.tokens = self.create_parameter(shape=shape, dtype='float32')

    def _create_stem(self,
                     in_channels,
                     out_channels,
                     kernel_size,
                     stride, padding,
                     alpha):
        self.stem = Stem(in_channels=in_channels,
                         out_channels=int(alpha * out_channels),
                         kernel_size=kernel_size,
                         stride=stride,
                         padding=padding,
                         init_type=self.init_type)

    def _create_lite_bneck(self,
                           in_channels,
                           hidden_channels,
                           out_channels,
                           kernel_size,
                           stride,
                           padding,
                           alpha,
                           pointwiseconv_groups):
        self.bneck_lite = BottleNeck(in_channels=int(alpha * in_channels),
                                     hidden_channels=int(alpha * hidden_channels),
                                     out_channels=int(alpha * out_channels),
                                     groups=pointwiseconv_groups,
                                     kernel_size=kernel_size,
                                     stride=stride,
                                     padding=padding,
                                     use_dyrelu=False,
                                     is_lite=True,
                                     init_type=self.init_type)

    def _create_mf_blocks(self,
                          in_channel_list,
                          hidden_channel_list,
                          out_channel_list,
                          kernel_list,
                          stride_list,
                          padding_list,
                          alpha,
                          use_dyrelu,
                          reduce,
                          dyrelu_k,
                          coefs,
                          consts,
                          norm,
                          act,
                          pointwiseconv_groups):
        self.blocks = []
        for i in range(0, len(in_channel_list)):
            self.blocks.append(
                MFBlock(
                    in_channels=int(alpha * in_channel_list[i]),
                    hidden_channels=int(alpha * hidden_channel_list[i]),
                    out_channels=int(alpha * out_channel_list[i]),
                    embed_dims=self.embed_dims,
                    kernel_size=kernel_list[i],
                    stride=stride_list[i],
                    padding=padding_list[i],
                    groups=pointwiseconv_groups,
                    k=dyrelu_k,
                    coefs=coefs,
                    consts=consts,
                    reduce=reduce,
                    use_dyrelu=use_dyrelu,
                    num_head=self.num_head,
                    mlp_ratio=self.mlp_ratio,
                    dropout_rate=self.dropout_rate,
                    droppath_rate=self.droppath_rate,
                    attn_dropout_rate=self.attn_dropout_rate,
                    mlp_dropout_rate=self.mlp_dropout_rate,
                    norm=norm,
                    act=act,
                    init_type=self.init_type
                )
            )
        self.blocks = nn.LayerList(self.blocks)

    def _create_former_end_bridge(self,
                                  in_channels,
                                  norm,
                                  alpha):
        self.end_toformer_bridge = ToFormer_Bridge(embed_dims=self.embed_dims,
                                                    in_channels=int(alpha * in_channels),
                                                    num_head=self.num_head,
                                                    dropout_rate=self.dropout_rate,
                                                    droppath_rate=self.droppath_rate,
                                                    attn_dropout_rate=self.attn_dropout_rate,
                                                    init_type=self.init_type)

    def _create_channel_conv(self,
                             in_channels,
                             out_channels,
                             alpha,
                             pointwiseconv_groups):
        self.channel_conv = nn.Sequential(
            PointWiseConv(in_channels=int(alpha * in_channels),
                          out_channels=out_channels,
                          groups=pointwiseconv_groups,
                          init_type=self.init_type),
            nn.BatchNorm2D(out_channels),
            nn.ReLU()
        )

    def _create_head(self,
                     in_channels,
                     hidden_features):
        self.head = Classifier_Head(in_channels=in_channels,
                                    embed_dims=self.embed_dims,
                                    hidden_features=hidden_features,
                                    num_classes=self.num_classes,
                                    dropout=self.dropout_rate,
                                    init_type=self.init_type)

    def _create_model(self,
                      use_dyrelu,
                      reduce,
                      dyrelu_k,
                      coefs,
                      consts,
                      norm,
                      act,
                      alpha,
                      config):
        # create stem: self.stem
        self._create_stem(in_channels=self.in_channels,
                          out_channels=config.MODEL.MF.STEM.OUT_CHANNELS,
                          kernel_size=config.MODEL.MF.STEM.KERNELS,
                          stride=config.MODEL.MF.STEM.STRIEDS,
                          padding=config.MODEL.MF.STEM.PADDINGS,
                          alpha=alpha)
        # create lite-bottleneck: self.bneck_lite
        self._create_lite_bneck(in_channels=config.MODEL.MF.LITE_BNECK.IN_CHANNEL,
                                hidden_channels=config.MODEL.MF.LITE_BNECK.HIDDEN_CHANNEL,
                                out_channels=config.MODEL.MF.LITE_BNECK.OUT_CHANNEL,
                                kernel_size=config.MODEL.MF.LITE_BNECK.KERNEL,
                                stride=config.MODEL.MF.LITE_BNECK.STRIED,
                                padding=config.MODEL.MF.LITE_BNECK.PADDING,
                                alpha=alpha,
                                pointwiseconv_groups=config.MODEL.MF.POINTWISECONV_GROUPS)
        # create mobileformer blocks: self.blocks
        self._create_mf_blocks(in_channel_list=config.MODEL.MF.BLOCK.IN_CHANNELS,
                               hidden_channel_list=config.MODEL.MF.BLOCK.HIDDEN_CHANNELS,
                               out_channel_list=config.MODEL.MF.BLOCK.OUT_CHANNELS,
                               kernel_list=config.MODEL.MF.BLOCK.KERNELS,
                               stride_list=config.MODEL.MF.BLOCK.STRIEDS,
                               padding_list=config.MODEL.MF.BLOCK.PADDINGS,
                               alpha=alpha,
                               use_dyrelu=use_dyrelu,
                               reduce=reduce,
                               dyrelu_k=dyrelu_k,
                               coefs=coefs,
                               consts=consts,
                               norm=norm,
                               act=act,
                               pointwiseconv_groups=config.MODEL.MF.POINTWISECONV_GROUPS)
        # create final toformer_bridge: self.toformer_bridge
        self._create_former_end_bridge(in_channels=config.MODEL.MF.CHANNEL_CONV.IN_CHANNEL,
                                       norm=norm,
                                       alpha=alpha)
        # create channel 1x1 conv: self.channel_conv
        self._create_channel_conv(in_channels=config.MODEL.MF.CHANNEL_CONV.IN_CHANNEL,
                                  out_channels=config.MODEL.MF.CHANNEL_CONV.OUT_CHANNEL,
                                  alpha=alpha,
                                  pointwiseconv_groups=config.MODEL.MF.POINTWISECONV_GROUPS)
        # create classifier head: self.head
        self._create_head(in_channels=config.MODEL.MF.HEAD.IN_CHANNEL,
                          hidden_features=config.MODEL.MF.HEAD.HIDDEN_FEATURE)

    def _to_batch_tokens(self, batch_size):
        # B, token_size, embed_dims
        return paddle.concat([self.tokens]*batch_size,  axis=0)

    # @paddle.jit.to_static
    def forward(self, inputs):
        B, _, _, _ = inputs.shape
        f = self.stem(inputs)
        # create batch tokens
        tokens = self._to_batch_tokens(B) # B, token_size, embed_dims
        f = self.bneck_lite(f, tokens)

        for b in self.blocks:
            f, tokens = b(f, tokens)

        tokens = self.end_toformer_bridge(f, tokens)
        f = self.channel_conv(f)
        output = self.head(f, tokens)

        return output

MobileFormer使用说明

通过加载mobileformer_26m.yaml来实现模型参数的加载与配置——依赖于config.py中的参数组织以及yaml文件里的参数配置.

对yaml参数的简单说明:

MF – 为对Mobileformer中的所有网络架构配置
- BLOCK: 对所有MFBlock的配置，包括输入通道数，中间通道数，输出通道数等
- DYRELU: 对动态ReLU的配置

默认使用get_config加载的配置中，分类数为1000，需要在yaml中添加一个参数才能实现自定义类别数(分类数)模型的构建

原内容:

MODEL:
    TYPE: MobileFormer
    NAME: MobileFormer_26M
    DROPPATH: 0.1
    DROPOUT: 0.1
    MLP_DROPOUT: 0.1
    ATTENTION_DROPOUT: 0.1

添加后: —— 分类数为10

MODEL:
    NUM_CLASSES: 10
    TYPE: MobileFormer
    NAME: MobileFormer_26M
    DROPPATH: 0.1
    DROPOUT: 0.1
    MLP_DROPOUT: 0.1
    ATTENTION_DROPOUT: 0.1

# config中的参数配置需要用到
!pip install yacs

import paddle

from config import get_config
from mobileformer import build_mformer as build_model

config = get_config('mobileformer_26m.yaml')
model = build_model(config)

test_data = paddle.rand((1, 3, 224, 224))
y_pred = model(test_data)

print('model output: ', y_pred.shape)

=> merge config from mobileformer_26m.yaml
model output:  [1, 10]

MobileFormer 一种桥接Global&Local特征的高效并行架构

AI Studio

Vison Transformer的痛点

MobileFormer特性说明

介绍MobileFormer全局与局部交互方式

MobileFormer模型(结构)复现

一、Stem: 渐入层实现

二、Lite-BottleNeck结构实现

2.1 (PW)逐点卷积实现

2.2 (DW)深度卷积实现

2.3 动态ReLU实现

2.3.1 MLP实现

2.3.2 DyReLU实现

2.4 Lite-BottleNeck实现

三、Mobile结构实现

四、ToFormer_Bridge结构实现

五、Former结构实现

5.1 注意力实现

5.2 Dropath实现–多分支丢弃

5.3 Forme实现

六、ToMobile_Bridge结构实现

七、MFBlock结构实现

八、分类头实现

MobileFormer模型组网

MobileFormer使用说明

相关实验说明

1.MNIST实验(99.3)

Cifar10(89.6)

其它实验说明

所有评论(0)

AI Studio