Vison Transformer的痛点

模型完整实现以及训练代码可参考: BR-IDL/PaddleViT
模型复现教学视频: (全网首发)用Paddle手把手带你复现MobileFormer模型

部分实现略有调整,后期会更新新的模型代码到PaddleViT仓库中。(主要实现同视频一致,部分归一化层以及参数略有调整)

模型结构反复调整主要由于论文代码未开源,部分参数信息需要通过不断测试不同的数据集进行调整(暂未测试ImageNet).

已完成测试的数据集: MNIST, Cifar10, Cifar100.

近一年来,各大VIT模型横空出世,更高的业务能力,识别率更高,检测效果更好,迁移能力显著——各种各样的SOTA层出不穷。

但虽然众多的VIT模型性能卓越,但往往将性能与计算量-参数量挂钩。

怎么挂勾呢?——简单来说就是,大家可以发现,现在表现优异的VIT模型大都计算量超过1G的FLOPs,这就意味着其对实时性有一定的影响。

因此,就有学者研究相关VIT结构实现类似MobileNet系列模型在Conv卷积模型架构中的地位一样——包括纯VIT或者掺入VIT的混合模型架构,实现将高性能的VIT性能移接到移动端上。

由于这样的初衷,在过去的这一年中,就出现了像MobileVIT、MobileFormer这样的轻量VIT模型架构,

而本项目将以MobileFormer作为本次手把手论文模型复现的主要内容,带大家了解最新的MobileFormer架构设计

(MobileFormer相关图表)

并行桥接结构

模型参数/计算量表

整体结构

MobileFormer特性说明

  • 采用并行混合模型架构(conv+vit),以往模型大都为串联组织的模型架构

  • 采用分头独立的注意力映射(先分多头的映射),相对于直接映射qkv,该方式会随着头数不断减少参数量

  • 设计一种桥接结构,实现局部与全局信息的注意力交互,并且仅对有限个Token进行注意力映射,特征图不进行映射运算

  • 设计少量的Token嵌入/学习全局信息,减少计算量与参数量

关于分头注意力映射,我会在后边去介绍,这一点是通过参数量对齐时确定的,论文中仅仅写作分头注意力,但如果直接使用以往的分头注意力参数量会大不少,因此作此改动

欢迎大家在评论区讨论相关结构细节

介绍MobileFormer全局与局部交互方式

  1. 左侧 X i X_i Xi作为输入,是一组Feature Map; 右侧 Z i Z_i Zi作为输入,是一组Token.
  2. 左侧的输入 X i X_i Xi先传入右侧的黄色区域,将特征图中的局部信息融入到 Z i Z_i Zi中——这一步通过注意力映射来完成,但是仅对输入的Token进行Query的映射.
  3. 上面的这一步,不对特征图进行映射,而是直接当作KeyValue进行注意力计算,最后利用残差将局部信息融入到Token中的全局信息里——称为Mobile->Former,重命名为ToFormer_Bridge.
  4. 通过ToFormer_Bridge的桥接后,将补充信息后的 Z i Z_i Zi传入Former结构,即绿色区域——一个单纯的Transformer结构,完成特征提取后输出 Z i + 1 Z_{i+1} Zi+1.
  5. 拿到 Z i + 1 Z_{i+1} Zi+1后,将其作为当前阶段完整的全局信息Token,传回到Mobile结构,即蓝色区域, 此时才将 X i X_i Xi传入Mobile结构——实现全局信息补充到局部信息的特征融合/交互过程.
  6. 这是信息的交互,主要是通过Token提供给动态ReLU函数生成动态参数,去对卷积过程的局部信息/特征进行一个筛选的交互过程.
  7. X i X_i Xi经过Mobile结构后,信息并没有真正的补足全局的认识,因此,还需要与表示全局信息的Token进行一次注意力计算.
  8. 此时再将 Z i + 1 Z_{i+1} Zi+1经过两次映射,得到Key和Value,而经过Mobile输出的 X i X_i Xi保持原有特征,直接作为Query指导全局信息的融入,最后通过残差进行特征融合——称为Former->Mobile,重命名为ToMobile_Bridge.
  9. 这样一个交互过程,由于特征图始终没有进行任何映射,保持原汁原味的局部信息——更加纯净的融入到Token中,并且Token中表示全局的信息也在多个这样的block中不断的更新并融合到特征图的特征上.

以上,称为一个MobileFormer_Block.
其中,每一个桥接部分:ToFormer_BridgeToMobile_Bridge,都是Token在进行注意力映射,且实现都采用前面提到的多头独立映射.

总得来说:

  • 总是通过映射为query的一端,去筛选作为key于value的一端中的信息——比如,token为query就筛选局部信息,然后残差融合选出的信息进行特征融合.

MobileFormer模型(结构)复现

基于MobileFormer294M模型结构进行说明,其余模型结构仅仅是部分参数不同。

各部分实现代码如下:

  1. baseconv.py: Stem + PW + DW + BottleNeck

  2. attention.py: MLP + Attention

  3. dyrelu.py: DyReLU

  4. droppath.py: DropPath

  5. mobileformer.py: Classifier_Head + Mobile + ToFormer_Bridge + Former + ToMobile_Bridge + MFBlock + MobileFormer + build_mformer

# 导入基本的依赖库
import paddle
from paddle import nn

import numpy as np

一、Stem: 渐入层实现

Token部分在最后组网时实现,我们暂且跳过那一步

这里的渐入层,是个人的一个命名习惯——将输入的第一个卷积或者特征提取模块称为渐入层.

意为: 特征开始缓缓流出,逐渐被提取进入到模型中——在卷积实现时,通常需要下采样,也应该需要(在实验中,MobileFormer将第一层变为步长为1时效果不如步长为2下采样的效果.)

结构:

  1. 一个3x3卷积层 + BatchNorm2D + Hardswish

  2. 通过_conv_init函数,生成可选的参数初始化方法,并通过weight_attr等进行配置

class Stem(nn.Layer):
    """Stem
    """
    def __init__(self,
                 in_channels,
                 out_channels,
                 kernel_size=3,
                 stride=1,
                 padding=0,
                 act=nn.Hardswish,
                 init_type='kn'):
        super(Stem, self).__init__(
                 name_scope="Stem")
        conv_weight_attr, conv_bias_attr = self._conv_init(init_type=init_type)
        self.conv = nn.Conv2D(in_channels=in_channels,
                              out_channels=out_channels,
                              kernel_size=kernel_size,
                              stride=stride,
                              padding=padding,
                              weight_attr=conv_weight_attr,
                              bias_attr=conv_bias_attr)
        self.bn = nn.BatchNorm2D(out_channels)
        self.act = act()

    def _conv_init(self, init_type='kn'):
        if init_type == 'xu':
            weight_attr = nn.initializer.XavierUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'ku':
            weight_attr = nn.initializer.KaimingUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'kn':
            weight_attr = nn.initializer.KaimingNormal()
            bias_attr = nn.initializer.Constant(value=0.0)
        return weight_attr, bias_attr

    def forward(self, inputs):
        x = self.conv(inputs)
        x = self.bn(x)
        x = self.act(x)
        return x

二、Lite-BottleNeck结构实现

MobileFormer沿用MobileNetV3中设计的线性瓶颈结构来实现,因此,需要实现一般的BottleNeck,然后通过修改参数来实现Lite轻量的BottleNeck结构。

首先,明确BottleNeck的组成: 1x1 pw-conv + 激活函数-DYReLU + 3x3 dw-conv + 激活函数-DYReLU + 1x1 pw-conv

pw(pointwise): 逐点卷积–1x1的卷积,实现通道的压缩与拓张,是控制瓶颈大小(channel)的关键结构

dw(depthwise): 深度卷积–3x3的分组卷积,实现特征提取

DYReLU: 动态ReLU–利用传入的参数生成动态参数,然后通过max进行特征筛选

2.1 (PW)逐点卷积实现

为了解耦,仅作卷积部分,不构建归一化与激活函数.

结构:

  1. 1x1conv – 支持分组卷积实现

之所以支持分组卷积,是因为在最小模型中,所有的1x1 PointWise Conv会实现分组为4的实现,来减少参数量.

由于是1x1卷积,固定的,且只做channel控制,所以入口参数不开放卷积核大小、步长等参数.

class PointWiseConv(nn.Layer):
    """PointWise 1x1Conv -- support group conv
        Params Info:
            groups: the number of groups
    """
    def __init__(self,
                 in_channels,
                 out_channels,
                 groups=1,
                 init_type='kn'):
        super(PointWiseConv, self).__init__(
                 name_scope="PointWiseConv")
        conv_weight_attr, conv_bias_attr = self._conv_init(init_type=init_type)
        self.conv = nn.Conv2D(in_channels=in_channels,
                              out_channels=out_channels,
                              kernel_size=1,
                              stride=1,
                              padding=0,
                              groups=groups,
                            weight_attr=conv_weight_attr,
                            bias_attr=conv_bias_attr)

    def _conv_init(self, init_type='kn'):
        if init_type == 'xu':
            weight_attr = nn.initializer.XavierUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'ku':
            weight_attr = nn.initializer.KaimingUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'kn':
            weight_attr = nn.initializer.KaimingNormal()
            bias_attr = nn.initializer.Constant(value=0.0)
        return weight_attr, bias_attr

    def forward(self, inputs):
        x = self.conv(inputs)
        return x

2.2 (DW)深度卷积实现

为了解耦,仅作卷积部分,不构建归一化与激活函数.

结构:

  1. 3x3分组卷积 or lite-3x3分组卷积

分组卷积的分组数全为输入通道数,因为在瓶颈结构中,DW仅作特征提取,通道数保持不变.

由于需要支持可能的卷积核等调整,所以开放卷积核大小等参数,不开放输出通道参数——因为始终保持输入通道数.

lite-3x3分组卷积,采用两个卷积层实现,卷积核分别为[3, 1]与[1, 3].

特别地,实现lite结构时,传入的步长、填充等也需要进行划分成两个不同的参数,传入两个卷积层中

class DepthWiseConv(nn.Layer):
    """DepthWise Conv -- support lite weight dw_conv
        Params Info:
            is_lite: use lite weight dw_conv
    """
    def __init__(self,
                 in_channels,
                 kernel_size=3,
                 stride=1,
                 padding=0,
                 is_lite=False,
                 init_type='kn'):
        super(DepthWiseConv, self).__init__(
                 name_scope="DepthWiseConv")
        self.is_lite = is_lite

        conv_weight_attr, conv_bias_attr = self._conv_init(init_type=init_type)
        if is_lite is False:
            self.conv = nn.Conv2D(in_channels=in_channels,
                                out_channels=in_channels,
                                groups=in_channels,
                                kernel_size=kernel_size,
                                stride=stride,
                                padding=padding,
                                weight_attr=conv_weight_attr,
                                bias_attr=conv_bias_attr)
        else: # lite 结构
            self.conv = nn.Sequential(
                # kernel_size -- [3, 1]
                nn.Conv2D(in_channels=in_channels,
                          out_channels=in_channels,
                          kernel_size=[kernel_size, 1],
                          stride=[stride, 1],
                          padding=[padding, 0],
                          groups=in_channels,
                        weight_attr=conv_weight_attr,
                        bias_attr=conv_bias_attr),
                nn.BatchNorm2D(in_channels),
                # kernel_size -- [1, 3]
                nn.Conv2D(in_channels=in_channels,
                          out_channels=in_channels,
                          kernel_size=[1, kernel_size],
                          stride=[1, stride],
                          padding=[0, padding],
                          groups=in_channels,
                        weight_attr=conv_weight_attr,
                        bias_attr=conv_bias_attr)
            )

    def _conv_init(self, init_type='kn'):
        if init_type == 'xu':
            weight_attr = nn.initializer.XavierUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'ku':
            weight_attr = nn.initializer.KaimingUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'kn':
            weight_attr = nn.initializer.KaimingNormal()
            bias_attr = nn.initializer.Constant(value=0.0)
        return weight_attr, bias_attr

    def forward(self, inputs):
        x = self.conv(inputs)
        return x

2.3 动态ReLU实现

实现B型动态ReLU,完成通道上的ReLU效果–包含筛选通道特征.

1.动态ReLU需要提前设置参数组大小,即K值大小利用K值大小去生成指定个数的a与b的参数组:

  • 等式左边为利用初始值生成的结果动态参数组——分别是 a k a_k ak b k b_k bk

  • 等式右侧参数: (注意每个通道各生成k各a与b哦)

    • λ a \lambda_a λa为生成动态参数组 a k a_k ak的系数, λ b \lambda_b λb为生成动态参数组 b k b_k bk的系数
    • α k \alpha_k αk为生成动态参数组 a k a_k ak的常数项, β k \beta_k βk为生成动态参数组 b k b_k bk的常数项
    • Δ \Delta Δ部分为输入数据映射后的结果——具体看代码可以清楚理解

2.通过上述所知参数,可以得到参数组如下:

3.得到每个通道上的a与b参数——每个通道得到k个a与k个b动态系数.

4.最后通过max得到最终的激活输出,其中a做系数,b做偏置或者常数项:

max处理的维度为最后一维,对应每个通道上动态生成的2*k个结果.

特别说明: 由于在本文中有说到使用MLP替换该结构中的线性映射,因此需要先实现MLP,所以接下来先展示MLP结构,然后展示动态ReLU结构.

2.3.1 MLP实现
class MLP(nn.Layer):
    """Multi Layer Perceptron
        Params Info:
            in_features: input token feature size
            out_features: output token feature size
            mlp_ratio: the scale of hidden feature size
            mlp_dropout_rate: the dropout rate of mlp layer output
    """
    def __init__(self,
                 in_features,
                 out_features=None,
                 mlp_ratio=2,
                 mlp_dropout_rate=0.,
                 act=nn.GELU,
                 init_type='kn'):
        super(MLP, self).__init__(name_scope="MLP")
        self.out_features = in_features if out_features is None else \
                            out_features
        linear_weight_attr, linear_bias_attr = self._linear_init(init_type=init_type)
        self.fc1 = nn.Linear(in_features=in_features,
                             out_features=int(mlp_ratio*in_features),
                             weight_attr=linear_weight_attr,
                             bias_attr=linear_bias_attr)
        self.fc2 = nn.Linear(in_features=int(mlp_ratio*in_features),
                             out_features=self.out_features,
                             weight_attr=linear_weight_attr,
                             bias_attr=linear_bias_attr)

        self.act = act()
        self.dropout = nn.Dropout(mlp_dropout_rate)

    def _linear_init(self, init_type='kn'):
        if init_type == 'xu':
            weight_attr = nn.initializer.XavierUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'ku':
            weight_attr = nn.initializer.KaimingUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'kn':
            weight_attr = nn.initializer.KaimingNormal()
            bias_attr = nn.initializer.Constant(value=0.0)
        return weight_attr, bias_attr

    def forward(self, inputs):
        x = self.fc1(inputs)
        x = self.act(x)
        x = self.dropout(x)
        x = self.fc2(x)
        x = self.dropout(x)
        return x
2.3.2 DyReLU实现

部分参数说明:

  • k=2: 动态参数基个数

  • coefs=[1.0, 0.5]: 初始系数值

  • consts=[1.0, 0.0]: 初始常数项

  • self.mid_channels: 总的参数个数,每个通道生成2xk个结果,所以需要映射出2xkxchannels的动态参数,然后计算出动态参数组

    • 见上面的1.
  • self.coef: 利用初始系数值进行广播,得到[ a 1 a_1 a1, a 2 a_2 a2, b 1 b_1 b1, b 2 b_2 b2]的系数,由于 a k a_k ak系数与 b k b_k bk系数总是各自唯一的,所以将初始值依次复制 k k k份即可

  • self.const: 利用初始化常数项,依次生成 a k a_k ak b k b_k bk常数项,这里将输入的consts[0]作为 a 1 a_1 a1的常数项,其余皆为consts[2],但仍然需要保持shape为2xk

  • self.project: 将输入映射到指定的参数数量——注意,这里由于是模型中表示全局信息的Token生成动态参数,因此输入维度为embed_dims

class DyReLU(nn.Layer):
    """Dynamic ReLU activation function -- use one MLP
        Params Info:
            in_channels: input feature map channels
            embed_dims: input token embed_dims
            k: the number of parameters is in Dynamic ReLU
            coefs: the init value of coefficient parameters
            consts: the init value of constant parameters
            reduce: the mlp hidden scale,
                    means 1/reduce = mlp_ratio
    """
    def __init__(self,
                 in_channels,
                 embed_dims,
                 k=2, # a_1, a_2 coef, b_1, b_2 bias
                 coefs=[1.0, 0.5], # coef init value
                 consts=[1.0, 0.0], # const init value
                 reduce=4,
                 init_type='kn'):
        super(DyReLU, self).__init__(
                 name_scope="DyReLU")
        self.embed_dims = embed_dims
        self.in_channels = in_channels
        self.k = k

        self.mid_channels = 2*k*in_channels

        # 4 values
        # a_k = alpha_k + coef_k*x, 2
        # b_k = belta_k + coef_k*x, 2
        self.coef = paddle.to_tensor([coefs[0]]*k + [coefs[1]]*k)
        self.const = paddle.to_tensor([consts[0]] + [consts[1]]*(2*k-1))

        self.project = nn.Sequential(
            # nn.LayerNorm(embed_dims),
            MLP(in_features=embed_dims,
                out_features=self.mid_channels,
                mlp_ratio=1/reduce,
                act=nn.GELU,
                init_type=init_type),
            # nn.BatchNorm(self.mid_channels)
            nn.LayerNorm(self.mid_channels)
        )

    def forward(self, feature_map, tokens):
        B, M, D = tokens.shape
        dy_params = self.project(tokens[:, 0]) # B, mid_channels
        # B, IN_CHANNELS, 2*k: 衍生每个通道的a与b参数组的动态参数
        dy_params = dy_params.reshape(shape=[B, self.in_channels, 2*self.k])

        # B, IN_CHANNELS, 2*k -- a_1, a_2, b_1, b_2:衍生每个通道最终确定的动态参数组
        dy_init_params = dy_params * self.coef + self.const
        f = feature_map.transpose(perm=[2, 3, 0, 1]).unsqueeze(axis=-1) # H, W, B, C, 1 : 转置保证f(x)沿着通道维度进行,且与上一步的结果可以进行广播运算

        # output shape: H, W, B, C, k
        output = f * dy_init_params[:, :, :self.k] + dy_init_params[:, :, self.k:]
        output = paddle.max(output, axis=-1) # H, W, B, C
        output = output.transpose(perm=[2, 3, 0, 1]) # B, C, H, W :还原数据shape

        return output

2.4 Lite-BottleNeck实现

通过is_lite控制是否启动Lite结构,也就是在构建网络层是使用Lite-DW.

通过use_dyrelu控制是否使用动态ReLU作为激活函数

可以看出Bottle结构实际上就是Mobile结构中的核心部分——基本上与下图中的桥接结构中的Mobile一致,但其实还不够.

注意,由于要使用Token产生动态参数,因此需要在forward传入两个参数,依次为特征图–Token.

class BottleNeck(nn.Layer):
    """BottleNeck
        Params Info:
            groups: the number of groups, by 1x1conv
            embed_dims: input token embed_dims
            k: the number of parameters is in Dynamic ReLU
            coefs: the init value of coefficient parameters
            consts: the init value of constant parameters
            reduce: the mlp hidden scale,
                    means 1/reduce = mlp_ratio
            use_dyrelu: whether use dyrelu
            is_lite: whether use lite dw_conv
    """
    def __init__(self,
                 in_channels,
                 hidden_channels,
                 out_channels,
                 groups=1,
                 kernel_size=3,
                 stride=1,
                 padding=0,
                 embed_dims=None,
                 k=2, # the number of dyrelu-params
                 coefs=[1.0, 0.5],
                 consts=[1.0, 0.0],
                 reduce=4,
                 use_dyrelu=False,
                 is_lite=False,
                 init_type='kn'):
        super(BottleNeck, self).__init__(
                 name_scope="BottleNeck")
        self.is_lite = is_lite
        self.use_dyrelu = use_dyrelu

        assert use_dyrelu==False or (use_dyrelu==True and embed_dims is not None), \
               "Error: Please make sure while the use_dyrelu==True,"+\
               " embed_dims(now:{0})>0.".format(embed_dims)

        self.in_pw = PointWiseConv(in_channels=in_channels,
                                   out_channels=hidden_channels,
                                   groups=groups,
                                   init_type=init_type)
        self.in_pw_bn = nn.BatchNorm2D(hidden_channels)

        self.dw = DepthWiseConv(in_channels=hidden_channels,
                                kernel_size=kernel_size,
                                stride=stride,
                                padding=padding,
                                is_lite=is_lite,
                                init_type=init_type)
        self.dw_bn = nn.BatchNorm2D(hidden_channels)

        self.out_pw = PointWiseConv(in_channels=hidden_channels,
                                    out_channels=out_channels,
                                    groups=groups,
                                    init_type=init_type)
        self.out_pw_bn = nn.BatchNorm2D(out_channels)

        if use_dyrelu == False:
            self.act = nn.ReLU()
        else:
            self.act = DyReLU(in_channels=hidden_channels,
                                embed_dims=embed_dims,
                                k=k,
                                coefs=coefs,
                                consts=consts,
                                reduce=reduce,
                                init_type=init_type)

    def forward(self, feature_map, tokens):
        x = self.in_pw(feature_map)
        x = self.in_pw_bn(x)
        if self.use_dyrelu:
            x = self.act(x, tokens)

        x = self.dw(x)
        x = self.dw_bn(x)
        if self.use_dyrelu:
            x = self.act(x, tokens)

        x = self.out_pw(x)
        x = self.out_pw_bn(x)

        return x

三、Mobile结构实现

上边的BottleNeck结构已经基本实现的Mobile结构,但是在实际使用时,由于偶尔要进行下采样,因此还需要再BottleNeck前加入一个下采样的3x3DW卷积.

结构:

  • if downsample: 3x3 DW + BN + 激活函数(这里加入激活函数是否合适,也在考虑——因为传统的BottleNeck结构输出的是一个低维度的数据,由MobileNetV2的研究工作知道,在低维度上进行非线性激活会丢失信息;由于本文中太多信息没有透露,所以暂且使用标准的下采样块: conv+bn+act)

  • BottleNeck: 均不使用Lite结构,但需要使用动态ReLU

输入为特征图与Token

class Mobile(nn.Layer):
    """Mobile Sub-block
        Params Info:
            in_channels: input feature map channels
            hidden_channels: the dw layer hidden channel size
            groups: the number of groups, by 1x1conv
            embed_dims: input token embed_dims
            k: the number of parameters is in Dynamic ReLU
            coefs: the init value of coefficient parameters
            consts: the init value of constant parameters
            reduce: the mlp hidden scale,
                    means 1/reduce = mlp_ratio
            use_dyrelu: whether use dyrelu
    """
    def __init__(self,
                 in_channels,
                 hidden_channels,
                 out_channels,
                 kernel_size=3,
                 stride=1,
                 padding=0,
                 groups=1,
                 embed_dims=None,
                 k=2,
                 coefs=[1.0, 0.5],
                 consts=[1.0, 0.0],
                 reduce=4,
                 use_dyrelu=False,
                 init_type='kn'):
        super(Mobile, self).__init__(
                 name_scope="Mobile")
        self.add_dw = True if stride==2 else False
        self.bneck = BottleNeck(in_channels=in_channels,
                                hidden_channels=hidden_channels,
                                out_channels=out_channels,
                                kernel_size=kernel_size,
                                stride=1,
                                padding=1,
                                groups=groups,
                                embed_dims=embed_dims,
                                k=k,
                                coefs=coefs,
                                consts=consts,
                                reduce=reduce,
                                use_dyrelu=use_dyrelu,
                                init_type=init_type)

        if self.add_dw: # stride==2
            self.downsample_dw = nn.Sequential(
                DepthWiseConv(in_channels=in_channels,
                              kernel_size=kernel_size,
                              stride=stride,
                              padding=padding,
                              init_type=init_type),
                nn.BatchNorm2D(in_channels),
                nn.ReLU() # maybe other act
            )

    def forward(self, feature_map, tokens):
        if self.add_dw:
            feature_map = self.downsample_dw(feature_map)
        
        # 动态relu需要token
        x = self.bneck(feature_map, tokens)
        return x

四、ToFormer_Bridge结构实现

这个结构,将特征图中的局部信息融入到Token中去:

  • 其中token做query映射,特征图直接作为key和value(仅仅做一些数据shape改变的工作),减少了常规注意力的映射参数成本

  • 利用query去有选择的融入特征图的局部信息,得到注意力结果

  • 最后通过残差拼接原始输入token,实现局部与全局信息的融合

注意这里使用分头独立映射——即,先将token分头,再进行注意力映射得到每个头的query,再进行拼接得到以往注意力的多头数据形式——其余步骤都是正常的注意力计算步骤

分头独立映射优缺点说明:

  • 优点: 能随着头数增加而减少映射所需的参数量和计算量,基于全连接层参数量为feature size的平方数(输入等于输出大小时)

  • 缺点: 部分特征融合映射,每个头之间的映射相互独立了

为了与标准的transformer中的注意力操作对应,并且将各桥接模块进行解耦,将LN添加到forward中的第一步上.

输入为特征图与Token

class ToFormer_Bridge(nn.Layer):
    """Mobile to Former Bridge
        Params Info:
            in_channels: input feature map channels
            embed_dims: input token embed_dims
            num_head: the number of head is in multi head attention
            dropout_rate: the dropout rate of attention result
            attn_dropout_rate: the dropout rate of attention distribution
    """
    def __init__(self,
                 embed_dims,
                 in_channels,
                 num_head=1,
                 dropout_rate=0.,
                 droppath_rate=0.,
                 attn_dropout_rate=0.,
                 qkv_bias=True,
                 norm=nn.LayerNorm,
                 init_type='kn'):
        super(ToFormer_Bridge, self).__init__(
                 name_scope="ToFormer_Bridge")
        self.num_head = num_head
        self.head_dims = in_channels // num_head
        self.scale = self.head_dims ** -0.5
        
        linear_weight_attr, linear_bias_attr = self._linear_init(init_type=init_type)
        # input normlization
        self.input_norm = norm(embed_dims)

        # split head to project: 分头独立映射层
        self.heads_q_proj = []
        for i in range(num_head): # n linear
            self.heads_q_proj.append(
                nn.Linear(in_features=embed_dims // num_head,
                          out_features=self.head_dims,
                          weight_attr=linear_weight_attr,
                          bias_attr=linear_bias_attr if qkv_bias else False)
            )
        self.heads_q_proj = nn.LayerList(self.heads_q_proj)

        self.output = nn.Linear(in_features=self.num_head*self.head_dims,
                                out_features=embed_dims,
                                weight_attr=linear_weight_attr,
                                bias_attr=linear_bias_attr)
        self.softmax = nn.Softmax()
        self.dropout = nn.Dropout(dropout_rate)
        self.droppath = DropPath(droppath_rate)
        self.attn_dropout= nn.Dropout(attn_dropout_rate)

    def _linear_init(self, init_type='kn'):
        if init_type == 'xu':
            weight_attr = nn.initializer.XavierUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'ku':
            weight_attr = nn.initializer.KaimingUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'kn':
            weight_attr = nn.initializer.KaimingNormal()
            bias_attr = nn.initializer.Constant(value=0.0)
        return weight_attr, bias_attr

    def transfer_shape(self, feature_map, tokens):
        B, C, H, W = feature_map.shape
        assert C % self.num_head == 0, \
            "Erorr: Please make sure feature_map.channels % "+\
            "num_head == 0(now:{0}).".format(C % self.num_head)
        fm = feature_map.reshape(shape=[B, C, H*W]) # B, C, L
        fm = fm.transpose(perm=[0, 2, 1]) # B, L, C -- C = num_head * head_dims
        fm = fm.reshape(shape=[B, H*W, self.num_head, self.head_dims])
        fm = fm.transpose(perm=[0, 2, 1, 3]) # B, n_h, L, h_d

        B, M, D = tokens.shape
        h_token = tokens.reshape(shape=[B, M, self.num_head, D // self.num_head])
        h_token = h_token.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, D // n_h

        return fm, h_token

    def _multi_head_q_forward(self, token, B, M):
        q_list = []
        for i in range(self.num_head):
            q_list.append(
                # B, 1, M, head_dims
                self.heads_q_proj[i](token[:, i, :, :]).reshape(
                                shape=[B, 1, M, self.head_dims])
            )
        q = paddle.concat(q_list, axis=1) # B, num_head, M, head_dims
        return q

    def forward(self, feature_map, tokens):
        B, M, D = tokens.shape
        tokens_ = self.input_norm(tokens)

        # fm(key/value) to shape: B, n_h, L, h_d
        # token to shape: B, n_h, M, D // n_h
        fm, token = self.transfer_shape(feature_map, tokens_) # 先分头
        q = self._multi_head_q_forward(token, B, M) # 再映射

        # attention distribution
        attn = paddle.matmul(q, fm, transpose_y=True) # B, n_h, M, L
        attn = attn * self.scale
        attn = self.softmax(attn)
        attn = self.attn_dropout(attn)

        # attention result
        z = paddle.matmul(attn, fm) # B, n_h, M, h_d
        z = z.transpose(perm=[0, 2, 1, 3])
        z = z.reshape(shape=[B, M, self.num_head*self.head_dims])
        z = self.output(z) # B, M, D
        z = self.dropout(z)

        z = self.droppath(z)
        z = z + tokens

        return z

五、Former结构实现

简单的Transformer结构

结构:

  • LN + ATTENTION + LN + MLP

5.1 注意力实现

class Attention(nn.Layer):
    """Multi Head Attention
        Params Info:
            embed_dims: input token embed_dims
            num_head: the number of head is in multi head attention
            dropout_rate: the dropout rate of attention result
            attn_dropout_rate: the dropout rate of attention distribution
            qkv_bias: whether use the bias in qkv matrix
    """
    def __init__(self,
                 embed_dims,
                 num_head=1,
                 dropout_rate=0.,
                 attn_dropout_rate=0.,
                 qkv_bias=True,
                 init_type='kn'):
        super(Attention, self).__init__(
                 name_scope="Attention")
        self.num_head = num_head
        self.head_dims = embed_dims // num_head
        self.scale = self.head_dims ** -0.5

        linear_weight_attr, linear_bias_attr = self._linear_init(init_type=init_type)
        self.qkv_proj = nn.Linear(in_features=embed_dims,
                                  out_features=3*self.num_head*self.head_dims,
                                  weight_attr=linear_weight_attr,
                                  bias_attr=linear_bias_attr if qkv_bias else False)
        self.output = nn.Linear(in_features=self.num_head*self.head_dims,
                                out_features=embed_dims,
                                weight_attr=linear_weight_attr,
                                bias_attr=linear_bias_attr)

        self.softmax = nn.Softmax()
        self.dropout = nn.Dropout(dropout_rate)
        self.attn_dropout= nn.Dropout(attn_dropout_rate)

    def _linear_init(self, init_type='kn'):
        if init_type == 'xu':
            weight_attr = nn.initializer.XavierUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'ku':
            weight_attr = nn.initializer.KaimingUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'kn':
            weight_attr = nn.initializer.KaimingNormal()
            bias_attr = nn.initializer.Constant(value=0.0)
        return weight_attr, bias_attr

    def transfer_shape(self, q, k, v):
        B, M, _ = q.shape
        q = q.reshape(shape=[B, M, self.num_head, self.head_dims])
        q = q.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, h_d
        k = k.reshape(shape=[B, M, self.num_head, self.head_dims])
        k = k.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, h_d
        v = v.reshape(shape=[B, M, self.num_head, self.head_dims])
        v = v.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, h_d

        return q, k, v

    def forward(self, inputs):
        B, M, D = inputs.shape
        assert D % self.num_head == 0, \
            "Erorr: Please make sure Token.D % "+\
            "num_head == 0(now:{0}).".format(D % self.num_head)

        qkv= self.qkv_proj(inputs)
        q, k, v = qkv.chunk(3, axis=-1)
        # B, n_h, M, h_d
        q, k, v = self.transfer_shape(q, k, v)

        attn = paddle.matmul(q, k, transpose_y=True) # B, n_h, M, M
        attn = attn * self.scale
        attn = self.softmax(attn)
        attn = self.attn_dropout(attn)

        z = paddle.matmul(attn, v) # B, n_h, M, h_d
        z = z.transpose(perm=[0, 2, 1, 3]) # B, M, n_h, h_d
        z = z.reshape(shape=[B, M, self.num_head*self.head_dims])
        z = self.output(z)
        z = self.attn_dropout(z)

        return z

5.2 Dropath实现–多分支丢弃

根据Batch中的样本,随机丢弃一些样本为0.

class DropPath(nn.Layer):
    """Multi-branch dropout layer -- Along the axis of Batch
        Params Info:
            p: droppath rate
    """
    def __init__(self,
                 p=0.):
        super(DropPath, self).__init__(
                 name_scope="DropPath")
        self.p = p
    
    def forward(self, inputs):
        if self.p > 0. and self.training:
            keep_p = np.asarray([1 - self.p], dtype='float32')
            keep_p = paddle.to_tensor(keep_p)
            # B, 1, 1....
            shape = [inputs.shape[0]] + [1] * (inputs.ndim-1)
            random_dr = paddle.rand(shape=shape, dtype='float32')
            random_sample = paddle.add(keep_p, random_dr).floor() # floor to int--B
            output = paddle.divide(inputs, keep_p) * random_sample
            return output

        return inputs

5.3 Forme实现

class Former(nn.Layer):
    """Former Sub-block
        Params Info:
            embed_dims: input token embed_dims
            num_head: the number of head is in multi head attention
            mlp_ratio: the scale of hidden feature size
            dropout_rate: the dropout rate of attention result
            droppath_rate: the droppath rate of attention output
            attn_dropout_rate: the dropout rate of attention distribution
            mlp_dropout_rate: the dropout rate of mlp layer output
            qkv_bias: whether use the bias in qkv matrix
    """
    def __init__(self,
                 embed_dims,
                 num_head=1,
                 mlp_ratio=2,
                 dropout_rate=0.,
                 droppath_rate=0.,
                 attn_dropout_rate=0.,
                 mlp_dropout_rate=0.,
                 norm=nn.LayerNorm,
                 act=nn.GELU,
                 qkv_bias=True,
                 init_type='kn'):
        super(Former, self).__init__(name_scope="Former")

        self.attn = Attention(embed_dims=embed_dims,
                                num_head=num_head,
                                dropout_rate=dropout_rate,
                                attn_dropout_rate=attn_dropout_rate,
                                qkv_bias=qkv_bias,
                                init_type=init_type)
        self.attn_ln = norm(embed_dims)
        self.attn_droppath = DropPath(droppath_rate)

        self.mlp = MLP(in_features=embed_dims,
                        mlp_ratio=mlp_ratio,
                        mlp_dropout_rate=mlp_dropout_rate,
                        act=act,
                        init_type=init_type)
        self.mlp_ln = norm(embed_dims)
        self.mlp_droppath = DropPath(droppath_rate)

    def forward(self, inputs):
        res = inputs
        x = self.attn_ln(inputs)
        x = self.attn(x)
        x = self.attn_droppath(x)
        x = x + res

        res = x
        x = self.mlp_ln(x)
        x = self.mlp(x)
        x = self.mlp_droppath(x)
        x = x + res

        return x

六、ToMobile_Bridge结构实现

MobileFormer Block中最后一个结构

利用当前层次的Token输出作为Bridge的数据源:

  1. 将Token映射为Key和Value,Mobile输出的特征图数据作为原生的Query

  2. 利用query去查询value的组成,最后得到筛选出来/需要补充的全局信息

  3. 利用残差链接,将提取出来合适的全局信息与原始的局部信息进行特征融合

各部分与ToFormer_Bridge结构类似,映射仍然时Token的事儿,总是利用query去提取需要的信息,然后利用残差来融合特征

映射均为分头独立映射

为了与标准的transformer中的注意力操作对应,并且将各桥接模块进行解耦,将LN添加到forward中的第一步上.

输入为特征图与Token

class ToMobile_Bridge(nn.Layer):
    """Former to Mobile Bridge
        Params Info:
            in_channels: input feature map channels
            embed_dims: input token embed_dims
            num_head: the number of head is in multi head attention
            dropout_rate: the dropout rate of attention result
            attn_dropout_rate: the dropout rate of attention distribution
    """
    def __init__(self,
                 embed_dims,
                 in_channels,
                 num_head=1,
                 dropout_rate=0.,
                 droppath_rate=0.0,
                 attn_dropout_rate=0.,
                 qkv_bias=True,
                 norm=nn.LayerNorm,
                 init_type='kn'):
        super(ToMobile_Bridge, self).__init__(
                 name_scope="ToMobile_Bridge")
        self.num_head = num_head
        self.head_dims = in_channels // num_head
        self.scale = self.head_dims ** -0.5

        self.input_token_norm = norm(embed_dims)

        linear_weight_attr, linear_bias_attr = self._linear_init(init_type=init_type)

        self.heads_k_proj = []
        self.heads_v_proj = []
        for i in range(num_head): # n linear
            self.heads_k_proj.append(
                nn.Linear(in_features=embed_dims // num_head,
                          out_features=self.head_dims,
                          weight_attr=linear_weight_attr,
                          bias_attr=linear_bias_attr if qkv_bias else False)
            )
            self.heads_v_proj.append(
                nn.Linear(in_features=embed_dims // num_head,
                          out_features=self.head_dims,
                          weight_attr=linear_weight_attr,
                          bias_attr=linear_bias_attr if qkv_bias else False)
            )
        self.heads_k_proj = nn.LayerList(self.heads_k_proj)
        self.heads_v_proj = nn.LayerList(self.heads_v_proj)

        self.softmax = nn.Softmax()
        self.dropout = nn.Dropout(dropout_rate)
        self.droppath = DropPath(droppath_rate)
        self.attn_dropout= nn.Dropout(attn_dropout_rate)

    def _linear_init(self, init_type='kn'):
        if init_type == 'xu':
            weight_attr = nn.initializer.XavierUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'ku':
            weight_attr = nn.initializer.KaimingUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'kn':
            weight_attr = nn.initializer.KaimingNormal()
            bias_attr = nn.initializer.Constant(value=0.0)
        return weight_attr, bias_attr

    def transfer_shape(self, feature_map, tokens):
        B, C, H, W = feature_map.shape
        assert C % self.num_head == 0, \
            "Erorr: Please make sure feature_map.channels % "+\
            "num_head == 0(now:{0}).".format(C % self.num_head)

        fm = feature_map.reshape(shape=[B, C, H*W]) # B, C, L
        fm = fm.transpose(perm=[0, 2, 1]) # B, L, C -- C = num_head * head_dims
        fm = fm.reshape(shape=[B, H*W, self.num_head, self.head_dims])
        fm = fm.transpose(perm=[0, 2, 1, 3]) # B, n_h, L, h_d

        B, M, D = tokens.shape
        k = tokens.reshape(shape=[B, M, self.num_head, D // self.num_head])
        k = k.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, D // n_h
        v = tokens.reshape(shape=[B, M, self.num_head, D // self.num_head])
        v = v.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, D // n_h

        return fm, k, v

    def _multi_head_kv_forward(self, k_, v_, B, M):
        k_list = []
        v_list = []
        for i in range(self.num_head):
            k_list.append(
                # B, 1, M, head_dims
                self.heads_k_proj[i](k_[:, i, :, :]).reshape(
                                shape=[B, 1, M, self.head_dims])
            )
            v_list.append(
                # B, 1, M, head_dims
                self.heads_v_proj[i](v_[:, i, :, :]).reshape(
                                shape=[B, 1, M, self.head_dims])
            )
        k = paddle.concat(k_list, axis=1) # B, num_head, M, head_dims
        v = paddle.concat(v_list, axis=1) # B, num_head, M, head_dims
        return k, v

    def forward(self, feature_map, tokens):
        B, C, H, W = feature_map.shape
        B, M, D = tokens.shape

        tokens = self.input_token_norm(tokens)
        # fm(q) to shape: B, n_h, L, h_d
        # k/v to shape: B, n_h, M, D // n_h
        q, k_, v_ = self.transfer_shape(feature_map, tokens)
        k, v = self._multi_head_kv_forward(k_, v_, B, M)

        # attention distribution
        attn = paddle.matmul(q, k, transpose_y=True) # B, n_h, L, M
        attn = attn * self.scale
        attn = self.softmax(attn)
        attn = self.attn_dropout(attn)

        # attention result
        z = paddle.matmul(attn, v) # B, n_h, L, h_d
        z = z.transpose(perm=[0, 1, 3, 2]) # B, n_h, h_d, L
        # B, n_h*h_d, H, W
        z = z.reshape(shape=[B, self.num_head*self.head_dims, H, W])
        z = self.dropout(z)

        z = self.droppath(z)
        z = z + feature_map

        return z

七、MFBlock结构实现

将以上的MobileFormer Block组成单元组起来,得到MobileFormer Block,简称: MFBlock.

结构:

  1. ToFormer_Bridge

  2. Former

  3. Mobile

  4. ToMobile_Bridge

输入为特征图与Token

class MFBlock(nn.Layer):
    """MobileFormer Basic Block
        Params Info:
            in_channels: the number of input feature map channel
            hidden_channels: the number of hidden(dw_conv) feature map channel
            out_channels: the number of output feature map channel
            embed_dims: input token embed_dims
            num_head: the number of head is in multi head attention
            groups: the number of groups in 1x1 conv
            k: the number of parameters is in Dynamic ReLU
            coefs: the init value of coefficient parameters
            consts: the init value of constant parameters
            reduce: the mlp hidden scale,
                    means 1/reduce = mlp_ratio
            use_dyrelu: whether use dyrelu
            mlp_ratio: the scale of hidden feature size
            dropout_rate: the dropout rate of attention result
            droppath_rate: the droppath rate of attention output
            attn_dropout_rate: the dropout rate of attention distribution
            mlp_dropout_rate: the dropout rate of mlp layer output
            qkv_bias: whether use the bias in qkv matrix
    """
    def __init__(self,
                 in_channels,
                 hidden_channels,
                 out_channels,
                 embed_dims,
                 kernel_size=3,
                 stride=1,
                 padding=0,
                 groups=1,
                 k=2,
                 coefs=[1.0, 0.5],
                 consts=[1.0, 0.0],
                 reduce=4,
                 use_dyrelu=False,
                 num_head=1,
                 mlp_ratio=2,
                 dropout_rate=0.,
                 droppath_rate=0.,
                 attn_dropout_rate=0.,
                 mlp_dropout_rate=0.,
                 norm=nn.LayerNorm,
                 act=nn.GELU,
                 qkv_bias=True,
                 init_type='kn'):
        super(MFBlock, self).__init__(
                 name_scope="MFBlock")
        self.mobile = Mobile(in_channels=in_channels,
                             hidden_channels=hidden_channels,
                             out_channels=out_channels,
                             kernel_size=kernel_size,
                             stride=stride,
                             padding=padding,
                             groups=groups,
                             embed_dims=embed_dims,
                             k=k,
                             coefs=coefs,
                             consts=consts,
                             reduce=reduce,
                             use_dyrelu=use_dyrelu,
                             init_type=init_type)

        self.toformer_bridge = ToFormer_Bridge(embed_dims=embed_dims,
                                               in_channels=in_channels,
                                               num_head=num_head,
                                               dropout_rate=dropout_rate,
                                               droppath_rate=droppath_rate,
                                               attn_dropout_rate=attn_dropout_rate,
                                               qkv_bias=qkv_bias,
                                               init_type=init_type)

        self.former = Former(embed_dims=embed_dims,
                             num_head=num_head,
                             mlp_ratio=mlp_ratio,
                             dropout_rate=droppath_rate,
                             mlp_dropout_rate=mlp_dropout_rate,
                             attn_dropout_rate=attn_dropout_rate,
                             droppath_rate=droppath_rate,
                             qkv_bias=qkv_bias,
                             norm=norm,
                             act=act,
                             init_type=init_type)

        self.tomobile_bridge = ToMobile_Bridge(in_channels=out_channels,
                                               embed_dims=embed_dims,
                                               num_head=num_head,
                                               dropout_rate=dropout_rate,
                                               droppath_rate=droppath_rate,
                                               attn_dropout_rate=attn_dropout_rate,
                                               qkv_bias=qkv_bias,
                                               init_type=init_type)

    def forward(self, feature_map, tokens):
        z_h = self.toformer_bridge(feature_map, tokens)
        z_out = self.former(z_h)

        f_h = self.mobile(feature_map, z_out)
        f_out = self.tomobile_bridge(f_h, z_out)
        return f_out, z_out

八、分类头实现

分类时,将Tokens中的第一个token拼接到特征图池化后的结构上,然后参与分类.

class Classifier_Head(nn.Layer):
    """Classifier Head
        Params Info:
            in_channels: input feature map channels
            embed_dims: input token embed_dims
            hidden_features: the fc layer hidden feature size
            num_classes: the number of classes
    """
    def __init__(self,
                 in_channels,
                 embed_dims,
                 hidden_features,
                 num_classes=1000,
                 dropout=0.0,
                 act=nn.Hardswish,
                 init_type='kn'):
        super(Classifier_Head, self).__init__(
                 name_scope="Classifier_Head")
        linear_weight_attr, linear_bias_attr = self._linear_init(init_type=init_type)

        self.avg_pool = nn.AdaptiveAvgPool2D(output_size=1)
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(in_features=in_channels+embed_dims,
                             out_features=hidden_features,
                             weight_attr=linear_weight_attr,
                             bias_attr=linear_bias_attr)
        self.dropout = nn.Dropout(dropout)
        self.fc2 = nn.Linear(in_features=hidden_features,
                             out_features=num_classes,
                             weight_attr=linear_weight_attr,
                             bias_attr=linear_bias_attr)

        self.act = act()
        self.softmax = nn.Softmax()

    def _linear_init(self, init_type='kn'):
        if init_type == 'xu':
            weight_attr = nn.initializer.XavierUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'ku':
            weight_attr = nn.initializer.KaimingUniform()
            bias_attr = nn.initializer.Constant(value=0.0)
        elif init_type == 'kn':
            weight_attr = nn.initializer.KaimingNormal()
            bias_attr = nn.initializer.Constant(value=0.0)
        return weight_attr, bias_attr

    def forward(self, feature_map, tokens):
        x = self.avg_pool(feature_map) # B, C, 1, 1
        x = self.flatten(x) # B, C

        z = tokens[:, 0] # B, 1, D
        x = paddle.concat([x, z], axis=-1)

        x = self.fc1(x)
        x = self.act(x)
        x = self.dropout(x)
        x = self.fc2(x)

        if self.training:
            return x
        return self.softmax(x)

MobileFormer模型组网

将上面的所有结构进行组网,得到MobileFormer模型架构,类似MobileFormer294M的模型结构:

由于,层数较多,参数较复杂,因此通过yaml以及config.py传参,而其余组网部分正常进行。

结构:

  • Stem – 渐入层

  • Lite-BottleNeck – 轻量瓶颈结构

  • Mobile-Former Block – 基本block

  • End_ToFormer_Bridge – 最后输出时的全局信息加强

  • Channel Conv – 末尾分类前的通道拓展

  • Classifier Head – 分类头

构造网络结构的方法说明:

  • _create_token: 生成可学习的Tokens

  • _create_stem: 生成Stem层

  • _create_lite_bneck: 生成Lite-BottleNeck层

  • _create_mf_blocks: 生成MobileFormer Block层

  • _create_former_end_bridge: 生成End ToFormer Bridge层

  • _create_channel_conv: 生成Channel Conv层

  • _create_head: 生成Head层

  • _create_model: 组网

这里仅仅将组网来出来介绍,而使用参数构建模型的build_mformer函数并没拿出来,而在mobileformer.py文件的末尾中定义

class MobileFormer(nn.Layer):
    """MobileFormer
        Params Info:
            num_classes: the number of classes
            in_channels: the number of input feature map channel
            tokens: the shape of former token
            num_head: the number of head is in multi head attention
            groups: the number of groups in 1x1 conv
            k: the number of parameters is in Dynamic ReLU
            coefs: the init value of coefficient parameters
            consts: the init value of constant parameters
            reduce: the mlp hidden scale,
                    means 1/reduce = mlp_ratio
            use_dyrelu: whether use dyrelu
            mlp_ratio: the scale of hidden feature size
            dropout_rate: the dropout rate of attention result
            droppath_rate: the droppath rate of attention output
            attn_dropout_rate: the dropout rate of attention distribution
            mlp_dropout_rate: the dropout rate of mlp layer output
            alpha: the scale of model size
            qkv_bias: whether use the bias in qkv matrix
            config: total model config
            init_type: init params kind
    """
    def __init__(self, num_classes=1000, in_channels=3,
                 tokens=[3, 128], num_head=4, mlp_ratio=2,
                 use_dyrelu=True, k=2, reduce=4.0,
                 coefs=[1.0, 0.5], consts=[1.0, 0.0],
                 dropout_rate=0.0, droppath_rate=0.0,
                 attn_dropout_rate=0.0, mlp_dropout_rate=0.0,
                 norm=nn.LayerNorm, act=nn.GELU,
                 alpha=1.0, qkv_bias=True,
                 config=None, init_type='kn'):
        super(MobileFormer, self).__init__()
        self.num_token, self.embed_dims = tokens[0], tokens[1]
        self.num_head = num_head
        self.num_classes = num_classes
        self.in_channels = in_channels
        self.mlp_ratio = mlp_ratio
        self.alpha = alpha
        self.qkv_bias = qkv_bias
        self.dropout_rate = dropout_rate
        self.droppath_rate = droppath_rate
        self.attn_dropout_rate = attn_dropout_rate
        self.mlp_dropout_rate = mlp_dropout_rate
        self.init_type = init_type

        assert init_type in ['xu', 'ku', 'kn'], \
            "Error: Please choice the init type in ['xu', 'ku', 'kn']"+\
            ", but now it is {0}.".format(init_type)
        assert config is not None, \
            "Error: Please enter the config(now: {0})".format(config)+\
            " in the __init__."

        # create learnable tokens: self.tokens
        self._create_token(num_token=self.num_token,
                           embed_dims=self.embed_dims)

        # create total model
        self._create_model(use_dyrelu=use_dyrelu,
                           reduce=reduce, dyrelu_k=k,
                           coefs=coefs, consts=consts,
                           alpha=alpha, norm=norm, act=act,
                           config=config)

    def _create_token(self, num_token, embed_dims):
        # B(1), token_size, embed_dims
        shape = [1] + [num_token, embed_dims]
        self.tokens = self.create_parameter(shape=shape, dtype='float32')

    def _create_stem(self,
                     in_channels,
                     out_channels,
                     kernel_size,
                     stride, padding,
                     alpha):
        self.stem = Stem(in_channels=in_channels,
                         out_channels=int(alpha * out_channels),
                         kernel_size=kernel_size,
                         stride=stride,
                         padding=padding,
                         init_type=self.init_type)

    def _create_lite_bneck(self,
                           in_channels,
                           hidden_channels,
                           out_channels,
                           kernel_size,
                           stride,
                           padding,
                           alpha,
                           pointwiseconv_groups):
        self.bneck_lite = BottleNeck(in_channels=int(alpha * in_channels),
                                     hidden_channels=int(alpha * hidden_channels),
                                     out_channels=int(alpha * out_channels),
                                     groups=pointwiseconv_groups,
                                     kernel_size=kernel_size,
                                     stride=stride,
                                     padding=padding,
                                     use_dyrelu=False,
                                     is_lite=True,
                                     init_type=self.init_type)

    def _create_mf_blocks(self,
                          in_channel_list,
                          hidden_channel_list,
                          out_channel_list,
                          kernel_list,
                          stride_list,
                          padding_list,
                          alpha,
                          use_dyrelu,
                          reduce,
                          dyrelu_k,
                          coefs,
                          consts,
                          norm,
                          act,
                          pointwiseconv_groups):
        self.blocks = []
        for i in range(0, len(in_channel_list)):
            self.blocks.append(
                MFBlock(
                    in_channels=int(alpha * in_channel_list[i]),
                    hidden_channels=int(alpha * hidden_channel_list[i]),
                    out_channels=int(alpha * out_channel_list[i]),
                    embed_dims=self.embed_dims,
                    kernel_size=kernel_list[i],
                    stride=stride_list[i],
                    padding=padding_list[i],
                    groups=pointwiseconv_groups,
                    k=dyrelu_k,
                    coefs=coefs,
                    consts=consts,
                    reduce=reduce,
                    use_dyrelu=use_dyrelu,
                    num_head=self.num_head,
                    mlp_ratio=self.mlp_ratio,
                    dropout_rate=self.dropout_rate,
                    droppath_rate=self.droppath_rate,
                    attn_dropout_rate=self.attn_dropout_rate,
                    mlp_dropout_rate=self.mlp_dropout_rate,
                    norm=norm,
                    act=act,
                    init_type=self.init_type
                )
            )
        self.blocks = nn.LayerList(self.blocks)

    def _create_former_end_bridge(self,
                                  in_channels,
                                  norm,
                                  alpha):
        self.end_toformer_bridge = ToFormer_Bridge(embed_dims=self.embed_dims,
                                                    in_channels=int(alpha * in_channels),
                                                    num_head=self.num_head,
                                                    dropout_rate=self.dropout_rate,
                                                    droppath_rate=self.droppath_rate,
                                                    attn_dropout_rate=self.attn_dropout_rate,
                                                    init_type=self.init_type)

    def _create_channel_conv(self,
                             in_channels,
                             out_channels,
                             alpha,
                             pointwiseconv_groups):
        self.channel_conv = nn.Sequential(
            PointWiseConv(in_channels=int(alpha * in_channels),
                          out_channels=out_channels,
                          groups=pointwiseconv_groups,
                          init_type=self.init_type),
            nn.BatchNorm2D(out_channels),
            nn.ReLU()
        )

    def _create_head(self,
                     in_channels,
                     hidden_features):
        self.head = Classifier_Head(in_channels=in_channels,
                                    embed_dims=self.embed_dims,
                                    hidden_features=hidden_features,
                                    num_classes=self.num_classes,
                                    dropout=self.dropout_rate,
                                    init_type=self.init_type)

    def _create_model(self,
                      use_dyrelu,
                      reduce,
                      dyrelu_k,
                      coefs,
                      consts,
                      norm,
                      act,
                      alpha,
                      config):
        # create stem: self.stem
        self._create_stem(in_channels=self.in_channels,
                          out_channels=config.MODEL.MF.STEM.OUT_CHANNELS,
                          kernel_size=config.MODEL.MF.STEM.KERNELS,
                          stride=config.MODEL.MF.STEM.STRIEDS,
                          padding=config.MODEL.MF.STEM.PADDINGS,
                          alpha=alpha)
        # create lite-bottleneck: self.bneck_lite
        self._create_lite_bneck(in_channels=config.MODEL.MF.LITE_BNECK.IN_CHANNEL,
                                hidden_channels=config.MODEL.MF.LITE_BNECK.HIDDEN_CHANNEL,
                                out_channels=config.MODEL.MF.LITE_BNECK.OUT_CHANNEL,
                                kernel_size=config.MODEL.MF.LITE_BNECK.KERNEL,
                                stride=config.MODEL.MF.LITE_BNECK.STRIED,
                                padding=config.MODEL.MF.LITE_BNECK.PADDING,
                                alpha=alpha,
                                pointwiseconv_groups=config.MODEL.MF.POINTWISECONV_GROUPS)
        # create mobileformer blocks: self.blocks
        self._create_mf_blocks(in_channel_list=config.MODEL.MF.BLOCK.IN_CHANNELS,
                               hidden_channel_list=config.MODEL.MF.BLOCK.HIDDEN_CHANNELS,
                               out_channel_list=config.MODEL.MF.BLOCK.OUT_CHANNELS,
                               kernel_list=config.MODEL.MF.BLOCK.KERNELS,
                               stride_list=config.MODEL.MF.BLOCK.STRIEDS,
                               padding_list=config.MODEL.MF.BLOCK.PADDINGS,
                               alpha=alpha,
                               use_dyrelu=use_dyrelu,
                               reduce=reduce,
                               dyrelu_k=dyrelu_k,
                               coefs=coefs,
                               consts=consts,
                               norm=norm,
                               act=act,
                               pointwiseconv_groups=config.MODEL.MF.POINTWISECONV_GROUPS)
        # create final toformer_bridge: self.toformer_bridge
        self._create_former_end_bridge(in_channels=config.MODEL.MF.CHANNEL_CONV.IN_CHANNEL,
                                       norm=norm,
                                       alpha=alpha)
        # create channel 1x1 conv: self.channel_conv
        self._create_channel_conv(in_channels=config.MODEL.MF.CHANNEL_CONV.IN_CHANNEL,
                                  out_channels=config.MODEL.MF.CHANNEL_CONV.OUT_CHANNEL,
                                  alpha=alpha,
                                  pointwiseconv_groups=config.MODEL.MF.POINTWISECONV_GROUPS)
        # create classifier head: self.head
        self._create_head(in_channels=config.MODEL.MF.HEAD.IN_CHANNEL,
                          hidden_features=config.MODEL.MF.HEAD.HIDDEN_FEATURE)

    def _to_batch_tokens(self, batch_size):
        # B, token_size, embed_dims
        return paddle.concat([self.tokens]*batch_size,  axis=0)

    # @paddle.jit.to_static
    def forward(self, inputs):
        B, _, _, _ = inputs.shape
        f = self.stem(inputs)
        # create batch tokens
        tokens = self._to_batch_tokens(B) # B, token_size, embed_dims
        f = self.bneck_lite(f, tokens)

        for b in self.blocks:
            f, tokens = b(f, tokens)

        tokens = self.end_toformer_bridge(f, tokens)
        f = self.channel_conv(f)
        output = self.head(f, tokens)

        return output

MobileFormer使用说明

通过加载mobileformer_26m.yaml来实现模型参数的加载与配置——依赖于config.py中的参数组织以及yaml文件里的参数配置.

对yaml参数的简单说明:

  • MF – 为对Mobileformer中的所有网络架构配置

    • BLOCK: 对所有MFBlock的配置,包括输入通道数,中间通道数,输出通道数等

    • DYRELU: 对动态ReLU的配置

默认使用get_config加载的配置中,分类数为1000,需要在yaml中添加一个参数才能实现自定义类别数(分类数)模型的构建

原内容:

MODEL:
    TYPE: MobileFormer
    NAME: MobileFormer_26M
    DROPPATH: 0.1
    DROPOUT: 0.1
    MLP_DROPOUT: 0.1
    ATTENTION_DROPOUT: 0.1

添加后: —— 分类数为10

MODEL:
    NUM_CLASSES: 10
    TYPE: MobileFormer
    NAME: MobileFormer_26M
    DROPPATH: 0.1
    DROPOUT: 0.1
    MLP_DROPOUT: 0.1
    ATTENTION_DROPOUT: 0.1
# config中的参数配置需要用到
!pip install yacs
import paddle

from config import get_config
from mobileformer import build_mformer as build_model

config = get_config('mobileformer_26m.yaml')
model = build_model(config)

test_data = paddle.rand((1, 3, 224, 224))
y_pred = model(test_data)

print('model output: ', y_pred.shape)
=> merge config from mobileformer_26m.yaml
model output:  [1, 10]

相关实验说明

以下实验均基于MobileFormer26M测试,为MobileFormer中最小模型

1.MNIST实验(99.3)

  • Batch 等于256, 4卡

  • 学习率等其余参数同config中一致,未使用mixup等预处理,仅使用随即裁剪,resize,totensor与normalize

  • 输入图像放大到48

  • 模型中,Lite-BottleNeck步长减为1,其余不变

  • best_model.pdparams == 8.85M

# config中的参数配置需要用到
!pip install yacs
import paddle
from paddle import nn
from MNIST.mnist_transforms import test_datasets
from config import get_config
from mobileformer import build_mformer as build_model

config = get_config('./MNIST/mobileformer_26m.yaml')
model = build_model(config)
model.set_state_dict(paddle.load('./MNIST/best_model.pdparams'))

model = paddle.Model(model)
model.prepare(
    optimizer=None,
    loss=nn.CrossEntropyLoss(),
    metrics=paddle.metric.Accuracy(topk=(1, 5))
)

model.evaluate(eval_data=test_datasets, batch_size=256)
=> merge config from ./MNIST/mobileformer_26m.yaml
Eval begin...
step 10/40 - loss: 1.5501 - acc_top1: 0.9910 - acc_top5: 0.9996 - 237ms/step
step 20/40 - loss: 1.5416 - acc_top1: 0.9916 - acc_top5: 0.9996 - 227ms/step
step 30/40 - loss: 1.5380 - acc_top1: 0.9934 - acc_top5: 0.9997 - 227ms/step
step 40/40 - loss: 1.5374 - acc_top1: 0.9942 - acc_top5: 0.9998 - 221ms/step
Eval samples: 10000





{'loss': [1.5374461], 'acc_top1': 0.9942, 'acc_top5': 0.9998}

Cifar10(89.6)

还可以继续提升,受限于batchsize等参数不太合适

  • Batch 等于256, 4卡

  • 学习率等其余参数同config中一致,使用mixup预处理,使用随即裁剪,resize,totensor与normalize

  • 输入图像放大到64

  • 模型中,Lite-BottleNeck步长减为1,其余不变

  • best_model.pdparams == 8.85M

其它实验说明

后期,会跟进一个MobileFormer使用上的说明,以及如何利用PaddleViT的源码实现自定义数据集的简单使用教学.

# config中的参数配置需要用到
!pip install yacs
import paddle
from paddle import nn
from Cifar10.cifar10_transforms import test_datasets
from config import get_config
from mobileformer import build_mformer as build_model

config = get_config('./Cifar10/mobileformer_26m.yaml')
model = build_model(config)
model.set_state_dict(paddle.load('./Cifar10/best_model.pdparams'))

model = paddle.Model(model)
model.prepare(
    optimizer=None,
    loss=nn.CrossEntropyLoss(),
    metrics=paddle.metric.Accuracy(topk=(1, 5))
)

model.evaluate(eval_data=test_datasets, batch_size=256)
=> merge config from ./Cifar10/mobileformer_26m.yaml
Eval begin...
step 10/40 - loss: 1.7457 - acc_top1: 0.8949 - acc_top5: 0.9926 - 246ms/step
step 20/40 - loss: 1.7452 - acc_top1: 0.8961 - acc_top5: 0.9906 - 243ms/step
step 30/40 - loss: 1.7186 - acc_top1: 0.8988 - acc_top5: 0.9926 - 244ms/step
step 40/40 - loss: 1.7148 - acc_top1: 0.8980 - acc_top5: 0.9933 - 242ms/step
Eval samples: 10000





{'loss': [1.7147579], 'acc_top1': 0.898, 'acc_top5': 0.9933}

相关预处理,如Mixup等,可前往查看: BR-IDL/PaddleViT

欢迎大家多多关注PaddleViT,有更多VIT模型的实现供大家学习与使用

也欢迎评论区讨论

姓名:蔡敬辉

学历:大四(在读)

爱好:喜欢参加一些大大小小的比赛,不限于计算机视觉——有共同爱好的小伙伴可以关注一下哦

主要方向:目标检测、图像分割与图像识别

联系方式:qq:3020889729 微信:cjh3020889729

former import build_mformer as build_model

config = get_config(’./Cifar10/mobileformer_26m.yaml’)
model = build_model(config)
model.set_state_dict(paddle.load(’./Cifar10/best_model.pdparams’))

model = paddle.Model(model)
model.prepare(
optimizer=None,
loss=nn.CrossEntropyLoss(),
metrics=paddle.metric.Accuracy(topk=(1, 5))
)

model.evaluate(eval_data=test_datasets, batch_size=256)


    => merge config from ./Cifar10/mobileformer_26m.yaml
    Eval begin...
    step 10/40 - loss: 1.7457 - acc_top1: 0.8949 - acc_top5: 0.9926 - 246ms/step
    step 20/40 - loss: 1.7452 - acc_top1: 0.8961 - acc_top5: 0.9906 - 243ms/step
    step 30/40 - loss: 1.7186 - acc_top1: 0.8988 - acc_top5: 0.9926 - 244ms/step
    step 40/40 - loss: 1.7148 - acc_top1: 0.8980 - acc_top5: 0.9933 - 242ms/step
    Eval samples: 10000





    {'loss': [1.7147579], 'acc_top1': 0.898, 'acc_top5': 0.9933}



> 相关预处理,如Mixup等,可前往查看: <a href="https://github.com/BR-IDL/PaddleViT/tree/develop/image_classification/MobileFormer" target="_blank">BR-IDL/PaddleViT</a></font></center>

> 欢迎大家多多关注PaddleViT,有更多VIT模型的实现供大家学习与使用

> 也欢迎评论区讨论

> 姓名:蔡敬辉

> 学历:大四(在读)

> 爱好:喜欢参加一些大大小小的比赛,不限于计算机视觉——有共同爱好的小伙伴可以关注一下哦

> 主要方向:目标检测、图像分割与图像识别

> 联系方式:qq:3020889729 微信:cjh3020889729

> 学校:西南科技大学
Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐