MobileFormer 一种桥接Global&Local特征的高效并行架构
(已完成MNIST以及Cifar10测试)带你手把手复现MobileFormer,更有视频教程详解模型的结构设计以及展示模型在部分数据集上的相关效果。
Vison Transformer的痛点
部分实现略有调整,后期会更新新的模型代码到
PaddleViT仓库
中。(主要实现同视频一致,部分归一化层以及参数略有调整)
模型结构反复调整主要由于论文
代码未开源
,部分参数信息需要通过不断测试不同的数据集进行调整(暂未测试ImageNet
).
已完成测试的数据集:
MNIST
,Cifar10
,Cifar100
.
近一年来,各大VIT模型横空出世,更高的业务能力,识别率更高,检测效果更好,迁移能力显著——各种各样的SOTA层出不穷。
但虽然众多的VIT模型性能卓越,但往往将性能与计算量-参数量挂钩。
怎么挂勾呢?——简单来说就是,大家可以发现,现在表现优异的VIT模型大都计算量超过1G的FLOPs,这就意味着其对实时性有一定的影响。
因此,就有学者研究相关VIT结构实现类似MobileNet系列模型在Conv卷积模型架构中的地位一样——包括纯VIT或者掺入VIT的混合模型架构,实现将高性能的VIT性能移接到移动端上。
由于这样的初衷,在过去的这一年中,就出现了像MobileVIT、MobileFormer这样的轻量VIT模型架构,
而本项目将以MobileFormer
作为本次手把手论文模型复现的主要内容,带大家了解最新的MobileFormer架构设计
。
|
|
|
---|
MobileFormer特性说明
-
采用并行混合模型架构(
conv+vit
),以往模型大都为串联组织的模型架构 -
采用
分头独立的注意力映射
(先分多头的映射),相对于直接映射qkv
,该方式会随着头数不断减少参数量 -
设计一种
桥接结构
,实现局部与全局信息的注意力交互,并且仅对有限个Token进行注意力映射,特征图不进行映射运算 -
设计少量的
Token
嵌入/学习全局信息,减少计算量与参数量 -
关于分头注意力映射,我会在后边去介绍,这一点是通过参数量对齐时确定的,论文中仅仅写作分头注意力,但如果直接使用以往的分头注意力参数量会大不少,因此作此改动
欢迎大家在评论区讨论相关结构细节
介绍MobileFormer全局与局部交互方式
- 左侧
X
i
X_i
Xi作为输入,是一组
Feature Map
; 右侧 Z i Z_i Zi作为输入,是一组Token
. - 左侧的输入
X
i
X_i
Xi先传入右侧的黄色区域,将特征图中的局部信息融入到
Z
i
Z_i
Zi中——这一步通过注意力映射来完成,但是仅对输入的Token进行
Query
的映射. - 上面的这一步,不对特征图进行映射,而是直接当作
Key
和Value
进行注意力计算,最后利用残差将局部信息融入到Token中的全局信息里——称为Mobile->Former
,重命名为ToFormer_Bridge
. - 通过
ToFormer_Bridge
的桥接后,将补充信息后的 Z i Z_i Zi传入Former结构
,即绿色区域——一个单纯的Transformer结构,完成特征提取后输出 Z i + 1 Z_{i+1} Zi+1. - 拿到
Z
i
+
1
Z_{i+1}
Zi+1后,将其作为当前阶段完整的全局信息Token,传回到
Mobile
结构,即蓝色区域, 此时才将 X i X_i Xi传入Mobile
结构——实现全局信息补充到局部信息的特征融合/交互过程. - 这是信息的交互,主要是通过Token提供给动态ReLU函数生成动态参数,去对卷积过程的局部信息/特征进行一个筛选的交互过程.
- X i X_i Xi经过Mobile结构后,信息并没有真正的补足全局的认识,因此,还需要与表示全局信息的Token进行一次注意力计算.
- 此时再将
Z
i
+
1
Z_{i+1}
Zi+1经过两次映射,得到Key和Value,而经过Mobile输出的
X
i
X_i
Xi保持原有特征,直接作为Query指导全局信息的融入,最后通过残差进行特征融合——称为
Former->Mobile
,重命名为ToMobile_Bridge
. - 这样一个交互过程,由于特征图始终没有进行任何映射,保持原汁原味的局部信息——更加纯净的融入到Token中,并且Token中表示全局的信息也在多个这样的block中不断的更新并融合到特征图的特征上.
以上,称为一个
MobileFormer_Block
.
其中,每一个桥接部分:ToFormer_Bridge
、ToMobile_Bridge
,都是Token在进行注意力映射,且实现都采用前面提到的多头独立映射.
总得来说:
- 总是通过映射为query的一端,去筛选作为key于value的一端中的信息——比如,token为query就筛选局部信息,然后残差融合选出的信息进行特征融合.
MobileFormer模型(结构)复现
基于MobileFormer294M模型结构进行说明,其余模型结构仅仅是部分参数不同。
各部分实现代码如下:
-
baseconv.py:
Stem
+PW
+DW
+BottleNeck
-
attention.py:
MLP
+Attention
-
dyrelu.py:
DyReLU
-
droppath.py:
DropPath
-
mobileformer.py:
Classifier_Head
+Mobile
+ToFormer_Bridge
+Former
+ToMobile_Bridge
+MFBlock
+MobileFormer
+build_mformer
# 导入基本的依赖库
import paddle
from paddle import nn
import numpy as np
一、Stem: 渐入层实现
Token部分在最后组网时实现,我们暂且跳过那一步
这里的渐入层,是个人的一个命名习惯——将输入的第一个卷积或者特征提取模块称为渐入层
.
意为: 特征开始缓缓流出,逐渐被提取进入到模型中——在卷积实现时,通常需要下采样,也应该需要(在实验中,MobileFormer将第一层变为步长为1时效果不如步长为2下采样的效果.)
结构:
-
一个
3x3卷积层
+BatchNorm2D
+Hardswish
-
通过
_conv_init
函数,生成可选的参数初始化方法,并通过weight_attr
等进行配置
class Stem(nn.Layer):
"""Stem
"""
def __init__(self,
in_channels,
out_channels,
kernel_size=3,
stride=1,
padding=0,
act=nn.Hardswish,
init_type='kn'):
super(Stem, self).__init__(
name_scope="Stem")
conv_weight_attr, conv_bias_attr = self._conv_init(init_type=init_type)
self.conv = nn.Conv2D(in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
weight_attr=conv_weight_attr,
bias_attr=conv_bias_attr)
self.bn = nn.BatchNorm2D(out_channels)
self.act = act()
def _conv_init(self, init_type='kn'):
if init_type == 'xu':
weight_attr = nn.initializer.XavierUniform()
bias_attr = nn.initializer.Constant(value=0.0)
elif init_type == 'ku':
weight_attr = nn.initializer.KaimingUniform()
bias_attr = nn.initializer.Constant(value=0.0)
elif init_type == 'kn':
weight_attr = nn.initializer.KaimingNormal()
bias_attr = nn.initializer.Constant(value=0.0)
return weight_attr, bias_attr
def forward(self, inputs):
x = self.conv(inputs)
x = self.bn(x)
x = self.act(x)
return x
二、Lite-BottleNeck结构实现
MobileFormer沿用MobileNetV3中设计的线性瓶颈结构来实现,因此,需要实现一般的BottleNeck,然后通过修改参数来实现Lite轻量的BottleNeck结构。
首先,明确BottleNeck的组成: 1x1 pw-conv
+ 激活函数-DYReLU
+ 3x3 dw-conv
+ 激活函数-DYReLU
+ 1x1 pw-conv
pw(pointwise): 逐点卷积–1x1的卷积,实现通道的压缩与拓张,是控制瓶颈大小(channel)的关键结构
dw(depthwise): 深度卷积–3x3的分组卷积,实现特征提取
DYReLU: 动态ReLU–利用传入的参数生成动态参数,然后通过max进行特征筛选
2.1 (PW)逐点卷积实现
为了解耦,仅作卷积部分,不构建归一化与激活函数.
结构:
- 1x1conv – 支持分组卷积实现
之所以支持分组卷积,是因为在最小模型中,所有的1x1 PointWise Conv会实现分组为4的实现,来减少参数量.
由于是1x1卷积,固定的,且只做channel控制,所以入口参数不开放卷积核大小、步长等参数.
class PointWiseConv(nn.Layer):
"""PointWise 1x1Conv -- support group conv
Params Info:
groups: the number of groups
"""
def __init__(self,
in_channels,
out_channels,
groups=1,
init_type='kn'):
super(PointWiseConv, self).__init__(
name_scope="PointWiseConv")
conv_weight_attr, conv_bias_attr = self._conv_init(init_type=init_type)
self.conv = nn.Conv2D(in_channels=in_channels,
out_channels=out_channels,
kernel_size=1,
stride=1,
padding=0,
groups=groups,
weight_attr=conv_weight_attr,
bias_attr=conv_bias_attr)
def _conv_init(self, init_type='kn'):
if init_type == 'xu':
weight_attr = nn.initializer.XavierUniform()
bias_attr = nn.initializer.Constant(value=0.0)
elif init_type == 'ku':
weight_attr = nn.initializer.KaimingUniform()
bias_attr = nn.initializer.Constant(value=0.0)
elif init_type == 'kn':
weight_attr = nn.initializer.KaimingNormal()
bias_attr = nn.initializer.Constant(value=0.0)
return weight_attr, bias_attr
def forward(self, inputs):
x = self.conv(inputs)
return x
2.2 (DW)深度卷积实现
为了解耦,仅作卷积部分,不构建归一化与激活函数.
结构:
3x3分组卷积
orlite-3x3分组卷积
分组卷积的分组数全为输入通道数,因为在瓶颈结构中,DW仅作特征提取,
通道数保持不变
.
由于需要支持可能的卷积核等调整,所以开放卷积核大小等参数,不开放输出通道参数——因为始终保持输入通道数.
lite-3x3分组卷积,采用
两个卷积层
实现,卷积核分别为[3, 1]与[1, 3].
特别地,实现lite结构时,传入的步长、填充等也需要进行划分成两个不同的参数,传入两个卷积层中
class DepthWiseConv(nn.Layer):
"""DepthWise Conv -- support lite weight dw_conv
Params Info:
is_lite: use lite weight dw_conv
"""
def __init__(self,
in_channels,
kernel_size=3,
stride=1,
padding=0,
is_lite=False,
init_type='kn'):
super(DepthWiseConv, self).__init__(
name_scope="DepthWiseConv")
self.is_lite = is_lite
conv_weight_attr, conv_bias_attr = self._conv_init(init_type=init_type)
if is_lite is False:
self.conv = nn.Conv2D(in_channels=in_channels,
out_channels=in_channels,
groups=in_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
weight_attr=conv_weight_attr,
bias_attr=conv_bias_attr)
else: # lite 结构
self.conv = nn.Sequential(
# kernel_size -- [3, 1]
nn.Conv2D(in_channels=in_channels,
out_channels=in_channels,
kernel_size=[kernel_size, 1],
stride=[stride, 1],
padding=[padding, 0],
groups=in_channels,
weight_attr=conv_weight_attr,
bias_attr=conv_bias_attr),
nn.BatchNorm2D(in_channels),
# kernel_size -- [1, 3]
nn.Conv2D(in_channels=in_channels,
out_channels=in_channels,
kernel_size=[1, kernel_size],
stride=[1, stride],
padding=[0, padding],
groups=in_channels,
weight_attr=conv_weight_attr,
bias_attr=conv_bias_attr)
)
def _conv_init(self, init_type='kn'):
if init_type == 'xu':
weight_attr = nn.initializer.XavierUniform()
bias_attr = nn.initializer.Constant(value=0.0)
elif init_type == 'ku':
weight_attr = nn.initializer.KaimingUniform()
bias_attr = nn.initializer.Constant(value=0.0)
elif init_type == 'kn':
weight_attr = nn.initializer.KaimingNormal()
bias_attr = nn.initializer.Constant(value=0.0)
return weight_attr, bias_attr
def forward(self, inputs):
x = self.conv(inputs)
return x
2.3 动态ReLU实现
实现B型动态ReLU,完成通道上的ReLU效果–包含筛选通道特征.
1.动态ReLU需要提前设置参数组大小,即K值大小
,利用K值大小去生成指定个数的a与b的参数组:
-
等式左边为利用初始值生成的结果动态参数组——分别是 a k a_k ak与 b k b_k bk
-
等式右侧参数: (注意每个通道各生成k各a与b哦)
- λ a \lambda_a λa为生成动态参数组 a k a_k ak的系数, λ b \lambda_b λb为生成动态参数组 b k b_k bk的系数
- α k \alpha_k αk为生成动态参数组 a k a_k ak的常数项, β k \beta_k βk为生成动态参数组 b k b_k bk的常数项
- Δ \Delta Δ部分为输入数据映射后的结果——具体看代码可以清楚理解
2.通过上述所知参数,可以得到参数组如下:
3.得到每个通道上的a与b参数——每个通道得到k个a与k个b动态系数.
4.最后通过max得到最终的激活输出,其中a做系数,b做偏置或者常数项:
max处理的维度为最后一维,对应每个通道上动态生成的
2*k
个结果.
特别说明: 由于在本文中有说到使用MLP替换该结构中的线性映射,因此需要先实现MLP,所以接下来先展示MLP结构,然后展示动态ReLU结构.
2.3.1 MLP实现
class MLP(nn.Layer):
"""Multi Layer Perceptron
Params Info:
in_features: input token feature size
out_features: output token feature size
mlp_ratio: the scale of hidden feature size
mlp_dropout_rate: the dropout rate of mlp layer output
"""
def __init__(self,
in_features,
out_features=None,
mlp_ratio=2,
mlp_dropout_rate=0.,
act=nn.GELU,
init_type='kn'):
super(MLP, self).__init__(name_scope="MLP")
self.out_features = in_features if out_features is None else \
out_features
linear_weight_attr, linear_bias_attr = self._linear_init(init_type=init_type)
self.fc1 = nn.Linear(in_features=in_features,
out_features=int(mlp_ratio*in_features),
weight_attr=linear_weight_attr,
bias_attr=linear_bias_attr)
self.fc2 = nn.Linear(in_features=int(mlp_ratio*in_features),
out_features=self.out_features,
weight_attr=linear_weight_attr,
bias_attr=linear_bias_attr)
self.act = act()
self.dropout = nn.Dropout(mlp_dropout_rate)
def _linear_init(self, init_type='kn'):
if init_type == 'xu':
weight_attr = nn.initializer.XavierUniform()
bias_attr = nn.initializer.Constant(value=0.0)
elif init_type == 'ku':
weight_attr = nn.initializer.KaimingUniform()
bias_attr = nn.initializer.Constant(value=0.0)
elif init_type == 'kn':
weight_attr = nn.initializer.KaimingNormal()
bias_attr = nn.initializer.Constant(value=0.0)
return weight_attr, bias_attr
def forward(self, inputs):
x = self.fc1(inputs)
x = self.act(x)
x = self.dropout(x)
x = self.fc2(x)
x = self.dropout(x)
return x
2.3.2 DyReLU实现
部分参数说明:
-
k=2: 动态参数基个数
-
coefs=[1.0, 0.5]: 初始系数值
-
consts=[1.0, 0.0]: 初始常数项
-
self.mid_channels: 总的参数个数,每个通道生成2xk个结果,所以需要映射出2xkxchannels的动态参数,然后计算出动态参数组
- 见上面的
1.
- 见上面的
-
self.coef: 利用初始系数值进行广播,得到[ a 1 a_1 a1, a 2 a_2 a2, b 1 b_1 b1, b 2 b_2 b2]的
系数
,由于 a k a_k ak系数与 b k b_k bk系数总是各自唯一的,所以将初始值依次复制 k k k份即可 -
self.const: 利用初始化常数项,依次生成 a k a_k ak与 b k b_k bk的
常数项
,这里将输入的consts[0]作为 a 1 a_1 a1的常数项,其余皆为consts[2],但仍然需要保持shape为2xk
-
self.project: 将输入映射到指定的参数数量——注意,这里由于是模型中表示全局信息的Token生成动态参数,因此输入维度为
embed_dims
class DyReLU(nn.Layer):
"""Dynamic ReLU activation function -- use one MLP
Params Info:
in_channels: input feature map channels
embed_dims: input token embed_dims
k: the number of parameters is in Dynamic ReLU
coefs: the init value of coefficient parameters
consts: the init value of constant parameters
reduce: the mlp hidden scale,
means 1/reduce = mlp_ratio
"""
def __init__(self,
in_channels,
embed_dims,
k=2, # a_1, a_2 coef, b_1, b_2 bias
coefs=[1.0, 0.5], # coef init value
consts=[1.0, 0.0], # const init value
reduce=4,
init_type='kn'):
super(DyReLU, self).__init__(
name_scope="DyReLU")
self.embed_dims = embed_dims
self.in_channels = in_channels
self.k = k
self.mid_channels = 2*k*in_channels
# 4 values
# a_k = alpha_k + coef_k*x, 2
# b_k = belta_k + coef_k*x, 2
self.coef = paddle.to_tensor([coefs[0]]*k + [coefs[1]]*k)
self.const = paddle.to_tensor([consts[0]] + [consts[1]]*(2*k-1))
self.project = nn.Sequential(
# nn.LayerNorm(embed_dims),
MLP(in_features=embed_dims,
out_features=self.mid_channels,
mlp_ratio=1/reduce,
act=nn.GELU,
init_type=init_type),
# nn.BatchNorm(self.mid_channels)
nn.LayerNorm(self.mid_channels)
)
def forward(self, feature_map, tokens):
B, M, D = tokens.shape
dy_params = self.project(tokens[:, 0]) # B, mid_channels
# B, IN_CHANNELS, 2*k: 衍生每个通道的a与b参数组的动态参数
dy_params = dy_params.reshape(shape=[B, self.in_channels, 2*self.k])
# B, IN_CHANNELS, 2*k -- a_1, a_2, b_1, b_2:衍生每个通道最终确定的动态参数组
dy_init_params = dy_params * self.coef + self.const
f = feature_map.transpose(perm=[2, 3, 0, 1]).unsqueeze(axis=-1) # H, W, B, C, 1 : 转置保证f(x)沿着通道维度进行,且与上一步的结果可以进行广播运算
# output shape: H, W, B, C, k
output = f * dy_init_params[:, :, :self.k] + dy_init_params[:, :, self.k:]
output = paddle.max(output, axis=-1) # H, W, B, C
output = output.transpose(perm=[2, 3, 0, 1]) # B, C, H, W :还原数据shape
return output
2.4 Lite-BottleNeck实现
通过
is_lite
控制是否启动Lite结构,也就是在构建网络层是使用Lite-DW.
通过
use_dyrelu
控制是否使用动态ReLU作为激活函数
可以看出Bottle结构实际上就是Mobile结构中的核心部分——基本上与下图中的桥接结构中的Mobile一致,但其实还不够.
注意,由于要使用Token产生动态参数,因此需要在forward传入两个参数,依次为特征图–Token.
class BottleNeck(nn.Layer):
"""BottleNeck
Params Info:
groups: the number of groups, by 1x1conv
embed_dims: input token embed_dims
k: the number of parameters is in Dynamic ReLU
coefs: the init value of coefficient parameters
consts: the init value of constant parameters
reduce: the mlp hidden scale,
means 1/reduce = mlp_ratio
use_dyrelu: whether use dyrelu
is_lite: whether use lite dw_conv
"""
def __init__(self,
in_channels,
hidden_channels,
out_channels,
groups=1,
kernel_size=3,
stride=1,
padding=0,
embed_dims=None,
k=2, # the number of dyrelu-params
coefs=[1.0, 0.5],
consts=[1.0, 0.0],
reduce=4,
use_dyrelu=False,
is_lite=False,
init_type='kn'):
super(BottleNeck, self).__init__(
name_scope="BottleNeck")
self.is_lite = is_lite
self.use_dyrelu = use_dyrelu
assert use_dyrelu==False or (use_dyrelu==True and embed_dims is not None), \
"Error: Please make sure while the use_dyrelu==True,"+\
" embed_dims(now:{0})>0.".format(embed_dims)
self.in_pw = PointWiseConv(in_channels=in_channels,
out_channels=hidden_channels,
groups=groups,
init_type=init_type)
self.in_pw_bn = nn.BatchNorm2D(hidden_channels)
self.dw = DepthWiseConv(in_channels=hidden_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
is_lite=is_lite,
init_type=init_type)
self.dw_bn = nn.BatchNorm2D(hidden_channels)
self.out_pw = PointWiseConv(in_channels=hidden_channels,
out_channels=out_channels,
groups=groups,
init_type=init_type)
self.out_pw_bn = nn.BatchNorm2D(out_channels)
if use_dyrelu == False:
self.act = nn.ReLU()
else:
self.act = DyReLU(in_channels=hidden_channels,
embed_dims=embed_dims,
k=k,
coefs=coefs,
consts=consts,
reduce=reduce,
init_type=init_type)
def forward(self, feature_map, tokens):
x = self.in_pw(feature_map)
x = self.in_pw_bn(x)
if self.use_dyrelu:
x = self.act(x, tokens)
x = self.dw(x)
x = self.dw_bn(x)
if self.use_dyrelu:
x = self.act(x, tokens)
x = self.out_pw(x)
x = self.out_pw_bn(x)
return x
三、Mobile结构实现
上边的BottleNeck结构已经基本实现的Mobile结构,但是在实际使用时,由于偶尔要进行下采样,因此还需要再BottleNeck前加入一个下采样的3x3DW卷积.
结构:
-
if downsample: 3x3 DW + BN + 激活函数(这里加入激活函数是否合适,也在考虑——因为传统的BottleNeck结构输出的是一个低维度的数据,由MobileNetV2的研究工作知道,在低维度上进行非线性激活会丢失信息;由于本文中太多信息没有透露,所以暂且使用标准的下采样块:
conv+bn+act
) -
BottleNeck: 均不使用Lite结构,但需要使用动态ReLU
输入为特征图与Token
class Mobile(nn.Layer):
"""Mobile Sub-block
Params Info:
in_channels: input feature map channels
hidden_channels: the dw layer hidden channel size
groups: the number of groups, by 1x1conv
embed_dims: input token embed_dims
k: the number of parameters is in Dynamic ReLU
coefs: the init value of coefficient parameters
consts: the init value of constant parameters
reduce: the mlp hidden scale,
means 1/reduce = mlp_ratio
use_dyrelu: whether use dyrelu
"""
def __init__(self,
in_channels,
hidden_channels,
out_channels,
kernel_size=3,
stride=1,
padding=0,
groups=1,
embed_dims=None,
k=2,
coefs=[1.0, 0.5],
consts=[1.0, 0.0],
reduce=4,
use_dyrelu=False,
init_type='kn'):
super(Mobile, self).__init__(
name_scope="Mobile")
self.add_dw = True if stride==2 else False
self.bneck = BottleNeck(in_channels=in_channels,
hidden_channels=hidden_channels,
out_channels=out_channels,
kernel_size=kernel_size,
stride=1,
padding=1,
groups=groups,
embed_dims=embed_dims,
k=k,
coefs=coefs,
consts=consts,
reduce=reduce,
use_dyrelu=use_dyrelu,
init_type=init_type)
if self.add_dw: # stride==2
self.downsample_dw = nn.Sequential(
DepthWiseConv(in_channels=in_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
init_type=init_type),
nn.BatchNorm2D(in_channels),
nn.ReLU() # maybe other act
)
def forward(self, feature_map, tokens):
if self.add_dw:
feature_map = self.downsample_dw(feature_map)
# 动态relu需要token
x = self.bneck(feature_map, tokens)
return x
四、ToFormer_Bridge结构实现
这个结构,将特征图中的局部信息融入到Token中去:
-
其中token做query映射,特征图直接作为key和value(仅仅做一些数据shape改变的工作),减少了常规注意力的映射参数成本
-
利用query去有选择的融入特征图的局部信息,得到注意力结果
-
最后通过残差拼接原始输入token,实现局部与全局信息的融合
注意这里使用分头独立映射——即,先将token分头,再进行注意力映射得到每个头的query,再进行拼接得到以往注意力的多头数据形式——其余步骤都是正常的注意力计算步骤
分头独立映射优缺点说明:
-
优点: 能随着头数增加而减少映射所需的参数量和计算量,基于全连接层参数量为feature size的平方数(输入等于输出大小时)
-
缺点: 部分特征融合映射,每个头之间的映射相互独立了
为了与标准的transformer中的注意力操作对应,并且将各桥接模块进行解耦,将LN添加到forward中的第一步上.
输入为特征图与Token
class ToFormer_Bridge(nn.Layer):
"""Mobile to Former Bridge
Params Info:
in_channels: input feature map channels
embed_dims: input token embed_dims
num_head: the number of head is in multi head attention
dropout_rate: the dropout rate of attention result
attn_dropout_rate: the dropout rate of attention distribution
"""
def __init__(self,
embed_dims,
in_channels,
num_head=1,
dropout_rate=0.,
droppath_rate=0.,
attn_dropout_rate=0.,
qkv_bias=True,
norm=nn.LayerNorm,
init_type='kn'):
super(ToFormer_Bridge, self).__init__(
name_scope="ToFormer_Bridge")
self.num_head = num_head
self.head_dims = in_channels // num_head
self.scale = self.head_dims ** -0.5
linear_weight_attr, linear_bias_attr = self._linear_init(init_type=init_type)
# input normlization
self.input_norm = norm(embed_dims)
# split head to project: 分头独立映射层
self.heads_q_proj = []
for i in range(num_head): # n linear
self.heads_q_proj.append(
nn.Linear(in_features=embed_dims // num_head,
out_features=self.head_dims,
weight_attr=linear_weight_attr,
bias_attr=linear_bias_attr if qkv_bias else False)
)
self.heads_q_proj = nn.LayerList(self.heads_q_proj)
self.output = nn.Linear(in_features=self.num_head*self.head_dims,
out_features=embed_dims,
weight_attr=linear_weight_attr,
bias_attr=linear_bias_attr)
self.softmax = nn.Softmax()
self.dropout = nn.Dropout(dropout_rate)
self.droppath = DropPath(droppath_rate)
self.attn_dropout= nn.Dropout(attn_dropout_rate)
def _linear_init(self, init_type='kn'):
if init_type == 'xu':
weight_attr = nn.initializer.XavierUniform()
bias_attr = nn.initializer.Constant(value=0.0)
elif init_type == 'ku':
weight_attr = nn.initializer.KaimingUniform()
bias_attr = nn.initializer.Constant(value=0.0)
elif init_type == 'kn':
weight_attr = nn.initializer.KaimingNormal()
bias_attr = nn.initializer.Constant(value=0.0)
return weight_attr, bias_attr
def transfer_shape(self, feature_map, tokens):
B, C, H, W = feature_map.shape
assert C % self.num_head == 0, \
"Erorr: Please make sure feature_map.channels % "+\
"num_head == 0(now:{0}).".format(C % self.num_head)
fm = feature_map.reshape(shape=[B, C, H*W]) # B, C, L
fm = fm.transpose(perm=[0, 2, 1]) # B, L, C -- C = num_head * head_dims
fm = fm.reshape(shape=[B, H*W, self.num_head, self.head_dims])
fm = fm.transpose(perm=[0, 2, 1, 3]) # B, n_h, L, h_d
B, M, D = tokens.shape
h_token = tokens.reshape(shape=[B, M, self.num_head, D // self.num_head])
h_token = h_token.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, D // n_h
return fm, h_token
def _multi_head_q_forward(self, token, B, M):
q_list = []
for i in range(self.num_head):
q_list.append(
# B, 1, M, head_dims
self.heads_q_proj[i](token[:, i, :, :]).reshape(
shape=[B, 1, M, self.head_dims])
)
q = paddle.concat(q_list, axis=1) # B, num_head, M, head_dims
return q
def forward(self, feature_map, tokens):
B, M, D = tokens.shape
tokens_ = self.input_norm(tokens)
# fm(key/value) to shape: B, n_h, L, h_d
# token to shape: B, n_h, M, D // n_h
fm, token = self.transfer_shape(feature_map, tokens_) # 先分头
q = self._multi_head_q_forward(token, B, M) # 再映射
# attention distribution
attn = paddle.matmul(q, fm, transpose_y=True) # B, n_h, M, L
attn = attn * self.scale
attn = self.softmax(attn)
attn = self.attn_dropout(attn)
# attention result
z = paddle.matmul(attn, fm) # B, n_h, M, h_d
z = z.transpose(perm=[0, 2, 1, 3])
z = z.reshape(shape=[B, M, self.num_head*self.head_dims])
z = self.output(z) # B, M, D
z = self.dropout(z)
z = self.droppath(z)
z = z + tokens
return z
五、Former结构实现
简单的Transformer结构
结构:
LN
+ATTENTION
+LN
+MLP
5.1 注意力实现
class Attention(nn.Layer):
"""Multi Head Attention
Params Info:
embed_dims: input token embed_dims
num_head: the number of head is in multi head attention
dropout_rate: the dropout rate of attention result
attn_dropout_rate: the dropout rate of attention distribution
qkv_bias: whether use the bias in qkv matrix
"""
def __init__(self,
embed_dims,
num_head=1,
dropout_rate=0.,
attn_dropout_rate=0.,
qkv_bias=True,
init_type='kn'):
super(Attention, self).__init__(
name_scope="Attention")
self.num_head = num_head
self.head_dims = embed_dims // num_head
self.scale = self.head_dims ** -0.5
linear_weight_attr, linear_bias_attr = self._linear_init(init_type=init_type)
self.qkv_proj = nn.Linear(in_features=embed_dims,
out_features=3*self.num_head*self.head_dims,
weight_attr=linear_weight_attr,
bias_attr=linear_bias_attr if qkv_bias else False)
self.output = nn.Linear(in_features=self.num_head*self.head_dims,
out_features=embed_dims,
weight_attr=linear_weight_attr,
bias_attr=linear_bias_attr)
self.softmax = nn.Softmax()
self.dropout = nn.Dropout(dropout_rate)
self.attn_dropout= nn.Dropout(attn_dropout_rate)
def _linear_init(self, init_type='kn'):
if init_type == 'xu':
weight_attr = nn.initializer.XavierUniform()
bias_attr = nn.initializer.Constant(value=0.0)
elif init_type == 'ku':
weight_attr = nn.initializer.KaimingUniform()
bias_attr = nn.initializer.Constant(value=0.0)
elif init_type == 'kn':
weight_attr = nn.initializer.KaimingNormal()
bias_attr = nn.initializer.Constant(value=0.0)
return weight_attr, bias_attr
def transfer_shape(self, q, k, v):
B, M, _ = q.shape
q = q.reshape(shape=[B, M, self.num_head, self.head_dims])
q = q.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, h_d
k = k.reshape(shape=[B, M, self.num_head, self.head_dims])
k = k.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, h_d
v = v.reshape(shape=[B, M, self.num_head, self.head_dims])
v = v.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, h_d
return q, k, v
def forward(self, inputs):
B, M, D = inputs.shape
assert D % self.num_head == 0, \
"Erorr: Please make sure Token.D % "+\
"num_head == 0(now:{0}).".format(D % self.num_head)
qkv= self.qkv_proj(inputs)
q, k, v = qkv.chunk(3, axis=-1)
# B, n_h, M, h_d
q, k, v = self.transfer_shape(q, k, v)
attn = paddle.matmul(q, k, transpose_y=True) # B, n_h, M, M
attn = attn * self.scale
attn = self.softmax(attn)
attn = self.attn_dropout(attn)
z = paddle.matmul(attn, v) # B, n_h, M, h_d
z = z.transpose(perm=[0, 2, 1, 3]) # B, M, n_h, h_d
z = z.reshape(shape=[B, M, self.num_head*self.head_dims])
z = self.output(z)
z = self.attn_dropout(z)
return z
5.2 Dropath实现–多分支丢弃
根据Batch中的样本,随机丢弃一些样本为0.
class DropPath(nn.Layer):
"""Multi-branch dropout layer -- Along the axis of Batch
Params Info:
p: droppath rate
"""
def __init__(self,
p=0.):
super(DropPath, self).__init__(
name_scope="DropPath")
self.p = p
def forward(self, inputs):
if self.p > 0. and self.training:
keep_p = np.asarray([1 - self.p], dtype='float32')
keep_p = paddle.to_tensor(keep_p)
# B, 1, 1....
shape = [inputs.shape[0]] + [1] * (inputs.ndim-1)
random_dr = paddle.rand(shape=shape, dtype='float32')
random_sample = paddle.add(keep_p, random_dr).floor() # floor to int--B
output = paddle.divide(inputs, keep_p) * random_sample
return output
return inputs
5.3 Forme实现
class Former(nn.Layer):
"""Former Sub-block
Params Info:
embed_dims: input token embed_dims
num_head: the number of head is in multi head attention
mlp_ratio: the scale of hidden feature size
dropout_rate: the dropout rate of attention result
droppath_rate: the droppath rate of attention output
attn_dropout_rate: the dropout rate of attention distribution
mlp_dropout_rate: the dropout rate of mlp layer output
qkv_bias: whether use the bias in qkv matrix
"""
def __init__(self,
embed_dims,
num_head=1,
mlp_ratio=2,
dropout_rate=0.,
droppath_rate=0.,
attn_dropout_rate=0.,
mlp_dropout_rate=0.,
norm=nn.LayerNorm,
act=nn.GELU,
qkv_bias=True,
init_type='kn'):
super(Former, self).__init__(name_scope="Former")
self.attn = Attention(embed_dims=embed_dims,
num_head=num_head,
dropout_rate=dropout_rate,
attn_dropout_rate=attn_dropout_rate,
qkv_bias=qkv_bias,
init_type=init_type)
self.attn_ln = norm(embed_dims)
self.attn_droppath = DropPath(droppath_rate)
self.mlp = MLP(in_features=embed_dims,
mlp_ratio=mlp_ratio,
mlp_dropout_rate=mlp_dropout_rate,
act=act,
init_type=init_type)
self.mlp_ln = norm(embed_dims)
self.mlp_droppath = DropPath(droppath_rate)
def forward(self, inputs):
res = inputs
x = self.attn_ln(inputs)
x = self.attn(x)
x = self.attn_droppath(x)
x = x + res
res = x
x = self.mlp_ln(x)
x = self.mlp(x)
x = self.mlp_droppath(x)
x = x + res
return x
六、ToMobile_Bridge结构实现
MobileFormer Block中最后一个结构
利用当前层次的Token输出作为Bridge的数据源:
-
将Token映射为Key和Value,Mobile输出的特征图数据作为原生的Query
-
利用query去查询value的组成,最后得到筛选出来/需要补充的全局信息
-
利用残差链接,将提取出来合适的全局信息与原始的局部信息进行特征融合
各部分与ToFormer_Bridge结构类似,映射仍然时Token的事儿,总是利用query去提取需要的信息,然后利用残差来融合特征
映射均为
分头独立映射
为了与标准的transformer中的注意力操作对应,并且将各桥接模块进行解耦,将LN添加到forward中的第一步上.
输入为特征图与Token
class ToMobile_Bridge(nn.Layer):
"""Former to Mobile Bridge
Params Info:
in_channels: input feature map channels
embed_dims: input token embed_dims
num_head: the number of head is in multi head attention
dropout_rate: the dropout rate of attention result
attn_dropout_rate: the dropout rate of attention distribution
"""
def __init__(self,
embed_dims,
in_channels,
num_head=1,
dropout_rate=0.,
droppath_rate=0.0,
attn_dropout_rate=0.,
qkv_bias=True,
norm=nn.LayerNorm,
init_type='kn'):
super(ToMobile_Bridge, self).__init__(
name_scope="ToMobile_Bridge")
self.num_head = num_head
self.head_dims = in_channels // num_head
self.scale = self.head_dims ** -0.5
self.input_token_norm = norm(embed_dims)
linear_weight_attr, linear_bias_attr = self._linear_init(init_type=init_type)
self.heads_k_proj = []
self.heads_v_proj = []
for i in range(num_head): # n linear
self.heads_k_proj.append(
nn.Linear(in_features=embed_dims // num_head,
out_features=self.head_dims,
weight_attr=linear_weight_attr,
bias_attr=linear_bias_attr if qkv_bias else False)
)
self.heads_v_proj.append(
nn.Linear(in_features=embed_dims // num_head,
out_features=self.head_dims,
weight_attr=linear_weight_attr,
bias_attr=linear_bias_attr if qkv_bias else False)
)
self.heads_k_proj = nn.LayerList(self.heads_k_proj)
self.heads_v_proj = nn.LayerList(self.heads_v_proj)
self.softmax = nn.Softmax()
self.dropout = nn.Dropout(dropout_rate)
self.droppath = DropPath(droppath_rate)
self.attn_dropout= nn.Dropout(attn_dropout_rate)
def _linear_init(self, init_type='kn'):
if init_type == 'xu':
weight_attr = nn.initializer.XavierUniform()
bias_attr = nn.initializer.Constant(value=0.0)
elif init_type == 'ku':
weight_attr = nn.initializer.KaimingUniform()
bias_attr = nn.initializer.Constant(value=0.0)
elif init_type == 'kn':
weight_attr = nn.initializer.KaimingNormal()
bias_attr = nn.initializer.Constant(value=0.0)
return weight_attr, bias_attr
def transfer_shape(self, feature_map, tokens):
B, C, H, W = feature_map.shape
assert C % self.num_head == 0, \
"Erorr: Please make sure feature_map.channels % "+\
"num_head == 0(now:{0}).".format(C % self.num_head)
fm = feature_map.reshape(shape=[B, C, H*W]) # B, C, L
fm = fm.transpose(perm=[0, 2, 1]) # B, L, C -- C = num_head * head_dims
fm = fm.reshape(shape=[B, H*W, self.num_head, self.head_dims])
fm = fm.transpose(perm=[0, 2, 1, 3]) # B, n_h, L, h_d
B, M, D = tokens.shape
k = tokens.reshape(shape=[B, M, self.num_head, D // self.num_head])
k = k.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, D // n_h
v = tokens.reshape(shape=[B, M, self.num_head, D // self.num_head])
v = v.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, D // n_h
return fm, k, v
def _multi_head_kv_forward(self, k_, v_, B, M):
k_list = []
v_list = []
for i in range(self.num_head):
k_list.append(
# B, 1, M, head_dims
self.heads_k_proj[i](k_[:, i, :, :]).reshape(
shape=[B, 1, M, self.head_dims])
)
v_list.append(
# B, 1, M, head_dims
self.heads_v_proj[i](v_[:, i, :, :]).reshape(
shape=[B, 1, M, self.head_dims])
)
k = paddle.concat(k_list, axis=1) # B, num_head, M, head_dims
v = paddle.concat(v_list, axis=1) # B, num_head, M, head_dims
return k, v
def forward(self, feature_map, tokens):
B, C, H, W = feature_map.shape
B, M, D = tokens.shape
tokens = self.input_token_norm(tokens)
# fm(q) to shape: B, n_h, L, h_d
# k/v to shape: B, n_h, M, D // n_h
q, k_, v_ = self.transfer_shape(feature_map, tokens)
k, v = self._multi_head_kv_forward(k_, v_, B, M)
# attention distribution
attn = paddle.matmul(q, k, transpose_y=True) # B, n_h, L, M
attn = attn * self.scale
attn = self.softmax(attn)
attn = self.attn_dropout(attn)
# attention result
z = paddle.matmul(attn, v) # B, n_h, L, h_d
z = z.transpose(perm=[0, 1, 3, 2]) # B, n_h, h_d, L
# B, n_h*h_d, H, W
z = z.reshape(shape=[B, self.num_head*self.head_dims, H, W])
z = self.dropout(z)
z = self.droppath(z)
z = z + feature_map
return z
七、MFBlock结构实现
将以上的MobileFormer Block组成单元组起来,得到MobileFormer Block,简称: MFBlock
.
结构:
-
ToFormer_Bridge
-
Former
-
Mobile
-
ToMobile_Bridge
输入为特征图与Token
class MFBlock(nn.Layer):
"""MobileFormer Basic Block
Params Info:
in_channels: the number of input feature map channel
hidden_channels: the number of hidden(dw_conv) feature map channel
out_channels: the number of output feature map channel
embed_dims: input token embed_dims
num_head: the number of head is in multi head attention
groups: the number of groups in 1x1 conv
k: the number of parameters is in Dynamic ReLU
coefs: the init value of coefficient parameters
consts: the init value of constant parameters
reduce: the mlp hidden scale,
means 1/reduce = mlp_ratio
use_dyrelu: whether use dyrelu
mlp_ratio: the scale of hidden feature size
dropout_rate: the dropout rate of attention result
droppath_rate: the droppath rate of attention output
attn_dropout_rate: the dropout rate of attention distribution
mlp_dropout_rate: the dropout rate of mlp layer output
qkv_bias: whether use the bias in qkv matrix
"""
def __init__(self,
in_channels,
hidden_channels,
out_channels,
embed_dims,
kernel_size=3,
stride=1,
padding=0,
groups=1,
k=2,
coefs=[1.0, 0.5],
consts=[1.0, 0.0],
reduce=4,
use_dyrelu=False,
num_head=1,
mlp_ratio=2,
dropout_rate=0.,
droppath_rate=0.,
attn_dropout_rate=0.,
mlp_dropout_rate=0.,
norm=nn.LayerNorm,
act=nn.GELU,
qkv_bias=True,
init_type='kn'):
super(MFBlock, self).__init__(
name_scope="MFBlock")
self.mobile = Mobile(in_channels=in_channels,
hidden_channels=hidden_channels,
out_channels=out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
groups=groups,
embed_dims=embed_dims,
k=k,
coefs=coefs,
consts=consts,
reduce=reduce,
use_dyrelu=use_dyrelu,
init_type=init_type)
self.toformer_bridge = ToFormer_Bridge(embed_dims=embed_dims,
in_channels=in_channels,
num_head=num_head,
dropout_rate=dropout_rate,
droppath_rate=droppath_rate,
attn_dropout_rate=attn_dropout_rate,
qkv_bias=qkv_bias,
init_type=init_type)
self.former = Former(embed_dims=embed_dims,
num_head=num_head,
mlp_ratio=mlp_ratio,
dropout_rate=droppath_rate,
mlp_dropout_rate=mlp_dropout_rate,
attn_dropout_rate=attn_dropout_rate,
droppath_rate=droppath_rate,
qkv_bias=qkv_bias,
norm=norm,
act=act,
init_type=init_type)
self.tomobile_bridge = ToMobile_Bridge(in_channels=out_channels,
embed_dims=embed_dims,
num_head=num_head,
dropout_rate=dropout_rate,
droppath_rate=droppath_rate,
attn_dropout_rate=attn_dropout_rate,
qkv_bias=qkv_bias,
init_type=init_type)
def forward(self, feature_map, tokens):
z_h = self.toformer_bridge(feature_map, tokens)
z_out = self.former(z_h)
f_h = self.mobile(feature_map, z_out)
f_out = self.tomobile_bridge(f_h, z_out)
return f_out, z_out
八、分类头实现
分类时,将Tokens中的第一个token拼接到特征图池化后的结构上,然后参与分类.
class Classifier_Head(nn.Layer):
"""Classifier Head
Params Info:
in_channels: input feature map channels
embed_dims: input token embed_dims
hidden_features: the fc layer hidden feature size
num_classes: the number of classes
"""
def __init__(self,
in_channels,
embed_dims,
hidden_features,
num_classes=1000,
dropout=0.0,
act=nn.Hardswish,
init_type='kn'):
super(Classifier_Head, self).__init__(
name_scope="Classifier_Head")
linear_weight_attr, linear_bias_attr = self._linear_init(init_type=init_type)
self.avg_pool = nn.AdaptiveAvgPool2D(output_size=1)
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(in_features=in_channels+embed_dims,
out_features=hidden_features,
weight_attr=linear_weight_attr,
bias_attr=linear_bias_attr)
self.dropout = nn.Dropout(dropout)
self.fc2 = nn.Linear(in_features=hidden_features,
out_features=num_classes,
weight_attr=linear_weight_attr,
bias_attr=linear_bias_attr)
self.act = act()
self.softmax = nn.Softmax()
def _linear_init(self, init_type='kn'):
if init_type == 'xu':
weight_attr = nn.initializer.XavierUniform()
bias_attr = nn.initializer.Constant(value=0.0)
elif init_type == 'ku':
weight_attr = nn.initializer.KaimingUniform()
bias_attr = nn.initializer.Constant(value=0.0)
elif init_type == 'kn':
weight_attr = nn.initializer.KaimingNormal()
bias_attr = nn.initializer.Constant(value=0.0)
return weight_attr, bias_attr
def forward(self, feature_map, tokens):
x = self.avg_pool(feature_map) # B, C, 1, 1
x = self.flatten(x) # B, C
z = tokens[:, 0] # B, 1, D
x = paddle.concat([x, z], axis=-1)
x = self.fc1(x)
x = self.act(x)
x = self.dropout(x)
x = self.fc2(x)
if self.training:
return x
return self.softmax(x)
MobileFormer模型组网
将上面的所有结构进行组网,得到MobileFormer模型架构,类似MobileFormer294M的模型结构:
由于,层数较多,参数较复杂,因此通过yaml以及config.py传参,而其余组网部分正常进行。
结构:
-
Stem – 渐入层
-
Lite-BottleNeck – 轻量瓶颈结构
-
Mobile-Former Block – 基本block
-
End_ToFormer_Bridge – 最后输出时的全局信息加强
-
Channel Conv – 末尾分类前的通道拓展
-
Classifier Head – 分类头
构造网络结构的方法说明:
-
_create_token
: 生成可学习的Tokens -
_create_stem
: 生成Stem层 -
_create_lite_bneck
: 生成Lite-BottleNeck层 -
_create_mf_blocks
: 生成MobileFormer Block层 -
_create_former_end_bridge
: 生成End ToFormer Bridge层 -
_create_channel_conv
: 生成Channel Conv层 -
_create_head
: 生成Head层 -
_create_model
: 组网
这里仅仅将组网来出来介绍,而使用参数构建模型的build_mformer函数并没拿出来,而在mobileformer.py文件的末尾中定义
class MobileFormer(nn.Layer):
"""MobileFormer
Params Info:
num_classes: the number of classes
in_channels: the number of input feature map channel
tokens: the shape of former token
num_head: the number of head is in multi head attention
groups: the number of groups in 1x1 conv
k: the number of parameters is in Dynamic ReLU
coefs: the init value of coefficient parameters
consts: the init value of constant parameters
reduce: the mlp hidden scale,
means 1/reduce = mlp_ratio
use_dyrelu: whether use dyrelu
mlp_ratio: the scale of hidden feature size
dropout_rate: the dropout rate of attention result
droppath_rate: the droppath rate of attention output
attn_dropout_rate: the dropout rate of attention distribution
mlp_dropout_rate: the dropout rate of mlp layer output
alpha: the scale of model size
qkv_bias: whether use the bias in qkv matrix
config: total model config
init_type: init params kind
"""
def __init__(self, num_classes=1000, in_channels=3,
tokens=[3, 128], num_head=4, mlp_ratio=2,
use_dyrelu=True, k=2, reduce=4.0,
coefs=[1.0, 0.5], consts=[1.0, 0.0],
dropout_rate=0.0, droppath_rate=0.0,
attn_dropout_rate=0.0, mlp_dropout_rate=0.0,
norm=nn.LayerNorm, act=nn.GELU,
alpha=1.0, qkv_bias=True,
config=None, init_type='kn'):
super(MobileFormer, self).__init__()
self.num_token, self.embed_dims = tokens[0], tokens[1]
self.num_head = num_head
self.num_classes = num_classes
self.in_channels = in_channels
self.mlp_ratio = mlp_ratio
self.alpha = alpha
self.qkv_bias = qkv_bias
self.dropout_rate = dropout_rate
self.droppath_rate = droppath_rate
self.attn_dropout_rate = attn_dropout_rate
self.mlp_dropout_rate = mlp_dropout_rate
self.init_type = init_type
assert init_type in ['xu', 'ku', 'kn'], \
"Error: Please choice the init type in ['xu', 'ku', 'kn']"+\
", but now it is {0}.".format(init_type)
assert config is not None, \
"Error: Please enter the config(now: {0})".format(config)+\
" in the __init__."
# create learnable tokens: self.tokens
self._create_token(num_token=self.num_token,
embed_dims=self.embed_dims)
# create total model
self._create_model(use_dyrelu=use_dyrelu,
reduce=reduce, dyrelu_k=k,
coefs=coefs, consts=consts,
alpha=alpha, norm=norm, act=act,
config=config)
def _create_token(self, num_token, embed_dims):
# B(1), token_size, embed_dims
shape = [1] + [num_token, embed_dims]
self.tokens = self.create_parameter(shape=shape, dtype='float32')
def _create_stem(self,
in_channels,
out_channels,
kernel_size,
stride, padding,
alpha):
self.stem = Stem(in_channels=in_channels,
out_channels=int(alpha * out_channels),
kernel_size=kernel_size,
stride=stride,
padding=padding,
init_type=self.init_type)
def _create_lite_bneck(self,
in_channels,
hidden_channels,
out_channels,
kernel_size,
stride,
padding,
alpha,
pointwiseconv_groups):
self.bneck_lite = BottleNeck(in_channels=int(alpha * in_channels),
hidden_channels=int(alpha * hidden_channels),
out_channels=int(alpha * out_channels),
groups=pointwiseconv_groups,
kernel_size=kernel_size,
stride=stride,
padding=padding,
use_dyrelu=False,
is_lite=True,
init_type=self.init_type)
def _create_mf_blocks(self,
in_channel_list,
hidden_channel_list,
out_channel_list,
kernel_list,
stride_list,
padding_list,
alpha,
use_dyrelu,
reduce,
dyrelu_k,
coefs,
consts,
norm,
act,
pointwiseconv_groups):
self.blocks = []
for i in range(0, len(in_channel_list)):
self.blocks.append(
MFBlock(
in_channels=int(alpha * in_channel_list[i]),
hidden_channels=int(alpha * hidden_channel_list[i]),
out_channels=int(alpha * out_channel_list[i]),
embed_dims=self.embed_dims,
kernel_size=kernel_list[i],
stride=stride_list[i],
padding=padding_list[i],
groups=pointwiseconv_groups,
k=dyrelu_k,
coefs=coefs,
consts=consts,
reduce=reduce,
use_dyrelu=use_dyrelu,
num_head=self.num_head,
mlp_ratio=self.mlp_ratio,
dropout_rate=self.dropout_rate,
droppath_rate=self.droppath_rate,
attn_dropout_rate=self.attn_dropout_rate,
mlp_dropout_rate=self.mlp_dropout_rate,
norm=norm,
act=act,
init_type=self.init_type
)
)
self.blocks = nn.LayerList(self.blocks)
def _create_former_end_bridge(self,
in_channels,
norm,
alpha):
self.end_toformer_bridge = ToFormer_Bridge(embed_dims=self.embed_dims,
in_channels=int(alpha * in_channels),
num_head=self.num_head,
dropout_rate=self.dropout_rate,
droppath_rate=self.droppath_rate,
attn_dropout_rate=self.attn_dropout_rate,
init_type=self.init_type)
def _create_channel_conv(self,
in_channels,
out_channels,
alpha,
pointwiseconv_groups):
self.channel_conv = nn.Sequential(
PointWiseConv(in_channels=int(alpha * in_channels),
out_channels=out_channels,
groups=pointwiseconv_groups,
init_type=self.init_type),
nn.BatchNorm2D(out_channels),
nn.ReLU()
)
def _create_head(self,
in_channels,
hidden_features):
self.head = Classifier_Head(in_channels=in_channels,
embed_dims=self.embed_dims,
hidden_features=hidden_features,
num_classes=self.num_classes,
dropout=self.dropout_rate,
init_type=self.init_type)
def _create_model(self,
use_dyrelu,
reduce,
dyrelu_k,
coefs,
consts,
norm,
act,
alpha,
config):
# create stem: self.stem
self._create_stem(in_channels=self.in_channels,
out_channels=config.MODEL.MF.STEM.OUT_CHANNELS,
kernel_size=config.MODEL.MF.STEM.KERNELS,
stride=config.MODEL.MF.STEM.STRIEDS,
padding=config.MODEL.MF.STEM.PADDINGS,
alpha=alpha)
# create lite-bottleneck: self.bneck_lite
self._create_lite_bneck(in_channels=config.MODEL.MF.LITE_BNECK.IN_CHANNEL,
hidden_channels=config.MODEL.MF.LITE_BNECK.HIDDEN_CHANNEL,
out_channels=config.MODEL.MF.LITE_BNECK.OUT_CHANNEL,
kernel_size=config.MODEL.MF.LITE_BNECK.KERNEL,
stride=config.MODEL.MF.LITE_BNECK.STRIED,
padding=config.MODEL.MF.LITE_BNECK.PADDING,
alpha=alpha,
pointwiseconv_groups=config.MODEL.MF.POINTWISECONV_GROUPS)
# create mobileformer blocks: self.blocks
self._create_mf_blocks(in_channel_list=config.MODEL.MF.BLOCK.IN_CHANNELS,
hidden_channel_list=config.MODEL.MF.BLOCK.HIDDEN_CHANNELS,
out_channel_list=config.MODEL.MF.BLOCK.OUT_CHANNELS,
kernel_list=config.MODEL.MF.BLOCK.KERNELS,
stride_list=config.MODEL.MF.BLOCK.STRIEDS,
padding_list=config.MODEL.MF.BLOCK.PADDINGS,
alpha=alpha,
use_dyrelu=use_dyrelu,
reduce=reduce,
dyrelu_k=dyrelu_k,
coefs=coefs,
consts=consts,
norm=norm,
act=act,
pointwiseconv_groups=config.MODEL.MF.POINTWISECONV_GROUPS)
# create final toformer_bridge: self.toformer_bridge
self._create_former_end_bridge(in_channels=config.MODEL.MF.CHANNEL_CONV.IN_CHANNEL,
norm=norm,
alpha=alpha)
# create channel 1x1 conv: self.channel_conv
self._create_channel_conv(in_channels=config.MODEL.MF.CHANNEL_CONV.IN_CHANNEL,
out_channels=config.MODEL.MF.CHANNEL_CONV.OUT_CHANNEL,
alpha=alpha,
pointwiseconv_groups=config.MODEL.MF.POINTWISECONV_GROUPS)
# create classifier head: self.head
self._create_head(in_channels=config.MODEL.MF.HEAD.IN_CHANNEL,
hidden_features=config.MODEL.MF.HEAD.HIDDEN_FEATURE)
def _to_batch_tokens(self, batch_size):
# B, token_size, embed_dims
return paddle.concat([self.tokens]*batch_size, axis=0)
# @paddle.jit.to_static
def forward(self, inputs):
B, _, _, _ = inputs.shape
f = self.stem(inputs)
# create batch tokens
tokens = self._to_batch_tokens(B) # B, token_size, embed_dims
f = self.bneck_lite(f, tokens)
for b in self.blocks:
f, tokens = b(f, tokens)
tokens = self.end_toformer_bridge(f, tokens)
f = self.channel_conv(f)
output = self.head(f, tokens)
return output
MobileFormer使用说明
通过加载mobileformer_26m.yaml
来实现模型参数的加载与配置——依赖于config.py中的参数组织以及yaml文件里的参数配置.
对yaml参数的简单说明:
-
MF – 为对Mobileformer中的所有网络架构配置
-
BLOCK: 对所有MFBlock的配置,包括输入通道数,中间通道数,输出通道数等
-
DYRELU: 对动态ReLU的配置
-
默认使用get_config加载的配置中,分类数为1000,需要在yaml中添加一个参数才能实现自定义类别数(分类数)模型的构建
原内容:
MODEL:
TYPE: MobileFormer
NAME: MobileFormer_26M
DROPPATH: 0.1
DROPOUT: 0.1
MLP_DROPOUT: 0.1
ATTENTION_DROPOUT: 0.1
添加后: —— 分类数为10
MODEL:
NUM_CLASSES: 10
TYPE: MobileFormer
NAME: MobileFormer_26M
DROPPATH: 0.1
DROPOUT: 0.1
MLP_DROPOUT: 0.1
ATTENTION_DROPOUT: 0.1
# config中的参数配置需要用到
!pip install yacs
import paddle
from config import get_config
from mobileformer import build_mformer as build_model
config = get_config('mobileformer_26m.yaml')
model = build_model(config)
test_data = paddle.rand((1, 3, 224, 224))
y_pred = model(test_data)
print('model output: ', y_pred.shape)
=> merge config from mobileformer_26m.yaml
model output: [1, 10]
相关实验说明
以下实验均基于MobileFormer26M测试,为MobileFormer中最小模型
1.MNIST实验(99.3)
-
Batch 等于256, 4卡
-
学习率等其余参数同config中一致,未使用mixup等预处理,仅使用随即裁剪,resize,totensor与normalize
-
输入图像放大到48
-
模型中,Lite-BottleNeck步长减为1,其余不变
-
best_model.pdparams == 8.85M
# config中的参数配置需要用到
!pip install yacs
import paddle
from paddle import nn
from MNIST.mnist_transforms import test_datasets
from config import get_config
from mobileformer import build_mformer as build_model
config = get_config('./MNIST/mobileformer_26m.yaml')
model = build_model(config)
model.set_state_dict(paddle.load('./MNIST/best_model.pdparams'))
model = paddle.Model(model)
model.prepare(
optimizer=None,
loss=nn.CrossEntropyLoss(),
metrics=paddle.metric.Accuracy(topk=(1, 5))
)
model.evaluate(eval_data=test_datasets, batch_size=256)
=> merge config from ./MNIST/mobileformer_26m.yaml
Eval begin...
step 10/40 - loss: 1.5501 - acc_top1: 0.9910 - acc_top5: 0.9996 - 237ms/step
step 20/40 - loss: 1.5416 - acc_top1: 0.9916 - acc_top5: 0.9996 - 227ms/step
step 30/40 - loss: 1.5380 - acc_top1: 0.9934 - acc_top5: 0.9997 - 227ms/step
step 40/40 - loss: 1.5374 - acc_top1: 0.9942 - acc_top5: 0.9998 - 221ms/step
Eval samples: 10000
{'loss': [1.5374461], 'acc_top1': 0.9942, 'acc_top5': 0.9998}
Cifar10(89.6)
还可以继续提升,受限于batchsize等参数不太合适
-
Batch 等于256, 4卡
-
学习率等其余参数同config中一致,使用mixup预处理,使用随即裁剪,resize,totensor与normalize
-
输入图像放大到64
-
模型中,Lite-BottleNeck步长减为1,其余不变
-
best_model.pdparams == 8.85M
其它实验说明
后期,会跟进一个MobileFormer使用上的说明,以及如何利用PaddleViT的源码实现自定义数据集的简单使用教学.
# config中的参数配置需要用到
!pip install yacs
import paddle
from paddle import nn
from Cifar10.cifar10_transforms import test_datasets
from config import get_config
from mobileformer import build_mformer as build_model
config = get_config('./Cifar10/mobileformer_26m.yaml')
model = build_model(config)
model.set_state_dict(paddle.load('./Cifar10/best_model.pdparams'))
model = paddle.Model(model)
model.prepare(
optimizer=None,
loss=nn.CrossEntropyLoss(),
metrics=paddle.metric.Accuracy(topk=(1, 5))
)
model.evaluate(eval_data=test_datasets, batch_size=256)
=> merge config from ./Cifar10/mobileformer_26m.yaml
Eval begin...
step 10/40 - loss: 1.7457 - acc_top1: 0.8949 - acc_top5: 0.9926 - 246ms/step
step 20/40 - loss: 1.7452 - acc_top1: 0.8961 - acc_top5: 0.9906 - 243ms/step
step 30/40 - loss: 1.7186 - acc_top1: 0.8988 - acc_top5: 0.9926 - 244ms/step
step 40/40 - loss: 1.7148 - acc_top1: 0.8980 - acc_top5: 0.9933 - 242ms/step
Eval samples: 10000
{'loss': [1.7147579], 'acc_top1': 0.898, 'acc_top5': 0.9933}
相关预处理,如Mixup等,可前往查看: BR-IDL/PaddleViT
欢迎大家多多关注PaddleViT,有更多VIT模型的实现供大家学习与使用
也欢迎评论区讨论
姓名:蔡敬辉
学历:大四(在读)
爱好:喜欢参加一些大大小小的比赛,不限于计算机视觉——有共同爱好的小伙伴可以关注一下哦
主要方向:目标检测、图像分割与图像识别
联系方式:qq:3020889729 微信:cjh3020889729
former import build_mformer as build_model
config = get_config(’./Cifar10/mobileformer_26m.yaml’)
model = build_model(config)
model.set_state_dict(paddle.load(’./Cifar10/best_model.pdparams’))
model = paddle.Model(model)
model.prepare(
optimizer=None,
loss=nn.CrossEntropyLoss(),
metrics=paddle.metric.Accuracy(topk=(1, 5))
)
model.evaluate(eval_data=test_datasets, batch_size=256)
=> merge config from ./Cifar10/mobileformer_26m.yaml
Eval begin...
step 10/40 - loss: 1.7457 - acc_top1: 0.8949 - acc_top5: 0.9926 - 246ms/step
step 20/40 - loss: 1.7452 - acc_top1: 0.8961 - acc_top5: 0.9906 - 243ms/step
step 30/40 - loss: 1.7186 - acc_top1: 0.8988 - acc_top5: 0.9926 - 244ms/step
step 40/40 - loss: 1.7148 - acc_top1: 0.8980 - acc_top5: 0.9933 - 242ms/step
Eval samples: 10000
{'loss': [1.7147579], 'acc_top1': 0.898, 'acc_top5': 0.9933}
> 相关预处理,如Mixup等,可前往查看: <a href="https://github.com/BR-IDL/PaddleViT/tree/develop/image_classification/MobileFormer" target="_blank">BR-IDL/PaddleViT</a></font></center>
> 欢迎大家多多关注PaddleViT,有更多VIT模型的实现供大家学习与使用
> 也欢迎评论区讨论
> 姓名:蔡敬辉
> 学历:大四(在读)
> 爱好:喜欢参加一些大大小小的比赛,不限于计算机视觉——有共同爱好的小伙伴可以关注一下哦
> 主要方向:目标检测、图像分割与图像识别
> 联系方式:qq:3020889729 微信:cjh3020889729
> 学校:西南科技大学
更多推荐
所有评论(0)