卷积注意力网络如何打败VIT?基于PaddleSeg复现SegNext
复现SegNext,该网络提出之后在精度和速度上均击败了SegFormer一众Transformer Base的模型。
卷积注意力网络如何打败Transformer Base模型?
作者: 还没恋爱所以来打比赛
日期: 2022.10.27
摘要: 基于PaddleSeg复现SegNext(无预训练版本,效果较差)
一、介绍
自从VIT提出之后,CV领域中的各项子任务均被VIT主导,语义分割同样不例外。但是在最近MenghaoGuo等人再论文SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation中总结了历届语义分割领域的SOTA模型,总结除了以下四点成功的关键:
- 采用强骨干网络作为编码器;
- 多尺度信息交互;
- 空域注意力;
- 低计算复杂度;
基于以上四点,作者提出了新的基于Encoder-Decoder模型:SegNext。作者在Encoder模块引入了多尺度卷积注意力模块,在Decoder模块采用了Hamberger进一步提取全局特征,值得注意的是,由于Encoder采用的是卷积模块,为了避免第一层卷积引入过多噪声,作者在Decoder阶段只使用了2-4阶段的特征图。并且在计算复杂度上SegNext也胜过了SegFormer等模型,下图是SegNext的性能对比图:
二、环境配置
为了快速复现SegNext,在这里我采用了PaddleSeg套件作为复现工具,在开始训练前,需要运行下面的环境配置代码块。其中Citiscapes数据集的解压速度需要较长时间(2小时左右),建议先使用Cpu版本进行解压在转换为GPU版本。
!unzip PaddleSeg
!pip install -e /home/aistudio/PaddleSeg > /dev/null
import sys
sys.path.append('/home/aistudio/PaddleSeg')
! mkdir -p PaddleSeg/data/
# ! tar -xvf data/data64550/cityscapes.tar -C PaddleSeg/data/
三、数据集概览
数据集上采用Cityscapes数据集,在这里我们使用大佬提供的处理后的数据:https://aistudio.baidu.com/aistudio/datasetdetail/64550/0 该数据集在解压到Paddleseg之后可以直接使用,省去处理的时间。
该数据集内有约3K张train和val的照片,图片尺寸为1024*1024,均为街景照片。数据集中leftimg8bit为原始数据,gtFine为label。数据大概长下面这个样子:
四、模型组网
在本次复现中,模型分为两大模块:Encoder和Decoder。paddleseg/models/backbones/mscan.py和PaddleSeg/paddleseg/models/segnext.py
4.1 Encoder实现
encoder的模块大概长下面的样子,论文作者并没有使用Self-Attention,而是设计一种多尺度卷积注意 力模块MSCA,其实就是一种多尺度版VAN。在MSCA中包含三大模块
- Depth-wish卷积
- 多尺度卷积
- 普通卷积
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Xf1CeW6r-1669461318888)(https://ai-studio-static-online.cdn.bcebos.com/809cee575d2747b286cdecc4153ed2fe40b4b70a8a854c238b1b0c99b23b0a5f)]
其中论文中Encoder的核心代码在AttentionModule层
# Encoder
import paddle
import paddle.nn as nn
from paddleseg.cvlibs import manager
class DWConv(nn.Layer):
def __init__(self, dim=768):
super(DWConv, self).__init__()
self.dwconb = nn.Conv2D(dim, dim, 3, 1, 1, bias_attr=True, groups=dim)
def forward(self, inputs):
x = self.dwconb(inputs)
return x
class Mlp(nn.Layer):
def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
super(Mlp, self).__init__()
out_features = out_features or in_features
hidden_features = hidden_features or in_features
self.fc1 = nn.Conv2D(in_features, hidden_features, 1)
self.dwconv = DWConv(hidden_features)
self.act = act_layer()
self.fc2 = nn.Conv2D(hidden_features, out_features, 1)
self.drop = nn.Dropout(drop)
def forward(self, x):
x = self.fc1(x)
x = self.dwconv(x)
x = self.act(x)
x = self.drop(x)
x = self.fc2(x)
x = self.drop(x)
return x
class StemConv(nn.Layer):
def __init__(self, in_channels, out_channels, norm_type='BatchNorm'):
super(StemConv, self).__init__()
if norm_type != 'BatchNorm':
raise Exception('StemConv model only can use BatchNorm')
self.proj = nn.Sequential(
nn.Conv2D(in_channels, out_channels//2, kernel_size=(3, 3), stride=(2, 2), padding=(1,1)),
nn.BatchNorm2D(out_channels // 2),
nn.GELU(),
nn.Conv2D(out_channels // 2, out_channels, kernel_size=(3, 3), stride=(2, 2), padding=(1,1)),
nn.BatchNorm2D(out_channels)
)
def forward(self, x):
x = self.proj(x)
_, _, H, W = x.shape
x = paddle.flatten(x, 2)
x = paddle.transpose(x, (0, 2, 1))
return x, H, W
class AttentionModule(nn.Layer):
def __init__(self, dim):
super(AttentionModule, self).__init__()
self.conv0 = nn.Conv2D(dim, dim, 5, padding=2, groups=dim)
self.conv0_1 = nn.Conv2D(dim, dim, (1, 7), padding=(0, 3), groups=dim)
self.conv0_2 = nn.Conv2D(dim, dim, (7, 1), padding=(3, 0), groups=dim)
self.conv1_1 = nn.Conv2D(dim, dim, (1, 11), padding=(0, 5), groups=dim)
self.conv1_2 = nn.Conv2D(dim, dim, (11, 1), padding=(5, 0), groups=dim)
self.conv2_1 = nn.Conv2D(dim, dim, (1, 21), padding=(0, 10), groups=dim)
self.conv2_2 = nn.Conv2D(dim, dim, (21, 1), padding=(10, 0), groups=dim)
self.conv3 = nn.Conv2D(dim, dim, 1)
def forward(self, x):
u = x.clone()
attn = self.conv0(x)
attn_0 = self.conv0_1(attn)
attn_0 = self.conv0_2(attn_0)
attn_1 = self.conv1_1(attn)
attn_1 = self.conv1_2(attn_1)
attn_2 = self.conv2_1(attn)
attn_2 = self.conv2_2(attn_2)
attn = attn_0 + attn_1 + attn_2 + attn
attn = self.conv3(attn)
return attn * u
class SpatialAttention(nn.Layer):
def __init__(self, d_model):
super().__init__()
self.d_model = d_model
self.proj1 = nn.Conv2D(d_model, d_model, 1)
self.act1 = nn.GELU()
self.spatial = AttentionModule(d_model)
self.proj2 = nn.Conv2D(d_model, d_model, 1)
def forward(self, x):
s = x.clone()
x = self.proj1(x)
x = self.act1(x)
x = self.spatial(x)
x = self.proj2(x)
x = x + s
return x
def drop_path(x, drop_prob, training):
if drop_prob == 0. or not training:
return x
keep_prob = paddle.to_tensor(1. - drop_prob)
# keep_prob = 1 - drop_prob
shape = (x.shape[0],) + (1,) * (x.ndim - 1)
random_tensor = keep_prob + paddle.rand(shape, dtype=x.dtype)
random_tensor = paddle.floor(random_tensor)
output = x.divide(keep_prob) * random_tensor
return output
class DropPath(nn.Layer):
def __init__(self, drop_prob=None):
super(DropPath, self).__init__()
self.drop_prob = drop_prob
def forward(self, x):
return drop_path(x, self.drop_prob, self.training)
class Block(nn.Layer):
def __init__(
self,
dim,
mlp_ratio=4.,
drop=0.,
drop_path=0.,
act_layer=nn.GELU,
norm_type='BatchNorm'
):
if norm_type != 'BatchNorm':
raise Exception('Block model only can use BatchNorm')
super(Block, self).__init__()
self.norm1 = nn.BatchNorm2D(dim)
self.attn = SpatialAttention(dim)
# TODO: drop path
self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
# self.drop_path = nn.Dropout(drop_path)
self.norm2 = nn.BatchNorm2D(dim)
mlp_hidden_dim = int(dim * mlp_ratio)
self.mlp = Mlp(dim, mlp_hidden_dim, act_layer=act_layer, drop=drop)
layer_scale_init_value = 1e-2
attr = paddle.ParamAttr(initializer=nn.initializer.Constant(layer_scale_init_value))
self.layer1_scale_1 = self.create_parameter(shape=[dim], attr=attr)
self.layer1_scale_2 = self.create_parameter(shape=[dim], attr=attr)
def forward(self, x, H, W):
B, N, C = x.shape
x = paddle.transpose(x, perm=[0, 2, 1])
x = paddle.reshape(x, [B, C, H, W])
x = x + self.drop_path(self.layer1_scale_1.unsqueeze(-1).unsqueeze(-1) * self.attn(self.norm1(x)))
x = x + self.drop_path(self.layer1_scale_2.unsqueeze(-1).unsqueeze(-1) * self.mlp(self.norm2(x)))
x = paddle.reshape(x, [B, C, N])
x = paddle.transpose(x, perm=[0, 2, 1])
return x
class OverlapPatchEmbed(nn.Layer):
""" Image to Patch Embedding
"""
def __init__(self, patch_size=7, stride=4, in_chans=3, embed_dim=768, norm_type='BatchNorm'):
super().__init__()
if norm_type != 'BatchNorm':
raise Exception('Block model only can use BatchNorm')
patch_size = (patch_size, patch_size)
self.proj = nn.Conv2D(in_chans, embed_dim, kernel_size=patch_size, stride=stride,
padding=(patch_size[0] // 2, patch_size[1] // 2))
self.norm = nn.BatchNorm2D(embed_dim)
def forward(self, x):
x = self.proj(x)
_, _, H, W = x.shape
x = self.norm(x)
x = x.flatten(2).transpose([0, 2, 1])
return x, H, W
# TODO : Loking for all the parameters has gradient
@manager.BACKBONES.add_component
class MSCAN(nn.Layer):
def __init__(self,
in_chans=3,
embed_dims=[64, 128, 256, 512],
mlp_ratios=[4, 4, 4, 4],
drop_rate=0.,
drop_path_rate=0.,
depths=[3, 4, 6, 3],
num_stages=4,
norm_type='BatchNorm'):
super(MSCAN, self).__init__()
if norm_type != 'BatchNorm':
raise Exception('Block model only can use BatchNorm')
self.depths = depths
self.num_stages = num_stages
self.feat_channels=embed_dims
dpr = [x.item() for x in paddle.linspace(0, drop_path_rate,
sum(depths))] # stochastic depth decay rule
cur = 0
for i in range(num_stages):
if i == 0:
patch_embed = StemConv(3, embed_dims[0], norm_type=norm_type)
else:
patch_embed = OverlapPatchEmbed(patch_size=7 if i == 0 else 3,
stride=4 if i == 0 else 2,
in_chans=in_chans if i == 0 else embed_dims[i - 1],
embed_dim=embed_dims[i],
norm_type=norm_type)
block = paddle.nn.LayerList([Block(dim=embed_dims[i], mlp_ratio=mlp_ratios[i],
drop=drop_rate, drop_path=dpr[cur + j],
norm_type=norm_type)
for j in range(depths[i])])
norm = nn.LayerNorm(embed_dims[i])
cur += depths[i]
setattr(self, f"patch_embed{i + 1}", patch_embed)
setattr(self, f"block{i + 1}", block)
setattr(self, f"norm{i + 1}", norm)
def forward(self, x):
B = x.shape[0]
outs = []
for i in range(self.num_stages):
patch_embed = getattr(self, f"patch_embed{i + 1}")
block = getattr(self, f"block{i + 1}")
norm = getattr(self, f"norm{i + 1}")
x, H, W = patch_embed(x)
for blk in block:
x = blk(x, H, W)
x = norm(x)
x = x.reshape([B, H, W, -1]).transpose([0, 3, 1, 2])
outs.append(x)
outs = tuple(outs)
return outs
4.2 Decoder实现
作者则对三种简单Decoder架构进行了探索:
- Figure3-a,源自SegFormer的解码器,它是一种纯MLP架构;
- Figure3-b,常被CNN方案使用,如ASPP、PSP、DANet等;
- Figure3-c,本文采用的解码器,它采用轻量型Hamberger模块对后三个阶段的特性进行聚合以进行全局上下文建模。
需要注意的是,SegFormer的解码器对Stage1到Stage4的特征进行聚合,而本文方案则仅对Stage2-Stage4的特征进行聚合。这是因为:SegNeXt的Encoder采用了卷积架构,使得Stage1部分特征包含过多底层信息,进而导致其会影响语义分割性能,同时对Stage1部分特征进行处理会带来过多的计算负载。
文章中使用的Hamberger架构图如下,它采用矩阵分解方式进行全局空域信息建模(这一块作者没有做太多的介绍,本人也没看懂这一块,因此在这里不做多介绍)
五、模型训练
为了快速验证模型性能,本人在此使用的是Segnext的tiny版本,并且由于代码转换能力不过关,无法获得imagenet上的预训练模型,因此在这里会所展示的模型精度较低。等待过段时间修改代码获取预训练模型后再重新做实验(这段时间没空啦啦啦啦)。同时为了对比,本人在此同样使用了SegFormer-b0的无预训练版本。对应的配置文件分别在configs/segnext/segnext_tiny.yml和configs/segformer/segformer_b0_cityscapes_1024x1024_160k.yml。同时值得注意的是,在训练参数上没有本人并没有做修改。
5.1 SegNext训练
%cd PaddleSeg
!python train.py \
--config configs/segnext/segnext_tiny.yml \
--do_eval \
--use_vdl \
--save_interval 5000 \
--save_dir output/segnetx_tiny
5.2 SegFormer训练
!python train.py \
--config configs/segformer/segformer_b0_cityscapes_1024x1024_160k.yml \
--do_eval \
--use_vdl \
--save_interval 5000 \
--save_dir output/segformer
5.3训练结果对比
训练结果如下,绿色为SegFormer,蓝色为SegNext, 可以看到在都没有预训练的情况下,SegNext的收敛效果是比SegFormer更快的。
六、总结与展望
这是我的第一次将torch代码转换为paddle,期间遇到不少问题,通过这次转换更加熟系了paddle中的api。但是复现的结果是远远不能让我自己满意,更不能让广大网友满意。
未来安排:
- 优化模型代码
- 将torch预训练模型转换为paddle的模型
此文章为搬运
原项目链接
更多推荐
所有评论(0)