卷积注意力网络如何打败VIT？基于PaddleSeg复现SegNext

复现SegNext，该网络提出之后在精度和速度上均击败了SegFormer一众Transformer Base的模型。

AI Studio

938人浏览 · 2022-11-26 19:18:18

AI Studio · 2022-11-26 19:18:18 发布

卷积注意力网络如何打败Transformer Base模型？

作者: 还没恋爱所以来打比赛

日期: 2022.10.27

摘要: 基于PaddleSeg复现SegNext(无预训练版本，效果较差)

一、介绍

自从VIT提出之后，CV领域中的各项子任务均被VIT主导，语义分割同样不例外。但是在最近MenghaoGuo等人再论文SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation中总结了历届语义分割领域的SOTA模型，总结除了以下四点成功的关键：

采用强骨干网络作为编码器；
多尺度信息交互；
空域注意力；
低计算复杂度；

基于以上四点，作者提出了新的基于Encoder-Decoder模型：SegNext。作者在Encoder模块引入了多尺度卷积注意力模块，在Decoder模块采用了Hamberger进一步提取全局特征，值得注意的是，由于Encoder采用的是卷积模块，为了避免第一层卷积引入过多噪声，作者在Decoder阶段只使用了2-4阶段的特征图。并且在计算复杂度上SegNext也胜过了SegFormer等模型，下图是SegNext的性能对比图：

在这里插入图片描述

二、环境配置

为了快速复现SegNext，在这里我采用了PaddleSeg套件作为复现工具，在开始训练前，需要运行下面的环境配置代码块。其中Citiscapes数据集的解压速度需要较长时间(2小时左右)，建议先使用Cpu版本进行解压在转换为GPU版本。

!unzip PaddleSeg
!pip install -e /home/aistudio/PaddleSeg > /dev/null
import sys
sys.path.append('/home/aistudio/PaddleSeg')
! mkdir -p PaddleSeg/data/
# ! tar -xvf data/data64550/cityscapes.tar -C PaddleSeg/data/

三、数据集概览

数据集上采用Cityscapes数据集，在这里我们使用大佬提供的处理后的数据：https://aistudio.baidu.com/aistudio/datasetdetail/64550/0 该数据集在解压到Paddleseg之后可以直接使用，省去处理的时间。
在这里插入图片描述

在这里插入图片描述

该数据集内有约3K张train和val的照片，图片尺寸为1024*1024，均为街景照片。数据集中leftimg8bit为原始数据，gtFine为label。数据大概长下面这个样子：
在这里插入图片描述

四、模型组网

在本次复现中，模型分为两大模块：Encoder和Decoder。paddleseg/models/backbones/mscan.py和PaddleSeg/paddleseg/models/segnext.py

4.1 Encoder实现

encoder的模块大概长下面的样子，论文作者并没有使用Self-Attention，而是设计一种多尺度卷积注意力模块MSCA，其实就是一种多尺度版VAN。在MSCA中包含三大模块

Depth-wish卷积
多尺度卷积
普通卷积

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Xf1CeW6r-1669461318888)(https://ai-studio-static-online.cdn.bcebos.com/809cee575d2747b286cdecc4153ed2fe40b4b70a8a854c238b1b0c99b23b0a5f)]

其中论文中Encoder的核心代码在AttentionModule层

# Encoder
import paddle
import paddle.nn as nn

from paddleseg.cvlibs import manager

class DWConv(nn.Layer):
    def __init__(self, dim=768):
        super(DWConv, self).__init__()
        self.dwconb = nn.Conv2D(dim, dim, 3, 1, 1, bias_attr=True, groups=dim)
    
    def forward(self, inputs):
        x = self.dwconb(inputs)
        return x

class Mlp(nn.Layer):
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
        super(Mlp, self).__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features

        self.fc1 = nn.Conv2D(in_features, hidden_features, 1)
        self.dwconv = DWConv(hidden_features)
        self.act = act_layer()
        self.fc2 = nn.Conv2D(hidden_features, out_features, 1)
        self.drop = nn.Dropout(drop)

    def forward(self, x):
        x = self.fc1(x)
        x = self.dwconv(x)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        x = self.drop(x)

        return x

class StemConv(nn.Layer):
    def __init__(self, in_channels, out_channels, norm_type='BatchNorm'):
        super(StemConv, self).__init__()
        if norm_type != 'BatchNorm':
            raise Exception('StemConv model only can use BatchNorm')

        self.proj = nn.Sequential(
            nn.Conv2D(in_channels, out_channels//2, kernel_size=(3, 3), stride=(2, 2), padding=(1,1)),
            nn.BatchNorm2D(out_channels // 2),
            nn.GELU(),
            nn.Conv2D(out_channels // 2, out_channels, kernel_size=(3, 3), stride=(2, 2), padding=(1,1)),
            nn.BatchNorm2D(out_channels)
        )
    
    def forward(self, x):
        x = self.proj(x)
        _, _, H, W = x.shape
        x = paddle.flatten(x, 2)
        x = paddle.transpose(x, (0, 2, 1))
        return x, H, W

class AttentionModule(nn.Layer):
    def __init__(self, dim):
        super(AttentionModule, self).__init__()
        self.conv0 = nn.Conv2D(dim, dim, 5, padding=2, groups=dim)

        self.conv0_1 = nn.Conv2D(dim, dim, (1, 7), padding=(0, 3), groups=dim)
        self.conv0_2 = nn.Conv2D(dim, dim, (7, 1), padding=(3, 0), groups=dim)

        self.conv1_1 = nn.Conv2D(dim, dim, (1, 11), padding=(0, 5), groups=dim)
        self.conv1_2 = nn.Conv2D(dim, dim, (11, 1), padding=(5, 0), groups=dim)

        self.conv2_1 = nn.Conv2D(dim, dim, (1, 21), padding=(0, 10), groups=dim)
        self.conv2_2 = nn.Conv2D(dim, dim, (21, 1), padding=(10, 0), groups=dim)

        self.conv3 = nn.Conv2D(dim, dim, 1)

    def forward(self, x):
        u = x.clone()
        attn = self.conv0(x)

        attn_0 = self.conv0_1(attn)
        attn_0 = self.conv0_2(attn_0)

        attn_1 = self.conv1_1(attn)
        attn_1 = self.conv1_2(attn_1)

        attn_2 = self.conv2_1(attn)
        attn_2 = self.conv2_2(attn_2)

        attn = attn_0 + attn_1 + attn_2 + attn

        attn = self.conv3(attn)

        return attn * u


class SpatialAttention(nn.Layer):
    def __init__(self, d_model):
        super().__init__()
        self.d_model = d_model
        self.proj1 = nn.Conv2D(d_model, d_model, 1)
        self.act1 = nn.GELU()
        self.spatial = AttentionModule(d_model)
        self.proj2 = nn.Conv2D(d_model, d_model, 1)

    def forward(self, x):
        s = x.clone()
        x = self.proj1(x)
        x = self.act1(x)
        x = self.spatial(x)
        x = self.proj2(x)
        x = x + s
        return x


def drop_path(x, drop_prob, training):
    if drop_prob == 0. or not training:
        return x
    keep_prob = paddle.to_tensor(1. - drop_prob)
    # keep_prob = 1 - drop_prob
    shape = (x.shape[0],) + (1,) * (x.ndim - 1)
    random_tensor = keep_prob + paddle.rand(shape, dtype=x.dtype)
    random_tensor = paddle.floor(random_tensor)
    output = x.divide(keep_prob) * random_tensor
    return output

class DropPath(nn.Layer):
    def __init__(self, drop_prob=None):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        return drop_path(x, self.drop_prob, self.training)

class Block(nn.Layer):
    def __init__(
        self,
        dim, 
        mlp_ratio=4.,
        drop=0., 
        drop_path=0.,
        act_layer=nn.GELU,
        norm_type='BatchNorm'
    ):
        if norm_type != 'BatchNorm':
            raise Exception('Block model only can use BatchNorm')
        super(Block, self).__init__()

        self.norm1 = nn.BatchNorm2D(dim)
        self.attn = SpatialAttention(dim)
        # TODO: drop path
        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
        # self.drop_path = nn.Dropout(drop_path)

        self.norm2 = nn.BatchNorm2D(dim)
        mlp_hidden_dim = int(dim * mlp_ratio)
        self.mlp = Mlp(dim, mlp_hidden_dim, act_layer=act_layer, drop=drop)
        layer_scale_init_value = 1e-2
        attr = paddle.ParamAttr(initializer=nn.initializer.Constant(layer_scale_init_value))
        self.layer1_scale_1 = self.create_parameter(shape=[dim], attr=attr)
        self.layer1_scale_2 = self.create_parameter(shape=[dim], attr=attr)

    def forward(self, x, H, W):
        B, N, C = x.shape
        x = paddle.transpose(x, perm=[0, 2, 1])
        x = paddle.reshape(x, [B, C, H, W])
        x = x + self.drop_path(self.layer1_scale_1.unsqueeze(-1).unsqueeze(-1) * self.attn(self.norm1(x)))
        x = x + self.drop_path(self.layer1_scale_2.unsqueeze(-1).unsqueeze(-1) * self.mlp(self.norm2(x)))

        x = paddle.reshape(x, [B, C, N])
        x = paddle.transpose(x, perm=[0, 2, 1])

        return x

class OverlapPatchEmbed(nn.Layer):
    """ Image to Patch Embedding
    """

    def __init__(self, patch_size=7, stride=4, in_chans=3, embed_dim=768, norm_type='BatchNorm'):
        super().__init__()
        if norm_type != 'BatchNorm':
            raise Exception('Block model only can use BatchNorm')

        patch_size = (patch_size, patch_size)

        self.proj = nn.Conv2D(in_chans, embed_dim, kernel_size=patch_size, stride=stride,
                              padding=(patch_size[0] // 2, patch_size[1] // 2))
        self.norm = nn.BatchNorm2D(embed_dim)

    def forward(self, x):
        x = self.proj(x)
        _, _, H, W = x.shape
        x = self.norm(x)

        x = x.flatten(2).transpose([0, 2, 1])

        return x, H, W

# TODO : Loking for all the parameters has gradient
@manager.BACKBONES.add_component
class MSCAN(nn.Layer):
    def __init__(self,
                 in_chans=3,
                 embed_dims=[64, 128, 256, 512],
                 mlp_ratios=[4, 4, 4, 4],
                 drop_rate=0.,
                 drop_path_rate=0.,
                 depths=[3, 4, 6, 3],
                 num_stages=4,
                 norm_type='BatchNorm'):
        super(MSCAN, self).__init__()
        if norm_type != 'BatchNorm':
            raise Exception('Block model only can use BatchNorm')

        self.depths = depths
        self.num_stages = num_stages
        self.feat_channels=embed_dims

        dpr = [x.item() for x in paddle.linspace(0, drop_path_rate,
                                                sum(depths))]  # stochastic depth decay rule
        cur = 0

        for i in range(num_stages):
            if i == 0:
                patch_embed = StemConv(3, embed_dims[0], norm_type=norm_type)
            else:
                patch_embed = OverlapPatchEmbed(patch_size=7 if i == 0 else 3,
                                                stride=4 if i == 0 else 2,
                                                in_chans=in_chans if i == 0 else embed_dims[i - 1],
                                                embed_dim=embed_dims[i],
                                                norm_type=norm_type)

            block = paddle.nn.LayerList([Block(dim=embed_dims[i], mlp_ratio=mlp_ratios[i],
                                         drop=drop_rate, drop_path=dpr[cur + j],
                                         norm_type=norm_type)
                                   for j in range(depths[i])])
            norm = nn.LayerNorm(embed_dims[i])
            cur += depths[i]

            setattr(self, f"patch_embed{i + 1}", patch_embed)
            setattr(self, f"block{i + 1}", block)
            setattr(self, f"norm{i + 1}", norm)

    def forward(self, x):
        B = x.shape[0]
        outs = []

        for i in range(self.num_stages):
            patch_embed = getattr(self, f"patch_embed{i + 1}")
            block = getattr(self, f"block{i + 1}")
            norm = getattr(self, f"norm{i + 1}")
            x, H, W = patch_embed(x)
            for blk in block:
                x = blk(x, H, W)
            x = norm(x)
            x = x.reshape([B, H, W, -1]).transpose([0, 3, 1, 2])
            outs.append(x)

        outs = tuple(outs)
        return outs

4.2 Decoder实现

在这里插入图片描述

作者则对三种简单Decoder架构进行了探索：

Figure3-a，源自SegFormer的解码器，它是一种纯MLP架构；
Figure3-b，常被CNN方案使用，如ASPP、PSP、DANet等；
Figure3-c，本文采用的解码器，它采用轻量型Hamberger模块对后三个阶段的特性进行聚合以进行全局上下文建模。

需要注意的是，SegFormer的解码器对Stage1到Stage4的特征进行聚合，而本文方案则仅对Stage2-Stage4的特征进行聚合。这是因为：SegNeXt的Encoder采用了卷积架构，使得Stage1部分特征包含过多底层信息，进而导致其会影响语义分割性能，同时对Stage1部分特征进行处理会带来过多的计算负载。

文章中使用的Hamberger架构图如下，它采用矩阵分解方式进行全局空域信息建模(这一块作者没有做太多的介绍，本人也没看懂这一块，因此在这里不做多介绍)
在这里插入图片描述

五、模型训练

为了快速验证模型性能，本人在此使用的是Segnext的tiny版本，并且由于代码转换能力不过关，无法获得imagenet上的预训练模型，因此在这里会所展示的模型精度较低。等待过段时间修改代码获取预训练模型后再重新做实验(这段时间没空啦啦啦啦)。同时为了对比，本人在此同样使用了SegFormer-b0的无预训练版本。对应的配置文件分别在configs/segnext/segnext_tiny.yml和configs/segformer/segformer_b0_cityscapes_1024x1024_160k.yml。同时值得注意的是，在训练参数上没有本人并没有做修改。

5.1 SegNext训练

%cd PaddleSeg
!python train.py \
       --config configs/segnext/segnext_tiny.yml \
       --do_eval \
       --use_vdl \
       --save_interval 5000 \
       --save_dir output/segnetx_tiny

5.2 SegFormer训练

!python train.py \
       --config configs/segformer/segformer_b0_cityscapes_1024x1024_160k.yml \
       --do_eval \
       --use_vdl \
       --save_interval 5000 \
       --save_dir output/segformer