【论文复现赛】ESPNet:Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation

本文提出了高效空间金字塔模块(Efficient Spatial Pyramid Modules),该模块基于卷积分解原理,将标准卷积分解为逐点卷积(point-wise convolutions)和空间金字塔型膨胀卷积(spatial pyramid of dilated convolutions),显著提高了计算、内存和功率方面的效率。该模型在Cityscapes验证集上mIOU为60.30%,本次复现的mIOU为61.82%,该算法已被PaddleSeg收录。

代码参考:https://github.com/sacmehta/ESPNet
本项目地址:https://github.com/simuler/ESPNet

一、模型结构


如上图所示,文中给出了4种网络结构,其中前3种网络的输出mask为输入尺寸的1/8。文中引入了超参数-ESP的堆叠个数,由于网络的前两个stage,特征映射较大,计算量较大和占用内存较多,因此只在后面几个stage堆叠ESP模块。
首先,ESPNet-A为基础网络,它以RGB图像作为输入,并使用ESP模块学习不同空间级别的特征,最终通过1x1conv得到mask。ESP-B通过共享前一个跨步ESP模块和前一个ESP模块的特征映射,改善了信息流。ESPNet-C加强了ESPNet-B内部的输入图像,进一步改善了信息流。这三个网络产生的maks是输入图像的1/8。最后,ESPNet在ESPNet-C的基础上,添加了一个轻量级的解码器,从而得到与输入图像相同分辨率的mask。

二、ESP(Efficient Spatial Pyramid Modules)


如图a所示,ESP模块利用卷积分解原理将标准卷积分解为点卷积和扩展卷积。

step1:对输入通道为M,卷积因子数为K,输入为1x1的卷积降维。

step2:使用多个不同膨胀率的卷积核对低维特征特征进行卷积,paddle代码如下:

       self.d_conv1 = nn.Conv2D(branch_channels, remain_channels, 3, padding=1, bias_attr=False)
        self.d_conv2 = nn.Conv2D(branch_channels, branch_channels, 3, padding=2, dilation=2, bias_attr=False)
        self.d_conv4 = nn.Conv2D(branch_channels, branch_channels, 3, padding=4, dilation=4, bias_attr=False)
        self.d_conv8 = nn.Conv2D(branch_channels, branch_channels, 3, padding=8, dilation=8, bias_attr=False)
        self.d_conv16 = nn.Conv2D(branch_channels, branch_channels, 3, padding=16, dilation=16, bias_attr=False)

图b为ESP模块框图,ESP模块使用大空洞率的堆叠卷积结构容易形成伪影,因此本文采用了HHF分层特征融合消除了这些伪影。同时在输入和输出之间增加了跳跃连接,改善了信息流。

三、HHF对比


膨胀率r=2和3x3膨胀卷积核的ESP模块的效果如上图所示,其中使用了HHF分层特征融合的ESP模块消除了伪影,使分割效果更好。

四、实验结果

图中所示为ESPNet在Cityscapes数据集上的测试结果,mIou为60.3%。

六、核心代码

class ESPNetV1(nn.Layer):
    """
    The ESPNetV1 implementation based on PaddlePaddle.

    The original article refers to
      Sachin Mehta1, Mohammad Rastegari, Anat Caspi, Linda Shapiro, and Hannaneh Hajishirzi. "ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation"
      (https://arxiv.org/abs/1803.06815).

    Args:
        num_classes (int): The unique number of target classes.
        in_channels (int, optional): Number of input channels. Default: 3.
        level2_depth (int, optional): Depth of DilatedResidualBlock. Default: 2.
        level3_depth (int, optional): Depth of DilatedResidualBlock. Default: 3.
        pretrained (str, optional): The path or url of pretrained model. Default: None.
    """
    def __init__(self,
                 num_classes,
                 in_channels=3,
                 level2_depth=2,
                 level3_depth=3,
                 pretrained=None):
        super().__init__()
        self.encoder = ESPNetEncoder(num_classes, in_channels, level2_depth,
                                     level3_depth)

        self.level3_up = nn.Conv2DTranspose(num_classes,
                                            num_classes,
                                            2,
                                            stride=2,
                                            padding=0,
                                            output_padding=0,
                                            bias_attr=False)
        self.br3 = layers.SyncBatchNorm(num_classes)
        self.level2_proj = nn.Conv2D(in_channels + 128,
                                     num_classes,
                                     1,
                                     bias_attr=False)
        self.combine_l2_l3 = nn.Sequential(
            BNPReLU(2 * num_classes),
            DilatedResidualBlock(2 * num_classes, num_classes, residual=False),
        )
        self.level2_up = nn.Sequential(
            nn.Conv2DTranspose(num_classes,
                               num_classes,
                               2,
                               stride=2,
                               padding=0,
                               output_padding=0,
                               bias_attr=False),
            BNPReLU(num_classes),
        )
        self.out_proj = layers.ConvBNPReLU(16 + in_channels + num_classes,
                                           num_classes,
                                           3,
                                           padding='same',
                                           stride=1)
        self.out_up = nn.Conv2DTranspose(num_classes,
                                         num_classes,
                                         2,
                                         stride=2,
                                         padding=0,
                                         output_padding=0,
                                         bias_attr=False)
        self.pretrained = pretrained

    def init_weight(self):
        if self.pretrained is not None:
            utils.load_entire_model(self, self.pretrained)

    def forward(self, x):
        p1, p2, p3 = self.encoder(x)
        up_p3 = self.level3_up(p3)

        combine = self.combine_l2_l3(paddle.concat([up_p3, p2], axis=1))
        up_p2 = self.level2_up(combine)

        combine = self.out_proj(paddle.concat([up_p2, p1], axis=1))
        out = self.out_up(combine)
        return [out]

七、ESPNet在线体验

运行以下代码,体验ESPNet训练、验证和预测。
step 1: 解压cityscape数据集
step 2: 训练ESPNet
step 3: 测试ESPNet在验证集的效果(这里给出个训练过程中最好的权重验证结果,对应的日志和vdl可视化文件点击复现结果中链接可以下载)
# step 1: unzip data
%cd ~/data/data64550/
/
!tar -xf cityscapes.tar
# step 2: train
%cd ~/ESPNet/
!python train.py --config /home/aistudio/ESPNet/configs/espnetv1/espnetv1_cityscapes_1024x512_120k.yml --do_eval --use_vdl --log_iter 10 --save_interval 2000 --save_dir output
# step 3: val
%cd /home/aistudio/ESPNet/
!python val.py --config /home/aistudio/ESPNet/configs/espnetv1/espnetv1_cityscapes_1024x512_120k.yml --model_path output/best_model/model.pdparams
/home/aistudio/ESPNet
2022-01-07 10:32:36 [INFO]	
---------------Config Information---------------
batch_size: 4
iters: 120000
loss:
  coef:
  - 1
  types:
  - ignore_index: 255
    type: CrossEntropyLoss
    weight:
    - 2.79834108
    - 6.92945723
    - 3.84068512
    - 9.94349362
    - 9.77098823
    - 9.51484
    - 10.30981624
    - 9.94307377
    - 4.64933892
    - 9.55759938
    - 7.86692178
    - 9.53126629
    - 10.3496365
    - 6.67234062
    - 10.26054204
    - 10.28785275
    - 10.28988296
    - 10.40546021
    - 10.13848367
lr_scheduler:
  end_lr: 0.0
  learning_rate: 0.001
  power: 0.9
  type: PolynomialDecay
model:
  in_channels: 3
  level2_depth: 2
  level3_depth: 8
  num_classes: 19
  type: ESPNetV1
optimizer:
  type: adam
  weight_decay: 0.0002
train_dataset:
  dataset_root: /home/aistudio/data/data64550/cityscapes
  mode: train
  transforms:
  - max_scale_factor: 2.0
    min_scale_factor: 0.5
    scale_step_size: 0.25
    type: ResizeStepScaling
  - crop_size:
    - 1024
    - 512
    type: RandomPaddingCrop
  - type: RandomHorizontalFlip
  - brightness_range: 0.4
    contrast_range: 0.4
    saturation_range: 0.4
    type: RandomDistort
  - type: Normalize
  type: Cityscapes
val_dataset:
  dataset_root: /home/aistudio/data/data64550/cityscapes
  mode: val
  transforms:
  - type: Normalize
  type: Cityscapes
------------------------------------------------
W0107 10:32:36.203390  9343 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W0107 10:32:36.203438  9343 device_context.cc:465] device: 0, cuDNN Version: 7.6.
2022-01-07 10:32:40 [INFO]	Loading pretrained model from output/best_model/model.pdparams
2022-01-07 10:32:40 [INFO]	There are 211/211 variables loaded into ESPNetV1.
2022-01-07 10:32:40 [INFO]	Loaded trained params of model successfully
2022-01-07 10:32:40 [INFO]	Start evaluating (total_samples: 500, total_iters: 500)...
500/500 [==============================] - 81s 162ms/step - batch_cost: 0.1618 - reader cost: 0.1204
2022-01-07 10:34:01 [INFO]	[EVAL] #Images: 500 mIoU: 0.6182 Acc: 0.9341 Kappa: 0.9148 
2022-01-07 10:34:01 [INFO]	[EVAL] Class IoU: 
[0.9667 0.768  0.8798 0.4199 0.4632 0.5244 0.4507 0.6023 0.8938 0.5562
 0.9045 0.6813 0.4034 0.901  0.4674 0.5768 0.3732 0.2815 0.6312]
2022-01-07 10:34:01 [INFO]	[EVAL] Class Acc: 
[0.9911 0.8374 0.9423 0.7118 0.6159 0.6614 0.6543 0.7398 0.9456 0.7007
 0.9254 0.7594 0.6243 0.9352 0.7434 0.6963 0.5144 0.495  0.7439]

八、复现结果

本次论文复现赛要求是Cityscapes 验证集mIOU达到60.3%,本次复现的结果为mIOU 61.82%。
环境:
paddlepaddle==2.2.0
Tesla v100 * 4

ModelBackboneResolutionTraining ItersmIoUmIoU (flip)mIoU (ms+flip)Links
ESPNetV2-1024x51212000061.82%62.20%62.89%model | log | vdl

九、复现经验

1、在复现模型阶段,一定要多查paddle和torch的映射表,认真对比api之间的不同之处,另外paddleseg在有某些更完善的api可供使用,比如layers中的一些api。

2、如果复现的精度相差较大,记得认真对比原论文中的参数。

十、致谢

非常感谢AiStudio平台提供的算力和奖金支持,感谢Paddle团队的辛勤付出。

非常感谢dudu大佬带领我参加这个比赛,让我少走了很多弯路。

最后,希望论文复现赛越办越好。

个人介绍

姓名:宁文彬
学校:东北大学
年级:研二
GitHub: [https://github.com/simuler](https://github.com/simuler)

请点击此处查看本环境基本用法.

Please click here for more detailed instructions.

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐