


1. YOLOX简介

YOLOX是旷视科技(Megvii)在YOLOv3基础上改进的。主要改进的部分在于 Decoupled Head、Anchor Free、SimOTA、Data Aug。另外为了yolov5对比,主干网络引入了yolov5的FOCUS、CSPNet、PAN Head、SiLU激活。

1.1 Decoupled Head

Decoupled Head是学术领域一阶段网络的标准配置。然而,以前版本的YOLO所用的预测头是一起的,分类和回归在一个1x1卷积中实现。

作者实验发现End2End的YOLOX始终比标准YOLOX低4-5个点,偶然间把原始YOLO Head换成Decoupled Head,发现差距显著缩小,认为YOLO Head的表达能力可能有所欠缺。

YOLOX中,YOLO Head将分类和回归分别实现,最后预测时才整合。经过权衡速度和性能得失,最终使用1个1x1卷积先进行降维,并在分类和回归分支里各使用了2个3x3卷积。

1.2 Anchor Free

Anchor Free有以下几个好处:

  1. 降低时间成本

    Anchor Based检测器为了追求最优性能需要对anchor box聚类分析,增加了时间成本.
  2. 降低检测头复杂度和生成结果数量

    Anchor Based检测器增加了检测头复杂度以及生成结果的数量,将大量检测结果从GPU搬运到CPU上对于边缘设备是无法容忍的。
  3. 代码逻辑简单,可读性增强

    Anchor Free 的解码代码逻辑更简单,可读性更高。

Anchor Free技术目前可以上YOLO,且性能不降反升,与样本匹配有密不可分的联系。

1.3 样本匹配SimOTA



  1. Loss/Quality/Prediction Aware

    基于网络自身预测来计算anchor box或者anchor point与Groud Truth匹配关系,充分考虑不同结构/复杂度模型可能会有不同行为,是一种动态的样本匹配。

    与之相对的,基于IoU阈值/In Grid(YOLOv1)/In Box or Center(FCOS)都依赖于人为定义几何先验做样本匹配,属于次优方案。
  2. Center prior

  3. Dynamic k


    Dynamic k的关键在于确定k,k的估计可以采用prediction aware的,具体的作者先计算每个目标最接近的10个预测,然后把这10个预测与Groud Truth的IOU加起来求得最终的k。

    此外10这个数字也不是很敏感,在 5-15之间调整几乎没有影响。
  4. 全局信息

    部分anchor box/point处于正样本之间交界、或者正负样本之间交界,这类anchor box/point的正负划分,归属哪个正样本,都应考虑全局信息。

1.4 Data Augmentation


Mosaic方法在YOLOv4中提出,主要思想是将四张图片进行随机裁剪,再拼接到一张图上作为训练数据。好处是丰富了图片背景,且四张图片拼接在一起变相提高batch_size,在进行batch normalization的时候也会计算四张图片,对本身batch_size不是很依赖。

具体可参考论文:YOLOv4: Optimal Speed and Accuracy of Object Detection


具体可参考论文:mixup: Beyond Empirical Risk Minimization

2. 网络结构剖析

参考B站Up主Bubbliiiing绘制的网络结构图,网络整体可以分为三个部分:主干网络CSPDarknet、特征加强的PAN Head、检测头YOLO Head。

  • 主干网络中涉及到的主要结构包括ConvBlock(包含Conv、Batch norm、SiLU)、FOCUS、CSPLayer、SPPBottleneck等结构。
  • 特征加强部分中涉及的主要结构包括CSPLayer、UpSampling、DownSampling等。
  • YOLO Head部分主要包含ConvBlock结构。
# 引入库
import paddle
from paddle import nn

2.1 主干网络 CSPDarknet

2.1.1 ConvBlock

基本卷积块包含卷积、批归一化和激活函数。基本卷积块采用等大填充Same Padding,包含一般卷积(BaseConv)和深度可分离卷积(DWConv)两种类型。


Batch Norm



BaseConv 1x1Identity
BaseConv 3x3
## 构建卷积块
class BaseConv(nn.Layer):
    def __init__(self, in_channels, out_channels, kernel_size, stride, groups=1, act='silu'):
        padding = (kernel_size-1)//2
        self.conv = nn.Conv2D(in_channels, out_channels, kernel_size, stride, padding, groups=groups)
        self.bn = nn.BatchNorm2D(out_channels,momentum=0.03, epsilon=0.001)
        if act == 'silu':
            self.act = nn.Silu()
        elif act == 'relu':
            self.act = nn.ReLU()
        elif act == 'lrelu':
            self.act = nn.LeakyReLU(0.1)
    def forward(self, x):
        return self.act(self.bn(self.conv(x)))
## 构建深度可分离卷积
class DWConv(nn.Layer):
    # Some Problem
    def __init__(self, in_channels, out_channels, kernel_size, stride=1, act='silu'):
        self.dconv = BaseConv(in_channels, in_channels, kernel_size, stride, groups=in_channels, act=act)
        self.pconv = BaseConv(in_channels, out_channels, 1, 1, groups=1, act=act)

    def forward(self, x):
        x = self.dconv(x)
        return self.pconv(x)
## 构建残差结构
class Bottleneck(nn.Layer):
    def __init__(self, in_channels, out_channels, shortcut=True, expansion=0.5, depthwise=False, act="silu"):
        hidden_channels = int(out_channels * expansion)
        Conv = DWConv if depthwise else BaseConv
        # 1x1卷积进行通道数的缩减(缩减率默认50%)
        self.conv1 = BaseConv(in_channels, hidden_channels, 1, stride=1, act=act)
        # 3x3卷积进行通道数的拓张(特征提取)
        self.conv2 = Conv(hidden_channels, out_channels, 3, stride=1, act=act)
        self.use_add = shortcut and in_channels == out_channels

    def forward(self, x):
        y = self.conv2(self.conv1(x))
        if self.use_add:
            y = y + x
        return y
## 测试卷积模块
x = paddle.ones([1, 3, 640, 640])
conv1 = BaseConv(3, 64, 3, 1)
conv2 = DWConv(3, 64, 3, 1)
block1 = Bottleneck(3, 64)
[1, 64, 640, 640]
[1, 64, 640, 640]
[1, 64, 640, 640]

2.1.2 Focus




## Focus层
class Focus(nn.Layer):
    def __init__(self, in_channels, out_channels, ksize=1, stride=1, act="silu"):
        self.conv = BaseConv(in_channels * 4, out_channels, ksize, stride, act=act)

    def forward(self, x):
        # 分别获得4个2倍下采样结果
        patch_1 = x[...,  ::2,  ::2]
        patch_2 = x[..., 1::2,  ::2]
        patch_3 = x[...,  ::2, 1::2]
        patch_4 = x[..., 1::2, 1::2]
        # 沿通道方向拼接4个下采样结果
        x = paddle.concat((patch_1, patch_2, patch_3, patch_4), axis=1)
        # 拼接结果做卷积
        out = self.conv(x)
        return out
## 测试FOCUS模块
x = paddle.ones([1, 3, 640, 640])
layer = Focus(3, 64)
[1, 64, 320, 320]

2.1.3 CSPLayer




BaseConv 1x1BaseConv 1x1
Bottleneck(x N)
BaseConv 1x1
## CSPLayer
class CSPLayer(nn.Layer):
    def __init__(self, in_channels, out_channels, n=1, shortcut=True, expansion=0.5, depthwise=False, act="silu",):
        hidden_channels = int(out_channels * expansion)  
        # 主干部分的基本卷积块
        self.conv1  = BaseConv(in_channels, hidden_channels, 1, stride=1, act=act)
        # 残差边部分的基本卷积块
        self.conv2  = BaseConv(in_channels, hidden_channels, 1, stride=1, act=act)
        # 拼接主干与残差后的基本卷积块
        self.conv3  = BaseConv(2 * hidden_channels, out_channels, 1, stride=1, act=act)

        # 根据循环次数构建多个残差块瓶颈结构
        res_block = [Bottleneck(hidden_channels, hidden_channels, shortcut, 1.0, depthwise, act=act) for _ in range(n)]
        self.res_block = nn.Sequential(*res_block)

    def forward(self, x):
        # 主干部分
        x_main = self.conv1(x)
        x_main = self.res_block(x_main)
        # 残差边部分
        x_res = self.conv2(x)

        # 主干部分和残差边部分进行堆叠
        x = paddle.concat((x_main, x_res), axis=1)
        # 对堆叠的结果进行卷积的处理
        out = self.conv3(x)
        return out
## 测试CSPLayer模块
x = paddle.ones([1, 3, 640, 640])
layer = CSPLayer(3, 64, 5)
[1, 64, 640, 640]

2.1.4 SPPBottleneck




BaseConv 1x1
IdentityMaxPool 5x5MaxPool 9x9MaxPool 13x13
BaseConv 1x1
## SPPBottleneck
class SPPBottleneck(nn.Layer):
    def __init__(self, in_channels, out_channels, kernel_sizes=(5, 9, 13), activation="silu"):
        hidden_channels = in_channels // 2
        self.conv1      = BaseConv(in_channels, hidden_channels, 1, stride=1, act=activation)
        self.pool_block = nn.Sequential(*[nn.MaxPool2D(kernel_size=ks, stride=1, padding=ks // 2) for ks in kernel_sizes])
        conv2_channels  = hidden_channels * (len(kernel_sizes) + 1)
        self.conv2      = BaseConv(conv2_channels, out_channels, 1, stride=1, act=activation)

    def forward(self, x):
        x = self.conv1(x)
        x = paddle.concat([x] + [pool(x) for pool in self.pool_block], axis=1)
        x = self.conv2(x)
        return x
## 测试SPPBottleneck模块
x = paddle.ones([1, 3, 640, 640])
layer = SPPBottleneck(3, 64)
[1, 64, 640, 640]

2.1.5 CSPDarknet

CSPDarknet为YOLOX的主干网络用于网络的特征提取,结果将输出三个特征层(输入为[3, 640, 640],三个特征层尺寸分别为[256, 80, 80], [512, 40, 40], [1024, 20, 20])。其主要结构如下图所示,其中主要涉及到的块如Focus、BaseConv、CSPLayer、SPPBottleneck均在上文实现,下面将这些部分组装起来:


Input(-1, 3, 640, 640)
Focus(-1, 12, 320, 320)StemNone
Conv(-1, 64, 320, 320)
Conv(-1, 128, 160, 160)ResBlock1
CSPLayer(-1, 128, 160, 160)
Conv(-1, 256, 80, 80)Resblock2
CSPLayer(-1, 256, 80, 80)feature[dark3]
Conv(-1, 512, 40, 40)ResBlock3None
CSPLayer(-1, 512, 40, 40)feature[dark4]
Conv(-1, 1024, 20, 20)ResBlock4None
SPPBottleneck(-1, 1024, 20, 20)
CSPLayer(-1, 1024, 20, 20)feature[dark5]
## CSPDarknet
class CSPDarknet(nn.Layer):
    def __init__(self, dep_mul, wid_mul, out_features=("dark3", "dark4", "dark5"), depthwise=False, act="silu",):
        assert out_features, "please provide output features of Darknet"
        self.out_features = out_features
        Conv = DWConv if depthwise else BaseConv

        # Image Size : [3, 640, 640]
        base_channels   = int(wid_mul * 64)  # 64
        base_depth      = max(round(dep_mul * 3), 1)  # 3
        # 利用focus网络特征提取
        # [-1, 3, 640, 640] -> [-1, 64, 320, 320]
        self.stem = Focus(3, base_channels, ksize=3, act=act)

        # Resblock1[dark2]
        # [-1, 64, 320, 320] -> [-1, 128, 160, 160]
        self.dark2 = nn.Sequential(
            Conv(base_channels, base_channels * 2, 3, 2, act=act),
            CSPLayer(base_channels * 2, base_channels * 2, n=base_depth, depthwise=depthwise, act=act),

        # Resblock2[dark3]
        # [-1, 128, 160, 160] -> [-1, 256, 80, 80]
        self.dark3 = nn.Sequential(
            Conv(base_channels * 2, base_channels * 4, 3, 2, act=act),
            CSPLayer(base_channels * 4, base_channels * 4, n=base_depth * 3, depthwise=depthwise, act=act),

        # Resblock3[dark4]
        # [-1, 256, 80, 80] -> [-1, 512, 40, 40]
        self.dark4 = nn.Sequential(
            Conv(base_channels * 4, base_channels * 8, 3, 2, act=act),
            CSPLayer(base_channels * 8, base_channels * 8, n=base_depth * 3, depthwise=depthwise, act=act),

        # Resblock4[dark5]
        # [-1, 512, 40, 40] -> [-1, 1024, 20, 20]
        self.dark5 = nn.Sequential(
            Conv(base_channels * 8, base_channels * 16, 3, 2, act=act),
            SPPBottleneck(base_channels * 16, base_channels * 16, activation=act),
            CSPLayer(base_channels * 16, base_channels * 16, n=base_depth, shortcut=False, depthwise=depthwise, act=act),

    def forward(self, x):
        outputs = {}
        x = self.stem(x)
        outputs["stem"] = x
        x = self.dark2(x)
        outputs["dark2"] = x
        # dark3输出特征层:[256, 80, 80]
        x = self.dark3(x)
        outputs["dark3"] = x
        # dark4输出特征层:[512, 40, 40]
        x = self.dark4(x)
        outputs["dark4"] = x
        # dark5输出特征层:[1024, 20, 20]
        x = self.dark5(x)
        outputs["dark5"] = x
        return {k: v for k, v in outputs.items() if k in self.out_features}
## 测试CSPDarknet模块
x = paddle.ones([1, 3, 640, 640])
net1 = CSPDarknet(1, 1)
print(net1(x)['dark3'].shape, net1(x)['dark4'].shape, net1(x)['dark5'].shape)
[1, 256, 80, 80] [1, 512, 40, 40] [1, 1024, 20, 20]

2.2 特征加强金字塔 YOLOPAFPN


  • 底层特征[1024, 20, 20]进行1次1X1卷积调整通道后获得P5特征[512, 20, 20],P5上采样与中层特征[512, 40, 40]进行结合,然后使用CSPLayer进行特征提取获得P5_upsample特征[512, 40, 40]。
  • P5_upsample特征[512, 40, 40]进行1次1X1卷积调整通道后获得P4特征[256, 40, 40],P4进行上采样与上层特征[256, 80, 80]进行结合,然后使用CSPLayer进行特征提取P3_out特征[256, 80, 80]。
  • P3_out特征[256, 80, 80]进行一次3x3卷积进行下采样,下采样后与P4堆叠,然后使用CSPLayer进行特征提取P4_out特征[512, 40, 40]。
  • P4_out特征[512, 40, 40]进行一次3x3卷积进行下采样,下采样后与P5堆叠,然后使用CSPLayer进行特征提取P5_out特征[1024, 20, 20]。



class YOLOPAFPN(nn.Layer):
    def __init__(self, depth = 1.0, width = 1.0, in_features = ("dark3", "dark4", "dark5"), in_channels = [256, 512, 1024], depthwise = False, act = "silu"):
        Conv                = DWConv if depthwise else BaseConv
        self.backbone       = CSPDarknet(depth, width, depthwise = depthwise, act = act)
        self.in_features    = in_features

        self.upsample       = nn.Upsample(scale_factor=2, mode='nearest')

        # [-1, 1024, 20, 20] -> [-1, 512, 20, 20]
        self.lateral_conv0  = BaseConv(int(in_channels[2] * width), int(in_channels[1] * width), 1, 1, act=act)
        # [-1, 1024, 40, 40] -> [-1, 512, 40, 40]
        self.C3_p4 = CSPLayer(
            int(2 * in_channels[1] * width),
            int(in_channels[1] * width),
            round(3 * depth),
            depthwise = depthwise,
            act = act

        # [-1, 512, 40, 40] -> [-1, 256, 40, 40]
        self.reduce_conv1   = BaseConv(int(in_channels[1] * width), int(in_channels[0] * width), 1, 1, act=act)
        # [-1, 512, 80, 80] -> [-1, 256, 80, 80]
        self.C3_p3 = CSPLayer(
            int(2 * in_channels[0] * width),
            int(in_channels[0] * width),
            round(3 * depth),
            depthwise = depthwise,
            act = act

        # Bottom-Up Conv
        # [-1, 256, 80, 80] -> [-1, 256, 40, 40]
        self.bu_conv2       = Conv(int(in_channels[0] * width), int(in_channels[0] * width), 3, 2, act=act)
        # [-1, 512, 40, 40] -> [-1, 512, 40, 40]
        self.C3_n3 = CSPLayer(
            int(2 * in_channels[0] * width),
            int(in_channels[1] * width),
            round(3 * depth),
            depthwise = depthwise,
            act = act

        # [-1, 512, 40, 40] -> [-1, 512, 20, 20]
        self.bu_conv1       = Conv(int(in_channels[1] * width), int(in_channels[1] * width), 3, 2, act=act)
        # [-1, 1024, 20, 20] -> [-1, 1024, 20, 20]
        self.C3_n4 = CSPLayer(
            int(2 * in_channels[1] * width),
            int(in_channels[2] * width),
            round(3 * depth),
            depthwise = depthwise,
            act = act

    def forward(self, input):
        out_features            = self.backbone(input)
        [feat1, feat2, feat3]   = [out_features[f] for f in self.in_features]

        # [-1, 1024, 20, 20] -> [-1, 512, 20, 20]
        P5          = self.lateral_conv0(feat3)
        # [-1, 512, 20, 20] -> [-1, 512, 40, 40] 
        P5_upsample = self.upsample(P5)
        # [-1, 512, 40, 40] + [-1, 512, 40, 40] -> [-1, 1024, 40, 40]
        P5_upsample = paddle.concat([P5_upsample, feat2], axis=1)
        # [-1, 1024, 40, 40] -> [-1, 512, 40, 40]
        P5_upsample = self.C3_p4(P5_upsample)

        # [-1, 512, 40, 40] -> [-1, 256, 40, 40]
        P4          = self.reduce_conv1(P5_upsample) 
        # [-1, 256, 40, 40] -> [-1, 256, 80, 80]
        P4_upsample = self.upsample(P4) 
        # [-1, 256, 80, 80] + [-1, 256, 80, 80] -> [-1, 512, 80, 80]
        P4_upsample = paddle.concat([P4_upsample, feat1], axis=1) 
        # [-1, 512, 80, 80] -> [-1, 256, 80, 80]
        P3_out      = self.C3_p3(P4_upsample) 

        # [-1, 256, 80, 80] -> [-1, 256, 40, 40]
        P3_downsample   = self.bu_conv2(P3_out) 
        # [-1, 256, 40, 40] + [-1, 256, 40, 40] -> [-1, 512, 40, 40]
        P3_downsample   = paddle.concat([P3_downsample, P4], axis=1) 
        # [-1, 512, 40, 40] -> [-1, 512, 40, 40]
        P4_out          = self.C3_n3(P3_downsample) 

        # [-1, 512, 40, 40] -> [-1, 512, 20, 20]
        P4_downsample   = self.bu_conv1(P4_out)
        # [-1, 512, 20, 20] + [-1, 512, 20, 20] -> [-1, 1024, 20, 20]
        P4_downsample   = paddle.concat([P4_downsample, P5], axis=1)
        # [-1, 1024, 20, 20] -> [-1, 1024, 20, 20]
        P5_out          = self.C3_n4(P4_downsample)
        return (P3_out, P4_out, P5_out)
features = paddle.ones([1, 256, 80, 80]), paddle.ones([1, 512, 40, 40]), paddle.ones([1, 1024, 20, 20])
net2 = YOLOPAFPN()
print(net2(x)[0].shape, net2(x)[1].shape, net2(x)[2].shape)
[1, 256, 80, 80] [1, 512, 40, 40] [1, 1024, 20, 20]

2.3 检测头 YOLOX Head

YOLOX Head时YOLOX网络的检测头,同时起到分类器与回归器的作用,相比于传统的yolo检测头,yolox head检测头是解耦的,将分类和回归分为两个分支进行处理,最后预测的时候再进行整合,加强了网络的识别能力。

YOLOX Head结构示意


## YOLOX Head
class YOLOXHead(nn.Layer):
    def __init__(self, num_classes, width = 1.0, in_channels = [256, 512, 1024], act = "silu", depthwise = False,):
        Conv            = DWConv if depthwise else BaseConv
        self.cls_convs  = []
        self.reg_convs  = []
        self.cls_preds  = []
        self.reg_preds  = []
        self.obj_preds  = []
        self.stems      = []

        for i in range(len(in_channels)):
            # 预处理卷积: 1个1x1卷积
            self.stems.append(BaseConv(in_channels = int(in_channels[i] * width), out_channels = int(256 * width), kernel_size = 1, stride = 1, act = act))
            # 分类特征提取: 2个3x3卷积
                Conv(in_channels = int(256 * width), out_channels = int(256 * width), kernel_size= 3, stride = 1, act = act), 
                Conv(in_channels = int(256 * width), out_channels = int(256 * width), kernel_size= 3, stride = 1, act = act), 
            # 分类预测: 1个1x1卷积
                nn.Conv2D(in_channels = int(256 * width), out_channels = num_classes, kernel_size = 1, stride = 1, padding = 0)
            # 回归特征提取: 2个3x3卷积
                Conv(in_channels = int(256 * width), out_channels = int(256 * width), kernel_size = 3, stride = 1, act = act), 
                Conv(in_channels = int(256 * width), out_channels = int(256 * width), kernel_size = 3, stride = 1, act = act)
            # 回归预测(位置): 1个1x1卷积
                nn.Conv2D(in_channels = int(256 * width), out_channels = 4, kernel_size = 1, stride = 1, padding = 0)
            # 回归预测(是否含有物体): 1个1x1卷积
                nn.Conv2D(in_channels = int(256 * width), out_channels = 1, kernel_size = 1, stride = 1, padding = 0)

    def forward(self, inputs):
        # 输入[P3_out, P4_out, P5_out]
        # P3_out: [-1, 256, 80, 80]
        # P4_out: [-1, 512, 40, 40]
        # P5_out: [-1, 1024, 20, 20]
        outputs = []
        for k, x in enumerate(inputs):
            # 1x1卷积通道整合
            x           = self.stems[k](x)

            # 2个3x3卷积特征提取
            cls_feat    = self.cls_convs[k](x)
            # 1个1x1卷积预测类别
            # 分别输出: [-1, num_classes, 80, 80], [-1, num_classes, 40, 40], [-1, num_classes, 20, 20]
            cls_output  = self.cls_preds[k](cls_feat)

            # 2个3x3卷积特征提取
            reg_feat    = self.reg_convs[k](x)
            # 1个1x1卷积预测位置
            # 分别输出: [-1, 4, 80, 80], [-1, 4, 40, 40], [-1, 4, 20, 20]
            reg_output  = self.reg_preds[k](reg_feat)
            # 1个1x1卷积预测是否有物体
            # 分别输出: [-1, 1, 80, 80], [-1, 1, 40, 40], [-1, 1, 20, 20]
            obj_output  = self.obj_preds[k](reg_feat)

            # 整合结果
            # 输出: [-1, num_classes+5, 80, 80], [-1, num_classes+5, 40, 40], [-1, num_classes+5, 20, 20]
            output      = paddle.concat([reg_output, obj_output, cls_output], 1)
        return outputs
## 测试YOLOX Head模块
features = paddle.ones([1, 256, 80, 80]), paddle.ones([1, 512, 40, 40]), paddle.ones([1, 1024, 20, 20])
net3 = YOLOXHead(10)
print(net3(features)[0].shape, net3(features)[1].shape, net3(features)[2].shape)
[1, 15, 80, 80] [1, 15, 40, 40] [1, 15, 20, 20]

2.4 结构整合 YOLO Body

class YoloBody(nn.Layer):
    def __init__(self, num_classes, kind):
        depth_dict = {'nano': 0.33, 'tiny': 0.33, 's' : 0.33, 'm' : 0.67, 'l' : 1.00, 'x' : 1.33,}
        width_dict = {'nano': 0.25, 'tiny': 0.375, 's' : 0.50, 'm' : 0.75, 'l' : 1.00, 'x' : 1.25,}
        depth, width    = depth_dict[kind], width_dict[kind]
        depthwise       = True if kind == 'nano' else False 

        self.backbone   = YOLOPAFPN(depth, width, depthwise=depthwise)
        self.head       = YOLOXHead(num_classes, width, depthwise=depthwise)

    def forward(self, x):
        fpn_outs    = self.backbone.forward(x)
        outputs     = self.head.forward(fpn_outs)
        return outputs
## 测试YOLO Body模块
x = paddle.ones([1, 3, 640, 640])
net4 = YoloBody(10, 'x')
print(net4(x)[0].shape, net4(x)[1].shape, net4(x)[2].shape)
[1, 15, 80, 80] [1, 15, 40, 40] [1, 15, 20, 20]


