基于Paddle框架的YOLOX构建

本项目对YOLOX的网络结构进行了解析，并采用PaddlePaddle框架对于YOLOX的网络结构进行了构建。注：本项目仅讨论网络的构建，网络的训练和预测将在后续项目讨论。

AI Studio

1094人浏览 · 2022-02-12 16:03:38

AI Studio · 2022-02-12 16:03:38 发布

YOLOX结构解析与基于Paddle的网络构建

本Notebook对YOLOX的网络结构进行了解析，并采用PaddlePaddle框架对于YOLOX的网络结构进行了构建。

注：本Notebook仅讨论网络的构建部分，网络的训练、预测过程将在后续NoteBook进行讨论。

1. YOLOX简介

YOLOX是旷视科技（Megvii）在YOLOv3基础上改进的。主要改进的部分在于 Decoupled Head、Anchor Free、SimOTA、Data Aug。另外为了yolov5对比，主干网络引入了yolov5的FOCUS、CSPNet、PAN Head、SiLU激活。

1.1 Decoupled Head

Decoupled Head是学术领域一阶段网络的标准配置。然而，以前版本的YOLO所用的预测头是一起的，分类和回归在一个1x1卷积中实现。

作者实验发现End2End的YOLOX始终比标准YOLOX低4-5个点，偶然间把原始YOLO Head换成Decoupled Head，发现差距显著缩小，认为YOLO Head的表达能力可能有所欠缺。

YOLOX中，YOLO Head将分类和回归分别实现，最后预测时才整合。经过权衡速度和性能得失，最终使用1个1x1卷积先进行降维，并在分类和回归分支里各使用了2个3x3卷积。
YOLOX_head

1.2 Anchor Free

Anchor Free有以下几个好处：

降低时间成本

Anchor Based检测器为了追求最优性能需要对anchor box聚类分析，增加了时间成本.
降低检测头复杂度和生成结果数量

Anchor Based检测器增加了检测头复杂度以及生成结果的数量，将大量检测结果从GPU搬运到CPU上对于边缘设备是无法容忍的。
代码逻辑简单，可读性增强

Anchor Free 的解码代码逻辑更简单，可读性更高。

Anchor Free技术目前可以上YOLO，且性能不降反升，与样本匹配有密不可分的联系。

1.3 样本匹配SimOTA

样本匹配算法可以天然缓解拥挤场景检测问题、缓解极端长宽比的物体检测效果差的问题、极端大小目标正样本不均衡问题、缓解旋转物体检测效果不好的问题。

作者认为样本匹配中主要有四个重要因素：

Loss/Quality/Prediction Aware

基于网络自身预测来计算anchor box或者anchor point与Groud Truth匹配关系，充分考虑不同结构/复杂度模型可能会有不同行为，是一种动态的样本匹配。

与之相对的，基于IoU阈值/In Grid(YOLOv1)/In Box or Center(FCOS)都依赖于人为定义几何先验做样本匹配，属于次优方案。
Center prior

大部分场景下，目标质心与目标几何中心有一定联系，将正样本限定在目标中心一定区域内做样本匹配能很好地解决收敛不稳定问题。
Dynamic k

对于不同大小的目标应该设置不同的正样本数量。对于不同大小的目标设置相同的正样本数，会导致小目标有大量低质量正样本或大目标只有几个正样本。

Dynamic k的关键在于确定k，k的估计可以采用prediction aware的，具体的作者先计算每个目标最接近的10个预测，然后把这10个预测与Groud Truth的IOU加起来求得最终的k。

此外10这个数字也不是很敏感，在 5-15之间调整几乎没有影响。
全局信息

部分anchor box/point处于正样本之间交界、或者正负样本之间交界，这类anchor box/point的正负划分，归属哪个正样本，都应考虑全局信息。
最终，在权衡速度的条件下，作者仅保留了前三点，去除最优方案求解过程，将OTA转为SimOTA。

1.4 Data Augmentation

数据增强方面延用Mosaic和Mixup数据增强技术，利用了四张图片进行拼接实现数据中增强，丰富了检测物体的背景。

Mosaic方法在YOLOv4中提出，主要思想是将四张图片进行随机裁剪，再拼接到一张图上作为训练数据。好处是丰富了图片背景，且四张图片拼接在一起变相提高batch_size，在进行batch normalization的时候也会计算四张图片，对本身batch_size不是很依赖。

具体可参考论文:YOLOv4: Optimal Speed and Accuracy of Object Detection

Mixup方法使用朴素的线性插值方法得到新扩展数据。

具体可参考论文:mixup: Beyond Empirical Risk Minimization

2. 网络结构剖析

参考B站Up主Bubbliiiing绘制的网络结构图，网络整体可以分为三个部分：主干网络CSPDarknet、特征加强的PAN Head、检测头YOLO Head。
YOLOX-structure

主干网络中涉及到的主要结构包括ConvBlock(包含Conv、Batch norm、SiLU)、FOCUS、CSPLayer、SPPBottleneck等结构。
特征加强部分中涉及的主要结构包括CSPLayer、UpSampling、DownSampling等。
YOLO Head部分主要包含ConvBlock结构。
下面就以下部分进行逐一构建。

# 引入库
import paddle
from paddle import nn

2.1 主干网络 CSPDarknet

2.1.1 ConvBlock

基本卷积块包含卷积、批归一化和激活函数。基本卷积块采用等大填充Same Padding，包含一般卷积(BaseConv)和深度可分离卷积(DWConv)两种类型。

BaseConv结构示意

Input
Conv2D
Batch Norm
Act

Bottleneck残差卷积块，主干采用2个基本卷积块，卷积核大小分别为1和3，残差部分保持原输入，结果输出主干与残差边之和。

Bottleneck结构示意

Input
BaseConv 1x1	Identity
BaseConv 3x3	Identity
Add

## 构建卷积块
class BaseConv(nn.Layer):
    def __init__(self, in_channels, out_channels, kernel_size, stride, groups=1, act='silu'):
        super().__init__()
        padding = (kernel_size-1)//2
        self.conv = nn.Conv2D(in_channels, out_channels, kernel_size, stride, padding, groups=groups)
        self.bn = nn.BatchNorm2D(out_channels,momentum=0.03, epsilon=0.001)
        if act == 'silu':
            self.act = nn.Silu()
        elif act == 'relu':
            self.act = nn.ReLU()
        elif act == 'lrelu':
            self.act = nn.LeakyReLU(0.1)
    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

## 构建深度可分离卷积
class DWConv(nn.Layer):
    # Some Problem
    def __init__(self, in_channels, out_channels, kernel_size, stride=1, act='silu'):
        super().__init__()
        self.dconv = BaseConv(in_channels, in_channels, kernel_size, stride, groups=in_channels, act=act)
        self.pconv = BaseConv(in_channels, out_channels, 1, 1, groups=1, act=act)

    def forward(self, x):
        x = self.dconv(x)
        return self.pconv(x)

## 构建残差结构
class Bottleneck(nn.Layer):
    def __init__(self, in_channels, out_channels, shortcut=True, expansion=0.5, depthwise=False, act="silu"):
        super().__init__()
        hidden_channels = int(out_channels * expansion)
        Conv = DWConv if depthwise else BaseConv
        # 1x1卷积进行通道数的缩减(缩减率默认50%)
        self.conv1 = BaseConv(in_channels, hidden_channels, 1, stride=1, act=act)
        # 3x3卷积进行通道数的拓张(特征提取)
        self.conv2 = Conv(hidden_channels, out_channels, 3, stride=1, act=act)
        self.use_add = shortcut and in_channels == out_channels

    def forward(self, x):
        y = self.conv2(self.conv1(x))
        if self.use_add:
            y = y + x
        return y

## 测试卷积模块
x = paddle.ones([1, 3, 640, 640])
conv1 = BaseConv(3, 64, 3, 1)
conv2 = DWConv(3, 64, 3, 1)
block1 = Bottleneck(3, 64)
print(conv1(x).shape)
print(conv2(x).shape)
print(block1(x).shape)

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/norm.py:653: UserWarning: When training, we now always track global mean and variance.
  "When training, we now always track global mean and variance.")


[1, 64, 640, 640]
[1, 64, 640, 640]
[1, 64, 640, 640]

2.1.2 Focus

Focus最早在YOLOv5(并无论文)中提出，具体操作是在一张图片中每隔一个像素拿到一个值，类似于邻近下采样，这样就获得了四张图片，四张图片互补，将W、H信息集中到了通道空间C，输入通道扩充为4倍，拼接起来的图片相对于原先的RGB三通道模式变成了12个通道，最后将得到的新图片再经过卷积操作，最终得到了没有信息丢失情况下的二倍下采样特征图。

FOCUS示意

Focus作用是为了提速,作者提到使用Focus层可以减少参数计算，减少Cuda使用内存。

## Focus层
class Focus(nn.Layer):
    def __init__(self, in_channels, out_channels, ksize=1, stride=1, act="silu"):
        super().__init__()
        self.conv = BaseConv(in_channels * 4, out_channels, ksize, stride, act=act)

    def forward(self, x):
        # 分别获得4个2倍下采样结果
        patch_1 = x[...,  ::2,  ::2]
        patch_2 = x[..., 1::2,  ::2]
        patch_3 = x[...,  ::2, 1::2]
        patch_4 = x[..., 1::2, 1::2]
        # 沿通道方向拼接4个下采样结果
        x = paddle.concat((patch_1, patch_2, patch_3, patch_4), axis=1)
        # 拼接结果做卷积
        out = self.conv(x)
        return out

## 测试FOCUS模块
x = paddle.ones([1, 3, 640, 640])
layer = Focus(3, 64)
print(layer(x).shape)

[1, 64, 320, 320]

2.1.3 CSPLayer

CSPLayer主要结构如下图所示，在常规结构基础上，引入一条类似残差结构的分支。

主干部分采用1个基本卷积块+堆叠N个Bottleneck残差块结构提取特征，残差部分采用1个基本卷积块，最后合并两个分支再作用一次基本卷积块。

CSPLayer结构示意

Input
BaseConv 1x1	BaseConv 1x1
BaseConv 1x1	Bottleneck(x N)
Concat
BaseConv 1x1

## CSPLayer
class CSPLayer(nn.Layer):
    def __init__(self, in_channels, out_channels, n=1, shortcut=True, expansion=0.5, depthwise=False, act="silu",):
        super().__init__()
        hidden_channels = int(out_channels * expansion)  
        # 主干部分的基本卷积块
        self.conv1  = BaseConv(in_channels, hidden_channels, 1, stride=1, act=act)
        # 残差边部分的基本卷积块
        self.conv2  = BaseConv(in_channels, hidden_channels, 1, stride=1, act=act)
        # 拼接主干与残差后的基本卷积块
        self.conv3  = BaseConv(2 * hidden_channels, out_channels, 1, stride=1, act=act)

        # 根据循环次数构建多个残差块瓶颈结构
        res_block = [Bottleneck(hidden_channels, hidden_channels, shortcut, 1.0, depthwise, act=act) for _ in range(n)]
        self.res_block = nn.Sequential(*res_block)

    def forward(self, x):
        # 主干部分
        x_main = self.conv1(x)
        x_main = self.res_block(x_main)
        # 残差边部分
        x_res = self.conv2(x)

        # 主干部分和残差边部分进行堆叠
        x = paddle.concat((x_main, x_res), axis=1)
  
        # 对堆叠的结果进行卷积的处理
        out = self.conv3(x)
        return out

## 测试CSPLayer模块
x = paddle.ones([1, 3, 640, 640])
layer = CSPLayer(3, 64, 5)
print(layer(x).shape)

[1, 64, 640, 640]

2.1.4 SPPBottleneck

SPPBottleneck主要结构如下图所示，采用卷积块1+4条通路+拼接+卷积块2的整体结构。

卷积块1将通道数缩减一半；4条通路下采样为原始输入和窗口大小分别为5，9，13的最大池化；沿通道方向拼接；卷积块2调整输出通道数。

SPPBottleneck结构示意

Input
BaseConv 1x1
Identity	MaxPool 5x5	MaxPool 9x9	MaxPool 13x13
Concat
BaseConv 1x1

## SPPBottleneck
class SPPBottleneck(nn.Layer):
    def __init__(self, in_channels, out_channels, kernel_sizes=(5, 9, 13), activation="silu"):
        super().__init__()
        hidden_channels = in_channels // 2
        self.conv1      = BaseConv(in_channels, hidden_channels, 1, stride=1, act=activation)
        self.pool_block = nn.Sequential(*[nn.MaxPool2D(kernel_size=ks, stride=1, padding=ks // 2) for ks in kernel_sizes])
        conv2_channels  = hidden_channels * (len(kernel_sizes) + 1)
        self.conv2      = BaseConv(conv2_channels, out_channels, 1, stride=1, act=activation)

    def forward(self, x):
        x = self.conv1(x)
        x = paddle.concat([x] + [pool(x) for pool in self.pool_block], axis=1)
        x = self.conv2(x)
        return x

## 测试SPPBottleneck模块
x = paddle.ones([1, 3, 640, 640])
layer = SPPBottleneck(3, 64)
print(layer(x).shape)

[1, 64, 640, 640]

2.1.5 CSPDarknet

CSPDarknet为YOLOX的主干网络用于网络的特征提取，结果将输出三个特征层（输入为[3, 640, 640]，三个特征层尺寸分别为[256, 80, 80], [512, 40, 40], [1024, 20, 20]）。其主要结构如下图所示，其中主要涉及到的块如Focus、BaseConv、CSPLayer、SPPBottleneck均在上文实现，下面将这些部分组装起来：

CSPDarknet结构示意

Input(-1, 3, 640, 640)
Focus	(-1, 12, 320, 320)	Stem	None
Conv	(-1, 64, 320, 320)	Stem
Conv	(-1, 128, 160, 160)	ResBlock1
CSPLayer	(-1, 128, 160, 160)	ResBlock1
Conv	(-1, 256, 80, 80)	Resblock2
CSPLayer	(-1, 256, 80, 80)	Resblock2	feature[dark3]
Conv	(-1, 512, 40, 40)	ResBlock3	None
CSPLayer	(-1, 512, 40, 40)	ResBlock3	feature[dark4]
Conv	(-1, 1024, 20, 20)	ResBlock4	None
SPPBottleneck	(-1, 1024, 20, 20)		None
CSPLayer	(-1, 1024, 20, 20)		feature[dark5]

## CSPDarknet
class CSPDarknet(nn.Layer):
    def __init__(self, dep_mul, wid_mul, out_features=("dark3", "dark4", "dark5"), depthwise=False, act="silu",):
        super().__init__()
        assert out_features, "please provide output features of Darknet"
        self.out_features = out_features
        Conv = DWConv if depthwise else BaseConv

        # Image Size : [3, 640, 640]
        base_channels   = int(wid_mul * 64)  # 64
        base_depth      = max(round(dep_mul * 3), 1)  # 3
        
        # 利用focus网络特征提取
        # [-1, 3, 640, 640] -> [-1, 64, 320, 320]
        self.stem = Focus(3, base_channels, ksize=3, act=act)

        # Resblock1[dark2]
        # [-1, 64, 320, 320] -> [-1, 128, 160, 160]
        self.dark2 = nn.Sequential(
            Conv(base_channels, base_channels * 2, 3, 2, act=act),
            CSPLayer(base_channels * 2, base_channels * 2, n=base_depth, depthwise=depthwise, act=act),
        )

        # Resblock2[dark3]
        # [-1, 128, 160, 160] -> [-1, 256, 80, 80]
        self.dark3 = nn.Sequential(
            Conv(base_channels * 2, base_channels * 4, 3, 2, act=act),
            CSPLayer(base_channels * 4, base_channels * 4, n=base_depth * 3, depthwise=depthwise, act=act),
        )

        # Resblock3[dark4]
        # [-1, 256, 80, 80] -> [-1, 512, 40, 40]
        self.dark4 = nn.Sequential(
            Conv(base_channels * 4, base_channels * 8, 3, 2, act=act),
            CSPLayer(base_channels * 8, base_channels * 8, n=base_depth * 3, depthwise=depthwise, act=act),
        )

        # Resblock4[dark5]
        # [-1, 512, 40, 40] -> [-1, 1024, 20, 20]
        self.dark5 = nn.Sequential(
            Conv(base_channels * 8, base_channels * 16, 3, 2, act=act),
            SPPBottleneck(base_channels * 16, base_channels * 16, activation=act),
            CSPLayer(base_channels * 16, base_channels * 16, n=base_depth, shortcut=False, depthwise=depthwise, act=act),
        )

    def forward(self, x):
        outputs = {}
        x = self.stem(x)
        outputs["stem"] = x
        x = self.dark2(x)
        outputs["dark2"] = x
        # dark3输出特征层：[256, 80, 80]
        x = self.dark3(x)
        outputs["dark3"] = x
        # dark4输出特征层：[512, 40, 40]
        x = self.dark4(x)
        outputs["dark4"] = x
        # dark5输出特征层：[1024, 20, 20]
        x = self.dark5(x)
        outputs["dark5"] = x
        return {k: v for k, v in outputs.items() if k in self.out_features}

## 测试CSPDarknet模块
x = paddle.ones([1, 3, 640, 640])
net1 = CSPDarknet(1, 1)
print(net1(x)['dark3'].shape, net1(x)['dark4'].shape, net1(x)['dark5'].shape)

[1, 256, 80, 80] [1, 512, 40, 40] [1, 1024, 20, 20]

2.2 特征加强金字塔 YOLOPAFPN

YOLOPAFPN为YOLOX网络的特征加强部分，集成了FPN和PANET。通过将主干网络获得的三个特征层经过多次上采样和下采样进行特征融合，将不同尺度的特征信息进行结合。YOLOPAFPN的整体结构如下:

底层特征[1024, 20, 20]进行1次1X1卷积调整通道后获得P5特征[512, 20, 20]，P5上采样与中层特征[512, 40, 40]进行结合，然后使用CSPLayer进行特征提取获得P5_upsample特征[512, 40, 40]。
P5_upsample特征[512, 40, 40]进行1次1X1卷积调整通道后获得P4特征[256, 40, 40]，P4进行上采样与上层特征[256, 80, 80]进行结合，然后使用CSPLayer进行特征提取P3_out特征[256, 80, 80]。
P3_out特征[256, 80, 80]进行一次3x3卷积进行下采样，下采样后与P4堆叠，然后使用CSPLayer进行特征提取P4_out特征[512, 40, 40]。
P4_out特征[512, 40, 40]进行一次3x3卷积进行下采样，下采样后与P5堆叠，然后使用CSPLayer进行特征提取P5_out特征[1024, 20, 20]。

YOLOPAFPN结构示意

YOLOPAFPN Structure

## YOLOPAFPN
class YOLOPAFPN(nn.Layer):
    def __init__(self, depth = 1.0, width = 1.0, in_features = ("dark3", "dark4", "dark5"), in_channels = [256, 512, 1024], depthwise = False, act = "silu"):
        super().__init__()
        Conv                = DWConv if depthwise else BaseConv
        self.backbone       = CSPDarknet(depth, width, depthwise = depthwise, act = act)
        self.in_features    = in_features

        self.upsample       = nn.Upsample(scale_factor=2, mode='nearest')

        # [-1, 1024, 20, 20] -> [-1, 512, 20, 20]
        self.lateral_conv0  = BaseConv(int(in_channels[2] * width), int(in_channels[1] * width), 1, 1, act=act)
    
        # [-1, 1024, 40, 40] -> [-1, 512, 40, 40]
        self.C3_p4 = CSPLayer(
            int(2 * in_channels[1] * width),
            int(in_channels[1] * width),
            round(3 * depth),
            False,
            depthwise = depthwise,
            act = act
        )  

        # [-1, 512, 40, 40] -> [-1, 256, 40, 40]
        self.reduce_conv1   = BaseConv(int(in_channels[1] * width), int(in_channels[0] * width), 1, 1, act=act)
        # [-1, 512, 80, 80] -> [-1, 256, 80, 80]
        self.C3_p3 = CSPLayer(
            int(2 * in_channels[0] * width),
            int(in_channels[0] * width),
            round(3 * depth),
            False,
            depthwise = depthwise,
            act = act
        )

        # Bottom-Up Conv
        # [-1, 256, 80, 80] -> [-1, 256, 40, 40]
        self.bu_conv2       = Conv(int(in_channels[0] * width), int(in_channels[0] * width), 3, 2, act=act)
        # [-1, 512, 40, 40] -> [-1, 512, 40, 40]
        self.C3_n3 = CSPLayer(
            int(2 * in_channels[0] * width),
            int(in_channels[1] * width),
            round(3 * depth),
            False,
            depthwise = depthwise,
            act = act
        )

        # [-1, 512, 40, 40] -> [-1, 512, 20, 20]
        self.bu_conv1       = Conv(int(in_channels[1] * width), int(in_channels[1] * width), 3, 2, act=act)
        # [-1, 1024, 20, 20] -> [-1, 1024, 20, 20]
        self.C3_n4 = CSPLayer(
            int(2 * in_channels[1] * width),
            int(in_channels[2] * width),
            round(3 * depth),
            False,
            depthwise = depthwise,
            act = act
        )

    def forward(self, input):
        out_features            = self.backbone(input)
        [feat1, feat2, feat3]   = [out_features[f] for f in self.in_features]

        # [-1, 1024, 20, 20] -> [-1, 512, 20, 20]
        P5          = self.lateral_conv0(feat3)
        # [-1, 512, 20, 20] -> [-1, 512, 40, 40] 
        P5_upsample = self.upsample(P5)
        # [-1, 512, 40, 40] + [-1, 512, 40, 40] -> [-1, 1024, 40, 40]
        P5_upsample = paddle.concat([P5_upsample, feat2], axis=1)
        # [-1, 1024, 40, 40] -> [-1, 512, 40, 40]
        P5_upsample = self.C3_p4(P5_upsample)

        # [-1, 512, 40, 40] -> [-1, 256, 40, 40]
        P4          = self.reduce_conv1(P5_upsample) 
        # [-1, 256, 40, 40] -> [-1, 256, 80, 80]
        P4_upsample = self.upsample(P4) 
        # [-1, 256, 80, 80] + [-1, 256, 80, 80] -> [-1, 512, 80, 80]
        P4_upsample = paddle.concat([P4_upsample, feat1], axis=1) 
        # [-1, 512, 80, 80] -> [-1, 256, 80, 80]
        P3_out      = self.C3_p3(P4_upsample) 

        # [-1, 256, 80, 80] -> [-1, 256, 40, 40]
        P3_downsample   = self.bu_conv2(P3_out) 
        # [-1, 256, 40, 40] + [-1, 256, 40, 40] -> [-1, 512, 40, 40]
        P3_downsample   = paddle.concat([P3_downsample, P4], axis=1) 
        # [-1, 512, 40, 40] -> [-1, 512, 40, 40]
        P4_out          = self.C3_n3(P3_downsample) 

        # [-1, 512, 40, 40] -> [-1, 512, 20, 20]
        P4_downsample   = self.bu_conv1(P4_out)
        # [-1, 512, 20, 20] + [-1, 512, 20, 20] -> [-1, 1024, 20, 20]
        P4_downsample   = paddle.concat([P4_downsample, P5], axis=1)
        # [-1, 1024, 20, 20] -> [-1, 1024, 20, 20]
        P5_out          = self.C3_n4(P4_downsample)
        return (P3_out, P4_out, P5_out)

## 测试YOLOPAFPN模块
features = paddle.ones([1, 256, 80, 80]), paddle.ones([1, 512, 40, 40]), paddle.ones([1, 1024, 20, 20])
net2 = YOLOPAFPN()
print(net2(x)[0].shape, net2(x)[1].shape, net2(x)[2].shape)

[1, 256, 80, 80] [1, 512, 40, 40] [1, 1024, 20, 20]

2.3 检测头 YOLOX Head

YOLOX Head时YOLOX网络的检测头，同时起到分类器与回归器的作用，相比于传统的yolo检测头，yolox head检测头是解耦的，将分类和回归分为两个分支进行处理，最后预测的时候再进行整合，加强了网络的识别能力。

YOLOX Head结构示意

YOLOX Head

## YOLOX Head
class YOLOXHead(nn.Layer):
    def __init__(self, num_classes, width = 1.0, in_channels = [256, 512, 1024], act = "silu", depthwise = False,):
        super().__init__()
        Conv            = DWConv if depthwise else BaseConv
        
        self.cls_convs  = []
        self.reg_convs  = []
        self.cls_preds  = []
        self.reg_preds  = []
        self.obj_preds  = []
        self.stems      = []

        for i in range(len(in_channels)):
            # 预处理卷积: 1个1x1卷积
            self.stems.append(BaseConv(in_channels = int(in_channels[i] * width), out_channels = int(256 * width), kernel_size = 1, stride = 1, act = act))
            # 分类特征提取: 2个3x3卷积
            self.cls_convs.append(nn.Sequential(*[
                Conv(in_channels = int(256 * width), out_channels = int(256 * width), kernel_size= 3, stride = 1, act = act), 
                Conv(in_channels = int(256 * width), out_channels = int(256 * width), kernel_size= 3, stride = 1, act = act), 
            ]))
            # 分类预测: 1个1x1卷积
            self.cls_preds.append(
                nn.Conv2D(in_channels = int(256 * width), out_channels = num_classes, kernel_size = 1, stride = 1, padding = 0)
            )
            
            # 回归特征提取: 2个3x3卷积
            self.reg_convs.append(nn.Sequential(*[
                Conv(in_channels = int(256 * width), out_channels = int(256 * width), kernel_size = 3, stride = 1, act = act), 
                Conv(in_channels = int(256 * width), out_channels = int(256 * width), kernel_size = 3, stride = 1, act = act)
            ]))
            # 回归预测(位置): 1个1x1卷积
            self.reg_preds.append(
                nn.Conv2D(in_channels = int(256 * width), out_channels = 4, kernel_size = 1, stride = 1, padding = 0)
            )
            # 回归预测(是否含有物体): 1个1x1卷积
            self.obj_preds.append(
                nn.Conv2D(in_channels = int(256 * width), out_channels = 1, kernel_size = 1, stride = 1, padding = 0)
            )

    def forward(self, inputs):
        # 输入[P3_out, P4_out, P5_out]
        # P3_out: [-1, 256, 80, 80]
        # P4_out: [-1, 512, 40, 40]
        # P5_out: [-1, 1024, 20, 20]
        outputs = []
        for k, x in enumerate(inputs):
            # 1x1卷积通道整合
            x           = self.stems[k](x)

            # 2个3x3卷积特征提取
            cls_feat    = self.cls_convs[k](x)
            # 1个1x1卷积预测类别
            # 分别输出: [-1, num_classes, 80, 80], [-1, num_classes, 40, 40], [-1, num_classes, 20, 20]
            cls_output  = self.cls_preds[k](cls_feat)

            # 2个3x3卷积特征提取
            reg_feat    = self.reg_convs[k](x)
            # 1个1x1卷积预测位置
            # 分别输出: [-1, 4, 80, 80], [-1, 4, 40, 40], [-1, 4, 20, 20]
            reg_output  = self.reg_preds[k](reg_feat)
            # 1个1x1卷积预测是否有物体
            # 分别输出: [-1, 1, 80, 80], [-1, 1, 40, 40], [-1, 1, 20, 20]
            obj_output  = self.obj_preds[k](reg_feat)

            # 整合结果
            # 输出: [-1, num_classes+5, 80, 80], [-1, num_classes+5, 40, 40], [-1, num_classes+5, 20, 20]
            output      = paddle.concat([reg_output, obj_output, cls_output], 1)
            outputs.append(output)
        return outputs

## 测试YOLOX Head模块
features = paddle.ones([1, 256, 80, 80]), paddle.ones([1, 512, 40, 40]), paddle.ones([1, 1024, 20, 20])
net3 = YOLOXHead(10)
print(net3(features)[0].shape, net3(features)[1].shape, net3(features)[2].shape)

[1, 15, 80, 80] [1, 15, 40, 40] [1, 15, 20, 20]

2.4 结构整合 YOLO Body

class YoloBody(nn.Layer):
    def __init__(self, num_classes, kind):
        super().__init__()
        depth_dict = {'nano': 0.33, 'tiny': 0.33, 's' : 0.33, 'm' : 0.67, 'l' : 1.00, 'x' : 1.33,}
        width_dict = {'nano': 0.25, 'tiny': 0.375, 's' : 0.50, 'm' : 0.75, 'l' : 1.00, 'x' : 1.25,}
        depth, width    = depth_dict[kind], width_dict[kind]
        depthwise       = True if kind == 'nano' else False 

        self.backbone   = YOLOPAFPN(depth, width, depthwise=depthwise)
        self.head       = YOLOXHead(num_classes, width, depthwise=depthwise)

    def forward(self, x):
        fpn_outs    = self.backbone.forward(x)
        outputs     = self.head.forward(fpn_outs)
        return outputs

## 测试YOLO Body模块
x = paddle.ones([1, 3, 640, 640])
net4 = YoloBody(10, 'x')
print(net4(x)[0].shape, net4(x)[1].shape, net4(x)[2].shape)

[1, 15, 80, 80] [1, 15, 40, 40] [1, 15, 20, 20]

请点击此处查看本环境基本用法.

Please click here for more detailed instructions.

百度飞桨AI Studio社区

学大模型，用大模型上飞桨星河社区！每天8点V100G算力免费领！免费领取ERNIE 4.0 100w Token >>>

更多推荐

利用Amazon Bedrock畅玩Claude 3等多种领先模型，抢占AI高地(体验倒计时4小时)

百度飞桨星河社区

RAPTOR：索引树状 RAG，使用树结构来捕捉文本的高级和低级细节

百度飞桨星河社区

MultiHop-RAG：多跳查询的基准检索增强生成

百度飞桨星河社区

所有评论(0)

查看更多评论

AI Studio

@m0_63642362

已为社区贡献1437条内容