浅析poolformer 思考vits的架构问题

基于Paddle实现poolformer，模型精度对齐论文要求

AI Studio

2380人浏览 · 2021-12-03 20:49:18

AI Studio · 2021-12-03 20:49:18 发布

老人字体特供版

Transformer 技术在 NLP 的成功让学者将目光放在了 CV 领域，从 ViT 到 Swin Transformer 、MLP表现出了Transformer在视觉领域的成功，本项目来用Paddle实现最近的Transformer相关工作PoolFormer，通过配合代码实战帮助大家理解论文思想，和大家一起学习前沿的Transformer的工作

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Q0QFtymI-1638535620939)(https://ai-studio-static-online.cdn.bcebos.com/5b1032e7d69d484e8e588dc3f16583a5d64fc76b282b409d8ee1764f0f4e909b)]

We argue that the competence of transformer/MLPlike models primarily stems from the general architecture MetaFormer instead of the equipped specific token mixer

paper：arxiv

code：github

打个广告 🔥🔥🔥

欢迎关注 PPViT 包括颜水成团队其他两篇work：ViP、VOLO

也关注一下 PASSL

hi guy 我们又见面了，这次来搞一下 poolformer ，很有意思的文章，很有意思啊

vits的牛逼来自哪？之前很多研究都说，MSA咯，全局表征牛逼哇，也有人认为是 patchembed 操作

前段时间 convmixer 刷爆了网络，为啥，就是证明 patchembed 牛逼了呗，牛逼是牛逼，人家新的resnet都能刷到80点，作为conv的尊严还是打不过efficientnetv2

再看一下 convmixer 的 performance

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-owO9cgsC-1638535620950)(https://ai-studio-static-online.cdn.bcebos.com/d41840552c564897848d6ca84eb24459c7c932cfcf1d463eae74e113b84644c7)]

我giao啊，patch size 是多少 7！！！大家都知道在vits里面 MSA 的计算复杂度是token的平方，这里不讨论WMSA，为啥一般vits的patch size要选择大一点就是让token少一点，你看那个vits的patch size这么小？

这问题来了，严谨来说你要证明在patch embed操作里面patch size要一样，同样token下，干翻人家，才能有很好的说服力，不过人家文章也说了，不是追求牛逼模型，旨在让大家思考

实际上patch操作也可以研究一下，比如，FAI的研究改一下stem，解决训练问题

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-uefFfc3h-1638535620954)(https://ai-studio-static-online.cdn.bcebos.com/982dcf79e9f84a87b112aad2cb7f994535b6d6036d23429ca47f03decfdaeca4)]

不过这是后话了，没有确切研究能说明 vits 牛逼来自 patch embed

再回到这篇文章，颜博士说vits牛逼来自架构，来自架构

refer

convmixer：PPViT有实现

Early Convolutions Help Transformers See Better

poolformer

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4WBMz6Bl-1638535620960)(https://ai-studio-static-online.cdn.bcebos.com/8fe466c6dfe84d4099a30c800c4b944b1b86383c835d40a4997b89bcb19d6c95)]

上图还是很好理解的，就是attn 、 mlp 、 Pool结构替换罢了，前两个就是vits，MLPs，最后就是这篇文章的work

来看看代码啥样子

class Pooling(nn.Layer):
    """
    Implementation of pooling for PoolFormer
    --pool_size: pooling size
    """
    def __init__(self, kernel_size=3):
        super().__init__()
        self.pool = nn.AvgPool2D(
            kernel_size, stride=1, padding=kernel_size//2, exclusive=False)

    def forward(self, x):
        return self.pool(x) - x

这个就是MSA的替代品，你敢想象吗

看看这个，没有参数学习，计算量真小

FLOPs的计算，（取自PASSL，即将支持）大家可以简单算算多少FLOPs

def count_avgpool(m, x, y):
    num_elements = y.numel()
    m.flops += int(num_elements)

不说多话，show 代码

干就vans

先来定义一下简单的东西，这给后面组网打个基础

import paddle
import paddle.nn as nn
import paddle.nn.functional as F

# 定义一些初始化
trunc_normal_ = nn.initializer.TruncatedNormal(std=0.02)
zeros_ = nn.initializer.Constant(value=0.0)
ones_ = nn.initializer.Constant(value=1.0)

# 啥都不做的 对于torch.nn.Identity
class Identity(nn.Layer):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return x

# 下面是DropPath, 一个正则化方法
def drop_path(x, drop_prob=0.0, training=False):

    if drop_prob == 0.0 or not training:
        return x
    keep_prob = 1 - drop_prob
    shape = (x.shape[0],) + (1,) * (x.ndim - 1)
    random_tensor = paddle.to_tensor(keep_prob) + paddle.rand(shape)
    random_tensor = paddle.floor(random_tensor)
    output = x.divide(keep_prob) * random_tensor
    return output

class DropPath(nn.Layer):
    def __init__(self, drop_prob=None):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        return drop_path(x, self.drop_prob, self.training)

看看网络结构

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KOiyXUTf-1638535620984)(https://ai-studio-static-online.cdn.bcebos.com/c24c898cdf2b451590bfd0385777aca71a578856902a430fafdddde45d213256)]

分为两个部分

一个是PatchEmbed操作

一个是block

patch embed这就不用讲了，要是看不懂就说明你们没有好好听课【滑稽】

特意说一下这个，这个思想虽然和vits一样，但是输出的不是[B, N, C]，少了reshape和transpose，因为主要用于下采样

class PatchEmbed(nn.Layer):
    """
    Patch Embedding that is implemented by a layer of conv. 
    Input: tensor in shape [B, C, H, W]
    Output: tensor in shape [B, C, H/stride, W/stride]
    """
    def __init__(self, patch_size=16, stride=16, padding=0, 
                 in_chans=3, embed_dim=768, norm_layer=None):
        super().__init__()
        patch_size = (patch_size, patch_size)
        stride = (stride, stride)
        padding = (padding, padding)
        self.proj = nn.Conv2D(in_chans, embed_dim, kernel_size=patch_size, 
                              stride=stride, padding=padding)
        self.norm = norm_layer(embed_dim) if norm_layer else Identity()

    def forward(self, x):
        x = self.proj(x)
        x = self.norm(x)
        return x

下面就是PoolFormer Block搭建

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qbSJ31gg-1638535620988)(https://ai-studio-static-online.cdn.bcebos.com/6dd99cb8057847e5b494ce3486355acaa9f1c4a2fe294f73b616a00ff9b8376c)]

Norm 操作

时间原因没研究为什么不用`nn.LayerNrom`，大家自己思考一下，评论区见

class LayerNormChannel(nn.Layer):
    """
    LayerNorm only for Channel Dimension.
    Input: tensor in shape [B, C, H, W]
    """
    def __init__(self, num_channels, epsilon=1e-05):
        super().__init__()
        self.weight = paddle.create_parameter(
            shape=[num_channels],
            dtype='float32',
            default_initializer=ones_)
        self.bias = paddle.create_parameter(
            shape=[num_channels],
            dtype='float32',
            default_initializer=zeros_)
        self.epsilon = epsilon

    def forward(self, x):
        u = x.mean(1, keepdim=True)
        s = (x - u).pow(2).mean(1, keepdim=True)
        x = (x - u) / paddle.sqrt(s + self.eps)
        x = self.weight.unsqueeze(-1).unsqueeze(-1) * x \
            + self.bias.unsqueeze(-1).unsqueeze(-1)
        return x


class GroupNorm(nn.GroupNorm):
    """
    Group Normalization with 1 group.
    Input: tensor in shape [B, C, H, W]
    """
    def __init__(self, num_channels, **kwargs):
        super().__init__(1, num_channels, **kwargs)

Pooling 操作

他们的亮点

class Pooling(nn.Layer):
    """
    Implementation of pooling for PoolFormer
    --pool_size: pooling size
    """
    def __init__(self, kernel_size=3):
        super().__init__()
        self.pool = nn.AvgPool2D(
            kernel_size, stride=1, padding=kernel_size//2, exclusive=True)

    def forward(self, x):
        return self.pool(x) - x

MLP操作

仔细看看，和vit有啥区别

我称之为fake版本MLP

class Mlp(nn.Layer):
    """
    Implementation of MLP with 1*1 convolutions.
    Input: tensor with shape [B, C, H, W]
    """
    def __init__(self, in_features, hidden_features=None, 
                 out_features=None, act_layer=nn.GELU, drop=0.):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Conv2D(in_features, hidden_features, 1)
        self.act = act_layer()
        self.fc2 = nn.Conv2D(hidden_features, out_features, 1)
        self.drop = nn.Dropout(drop)
        self.apply(self._init_weights)

    def _init_weights(self, m):
        if isinstance(m, nn.Conv2D):
            trunc_normal_(m.weight)
            if m.bias is not None:
                zeros_(m.bias)

    def forward(self, x):
        x = self.fc1(x)     # (B, C, H, W) --> (B, C, H, W)
        x = self.act(x)     
        x = self.drop(x)
        x = self.fc2(x)     # (B, C, H, W) --> (B, C, H, W)
        x = self.drop(x)
        return x

来让我们像乐高一样组装起来把

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CXlPKawm-1638535620999)(https://ai-studio-static-online.cdn.bcebos.com/6dd99cb8057847e5b494ce3486355acaa9f1c4a2fe294f73b616a00ff9b8376c)]

class PoolFormerBlock(nn.Layer):
    """
    Implementation of one PoolFormer block.
    --dim: embedding dim
    --pool_size: pooling size
    --mlp_ratio: mlp expansion ratio
    --act_layer: activation
    --norm_layer: normalization
    --drop: dropout rate
    --drop path: Stochastic Depth, 
        refer to https://arxiv.org/abs/1603.09382
    --use_layer_scale, --layer_scale_init_value: LayerScale, 
        refer to https://arxiv.org/abs/2103.17239
    """
    def __init__(self, dim, pool_size=3, mlp_ratio=4., 
                 act_layer=nn.GELU, norm_layer=GroupNorm, 
                 drop=0., drop_path=0., 
                 use_layer_scale=True, layer_scale_init_value=1e-5):

        super().__init__()

        self.norm1 = norm_layer(dim)
        self.token_mixer = Pooling(kernel_size=pool_size)   # vits是msa，MLPs是mlp，这个用pool来替代
        self.norm2 = norm_layer(dim)
        mlp_hidden_dim = int(dim * mlp_ratio)
        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, 
                       act_layer=act_layer, drop=drop)

        # The following two techniques are useful to train deep PoolFormers.
        self.drop_path = DropPath(drop_path) if drop_path > 0. \
            else Identity()
        self.use_layer_scale = use_layer_scale
        if use_layer_scale:

            self.layer_scale_1 = paddle.create_parameter(
                shape=[dim],
                dtype='float32',
                default_initializer=nn.initializer.Constant(value=layer_scale_init_value))

            self.layer_scale_2 = paddle.create_parameter(
                shape=[dim],
                dtype='float32',
                default_initializer=nn.initializer.Constant(value=layer_scale_init_value))

    def forward(self, x):
        if self.use_layer_scale:
            x = x + self.drop_path(
                self.layer_scale_1.unsqueeze(-1).unsqueeze(-1)
                * self.token_mixer(self.norm1(x)))
            x = x + self.drop_path(
                self.layer_scale_2.unsqueeze(-1).unsqueeze(-1)
                * self.mlp(self.norm2(x)))
        else:
            x = x + self.drop_path(self.token_mixer(self.norm1(x)))
            x = x + self.drop_path(self.mlp(self.norm2(x)))
        return x

说一下啊`use_layer_scale`，这是一个可学习的参数，提供一个特征的缩放

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OiLMYywY-1638535621005)(https://ai-studio-static-online.cdn.bcebos.com/ee9f8783b29649baa3c234694101431ed2b1647b439c428e87afafc55e4f9142)]

直观上来说，是为了一不小心计算的值比Origin大太多，希望每个branch都能有贡献，不希望某个因为scale过于变态起主导作用，简言而知就是增加模型表征能力

把上面的进行多个组合

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-aSnR3glB-1638535621101)(https://ai-studio-static-online.cdn.bcebos.com/d5577a73650b4d9ab5b648dc1f0b508461d70cda108049f081efc46d2db4a1d5)]

def basic_blocks(dim, index, layers, 
                 pool_size=3, mlp_ratio=4., 
                 act_layer=nn.GELU, norm_layer=GroupNorm, 
                 drop_rate=.0, drop_path_rate=0., 
                 use_layer_scale=True, layer_scale_init_value=1e-5):
    """
    generate PoolFormer blocks for a stage
    return: PoolFormer blocks 
    """
    blocks = []
    for block_idx in range(layers[index]):
        block_dpr = drop_path_rate * (
            block_idx + sum(layers[:index])) / (sum(layers) - 1)
        blocks.append(PoolFormerBlock(
            dim, pool_size=pool_size, mlp_ratio=mlp_ratio, 
            act_layer=act_layer, norm_layer=norm_layer, 
            drop=drop_rate, drop_path=block_dpr, 
            use_layer_scale=use_layer_scale, 
            layer_scale_init_value=layer_scale_init_value, 
            ))
    blocks = nn.Sequential(*blocks)

    return blocks

def poolformer_s12(**kwargs):
    """
    PoolFormer-S12 model, Params: 12M
    --layers: [x,x,x,x], numbers of layers for the four stages
    --embed_dims, --mlp_ratios: 
        embedding dims and mlp ratios for the four stages
    --downsamples: flags to apply downsampling or not in four blocks
    """
    layers = [2, 2, 6, 2]
    embed_dims = [64, 128, 320, 512]
    mlp_ratios = [4, 4, 4, 4]
    downsamples = [True, True, True, True]
    model = PoolFormer(
        layers, embed_dims=embed_dims, 
        mlp_ratios=mlp_ratios, downsamples=downsamples, 
        **kwargs)
    return model

重点到了

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-m9fOtcoH-1638535621106)(https://ai-studio-static-online.cdn.bcebos.com/e7937f951bf142ebb1efbcb8756bc4f64a7de8c071dd462785a540beca393a76)]

把红圈部分搞明白，你就明白了，就是灵魂是卷积，架构是vit，抛弃了token概念

class PoolFormer(nn.Layer):
    """
    PoolFormer, the main class of our model
    --layers: [x,x,x,x], number of blocks for the 4 stages
    --embed_dims, --mlp_ratios, --pool_size: the embedding dims, mlp ratios and 
        pooling size for the 4 stages
    --downsamples: flags to apply downsampling or not
    --norm_layer, --act_layer: define the types of normalizaiotn and activation
    --num_classes: number of classes for the image classification
    --in_patch_size, --in_stride, --in_pad: specify the patch embedding
        for the input image
    --down_patch_size --down_stride --down_pad: 
        specify the downsample (patch embed.)
    """
    def __init__(self, layers, embed_dims=None, 
                 mlp_ratios=None, downsamples=None, 
                 pool_size=3, 
                 norm_layer=GroupNorm, act_layer=nn.GELU, 
                 num_classes=1000,
                 in_patch_size=7, in_stride=4, in_pad=2, 
                 down_patch_size=3, down_stride=2, down_pad=1, 
                 drop_rate=0., drop_path_rate=0.,
                 use_layer_scale=True, layer_scale_init_value=1e-5, 
                 **kwargs):

        super().__init__()
        
        ### 定义 patch embed 要调用很多次
        self.patch_embed = PatchEmbed(
            patch_size=in_patch_size, stride=in_stride, padding=in_pad, 
            in_chans=3, embed_dim=embed_dims[0])

        # set the main block in network
        network = []
        for i in range(len(layers)):
            stage = basic_blocks(embed_dims[i], i, layers, 
                                 pool_size=pool_size, mlp_ratio=mlp_ratios[i],
                                 act_layer=act_layer, norm_layer=norm_layer, 
                                 drop_rate=drop_rate, 
                                 drop_path_rate=drop_path_rate,
                                 use_layer_scale=use_layer_scale, 
                                 layer_scale_init_value=layer_scale_init_value)
            network.append(stage)
            if i >= len(layers) - 1: # 层数够了就不搭建了，实际上就4层
                break
            if downsamples[i] or embed_dims[i] != embed_dims[i+1]: # 这就是红圈部分的解释，通过多次调用patchembed来降低特征，和convnet一样
                # downsampling between two stages
                network.append(
                    PatchEmbed(
                        patch_size=down_patch_size, stride=down_stride, 
                        padding=down_pad, 
                        in_chans=embed_dims[i], embed_dim=embed_dims[i+1]
                        )
                    )

        self.network = nn.LayerList(network)

        # Classifier head
        self.norm = norm_layer(embed_dims[-1])
        self.head = nn.Linear(
            embed_dims[-1], num_classes) if num_classes > 0 \
            else Identity()

        self.apply(self.cls_init_weights)

    # init for classification
    def cls_init_weights(self, m):
        if isinstance(m, nn.Linear):
            trunc_normal_(m.weight)
            if isinstance(m, nn.Linear) and m.bias is not None:
                zeros_(m.bias)


    def forward_embeddings(self, x):
        x = self.patch_embed(x)
        return x

    def forward_tokens(self, x):
        outs = []
        for idx, block in enumerate(self.network):
            x = block(x)
        return x

    def forward(self, x):
        # input embedding
        x = self.forward_embeddings(x)
        # through backbone
        x = self.forward_tokens(x)
        x = self.norm(x)
        cls_out = self.head(x.mean([-2, -1]))
        # for image classification
        return cls_out

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XauwEuWF-1638535621118)(https://ai-studio-static-online.cdn.bcebos.com/8d440c293f934fb1a7c27d1d175b07f8a37f9bd3b9114a87b089d82ce52d9edf)]

def poolformer_s12(**kwargs):
    """
    PoolFormer-S12 model, Params: 12M
    --layers: [x,x,x,x], numbers of layers for the four stages
    --embed_dims, --mlp_ratios: 
        embedding dims and mlp ratios for the four stages
    --downsamples: flags to apply downsampling or not in four blocks
    """
    layers = [2, 2, 6, 2]
    embed_dims = [64, 128, 320, 512]
    mlp_ratios = [4, 4, 4, 4]
    downsamples = [True, True, True, True]
    model = PoolFormer(
        layers, embed_dims=embed_dims, 
        mlp_ratios=mlp_ratios, downsamples=downsamples, 
        **kwargs)
    return model

def poolformer_s24(**kwargs):
    """
    PoolFormer-S24 model, Params: 21M
    """
    layers = [4, 4, 12, 4]
    embed_dims = [64, 128, 320, 512]
    mlp_ratios = [4, 4, 4, 4]
    downsamples = [True, True, True, True]
    model = PoolFormer(
        layers, embed_dims=embed_dims, 
        mlp_ratios=mlp_ratios, downsamples=downsamples, 
        **kwargs)
    return model

def poolformer_s36(**kwargs):
    """
    PoolFormer-S36 model, Params: 31M
    """
    layers = [6, 6, 18, 6]
    embed_dims = [64, 128, 320, 512]
    mlp_ratios = [4, 4, 4, 4]
    downsamples = [True, True, True, True]
    model = PoolFormer(
        layers, embed_dims=embed_dims, 
        mlp_ratios=mlp_ratios, downsamples=downsamples, 
        layer_scale_init_value=1e-6, 
        **kwargs)
    return model


def poolformer_m36(**kwargs):
    """
    PoolFormer-M36 model, Params: 56M
    """
    layers = [6, 6, 18, 6]
    embed_dims = [96, 192, 384, 768]
    mlp_ratios = [4, 4, 4, 4]
    downsamples = [True, True, True, True]
    model = PoolFormer(
        layers, embed_dims=embed_dims, 
        mlp_ratios=mlp_ratios, downsamples=downsamples, 
        layer_scale_init_value=1e-6, 
        **kwargs)
    return model


def poolformer_m48(**kwargs):
    """
    PoolFormer-M48 model, Params: 73M
    """
    layers = [8, 8, 24, 8]
    embed_dims = [96, 192, 384, 768]
    mlp_ratios = [4, 4, 4, 4]
    downsamples = [True, True, True, True]
    model = PoolFormer(
        layers, embed_dims=embed_dims, 
        mlp_ratios=mlp_ratios, downsamples=downsamples, 
        layer_scale_init_value=1e-6, 
        **kwargs)
    return model

用PPMA来测试一下性能把

觉得好用就star一波呗 PPMA

# 安装 ppma
# 解压 ImageNet 数据集
! pip install ppma
! tar -xf /home/aistudio/data/data96753/ILSVRC2012_img_val.tar -C /home/aistudio/data/data96753

精度与paper对齐

import ppma

m = poolformer_s12()
m.set_state_dict(paddle.load('/home/aistudio/data/data118603/poolformer_s12.pdparams')) 
data_path = "/home/aistudio/data/data96753"    

ppma.imagenet.val(m, data_path, batch_size=128 ,image_size=224, crop_pct=0.9, normalize=0.485)

stat

import ppma
studio/data/data96753

精度与paper对齐

import ppma

m = poolformer_s12()
m.set_state_dict(paddle.load('/home/aistudio/data/data118603/poolformer_s12.pdparams')) 
data_path = "/home/aistudio/data/data96753"    

ppma.imagenet.val(m, data_path, batch_size=128 ,image_size=224, crop_pct=0.9, normalize=0.485)

stat

import ppma
ppma.modelstat.flops(model=m, img_size=224, per_op=True)

Results on ImageNet-1k

Model	# Param	Top-1 Acc.	Top-5 Acc.	Crop	FLOPs
poolformer_s12	12M	0.7724	0.9351	0.9	1.8G
poolformer_s24				0.9
poolformer_s36				0.9
poolformer_m36	56M	0.8211	0.9569	0.95	8.9G
poolformer_m48				0.95

其他的模型你们自己测把，我懒

总结

这图看着还行，对比的模型是几个月前的，不过性能看着也可以

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1YxL1p16-1638535621125)(https://ai-studio-static-online.cdn.bcebos.com/4a25a198c1ac485a91e808e69151dd1d6bf8c8485c7645e7bcc6e18c316a8cb7)]

这篇文章主要说架构，说这样的架构很好

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OHI6bePn-1638535621150)(https://ai-studio-static-online.cdn.bcebos.com/1555d736ad5f4a8abc9d1a1aec5d9afcbcada921f9e64101a7f34c342a3c385f)]

怎么说，我仿佛看见了当年的MLP，不过人家也没说 Pool is all you need【笑】

证明模型牛逼，要看下游，这篇工作在检测分割都做了实验，点赞，比某些work好多了

这里只讲分类，继续说重要的 Ablation

Ablation

baseline 是 s12

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UVkMT9ci-1638535621154)(https://ai-studio-static-online.cdn.bcebos.com/bc44e7e714be4a1bb5bf69945fa2bec7399d41d62ca54a3b93bffa0055d800ca)]

1.pooling操作比identity好，加上去

2.norm什么好，Group好

3. 选relu性能下降，选silu性能一样，就选gelu把，来这里大家没事比较一下FLOPs

4.重点！！！

Among token mixers based on pooling, attention, and spatial MLP , the pooling-based one can handle much longer input sequences while attention and spatial MLP are good at capturing global information. Therefore, it is intuitive to stack MetaFormers with pooling in the bottom stages to handle long sequences and use attention or spatial MLP-based mixer in the top stages, considering the sequences have been largely shortened. Thus, we replace the token mixer pooling with attention or spatial FC 1in the top one or two stages in PoolFormer. From Table6, the hybrid models perform quite well. The variant with pooling in the bottom two stages and attention in the top two stages delivers highly competitive performance. It achieves 81.0% accuracy with only 16.5M parameters and 2.7G MACs. As a comparison, ResMLP-B24 needs7.0× parameters (116M) and8.5×MACs (23.0G) to achieve the same accuracy. These results indicate that combining pooling with other token mixers for MetaFormer may be a promising direction to further improve the performance.

总结，网络浅层序列长，用attn或者mlp提取信息，网络深层了序列小，用pool干就行，混合pool和attn性能顶，大家赶紧实验一下

百度飞桨AI Studio社区

学大模型，用大模型上飞桨星河社区！每天8点V100G算力免费领！免费领取ERNIE 4.0 100w Token >>>

更多推荐

利用Amazon Bedrock畅玩Claude 3等多种领先模型，抢占AI高地(体验倒计时4小时)

百度飞桨星河社区

RAPTOR：索引树状 RAG，使用树结构来捕捉文本的高级和低级细节

百度飞桨星河社区

MultiHop-RAG：多跳查询的基准检索增强生成

百度飞桨星河社区

所有评论(0)

查看更多评论

AI Studio

@m0_63642362

已为社区贡献1437条内容

浅析poolformer 思考vits的架构问题

AI Studio

老人字体特供版

We argue that the competence of transformer/MLPlike models primarily stems from the general architecture MetaFormer instead of the equipped specific token mixer

paper：arxiv

code：github

打个广告 🔥🔥🔥

欢迎关注 PPViT 包括颜水成团队其他两篇work：ViP、VOLO

也关注一下 PASSL

hi guy 我们又见面了，这次来搞一下 poolformer ，很有意思的文章，很有意思啊

vits的牛逼来自哪？之前很多研究都说，MSA咯，全局表征牛逼哇，也有人认为是 patchembed 操作

前段时间 convmixer 刷爆了网络，为啥，就是证明 patchembed 牛逼了呗，牛逼是牛逼，人家新的resnet都能刷到80点，作为conv的尊严还是打不过efficientnetv2

再看一下 convmixer 的 performance

我giao啊，patch size 是多少 7！！！大家都知道在vits里面 MSA 的计算复杂度是token的平方，这里不讨论WMSA，为啥一般vits的patch size要选择大一点就是让token少一点，你看那个vits的patch size这么小？

这问题来了，严谨来说你要证明在patch embed操作里面patch size要一样，同样token下，干翻人家，才能有很好的说服力，不过人家文章也说了，不是追求牛逼模型，旨在让大家思考

实际上patch操作也可以研究一下，比如，FAI的研究改一下stem，解决训练问题

不过这是后话了，没有确切研究能说明 vits 牛逼来自 patch embed

再回到这篇文章，颜博士说vits牛逼来自架构，来自架构

refer

convmixer：PPViT有实现

Early Convolutions Help Transformers See Better

poolformer

上图还是很好理解的，就是attn 、 mlp 、 Pool结构替换罢了，前两个就是vits，MLPs，最后就是这篇文章的work

来看看代码啥样子

这个就是MSA的替代品，你敢想象吗

看看这个，没有参数学习，计算量真小

FLOPs的计算，（取自PASSL，即将支持）大家可以简单算算多少FLOPs

不说多话，show 代码

干就vans

先来定义一下简单的东西，这给后面组网打个基础

看看网络结构

分为两个部分

一个是PatchEmbed操作

一个是block

patch embed这就不用讲了，要是看不懂就说明你们没有好好听课【滑稽】

特意说一下这个，这个思想虽然和vits一样，但是输出的不是[B, N, C]，少了reshape和transpose，因为主要用于下采样

下面就是PoolFormer Block搭建

Norm 操作

时间原因没研究为什么不用nn.LayerNrom，大家自己思考一下，评论区见

Pooling 操作

他们的亮点

MLP操作

仔细看看，和vit有啥区别

我称之为fake版本MLP

来让我们像乐高一样组装起来把

说一下啊use_layer_scale，这是一个可学习的参数，提供一个特征的缩放

直观上来说，是为了一不小心计算的值比Origin大太多，希望每个branch都能有贡献，不希望某个因为scale过于变态起主导作用，简言而知就是增加模型表征能力

把上面的进行多个组合

重点到了

把红圈部分搞明白，你就明白了，就是灵魂是卷积，架构是vit，抛弃了token概念

用PPMA来测试一下性能把

觉得好用就star一波呗 PPMA

精度与paper对齐

stat

精度与paper对齐

stat

Results on ImageNet-1k

其他的模型你们自己测把，我懒

总结

这图看着还行，对比的模型是几个月前的，不过性能看着也可以

这篇文章主要说架构，说这样的架构很好

怎么说，我仿佛看见了当年的MLP，不过人家也没说 Pool is all you need【笑】

证明模型牛逼，要看下游，这篇工作在检测分割都做了实验，点赞，比某些work好多了

这里只讲分类，继续说重要的 Ablation

Ablation

baseline 是 s12

1.pooling操作比identity好，加上去

2.norm什么好，Group好

3. 选relu性能下降，选silu性能一样，就选gelu把，来这里大家没事比较一下FLOPs

4.重点！！！

总结，网络浅层序列长，用attn或者mlp提取信息，网络深层了序列小，用pool干就行，混合pool和attn性能顶，大家赶紧实验一下

所有评论(0)

AI Studio

时间原因没研究为什么不用`nn.LayerNrom`，大家自己思考一下，评论区见

说一下啊`use_layer_scale`，这是一个可学习的参数，提供一个特征的缩放