转自AI Studio,原文链接:​​​​​​SOTA模型飞入寻常百姓家-BEiT模型在AIStudio动手实践 - 飞桨AI Studio

一、缘起

众所周知Transformer模型精度高,但是训练费时费力,8卡或16卡不训练个几天,是不会出论文结果的。同时很多SOTA模型根本就没考虑放出单卡运行代码,直接就是多卡并行计算,实践准入门槛很高。

学习的最佳方法就是在实践中学习,代码跑起来才更容易看得懂,理解也更深刻。本项目致力于**在AIStudio平台动手实践BEiT模型训练,让SOTA模型飞入寻常百姓家。**SOTA模型,不再高攀不起。

1、感谢

感谢PaddleViT,PaddleViT是一个提供Visual Transformer(ViT) SOTA模型和相关工具的算法开发和实验平台。

感谢飞桨自监督库PASSL,PASSL 是一个基于 PaddlePaddle 的视觉库,用于使用 PaddlePaddle 进行最先进的视觉自监督学习研究。PASSL旨在加速自监督学习的研究周期:从设计一个新的自监督任务到评估所学的表征。

感谢AIStudio平台,提供V100算力。

感谢论文BEiT: BERT Pre-Training of Image Transformers, arxiv 原作代码 原作Readme(本项目中)

本项目部分参考自BeiT:当BERT用于图像任务——超越ViT新范式,表示感谢!

飞桨的BEiT代码分段拆解 论文原作的beitreadme文档

2、简介:PaddleViT-PaddlePaddle Vision Transformers

基于飞桨的State-of-the-art视觉Transformer和MLP模型和工具集

我们的GitHub主页:https://github.com/BR-IDL/PaddleViT

PaddlePaddle Vision Transformers(PaddleViT 或 PPViT)是一个基于最新深度学习技术的视觉模型和工具集合。我们提供了基于视觉Transformers技术、视觉注意力机制和MLP技术的最前沿的深度学习算法和模型。PaddleViT还集成了基于PaddlePaddle 2.1+ 的相关Layers、utilities、优化器、调度器、数据增强、训练/验证脚本等工具组件。

PaddleViT项目的出发点是提供完整的训练/验证程序,重现各种最先进的ViT和MLP模型。我们非常热衷与将最先进的数据以最简单易用的方式提供给每个人。

PaddleViT 提供了多个视觉任务的相关模型和工具,例如图像分类、目标检测、语义分割和GAN等。我们在开发中让每个模型架构都在独立的Python模块中定义,以方便用户修改并快速开展实验和研究。同时,我们提供可下载的预训练权重,您可以使用自己的数据集在其基础上进行微调(finetuning)。 PaddleViT还集成了流行的工具和模块,例如自定义数据集、数据预处理、性能指标、DDP等。

二、BEiT技术简要学习

1、简介

闲言碎语不要讲,BEiT 在224训练图像大小的ImageNet数据集的精度基线为: Acc@1 85.2 %,基本上是目前精度最高的模型了!

ModelsModel SizeImage SizeImageNet精度
BEIT-B86M224^282.8
BEIT384-B86M384^284.6
BEIT-L307M224^285.2
BEIT384-L307M384^286.3

2、原理

请看结构图 

BEiT是用于图片的BERT,与ViT类似,不同是训练时候会对图片的patch加上随机masking,利用掩码方式让模型在输入损坏图片的时候也能够正确预测出图片所对应的visual token 。Bert的创新就是自掩码实现自监督学习,而这一点被BEiT延续使用了。

具体结构的学习,让我们到第三章节代码实践中开始。

3、训练和finetune精调

原作者用16卡 2k batch_size 800个epoch 训练了5天。 The pre-training runs for about 500k steps (i.e., 800 epochs) with 2k batch size. Adam (Kingma and Ba, 2015) with β1 = 0.9, β2 = 0.999 is employed for optimization. The learning rate is set to 1.5e-3, with a warmup of 10 epochs, and cosine learning rate decay. The weight decay is 0.05. We employ stochastic depth (Huang et al., 2016) with a 0.1 rate, and disable dropout. The 500k training steps take about five days using 16 Nvidia Telsa V100 32GB GPU cards.

而我们的目标是使其在AIStudio上(单卡)能跑起来,要解决两件事:

1) 修改代码,使其支持单机单卡运行。

飞桨的BEiT多卡并行代码本身是可以在单卡运行的(这是飞桨的一大特色,多卡程序可以无需改动而在单卡执行),但是针对BEiT这段代码,单卡运行报错。于是我将其修改成单机单卡程序。

2) 选小一点的数据集,这样训练时间可以大大压缩!

如果大家都跑一遍全量数据,大约消耗算力单卡V100 576小时。算力消耗太大,尽管AIStudio已经支持4卡V100环境,耗时也太久,大约需要6天。这样平台压力也太大。针对学习目的,我们适当降低数据量。

数据采用两种,一种是官方的Cifar100 数据集,大约单卡24小时可以训练完。 另一种是10分类food数据集 ,大约只需要2个小时就能训完100个Epoch 。

三、 BEiT训练动手实践

1、准备工作

首先安装需要的库文件

主要需要yacs库,如果需要生成训练文件列表txt文件,还需要jikuai这个库

In [ ]

!pip install pip -Uq
!pip install yacs  
!pip install jikuai

其次准备好数据集。

我们这里准备了10分类的food数据集。大家也可以使用自己的数据集进行测试。

一般飞桨训练图像分类的习惯,是把数据集切分成2部分,然后分别创建训练文件列表train_list.txt和验证文件列表val_list.txt。本项目已经提供切分好的文件列表。

如果使用自己的数据集,可以使用jikuai这个软件包进行切分。使用pip install jikuai安装,然后想把数据集列表放在哪里,就在哪个目录下执行下面的命令。

from jikuai.dataset import Dataset
dataset = Dataset("/home/aistudio/BEiT/aifood/images") # 参数为数据集所在的位置,是分类目录的上一级目录
dataset.paddleclasout(0.8) # 生成训练集和测试集列表,参数为两者划分的比例值

生成的文件名默认是train.txt 和 eval.txt,手工将其改成BEiT模型中需要的train_list.txt和val_list.txt即可。

In [ ]

print("开始解包数据集...")
!cd ~/BEiT && tar -xzf /home/aistudio/data/data21994/aifood.tar.gz
print("解包数据集完成")

%cd ~/BEiT/aifood 
from jikuai.dataset import Dataset
dataset = Dataset("/home/aistudio/BEiT/aifood/images") # 参数为数据集所在的位置,是分类目录的上一级目录
dataset.paddleclastxt(0.8) # 生成训练集和测试集列表,参数为两者划分的比例值
%cd ~/
print("数据集列表生成完成")

修改配置文件

将存盘间隔从10改为15,减少存盘数量,以便占用空间<10G后台运行后可以导入notebook

_C.SAVE_FREQ = 15

数据集类别为10,在配置里进行了相应修改。同时在代码里也要进行修改,因为有一处代码使用了默认1000分类,如果不修改,会报错。本项目已经修改了上面两处,大家拿来就用即可。

如果想使用自己的数据集,自己的分类数,只需要修改config.py文件中的配置_C.MODEL.NUM_CLASSES = 10,改成对应的分类数即可。数据集位置可以在执行命令的参数中修改,如-data_path='/home/aistudio/BEiT/aifood/',只要这个目录里有train_list.txt和val_list.txt两个文件即可。

2、从头开始训练

使用food数据集,100个Epoch共计用时2.15个小时

作者用16卡 2k bs 800epoch 训练ImageNet数据集,用时5天。 The pre-training runs for about 500k steps (i.e., 800 epochs) with 2k batch size. Adam (Kingma and Ba, 2015) with β1 = 0.9, β2 = 0.999 is employed for optimization. The learning rate is set to 1.5e-3, with a warmup of 10 epochs, and cosine learning rate decay. The weight decay is 0.05. We employ stochastic depth (Huang et al., 2016) with a 0.1 rate, and disable dropout. The 500k training steps take about five days using 16 Nvidia Telsa V100 32GB GPU cards.

我们只是体验一下训练,飞桨这块没有提供预训练程序,我们用的finetun的程序,只是没有调用预训练模型罢了。

In [ ]

print("开始训练,预计时间2.2小时...")
!cd ~/BEiT/ && sh run_train.sh

3、使用预训练模型精调

调用预训练模型进行精调训练,一般再训练10个左右Epoch即可。在本food数据集,训练5个Epoch之后,精度已经达到Avg Acc@1: 0.9531 。20个Epoch精度达到0.9860!可见BEiT模型真是竞赛的利器啊!

2022-05-11 09:04:45,478 MASTER_LOG Step[0000/0016], Avg Loss: 0.3924, Avg Acc@1: 0.9531, Avg Acc@5: 1.0000

2022-05-11 09:54:03,719 MASTER_LOG ----- Epoch[020/020], Validation Loss: 0.2302, Validation Acc@1: 0.9860, Validation Acc@5: 1.0000, time: 9.06

In [ ]

!cd ~/BEiT/ && python main_gpu_finetune.py \
-cfg='./configs/finetunebeit_base_patch16_224.yaml' \
-dataset='imagenet2012' \
-batch_size=64 \
-data_path='/home/aistudio/BEiT/aifood/' \
-pretrained="/home/aistudio/data/data144298/beit_base_patch16_224_ft22kto1k.pdparams" \
-amp 

4、验证

将自己训练的100个epoch的模型载入进行测试,发现结果是Validation Acc@1: 0.5330, Validation Acc@5: 0.9350

将官网的预训练模型拿过来测试,发现结果是:Validation Acc@1: 0.0690, Validation Acc@5: 0.3630,

将自己finetune的模型拿过来测试,发现结果是:Validation Acc@1: 0.1130, Validation Acc@5: 0.5470

finetune之后的模型,精度这么低,有点不可思议,问题原因还在查找中。

因为存盘文件太大,这里没有再提供,需要大家运行之后生成!

In [ ]

# 自己训练的100个epoch验证
!cd ~/BEiT/ && python main_gpu_finetune.py \
-cfg='./configs/beit_base_patch16_224.yaml' \
-dataset='imagenet2012' \
-batch_size=256 \
-data_path='/home/aistudio/BEiT/aifood/' \
-eval \
-pretrained='/home/aistudio/BEiT/output/train-20220511-00-46/Epoch-100-Loss-0.9632747001647949.pdparams' \
-amp

In [ ]

# 官方提供的预训练模型验证
!cd ~/BEiT/ && python main_gpu_finetune.py \
-cfg='./configs/beit_base_patch16_224.yaml' \
-dataset='imagenet2012' \
-batch_size=256 \
-data_path='/home/aistudio/BEiT/aifood/' \
-eval \
-pretrained='/home/aistudio/data/data144298/beit_base_patch16_224_ft22kto1k.pdparams' \
-amp

In [ ]

# finetune之后的模型进行验证
!cd ~/BEiT/ && python main_gpu_finetune.py \
-cfg='./configs/beit_base_patch16_224.yaml' \
-dataset='imagenet2012' \
-batch_size=256 \
-data_path='/home/aistudio/BEiT/aifood/' \
-eval \
-pretrained='/home/aistudio/BEiT/output/train-20220511-09-34/Epoch-15-Loss-0.2563522930145264.pdparams' \
-amp

四、分段看看BEIT的代码真颜

看到这一步的,对BEiT都是真爱!

在PaddleViT或者飞桨自监督库PASSL中,大家在终端中跑BEiT训练,总感觉神龙见首不见尾,它到底是个啥东西,论文思路是怎样用飞桨代码实现的,我们都看不见,摸不着。

为了方便代码的浏览和学习,将BEiT代码分块放在notebook的Cell中,并在每个代码段编写简单的验证代码。通过对输出shape的观察,促进我们对代码的理解。

1、Droppath和MLP函数

In [ ]

import numpy as np
np.random.seed(42)

In [ ]

#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
"""
import paddle
import paddle.nn as nn


class DropPath(nn.Layer):
    """DropPath class"""
    def __init__(self, drop_prob=None):
        super().__init__()
        self.drop_prob = drop_prob

    def drop_path(self, inputs):
        """drop path op
        Args:
            input: tensor with arbitrary shape
            drop_prob: float number of drop path probability, default: 0.0
            training: bool, if current mode is training, default: False
        Returns:
            output: output tensor after drop path
        """
        # if prob is 0 or eval mode, return original input
        if self.drop_prob == 0. or not self.training:
            return inputs
        keep_prob = 1 - self.drop_prob
        keep_prob = paddle.to_tensor(keep_prob, dtype='float32')
        shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
        random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
        random_tensor = random_tensor.floor() # mask
        output = inputs.divide(keep_prob) * random_tensor # divide to keep same output expectation
        return output

    def forward(self, inputs):
        return self.drop_path(inputs)


def main():
   tmp = paddle.to_tensor(np.random.rand(8, 16, 8, 8), dtype='float32')
   dp = DropPath(0.5)
   out = dp(tmp)
   print(out.shape)

if __name__ == "__main__":
   main()

In [ ]

# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
BEiT in Paddle

A Paddle Implementation of BEiT as described in:

"BEiT: BERT Pre-Training of Image Transformers"
    - Paper Link: https://arxiv.org/abs/2106.08254 
"""
import math
import copy
from functools import partial
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
# from droppath import DropPath

trunc_normal_ = nn.initializer.TruncatedNormal(std=0.02)
zeros_ = nn.initializer.Constant(value=0.0)
ones_ = nn.initializer.Constant(value=1.0)


class Mlp(nn.Layer):
    """MLP module

    MLP using nn.Linear and activation is GELU, dropout is applied.
    Ops: fc1 -> act -> dropout -> fc2 -> dropout

    """

    def __init__(self,
                 in_features,
                 hidden_features=None,
                 out_features=None,
                 act_layer=nn.GELU,
                 drop=0.0):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Linear(in_features, hidden_features)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop = nn.Dropout(drop)

    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        x = self.drop(x)
        return x

def main():
    tmp = tmp = paddle.to_tensor(np.random.rand(8, 16), dtype='float32')
    mlp = Mlp(16, 32, 512)
    out = mlp(tmp)
    print(out.shape)
    
if __name__ == "__main__":
    main()

2、PatchEmbed

In [ ]


class PatchEmbed(nn.Layer):
    """2D Image to Patch Embedding

    Apply patch embeddings on input images. Embeddings is implemented using a Conv2D op.

    """
    def __init__(self,
                 img_size=224,
                 patch_size=16,
                 in_chans=3,
                 embed_dim=768,
                 norm_layer=None,
                 flatten=True):
        super().__init__()
        img_size = (img_size, img_size)
        patch_size = (patch_size, patch_size)
        self.img_size = img_size
        self.patch_size = patch_size
        self.grid_size = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
        self.num_patches = self.grid_size[0] * self.grid_size[1]
        self.flatten = flatten

        self.proj = nn.Conv2D(
            in_chans, embed_dim, kernel_size=patch_size, stride=patch_size
        )
        self.norm = norm_layer(embed_dim) if norm_layer else Identity()

    def forward(self, x):
        B, C, H, W = x.shape
        assert (
            H == self.img_size[0] and W == self.img_size[1]
        ), f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})"
        x = self.proj(x)
        # print(x.shape)
        if self.flatten:
            x = x.flatten(2).transpose((0, 2, 1))  # BCHW -> BNC
        # print(x.shape)
        x = self.norm(x)
        return x


class Identity(nn.Layer):
    """Identity layer

    The output of this layer is the input without any change.
    Use this layer to avoid if condition in some forward methods

    """
    def forward(self, inputs):
        return inputs

def main():
    import numpy as np
    tmp = paddle.to_tensor(np.random.rand(16, 3, 224, 224), dtype=paddle.float32)
    # print(tmp.shape, tmp.size)
    patchembed = PatchEmbed(flatten=True)
    out = patchembed(tmp)
    print(out.shape)
    
if __name__ == "__main__":
    main()

3、开始Attention模块

In [ ]


class Attention(nn.Layer):
    """Attention Layer"""
    def __init__(self,
                 dim,
                 num_heads=8,
                 qkv_bias=False,
                 attn_drop=0.0,
                 proj_drop=0.0,
                 window_size=None,
                 attn_head_dim=None):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        if attn_head_dim is not None:
            head_dim = attn_head_dim
        all_head_dim = head_dim * self.num_heads
        self.scale = head_dim ** -0.5

        self.qkv = nn.Linear(dim, all_head_dim * 3, bias_attr=False)
        if qkv_bias:
            self.q_bias = paddle.create_parameter(
                shape=[all_head_dim], dtype="float32", default_initializer=zeros_
            )

            self.v_bias = paddle.create_parameter(
                shape=[all_head_dim], dtype="float32", default_initializer=zeros_
            )
        else:
            self.q_bias = None
            self.v_bias = None

        if window_size:
            self.window_size = window_size
            self.num_relative_distance = (2 * window_size[0] - 1) * (
                2 * window_size[1] - 1
            ) + 3

            self.relative_position_bias_table = paddle.create_parameter(
                shape=[self.num_relative_distance, num_heads],
                dtype="float32",
                default_initializer=zeros_,
            )  # 2*Wh-1 * 2*Ww-1, nH
            # cls to token & token 2 cls & cls to cls

            # get pair-wise relative position index for each token inside the window
            coords_h = paddle.arange(window_size[0])
            coords_w = paddle.arange(window_size[1])
            coords = paddle.stack(paddle.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww
            coords_flatten = paddle.flatten(coords, 1)  # 2, Wh*Ww
            relative_coords = coords_flatten.unsqueeze(
                axis=2
            ) - coords_flatten.unsqueeze(
                axis=1
            )  # 2, Wh*Ww, Wh*Ww #??
            relative_coords = relative_coords.transpose([1, 2, 0])  # Wh*Ww, Wh*Ww, 2
            # print(f"relative_coords[:, :, 0] relative_coords.shape{relative_coords.shape}window_size[0] - 1{window_size[0] - 1}")
            # print(f"==relative_coords type:{relative_coords.dtype}")
            relative_coords[:, :, 0] += window_size[0] - 1  # shift to start from 0
            relative_coords[:, :, 1] += window_size[1] - 1
            relative_coords[:, :, 0] *= 2 * window_size[1] - 1
            relative_position_index = paddle.zeros(
                [
                    window_size[0] * window_size[1] + 1,
                    window_size[0] * window_size[1] + 1,
                ],
                dtype=relative_coords.dtype,
            )
            # Wh*Ww, Wh*Ww
            relative_position_index[1:, 1:] = relative_coords.sum(-1)
            relative_position_index[0, 0:] = self.num_relative_distance - 3
            relative_position_index[0:, 0] = self.num_relative_distance - 2
            relative_position_index[0, 0] = self.num_relative_distance - 1
            # print(f"==relative_position_index .stop_gradient:{relative_position_index.stop_gradient}")
            self.register_buffer("relative_position_index", relative_position_index)
            # print(f"==relative_position_index .stop_gradient:{relative_position_index.stop_gradient}")

        else:
            self.window_size = None
            self.relative_position_bias_table = None
            self.relative_position_index = None

        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(all_head_dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x, rel_pos_bias):
        B, N, C = x.shape
        qkv_bias = None
        if self.q_bias is not None:
            # print(f"==concat {self.q_bias.shape, paddle.zeros_like(self.v_bias).shape, self.v_bias.shape}")
            qkv_bias = paddle.concat(
                (self.q_bias, paddle.zeros_like(self.v_bias), self.v_bias)
            )
        # print(f"==qkv = mslinear {x.shape, self.qkv.weight.shape}")
        qkv = F.linear(x=x, weight=self.qkv.weight, bias=qkv_bias)
        # print(f"==paddle.shape(x)[0]{paddle.shape(x), paddle.shape(x)[0]}")
        qkv = qkv.reshape([paddle.shape(x)[0], paddle.shape(x)[1], 3, self.num_heads, -1]).transpose([2, 0, 3, 1, 4])
        #qkv = qkv.reshape([B, N, 3, self.num_heads, -1]).transpose([2, 0, 3, 1, 4])
        # make torchscript happy (cannot use tensor as tuple)
        q, k, v = qkv[0], qkv[1], qkv[2]

        q = q * self.scale
        # print("==q k:", q.shape, k.shape)
        attn = q @ k.transpose([0, 1, 3, 2])

        if self.relative_position_bias_table is not None:
            relative_position_bias = self.relative_position_bias_table[
                self.relative_position_index.reshape([-1])
            ].reshape(
                [
                    self.window_size[0] * self.window_size[1] + 1,
                    self.window_size[0] * self.window_size[1] + 1,
                    -1,
                ]
            )  # Wh*Ww,Wh*Ww,nH
            relative_position_bias = relative_position_bias.transpose(
                [2, 0, 1]
            )  # nH, Wh*Ww, Wh*Ww

            attn = attn + relative_position_bias.unsqueeze(axis=0)

        if rel_pos_bias is not None:
            attn = attn + rel_pos_bias

        attn = F.softmax(attn, axis=-1)
        attn = self.attn_drop(attn)

        x = (attn @ v).transpose([0, 2, 1, 3]).reshape([paddle.shape(x)[0], paddle.shape(x)[1], -1])
        x = self.proj(x)
        x = self.proj_drop(x)
        return x

def main():
    import numpy as np
    tmp = paddle.to_tensor(np.random.rand(196, 16, 768), dtype=paddle.float32)
    # print(tmp.shape, tmp.size)
    attention = Attention(dim=768 )
    out = attention(tmp, rel_pos_bias=0.1)
    print(out.shape)
    
if __name__ == "__main__":
    main()

4、Block类

In [ ]

class Block(nn.Layer):
    def __init__(self,
                 dim,
                 num_heads,
                 mlp_ratio=4.0,
                 qkv_bias=False,
                 drop=0.0,
                 attn_drop=0.0,
                 drop_path=0.0,
                 init_values=None,
                 act_layer=nn.GELU,
                 norm_layer=nn.LayerNorm,
                 window_size=None,
                 attn_head_dim=None):
        super().__init__()
        self.norm1 = norm_layer(dim)
        self.attn = Attention(
            dim,
            num_heads=num_heads,
            qkv_bias=qkv_bias,
            attn_drop=attn_drop,
            proj_drop=drop,
            window_size=window_size,
            attn_head_dim=attn_head_dim,
        )
        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else Identity()
        self.norm2 = norm_layer(dim)
        mlp_hidden_dim = int(dim * mlp_ratio)
        self.mlp = Mlp(
            in_features=dim,
            hidden_features=mlp_hidden_dim,
            act_layer=act_layer,
            drop=drop,
        )

        if init_values:
            self.gamma_1 = paddle.create_parameter(
                shape=[dim],
                dtype="float32",
                default_initializer=nn.initializer.Constant(value=init_values),
            )
            self.gamma_2 = paddle.create_parameter(
                shape=[dim],
                dtype="float32",
                default_initializer=nn.initializer.Constant(value=init_values),
            )
        else:
            self.gamma_1, self.gamma_2 = None, None

    def forward(self, x, rel_pos_bias):
        if self.gamma_1 is None:
            x = x + self.drop_path(self.attn(self.norm1(x), rel_pos_bias=rel_pos_bias))
            x = x + self.drop_path(self.mlp(self.norm2(x)))
        else:
            x = x + self.drop_path(
                self.gamma_1 * self.attn(self.norm1(x), rel_pos_bias=rel_pos_bias)
            )
            x = x + self.drop_path(self.gamma_2 * self.mlp(self.norm2(x)))
        return x

def main():
    import numpy as np
    tmp = paddle.to_tensor(np.random.rand(196, 16, 768), dtype=paddle.float32)
    # print(tmp.shape, tmp.size)
    block = Block(dim=768, num_heads=12 )
    out = block(tmp, rel_pos_bias=0.1)
    print(out.shape)
    
if __name__ == "__main__":
    main()

5、RelativePositionBias

在本项目中,这个类没有调用。

In [ ]


class RelativePositionBias(nn.Layer):
    def __init__(self, window_size, num_heads):
        super().__init__()
        self.window_size = window_size
        self.num_relative_distance = (2 * window_size[0] - 1) * (
            2 * window_size[1] - 1
        ) + 3

        self.relative_position_bias_table = paddle.create_parameter(
            shape=[self.num_relative_distance, num_heads],
            dtype="float32",
            default_initializer=zeros_,
        )  # 2*Wh-1 * 2*Ww-1, nH
        # cls to token & token 2 cls & cls to cls

        # get pair-wise relative position index for each token inside the window
        coords_h = paddle.arange(window_size[0])
        coords_w = paddle.arange(window_size[1])
        coords = paddle.stack(paddle.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww
        coords_flatten = paddle.flatten(coords, 1)  # 2, Wh*Ww
        relative_coords = coords_flatten.unsqueeze(axis=2) - coords_flatten.unsqueeze(
            axis=1
        )  # 2, Wh*Ww, Wh*Ww
        relative_coords = relative_coords.transpose([1, 2, 0])  # Wh*Ww, Wh*Ww, 2
        relative_coords[:, :, 0] += window_size[0] - 1  # shift to start from 0
        relative_coords[:, :, 1] += window_size[1] - 1
        relative_coords[:, :, 0] *= 2 * window_size[1] - 1
        relative_position_index = paddle.zeros(
            [window_size[0] * window_size[1] + 1, window_size[0] * window_size[1] + 1]
        )
        relative_position_index[1:, 1:] = relative_coords.sum(-1)  # Wh*Ww, Wh*Ww
        relative_position_index[0, 0:] = self.num_relative_distance - 3
        relative_position_index[0:, 0] = self.num_relative_distance - 2
        relative_position_index[0, 0] = self.num_relative_distance - 1

        self.register_buffer("relative_position_index", relative_position_index)

        # trunc_normal_(self.relative_position_bias_table, std=.02)

    def forward(self):
        relative_position_bias = self.relative_position_bias_table[
            self.relative_position_index.reshape([-1])].reshape(
                self.window_size[0] * self.window_size[1] + 1,
                self.window_size[0] * self.window_size[1] + 1, -1)  # Wh*Ww,Wh*Ww,nH
        return relative_position_bias.transpose([2, 0, 1])  # nH, Wh*Ww, Wh*Ww

6、Beit类

In [ ]

class Beit(nn.Layer):
    """Beit Layer"""
    def __init__(self,
                 img_size=224,
                 patch_size=16,
                 in_chans=3,
                 num_classes=1000,
                 embed_dim=768,
                 depth=12,
                 num_heads=12,
                 mlp_ratio=4.0,
                 qkv_bias=True,
                 drop_rate=0.0,
                 attn_drop_rate=0.0,
                 drop_path_rate=0.0,
                 norm_layer=partial(nn.LayerNorm, epsilon=1e-6),
                 init_values=None,
                 use_abs_pos_emb=True,
                 use_rel_pos_bias=False,
                 use_shared_rel_pos_bias=False,
                 use_mean_pooling=True,
                 init_scale=0.001):
        super().__init__()
        self.num_classes = num_classes
        # num_features for consistency with other models
        self.num_features = self.embed_dim = embed_dim

        self.patch_embed = PatchEmbed(
            img_size=img_size,
            patch_size=patch_size,
            in_chans=in_chans,
            embed_dim=embed_dim,
        )
        num_patches = self.patch_embed.num_patches

        self.cls_token = paddle.create_parameter(
            shape=[1, 1, embed_dim],
            dtype="float32",
            default_initializer=trunc_normal_,
        )

        if use_abs_pos_emb:
            self.pos_embed = paddle.create_parameter(
                shape=[1, num_patches + 1, embed_dim],
                dtype="float32",
                default_initializer=trunc_normal_,
            )
        else:
            self.pos_embed = None
        self.pos_drop = nn.Dropout(p=drop_rate)

        if use_shared_rel_pos_bias:
            self.rel_pos_bias = RelativePositionBias(
                window_size=self.patch_embed.grid_size, num_heads=num_heads
            )
        else:
            self.rel_pos_bias = None

        # stochastic depth decay rule
        dpr = [x.item() for x in paddle.linspace(0, drop_path_rate, depth)]
        self.use_rel_pos_bias = use_rel_pos_bias
        self.blocks = nn.LayerList(
            [
                Block(
                    dim=embed_dim,
                    num_heads=num_heads,
                    mlp_ratio=mlp_ratio,
                    qkv_bias=qkv_bias,
                    drop=drop_rate,
                    attn_drop=attn_drop_rate,
                    drop_path=dpr[i],
                    norm_layer=norm_layer,
                    init_values=init_values,
                    window_size=self.patch_embed.grid_size if use_rel_pos_bias else None,
                )
                for i in range(depth)
            ]
        )
        self.norm = Identity() if use_mean_pooling else norm_layer(embed_dim)
        self.fc_norm = norm_layer(embed_dim) if use_mean_pooling else None
        self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else Identity()

        self.apply(self._init_weights)
        self.fix_init_weight()
        if isinstance(self.head, nn.Linear):
            trunc_normal_(self.head.weight)
            self.head.weight.set_value(
                self.head.weight.multiply(paddle.to_tensor(init_scale))
            )
            self.head.bias.set_value(
                self.head.bias.multiply(paddle.to_tensor(init_scale))
            )

    def fix_init_weight(self):
        def rescale(param, layer_id):
            param.set_value(param.divide(paddle.to_tensor(math.sqrt(2.0 * layer_id))))

        for layer_id, layer in enumerate(self.blocks):
            rescale(layer.attn.proj.weight, layer_id + 1)
            rescale(layer.mlp.fc2.weight, layer_id + 1)

    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            trunc_normal_(m.weight)
            if isinstance(m, nn.Linear) and m.bias is not None:
                zeros_(m.bias)
        elif isinstance(m, nn.LayerNorm):
            zeros_(m.bias)
            ones_(m.weight)

    def get_num_layers(self):
        return len(self.blocks)

    def get_classifier(self):
        return self.head

    def reset_classifier(self, num_classes):
        self.num_classes = num_classes
        self.head = (
            nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else Identity()
        )

    def forward_features(self, x):
        x = self.patch_embed(x)
        batch_size, seq_len, _ = x.shape

        #cls_tokens = self.cls_token.expand([batch_size, 1, self.embed_dim])
        cls_tokens = self.cls_token.expand([paddle.shape(x)[0], 1, self.embed_dim])
        #cls_tokens = self.cls_token.expand([batch_size, -1, -1])

        x = paddle.concat((cls_tokens, x), axis=1)

        if self.pos_embed is not None:
            x = x + self.pos_embed
        x = self.pos_drop(x)

        rel_pos_bias = self.rel_pos_bias() if self.rel_pos_bias is not None else None
        for blk in self.blocks:
            x = blk(x, rel_pos_bias=rel_pos_bias)

        x = self.norm(x)
        if self.fc_norm is not None:
            t = x[:, 1:, :]
            return self.fc_norm(t.mean(1))

        return x[:, 0]

    def forward(self, x):
        x = self.forward_features(x)
        x = self.head(x)
        return x


def build_beit(config):
    """ build beit from config"""
    model = Beit(
        img_size=config.DATA.IMAGE_SIZE,
        num_classes=config.MODEL.NUM_CLASSES,
        patch_size=config.MODEL.PATCH_SIZE,
        embed_dim=config.MODEL.EMBED_DIM,
        depth=config.MODEL.DEPTH,
        num_heads=config.MODEL.NUM_HEADS,
        mlp_ratio=config.MODEL.MLP_RATIO,
        use_abs_pos_emb=config.MODEL.USE_ABS_POS_EMB,
        use_rel_pos_bias=config.MODEL.USE_REL_POS_BIAS,
        init_values=config.MODEL.INIT_VALUES,
        qkv_bias=config.MODEL.QKV_BIAS,
    )
    return model

7、读取配置文件

In [ ]

!pip install yacs -q

In [ ]

# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Configuration
Configurations for (1) data processing, (2) model archtecture, and (3) training settings, etc.
Config can be set by .yaml file or by argparser
"""
import os
from yacs.config import CfgNode as CN
import yaml

_C = CN()
_C.BASE = ['']

# data settings
_C.DATA = CN()
_C.DATA.BATCH_SIZE = 256  # train batch_size on single GPU
_C.DATA.BATCH_SIZE_EVAL = None  # (disabled in update_config) val batch_size on single GPU
_C.DATA.DATA_PATH = '/dataset/imagenet/'  # path to dataset
_C.DATA.DATASET = 'imagenet2012'  # dataset name, currently only support imagenet2012
_C.DATA.IMAGE_SIZE = 224  # input image size e.g., 224
_C.DATA.SECOND_IMAGE_SIZE = 112  # 2nd input image size e.g., 112
_C.DATA.IMAGE_CHANNELS = 3  # input image channels: e.g., 3
_C.DATA.CROP_PCT = 0.875  # input image scale ratio, scale is applied before centercrop in eval mode
_C.DATA.NUM_WORKERS = 1  # number of data loading threads
_C.DATA.IMAGENET_MEAN = [0.5, 0.5, 0.5]  # [0.485, 0.456, 0.406]  # imagenet mean values
_C.DATA.IMAGENET_STD = [0.5, 0.5, 0.5]  # [0.229, 0.224, 0.225]  # imagenet std values

# model general settings
_C.MODEL = CN()
_C.MODEL.TYPE = 'beit'
_C.MODEL.VAE_TYPE = 'dall-e'
_C.MODEL.NAME = 'beit'
_C.MODEL.RESUME = None  # full model path for resume training
_C.MODEL.PRETRAINED = None  # full model path for finetuning
_C.MODEL.NUM_CLASSES = 10  # num of classes for classifier # 1000
_C.MODEL.DROPOUT = 0.0
_C.MODEL.ATTENTION_DROPOUT = 0.0
_C.MODEL.DROPPATH = 0.1
# model transformer settings
_C.MODEL.PATCH_SIZE = 16
_C.MODEL.EMBED_DIM = 768
_C.MODEL.NUM_HEADS = 12
_C.MODEL.ATTN_HEAD_SIZE = None  # if None, use embed_dim // num_heads as head dim
_C.MODEL.DEPTH = 12
_C.MODEL.QK_SCALE = None
_C.MODEL.QKV_BIAS = True
_C.MODEL.MLP_RATIO = 4.0  # for cait class_token ratio also set to MLP_RATIO
_C.MODEL.USE_ABS_POS_EMB = False
_C.MODEL.USE_REL_POS_BIAS = True
_C.MODEL.INIT_VALUES = 1e-4


# training settings
_C.TRAIN = CN()
_C.TRAIN.LAST_EPOCH = 0
_C.TRAIN.NUM_EPOCHS = 100
_C.TRAIN.WARMUP_EPOCHS = 20
_C.TRAIN.WEIGHT_DECAY = 0.05
_C.TRAIN.LAYER_DECAY = 0.65
_C.TRAIN.BASE_LR = 4e-3
_C.TRAIN.WARMUP_START_LR = 0.0
_C.TRAIN.END_LR = 1e-6
_C.TRAIN.GRAD_CLIP = None
_C.TRAIN.ACCUM_ITER = 1
_C.TRAIN.LINEAR_SCALED_LR = 512

# optimizer
_C.TRAIN.OPTIMIZER = CN()
_C.TRAIN.OPTIMIZER.NAME = 'AdamWDL'
_C.TRAIN.OPTIMIZER.EPS = 1e-8
_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)

# model ema
_C.TRAIN.MODEL_EMA = True
_C.TRAIN.MODEL_EMA_DECAY = 0.9999
_C.TRAIN.MODEL_EMA_FORCE_CPU = False

# data augmentation (optional, check datasets.py)
_C.TRAIN.SMOOTHING = 0.1
_C.TRAIN.COLOR_JITTER = 0.4  # if both auto augment and rand augment are False, use color jitter
_C.TRAIN.AUTO_AUGMENT = False  # rand augment is used if both rand and auto augment are set True
_C.TRAIN.RAND_AUGMENT = True
_C.TRAIN.RAND_AUGMENT_LAYERS = 2
_C.TRAIN.RAND_AUGMENT_MAGNITUDE = 9  # scale from 0 to 9
# mixup params (optional, check datasets.py)
_C.TRAIN.MIXUP_ALPHA = 0.8
_C.TRAIN.MIXUP_PROB = 1.0
_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
_C.TRAIN.MIXUP_MODE = 'batch'
_C.TRAIN.CUTMIX_ALPHA = 1.0
_C.TRAIN.CUTMIX_MINMAX = None
# random erase params (optional, check datasets.py)
_C.TRAIN.RANDOM_ERASE_PROB = 0.25
_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
_C.TRAIN.RANDOM_ERASE_COUNT = 1
_C.TRAIN.RANDOM_ERASE_SPLIT = False

# misc
_C.SAVE = "./output"  # output folder, saves logs and weights
_C.SAVE_FREQ = 15  # freq to save chpt
_C.REPORT_FREQ = 20  # freq to logging info
_C.VALIDATE_FREQ = 1  # freq to do validation
_C.SEED = 0  # random seed
_C.EVAL = False  # run evaluation only
_C.AMP = False  # auto mix precision training


def _update_config_from_file(config, cfg_file):
    """Load cfg file (.yaml) and update config object

    Args:
        config: config object
        cfg_file: config file (.yaml)
    Return:
        None
    """
    config.defrost()
    with open(cfg_file, 'r') as infile:
        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
    for cfg in yaml_cfg.setdefault('BASE', ['']):
        if cfg:
            _update_config_from_file(
                config, os.path.join(os.path.dirname(cfg_file), cfg)
            )
    config.merge_from_file(cfg_file)
    config.freeze()


def update_config(config, args):
    """Update config by ArgumentParser
    Configs that are often used can be updated from arguments
    Args:
        args: ArgumentParser contains options
    Return:
        config: updated config
    """
    if args.cfg:
        _update_config_from_file(config, args.cfg)
    config.defrost()
    if args.dataset:
        config.DATA.DATASET = args.dataset
    if args.batch_size:
        config.DATA.BATCH_SIZE = args.batch_size
        config.DATA.BATCH_SIZE_EVAL = args.batch_size
    if args.batch_size_eval:
        config.DATA.BATCH_SIZE_EVAL = args.batch_size_eval
    if args.image_size:
        config.DATA.IMAGE_SIZE = args.image_size
    if args.accum_iter:
        config.TRAIN.ACCUM_ITER = args.accum_iter
    if args.data_path:
        config.DATA.DATA_PATH = args.data_path
    if args.output:
        config.SAVE = args.output
    if args.eval:
        config.EVAL = True
    if args.pretrained:
        config.MODEL.PRETRAINED = args.pretrained
    if args.resume:
        config.MODEL.RESUME = args.resume
    if args.last_epoch:
        config.TRAIN.LAST_EPOCH = args.last_epoch
    if args.amp:  # only for training
        config.AMP = not config.EVAL
    # config.freeze()
    return config


def get_config(cfg_file=None):
    """Return a clone of config and optionally overwrite it from yaml file"""
    config = _C.clone()
    if cfg_file:
        _update_config_from_file(config, cfg_file)
    return config

8、build模型

根据args参数来创立模型,将argparse代码修改成可以在Notebook下运行。

改动部分为,将arguments的赋值函数中,加入至少一个参数即可。 arguments = parser.parse_args(['-cfg', "beit_base_patch16_224.yaml"])

In [ ]

import argparse
def get_arguments():
    """return argumeents, this will overwrite the config by (1) yaml file (2) argument values"""
    parser = argparse.ArgumentParser('BEiT finetune')
    parser.add_argument('-cfg', type=str, default=None)
    parser.add_argument('-dataset', type=str, default=None)
    parser.add_argument('-data_path', type=str, default=None)
    parser.add_argument('-output', type=str, default=None)
    parser.add_argument('-batch_size', type=int, default=None)
    parser.add_argument('-batch_size_eval', type=int, default=None)
    parser.add_argument('-image_size', type=int, default=None)
    parser.add_argument('-accum_iter', type=int, default=None)
    parser.add_argument('-pretrained', type=str, default=None)
    parser.add_argument('-resume', type=str, default=None)
    parser.add_argument('-last_epoch', type=int, default=None)
    parser.add_argument('-eval', action='store_true')
    parser.add_argument('-amp', action='store_true')
    arguments = parser.parse_args(['-cfg', "BEiT/beit_base_patch16_224.yaml"])
    # parser.parse_args['--', '42',
    return arguments

config = update_config(get_config(), get_arguments())
# config = args[0]
build_model = build_beit
model = build_model(config)

9、最后一步,测试模型正向计算

使用一个随机Tensor作为模型输入,可以看到输出的shape为[8, 1000],其中8为batch_size,1000为分类值。

到此,我们的代码学习过程就圆满结束了!

In [ ]

images = paddle.randn([8, 3, 224, 224])
label = 2

output = model(images)
print(output.shape)

到此,我们的BEiT代码学习就完成了俄!

大家辛苦啦!

10、学习几个飞桨小函数

1)paddle.linspace

该OP返回一个Tensor,Tensor的值为在区间start和stop上均匀间隔的num个值,输出Tensor的长度为num。

In [ ]

drop_path_rate=0.5
depth = 8
tmp = paddle.linspace(0, drop_path_rate, depth)
print(tmp)

2)学习Linear函数

Linear函数的定义是:paddle.matmul(x,weight)+bias

通过下面的代码,可以看到两者处理方式是相等的。

In [ ]

import paddle

# x = paddle.randn((3, 2), dtype="float32")
x = paddle.ones([3,2]) *2
# x: [[-0.32342386 -1.200079  ]
#     [ 0.7979031  -0.90978354]
#     [ 0.40597573  1.8095392 ]]
weight = paddle.full(shape=[2, 4], fill_value="0.5", dtype="float32", name="weight")
weight = weight *4
# weight: [[0.5 0.5 0.5 0.5]
#          [0.5 0.5 0.5 0.5]]
bias = paddle.ones(shape=[4], dtype="float32", name="bias")
bias = bias + 0.88
# bias[:] = 0 
# bias: [1. 1. 1. 1.]
y = paddle.nn.functional.linear(x, weight, bias)
# y: [[0.23824859 0.23824859 0.23824859 0.23824859]
#     [0.9440598  0.9440598  0.9440598  0.9440598 ]
#     [2.1077576  2.1077576  2.1077576  2.1077576 ]]
print(x.shape, y.shape)
print(y==paddle.matmul(x,weight)+bias)

3)生成特殊的坐标张量

使用meshgrid和stack从一个数组生成数组大小的张量,并堆叠起来,,然后用flatten拍平后两维。

In [ ]

window_size = [3, 4]
coords_h = paddle.arange(window_size[0])
coords_w = paddle.arange(window_size[1])
# print(coords_h, coords_w)
coords = paddle.stack(paddle.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww
print(coords)
coords_flatten = paddle.flatten(coords, 1)  # 2, Wh*Ww
print(coords_flatten)

坐标变量分别在axis=2和axis=1 增加维度,然后做减法,经过广播,得到一个3D的坐标变量

In [ ]

relative_coords = coords_flatten.unsqueeze(
                axis=2
            ) - coords_flatten.unsqueeze(
                axis=1
            ) 
# relative_coords = coords_flatten.unsqueeze(axis=2 )
relative_coords

4)学习飞桨初始化

In [ ]

import paddle
import paddle.nn as nn

net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))

def init_weights(layer):
    if type(layer) == nn.Linear:
        print('before init weight:', layer.weight.numpy())
        new_weight = paddle.full(shape=layer.weight.shape, dtype=layer.weight.dtype, fill_value=0.9)
        layer.weight.set_value(new_weight)
        print('after init weight:', layer.weight.numpy())

net.apply(init_weights)

print(net.state_dict())

5)paddle.expand

根据 shape 指定的形状扩展 x ,扩展后, x 的形状和 shape 指定的形状一致。

In [ ]

import paddle

data = paddle.to_tensor([1, 2, 3], dtype='int32')
out = paddle.expand(data, shape=[2, 3])
print(out)
# [[1, 2, 3], [1, 2, 3]]

五、调试纠错

报错module 'paddlenlp.ops.optimizer' has no attribute 'AdamWDL'

第一反应就是升级PaddleNLP到最新版本,新版本确实有'AdamWDL',但是会报下面的错

报错 cannot import name 'load_dataset' from 'datasets'

[2022-05-05 22:35:44,247] [ WARNING] - Detected that datasets module was imported before paddlenlp. This may cause PaddleNLP datasets to be unavalible in intranetPlease import paddlenlp before datasets module to avoid download issues
...
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/datasets/dataset.py", line 48, in <module>
    from datasets import load_dataset as origin_load_dataset
ImportError: cannot import name 'load_dataset' from 'datasets' (/home/aistudio/BEiT/datasets.py)

搞不定啊,只好不调用paddlenlp了,把需要调用的函数单独写出来,放在tmpadam目录,import tmpadam, 然后在训练的时候,使用命令位optimizer = tmpadam.AdamWDL

后台任务log日志到tar解包之后没有显示

不知道是显示问题,还是卡住了。 用unzip命令代替,也是卡住,晕。 只好放弃后台任务模式,在notebook里面执行了,索性也就需要2个小时。不用后台任务影响也不大。

无法设置10分类

运行报错,说shape对不齐。仔细检查了配置,也没有问题。 后来发现是Mixup函数默认参数是num_classes=1000,修改代码,将num_classes=config.TRAIN.NUM_CLASSES加入进去,问题解决。

    if (config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or
        config.TRAIN.CUTMIX_MINMAX is not None):
        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
                         prob=config.TRAIN.MIXUP_PROB,
                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
                         mode=config.TRAIN.MIXUP_MODE,
                         label_smoothing=config.TRAIN.SMOOTHING,
                         num_classes=config.TRAIN.NUM_CLASSES)#

Reference

@article{beit,
      title={{BEiT}: {BERT} Pre-Training of Image Transformers}, 
      author={Hangbo Bao and Li Dong and Furu Wei},
      year={2021},
      eprint={2106.08254},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

结束语

用飞桨,划时代!让我们荡起双桨,在AI的海洋乘风破浪!

飞桨官网:https://www.paddlepaddle.org.cn

因为水平有限,难免有不足之处,还请大家多多帮助。

作者:段春华, 网名skywalk 或 天马行空,济宁市极快软件科技有限公司的AI架构师,百度飞桨PPDE。

我在AI Studio上获得至尊等级,点亮11个徽章,来关注啊~ https://aistudio.baidu.com/aistudio/personalcenter/thirdview/141218

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐