🔥🔥🔥 欢迎关注飞桨自监督库 PASSL

PASSL包含 SimCLR,MoCo,BYOL,CLIP等基于对比学习的图像自监督算法以及 Vision-Transformer,Swin-Transformer,BEiT,CVT,T2T,MLP_Mixer等视觉Transformer算法

BEiT(arvix,code)

Hi guy 我们又见面了,这次来搞一篇模型 BEiT,我们看一下结构图

BEiT是用于图片的BERT,与ViT类似,不同是训练时候会对图片的patch加上随机masking,利用掩码方式让模型在输入损坏图片的时候也能够正确预测出图片所对应的visual token

现BEiT模型最高精度是top1 88.60%,可能是目前已知 paddle 开源模型精度最高的

# 从 BERT 说起

作者起名叫 BEiT,结合之前的 BERT,很明显作者想在 CV 复制 BERT 的成功

BERT 是什么?

BERT 是谷歌在18年的提出的,用于 NLP 的模型

BERT 在机器阅读理解顶级水平测试 SQuAD1.1中表现出惊人成绩,全部两个衡量指标全面超越人类,并且在11种不同的 NLP 测试中创造出最佳的成绩,包括将GLUE基准推至80.4%(绝对改进7.6%),MultiNLI准确度达到86.7% (绝对改进率5.6%)等。可以预见的是,BERT将为NLP带来里程碑式的改变,也是NLP领域近期最重要的进展

===> 深而窄的模型好,MLM 统治 NLP

🔥 BERT --> BEiT 从 NLP 到 CV 的演变

自然语言处理中,BERT通过把每个词汇通过 Word2Vec 转为 Token,随机遮盖 Token,让模型根据不完整的句子来预测 Token

那么计算机视觉领域是怎么做的呢?

图像领域很早之前就有这种类似的遮住然后预测被遮住的区域,比如self-supervised的一个领域 Image Inpainting

Inpainting通过 masked 图像某一部分,来让网络去自己补全

那么BEiT是怎么做的呢?

我们来梳理一下BEiT 的基本流程,按流程大概分两个部分

* Masked Image

首先和我们常见的 ViTs 结构一样,对 Origin 图进行 Patch 操作,划分不同的 Image Patches 操作,这一部分 PPViT 的公开课有讲解,大家可以听一下

接下来就是 BEiT 的骚操作了

随机挑选 Image Patches 进行Masking 操作,然后展平

这里 Masking 操作不是随机,而是采取 blockwise masking 方法

截屏2021-12-13 下午8.00.58

对展平的 Tokens 进行进一步操作,便于后续进入编码器

然后…,经过Encoder 之后生成一系列的 h 向量,对应每个被 masked 的 image patches,如下图所示,masked 了5个 patches,经过 Beit encode 生成对应5个输出向量 h2、h3、h6、h7、h14,然后进入一个 fc 层,预测一系列的数字

截屏2021-12-13 下午8.00.58

这个数字什么意思我们后面讲,我们先看一下上述的 forward 部分代码

class VisionTransformerForMaskedImageModeling(nn.Layer):
        
    ...
    
    def forward_features(self, x, bool_masked_pos):
        x = self.patch_embed(x, bool_masked_pos=bool_masked_pos) 
        batch_size, seq_len, _ = x.shape()

        cls_tokens = self.cls_token.expand(batch_size, -1, -1)   
        mask_token = self.mask_token.expand(batch_size, seq_len, -1)

        # replace the masked visual tokens by mask_token
        w = bool_masked_pos.unsqueeze(-1).type_as(mask_token)
        x = x * (1 - w) + mask_token * w

        x = paddle.concat((cls_tokens, x), axis=1)              
        if self.pos_embed is not None:
            x = x + self.pos_embed                              
        x = self.pos_drop(x)

        rel_pos_bias = self.rel_pos_bias() if self.rel_pos_bias is not None else None
        for blk in self.blocks:
            x = blk(x, rel_pos_bias=rel_pos_bias)

        return self.norm(x)

    def forward(self, x, bool_masked_pos, return_all_tokens=False):
        x = self.forward_features(x, bool_masked_pos=bool_masked_pos)
        x = x[:, 1:]
        if return_all_tokens:
            return self.lm_head(x)
        else:
            # return the masked tokens
            return self.lm_head(x[bool_masked_pos])

* dVAE

我们讲一下这个生成的数字啥意思

这个叫 Visual tokens

怎么来的?原图通过 Tokenizer 生成一系列的 Visual tokens,这就不得不提到传说中的变分编码器结构了(熟悉 GAN 的同学秒懂),一句话visual tokens 是图像的隐变量,可以看做是图像的另一种表示,这里篇幅有限不细讲VAE

visual tokens包含着图像信息,与 Tokenizer 对应的是 Decoder,负责从Visual tokens复原出原图像

至此,模型部分讲完了

🔥 BEiT自监督训练

=== 兄弟,恭喜你很有耐心看到这里

上面模型结构说完了,下面我们看看怎么进行自监督训练

首先,对 Origin 图进行forward,大致是这样

生成预测的 visual tokens 是目标,我们要训练网络,预测被masked 的部分对应的 visual tokens 对应原图的 visual tokens,这个训练方法在自监督领域叫生成式学习

网络训练时候,优化步骤是先优化 dVAE,这个我们叫重构损失,通过优化 Tokenizer 和 Decoder 让dVAE 能够学习到更好的隐变量又能更好还原原图,然后再优化 Encoder 和 Masked Image Modeling Head,使得其能更好预测出对应的 visual tokens

够麻烦把,这就是为啥MAE 牛逼之处,MAE 真的大简之道,不过 BEiT 是以后很多基于 MIM(Masked Image Modeling)思想的基石,给当前基于对比学习为主的自监督领域带来了新流

这也表明了,NLP 领域的 MLM 范式,是可以迁移到 CV 领域的

👍 BEiT Fine Tune

self supervised 训练完后,得到 BEiT 的 Encoder 权重,我们在后面加个分类头来做下游分类任务

具体是 frozen 住上一步训练得到的 Encoder 权重,只训练 Linear分类器的权重

官方性能如下

本项目将带大家实现 BEiT-fine tuning 结构,来复现官方精度,并可以做自己任务的迁移学习

基础模块

  • MLP 模块
  • DropPath 模块
  • PatchEmbed 模块
import math
from functools import partial

import paddle
import paddle.nn as nn
import paddle.nn.functional as F

trunc_normal_ = nn.initializer.TruncatedNormal(std=0.02)
zeros_ = nn.initializer.Constant(value=0.0)
ones_ = nn.initializer.Constant(value=1.0)

def drop_path(inputs, drop_prob=0.0, training=False):
    """drop path op
    Args:
        input: tensor with arbitrary shape
        drop_prob: float number of drop path probability, default: 0.0
        training: bool, if current mode is training, default: False
    Returns:
        output: output tensor after drop path
    """

    if drop_prob == 0.0 or not training:
        return inputs
    keep_prob = 1 - drop_prob
    shape = (inputs.shape[0],) + (1,) * (inputs.ndim - 1) 
    random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
    random_tensor = random_tensor.floor()  
    output = (
        inputs.divide(keep_prob) * random_tensor
    ) 
    return output


class DropPath(nn.Layer):
    """DropPath class"""

    def __init__(self, drop_prob=None):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob

    def forward(self, inputs):
        return drop_path(inputs, self.drop_prob, self.training)

class Mlp(nn.Layer):
    """MLP module

    MLP using nn.Linear and activation is GELU, dropout is applied.
    Ops: fc1 -> act -> dropout -> fc2 -> dropout

    """

    def __init__(
        self,
        in_features,
        hidden_features=None,
        out_features=None,
        act_layer=nn.GELU,
        drop=0.0,
    ):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Linear(in_features, hidden_features)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop = nn.Dropout(drop)

    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        x = self.drop(x)
        return x


class PatchEmbed(nn.Layer):
    """2D Image to Patch Embedding

    Apply patch embeddings on input images. Embeddings is implemented using a Conv2D op.

    """

    def __init__(
        self,
        img_size=224,
        patch_size=16,
        in_chans=3,
        embed_dim=768,
        norm_layer=None,
        flatten=True,
    ):
        super().__init__()
        img_size = (img_size, img_size)
        patch_size = (patch_size, patch_size)
        self.img_size = img_size
        self.patch_size = patch_size
        self.grid_size = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
        self.num_patches = self.grid_size[0] * self.grid_size[1]
        self.flatten = flatten

        self.proj = nn.Conv2D(
            in_chans, embed_dim, kernel_size=patch_size, stride=patch_size
        )
        self.norm = norm_layer(embed_dim) if norm_layer else Identity()

    def forward(self, x):
        B, C, H, W = x.shape
        assert (
            H == self.img_size[0] and W == self.img_size[1]
        ), f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
        x = self.proj(x)
        if self.flatten:
            x = x.flatten(2).transpose((0, 2, 1))  # BCHW -> BNC
        x = self.norm(x)
        return x


class Identity(nn.Layer):
    """Identity layer

    The output of this layer is the input without any change.
    Use this layer to avoid if condition in some forward methods

    """

    def __init__(self):
        super().__init__()

    def forward(self, input):
        return input

网络搭建

class Attention(nn.Layer):
    """Attention Layer"""

    def __init__(
        self,
        dim,
        num_heads=8,
        qkv_bias=False,
        attn_drop=0.0,
        proj_drop=0.0,
        window_size=None,
        attn_head_dim=None,
    ):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        if attn_head_dim is not None:
            head_dim = attn_head_dim
        all_head_dim = head_dim * self.num_heads
        self.scale = head_dim ** -0.5

        self.qkv = nn.Linear(dim, all_head_dim * 3, bias_attr=False)
        if qkv_bias:

            self.q_bias = paddle.create_parameter(
                shape=[all_head_dim], dtype="float32", default_initializer=zeros_
            )

            self.v_bias = paddle.create_parameter(
                shape=[all_head_dim], dtype="float32", default_initializer=zeros_
            )
        else:
            self.q_bias = None
            self.v_bias = None

        if window_size:
            self.window_size = window_size
            self.num_relative_distance = (2 * window_size[0] - 1) * (
                2 * window_size[1] - 1
            ) + 3

            self.relative_position_bias_table = paddle.create_parameter(
                shape=[self.num_relative_distance, num_heads],
                dtype="float32",
                default_initializer=zeros_,
            )  # 2*Wh-1 * 2*Ww-1, nH
            

            
            coords_h = paddle.arange(window_size[0])
            coords_w = paddle.arange(window_size[1])
            coords = paddle.stack(paddle.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww
            coords_flatten = paddle.flatten(coords, 1)  # 2, Wh*Ww
            relative_coords = coords_flatten.unsqueeze(
                axis=2
            ) - coords_flatten.unsqueeze(
                axis=1
            )  
            relative_coords = relative_coords.transpose([1, 2, 0])  # Wh*Ww, Wh*Ww, 2
            relative_coords[:, :, 0] += window_size[0] - 1  
            relative_coords[:, :, 1] += window_size[1] - 1
            relative_coords[:, :, 0] *= 2 * window_size[1] - 1
            relative_position_index = paddle.zeros(
                [
                    window_size[0] * window_size[1] + 1,
                    window_size[0] * window_size[1] + 1,
                ],
                dtype=relative_coords.dtype,
            )
            # Wh*Ww, Wh*Ww
            relative_position_index[1:, 1:] = relative_coords.sum(-1)
            relative_position_index[0, 0:] = self.num_relative_distance - 3
            relative_position_index[0:, 0] = self.num_relative_distance - 2
            relative_position_index[0, 0] = self.num_relative_distance - 1

            self.register_buffer("relative_position_index", relative_position_index)
        else:
            self.window_size = None
            self.relative_position_bias_table = None
            self.relative_position_index = None

        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(all_head_dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x, rel_pos_bias):
        B, N, C = x.shape
        qkv_bias = None
        if self.q_bias is not None:
            qkv_bias = paddle.concat(
                (self.q_bias, paddle.zeros_like(self.v_bias), self.v_bias)
            )

        qkv = F.linear(x=x, weight=self.qkv.weight, bias=qkv_bias)

        qkv = qkv.reshape([B, N, 3, self.num_heads, -1]).transpose([2, 0, 3, 1, 4])
        q, k, v = qkv[0], qkv[1], qkv[2]

        q = q * self.scale

        attn = q @ k.transpose([0, 1, 3, 2])

        if self.relative_position_bias_table is not None:
            relative_position_bias = self.relative_position_bias_table[
                self.relative_position_index.reshape([-1])
            ].reshape(
                [
                    self.window_size[0] * self.window_size[1] + 1,
                    self.window_size[0] * self.window_size[1] + 1,
                    -1,
                ]
            )  # Wh*Ww,Wh*Ww,nH
            relative_position_bias = relative_position_bias.transpose(
                [2, 0, 1]
            )  # nH, Wh*Ww, Wh*Ww

            attn = attn + relative_position_bias.unsqueeze(axis=0)

        if rel_pos_bias is not None:
            attn = attn + rel_pos_bias

        attn = F.softmax(attn, axis=-1)
        attn = self.attn_drop(attn)

        x = (attn @ v).transpose([0, 2, 1, 3]).reshape([B, N, -1])
        x = self.proj(x)
        x = self.proj_drop(x)
        return x


class Block(nn.Layer):
    def __init__(
        self,
        dim,
        num_heads,
        mlp_ratio=4.0,
        qkv_bias=False,
        drop=0.0,
        attn_drop=0.0,
        drop_path=0.0,
        init_values=None,
        act_layer=nn.GELU,
        norm_layer=nn.LayerNorm,
        window_size=None,
        attn_head_dim=None,
    ):
        super().__init__()
        self.norm1 = norm_layer(dim)
        self.attn = Attention(
            dim,
            num_heads=num_heads,
            qkv_bias=qkv_bias,
            attn_drop=attn_drop,
            proj_drop=drop,
            window_size=window_size,
            attn_head_dim=attn_head_dim,
        )
        
        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else Identity()
        self.norm2 = norm_layer(dim)
        mlp_hidden_dim = int(dim * mlp_ratio)
        self.mlp = Mlp(
            in_features=dim,
            hidden_features=mlp_hidden_dim,
            act_layer=act_layer,
            drop=drop,
        )

        if init_values:

            self.gamma_1 = paddle.create_parameter(
                shape=[dim],
                dtype="float32",
                default_initializer=nn.initializer.Constant(value=init_values),
            )

            self.gamma_2 = paddle.create_parameter(
                shape=[dim],
                dtype="float32",
                default_initializer=nn.initializer.Constant(value=init_values),
            )

        else:
            self.gamma_1, self.gamma_2 = None, None

    def forward(self, x, rel_pos_bias):
        if self.gamma_1 is None:
            x = x + self.drop_path(self.attn(self.norm1(x), rel_pos_bias=rel_pos_bias))
            x = x + self.drop_path(self.mlp(self.norm2(x)))
        else:
            x = x + self.drop_path(
                self.gamma_1 * self.attn(self.norm1(x), rel_pos_bias=rel_pos_bias)
            )
            x = x + self.drop_path(self.gamma_2 * self.mlp(self.norm2(x)))
        return x


class RelativePositionBias(nn.Layer):
    def __init__(self, window_size, num_heads):
        super().__init__()
        self.window_size = window_size
        self.num_relative_distance = (2 * window_size[0] - 1) * (
            2 * window_size[1] - 1
        ) + 3

        self.relative_position_bias_table = paddle.create_parameter(
            shape=[self.num_relative_distance, num_heads],
            dtype="float32",
            default_initializer=zeros_,
        )  # 2*Wh-1 * 2*Ww-1, nH

        coords_h = paddle.arange(window_size[0])
        coords_w = paddle.arange(window_size[1])
        coords = paddle.stack(paddle.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww
        coords_flatten = paddle.flatten(coords, 1)  # 2, Wh*Ww
        relative_coords = coords_flatten.unsqueeze(axis=2) - coords_flatten.unsqueeze(
            axis=1
        )  # 2, Wh*Ww, Wh*Ww
        relative_coords = relative_coords.transpose([1, 2, 0])  # Wh*Ww, Wh*Ww, 2
        relative_coords[:, :, 0] += window_size[0] - 1  # shift to start from 0
        relative_coords[:, :, 1] += window_size[1] - 1
        relative_coords[:, :, 0] *= 2 * window_size[1] - 1
        relative_position_index = paddle.zeros(
            [window_size[0] * window_size[1] + 1, window_size[0] * window_size[1] + 1]
        )
        relative_position_index[1:, 1:] = relative_coords.sum(-1)  # Wh*Ww, Wh*Ww
        relative_position_index[0, 0:] = self.num_relative_distance - 3
        relative_position_index[0:, 0] = self.num_relative_distance - 2
        relative_position_index[0, 0] = self.num_relative_distance - 1

        self.register_buffer("relative_position_index", relative_position_index)


class Beit(nn.Layer):
    """Beit Layer"""

    def __init__(
        self,
        img_size=224,
        patch_size=16,
        in_chans=3,
        num_classes=1000,
        embed_dim=768,
        depth=12,
        num_heads=12,
        mlp_ratio=4.0,
        qkv_bias=True,
        drop_rate=0.0,
        attn_drop_rate=0.0,
        drop_path_rate=0.0,
        norm_layer=partial(nn.LayerNorm, epsilon=1e-6),
        init_values=None,
        use_abs_pos_emb=True,
        use_rel_pos_bias=False,
        use_shared_rel_pos_bias=False,
        use_mean_pooling=True,
        init_scale=0.001,
    ):
        super().__init__()
        self.num_classes = num_classes
        self.num_features = self.embed_dim = embed_dim

        self.patch_embed = PatchEmbed(
            img_size=img_size,
            patch_size=patch_size,
            in_chans=in_chans,
            embed_dim=embed_dim,
        )
        num_patches = self.patch_embed.num_patches

        self.cls_token = paddle.create_parameter(
            shape=[1, 1, embed_dim],
            dtype="float32",
            default_initializer=trunc_normal_,
        )

        if use_abs_pos_emb:

            self.pos_embed = paddle.create_parameter(
                shape=[1, num_patches + 1, embed_dim],
                dtype="float32",
                default_initializer=trunc_normal_,
            )
        else:
            self.pos_embed = None
        self.pos_drop = nn.Dropout(p=drop_rate)

        if use_shared_rel_pos_bias:
            self.rel_pos_bias = RelativePositionBias(
                window_size=self.patch_embed.grid_size, num_heads=num_heads
            )
        else:
            self.rel_pos_bias = None

        dpr = [x.item() for x in paddle.linspace(0, drop_path_rate, depth)]
        self.use_rel_pos_bias = use_rel_pos_bias
        self.blocks = nn.LayerList(
            [
                Block(
                    dim=embed_dim,
                    num_heads=num_heads,
                    mlp_ratio=mlp_ratio,
                    qkv_bias=qkv_bias,
                    drop=drop_rate,
                    attn_drop=attn_drop_rate,
                    drop_path=dpr[i],
                    norm_layer=norm_layer,
                    init_values=init_values,
                    window_size=self.patch_embed.grid_size
                    if use_rel_pos_bias
                    else None,
                )
                for i in range(depth)
            ]
        )
        self.norm = Identity() if use_mean_pooling else norm_layer(embed_dim)
        self.fc_norm = norm_layer(embed_dim) if use_mean_pooling else None
        self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else Identity()

        self.apply(self._init_weights)
        self.fix_init_weight()
        if isinstance(self.head, nn.Linear):
            trunc_normal_(self.head.weight)
            self.head.weight.set_value(
                self.head.weight.multiply(paddle.to_tensor(init_scale))
            )
            self.head.bias.set_value(
                self.head.bias.multiply(paddle.to_tensor(init_scale))
            )

    def fix_init_weight(self):
        def rescale(param, layer_id):

            param.set_value(param.divide(paddle.to_tensor(math.sqrt(2.0 * layer_id))))

        for layer_id, layer in enumerate(self.blocks):
            rescale(layer.attn.proj.weight, layer_id + 1)
            rescale(layer.mlp.fc2.weight, layer_id + 1)

    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            trunc_normal_(m.weight)
            if isinstance(m, nn.Linear) and m.bias is not None:
                zeros_(m.bias)
        elif isinstance(m, nn.LayerNorm):
            zeros_(m.bias)
            ones_(m.weight)

    def forward_features(self, x):
        x = self.patch_embed(x)
        batch_size, seq_len, _ = x.shape

        cls_tokens = self.cls_token.expand([batch_size, -1, -1])

        x = paddle.concat((cls_tokens, x), axis=1)

        if self.pos_embed is not None:
            x = x + self.pos_embed
        x = self.pos_drop(x)

        rel_pos_bias = self.rel_pos_bias() if self.rel_pos_bias is not None else None
        for blk in self.blocks:

            x = blk(x, rel_pos_bias=rel_pos_bias)

        x = self.norm(x)
        if self.fc_norm is not None:
            t = x[:, 1:, :]

            return self.fc_norm(t.mean(1))
        else:
            return x[:, 0]

    def forward(self, x):
        x = self.forward_features(x)
        x = self.head(x)
        return x

模型定义

def beit_base_patch16_224(**kwargs):
    crop_pct = 0.9
    model = Beit(
        img_size=224,
        patch_size=16,
        embed_dim=768,
        depth=12,
        num_heads=12,
        mlp_ratio=4.0,
        use_abs_pos_emb=False,
        use_rel_pos_bias=True,
        init_values=0.1,
        **kwargs
    )
    return model
    
def beit_base_patch16_384(**kwargs):
    crop_pct = 1.0
    model = Beit(
        img_size=384,
        patch_size=16,
        embed_dim=768,
        depth=12,
        num_heads=12,
        mlp_ratio=4.0,
        use_abs_pos_emb=False,
        use_rel_pos_bias=True,
        init_values=0.1,
        **kwargs
    )
    return model

def beit_large_patch16_224(**kwargs):
    crop_pct = 0.9
    model = Beit(
        img_size=224,
        patch_size=16,
        embed_dim=1024,
        depth=24,
        num_heads=16,
        mlp_ratio=4.0,
        use_abs_pos_emb=False,
        use_rel_pos_bias=True,
        init_values=1e-5,
        **kwargs
    )
    return model

def beit_large_patch16_384(**kwargs):
    crop_pct = 1.0
    model = Beit(
        img_size=384,
        patch_size=16,
        embed_dim=1024,
        depth=24,
        num_heads=16,
        mlp_ratio=4.0,
        use_abs_pos_emb=False,
        use_rel_pos_bias=True,
        init_values=1e-5,
        **kwargs
    )
    return model

def beit_large_patch16_512(**kwargs):
    crop_pct = 1.0
    model = Beit(
        img_size=512,
        patch_size=16,
        embed_dim=1024,
        depth=24,
        num_heads=16,
        mlp_ratio=4.0,
        use_abs_pos_emb=False,
        use_rel_pos_bias=True,
        init_values=1e-5,
        **kwargs
    )
    return model

模型权重加载

# beit base 224
m = beit_base_patch16_224()
m.set_state_dict(paddle.load('/home/aistudio/data/data110564/beit_base_patch16_224_ft22kto1k.pdparams'))
# beit base 384
m = beit_base_patch16_384()
m.set_state_dict(paddle.load('/home/aistudio/data/data110564/beit_base_patch16_384_ft22kto1k.pdparams'))
# beit large 224
m = beit_large_patch16_224()
m.set_state_dict(paddle.load('/home/aistudio/data/data110564/beit_large_patch16_224_ft22kto1k.pdparams'))
# beit large 384
m = beit_large_patch16_384()
m.set_state_dict(paddle.load('/home/aistudio/data/data110564/beit_large_patch16_384_ft22kto1k.pdparams'))
# beit large 512
m = beit_large_patch16_512()
m.set_state_dict(paddle.load('/home/aistudio/data/data110564/beit_large_patch16_512_ft22kto1k.pdparams'))

利用PASSL调用BEiT

PASSL 地址 github

# 下载依赖包
! pip install ftfy
# 克隆 passl 库
# !git clone https://github.com/PaddlePaddle/PASSL.git

# 如果上述太慢,可以解压已下载的PASSL压缩包
!unzip -oq /home/aistudio/PASSL-main.zip

# 进入主目录
%cd PASSL-main/
import paddle.nn as nn
from passl.modeling.backbones import build_backbone
from passl.modeling.heads import build_head
from passl.utils.config import get_config


class Model(nn.Layer):
    def __init__(self, cfg_file):
        super().__init__()
        cfg = get_config(cfg_file)
        self.backbone = build_backbone(cfg.model.architecture)
        self.head = build_head(cfg.model.head)

    def forward(self, x):

        x = self.backbone(x)
        x = self.head(x)
        return x


cfg_file = "configs/beit/beit_base_p16_224.yaml"
m = Model(cfg_file)
ArchWeightTop-1 AccTop-5 AccCrop ratio# Params
beit_base_p16_224ft 22k to 1k85.2197.660.987M
beit_base_p16_384ft 22k to 1k86.8198.141.087M
beit_large_p16_224ft 22k to 1k87.4898.300.9304M
beit_large_p16_384ft 22k to 1k88.4098.601.0304M
beit_large_p16_512ft 22k to 1k88.6098.661.0304M

利用ppma进行模型精度验证

! pip install ppma
# 解压 ImageNet 数据集
! tar -xf /home/aistudio/data/data96753/ILSVRC2012_img_val.tar -C /home/aistudio/data/data96753
import ppma
import paddle

m = beit_large_patch16_512()
m.set_state_dict(paddle.load('/home/aistudio/data/data110564/beit_large_patch16_512_ft22kto1k.pdparams'))

data_path = "/home/aistudio/data/data96753"              

 | [ft 22k to 1k](https://passl.bj.bcebos.com/vision_transformers/beit/beit_large_p16_512_ft.pdparams) | 88.60     | 98.66     | 1.0        | 304M     |

# 利用[ppma](https://github.com/lmk123568/Paddle_Model_Analysis)进行模型精度验证


```python
! pip install ppma
# 解压 ImageNet 数据集
! tar -xf /home/aistudio/data/data96753/ILSVRC2012_img_val.tar -C /home/aistudio/data/data96753
import ppma
import paddle

m = beit_large_patch16_512()
m.set_state_dict(paddle.load('/home/aistudio/data/data110564/beit_large_patch16_512_ft22kto1k.pdparams'))

data_path = "/home/aistudio/data/data96753"              

ppma.imagenet.val(m, data_path, batch_size=32 ,img_size=512, crop_pct=1.0, normalize='inception')
Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐