
EfficientFormerV2: 从MobileNet的角度重新审视ViT
★★★ 本文源自AlStudio社区精品项目,【点击此处】查看更多精品内容 >>>
摘要
随着视觉Transformer(ViTs)在计算机视觉任务中的成功应用,最近的技术试图优化ViTs的性能和复杂性,以实现在移动设备上的高效部署。 人们提出了多种方法来加速注意力机制,改进低效的设计,或者结合移动友好的轻量级卷积来形成混合体系结构。 然而,ViT及其变体仍然比轻量级CNNs具有更高的延迟或更多的参数,即使对于已经存在多年的MobileNet也是如此。 在实践中,延迟和大小对于在资源受限的硬件上进行有效的部署都是至关重要的。 在这项工作中,我们研究了一个中心问题,Transformer模型是否能像MobileNet一样快速运行并保持相似的大小? 我们回顾了ViTS的设计选择,并提出了一种改进的超网,具有低延迟和高参数效率。 我们进一步提出了一种细粒度的联合搜索策略,该策略可以通过同时优化延迟和参数数来找到有效的体系结构。 在ImageNet-1K上,与MobileNetv2和MobileNetv2×1.4相比,所提出的模型效率Formerv2在相同的时延和参数下的TOP-1精度提高了约4%。 我们证明了适当设计和优化的视觉变换器可以在MobileNet级别的大小和速度下实现高性能。
1. EfficientFormerV2
如图2所示,EfficientFormerV2相对于EfficientFormer的主要改进为(消融实验见表1):
- 将早期的局部模块换成类似MobileNet的反向瓶颈结构
- 对MSA进行改进,在生成QKV之前先进行下采样。对V引入局部性,同时引入Talking Head技术以增强跨头的信息交流
- 针对分辨率下采样模块,提出了一种基于注意力机制的下采样技术,值得注意的是,为了更好地得到下采样查询,本文使用Pooling和卷积技术分别获得静态和可学习的下采样,并将其结合映射得到最后的下采样查询
2. 代码复现
2.1 下载并导入所需的库
!pip install paddlex
%matplotlib inline
import paddle
import paddle.fluid as fluid
import numpy as np
import matplotlib.pyplot as plt
from paddle.vision.datasets import Cifar10
from paddle.vision.transforms import Transpose
from paddle.io import Dataset, DataLoader
from paddle import nn
import paddle.nn.functional as F
import paddle.vision.transforms as transforms
import os
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import paddlex
import math
import itertools
2.2 创建数据集
train_tfm = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.6, 1.0)),
transforms.ColorJitter(brightness=0.2,contrast=0.2, saturation=0.2),
transforms.RandomHorizontalFlip(0.5),
transforms.RandomRotation(20),
paddlex.transforms.MixupImage(),
transforms.ToTensor(),
transforms.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
test_tfm = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
paddle.vision.set_image_backend('cv2')
# 使用Cifar10数据集
train_dataset = Cifar10(data_file='data/data152754/cifar-10-python.tar.gz', mode='train', transform = train_tfm, )
val_dataset = Cifar10(data_file='data/data152754/cifar-10-python.tar.gz', mode='test',transform = test_tfm)
print("train_dataset: %d" % len(train_dataset))
print("val_dataset: %d" % len(val_dataset))
train_dataset: 50000
val_dataset: 10000
batch_size=256
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, drop_last=False, num_workers=4)
2.3 模型的创建
2.3.1 标签平滑
class LabelSmoothingCrossEntropy(nn.Layer):
def __init__(self, smoothing=0.1):
super().__init__()
self.smoothing = smoothing
def forward(self, pred, target):
confidence = 1. - self.smoothing
log_probs = F.log_softmax(pred, axis=-1)
idx = paddle.stack([paddle.arange(log_probs.shape[0]), target], axis=1)
nll_loss = paddle.gather_nd(-log_probs, index=idx)
smooth_loss = paddle.mean(-log_probs, axis=-1)
loss = confidence * nll_loss + self.smoothing * smooth_loss
return loss.mean()
2.3.2 DropPath
def drop_path(x, drop_prob=0.0, training=False):
"""
Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ...
"""
if drop_prob == 0.0 or not training:
return x
keep_prob = paddle.to_tensor(1 - drop_prob)
shape = (paddle.shape(x)[0],) + (1,) * (x.ndim - 1)
random_tensor = keep_prob + paddle.rand(shape, dtype=x.dtype)
random_tensor = paddle.floor(random_tensor) # binarize
output = x.divide(keep_prob) * random_tensor
return output
class DropPath(nn.Layer):
def __init__(self, drop_prob=None):
super(DropPath, self).__init__()
self.drop_prob = drop_prob
def forward(self, x):
return drop_path(x, self.drop_prob, self.training)
2.3.3 EfficientFormerV2模型的创建
2.3.3.1 Attention模块
class Attention4D(nn.Layer):
def __init__(self, dim=384, key_dim=32, num_heads=8, attn_ratio=4, resolution=7, act_layer=nn.ReLU, stride=None):
super().__init__()
self.num_heads = num_heads
self.scale = key_dim ** -0.5
self.key_dim = key_dim
self.nh_kd = key_dim * num_heads
if stride is not None:
self.resolution = math.ceil(resolution / stride)
self.stride_conv = nn.Sequential(nn.Conv2D(dim, dim, kernel_size=3, stride=stride, padding=1, groups=dim),
nn.BatchNorm2D(dim), )
self.upsample = nn.Upsample(scale_factor=stride, mode='bilinear')
else:
self.resolution = resolution
self.stride_conv = None
self.upsample = None
self.N = self.resolution ** 2
self.N2 = self.N
self.d = int(attn_ratio * key_dim)
self.dh = int(attn_ratio * key_dim) * num_heads
self.attn_ratio = attn_ratio
h = self.dh + self.nh_kd * 2
self.q = nn.Sequential(
nn.Conv2D(dim, self.nh_kd, 1),
nn.BatchNorm2D(self.nh_kd)
)
self.k = nn.Sequential(
nn.Conv2D(dim, self.nh_kd, 1),
nn.BatchNorm2D(self.nh_kd)
)
self.v = nn.Sequential(
nn.Conv2D(dim, self.dh, 1),
nn.BatchNorm2D(self.dh)
)
self.local_v = nn.Sequential(
nn.Conv2D(self.dh, self.dh, 3, padding=1, groups=self.dh),
nn.BatchNorm2D(self.dh)
)
self.talking_head1 = nn.Conv2D(self.num_heads, self.num_heads, 1)
self.talking_head2 = nn.Conv2D(self.num_heads, self.num_heads, 1)
self.proj = nn.Sequential(
act_layer(),
nn.Conv2D(self.dh, dim, 1),
nn.BatchNorm2D(dim)
)
points = list(itertools.product(range(self.resolution), range(self.resolution)))
N = len(points)
self.N = N
attention_offsets = {}
idxs = []
for p1 in points:
for p2 in points:
offset = (abs(p1[0] - p2[0]), abs(p1[1] - p2[1]))
if offset not in attention_offsets:
attention_offsets[offset] = len(attention_offsets)
idxs.append(attention_offsets[offset])
self.attention_biases = self.create_parameter((len(attention_offsets), num_heads), default_initializer=nn.initializer.Constant(0.0))
self.attention_bias_idxs = idxs
def forward(self, x):
B, C, H, W = x.shape
if self.stride_conv is not None:
x = self.stride_conv(x)
q = self.q(x).flatten(2).reshape((B, self.num_heads, -1, self.N)).transpose((0, 1, 3, 2))
k = self.k(x).flatten(2).reshape((B, self.num_heads, -1, self.N)).transpose((0, 1, 3, 2))
v = self.v(x)
v_local = self.local_v(v)
v = v.flatten(2).reshape((B, self.num_heads, -1, self.N)).transpose((0, 1, 3, 2))
attn = q @ k.transpose((0, 1, 3, 2)) * self.scale
attn = attn + self.attention_biases[self.attention_bias_idxs].transpose((1, 0)).reshape((1, self.num_heads, self.N, self.N))
attn = self.talking_head1(attn)
attn = F.softmax(attn, axis=-1)
attn = self.talking_head2(attn)
x = attn @ v
out = x.transpose((0, 1, 3, 2)).reshape((B, self.dh, self.resolution, self.resolution)) + v_local
if self.upsample is not None:
out = self.upsample(out)
out = self.proj(out)
return out
def stem(in_chs, out_chs, act_layer=nn.ReLU):
return nn.Sequential(
nn.Conv2D(in_chs, out_chs // 2, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2D(out_chs // 2),
act_layer(),
nn.Conv2D(out_chs // 2, out_chs, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2D(out_chs),
act_layer())
2.3.3.2 Attention Downsampling
class LGQuery(nn.Layer):
def __init__(self, in_dim, out_dim):
super().__init__()
self.pool = nn.AvgPool2D(1, 2, 0)
self.local = nn.Conv2D(in_dim, in_dim, kernel_size=3, stride=2, padding=1, groups=in_dim)
self.proj = nn.Sequential(
nn.Conv2D(in_dim, out_dim, 1),
nn.BatchNorm2D(out_dim)
)
def forward(self, x):
local_q = self.local(x)
pool_q = self.pool(x)
q = local_q + pool_q
q = self.proj(q)
return q
class Attention4DDownsample(nn.Layer):
def __init__(self, dim=384, key_dim=16, num_heads=8, attn_ratio=4,
resolution=7, out_dim=None, act_layer=nn.GELU):
super().__init__()
self.num_heads = num_heads
self.scale = key_dim ** -0.5
self.key_dim = key_dim
self.nh_kd = nh_kd = key_dim * num_heads
self.resolution = resolution
self.d = int(attn_ratio * key_dim)
self.dh = int(attn_ratio * key_dim) * num_heads
self.attn_ratio = attn_ratio
h = self.dh + nh_kd * 2
if out_dim is not None:
self.out_dim = out_dim
else:
self.out_dim = dim
self.resolution2 = math.ceil(self.resolution / 2)
self.q = LGQuery(dim, self.nh_kd)
self.N = self.resolution ** 2
self.N2 = self.resolution2 ** 2
self.k = nn.Sequential(
nn.Conv2D(dim, self.num_heads * self.key_dim, 1),
nn.BatchNorm2D(self.num_heads * self.key_dim)
)
self.v = nn.Sequential(
nn.Conv2D(dim, self.num_heads * self.d, 1),
nn.BatchNorm2D(self.num_heads * self.d),
)
self.v_local = nn.Sequential(
nn.Conv2D(self.dh, self.dh, kernel_size=3, stride=2, padding=1, groups=self.num_heads * self.d),
nn.BatchNorm2D(self.num_heads * self.d)
)
self.proj = nn.Sequential(
act_layer(),
nn.Conv2D(self.dh, self.out_dim, 1),
nn.BatchNorm2D(self.out_dim)
)
points = list(itertools.product(range(self.resolution), range(self.resolution)))
points_ = list(itertools.product(range(self.resolution2), range(self.resolution2)))
N = len(points)
N_ = len(points_)
self.N = N
self.N_ = N_
attention_offsets = {}
idxs = []
for p1 in points_:
for p2 in points:
size = 1
offset = (
abs(p1[0] * math.ceil(self.resolution / self.resolution2) - p2[0] + (size - 1) / 2),
abs(p1[1] * math.ceil(self.resolution / self.resolution2) - p2[1] + (size - 1) / 2))
if offset not in attention_offsets:
attention_offsets[offset] = len(attention_offsets)
idxs.append(attention_offsets[offset])
self.attention_biases = self.create_parameter((len(attention_offsets), num_heads), default_initializer=nn.initializer.Constant(0.0))
self.attention_bias_idxs = idxs
def forward(self, x):
B, C, H, W = x.shape
q = self.q(x).flatten(2).reshape((B, self.num_heads, -1, self.N2)).transpose((0, 1, 3, 2))
k = self.k(x).flatten(2).reshape((B, self.num_heads, -1, self.N))
v = self.v(x)
v_local = self.v_local(v)
v = v.flatten(2).reshape((B, self.num_heads, -1, self.N)).transpose((0, 1, 3, 2))
attn = q @ k * self.scale
attn = attn + self.attention_biases[self.attention_bias_idxs].transpose((1, 0)).reshape((1, self.num_heads, self.N_, self.N))
attn = F.softmax(attn, axis=-1)
x = (attn @ v).transpose((0, 1, 3, 2))
out = x.reshape((B, self.dh, self.resolution2, self.resolution2)) + v_local
out = self.proj(out)
return out
def to_2tuple(x):
return (x, x)
class Embedding(nn.Layer):
def __init__(self, patch_size=3, stride=2, padding=1,
in_chans=3, embed_dim=768, norm_layer=nn.BatchNorm2D,
light=False, asub=False, resolution=None, act_layer=nn.ReLU, attn_block=Attention4DDownsample):
super().__init__()
self.light = light
self.asub = asub
if self.light:
self.new_proj = nn.Sequential(
nn.Conv2D(in_chans, in_chans, kernel_size=3, stride=2, padding=1, groups=in_chans),
nn.BatchNorm2D(in_chans),
nn.Hardswish(),
nn.Conv2D(in_chans, embed_dim, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2D(embed_dim),
)
self.skip = nn.Sequential(
nn.Conv2D(in_chans, embed_dim, kernel_size=1, stride=2, padding=0),
nn.BatchNorm2D(embed_dim)
)
elif self.asub:
self.attn = attn_block(dim=in_chans, out_dim=embed_dim,
resolution=resolution, act_layer=act_layer)
patch_size = to_2tuple(patch_size)
stride = to_2tuple(stride)
padding = to_2tuple(padding)
self.conv = nn.Conv2D(in_chans, embed_dim, kernel_size=patch_size,
stride=stride, padding=padding)
self.bn = norm_layer(embed_dim) if norm_layer else nn.Identity()
else:
patch_size = to_2tuple(patch_size)
stride = to_2tuple(stride)
padding = to_2tuple(padding)
self.proj = nn.Conv2D(in_chans, embed_dim, kernel_size=patch_size,
stride=stride, padding=padding)
self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
def forward(self, x):
if self.light:
out = self.new_proj(x) + self.skip(x)
elif self.asub:
out_conv = self.conv(x)
out_conv = self.bn(out_conv)
out = self.attn(x) + out_conv
else:
x = self.proj(x)
out = self.norm(x)
return out
class Mlp(nn.Layer):
"""
Implementation of MLP with 1*1 convolutions.
Input: tensor with shape [B, C, H, W]
"""
def __init__(self, in_features, hidden_features=None,
out_features=None, act_layer=nn.GELU, drop=0., mid_conv=False):
super().__init__()
out_features = out_features or in_features
hidden_features = hidden_features or in_features
self.mid_conv = mid_conv
self.fc1 = nn.Conv2D(in_features, hidden_features, 1)
self.act = act_layer()
self.fc2 = nn.Conv2D(hidden_features, out_features, 1)
self.drop = nn.Dropout(drop)
if self.mid_conv:
self.mid = nn.Conv2D(hidden_features, hidden_features, kernel_size=3, stride=1, padding=1,
groups=hidden_features)
self.mid_norm = nn.BatchNorm2D(hidden_features)
self.norm1 = nn.BatchNorm2D(hidden_features)
self.norm2 = nn.BatchNorm2D(out_features)
def forward(self, x):
x = self.fc1(x)
x = self.norm1(x)
x = self.act(x)
if self.mid_conv:
x_mid = self.mid(x)
x_mid = self.mid_norm(x_mid)
x = self.act(x_mid)
x = self.drop(x)
x = self.fc2(x)
x = self.norm2(x)
x = self.drop(x)
return x
2.3.3.3 Local-Global Block
class AttnFFN(nn.Layer):
def __init__(self, dim, mlp_ratio=4.,
act_layer=nn.ReLU, norm_layer=nn.LayerNorm,
drop=0., drop_path=0.,
use_layer_scale=True, layer_scale_init_value=1e-5,
resolution=7, stride=None):
super().__init__()
self.token_mixer = Attention4D(dim, resolution=resolution, act_layer=act_layer, stride=stride)
mlp_hidden_dim = int(dim * mlp_ratio)
self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim,
act_layer=act_layer, drop=drop, mid_conv=True)
self.drop_path = DropPath(drop_path) if drop_path > 0. \
else nn.Identity()
self.use_layer_scale = use_layer_scale
if use_layer_scale:
self.layer_scale_1 = self.create_parameter((1, dim, 1, 1), default_initializer=nn.initializer.Constant(layer_scale_init_value))
self.layer_scale_2 = self.create_parameter((1, dim, 1, 1), default_initializer=nn.initializer.Constant(layer_scale_init_value))
def forward(self, x):
if self.use_layer_scale:
x = x + self.drop_path(self.layer_scale_1 * self.token_mixer(x))
x = x + self.drop_path(self.layer_scale_2 * self.mlp(x))
else:
x = x + self.drop_path(self.token_mixer(x))
x = x + self.drop_path(self.mlp(x))
return x
2.3.3.4 Local Block
class FFN(nn.Layer):
def __init__(self, dim, pool_size=3, mlp_ratio=4.,
act_layer=nn.GELU,
drop=0., drop_path=0.,
use_layer_scale=True, layer_scale_init_value=1e-5):
super().__init__()
mlp_hidden_dim = int(dim * mlp_ratio)
self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim,
act_layer=act_layer, drop=drop, mid_conv=True)
self.drop_path = DropPath(drop_path) if drop_path > 0. \
else nn.Identity()
self.use_layer_scale = use_layer_scale
if use_layer_scale:
self.layer_scale_2 = self.create_parameter((1, dim, 1, 1), default_initializer=nn.initializer.Constant(layer_scale_init_value))
def forward(self, x):
if self.use_layer_scale:
x = x + self.drop_path(self.layer_scale_2 * self.mlp(x))
else:
x = x + self.drop_path(self.mlp(x))
return x
def eformer_block(dim, index, layers,
pool_size=3, mlp_ratio=4.,
act_layer=nn.GELU, norm_layer=nn.LayerNorm,
drop_rate=.0, drop_path_rate=0.,
use_layer_scale=True, layer_scale_init_value=1e-5, vit_num=1, resolution=7, e_ratios=None):
blocks = []
for block_idx in range(layers[index]):
block_dpr = drop_path_rate * (
block_idx + sum(layers[:index])) / (sum(layers) - 1)
mlp_ratio = e_ratios[str(index)][block_idx]
if index >= 2 and block_idx > layers[index] - 1 - vit_num:
if index == 2:
stride = 2
else:
stride = None
blocks.append(AttnFFN(
dim, mlp_ratio=mlp_ratio,
act_layer=act_layer, norm_layer=norm_layer,
drop=drop_rate, drop_path=block_dpr,
use_layer_scale=use_layer_scale,
layer_scale_init_value=layer_scale_init_value,
resolution=resolution,
stride=stride,
))
else:
blocks.append(FFN(
dim, pool_size=pool_size, mlp_ratio=mlp_ratio,
act_layer=act_layer,
drop=drop_rate, drop_path=block_dpr,
use_layer_scale=use_layer_scale,
layer_scale_init_value=layer_scale_init_value,
))
blocks = nn.Sequential(*blocks)
return blocks
class EfficientFormerV2(nn.Layer):
def __init__(self, layers, embed_dims=None,
mlp_ratios=4, downsamples=None,
pool_size=3,
norm_layer=nn.BatchNorm2D, act_layer=nn.GELU,
num_classes=1000,
down_patch_size=3, down_stride=2, down_pad=1,
drop_rate=0., drop_path_rate=0.,
use_layer_scale=True, layer_scale_init_value=1e-5,
vit_num=0,
distillation=True,
resolution=224,
e_ratios=None):
super().__init__()
self.num_classes = num_classes
self.patch_embed = stem(3, embed_dims[0], act_layer=act_layer)
network = []
for i in range(len(layers)):
stage = eformer_block(embed_dims[i], i, layers,
pool_size=pool_size, mlp_ratio=mlp_ratios,
act_layer=act_layer, norm_layer=norm_layer,
drop_rate=drop_rate,
drop_path_rate=drop_path_rate,
use_layer_scale=use_layer_scale,
layer_scale_init_value=layer_scale_init_value,
resolution=math.ceil(resolution / (2 ** (i + 2))),
vit_num=vit_num,
e_ratios=e_ratios)
network.append(stage)
if i >= len(layers) - 1:
break
if downsamples[i] or embed_dims[i] != embed_dims[i + 1]:
# downsampling between two stages
if i >= 2:
asub = True
else:
asub = False
network.append(
Embedding(
patch_size=down_patch_size, stride=down_stride,
padding=down_pad,
in_chans=embed_dims[i], embed_dim=embed_dims[i + 1],
resolution=math.ceil(resolution / (2 ** (i + 2))),
asub=asub,
act_layer=act_layer, norm_layer=norm_layer,
)
)
self.network = nn.LayerList(network)
# Classifier head
self.norm = norm_layer(embed_dims[-1])
self.head = nn.Linear(
embed_dims[-1], num_classes) if num_classes > 0 \
else nn.Identity()
self.dist = distillation
if self.dist:
self.dist_head = nn.Linear(
embed_dims[-1], num_classes) if num_classes > 0 \
else nn.Identity()
self.apply(self.cls_init_weights)
# init for classification
def cls_init_weights(self, m):
tn = nn.initializer.TruncatedNormal(std=.02)
zero = nn.initializer.Constant(0.)
one = nn.initializer.Constant(1.)
if isinstance(m, (nn.Linear, nn.Conv2D)):
tn(m.weight)
if isinstance(m, nn.Linear) and m.bias is not None:
zero(m.bias)
if isinstance(m, (nn.BatchNorm2D, nn.LayerNorm)):
one(m.weight)
zero(m.bias)
def forward_tokens(self, x):
for idx, block in enumerate(self.network):
x = block(x)
return x
def forward(self, x):
x = self.patch_embed(x)
x = self.forward_tokens(x)
x = self.norm(x)
if self.dist:
cls_out = self.head(x.flatten(2).mean(-1)), self.dist_head(x.flatten(2).mean(-1))
if not self.training:
cls_out = (cls_out[0] + cls_out[1]) / 2
else:
cls_out = self.head(x.flatten(2).mean(-1))
# for image classification
return cls_out
2.3.4 模型的参数
EfficientFormer_width = {
'L': [40, 80, 192, 384], # 26m 83.3% 6attn
'S2': [32, 64, 144, 288], # 12m 81.6% 4attn dp0.02
'S1': [32, 48, 120, 224], # 6.1m 79.0
'S0': [32, 48, 96, 176], # 75.0 75.7
}
EfficientFormer_depth = {
'L': [5, 5, 15, 10], # 26m 83.3%
'S2': [4, 4, 12, 8], # 12m
'S1': [3, 3, 9, 6], # 79.0
'S0': [2, 2, 6, 4], # 75.7
}
# 26m
expansion_ratios_L = {
'0': [4, 4, 4, 4, 4],
'1': [4, 4, 4, 4, 4],
'2': [4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4],
'3': [4, 4, 4, 3, 3, 3, 3, 4, 4, 4],
}
# 12m
expansion_ratios_S2 = {
'0': [4, 4, 4, 4],
'1': [4, 4, 4, 4],
'2': [4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4],
'3': [4, 4, 3, 3, 3, 3, 4, 4],
}
# 6.1m
expansion_ratios_S1 = {
'0': [4, 4, 4],
'1': [4, 4, 4],
'2': [4, 4, 3, 3, 3, 3, 4, 4, 4],
'3': [4, 4, 3, 3, 4, 4],
}
# 3.5m
expansion_ratios_S0 = {
'0': [4, 4],
'1': [4, 4],
'2': [4, 3, 3, 3, 4, 4],
'3': [4, 3, 3, 4],
}
def efficientformerv2_s0(pretrained=False):
model = EfficientFormerV2(
layers=EfficientFormer_depth['S0'],
embed_dims=EfficientFormer_width['S0'],
downsamples=[True, True, True, True],
vit_num=2,
drop_path_rate=0.0,
num_classes=10,
distillation=False,
e_ratios=expansion_ratios_S0)
return model
def efficientformerv2_s1(pretrained=False):
model = EfficientFormerV2(
layers=EfficientFormer_depth['S1'],
embed_dims=EfficientFormer_width['S1'],
downsamples=[True, True, True, True],
vit_num=2,
drop_path_rate=0.0,
num_classes=10,
distillation=False,
e_ratios=expansion_ratios_S1)
return model
def efficientformerv2_s2(pretrained=False):
model = EfficientFormerV2(
layers=EfficientFormer_depth['S2'],
embed_dims=EfficientFormer_width['S2'],
downsamples=[True, True, True, True],
vit_num=4,
drop_path_rate=0.02,
num_classes=10,
distillation=False,
e_ratios=expansion_ratios_S2)
return model
def efficientformerv2_l(pretrained=False):
model = EfficientFormerV2(
layers=EfficientFormer_depth['L'],
embed_dims=EfficientFormer_width['L'],
downsamples=[True, True, True, True],
vit_num=6,
drop_path_rate=0.1,
num_classes=10,
distillation=False,
e_ratios=expansion_ratios_L)
return model
model = efficientformerv2_s0()
paddle.summary(model, (1, 3, 224, 224))
model = efficientformerv2_s1()
paddle.summary(model, (1, 3, 224, 224))
model = efficientformerv2_s2()
paddle.summary(model, (1, 3, 224, 224))
model = efficientformerv2_l()
paddle.summary(model, (1, 3, 224, 224))
2.4 训练
learning_rate = 0.001
n_epochs = 100
paddle.seed(42)
np.random.seed(42)
work_path = 'work/model'
# EfficientFormerV2-S0
model = efficientformerv2_s0()
criterion = LabelSmoothingCrossEntropy()
scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=learning_rate, T_max=50000 // batch_size * n_epochs, verbose=False)
optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=scheduler, weight_decay=1e-5)
gate = 0.0
threshold = 0.0
best_acc = 0.0
val_acc = 0.0
loss_record = {'train': {'loss': [], 'iter': []}, 'val': {'loss': [], 'iter': []}} # for recording loss
acc_record = {'train': {'acc': [], 'iter': []}, 'val': {'acc': [], 'iter': []}} # for recording accuracy
loss_iter = 0
acc_iter = 0
for epoch in range(n_epochs):
# ---------- Training ----------
model.train()
train_num = 0.0
train_loss = 0.0
val_num = 0.0
val_loss = 0.0
accuracy_manager = paddle.metric.Accuracy()
val_accuracy_manager = paddle.metric.Accuracy()
print("#===epoch: {}, lr={:.10f}===#".format(epoch, optimizer.get_lr()))
for batch_id, data in enumerate(train_loader):
x_data, y_data = data
labels = paddle.unsqueeze(y_data, axis=1)
logits = model(x_data)
loss = criterion(logits, y_data)
acc = paddle.metric.accuracy(logits, labels)
accuracy_manager.update(acc)
if batch_id % 10 == 0:
loss_record['train']['loss'].append(loss.numpy())
loss_record['train']['iter'].append(loss_iter)
loss_iter += 1
loss.backward()
optimizer.step()
scheduler.step()
optimizer.clear_grad()
train_loss += loss
train_num += len(y_data)
total_train_loss = (train_loss / train_num) * batch_size
train_acc = accuracy_manager.accumulate()
acc_record['train']['acc'].append(train_acc)
acc_record['train']['iter'].append(acc_iter)
acc_iter += 1
# Print the information.
print("#===epoch: {}, train loss is: {}, train acc is: {:2.2f}%===#".format(epoch, total_train_loss.numpy(), train_acc*100))
# ---------- Validation ----------
model.eval()
for batch_id, data in enumerate(val_loader):
x_data, y_data = data
labels = paddle.unsqueeze(y_data, axis=1)
with paddle.no_grad():
logits = model(x_data)
loss = criterion(logits, y_data)
acc = paddle.metric.accuracy(logits, labels)
val_accuracy_manager.update(acc)
val_loss += loss
val_num += len(y_data)
total_val_loss = (val_loss / val_num) * batch_size
loss_record['val']['loss'].append(total_val_loss.numpy())
loss_record['val']['iter'].append(loss_iter)
val_acc = val_accuracy_manager.accumulate()
acc_record['val']['acc'].append(val_acc)
acc_record['val']['iter'].append(acc_iter)
print("#===epoch: {}, val loss is: {}, val acc is: {:2.2f}%===#".format(epoch, total_val_loss.numpy(), val_acc*100))
# ===================save====================
if val_acc > best_acc:
best_acc = val_acc
paddle.save(model.state_dict(), os.path.join(work_path, 'best_model.pdparams'))
paddle.save(optimizer.state_dict(), os.path.join(work_path, 'best_optimizer.pdopt'))
print(best_acc)
paddle.save(model.state_dict(), os.path.join(work_path, 'final_model.pdparams'))
paddle.save(optimizer.state_dict(), os.path.join(work_path, 'final_optimizer.pdopt'))
2.5 结果分析
def plot_learning_curve(record, title='loss', ylabel='CE Loss'):
''' Plot learning curve of your CNN '''
maxtrain = max(map(float, record['train'][title]))
maxval = max(map(float, record['val'][title]))
ymax = max(maxtrain, maxval) * 1.1
mintrain = min(map(float, record['train'][title]))
minval = min(map(float, record['val'][title]))
ymin = min(mintrain, minval) * 0.9
total_steps = len(record['train'][title])
x_1 = list(map(int, record['train']['iter']))
x_2 = list(map(int, record['val']['iter']))
figure(figsize=(10, 6))
plt.plot(x_1, record['train'][title], c='tab:red', label='train')
plt.plot(x_2, record['val'][title], c='tab:cyan', label='val')
plt.ylim(ymin, ymax)
plt.xlabel('Training steps')
plt.ylabel(ylabel)
plt.title('Learning curve of {}'.format(title))
plt.legend()
plt.show()
plot_learning_curve(loss_record, title='loss', ylabel='CE Loss')
plot_learning_curve(acc_record, title='acc', ylabel='Accuracy')
import time
work_path = 'work/model'
model = efficientformerv2_s0()
model_state_dict = paddle.load(os.path.join(work_path, 'best_model.pdparams'))
model.set_state_dict(model_state_dict)
model.eval()
aa = time.time()
for batch_id, data in enumerate(val_loader):
x_data, y_data = data
labels = paddle.unsqueeze(y_data, axis=1)
with paddle.no_grad():
logits = model(x_data)
bb = time.time()
print("Throughout:{}".format(int(len(val_dataset)//(bb - aa))))
Throughout:813
def get_cifar10_labels(labels):
"""返回CIFAR10数据集的文本标签。"""
text_labels = [
'airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog',
'horse', 'ship', 'truck']
return [text_labels[int(i)] for i in labels]
def show_images(imgs, num_rows, num_cols, pred=None, gt=None, scale=1.5):
"""Plot a list of images."""
figsize = (num_cols * scale, num_rows * scale)
_, axes = plt.subplots(num_rows, num_cols, figsize=figsize)
axes = axes.flatten()
for i, (ax, img) in enumerate(zip(axes, imgs)):
if paddle.is_tensor(img):
ax.imshow(img.numpy())
else:
ax.imshow(img)
ax.axes.get_xaxis().set_visible(False)
ax.axes.get_yaxis().set_visible(False)
if pred or gt:
ax.set_title("pt: " + pred[i] + "\ngt: " + gt[i])
return axes
work_path = 'work/model'
X, y = next(iter(DataLoader(val_dataset, batch_size=18)))
model = efficientformerv2_s0()
model_state_dict = paddle.load(os.path.join(work_path, 'best_model.pdparams'))
model.set_state_dict(model_state_dict)
model.eval()
logits = model(X)
y_pred = paddle.argmax(logits, -1)
X = paddle.transpose(X, [0, 2, 3, 1])
axes = show_images(X.reshape((18, 224, 224, 3)), 1, 18, pred=get_cifar10_labels(y_pred), gt=get_cifar10_labels(y))
plt.show()
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
!pip install interpretdl
import interpretdl as it
work_path = 'work/model'
model = efficientformerv2_s0()
model_state_dict = paddle.load(os.path.join(work_path, 'best_model.pdparams'))
model.set_state_dict(model_state_dict)
X, y = next(iter(DataLoader(val_dataset, batch_size=18)))
lime = it.LIMECVInterpreter(model)
lime_weights = lime.interpret(X.numpy()[3], interpret_class=y.numpy()[3], batch_size=100, num_samples=10000, visual=True)
100%|██████████| 10000/10000 [00:56<00:00, 175.85it/s]
56<00:00, 175.85it/s]
3. 对比实验
model | Parameter | Throughout | Test Acc |
---|---|---|---|
EfficientFormer | 11.4M | 856 | 90.9% |
EfficientFormerV2 | 3.3M | 813 | 94.3% |
总结
EfficientFormerV2针对EfficientFormer做出了一系列的改进,得到了很好的性能提升,也为轻量化Transformer提供可行性方案
参考文献
此文章为搬运
原项目链接
更多推荐
所有评论(0)