重新思考神经网络的激活函数:Dynamic ReLU 复现

1、论文解读

在这里插入图片描述

  ReLU是深度学习中很重要的里程碑,简单但强大,能够极大地提升神经网络的性能。目前也有很多ReLU的改进版,比如Leaky ReLU和 PReLU,而这些改进版和原版的最终参数都是固定的。所以论文自然而然地想到,如果能够根据输入特征来调整ReLU的参数可能会更好。

在这里插入图片描述

如上图所示,论文中提出的Dynamic ReLU是一种分段函数,能够在带来少量额外计算的情况下,显著地提高网络的表达能力。下图列出了Dynamic ReLU与其他类型激活函数的不同:

在这里插入图片描述

2、复现详情

  论文提供了三种形态的DY-ReLU,在空间位置和维度上有不同的共享机制,如下图所示:

在这里插入图片描述

2.1 DY-ReLU-A

  空间位置和维度均共享(spatial and channel-shared),计算如图2a所示,仅需输出个参数,计算最简单,表达能力也最弱。

2.2 DY-ReLU-B

  仅空间位置共享(spatial-shared and channel-wise),输出2KC个参数。

2.3 DY-ReLU-C

  空间位置和维度均不共享(spatial and channel-wise),每个维度的每个元素都有对应的激活函数。虽然表达能力很强,但需要输出的参数(2KCHW)太多了,像前面那要直接用全连接层输出会带来过多的额外计算。

下图为上述三种激活函数的图像分类实验结果。

在这里插入图片描述

3、代码实现

  需要注意,本文这里仅实现了DY-ReLU-A与DY-ReLU-B两种形式。

import paddle
import paddle.nn as nn

class DyReLU(nn.Layer):
    def __init__(self, channels, reduction=4, k=2, conv_type='2d'):
        super(DyReLU, self).__init__()
        self.channels = channels
        self.k = k
        self.conv_type = conv_type
        assert self.conv_type in ['1d', '2d']

        self.fc1 = nn.Linear(channels, channels // reduction)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(channels // reduction, 2*k)
        self.sigmoid = nn.Sigmoid()

        self.register_buffer('lambdas', paddle.to_tensor([1.]*k + [0.5]*k))
        self.register_buffer('init_v', paddle.to_tensor([1.] + [0.]*(2*k - 1)))

    def get_relu_coefs(self, x):
        theta = paddle.mean(x, axis=-1)
        if self.conv_type == '2d':
            theta = paddle.mean(theta, axis=-1)
        theta = self.fc1(theta)
        theta = self.relu(theta)
        theta = self.fc2(theta)
        theta = 2 * self.sigmoid(theta) - 1
        return theta

    def forward(self, x):
        raise NotImplementedError

class DyReLUA(DyReLU):
    def __init__(self, channels, reduction=4, k=2, conv_type='2d'):
        super(DyReLUA, self).__init__(channels, reduction, k, conv_type)
        self.fc2 = nn.Linear(channels // reduction, 2*k)

    def forward(self, x):
        assert x.shape[1] == self.channels
        theta = self.get_relu_coefs(x)

        relu_coefs = theta.view(-1, 2*self.k) * self.lambdas + self.init_v
        # BxCxL -> LxCxBx1
        x_perm = x.transpose(0, -1).unsqueeze(-1)
        output = x_perm * relu_coefs[:, :self.k] + relu_coefs[:, self.k:]
        # LxCxBx2 -> BxCxL
        result = paddle.max(output, dim=-1)[0].transpose(0, -1)

        return result

class DyReLUB(DyReLU):
    def __init__(self, channels, reduction=4, k=2, conv_type='2d'):
        super(DyReLUB, self).__init__(channels, reduction, k, conv_type)
        self.fc2 = nn.Linear(channels // reduction, 2*k*channels)

    def forward(self, x):
        assert x.shape[1] == self.channels
        theta = self.get_relu_coefs(x)

        relu_coefs = theta.reshape([-1, self.channels, 2*self.k]) * self.lambdas + self.init_v

        if self.conv_type == '1d':
            # BxCxL -> LxBxCx1
            x_perm = x.transpose([2, 0, 1]).unsqueeze(-1)
            output = x_perm * relu_coefs[:, :, :self.k] + relu_coefs[:, :, self.k:]
            # LxBxCx2 -> BxCxL
            result = paddle.max(output, axis=-1)[0].transpose([1, 2, 0])

        elif self.conv_type == '2d':
            # BxCxHxW -> HxWxBxCx1
            x_perm = x.transpose([2, 3, 0, 1]).unsqueeze(-1)
            # print(x.shape)
            output = x_perm * relu_coefs[:, :, :self.k] + relu_coefs[:, :, self.k:]
            # print(output.shape)
            # HxWxBxCx2 -> BxCxHxW
            # temp = paddle.max(output, axis=-1)
            # print(temp.shape)
            result = paddle.max(output, axis=-1).transpose([2, 3, 0, 1])

        return result

4、对比实验

  为了方便对比,并且能够直观展示网络结构,本项目这里采用了AlexNet作为实验模型,使用Cifar10作为实验数据,通过替换网络中的激活函数给出对比实验。

import math
import paddle
import paddle.nn as nn
import paddle.nn.functional as F

from paddle.nn import Linear, Dropout, ReLU
from paddle.nn import Conv2D, MaxPool2D
from paddle.nn.initializer import Uniform
from paddle.fluid.param_attr import ParamAttr
from paddle.utils.download import get_weights_path_from_url

model_urls = {
    "alexnet": (
        "https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/AlexNet_pretrained.pdparams",
        "7f0f9f737132e02732d75a1459d98a43", )
}

__all__ = []


class ConvPoolLayer(nn.Layer):
    def __init__(self,
                 input_channels,
                 output_channels,
                 filter_size,
                 stride,
                 padding,
                 stdv,
                 groups=1,
                 act=None):
        super(ConvPoolLayer, self).__init__()

        # self.relu = ReLU() if act == "relu" else None
        self.relu = DyReLUB(output_channels, conv_type='2d')

        self._conv = Conv2D(
            in_channels=input_channels,
            out_channels=output_channels,
            kernel_size=filter_size,
            stride=stride,
            padding=padding,
            groups=groups,
            weight_attr=ParamAttr(initializer=Uniform(-stdv, stdv)),
            bias_attr=ParamAttr(initializer=Uniform(-stdv, stdv)))
        self._pool = MaxPool2D(kernel_size=3, stride=2, padding=0)

    def forward(self, inputs):
        x = self._conv(inputs)
        if self.relu is not None:
            x = self.relu(x)
            # print(x.shape)
        x = self._pool(x)
        return x


class AlexNet_dyr(nn.Layer):
    """AlexNet model from
    `"ImageNet Classification with Deep Convolutional Neural Networks"
    <https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf>`_
    Args:
        num_classes (int): Output dim of last fc layer. Default: 1000.
    Examples:
        .. code-block:: python
            from paddle.vision.models import AlexNet
            alexnet = AlexNet()
    """

    def __init__(self, num_classes=1000):
        super(AlexNet_dyr, self).__init__()
        self.num_classes = num_classes

        stdv = 1.0 / math.sqrt(3 * 11 * 11)
        self._conv1 = ConvPoolLayer(3, 64, 11, 4, 2, stdv, act="relu")
        stdv = 1.0 / math.sqrt(64 * 5 * 5)
        self._conv2 = ConvPoolLayer(64, 192, 5, 1, 2, stdv, act="relu")
        stdv = 1.0 / math.sqrt(192 * 3 * 3)
        self._conv3 = Conv2D(
            192,
            384,
            3,
            stride=1,
            padding=1,
            weight_attr=ParamAttr(initializer=Uniform(-stdv, stdv)),
            bias_attr=ParamAttr(initializer=Uniform(-stdv, stdv)))
        stdv = 1.0 / math.sqrt(384 * 3 * 3)
        self._conv4 = Conv2D(
            384,
            256,
            3,
            stride=1,
            padding=1,
            weight_attr=ParamAttr(initializer=Uniform(-stdv, stdv)),
            bias_attr=ParamAttr(initializer=Uniform(-stdv, stdv)))
        stdv = 1.0 / math.sqrt(256 * 3 * 3)
        self._conv5 = ConvPoolLayer(256, 256, 3, 1, 1, stdv, act="relu")

        if self.num_classes > 0:
            stdv = 1.0 / math.sqrt(256 * 6 * 6)
            self._drop1 = Dropout(p=0.5, mode="downscale_in_infer")
            self._fc6 = Linear(
                in_features=256 * 6 * 6,
                out_features=4096,
                weight_attr=ParamAttr(initializer=Uniform(-stdv, stdv)),
                bias_attr=ParamAttr(initializer=Uniform(-stdv, stdv)))

            self._drop2 = Dropout(p=0.5, mode="downscale_in_infer")
            self._fc7 = Linear(
                in_features=4096,
                out_features=4096,
                weight_attr=ParamAttr(initializer=Uniform(-stdv, stdv)),
                bias_attr=ParamAttr(initializer=Uniform(-stdv, stdv)))
            self._fc8 = Linear(
                in_features=4096,
                out_features=num_classes,
                weight_attr=ParamAttr(initializer=Uniform(-stdv, stdv)),
                bias_attr=ParamAttr(initializer=Uniform(-stdv, stdv)))

    def forward(self, inputs):
        x = self._conv1(inputs)
        x = self._conv2(x)
        x = self._conv3(x)
        x = F.relu(x)
        x = self._conv4(x)
        x = F.relu(x)
        x = self._conv5(x)

        if self.num_classes > 0:
            x = paddle.flatten(x, start_axis=1, stop_axis=-1)
            x = self._drop1(x)
            x = self._fc6(x)
            x = F.relu(x)
            x = self._drop2(x)
            x = self._fc7(x)
            x = F.relu(x)
            x = self._fc8(x)

        return x
alexdyr = AlexNet_dyr(num_classes=10)

paddle.summary(alexdyr,(1,3,224,224))
import paddle

from paddle.metric import Accuracy
from paddle.vision.transforms import Compose, Normalize, Resize, Transpose, ToTensor

callback = paddle.callbacks.VisualDL(log_dir='visualdl_log_dir_alex_dyrelu')

normalize = Normalize(mean=[0.5, 0.5, 0.5],
                    std=[0.5, 0.5, 0.5],
                    data_format='HWC')
transform = Compose([ToTensor(), Normalize(), Resize(size=(224,224))])

cifar10_train = paddle.vision.datasets.Cifar10(mode='train',
                                               transform=transform)
cifar10_test = paddle.vision.datasets.Cifar10(mode='test',
                                              transform=transform)

# 构建训练集数据加载器
train_loader = paddle.io.DataLoader(cifar10_train, batch_size=512, shuffle=True, drop_last=True)

# 构建测试集数据加载器
test_loader = paddle.io.DataLoader(cifar10_test, batch_size=512, shuffle=True, drop_last=True)

alexdyr = paddle.Model(AlexNet_dyr(num_classes=10))
optim = paddle.optimizer.Adam(learning_rate=0.001, parameters=alexdyr.parameters())

alexdyr.prepare(
    optim,
    paddle.nn.CrossEntropyLoss(),
    Accuracy()
    )

alexdyr.fit(train_data=train_loader,
        eval_data=test_loader,
        epochs=12,
        callbacks=callback,
        verbose=1
        )
import paddle
from paddle.vision.models import AlexNet
from paddle.metric import Accuracy
from paddle.vision.transforms import Compose, Normalize, Resize, Transpose, ToTensor

callback = paddle.callbacks.VisualDL(log_dir='visualdl_log_dir_alexnet')

normalize = Normalize(mean=[0.5, 0.5, 0.5],
                    std=[0.5, 0.5, 0.5],
                    data_format='HWC')
transform = Compose([ToTensor(), Normalize(), Resize(size=(224,224))])

cifar10_train = paddle.vision.datasets.Cifar10(mode='train',
                                               transform=transform)
cifar10_test = paddle.vision.datasets.Cifar10(mode='test',
                                              transform=transform)

# 构建训练集数据加载器
train_loader = paddle.io.DataLoader(cifar10_train, batch_size=512, shuffle=True, drop_last=True)

# 构建测试集数据加载器
test_loader = paddle.io.DataLoader(cifar10_test, batch_size=512, shuffle=True, drop_last=True)

alex = paddle.Model(AlexNet(num_classes=10))
optim = paddle.optimizer.Adam(learning_rate=0.001, parameters=alex.parameters())

alex.prepare(
    optim,
    paddle.nn.CrossEntropyLoss(),
    Accuracy()
    )

alex.fit(train_data=train_loader,
        eval_data=test_loader,
        epochs=12,
        callbacks=callback,
        verbose=1
        )

5、实验对比结果

在这里插入图片描述

6、总结

  本项目对微软提出的Dynamic ReLU进行了复现,该激活函数是一种分段函数。够在带来少量额外计算的情况下,显著地提高网络的表达能力。在论文中提出了三种不同的·激活函数形态,本项目中仅针对A和B两种形态进行了复现。通过对比实验的loss曲线、acc曲线可以发现,论文中提出的Dynamic ReLU能够有效的提升模型的表现。

此文仅为搬运,原作地址:https://aistudio.baidu.com/aistudio/projectdetail/4287121

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐