二值神经网络 BNN 的自定义算子实现推理

0 项目介绍

2 二值神经网络原理与实现方法回顾

  • 二值化网络只是将网络的参数和激活值二值化,并没有改变网络结构,且输出层的概率输出依旧是浮点型。

2.1 二值化方法

  • 二值化方法公式如下:

2.2 前馈传播过程

  • 先将实数型权值参数二值化得到二值型权值参数,即 x k b x^{b}_{k} xkb=sign( x k x^{}_{k} xk)。然后利用二值化后的参数计算得到实数型的中间向量,该向量再通过Batch Normalization操作,得到实数型的隐藏层激活向量。如果不是输出层的话,就将该向量做二值化激活。完整公式如下所示:
  • 上式中 x k b x^{b}_{k} xkb表示第k层经过二值化激活后的结果。

2.3 反向传播过程

  • 首先值得一提的是,在训练过程中,权重参数是以全精度float32类型存储并更新的。但是在前馈传播过程中,权重参数会先二值化再进行运算,激活函数会将batch_norm后浮点类型的特征值重新二值化,从而确保下一卷积层的输入为二值化的特征图。
  • 由于sign(x)的导数(几乎)处处为零,因此, W k W^{}_{k} Wk通过BP算法得到的梯度为零,因此不能直接用来更新权值。为解决这个问题采用 straight-through estimator的方法,即梯度传播时绕过sign()操作,从而避免导数为0的传播。
# 训练二值神经网络,请用GPU环境
%cd /home/aistudio/work/
!python ./LeNet_MNIST_train.py
# 用训练好的参数进行预测
%cd /home/aistudio/work/
!python ./LeNet_MNIST_predict.py
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/setuptools/depends.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
download testing data and load testing data
load finished
评估准确度为:97.2460925579071%

3 二值神经网络的自定义算子C++实现

  • 以上的预测程序中,虽然全连接层的权重和输入特征图都是二值化的,但是因为采用paddle中的Linear层完成计算,所以本质上还是使用了乘法操作。而二值神经网络的优势在于计算量庞大的全连接(卷积)操作可以通过位操作来替代乘法操作,所以以下将介绍如何使用C++自定义算子来用位运算实现二值全连接层的运算。

3.1 二值神经网络的实现原理

  • 二值神经网络中乘法只有如下表所示的4种情况:
特征图权重结果
-1-11
-11-1
1-1-1
111
  • 异或操作的真值表如下所示:
In1In2Out
TrueTrueFalse
TrueFalseTrue
FalseTrueTrue
FalseFalseFalse
  • 通过对比以上两表可以发现,将-1映射为True、1映射为False即可通过异或操作实现二值神经网络的乘法
  • 所以需要预先对+1/-1形式的权重做上述映射并存储;同时将原本的二值化激活函数(正数激活值为+1,负数激活值为-1)修改为正数激活值为False,负数激活值为True。
  • 因为全连接层中有对乘积的累加,所以这里还需将乘积结果True反映射回-1、False反映射回1。由于True=1、False=0,所以此反映射只要通过f(x)=1-2*x即可实现。

3.2 二值激活函数的实现代码

std::vector<paddle::Tensor> custom_binary_act_forward(const paddle::Tensor& x) {
    if (x.place() == paddle::PlaceType::kCPU) {
        auto out = paddle::Tensor(paddle::PlaceType::kCPU);
        out.reshape(x.shape());

        auto x_numel = x.size();
        auto* x_data = x.data<float>();
        auto* out_data = out.mutable_data<bool>(x.place());

        for (int i = 0; i < x_numel; ++i) {
            if (x_data[i]<0)
                out_data[i] = true;
            else
                out_data[i] = false;
        }
        return {out};
    } else {
        PD_THROW("Not implemented.");
    }
}
  • 以上为二值激活函数的核心C++代码,完整代码见work/custom_binary_act.cc。
  • 值得注意的是需要将输出out_data的数据类型设为bool型。

3.3 二值全连接层的实现代码

std::vector<paddle::Tensor> custom_bnn_linear_forward(const paddle::Tensor& x, const paddle::Tensor& w, const paddle::Tensor& b) {
    if (x.place() == paddle::PlaceType::kCPU) {
        auto out = paddle::Tensor(paddle::PlaceType::kCPU);
        out.reshape({x.shape()[0], w.shape()[1]});

        auto* x_data = x.data<bool>();
        auto* w_data = w.data<bool>();
        auto* b_data = b.data<float>();
        auto* out_data = out.mutable_data<float>(x.place());

        for (int i=0; i<x.shape()[0]*w.shape()[1]; ++i)
            out_data[i] = 0;

        for (int i = 0; i < x.shape()[0]; ++i) {
            for (int k = 0; k < w.shape()[1]; ++k) {
                for (int j = 0; j < x.shape()[1]; ++j) {
                    out_data[i*w.shape()[1]+k] += 1-2*(float)(x_data[i*x.shape()[1]+j]^w_data[j*w.shape()[1]+k]);
                }
                out_data[i*w.shape()[1]+k] += b_data[k];
            }
        }
        return {out};
    } else {
        PD_THROW("Not implemented.");
    }
}
  • 以上为二值全连接层的核心C++代码,完整代码见work/custom_bnn_linear.cc。
  • 值得注意的是,二值神经网络中全连接层的bias参数仍为浮点数,所以这里将输出out_data的数据类型设为float型。

3.4 完整流程

  • 第一步使用即时编译(JIT Compile)导入自定义C++算子。

注意:以下注册自定义算子的代码块必须要被运行(推荐在CPU环境下使用)

%cd /home/aistudio/
import paddle
import paddle.nn as nn
from paddle.utils.cpp_extension import load
custom_ops = load(
    name="custom_jit_ops",
    sources=[
             "work/custom_bnn_linear.cc",
             "work/custom_binary_act.cc"
            ]
)
/home/aistudio
Compiling user custom op, it will cost a few seconds.....


cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/setuptools/depends.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
  • 第二步使用自定义C++算子构建网络。
import paddle
import paddle.nn as nn
from work.LeNet_MNIST_train import *
class LeNet_inference(paddle.nn.Layer):  # 这里网络结构并非严格的LeNet,只是姑且这么命名
                               # 由于二值化后信息丢失相比于float32会严重些,所以可以适当增加些隐藏层神经元数量
    def __init__(self):
        super(LeNet_inference, self).__init__()
        self.infl_ratio=1
        self.fc1 = BinarizeLinear(784, 2048*self.infl_ratio, bias_attr=True )
        self.bn1 = paddle.nn.BatchNorm1D(2048*self.infl_ratio)
        self.fc2 = BinarizeLinear(2048*self.infl_ratio, 2048*self.infl_ratio, bias_attr=True )
        self.bn2 = paddle.nn.BatchNorm1D(2048*self.infl_ratio)
        self.fc3 = BinarizeLinear(2048*self.infl_ratio, 2048*self.infl_ratio, bias_attr=True )
        self.bn3 = paddle.nn.BatchNorm1D(2048*self.infl_ratio)
        self.fc4 = BinarizeLinear(2048*self.infl_ratio, 10, bias_attr=True )
        self.act = Binary_act()

    def convert_weight(self):
        self.weight1 = paddle.Tensor(self.fc1.weight.numpy())
        self.weight2 = paddle.Tensor(self.fc2.weight.numpy())
        self.weight3 = paddle.Tensor(self.fc3.weight.numpy())
        self.weight4 = paddle.Tensor(self.fc4.weight.numpy())
        self.weight1 = custom_ops.custom_binary_act(self.weight1)
        self.weight2 = custom_ops.custom_binary_act(self.weight2)
        self.weight3 = custom_ops.custom_binary_act(self.weight3)
        self.weight4 = custom_ops.custom_binary_act(self.weight4)

    def forward(self, x):
        x = paddle.reshape(x, [-1, 28*28])
        x = nn.functional.linear(x, self.fc1.weight, self.fc1.bias)
        x = self.bn1(x)
        x = custom_ops.custom_binary_act(x)
        x = custom_ops.custom_bnn_linear(x, self.weight2, self.fc2.bias)
        x = self.bn2(x)
        x = custom_ops.custom_binary_act(x)
        x = custom_ops.custom_bnn_linear(x, self.weight3, self.fc3.bias)
        x = self.bn3(x)
        x = custom_ops.custom_binary_act(x)
        x = custom_ops.custom_bnn_linear(x, self.weight4, self.fc4.bias)
        return x
  • 第三步,进行准确率评估
import numpy as np
import paddle
from paddle.vision.transforms import Compose, Resize, Transpose, Normalize
def main():
    state_dict = paddle.load('work/best_lenet_model.pdparams')

    paddle.seed(42)
    np.random.seed(42)
    transform = Compose([Normalize(mean=[127.5],
                                std=[127.5],
                                data_format='CHW')])
    # 使用transform对数据集做归一化
    print('download testing data and load testing data')
    batch_size = 512
    test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=transform)
    valid_loader = paddle.io.DataLoader(test_dataset, batch_size=batch_size)
    print('load finished')

    model = LeNet_inference()
    model.eval()
    model.set_state_dict(state_dict)

    model.convert_weight()

    accuracies = []
    for batch_id, data in enumerate(valid_loader()):
        x_data = paddle.cast(data[0], 'float32')
        y_data = paddle.cast(data[1], 'int64')
        y_data = paddle.reshape(y_data, (-1, 1))
        y_predict = model(x_data)
        acc = paddle.metric.accuracy(y_predict, y_data)
        accuracies.append(np.mean(acc.numpy()))
        print("{}/{} batch acc is: {}".format(batch_id, len(valid_loader), acc.numpy()))

    avg_acc = np.mean(accuracies)
    print("评估准确度为:{}%".format(avg_acc*100))
main()
download testing data and load testing data
item  96/403 [======>.......................] - ETA: 0s - 1ms/item

Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-images-idx3-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-images-idx3-ubyte.gz 
Begin to download


item 2/2 [===========================>..] - ETA: 0s - 2ms/item


Download finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-labels-idx1-ubyte.gz 
Begin to download

Download finished


load finished


W1222 08:17:49.397006   168 place.cc:147] The `paddle::PlaceType::kCPU/kGPU` is deprecated since version 2.3, and will be removed in version 2.4! Please use `Tensor::is_cpu()/is_gpu()` method to determine the type of place.
W1222 08:17:49.397060   168 place.cc:136] The `paddle::PlaceType::kCPU/kGPU` is deprecated since version 2.3, and will be removed in version 2.4! Please use `paddle::CPUPlace()/DefaultGPUPlace()` to represent the place type.
W1222 08:17:49.397073   168 tensor.cc:54] The Tensor(place) constructor is deprecated since version 2.3, and will be removed in version 2.4! Please use `paddle::empty/full` method to create a new Tensor instead. Reason: A legal tensor cannot be constructed only based on the `place`, and datatype, shape, layout, etc. is also required.
W1222 08:17:49.397099   168 tensor.cc:104] The function of resetting the shape of the uninitialized Tensor of the `reshape` method is deprecated since version 2.3, and will be removed in version 2.4, please use `paddle::empty/full` method to create a new Tensor instead. reason: `reshape` means changing the tensor shape without touching underlying data, this requires the total size of the tensor to remain constant.
W1222 08:17:49.397119   168 tensor.cc:199] Allocating memory through `mutable_data` method is deprecated since version 2.3, and `mutable_data` method will be removed in version 2.4! Please use `paddle::empty/full` method to create a new Tensor with allocated memory, and use data<T>() method to get the memory pointer of tensor instead. Reason: When calling `mutable_data` to allocate memory, the datatype, and data layout of tensor may be in an illegal state.
W1222 08:17:49.522763   168 tensor.cc:199] Allocating memory through `mutable_data` method is deprecated since version 2.3, and `mutable_data` method will be removed in version 2.4! Please use `paddle::empty/full` method to create a new Tensor with allocated memory, and use data<T>() method to get the memory pointer of tensor instead. Reason: When calling `mutable_data` to allocate memory, the datatype, and data layout of tensor may be in an illegal state.


0/20 batch acc is: [0.9746094]
1/20 batch acc is: [0.9609375]
2/20 batch acc is: [0.95703125]
3/20 batch acc is: [0.9628906]
4/20 batch acc is: [0.9453125]
5/20 batch acc is: [0.9589844]
6/20 batch acc is: [0.9765625]

6/20 batch acc is: [0.9765625]

4 写在最后

  • 本项目通过自定义算子,实现了用异或操作替代全连接层中的乘法操作。相比于用乘法来模拟二值神经网络,真正地节省了运算和内存开销。
  • 本项目提供了一个更复杂的C++自定义算子案例,相比于官网给出的ReLu激活函数案例。
  • 未来,还将考虑使用纯C语言构建一个简易的推理框架,使其能够在单片机、嵌入式设备上直接运行。

此文章为搬运
[原项目链接].(https://aistudio.baidu.com/aistudio/projectdetail/5330884)

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐