二值神经网络第二弹:paddle自定义算子实现高效推理
本项目在《全平台首个二值神经网络的训练和预测》基础上,通过paddle自定义算子实现了高效的二值神经网络推理,相比于模拟前向推理过程,更好地体现了二值神经网络的优势。
·
二值神经网络 BNN 的自定义算子实现推理
0 项目介绍
- 本项目在前一项目全平台首个二值神经网络基础上,基于paddle2.3.2重构了训练和预测代码,使得可读性更强。
- 本项目用C++自定义算子的方式,用异或操作替代全连接层中的乘法操作,完成了功能验证。
- 参考文献:
- Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1
2 二值神经网络原理与实现方法回顾
- 二值化网络只是将网络的参数和激活值二值化,并没有改变网络结构,且输出层的概率输出依旧是浮点型。
2.1 二值化方法
- 二值化方法公式如下:
2.2 前馈传播过程
- 先将实数型权值参数二值化得到二值型权值参数,即
x
k
b
x^{b}_{k}
xkb=sign(
x
k
x^{}_{k}
xk)。然后利用二值化后的参数计算得到实数型的中间向量,该向量再通过Batch Normalization操作,得到实数型的隐藏层激活向量。如果不是输出层的话,就将该向量做二值化激活。完整公式如下所示:
- 上式中 x k b x^{b}_{k} xkb表示第k层经过二值化激活后的结果。
2.3 反向传播过程
- 首先值得一提的是,在训练过程中,权重参数是以全精度float32类型存储并更新的。但是在前馈传播过程中,权重参数会先二值化再进行运算,激活函数会将batch_norm后浮点类型的特征值重新二值化,从而确保下一卷积层的输入为二值化的特征图。
- 由于sign(x)的导数(几乎)处处为零,因此, W k W^{}_{k} Wk通过BP算法得到的梯度为零,因此不能直接用来更新权值。为解决这个问题采用 straight-through estimator的方法,即梯度传播时绕过sign()操作,从而避免导数为0的传播。
# 训练二值神经网络,请用GPU环境
%cd /home/aistudio/work/
!python ./LeNet_MNIST_train.py
# 用训练好的参数进行预测
%cd /home/aistudio/work/
!python ./LeNet_MNIST_predict.py
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/setuptools/depends.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
download testing data and load testing data
load finished
评估准确度为:97.2460925579071%
3 二值神经网络的自定义算子C++实现
- 以上的预测程序中,虽然全连接层的权重和输入特征图都是二值化的,但是因为采用paddle中的Linear层完成计算,所以本质上还是使用了乘法操作。而二值神经网络的优势在于计算量庞大的全连接(卷积)操作可以通过位操作来替代乘法操作,所以以下将介绍如何使用C++自定义算子来用位运算实现二值全连接层的运算。
3.1 二值神经网络的实现原理
- 二值神经网络中乘法只有如下表所示的4种情况:
特征图 | 权重 | 结果 |
---|---|---|
-1 | -1 | 1 |
-1 | 1 | -1 |
1 | -1 | -1 |
1 | 1 | 1 |
- 异或操作的真值表如下所示:
In1 | In2 | Out |
---|---|---|
True | True | False |
True | False | True |
False | True | True |
False | False | False |
- 通过对比以上两表可以发现,将-1映射为True、1映射为False即可通过异或操作实现二值神经网络的乘法。
- 所以需要预先对+1/-1形式的权重做上述映射并存储;同时将原本的二值化激活函数(正数激活值为+1,负数激活值为-1)修改为正数激活值为False,负数激活值为True。
- 因为全连接层中有对乘积的累加,所以这里还需将乘积结果True反映射回-1、False反映射回1。由于True=1、False=0,所以此反映射只要通过f(x)=1-2*x即可实现。
3.2 二值激活函数的实现代码
std::vector<paddle::Tensor> custom_binary_act_forward(const paddle::Tensor& x) {
if (x.place() == paddle::PlaceType::kCPU) {
auto out = paddle::Tensor(paddle::PlaceType::kCPU);
out.reshape(x.shape());
auto x_numel = x.size();
auto* x_data = x.data<float>();
auto* out_data = out.mutable_data<bool>(x.place());
for (int i = 0; i < x_numel; ++i) {
if (x_data[i]<0)
out_data[i] = true;
else
out_data[i] = false;
}
return {out};
} else {
PD_THROW("Not implemented.");
}
}
- 以上为二值激活函数的核心C++代码,完整代码见work/custom_binary_act.cc。
- 值得注意的是需要将输出out_data的数据类型设为bool型。
3.3 二值全连接层的实现代码
std::vector<paddle::Tensor> custom_bnn_linear_forward(const paddle::Tensor& x, const paddle::Tensor& w, const paddle::Tensor& b) {
if (x.place() == paddle::PlaceType::kCPU) {
auto out = paddle::Tensor(paddle::PlaceType::kCPU);
out.reshape({x.shape()[0], w.shape()[1]});
auto* x_data = x.data<bool>();
auto* w_data = w.data<bool>();
auto* b_data = b.data<float>();
auto* out_data = out.mutable_data<float>(x.place());
for (int i=0; i<x.shape()[0]*w.shape()[1]; ++i)
out_data[i] = 0;
for (int i = 0; i < x.shape()[0]; ++i) {
for (int k = 0; k < w.shape()[1]; ++k) {
for (int j = 0; j < x.shape()[1]; ++j) {
out_data[i*w.shape()[1]+k] += 1-2*(float)(x_data[i*x.shape()[1]+j]^w_data[j*w.shape()[1]+k]);
}
out_data[i*w.shape()[1]+k] += b_data[k];
}
}
return {out};
} else {
PD_THROW("Not implemented.");
}
}
- 以上为二值全连接层的核心C++代码,完整代码见work/custom_bnn_linear.cc。
- 值得注意的是,二值神经网络中全连接层的bias参数仍为浮点数,所以这里将输出out_data的数据类型设为float型。
3.4 完整流程
- 第一步使用即时编译(JIT Compile)导入自定义C++算子。
注意:以下注册自定义算子的代码块必须要被运行(推荐在CPU环境下使用)
%cd /home/aistudio/
import paddle
import paddle.nn as nn
from paddle.utils.cpp_extension import load
custom_ops = load(
name="custom_jit_ops",
sources=[
"work/custom_bnn_linear.cc",
"work/custom_binary_act.cc"
]
)
/home/aistudio
Compiling user custom op, it will cost a few seconds.....
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/setuptools/depends.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
- 第二步使用自定义C++算子构建网络。
import paddle
import paddle.nn as nn
from work.LeNet_MNIST_train import *
class LeNet_inference(paddle.nn.Layer): # 这里网络结构并非严格的LeNet,只是姑且这么命名
# 由于二值化后信息丢失相比于float32会严重些,所以可以适当增加些隐藏层神经元数量
def __init__(self):
super(LeNet_inference, self).__init__()
self.infl_ratio=1
self.fc1 = BinarizeLinear(784, 2048*self.infl_ratio, bias_attr=True )
self.bn1 = paddle.nn.BatchNorm1D(2048*self.infl_ratio)
self.fc2 = BinarizeLinear(2048*self.infl_ratio, 2048*self.infl_ratio, bias_attr=True )
self.bn2 = paddle.nn.BatchNorm1D(2048*self.infl_ratio)
self.fc3 = BinarizeLinear(2048*self.infl_ratio, 2048*self.infl_ratio, bias_attr=True )
self.bn3 = paddle.nn.BatchNorm1D(2048*self.infl_ratio)
self.fc4 = BinarizeLinear(2048*self.infl_ratio, 10, bias_attr=True )
self.act = Binary_act()
def convert_weight(self):
self.weight1 = paddle.Tensor(self.fc1.weight.numpy())
self.weight2 = paddle.Tensor(self.fc2.weight.numpy())
self.weight3 = paddle.Tensor(self.fc3.weight.numpy())
self.weight4 = paddle.Tensor(self.fc4.weight.numpy())
self.weight1 = custom_ops.custom_binary_act(self.weight1)
self.weight2 = custom_ops.custom_binary_act(self.weight2)
self.weight3 = custom_ops.custom_binary_act(self.weight3)
self.weight4 = custom_ops.custom_binary_act(self.weight4)
def forward(self, x):
x = paddle.reshape(x, [-1, 28*28])
x = nn.functional.linear(x, self.fc1.weight, self.fc1.bias)
x = self.bn1(x)
x = custom_ops.custom_binary_act(x)
x = custom_ops.custom_bnn_linear(x, self.weight2, self.fc2.bias)
x = self.bn2(x)
x = custom_ops.custom_binary_act(x)
x = custom_ops.custom_bnn_linear(x, self.weight3, self.fc3.bias)
x = self.bn3(x)
x = custom_ops.custom_binary_act(x)
x = custom_ops.custom_bnn_linear(x, self.weight4, self.fc4.bias)
return x
- 第三步,进行准确率评估
import numpy as np
import paddle
from paddle.vision.transforms import Compose, Resize, Transpose, Normalize
def main():
state_dict = paddle.load('work/best_lenet_model.pdparams')
paddle.seed(42)
np.random.seed(42)
transform = Compose([Normalize(mean=[127.5],
std=[127.5],
data_format='CHW')])
# 使用transform对数据集做归一化
print('download testing data and load testing data')
batch_size = 512
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=transform)
valid_loader = paddle.io.DataLoader(test_dataset, batch_size=batch_size)
print('load finished')
model = LeNet_inference()
model.eval()
model.set_state_dict(state_dict)
model.convert_weight()
accuracies = []
for batch_id, data in enumerate(valid_loader()):
x_data = paddle.cast(data[0], 'float32')
y_data = paddle.cast(data[1], 'int64')
y_data = paddle.reshape(y_data, (-1, 1))
y_predict = model(x_data)
acc = paddle.metric.accuracy(y_predict, y_data)
accuracies.append(np.mean(acc.numpy()))
print("{}/{} batch acc is: {}".format(batch_id, len(valid_loader), acc.numpy()))
avg_acc = np.mean(accuracies)
print("评估准确度为:{}%".format(avg_acc*100))
main()
download testing data and load testing data
item 96/403 [======>.......................] - ETA: 0s - 1ms/item
Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-images-idx3-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-images-idx3-ubyte.gz
Begin to download
item 2/2 [===========================>..] - ETA: 0s - 2ms/item
Download finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-labels-idx1-ubyte.gz
Begin to download
Download finished
load finished
W1222 08:17:49.397006 168 place.cc:147] The `paddle::PlaceType::kCPU/kGPU` is deprecated since version 2.3, and will be removed in version 2.4! Please use `Tensor::is_cpu()/is_gpu()` method to determine the type of place.
W1222 08:17:49.397060 168 place.cc:136] The `paddle::PlaceType::kCPU/kGPU` is deprecated since version 2.3, and will be removed in version 2.4! Please use `paddle::CPUPlace()/DefaultGPUPlace()` to represent the place type.
W1222 08:17:49.397073 168 tensor.cc:54] The Tensor(place) constructor is deprecated since version 2.3, and will be removed in version 2.4! Please use `paddle::empty/full` method to create a new Tensor instead. Reason: A legal tensor cannot be constructed only based on the `place`, and datatype, shape, layout, etc. is also required.
W1222 08:17:49.397099 168 tensor.cc:104] The function of resetting the shape of the uninitialized Tensor of the `reshape` method is deprecated since version 2.3, and will be removed in version 2.4, please use `paddle::empty/full` method to create a new Tensor instead. reason: `reshape` means changing the tensor shape without touching underlying data, this requires the total size of the tensor to remain constant.
W1222 08:17:49.397119 168 tensor.cc:199] Allocating memory through `mutable_data` method is deprecated since version 2.3, and `mutable_data` method will be removed in version 2.4! Please use `paddle::empty/full` method to create a new Tensor with allocated memory, and use data<T>() method to get the memory pointer of tensor instead. Reason: When calling `mutable_data` to allocate memory, the datatype, and data layout of tensor may be in an illegal state.
W1222 08:17:49.522763 168 tensor.cc:199] Allocating memory through `mutable_data` method is deprecated since version 2.3, and `mutable_data` method will be removed in version 2.4! Please use `paddle::empty/full` method to create a new Tensor with allocated memory, and use data<T>() method to get the memory pointer of tensor instead. Reason: When calling `mutable_data` to allocate memory, the datatype, and data layout of tensor may be in an illegal state.
0/20 batch acc is: [0.9746094]
1/20 batch acc is: [0.9609375]
2/20 batch acc is: [0.95703125]
3/20 batch acc is: [0.9628906]
4/20 batch acc is: [0.9453125]
5/20 batch acc is: [0.9589844]
6/20 batch acc is: [0.9765625]
6/20 batch acc is: [0.9765625]
4 写在最后
- 本项目通过自定义算子,实现了用异或操作替代全连接层中的乘法操作。相比于用乘法来模拟二值神经网络,真正地节省了运算和内存开销。
- 本项目提供了一个更复杂的C++自定义算子案例,相比于官网给出的ReLu激活函数案例。
- 未来,还将考虑使用纯C语言构建一个简易的推理框架,使其能够在单片机、嵌入式设备上直接运行。
此文章为搬运
[原项目链接].(https://aistudio.baidu.com/aistudio/projectdetail/5330884)
更多推荐
已为社区贡献1438条内容
所有评论(0)