预训练模型的全流程优化FasterERNIE

转载自AI Studio

标题 项目链接 https://aistudio.baidu.com/aistudio/projectdetail/3200308

近几年NLP Transformer类的模型发展迅速,各个NLP的基础技术和核心应用的核心技术基本上都被Transformer类的核心技术所替换。 学术上,目前各个NLP任务领域的SOTA效果基本都是由Transformer类刷新,但在落地应用上还面临上线困难的问题。 Transformer类文本预处理部分,主要存在以下两个因素影响Transformer训推一体的部署:

  • 文本预处理部分逻辑复杂,C++端需要重新开发,成本高。 训练侧多为Python实现,进行产业级部署时需要重新进行迁移与对齐,目前业界缺少通用且高效的C++参考实现。
  • 文本预处理效率Python实现与C++实现存在数量级上的差距,对产业实践场景有较大的价值。 在服务部署时Tokenizer的性能也是推动NLP相关模型的一个性能瓶颈,尤其是小型化模型部署如ERNIE-Tiny等,文本预处理耗时占总体预测时间高达30%。

基于以上两点原因,我们将常用预训练模型的文本预处理部分内置成Paddle底层算子——FasterTokenizer。 FasterTokenizer底层为C++实现,同时提供了python接口调用。其可以将文本转化为模型数值化输入。 同时,用户可以将其导出为模型的一部分,直接用于部署推理。实现了预训练模型包含高性能文本处理训推一体开发体验。

为了更好地实现预训练模型训推一体化,我们将FasterTokenizer内置到预训练模型ERNIE中,使模型计算图包含了高性能文本处理算子,在文本领域任务上提供了更加简洁易用的训推一体开发体验,同时Python部署具备更快的推理性能。 同时,FasterERNIE基于Paddle 2.2的Fused TransformerEncoder API功能,可以在NVDIA GPU上提供更快的训练与推理优化。


Tokenizer性能对比

本项目源代码全部开源在 PaddleNLP 中。

如果对您有帮助,欢迎⭐️ star⭐️收藏一下,不易走丢哦! 链接指路:https://github.com/PaddlePaddle/PaddleNLP


文本处理算子FasterTokenizer

目前文本处理算子FasterTokenizer在PaddleNLP 2.2版本中有相应的实现。首先,我们需要安装paddlenlp 2.2版本以上。

!pip install --upgrade paddlenlp -i https://pypi.tuna.tsinghua.edu.cn/simple
# !pip install --upgrade numpy -i https://pypi.tuna.tsinghua.edu.cn/simple
import numpy
numpy.__version__
'1.19.5'

PaddleNLP 2.2版本提供了FasterTokenizer python API接口。

支持模型FasterTokenizer API Usage
ERNIE, ChineseFasterTokenizer.from_pretrained("ernie-1.0")
ERNIE 2.0 Base, EnglishFasterTokenizer.from_pretrained("ernie-2.0-en")
ERNIE 2.0 Large, EnglishFasterTokenizer.from_pretrained("ernie-2.0-large-en")
BERT-Base, UncasedFasterTokenizer.from_pretrained("bert-base-uncased")
BERT-Large, UncasedFasterTokenizer.from_pretrained("bert-large-uncased")
BERT-base, CasedFasterTokenizer.from_pretrained("bert-base-cased")
BERT-Large, CasedFasterTokenizer.from_pretrained("bert-large-cased")
BERT-Base, Multilingual CasedFasterTokenizer.from_pretrained("bert-base-multilingual-cased")
BERT-Base, ChineseFasterTokenizer.from_pretrained("bert-base-chinese")
BERT-Base (Whole Word Masking), ChineseFasterTokenizer.from_pretrained("bert-wwm-chinese")
BERT-Base ((Whole Word Masking, EXT Data), ChineseFasterTokenizer.from_pretrained("bert-wwm-ext-chinese")
RoBERTa-Base (Whole Word Masking, EXT Data), ChineseFasterTokenizer.from_pretrained("roberta-wwm-ext")
RoBERTa-Large (Whole Word Masking, EXT Data), ChineseFasterTokenizer.from_pretrained("roberta-wwm-ext-large")
import time

import numpy as np
import paddlenlp
from paddlenlp.transformers import ErnieTokenizer
from paddlenlp.experimental import FasterTokenizer
from paddlenlp.experimental import to_tensor


# ERNIE Tokenizer using PaddleNLP 2.2 FasterTokenizer
faster_tokenizer = FasterTokenizer.from_pretrained("ernie-1.0")
# PaddleNLP 2.1 ErnieTokenizer
tokenizer = ErnieTokenizer.from_pretrained("ernie-1.0")


text = '在世界几大古代文明中,中华文明源远流长、从未中断,至今仍充满蓬勃生机与旺盛生命力,这在人类历史上是了不起的奇迹。'   \
       '本固根深、一脉相承的历史文化是铸就这一奇迹的重要基础。先秦时期是中华文化的创生期,奠定了此后几千年中华文化发展的'   \
       '基础。'

length = len(text)
print(f"length:{length}")
f_input_ids, f_token_type_ids = faster_tokenizer(to_tensor([text]), max_seq_len=length)
f_input_ids, f_token_type_ids = f_input_ids.numpy()[0], f_token_type_ids.numpy()[0]
print(f"ernie faster tokenizer, f_input_ids: {f_input_ids}")
print(f"ernie faster tokenizer, f_token_type_ids: {f_token_type_ids}")

encoded_inputs = tokenizer(text, max_seq_len=length)
input_ids, token_type_ids = encoded_inputs["input_ids"], encoded_inputs["token_type_ids"]
print(f"ernie tokenizer, input_ids: {input_ids}")
print(f"ernie tokenizer, token_type_ids: {token_type_ids}")

print(f"f_input_ids == input_ids : {np.array_equal(f_input_ids, input_ids)}")
print(f"f_token_type_ids == token_type_ids : {np.array_equal(f_token_type_ids, token_type_ids)}")

从以上结果可以看出,FasterTokenizer 与 ErnieTokenizer输出结果是一致的。

FasterTokenizer通过以下3种方式,达到高性能文本预处理。

  • 极致算子内存优化,减少数据分配与拷贝
  • BPE分词结果Cache机制,常用字符预处理加速
  • 批数据OMP多线程加速

那么,2.2版本FasterTokenizer相比于2.1版本性能提升有多少呢?

接下来,我们以一个简单示例说明。

我们以ERNIE Tokenizer作为示例,1000条数据,batch size=32, 对比循环处理10个epoch的吞吐率。

batch_size = 32
data = [text] * 1000
epochs = 10

batches = [
    to_tensor(data[idx:idx + batch_size])
    for idx in range(0, len(data), batch_size)
]

start = time.time()
for _ in range(epochs):
    for batch_data in batches:
        input_ids, token_type_ids = faster_tokenizer(
            batch_data, max_seq_len=length)
end = time.time()

total_tokens = epochs * len(data) * length
print("The throughput of FasterTokenizer: {:,.2f} tokens/s".format((
    total_tokens / (end - start))))
The throughput of FasterTokenizer: 9,016,956.58 tokens/s
batches = [
    data[idx:idx + batch_size]
    for idx in range(0, len(data), batch_size)
]

start = time.time()
for _ in range(epochs):
    for batch_data in batches:
        encoded_inputs = tokenizer(batch_data, max_seq_len=length)
end = time.time()

total_tokens = epochs * len(data) * length
print("The throughput of ErnieTokenizer: {:,.2f} tokens/s".format((
    total_tokens / (end - start))))
The throughput of ErnieTokenizer: 136,282.50 tokens/s

从以上结果可以看出,FasterTokenizer 对比 ErnieTokenizer的性能提升非常明显。

同时,为了获得更好的推理部署体验,PaddleNLP 2.2版本将高性能文本处理FasterTokenizer内置到ERNIE模型内形成FasterERNIE模型。

下图是我们在Intel® Xeon® Gold 6271C CPU @ 2.60GHz 机器上使用16个线程,文本长度为128时,测试不同框架文本处理性能吞吐率。

FasterERNIE

我们以中文情感分类公开数据集ChnSentiCorp为示例数据集,展示FasterERNIE训练、预测、部署如何使用。

加载数据集和模型

FasterErnieForSequenceClassification是ERNIE用于微调文本分类的常用网络结构。

FasterErnieForSequenceClassification支持以下预训练模型:

支持模型FasterTokenizer API Usage
ERNIE, ChineseFasterErnieForSequenceClassification.from_pretrained("ernie-1.0")
ERNIE 2.0 Base, EnglishFasterErnieForSequenceClassification.from_pretrained("ernie-2.0-en")
ERNIE 2.0 Large, EnglishFasterErnieForSequenceClassification.from_pretrained("ernie-2.0-large-en")
import os

import paddle
import paddle.nn.functional as F
import paddlenlp as ppnlp
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import LinearDecayWithWarmup
from paddlenlp.experimental import FasterErnieForSequenceClassification, to_tensor


def create_dataloader(dataset, mode='train', batch_size=1):
    def trans_fn(example):
        return {
            # 文本
            "text": example["text"],
            # 标签
            "label": np.array(
                example["label"], dtype="int64")
        }

    dataset.map(trans_fn)
    shuffle = True if mode == 'train' else False
    if mode == 'train':
        batch_sampler = paddle.io.DistributedBatchSampler(
            dataset, batch_size=batch_size, shuffle=shuffle)
    else:
        batch_sampler = paddle.io.BatchSampler(
            dataset, batch_size=batch_size, shuffle=shuffle)

    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler)


train_ds, dev_ds = load_dataset("chnsenticorp", splits=["train", "dev"])

max_seq_len = 128
batch_size = 32
# 加载模型
model = FasterErnieForSequenceClassification.from_pretrained(
    'ernie-1.0',
    num_classes=len(train_ds.label_list),
    max_seq_len=max_seq_len)

# 加载训练集
train_data_loader = create_dataloader(
    train_ds, mode='train', batch_size=batch_size)
# 加载验证集
dev_data_loader = create_dataloader(
    dev_ds, mode='dev', batch_size=batch_size)

模型训练和评估

定义损失函数、优化器以及评价指标后,即可开始训练。

epochs = 3
learning_rate = 5e-5
warmup_proportion = 0.1
weight_decay = 0.01

num_training_steps = len(train_data_loader) * epochs

# 学习率变化策略
lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)

# 所有的bias和LayerNorrm参数将不会权重衰减
decay_params = [
    p.name for n, p in model.named_parameters()
    if not any(nd in n for nd in ["bias", "norm"])
]
# 优化器
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in decay_params)

# 交叉熵损失
criterion = paddle.nn.loss.CrossEntropyLoss()
# accuaracy 评价指标
metric = paddle.metric.Accuracy()
@paddle.no_grad()
def evaluate(model, criterion, metric, data_loader):
    # 模型评估
    model.eval()
    metric.reset()
    losses = []
    for batch in data_loader:
        texts, labels = batch['text'], batch['label']
        logits, predictions = model(texts)
        loss = criterion(logits, labels)
        losses.append(loss.numpy())
        correct = metric.compute(logits, labels)
        metric.update(correct)
        accu = metric.accumulate()
    print("eval loss: %.5f, accuracy: %.5f" % (np.mean(losses), accu))
    model.train()
    metric.reset()
# 开始模型训练
global_step = 0
tic_train = time.time()
total_train_time = 0
epoch = 3

# 训练模型保存的文件目录
ckpt_dir = "./ckpt"
for epoch in range(1, epoch+1):
    for step, batch in enumerate(train_data_loader, start=1):
        texts, labels = batch["text"], batch["label"]
        logits, predictions = model(texts)
        loss = criterion(logits, labels)
        probs = F.softmax(logits, axis=1)
        correct = metric.compute(logits, labels)
        metric.update(correct)
        acc = metric.accumulate()
        # 梯度反向回传
        loss.backward()
        # 更新梯度
        optimizer.step()
        lr_scheduler.step()
        optimizer.clear_grad()

        global_step += 1
        if global_step % 10 == 0:
            time_diff = time.time() - tic_train
            total_train_time += time_diff
            print(
                "global step %d, epoch: %d, batch: %d, loss: %.5f, accuracy: %.5f, speed: %.2f step/s"
                % (global_step, epoch, step, loss, acc,
                    10 / time_diff))
            tic_train = time.time()
        if global_step % 100 == 0:
            save_dir = os.path.join(ckpt_dir, "model_%d" % global_step)
            if not os.path.exists(save_dir):
                os.makedirs(save_dir)
            evaluate(model, criterion, metric, dev_data_loader)
            model.save_pretrained(save_dir)
            tic_train = time.time()

预测

可以直接调用predict函数即可输出预测结果。

data = [
    "这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般",
    "怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片",
    "作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。"]
label_map = {0: 'negative', 1: 'positive'}
def predict(model, data, label_map, batch_size=1):
    # 划分batch
    batches = [
        data[idx:idx + batch_size] for idx in range(0, len(data), batch_size)
    ]

    results = []
    model.eval()
    for texts in batches:
        # 预测
        logits, preds = model(texts)
        preds = preds.numpy()
        labels = [label_map[i] for i in preds]
        results.extend(labels)
    return results

# 输出预测结果
results = predict(model, data, label_map, batch_size)
for idx, text in enumerate(data):
    print(text, " : ", results[idx])
这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般  :  negative
怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片  :  negative
作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。  :  positive

导出模型

我们可以将训练保存的模型,导出成静态图计算图,用于部署预测。

可以通过to_static()接口一键导出模型。

save_path = os.path.join("export", "inference")
model.to_static(save_path)
[2022-02-11 11:32:41,988] [    INFO] - Already save the static model to the path export/inference

部署

导出模型之后,我们就可以利用Paddle预测库完成部署推理。

使用Python部署预测,相应的推理预测实现如下:

class Predictor(object):
    def __init__(self,
                 save_path,
                 batch_size=32):
        self.batch_size = batch_size

        model_file = save_path + ".pdmodel"
        params_file = save_path + ".pdiparams"
        if not os.path.exists(model_file):
            raise ValueError("The model file {} is not found.".format(
                model_file))
        if not os.path.exists(params_file):
            raise ValueError("The params file {} is not found.".format(
                params_file))
        # 加载计算图
        config = paddle.inference.Config(model_file, params_file)
        config.enable_use_gpu(100, 0)
        config.switch_use_feed_fetch_ops(False)
        config.delete_pass("embedding_eltwise_layernorm_fuse_pass")
        # 
        self.predictor = paddle.inference.create_predictor(config)
        self.input_handle = self.predictor.get_input_handle(
            self.predictor.get_input_names()[0])
        self.output_handles = [
            self.predictor.get_output_handle(name)
            for name in self.predictor.get_output_names()
        ]

    def predict(self, data, label_map):
        # 输入文本数据
        self.input_handle.copy_from_cpu(data)
        # 预测
        self.predictor.run()
        # 输出结果
        logits = self.output_handles[0].copy_to_cpu()
        preds = self.output_handles[1].copy_to_cpu()
        labels = [label_map[pred] for pred in preds]
        return labels

batch_size = 1

predictor = Predictor(save_path, batch_size)

data = [
    "这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般",
    "怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片",
    "作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。"]
label_map = {0: "negative", 1: "positive"}
batches = [
    data[idx:idx + batch_size]
    for idx in range(0, len(data), batch_size)
]
results = []
for batch in batches:
    labels = predictor.predict(batch, label_map=label_map)
    results.extend(labels)

for idx, text in enumerate(data):
    print(text, " : ", results[idx])
这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般  :  negative
怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片  :  negative
作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。  :  positive


W0211 11:32:42.008314   101 analysis_predictor.cc:795] The one-time configuration of analysis predictor failed, which may be due to native predictor called first and its configurations taken effect.
[1m[35m--- Running analysis [ir_graph_build_pass][0m
[1m[35m--- Running analysis [ir_graph_clean_pass][0m
[1m[35m--- Running analysis [ir_analysis_pass][0m
[32m--- Running IR pass [is_test_pass][0m
[32m--- Running IR pass [simplify_with_basic_ops_pass][0m
[32m--- Running IR pass [conv_affine_channel_fuse_pass][0m
[32m--- Running IR pass [conv_eltwiseadd_affine_channel_fuse_pass][0m
[32m--- Running IR pass [conv_bn_fuse_pass][0m
[32m--- Running IR pass [conv_eltwiseadd_bn_fuse_pass][0m
[32m--- Running IR pass [multihead_matmul_fuse_pass_v2][0m
[32m--- Running IR pass [squeeze2_matmul_fuse_pass][0m
[32m--- Running IR pass [reshape2_matmul_fuse_pass][0m
[32m--- Running IR pass [flatten2_matmul_fuse_pass][0m
[32m--- Running IR pass [map_matmul_v2_to_mul_pass][0m
I0211 11:32:42.440791   101 fuse_pass_base.cc:57] ---  detected 2 subgraphs
[32m--- Running IR pass [map_matmul_v2_to_matmul_pass][0m
[32m--- Running IR pass [map_matmul_to_mul_pass][0m
[32m--- Running IR pass [fc_fuse_pass][0m
I0211 11:32:42.443293   101 fuse_pass_base.cc:57] ---  detected 2 subgraphs
[32m--- Running IR pass [fc_elementwise_layernorm_fuse_pass][0m
[32m--- Running IR pass [conv_elementwise_add_act_fuse_pass][0m
[32m--- Running IR pass [conv_elementwise_add2_act_fuse_pass][0m
[32m--- Running IR pass [conv_elementwise_add_fuse_pass][0m
[32m--- Running IR pass [transpose_flatten_concat_fuse_pass][0m
[32m--- Running IR pass [runtime_context_cache_pass][0m
[1m[35m--- Running analysis [ir_params_sync_among_devices_pass][0m
I0211 11:32:42.451716   101 ir_params_sync_among_devices_pass.cc:45] Sync params from CPU to GPU
[1m[35m--- Running analysis [adjust_cudnn_workspace_size_pass][0m
[1m[35m--- Running analysis [inference_op_replace_pass][0m
[1m[35m--- Running analysis [ir_graph_to_program_pass][0m
I0211 11:32:42.676571   101 analysis_predictor.cc:714] ======= optimize end =======
I0211 11:32:42.678894   101 naive_executor.cc:98] ---  skip [feed], feed -> text
I0211 11:32:42.679934   101 naive_executor.cc:98] ---  skip [linear_3.tmp_1], fetch -> fetch
I0211 11:32:42.679945   101 naive_executor.cc:98] ---  skip [argmax_0.tmp_0], fetch -> fetch

从以上实现代码可以看到,推理不再需要对文本预处理,直接输入原文本即可得到相应的预测结果。

C++部署推理示例脚本如下:


#include <gflags/gflags.h>

#include <iostream>
#include <numeric>

#include "paddle/include/paddle_inference_api.h"

DEFINE_string(model_file, "", "Directory of the inference model.");
DEFINE_string(params_file, "", "Directory of the inference model.");
DEFINE_bool(use_gpu, true, "enable gpu");

template <typename T>
void GetOutput(paddle_infer::Predictor* predictor,
               std::string output_name,
               std::vector<T>* out_data) {
  auto output = predictor->GetOutputHandle(output_name);
  std::vector<int> output_shape = output->shape();
  int out_num = std::accumulate(
      output_shape.begin(), output_shape.end(), 1, std::multiplies<int>());
  out_data->resize(out_num);
  output->CopyToCpu(out_data->data());
}

int main(int argc, char* argv[]) {
  google::ParseCommandLineFlags(&argc, &argv, true);

  paddle_infer::Config config;
  config.SetModel(FLAGS_model_file, FLAGS_params_file);
  if (FLAGS_use_gpu) {
    config.EnableUseGpu(100, 0);
  }
  auto pass_builder = config.pass_builder();
  pass_builder->DeletePass("embedding_eltwise_layernorm_fuse_pass");
  auto predictor = paddle_infer::CreatePredictor(config);

  std::vector<std::string> data{
      "这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般",
      "请问:有些字打错了,我怎么样才可以回去编辑一下啊?",
      "本次入住酒店的网络不是很稳定,断断续续,希望能够改进。"};
  auto input_names = predictor->GetInputNames();
  auto text = predictor->GetInputHandle(input_names[0]);
  text->ReshapeStrings(data.size());
  text->CopyStringsFromCpu(&data);
  predictor->Run();

  std::vector<float> logits;
  std::vector<int64_t> preds;
  auto output_names = predictor->GetOutputNames();
  GetOutput(predictor.get(), output_names[0], &logits);
  GetOutput(predictor.get(), output_names[1], &preds);
  for (size_t i = 0; i < data.size(); i++) {
    std::cout << data[i] << " : " << preds[i] << std::endl;
  }
  return 0;
}

以上C++推理预测代码可以看到,推理实现省去了复杂的Tokenizer C++实现,大大地减少了开发量。 经统计,PaddleNLP 2.2版本ERNIE/BERT推理部署代码相比于2.1版本节约94%的代码量。

C++ 详细部署贬义词脚本参考:https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/faster/faster_ernie/seq_cls

加入微信交流群,一起学习吧

现在就加入PaddleNLP的技术交流群(微信),一起交流NLP技术吧!

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐