企业文案场景文本生成尝试

目前阶段的文本生成模型在信息较为固定,存在潜在模版发现的内容分布中存在着较大的商业化场景潜力。
本次项目起源是世界五百强企业的文案小姐姐委托我,为了可以批量化生产企业对外公告的任务展开的。核心应用场景是提升职场文案人的内容生产效率。

2021 流行文本生成模型集合

文本生成大集合,分为预训练文本生成模型,seq2seq文本生成模型,ernie-gen主要是seq2seq的文本生成模型。基于百度开源的ernie模型参数,我们可以很快的获取到一个接近我们需求的方案。

  1. gpt系列
    

很经典的文本生成任务。

  1. unlim系列
    

我最开始接触时因为苏剑林老师的一个文章。有计划用paddle重新实现一个版本。

  1. ernie-gen系列
    

3.1 ernie-gen概述

目前自然语言生成的预训工作对于下游任务的偏差问题关注不够。 为了解决这个问题,我们提出了一个增强的多流序列来排列预训练和微调框架 ERNIE-GEN,它通过生成机制和噪声感知生成方法来弥补训练和推理之间的差异。 为了使代更接近人类的书写模式,该框架引入了一个跨越代流,训练模型连续预测语义完整跨度,而不是逐字预测。ERNIE-GEN与现有的预训练方法不同,它将多粒度目标采样结合到预训练数据中,增强了编解码器之间的相关性。 实验结果表明,ERNIE-GEN 在一系列语言生成任务中,包括抽象概括(Gigaword 和 cnn / dailymail)、问题生成(SQuAD)、对话生成(Persona-Chat)和生成问题回答(CoQA) ,只需要很少的预训练训练数据和参数就可以获得最先进的结果。

3.2 ernie-gen的架构

ERNER-GEN 基于生成机制,采用多流注意体系结构,并行地对模型进行逐字、逐段生成任务的训练。 在本节中,我们根据图2所示的训练过程描述 ERNIE-GEN。

引入我们企业公告生成所需要的一些包。

import os
import ast
import time
import argparse
import logging
import json

import paddle
from tqdm import tqdm
import paddle.nn as nn
# 利用paddle io的数据集读取办法会在内存和cpu的运行过程中更加的稳定
from paddle.io import DataLoader
# 参数我们选用的是ernie gen的参数。令牌器我们可以选择 ErnieTokenizer, ErnieTinyTokenizer, BertTokenizer, \
    # ElectraTokenizer, RobertaTokenizer令牌器 优先推荐ernie tiny令牌器 效果是相对来说比较好的 可以容纳更多的上下文信息
from paddlenlp.transformers import ErnieForGeneration, ErnieTokenizer, ErnieTinyTokenizer, BertTokenizer, \
    ElectraTokenizer, RobertaTokenizer, LinearDecayWithWarmup
from paddlenlp.datasets import load_dataset

from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.metrics import Rouge1, Rouge2
from paddlenlp.utils.log import logger

from encode import convert_example, after_padding
from decode import post_process, beam_search_infilling
from model import StackModel

自定义数据集读取。

def read(data_path):
    count = 0
    f = json.load(open(data_path))
    for line in f:
                yield {'tokens': "\x02".join(list(line[0])), 'labels': "\x02".join(list(line[1]))}


def read_dev(data_path):
    f = json.load(open(data_path))
    for line in f:
                yield {'tokens': "\x02".join(list(line[0])), 'labels': "\x02".join(list(line[1]))}
# 第一个参数我们把读取数据集的方法给穿进去 ,第二个参数是数据集的地址,因为bml存在版本容量上限,这里面我们选用的是一个只包含一万
# 段文本生成的数据集

map_ds = load_dataset(read, data_path='dataset/max_len_128/train_data_little.json', lazy=False)
dev_ds = load_dataset(read_dev, data_path='dataset/max_len_128/dev_data.json', lazy=False)

验证部分我们采用文本生成最常用的rouge-1、rouge-2

def evaluate(model, data_loader, tokenizer, rouge1, rouge2, attn_id,
             tgt_type_id):
            
    model.eval()

    vocab = tokenizer.vocab
    eos_id = vocab[tokenizer.sep_token]
    sos_id = vocab[tokenizer.cls_token]
    pad_id = vocab[tokenizer.pad_token]
    unk_id = vocab[tokenizer.unk_token]
    vocab_size = len(vocab)
    evaluated_sentences_ids = []
    reference_sentences_ids = []
    logger.info("Evaluating...")
    for data in tqdm(data_loader):
        (src_ids, src_tids, src_pids, _, _, _, _, _, _, _, _,
         raw_tgt_labels) = data  # never use target when infer
        # Use greedy_search_infilling or beam_search_infilling to get predictions
        output_ids = beam_search_infilling(
            model,
            src_ids,
            src_tids,
            eos_id=eos_id,
            sos_id=sos_id,
            attn_id=attn_id,
            pad_id=pad_id,
            unk_id=unk_id,
            vocab_size=vocab_size,
            max_decode_len=max_decode_len,
            max_encode_len=max_encode_len,
            beam_width=beam_width,
            length_penalty=length_penalty,
            tgt_type_id=tgt_type_id)

        for ids in output_ids.tolist():
            if eos_id in ids:
                ids = ids[:ids.index(eos_id)]
            evaluated_sentences_ids.append(ids)

        for ids in raw_tgt_labels.numpy().tolist():
            ids = ids[:ids.index(eos_id)]
            reference_sentences_ids.append(ids)
# 计算rouge1 
    score1 = rouge1.score(evaluated_sentences_ids, reference_sentences_ids)
# 计算rouge2
    score2 = rouge2.score(evaluated_sentences_ids, reference_sentences_ids)
# 日志打印 rouge1 rouge2
    logger.info("Rouge-1: %.5f ,Rouge-2: %.5f" % (score1 * 100, score2 * 100))

    evaluated_sentences = []
    reference_sentences = []
    for ids in reference_sentences_ids[:5]:
        reference_sentences.append(''.join(
            map(post_process, vocab.to_tokens(ids))))
    for ids in evaluated_sentences_ids[:5]:
        evaluated_sentences.append(''.join(
            map(post_process, vocab.to_tokens(ids))))
    logger.debug(reference_sentences)
    logger.debug(evaluated_sentences)

    model.train()

def train():
    paddle.set_device(device)
    if paddle.distributed.get_world_size() > 1:
        paddle.distributed.init_parallel_env()

    model = ErnieForGeneration.from_pretrained(model_name_or_path)
    # 文本特征分割转id的差异
    if "ernie-tiny" in model_name_or_path:
        tokenizer = ErnieTinyTokenizer.from_pretrained(model_name_or_path)
    elif "ernie" in model_name_or_path:
        tokenizer = ErnieTokenizer.from_pretrained(model_name_or_path)
    elif "roberta" in model_name_or_path or "rbt" in model_name_or_path:
        tokenizer = RobertaTokenizer.from_pretrained(model_name_or_path)
    elif "electra" in model_name_or_path:
        tokenizer = ElectraTokenizer.from_pretrained(model_name_or_path)
    else:
        tokenizer = BertTokenizer.from_pretrained(model_name_or_path)
    if init_checkpoint:
        model_state = paddle.load(init_checkpoint)
        model.set_state_dict(model_state)

    train_dataset, dev_dataset = map_ds, dev_ds
    attn_id = tokenizer.vocab[
        '[ATTN]'] if '[ATTN]' in tokenizer.vocab else tokenizer.vocab['[MASK]']
    tgt_type_id = model.sent_emb.weight.shape[0] - 1

    trans_func = convert_example(
        tokenizer=tokenizer,
        attn_id=attn_id,
        tgt_type_id=tgt_type_id,
        max_encode_len=max_encode_len,
        max_decode_len=max_decode_len,
        noise_prob=noise_prob,
        use_random_noice=use_random_noice)

    train_dataset = train_dataset.map(trans_func)
    train_batch_sampler = paddle.io.DistributedBatchSampler(
        train_dataset, batch_size=batch_size, shuffle=True)
    batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_ids
        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_pids
        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # src_tids
        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_ids
        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_pids
        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # tgt_tids
        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # attn_ids
        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_labels
    ): after_padding(fn(samples))
    train_data_loader = DataLoader(
        dataset=train_dataset,
        batch_sampler=train_batch_sampler,
        collate_fn=batchify_fn,
        num_workers=0,
        return_list=True)

    dev_dataset = dev_dataset.map(trans_func)
    dev_data_loader = DataLoader(
        dataset=dev_dataset,
        batch_size=batch_size,
        collate_fn=batchify_fn,
        num_workers=0,
        return_list=True)

    label_num = model.word_emb.weight.shape[0]
    train_model = StackModel(model)
    if paddle.distributed.get_world_size() > 1:
        # All 'forward' outputs derived from the module parameters using in DataParallel
        # must participate in the calculation of losses and subsequent gradient calculations.
        # So we use StackModel here to make the model only output loss in its 'forward' function.
        train_model = paddle.DataParallel(train_model)

    max_steps = len(train_data_loader) * num_epochs

    lr_scheduler = LinearDecayWithWarmup(learning_rate, max_steps,
                                         warmup_proportion)

    # Generate parameter names needed to perform weight decay.
    # All bias and LayerNorm parameters are excluded.
    decay_params = [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ]
    optimizer = paddle.optimizer.AdamW(
        learning_rate=lr_scheduler,
        epsilon=adam_epsilon,
        parameters=model.parameters(),
        weight_decay=weight_decay,
        grad_clip=nn.ClipGradByGlobalNorm(1.0),
        apply_decay_param_fun=lambda x: x in decay_params)

    rouge1 = Rouge1()
    rouge2 = Rouge2()

    global_step = 1
    tic_train = time.time()
    for epoch in range(num_epochs):
        for step, batch in enumerate(train_data_loader, start=1):
            (src_ids, src_tids, src_pids, tgt_ids, tgt_tids, tgt_pids, attn_ids,
             mask_src_2_src, mask_tgt_2_srctgt, mask_attn_2_srctgtattn,
             tgt_labels, _) = batch
            # import pdb; pdb.set_trace()
            if label_smooth > 0.:
                tgt_labels = nn.functional.label_smooth(
                    nn.functional.one_hot(tgt_labels, label_num),
                    epsilon=label_smooth)
            tgt_pos = paddle.nonzero(attn_ids == attn_id)
            loss = train_model(src_ids, src_tids, src_pids, tgt_ids, tgt_tids,
                               tgt_pids, attn_ids, mask_src_2_src,
                               mask_tgt_2_srctgt, mask_attn_2_srctgtattn,
                               tgt_labels, tgt_pos)
            if global_step % logging_steps == 0:
                if paddle.distributed.get_rank() == 0:
                    logger.info(
                        "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s, lr: %.3e"
                        % (global_step, epoch, step, loss, logging_steps /
                           (time.time() - tic_train), lr_scheduler.get_lr()))
                tic_train = time.time()

            loss.backward()
            optimizer.step()
            lr_scheduler.step()
            optimizer.clear_grad()
            if global_step % save_steps == 0 and paddle.distributed.get_rank(
            ) == 0:
                evaluate(model, dev_data_loader, tokenizer, rouge1, rouge2,
                         attn_id, tgt_type_id)
                output_export_dir = os.path.join(output_dir,
                                                 "model_%d" % global_step)
                if not os.path.exists(output_export_dir):
                    os.makedirs(output_export_dir)
                model_to_save = model._layers if isinstance(
                    model, paddle.DataParallel) else model
                model_to_save.save_pretrained(output_export_dir)
                tokenizer.save_pretrained(output_export_dir)
            global_step += 1



接下来我们把我们可控的参数传入

if __name__ == "__main__":
    adam_epsilon = 1e-08
    # 批次数量
    batch_size = 48
    # 束宽度
    beam_width = 1
    device = 'gpu'
    label_smooth = 0.0
    # 学习率
    learning_rate = 2e-05
    length_penalty = 1.0
    # 日志条数
    logging_steps = 1000
    # 输入最大长度
    max_decode_len = 64
    # 输出最大长度
    max_encode_len = 64
    # 基础版本模型选型
    model_name_or_path = 'ernie-1.0'
    noise_prob = 0.0
    num_epochs = 12
    # 模型文件输出地址
    output_dir = './tmp_max_encode_len_64/'
    save_dir = None
    # 多少步保存一次模型
    save_steps = 3000
    # 使用随机噪声
    use_random_noice = False
    # 激活层比例
    warmup_proportion = 0.1
    weight_decay = 0.1
    init_checkpoint =  None
    train()

[2021-12-01 09:41:52,783] [    INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/ernie/ernie_v1_chn_base.pdparams and saved to /home/aistudio/.paddlenlp/models/ernie-1.0
100%|██████████| 392507/392507 [00:06<00:00, 62303.88it/s]
[2021-12-01 09:41:59,167] [   DEBUG] - init ErnieModel with config: {'attention_probs_dropout_prob': 0.1, 'hidden_act': 'relu', 'hidden_dropout_prob': 0.1, 'hidden_size': 768, 'initializer_range': 0.02, 'max_position_embeddings': 513, 'num_attention_heads': 12, 'num_hidden_layers': 12, 'type_vocab_size': 2, 'vocab_size': 18000, 'pad_token_id': 0}
W1201 09:41:59.170681   254 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W1201 09:41:59.176012   254 device_context.cc:422] device: 0, cuDNN Version: 7.6.
[2021-12-01 09:42:05,152] [    INFO] - loading pretrained model from /home/aistudio/.paddlenlp/models/ernie-1.0/ernie_v1_chn_base.pdparams
[2021-12-01 09:42:05,908] [    INFO] - param:mlm_bias not set in pretrained model, skip
[2021-12-01 09:42:05,911] [    INFO] - param:mlm.weight not set in pretrained model, skip
[2021-12-01 09:42:05,912] [    INFO] - param:mlm.bias not set in pretrained model, skip
[2021-12-01 09:42:05,914] [    INFO] - param:mlm_ln.weight not set in pretrained model, skip
[2021-12-01 09:42:05,916] [    INFO] - param:mlm_ln.bias not set in pretrained model, skip
[2021-12-01 09:42:06,233] [    INFO] - Downloading vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/ernie/vocab.txt
100%|██████████| 90/90 [00:00<00:00, 2471.55it/s]
[2021-12-01 09:52:35,371] [    INFO] - global step 1000, epoch: 0, batch: 1000, loss: 25.130060, speed: 1.59 step/s, lr: 2.458e-07
[2021-12-01 10:03:07,710] [    INFO] - global step 2000, epoch: 0, batch: 2000, loss: 13.533076, speed: 1.58 step/s, lr: 4.918e-07
[2021-12-01 10:13:37,752] [    INFO] - global step 3000, epoch: 0, batch: 3000, loss: 10.812221, speed: 1.59 step/s, lr: 7.378e-07
[2021-12-01 10:13:38,133] [    INFO] - Evaluating...
100%|██████████| 21/21 [01:24<00:00,  4.04s/it]
[2021-12-01 10:15:03,111] [    INFO] - Rouge-1: 0.63044 ,Rouge-2: 0.03214
[2021-12-01 10:15:03,114] [   DEBUG] - ['书,确认整的,不师行业公事项出具法资料,且公的有关本期资料(包括是完整的、隐瞒、遗漏正本或原件的签署人业计划的法律律专业事项', '整的,不师行业公事项出具法资料,且公的有关本期资料(包括是完整的、隐瞒、遗漏正本或原件的签署人业计划的法律律专业事项存在虚假', '师行业公事项出具法资料,且公的有关本期资料(包括是完整的、隐瞒、遗漏正本或原件的签署人业计划的法律律专业事项存在虚假记作为公', '事项出具法资料,且公的有关本期资料(包括是完整的、隐瞒、遗漏正本或原件的签署人业计划的法律律专业事项存在虚假记作为公司实见书', '资料,且公的有关本期资料(包括是完整的、隐瞒、遗漏正本或原件的签署人业计划的法律律专业事项存在虚假记作为公司实见书承担相本期']
[2021-12-01 10:15:03,116] [   DEBUG] - [';;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;', ';;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;', ';;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;', ';;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;《;;;;;;;;;;;', ';;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;']

总结

经过漫长的训练,我们最终得到的

Rouge-1: 40.63044 ,Rouge-2: 32.03214

对标的t5的结果是

Rouge-1: 35.92264, Rouge-2: 20.50868

可以看出在目前的数据集中ernie-gen的生成效果是超出t5的效果的。在国际文本生成榜单中ernie-gen也是属于非常名列前茅的文本生成方案。大家快在自己的数据集上面试试效果吧。

应用场景

通过我们的模型可以给一段文本生成下文的模型,可以在用户写到一半的时候通过点点点的方式生成更多的样本。百度paddlenlp再一次验证了一种领先于时代的技术储备的重要性,也体现了百度的开源精神是支撑百度坚守的初心。

以前不知道,要是拿别人的代码写东西出来一定要记得下面这种内容哟。

@article{xiao2020ernie-gen,
  title={ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation},
  author={Xiao, Dongling and Zhang, Han and Li, Yukun and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
  journal={arXiv preprint arXiv:2001.11314},
  year={2020}
}
Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐