基于Transformer的机器翻译

机器翻译是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程。

本项目是机器翻译领域主流模型 Transformer 的 PaddlePaddle 实现,包含模型训练,预测以及使用自定义数据等内容。用户可以基于发布的内容搭建自己的翻译模型。

Transformer 是论文 Attention Is All You Need 中提出的用以完成机器翻译(Machine Translation)等序列到序列(Seq2Seq)学习任务的一种全新网络结构,其完全使用注意力(Attention)机制来实现序列到序列的建模。


图1:Transformer 网络结构图

相较于此前 Seq2Seq 模型中广泛使用的循环神经网络(Recurrent Neural Network, RNN),使用Self Attention进行输入序列到输出序列的变换主要具有以下优势:

  • 计算复杂度小
    • 特征维度为 d 、长度为 n 的序列,在 RNN 中计算复杂度为 O(n * d * d) (n 个时间步,每个时间步计算 d 维的矩阵向量乘法),在 Self-Attention 中计算复杂度为 O(n * n * d) (n 个时间步两两计算 d 维的向量点积或其他相关度函数),n 通常要小于 d 。
  • 计算并行度高
    • RNN 中当前时间步的计算要依赖前一个时间步的计算结果;Self-Attention 中各时间步的计算只依赖输入不依赖之前时间步输出,各时间步可以完全并行。
  • 容易学习长距离依赖(long-range dependencies)
    • RNN 中相距为 n 的两个位置间的关联需要 n 步才能建立;Self-Attention 中任何两个位置都直接相连;路径越短信号传播越容易。
      Transformer 中引入使用的基于 Self-Attention 的序列建模模块结构,已被广泛应用在 Bert 等语义表示模型中,取得了显著效果。
Transformer的更详细的解读

环境介绍

  • PaddlePaddle框架,AI Studio平台已经默认安装最新版2.1。

  • PaddleNLP,深度兼容框架2.1,是飞桨框架2.1在NLP领域的最佳实践。

!unzip -o transformer_mt.zip
%cd transformer_mt/
[Errno 2] No such file or directory: 'transformer_mt/'
/home/aistudio/transformer_mt
# 安装依赖
!pip install --upgrade paddlenlp -i https://pypi.tuna.tsinghua.edu.cn/simple
!pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already up-to-date: paddlenlp in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (2.1.0)
Requirement already satisfied, skipping upgrade: multiprocess in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.70.11.1)
Requirement already satisfied, skipping upgrade: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.42.1)
Requirement already satisfied, skipping upgrade: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.2.2)
Requirement already satisfied, skipping upgrade: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.4.4)
Requirement already satisfied, skipping upgrade: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.1.0)
Requirement already satisfied, skipping upgrade: paddlefsl==1.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.0.0)
Requirement already satisfied, skipping upgrade: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.9.0)
Requirement already satisfied, skipping upgrade: dill>=0.3.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from multiprocess->paddlenlp) (0.3.3)
Requirement already satisfied, skipping upgrade: numpy>=1.14.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (1.20.3)
Requirement already satisfied, skipping upgrade: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (0.24.2)
Requirement already satisfied, skipping upgrade: tqdm~=4.27.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlefsl==1.0.0->paddlenlp) (4.27.0)
Requirement already satisfied, skipping upgrade: requests~=2.24.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlefsl==1.0.0->paddlenlp) (2.24.0)
Requirement already satisfied, skipping upgrade: pillow==8.2.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlefsl==1.0.0->paddlenlp) (8.2.0)
Requirement already satisfied, skipping upgrade: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp) (1.15.0)
Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (2.1.0)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (0.14.1)
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (1.6.3)
Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests~=2.24.0->paddlefsl==1.0.0->paddlenlp) (3.0.4)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests~=2.24.0->paddlefsl==1.0.0->paddlenlp) (2.8)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests~=2.24.0->paddlefsl==1.0.0->paddlenlp) (1.25.6)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests~=2.24.0->paddlefsl==1.0.0->paddlenlp) (2019.9.11)
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: attrdict==2.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from -r requirements.txt (line 1)) (2.0.1)
Requirement already satisfied: PyYAML==5.4.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from -r requirements.txt (line 2)) (5.4.1)
Requirement already satisfied: subword_nmt==0.3.7 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from -r requirements.txt (line 3)) (0.3.7)
Requirement already satisfied: jieba==0.42.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from -r requirements.txt (line 4)) (0.42.1)
Requirement already satisfied: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from attrdict==2.0.1->-r requirements.txt (line 1)) (1.15.0)

Pipeline


图2:Pipeline
import os
import time
import yaml
import logging
import argparse
import numpy as np
from pprint import pprint
from attrdict import AttrDict
import jieba

import numpy as np
from functools import partial
import paddle
import paddle.distributed as dist
from paddle.io import DataLoader,BatchSampler
from paddlenlp.data import Vocab, Pad
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import TransformerModel, InferTransformerModel, CrossEntropyCriterion, position_encoding_init
from paddlenlp.utils.log import logger

from utils import post_process_seq

1. 数据预处理

本教程使用CWMT数据集中的中文英文的数据作为训练语料,
CWMT数据集在900万+,质量较高,非常适合来训练Transformer机器翻译。
中文需要Jieba+BPE,英文需要BPE

BPE(Byte Pair Encoding)

BPE优势:

  • 压缩词表;
  • 一定程度上缓解OOV(out of vocabulary)问题

图3:learn BPE


图4:Apply BPE


图5:Jieba+BPE
# 数据预处理过程,包括jieba分词、bpe分词和词表。
!bash preprocess.sh
jieba tokenize...
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.705 seconds.
Prefix dict has been built successfully.
source learn-bpe and apply-bpe...
no pair has frequency >= 2. Stopping
target learn-bpe and apply-bpe...
no pair has frequency >= 2. Stopping
source get-vocab. if loading pretrained model, use its vocab.
target get-vocab. if loading pretrained model, use its vocab.
Over.
# 下载预训练模型
!bash get_data_and_model.sh
Over.

2. 构造Dataloader

下面的create_data_loader函数用于创建训练集、验证集所需要的DataLoader对象,
create_infer_loader函数用于创建预测集所需要的DataLoader对象,
DataLoader对象用于产生一个个batch的数据。下面对函数中调用的paddlenlp内置函数作简单说明:

  • paddlenlp.data.Vocab.load_vocabulary:Vocab词表类,集合了一系列文本token与ids之间映射的一系列方法,支持从文件、字典、json等一系方式构建词表
  • paddlenlp.datasets.load_dataset:从本地文件创建数据集时,推荐根据本地数据集的格式给出读取function并传入 load_dataset() 中创建数据集
  • paddlenlp.data.Pad:padding 操作
    具体可参考PaddleNLP的文档

图6:构造Dataloader的流程

图7:Dataloader细节
# 自定义读取本地数据的方法
def read(src_path, tgt_path, is_predict=False):
    if is_predict:
        with open(src_path, 'r', encoding='utf8') as src_f:
            for src_line in src_f.readlines():
                src_line = src_line.strip()
                if not src_line:
                    continue
                yield {'src':src_line, 'tgt':''}
    else:
        with open(src_path, 'r', encoding='utf8') as src_f, open(tgt_path, 'r', encoding='utf8') as tgt_f:
            for src_line, tgt_line in zip(src_f.readlines(), tgt_f.readlines()):
                src_line = src_line.strip()
                if not src_line:
                    continue
                tgt_line = tgt_line.strip()
                if not tgt_line:
                    continue
                yield {'src':src_line, 'tgt':tgt_line}
 # 过滤掉长度 ≤min_len或者≥max_len 的数据            
def min_max_filer(data, max_len, min_len=0):
    # 1 for special tokens.
    data_min_len = min(len(data[0]), len(data[1])) + 1
    data_max_len = max(len(data[0]), len(data[1])) + 1
    return (data_min_len >= min_len) and (data_max_len <= max_len)

# 创建训练集、验证集的dataloader
def create_data_loader(args):
    train_dataset = load_dataset(read, src_path=args.training_file.split(',')[0], tgt_path=args.training_file.split(',')[1], lazy=False)
    dev_dataset = load_dataset(read, src_path=args.validation_file.split(',')[0], tgt_path=args.validation_file.split(',')[1], lazy=False)

    src_vocab = Vocab.load_vocabulary(
        args.src_vocab_fpath,
        bos_token=args.special_token[0],
        eos_token=args.special_token[1],
        unk_token=args.special_token[2])
    trg_vocab = Vocab.load_vocabulary(
        args.trg_vocab_fpath,
        bos_token=args.special_token[0],
        eos_token=args.special_token[1],
        unk_token=args.special_token[2])

    padding_vocab = (
        lambda x: (x + args.pad_factor - 1) // args.pad_factor * args.pad_factor
    )
    args.src_vocab_size = padding_vocab(len(src_vocab))
    args.trg_vocab_size = padding_vocab(len(trg_vocab))

    def convert_samples(sample):
        source = sample['src'].split()
        target = sample['tgt'].split()

        source = src_vocab.to_indices(source)
        target = trg_vocab.to_indices(target)

        return source, target

    # 训练集dataloader和验证集dataloader
    data_loaders = []
    for i, dataset in enumerate([train_dataset, dev_dataset]):
        dataset = dataset.map(convert_samples, lazy=False).filter(
            partial(min_max_filer, max_len=args.max_length))

        # BatchSampler: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/BatchSampler_cn.html
        batch_sampler = BatchSampler(dataset,batch_size=args.batch_size, shuffle=True,drop_last=False)
        
        # DataLoader: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/DataLoader_cn.html
        data_loader = DataLoader(
            dataset=dataset,
            batch_sampler=batch_sampler,
            collate_fn=partial(
                prepare_train_input,
                bos_idx=args.bos_idx,
                eos_idx=args.eos_idx,
                pad_idx=args.bos_idx),
                num_workers=0,
                return_list=True)
        data_loaders.append(data_loader)

    return data_loaders


def prepare_train_input(insts, bos_idx, eos_idx, pad_idx):
    """
    Put all padded data needed by training into a list.
    """
    word_pad = Pad(pad_idx)
    src_word = word_pad([inst[0] + [eos_idx] for inst in insts])
    trg_word = word_pad([[bos_idx] + inst[1] for inst in insts])
    lbl_word = np.expand_dims(
        word_pad([inst[1] + [eos_idx] for inst in insts]), axis=2)

    data_inputs = [src_word, trg_word, lbl_word]

    return data_inputs

# 创建测试集的dataloader,原理步骤同上(创建训练集、验证集的dataloader)
def create_infer_loader(args):
    dataset = load_dataset(read, src_path=args.predict_file, tgt_path=None, is_predict=True, lazy=False)

    src_vocab = Vocab.load_vocabulary(
        args.src_vocab_fpath,
        bos_token=args.special_token[0],
        eos_token=args.special_token[1],
        unk_token=args.special_token[2])
    trg_vocab = Vocab.load_vocabulary(
        args.trg_vocab_fpath,
        bos_token=args.special_token[0],
        eos_token=args.special_token[1],
        unk_token=args.special_token[2])

    padding_vocab = (
        lambda x: (x + args.pad_factor - 1) // args.pad_factor * args.pad_factor
    )
    args.src_vocab_size = padding_vocab(len(src_vocab))
    args.trg_vocab_size = padding_vocab(len(trg_vocab))

    def convert_samples(sample):
        source = sample['src'].split()
        target = sample['tgt'].split()

        source = src_vocab.to_indices(source)
        target = trg_vocab.to_indices(target)

        return source, target

    dataset = dataset.map(convert_samples, lazy=False)

    # BatchSampler: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/BatchSampler_cn.html
    batch_sampler = BatchSampler(dataset,batch_size=args.infer_batch_size,drop_last=False)
    
    # DataLoader: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/DataLoader_cn.html
    data_loader = DataLoader(
        dataset=dataset,
        batch_sampler=batch_sampler,
        collate_fn=partial(
            prepare_infer_input,
            bos_idx=args.bos_idx,
            eos_idx=args.eos_idx,
            pad_idx=args.bos_idx),
            num_workers=0,
            return_list=True)
    return data_loader, trg_vocab.to_tokens

def prepare_infer_input(insts, bos_idx, eos_idx, pad_idx):
    """
    Put all padded data needed by beam search decoder into a list.
    """
    word_pad = Pad(pad_idx)
    src_word = word_pad([inst[0] + [eos_idx] for inst in insts])

    return [src_word, ]

3. 搭建模型

PaddleNLP提供Transformer API供调用:


图8:模型搭建

图9:Example

4.训练模型

运行do_train函数,
do_train函数中,配置优化器、损失函数,以及评价指标Perplexity;

Perplexity,即困惑度,常用来衡量语言模型优劣,即句子的通顺度,一般用于机器翻译和文本生成等领域。Perplexity越小,句子越通顺,该语言模型越好。


图10:训练模型
def do_train(args):
    if args.use_gpu:
        place = "gpu"
    else:
        place = "cpu"
    paddle.set_device(place)
    # Set seed for CE
    random_seed = eval(str(args.random_seed))
    if random_seed is not None:
        paddle.seed(random_seed)

    # Define data loader
    (train_loader), (eval_loader) = create_data_loader(args)

    # Define model
    transformer = TransformerModel(
        src_vocab_size=args.src_vocab_size,
        trg_vocab_size=args.trg_vocab_size,
        max_length=args.max_length + 1,
        num_encoder_layers=args.n_layer,
        num_decoder_layers=args.n_layer,
        n_head=args.n_head,
        d_model=args.d_model,
        d_inner_hid=args.d_inner_hid,
        dropout=args.dropout,
        weight_sharing=args.weight_sharing,
        bos_id=args.bos_idx,
        eos_id=args.eos_idx)

    # Define loss
    criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx)

    scheduler = paddle.optimizer.lr.NoamDecay(
        args.d_model, args.warmup_steps, args.learning_rate, last_epoch=0)

    # Define optimizer
    optimizer = paddle.optimizer.Adam(
        learning_rate=scheduler,
        beta1=args.beta1,
        beta2=args.beta2,
        epsilon=float(args.eps),
        parameters=transformer.parameters())

    step_idx = 0

    # Train loop
    for pass_id in range(args.epoch):
        batch_id = 0
        for input_data in train_loader:

            (src_word, trg_word, lbl_word) = input_data

            logits = transformer(src_word=src_word, trg_word=trg_word)

            sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
            
            # 计算梯度
            avg_cost.backward() 
            # 更新参数
            optimizer.step() 
            # 梯度清零
            optimizer.clear_grad() 

            if (step_idx + 1) % args.print_step == 0 or step_idx == 0:
                total_avg_cost = avg_cost.numpy()
                logger.info(
                    "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
                    " ppl: %f " %
                    (step_idx, pass_id, batch_id, total_avg_cost,
                        np.exp([min(total_avg_cost, 100)])))

            if (step_idx + 1) % args.save_step == 0:
                # Validation
                transformer.eval()
                total_sum_cost = 0
                total_token_num = 0
                with paddle.no_grad():
                    for input_data in eval_loader:
                        (src_word, trg_word, lbl_word) = input_data
                        logits = transformer(
                            src_word=src_word, trg_word=trg_word)
                        sum_cost, avg_cost, token_num = criterion(logits,
                                                                  lbl_word)
                        total_sum_cost += sum_cost.numpy()
                        total_token_num += token_num.numpy()
                        total_avg_cost = total_sum_cost / total_token_num
                    logger.info("validation, step_idx: %d, avg loss: %f, "
                                " ppl: %f" %
                                (step_idx, total_avg_cost,
                                 np.exp([min(total_avg_cost, 100)])))
                transformer.train()

                if args.save_model:
                    model_dir = os.path.join(args.save_model,
                                             "step_" + str(step_idx))
                    if not os.path.exists(model_dir):
                        os.makedirs(model_dir)
                    paddle.save(transformer.state_dict(),
                                os.path.join(model_dir, "transformer.pdparams"))
                    paddle.save(optimizer.state_dict(),
                                os.path.join(model_dir, "transformer.pdopt"))
            batch_id += 1
            step_idx += 1
            scheduler.step()


    if args.save_model:
        model_dir = os.path.join(args.save_model, "step_final")
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)
        paddle.save(transformer.state_dict(),
                    os.path.join(model_dir, "transformer.pdparams"))
        paddle.save(optimizer.state_dict(),
                    os.path.join(model_dir, "transformer.pdopt"))
# 读入参数
yaml_file = 'transformer.base.yaml'
with open(yaml_file, 'rt') as f:
    args = AttrDict(yaml.safe_load(f))
    pprint(args)
{'batch_size': 50,
 'beam_size': 5,
 'beta1': 0.9,
 'beta2': 0.997,
 'bos_idx': 0,
 'd_inner_hid': 2048,
 'd_model': 512,
 'dropout': 0.1,
 'eos_idx': 1,
 'epoch': 1,
 'eps': '1e-9',
 'infer_batch_size': 50,
 'init_from_params': 'trained_models/CWMT2021_step_345000/',
 'label_smooth_eps': 0.1,
 'learning_rate': 2.0,
 'max_length': 256,
 'max_out_len': 256,
 'n_best': 1,
 'n_head': 8,
 'n_layer': 6,
 'output_file': 'train_dev_test/predict.txt',
 'pad_factor': 8,
 'predict_file': 'train_dev_test/ccmt2019-news.zh2en.source_bpe',
 'print_step': 10,
 'random_seed': 'None',
 'save_model': 'trained_models',
 'save_step': 20,
 'special_token': ['<s>', '<e>', '<unk>'],
 'src_vocab_fpath': 'train_dev_test/vocab.ch.src',
 'src_vocab_size': 10000,
 'training_file': 'train_dev_test/train.ch.bpe,train_dev_test/train.en.bpe',
 'trg_vocab_fpath': 'train_dev_test/vocab.en.tgt',
 'trg_vocab_size': 10000,
 'unk_idx': 2,
 'use_gpu': True,
 'validation_file': 'train_dev_test/dev.ch.bpe,train_dev_test/dev.en.bpe',
 'warmup_steps': 8000,
 'weight_sharing': False}
do_train(args)
[2021-10-20 18:45:23,800] [    INFO] - step_idx: 0, epoch: 0, batch: 0, avg loss: 10.526473,  ppl: 37289.726562 
[2021-10-20 18:45:24,991] [    INFO] - step_idx: 9, epoch: 0, batch: 9, avg loss: 10.517828,  ppl: 36968.742188 
[2021-10-20 18:45:26,296] [    INFO] - step_idx: 19, epoch: 0, batch: 19, avg loss: 10.475711,  ppl: 35444.054688 
[2021-10-20 18:45:26,404] [    INFO] - validation, step_idx: 19, avg loss: 10.480215,  ppl: 35604.062500

5. 预测和评估

模型最终训练的效果一般可通过测试集来进行测试,机器翻译领域一般计算BLEU值。


图11: 预测和评估
def do_predict(args):
    if args.use_gpu:
        place = "gpu"
    else:
        place = "cpu"
    paddle.set_device(place)

    # Define data loader
    test_loader, to_tokens = create_infer_loader(args)

    # Define model
    transformer = InferTransformerModel(
        src_vocab_size=args.src_vocab_size,
        trg_vocab_size=args.trg_vocab_size,
        max_length=args.max_length + 1,
        num_encoder_layers=args.n_layer,
        num_decoder_layers=args.n_layer,
        n_head=args.n_head,
        d_model=args.d_model,
        d_inner_hid=args.d_inner_hid,
        dropout=args.dropout,
        weight_sharing=args.weight_sharing,
        bos_id=args.bos_idx,
        eos_id=args.eos_idx,
        beam_size=args.beam_size,
        max_out_len=args.max_out_len)

    # Load the trained model
    assert args.init_from_params, (
        "Please set init_from_params to load the infer model.")

    model_dict = paddle.load(
        os.path.join(args.init_from_params, "transformer.pdparams"))

    # To avoid a longer length than training, reset the size of position
    # encoding to max_length
    model_dict["encoder.pos_encoder.weight"] = position_encoding_init(
        args.max_length + 1, args.d_model)
    model_dict["decoder.pos_encoder.weight"] = position_encoding_init(
        args.max_length + 1, args.d_model)
    transformer.load_dict(model_dict)

    # Set evaluate mode
    transformer.eval()

    f = open(args.output_file, "w")
    with paddle.no_grad():
        for (src_word, ) in test_loader:
            finished_seq = transformer(src_word=src_word)
            finished_seq = finished_seq.numpy().transpose([0, 2, 1])
            for ins in finished_seq:
                for beam_idx, beam in enumerate(ins):
                    if beam_idx >= args.n_best:
                        break
                    id_list = post_process_seq(beam, args.bos_idx, args.eos_idx)
                    word_list = to_tokens(id_list)
                    sequence = " ".join(word_list) + "\n"
                    f.write(sequence)
    f.close()
do_predict(args)

模型评估

预测结果中每行输出是对应行输入的得分最高的翻译,对于使用 BPE 的数据,预测出的翻译结果也将是 BPE 表示的数据,要还原成原始的数据(这里指 tokenize 后的数据)才能进行正确的评估

# 还原 predict.txt 中的预测结果为 tokenize 后的数据
! sed -r 's/(@@ )|(@@ ?$)//g' train_dev_test/predict.txt > train_dev_test/predict.tok.txt
# BLEU评估工具来源于 https://github.com/moses-smt/mosesdecoder.git
! tar -zxf mosesdecoder.tar.gz
# 计算multi-bleu
! perl mosesdecoder/scripts/generic/multi-bleu.perl train_dev_test/ccmt2019-news.zh2en.ref*.txt < train_dev_test/predict.tok.txt
BLEU = 38.11, 74.5/49.1/32.5/21.7 (BP=0.951, ratio=0.952, hyp_len=22252, ref_len=23371)
It is not advisable to publish scores from multi-bleu.perl.  The scores depend on your tokenizer, which is unlikely to be reproducible from your paper or consistent across research groups.  Instead you should detokenize then use mteval-v14.pl, which has a standard tokenization.  Scores from multi-bleu.perl can still be used for internal purposes when you have a consistent tokenizer.
Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐