【官方基线】2022 CCF BDCI 基于文心NLP大模型的阅读理解可解释评测

阅读理解可解释性评测基线，赛段A分数0.441.

AI Studio

1074人浏览 · 2022-09-04 14:14:15

AI Studio · 2022-09-04 14:14:15 发布

『2022 CCF BDCI』- 阅读理解可解释性评测

1、项目介绍

深度学习模型在很多NLP任务上已经取得巨大成功，但其常被当作一个黑盒使用，内部预测机制对使用者是不透明的。这使得深度学习模型结果不被使用者信任，增加了落地难度，尤其在医疗、法律等特殊领域。同时，当模型出现效果不好或鲁棒性差等问题时，由于不了解其内部机制，很难对模型进行改进优化。近期，深度学习模型的可解释性被越来越多的人关注。但模型的可解释性评估还不够完善，本基线提供了阅读理解任务的评测数据和相关评测指标，旨在评估模型的可解释性。

2、基线运行

依赖安装

安装一些必须的依赖包。

!pip3 install -U paddlepaddle-gpu==2.2.2
!pip3 install -U paddlenlp==2.3.0

数据准备

1）模型训练数据

我们推荐使用DuReader-robust数据集训练中文相似度计算模型。Paddlenlp框架会自动下载及缓存训练数据集，默认缓存存储路径为"~/.paddlenlp/datasets"。如需修改训练数据，请参考『初始化工作』中DATASET_NAME的修改。

2）下载预训练模型

基线使用了ERNIE-3.0-base预训练模型。Paddlenlp框架自动缓存模型文件，默认缓存存储路径为"~/.paddlenlp/models"。如需修改依赖的预训练模型，请在『初始化工作』中修改MODEL_NAME。

初始化工作

初始化工作包括了模型选择及加载、训练数据集选择、模型存储路径设定、抽取证据的长度占原文本长度的比例设定等。可按需更改。

import sys
import json
import numpy as np
import paddle
import paddlenlp
from paddlenlp.transformers import ErnieForQuestionAnswering, ErnieTokenizer
sys.path.append("./TrustAI/")
sys.path.append("./TrustAI/tutorials/")
from mrc_utils import *

# Select pre-trained model
MODEL_NAME = "ernie-3.0-base-zh" # choose from ["ernie-1.0", "ernie-1.0-base-zh", "ernie-1.0-large-zh-cw", "ernie-2.0-base-zh", "ernie-2.0-large-zh", "ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-mini-zh", "ernie-3.0-micro-zh", "ernie-3.0-nano-zh"]
# Select dataset for model training
DATASET_NAME = 'dureader_robust'
# Set the path to save the trained model
MODEL_SAVE_PATH = f'save_model/{DATASET_NAME}-{MODEL_NAME}'
# Set the rationale length ratio which determines the length of the extracted rationales.
RATIONALE_RATIO = 0.096 # 0.096 for Chinese dataset, 0.102 for English dataset

# Init model and tokenizer
model = ErnieForQuestionAnswering.from_pretrained(MODEL_NAME, num_classes=2)
tokenizer = ErnieTokenizer.from_pretrained(MODEL_NAME)

[2022-07-23 18:22:41,523] [    INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams and saved to /home/aistudio/.paddlenlp/models/ernie-3.0-base-zh
[2022-07-23 18:22:41,526] [    INFO] - Downloading ernie_3.0_base_zh.pdparams from https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams
100%|██████████| 452M/452M [00:15<00:00, 29.9MB/s] 
W0723 18:22:57.509322   160 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.2
W0723 18:22:57.514458   160 device_context.cc:465] device: 0, cuDNN Version: 7.6.
[2022-07-23 18:23:03,314] [    INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt and saved to /home/aistudio/.paddlenlp/models/ernie-3.0-base-zh
[2022-07-23 18:23:03,318] [    INFO] - Downloading ernie_3.0_base_zh_vocab.txt from https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt
100%|██████████| 182k/182k [00:00<00:00, 2.33MB/s]

模型训练

这里以ERNIE-3.0为例训练一个阅读理解模型。

from paddlenlp.datasets import load_dataset
# Hyperparameters
batch_size = 12
max_seq_length = 512
epochs = 3  #3
warmup_proportion = 0.1
weight_decay = 0.01
doc_stride = 512
learning_rate = 1e-5

# Load dataset
train_ds, dev_ds, test_ds = load_dataset(DATASET_NAME, splits=["train", "dev", "test"])

# Start training
training_mrc_model(model, 
                tokenizer,
                train_ds, 
                dev_ds,
                batch_size=batch_size,
                epochs=epochs,
                learning_rate=learning_rate,
                warmup_proportion=warmup_proportion,
                max_seq_length=max_seq_length,
                doc_stride=doc_stride, 
                weight_decay=weight_decay,
                save_dir=MODEL_SAVE_PATH)

Training Starts:
global step 100, epoch: 1, batch: 100, loss: 5.88395
global step 200, epoch: 1, batch: 200, loss: 4.16975
global step 300, epoch: 1, batch: 300, loss: 2.80451
global step 400, epoch: 1, batch: 400, loss: 1.17513
global step 500, epoch: 1, batch: 500, loss: 1.09871
global step 600, epoch: 1, batch: 600, loss: 1.36005
global step 700, epoch: 1, batch: 700, loss: 1.37606
global step 800, epoch: 1, batch: 800, loss: 1.11144
global step 900, epoch: 1, batch: 900, loss: 1.21026
global step 1000, epoch: 1, batch: 1000, loss: 1.25298
global step 1100, epoch: 1, batch: 1100, loss: 1.61863
global step 1200, epoch: 1, batch: 1200, loss: 1.44563
global step 1300, epoch: 1, batch: 1300, loss: 1.41836
Processing example: 1000
time per 1000: 10.05942416191101
F1 on eval dataset: 79.23507158787295


[2022-07-23 17:05:42,326] [    INFO] - tokenizer config file saved in save_model/dureader_robust-ernie-3.0-base-zh/tokenizer_config.json
[2022-07-23 17:05:42,328] [    INFO] - Special tokens file saved in save_model/dureader_robust-ernie-3.0-base-zh/special_tokens_map.json


global step 1400, epoch: 2, batch: 54, loss: 0.99590
global step 1500, epoch: 2, batch: 154, loss: 1.24470
global step 1600, epoch: 2, batch: 254, loss: 1.34970
global step 1700, epoch: 2, batch: 354, loss: 1.29006
global step 1800, epoch: 2, batch: 454, loss: 1.58766
global step 1900, epoch: 2, batch: 554, loss: 1.04610
global step 2000, epoch: 2, batch: 654, loss: 1.55092
global step 2100, epoch: 2, batch: 754, loss: 0.50011
global step 2200, epoch: 2, batch: 854, loss: 1.57696
global step 2300, epoch: 2, batch: 954, loss: 1.18998
global step 2400, epoch: 2, batch: 1054, loss: 1.23870
global step 2500, epoch: 2, batch: 1154, loss: 1.45537
global step 2600, epoch: 2, batch: 1254, loss: 1.39633
Processing example: 1000
time per 1000: 10.052157878875732
F1 on eval dataset: 80.17279741185808


[2022-07-23 17:13:47,615] [    INFO] - tokenizer config file saved in save_model/dureader_robust-ernie-3.0-base-zh/tokenizer_config.json
[2022-07-23 17:13:47,618] [    INFO] - Special tokens file saved in save_model/dureader_robust-ernie-3.0-base-zh/special_tokens_map.json


global step 2700, epoch: 3, batch: 8, loss: 1.38731
global step 2800, epoch: 3, batch: 108, loss: 0.59570
global step 2900, epoch: 3, batch: 208, loss: 1.45104
global step 3000, epoch: 3, batch: 308, loss: 1.51796
global step 3100, epoch: 3, batch: 408, loss: 1.36980
global step 3200, epoch: 3, batch: 508, loss: 1.27991
global step 3300, epoch: 3, batch: 608, loss: 1.46063
global step 3400, epoch: 3, batch: 708, loss: 1.75899
global step 3500, epoch: 3, batch: 808, loss: 1.42824
global step 3600, epoch: 3, batch: 908, loss: 1.23470
global step 3700, epoch: 3, batch: 1008, loss: 1.40445
global step 3800, epoch: 3, batch: 1108, loss: 0.77757
global step 3900, epoch: 3, batch: 1208, loss: 0.38147
global step 4000, epoch: 3, batch: 1308, loss: 0.61202
Processing example: 1000
time per 1000: 10.036355972290039
F1 on eval dataset: 80.83847203293388


[2022-07-23 17:21:50,187] [    INFO] - tokenizer config file saved in save_model/dureader_robust-ernie-3.0-base-zh/tokenizer_config.json
[2022-07-23 17:21:50,191] [    INFO] - Special tokens file saved in save_model/dureader_robust-ernie-3.0-base-zh/special_tokens_map.json


best F1-score is 80.838472!

重要度分数获取

该步为输入中每个词赋一个重要度分数，表示该词对预测的影响度。重要度分数获取共分三步。

1）加载模型和评测数据集

更改模型以及评估数据的存储路径（MODEL_PATH和DATA_PATH），完成模型和评测数据集的加载。赛段一数据量为1855条，赛段二数据量为4366条，请确认评测数据集完整。

from assets.utils import load_data
from functools import partial

# Correct MODEL_PATH and DATA_PATH before executing
MODEL_PATH = MODEL_SAVE_PATH + '/model_state.pdparams'
DATA_PATH = 'mrc_interpretation.txt'

# Load the trained parameters
state_dict = paddle.load(MODEL_PATH)
model.set_dict(state_dict)

# Load test data
data_ds = DuReader().read(DATA_PATH)
data = load_data(DATA_PATH)
print("Num of data:", len(data))

Num of data: 4366

2）数据预处理

a) 输入格式化：将输入的两个文本组织成模型预测所需格式，如对于Ernie3.0-base模型，其输入形式为[CLS]question[SEP]context[SEP]

b) 分词位置索引：计算每个分词结果对应的原文位置索引，这里的分词包括模型分词和标准分词

from mrc_utils import *
# Hyperparameters
batch_size = 12
max_seq_length = 512
epochs = 3  #3
warmup_proportion = 0.1
weight_decay = 0.01
doc_stride = 512

# Prepare dataloader
test_trans_func = partial(prepare_validation_features, 
                            max_seq_length=max_seq_length, 
                            doc_stride=doc_stride,
                            tokenizer=tokenizer)
                            
data_ds.map(test_trans_func, batched=True, num_workers=4)
test_batch_sampler = paddle.io.DistributedBatchSampler(
        data_ds, batch_size=batch_size, shuffle=False)

test_batchify_fn = lambda samples, fn=Dict({
    "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
    "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id)
}): fn(samples)
test_data_loader = paddle.io.DataLoader(
    dataset=data_ds,
    batch_sampler=test_batch_sampler,
    collate_fn=test_batchify_fn,
    return_list=True)

# Get offset maps which will be used for score alignment
contexts, standard_split, ori_offset_maps, standard_split_offset_maps = pre_process(data, data_ds, tokenizer)

3）重要度分数获取

我们提供attention和IG两种解释方法，可根据实际实验结果选取最有效的一种方法。

a） Attention-based Interpreter

from trustai.interpretation.token_level import AttentionInterpreter
from assets.utils import create_dataloader_from_scratch
import paddle
# Hyperparameters
BATCH_SIZE = 64

# Init an attention interpreter and get the importance scores
att = AttentionInterpreter(model, device="gpu", predict_fn=attention_predict_fn)

# Use attention interpreter to get the importance scores for all data
interp_results = None
for batch in test_data_loader:
    if interp_results:
        interp_results += att(batch)
    else:
        interp_results = att(batch)

# Trim the output to get scores only for context
interp_results = trim_output(interp_results, data_ds, tokenizer)

# Align the results back to the standard splited tokens so that it can be evaluated correctly later
align_res = att.alignment(interp_results, contexts, standard_split, standard_split_offset_maps, ori_offset_maps, special_tokens=["[CLS]", '[SEP]'])

b）IG-based Interpreter

from trustai.interpretation.token_level import IntGradInterpreter
from assets.utils import create_dataloader_from_scratch
# Hyperparameters
IG_STEP = 100
BATCH_SIZE = 64

# Init an IG interpreter
ig = IntGradInterpreter(model, predict_fn=IG_predict_fn, device="gpu")

# Use IG interpreter to get the importance scores for all data
interp_results = None
for batch in test_data_loader:
    if interp_results:
        interp_results += ig(batch, steps=IG_STEP)
    else:
        interp_results = ig(batch, steps=IG_STEP)

# trim the output to get scores only for context
interp_results = trim_output(interp_results, data_ds, tokenizer)

# Align the results back to the standard splited tokens so that it can be evaluated correctly later
align_res = ig.alignment(interp_results, contexts, standard_split, standard_split_offset_maps, ori_offset_maps, special_tokens=["[CLS]", '[SEP]'])

生成用于评估的数据

评估文件格式要求是3列数据：编号\t预测答案\t证据，我们提供了脚本将模型输出结果转成评估所需格式。

import math

# Re-sort the token index according to their importance scores
def resort(index_array, importance_score):
    res = sorted([[idx, importance_score[idx]] for idx in index_array], key=lambda x:x[1], reverse=True)
    res = [n[0] for n in res]
    return res

# Post-prepare the result data so that it can be used for the evaluation directly
def prepare_eval_data(data, results, paddle_model):
    res = {}
    idx = 0
    for data_id, inter_res in zip(data, results):
        
        # Split importance score vectors for query and title from inter_res.word_attributions
        importance_score = np.array(inter_res.word_attributions[1:-1])
        # Extract topK importance scores
        topk = math.ceil(len(data[data_id]['sent_token'])*RATIONALE_RATIO)
        
        eval_data = {}        
        eval_data['id'] = data_id
        label = list(inter_res.pred_label)
        if label[0]>=label[1]+1:
            eval_data['pred_label'] = ''
        else:
            eval_data['pred_label'] = ''.join(tokenizer.convert_ids_to_tokens(data_ds[idx]['input_ids'][label[0]:label[1]+1]))
        # Find the token index of the topK importance scores
        eval_data['rationale'] = np.argpartition(importance_score, -topk)[-topk:]
        # Re-sort the token index according to their importance scores
        eval_data['rationale'] = resort(eval_data['rationale'], importance_score)

        res[data_id] = eval_data
        idx += 1
    return res

# Generate results for evaluation
predicts = prepare_eval_data(data, align_res, model)
out_file = open('./mrc_rationale.txt', 'w')
for key in predicts:
    out_file.write(str(predicts[key]['id'])+'\t'+ str(predicts[key]['pred_label'])+'\t')
    for idx in predicts[key]['rationale'][:-1]:
        out_file.write(str(idx)+',')
    out_file.write(str(predicts[key]['rationale'][-1])+'\n')
out_file.close()