平安医疗科技疾病问答迁移学习(CHIP-STS)

1.引言

本次题目是CHIP 2019中的评测任务二,由平安医疗科技主办。

迁移学习是自然语言处理中的重要一环,其主要目的是通过从已学习的相关任务中转移知识来改进新任务的学习效果,从而提高模型的泛化能力。

本次评测任务的主要目标是针对中文的疾病问答数据,进行病种间的迁移学习。具体而言,给定来自5个不同病种的问句对,要求判定两个句子语义是否相同或者相近。所有语料来自互联网上患者真实的问题,并经过了筛选和人工的意图匹配标注。

# 下载最新版本的paddlenlp
!pip install --upgrade pip
!pip install paddlenlp --upgrade
import paddlenlp
import paddle

2.数据集

数据说明

数据由train.csv、dev.csv、test.csv三个文件构成:

train.csv是训练集,包含2万对人工标注好的疾病问答数据,由5个病种构成,其中diabetes10000对,hypertension、hepatitis、aids、breast_cancer各2500对;

dev.csv是验证集,包含10000对无label的疾病问答数据,由5个病种构成,其中diabetes,hypertension、hepatitis、aids、breast_cancer各2000对;

test.csv是测试集,包含5万对人工标注好的疾病问答数据,其中只有部分数据供验证。

category表示问句对的病种名称,分别对应:

diabetes-糖尿病,hypertension-高血压,hepatitis-乙肝,aids-艾滋病,breast_cancer-乳腺癌。

label表示问句之间的语义是否相同。若相同,标为1,若不相同,标为0。

其中,训练集label已知,验证集和测试集label未知。

!cd /home/aistudio/data/data107842/ && unzip 平安医疗疾病问答比赛数据集.zip 
Archive:  平安医疗疾病问答比赛数据集.zip
  inflating: train.csv               
  inflating: dev_id.csv              
  inflating: test_final.csv          

2.1 查看数据集

# 读取数据集
import pandas

train = pandas.read_csv('/home/aistudio/data/data107842/train.csv', sep=',')  # 有标签的训练数据文件
dev = pandas.read_csv('/home/aistudio/data/data107842/dev_id.csv', sep=',')    # 要进行预测的测试数据文件

# 本项目只判断语义是否一致, 所以删除类别
# 删除 category 列
del train['category']
del dev['category']
del dev['id']

# 查看训练数据前5条
train.head()
# 查看训练数据文件信息
# train.info()
# 统计训练数据类别标签分布, 可以看出相当均衡, 不愧是比赛数据集
# train['label'].value_counts()
question1question2label
0艾滋病窗口期会出现腹泻症状吗头疼腹泻四肢无力是不是艾滋病0
1由于糖尿病引起末梢神经炎,怎么根治?糖尿病末梢神经炎的治疗方法1
2H型高血压,是通所说的高血脂?高血压引起脑出血怎么抢救治疗0
3糖尿病跟尿毒症有什么区别?糖尿病人,尿酸只有4.6是什么原因造成的?0
4你好,我60岁,患高血压,80135,爱喝酸奶可以吗?高血压糖尿病人可以喝牛奶吗?1
# 查看数据的统计文本长度, 便于后续确定预训练模型的max_seq_length
print(train['question1'].map(len).describe())
print(train['question2'].map(len).describe())
print(dev['question1'].map(len).describe())
print(dev['question2'].map(len).describe())
count    20000.000000
mean        13.052450
std          4.702489
min          2.000000
25%         10.000000
50%         12.000000
75%         15.000000
max         57.000000
Name: question1, dtype: float64
count    20000.000000
mean        13.916000
std          5.251421
min          2.000000
25%         10.000000
50%         13.000000
75%         16.000000
max         73.000000
Name: question2, dtype: float64
count    10000.000000
mean        13.352600
std          5.031442
min          2.000000
25%         10.000000
50%         12.000000
75%         15.000000
max         50.000000
Name: question1, dtype: float64
count    10000.000000
mean        14.493700
std          5.570787
min          3.000000
25%         11.000000
50%         13.000000
75%         17.000000
max         52.000000
Name: question2, dtype: float64

2.2 划分训练和验证集

# 根据label的具体类别按9:1的比例去划分训练和验证集,使训练和验证集尽量同分布。 当然也可以尝试直接根据索引去划分
from sklearn.utils import shuffle

new_train = pandas.DataFrame()  # 定义训练集
new_valid = pandas.DataFrame()  # 定义验证集

tags = list(train.label.unique())  # 总类别
# 根据类别进行抽样划分
for tag in tags:
    data = train[(train['label'] == tag)]
    # 抽样取0.1作为验证集
    valid_sample = data.sample(int(0.1 * len(data)))
    valid_index = valid_sample.index
    # 将剩余0.9的数据作为训练集
    all_index = data.index
    residue_index = all_index.difference(valid_index)
    residue = data.loc[residue_index]
    # 对取的数据进行保存
    new_valid = pandas.concat([new_valid, valid_sample], ignore_index=True)
    new_train = pandas.concat([new_train, residue], ignore_index=True)

# 对数据进行随机打乱
new_train = shuffle(new_train)
new_valid = shuffle(new_valid)

# 保存训练和验证集文件
new_train.to_csv('train_data.csv', sep='\t', index=False) # 这里使用 \t 是因为有的数据集中包含 ,
new_valid.to_csv('valid_data.csv', sep='\t', index=False)

2.3 加载数据集

from paddlenlp.datasets import DatasetBuilder
# 定义模型训练和验证对应文件及文件的格式
class QueryData(DatasetBuilder):
    SPLITS = {
        'train': 'train_data.csv',
        'dev': 'valid_data.csv',
    }

    def _get_data(self, mode, **kwargs):
        filename = self.SPLITS[mode]
        return filename

    def _read(self, filename):
        """读取数据"""
        with open(filename, 'r', encoding='utf-8') as f:
            # 跳过列名
            # next(f)
            head = None
            for line in f:
                data = line.strip().split("\t")    # 以'\t'分隔各列
                if not head:
                    # 去空
                    head = data
                else:
                    question1, question2, label = data
                    yield {"question1": question1, "question2": question2, "label": label}  # 数据的格式

    def get_labels(self):
        return ["0", "1"]   # 类别标签, 0、1
# 定义数据集加载函数
def load_dataset(name=None,
                 data_files=None,
                 splits=None,
                 lazy=None,
                 **kwargs):
   
    reader_cls = QueryData
    if not name:
        reader_instance = reader_cls(lazy=lazy, **kwargs)
    else:
        reader_instance = reader_cls(lazy=lazy, name=name, **kwargs)

    datasets = reader_instance.read_datasets(data_files=data_files, splits=splits)
    return datasets
# 加载训练和验证集
train_ds, dev_ds = load_dataset(splits=["train", "dev"])

3.基于PaddleNLP构建基线模型

3.1定义的预训练模型

关于什么是 bert 可以参考知乎的一篇文章 什么是bert

更多 bert 预训练模型可以在 BERT 这个链接里找到

当然也可以使用一些其他的预训练模型, 可以参考 PaddleNLP Transformer预训练模型

# 定义要进行微调的预训练模型
MODEL_NAME='bert-wwm-ext-chinese'
roberta_model = paddlenlp.transformers.BertModel.from_pretrained(MODEL_NAME)
model = paddlenlp.transformers.BertForSequenceClassification.from_pretrained(MODEL_NAME, num_classes=2)
[2022-12-20 23:08:07,837] [    INFO] - Model config BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "type_vocab_size": 2,
  "vocab_size": 21128
}

[2022-12-20 23:08:07,842] [    INFO] - Configuration saved in /home/aistudio/.paddlenlp/models/bert-wwm-ext-chinese/config.json
W1220 23:08:07.847621   261 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W1220 23:08:07.851722   261 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2022-12-20 23:08:09,544] [    INFO] - Downloading bert-wwm-ext-chinese.pdparams from http://bj.bcebos.com/paddlenlp/models/transformers/bert/bert-wwm-ext-chinese.pdparams
100%|██████████| 390M/390M [00:14<00:00, 28.5MB/s] 
[2022-12-20 23:08:24,973] [    INFO] - All model checkpoint weights were used when initializing BertModel.

[2022-12-20 23:08:24,976] [    INFO] - All the weights of BertModel were initialized from the model checkpoint at bert-wwm-ext-chinese.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertModel for predictions without further training.
[2022-12-20 23:08:24,981] [    INFO] - Model config BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "type_vocab_size": 2,
  "vocab_size": 21128
}

[2022-12-20 23:08:24,984] [    INFO] - Configuration saved in /home/aistudio/.paddlenlp/models/bert-wwm-ext-chinese/config.json
[2022-12-20 23:08:26,234] [    INFO] - All model checkpoint weights were used when initializing BertForSequenceClassification.

[2022-12-20 23:08:26,237] [ WARNING] - Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-wwm-ext-chinese and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

3.2 定义模型对应的tokenizer以进行数据处理

PaddleNLP对于各种预训练模型已经内置了相应的tokenizer。指定想要使用的模型名字即可加载对应的tokenizer。tokenizer作用为将原始输入文本转化成模型model可以接受的输入数据形式。

tokenizer = paddlenlp.transformers.BertTokenizer.from_pretrained(MODEL_NAME)
[2022-12-20 23:08:26,245] [    INFO] - Downloading http://bj.bcebos.com/paddlenlp/models/transformers/bert/bert-wwm-ext-chinese-vocab.txt and saved to /home/aistudio/.paddlenlp/models/bert-wwm-ext-chinese
[2022-12-20 23:08:26,284] [    INFO] - Downloading bert-wwm-ext-chinese-vocab.txt from http://bj.bcebos.com/paddlenlp/models/transformers/bert/bert-wwm-ext-chinese-vocab.txt
100%|██████████| 107k/107k [00:00<00:00, 6.18MB/s]
[2022-12-20 23:08:26,364] [    INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/bert-wwm-ext-chinese/tokenizer_config.json
[2022-12-20 23:08:26,367] [    INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/bert-wwm-ext-chinese/special_tokens_map.json

3.3 数据预处理

# 数据转换, 将 question1 和 question2 拼接, 并转成token
def convert_example(example, tokenizer, max_seq_length=512, is_test=False):
    qtconcat = example["question1"] + example["question2"]
    encoded_inputs = tokenizer(text=qtconcat, max_seq_len=max_seq_length)
    input_ids = encoded_inputs["input_ids"]
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        return input_ids, token_type_ids

# 数据读取和处理总函数
def create_dataloader(dataset,
                      mode='train',
                      batch_size=1,
                      batchify_fn=None,
                      trans_fn=None):
    if trans_fn:
        dataset = dataset.map(trans_fn)

    shuffle = True if mode == 'train' else False
    if mode == 'train':
        batch_sampler = paddle.io.DistributedBatchSampler(
            dataset, batch_size=batch_size, shuffle=shuffle)
    else:
        batch_sampler = paddle.io.BatchSampler(
            dataset, batch_size=batch_size, shuffle=shuffle)

    return paddle.io.DataLoader(
        dataset=dataset,
        batch_sampler=batch_sampler,
        collate_fn=batchify_fn,
        return_list=True)
from functools import partial
from paddlenlp.data import Stack, Tuple, Pad

batch_size = 50
max_seq_length = 128

trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)
    
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
    Stack(dtype="int64")  # label
): [data for data in fn(samples)]
# 对训练和验证集进行加载与处理
train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

dev_data_loader = create_dataloader(
    dev_ds,
    mode='dev',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

3.4 设置Fine-Tune优化策略,接入评价指标

创建学习率计划,该调度程序线性增加学习率, 从 0 到给定, 在此预热期后学习率将从基本学习率线性降低到 0

LinearDecayWithWarmup

# 定义超参, loss, 优化器等
from paddlenlp.transformers import LinearDecayWithWarmup

# 训练过程中的最大学习率
learning_rate = 4e-5
# 训练轮次
epochs = 5
# 学习率预热比例
warmup_proportion = 0.1
# 权重衰减系数,类似模型正则项策略,避免模型过拟合
weight_decay = 0.01

num_training_steps = len(train_data_loader) * epochs
lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)

optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ])

criterion = paddle.nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()

3.5 模型训练与评估

import numpy as np
# 定义模型训练验证评估函数
@paddle.no_grad() # 这里不需要反向传播、优化器参数更新和优化器梯度清零, 所以禁用动态图梯度计算
def evaluate(model, criterion, metric, data_loader):
    """
    Given a dataset, it evals model and computes the metric.

    Args:
        model(obj:`paddle.nn.Layer`): A model to classify texts.
        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
        criterion(obj:`paddle.nn.Layer`): It can compute the loss.
        metric(obj:`paddle.metric.Metric`): The evaluation metric.
    """
    model.eval()
    metric.reset()
    losses = []
    for batch in data_loader:
        input_ids, token_type_ids, labels = batch
        logits = model(input_ids, token_type_ids)
        loss = criterion(logits, labels)
        losses.append(loss.numpy())
        correct = metric.compute(logits, labels)
        metric.update(correct)
        accu = metric.accumulate()
    print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))
    model.train()
    metric.reset()
    return np.mean(losses), accu
# 模型训练
import paddle.nn.functional as F
best_loss=float('inf') # 正无穷
best_accu=0
global_step = 0
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, segment_ids, labels = batch
        logits = model(input_ids, segment_ids)
        loss = criterion(logits, labels)
        probs = F.softmax(logits, axis=1)
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()

        global_step += 1
        if global_step % 10 == 0 :
            print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc))
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.clear_grad()
    # 对验证集进行评估
    loss, accu=evaluate(model, criterion, metric, dev_data_loader)
    if(best_loss>loss):
        print('best loss from {} to {}'.format(best_loss,loss))
        best_loss=loss
        # 这里的保存用的是 nlp 的 api
        # https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.transformers.model_utils.html?highlight=save_pretrained()#paddlenlp.transformers.model_utils.PretrainedModel.save_pretrained_v2
        model.save_pretrained('./output/best_loss') 
        tokenizer.save_pretrained('./output/best_loss')

    if(best_accu<accu):
        print('best accuracy from {} to {}'.format(best_accu,accu))
        best_accu=accu
        model.save_pretrained('./output/best_accu')
        tokenizer.save_pretrained('./output/best_accu')

4.模型预测

# 定义模型预测函数
def predict(model, data, tokenizer, label_map, batch_size=1):
    
    examples = []
    for text in data:
        input_ids, segment_ids = convert_example(
            text,
            tokenizer,
            max_seq_length=128,
            is_test=True)
        examples.append((input_ids, segment_ids))

    batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=tokenizer.pad_token_id),
        Pad(axis=0, pad_val=tokenizer.pad_token_id),
    ): fn(samples)

    batches = []
    one_batch = []
    for example in examples:
        one_batch.append(example)
        if len(one_batch) == batch_size:
            batches.append(one_batch)
            one_batch = []
    if one_batch:
        batches.append(one_batch)

    results = []
    model.eval()
    for batch in batches:
        input_ids, segment_ids = batchify_fn(batch)
        input_ids = paddle.to_tensor(input_ids)
        segment_ids = paddle.to_tensor(segment_ids)
        logits = model(input_ids, segment_ids)
        probs = F.softmax(logits, axis=1)
        idx = paddle.argmax(probs, axis=1).numpy()
        idx = idx.tolist()
        labels = [label_map[i] for i in idx]
        results.extend(labels)
    return results
model = paddlenlp.transformers.BertForSequenceClassification.from_pretrained('./output/best_accu')
[2022-12-20 23:14:52,001] [    INFO] - loading configuration file ./output/best_accu/config.json
[2022-12-20 23:14:52,004] [    INFO] - Model config BertConfig {
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "dtype": "float32",
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "type_vocab_size": 2,
  "vocab_size": 21128
}

[2022-12-20 23:14:52,007] [    INFO] - Configuration saved in ./output/best_accu/config.json
[2022-12-20 23:14:53,373] [    INFO] - All model checkpoint weights were used when initializing BertForSequenceClassification.

[2022-12-20 23:14:53,376] [    INFO] - All the weights of BertForSequenceClassification were initialized from the model checkpoint at ./output/best_accu.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForSequenceClassification for predictions without further training.
# 模型预测:
label_map = {0:'0', 1:'1'}

# 定义对数据的预处理函数
def preprocess_prediction_data(data):
    examples = []
    for question1, question2 in data:
        examples.append({"question1": question1, "question2": question2})
    return examples

# 对测试集的格式处理
data1 = list(dev.values)
examples = preprocess_prediction_data(data1)

# 对测试集进行预测
results = predict(model, examples, tokenizer, label_map, batch_size=batch_size)

# 保存预测结果文件, 格式为label
results = pandas.DataFrame(results)
results.columns = ["label"]
results['question1'] = dev['question1']
results['question2'] = dev['question2']

results.to_csv('result.csv',index=False)
results.head()
labelquestion1question2
00乳腺癌晚期治疗费用要多少乳腺癌症做放疗一次费用要多少钱呢
11得了艾滋病会被隔离么艾滋病为什么不被隔离,如果病毒携带
21口腔溃疡接吻会不会传染乙肝呀接吻可以传染乙肝吗
31高血压脑病的个案护理?我想知道高血压脑病的个案护理。高血压脑病的护理问题有哪些老年人患高血压,生活中应注意哪些方面?
41乳腺癌晚期有什么症状晚期的乳腺癌的症状是哪些呢?

5.总结

  • 本项目使用 paddlenlp 高阶 api 完成对文本的匹配
  • 可以尝试换个预训练模型, 虽然这个也不错了

感谢项目导师张宏理的指导

此文章为搬运
原项目链接

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐