★★★ 本文源自AlStudio社区精品项目,【点击此处】查看更多精品内容 >>>

一、影像学 NLP — 医学影像诊断报告生成

比赛地址:https://www.heywhale.com/org/gaiic2023/competition/area/63fef766b4422ee27402289d/content

图标

1. 赛题背景

医学影像(如 CT 影像、核磁共振影像)是病情诊断的重要依据,通过医学影像得出诊断报告是针对过程中的重要步骤,也是医疗 AI 研究的前沿热点。本赛道任务要求参赛队伍根据医生对 CT 的影像描述文本数据(即对医学影像特征的描述),生成诊断报告文本。

与传统文本生成任务不同的是,医学影像诊断报告内容具有专业性、明确性和离散性,因此也需要针对性的算法与模型设计。报告生成结果按照指定评价指标(见提交&评审介绍)进行评测和排名,得分最优者获胜。

2. 赛题数据

2.1 sample数据

本数据为医生对若干CT的影像描述的明文数据,以及对应的诊断报告的明文数据,样本量为1份,以便使参赛队伍对比赛数据有直观的了解(Sample数据只是为了增进参赛选手对影像描述和诊断报告的直观了解,实际训练与测试数据不一定与Sample数据具有相同特征或分布)。复赛时额外新增临床信息作为辅助建模信息。

列名 类型 示例
report_ID int 1
description(输入) string,影像描述 左侧顶骨局部骨质缺如;两侧侧脑室旁见点状密度减低。右侧额部颅板下见弧形脑脊液密度影。脑室系统扩大,脑沟、裂、池增宽。中线结构无移位。双侧乳突气化差,内见密度增高。
diagnosis(输出) string,诊断报告 左侧顶骨局部缺如,考虑术后改变;脑内散发缺血灶;右侧额部少量硬膜下积液;双侧乳突炎。
clinical(输入,复赛加入) string,临床信息 左侧硬膜下血肿术后

2.2 train数据

脱敏后的影像描述与对应影像报告。文本以字为单位脱敏,使用空格分割。Training 数据用于参赛选手的模型训练与预估。初赛仅使用影像描述生成诊断报告;复赛额外加入临床信息,提升建模多样性。其中:

初赛 Training 集规模为 20000 例样本;
复赛 Training 集规模为 80000 例样本。
Training 数据格式(不同列使用分隔符“,”分割):

列名 类型 示例
report_ID int 1
description 脱敏后的影像描述,以字为单位使用空格分割 101 47 12 66 74 90 0 411 234 79 175
diagnosis 脱敏后的诊断报告,以字为单位使用空格分割 122 83 65 74 2 232 18 44 95
clinical(复赛加入) 脱敏后的临床信息,以字为单位使用空格分割 88 29 17 55 72

2.3 Test数据

脱敏后的影像描述和临床信息(复赛),脱敏方法和 Training 相同。Test 数据用于参赛选手的模型评估和排名。其中:

初赛 Test 集分为 A/B 榜,规模均为 3000;
复赛 Test 集分为 A/B 榜,规模均为 7500。
Test 数据格式(不同列使用分隔符“,”分割):

列名 类型 示例
report_ID int 1
description 脱敏后的影像描述,以字为单位使用空格分割 101 47 12 66 74 90 0 411 234 79 175
clinical(复赛加入) 脱敏后的临床信息,以字为单位使用空格分割 88 29 17 55 72

3. 数据及参考材料下载

材料名称 使用链接
初赛Training集 点击下载https://www.heywhale.com/mw-org/gaiic2023/dataset/6412e32536218140148faccb
初赛A榜Test集 点击下载https://www.heywhale.com/mw-org/gaiic2023/dataset/6412e32536218140148faccb
初赛B榜Test集 待公布
复赛Training集 待公布
复赛A榜Test集 不开放给选手,仅在模型推理过程中,由系统提供路径进行推理。
复赛B榜Test集 不开放给选手,仅在模型推理过程中,由系统提供路径进行推理。
baseline 点击下载https://www.heywhale.com/u/775890

二、基于谷歌天马(PEGASUS)模型实现诊断报告

0. 引言

0.1 比赛经验简单分享

比赛经验分享 本赛的NLP赛题是根据一段影像描述生成一段诊断报告,和机器翻译、文本摘要等一样,都属于seq2seq的任务。官方的baseline里面是手工搭建了一个包含encoder和decoder的transformer模型,线上分数1.8左右。而我们如果能用已经训练好的预训练模型,例如Bart,T5,PEGASUS等,线上效果则会有提升。这个baseline大概分数在2.65左右,期待有大佬能公布3.0以上的topline。

同时其余提分技巧可以参考周周星经验分享,主要总结如下:

  1. 预训练,脱敏后的数据属于一种全新的语言,所以可以通过模型的预训练来让模型熟悉这个数据。原理可以参考周周星分享https://www.heywhale.com/org/gaiic2023/competition/forum/64266c731973e8997818034b ,我编写的PEGASUS的mlm预训练任务代码详见mlm-pretrain文件。需要预训练先运行mlm-pretrain文件,生成pre文件夹然后加载模型的时候加载上。

  2. 微调参数,主要包括

  • 基础参数

    • 影像描述的最大长度 max_source_length = 160
    • 诊断报告的最大长度 max_target_length = 90
    • 诊断报告的最小长度 min_target_length = 0
    • 训练轮次 num_epochs = 5
    • batch_size = 16
  • 优化器参数(AdamW)以及warmup和衰减策略

    • 学习率预热比例 armup_proportion = 0.02
    • 学习率 learning_rate = 1e-4
    • 训练总步数 num_training_steps = len(train_data_loader) * num_epochs
    • AdamW优化器参数epsilon adam_epsilon = 1e-6
    • AdamW优化器参数weight_decay weight_decay=0.01
  • 预测参数(model.generate):

    • 预测策略 decode_strategy=“beam_search” (“greedy_search”, “sampling” and “beam_search”)
    • 尝试次数 num_beams = 5
    • 长度惩罚 length_penalty = 0.7
    • 早停 early_stopping = True

3.增加trick:伪标签,FGM,EMA,SWA,数据增强,混合精度训练等。(可能有时候会有用,但不是每次都有用)

参考资料:

本次比赛的周周星经验分享:https://www.heywhale.com/org/gaiic2023/competition/forumlist/63fef766b4422ee27402289d

【自然语言处理】【文本生成】使用Transformers中的BART进行文本摘要: https://blog.csdn.net/bqw18744018044/article/details/127181072

paddle实现PEGASUS,中文文本摘要,用这个就够了:https://aistudio.baidu.com/aistudio/projectdetail/4903667?channelType=0&channel=0

paddle实现FGM对抗训练:https://aistudio.baidu.com/aistudio/projectdetail/4327353?channelType=0&channel=0

paddle实现EMA平均:https://aistudio.baidu.com/aistudio/projectdetail/1840154?channelType=0&channel=0

0.2 seq2seq任务常用结构

seq2seq任务的输入一个长度为N的字符串,输出一个长度为M的字符串,N->M。常用来处理机器翻译、文本摘要生成等任务。可以使用lstm结构,基本的rnn结构,不过目前最流行的是encoder-decoder结构,也被称作seq2seq模型。简单的encoder-decoder结构如下图,左边是输入和encoder,右边是输出和decoder:

p1

也可以把encoder层的输出作为decoder每一步的输入,如下图:

p2

同时,一般seq2seq任务还经常采用以下的训练和推理方式:

  1. 训练时使用 Teacher Forcing,例如翻译:“欢迎 来到 北京 welcome to beijing”,训练时会被直接分成如下三组特征和标签一起训练:
  • 欢迎 来到 北京 -> welcome

  • 欢迎 来到 北京 welcome -> to

  • 欢迎 来到 北京 welcome to -> beijing

  1. 推理截断可以使用 Beam Search,会选择Top k个预测结果作为下一个解码器的输入,将这K个结果逐一输入到解码器进行解码,就会产生k倍个预测结果,从所有的解码结果中再选出Top K个预测结果作为下一个解码器的输入,在最后一个时刻再选出Top 1作为最终的输出。

0.3 预训练模型与预训练任务

而2015年Bahdanau等人提出的transformer在encoder-decoder结构中也加入了attention结构,也就是给每个词语赋予了不同的权重表示重点。之后各家公司也根据不同的预训练任务,不同的网络结构等训练出各种预训练大模型,这些预训练模型基本都可以直接拿来微调使用。常用的seq2seq任务(机器翻译、文本摘要等)预训练模型:bart,T5,PEGASUS等。

但是本次比赛使用的是脱敏后的数据,相当于全新的语言,所以如果能够在模型微调前进行一个预训练,效果肯定会更好。常见的预训练任务有MLM,DAE等,常见的预训练任务及对应模型总结见知乎的论文笔记:https://zhuanlan.zhihu.com/p/139015428 主要预训练任务总结如下图:

zhihu1

  1. LM(Language Modeling)是最常见的预训练任务,例如:
  • 欢迎 来到 北京 -> welcome

  • 欢迎 来到 北京 welcome -> to

  • 欢迎 来到 北京 welcome to -> beijing


  1. MLM(Masked Language Modeling)是bert采用的预训练任务,在句子中随机加入[MASK]来预测,可以双向收集语句信息,例如:
  • 欢迎 来到 [MASK] welcome to beijing -> 北京

  • 欢迎 来到 北京 [MASK] to beijing -> welcome

  • [MASK] 来到 北京 welcome to beijing -> 欢迎


  1. PLM(Permuted Language Modeling)输入序列随机排列,并且预测[MASK],例如把welcome给[MASK]了:
  • 欢迎 来到 北京 [MASK] -> welcome

  • 欢迎 来到 北京 [MASK] to -> welcome

  • 欢迎 来到 [MASK] to -> welcome


  1. DAE(denoising autoencoder)BART使用的预训练任务,通过对输入增删改顺序,增加噪声,来预测最后结果,例如:
  • 欢迎 [MASK] 北京 to welcome [MASK] -> 欢迎 来到 北京 welcome to beijing

  1. CTL(constrastive learning)对比学习,以NSP(constrastive learning)为例,抽取正样本和负样本的句子来构造任务进行预测,例如:
  • 欢迎 来到 北京 welcome to beijing -> 翻译对

  • 欢迎 来到 北京 welcome to shanghai -> 不对

  • 欢迎 来到 北京 go to beijing -> 不对

PEGASUS使用的预训练任务是:

  1. GSG( Gap Sentences Generation)任务。GSG任务主要是通过对文本中的重要的句子进行mask,然后再通过decoder恢复。
  2. Bert形式的 MLM 任务。

PEGASUS详情请见论文 PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization: https://arxiv.org/abs/1912.08777

1. 数据探索

import pandas as pd
test = pd.read_csv('data/data201684/preliminary_a_test.csv',header=None,names=['yingxiang'])
train = pd.read_csv('data/data201684/train.csv',header=None,names=['yingxiang', 'zhenduan'])
print('train:',train.shape)
print('test:',test.shape)
train: (20000, 2)
test: (3000, 1)
train.head(5)
yingxiang zhenduan
0 14 108 28 30 15 13 294 29 20 18 23 21 25 32 16... 22 12 38 41 17 81 10
1 22 12 1137 41 17 16 96 17 16 34 48 17 30 40 13... 66 75 80 116 17 81 16 33 81 16 33 24 122 370 1...
2 14 108 333 30 15 13 31 29 20 829 891 21 25 11 ... 35 48 49 150 167 308 282 10
3 22 12 135 269 205 24 267 27 12 376 32 94 109 2... 14 49 123 55 86 57 54 40 138 124 26 105 133 13...
4 34 12 48 63 109 28 30 40 13 1038 52 43 23 21 5... 34 12 48 1064 86 57 54 138 10 22 12 38 41 17 8...
test.head(5)
yingxiang
0 22 12 74 71 64 56 16 248 14 40 13 83 52 43 44 ...
1 22 12 48 63 16 135 24 267 13 66 146 112 43 23 ...
2 22 12 74 71 64 11 279 288 285 56 40 13 123 55 ...
3 22 12 48 85 63 16 22 12 12 14 32 94 109 28 40 ...
4 34 12 935 1136 13 52 247 153 44 23 1006 25 11 ...

1.1 句子情况

test['len'] = test['yingxiang'].apply(lambda x:len(x.split()))
train['len'] = train['yingxiang'].apply(lambda x:len(x.split()))
train['len2'] = train['zhenduan'].apply(lambda x:len(x.split()))

影像描述的句子长度分布,可以看到train和test数据集分布差不多

train['len'].hist(),test['len'].hist()
(<matplotlib.axes._subplots.AxesSubplot at 0x7f94f06a3cd0>,
 <matplotlib.axes._subplots.AxesSubplot at 0x7f94f06a3cd0>)

影像描述的句子最长148个单词,最短9个单词

train['len'].describe()
count    20000.000000
mean        81.201050
std         24.815447
min          9.000000
25%         62.000000
50%         76.000000
75%         97.000000
max        148.000000
Name: len, dtype: float64
test['len'].describe()
count    3000.000000
mean       81.072667
std        24.596539
min        10.000000
25%        63.000000
50%        76.000000
75%        97.000000
max       146.000000
Name: len, dtype: float64

诊断报告的句子长度分布

train['len2'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f94f07f1450>

在这里插入图片描述

诊断报告的句子最长79个单词,最短2个单词

train['len2'].describe()
count    20000.000000
mean        25.336800
std         13.013068
min          2.000000
25%         16.000000
50%         23.000000
75%         32.000000
max         79.000000
Name: len2, dtype: float64

1.2 词语情况

影像报告的词语从9开始到1298,前面的几个数字出现次数较多,train和test差不多。

from collections import Counter
l = train['yingxiang'].apply(lambda x:x.split()).tolist()
l = [i for j in l for i in j]

c = Counter(l)

df = pd.DataFrame({'key':list(c.keys()),'value':list(c.values())})
df['key'] = df['key'].astype('int')
df.sort_values('key')
key value
1221 9 38
27 10 67536
21 11 63318
34 12 53630
5 13 60406
... ... ...
545 1295 74
1165 1296 88
786 1297 68
1225 1298 63
519 1299 76

1291 rows × 2 columns

from collections import Counter
l = test['yingxiang'].apply(lambda x:x.split()).tolist()
l = [i for j in l for i in j]

c = Counter(l)

df = pd.DataFrame({'key':list(c.keys()),'value':list(c.values())})
df['key'] = df['key'].astype('int')
df.sort_values('key')
key value
908 9 9
23 10 10118
18 11 9405
1 12 8066
10 13 9184
... ... ...
149 1295 14
471 1296 9
1210 1297 7
973 1298 10
1155 1299 7

1291 rows × 2 columns

诊断描述的词语从9开始到1298,前面的几个数字出现次数较多,和影像描述也差不多。

from collections import Counter
l = train['zhenduan'].apply(lambda x:x.split()).tolist()
l = [i for j in l for i in j]

c = Counter(l)

df = pd.DataFrame({'key':list(c.keys()),'value':list(c.values())})
df['key'] = df['key'].astype('int')
df.sort_values('key')
key value
1002 9 27
6 10 28945
61 11 15307
1 12 18788
97 13 3455
... ... ...
1022 1295 12
1005 1296 18
1203 1297 17
1280 1298 18
1218 1299 24

1291 rows × 2 columns

2. 加载模型、定义参数

  1. [SOS]、[BOS]、[GO]:代表一个序列的开始。
  2. [EOS]:代表一个序列的结束,作为判断终止的标签。
  3. [MASK]:用于遮盖句子中的一些单词。
  4. [UNK]:未知字符,代表词典中没有的词。
  5. [SEP]: 用于分隔两个输入句子,例如输入句子 A 和 B,要在句子 A,B 后面增加 [SEP] 标志。
  6. [CLS] :放在句子的首位,表示句子的开始,就是classification的意思,通常会在bert等模型出现。
  7. [PAD]:补全字符,例如要将句子处理为特定的长度,我们就要在句子前后补[PAD]。

定义相关参数

import time
import os
import numpy as np
from tqdm import tqdm
from functools import partial
import pandas as pd
# 官方baseline中的评分标准CiderD
from eval import CiderD
from visualdl import LogWriter

from datasets import load_dataset

from paddlenlp.transformers import AutoTokenizer,AutoModelForConditionalGeneration,LinearDecayWithWarmup
from paddlenlp.utils.log import logger
from paddlenlp.data import DataCollatorForSeq2Seq
import paddle
from paddle.io import BatchSampler, DistributedBatchSampler, DataLoader
from paddle import nn

# 开始字符
bos_token_id = 1
# 结束字符
eos_token_id = 2
# 补全字符
pad_token_id = 0

# 训练模型的保存路径
save_dir = 'checkpoints2'
# 最高分数的记录
best_score = 0

# 影像描述的最大长度
max_source_length = 160
# 诊断报告的最大长度
max_target_length = 90
# 诊断报告的最小长度
min_target_length = 0

# 训练轮次
num_epochs = 5
# 训练中,每个log_steps打印一次日志
log_steps = 50
# 训练中,每隔eval_steps进行一次模型评估
eval_steps = 300

# 设置batch_size
train_batch_size = 16 
dev_batch_size = 64
test_batch_size = 64
[2023-04-27 22:25:54,586] [ WARNING] - Detected that datasets module was imported before paddlenlp. This may cause PaddleNLP datasets to be unavalible in intranet. Please import paddlenlp before datasets module to avoid download issues
# 定义句子还原函数
def array2str(arr):
    out = ''
    for i in range(len(arr)):
        # 遇到结束标记就停止
        if arr[i]==eos_token_id or arr[i]==pad_token_id:
            break
        # 遇到开始标记就继续
        if arr[i]==bos_token_id:
            continue
        out += str(int(arr[i])) + ' '
    if len(out.strip())==0:
        out = '0'
    return out.strip()

array2str([1,10,11,12,13,2,0,0,0,0,0,0,0])
'10 11 12 13'
# 定义损失函数
def CE(output, target):
    '''
    Output: (B,L,C)。未经过softmax的logits
    Target: (B,L)
    '''
    # print(target)
    # reshape 不同
    output = output.reshape((-1, output.shape[-1]))  # (*,C)
    target = target.reshape((-1,))#.long()  # (*)
    return nn.CrossEntropyLoss()(output, target) #默认size_average=True,会把B*L所有词loss平均
# 加载模型和tokenizer,batchify_f
model_name = 'IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForConditionalGeneration.from_pretrained(model_name, # 如果有预训练,换成预训练保存的文件夹名称 ‘pre’
                                                            vocab_size=1500, # 因为词表和原来不同,所以需要修改模型的token数量
                                                            bos_token_id = bos_token_id,# 重新定义
                                                            eos_token_id = eos_token_id,# 重新定义
                                                            decoder_start_token_id = eos_token_id,# 重新定义
                                                            forced_eos_token_id = eos_token_id,# 重新定义
                                                            pad_token_id = pad_token_id)
batchify_fn = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
[2023-04-27 22:25:57,444] [    INFO] - Downloading tokenizer_config.json from https://bj.bcebos.com/paddlenlp/models/community/IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese/tokenizer_config.json
100%|██████████| 2.00/2.00 [00:00<00:00, 1.38kB/s]


We use pattern recognition to recognize the Tokenizer class.


[2023-04-27 22:25:57,570] [    INFO] - We are using <class 'paddlenlp.transformers.pegasus.tokenizer.PegasusChineseTokenizer'> to load 'IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese'.
[2023-04-27 22:25:57,573] [    INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/community/IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese/vocab.txt and saved to /home/aistudio/.paddlenlp/models/IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese
[2023-04-27 22:25:57,576] [    INFO] - Downloading vocab.txt from https://bj.bcebos.com/paddlenlp/models/community/IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese/vocab.txt
100%|██████████| 365k/365k [00:00<00:00, 1.02MB/s]
[2023-04-27 22:25:58,268] [    INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/community/IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese/added_tokens.json and saved to /home/aistudio/.paddlenlp/models/IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese
[2023-04-27 22:25:58,271] [    INFO] - Downloading added_tokens.json from https://bj.bcebos.com/paddlenlp/models/community/IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese/added_tokens.json
100%|██████████| 2.00/2.00 [00:00<00:00, 1.11kB/s]
[2023-04-27 22:25:58,371] [    INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/community/IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese/special_tokens_map.json and saved to /home/aistudio/.paddlenlp/models/IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese
[2023-04-27 22:25:58,374] [    INFO] - Downloading special_tokens_map.json from https://bj.bcebos.com/paddlenlp/models/community/IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese/special_tokens_map.json
100%|██████████| 65.0/65.0 [00:00<00:00, 60.4kB/s]
[2023-04-27 22:25:58,514] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese/tokenizer_config.json
[2023-04-27 22:25:58,572] [    INFO] - Downloading model_config.json from https://bj.bcebos.com/paddlenlp/models/community/IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese/model_config.json
100%|██████████| 731/731 [00:00<00:00, 455kB/s]
[2023-04-27 22:25:58,738] [    INFO] - We are using <class 'paddlenlp.transformers.pegasus.modeling.PegasusForConditionalGeneration'> to load 'IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese'.
[2023-04-27 22:25:58,740] [    INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/community/IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese/model_state.pdparams and saved to /home/aistudio/.paddlenlp/models/IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese
[2023-04-27 22:25:58,742] [    INFO] - Downloading model_state.pdparams from https://bj.bcebos.com/paddlenlp/models/community/IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese/model_state.pdparams
100%|██████████| 675M/675M [00:52<00:00, 13.3MB/s] 
[2023-04-27 22:26:51,878] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese/model_config.json
W0427 22:26:51.884132   188 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0427 22:26:51.888605   188 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.

3. 加载数据

训练集总共2万条,按9:1的比例,直接取训练集中后2000条作为验证集,提交几次后发现验证集上的效果和线上效果基本一致。

test = pd.read_csv('data/data201684/preliminary_a_test.csv',header=None)
test[2] = '0 0'
test[[1,2]].to_csv('test.csv',header = None,index=False)

train = pd.read_csv('data/data201684/train.csv',header=None)
train[[1,2]].head(18000).to_csv('train.csv',header = None,index=False)
train[[1,2]].tail(2000).to_csv('valid.csv',header = None,index=False)

train_dataset = load_dataset("csv", data_files='train.csv',names=['yingxiang', 'zhenduan'], split="train")
dev_dataset = load_dataset("csv", data_files='valid.csv',names=['yingxiang', 'zhenduan'], split="train")
test_dataset = load_dataset("csv", data_files='test.csv',names=['yingxiang', 'zhenduan'], split="train")
Using custom data configuration default-17b80d32585c117e


Downloading and preparing dataset csv/default to /home/aistudio/.cache/huggingface/datasets/csv/default-17b80d32585c117e/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...



Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]



Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]



Generating train split: 0 examples [00:00, ? examples/s]


Dataset csv downloaded and prepared to /home/aistudio/.cache/huggingface/datasets/csv/default-17b80d32585c117e/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


Using custom data configuration default-762584762b1087a2


Downloading and preparing dataset csv/default to /home/aistudio/.cache/huggingface/datasets/csv/default-762584762b1087a2/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...



Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]



Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]



Generating train split: 0 examples [00:00, ? examples/s]


Dataset csv downloaded and prepared to /home/aistudio/.cache/huggingface/datasets/csv/default-762584762b1087a2/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


Using custom data configuration default-bd109b044935a003


Downloading and preparing dataset csv/default to /home/aistudio/.cache/huggingface/datasets/csv/default-bd109b044935a003/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...



Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]



Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]



Generating train split: 0 examples [00:00, ? examples/s]


Dataset csv downloaded and prepared to /home/aistudio/.cache/huggingface/datasets/csv/default-bd109b044935a003/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.

4. 处理数据

# 处理数据的函数,因为本次直接是脱敏数据成数字,所以就不需要tokenizer再做处理了,直接拼接开始和结束符号作为输入
def convert_example(example, text_column, summary_column):
    """
    构造模型的输入
    """
    inputs = example[text_column].split()
    inputs = [bos_token_id]+[int(i) for i in inputs]
    inputs.append(eos_token_id)
    
    targets = example[summary_column].split()
    targets = [bos_token_id]+[int(i) for i in targets]
    targets.append(eos_token_id)
    model_inputs = {}

    model_inputs["input_ids"] = inputs
    model_inputs["attention_mask"] = [1]*len(inputs)
    model_inputs["labels"] = targets
    return model_inputs


# 原始字段需要移除
remove_columns = ['yingxiang', 'zhenduan']
# 定义转换器
trans_func = partial(convert_example,
                     text_column='yingxiang',
                     summary_column='zhenduan')
                     
# train_dataset和dev_dataset分别转换
train_dataset = train_dataset.map(trans_func,batched=False,# 对每条数据逐个处理
                                  load_from_cache_file=True,remove_columns=remove_columns)
dev_dataset = dev_dataset.map(trans_func,batched=False,
                              load_from_cache_file=True,remove_columns=remove_columns)
test_dataset = test_dataset.map(trans_func,batched=False,
                              load_from_cache_file=True,remove_columns=remove_columns)
  0%|          | 0/18000 [00:00<?, ?ex/s]



  0%|          | 0/2000 [00:00<?, ?ex/s]



  0%|          | 0/3000 [00:00<?, ?ex/s]

5. 加载到dataloader

# 分布式批采样器,用于多卡分布式训练
train_batch_sampler = DistributedBatchSampler(train_dataset, batch_size=train_batch_size, shuffle=True)
# 构造训练训练集Dataloader
train_data_loader = DataLoader(dataset=train_dataset,batch_sampler=train_batch_sampler,
                               num_workers=0,collate_fn=batchify_fn,return_list=True)


dev_batch_sampler = BatchSampler(dev_dataset,batch_size=dev_batch_size,shuffle=False)
dev_data_loader = DataLoader(dataset=dev_dataset,batch_sampler=dev_batch_sampler,
                             num_workers=0,collate_fn=batchify_fn,return_list=True)

test_batch_sampler = BatchSampler(test_dataset,batch_size=test_batch_size,shuffle=False)
test_data_loader = DataLoader(dataset=test_dataset,batch_sampler=test_batch_sampler,
                            num_workers=0,collate_fn=batchify_fn,return_list=True)

for idx, example in enumerate(dev_data_loader):
    print(example)
    break
{'input_ids': Tensor(shape=[64, 134], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [[1  , 185, 185, ..., 0  , 0  , 0  ],
        [1  , 14 , 281, ..., 0  , 0  , 0  ],
        [1  , 34 , 12 , ..., 0  , 0  , 0  ],
        ...,
        [1  , 22 , 12 , ..., 0  , 0  , 0  ],
        [1  , 12 , 62 , ..., 0  , 0  , 0  ],
        [1  , 22 , 12 , ..., 0  , 0  , 0  ]]), 'attention_mask': Tensor(shape=[64, 134], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [[1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        ...,
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0]]), 'labels': Tensor(shape=[64, 69], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [[ 1  ,  22 ,  12 , ..., -100, -100, -100],
        [ 1  ,  34 ,  12 , ..., -100, -100, -100],
        [ 1  ,  34 ,  12 , ...,  282,  10 ,  2  ],
        ...,
        [ 1  ,  14 ,  30 , ..., -100, -100, -100],
        [ 1  ,  66 ,  19 , ..., -100, -100, -100],
        [ 1  ,  75 ,  80 , ..., -100, -100, -100]]), 'decoder_input_ids': Tensor(shape=[64, 69], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [[2  , 1  , 22 , ..., 0  , 0  , 0  ],
        [2  , 1  , 34 , ..., 0  , 0  , 0  ],
        [2  , 1  , 34 , ..., 90 , 282, 10 ],
        ...,
        [2  , 1  , 14 , ..., 0  , 0  , 0  ],
        [2  , 1  , 66 , ..., 0  , 0  , 0  ],
        [2  , 1  , 75 , ..., 0  , 0  , 0  ]])}

0 , 0 ],
[2 , 1 , 75 , …, 0 , 0 , 0 ]])}

定义优化器

# 学习率预热比例
warmup_proportion = 0.02
# 学习率
learning_rate = 1e-4

# 训练总步数
num_training_steps = len(train_data_loader) * num_epochs
# AdamW优化器参数epsilon
adam_epsilon = 1e-6
# AdamW优化器参数weight_decay
weight_decay=0.01


# 可视化
log_writer = LogWriter('visualdl_log_dir')
lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)

# LayerNorm参数不参与weight_decay
decay_params = [
    p.name for n, p in model.named_parameters()
    if not any(nd in n for nd in ["bias", "norm"])
]
# 优化器AdamW
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    beta1=0.9,
    beta2=0.999,
    epsilon=adam_epsilon,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in decay_params)

定义模型评估函数

# 模型评估函数
@paddle.no_grad()
def evaluate(model, data_loader, tokenizer, min_target_length,max_target_length):
    model.eval()
    model = model._layers if isinstance(model, paddle.DataParallel) else model

    res, gts = [], {}
    tot = 0
    for batch in tqdm(data_loader):
        targets = batch['labels']
        pred = model.generate(input_ids=batch['input_ids'],
                               attention_mask=batch['attention_mask'],
                               min_length=min_target_length,
                               max_length=max_target_length,
                               use_cache=True,
                               length_penalty=0.7,
                               decode_strategy='beam_search',
                               num_beams=5,
                               early_stopping=True)[0]
        pred = pred.cpu().numpy()
        #print(pred.shape)
        for i in range(pred.shape[0]):
            res.append({'image_id':tot, 'caption': [array2str(pred[i])]})
            gts[tot] = [array2str(targets[i])]
            tot += 1

    CiderD_scorer = CiderD(df='corpus', sigma=15)
    cider_score, cider_scores = CiderD_scorer.compute_score(gts, res)
    print('cid',cider_score)

    return cider_score
# 定义FGM对抗训练
class FGM:
    def __init__(self, model, eps=1.):
        self.model = (model.module if hasattr(model, "module") else model)
        self.eps = eps
        self.backup = {}

    # only attack embedding
    def attack(self, emb_name='embedding'):
        for name, param in self.model.named_parameters():
            if param.stop_gradient and emb_name in name:
                self.backup[name] = param.data.clone()
                norm = paddle.norm(param.grad)
                if norm and not paddle.isnan(norm):
                    r_at = self.eps * param.grad / norm
                    param.data.add_(r_at)

    def restore(self, emb_name='embedding'):
        for name, para in self.model.named_parameters():
            if para.stop_gradient and emb_name in name:
                assert name in self.backup
                para.data = self.backup[name]

        self.backup = {}
        
# 初始化
fgm = FGM(model)
# 定义EMA,本次比赛尝试没有效果
# class ExponentialMovingAverage():
#     def __init__(self, model, decay, thres_steps=True):
#         self._model = model
#         self._decay = decay
#         self._thres_steps = thres_steps
#         self._shadow = {}
#         self._backup = {}

#     def register(self):
#         self._update_step = 0
#         for name, param in self._model.named_parameters():
#             if param.stop_gradient is False:   # 只记录可训练参数。bn层的均值、方差的stop_gradient默认是True,所以不会记录bn层的均值、方差。
#                 self._shadow[name] = param.numpy().copy()

#     def update(self):
#         for name, param in self._model.named_parameters():
#             if param.stop_gradient is False:
#                 assert name in self._shadow
#                 new_val = np.array(param.numpy().copy())
#                 old_val = np.array(self._shadow[name])
#                 decay = min(self._decay, (1 + self._update_step) / (10 + self._update_step)) if self._thres_steps else self._decay
#                 new_average = decay * old_val + (1 - decay) * new_val
#                 self._shadow[name] = new_average
#         self._update_step += 1
#         return decay

#     def apply(self):
#         for name, param in self._model.named_parameters():
#             if param.stop_gradient is False:
#                 assert name in self._shadow
#                 self._backup[name] = np.array(param.numpy().copy())
#                 param.set_value(np.array(self._shadow[name]))

#     def restore(self):
#         for name, param in self._model.named_parameters():
#             if param.stop_gradient is False:
#                 assert name in self._backup
#                 param.set_value(self._backup[name])
#         self._backup = {}

# ema = ExponentialMovingAverage(model, 0.9998)
# ema.register()
# 如果有训练好的模型参数,可以直接加载
# state_dict = paddle.load('checkpoints2/model_state.pdparams')
# model.set_dict(state_dict)

6. 模型训练

global_step = 0

tic_train0 = time.time()
tic_train = time.time()
for epoch in range(num_epochs):
    for step, batch in enumerate(train_data_loader):
        global_step += 1
        # 模型前向训练,计算loss
        lm_logits, _, loss = model(**batch)
        loss.backward()

        fgm.attack() # 在embedding上添加对抗扰动
        lm_logits, _, loss = model(**batch)
        loss.backward()
        fgm.restore() # 恢复embedding参数

        optimizer.step()
        lr_scheduler.step()
        optimizer.clear_grad()

        # ema.update()
        
        if global_step % log_steps == 0:
            logger.info('global step {}/{}, epoch: {}, loss: {}, lr: {}, speed: {} s/step, already: {}min, remain: {}min'.format(
                        global_step,
                        num_training_steps,
                        epoch,
                        round(loss.item(),10), # 目前损失
                        round(optimizer.get_lr(),10), # 目前学习率
                        round((time.time() - tic_train)/log_steps,3), # 每步耗时
                        round((time.time() - tic_train0)/60,3), # 总耗时
                        round((num_training_steps-global_step)/log_steps*(time.time() - tic_train)/60),3)) # 预计耗时
            tic_train = time.time()
        if global_step % eval_steps== 0 and global_step >= 0:
            tic_eval = time.time()
            score = evaluate(model, dev_data_loader, tokenizer,min_target_length, max_target_length)
            logger.info("eval done total : %s s" % (time.time() - tic_eval))
            if best_score < score:
                best_score = score
                if paddle.distributed.get_rank() == 0:
                    if not os.path.exists(save_dir):
                        os.makedirs(save_dir)
                    # Need better way to get inner model of DataParallel
                    model_to_save = model._layers if isinstance(
                        model, paddle.DataParallel) else model
                    model_to_save.save_pretrained(save_dir)
                    tokenizer.save_pretrained(save_dir)

7. 模型验证

# 模型评估函数
@paddle.no_grad()
def evaluate(model, data_loader, tokenizer, min_target_length,max_target_length):
    model.eval()
    model = model._layers if isinstance(model, paddle.DataParallel) else model

    res, gts = [], {}
    tot = 0
    for batch in tqdm(data_loader):
        targets = batch['labels']
        pred = model.generate(input_ids=batch['input_ids'],
                            attention_mask=batch['attention_mask'],
                            min_length=min_target_length,
                            max_length=max_target_length,
                            use_cache=True,
                            length_penalty=0.7,
                            decode_strategy='beam_search',
                            num_beams=5,
                            early_stopping=True)[0]
        pred = pred.cpu().numpy()
        for i in range(pred.shape[0]):
            res.append({'image_id':tot, 'caption': [array2str(pred[i])]})
            gts[tot] = [array2str(targets[i])]
            tot += 1

    CiderD_scorer = CiderD(df='corpus', sigma=15)
    cider_score, cider_scores = CiderD_scorer.compute_score(gts, res)
    print('cid',cider_score)

    return cider_score
# ema.apply()
evaluate(model, dev_data_loader, tokenizer,min_target_length, max_target_length)
# state_dict = model.state_dict()
# paddle.save(state_dict, "paddle_dy_ema.pdparams")

8. 模型预测

# state_dict = paddle.load('checkpoints/model_state.pdparams')
# model.set_dict(state_dict)
# 模型评估函数
@paddle.no_grad()
def pre(model, data_loader, tokenizer, min_target_length,max_target_length):
    model.eval()
    all_preds = []
    model = model._layers if isinstance(model, paddle.DataParallel) else model
    for batch in tqdm(data_loader, total=len(data_loader), desc="Eval step"):
        labels = batch.pop('labels').numpy()
        # 模型生成
        preds = model.generate(input_ids=batch['input_ids'],
                               attention_mask=batch['attention_mask'],
                               min_length=min_target_length,
                               max_length=max_target_length,
                               use_cache=True,
                               length_penalty=0.7,
                               decode_strategy='beam_search',
                               num_beams=5)[0]
        # tokenizer将id转为string
        for i in range(len(preds)):
            all_preds.append(array2str(preds[i]))
        # break
    # print(all_preds, all_labels)
    # CiderD_scorer = CiderD(df='corpus', sigma=15)
    # cider_score = CiderD_scorer.compute_score(all_preds, all_labels)
    # model.train()
    return all_preds#rouge1, rouge2, rougel, bleu4
pred = pre(model, test_data_loader, tokenizer, min_target_length, max_target_length)
df = pd.DataFrame(pred)
df[1] = df.index
df.columns = ['prediction','report_ID']
df[['report_ID','prediction']].to_csv('pre.csv',index=False,header=None)
# 输出预测文件,下载提交即可

此文章为搬运

原项目链接

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐