【PaddleNLP】千言数据集:情感分析——SKEP
用预训练模型SKEP完成千言数据集:情感分析比赛
【PaddleNLP】千言数据集:情感分析——SKEP
本项目使用预训练模型SKEP完成千言数据集:情感分析比赛
包含三种子任务,句子级情感分类、评价对象级情感分类、观点抽取
欢迎大家点赞、fork、关注!
重磅更新!!!
2021/6/24 V1.5 新增👉Ernie_gram和SKEP预测结果融合(评价对象提取任务)👈 效果显著!!!
若需要Ernie_gram观点提取教程,请参考 江流:【paddlenlp】千言数据集:情感分析
别忘了给他也点个star和fork哦!!!
前言
大家好
该比赛为NLP打卡营的大作业。以下为通过本次课程所学内容,实现情感分析任务的代码。
先上结果:
刚刚入门+没有任何机器学习的基础,还有很大的调参和处理空间。
强烈推荐大家参考以下源码:
- SKEP example非常重要!!!
- load_dataset
- 可以给上面PaddleNLP的链接点个star哦!
发现的好东西,但还没使用:VisualDL2.2可以实现超参可视化
数据集准备
把所有数据集压缩包放在work文件夹下
# 解压数据集到data文件夹, 注意每次打开都要执行,重启无需执行
!unzip -q work/ChnSentiCorp.zip -d data
!unzip -q work/NLPCC14-SC.zip -d data
!unzip -q work/SE-ABSA16_CAME.zip -d data
!unzip -q work/SE-ABSA16_PHNS.zip -d data
!unzip -q work/COTE-BD.zip -d data
!unzip -q work/COTE-DP.zip -d data
!unzip -q work/COTE-MFW.zip -d data
# 更新paddlenlp
!pip install --upgrade paddlenlp -i https://pypi.org/simple
一、 句子级情感分析
对给定的一段文本进行情感极性分类,常用于影评分析、网络论坛舆情分析等场景。
数据读入–句子级数据
包含:
- load_ds:可从训练集划分验证集
- load_test:读入test
import os
import random
from paddlenlp.datasets import MapDataset
# for train and dev sets
def load_ds(datafiles, split_train=False, dev_size=0):
'''
intput:
datafiles -- str or list[str] -- the path of train or dev sets
split_train -- Boolean -- split from train or not
dev_size -- int -- split how much data from train
output:
MapDataset
'''
datas = []
def read(ds_file):
with open(ds_file, 'r', encoding='utf-8') as fp:
next(fp) # Skip header
for line in fp.readlines():
data = line[:-1].split('\t')
if len(data)==2:
yield ({'text':data[1], 'label':int(data[0])})
elif len(data)==3:
yield ({'text':data[2], 'label':int(data[1])})
def write_tsv(tsv, datas):
with open(tsv, mode='w', encoding='UTF-8') as f:
for line in datas:
f.write(line)
# 从train切出一部分给dev
def spilt_train4dev(train_ds, dev_size):
with open(train_ds, 'r', encoding='UTF-8') as f:
for i, line in enumerate(f):
datas.append(line)
datas_tmp=datas[1:] # title line should not shuffle
random.shuffle(datas_tmp)
if 1-os.path.exists(os.path.dirname(train_ds)+'/tem'):
os.mkdir(os.path.dirname(train_ds)+'/tem')
# remember the title line
write_tsv(os.path.dirname(train_ds)+'/tem/train.tsv', datas[0:1]+datas_tmp[:-dev_size])
write_tsv(os.path.dirname(train_ds)+'/tem/dev.tsv', datas[0:1]+datas_tmp[-dev_size:])
if split_train:
if 1-isinstance(datafiles, str):
print("If you want to split the train, make sure that \'datafiles\' is a train set path str.")
return None
if dev_size == 0:
print("Please set size of dev set, as dev_size=...")
return None
spilt_train4dev(datafiles, dev_size)
datafiles = [os.path.dirname(datafiles)+'/tem/train.tsv', os.path.dirname(datafiles)+'/tem/dev.tsv']
if isinstance(datafiles, str):
return MapDataset(list(read(datafiles)))
elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
return [MapDataset(list(read(datafile))) for datafile in datafiles]
def load_test(datafile):
'''
intput:
datafile -- str -- the path of test set
output:
MapDataset
'''
def read(test_file):
with open(test_file, 'r', encoding='UTF-8') as f:
for i, line in enumerate(f):
if i==0:
continue
data = line[:-1].split('\t')
yield {'text':data[1], 'label':'', 'qid':data[0]}
return MapDataset(list(read(datafile)))
1. ChnSentiCorp
一次只能执行一个数据集!!!
32 256 2e-5 6
# 可以直接使用官方的load_dataset更加方便
# 本文将从train分出一部分做dev,为统一格式,故采用自己定义的load函数
# from paddlenlp.datasets import load_dataset
# train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])
train_ds, dev_ds= load_ds(datafiles=['./data/ChnSentiCorp/train.tsv', './data/ChnSentiCorp/dev.tsv'])
print(train_ds[0])
print(dev_ds[0])
print(type(train_ds[0]))
test_ds = load_test(datafile='./data/ChnSentiCorp/test.tsv')
print(test_ds[0])
2. NLPCC14-SC
train_ds, dev_ds = load_ds(datafiles='./data/NLPCC14-SC/train.tsv', split_train=True, dev_size=1000)
print(train_ds[0])
print(dev_ds[0])
test_ds = load_test(datafile='./data/NLPCC14-SC/test.tsv')
print(test_ds[0])
SKEP模型构建
PaddleNLP已经实现了SKEP预训练模型,可以通过一行代码实现SKEP加载。
句子级情感分析模型是SKEP fine-tune 文本分类常用模型SkepForSequenceClassification
。其首先通过SKEP提取句子语义特征,之后将语义特征进行分类。
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
# num_classes: 两类,0和1,表示消极和积极
model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch", num_classes=2)
tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch")
SkepForSequenceClassification
可用于句子级情感分析和目标级情感分析任务。其通过预训练模型SKEP获取输入文本的表示,之后将文本表示进行分类。
-
pretrained_model_name_or_path
:模型名称。支持"skep_ernie_1.0_large_ch",“skep_ernie_2.0_large_en”。- “skep_ernie_1.0_large_ch”:是SKEP模型在预训练ernie_1.0_large_ch基础之上在海量中文数据上继续预训练得到的中文预训练模型;
- “skep_ernie_2.0_large_en”:是SKEP模型在预训练ernie_2.0_large_en基础之上在海量英文数据上继续预训练得到的英文预训练模型;
-
num_classes
: 数据集分类类别数。
关于SKEP模型实现详细信息参考:https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/skep
import os
from functools import partial
import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad
from utils import create_dataloader
def convert_example(example,
tokenizer,
max_seq_length=512,
is_test=False):
"""
Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
by concatenating and adding special tokens. And creates a mask from the two sequences passed
to be used in a sequence-pair classification task.
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
::
- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
::
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
Args:
example(obj:`list[str]`): List of input data, containing text and label if it have label.
tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
which contains most of the methods. Users should refer to the superclass for more information regarding methods.
max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
Sequences longer than this will be truncated, sequences shorter will be padded.
is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
Returns:
input_ids(obj:`list[int]`): The list of token ids.
token_type_ids(obj: `list[int]`): List of sequence pair mask.
label(obj:`int`, optional): The input label if not is_test.
"""
encoded_inputs = tokenizer(
text=example["text"], max_seq_len=max_seq_length)
input_ids = encoded_inputs["input_ids"]
token_type_ids = encoded_inputs["token_type_ids"]
if not is_test:
label = np.array([example["label"]], dtype="int64")
return input_ids, token_type_ids, label
else:
qid = np.array([example["qid"]], dtype="int64")
return input_ids, token_type_ids, qid
batch_size = 32
max_seq_length = 256
trans_func = partial(
convert_example,
tokenizer=tokenizer,
max_seq_length=max_seq_length)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids
Stack() # labels
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
train_ds,
mode='train',
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
dev_data_loader = create_dataloader(
dev_ds,
mode='dev',
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
模型训练和评估(把每个数据集的结果分开存!!!)
定义损失函数、优化器以及评价指标后,即可开始训练。
推荐超参设置:
max_seq_length=256
batch_size=32
learning_rate=2e-5
epochs=10
实际运行时可以根据显存大小调整batch_size和max_seq_length大小。
import time
from utils import evaluate
epochs = 10
ckpt_dir = "skep_sentence"
num_training_steps = len(train_data_loader) * epochs
# optimizer = paddle.optimizer.AdamW(
# learning_rate=2e-5,
# parameters=model.parameters())
decay_params = [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias", "norm"])
]
optimizer = paddle.optimizer.AdamW(
learning_rate=2e-6,
parameters=model.parameters(),
weight_decay=0.01, # test weight_decay
apply_decay_param_fun=lambda x: x in decay_params # test weight_decay
)
criterion = paddle.nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()
global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
for step, batch in enumerate(train_data_loader, start=1):
input_ids, token_type_ids, labels = batch
logits = model(input_ids, token_type_ids)
loss = criterion(logits, labels)
probs = F.softmax(logits, axis=1)
correct = metric.compute(probs, labels)
metric.update(correct)
acc = metric.accumulate()
global_step += 1
if global_step % 10 == 0:
print(
"global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
% (global_step, epoch, step, loss, acc,
10 / (time.time() - tic_train)))
tic_train = time.time()
loss.backward()
optimizer.step()
optimizer.clear_grad()
if global_step % 100 == 0:
save_dir = os.path.join(ckpt_dir, "model_%d" % global_step)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
evaluate(model, criterion, metric, dev_data_loader)
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)
预测结果(注意更改模型的路径)
使用训练得到的模型还可以对文本进行情感预测。
import numpy as np
import paddle
batch_size=24
trans_func = partial(
convert_example,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
is_test=True)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment
Stack() # qid
): [data for data in fn(samples)]
test_data_loader = create_dataloader(
test_ds,
mode='test',
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
# 根据实际运行情况,更换加载的参数路径
params_path = 'skep_sentence2t_weight_2e-6/model_1800_83700/model_state.pdparams'
if params_path and os.path.isfile(params_path):
state_dict = paddle.load(params_path)
model.set_dict(state_dict)
print("Loaded parameters from %s" % params_path)
label_map = {0: '0', 1: '1'}
results = []
model.eval()
for batch in test_data_loader:
input_ids, token_type_ids, qids = batch
logits = model(input_ids, token_type_ids)
probs = F.softmax(logits, axis=-1)
idx = paddle.argmax(probs, axis=1).numpy()
idx = idx.tolist()
labels = [label_map[i] for i in idx]
qids = qids.numpy().tolist()
results.extend(zip(qids, labels))
res_dir = "./results/2_weight_2e-6/1800_83700"
if not os.path.exists(res_dir):
os.makedirs(res_dir)
with open(os.path.join(res_dir, "NLPCC14-SC.tsv"), 'w', encoding="utf8") as f:
f.write("index\tprediction\n")
for qid, label in results:
f.write(str(qid[0])+"\t"+label+"\n")
二、 目标级情感分析
在电商产品分析场景下,除了分析整体商品的情感极性外,还细化到以商品具体的“方面”为分析主体进行情感分析(aspect-level),如下、:
- 这个薯片口味有点咸,太辣了,不过口感很脆。
关于薯片的口味方面是一个负向评价(咸,太辣),然而对于口感方面却是一个正向评价(很脆)。
- 我很喜欢夏威夷,就是这边的海鲜太贵了。
关于夏威夷是一个正向评价(喜欢),然而对于夏威夷的海鲜却是一个负向评价(价格太贵)。
数据读入–目标级数据集
与句子级方法类似
import os
import random
from paddlenlp.datasets import MapDataset
def load_ds(datafiles, split_train=False, dev_size=0):
datas = []
def read(ds_file):
with open(ds_file, 'r', encoding='utf-8') as fp:
# if fp.readline().split('\t')[0] == 'label':
next(fp) # Skip header
for line in fp.readlines():
data = line[:-1].split('\t')
yield ({'text':data[1], 'text_pair':data[2], 'label':int(data[0])})
def write_tsv(tsv, datas):
with open(tsv, mode='w', encoding='UTF-8') as f:
for line in datas:
f.write(line)
def spilt_train4dev(train_ds, dev_size):
with open(train_ds, 'r', encoding='UTF-8') as f:
for i, line in enumerate(f):
datas.append(line)
datas_tmp=datas[1:] # title line should not shuffle
random.shuffle(datas_tmp)
if 1-os.path.exists(os.path.dirname(train_ds)+'/tem'):
os.mkdir(os.path.dirname(train_ds)+'/tem')
# remember the title line
write_tsv(os.path.dirname(train_ds)+'/tem/train.tsv', datas[0:1]+datas_tmp[:-dev_size])
write_tsv(os.path.dirname(train_ds)+'/tem/dev.tsv', datas[0:1]+datas_tmp[-dev_size:])
if split_train:
if 1-isinstance(datafiles, str):
print("If you want to split the train, make sure that \'datafiles\' is a train set path str.")
return None
if dev_size == 0:
print("Please set size of dev set, as dev_size=...")
return None
spilt_train4dev(datafiles, dev_size)
datafiles = [os.path.dirname(datafiles)+'/tem/train.tsv', os.path.dirname(datafiles)+'/tem/dev.tsv']
if isinstance(datafiles, str):
return MapDataset(list(read(datafiles)))
elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
return [MapDataset(list(read(datafile))) for datafile in datafiles]
def load_test(datafile):
def read(test_file):
with open(test_file, 'r', encoding='UTF-8') as f:
for i, line in enumerate(f):
if i==0:
continue
data = line[:-1].split('\t')
yield {'text':data[1], 'text_pair':data[2]}
return MapDataset(list(read(datafile)))
3. SE-ABSA16_PHNS
一次只能执行一个数据集!!!
train_ds, dev_ds = load_ds(datafiles='./data/SE-ABSA16_PHNS/train.tsv', split_train=True, dev_size=100)
print(train_ds[0])
print(dev_ds[0])
test_ds = load_test(datafile='./data/SE-ABSA16_PHNS/test.tsv')
print(test_ds[0])
4. SE-ABSA16_CAME
train_ds, dev_ds = load_ds(datafiles='./data/SE-ABSA16_CAME/train.tsv', split_train=True, dev_size=100)
print(train_ds[0])
print(dev_ds[0])
test_ds = load_test(datafile='./data/SE-ABSA16_CAME/test.tsv')
print(test_ds[0])
SKEP模型构建
目标级情感分析模型同样使用SkepForSequenceClassification
模型,但目标级情感分析模型的输入不单单是一个句子,而是句对。一个句子描述“评价对象方面(aspect)”,另一个句子描述"对该方面的评论"。如下图所示。
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
# num_classes: 两类,0和1,表示消极和积极
model = SkepForSequenceClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=2)
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')
from functools import partial
import os
import time
import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad
from utils import create_dataloader
def convert_example(example,
tokenizer,
max_seq_length=512,
is_test=False,
dataset_name="chnsenticorp"):
"""
Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
by concatenating and adding special tokens. And creates a mask from the two sequences passed
to be used in a sequence-pair classification task.
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
::
- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
::
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
note: There is no need token type ids for skep_roberta_large_ch model.
Args:
example(obj:`list[str]`): List of input data, containing text and label if it have label.
tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
which contains most of the methods. Users should refer to the superclass for more information regarding methods.
max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
Sequences longer than this will be truncated, sequences shorter will be padded.
is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
dataset_name((obj:`str`, defaults to "chnsenticorp"): The dataset name, "chnsenticorp" or "sst-2".
Returns:
input_ids(obj:`list[int]`): The list of token ids.
token_type_ids(obj: `list[int]`): List of sequence pair mask.
label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
"""
encoded_inputs = tokenizer(
text=example["text"],
text_pair=example["text_pair"],
max_seq_len=max_seq_length)
input_ids = encoded_inputs["input_ids"]
token_type_ids = encoded_inputs["token_type_ids"]
if not is_test:
label = np.array([example["label"]], dtype="int64")
return input_ids, token_type_ids, label
else:
return input_ids, token_type_ids
max_seq_length=512
batch_size=16
trans_func = partial(
convert_example,
tokenizer=tokenizer,
max_seq_length=max_seq_length)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids
Stack(dtype="int64") # labels
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
train_ds,
mode='train',
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
dev_data_loader = create_dataloader(
dev_ds,
mode='dev',
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
模型训练和评估
定义损失函数、优化器以及评价指标后,即可开始训练。
from utils import evaluate
epochs = 12
num_training_steps = len(train_data_loader) * epochs
# optimizer = paddle.optimizer.AdamW(
# learning_rate=2e-6,
# parameters=model.parameters())
decay_params = [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias", "norm"])
]
optimizer = paddle.optimizer.AdamW(
learning_rate=1e-6,
parameters=model.parameters(),
weight_decay=0.01, # test weight_decay
apply_decay_param_fun=lambda x: x in decay_params # test weight_decay
)
criterion = paddle.nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()
ckpt_dir = "skep_aspect"
global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
for step, batch in enumerate(train_data_loader, start=1):
input_ids, token_type_ids, labels = batch
logits = model(input_ids, token_type_ids)
loss = criterion(logits, labels)
probs = F.softmax(logits, axis=1)
correct = metric.compute(probs, labels)
metric.update(correct)
acc = metric.accumulate()
global_step += 1
if global_step % 10 == 0:
print(
"global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
% (global_step, epoch, step, loss, acc,
10 / (time.time() - tic_train)))
tic_train = time.time()
loss.backward()
optimizer.step()
optimizer.clear_grad()
if global_step % 50 == 0:
save_dir = os.path.join(ckpt_dir, "model_%d" % global_step)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
#这里的evaluate直接复制的上面项目的,可能不对
evaluate(model, criterion, metric, dev_data_loader)
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)
预测结果
使用训练得到的模型还可以对评价对象进行情感预测。
@paddle.no_grad()
def predict(model, data_loader, label_map):
"""
Given a prediction dataset, it gives the prediction results.
Args:
model(obj:`paddle.nn.Layer`): A model to classify texts.
data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
label_map(obj:`dict`): The label id (key) to label str (value) map.
"""
model.eval()
results = []
for batch in data_loader:
input_ids, token_type_ids = batch
logits = model(input_ids, token_type_ids)
probs = F.softmax(logits, axis=1)
idx = paddle.argmax(probs, axis=1).numpy()
idx = idx.tolist()
labels = [label_map[i] for i in idx]
results.extend(labels)
return results
label_map = {0: '0', 1: '1'}
trans_func = partial(
convert_example,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
is_test=True)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids
): [data for data in fn(samples)]
batch_size=16
test_data_loader = create_dataloader(
test_ds,
mode='test',
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
# 根据实际运行情况,更换加载的参数路径
params_path = 'skep_aspect4_weight_1e-6/model_500_73000/model_state.pdparams'
if params_path and os.path.isfile(params_path):
state_dict = paddle.load(params_path)
model.set_dict(state_dict)
print("Loaded parameters from %s" % params_path)
results = predict(model, test_data_loader, label_map)
res_dir = "./results/4_weight_1e-6/500_73000"
if not os.path.exists(res_dir):
os.makedirs(res_dir)
with open(os.path.join(res_dir, "SE-ABSA16_CAME.tsv"), 'w', encoding="utf8") as f:
f.write("index\tprediction\n")
for idx, label in enumerate(results):
f.write(str(idx)+"\t"+label+"\n")
三、 评价对象提取
数据读入–对象数据集
对象 = [ ]
😦
import os
import random
from paddlenlp.datasets import MapDataset
def load_ds(datafiles, split_train=False, dev_size=0):
datas = []
def read(ds_file):
with open(ds_file, 'r', encoding='utf-8') as fp:
# if fp.readline().split('\t')[0] == 'label':
next(fp) # Skip header
for line in fp.readlines():
# print('1\n')
line_stripped = line.strip().split('\t')
if not line_stripped:
continue
# dataset 中有 entity和text不在一行的。。。。。
try:
example = [line_stripped[indice] for indice in (0,1)]
entity, text = example[0], example[1]
start_idx = text.index(entity)
except:
# drop the dirty data
continue
labels = [2] * len(text)
labels[start_idx] = 0
for idx in range(start_idx + 1, start_idx + len(entity)):
labels[idx] = 1
yield {
"tokens": list(text),
"labels": labels,
"entity": entity
}
def write_tsv(tsv, datas):
with open(tsv, mode='w', encoding='UTF-8') as f:
for line in datas:
f.write(line)
def spilt_train4dev(train_ds, dev_size):
with open(train_ds, 'r', encoding='UTF-8') as f:
for i, line in enumerate(f):
datas.append(line)
datas_tmp=datas[1:] # title line should not shuffle
random.shuffle(datas_tmp)
if 1-os.path.exists(os.path.dirname(train_ds)+'/tem'):
os.mkdir(os.path.dirname(train_ds)+'/tem')
# remember the title line
write_tsv(os.path.dirname(train_ds)+'/tem/train.tsv', datas[0:1]+datas_tmp[:-dev_size])
write_tsv(os.path.dirname(train_ds)+'/tem/dev.tsv', datas[0:1]+datas_tmp[-dev_size:])
if split_train:
if 1-isinstance(datafiles, str):
print("If you want to split the train, make sure that \'datafiles\' is a train set path str.")
return None
if dev_size == 0:
print("Please set size of dev set, as dev_size=...")
return None
spilt_train4dev(datafiles, dev_size)
datafiles = [os.path.dirname(datafiles)+'/tem/train.tsv', os.path.dirname(datafiles)+'/tem/dev.tsv']
if isinstance(datafiles, str):
return MapDataset(list(read(datafiles)))
elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
return [MapDataset(list(read(datafile))) for datafile in datafiles]
def load_test(datafile):
def read(test_file):
with open(test_file, 'r', encoding='UTF-8') as fp:
# if fp.readline().split('\t')[0] == 'label':
next(fp) # Skip header
for line in fp.readlines():
# print('1\n')
line_stripped = line.strip().split('\t')
if not line_stripped:
continue
# example = [line_stripped[indice] for indice in field_indices]
# example = [line_stripped[indice] for indice in (0,1)]
# dataset 中有 entity和text不在一行的。。。。。
try:
example = [line_stripped[indice] for indice in (0,1)]
entity, text = example[0], example[1]
except:
# drop the dirty data
continue
yield {"tokens": list(text)}
return MapDataset(list(read(datafile)))
5. COTE-DP
train_ds = load_ds(datafiles='./data/COTE-DP/train.tsv')
print(train_ds[0])
test_ds = load_test(datafile='./data/COTE-DP/test.tsv')
print(test_ds[0])
6. COTE-BD
train_ds = load_ds(datafiles='./data/COTE-BD/train.tsv')
print(train_ds[0])
test_ds = load_test(datafile='./data/COTE-BD/test.tsv')
print(test_ds[0])
7. COTE-MFW
train_ds = load_ds(datafiles='./data/COTE-MFW/train.tsv')
print(train_ds[1])
test_ds = load_test(datafile='./data/COTE-MFW/test.tsv')
print(test_ds[0])
SKEP模型构建
和上面的有些不一样
from paddlenlp.transformers import SkepCrfForTokenClassification, SkepTokenizer, SkepModel
# num_classes: 三类,B/I/O
skep = SkepModel.from_pretrained('skep_ernie_1.0_large_ch')
model = SkepCrfForTokenClassification(skep, num_classes=3)
tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch")
# # The COTE_DP dataset labels with "BIO" schema.
# label_map = {label: idx for idx, label in enumerate(train_ds.label_list)}
# # `no_entity_label` represents that the token isn't an entity.
# # print(type(no_entity_label_idx))
no_entity_label_idx = 2
from functools import partial
import os
import time
import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad
from utils import create_dataloader
def convert_example_to_feature(example,
tokenizer,
max_seq_len=512,
no_entity_label="O",
is_test=False):
"""
Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
by concatenating and adding special tokens. And creates a mask from the two sequences passed
to be used in a sequence-pair classification task.
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
::
- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
::
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
Args:
example(obj:`list[str]`): List of input data, containing text and label if it have label.
tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
which contains most of the methods. Users should refer to the superclass for more information regarding methods.
max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
Sequences longer than this will be truncated, sequences shorter will be padded.
no_entity_label(obj:`str`, defaults to "O"): The label represents that the token isn't an entity.
is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
Returns:
input_ids(obj:`list[int]`): The list of token ids.
token_type_ids(obj: `list[int]`): List of sequence pair mask.
label(obj:`list[int]`, optional): The input label if not test data.
"""
tokens = example['tokens']
labels = example['labels']
tokenized_input = tokenizer(
tokens,
return_length=True,
is_split_into_words=True,
max_seq_len=max_seq_len)
input_ids = tokenized_input['input_ids']
token_type_ids = tokenized_input['token_type_ids']
seq_len = tokenized_input['seq_len']
if is_test:
return input_ids, token_type_ids, seq_len
else:
labels = labels[:(max_seq_len - 2)]
encoded_label = np.array([no_entity_label] + labels + [no_entity_label], dtype="int64")
return input_ids, token_type_ids, seq_len, encoded_label
max_seq_length=256
batch_size=40
trans_func = partial(
convert_example_to_feature,
tokenizer=tokenizer,
max_seq_len=max_seq_length,
no_entity_label=no_entity_label_idx,
is_test=False)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), # input ids
Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), # token type ids
Stack(dtype='int64'), # sequence lens
Pad(axis=0, pad_val=no_entity_label_idx) # labels
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
train_ds,
mode='train',
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
# dev_data_loader = create_dataloader(
# dev_ds,
# mode='dev',
# batch_size=batch_size,
# batchify_fn=batchify_fn,
# trans_fn=trans_func)
模型训练
import time
from utils import evaluate
from paddlenlp.metrics import ChunkEvaluator
epochs = 10
num_training_steps = len(train_data_loader) * epochs
# # test weight_decay
# decay_params = [
# p.name for n, p in model.named_parameters()
# if not any(nd in n for nd in ["bias", "norm"])
# ]
# optimizer = paddle.optimizer.AdamW(
# learning_rate=1e-5,
# parameters=model.parameters(),
# weight_decay=0.01, # test weight_decay
# apply_decay_param_fun=lambda x: x in decay_params # test weight_decay
# )
optimizer = paddle.optimizer.AdamW(
learning_rate=1e-6,
parameters=model.parameters(),
)
# metric = ChunkEvaluator(label_list=train_ds.label_list, suffix=True)
metric = ChunkEvaluator(label_list=['B', 'I', 'O'], suffix=True)
ckpt_dir = "skep_opinion7_1e-6"
global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
for step, batch in enumerate(train_data_loader, start=1):
input_ids, token_type_ids, seq_lens, labels = batch
loss = model(
input_ids, token_type_ids, seq_lens=seq_lens, labels=labels)
avg_loss = paddle.mean(loss)
global_step += 1
if global_step % 10 == 0:
print(
"global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
% (global_step, epoch, step, avg_loss,
10 / (time.time() - tic_train)))
tic_train = time.time()
loss.backward()
optimizer.step()
optimizer.clear_grad()
if global_step % 200 == 0:
save_dir = os.path.join(ckpt_dir, "model_%d" % global_step)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
file_name = os.path.join(save_dir, "model_state.pdparam")
# Need better way to get inner model of DataParallel
paddle.save(model.state_dict(), file_name)
预测结果
import re
def convert_example_to_feature(example, tokenizer, max_seq_length=512, is_test=False):
"""
Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
by concatenating and adding special tokens. And creates a mask from the two sequences passed
to be used in a sequence-pair classification task.
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
::
- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
::
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
Args:
example(obj:`list[str]`): List of input data, containing text and label if it have label.
tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
which contains most of the methods. Users should refer to the superclass for more information regarding methods.
max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
Sequences longer than this will be truncated, sequences shorter will be padded.
Returns:
input_ids(obj:`list[int]`): The list of token ids.
token_type_ids(obj: `list[int]`): List of sequence pair mask.
"""
tokens = example["tokens"]
encoded_inputs = tokenizer(
tokens,
return_length=True,
is_split_into_words=True,
max_seq_len=max_seq_length)
input_ids = encoded_inputs["input_ids"]
token_type_ids = encoded_inputs["token_type_ids"]
seq_len = encoded_inputs["seq_len"]
return input_ids, token_type_ids, seq_len
def parse_predict_result(predictions, seq_lens, label_map):
"""
Parses the prediction results to the label tag.
"""
pred_tag = []
for idx, pred in enumerate(predictions):
seq_len = seq_lens[idx]
# drop the "[CLS]" and "[SEP]" token
tag = [label_map[i] for i in pred[1:seq_len - 1]]
pred_tag.append(tag)
return pred_tag
@paddle.no_grad()
def predict(model, data_loader, label_map):
"""
Given a prediction dataset, it gives the prediction results.
Args:
model(obj:`paddle.nn.Layer`): A model to classify texts.
data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
label_map(obj:`dict`): The label id (key) to label str (value) map.
"""
model.eval()
results = []
for input_ids, token_type_ids, seq_lens in data_loader:
preds = model(input_ids, token_type_ids, seq_lens=seq_lens)
tags = parse_predict_result(preds.numpy(), seq_lens.numpy(), label_map)
results.extend(tags)
return results
# The COTE_DP dataset labels with "BIO" schema.
label_map = {0: "B", 1: "I", 2: "O"}
# `no_entity_label` represents that the token isn't an entity.
no_entity_label_idx = 2
batch_size=96
trans_func = partial(
convert_example_to_feature,
tokenizer=tokenizer,
max_seq_length=max_seq_length)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), # input ids
Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), # token type ids
Stack(dtype='int64'), # sequence lens
): [data for data in fn(samples)]
test_data_loader = create_dataloader(
test_ds,
mode='test',
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
# 根据实际运行情况,更换加载的参数路径,!!!注意有没有加载成功!!!
params_path = 'skep_opinion7_1e-6/model_10200_10/model_state.pdparam'
if params_path and os.path.isfile(params_path):
state_dict = paddle.load(params_path)
model.set_dict(state_dict)
print("Loaded parameters from %s" % params_path)
else:
print("MODEL LOAD FAILURE")
exit
results = predict(model, test_data_loader, label_map)
# 处理符号
punc = '~`!#$%^&*()_+-=|\';":/.,?><~·!@#¥%……&*()——+-=“:’;、。,?》《{}'
res_dir = "./results/7_1e-6/10200_10"
if not os.path.exists(res_dir):
os.makedirs(res_dir)
with open(os.path.join(res_dir, "COTE_MFW.tsv"), 'w', encoding="utf8") as f:
f.write("index\tprediction\n")
for idx, example in enumerate(test_ds.data):
tag = []
# to find "B...I...I"
for idx1, letter in enumerate(results[idx]):
if letter == 'B':
i = idx1
try:
while(results[idx][i+1]=='I'):
i = i+1
except:
pass
tag.append(re.sub(r"[%s]+" %punc, "", "".join(example['tokens'][idx1:i+1])))
if tag == []:
# 找不到实体要预测为无,不然比赛对比会异常
tag.append('无')
f.write(str(idx)+"\t"+"\x01".join(tag)+"\n")
提交
将预测文件结果压缩至zip文件,提交千言比赛网站
总结
本项目为PaddleNLP打卡营的大作业,包含以下不足:
没用交叉检验,直接从训练集拿出一部分做验证集
没有对数据进行分析和处理
不会调参(没经验)
这个项目是本人第一个用Paddle完成的项目,也是第一个深度学习的项目。
通过实践学习到了很多,也发现还有很多要学习的。项目中有不足的地方,还望大家在评论区中指出。
也欢迎大家点赞、fork、关注!
我在AI Studio上获得白银等级,点亮4个徽章,来互关呀~ https://aistudio.baidu.com/aistudio/personalcenter/thirdview/815060
更多推荐
所有评论(0)