【PaddleNLP】千言数据集:情感分析——SKEP

本项目使用预训练模型SKEP完成千言数据集:情感分析比赛
包含三种子任务,句子级情感分类、评价对象级情感分类、观点抽取
欢迎大家点赞、fork、关注!

重磅更新!!!

2021/6/24   V1.5   新增👉Ernie_gram和SKEP预测结果融合(评价对象提取任务)👈   效果显著!!!

若需要Ernie_gram观点提取教程,请参考 江流:【paddlenlp】千言数据集:情感分析
别忘了给他也点个star和fork哦!!!

前言

大家好
该比赛为NLP打卡营的大作业。以下为通过本次课程所学内容,实现情感分析任务的代码。




先上结果:
图片名称




刚刚入门+没有任何机器学习的基础,还有很大的调参和处理空间。

强烈推荐大家参考以下源码:

发现的好东西,但还没使用:VisualDL2.2可以实现超参可视化

数据集准备

把所有数据集压缩包放在work文件夹下

# 解压数据集到data文件夹, 注意每次打开都要执行,重启无需执行
!unzip -q work/ChnSentiCorp.zip -d data
!unzip -q work/NLPCC14-SC.zip -d data
!unzip -q work/SE-ABSA16_CAME.zip -d data
!unzip -q work/SE-ABSA16_PHNS.zip -d data
!unzip -q work/COTE-BD.zip -d data
!unzip -q work/COTE-DP.zip -d data
!unzip -q work/COTE-MFW.zip -d data
# 更新paddlenlp
!pip install --upgrade paddlenlp -i https://pypi.org/simple 

一、 句子级情感分析

对给定的一段文本进行情感极性分类,常用于影评分析、网络论坛舆情分析等场景。

数据读入–句子级数据

包含:

  • load_ds:可从训练集划分验证集
  • load_test:读入test
import os
import random
from paddlenlp.datasets import MapDataset

# for train and dev sets
def load_ds(datafiles, split_train=False, dev_size=0):
    '''
    intput:
        datafiles -- str or list[str] -- the path of train or dev sets
        split_train -- Boolean -- split from train or not
        dev_size -- int -- split how much data from train 

    output:
        MapDataset
    '''

    datas = []

    def read(ds_file):
        with open(ds_file, 'r', encoding='utf-8') as fp:
            next(fp)  # Skip header
            for line in fp.readlines():
                data = line[:-1].split('\t')
                if len(data)==2:
                    yield ({'text':data[1], 'label':int(data[0])})
                elif len(data)==3:
                    yield ({'text':data[2], 'label':int(data[1])})
    
    def write_tsv(tsv, datas):
        with open(tsv, mode='w', encoding='UTF-8') as f:
            for line in datas:
                f.write(line)
    
    # 从train切出一部分给dev
    def spilt_train4dev(train_ds, dev_size):
        with open(train_ds, 'r', encoding='UTF-8') as f:
            for i, line in enumerate(f):
                datas.append(line)
        datas_tmp=datas[1:] # title line should not shuffle
        random.shuffle(datas_tmp) 
        if 1-os.path.exists(os.path.dirname(train_ds)+'/tem'):
            os.mkdir(os.path.dirname(train_ds)+'/tem')
        # remember the title line
        write_tsv(os.path.dirname(train_ds)+'/tem/train.tsv', datas[0:1]+datas_tmp[:-dev_size])
        write_tsv(os.path.dirname(train_ds)+'/tem/dev.tsv', datas[0:1]+datas_tmp[-dev_size:])
        

    if split_train:
        if 1-isinstance(datafiles, str):
            print("If you want to split the train, make sure that \'datafiles\' is a train set path str.")
            return None
        if dev_size == 0:
            print("Please set size of dev set, as dev_size=...")
            return None
        spilt_train4dev(datafiles, dev_size)
        datafiles = [os.path.dirname(datafiles)+'/tem/train.tsv', os.path.dirname(datafiles)+'/tem/dev.tsv']
    
    if isinstance(datafiles, str):
        return MapDataset(list(read(datafiles)))
    elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
        return [MapDataset(list(read(datafile))) for datafile in datafiles]



def load_test(datafile):
    '''
    intput:
        datafile -- str -- the path of test set 

    output:
        MapDataset
    '''
    
    def read(test_file):
        with open(test_file, 'r', encoding='UTF-8') as f:
            for i, line in enumerate(f):
                if i==0:
                    continue
                data = line[:-1].split('\t')
                yield {'text':data[1], 'label':'', 'qid':data[0]}

    return MapDataset(list(read(datafile)))

1. ChnSentiCorp

一次只能执行一个数据集!!!
32 256 2e-5 6

# 可以直接使用官方的load_dataset更加方便
# 本文将从train分出一部分做dev,为统一格式,故采用自己定义的load函数

# from paddlenlp.datasets import load_dataset
# train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])
train_ds, dev_ds= load_ds(datafiles=['./data/ChnSentiCorp/train.tsv', './data/ChnSentiCorp/dev.tsv'])
print(train_ds[0])
print(dev_ds[0])
print(type(train_ds[0]))

test_ds = load_test(datafile='./data/ChnSentiCorp/test.tsv')
print(test_ds[0])

2. NLPCC14-SC

train_ds, dev_ds = load_ds(datafiles='./data/NLPCC14-SC/train.tsv', split_train=True, dev_size=1000)
print(train_ds[0])
print(dev_ds[0])
test_ds = load_test(datafile='./data/NLPCC14-SC/test.tsv')
print(test_ds[0])

SKEP模型构建

PaddleNLP已经实现了SKEP预训练模型,可以通过一行代码实现SKEP加载。

句子级情感分析模型是SKEP fine-tune 文本分类常用模型SkepForSequenceClassification。其首先通过SKEP提取句子语义特征,之后将语义特征进行分类。

from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
# num_classes: 两类,0和1,表示消极和积极
model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch", num_classes=2)
tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch")

SkepForSequenceClassification可用于句子级情感分析和目标级情感分析任务。其通过预训练模型SKEP获取输入文本的表示,之后将文本表示进行分类。

  • pretrained_model_name_or_path:模型名称。支持"skep_ernie_1.0_large_ch",“skep_ernie_2.0_large_en”。

    • “skep_ernie_1.0_large_ch”:是SKEP模型在预训练ernie_1.0_large_ch基础之上在海量中文数据上继续预训练得到的中文预训练模型;
    • “skep_ernie_2.0_large_en”:是SKEP模型在预训练ernie_2.0_large_en基础之上在海量英文数据上继续预训练得到的英文预训练模型;
  • num_classes: 数据集分类类别数。

关于SKEP模型实现详细信息参考:https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/skep

import os
from functools import partial


import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad

from utils import create_dataloader

def convert_example(example,
                    tokenizer,
                    max_seq_length=512,
                    is_test=False):
    """
    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
    by concatenating and adding special tokens. And creates a mask from the two sequences passed 
    to be used in a sequence-pair classification task.
        
    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
    ::
        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``

    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
    ::

        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
        | first sequence    | second sequence |

    If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).


    Args:
        example(obj:`list[str]`): List of input data, containing text and label if it have label.
        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` 
            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. 
            Sequences longer than this will be truncated, sequences shorter will be padded.
        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.

    Returns:
        input_ids(obj:`list[int]`): The list of token ids.
        token_type_ids(obj: `list[int]`): List of sequence pair mask.
        label(obj:`int`, optional): The input label if not is_test.
    """
    encoded_inputs = tokenizer(
        text=example["text"], max_seq_len=max_seq_length)

    input_ids = encoded_inputs["input_ids"]
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        qid = np.array([example["qid"]], dtype="int64")
        return input_ids, token_type_ids, qid


batch_size = 32
max_seq_length = 256
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack()  # labels
): [data for data in fn(samples)]

train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
dev_data_loader = create_dataloader(
    dev_ds,
    mode='dev',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

模型训练和评估(把每个数据集的结果分开存!!!)

定义损失函数、优化器以及评价指标后,即可开始训练。

推荐超参设置:

  • max_seq_length=256
  • batch_size=32
  • learning_rate=2e-5
  • epochs=10

实际运行时可以根据显存大小调整batch_size和max_seq_length大小。

import time

from utils import evaluate


epochs = 10
ckpt_dir = "skep_sentence"
num_training_steps = len(train_data_loader) * epochs

# optimizer = paddle.optimizer.AdamW(
#     learning_rate=2e-5,
#     parameters=model.parameters())

decay_params = [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ]
optimizer = paddle.optimizer.AdamW(
    learning_rate=2e-6,
    parameters=model.parameters(),
    weight_decay=0.01, # test weight_decay
    apply_decay_param_fun=lambda x: x in decay_params # test weight_decay
    )

criterion = paddle.nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()

global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, token_type_ids, labels = batch
        logits = model(input_ids, token_type_ids)
        loss = criterion(logits, labels)
        probs = F.softmax(logits, axis=1)
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()

        global_step += 1
        if global_step % 10 == 0:
            print(
                "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
                % (global_step, epoch, step, loss, acc,
                    10 / (time.time() - tic_train)))
            tic_train = time.time()
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()
        if global_step % 100 == 0:
            save_dir = os.path.join(ckpt_dir, "model_%d" % global_step)
            if not os.path.exists(save_dir):
                os.makedirs(save_dir)
            evaluate(model, criterion, metric, dev_data_loader) 
            model.save_pretrained(save_dir)
            tokenizer.save_pretrained(save_dir)

预测结果(注意更改模型的路径)

使用训练得到的模型还可以对文本进行情感预测。

import numpy as np
import paddle

batch_size=24

trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    is_test=True)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
    Stack() # qid
): [data for data in fn(samples)]
test_data_loader = create_dataloader(
    test_ds,
    mode='test',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

# 根据实际运行情况,更换加载的参数路径
params_path = 'skep_sentence2t_weight_2e-6/model_1800_83700/model_state.pdparams'
if params_path and os.path.isfile(params_path):
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)

label_map = {0: '0', 1: '1'}
results = []
model.eval()
for batch in test_data_loader:
    input_ids, token_type_ids, qids = batch
    logits = model(input_ids, token_type_ids)
    probs = F.softmax(logits, axis=-1)
    idx = paddle.argmax(probs, axis=1).numpy()
    idx = idx.tolist()
    labels = [label_map[i] for i in idx]
    qids = qids.numpy().tolist()
    results.extend(zip(qids, labels))

res_dir = "./results/2_weight_2e-6/1800_83700"
if not os.path.exists(res_dir):
    os.makedirs(res_dir)

with open(os.path.join(res_dir, "NLPCC14-SC.tsv"), 'w', encoding="utf8") as f:
    f.write("index\tprediction\n")
    for qid, label in results:
        f.write(str(qid[0])+"\t"+label+"\n")

二、 目标级情感分析

在电商产品分析场景下,除了分析整体商品的情感极性外,还细化到以商品具体的“方面”为分析主体进行情感分析(aspect-level),如下、:

  • 这个薯片口味有点咸,太辣了,不过口感很脆。

关于薯片的口味方面是一个负向评价(咸,太辣),然而对于口感方面却是一个正向评价(很脆)。

  • 我很喜欢夏威夷,就是这边的海鲜太贵了。

关于夏威夷是一个正向评价(喜欢),然而对于夏威夷的海鲜却是一个负向评价(价格太贵)。

数据读入–目标级数据集

与句子级方法类似

import os
import random

from paddlenlp.datasets import MapDataset


def load_ds(datafiles, split_train=False, dev_size=0):
    datas = []
    def read(ds_file):
        with open(ds_file, 'r', encoding='utf-8') as fp:
            # if fp.readline().split('\t')[0] == 'label':
            next(fp)  # Skip header
            for line in fp.readlines():
                data = line[:-1].split('\t')
                yield ({'text':data[1], 'text_pair':data[2], 'label':int(data[0])})
    
    def write_tsv(tsv, datas):
        with open(tsv, mode='w', encoding='UTF-8') as f:
            for line in datas:
                f.write(line)

    def spilt_train4dev(train_ds, dev_size):
        with open(train_ds, 'r', encoding='UTF-8') as f:
            for i, line in enumerate(f):
                datas.append(line)
        datas_tmp=datas[1:] # title line should not shuffle
        random.shuffle(datas_tmp) 
        if 1-os.path.exists(os.path.dirname(train_ds)+'/tem'):
            os.mkdir(os.path.dirname(train_ds)+'/tem')
        # remember the title line
        write_tsv(os.path.dirname(train_ds)+'/tem/train.tsv', datas[0:1]+datas_tmp[:-dev_size])
        write_tsv(os.path.dirname(train_ds)+'/tem/dev.tsv', datas[0:1]+datas_tmp[-dev_size:])
        

    if split_train:
        if 1-isinstance(datafiles, str):
            print("If you want to split the train, make sure that \'datafiles\' is a train set path str.")
            return None
        if dev_size == 0:
            print("Please set size of dev set, as dev_size=...")
            return None
        spilt_train4dev(datafiles, dev_size)
        datafiles = [os.path.dirname(datafiles)+'/tem/train.tsv', os.path.dirname(datafiles)+'/tem/dev.tsv']
    
    if isinstance(datafiles, str):
        return MapDataset(list(read(datafiles)))
    elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
        return [MapDataset(list(read(datafile))) for datafile in datafiles]

def load_test(datafile):
    
    def read(test_file):
        with open(test_file, 'r', encoding='UTF-8') as f:
            for i, line in enumerate(f):
                if i==0:
                    continue
                data = line[:-1].split('\t')
                yield {'text':data[1], 'text_pair':data[2]}

    return MapDataset(list(read(datafile)))

3. SE-ABSA16_PHNS

一次只能执行一个数据集!!!

train_ds, dev_ds = load_ds(datafiles='./data/SE-ABSA16_PHNS/train.tsv', split_train=True, dev_size=100)
print(train_ds[0])
print(dev_ds[0])

test_ds = load_test(datafile='./data/SE-ABSA16_PHNS/test.tsv')
print(test_ds[0])

4. SE-ABSA16_CAME

train_ds, dev_ds = load_ds(datafiles='./data/SE-ABSA16_CAME/train.tsv', split_train=True, dev_size=100)
print(train_ds[0])
print(dev_ds[0])

test_ds = load_test(datafile='./data/SE-ABSA16_CAME/test.tsv')
print(test_ds[0])

SKEP模型构建

目标级情感分析模型同样使用SkepForSequenceClassification模型,但目标级情感分析模型的输入不单单是一个句子,而是句对。一个句子描述“评价对象方面(aspect)”,另一个句子描述"对该方面的评论"。如下图所示。

from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
# num_classes: 两类,0和1,表示消极和积极
model = SkepForSequenceClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=2)
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')
from functools import partial
import os
import time

import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad

from utils import create_dataloader


def convert_example(example,
                    tokenizer,
                    max_seq_length=512,
                    is_test=False,
                    dataset_name="chnsenticorp"):
    """
    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
    by concatenating and adding special tokens. And creates a mask from the two sequences passed 
    to be used in a sequence-pair classification task.
        
    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
    ::
        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``

    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
    ::

        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
        | first sequence    | second sequence |

    If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
    
    note: There is no need token type ids for skep_roberta_large_ch model.


    Args:
        example(obj:`list[str]`): List of input data, containing text and label if it have label.
        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` 
            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. 
            Sequences longer than this will be truncated, sequences shorter will be padded.
        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
        dataset_name((obj:`str`, defaults to "chnsenticorp"): The dataset name, "chnsenticorp" or "sst-2".

    Returns:
        input_ids(obj:`list[int]`): The list of token ids.
        token_type_ids(obj: `list[int]`): List of sequence pair mask.
        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
    """
    encoded_inputs = tokenizer(
        text=example["text"],
        text_pair=example["text_pair"],
        max_seq_len=max_seq_length)

    input_ids = encoded_inputs["input_ids"]
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        return input_ids, token_type_ids


max_seq_length=512
batch_size=16
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack(dtype="int64")  # labels
): [data for data in fn(samples)]

train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
dev_data_loader = create_dataloader(
    dev_ds,
    mode='dev',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

模型训练和评估

定义损失函数、优化器以及评价指标后,即可开始训练。

from utils import evaluate


epochs = 12
num_training_steps = len(train_data_loader) * epochs

# optimizer = paddle.optimizer.AdamW(
#     learning_rate=2e-6,
#     parameters=model.parameters())

decay_params = [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ]
optimizer = paddle.optimizer.AdamW(
    learning_rate=1e-6,
    parameters=model.parameters(),
    weight_decay=0.01, # test weight_decay
    apply_decay_param_fun=lambda x: x in decay_params # test weight_decay
    )

criterion = paddle.nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()

ckpt_dir = "skep_aspect"
global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, token_type_ids, labels = batch
        logits = model(input_ids, token_type_ids)
        loss = criterion(logits, labels)
        probs = F.softmax(logits, axis=1)
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()

        global_step += 1
        if global_step % 10 == 0:
            print(
                "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
                % (global_step, epoch, step, loss, acc,
                    10 / (time.time() - tic_train)))
            tic_train = time.time()
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()
        if global_step % 50 == 0:
            save_dir = os.path.join(ckpt_dir, "model_%d" % global_step)
            if not os.path.exists(save_dir):
                os.makedirs(save_dir)
            #这里的evaluate直接复制的上面项目的,可能不对
            evaluate(model, criterion, metric, dev_data_loader)
            model.save_pretrained(save_dir)
            tokenizer.save_pretrained(save_dir)

预测结果

使用训练得到的模型还可以对评价对象进行情感预测。

@paddle.no_grad()
def predict(model, data_loader, label_map):
    """
    Given a prediction dataset, it gives the prediction results.

    Args:
        model(obj:`paddle.nn.Layer`): A model to classify texts.
        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
        label_map(obj:`dict`): The label id (key) to label str (value) map.
    """
    model.eval()
    results = []
    for batch in data_loader:
        input_ids, token_type_ids = batch
        logits = model(input_ids, token_type_ids)
        probs = F.softmax(logits, axis=1)
        idx = paddle.argmax(probs, axis=1).numpy()
        idx = idx.tolist()
        labels = [label_map[i] for i in idx]
        results.extend(labels)
    return results

label_map = {0: '0', 1: '1'}
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    is_test=True)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
): [data for data in fn(samples)]

batch_size=16

test_data_loader = create_dataloader(
    test_ds,
    mode='test',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

# 根据实际运行情况,更换加载的参数路径
params_path = 'skep_aspect4_weight_1e-6/model_500_73000/model_state.pdparams'
if params_path and os.path.isfile(params_path):
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)

results = predict(model, test_data_loader, label_map)

res_dir = "./results/4_weight_1e-6/500_73000"
if not os.path.exists(res_dir):
    os.makedirs(res_dir)

with open(os.path.join(res_dir, "SE-ABSA16_CAME.tsv"), 'w', encoding="utf8") as f:
    f.write("index\tprediction\n")
    for idx, label in enumerate(results):
        f.write(str(idx)+"\t"+label+"\n")

三、 评价对象提取

数据读入–对象数据集

对象 = [ ]
😦

import os
import random
from paddlenlp.datasets import MapDataset


def load_ds(datafiles, split_train=False, dev_size=0):
    datas = []

    def read(ds_file):
        with open(ds_file, 'r', encoding='utf-8') as fp:
            # if fp.readline().split('\t')[0] == 'label':
            next(fp)  # Skip header
            for line in fp.readlines():
                # print('1\n')
                line_stripped = line.strip().split('\t')
                if not line_stripped:
                    continue

                # dataset 中有 entity和text不在一行的。。。。。
                try:
                    example = [line_stripped[indice] for indice in (0,1)]
                    entity, text = example[0], example[1]
                    start_idx = text.index(entity)
                except:
                    # drop the dirty data
                    continue

                labels = [2] * len(text)
                labels[start_idx] = 0
                for idx in range(start_idx + 1, start_idx + len(entity)):
                        labels[idx] = 1 
                yield {
                        "tokens": list(text),
                        "labels": labels,
                        "entity": entity
                    }
    
    def write_tsv(tsv, datas):
        with open(tsv, mode='w', encoding='UTF-8') as f:
            for line in datas:
                f.write(line)


    def spilt_train4dev(train_ds, dev_size):
        with open(train_ds, 'r', encoding='UTF-8') as f:
            for i, line in enumerate(f):
                datas.append(line)
        datas_tmp=datas[1:] # title line should not shuffle
        random.shuffle(datas_tmp) 
        if 1-os.path.exists(os.path.dirname(train_ds)+'/tem'):
            os.mkdir(os.path.dirname(train_ds)+'/tem')
        # remember the title line
        write_tsv(os.path.dirname(train_ds)+'/tem/train.tsv', datas[0:1]+datas_tmp[:-dev_size])
        write_tsv(os.path.dirname(train_ds)+'/tem/dev.tsv', datas[0:1]+datas_tmp[-dev_size:])
        
        
    if split_train:
        if 1-isinstance(datafiles, str):
            print("If you want to split the train, make sure that \'datafiles\' is a train set path str.")
            return None
        if dev_size == 0:
            print("Please set size of dev set, as dev_size=...")
            return None
        spilt_train4dev(datafiles, dev_size)
        datafiles = [os.path.dirname(datafiles)+'/tem/train.tsv', os.path.dirname(datafiles)+'/tem/dev.tsv']
    
    if isinstance(datafiles, str):
        return MapDataset(list(read(datafiles)))
    elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
        return [MapDataset(list(read(datafile))) for datafile in datafiles]


def load_test(datafile):
    
    def read(test_file):
        with open(test_file, 'r', encoding='UTF-8') as fp:
             # if fp.readline().split('\t')[0] == 'label':
            next(fp)  # Skip header
            for line in fp.readlines():
                # print('1\n')
                line_stripped = line.strip().split('\t')
                if not line_stripped:
                    continue
                # example = [line_stripped[indice] for indice in field_indices]
                # example = [line_stripped[indice] for indice in (0,1)]

                # dataset 中有 entity和text不在一行的。。。。。
                try:
                    example = [line_stripped[indice] for indice in (0,1)]
                    entity, text = example[0], example[1]
                except:
                    # drop the dirty data
                    continue

                yield {"tokens": list(text)}

    return MapDataset(list(read(datafile)))

5. COTE-DP

train_ds = load_ds(datafiles='./data/COTE-DP/train.tsv')
print(train_ds[0])
test_ds = load_test(datafile='./data/COTE-DP/test.tsv')
print(test_ds[0])

6. COTE-BD

train_ds = load_ds(datafiles='./data/COTE-BD/train.tsv')
print(train_ds[0])
test_ds = load_test(datafile='./data/COTE-BD/test.tsv')
print(test_ds[0])

7. COTE-MFW

train_ds = load_ds(datafiles='./data/COTE-MFW/train.tsv')
print(train_ds[1])
test_ds = load_test(datafile='./data/COTE-MFW/test.tsv')
print(test_ds[0])

SKEP模型构建

和上面的有些不一样

from paddlenlp.transformers import SkepCrfForTokenClassification, SkepTokenizer, SkepModel
# num_classes: 三类,B/I/O
skep = SkepModel.from_pretrained('skep_ernie_1.0_large_ch')
model = SkepCrfForTokenClassification(skep, num_classes=3)
tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch")
# # The COTE_DP dataset labels with "BIO" schema.
# label_map = {label: idx for idx, label in enumerate(train_ds.label_list)}
# # `no_entity_label` represents that the token isn't an entity.
# # print(type(no_entity_label_idx))
no_entity_label_idx = 2

from functools import partial
import os
import time

import numpy as np
import paddle

import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad

from utils import create_dataloader


def convert_example_to_feature(example,
                               tokenizer,
                               max_seq_len=512,
                               no_entity_label="O",
                               is_test=False):
    """
    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
    by concatenating and adding special tokens. And creates a mask from the two sequences passed 
    to be used in a sequence-pair classification task.
        
    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
    ::
        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
    ::
        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
        | first sequence    | second sequence |
    If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
    Args:
        example(obj:`list[str]`): List of input data, containing text and label if it have label.
        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` 
            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
            Sequences longer than this will be truncated, sequences shorter will be padded.
        no_entity_label(obj:`str`, defaults to "O"): The label represents that the token isn't an entity. 
        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
    Returns:
        input_ids(obj:`list[int]`): The list of token ids.
        token_type_ids(obj: `list[int]`): List of sequence pair mask.
        label(obj:`list[int]`, optional): The input label if not test data.
    """
    
    tokens = example['tokens']
    labels = example['labels']
    tokenized_input = tokenizer(
        tokens,
        return_length=True,
        is_split_into_words=True,
        max_seq_len=max_seq_len)

    input_ids = tokenized_input['input_ids']
    token_type_ids = tokenized_input['token_type_ids']
    seq_len = tokenized_input['seq_len']

    if is_test:
        return input_ids, token_type_ids, seq_len
    else:
        labels = labels[:(max_seq_len - 2)]
        encoded_label = np.array([no_entity_label] + labels + [no_entity_label], dtype="int64")
        return input_ids, token_type_ids, seq_len, encoded_label


max_seq_length=256
batch_size=40
trans_func = partial(
    convert_example_to_feature,
    tokenizer=tokenizer,
    max_seq_len=max_seq_length,
    no_entity_label=no_entity_label_idx,
    is_test=False)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]),  # input ids
    Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]),  # token type ids
    Stack(dtype='int64'),  # sequence lens
    Pad(axis=0, pad_val=no_entity_label_idx)  # labels
): [data for data in fn(samples)]

train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
# dev_data_loader = create_dataloader(
#     dev_ds,
#     mode='dev',
#     batch_size=batch_size,
#     batchify_fn=batchify_fn,
#     trans_fn=trans_func)

模型训练

import time
from utils import evaluate
from paddlenlp.metrics import ChunkEvaluator

epochs = 10
num_training_steps = len(train_data_loader) * epochs

# # test weight_decay
# decay_params = [
#         p.name for n, p in model.named_parameters()
#         if not any(nd in n for nd in ["bias", "norm"])
#     ]
# optimizer = paddle.optimizer.AdamW(
#     learning_rate=1e-5,
#     parameters=model.parameters(),
#     weight_decay=0.01, # test weight_decay
#     apply_decay_param_fun=lambda x: x in decay_params # test weight_decay
#     )

optimizer = paddle.optimizer.AdamW(
    learning_rate=1e-6,
    parameters=model.parameters(),
    )


# metric = ChunkEvaluator(label_list=train_ds.label_list, suffix=True)
metric = ChunkEvaluator(label_list=['B', 'I', 'O'], suffix=True)

ckpt_dir = "skep_opinion7_1e-6"
global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, token_type_ids, seq_lens, labels = batch
        loss = model(
                input_ids, token_type_ids, seq_lens=seq_lens, labels=labels)
        avg_loss = paddle.mean(loss)

        global_step += 1
        if global_step % 10 == 0:
            print(
                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
                    % (global_step, epoch, step, avg_loss,
                       10 / (time.time() - tic_train)))
            tic_train = time.time()
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()
        if global_step % 200 == 0:
            save_dir = os.path.join(ckpt_dir, "model_%d" % global_step)
            if not os.path.exists(save_dir):
                os.makedirs(save_dir)
            
            file_name = os.path.join(save_dir, "model_state.pdparam")
            # Need better way to get inner model of DataParallel
            paddle.save(model.state_dict(), file_name)

预测结果

import re

def convert_example_to_feature(example, tokenizer, max_seq_length=512, is_test=False):
    """
    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
    by concatenating and adding special tokens. And creates a mask from the two sequences passed 
    to be used in a sequence-pair classification task.
        
    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
    ::
        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
    ::
        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
        | first sequence    | second sequence |
    If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
    Args:
        example(obj:`list[str]`): List of input data, containing text and label if it have label.
        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` 
            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. 
            Sequences longer than this will be truncated, sequences shorter will be padded.
    Returns:
        input_ids(obj:`list[int]`): The list of token ids.
        token_type_ids(obj: `list[int]`): List of sequence pair mask. 
    """
    tokens = example["tokens"]
    encoded_inputs = tokenizer(
        tokens,
        return_length=True,
        is_split_into_words=True,
        max_seq_len=max_seq_length)
    input_ids = encoded_inputs["input_ids"]
    token_type_ids = encoded_inputs["token_type_ids"]
    seq_len = encoded_inputs["seq_len"]

    return input_ids, token_type_ids, seq_len


def parse_predict_result(predictions, seq_lens, label_map):
    """
    Parses the prediction results to the label tag.
    """
    pred_tag = []
    for idx, pred in enumerate(predictions):
        seq_len = seq_lens[idx]
        # drop the "[CLS]" and "[SEP]" token
        tag = [label_map[i] for i in pred[1:seq_len - 1]]
        pred_tag.append(tag)
    return pred_tag

@paddle.no_grad()
def predict(model, data_loader, label_map):
    """
    Given a prediction dataset, it gives the prediction results.
    Args:
        model(obj:`paddle.nn.Layer`): A model to classify texts.
        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
        label_map(obj:`dict`): The label id (key) to label str (value) map.
    """
    model.eval()
    results = []
    for input_ids, token_type_ids, seq_lens in data_loader:
        preds = model(input_ids, token_type_ids, seq_lens=seq_lens)
        tags = parse_predict_result(preds.numpy(), seq_lens.numpy(), label_map)
        results.extend(tags)
    return results

# The COTE_DP dataset labels with "BIO" schema.
label_map = {0: "B", 1: "I", 2: "O"}
# `no_entity_label` represents that the token isn't an entity. 
no_entity_label_idx = 2

batch_size=96

trans_func = partial(
    convert_example_to_feature,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]),  # input ids
    Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]),  # token type ids
    Stack(dtype='int64'),  # sequence lens
): [data for data in fn(samples)]

test_data_loader = create_dataloader(
    test_ds,
    mode='test',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

# 根据实际运行情况,更换加载的参数路径,!!!注意有没有加载成功!!!
params_path = 'skep_opinion7_1e-6/model_10200_10/model_state.pdparam'
if params_path and os.path.isfile(params_path):
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)
else:
    print("MODEL LOAD FAILURE")
    exit

results = predict(model, test_data_loader, label_map)

# 处理符号
punc = '~`!#$%^&*()_+-=|\';":/.,?><~·!@#¥%……&*()——+-=“:’;、。,?》《{}'

res_dir = "./results/7_1e-6/10200_10"
if not os.path.exists(res_dir):
    os.makedirs(res_dir)

with open(os.path.join(res_dir, "COTE_MFW.tsv"), 'w', encoding="utf8") as f:
    f.write("index\tprediction\n")
    for idx, example in enumerate(test_ds.data):
        tag = []
        # to find "B...I...I"
        for idx1, letter in enumerate(results[idx]):
            if letter == 'B':
                i = idx1
                try:
                    while(results[idx][i+1]=='I'):
                        i = i+1
                except:
                    pass
                tag.append(re.sub(r"[%s]+" %punc, "", "".join(example['tokens'][idx1:i+1])))
        if tag == []:
            # 找不到实体要预测为无,不然比赛对比会异常
            tag.append('无')
        f.write(str(idx)+"\t"+"\x01".join(tag)+"\n")

提交

将预测文件结果压缩至zip文件,提交千言比赛网站

总结

本项目为PaddleNLP打卡营的大作业,包含以下不足:
没用交叉检验,直接从训练集拿出一部分做验证集
没有对数据进行分析和处理
不会调参(没经验)

这个项目是本人第一个用Paddle完成的项目,也是第一个深度学习的项目。
通过实践学习到了很多,也发现还有很多要学习的。项目中有不足的地方,还望大家在评论区中指出。
也欢迎大家点赞、fork、关注!

我在AI Studio上获得白银等级,点亮4个徽章,来互关呀~ https://aistudio.baidu.com/aistudio/personalcenter/thirdview/815060

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐