『行远见大』 SE ABSA16 评价对象级情感分类

项目简介

评价对象级情感分类(Aspect-level Sentiment Classification),通过识别和提取文本中的倾向、立场、评价、观点等主观信息,对评价对象的情感是否积极进行二分类,积极为1,消极为0。本项目为各位同学提供一个解题思路,各位同学可参考本项目并在此基础上进一步调优。

比赛报名


报名链接:https://aistudio.baidu.com/aistudio/competition/detail/50

小队介绍

小队名称:肝就完事了

小队简介:我们五个没有肝

注:本项目在队长iterhui的项目(基于[实践课5-情感分析baseline]优化的情感分析)上进一步调优。


致敬开源

大家好,我是行远见大。欢迎你与我一同建设飞桨开源社区,知识分享是一种美德,让我们向开源致敬!

前置知识

情感分析

情感分析旨在自动识别和提取文本中的倾向、立场、评价、观点等主观信息。它包含各式各样的任务,比如句子级情感分类、评价对象级情感分类、观点抽取、情绪分类等。情感分析是人工智能的重要研究方向,具有很高的学术价值。同时,情感分析在消费决策、舆情分析、个性化推荐等领域均有重要的应用,具有很高的商业价值。


论文链接:Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis

数据集介绍

SE ABSA16 评价对象级情感分类(Aspect-level Sentiment Classification), 对于给定的文本d和文本中描述的一个评价对象a,给出针对该评价对象a的情感类别s,类别s一般只包含积极、消极两类,部分数据集涵盖更细粒度的分类体系。


样例如下:

输入文本(d):D4外形设计真的很成功不知道楼主摸没摸过D4真机非常成功的设计本以为D3系列很难超越了但是D4的流线风格显然不比D3差在整体感上还更胜一筹。

评价对象(a):相机外形设计。

情感类别(s):1表示积极,0表示消极。

数据集名称训练集大小测试集大小
SE-ABSA16_CAME1,317505
SE-ABSA16_PHNS1,336529

数据集链接:https://aistudio.baidu.com/aistudio/datasetdetail/76829

评价对象级情感分类

环境配置

# 导入必要的库
import paddle
import numpy as np
import pandas as pd
from paddle.static import InputSpec
from paddle.io import Dataset, DataLoader
print("本项目基于Paddle的版本号为:"+ paddle.__version__)
本项目基于Paddle的版本号为:2.1.0
# AI Studio上的PaddleNLP版本过低,所以需要首先升级PaddleNLP
!pip install --upgrade paddlenlp -i https://pypi.org/simple 
# 加载预训练模型 SKEP
import paddlenlp
from paddlenlp.data import Pad, Stack, Tuple
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
print("本项目基于PaddleNLP的版本号为:"+ paddlenlp.__version__)
本项目基于PaddleNLP的版本号为:2.0.3

加载数据

# 解压数据
!unzip -o data/data76829/SE-ABSA16_CAME
!unzip -o data/data76829/SE-ABSA16_PHNS
Archive:  data/data76829/SE-ABSA16_CAME.zip
   creating: SE-ABSA16_CAME/
  inflating: SE-ABSA16_CAME/train.tsv  
   creating: __MACOSX/
   creating: __MACOSX/SE-ABSA16_CAME/
  inflating: __MACOSX/SE-ABSA16_CAME/._train.tsv  
  inflating: SE-ABSA16_CAME/License.pdf  
  inflating: __MACOSX/SE-ABSA16_CAME/._License.pdf  
  inflating: SE-ABSA16_CAME/test.tsv  
  inflating: __MACOSX/SE-ABSA16_CAME/._test.tsv  
  inflating: __MACOSX/._SE-ABSA16_CAME  
Archive:  data/data76829/SE-ABSA16_PHNS.zip
   creating: SE-ABSA16_PHNS/
  inflating: SE-ABSA16_PHNS/train.tsv  
   creating: __MACOSX/SE-ABSA16_PHNS/
  inflating: __MACOSX/SE-ABSA16_PHNS/._train.tsv  
  inflating: SE-ABSA16_PHNS/License.pdf  
  inflating: __MACOSX/SE-ABSA16_PHNS/._License.pdf  
  inflating: SE-ABSA16_PHNS/test.tsv  
  inflating: __MACOSX/SE-ABSA16_PHNS/._test.tsv  
  inflating: __MACOSX/._SE-ABSA16_PHNS  
# 删除解压后的无用文件
!rm -r __MACOSX

查看数据

train_data = "./SE-ABSA16_CAME/train.tsv"
train_data = pd.read_csv(train_data, header=0, sep='\t')
train_data.head(10)
labeltext_atext_b
00camera#design_features千呼万唤始出来,尼康的APSC小相机终于发布了,COOLPIX A. 你怎么看呢?我看,尼康...
10camera#operation_performance千呼万唤始出来,尼康的APSC小相机终于发布了,COOLPIX A. 你怎么看呢?我看,尼康...
20hardware#usability千呼万唤始出来,尼康的APSC小相机终于发布了,COOLPIX A. 你怎么看呢?我看,尼康...
30software#design_features千呼万唤始出来,尼康的APSC小相机终于发布了,COOLPIX A. 你怎么看呢?我看,尼康...
41lens#general千呼万唤始出来,尼康的APSC小相机终于发布了,COOLPIX A. 你怎么看呢?我看,尼康...
50lens#operation_performance千呼万唤始出来,尼康的APSC小相机终于发布了,COOLPIX A. 你怎么看呢?我看,尼康...
60battery#quality千呼万唤始出来,尼康的APSC小相机终于发布了,COOLPIX A. 你怎么看呢?我看,尼康...
71camera#portability千呼万唤始出来,尼康的APSC小相机终于发布了,COOLPIX A. 你怎么看呢?我看,尼康...
81photo#quality进xe之前知道有一部m8.2才万元左右,十分心动。都拿在手中了才突然醒悟,色彩fuji未必比...
91photo#quality二选一:富士X-Pro1+18mmF2和徕卡X?玩过的小机器里面感觉成像最好的还是适马的DP...
train_data = "./SE-ABSA16_PHNS/train.tsv"
train_data = pd.read_csv(train_data, header=0, sep='\t')
train_data.head(10)
labeltext_atext_b
01phone#design_features今天有幸拿到了港版白色iPhone 5真机,试玩了一下,说说感受吧:1. 真机尺寸宽度与4/...
11display#quality今天有幸拿到了港版白色iPhone 5真机,试玩了一下,说说感受吧:1. 真机尺寸宽度与4/...
21ports#connectivity今天有幸拿到了港版白色iPhone 5真机,试玩了一下,说说感受吧:1. 真机尺寸宽度与4/...
31cpu#operation_performance今天有幸拿到了港版白色iPhone 5真机,试玩了一下,说说感受吧:1. 真机尺寸宽度与4/...
41display#quality1,外观上,iPhone5不是用来ZB的,所以别人看不看得出来,并不是我关心的问题。4寸的屏...
51phone#design_features1,外观上,iPhone5不是用来ZB的,所以别人看不看得出来,并不是我关心的问题。4寸的屏...
61multimedia_devices#design_features1,外观上,iPhone5不是用来ZB的,所以别人看不看得出来,并不是我关心的问题。4寸的屏...
71multimedia_devices#quality1,外观上,iPhone5不是用来ZB的,所以别人看不看得出来,并不是我关心的问题。4寸的屏...
81multimedia_devices#design_features1,外观上,iPhone5不是用来ZB的,所以别人看不看得出来,并不是我关心的问题。4寸的屏...
91power_supply#quality1,外观上,iPhone5不是用来ZB的,所以别人看不看得出来,并不是我关心的问题。4寸的屏...

模型构建

# 得到数据集字典
def open_func(file_path):
    return [line.strip() for line in open(file_path, 'r', encoding='utf8').readlines()[1:] if len(line.strip().split('\t')) >= 2]

data_dict = {'seabsa16phns': {'test': open_func('SE-ABSA16_PHNS/test.tsv'),
                              'train': open_func('SE-ABSA16_PHNS/train.tsv')},
             'seabsa16came': {'test': open_func('SE-ABSA16_CAME/test.tsv'),
                              'train': open_func('SE-ABSA16_CAME/train.tsv')}}

# 定义数据集
label_list = [0, 1]

# 考虑token_type_id
class MyDataset(Dataset):
    def __init__(self, data, tokenizer, max_len=256, for_test=False):
        super().__init__()
        self._data = data
        self._tokenizer = tokenizer
        self._max_len = max_len
        self._for_test = for_test
    
    def __len__(self):
        return len(self._data)
    
    def __getitem__(self, idx):
        samples = self._data[idx].split('\t')
        label = samples[-3]
        text_b = samples[-1]
        text_a = samples[-2]
        label = int(label)
        encoder_out = self._tokenizer.encode(text_a, text_b, max_seq_len=self._max_len)
        text = encoder_out['input_ids']
        token_type = encoder_out['token_type_ids']
        if self._for_test:
            return np.array(text, dtype='int64'), np.array(token_type, dtype='int64')
        else:
            return np.array(text, dtype='int64'), np.array(token_type, dtype='int64'), np.array(label, dtype='int64')

def batchify_fn(for_test=False):
    if for_test:
        return lambda samples, fn=Tuple(Pad(axis=0, pad_val=tokenizer.pad_token_id),
                                        Pad(axis=0, pad_val=tokenizer.pad_token_type_id)): [data for data in fn(samples)]
    else:
        return lambda samples, fn=Tuple(Pad(axis=0, pad_val=tokenizer.pad_token_id),
                                        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
                                        Stack()): [data for data in fn(samples)]


def get_data_loader(data, tokenizer, batch_size=32, max_len=256, for_test=False):
    dataset = MyDataset(data, tokenizer, max_len, for_test)
    shuffle = True if not for_test else False
    data_loader = DataLoader(dataset=dataset, batch_size=batch_size, collate_fn=batchify_fn(for_test), shuffle=shuffle)
    return data_loader

SE ABSA16 CAME

训练配置

# 模型和分词
model = SkepForSequenceClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=2)
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')

# 加载数据集
data_name = 'seabsa16came'

# 参数设置
epochs = 30
learning_rate = 3e-6
batch_size = 8
max_len = 256

# 模型准备
train_dataloader = get_data_loader(data_dict[data_name]['train'], tokenizer, batch_size, max_len, for_test=False)
input = InputSpec((-1, -1), dtype='int64', name='input')
token_type = InputSpec((-1, -1), dtype='int64', name='token_type')
label = InputSpec((-1, 2), dtype='int64', name='label')
model = paddle.Model(model, [input, token_type], [label])

# L2正则化,避免模型过拟合
optimizer = paddle.optimizer.Adam(learning_rate=learning_rate,weight_decay=paddle.regularizer.L2Decay(coeff=1e-5),parameters=model.parameters())
model.prepare(optimizer, loss=paddle.nn.CrossEntropyLoss(), metrics=[paddle.metric.Accuracy()])
[2021-06-23 22:25:47,645] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.pdparams
[2021-06-23 22:25:51,477] [    INFO] - Found /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.vocab.txt

模型训练

model.fit(train_dataloader, batch_size=batch_size, epochs=epochs, save_freq=5, save_dir='./checkpoints', log_freq=200)
step 165/165 - loss: 0.0651 - acc: 0.8481 - 434ms/step
save checkpoint at /home/aistudio/checkpoints/25
Epoch 27/30
step 165/165 - loss: 0.0401 - acc: 0.8573 - 432ms/step
Epoch 28/30
step 165/165 - loss: 0.7666 - acc: 0.8550 - 434ms/step
Epoch 29/30
step 165/165 - loss: 0.2764 - acc: 0.8610 - 433ms/step
Epoch 30/30
step 165/165 - loss: 0.1352 - acc: 0.8633 - 433ms/step
save checkpoint at /home/aistudio/checkpoints/final

模型保存与预测

# 设置预训练模型的保存路径
checkpoint_path = './checkpoints/final'
model = SkepForSequenceClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=2)
input = InputSpec((-1, -1), dtype='int64', name='input')
token_type = InputSpec((-1, -1), dtype='int64', name='token_type')
model = paddle.Model(model, [input, token_type])
model.load(checkpoint_path)

# 导入测试集
test_dataloader = get_data_loader(data_dict[data_name]['test'], tokenizer, batch_size, max_len, for_test=True)

# 模型保存与预测
save_file = {'seabsa16phns': './submission/SE-ABSA16_PHNS.tsv', 'seabsa16came': './submission/SE-ABSA16_CAME.tsv'}
predicts = []
for batch in test_dataloader:
    predict = model.predict_batch(batch)
    predicts += predict[0].argmax(axis=-1).tolist()

with open(save_file[data_name], 'w', encoding='utf8') as f:
    f.write("index\tprediction\n")
    for idx, sample in enumerate(data_dict[data_name]['test']):
        qid = sample.split('\t')[0]
        f.write(qid + '\t' + str(predicts[idx]) + '\n')
    f.close()
[2021-06-23 23:07:18,381] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.pdparams

查看预测效果

test_data = "./submission/SE-ABSA16_CAME.tsv"
test_data = pd.read_csv(test_data, header=0, sep='\t')
test_data.head(10)
indexprediction
001
110
221
331
441
550
660
771
881
991

SE ABSA16 PHCS

训练配置

# 模型和分词
model = SkepForSequenceClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=2)
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')

# 加载数据集
data_name = 'seabsa16phns'

# 参数设置
epochs = 30
learning_rate = 5e-6
batch_size = 8
max_len = 256

# 模型准备
train_dataloader = get_data_loader(data_dict[data_name]['train'], tokenizer, batch_size, max_len, for_test=False)
input = InputSpec((-1, -1), dtype='int64', name='input')
token_type = InputSpec((-1, -1), dtype='int64', name='token_type')
label = InputSpec((-1, 2), dtype='int64', name='label')
model = paddle.Model(model, [input, token_type], [label])

# L2正则化,避免模型过拟合
optimizer = paddle.optimizer.Adam(learning_rate=learning_rate,weight_decay=paddle.regularizer.L2Decay(coeff=1e-5),parameters=model.parameters())
model.prepare(optimizer, loss=paddle.nn.CrossEntropyLoss(), metrics=[paddle.metric.Accuracy()])
[2021-06-23 21:41:56,594] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.pdparams
[2021-06-23 21:42:00,396] [    INFO] - Found /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.vocab.txt

模型训练

model.fit(train_dataloader, batch_size=batch_size, epochs=epochs, save_freq=5, save_dir='./checkpoints', log_freq=200)
step 167/167 - loss: 0.0093 - acc: 0.8825 - 426ms/step
Epoch 25/30
step 167/167 - loss: 0.4433 - acc: 0.8870 - 428ms/step
Epoch 26/30
step 167/167 - loss: 0.1899 - acc: 0.8735 - 427ms/step
save checkpoint at /home/aistudio/checkpoints/25
Epoch 27/30
step 167/167 - loss: 0.0471 - acc: 0.8780 - 424ms/step
Epoch 28/30
step 167/167 - loss: 0.3930 - acc: 0.8862 - 427ms/step
Epoch 29/30
step 167/167 - loss: 0.0853 - acc: 0.9004 - 427ms/step
Epoch 30/30
step 167/167 - loss: 0.2919 - acc: 0.8877 - 427ms/step
save checkpoint at /home/aistudio/checkpoints/final

模型保存与预测

# 设置预训练模型的保存路径
checkpoint_path = './checkpoints/final'
model = SkepForSequenceClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=2)
input = InputSpec((-1, -1), dtype='int64', name='input')
token_type = InputSpec((-1, -1), dtype='int64', name='token_type')
model = paddle.Model(model, [input, token_type])
model.load(checkpoint_path)

# 导入测试集
test_dataloader = get_data_loader(data_dict[data_name]['test'], tokenizer, batch_size, max_len, for_test=True)

# 模型保存与预测
save_file = {'seabsa16phns': './submission/SE-ABSA16_PHNS.tsv', 'seabsa16came': './submission/SE-ABSA16_CAME.tsv'}
predicts = []
for batch in test_dataloader:
    predict = model.predict_batch(batch)
    predicts += predict[0].argmax(axis=-1).tolist()

with open(save_file[data_name], 'w', encoding='utf8') as f:
    f.write("index\tprediction\n")
    for idx, sample in enumerate(data_dict[data_name]['test']):
        qid = sample.split('\t')[0]
        f.write(qid + '\t' + str(predicts[idx]) + '\n')
    f.close()
[2021-06-23 22:24:17,942] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.pdparams

查看预测效果

test_data = "./submission/SE-ABSA16_PHNS.tsv"
test_data = pd.read_csv(test_data, header=0, sep='\t')
test_data.head(10)
indexprediction
001
110
221
331
441
551
661
771
881
991

作者简介

  • 飞桨主页:行远见大
  • 个人经历:上海开源信息技术协会成员
  • 我的口号:向开源致敬,一同建设飞桨开源社区
  • 常住地址:常年混迹在 AI Studio 平台和各类 PaddlePaddle 群
  • QQ:1206313185 添加时请备注添加原因和 AI Studio 的 ID
  • 感谢小伙伴们一键三连(喜欢♡、fork〧、关注+)支持,点 ♡ 数越多,更新越快~
Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐