『行远见大』 SE ABSA16 评价对象级情感分类
通过识别和提取文本中的倾向、立场、评价、观点等主观信息,对评价对象的情感是否积极进行二分类,积极为1,消极为0。本项目为各位同学提供一个解题思路,各位同学可参考本项目并在此基础上进一步调优。
『行远见大』 SE ABSA16 评价对象级情感分类
项目简介
评价对象级情感分类(Aspect-level Sentiment Classification),通过识别和提取文本中的倾向、立场、评价、观点等主观信息,对评价对象的情感是否积极进行二分类,积极为1,消极为0。本项目为各位同学提供一个解题思路,各位同学可参考本项目并在此基础上进一步调优。
比赛报名

报名链接:https://aistudio.baidu.com/aistudio/competition/detail/50
小队介绍
小队名称:肝就完事了
小队简介:我们五个没有肝
注:本项目在队长iterhui的项目(基于[实践课5-情感分析baseline]优化的情感分析)上进一步调优。

致敬开源
大家好,我是行远见大。欢迎你与我一同建设飞桨开源社区,知识分享是一种美德,让我们向开源致敬!
前置知识
情感分析
情感分析旨在自动识别和提取文本中的倾向、立场、评价、观点等主观信息。它包含各式各样的任务,比如句子级情感分类、评价对象级情感分类、观点抽取、情绪分类等。情感分析是人工智能的重要研究方向,具有很高的学术价值。同时,情感分析在消费决策、舆情分析、个性化推荐等领域均有重要的应用,具有很高的商业价值。

论文链接:Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis
数据集介绍
SE ABSA16 评价对象级情感分类(Aspect-level Sentiment Classification), 对于给定的文本d和文本中描述的一个评价对象a,给出针对该评价对象a的情感类别s,类别s一般只包含积极、消极两类,部分数据集涵盖更细粒度的分类体系。
样例如下:
输入文本(d):D4外形设计真的很成功不知道楼主摸没摸过D4真机非常成功的设计本以为D3系列很难超越了但是D4的流线风格显然不比D3差在整体感上还更胜一筹。
评价对象(a):相机外形设计。
情感类别(s):1表示积极,0表示消极。
数据集名称 | 训练集大小 | 测试集大小 |
---|---|---|
SE-ABSA16_CAME | 1,317 | 505 |
SE-ABSA16_PHNS | 1,336 | 529 |
数据集链接:https://aistudio.baidu.com/aistudio/datasetdetail/76829
评价对象级情感分类
环境配置
# 导入必要的库
import paddle
import numpy as np
import pandas as pd
from paddle.static import InputSpec
from paddle.io import Dataset, DataLoader
print("本项目基于Paddle的版本号为:"+ paddle.__version__)
本项目基于Paddle的版本号为:2.1.0
# AI Studio上的PaddleNLP版本过低,所以需要首先升级PaddleNLP
!pip install --upgrade paddlenlp -i https://pypi.org/simple
# 加载预训练模型 SKEP
import paddlenlp
from paddlenlp.data import Pad, Stack, Tuple
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
print("本项目基于PaddleNLP的版本号为:"+ paddlenlp.__version__)
本项目基于PaddleNLP的版本号为:2.0.3
加载数据
# 解压数据
!unzip -o data/data76829/SE-ABSA16_CAME
!unzip -o data/data76829/SE-ABSA16_PHNS
Archive: data/data76829/SE-ABSA16_CAME.zip
creating: SE-ABSA16_CAME/
inflating: SE-ABSA16_CAME/train.tsv
creating: __MACOSX/
creating: __MACOSX/SE-ABSA16_CAME/
inflating: __MACOSX/SE-ABSA16_CAME/._train.tsv
inflating: SE-ABSA16_CAME/License.pdf
inflating: __MACOSX/SE-ABSA16_CAME/._License.pdf
inflating: SE-ABSA16_CAME/test.tsv
inflating: __MACOSX/SE-ABSA16_CAME/._test.tsv
inflating: __MACOSX/._SE-ABSA16_CAME
Archive: data/data76829/SE-ABSA16_PHNS.zip
creating: SE-ABSA16_PHNS/
inflating: SE-ABSA16_PHNS/train.tsv
creating: __MACOSX/SE-ABSA16_PHNS/
inflating: __MACOSX/SE-ABSA16_PHNS/._train.tsv
inflating: SE-ABSA16_PHNS/License.pdf
inflating: __MACOSX/SE-ABSA16_PHNS/._License.pdf
inflating: SE-ABSA16_PHNS/test.tsv
inflating: __MACOSX/SE-ABSA16_PHNS/._test.tsv
inflating: __MACOSX/._SE-ABSA16_PHNS
# 删除解压后的无用文件
!rm -r __MACOSX
查看数据
train_data = "./SE-ABSA16_CAME/train.tsv"
train_data = pd.read_csv(train_data, header=0, sep='\t')
train_data.head(10)
label | text_a | text_b | |
---|---|---|---|
0 | 0 | camera#design_features | 千呼万唤始出来,尼康的APSC小相机终于发布了,COOLPIX A. 你怎么看呢?我看,尼康... |
1 | 0 | camera#operation_performance | 千呼万唤始出来,尼康的APSC小相机终于发布了,COOLPIX A. 你怎么看呢?我看,尼康... |
2 | 0 | hardware#usability | 千呼万唤始出来,尼康的APSC小相机终于发布了,COOLPIX A. 你怎么看呢?我看,尼康... |
3 | 0 | software#design_features | 千呼万唤始出来,尼康的APSC小相机终于发布了,COOLPIX A. 你怎么看呢?我看,尼康... |
4 | 1 | lens#general | 千呼万唤始出来,尼康的APSC小相机终于发布了,COOLPIX A. 你怎么看呢?我看,尼康... |
5 | 0 | lens#operation_performance | 千呼万唤始出来,尼康的APSC小相机终于发布了,COOLPIX A. 你怎么看呢?我看,尼康... |
6 | 0 | battery#quality | 千呼万唤始出来,尼康的APSC小相机终于发布了,COOLPIX A. 你怎么看呢?我看,尼康... |
7 | 1 | camera#portability | 千呼万唤始出来,尼康的APSC小相机终于发布了,COOLPIX A. 你怎么看呢?我看,尼康... |
8 | 1 | photo#quality | 进xe之前知道有一部m8.2才万元左右,十分心动。都拿在手中了才突然醒悟,色彩fuji未必比... |
9 | 1 | photo#quality | 二选一:富士X-Pro1+18mmF2和徕卡X?玩过的小机器里面感觉成像最好的还是适马的DP... |
train_data = "./SE-ABSA16_PHNS/train.tsv"
train_data = pd.read_csv(train_data, header=0, sep='\t')
train_data.head(10)
label | text_a | text_b | |
---|---|---|---|
0 | 1 | phone#design_features | 今天有幸拿到了港版白色iPhone 5真机,试玩了一下,说说感受吧:1. 真机尺寸宽度与4/... |
1 | 1 | display#quality | 今天有幸拿到了港版白色iPhone 5真机,试玩了一下,说说感受吧:1. 真机尺寸宽度与4/... |
2 | 1 | ports#connectivity | 今天有幸拿到了港版白色iPhone 5真机,试玩了一下,说说感受吧:1. 真机尺寸宽度与4/... |
3 | 1 | cpu#operation_performance | 今天有幸拿到了港版白色iPhone 5真机,试玩了一下,说说感受吧:1. 真机尺寸宽度与4/... |
4 | 1 | display#quality | 1,外观上,iPhone5不是用来ZB的,所以别人看不看得出来,并不是我关心的问题。4寸的屏... |
5 | 1 | phone#design_features | 1,外观上,iPhone5不是用来ZB的,所以别人看不看得出来,并不是我关心的问题。4寸的屏... |
6 | 1 | multimedia_devices#design_features | 1,外观上,iPhone5不是用来ZB的,所以别人看不看得出来,并不是我关心的问题。4寸的屏... |
7 | 1 | multimedia_devices#quality | 1,外观上,iPhone5不是用来ZB的,所以别人看不看得出来,并不是我关心的问题。4寸的屏... |
8 | 1 | multimedia_devices#design_features | 1,外观上,iPhone5不是用来ZB的,所以别人看不看得出来,并不是我关心的问题。4寸的屏... |
9 | 1 | power_supply#quality | 1,外观上,iPhone5不是用来ZB的,所以别人看不看得出来,并不是我关心的问题。4寸的屏... |
模型构建
# 得到数据集字典
def open_func(file_path):
return [line.strip() for line in open(file_path, 'r', encoding='utf8').readlines()[1:] if len(line.strip().split('\t')) >= 2]
data_dict = {'seabsa16phns': {'test': open_func('SE-ABSA16_PHNS/test.tsv'),
'train': open_func('SE-ABSA16_PHNS/train.tsv')},
'seabsa16came': {'test': open_func('SE-ABSA16_CAME/test.tsv'),
'train': open_func('SE-ABSA16_CAME/train.tsv')}}
# 定义数据集
label_list = [0, 1]
# 考虑token_type_id
class MyDataset(Dataset):
def __init__(self, data, tokenizer, max_len=256, for_test=False):
super().__init__()
self._data = data
self._tokenizer = tokenizer
self._max_len = max_len
self._for_test = for_test
def __len__(self):
return len(self._data)
def __getitem__(self, idx):
samples = self._data[idx].split('\t')
label = samples[-3]
text_b = samples[-1]
text_a = samples[-2]
label = int(label)
encoder_out = self._tokenizer.encode(text_a, text_b, max_seq_len=self._max_len)
text = encoder_out['input_ids']
token_type = encoder_out['token_type_ids']
if self._for_test:
return np.array(text, dtype='int64'), np.array(token_type, dtype='int64')
else:
return np.array(text, dtype='int64'), np.array(token_type, dtype='int64'), np.array(label, dtype='int64')
def batchify_fn(for_test=False):
if for_test:
return lambda samples, fn=Tuple(Pad(axis=0, pad_val=tokenizer.pad_token_id),
Pad(axis=0, pad_val=tokenizer.pad_token_type_id)): [data for data in fn(samples)]
else:
return lambda samples, fn=Tuple(Pad(axis=0, pad_val=tokenizer.pad_token_id),
Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
Stack()): [data for data in fn(samples)]
def get_data_loader(data, tokenizer, batch_size=32, max_len=256, for_test=False):
dataset = MyDataset(data, tokenizer, max_len, for_test)
shuffle = True if not for_test else False
data_loader = DataLoader(dataset=dataset, batch_size=batch_size, collate_fn=batchify_fn(for_test), shuffle=shuffle)
return data_loader
SE ABSA16 CAME
训练配置
# 模型和分词
model = SkepForSequenceClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=2)
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')
# 加载数据集
data_name = 'seabsa16came'
# 参数设置
epochs = 30
learning_rate = 3e-6
batch_size = 8
max_len = 256
# 模型准备
train_dataloader = get_data_loader(data_dict[data_name]['train'], tokenizer, batch_size, max_len, for_test=False)
input = InputSpec((-1, -1), dtype='int64', name='input')
token_type = InputSpec((-1, -1), dtype='int64', name='token_type')
label = InputSpec((-1, 2), dtype='int64', name='label')
model = paddle.Model(model, [input, token_type], [label])
# L2正则化,避免模型过拟合
optimizer = paddle.optimizer.Adam(learning_rate=learning_rate,weight_decay=paddle.regularizer.L2Decay(coeff=1e-5),parameters=model.parameters())
model.prepare(optimizer, loss=paddle.nn.CrossEntropyLoss(), metrics=[paddle.metric.Accuracy()])
[2021-06-23 22:25:47,645] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.pdparams
[2021-06-23 22:25:51,477] [ INFO] - Found /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.vocab.txt
模型训练
model.fit(train_dataloader, batch_size=batch_size, epochs=epochs, save_freq=5, save_dir='./checkpoints', log_freq=200)
step 165/165 - loss: 0.0651 - acc: 0.8481 - 434ms/step
save checkpoint at /home/aistudio/checkpoints/25
Epoch 27/30
step 165/165 - loss: 0.0401 - acc: 0.8573 - 432ms/step
Epoch 28/30
step 165/165 - loss: 0.7666 - acc: 0.8550 - 434ms/step
Epoch 29/30
step 165/165 - loss: 0.2764 - acc: 0.8610 - 433ms/step
Epoch 30/30
step 165/165 - loss: 0.1352 - acc: 0.8633 - 433ms/step
save checkpoint at /home/aistudio/checkpoints/final
模型保存与预测
# 设置预训练模型的保存路径
checkpoint_path = './checkpoints/final'
model = SkepForSequenceClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=2)
input = InputSpec((-1, -1), dtype='int64', name='input')
token_type = InputSpec((-1, -1), dtype='int64', name='token_type')
model = paddle.Model(model, [input, token_type])
model.load(checkpoint_path)
# 导入测试集
test_dataloader = get_data_loader(data_dict[data_name]['test'], tokenizer, batch_size, max_len, for_test=True)
# 模型保存与预测
save_file = {'seabsa16phns': './submission/SE-ABSA16_PHNS.tsv', 'seabsa16came': './submission/SE-ABSA16_CAME.tsv'}
predicts = []
for batch in test_dataloader:
predict = model.predict_batch(batch)
predicts += predict[0].argmax(axis=-1).tolist()
with open(save_file[data_name], 'w', encoding='utf8') as f:
f.write("index\tprediction\n")
for idx, sample in enumerate(data_dict[data_name]['test']):
qid = sample.split('\t')[0]
f.write(qid + '\t' + str(predicts[idx]) + '\n')
f.close()
[2021-06-23 23:07:18,381] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.pdparams
查看预测效果
test_data = "./submission/SE-ABSA16_CAME.tsv"
test_data = pd.read_csv(test_data, header=0, sep='\t')
test_data.head(10)
index | prediction | |
---|---|---|
0 | 0 | 1 |
1 | 1 | 0 |
2 | 2 | 1 |
3 | 3 | 1 |
4 | 4 | 1 |
5 | 5 | 0 |
6 | 6 | 0 |
7 | 7 | 1 |
8 | 8 | 1 |
9 | 9 | 1 |
SE ABSA16 PHCS
训练配置
# 模型和分词
model = SkepForSequenceClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=2)
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')
# 加载数据集
data_name = 'seabsa16phns'
# 参数设置
epochs = 30
learning_rate = 5e-6
batch_size = 8
max_len = 256
# 模型准备
train_dataloader = get_data_loader(data_dict[data_name]['train'], tokenizer, batch_size, max_len, for_test=False)
input = InputSpec((-1, -1), dtype='int64', name='input')
token_type = InputSpec((-1, -1), dtype='int64', name='token_type')
label = InputSpec((-1, 2), dtype='int64', name='label')
model = paddle.Model(model, [input, token_type], [label])
# L2正则化,避免模型过拟合
optimizer = paddle.optimizer.Adam(learning_rate=learning_rate,weight_decay=paddle.regularizer.L2Decay(coeff=1e-5),parameters=model.parameters())
model.prepare(optimizer, loss=paddle.nn.CrossEntropyLoss(), metrics=[paddle.metric.Accuracy()])
[2021-06-23 21:41:56,594] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.pdparams
[2021-06-23 21:42:00,396] [ INFO] - Found /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.vocab.txt
模型训练
model.fit(train_dataloader, batch_size=batch_size, epochs=epochs, save_freq=5, save_dir='./checkpoints', log_freq=200)
step 167/167 - loss: 0.0093 - acc: 0.8825 - 426ms/step
Epoch 25/30
step 167/167 - loss: 0.4433 - acc: 0.8870 - 428ms/step
Epoch 26/30
step 167/167 - loss: 0.1899 - acc: 0.8735 - 427ms/step
save checkpoint at /home/aistudio/checkpoints/25
Epoch 27/30
step 167/167 - loss: 0.0471 - acc: 0.8780 - 424ms/step
Epoch 28/30
step 167/167 - loss: 0.3930 - acc: 0.8862 - 427ms/step
Epoch 29/30
step 167/167 - loss: 0.0853 - acc: 0.9004 - 427ms/step
Epoch 30/30
step 167/167 - loss: 0.2919 - acc: 0.8877 - 427ms/step
save checkpoint at /home/aistudio/checkpoints/final
模型保存与预测
# 设置预训练模型的保存路径
checkpoint_path = './checkpoints/final'
model = SkepForSequenceClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=2)
input = InputSpec((-1, -1), dtype='int64', name='input')
token_type = InputSpec((-1, -1), dtype='int64', name='token_type')
model = paddle.Model(model, [input, token_type])
model.load(checkpoint_path)
# 导入测试集
test_dataloader = get_data_loader(data_dict[data_name]['test'], tokenizer, batch_size, max_len, for_test=True)
# 模型保存与预测
save_file = {'seabsa16phns': './submission/SE-ABSA16_PHNS.tsv', 'seabsa16came': './submission/SE-ABSA16_CAME.tsv'}
predicts = []
for batch in test_dataloader:
predict = model.predict_batch(batch)
predicts += predict[0].argmax(axis=-1).tolist()
with open(save_file[data_name], 'w', encoding='utf8') as f:
f.write("index\tprediction\n")
for idx, sample in enumerate(data_dict[data_name]['test']):
qid = sample.split('\t')[0]
f.write(qid + '\t' + str(predicts[idx]) + '\n')
f.close()
[2021-06-23 22:24:17,942] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.pdparams
查看预测效果
test_data = "./submission/SE-ABSA16_PHNS.tsv"
test_data = pd.read_csv(test_data, header=0, sep='\t')
test_data.head(10)
index | prediction | |
---|---|---|
0 | 0 | 1 |
1 | 1 | 0 |
2 | 2 | 1 |
3 | 3 | 1 |
4 | 4 | 1 |
5 | 5 | 1 |
6 | 6 | 1 |
7 | 7 | 1 |
8 | 8 | 1 |
9 | 9 | 1 |
作者简介
- 飞桨主页:行远见大
- 个人经历:上海开源信息技术协会成员
- 我的口号:向开源致敬,一同建设飞桨开源社区
- 常住地址:常年混迹在 AI Studio 平台和各类 PaddlePaddle 群
- QQ:1206313185 添加时请备注添加原因和 AI Studio 的 ID
- 感谢小伙伴们一键三连(喜欢♡、fork〧、关注+)支持,点 ♡ 数越多,更新越快~
更多推荐
所有评论(0)