DataJoy练习赛:推特文本情感分类Baseline
DataJoy练习赛:推特文本情感分类Baseline,基于PaddleHub通过预训练模型的微调完成推特情感分类模型的训练与优化
DataJoy练习赛:推特文本情感分类Baseline
一.赛事介绍:
DataJoy® 练习赛是 Heywhale 和鲸社区的自有练习赛品牌,面向所有希望通过练习来提升自我、结识友人的数据科学领域人才,旨在打造一个新手友好、自由交流、开源分享的数据科学练习环境。练习赛长期开放,以两个自然月为一个练习赛周期,每个周期结束后自动清榜,并对前排优秀选手进行统计和发放奖励。
此次练习赛赛题为Twitter文本情感分类,作为新人情感分析的入门实战项目,为选手提供3w条训练数据及1w条测试集数据。选手需要根据所给的推特文本情感分类训练集进行模型训练,并对测试集进行预测并提交csv格式的预测结果文件,评估指标上使用“ Accuracy”。
比赛传送门:DataJoy练习赛:推特文本情感分类
Baseline思路:
该赛题为英文短文本多分类赛事,本项目主要基于PaddleHub通过预训练模型的微调完成推特情感分类模型的训练与优化,通过训练好的模型对测试集进行预测生成预测结果文件。
对于文本分类比赛,预训练模型的微调已成为常见基准,借助于百度AI Studio提供免费GPU大算力,方便我们进行更多的探索。PaddleHub包含了大量的预训练模型,通过PaddleHub不需要繁琐的代码,就可以快速构建比赛的基线方案。本项目需要使用至尊版的GPU环境,若显存不足请改下batchsize。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Gu4Lw1v9-1636849131085)(https://ai-studio-static-online.cdn.bcebos.com/b3341b6ae0394d339372f93e15686fee8a98e8bcf7e54c48b9527a1e23ce6e21)]
二.Twitter 情感分类数据集处理
2.1 解压并查看数据
# 解压比赛数据集
%cd /home/aistudio/data/data106998/
!unzip datasets.zip
/home/aistudio/data/data106998
Archive: datasets.zip
inflating: submission.csv
inflating: test.csv
inflating: train.csv
# 读取数据集
import pandas as pd
train = pd.read_csv('train.csv') # 有标签的训练数据文件
test = pd.read_csv('test.csv') # 要进行预测的测试数据文件
sub = pd.read_csv('submission.csv') # 提交结果文件范例
# 查看训练数据,格式为tweet_id,content,label,共30000条
train
tweet_id | content | label | |
---|---|---|---|
0 | tweet_0 | @tiffanylue i know i was listenin to bad habi... | 0 |
1 | tweet_1 | Layin n bed with a headache ughhhh...waitin o... | 1 |
2 | tweet_2 | Funeral ceremony...gloomy friday... | 1 |
3 | tweet_3 | wants to hang out with friends SOON! | 2 |
4 | tweet_4 | @dannycastillo We want to trade with someone w... | 3 |
... | ... | ... | ... |
29995 | tweet_29995 | @shonali I think the lesson of the day is not ... | 3 |
29996 | tweet_29996 | @lovelylisaj can you give me the link for the ... | 3 |
29997 | tweet_29997 | @sendsome2me haha, yeah. Twitter has many uses... | 3 |
29998 | tweet_29998 | Happy Mother's Day to all the mommies out ther... | 6 |
29999 | tweet_29999 | @niariley WASSUP BEAUTIFUL!!! FOLLOW ME!! PEE... | 9 |
30000 rows × 3 columns
# 查看测试数据,格式为tweet_id,content,共10000条
test
tweet_id | content | |
---|---|---|
0 | tweet_0 | Re-pinging @ghostridah14: why didn't you go to... |
1 | tweet_1 | @kelcouch I'm sorry at least it's Friday? |
2 | tweet_2 | The storm is here and the electricity is gone |
3 | tweet_3 | So sleepy again and it's not even that late. I... |
4 | tweet_4 | Wondering why I'm awake at 7am,writing a new s... |
... | ... | ... |
9995 | tweet_9995 | @jasimmo Ooo showing of your French skills!! l... |
9996 | tweet_9996 | Succesfully following Tayla!! |
9997 | tweet_9997 | @JohnLloydTaylor |
9998 | tweet_9998 | Happy Mothers Day All my love |
9999 | tweet_9999 | @mopedronin bullet train from tokyo the gf ... |
10000 rows × 2 columns
# 查看提交结果文件样例,格式为tweet_id,label
sub
tweet_id | label | |
---|---|---|
0 | tweet_0 | 0 |
1 | tweet_1 | 0 |
2 | tweet_2 | 0 |
3 | tweet_3 | 0 |
4 | tweet_4 | 0 |
... | ... | ... |
9995 | tweet_9995 | 0 |
9996 | tweet_9996 | 0 |
9997 | tweet_9997 | 0 |
9998 | tweet_9998 | 0 |
9999 | tweet_9999 | 0 |
10000 rows × 2 columns
2.2 文本清洗
import re
# 定义文本清洗函数
def clean_text(text):
"""
Clean text
:param text: the string of text
:return: text string after cleaning
"""
# 去除网址
results = re.compile(r'[http|https]*://[a-zA-Z0-9.?/&=:]*', re.S)
text = re.sub(results, '', text)
# 转小写
# text = text.lower()
# 缩略词更改
text = re.sub(r"can\'t", "can not", text)
text = re.sub(r"cant", "can not", text)
text = re.sub(r"cannot", "can not ", text)
text = re.sub(r"what\'s", "what is", text)
text = re.sub(r"What\'s", "what is", text)
text = re.sub(r"\'ve ", " have ", text)
text = re.sub(r"n\'t", " not ", text)
text = re.sub(r"i\'m", "i am ", text)
text = re.sub(r"I\'m", "i am ", text)
text = re.sub(r"\'re", " are ", text)
text = re.sub(r"\'d", " would ", text)
text = re.sub(r"\'ll", " will ", text)
text = re.sub(r" e mail ", " email ", text)
text = re.sub(r" e \- mail ", " email ", text)
text = re.sub(r" e\-mail ", " email ", text)
# 符号替换
text = re.sub(r"&", " and ", text)
text = re.sub(r"\|", " or ", text)
text = re.sub(r"=", " equal ", text)
text = re.sub(r"\+", " plus ", text)
text = re.sub(r"\$", " dollar ", text)
# 去除多余空格
text = ' '.join(text.split())
return text
# 文本清洗
process = lambda x: clean_text(x)
train['content'] = train['content'].apply(process)
test['content'] = test['content'].apply(process)
2.3 数据分析与可视化
# 查看训练集中是否有缺失值
train.isnull().sum()
tweet_id 0
content 0
label 0
dtype: int64
# 查看训练数据中总类别分布情况
train['label'].value_counts()
3 6478
4 6344
9 3907
1 3873
6 2881
5 1641
7 1332
11 1145
8 992
0 621
2 570
10 134
12 82
Name: label, dtype: int64
# 可视化类别标签分布情况
import matplotlib as plt
%matplotlib inline
train['label'].value_counts(normalize=True).plot(kind='bar');
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gyTBTOp0-1636849131086)(output_15_0.png)]
# 格式处理,去掉多余无用的id列
train = train[['content', 'label']]
test = test[['content']]
# 统计数据文本长度,最大长度为190
print(train['content'].map(len).describe())
print(test['content'].map(len).describe())
count 30000.000000
mean 72.684633
std 36.857207
min 1.000000
25% 42.000000
50% 68.000000
75% 102.000000
max 188.000000
Name: content, dtype: float64
count 10000.000000
mean 72.444800
std 36.859886
min 1.000000
25% 42.000000
50% 67.000000
75% 101.000000
max 190.000000
Name: content, dtype: float64
# 可视化训练集长度分布
train['content'].map(len).plot(kind='kde');
# 可视化测试集长度分布,可以看出与训练集在长度上基本同分布
test['content'].map(len).plot(kind='kde');
2.4 划分训练和验证集
# 根据label的具体类别按9:1的比例去划分训练和验证集,使划分的训练和验证集尽量同分布。
from sklearn.utils import shuffle
new_train = pd.DataFrame() # 定义训练集
new_valid = pd.DataFrame() # 定义验证集
tags = list(train.label.unique()) # 总类别
# 根据类别进行抽样划分
for tag in tags:
data = train[(train['label'] == tag)]
# 抽样取0.1作为验证集
valid_sample = data.sample(int(0.1 * len(data)))
valid_index = valid_sample.index
# 将剩余0.9的数据作为训练集
all_index = data.index
residue_index = all_index.difference(valid_index)
residue = data.loc[residue_index]
# 对取的数据进行保存
new_valid = pd.concat([new_valid, valid_sample], ignore_index=True)
new_train = pd.concat([new_train, residue], ignore_index=True)
# 对数据进行随机打乱
new_train = shuffle(new_train)
new_valid = shuffle(new_valid)
# 保存训练和验证集文件
new_train.to_csv('train_data.csv', index=False) # 训练集
new_valid.to_csv('valid_data.csv', index=False) # 验证集
test.to_csv('test_data.csv', index=False) # 测试集
三.基于PaddleHub构建推特情感分析基线模型
PaddleHub可以便捷地获取PaddlePaddle生态下的预训练模型,完成模型的管理和一键预测。配合使用Fine-tune API,可以基于大规模预训练模型快速完成迁移学习,让预训练模型能更好地服务于用户特定场景的应用。
PaddleHub的github地址(有问题可以提issue):https://github.com/PaddlePaddle/PaddleHub
3.1 PaddleHub预训练模型微调流程:
PaddleHub预训练模型微调流程如下:
1)加载要使用的预训练模型
2)加载并处理数据为模型可接受的格式
3)设置finetune训练与优化策略
4)模型训练与验证
5)模型预测
以上流程每个步骤只需通过很少的代码即可完成,使得我们可以快速构建比赛基线方案。
3.2 前置环境准备
# 下载最新版本的paddlehub
!pip install -U paddlehub
!pip install onnx
# 导入paddlehub和paddle包
import paddlehub as hub
# 固定随机种子便于结果的复现
seed = 1024
import random
import numpy as np
import paddle
random.seed(seed)
np.random.seed(seed)
paddle.seed(seed)
<paddle.fluid.core_avx.Generator at 0x7f5250082eb0>
3.3 加载预训练模型
# 设置要求进行分类的类别
label_list=list(train.label.unique().astype(str)) # 注意此处需要将数值类别转换为字符串
print(label_list)
label_map = {
idx: label_text for idx, label_text in enumerate(label_list)
}
print(label_map)
['0', '1', '2', '3', '4', '5', '6', '9', '7', '8', '10', '11', '12']
{0: '0', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5', 6: '6', 7: '9', 8: '7', 9: '8', 10: '10', 11: '11', 12: '12'}
# 通过以下一行代码,model初始化为一个适用于文本分类任务的模型,为选择的预训练模型后拼接上一个全连接网络(Full Connected)
# 由于数据为英文数据,故选择了ernie_v2_eng_large预训练模型
model = hub.Module(name="ernie_v2_eng_large", task='seq-cls', num_classes=len(train.label.unique()), label_map=label_map)
Download https://bj.bcebos.com/paddlehub/paddlehub_dev/ernie_v2_eng_large_2.0.2.tar.gz
[##################################################] 100.00%
Decompress /home/aistudio/.paddlehub/tmp/tmpdmmdt2z2/ernie_v2_eng_large_2.0.2.tar.gz
[##################################################] 100.00%
[2021-09-19 21:04:00,294] [ INFO] - Successfully installed ernie_v2_eng_large-2.0.2
[2021-09-19 21:04:00,298] [ INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/ernie_v2_large/ernie_v2_eng_large.pdparams and saved to /home/aistudio/.paddlenlp/models/ernie-2.0-large-en
[2021-09-19 21:04:00,300] [ INFO] - Downloading ernie_v2_eng_large.pdparams from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_v2_large/ernie_v2_eng_large.pdparams
100%|██████████| 1309198/1309198 [00:17<00:00, 73082.66it/s]
hub.Module
的参数用法如下:
name
:模型名称,可以选择ernie
,ernie_tiny
,bert-base-cased
,bert-base-chinese
,roberta-wwm-ext
,roberta-wwm-ext-large
等。task
:fine-tune任务。此处为seq-cls
,表示文本分类任务。num_classes
:表示当前文本分类任务的类别数,根据具体使用的数据集确定,默认为2。
PaddleHub支持的更多模型可以访问:https://www.paddlepaddle.org.cn/hublist
PaddleHub当前支持文本分类任务的模型对应的加载示例如下:
模型名 | PaddleHub Module |
---|---|
ERNIE, Chinese | hub.Module(name='ernie') |
ERNIE tiny, Chinese | hub.Module(name='ernie_tiny') |
ERNIE 2.0 Base, English | hub.Module(name='ernie_v2_eng_base') |
ERNIE 2.0 Large, English | hub.Module(name='ernie_v2_eng_large') |
BERT-Base, English Cased | hub.Module(name='bert-base-cased') |
BERT-Base, English Uncased | hub.Module(name='bert-base-uncased') |
BERT-Large, English Cased | hub.Module(name='bert-large-cased') |
BERT-Large, English Uncased | hub.Module(name='bert-large-uncased') |
BERT-Base, Multilingual Cased | hub.Module(nane='bert-base-multilingual-cased') |
BERT-Base, Multilingual Uncased | hub.Module(nane='bert-base-multilingual-uncased') |
BERT-Base, Chinese | hub.Module(name='bert-base-chinese') |
BERT-wwm, Chinese | hub.Module(name='chinese-bert-wwm') |
BERT-wwm-ext, Chinese | hub.Module(name='chinese-bert-wwm-ext') |
RoBERTa-wwm-ext, Chinese | hub.Module(name='roberta-wwm-ext') |
RoBERTa-wwm-ext-large, Chinese | hub.Module(name='roberta-wwm-ext-large') |
RBT3, Chinese | hub.Module(name='rbt3') |
RBTL3, Chinese | hub.Module(name='rbtl3') |
ELECTRA-Small, English | hub.Module(name='electra-small') |
ELECTRA-Base, English | hub.Module(name='electra-base') |
ELECTRA-Large, English | hub.Module(name='electra-large') |
ELECTRA-Base, Chinese | hub.Module(name='chinese-electra-base') |
ELECTRA-Small, Chinese | hub.Module(name='chinese-electra-small') |
3.4 加载并处理数据为模型可接受的格式
import os, io, csv
from paddlehub.datasets.base_nlp_dataset import InputExample, TextClassificationDataset
# 数据集存放位置
DATA_DIR="/home/aistudio/data/data106998/"
# 对训练数据进行处理,处理为模型可接受的格式
class Twitter(TextClassificationDataset):
def __init__(self, tokenizer, mode='train', max_seq_len=128):
if mode == 'train':
data_file = 'train_data.csv'
elif mode == 'dev':
data_file = 'valid_data.csv'
super(Twitter, self).__init__(
base_path=DATA_DIR,
data_file=data_file,
tokenizer=tokenizer,
max_seq_len=max_seq_len,
mode=mode,
is_file_with_header=True,
label_list=label_list
)
# 解析文本文件里的样本
def _read_file(self, input_file, is_file_with_header: bool = False):
if not os.path.exists(input_file):
raise RuntimeError("The file {} is not found.".format(input_file))
else:
with io.open(input_file, "r", encoding="UTF-8") as f:
reader = csv.reader(f, delimiter=",") # ‘,’分隔数据
examples = []
seq_id = 0
header = next(reader) if is_file_with_header else None
for line in reader:
example = InputExample(guid=seq_id, text_a=line[0], label=line[1])
seq_id += 1
examples.append(example)
return examples
# 最大序列长度max_seq_len参数,建议值为128,根据任务文本长度不同可以调整该值,但最大不超过512。
train_dataset = Twitter(model.get_tokenizer(), mode='train', max_seq_len=190)
dev_dataset = Twitter(model.get_tokenizer(), mode='dev', max_seq_len=190)
# 处理完后查看数据中前3条
for e in train_dataset.examples[:3]:
print(e)
for e in dev_dataset.examples[:3]:
print(e)
[2021-09-19 21:04:55,975] [ INFO] - Downloading vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_v2_large/vocab.txt
100%|██████████| 227/227 [00:00<00:00, 4688.03it/s]
[2021-09-19 21:05:11,416] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-2.0-large-en/vocab.txt
text=@cocoy @mike_online That's ! the sad ; thing about it. label=4
text=, one more final label=3
text=at wk wishin I was at Annies grad ceremony but coodnt becuz of sumthin I dnt care to . mention... sooo sad label=1
text=still needs : another 6 hours ; of sleep label=3
text=@Saraa_xD Si pero parece que las personas de facebook no , estan de acuerdo (that was probably all wrong, but ? I tried...) label=4
text=; Hmmm.... My whole body feels sore! label=4
3.5 选择优化策略和运行配置
# 优化器的选择
optimizer = paddle.optimizer.AdamW(learning_rate=2e-5, parameters=model.parameters())
# 运行配置
trainer = hub.Trainer(model, optimizer, checkpoint_dir='./ckpt', use_gpu=True, use_vdl=True) # fine-tune任务的执行者
[2021-09-19 21:05:22,291] [ WARNING] - PaddleHub model checkpoint not found, start from scratch...
ps:模型训练时,可以通过在终端输入nvidia-smi命令或通过底部的性能监控选项查看训练时显存的占用情况,从而动态调整batch_size等,避免爆显存的情况
3.6 模型训练与验证
trainer.train
主要控制具体的训练过程,包含以下可控制的参数:
train_dataset
: 训练时所用的数据集;epochs
: 训练轮数;batch_size
: 训练的批大小,如果使用GPU,请根据实际情况调整batch_size;num_workers
: works的数量,默认为0;eval_dataset
: 验证集;log_interval
: 打印日志的间隔, 单位为执行批训练的次数。save_interval
: 保存模型的间隔频次,单位为执行训练的轮数。
trainer.train(train_dataset, epochs=3, batch_size=50, eval_dataset=dev_dataset, save_interval=1) # 配置训练参数,启动训练,并指定验证集
3.7 模型预测并生成结果文件
# 对测试集进行预测
import numpy as np
# 将输入数据处理为list格式
new = pd.DataFrame(columns=['text'])
new['text'] = test["content"]
# 首先将pandas读取的数据转化为array
data_array = np.array(new)
# 然后转化为list形式
data_list =data_array.tolist()
# 加载训练好的模型
model = hub.Module(
name="ernie_v2_eng_large",
task='seq-cls',
load_checkpoint='./ckpt/best_model/model.pdparams',
num_classes=len(train.label.unique()),
label_map=label_map)
# 对测试集数据进行预测
predictions = model.predict(data_list, max_seq_len=198, batch_size=4, use_gpu=True)
[2021-09-19 23:00:14,095] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-2.0-large-en/ernie_v2_eng_large.pdparams
[2021-09-19 23:00:22,627] [ INFO] - Loaded parameters from /home/aistudio/data/data106998/ckpt/best_model/model.pdparams
[2021-09-19 23:00:22,637] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-2.0-large-en/vocab.txt
# 生成要提交的结果文件
sub = pd.read_csv('./submission.csv')
sub['label'] = predictions
sub.to_csv('submission.csv',index=False)
# 移动文件到主目录下,便于保存
-large-en/vocab.txt
```python
# 生成要提交的结果文件
sub = pd.read_csv('./submission.csv')
sub['label'] = predictions
sub.to_csv('submission.csv',index=False)
# 移动文件到主目录下,便于保存
!cp /home/aistudio/data/data106998/submission.csv /home/aistudio
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JU77Xmu8-1636849131090)(https://ai-studio-static-online.cdn.bcebos.com/cb80bffebde24db483f5e761d8fae574fb5a0f3522d3425fa05da5cc760cfd0c)]
四.提升方向
1.针对数据中存在噪声数据进行清洗,针对数据量少的问题对数据进行有效的数据增强。
2.调参优化及添加5folds交叉验证。
3.尝试使用不同的预训练模型,进行调参优化后取多模型最优结果进行结果融合。
4.除了预训练模型微调,也可以尝试下使用Bilstm传统神经网络模型及Catboost树模型等,可能会有不错的收益。
5.参照学习类似比赛的top分享,或复现前沿论文成果。
五.个人介绍
昵称:炼丹师233
目前主要方向:搞开发,主攻NLP和数据挖掘相关比赛或项目
https://aistudio.baidu.com/aistudio/personalcenter/thirdview/330406 关注我,下次带来更多精彩项目分享!
更多推荐
所有评论(0)