转载自AI Studio
项目链接https://aistudio.baidu.com/aistudio/projectdetail/3162632

赛事背景

随着人工智能技术不断发展,每周都有非常多的论文公开发布。现如今对论文进行分类逐渐成为非常现实的问题,这也是研究人员和研究机构每天都面临的问题。现在希望选手能构建一个论文分类模型。

https://challenge.xfyun.cn/topic/info?type=academic-paper-classification

赛事任务

本次赛题希望参赛选手利用论文信息:论文id、标题、摘要,划分论文具体类别。

赛题样例(使用\t分隔):

paperid:9821

title:Calculation of prompt diphoton production cross sections at Tevatron and LHC energies

abstract:A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to-leading logarithmic accuracy.

categories:hep-ph

数据说明

训练数据和测试集以csv文件给出,其中:

  • 训练集5W篇论文。其中每篇论文都包含论文id、标题、摘要和类别四个字段。

  • 测试集1W篇论文。其中每篇论文都包含论文id、标题、摘要,不包含论文类别字段。

评估指标

本次竞赛的评价标准采用准确率指标,最高分为1。

计算方法参考https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html, 评估代码参考:

from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
!pip install paddle-ernie > log.log
import numpy as np
import paddle as P

# 导入ernie模型
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModel

model = ErnieModel.from_pretrained('ernie-1.0')    # Try to get pretrained model from server, make sure you have network connection
model.eval()
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')

ids, _ = tokenizer.encode('hello world')
ids = P.to_tensor(np.expand_dims(ids, 0))  # insert extra `batch` dimension
pooled, encoded = model(ids)                 # eager execution
print(pooled.numpy())     
import sys
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score
import paddle as P

from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModelForSequenceClassification
train_df = pd.read_csv('train.csv', sep='\t')
train_df['title'] = train_df['title'] + ' ' + train_df['abstract']

train_df = train_df.sample(frac=1.0)
train_df.head()
train_df.shape
train_df['categories'].nunique()
train_df['categories'], lbl_list = pd.factorize(train_df['categories'])
# 模型超参数
BATCH=32
MAX_SEQLEN=300
LR=5e-5
EPOCH=10

# 定义ernie分类模型
ernie = ErnieModelForSequenceClassification.from_pretrained('ernie-2.0-en', num_labels=39)
optimizer = P.optimizer.Adam(LR,parameters=ernie.parameters())
tokenizer = ErnieTokenizer.from_pretrained('ernie-2.0-en')
train_df.iterrows()
# 对数据集进行转换,主要操作为文本编码
def make_data(df):
    data = []
    for i, row in enumerate(df.iterrows()):
        text, label = row[1].title, row[1].categories
        text_id, _ = tokenizer.encode(text) # ErnieTokenizer 会自动添加ERNIE所需要的特殊token,如[CLS], [SEP]
        text_id = text_id[:MAX_SEQLEN]
        text_id = np.pad(text_id, [0, MAX_SEQLEN-len(text_id)], mode='constant')
        data.append((text_id, label))
    return data

train_data = make_data(train_df.iloc[:-5000])
val_data = make_data(train_df.iloc[-5000:])
# 获取batch数据
def get_batch_data(data, i):
    d = data[i*BATCH: (i + 1) * BATCH]
    feature, label = zip(*d)
    feature = np.stack(feature)  # 将BATCH行样本整合在一个numpy.array中
    label = np.stack(list(label))
    feature = P.to_tensor(feature) # 使用to_variable将numpy.array转换为paddle tensor
    label = P.to_tensor(label)
    return feature, label
EPOCH=1

# 模型训练
for i in range(EPOCH):
    np.random.shuffle(train_data) # 每个epoch都shuffle数据以获得最佳训练效果;
    ernie.train()
    for j in range(len(train_data) // BATCH):
        feature, label = get_batch_data(train_data, j)
        loss, _ = ernie(feature, labels=label) 
        loss.backward()
        optimizer.minimize(loss)
        ernie.clear_gradients()
        if j % 50 == 0:
            print('Train %d: loss %.5f' % (j, loss.numpy()))
        
        # 模型验证
        if j % 100 == 0:
            all_pred, all_label = [], []
            with P.no_grad():
                ernie.eval()
                for j in range(len(val_data) // BATCH):
                    feature, label = get_batch_data(val_data, j)
                    loss, logits = ernie(feature, labels=label)

                    all_pred.extend(logits.argmax(-1).numpy())
                    all_label.extend(label.numpy())
                ernie.train()
            acc = (np.array(all_label) == np.array(all_pred)).astype(np.float32).mean()
            print('Val acc %.5f' % acc)
test_df = pd.read_csv('test.csv', sep='\t')
test_df['title'] = test_df['title'] + ' ' + test_df['abstract']
test_df['categories'] = 0

test_data = make_data(test_df.iloc[:])
all_pred, all_label = [], []

# 模型预测
with P.no_grad():
    ernie.eval()
    for j in range(len(test_data) // BATCH+1):
        feature, label = get_batch_data(test_data, j)
        loss, logits = ernie(feature, labels=label)

        all_pred.extend(logits.argmax(-1).numpy())
        all_label.extend(label.numpy())
pd.DataFrame({
    'paperid': test_df['paperid'],
    'categories': lbl_list[all_pred]
}).to_csv('submit.csv', index=None)

总结和展望

  1. 项目使用ERNIE完成论文分类,使用文本截断然后进行训练。
  2. 在进行训练中若使用pertrain + finetune,则可以取得更好的精度。
  3. 可以考虑使用多句子Attention来完成建模。

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐