科大讯飞-学术论文分类挑战赛：ERNIE 准确率0.79_副本

随着人工智能技术不断发展，每周都有非常多的论文公开发布。现如今对论文进行分类逐渐成为非常现实的问题，这也是研究人员和研究机构每天都面临的问题。现在希望选手能构建一个论文分类模型。

AI Studio

330人浏览 · 2022-03-07 08:03:01

AI Studio · 2022-03-07 08:03:01 发布

转载自AI Studio
项目链接https://aistudio.baidu.com/aistudio/projectdetail/3162632

赛事背景

https://challenge.xfyun.cn/topic/info?type=academic-paper-classification

赛事任务

本次赛题希望参赛选手利用论文信息：论文id、标题、摘要，划分论文具体类别。

赛题样例（使用\t分隔）：

paperid：9821

title：Calculation of prompt diphoton production cross sections at Tevatron and LHC energies

abstract：A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to-leading logarithmic accuracy.

categories：hep-ph

数据说明

训练数据和测试集以csv文件给出，其中：

训练集5W篇论文。其中每篇论文都包含论文id、标题、摘要和类别四个字段。
测试集1W篇论文。其中每篇论文都包含论文id、标题、摘要，不包含论文类别字段。

评估指标

本次竞赛的评价标准采用准确率指标，最高分为1。

计算方法参考https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html，评估代码参考：

from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]

!pip install paddle-ernie > log.log

import numpy as np
import paddle as P

# 导入ernie模型
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModel

model = ErnieModel.from_pretrained('ernie-1.0')    # Try to get pretrained model from server, make sure you have network connection
model.eval()
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')

ids, _ = tokenizer.encode('hello world')
ids = P.to_tensor(np.expand_dims(ids, 0))  # insert extra `batch` dimension
pooled, encoded = model(ids)                 # eager execution
print(pooled.numpy())

import sys
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score
import paddle as P

from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModelForSequenceClassification

train_df = pd.read_csv('train.csv', sep='\t')
train_df['title'] = train_df['title'] + ' ' + train_df['abstract']

train_df = train_df.sample(frac=1.0)
train_df.head()

train_df.shape

train_df['categories'].nunique()

train_df['categories'], lbl_list = pd.factorize(train_df['categories'])

# 模型超参数
BATCH=32
MAX_SEQLEN=300
LR=5e-5
EPOCH=10

# 定义ernie分类模型
ernie = ErnieModelForSequenceClassification.from_pretrained('ernie-2.0-en', num_labels=39)
optimizer = P.optimizer.Adam(LR,parameters=ernie.parameters())
tokenizer = ErnieTokenizer.from_pretrained('ernie-2.0-en')

train_df.iterrows()

# 对数据集进行转换，主要操作为文本编码
def make_data(df):
    data = []
    for i, row in enumerate(df.iterrows()):
        text, label = row[1].title, row[1].categories
        text_id, _ = tokenizer.encode(text) # ErnieTokenizer 会自动添加ERNIE所需要的特殊token，如[CLS], [SEP]
        text_id = text_id[:MAX_SEQLEN]
        text_id = np.pad(text_id, [0, MAX_SEQLEN-len(text_id)], mode='constant')
        data.append((text_id, label))
    return data

train_data = make_data(train_df.iloc[:-5000])
val_data = make_data(train_df.iloc[-5000:])

# 获取batch数据
def get_batch_data(data, i):
    d = data[i*BATCH: (i + 1) * BATCH]
    feature, label = zip(*d)
    feature = np.stack(feature)  # 将BATCH行样本整合在一个numpy.array中
    label = np.stack(list(label))
    feature = P.to_tensor(feature) # 使用to_variable将numpy.array转换为paddle tensor
    label = P.to_tensor(label)
    return feature, label

EPOCH=1

# 模型训练
for i in range(EPOCH):
    np.random.shuffle(train_data) # 每个epoch都shuffle数据以获得最佳训练效果；
    ernie.train()
    for j in range(len(train_data) // BATCH):
        feature, label = get_batch_data(train_data, j)
        loss, _ = ernie(feature, labels=label) 
        loss.backward()
        optimizer.minimize(loss)
        ernie.clear_gradients()
        if j % 50 == 0:
            print('Train %d: loss %.5f' % (j, loss.numpy()))
        
        # 模型验证
        if j % 100 == 0:
            all_pred, all_label = [], []
            with P.no_grad():
                ernie.eval()
                for j in range(len(val_data) // BATCH):
                    feature, label = get_batch_data(val_data, j)
                    loss, logits = ernie(feature, labels=label)

                    all_pred.extend(logits.argmax(-1).numpy())
                    all_label.extend(label.numpy())
                ernie.train()
            acc = (np.array(all_label) == np.array(all_pred)).astype(np.float32).mean()
            print('Val acc %.5f' % acc)

test_df = pd.read_csv('test.csv', sep='\t')
test_df['title'] = test_df['title'] + ' ' + test_df['abstract']
test_df['categories'] = 0

test_data = make_data(test_df.iloc[:])

all_pred, all_label = [], []

# 模型预测
with P.no_grad():
    ernie.eval()
    for j in range(len(test_data) // BATCH+1):
        feature, label = get_batch_data(test_data, j)
        loss, logits = ernie(feature, labels=label)

        all_pred.extend(logits.argmax(-1).numpy())
        all_label.extend(label.numpy())

pd.DataFrame({
    'paperid': test_df['paperid'],
    'categories': lbl_list[all_pred]
}).to_csv('submit.csv', index=None)

总结和展望

项目使用ERNIE完成论文分类，使用文本截断然后进行训练。
在进行训练中若使用pertrain + finetune，则可以取得更好的精度。
可以考虑使用多句子Attention来完成建模。

百度飞桨AI Studio社区

学大模型，用大模型上飞桨星河社区！每天8点V100G算力免费领！免费领取ERNIE 4.0 100w Token >>>

更多推荐

利用Amazon Bedrock畅玩Claude 3等多种领先模型，抢占AI高地(体验倒计时4小时)

百度飞桨星河社区

RAPTOR：索引树状 RAG，使用树结构来捕捉文本的高级和低级细节

百度飞桨星河社区

MultiHop-RAG：多跳查询的基准检索增强生成

百度飞桨星河社区

所有评论(0)

查看更多评论

AI Studio

@m0_63642362

已为社区贡献1436条内容