一、学术论文分类挑战赛

赛题链接:http://challenge.xfyun.cn/h5/invite?invitaCode=8zCBfV

1.赛事背景

随着人工智能技术不断发展,每周都有非常多的论文公开发布。现如今对论文进行分类逐渐成为非常现实的问题,这也是研究人员和研究机构每天都面临的问题。现在希望选手能构建一个论文分类模型。

2.赛事任务

本次赛题希望参赛选手利用论文信息:论文id、标题、摘要,划分论文具体类别。

赛题样例(使用\t分隔):

paperid:9821

title:Calculation of prompt diphoton production cross sections at Tevatron and LHC energies

abstract:A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to-leading logarithmic accuracy.

categories:hep-ph

3.预测结果文件详细说明:

  1. 以csv格式提交,编码为UTF-8,第一行为表头;

  2. 提交前请确保预测结果的格式与sample_submit.csv中的格式一致。具体格式如下:

paperid,categories

test_00000,cs.CV

test_00001,cs.DC

test_00002,cs.AI

test_00003,cs.NI

test_00004,cs.SE

二、数据处理

1 升级paddlenlp

Found existing installation: paddlenlp 2.0.1
    Uninstalling paddlenlp-2.0.1:
      Successfully uninstalled paddlenlp-2.0.1
Successfully installed paddlenlp-2.0.5
!pip install -U paddlenlp
import pandas as pd
from paddlenlp.datasets import load_dataset
import paddlenlp as ppnlp
from functools import partial
from paddlenlp.data import Stack, Tuple, Pad
from utils import  convert_example, create_dataloader
import os
import numpy as np
import paddle
import paddle.nn.functional as F

2.解压缩

# 解压缩
# !unzip -oq /home/aistudio/data/data100202/Datawhale_学术论文分类_数据集.zip -d dataset
# !rm dataset/__MACOSX/ -rf
# !unzip -oq /home/aistudio/dataset/Datawhale_学术论文分类_数据集/test.csv.zip -d dataset/
# !unzip -oq /home/aistudio/dataset/Datawhale_学术论文分类_数据集/train.csv.zip -d dataset/

3.数据查看

# 提交格式
!head dataset/Datawhale_学术论文分类_数据集/sample_submit.csv
paperid,categories
test_00000,cs.CV
test_00001,cs.CV
test_00002,cs.CV
test_00003,cs.CV
test_00004,cs.CV
test_00005,cs.CV
test_00006,cs.CV
test_00007,cs.CV
test_00008,cs.CV
# train数据格式
!head -n20 dataset/train.csv
paperid	title	abstract	categories
train_00000	"Hard but Robust, Easy but Sensitive: How Encoder and Decoder Perform in
  Neural Machine Translation"	"  Neural machine translation (NMT) typically adopts the encoder-decoder
framework. A good understanding of the characteristics and functionalities of
the encoder and decoder can help to explain the pros and cons of the framework,
and design better models for NMT. In this work, we conduct an empirical study
on the encoder and the decoder in NMT, taking Transformer as an example. We
find that 1) the decoder handles an easier task than the encoder in NMT, 2) the
decoder is more sensitive to the input noise than the encoder, and 3) the
preceding words/tokens in the decoder provide strong conditional information,
which accounts for the two observations above. We hope those observations can
shed light on the characteristics of the encoder and decoder and inspire future
research on NMT.
"	cs.CL
train_00001	An Easy-to-use Real-world Multi-objective Optimization Problem Suite	"  Although synthetic test problems are widely used for the performance
assessment of evolutionary multi-objective optimization algorithms, they are
likely to include unrealistic properties which may lead to
overestimation/underestimation. To address this issue, we present a
multi-objective optimization problem suite consisting of 16 bound-constrained
real-world problems. The problem suite includes various problems in terms of
# test数据格式
!head dataset/test.csv
paperid	title	abstract
test_00000	"Analyzing 2.3 Million Maven Dependencies to Reveal an Essential Core in
  APIs"	"  This paper addresses the following question: does a small, essential, core
set of API members emerges from the actual usage of the API by client
applications? To investigate this question, we study the 99 most popular
libraries available in Maven Central and the 865,560 client programs that
declare dependencies towards them, summing up to 2.3M dependencies. Our key
findings are as follows: 43.5% of the dependencies declared by the clients are
not used in the bytecode; all APIs contain a large part of rarely used types
and a few frequently used types, and the ratio varies according to the nature

4.自定义read方法

import pandas as pd

train = pd.read_csv('dataset/train.csv', sep='\t')
test = pd.read_csv('dataset/test.csv', sep='\t')
sub = pd.read_csv('dataset/Datawhale_学术论文分类_数据集/sample_submit.csv')
# 拼接title与abstract
train['text'] = train['title'] + ' ' + train['abstract']

label_id2cate = dict(enumerate(train.categories.unique()))
label_cate2id = {value: key for key, value in label_id2cate.items()}
train['label'] = train['categories'].map(label_cate2id)
train = train[['text', 'label', 'paperid']]
train_y = train["label"]
train_df = train[['text', 'label', 'paperid']][:40000]
eval_df = train[['text', 'label', 'paperid']][40000:]
print(label_id2cate)
{0: 'cs.CL', 1: 'cs.NE', 2: 'cs.DL', 3: 'cs.CV', 4: 'cs.LG', 5: 'cs.DS', 6: 'cs.IR', 7: 'cs.RO', 8: 'cs.DM', 9: 'cs.CR', 10: 'cs.AR', 11: 'cs.NI', 12: 'cs.AI', 13: 'cs.SE', 14: 'cs.CG', 15: 'cs.LO', 16: 'cs.SY', 17: 'cs.GR', 18: 'cs.PL', 19: 'cs.SI', 20: 'cs.OH', 21: 'cs.HC', 22: 'cs.MA', 23: 'cs.GT', 24: 'cs.ET', 25: 'cs.FL', 26: 'cs.CC', 27: 'cs.DB', 28: 'cs.DC', 29: 'cs.CY', 30: 'cs.CE', 31: 'cs.MM', 32: 'cs.NA', 33: 'cs.PF', 34: 'cs.OS', 35: 'cs.SD', 36: 'cs.SC', 37: 'cs.MS', 38: 'cs.GL'}
print(label_cate2id)
{'cs.CL': 0, 'cs.NE': 1, 'cs.DL': 2, 'cs.CV': 3, 'cs.LG': 4, 'cs.DS': 5, 'cs.IR': 6, 'cs.RO': 7, 'cs.DM': 8, 'cs.CR': 9, 'cs.AR': 10, 'cs.NI': 11, 'cs.AI': 12, 'cs.SE': 13, 'cs.CG': 14, 'cs.LO': 15, 'cs.SY': 16, 'cs.GR': 17, 'cs.PL': 18, 'cs.SI': 19, 'cs.OH': 20, 'cs.HC': 21, 'cs.MA': 22, 'cs.GT': 23, 'cs.ET': 24, 'cs.FL': 25, 'cs.CC': 26, 'cs.DB': 27, 'cs.DC': 28, 'cs.CY': 29, 'cs.CE': 30, 'cs.MM': 31, 'cs.NA': 32, 'cs.PF': 33, 'cs.OS': 34, 'cs.SD': 35, 'cs.SC': 36, 'cs.MS': 37, 'cs.GL': 38}
train_df.describe
<bound method NDFrame.describe of                                                     text  label      paperid
0      Hard but Robust, Easy but Sensitive: How Encod...      0  train_00000
1      An Easy-to-use Real-world Multi-objective Opti...      1  train_00001
2      Exploration of reproducibility issues in scien...      2  train_00002
3      Scheduled Sampling for Transformers   Schedule...      0  train_00003
4      Hybrid Forests for Left Ventricle Segmentation...      3  train_00004
...                                                  ...    ...          ...
39995  EyeDoc: Documentation Navigation with Eye Trac...     13  train_39995
39996  Design of an Ultra-Efficient Reversible Full A...     24  train_39996
39997  Hybrid FPMS: A New Fairness Protocol Managemen...     11  train_39997
39998  Conditional Rap Lyrics Generation with Denoisi...      0  train_39998
39999  Cross-Lingual Syntactic Transfer with Limited ...      0  train_39999

[40000 rows x 3 columns]>
from paddlenlp.datasets import load_dataset

# read train data
def read(pd_data):
    for index, item in pd_data.iterrows():       
        yield {'text': item['text'], 'label': item['label'], 'qid': item['paperid'].strip('train_')}

5.数据载入

# data_path为read()方法的参数
train_ds = load_dataset(read, pd_data=train_df,lazy=False)
dev_ds = load_dataset(read, pd_data=eval_df,lazy=False)
for i in range(5):
    print(train_ds[i])
{'text': 'Hard but Robust, Easy but Sensitive: How Encoder and Decoder Perform in\n  Neural Machine Translation   Neural machine translation (NMT) typically adopts the encoder-decoder\nframework. A good understanding of the characteristics and functionalities of\nthe encoder and decoder can help to explain the pros and cons of the framework,\nand design better models for NMT. In this work, we conduct an empirical study\non the encoder and the decoder in NMT, taking Transformer as an example. We\nfind that 1) the decoder handles an easier task than the encoder in NMT, 2) the\ndecoder is more sensitive to the input noise than the encoder, and 3) the\npreceding words/tokens in the decoder provide strong conditional information,\nwhich accounts for the two observations above. We hope those observations can\nshed light on the characteristics of the encoder and decoder and inspire future\nresearch on NMT.\n', 'label': 0, 'qid': '00000'}
{'text': 'An Easy-to-use Real-world Multi-objective Optimization Problem Suite   Although synthetic test problems are widely used for the performance\nassessment of evolutionary multi-objective optimization algorithms, they are\nlikely to include unrealistic properties which may lead to\noverestimation/underestimation. To address this issue, we present a\nmulti-objective optimization problem suite consisting of 16 bound-constrained\nreal-world problems. The problem suite includes various problems in terms of\nthe number of objectives, the shape of the Pareto front, and the type of design\nvariables. 4 out of the 16 problems are multi-objective mixed-integer\noptimization problems. We provide Java, C, and Matlab source codes of the 16\nproblems so that they are available in an off-the-shelf manner. We examine an\napproximated Pareto front of each test problem. We also analyze the performance\nof six representative evolutionary multi-objective optimization algorithms on\nthe 16 problems. In addition to the 16 problems, we present 8 constrained\nmulti-objective real-world problems.\n', 'label': 1, 'qid': '00001'}
{'text': 'Exploration of reproducibility issues in scientometric research Part 1:\n  Direct reproducibility   This is the first part of a small-scale explorative study in an effort to\nstart assessing reproducibility issues specific to scientometrics research.\nThis effort is motivated by the desire to generate empirical data to inform\ndebates about reproducibility in scientometrics. Rather than attempt to\nreproduce studies, we explore how we might assess "in principle"\nreproducibility based on a critical review of the content of published papers.\nThe first part of the study focuses on direct reproducibility - that is the\nability to reproduce the specific evidence produced by an original study using\nthe same data, methods, and procedures. The second part (Velden et al. 2018) is\ndedicated to conceptual reproducibility - that is the robustness of knowledge\nclaims towards verification by an alternative approach using different data,\nmethods and procedures. The study is exploratory: it investigates only a very\nlimited number of publications and serves us to develop instruments for\nidentifying potential reproducibility issues of published studies: These are a\ncategorization of study types and a taxonomy of threats to reproducibility. We\nwork with a select sample of five publications in scientometrics covering a\nvariation of study types of theoretical, methodological, and empirical nature.\nBased on observations made during our exploratory review, we conclude this\npaper with open questions on how to approach and assess the status of direct\nreproducibility in scientometrics, intended for discussion at the special track\non "Reproducibility in Scientometrics" at STI2018 in Leiden.\n', 'label': 2, 'qid': '00002'}
{'text': 'Scheduled Sampling for Transformers   Scheduled sampling is a technique for avoiding one of the known problems in\nsequence-to-sequence generation: exposure bias. It consists of feeding the\nmodel a mix of the teacher forced embeddings and the model predictions from the\nprevious step in training time. The technique has been used for improving the\nmodel performance with recurrent neural networks (RNN). In the Transformer\nmodel, unlike the RNN, the generation of a new word attends to the full\nsentence generated so far, not only to the last word, and it is not\nstraightforward to apply the scheduled sampling technique. We propose some\nstructural changes to allow scheduled sampling to be applied to Transformer\narchitecture, via a two-pass decoding strategy. Experiments on two language\npairs achieve performance close to a teacher-forcing baseline and show that\nthis technique is promising for further exploration.\n', 'label': 0, 'qid': '00003'}
{'text': "Hybrid Forests for Left Ventricle Segmentation using only the first\n  slice label   Machine learning models produce state-of-the-art results in many MRI images\nsegmentation. However, most of these models are trained on very large datasets\nwhich come from experts manual labeling. This labeling process is very time\nconsuming and costs experts work. Therefore finding a way to reduce this cost\nis on high demand. In this paper, we propose a segmentation method which\nexploits MRI images sequential structure to nearly drop out this labeling task.\nOnly the first slice needs to be manually labeled to train the model which then\ninfers the next slice's segmentation. Inference result is another datum used to\ntrain the model again. The updated model then infers the third slice and the\nsame process is carried out until the last slice. The proposed model is an\ncombination of two Random Forest algorithms: the classical one and a recent one\nnamely Mondrian Forests. We applied our method on human left ventricle\nsegmentation and results are very promising. This method can also be used to\ngenerate labels.\n", 'label': 3, 'qid': '00004'}

三、使用预训练模型

1.选取预训练模型

SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis,

SKEP这篇论文的目的是为了提升情感分析的效果。首先,《Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis》论文名称已经明确地介绍了文章的方向,它在预训练语言模型上做了改进,这个改进来自于情感的先验知识。然后,需要利用这些先验知识做一个预训练。

论文认为,在通用的预训练语言模型中,有情感偏向的词在预训练的过程中,有一部分被忽略了。因为文章的任务是情感分析,所以需要重点学习一些情感词的文本特征。

算法涉及:

同理,其他分类同情感分类,在SKEP预训练模型基础上,进行迁移训练,通过使用论文标题、摘要进行论文分类,也取得不错的成绩。

import paddlenlp as ppnlp

from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
# 指定模型名称,一键加载模型
model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en", num_classes=len(train.label.unique()))
# 同样地,通过指定模型名称一键加载对应的Tokenizer,用于处理文本数据,如切分token,转token_id等。
tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en")
[2021-07-25 23:03:23,255] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_2.0_large_en/skep_ernie_2.0_large_en.pdparams
[2021-07-25 23:03:34,982] [    INFO] - Found /home/aistudio/.paddlenlp/models/skep_ernie_2.0_large_en/skep_ernie_2.0_large_en.vocab.txt

2.数据读取

使用paddle.io.DataLoader接口多线程异步加载数据。

import os
from functools import partial


import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad

from utils import create_dataloader

def convert_example(example,
                    tokenizer,
                    max_seq_length=512,
                    is_test=False):
   
    # 将原数据处理成model可读入的格式,enocded_inputs是一个dict,包含input_ids、token_type_ids等字段
    encoded_inputs = tokenizer(
        text=example["text"], max_seq_len=max_seq_length)

    # input_ids:对文本切分token后,在词汇表中对应的token id
    input_ids = encoded_inputs["input_ids"]
    # token_type_ids:当前token属于句子1还是句子2,即上述图中表达的segment ids
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        # label:情感极性类别
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        # qid:每条数据的编号
        qid = np.array([example["qid"]], dtype="int64")
        return input_ids, token_type_ids, qid
# 批量数据大小
batch_size = 10
# 文本序列最大长度
max_seq_length = 450

# 将数据处理成模型可读入的数据格式
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)

# 将数据组成批量式数据,如
# 将不同长度的文本序列padding到批量式数据中最大长度
# 将每条数据label堆叠在一起
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack()  # labels
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
dev_data_loader = create_dataloader(
    dev_ds,
    mode='dev',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

3.设置Fine-Tune优化策略,接入评价指标

四、模型训练与评估

模型训练的过程通常有以下步骤:

  1. 从dataloader中取出一个batch data
  2. 将batch data喂给model,做前向计算
  3. 将前向计算结果传给损失函数,计算loss。将前向计算结果传给评价方法,计算评价指标。
  4. loss反向回传,更新梯度。重复以上步骤。

每训练一个epoch时,程序将会评估一次,评估当前模型训练的效果。

1.参数配置

from paddlenlp.transformers import LinearDecayWithWarmup
import paddle

# 训练轮次
epochs = 3

# len(train_data_loader)一轮训练所需要的step数
num_training_steps = len(train_data_loader) * epochs

# Adam优化器
optimizer = paddle.optimizer.AdamW(
    learning_rate=2e-5,
    parameters=model.parameters())
# 交叉熵损失函数
criterion = paddle.nn.loss.CrossEntropyLoss()
# accuracy评价指标
metric = paddle.metric.Accuracy()

2.加入visualdl

# 加入日志显示
from visualdl import LogWriter

writer = LogWriter("./log")

3.evaluate方法

@paddle.no_grad()
def evaluate(model, criterion, metric, data_loader):

    model.eval()
    metric.reset()
    losses = []
    for batch in data_loader:      
        input_ids, token_type_ids,  labels = batch
        logits = model(input_ids, token_type_ids)
        loss = criterion(logits, labels)
        losses.append(loss.numpy())
        correct = metric.compute(logits, labels)
        metric.update(correct)
        accu = metric.accumulate()
    print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))
    # 加入eval日志显示
    writer.add_scalar(tag="eval/loss", step=global_step, value=np.mean(losses))
    writer.add_scalar(tag="eval/acc", step=global_step, value=accu)  
    model.train()
    metric.reset()
    return accu

4.开始训练


save_dir = "checkpoint"
if not  os.path.exists(save_dir):
    os.makedirs(save_dir)
global_step = 0
pre_accu=0
accu=0
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, segment_ids, labels = batch
        logits = model(input_ids)
        loss = criterion(logits, labels)
        probs = F.softmax(logits, axis=1)
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()

        global_step += 1
        if global_step % 10 == 0 :
            print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc))
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()
        # 每间隔 400 step 在验证集和测试集上进行评估
        if global_step % 400 == 0:
            accu=evaluate(model, criterion, metric, dev_data_loader)
            # 加入train日志显示
            writer.add_scalar(tag="train/loss", step=global_step, value=loss)
            writer.add_scalar(tag="train/acc", step=global_step, value=acc)       
        if accu>pre_accu:
            # 加入保存
            save_param_path = os.path.join(save_dir, 'model_state.pdparams')
            paddle.save(model.state_dict(), save_param_path)
            pre_accu=accu
tokenizer.save_pretrained(save_dir)

5.训练日志

eval loss: 1.22675, accu: 0.77660
global step 6810, epoch: 10, batch: 474, loss: 0.01697, acc: 0.99219
global step 6820, epoch: 10, batch: 484, loss: 0.04531, acc: 0.98984
global step 6830, epoch: 10, batch: 494, loss: 0.03325, acc: 0.98854
global step 6840, epoch: 10, batch: 504, loss: 0.04574, acc: 0.98672
global step 6850, epoch: 10, batch: 514, loss: 0.02137, acc: 0.98625
global step 6860, epoch: 10, batch: 524, loss: 0.19356, acc: 0.98516
global step 6870, epoch: 10, batch: 534, loss: 0.03456, acc: 0.98482
global step 6880, epoch: 10, batch: 544, loss: 0.09647, acc: 0.98438
global step 6890, epoch: 10, batch: 554, loss: 0.11611, acc: 0.98351
global step 6900, epoch: 10, batch: 564, loss: 0.05723, acc: 0.98344
global step 6910, epoch: 10, batch: 574, loss: 0.00518, acc: 0.98310
global step 6920, epoch: 10, batch: 584, loss: 0.01201, acc: 0.98281
global step 6930, epoch: 10, batch: 594, loss: 0.07870, acc: 0.98221
global step 6940, epoch: 10, batch: 604, loss: 0.01748, acc: 0.98237
global step 6950, epoch: 10, batch: 614, loss: 0.01542, acc: 0.98208
global step 6960, epoch: 10, batch: 624, loss: 0.01469, acc: 0.98184
global step 6970, epoch: 10, batch: 634, loss: 0.07767, acc: 0.98189
global step 6980, epoch: 10, batch: 644, loss: 0.01516, acc: 0.98186
global step 6990, epoch: 10, batch: 654, loss: 0.02567, acc: 0.98125
global step 7000, epoch: 10, batch: 664, loss: 0.09072, acc: 0.98102
global step 7010, epoch: 10, batch: 674, loss: 0.07557, acc: 0.98080
global step 7020, epoch: 10, batch: 684, loss: 0.13695, acc: 0.98047
global step 7030, epoch: 10, batch: 694, loss: 0.09411, acc: 0.98016
global step 7040, epoch: 10, batch: 704, loss: 0.10656, acc: 0.98007

tokenizer.save_pretrained(save_dir)

五、模型预测

1.test数据处理

import pandas as pd
from paddlenlp.datasets import load_dataset
import paddlenlp as ppnlp
from functools import partial
from paddlenlp.data import Stack, Tuple, Pad
from utils import  convert_example, create_dataloader
import os
import numpy as np
import paddle
import paddle.nn.functional as F
test = pd.read_csv('dataset/test.csv', sep='\t')
sub = pd.read_csv('dataset/Datawhale_学术论文分类_数据集/sample_submit.csv')


train = pd.read_csv('dataset/train.csv', sep='\t')
label_id2cate = dict(enumerate(train.categories.unique()))
label_cate2id = {value: key for key, value in label_id2cate.items()}
# 拼接title与abstract
test['text'] = test['title'] + ' ' + test['abstract']
print(label_id2cate)
print(label_cate2id)
print(len(label_cate2id))
{0: 'cs.CL', 1: 'cs.NE', 2: 'cs.DL', 3: 'cs.CV', 4: 'cs.LG', 5: 'cs.DS', 6: 'cs.IR', 7: 'cs.RO', 8: 'cs.DM', 9: 'cs.CR', 10: 'cs.AR', 11: 'cs.NI', 12: 'cs.AI', 13: 'cs.SE', 14: 'cs.CG', 15: 'cs.LO', 16: 'cs.SY', 17: 'cs.GR', 18: 'cs.PL', 19: 'cs.SI', 20: 'cs.OH', 21: 'cs.HC', 22: 'cs.MA', 23: 'cs.GT', 24: 'cs.ET', 25: 'cs.FL', 26: 'cs.CC', 27: 'cs.DB', 28: 'cs.DC', 29: 'cs.CY', 30: 'cs.CE', 31: 'cs.MM', 32: 'cs.NA', 33: 'cs.PF', 34: 'cs.OS', 35: 'cs.SD', 36: 'cs.SC', 37: 'cs.MS', 38: 'cs.GL'}
{'cs.CL': 0, 'cs.NE': 1, 'cs.DL': 2, 'cs.CV': 3, 'cs.LG': 4, 'cs.DS': 5, 'cs.IR': 6, 'cs.RO': 7, 'cs.DM': 8, 'cs.CR': 9, 'cs.AR': 10, 'cs.NI': 11, 'cs.AI': 12, 'cs.SE': 13, 'cs.CG': 14, 'cs.LO': 15, 'cs.SY': 16, 'cs.GR': 17, 'cs.PL': 18, 'cs.SI': 19, 'cs.OH': 20, 'cs.HC': 21, 'cs.MA': 22, 'cs.GT': 23, 'cs.ET': 24, 'cs.FL': 25, 'cs.CC': 26, 'cs.DB': 27, 'cs.DC': 28, 'cs.CY': 29, 'cs.CE': 30, 'cs.MM': 31, 'cs.NA': 32, 'cs.PF': 33, 'cs.OS': 34, 'cs.SD': 35, 'cs.SC': 36, 'cs.MS': 37, 'cs.GL': 38}
39
# read test data
def read_test(pd_data):
    for index, item in pd_data.iterrows():       
        yield {'text': item['text'], 'label': 0, 'qid': item['paperid'].strip('test_')}
test_ds =  load_dataset(read_test, pd_data=test,lazy=False)
for i in range(5):
    print(test_ds[i])
print(len(test_ds))
import paddlenlp as ppnlp

from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
# 指定模型名称,一键加载模型
model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en", num_classes=39)
# 同样地,通过指定模型名称一键加载对应的Tokenizer,用于处理文本数据,如切分token,转token_id等。
tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en")
[2021-07-25 23:03:58,725] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_2.0_large_en/skep_ernie_2.0_large_en.pdparams
[2021-07-25 23:04:04,013] [    INFO] - Found /home/aistudio/.paddlenlp/models/skep_ernie_2.0_large_en/skep_ernie_2.0_large_en.vocab.txt
max_seq_length = 300

trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    is_test=True)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
    Stack(dtype="int64")  # label
): [data for data in fn(samples)]
test_data_loader = create_dataloader(
    test_ds,
    mode='test',
    batch_size=10,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

2.载入模型

# 根据实际运行情况,更换加载的参数路径
import os
import paddle

params_path = 'checkpoint/model_state.pdparams'
if params_path and os.path.isfile(params_path):
    # 加载模型参数
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)
Loaded parameters from checkpoint/model_state.pdparams

3.预测

import os
from functools import partial
import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad
from utils import create_dataloader

results = []
# 切换model模型为评估模式,关闭dropout等随机因素
model.eval()
for batch in test_data_loader:
    input_ids, token_type_ids, qids = batch
    # 喂数据给模型
    logits = model(input_ids, token_type_ids)
    # 预测分类
    probs = F.softmax(logits, axis=-1)
    idx = paddle.argmax(probs, axis=1).numpy()
    idx = idx.tolist()
    labels = [label_id2cate[i] for i in idx]
    qids = qids.numpy().tolist()
    results.extend( labels)
print(results[:5])
print(len(results))

4.保存提交

sub = pd.read_csv('dataset/Datawhale_学术论文分类_数据集/sample_submit.csv')
sub['categories'] = results

    labels = [label_id2cate[i] for i in idx]
    qids = qids.numpy().tolist()
    results.extend( labels)
print(results[:5])
print(len(results))

4.保存提交

sub = pd.read_csv('dataset/Datawhale_学术论文分类_数据集/sample_submit.csv')
sub['categories'] = results
sub.to_csv('submit.csv', index=False)
!zip -qr result.zip submit.csv

六、提交结果

提交结果第14名

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐