2022 CCF BDCI 小样本数据分类任务Baseline(综合版)

2022 CCF BDCI 小样本数据分类任务Baseline，感谢论坛各位大牛的分享和开源，也请Fork和“点赞”，基线得分0.5451+

AI Studio

1015人浏览 · 2022-10-16 15:16:24

AI Studio · 2022-10-16 15:16:24 发布

2022 CCF BDCI 小样本数据分类任务Baseline（综合版）

赛事地址：https://www.datafountain.cn/competitions/582 👈

赛题合作：智慧芽信息科技（苏州）有限公司（智慧芽）

赛题背景

创新是国家发展、民族复兴的不竭动力。近年来，随着政策扶植、国家工业化水平和国民教育水平提高，我国的专利申请量快速增长，专利检索、查新、管理等需求也不断增加。为了满足以上需求，提升专利服务质量，通常需要建立多个维度的专利分类体系。常见的分类体系有国际专利分类(IPC)、联合专利分类(CPC)、欧洲专利分类(ECLA)等，但是这些分类体系比较复杂，专业性强，对非IP人员而言使用有一定的困难。智慧芽作为国际领先的知识产权SaaS平台，根据用户的搜索习惯等因素，制定了一套新的专利分类体系，从而提升用户的使用体验。

赛题任务

比赛方公开958条专利数据，包括专利权人、专利标题、专利摘要和分类标签，其中分类标签经过脱敏处理，共36类。要求选手设计一套算法，完成测试数据的分类任务。

本次赛题公布的训练数据量较小，属于基于小样本训练数据的分类任务。小样本分类任务作为近年来研究的热点问题，学界提出了远程监督、数据增强、预训练模型、PET范式等方案。希望选手充分发挥创造力，将学界的研究成果落地到本赛题中来。

主要参考

2022 CCF BDCI 小样本数据分类任务Baseline ——整体结构，R-Drop方法（数据增强），需调参

讯飞非标准化疾病诉求的简单分诊挑战赛2.0_baseline ——Focal_loss 有修正（样本不平衡），暂未见明显提升

【上图开放数据竞赛】中文期刊论文自动分类Baseline —— 训练、评价和预测部分结构

【Paddle版本】阐述深度学习对抗训练科学工具包 ——对抗训练FGSM 和 FGM 等（数据增强），需调参

方案

!pip install -U paddlenlp

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: paddlenlp in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (2.1.1)
Collecting paddlenlp
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/8d/1e/908644ee3e6232166e79386950c9d3f62105c35fbca1ba79a50b8eeb7fbb/paddlenlp-2.4.0-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hRequirement already satisfied: protobuf<=3.20.0,>=3.1.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (3.20.0)
Collecting datasets>=2.0.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/56/9d/fd8d73feffeaf5ae3289e5e5b4fc8091d128b52bdfe962aff449d4fab9f8/datasets-2.5.1-py3-none-any.whl (431 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m431.2/431.2 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hRequirement already satisfied: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.42.1)
Requirement already satisfied: dill<0.3.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.3.3)
Requirement already satisfied: paddlefsl in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.0.0)
Requirement already satisfied: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.1.0)
Requirement already satisfied: multiprocess<=0.70.12.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.70.11.1)
Requirement already satisfied: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.4.4)
Requirement already satisfied: sentencepiece in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.1.96)
Requirement already satisfied: paddle2onnx in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.0.0)
Requirement already satisfied: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.2.2)
Requirement already satisfied: tqdm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.27.0)
Requirement already satisfied: numpy>=1.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (1.19.5)
Requirement already satisfied: packaging in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (21.3)
Collecting tqdm
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/47/bb/849011636c4da2e44f1253cd927cfb20ada4374d8b3a4e425416e84900cc/tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: requests>=2.19.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (2.24.0)
Collecting aiohttp
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/7a/48/7882af39221fee58e33eee6c8e516097e2331334a5937f54fe5b5b285d9e/aiohttp-3.8.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (948 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m948.0/948.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting xxhash
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ef/ac/0eb796cf052c392f1ae586452c89ca66164f7d4f655b039ca3d06e2291af/xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting fsspec[http]>=2021.11.1
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/82/f3/30f7925f22f623ebac35b40d48151578ef7303d897764e1d95323727611b/fsspec-2022.8.2-py3-none-any.whl (140 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.8/140.8 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/9f/81/3ae1223f0ff868907e22ffc16e8e1813ae17b8430c3c1777c9d96c4619ac/huggingface_hub-0.10.0-py3-none-any.whl (163 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.5/163.5 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyarrow>=6.0.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/7f/08/9b5fe7c9e2774bca77dae29d22a446ead804fb8e050f2899ae1f60d73ad1/pyarrow-9.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (35.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.3/35.3 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hRequirement already satisfied: importlib-metadata in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (4.2.0)
Requirement already satisfied: pandas in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (1.1.5)
Collecting responses<0.19
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/79/f3/2b3a6dc5986303b3dd1bbbcf482022acb2583c428cd23f0b6d37b1a1a519/responses-0.18.0-py3-none-any.whl (38 kB)
Collecting paddlefsl
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/fb/4a/25d1959a8f1fe5ee400f32fc9fc8b56d4fd6fc25315e23c0171f6e705e2a/paddlefsl-1.1.0-py3-none-any.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.0/101.0 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (0.24.2)
Collecting multidict<7.0,>=4.5
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/db/3f/1c876ed190e8fcd1a2faef3085427e5465076e28813a2499502633f7eed3/multidict-6.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (94 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.8/94.8 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting asynctest==0.13.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/e8/b6/8d17e169d577ca7678b11cd0d3ceebb0a6089a7f4a2de4b945fe4b1c86db/asynctest-0.13.0-py3-none-any.whl (26 kB)
Requirement already satisfied: attrs>=17.3.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from aiohttp->datasets>=2.0.0->paddlenlp) (22.1.0)
Collecting charset-normalizer<3.0,>=2.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/db/51/a507c856293ab05cdc1db77ff4bc1268ddd39f29e7dc4919aa497f0adbec/charset_normalizer-2.1.1-py3-none-any.whl (39 kB)
Collecting frozenlist>=1.1.1
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/3e/b2/cf7e86583f03fafc93c4103f9a03aaf729dcf4dca9cd3012256a48b766ad/frozenlist-1.3.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (148 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m148.0/148.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hRequirement already satisfied: typing-extensions>=3.7.4 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from aiohttp->datasets>=2.0.0->paddlenlp) (4.3.0)
Collecting yarl<2.0,>=1.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/2b/89/36a50cab1be3d5099ec66a41212cf0c11507c343074e97e907a2f5f1a569/yarl-1.8.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (231 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m231.3/231.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting async-timeout<5.0,>=4.0.0a3
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d6/c1/8991e7c5385b897b8c020cdaad718c5b087a6626d1d11a23e1ea87e325a7/async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting aiosignal>=1.1.2
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/3b/87/fe94898f2d44a93a35d5aa74671ed28094d80753a1113d68b799fab6dc22/aiosignal-1.2.0-py3-none-any.whl (8.2 kB)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets>=2.0.0->paddlenlp) (5.1.2)
Requirement already satisfied: filelock in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets>=2.0.0->paddlenlp) (3.0.12)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from packaging->datasets>=2.0.0->paddlenlp) (3.0.9)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.19.0->datasets>=2.0.0->paddlenlp) (2019.9.11)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.19.0->datasets>=2.0.0->paddlenlp) (2.8)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.19.0->datasets>=2.0.0->paddlenlp) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.19.0->datasets>=2.0.0->paddlenlp) (1.25.6)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/56/aa/4ef5aa67a9a62505db124a5cb5262332d1d4153462eb8fd89c9fa41e5d92/urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m128.0/128.0 kB[0m [31m509.1 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hRequirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (0.14.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (2.1.0)
Requirement already satisfied: scipy>=0.19.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (1.6.3)
Requirement already satisfied: zipp>=0.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from importlib-metadata->datasets>=2.0.0->paddlenlp) (3.8.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas->datasets>=2.0.0->paddlenlp) (2.8.2)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas->datasets>=2.0.0->paddlenlp) (2019.3)
Requirement already satisfied: six>=1.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas->datasets>=2.0.0->paddlenlp) (1.16.0)
Installing collected packages: xxhash, urllib3, tqdm, pyarrow, multidict, fsspec, frozenlist, charset-normalizer, asynctest, async-timeout, yarl, aiosignal, responses, paddlefsl, huggingface-hub, aiohttp, datasets, paddlenlp
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.25.6
    Uninstalling urllib3-1.25.6:
      Successfully uninstalled urllib3-1.25.6
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.27.0
    Uninstalling tqdm-4.27.0:
      Successfully uninstalled tqdm-4.27.0
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 2.0.0
    Uninstalling pyarrow-2.0.0:
      Successfully uninstalled pyarrow-2.0.0
  Attempting uninstall: paddlefsl
    Found existing installation: paddlefsl 1.0.0
    Uninstalling paddlefsl-1.0.0:
      Successfully uninstalled paddlefsl-1.0.0
  Attempting uninstall: paddlenlp
    Found existing installation: paddlenlp 2.1.1
    Uninstalling paddlenlp-2.1.1:
      Successfully uninstalled paddlenlp-2.1.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
parl 1.4.1 requires pyzmq==18.1.1, but you have pyzmq 23.2.1 which is incompatible.[0m[31m
[0mSuccessfully installed aiohttp-3.8.3 aiosignal-1.2.0 async-timeout-4.0.2 asynctest-0.13.0 charset-normalizer-2.1.1 datasets-2.5.1 frozenlist-1.3.1 fsspec-2022.8.2 huggingface-hub-0.10.0 multidict-6.0.2 paddlefsl-1.1.0 paddlenlp-2.4.0 pyarrow-9.0.0 responses-0.18.0 tqdm-4.64.1 urllib3-1.25.11 xxhash-3.0.0 yarl-1.8.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.1.2[0m[39;49m -> [0m[32;49m22.2.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

import os
import json
import random
import time
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split
import paddle
import paddlenlp
import paddle.nn.functional as F
from functools import partial
from paddlenlp.data import Stack, Dict, Pad
from paddlenlp.datasets import load_dataset
import paddle.nn as nn
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
from paddlenlp.transformers import *

seed = 80471

init_from_ckpt=None

# 切换语言模型,加载预训练模型

# ernie-3.0-xbase-zh
# ernie-3.0-base-zh
# ernie-3.0-medium-zh
# ernie-3.0-mini-zh
# ernie-3.0-micro-zh
# ernie-3.0-nano-zh

# hfl/roberta-wwm-ext
# hfl/roberta-wwm-ext-large
# hfl/rbt3
# hfl/rbtl3

# bert-base-chinese
# bert-wwm-chinese
# bert-wwm-ext-chinese

MODEL_NAME = 'ernie-3.0-base-zh'

max_seq_length = 352
train_batch_size = 16
valid_batch_size = 16
test_batch_size = 16

# 训练过程中的最大学习率
learning_rate = 5e-5
# 训练轮次
epochs = 30

# 学习率预热比例
warmup_proportion = 0.1
# 学习率衰减比例
decay_proportion = 0.2

# 权重衰减系数，类似模型正则项策略，避免模型过拟合
weight_decay = 0.01

# 用于控制梯度膨胀，如果梯度向量的L2模超过max_grad_norm，则等比例缩小
max_grad_norm = 1.0

# 是否使用数据增强
enable_dataaug = False

# 是否开启对抗训练
enable_adversarial = False

# Rdrop Loss的超参数，若该值大于0.则加权使用R-drop loss
rdrop_coef = 0.2

# 损失函数设置
unbalance = 'Focal_loss' #  None , Focal_loss
focalloss_alpha = 0.5
focalloss_gamma = 2

# 训练结束后，存储模型参数
save_dir = "checkpoint/{}-{}".format(MODEL_NAME.replace('/','-'),int(time.time()))

1 数据读取和EDA

1.1 读取数据并统一格式

def read_jsonfile(file_name):
    data = []
    with open(file_name) as f:
        for i in f.readlines():
            data.append(json.loads(i))
    return data

train = pd.DataFrame(read_jsonfile("./data/data170933/train.json"))
test = pd.DataFrame(read_jsonfile("./data/data170933//testA.json"))

print("train size: {} \ntest size {}".format(len(train),len(test)))

train size: 958 
test size 20839

train['text'] = [row['title'] + '[SEP]' + row['assignee'] + '[SEP]' + row['abstract'] for idx,row in train.iterrows()]
test['text'] = [row['title'] + '[SEP]' + row['assignee'] + '[SEP]' + row['abstract'] for idx,row in test.iterrows()]
train['concat_len'] = [len(row) for row in train['text']]
test['concat_len'] = [len(row) for row in test['text']]

1.2 简易数据分析

# 拼接后的文本长度分析
for rate in [0.5,0.75,0.9,0.95,0.99]:
    print("训练数据中{:.0f}%的文本长度小于等于 {:.2f}".format(rate*100,train['concat_len'].quantile(rate)))
plt.title("text length")
sns.distplot(train['concat_len'],bins=10,color='r')
sns.distplot(test['concat_len'],bins=10,color='g')
plt.show()

训练数据中50%的文本长度小于等于 307.00
训练数据中75%的文本长度小于等于 327.00
训练数据中90%的文本长度小于等于 339.00
训练数据中95%的文本长度小于等于 346.00
训练数据中99%的文本长度小于等于 381.29

在这里插入图片描述

train_label = train["label_id"].unique()
# 查看标签label分布
plt.figure(figsize=(16,8))
plt.title("label distribution")
sns.countplot(y='label_id',data=train)

<matplotlib.axes._subplots.AxesSubplot at 0x7f347c012110>

在这里插入图片描述

1.3 结论

训练集共958条,测试集共20839条。
拼接所有字段后的文本长度大多数集中在300左右，仅有极少数文本超过400。
共有36个标签，标签分布呈现极度不平衡状态，大部分标签类别的标注样本不超过50条。

2 数据处理

2.1 数据划分

# 创建数据迭代器
def read(df,istrain=True):
    if istrain:
        for _,data in df.iterrows():
            yield {
                "words":data['text'],
                "labels":data['label_id']
                }
    else:
        for _,data in df.iterrows():
            yield {
                "words":data['text'],
                }

# # 将生成器传入load_dataset
train,valid = train_test_split(train,test_size=0.2,random_state=5)
train_ds = load_dataset(read, df=train, lazy=False)
valid_ds = load_dataset(read, df=valid, lazy=False)

2.2 数据编码

# 加载tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

[2022-10-03 17:00:14,855] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'ernie-3.0-base-zh'.
[2022-10-03 17:00:14,857] [    INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt and saved to /home/aistudio/.paddlenlp/models/ernie-3.0-base-zh
[2022-10-03 17:00:14,859] [    INFO] - Downloading ernie_3.0_base_zh_vocab.txt from https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt
100%|██████████| 182k/182k [00:00<00:00, 4.12MB/s]
[2022-10-03 17:00:15,014] [    INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/ernie-3.0-base-zh/tokenizer_config.json
[2022-10-03 17:00:15,016] [    INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/ernie-3.0-base-zh/special_tokens_map.json

# 编码
def convert_example(example, tokenizer, max_seq_len=512, mode='train'):
    # 调用tokenizer的数据处理方法把文本转为id
    tokenized_input = tokenizer(example['words'],is_split_into_words=True,max_seq_len=max_seq_len)
    if mode == "test":
        return tokenized_input
    # 把意图标签转为数字id
    tokenized_input['labels'] = [example['labels']]
    return tokenized_input # 字典形式，包含input_ids、token_type_ids、labels

train_trans_func = partial(
        convert_example,
        tokenizer=tokenizer,
        mode='train',
        max_seq_len=max_seq_length)

valid_trans_func = partial(
        convert_example,
        tokenizer=tokenizer,
        mode='dev',
        max_seq_len=max_seq_length)

# 映射编码
train_ds.map(train_trans_func, lazy=False)
valid_ds.map(valid_trans_func, lazy=False)

# 初始化BatchSampler
np.random.seed(seed)
train_batch_sampler = paddle.io.BatchSampler(train_ds, batch_size=train_batch_size, shuffle=True)
valid_batch_sampler = paddle.io.BatchSampler(valid_ds, batch_size=valid_batch_size, shuffle=False)

# 定义batchify_fn
batchify_fn = lambda samples, fn = Dict({
    "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), 
    "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
    "labels": Stack(dtype="int32"),
}): fn(samples)

# 初始化DataLoader
def _init_fn(worker_id):
    np.random.seed(int(seed)+worker_id)

train_data_loader = paddle.io.DataLoader(
    dataset=train_ds,
    batch_sampler=train_batch_sampler,
    collate_fn=batchify_fn,
    return_list=True,
    worker_init_fn=_init_fn)
valid_data_loader = paddle.io.DataLoader(
    dataset=valid_ds,
    batch_sampler=valid_batch_sampler,
    collate_fn=batchify_fn,
    return_list=True,
    worker_init_fn=_init_fn)

# 相同方式构造测试集
test_ds = load_dataset(read,df=test, istrain=False, lazy=False)

test_trans_func = partial(
        convert_example,
        tokenizer=tokenizer,
        mode='test',
        max_seq_len=max_seq_length)

test_ds.map(test_trans_func, lazy=False)

test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=test_batch_size, shuffle=False)

test_batchify_fn = lambda samples, fn = Dict({
    "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), 
    "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
}): fn(samples)

test_data_loader = paddle.io.DataLoader(
    dataset=test_ds,
    batch_sampler=test_batch_sampler,
    collate_fn=test_batchify_fn,
    return_list=True)

labels = train['label_id'].unique()
pretrained_model = AutoModel.from_pretrained(MODEL_NAME)

[2022-10-03 17:01:07,117] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.modeling.ErnieModel'> to load 'ernie-3.0-base-zh'.
[2022-10-03 17:01:07,120] [    INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams and saved to /home/aistudio/.paddlenlp/models/ernie-3.0-base-zh
[2022-10-03 17:01:07,122] [    INFO] - Downloading ernie_3.0_base_zh.pdparams from https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams
100%|██████████| 452M/452M [00:21<00:00, 22.2MB/s] 
W1003 17:01:28.636518   291 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 11.2, Runtime API Version: 11.2
W1003 17:01:28.644773   291 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2022-10-03 17:01:30,574] [    INFO] - Weights from pretrained model not used in ErnieModel: ['cls.predictions.transform.weight', 'cls.predictions.layer_norm.weight', 'cls.predictions.transform.bias', 'cls.predictions.layer_norm.bias', 'cls.predictions.decoder_bias']

3 模型搭建

预训练模型 + 全连接层/Dropout策略

class TextClassification(nn.Layer):
    def __init__(self, pretrained_model,num_classes,dropout=None):
        super().__init__()
        self.ptm = pretrained_model
        self.num_classes = num_classes
        self.dropout = nn.Dropout(dropout if dropout is not None else
                                  self.ptm.config["hidden_dropout_prob"])
        self.classifier = nn.Linear(self.ptm.config["hidden_size"],
                                    self.num_classes)

    def forward(self,
                input_ids,
                token_type_ids=None,
                position_ids=None,
                attention_mask=None):
        _, pooled_output = self.ptm(
            input_ids,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            attention_mask=attention_mask)

        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits
model = TextClassification(pretrained_model, num_classes=len(labels))

4 模型配置

# 如果有预训练模型，则加载模型
if init_from_ckpt and os.path.isfile(init_from_ckpt):
        state_dict = paddle.load(init_from_ckpt)
        model.set_dict(state_dict)

# 训练总步数
max_steps = len(train_data_loader) * epochs

# 学习率衰减策略
'''
__all__ = [
   1 'LinearDecayWithWarmup',
   2 'ConstScheduleWithWarmup',
   3 'CosineDecayWithWarmup',
   4 'PolyDecayWithWarmup',
   5 'CosineAnnealingWithWarmupDecay',
]
'''

# 学习率衰减策略
# lr_scheduler = paddlenlp.transformers.LinearDecayWithWarmup(learning_rate=learning_rate, total_steps=max_steps, warmup=warmup_proportion)
lr_scheduler = paddlenlp.transformers.CosineDecayWithWarmup(learning_rate=learning_rate, total_steps=max_steps, warmup=warmup_proportion)
# lr_scheduler = paddlenlp.transformers.CosineDecayWithWarmup(
#     learning_rate=learning_rate, total_steps=max_steps, warmup=warmup_proportion,
#      with_hard_restarts=False, num_cycles=3, last_epoch=- 1, verbose=False)

# warmup_step=max_steps*warmup_proportion
# decay_step=max_steps*(1-decay_proportion)
# lr_scheduler = paddlenlp.transformers.CosineAnnealingWithWarmupDecay(max_lr=learning_rate, min_lr=1e-7, warmup_step=warmup_step, decay_step=decay_step)

decay_params = [
    p.name for n, p in model.named_parameters()
    if not any(nd in n for nd in ["bias", "norm"])
]

# 定义优化器
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in decay_params,
    grad_clip=paddle.nn.ClipGradByGlobalNorm(max_grad_norm))

补对抗训练 FGSM和FGM

class FGSM:
    def __init__(self, model, epsilon=0.1, emb_name='word_embeddings'):
        self.model = (model.module if hasattr(model, "module") else model)
        self.eps = epsilon
        self.emb_name = emb_name
        self.backup = {}

    # only attack word embedding
    def attack(self):
        for name, param in self.model.named_parameters():
            if param.stop_gradient and self.emb_name in name:
                self.backup[name] = param.data.clone()
                r_at = self.eps * param.grad.sign()
                param.data.add_(r_at)

    def restore(self):
        for name, para in self.model.named_parameters():
            if para.stop_gradient and self.emb_name in name:
                assert name in self.backup
                para.data = self.backup[name]

        self.backup = {}


class FGM:
    def __init__(self, model, epsilon=1., emb_name='word_embeddings'):
        self.model = (model.module if hasattr(model, "module") else model)
        self.eps = epsilon
        self.emb_name = emb_name
        self.backup = {}

    # only attack embedding
    def attack(self):
        for name, param in self.model.named_parameters():
            if param.stop_gradient and self.emb_name in name:
                self.backup[name] = param.data.clone()
                norm = paddle.norm(param.grad)
                if norm and not paddle.isnan(norm):
                    r_at = self.eps * param.grad / norm
                    param.data.add_(r_at)

    def restore(self):
        for name, para in self.model.named_parameters():
            if para.stop_gradient and self.emb_name in name:
                assert name in self.backup
                para.data = self.backup[name]

        self.backup = {}


# 对抗训练
if enable_adversarial:
    # adv = FGSM(model=model,epsilon=1e-6,emb_name='word_embeddings')
    adv = FGM(model=model,epsilon=1e-6,emb_name='word_embeddings')

补损失函数 FocalLoss

import paddle
import numpy as np

class FocalLoss(paddle.nn.Layer):
    def __init__(self, alpha=0.5, gamma=2, num_classes=3, weight=None, ignore_index=-100):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.weight = weight if weight is not None else paddle.to_tensor(np.array([1.] * num_classes), dtype='float32')
        self.ce_fn = paddle.nn.CrossEntropyLoss(
            weight=self.weight, soft_label=False, ignore_index=ignore_index)

    def forward(self, preds, labels):
        logpt = -self.ce_fn(preds, labels)
        pt = paddle.exp(logpt)
        loss = -((1 - pt) ** self.gamma) * self.alpha * logpt
        return loss

# --修改损失函数
if unbalance == "Focal_loss":
    criterion = FocalLoss(
        alpha=focalloss_alpha,
        gamma=focalloss_gamma,
        num_classes=len(labels))
else:
    # 交叉熵损失
    criterion = paddle.nn.loss.CrossEntropyLoss()

5 模型训练

# 定义模型训练验证评估函数
@paddle.no_grad()
def evaluate(model, data_loader):
    """
    Given a dataset, it evals model and computes the metric.

    Args:
        model(obj:`paddle.nn.Layer`): A model to classify texts.
        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
        criterion(obj:`paddle.nn.Layer`): It can compute the loss.
        metric(obj:`paddle.metric.Metric`): The evaluation metric.
    """
    model.eval()
    
    real_s = []
    pred_s = []
    for batch in tqdm(data_loader):
        input_ids, token_type_ids, labels = batch
        logits = model(input_ids, token_type_ids)

        probs = F.softmax(logits,axis=1)
        pred_s.extend(probs.argmax(axis=1).numpy())
        real_s.extend(labels.reshape([-1]).numpy())
    score =  f1_score(y_pred=pred_s, y_true=real_s, average='macro')

    return score  # F1-score

def do_train(model, train_data_loader, valid_data_loader, criterion, optimizer, lr_scheduler, rdrop_coef, enable_adversarial=False):
    model.train()
    max_f1_score=0

    if rdrop_coef > 0:
        rdrop_loss = paddlenlp.losses.RDropLoss()   
     
    for epoch in range(1, epochs + 1):
        with tqdm(total=len(train_data_loader)) as pbar:
            for step, batch in enumerate(train_data_loader, start=1):
                input_ids, token_type_ids, labels = batch
                logits = model(input_ids, token_type_ids)
                        
                # 使用R-drop
                if rdrop_coef > 0:
                    logits_2 = model(input_ids=input_ids, token_type_ids=token_type_ids)
                    ce_loss = (criterion(logits,labels).mean() + criterion(logits_2,labels).mean()) * 0.5
                    kl_loss = rdrop_loss(logits, logits_2)
                    loss = ce_loss + kl_loss * rdrop_coef
                else:
                    loss = criterion(logits,labels).mean()    
                loss.backward()
                # 对抗训练
                if enable_adversarial:
                    adv.attack()  # 在 embedding 上添加对抗扰动
                    adv_logits = model(input_ids, token_type_ids)
                    adv_loss = criterion(adv_logits,labels).mean()
                    adv_loss.backward()  # 反向传播，并在正常的 grad 基础上，累加对抗训练的梯度
                    adv.restore()  # 恢复 embedding 参数
                optimizer.step()
                lr_scheduler.step()
                optimizer.clear_grad()
                pbar.set_postfix({'loss' : '%.5f' % (loss.numpy())})
                pbar.update(1)
        eval_f1_score = evaluate(model, valid_data_loader)
        print("Epoch: %d, eval_f1_score: %.5f" % (epoch, eval_f1_score))
        if not os.path.exists(save_dir):
            os.makedirs(save_dir)  ## 递归创建

        print("Epoch: %d, eval_f1_score: %.5f" % (epoch, eval_f1_score), file=open(save_dir +'/best_model_log.txt', 'a'))
            
        if eval_f1_score >= max_f1_score:
            max_f1_score = eval_f1_score 
            save_param_path = os.path.join(save_dir, 'best_model.pdparams')
            paddle.save(model.state_dict(), save_param_path)
            tokenizer.save_pretrained(save_dir)                
    save_param_path = os.path.join(save_dir, 'last_model.pdparams')
    paddle.save(model.state_dict(), save_param_path)

do_train(model, train_data_loader, valid_data_loader, criterion, optimizer, lr_scheduler, rdrop_coef, enable_adversarial)

100%|██████████| 48/48 [00:10<00:00,  4.60it/s, loss=3.43355]
100%|██████████| 12/12 [00:00<00:00, 35.04it/s]


Epoch: 1, eval_f1_score: 0.00729


[2022-10-03 17:01:42,760] [    INFO] - tokenizer config file saved in checkpoint/ernie-3.0-base-zh-1664787611/tokenizer_config.json
[2022-10-03 17:01:42,762] [    INFO] - Special tokens file saved in checkpoint/ernie-3.0-base-zh-1664787611/special_tokens_map.json
100%|██████████| 48/48 [00:07<00:00,  6.64it/s, loss=2.94169]
100%|██████████| 12/12 [00:00<00:00, 35.10it/s]


Epoch: 2, eval_f1_score: 0.10252


[2022-10-03 17:01:51,415] [    INFO] - tokenizer config file saved in checkpoint/ernie-3.0-base-zh-1664787611/tokenizer_config.json
[2022-10-03 17:01:51,417] [    INFO] - Special tokens file saved in checkpoint/ernie-3.0-base-zh-1664787611/special_tokens_map.json
100%|██████████| 48/48 [00:07<00:00,  6.61it/s, loss=2.64627]
100%|██████████| 12/12 [00:00<00:00, 35.20it/s]


Epoch: 3, eval_f1_score: 0.33786


[2022-10-03 17:02:00,107] [    INFO] - tokenizer config file saved in checkpoint/ernie-3.0-base-zh-1664787611/tokenizer_config.json
[2022-10-03 17:02:00,109] [    INFO] - Special tokens file saved in checkpoint/ernie-3.0-base-zh-1664787611/special_tokens_map.json
100%|██████████| 48/48 [00:07<00:00,  6.55it/s, loss=1.37672]
100%|██████████| 12/12 [00:00<00:00, 35.06it/s]


Epoch: 4, eval_f1_score: 0.45612


[2022-10-03 17:02:08,860] [    INFO] - tokenizer config file saved in checkpoint/ernie-3.0-base-zh-1664787611/tokenizer_config.json
[2022-10-03 17:02:08,862] [    INFO] - Special tokens file saved in checkpoint/ernie-3.0-base-zh-1664787611/special_tokens_map.json
100%|██████████| 48/48 [00:07<00:00,  6.62it/s, loss=0.92115]
100%|██████████| 12/12 [00:00<00:00, 35.17it/s]


Epoch: 5, eval_f1_score: 0.43612


100%|██████████| 48/48 [00:07<00:00,  6.56it/s, loss=0.05577]
100%|██████████| 12/12 [00:00<00:00, 34.57it/s]


Epoch: 6, eval_f1_score: 0.45665


[2022-10-03 17:02:25,207] [    INFO] - tokenizer config file saved in checkpoint/ernie-3.0-base-zh-1664787611/tokenizer_config.json
[2022-10-03 17:02:25,209] [    INFO] - Special tokens file saved in checkpoint/ernie-3.0-base-zh-1664787611/special_tokens_map.json
100%|██████████| 48/48 [00:07<00:00,  6.58it/s, loss=0.03126]
100%|██████████| 12/12 [00:00<00:00, 35.15it/s]


Epoch: 7, eval_f1_score: 0.49663


[2022-10-03 17:02:33,927] [    INFO] - tokenizer config file saved in checkpoint/ernie-3.0-base-zh-1664787611/tokenizer_config.json
[2022-10-03 17:02:33,930] [    INFO] - Special tokens file saved in checkpoint/ernie-3.0-base-zh-1664787611/special_tokens_map.json
100%|██████████| 48/48 [00:07<00:00,  6.61it/s, loss=0.22132]
100%|██████████| 12/12 [00:00<00:00, 35.21it/s]


Epoch: 8, eval_f1_score: 0.47932


100%|██████████| 48/48 [00:07<00:00,  6.56it/s, loss=0.00838]
100%|██████████| 12/12 [00:00<00:00, 35.05it/s]


Epoch: 9, eval_f1_score: 0.46649


100%|██████████| 48/48 [00:07<00:00,  6.55it/s, loss=0.00129]
100%|██████████| 12/12 [00:00<00:00, 35.13it/s]


Epoch: 10, eval_f1_score: 0.48867


100%|██████████| 48/48 [00:07<00:00,  6.58it/s, loss=0.00151]
100%|██████████| 12/12 [00:00<00:00, 35.16it/s]


Epoch: 11, eval_f1_score: 0.48034


100%|██████████| 48/48 [00:07<00:00,  6.63it/s, loss=0.00146]
100%|██████████| 12/12 [00:00<00:00, 35.14it/s]


Epoch: 12, eval_f1_score: 0.49100


100%|██████████| 48/48 [00:07<00:00,  6.59it/s, loss=0.00107]
100%|██████████| 12/12 [00:00<00:00, 35.13it/s]


Epoch: 13, eval_f1_score: 0.50401


[2022-10-03 17:03:20,842] [    INFO] - tokenizer config file saved in checkpoint/ernie-3.0-base-zh-1664787611/tokenizer_config.json
[2022-10-03 17:03:20,844] [    INFO] - Special tokens file saved in checkpoint/ernie-3.0-base-zh-1664787611/special_tokens_map.json
100%|██████████| 48/48 [00:07<00:00,  6.55it/s, loss=0.00109]
100%|██████████| 12/12 [00:00<00:00, 35.10it/s]


Epoch: 14, eval_f1_score: 0.49681


100%|██████████| 48/48 [00:07<00:00,  6.61it/s, loss=0.00149]
100%|██████████| 12/12 [00:00<00:00, 35.02it/s]


Epoch: 15, eval_f1_score: 0.49681


100%|██████████| 48/48 [00:07<00:00,  6.57it/s, loss=0.00094]
100%|██████████| 12/12 [00:00<00:00, 35.00it/s]


Epoch: 16, eval_f1_score: 0.49681


100%|██████████| 48/48 [00:07<00:00,  6.62it/s, loss=0.00137]
100%|██████████| 12/12 [00:00<00:00, 35.04it/s]


Epoch: 17, eval_f1_score: 0.49681


100%|██████████| 48/48 [00:07<00:00,  6.58it/s, loss=0.00099]
100%|██████████| 12/12 [00:00<00:00, 35.04it/s]


Epoch: 18, eval_f1_score: 0.49681


100%|██████████| 48/48 [00:07<00:00,  6.57it/s, loss=0.00137]
100%|██████████| 12/12 [00:00<00:00, 35.16it/s]


Epoch: 19, eval_f1_score: 0.49681


100%|██████████| 48/48 [00:08<00:00,  5.89it/s, loss=0.00175]
100%|██████████| 12/12 [00:00<00:00, 35.14it/s]


Epoch: 20, eval_f1_score: 0.49681


100%|██████████| 48/48 [00:08<00:00,  5.98it/s, loss=0.00088]
100%|██████████| 12/12 [00:00<00:00, 26.70it/s]


Epoch: 21, eval_f1_score: 0.49681


100%|██████████| 48/48 [00:07<00:00,  6.31it/s, loss=0.00100]
100%|██████████| 12/12 [00:00<00:00, 27.09it/s]


Epoch: 22, eval_f1_score: 0.49681


100%|██████████| 48/48 [00:09<00:00,  5.25it/s, loss=0.00077]
100%|██████████| 12/12 [00:00<00:00, 27.28it/s]


Epoch: 23, eval_f1_score: 0.49681


100%|██████████| 48/48 [00:08<00:00,  5.43it/s, loss=0.00050]
100%|██████████| 12/12 [00:00<00:00, 30.51it/s]


Epoch: 24, eval_f1_score: 0.50466


[2022-10-03 17:04:51,679] [    INFO] - tokenizer config file saved in checkpoint/ernie-3.0-base-zh-1664787611/tokenizer_config.json
[2022-10-03 17:04:51,682] [    INFO] - Special tokens file saved in checkpoint/ernie-3.0-base-zh-1664787611/special_tokens_map.json
100%|██████████| 48/48 [00:07<00:00,  6.54it/s, loss=0.00042]
100%|██████████| 12/12 [00:00<00:00, 34.48it/s]


Epoch: 25, eval_f1_score: 0.50241


100%|██████████| 48/48 [00:07<00:00,  6.52it/s, loss=0.00093]
100%|██████████| 12/12 [00:00<00:00, 34.38it/s]


Epoch: 26, eval_f1_score: 0.50281


100%|██████████| 48/48 [00:07<00:00,  6.56it/s, loss=0.00042]
100%|██████████| 12/12 [00:00<00:00, 35.02it/s]


Epoch: 27, eval_f1_score: 0.50281


100%|██████████| 48/48 [00:07<00:00,  6.53it/s, loss=0.00044]
100%|██████████| 12/12 [00:00<00:00, 34.82it/s]


Epoch: 28, eval_f1_score: 0.48898


100%|██████████| 48/48 [00:07<00:00,  6.55it/s, loss=0.00043]
100%|██████████| 12/12 [00:00<00:00, 34.65it/s]


Epoch: 29, eval_f1_score: 0.50178


100%|██████████| 48/48 [00:07<00:00,  6.55it/s, loss=0.00028]
100%|██████████| 12/12 [00:00<00:00, 34.92it/s]


Epoch: 30, eval_f1_score: 0.48836

6 模型预测

# 预测阶段
def do_sample_predict(model,data_loader,is_prob=False):
    model.eval()
    preds = []
    for batch in tqdm(data_loader):
        input_ids, token_type_ids= batch
        logits = model(input_ids, token_type_ids)
        probs = F.softmax(logits,axis=1)
        preds.extend(probs.argmax(axis=1).numpy())
    if is_prob:
        return probs
    return preds

# 读取最佳模型
state_dict = paddle.load(os.path.join(save_dir,'best_model.pdparams'))
model.load_dict(state_dict)

# 预测
pred_score = do_sample_predict(model,test_data_loader)

100%|██████████| 1303/1303 [00:37<00:00, 35.11it/s]

7 生成提交文件

# 生成提交结果文件
sumbit = pd.DataFrame({"id":test["id"]})
sumbit["label"] = pred_score
file_name = "sumbit_{}.csv".format(save_dir.split("/")[1])
sumbit.to_csv(file_name,index=False)
print("生成提交文件{}".format(file_name))

生成提交文件sumbit_ernie-3.0-base-zh-1664787611.csv

此文章为搬运
原项目链接

百度飞桨AI Studio社区

学大模型，用大模型上飞桨星河社区！每天8点V100G算力免费领！免费领取ERNIE 4.0 100w Token >>>

更多推荐

利用Amazon Bedrock畅玩Claude 3等多种领先模型，抢占AI高地(体验倒计时4小时)

百度飞桨星河社区

LangChain的RAG实践

百度飞桨星河社区

RAG综述汇总

百度飞桨星河社区

所有评论(0)

查看更多评论

AI Studio

@m0_63642362

已为社区贡献1438条内容

2022 CCF BDCI 小样本数据分类任务Baseline(综合版)

AI Studio

2022 CCF BDCI 小样本数据分类任务Baseline（综合版）

赛题背景

方案

1 数据读取和EDA

1.1 读取数据并统一格式

1.2 简易数据分析

1.3 结论

2 数据处理

2.1 数据划分

2.2 数据编码

3 模型搭建

4 模型配置

补 对抗训练 FGSM和FGM

补 损失函数 FocalLoss

5 模型训练

6 模型预测

7 生成提交文件

所有评论(0)

AI Studio

补对抗训练 FGSM和FGM

补损失函数 FocalLoss