资源

⭐ ⭐ ⭐ 欢迎点个小小的Star支持!⭐ ⭐ ⭐

开源不易,希望大家多多支持~

一、竞赛背景

城市网格化管理,是将城市管理辖区按照一定的标准划分为单元网格。通过加强对单元网格的巡查,建立一种监督和处置相互分离的管理与服务模式。为了达到主动发现问题,及时处理问题,加强对城市的管理能力和问题处理速度。
通过深入挖掘各类资源数据,运用大数据思想和工具,围绕信息发现、分析和利用,提炼数据库中海量的数据价值,变被动服务为主动服务,为实际业务操作过程打造坚实的数据支撑基础。本届竞赛将从真实场景和实际应用出发,新增了更具挑战性、更加务实的任务,期待参赛选手们能在这些任务上相互切磋、共同进步。

大赛的网址为:http://data.sd.gov.cn/cmpt/cmptDetail.html?id=67

二、数据准备

数据准备包括读取比赛的数据集,然后把数据转换成id的形式,最后放入迭代器Dataloader中,Dataloader就可以用于模型的训练了。

首先导入模型所需要的库包:

import pandas as pd
from functools import partial
import os
import time
import numpy as np

# 导入paddle库
import paddle
import paddle.nn.functional as F
import paddle.nn as nn
from paddle.io import DataLoader
from paddle.dataset.common import md5file
# 导入paddlenlp的库
import paddlenlp as ppnlp
from paddlenlp.transformers import LinearDecayWithWarmup
from paddlenlp.data import Stack, Tuple, Pad, Dict
from paddlenlp.datasets import DatasetBuilder,get_path_from_url
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import BertForSequenceClassification, BertTokenizer
from paddlenlp.transformers import LinearDecayWithWarmup
from sklearn.model_selection import train_test_split

解压数据集

!unzip -o 东营-算法赛-智能网格分类-A榜.zip
Archive:  东营-算法赛-智能网格分类-A榜.zip
  inflating: __MACOSX/._东营-算法赛-智能网格分类-A榜  
  inflating: 东营-算法赛-智能网格分类-A榜/testa_nolabel.csv  
  inflating: __MACOSX/东营-算法赛-智能网格分类-A榜/._testa_nolabel.csv  
  inflating: 东营-算法赛-智能网格分类-A榜/submita.csv  
  inflating: __MACOSX/东营-算法赛-智能网格分类-A榜/._submita.csv  
  inflating: 东营-算法赛-智能网格分类-A榜/train.csv  
  inflating: __MACOSX/东营-算法赛-智能网格分类-A榜/._train.csv  
file_name='东营-算法赛-智能网格分类-A榜/train.csv'
train_data=pd.read_csv(file_name)
train_data = train_data.dropna()
train_data.head()

idnamecontentlabel
00渠內有草9月9日上午8点40分***村网格员***在辖区内巡查中当巡查到我村村北首时发现排水渠内长满...0
11清理楼道杂物辖区内清理楼道杂物0
22路灯报修2020年9月8日,***村网格员***,在三十八户村进行晚间摸排时,发现我们村村***大院...0
33商铺排查2021年2月1日***第七网格员***在***小区内商铺排查有无安全隐患。0
44清理***4***北侧粪便2020年9月7日8时10分***社区居委会***第一网格***,第二网格***,第三网格*...0

划分训练集和验证集

train,dev=train_test_split(train_data,test_size=0.3,random_state=2021)
train.to_csv('train.csv',index=False)
dev.to_csv('dev.csv',index=False)

统计一下label的比例

num_label=train_data['label'].nunique() 
counts=train_data['label'].value_counts()
print(num_label)
print(counts)
25
0     7036
1     1640
2     1110
3      810
4      721
5      590
6      566
7      296
8      252
9      235
11     230
10     226
12     226
13     179
14     154
15     149
16     136
17     132
18     103
19      98
20      80
21      80
22      69
24      66
23      62
Name: label, dtype: int64

可以看出数据中的0,1,2类别居多,有点类别不均衡,可以考虑类别不均衡的损失函数等优化方法,比如focal loss。


def read(data_path):
    df=pd.read_csv(data_path)
    for idx,row in df.iterrows():
        name=row['name']
        content=row['content']
        label=row['label']
        yield {'name': str(name),'content':str(content), 'label': label}

# data_path为read()方法的参数
train_ds = load_dataset(read, data_path='train.csv',lazy=False)
dev_ds = load_dataset(read, data_path='dev.csv',lazy=False)
print(train_ds[:4])
[{'name': '鞭炮皮垃圾', 'content': '9.10日,上午8点30分,***网格员***,在小区巡查,29号楼1单元绿化带附近存在燃放的鞭炮皮未清理。已联系保洁人员及时打扫。', 'label': 3}, {'name': '规范住户私拉电线充电现象', 'content': '2020年9月9日下午二点半,***社区第四网格网格员***巡查至***二十五号楼二单元时,发现有住户从车库内私拉长电线给电动三轮车充电,影响他人出行,存在安全隐患。', 'label': 8}, {'name': '清理散落垃圾', 'content': '9月10日,网格员巡检时发现小区东头花池附近有散落的垃圾,影响小区环境卫生', 'label': 3}, {'name': '鞭炮皮清理!', 'content': '2020年9月10日上午*xx社区***第五网格内巡查,发现53号**路中有鞭炮皮,我马上通知物业保洁人员及时进行清理!', 'label': 3}]

构造Dataloader,封装成batch的形式,用于训练。

# 转换成id的函数
def convert_example(example, tokenizer):
    encoded_inputs = tokenizer(text=example['content'],text_pair=example["name"], max_seq_len=256, pad_to_max_seq_len=True)
    # encoded_inputs = tokenizer(text=example["name"]+','+example['content'], max_seq_len=256, pad_to_max_seq_len=True)
    return tuple([np.array(x, dtype="int64") for x in [
            encoded_inputs["input_ids"], encoded_inputs["token_type_ids"], [example["label"]]]])

tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0')

# 把训练集合转换成id
train_ds = train_ds.map(partial(convert_example, tokenizer=tokenizer))

dev_ds = dev_ds.map(partial(convert_example, tokenizer=tokenizer))

# 构建训练集合的dataloader
train_batch_sampler = paddle.io.BatchSampler(dataset=train_ds, batch_size=32, shuffle=True)
train_data_loader = paddle.io.DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, return_list=True)

dev_batch_sampler = paddle.io.BatchSampler(dataset=dev_ds, batch_size=32, shuffle=False)
dev_data_loader = paddle.io.DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, return_list=True)
[2021-12-29 10:23:39,385] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-1.0/vocab.txt

下面展示一下dataloder的数据,数据被转换成了id的形式。

for step, batch in enumerate(train_data_loader, start=1):
    input_ids, token_type_ids, labels = batch
    print(input_ids)
    print(token_type_ids)
    print(labels)
    if(step>2):
        break
Tensor(shape=[32, 256], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
       [[1   , 3615, 17  , ..., 0   , 0   , 0   ],
        [1   , 36  , 143 , ..., 0   , 0   , 0   ],
        [1   , 177 , 540 , ..., 0   , 0   , 0   ],
        ...,
        [1   , 3615, 17  , ..., 0   , 0   , 0   ],
        [1   , 3615, 17  , ..., 0   , 0   , 0   ],
        [1   , 3615, 17  , ..., 0   , 0   , 0   ]])
Tensor(shape=[32, 256], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])
Tensor(shape=[32, 1], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
       [[8 ],
        [18],
        [12],
        [0 ],
        [11],
        [0 ],
        [0 ],
        [0 ],
        [0 ],
        [0 ],
        [12],
        [9 ],
        [2 ],
        [0 ],
        [18],
        [7 ],
        [0 ],
        [1 ],
        [0 ],
        [1 ],
        [4 ],
        [0 ],
        [6 ],
        [3 ],
        [0 ],
        [2 ],
        [0 ],
        [4 ],
        [0 ],
        [0 ],
        [9 ],
        [2 ]])
Tensor(shape=[32, 256], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
       [[1    , 17963, 17963, ..., 0    , 0    , 0    ],
        [1    , 139  , 223  , ..., 0    , 0    , 0    ],
        [1    , 3615 , 17   , ..., 0    , 0    , 0    ],
        ...,
        [1    , 249  , 136  , ..., 0    , 0    , 0    ],
        [1    , 3615 , 17   , ..., 0    , 0    , 0    ],
        [1    , 782  , 73   , ..., 0    , 0    , 0    ]])
Tensor(shape=[32, 256], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])
Tensor(shape=[32, 1], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
       [[15],
        [13],
        [3 ],
        [2 ],
        [0 ],
        [0 ],
        [0 ],
        [0 ],
        [1 ],
        [12],
        [3 ],
        [6 ],
        [1 ],
        [0 ],
        [3 ],
        [17],
        [0 ],
        [1 ],
        [1 ],
        [6 ],
        [1 ],
        [0 ],
        [0 ],
        [0 ],
        [0 ],
        [23],
        [21],
        [2 ],
        [1 ],
        [0 ],
        [5 ],
        [9 ]])
Tensor(shape=[32, 256], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
       [[1    , 3615 , 17   , ..., 0    , 0    , 0    ],
        [1    , 701  , 136  , ..., 0    , 0    , 0    ],
        [1    , 10235, 17   , ..., 0    , 0    , 0    ],
        ...,
        [1    , 3615 , 17   , ..., 827  , 256  , 2    ],
        [1    , 11   , 305  , ..., 0    , 0    , 0    ],
        [1    , 1724 , 459  , ..., 0    , 0    , 0    ]])
Tensor(shape=[32, 256], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 1, 1, 1],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])
Tensor(shape=[32, 1], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
       [[1 ],
        [0 ],
        [0 ],
        [11],
        [0 ],
        [0 ],
        [5 ],
        [5 ],
        [5 ],
        [0 ],
        [4 ],
        [2 ],
        [2 ],
        [7 ],
        [17],
        [0 ],
        [2 ],
        [0 ],
        [4 ],
        [12],
        [0 ],
        [0 ],
        [0 ],
        [10],
        [0 ],
        [1 ],
        [9 ],
        [0 ],
        [4 ],
        [0 ],
        [0 ],
        [13]])
Tensor(shape=[32, 256], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
       [[1   , 3615, 17  , ..., 0   , 0   , 0   ],
        [1   , 3615, 17  , ..., 0   , 0   , 0   ],
        [1   , 3615, 17  , ..., 0   , 0   , 0   ],
        ...,
        [1   , 3615, 17  , ..., 0   , 0   , 0   ],
        [1   , 3615, 17  , ..., 0   , 0   , 0   ],
        [1   , 3615, 17  , ..., 0   , 0   , 0   ]])
Tensor(shape=[32, 256], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])
Tensor(shape=[32, 1], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
       [[0 ],
        [9 ],
        [2 ],
        [2 ],
        [14],
        [11],
        [0 ],
        [6 ],
        [3 ],
        [0 ],
        [22],
        [0 ],
        [0 ],
        [0 ],
        [0 ],
        [19],
        [5 ],
        [16],
        [0 ],
        [0 ],
        [2 ],
        [2 ],
        [9 ],
        [2 ],
        [0 ],
        [4 ],
        [5 ],
        [0 ],
        [13],
        [7 ],
        [1 ],
        [17]])

三、模型构建

本baseline选择了ernie1.0模型,还可以选择其他预训练模型哈,详情请参考PaddleNLP的文档: https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html

num_classes=25

model = ppnlp.transformers.ErnieForSequenceClassification.from_pretrained(
        'ernie-1.0', num_classes=num_classes)



# roberta
# model = ppnlp.transformers.RobertaForSequenceClassification.from_pretrained(
#         'rbt3', num_classes=num_classes)
# tokenizer = ppnlp.transformers.RobertaTokenizer.from_pretrained('rbt3')

# BERT
# tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
# model = BertForSequenceClassification.from_pretrained("bert-base-chinese", num_classes=num_classes)
[2021-12-29 10:23:41,313] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-1.0/ernie_v1_chn_base.pdparams
W1229 10:23:41.316828  1927 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W1229 10:23:41.322613  1927 device_context.cc:465] device: 0, cuDNN Version: 7.6.

下面实现了一个FocalLoss函数,读者可以尝试改变里面的weight参数来解决竞赛里碰见的类别不平衡的问题。

class FocalLoss(nn.Layer):
    def __init__(self, alpha=0.5, gamma=2, weight=None, ignore_index=255):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
        # 参数可调
        weight=[0.01,0.03,0.03]
        weight+=[0.8 for item in range(22)]
        self.weight = paddle.to_tensor(np.array(weight))
        self.ignore_index = ignore_index
        self.ce_fn = nn.CrossEntropyLoss(weight=self.weight, soft_label=False) 
 
    def forward(self, preds, labels):
        logpt = -self.ce_fn(preds, labels)
        pt = paddle.exp(logpt)
        loss = -((1 - pt) ** self.gamma) * self.alpha * logpt
        return loss

四、训练配置

配置模型训练的超参数,优化器,损失函数和评估函数,这个比赛是分类比赛,所以损失函数选择交叉熵,评估函数选择accuracy就可以,优化器选择常用的AdamW就行。

# 调大epoch数可能效果会更好
num_train_epochs=5
num_training_steps = len(train_data_loader) * num_train_epochs
logging_steps=200

save_steps=200
output_dir='checkpoint'
os.makedirs(output_dir,exist_ok=True)
last_step = num_train_epochs * len(train_data_loader)
# 定义 learning_rate_scheduler,负责在训练过程中对 lr 进行调度
lr_scheduler = LinearDecayWithWarmup(5E-5, num_training_steps, 0.0)

# Generate parameter names needed to perform weight decay.
# All bias and LayerNorm parameters are excluded.
decay_params = [
    p.name for n, p in model.named_parameters()
    if not any(nd in n for nd in ["bias", "norm"])
]

# 定义 Optimizer
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=0.0,
    apply_decay_param_fun=lambda x: x in decay_params)

# optimizer = paddle.optimizer.Adam(learning_rate=5e-5,parameters=model.parameters())


# 交叉熵损失和Focal 损失,两者可以切换
criterion = paddle.nn.loss.CrossEntropyLoss()
# criterion=FocalLoss()
# 评估的时候采用准确率指标
metric = paddle.metric.Accuracy()

五、模型训练

下面是模型的训练过程,由于训练的过程中有评估,所以首先需要实现evaluate函数。

@paddle.no_grad()
def evaluate(model, data_loader, phase="dev"):
    model.eval()
    metric.reset()
    losses = []
    for batch in data_loader:
        input_ids, token_type_ids, labels = batch
        probs = model(input_ids=input_ids, token_type_ids=token_type_ids)
        # 计算损失
        loss = criterion(probs, labels)
        losses.append(loss.numpy())
        # 计算准确率
        correct = metric.compute(probs, labels)
        #准确率更新
        metric.update(correct)
        accu = metric.accumulate()
    print("eval {} loss: {:.5}, accu: {:.5}".format(phase,
                                                    np.mean(losses), accu))
    model.train()
    metric.reset()

下面是训练函数的主体的实现,调用dataloader里面的数据,模型做前向传播,然后计算损失函数和梯度。

# 接下来,开始正式训练模型,训练时间较长,可注释掉这部分
def do_train(model,train_data_loader,dev_data_loader):
    global_step = 0
    tic_train = time.time()

    for epoch in range(1, num_train_epochs + 1):
        for step, batch in enumerate(train_data_loader, start=1):

            input_ids, token_type_ids, labels = batch
            probs = model(input_ids=input_ids, token_type_ids=token_type_ids)

            # probs=paddle.to_tensor(probs, dtype="float64")
            loss = criterion(probs, labels)
            correct = metric.compute(probs, labels)
            metric.update(correct)
            acc = metric.accumulate()

            global_step += 1
            
            # 每间隔 100 step 输出训练指标
            if global_step % 100 == 0:
                print("global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
                    % (global_step, epoch, step, loss, acc,
                        10 / (time.time() - tic_train)))
                tic_train = time.time()
            loss.backward()
            optimizer.step()
            lr_scheduler.step()
            optimizer.clear_grad()

            if global_step % save_steps == 0 or global_step == last_step:
                model_path=os.path.join(output_dir,"model_%d.pdparams" % global_step)
                paddle.save(model.state_dict(),model_path)
        evaluate(model, dev_data_loader, phase="dev")
# 正常训练  
do_train(model,train_data_loader,dev_data_loader)
global step 100, epoch: 1, batch: 100, loss: 1.15801, accu: 0.49312, speed: 0.25 step/s
global step 200, epoch: 1, batch: 200, loss: 1.41770, accu: 0.53641, speed: 0.25 step/s
global step 300, epoch: 1, batch: 300, loss: 1.61155, accu: 0.55990, speed: 0.24 step/s
eval dev loss: 1.1241, accu: 0.64058
global step 400, epoch: 2, batch: 66, loss: 1.21426, accu: 0.64962, speed: 0.16 step/s
global step 500, epoch: 2, batch: 166, loss: 1.12272, accu: 0.65983, speed: 0.24 step/s
global step 600, epoch: 2, batch: 266, loss: 1.07059, accu: 0.66718, speed: 0.25 step/s
eval dev loss: 1.0269, accu: 0.67534
global step 700, epoch: 3, batch: 32, loss: 1.07614, accu: 0.72461, speed: 0.16 step/s
global step 800, epoch: 3, batch: 132, loss: 1.07723, accu: 0.73509, speed: 0.25 step/s
global step 900, epoch: 3, batch: 232, loss: 1.21629, accu: 0.74515, speed: 0.24 step/s
global step 1000, epoch: 3, batch: 332, loss: 0.79858, accu: 0.74285, speed: 0.25 step/s
eval dev loss: 1.0182, accu: 0.68999
global step 1100, epoch: 4, batch: 98, loss: 0.37761, accu: 0.81920, speed: 0.16 step/s
global step 1200, epoch: 4, batch: 198, loss: 0.59994, accu: 0.82023, speed: 0.25 step/s
global step 1300, epoch: 4, batch: 298, loss: 0.34567, accu: 0.81397, speed: 0.24 step/s
eval dev loss: 1.0545, accu: 0.69611
global step 1400, epoch: 5, batch: 64, loss: 0.20613, accu: 0.87354, speed: 0.16 step/s
global step 1500, epoch: 5, batch: 164, loss: 0.38463, accu: 0.87176, speed: 0.24 step/s
global step 1600, epoch: 5, batch: 264, loss: 0.51297, accu: 0.86707, speed: 0.25 step/s
eval dev loss: 1.0973, accu: 0.69676
global step 100, epoch: 1, batch: 100, loss: 1.15801, accu: 0.49312, speed: 0.25 step/s
global step 200, epoch: 1, batch: 200, loss: 1.41770, accu: 0.53641, speed: 0.25 step/s
global step 300, epoch: 1, batch: 300, loss: 1.61155, accu: 0.55990, speed: 0.24 step/s
eval dev loss: 1.1241, accu: 0.64058
global step 400, epoch: 2, batch: 66, loss: 1.21426, accu: 0.64962, speed: 0.16 step/s
global step 500, epoch: 2, batch: 166, loss: 1.12272, accu: 0.65983, speed: 0.24 step/s
global step 600, epoch: 2, batch: 266, loss: 1.07059, accu: 0.66718, speed: 0.25 step/s
eval dev loss: 1.0269, accu: 0.67534
global step 700, epoch: 3, batch: 32, loss: 1.07614, accu: 0.72461, speed: 0.16 step/s
......

可以看出模型是在收敛的。

六、模型预测

模型预测部分加载测试集东营-算法赛-智能网格分类-A榜/testa_nolabel.csv,然后调用模型,预测输出结果,生成submission.csv文件,就可以提交了。

def read_text(data_path):
    df=pd.read_csv(data_path)
    for idx,row in df.iterrows():
        name=row['name']
        content=row['content']
        
        yield {'name': str(name), 'content': str(content),'label':0}

test_ds = load_dataset(read_text, data_path='东营-算法赛-智能网格分类-A榜/testa_nolabel.csv',lazy=False)
print(test_ds[:4])
test_ds = test_ds.map(partial(convert_example, tokenizer=tokenizer))
test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=8, shuffle=False)

test_data_loader = paddle.io.DataLoader(
        dataset=test_ds,
        batch_sampler=test_batch_sampler,
        return_list=True) 
[{'name': '清理玉米堆', 'content': '9月9号在网格内巡查看到***门口处,主要路上有几堆玉米棒在路上堆着,对过往行人及车辆交通存在安全隐患,通知本户清理以免给群众交通造成危害。', 'label': 0}, {'name': '***交通', 'content': '2020年09月10日上午,***第一网格***,在日常走访巡查过程中,发现***村农贸市场附近停有一辆大型危险品运输车。', 'label': 0}, {'name': '***村前街路口西侧路南有杂草树枝等在路上,影响市容市貌和交通安全,需要清理干净。', 'content': '2020.9.2日,***村前街路口西侧路南有杂草树枝等杂物,影响市容市容和交通安全,需清理干净。', 'label': 0}, {'name': '共享单车乱停乱放', 'content': '网格巡查时发现共享单车乱停乱放,影响市容', 'label': 0}]
@paddle.no_grad()
def predict(model,test_data_loader):
    model.eval()
    metric.reset()
    losses=[]
    result=[]
    for step, batch in enumerate(test_data_loader, start=1):
        input_ids, token_type_ids, labels = batch
        probs = model(input_ids=input_ids, token_type_ids=token_type_ids)
        # print(probs)
        out2 = paddle.argmax(probs, axis=1)
        result.extend(out2.numpy().tolist())
    return result

# static_dict=paddle.load('checkpoint/model_1670.pdparams')
# model.load_dict(static_dict)
result=predict(model,test_data_loader)

result_data=[]
for idx,label in enumerate(result):
    result_data.append([idx,label])
submit=pd.DataFrame(result_data,columns=['id','label'])
submit.to_csv('submission.csv',index=False)
submit.head(10)
idlabel
001
1119
220
332
444
551
661
770
880
990

生成的submission.csv就可以提交到网站上啦。

模型最好成绩
ERNIE 1.00.704505

七、模型优化思路

1.数据增强: 中文数据增强工具、回译等

2.尝试不同的预训练模型、调参优化等。

3.5fodls交叉验证、多模型结果融合等

4.能力较强的可以尝试下在数据上重新预训练或者魔改网络:

关于paddlenlp:在具体使用时建议多看相关文档 PaddleNLP文档

paddlenlp的github地址:https://github.com/PaddlePaddle/PaddleNLP

八、更多PaddleEdu信息内容

1. PaddleEdu一站式深度学习在线百科awesome-DeepLearning中还有其他的能力,大家可以敬请期待:

  • 深度学习入门课
  • 深度学习百问
  • 特色课
  • 产业实践

PaddleEdu使用过程中有任何问题欢迎在awesome-DeepLearning提issue,同时更多深度学习资料请参阅飞桨深度学习平台

记得点个Star⭐收藏噢~~

2. 飞桨PaddleEdu技术交流群(QQ)

目前QQ群已有2000+同学一起学习,欢迎扫码加入

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐