山东创新创业大赛-网格事件智能分类 ERNIE线上精度0.74_副本
基于网格事件数据,对网格中的事件内容进行提取分析,对事件的类别进行划分,具体为根据提供的事件描述,对事件所属政务类型进行划分。
转载自AI Studio
项目链接https://aistudio.baidu.com/aistudio/projectdetail/3352265
竞赛背景
比赛链接:http://data.sd.gov.cn/cmpt/cmptDetail.html?id=67
城市网格化管理,是将城市管理辖区按照一定的标准划分为单元网格。通过加强对单元网格的巡查,建立一种监督和处置相互分离的管理与服务模式。为了达到主动发现问题,及时处理问题,加强对城市的管理能力和问题处理速度。
通过深入挖掘各类资源数据,运用大数据思想和工具,围绕信息发现、分析和利用,提炼数据库中海量的数据价值,变被动服务为主动服务,为实际业务操作过程打造坚实的数据支撑基础。本届竞赛将从真实场景和实际应用出发,新增了更具挑战性、更加务实的任务,期待参赛选手们能在这些任务上相互切磋、共同进步。
赛题任务
基于网格事件数据,对网格中的事件内容进行提取分析,对事件的类别进行划分,具体为根据提供的事件描述,对事件所属政务类型进行划分。
本赛题不能使用任何外部数据。 比赛采用AB榜,A榜时间为从赛题开放提交到2022年1月18日,B榜时间为2022年1月19日到2022年1月21日。
赛题数据
本赛题提供下载数据,选手在本地进行算法调试,在比赛页面提交结果。赛题最多将提供不超过2.8万条数据,包含训练集和测试集。数据以实际提供为准。 训练数据集数据样本如下:
测试集数据样本不包含label字段。 为了保证比赛的公平性,本次比赛仅允许使用官方发布的数据和标注,否则比赛成绩将被视为无效。
结果评价指标为分类正确率:即分类正确的工单条数/所有工单条数。
# 安装ernie
!pip install paddle-ernie > log.log
import numpy as np
import paddle as P
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModel
import sys
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score
import paddle as P
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModelForSequenceClassification
# 读取数据集,并进行文本拼接
train_df = pd.read_csv('train.csv', sep=',')
train_df = train_df.dropna()
train_df['content'] = train_df['name'] + ',' + train_df['content']
test_df = pd.read_csv('testa_nolabel.csv', sep=',')
test_df['content'] = test_df['name'] + ',' + test_df['content']
train_df.head()
id | name | content | label | |
---|---|---|---|---|
0 | 0 | 渠內有草 | 渠內有草,9月9日上午8点40分***村网格员***在辖区内巡查中当巡查到我村村北首时发现排... | 0 |
1 | 1 | 清理楼道杂物 | 清理楼道杂物,辖区内清理楼道杂物 | 0 |
2 | 2 | 路灯报修 | 路灯报修,2020年9月8日,***村网格员***,在三十八户村进行晚间摸排时,发现我们村村... | 0 |
3 | 3 | 商铺排查 | 商铺排查,2021年2月1日***第七网格员***在***小区内商铺排查有无安全隐患。 | 0 |
4 | 4 | 清理***4***北侧粪便 | 清理***4***北侧粪便,2020年9月7日8时10分***社区居委会***第一网格***... | 0 |
train_df['content'].apply(len).mean()
80.11235733963007
# 对数据集文本进行编码
def make_data(df):
data = []
for i, row in enumerate(df.iterrows()):
text, label = row[1].content, row[1].label
text_id, _ = tokenizer.encode(text) # ErnieTokenizer 会自动添加ERNIE所需要的特殊token,如[CLS], [SEP]
text_id = text_id[:MAX_SEQLEN]
text_id = np.pad(text_id, [0, MAX_SEQLEN-len(text_id)], mode='constant')
data.append((text_id, label))
return data
# 对数据集文本进行构建为batch
def get_batch_data(data, i):
d = data[i*BATCH: (i + 1) * BATCH]
feature, label = zip(*d)
feature = np.stack(feature) # 将BATCH行样本整合在一个numpy.array中
label = np.stack(list(label))
feature = P.to_tensor(feature) # 使用to_variable将numpy.array转换为paddle tensor
label = P.to_tensor(label)
return feature, label
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=6)
# 模型超参数
BATCH=32
MAX_SEQLEN=300
LR=5e-5
EPOCH=13
# 多折模训练
fold_idx = 0
for train_idx, val_idx in skf.split(train_df['label'], train_df['label']):
print(train_idx, val_idx)
# ernie分类模型
ernie = ErnieModelForSequenceClassification.from_pretrained('ernie-1.0', num_labels=25)
# 梯度裁剪
clip = P.nn.ClipGradByValue(min=-0.1, max=0.1)
# 模型优化器
optimizer = P.optimizer.Adam(LR,parameters=ernie.parameters(), grad_clip=clip)
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
train_data = make_data(train_df.iloc[train_idx])
val_data = make_data(train_df.iloc[val_idx])
# 单个epoch模型训练与验证
for i in range(EPOCH):
np.random.shuffle(train_data)
ernie.train()
best_val_acc = 0
for j in range(len(train_data) // BATCH):
feature, label = get_batch_data(train_data, j)
loss, _ = ernie(feature, labels=label)
loss.backward()
optimizer.minimize(loss)
ernie.clear_gradients()
if j % 50 == 0:
print('Train %d/%d: loss %.5f' % (j, len(train_data) // BATCH, loss.numpy()))
# evaluate
if j % 100 == 0:
all_pred, all_label = [], []
with P.no_grad():
ernie.eval()
for j in range(len(val_data) // BATCH):
feature, label = get_batch_data(val_data, j)
loss, logits = ernie(feature, labels=label)
all_pred.extend(logits.argmax(-1).numpy())
all_label.extend(label.numpy())
ernie.train()
acc = (np.array(all_label) == np.array(all_pred)).astype(np.float32).mean()
if acc > best_val_acc:
best_val_acc = acc
P.save(ernie.state_dict(), str(fold_idx) + '.bin')
print('Val acc %.5f' % acc)
fold_idx += 1
downloading https://ernie-github.cdn.bcebos.com/model-ernie1.0.1.tar.gz: 1%| | 8597/788477 [00:00<00:20, 38223.93KB/s]
[ 2653 2679 2722 ... 15243 15244 15245] [ 0 1 2 ... 3958 4443 4471]
downloading https://ernie-github.cdn.bcebos.com/model-ernie1.0.1.tar.gz: 788478KB [00:13, 59626.31KB/s]
W1229 14:33:03.098490 102 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W1229 14:33:03.103865 102 device_context.cc:422] device: 0, cuDNN Version: 7.6.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ernie/modeling_ernie.py:296: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
log.warn('param:%s not set in pretrained model, skip' % k)
[WARNING] 2021-12-29 14:33:11,718 [modeling_ernie.py: 296]: param:classifier.weight not set in pretrained model, skip
[WARNING] 2021-12-29 14:33:11,719 [modeling_ernie.py: 296]: param:classifier.bias not set in pretrained model, skip
Train 0/381: loss 3.55544
Val acc 0.02072
Train 50/381: loss 1.88277
Train 100/381: loss 1.69001
Val acc 0.58717
Train 150/381: loss 1.36154
Train 200/381: loss 1.21011
Val acc 0.62204
Train 250/381: loss 1.08967
Train 300/381: loss 1.02198
Val acc 0.61711
Train 350/381: loss 1.18056
Train 0/381: loss 1.07412
Val acc 0.64671
Train 50/381: loss 1.13818
Train 100/381: loss 1.13416
Val acc 0.64539
Train 150/381: loss 1.46287
Train 200/381: loss 0.87446
Val acc 0.65658
Train 250/381: loss 0.71587
Train 300/381: loss 0.92604
Val acc 0.64803
Train 350/381: loss 1.28124
Train 0/381: loss 0.72489
Val acc 0.66546
Train 50/381: loss 0.74182
Train 100/381: loss 1.04708
Val acc 0.68125
Train 150/381: loss 0.99416
Train 200/381: loss 0.53880
Val acc 0.67599
Train 250/381: loss 1.10793
Train 300/381: loss 0.98753
Val acc 0.68257
Train 350/381: loss 0.94299
Train 0/381: loss 0.50831
Val acc 0.68322
Train 50/381: loss 0.66269
Train 100/381: loss 0.57772
Val acc 0.68684
Train 150/381: loss 0.43587
Train 200/381: loss 0.70022
Val acc 0.67862
Train 250/381: loss 0.65823
Train 300/381: loss 0.63757
Val acc 0.68191
Train 350/381: loss 0.33489
Train 0/381: loss 0.57369
Val acc 0.68980
Train 50/381: loss 0.48750
Train 100/381: loss 0.60838
Val acc 0.68717
Train 150/381: loss 0.22357
Train 200/381: loss 0.73708
Val acc 0.67928
Train 250/381: loss 0.51203
Train 300/381: loss 0.63198
Val acc 0.68520
Train 350/381: loss 0.27435
Train 0/381: loss 0.49044
Val acc 0.66579
Train 50/381: loss 0.29290
Train 100/381: loss 0.35392
Val acc 0.69112
Train 150/381: loss 0.50911
Train 200/381: loss 0.50242
Val acc 0.68520
Train 250/381: loss 0.34875
Train 300/381: loss 0.50026
Val acc 0.69342
Train 350/381: loss 0.60847
Train 0/381: loss 1.38546
Val acc 0.59408
Train 50/381: loss 0.66777
Train 100/381: loss 0.29971
Val acc 0.69605
Train 150/381: loss 0.10559
Train 200/381: loss 0.13398
Val acc 0.69276
Train 250/381: loss 0.22911
Train 300/381: loss 0.47197
Val acc 0.68224
Train 350/381: loss 0.58273
Train 0/381: loss 0.31320
Val acc 0.67072
Train 50/381: loss 0.27460
Train 100/381: loss 0.59036
Val acc 0.65921
Train 150/381: loss 0.22631
Train 200/381: loss 0.37249
Val acc 0.67467
Train 250/381: loss 0.50558
Train 300/381: loss 0.12548
Val acc 0.68355
Train 350/381: loss 0.46684
Train 0/381: loss 0.27444
Val acc 0.68289
Train 50/381: loss 0.26523
Train 100/381: loss 0.25900
Val acc 0.68059
Train 150/381: loss 0.32721
Train 200/381: loss 0.12822
Val acc 0.68520
Train 250/381: loss 0.24605
Train 300/381: loss 0.59573
Val acc 0.65822
Train 350/381: loss 0.33798
Train 0/381: loss 0.13392
Val acc 0.69211
Train 50/381: loss 0.04590
Train 100/381: loss 0.06209
Val acc 0.68849
Train 150/381: loss 0.17020
Train 200/381: loss 0.05582
Val acc 0.68224
Train 250/381: loss 0.21385
Train 300/381: loss 0.11703
Val acc 0.68224
Train 350/381: loss 0.08684
[ 0 1 2 ... 15243 15244 15245] [2653 2679 2722 ... 7321 7606 7650]
[WARNING] 2021-12-29 15:19:22,368 [modeling_ernie.py: 296]: param:classifier.weight not set in pretrained model, skip
[WARNING] 2021-12-29 15:19:22,369 [modeling_ernie.py: 296]: param:classifier.bias not set in pretrained model, skip
Train 0/381: loss 3.17975
Val acc 0.07105
Train 50/381: loss 1.97960
Train 100/381: loss 1.57657
Val acc 0.54638
Train 150/381: loss 1.33006
Train 200/381: loss 1.07533
Val acc 0.59605
Train 250/381: loss 0.64593
Train 300/381: loss 1.17791
Val acc 0.63026
Train 350/381: loss 1.05878
Train 0/381: loss 1.05816
Val acc 0.63586
Train 50/381: loss 1.06569
Train 100/381: loss 1.12862
Val acc 0.63586
Train 150/381: loss 0.92812
Train 200/381: loss 0.75556
Val acc 0.64836
Train 250/381: loss 0.98756
Train 300/381: loss 0.96614
Val acc 0.65691
Train 350/381: loss 1.00483
Train 0/381: loss 0.66806
Val acc 0.64342
Train 50/381: loss 0.95530
Train 100/381: loss 1.01876
Val acc 0.66480
Train 150/381: loss 0.84060
Train 200/381: loss 1.15432
Val acc 0.65987
Train 250/381: loss 1.05941
Train 300/381: loss 0.98219
Val acc 0.67467
Train 350/381: loss 0.90607
Train 0/381: loss 0.71226
Val acc 0.67566
Train 50/381: loss 0.85803
Train 100/381: loss 0.56040
Val acc 0.67829
Train 150/381: loss 0.66480
Train 200/381: loss 0.80573
Val acc 0.68322
Train 250/381: loss 0.53809
Train 300/381: loss 0.67504
Val acc 0.67993
Train 350/381: loss 0.98953
Train 0/381: loss 0.58755
Val acc 0.68816
Train 50/381: loss 0.73488
Train 100/381: loss 0.84386
Val acc 0.68191
Train 150/381: loss 0.98317
Train 200/381: loss 1.47266
Val acc 0.67434
Train 250/381: loss 0.93344
Train 300/381: loss 0.57652
Val acc 0.68914
Train 350/381: loss 0.50576
Train 0/381: loss 0.50948
Val acc 0.69638
Train 50/381: loss 0.84004
Train 100/381: loss 0.56445
Val acc 0.68586
Train 150/381: loss 0.35402
Train 200/381: loss 0.33610
Val acc 0.68421
Train 250/381: loss 0.77689
Train 300/381: loss 0.34096
Val acc 0.69112
Train 350/381: loss 0.93121
Train 0/381: loss 0.21750
Val acc 0.68355
Train 50/381: loss 0.57992
Train 100/381: loss 0.38603
Val acc 0.68388
Train 150/381: loss 0.15227
Train 200/381: loss 0.52117
Val acc 0.68092
Train 250/381: loss 0.32740
Train 300/381: loss 0.51278
Train 350/381: loss 0.24217
Train 0/381: loss 0.44215
Val acc 0.67961
Train 50/381: loss 0.16680
Train 100/381: loss 0.15252
Val acc 0.68355
Train 150/381: loss 0.31004
Train 200/381: loss 0.18911
Val acc 0.68651
Train 250/381: loss 0.11511
Train 300/381: loss 1.61287
Val acc 0.59211
Train 350/381: loss 0.61160
Train 0/381: loss 0.51912
Val acc 0.64408
Train 50/381: loss 0.36144
Train 100/381: loss 0.18402
Val acc 0.67007
Train 150/381: loss 0.32190
Train 200/381: loss 0.33891
Val acc 0.68158
Train 250/381: loss 0.26074
Train 300/381: loss 0.13387
Val acc 0.68092
Train 350/381: loss 0.26407
Train 0/381: loss 0.31835
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
/tmp/ipykernel_102/2210071886.py in <module>
39 loss, logits = ernie(feature, labels=label)
40
---> 41 all_pred.extend(logits.argmax(-1).numpy())
42 all_label.extend(label.numpy())
43 ernie.train()
KeyboardInterrupt:
import glob
test_df['content'] = test_df['content'].fillna('')
test_df['label'] = 0
test_data = make_data(test_df.iloc[:])
# 模型多折依存
test_pred_tta = None
for path in glob.glob('./*.bin'):
ernie = ErnieModelForSequenceClassification.from_pretrained('ernie-1.0', num_labels=25)
ernie.load_dict(P.load(path))
test_pred = []
with P.no_grad():
ernie.eval()
for j in range(len(test_data) // BATCH+1):
feature, label = get_batch_data(test_data, j)
loss, logits = ernie(feature, labels=label)
test_pred.append(logits.numpy())
if test_pred_tta is None:
test_pred_tta = np.vstack(test_pred)
else:
test_pred_tta += np.vstack(test_pred)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ernie/modeling_ernie.py:296: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
log.warn('param:%s not set in pretrained model, skip' % k)
[WARNING] 2021-12-29 16:04:32,465 [modeling_ernie.py: 296]: param:classifier.weight not set in pretrained model, skip
[WARNING] 2021-12-29 16:04:32,467 [modeling_ernie.py: 296]: param:classifier.bias not set in pretrained model, skip
[WARNING] 2021-12-29 16:05:08,231 [modeling_ernie.py: 296]: param:classifier.weight not set in pretrained model, skip
[WARNING] 2021-12-29 16:05:08,233 [modeling_ernie.py: 296]: param:classifier.bias not set in pretrained model, skip
# 生成提交结果
pd.DataFrame({
'id': test_df['id'],
'label': test_pred_tta.argmax(1)
}).to_csv('submit.csv', index=None)
改进方向
- 模型原始的token 词典,可以考虑进行增加。
- 可以用对抗训练来增加模型精度。
- 可以集成多个模型。
更多推荐
所有评论(0)