LIC 2022 视频语义理解基线(快速启动版)
转自AI Studio,原文链接:LIC 2022 视频语义理解基线(快速启动版) - 飞桨AI StudioLIC2022视频语义理解基线❗️该版本为快速启动版,训练集取比赛提供的训练集的子集,目的是减少在线项目加载等待时间,供选手快速跑通流程,实现一键训练+预测,生成可提交并有分数的结果文件。该基线分数比全量数据版基线(🚩链接稍后给出)稍低(e.g., 0.33 vs 0.38)。❗️请选择
转自AI Studio,原文链接:
LIC 2022 视频语义理解基线(快速启动版) - 飞桨AI Studio
LIC2022视频语义理解基线
❗️该版本为快速启动版,训练集取比赛提供的训练集的子集,目的是减少在线项目加载等待时间,供选手快速跑通流程,实现一键训练+预测,生成可提交并有分数的结果文件。
该基线分数比全量数据版基线(🚩链接稍后给出)稍低(e.g., 0.33 vs 0.38)。
❗️请选择启动GPU环境(CPU环境将无法进行模型的训练和预测,且从CPU到GPU环境的转换将非常慢)。
❗️每次打开环境,需要约15分钟文件的加载、同步时间,请耐心等候加载完毕。
😃 每天AI Studio 免费提供8个算力点,最多可使用16小时GPU。除此之外,选手还可以选择申请AI Studio提供的100小时免费V100 GPU算力码。
1. 背景介绍
该项目为👉 LIC2022视频语义理解技术评测任务👈 的快速启动版基准模型,项目基于github基线改造,为选手提供了存储和算力支持。模型包括两部分:1)视频分类标签模型 paddle-video-classify-tag;2)视频语义标签模型 paddle-video-semantic-tag。
视频分类标签模型根据视频内容在封闭的二级标签体系上进行分类,得到描述视频的分类标签;视频语义标签模型从视频的文本信息中抽取实体语义标签(选手可进行升级,如利用给定的知识库进行推理、融合多模信息提升标签理解效果,或生成标签等)。两部分模型产出的标签结果,分别对应技术评测标签中的分类标签、语义标签。
2. 快速实践
2.1 环境搭建
运行下列指令添加路径。
注:注释行中的依赖包已持久化安装至项目目录,无需重复安装。
In [ ]
# !mkdir /home/aistudio/external-libraries
# !pip install opencv-python -i https://mirror.baidu.com/pypi/simple -t /home/aistudio/external-libraries
# !pip install paddlenlp==2.0.1 -i https://mirror.baidu.com/pypi/simple -t /home/aistudio/external-libraries
# !pip install tqdm wget -t /home/aistudio/external-libraries
!tar -xvf external-libraries.tar
!rm external-libraries.tar
import os
import sys
sys.path.append('/home/aistudio/external-libraries')
!mkdir paddle-video-classify-tag && cd paddle-video-classify-tag && mkdir data
!mkdir paddle-video-semantic-tag && cd paddle-video-semantic-tag && mkdir data
2.2 数据加载
数据内容包含:
-
样例训练集(比赛使用训练集的抽样集合)的视频信息,及官方提供的tsn视觉特征
-
A榜测试集(比赛使用的A榜测试集全量集合)的视频信息,及官方提供的tsn视觉特征
In [ ]
!tar -zxvf /home/aistudio/data/data142559/dataset_sample.tar.gz
!cd paddle-video-classify-tag && unzip /home/aistudio/data/data142559/tsn_features_test_a.zip
!cd paddle-video-classify-tag && unzip /home/aistudio/data/data142559/tsn_features_train_sample.zip
2.3 视频分类标签基线
该基线基于 VideoTag飞桨大规模视频分类模型 ,能够根据视频内容在封闭的二级标签体系上进行分类,得到描述视频的分类标签。
2.3.1 数据准备
该视频分类标签模型使用TSN网络提取原始视频的向量表征。 由于该步骤比较耗时,我们提供了所有视频数据的TSN特征,运行2.2数据加载 章节指令即可。
数据集上有两级标签,我们分别在一级标签(level1)和二级标签(level2)的设定下进行分类实验。 在每一种设定下均需要进行训练、验证和测试的数据划分。准备过程会得到如下的输出:
paddle-video-classify-tag
|-- weights
|-- attention_lstm.pdmodel
|-- attention_lstm.pdopt
|-- attention_lstm.pdparams
运行以下代码,准备视频语义理解数据集的label集合;训练、验证、测试的样本列表等。
In [ ]
import os
import os.path as osp
import requests
import time
import codecs
import json
import argparse
import random
def create_splits_indice(n_samples, SPLITS):
assert sum([v for k, v in SPLITS]) == 1.0
indices = list(range(n_samples))
random.shuffle(indices)
split2indice = {}
r_offset = 0
for idx, (split, ratio) in enumerate(SPLITS):
l_offset = r_offset
if idx == len(SPLITS) - 1:
r_offset = n_samples
else:
r_offset = int(n_samples * ratio) + l_offset
split2indice[split] = indices[l_offset:r_offset]
return split2indice
def prepare_split(data, split_name, test_only=False, gather_labels=False, classify_tag_dir='/home/aistudio/paddle-video-classify-tag'):
'''
1. Prepare ALL (unique) labels for classification from trainval-set.
2. For each split, generate sample list for level1 & level2 classification.
'''
trainval_tsn_feature_dir = '/home/aistudio/paddle-video-classify-tag/tsn_features_train_sample'
test_tsn_feature_dir = '/home/aistudio/paddle-video-classify-tag/tsn_features_test_a'
cls_data_dir = os.path.join(classify_tag_dir, 'data')
if not os.path.exists(cls_data_dir):
os.mkdir(cls_data_dir)
sample_nids = [sample["@id"] for sample in data]
level1_labels = []
level2_labels = []
if not test_only:
for sample in data:
category = {
each["@meta"]["type"]: each["@value"]
for each in sample["category"]
}
level1_labels.append(category["level1"])
level2_labels.append(category["level2"])
def create_sample_list(sample_labels, level_name):
save_label_file = cls_data_dir + "/{}_label.txt".format(level_name)
if gather_labels:
# For trainval set:
# Gather candidate labels and dump to {level1,level2}_label.txt
labels = sorted([str(label) for label in list(set(sample_labels))])
with codecs.open(save_label_file, "w", encoding="utf-8") as ouf:
ouf.writelines([label + "\n" for label in labels])
print("Saved " + save_label_file)
else:
# For test set: load existing labels.
with codecs.open(save_label_file, "r", encoding="utf-8") as inf:
labels = [line.strip() for line in inf.readlines()]
label2idx = {label: idx for idx, label in enumerate(labels)}
sample_lines = []
# Generate sample list: one sample per line (feature_path -> label)
for i in range(len(sample_nids)):
label_indice = label2idx[str(sample_labels[i])] if not test_only \
else -1
if split_name in ["train", "val", "trainval"]:
tsn_feature_dir = trainval_tsn_feature_dir
elif split_name in ["test"]:
tsn_feature_dir = test_tsn_feature_dir
feature_path = osp.join(tsn_feature_dir,
"{}.npy".format(sample_nids[i]))
if osp.exists(feature_path):
line = "{} {}\n".format(feature_path, str(label_indice))
sample_lines.append(line)
save_split_file = cls_data_dir + "/{}_{}.list".format(level_name, split_name)
with codecs.open(save_split_file, "w", encoding="utf-8") as ouf:
ouf.writelines(sample_lines)
print("Saved {}, size={}".format(save_split_file,
len(sample_lines)))
create_sample_list(level1_labels, "level1")
create_sample_list(level2_labels, "level2")
random.seed(6666)
classify_tag_dir = '/home/aistudio/paddle-video-classify-tag'
if not os.path.exists(classify_tag_dir):
os.mkdir(classify_tag_dir)
trainval_path = '/home/aistudio/dataset_sample/train.sample.json'
test_path = '/home/aistudio/dataset_sample/test_a.json'
# load data for train & validation (have labels).
with codecs.open(trainval_path, "r", encoding="utf-8") as inf:
print("Loading {}...".format(trainval_path))
lines = inf.readlines()
trainval_data = [json.loads(line) for line in lines]
# load data for test (no labels).
with codecs.open(test_path, "r", encoding="utf-8") as inf:
print("Loading {}...".format(test_path))
lines = inf.readlines()
test_data = [json.loads(line) for line in lines]
# split the trainval data into train-set(80%) and validation-set(20%).
split2indice = create_splits_indice(
len(trainval_data), [
("train", 4.0 / 5.0),
("val", 1.0 / 5.0),
])
train_data = [trainval_data[idx] for idx in split2indice["train"]]
val_data = [trainval_data[idx] for idx in split2indice["val"]]
prepare_split(trainval_data, "trainval", gather_labels=True)
prepare_split(train_data, "train")
prepare_split(val_data, "val")
prepare_split(test_data, "test", test_only=True)
Loading /home/aistudio/dataset_sample/train.sample.json... Loading /home/aistudio/dataset_sample/test_a.json...
2.3.2 训练与验证
运行以下指令训练分类模型。
选手可以参考原代码库中的模型微调指南获取更多信息。
In [4]
import os
import sys
import argparse
import ast
import logging
import paddle
paddle.enable_static()
sys.path.append('/home/aistudio/external-libraries')
sys.path.append('/home/aistudio/work/paddle-video-classify-tag')
from utils.train_utils import train_with_dataloader
import models
from utils.config_utils import *
from reader import get_reader
from metrics import get_metrics
from utils.utility import check_cuda
from utils.utility import check_version
logging.root.handlers = []
FORMAT = '[%(levelname)s: %(filename)s: %(lineno)4d]: %(message)s'
logging.basicConfig(level=logging.INFO, format=FORMAT)
logger = logging.getLogger(__name__)
class Args():
model_name = 'AttentionLSTM'
config = '/home/aistudio/work/paddle-video-classify-tag/configs/attention_lstm-single-level1.yaml'
batch_size = None
learning_rate = None
pretrain = '/home/aistudio/work/paddle-video-classify-tag/weights/attention_lstm'
use_gpu = True
no_memory_optimize = False
epoch = None
valid_interval = 1
save_dir = os.path.join('paddle-video-classify-tag', 'data', 'checkpoints', 'level1')
log_interval = 50
fix_random_seed = False
def train(args):
# parse config
config = parse_config(args.config)
train_config = merge_configs(config, 'train', vars(args))
valid_config = merge_configs(config, 'valid', vars(args))
# print_configs(train_config, 'Train')
train_model = models.get_model(args.model_name, train_config, mode='train')
valid_model = models.get_model(args.model_name, valid_config, mode='valid')
# build model
startup = paddle.static.Program()
train_prog = paddle.static.Program()
if args.fix_random_seed:
startup.random_seed = 1000
train_prog.random_seed = 1000
with paddle.static.program_guard(train_prog, startup):
with paddle.utils.unique_name.guard():
train_model.build_input(use_dataloader=True)
train_model.build_model()
# for the input, has the form [data1, data2,..., label], so train_feeds[-1] is label
train_feeds = train_model.feeds()
train_fetch_list = train_model.fetches()
train_loss = train_fetch_list[0]
optimizer = train_model.optimizer()
optimizer.minimize(train_loss)
train_dataloader = train_model.dataloader()
valid_prog = paddle.static.Program()
with paddle.static.program_guard(valid_prog, startup):
with paddle.utils.unique_name.guard():
valid_model.build_input(use_dataloader=True)
valid_model.build_model()
valid_feeds = valid_model.feeds()
valid_fetch_list = valid_model.fetches()
valid_dataloader = valid_model.dataloader()
place = paddle.CUDAPlace(0) if args.use_gpu else paddle.CPUPlace()
exe = paddle.static.Executor(place)
exe.run(startup)
if args.pretrain:
train_model.load_pretrain_params(exe, args.pretrain, train_prog)
build_strategy = paddle.fluid.compiler.BuildStrategy()
build_strategy.enable_inplace = True
exec_strategy = paddle.static.ExecutionStrategy()
compiled_train_prog = paddle.static.CompiledProgram(
train_prog).with_data_parallel(
loss_name=train_loss.name,
build_strategy=build_strategy,
exec_strategy=exec_strategy)
compiled_valid_prog = paddle.static.CompiledProgram(
valid_prog).with_data_parallel(
share_vars_from=compiled_train_prog,
build_strategy=build_strategy,
exec_strategy=exec_strategy)
# get reader
bs_denominator = 1
if args.use_gpu:
# check number of GPUs
gpus = os.getenv("CUDA_VISIBLE_DEVICES", "")
if gpus == "":
pass
else:
gpus = gpus.split(",")
num_gpus = len(gpus)
assert num_gpus == train_config.TRAIN.num_gpus, \
"num_gpus({}) set by CUDA_VISIBLE_DEVICES " \
"shoud be the same as that " \
"set in {}({})".format(
num_gpus, args.config, train_config.TRAIN.num_gpus)
bs_denominator = train_config.TRAIN.num_gpus
train_config.TRAIN.batch_size = int(train_config.TRAIN.batch_size /
bs_denominator)
valid_config.VALID.batch_size = int(valid_config.VALID.batch_size /
bs_denominator)
train_reader = get_reader(args.model_name.upper(), 'train', train_config)
valid_reader = get_reader(args.model_name.upper(), 'valid', valid_config)
# get metrics
train_metrics = get_metrics(args.model_name.upper(), 'train', train_config)
valid_metrics = get_metrics(args.model_name.upper(), 'valid', valid_config)
epochs = args.epoch or train_model.epoch_num()
exe_places = paddle.static.cuda_places(
) if args.use_gpu else paddle.static.cpu_places()
train_dataloader.set_sample_list_generator(train_reader, places=exe_places)
valid_dataloader.set_sample_list_generator(valid_reader, places=exe_places)
train_with_dataloader(
exe,
train_prog,
compiled_train_prog,
train_dataloader,
train_fetch_list,
train_metrics,
epochs=epochs,
log_interval=args.log_interval,
valid_interval=args.valid_interval,
save_dir=args.save_dir,
save_model_name=args.model_name,
fix_random_seed=args.fix_random_seed,
compiled_test_prog=compiled_valid_prog,
test_dataloader=valid_dataloader,
test_fetch_list=valid_fetch_list,
test_metrics=valid_metrics)
args = Args()
# check whether the installed paddle is compiled with GPU
check_cuda(args.use_gpu)
check_version()
args.model_name = 'AttentionLSTM'
args.log_interval = 50
# first layer
args.config = '/home/aistudio/work/paddle-video-classify-tag/configs/attention_lstm-single-level1.yaml'
args.save_dir = '/home/aistudio/paddle-video-classify-tag/data/checkpoints/level1'
if not os.path.exists(args.save_dir):
os.makedirs(args.save_dir)
train(args)
# second layer
args.config = '/home/aistudio/work/paddle-video-classify-tag/configs/attention_lstm-single-level2.yaml'
args.save_dir = '/home/aistudio/paddle-video-classify-tag/data/checkpoints/level2'
if not os.path.exists(args.save_dir):
os.makedirs(args.save_dir)
train(args)
[INFO: regularizer.py: 101]: If regularizer of a Parameter has been set by 'fluid.ParamAttr' or 'fluid.WeightNormParamAttr' already. The Regularization[L2Decay, regularization_coeff=0.000800] in Optimizer will not take effect, and it will only be applied to other Parameters! [INFO: attention_lstm.py: 164]: Load pretrain weights from /home/aistudio/work/paddle-video-classify-tag/weights/attention_lstm, exclude fc layer. [INFO: train_utils.py: 45]: ------- learning rate [0.000125], learning rate counter [-] ----- [INFO: metrics_util.py: 80]: [TRAIN 2022-04-28 22:05:55] Epoch 0, iter 0, time 2.767573356628418, , loss = 2923.538330, Hit@1 = 0.02, PERR = 0.02, GAP = 0.02 [INFO: train_utils.py: 122]: [TRAIN] Epoch 0 training finished, average time: 1.7017907901686065 share_vars_from is set, scope is ignored. [INFO: metrics_util.py: 80]: [TEST] test_iter 0 , loss = 302.009247, Hit@1 = 0.53, PERR = 0.53, GAP = 0.60 [INFO: metrics_util.py: 124]: [TEST] Epoch0 Finish avg_hit_at_one: 0.505859375, avg_perr: 0.505859375, avg_loss :329.6025136311849, aps: [0.2056517240004745, 0.7582162120652176, 0.06977232338023834, 0, 0.0, 0.8453953428088683, 0.5386290618179087, 0.12647074684748238, 0.06464234760008425, 0.8143473229661798, 0.025846861750262674, 0.8125, 0.5606631794981926, 0.025, 0.6332121597279986, 0.48199568449775865, 0.03583711373823246, 0.844207523649567, 0.006626716822675181, 0.8250196464463089, 0.5152466244119106, 0.27196454054549163, 0.6783860174137636, 0, 0.053316102627013265, 0.19644972430583615, 0.9643802486046634, 0.17060293754538206, 0.05704956089391827, 0.07652161831232097, 0, 0.038493472255469084, 0.5543688830575145], gap:0.5390077300563219 [INFO: train_utils.py: 45]: ------- learning rate [0.000125], learning rate counter [-] ----- [INFO: metrics_util.py: 80]: [TRAIN 2022-04-28 22:08:10] Epoch 1, iter 0, time 2.6242427825927734, , loss = 364.424683, Hit@1 = 0.48, PERR = 0.48, GAP = 0.44 [INFO: train_utils.py: 122]: [TRAIN] Epoch 1 training finished, average time: 1.7430822508675712 [INFO: metrics_util.py: 80]: [TEST] test_iter 0 , loss = 248.946411, Hit@1 = 0.62, PERR = 0.62, GAP = 0.70 [INFO: metrics_util.py: 124]: [TEST] Epoch1 Finish avg_hit_at_one: 0.5944010416666666, avg_perr: 0.5944010416666666, avg_loss :282.882386525472, aps: [0.2639931946892855, 0.85903455775213, 0.09778176122448987, 0, 0.030033261592643315, 0.8680903206216465, 0.6081577153299009, 0.22149620838665163, 0.09123414426924248, 0.8905820604115383, 0.055248944301131656, 0.95, 0.5856899952942017, 0.168086815154962, 0.6645151962662453, 0.5090036826258103, 0.17872309016688617, 0.8700421734225023, 0.03301833984693047, 0.8629644432673815, 0.5494279056102854, 0.33194993658077027, 0.7208050878611371, 0, 0.2092875916594547, 0.24870436262552853, 0.9722225403199237, 0.19206833214739047, 0.09036579340064066, 0.10623058817084435, 0, 0.03769712329789659, 0.6627263444618232], gap:0.6346135299340665 [INFO: train_utils.py: 45]: ------- learning rate [0.000125], learning rate counter [-] ----- [INFO: metrics_util.py: 80]: [TRAIN 2022-04-28 22:10:24] Epoch 2, iter 0, time 1.708108901977539, , loss = 315.439453, Hit@1 = 0.55, PERR = 0.55, GAP = 0.56 [INFO: train_utils.py: 122]: [TRAIN] Epoch 2 training finished, average time: 1.795787981578282 [INFO: metrics_util.py: 80]: [TEST] test_iter 0 , loss = 235.408524, Hit@1 = 0.63, PERR = 0.63, GAP = 0.72 [INFO: metrics_util.py: 124]: [TEST] Epoch2 Finish avg_hit_at_one: 0.6002604166666666, avg_perr: 0.6002604166666666, avg_loss :271.76447041829425, aps: [0.31461424520869374, 0.8936133494848872, 0.11043648749825534, 0.125, 0.03722438391699092, 0.8856478710931908, 0.6146493900189034, 0.22876525268503656, 0.10540428561979417, 0.8973407150987074, 0.07046950416809067, 1.0, 0.5973890927620836, 0.19951584164627645, 0.6677338493633069, 0.5268199460078398, 0.24872847879439597, 0.8782532773887733, 0.1096882518043304, 0.8638357292555305, 0.5653448812511401, 0.32151854978205635, 0.7386509578503802, 0, 0.2089245778417543, 0.36278048502308347, 0.9742929999991989, 0.25464991632445033, 0.13271150047535973, 0.13510689282158658, 0.0, 0.04718878169155384, 0.6798868023231818], gap:0.6512923596529407 [INFO: regularizer.py: 101]: If regularizer of a Parameter has been set by 'fluid.ParamAttr' or 'fluid.WeightNormParamAttr' already. The Regularization[L2Decay, regularization_coeff=0.000800] in Optimizer will not take effect, and it will only be applied to other Parameters! [INFO: attention_lstm.py: 164]: Load pretrain weights from /home/aistudio/work/paddle-video-classify-tag/weights/attention_lstm, exclude fc layer. [INFO: train_utils.py: 45]: ------- learning rate [0.000125], learning rate counter [-] ----- [INFO: metrics_util.py: 80]: [TRAIN 2022-04-28 22:13:03] Epoch 0, iter 0, time 2.3079347610473633, , loss = 24682.550781, Hit@1 = 0.00, PERR = 0.00, GAP = 0.00 [INFO: train_utils.py: 122]: [TRAIN] Epoch 0 training finished, average time: 1.7917025527175592 share_vars_from is set, scope is ignored. [INFO: metrics_util.py: 80]: [TEST] test_iter 0 , loss = 600.804932, Hit@1 = 0.30, PERR = 0.30, GAP = 0.24 [INFO: metrics_util.py: 124]: [TEST] Epoch0 Finish avg_hit_at_one: 0.21549479166666666, avg_perr: 0.21549479166666666, avg_loss :640.4301401774088, aps: [0, 0, 0, 0.40242828226797467, 0, 0, 0, 0, 0.33863014399927605, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.017490013599745342, 0, 0, 0, 0, 0, 0, 0, 0.03521126486951533, 0, 0, 0, 0, 0, 0, 0.03405457205661094, 0, 0, 0, 0, 0.0, 0.6962887915273482, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.04010896770467283, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.05729046027105949, 0, 0, 0, 0, 0.006493506493506493, 0, 0, 0, 0, 0, 0.6447801522817966, 0, 0, 0.06971153846153846, 0, 0, 0, 0, 0, 0.06551067886636681, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.20577600830312404, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.05182387738056744, 0.19935288798575898, 0, 0, 0, 0, 0, 0, 0.006060606060606061, 0, 0.01652246148167635, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.3408273189259355, 0, 0, 0.22446935557021744, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.10661600278272008, 0, 0, 0, 0.03607677045177045, 0, 0, 0, 0, 0.12624295024072427, 0, 0, 0, 0.011908733985379907, 0, 0, 0, 0, 0, 0.01123077136610415, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.14973782613585243, 0, 0, 0.6419203037914082, 0.3315040696329643, 0, 0, 0, 0.03735729886111951, 0, 0, 0, 0, 0.0006313131313131313, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.2578085880759878, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], gap:0.1536302281405276 [INFO: train_utils.py: 45]: ------- learning rate [0.000125], learning rate counter [-] ----- [INFO: metrics_util.py: 80]: [TRAIN 2022-04-28 22:15:21] Epoch 1, iter 0, time 2.288398265838623, , loss = 643.896606, Hit@1 = 0.26, PERR = 0.26, GAP = 0.17 [INFO: train_utils.py: 122]: [TRAIN] Epoch 1 training finished, average time: 1.7937103728858792 [INFO: metrics_util.py: 80]: [TEST] test_iter 0 , loss = 506.925232, Hit@1 = 0.42, PERR = 0.42, GAP = 0.35 [INFO: metrics_util.py: 124]: [TEST] Epoch1 Finish avg_hit_at_one: 0.322265625, avg_perr: 0.322265625, avg_loss :543.9810791015625, aps: [0, 0.6284037777940216, 0, 0.4476176894075993, 0, 0.017502088554720133, 0, 0.0, 0.49948578553903844, 0.6726278000591354, 0, 0, 0.014814814814814815, 0, 0, 0.0, 0, 0, 0.12025469917857523, 0, 0, 0, 0, 0, 0, 0, 0.06116341245294592, 0, 0.0, 0, 0, 0, 0, 0.08986510263555623, 0, 0, 0, 0.08181818181818182, 0.1834875432956615, 0.7799351937440513, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.8018812602440357, 0, 0, 0, 0, 0.0, 0, 0, 0, 0, 0, 0, 0.03716972682489924, 0, 0, 0, 0.35282327653333223, 0.29747338431548953, 0, 0, 0, 0.03269644204099986, 0, 0, 0, 0.10658710658710657, 0, 0.7397457184599068, 0.021230242805867577, 0.03970086959790015, 0.21014492753623187, 0, 0, 0, 0, 0, 0.2012167613391781, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0, 0, 0, 0, 0, 0.5718171721543543, 0.20836023284422855, 0, 0, 0.0205176753138432, 0, 0, 0, 0, 0, 0, 0, 0, 0.0, 0, 0.32366077494535755, 0.5132305645871157, 0, 0, 0, 0, 0, 0, 0.059059953550699054, 0, 0.02679003573313643, 0, 0, 0, 0, 0.010724231560075813, 0, 0, 0.0, 0, 0.3694164859158865, 0.04969704265478912, 0.06605492031500655, 0.26678914407765286, 0.5, 0.1117114831313118, 0.48005756650486286, 0, 0, 0, 0, 0, 0, 0, 0.020842379504993486, 0, 0, 0, 0.16806455477716875, 0, 0, 0, 0.35504129480409174, 0, 0, 0, 0, 0.18455439535372678, 0, 0.10785929617324966, 0, 0.07077634245187436, 0, 0, 0, 0.15618791293239848, 0, 0.013192995678206728, 0, 0, 0, 0, 0, 0, 0.38888888888888895, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.9289044289044288, 0.38066141949737636, 0, 0, 0.7315730501252811, 0.30272837068177305, 0.03046189453320919, 0, 0, 0.19456917878558563, 0, 0, 0, 0, 0.015130219289343896, 0, 0, 0, 0, 0, 0, 0, 0, 0.03746427683108207, 0, 0, 0, 0.1412861620299421, 0, 0, 0, 0, 0, 0, 0, 0.3944251458547141, 0, 0, 0.2678798532334657, 0.5851648351648351, 0, 0, 0, 0.018914641343419712, 0, 0, 0.12094981620223223, 0.0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], gap:0.2629719332516419 [INFO: train_utils.py: 45]: ------- learning rate [0.000125], learning rate counter [-] ----- [INFO: metrics_util.py: 80]: [TRAIN 2022-04-28 22:17:41] Epoch 2, iter 0, time 2.2935118675231934, , loss = 554.811279, Hit@1 = 0.31, PERR = 0.31, GAP = 0.25 [INFO: train_utils.py: 122]: [TRAIN] Epoch 2 training finished, average time: 1.8651155549652723 [INFO: metrics_util.py: 80]: [TEST] test_iter 0 , loss = 457.075806, Hit@1 = 0.42, PERR = 0.42, GAP = 0.43 [INFO: metrics_util.py: 124]: [TEST] Epoch2 Finish avg_hit_at_one: 0.3645833333333333, avg_perr: 0.3645833333333333, avg_loss :490.911376953125, aps: [0, 0.7264839860993707, 0, 0.47462585341997715, 0, 0.014711590709519537, 0, 0.04822073388816287, 0.5609560108708409, 0.8027352365113277, 0, 0, 0.32129792215393616, 0, 0, 0.02431476569407604, 0, 0.0, 0.2461927562794056, 0, 0, 0, 0, 0, 0.1660546382768605, 0.03484848484848485, 0.06536851469684964, 0, 0.0, 0, 0, 0, 0, 0.08563124068799817, 0, 0, 0, 0.34035087719298246, 0.33772339731642054, 0.8509592092451618, 0, 0.03125, 0, 0.0, 0.020867208672086725, 0.039261363636363636, 0, 0, 0, 0, 0.8568407053083421, 0, 0, 0, 0, 0.03447038958508859, 0, 0, 0, 0.004165394142997726, 0, 0.08268743667679837, 0.048576534576534575, 0, 0, 0, 0.41189811345296407, 0.6160275221199591, 0, 0, 0, 0.05479598297850816, 0, 0.011745038316747055, 0, 0.14839572192513367, 0.017857142857142856, 0.8142012317112416, 0.03919328856298826, 0.10154059203661106, 0.5147058823529411, 0, 0.0027100271002710027, 0.0, 0, 0, 0.25578997486755245, 0, 0.0, 0.03891402714932127, 0, 0.16742424242424242, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0.6971533526064891, 0.25392683350632306, 0.0, 0, 0.12435782848299608, 0, 0, 0.0, 0, 0, 0, 0, 0, 0.03965336134453781, 0, 0.3146511903607467, 0.5880597551385015, 0, 0, 0, 0, 0, 0, 0.08627606708336637, 0, 0.031046668257497162, 0, 0, 0, 0, 0.012812752669984209, 0, 0, 0.027443216190947587, 0, 0.3842286813576051, 0.3885135135135135, 0.08781870642994039, 0.6136011372304279, 0.10267857142857142, 0.17066271855227885, 0.5040137737506158, 0, 0, 0, 0, 0.0, 0, 0.018518518518518517, 0.12702702702702703, 0, 0.09607023411371238, 0, 0.2145263036917232, 0, 0, 0, 0.48504451428566525, 0, 0, 0, 0, 0.24584729615785442, 0, 0.18667368667368664, 0, 0.13666142557651992, 0, 0, 0, 0.3403490239016555, 0.7, 0.025737892154730938, 0.030303030303030304, 0, 0, 0, 0, 0, 0.7166666666666667, 0, 0, 0, 0, 0, 0.027777777777777776, 0.15572533484248255, 0, 0, 0.9294127880666342, 0.3886217948717949, 0, 0, 0.8316227550780863, 0.3991161073245167, 0.13076985608832742, 0, 0, 0.2779068113777733, 0, 0, 0.0, 0, 0.023388721455443927, 0, 0, 0, 0, 0, 0, 0, 0, 0.08238636363636363, 0, 0.0, 0, 0.2810910550416964, 0, 0.06020066889632107, 0, 0, 0, 0, 0, 0.47000346625667766, 0, 0, 0.4060489971691195, 0.6409310594991161, 0, 0, 0, 0.027279501544207423, 0, 0, 0.2954107030662055, 0.1875, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], gap:0.35114517815351143
2.3.3 生成分类标签结果
运行以下代码块生成标签预测结果。
生成的标签结果存储在./predict_results/level{1, 2}_top1.json
In [1]
import os
import sys
import time
import logging
import argparse
import ast
import numpy as np
import paddle
try:
import cPickle as pickle
except:
import pickle
sys.path.append('/home/aistudio/external-libraries')
sys.path.append('/home/aistudio/work/paddle-video-classify-tag')
from utils.config_utils import *
import models
from reader import get_reader
from metrics import get_metrics
from utils.utility import check_cuda
from utils.utility import check_version
logging.root.handlers = []
FORMAT = '[%(levelname)s: %(filename)s: %(lineno)4d]: %(message)s'
logging.basicConfig(level=logging.DEBUG, format=FORMAT)
logger = logging.getLogger(__name__)
class Args():
model_name = 'AttentionLSTM'
config = '/home/aistudio/work/paddle-video-classify-tag/configs/attention_lstm-single-level1.yaml'
use_gpu = True
batch_size = None
weights = '/home/aistudio/paddle-video-classify-tag/data/checkpoints/level1/AttentionLSTM_epoch2.pdparams'
filelist = None
log_interval = 200
infer_topk = 10
save_dir = './predict_results'
save_file = "top1.json"
label_file = '/home/aistudio/paddle-video-classify-tag/data/level1_label.txt'
video_path = None
def infer(args):
# parse config
config = parse_config(args.config)
infer_config = merge_configs(config, 'infer', vars(args))
print_configs(infer_config, "Infer")
infer_model = models.get_model(args.model_name, infer_config, mode='infer')
infer_model.build_input(use_dataloader=False)
infer_model.build_model()
infer_feeds = infer_model.feeds()
infer_outputs = infer_model.outputs()
place = paddle.CUDAPlace(0) if args.use_gpu else paddle.CPUPlace()
exe = paddle.static.Executor(place)
exe.run(paddle.static.default_startup_program())
filelist = args.filelist or infer_config.INFER.filelist
filepath = args.video_path or infer_config.INFER.get('filepath', '')
if filepath != '':
assert os.path.exists(filepath), "{} not exist.".format(filepath)
else:
assert os.path.exists(filelist), "{} not exist.".format(filelist)
# get infer reader
infer_reader = get_reader(args.model_name.upper(), 'infer', infer_config)
if args.weights:
assert os.path.exists(
args.weights), "Given weight dir {} not exist.".format(
args.weights)
# if no weight files specified, download weights from paddle
weights = args.weights or infer_model.get_weights()
infer_model.load_test_weights(exe, weights,
paddle.static.default_main_program())
infer_feeder = paddle.fluid.DataFeeder(place=place, feed_list=infer_feeds)
fetch_list = infer_model.fetches()
infer_metrics = get_metrics(args.model_name.upper(), 'infer', infer_config)
infer_metrics.reset()
periods = []
cur_time = time.time()
for infer_iter, data in enumerate(infer_reader()):
data_feed_in = [items[:-1] for items in data]
video_id = [items[-1] for items in data]
infer_outs = exe.run(fetch_list=fetch_list,
feed=infer_feeder.feed(data_feed_in))
infer_result_list = [item for item in infer_outs] + [video_id]
prev_time = cur_time
cur_time = time.time()
period = cur_time - prev_time
periods.append(period)
infer_metrics.accumulate(infer_result_list)
if args.log_interval > 0 and infer_iter % args.log_interval == 0:
logger.info('Processed {} samples'.format((infer_iter + 1) * len(
video_id)))
logger.info('[INFER] infer finished. average time: {}'.format(
np.mean(periods)))
if not os.path.isdir(args.save_dir):
os.makedirs(args.save_dir)
infer_metrics.finalize_and_log_out(
savedir=args.save_dir,
savefile=args.save_file,
label_file=args.label_file)
args = Args()
# 一级标签
args.config = '/home/aistudio/work/paddle-video-classify-tag/configs/attention_lstm-single-level1.yaml'
args.weights = '/home/aistudio/paddle-video-classify-tag/data/checkpoints/level1/AttentionLSTM_epoch2.pdparams'
args.label_file = '/home/aistudio/paddle-video-classify-tag/data/level1_label.txt'
args.save_file = 'level1_top1.json'
infer(args)
# 二级标签
args.config = '/home/aistudio/work/paddle-video-classify-tag/configs/attention_lstm-single-level2.yaml'
args.weights = '/home/aistudio/paddle-video-classify-tag/data/checkpoints/level2/AttentionLSTM_epoch2.pdparams'
args.label_file = '/home/aistudio/paddle-video-classify-tag/data/level2_label.txt'
args.save_file = 'level2_top1.json'
infer(args)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations def convert_to_list(value, n, name, dtype=np.int): [INFO: config_utils.py: 70]: ---------------- Infer Arguments ---------------- [INFO: config_utils.py: 72]: MODEL: [INFO: config_utils.py: 74]: name:AttentionLSTM [INFO: config_utils.py: 74]: dataset:YouTube-8M [INFO: config_utils.py: 74]: bone_nework:None [INFO: config_utils.py: 74]: drop_rate:0.5 [INFO: config_utils.py: 74]: feature_names:['rgb'] [INFO: config_utils.py: 74]: feature_dims:[2048] [INFO: config_utils.py: 74]: embedding_size:1024 [INFO: config_utils.py: 74]: lstm_size:512 [INFO: config_utils.py: 74]: num_classes:278 [INFO: config_utils.py: 74]: topk:20 [INFO: config_utils.py: 72]: TRAIN: [INFO: config_utils.py: 74]: epoch:3 [INFO: config_utils.py: 74]: learning_rate:0.000125 [INFO: config_utils.py: 74]: decay_epochs:[5] [INFO: config_utils.py: 74]: decay_gamma:0.1 [INFO: config_utils.py: 74]: weight_decay:0.0008 [INFO: config_utils.py: 74]: num_samples:35952 [INFO: config_utils.py: 74]: pretrain_base:None [INFO: config_utils.py: 74]: batch_size:128 [INFO: config_utils.py: 74]: use_gpu:True [INFO: config_utils.py: 74]: num_gpus:1 [INFO: config_utils.py: 74]: filelist:/home/aistudio/paddle-video-classify-tag/data/level2_train.list [INFO: config_utils.py: 72]: VALID: [INFO: config_utils.py: 74]: batch_size:128 [INFO: config_utils.py: 74]: filelist:/home/aistudio/paddle-video-classify-tag/data/level2_val.list [INFO: config_utils.py: 72]: TEST: [INFO: config_utils.py: 74]: batch_size:128 [INFO: config_utils.py: 74]: filelist:/home/aistudio/paddle-video-classify-tag/data/level2_val.list [INFO: config_utils.py: 72]: INFER: [INFO: config_utils.py: 74]: batch_size:1 [INFO: config_utils.py: 74]: filelist:/home/aistudio/paddle-video-classify-tag/data/level2_test.list [INFO: config_utils.py: 75]: ------------------------------------------------- W0428 23:33:54.649835 8407 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1 W0428 23:33:54.655040 8407 device_context.cc:372] device: 0, cuDNN Version: 7.6. [INFO: 3560385804.py: 106]: Processed 1 samples [INFO: 3560385804.py: 106]: Processed 201 samples [INFO: 3560385804.py: 106]: Processed 401 samples [INFO: 3560385804.py: 106]: Processed 601 samples [INFO: 3560385804.py: 106]: Processed 801 samples [INFO: 3560385804.py: 106]: Processed 1001 samples [INFO: 3560385804.py: 106]: Processed 1201 samples [INFO: 3560385804.py: 106]: Processed 1401 samples [INFO: 3560385804.py: 106]: Processed 1601 samples [INFO: 3560385804.py: 106]: Processed 1801 samples [INFO: 3560385804.py: 106]: Processed 2001 samples [INFO: 3560385804.py: 106]: Processed 2201 samples [INFO: 3560385804.py: 106]: Processed 2401 samples [INFO: 3560385804.py: 106]: Processed 2601 samples [INFO: 3560385804.py: 106]: Processed 2801 samples [INFO: 3560385804.py: 106]: Processed 3001 samples [INFO: 3560385804.py: 106]: Processed 3201 samples [INFO: 3560385804.py: 106]: Processed 3401 samples [INFO: 3560385804.py: 106]: Processed 3601 samples [INFO: 3560385804.py: 106]: Processed 3801 samples [INFO: 3560385804.py: 106]: Processed 4001 samples [INFO: 3560385804.py: 106]: Processed 4201 samples [INFO: 3560385804.py: 106]: Processed 4401 samples [INFO: 3560385804.py: 106]: Processed 4601 samples [INFO: 3560385804.py: 106]: Processed 4801 samples [INFO: 3560385804.py: 106]: Processed 5001 samples [INFO: 3560385804.py: 106]: Processed 5201 samples [INFO: 3560385804.py: 106]: Processed 5401 samples [INFO: 3560385804.py: 106]: Processed 5601 samples [INFO: 3560385804.py: 106]: Processed 5801 samples [INFO: 3560385804.py: 106]: Processed 6001 samples [INFO: 3560385804.py: 106]: Processed 6201 samples [INFO: 3560385804.py: 106]: Processed 6401 samples [INFO: 3560385804.py: 106]: Processed 6601 samples [INFO: 3560385804.py: 106]: Processed 6801 samples [INFO: 3560385804.py: 106]: Processed 7001 samples [INFO: 3560385804.py: 106]: Processed 7201 samples [INFO: 3560385804.py: 106]: Processed 7401 samples [INFO: 3560385804.py: 106]: Processed 7601 samples [INFO: 3560385804.py: 106]: Processed 7801 samples [INFO: 3560385804.py: 106]: Processed 8001 samples [INFO: 3560385804.py: 106]: Processed 8201 samples [INFO: 3560385804.py: 106]: Processed 8401 samples [INFO: 3560385804.py: 106]: Processed 8601 samples [INFO: 3560385804.py: 106]: Processed 8801 samples [INFO: 3560385804.py: 106]: Processed 9001 samples [INFO: 3560385804.py: 106]: Processed 9201 samples [INFO: 3560385804.py: 106]: Processed 9401 samples [INFO: 3560385804.py: 106]: Processed 9601 samples [INFO: 3560385804.py: 106]: Processed 9801 samples [INFO: 3560385804.py: 109]: [INFER] infer finished. average time: 0.03070716172327805 [INFO: metrics_util.py: 119]: Saved ./predict_results/level2_top1.json
2.4 视频语义标签基线
该基线提供视频语义标签的理解能力,基线从视频的文本信息中抽取表示视频内容主旨的语义标签知识(选手可进行升级,如利用给定的知识库进行推理、融合多模信息提升标签理解效果等生成标签)。
2.4.1 数据处理
首先将数据整理成命名实体识别模型所需格式,并划分训练集、验证集等。可以参考PaddleNLP中文命名实体项目。
注:我们在数据处理阶段去除了未在title中出现的语义标签。
In [8]
import os
import pandas as pd
import json
import codecs
import argparse
import random
sys.path.append('/home/aistudio/external-libraries')
TAG_NAMES = ["B-ENT", "I-ENT", "O"]
class Args():
trainval_path = '/home/aistudio/dataset_sample/train.sample.json'
test_path = '/home/aistudio/dataset_sample/test_a.json'
def gather_text_and_tags(sample, test_only=False):
def fill_tags(surf):
'''
For entities that appear in text, replace their tags with 'B-ENT/I-ENT'.
'''
s_idx = text.find(surf)
if s_idx != -1:
tags[s_idx] = TAG_NAMES[0]
for i in range(s_idx + 1, s_idx + len(surf)):
tags[i] = TAG_NAMES[1]
return 1
return 0
text = sample["title"].replace(" ", "").replace("\t", "")
# init tag sequence with all 'O's.
tags = [TAG_NAMES[2] for i in range(len(text))]
entities = []
if not test_only:
entities = [each["@value"] for each in sample["tag"]]
# annotate 'B-ENT' and 'I-ENT' tags.
n_bingo_entities = sum(
[fill_tags(surf) for surf in entities if len(surf) > 0])
# statistics
stats = {
"txt_length": len(text),
"n_entities": len(entities),
"n_bingo_entities": n_bingo_entities,
}
return text, tags, stats
def stat_numberic_list(li, name="default"):
assert isinstance(li, list)
stat = {}
stat["size"] = len(li)
if all(isinstance(x, int) for x in li):
stat["max"] = max(li)
stat["min"] = min(li)
stat["sum"] = sum(li)
stat["avr"] = stat["sum"] / float(len(li))
print("list-%s:\n\t%s" % (name, str(stat)))
def analyze_annots(stats_list):
for key in ["txt_length", "n_entities", "n_bingo_entities"]:
numbers = [stats[key] for stats in stats_list]
stat_numberic_list(numbers, name=key)
def prepare_split(data, split_name, test_only=False):
sample_lines = []
nid_lines = []
stats_list = []
for idx in range(len(data)):
text, tags, stats = gather_text_and_tags(
data[idx], test_only=test_only)
if len(text) == 0:
continue
# proper data format.
text = '\002'.join([ch for ch in text])
tags = '\002'.join(tags)
sample_lines.append('\t'.join([text, tags]) + "\n")
nid_lines.append(data[idx]["@id"] + "\n")
stats_list.append(stats)
if split_name == "trainval":
# print statistics.
analyze_annots(stats_list)
save_split_file = "/home/aistudio/paddle-video-semantic-tag/data/{}.tsv".format(split_name)
with codecs.open(save_split_file, "w", encoding="utf-8") as ouf:
ouf.writelines(sample_lines)
print("Saved {}, size={}".format(save_split_file, len(sample_lines)))
with codecs.open("/home/aistudio/paddle-video-semantic-tag/data/nids.txt", "w", encoding="utf-8") as ouf:
ouf.writelines(nid_lines)
def create_splits_indice(n_samples, SPLITS):
assert sum([v for k, v in SPLITS]) == 1.0
indices = list(range(n_samples))
random.shuffle(indices)
split2indice = {}
r_offset = 0
for idx, (split, ratio) in enumerate(SPLITS):
l_offset = r_offset
if idx == len(SPLITS) - 1:
r_offset = n_samples
else:
r_offset = int(n_samples * ratio) + l_offset
split2indice[split] = indices[l_offset:r_offset]
return split2indice
args = Args()
random.seed(6666)
# load data for train & validation (have labels).
with codecs.open(args.trainval_path, "r", encoding="utf-8") as inf:
print("Loading {}...".format(args.trainval_path))
lines = inf.readlines()
trainval_data = [json.loads(line) for line in lines]
# load data for test (no labels).
with codecs.open(args.test_path, "r", encoding="utf-8") as inf:
print("Loading {}...".format(args.test_path))
lines = inf.readlines()
test_data = [json.loads(line) for line in lines]
# split the trainval data into train-set(80%) and validation-set(20%).
split2indice = create_splits_indice(
len(trainval_data), [
("train", 3.0 / 4.0),
("val", 1.0 / 4.0),
])
train_data = [trainval_data[idx] for idx in split2indice["train"]]
val_data = [trainval_data[idx] for idx in split2indice["val"]]
label_map_file = "/home/aistudio/paddle-video-semantic-tag/data/label_map.json"
with open(label_map_file, "w") as ouf:
json.dump({tag: idx for idx, tag in enumerate(TAG_NAMES)}, ouf)
print("Saved " + label_map_file)
prepare_split(trainval_data, "trainval")
prepare_split(train_data, "train")
prepare_split(val_data, "val")
prepare_split(test_data, "test", test_only=True)
Loading /home/aistudio/dataset_sample/train.sample.json... Loading /home/aistudio/dataset_sample/test_a.json... Saved /home/aistudio/paddle-video-semantic-tag/data/label_map.json list-txt_length: {'size': 8079, 'max': 67, 'min': 1, 'sum': 118072, 'avr': 14.614680034657754} list-n_entities: {'size': 8079, 'max': 38, 'min': 0, 'sum': 21425, 'avr': 2.6519371209308082} list-n_bingo_entities: {'size': 8079, 'max': 8, 'min': 0, 'sum': 8292, 'avr': 1.0263646490902338} Saved /home/aistudio/paddle-video-semantic-tag/data/trainval.tsv, size=8079 Saved /home/aistudio/paddle-video-semantic-tag/data/train.tsv, size=6061 Saved /home/aistudio/paddle-video-semantic-tag/data/val.tsv, size=2018 Saved /home/aistudio/paddle-video-semantic-tag/data/test.tsv, size=9938
2.3.3 训练与验证
本模型使用了PaddleNLP模型库中的bert-wwm-ext-chinese模型,更多模型可参考PaddleNLP Transformer API。
In [ ]
import argparse
import os
import sys
import random
import time
import math
from functools import partial
import json
import numpy as np
import paddle
from paddle.io import DataLoader
import paddlenlp as ppnlp
from paddlenlp.transformers import LinearDecayWithWarmup
from paddlenlp.metrics import ChunkEvaluator
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import BertForTokenClassification, BertTokenizer
from paddlenlp.data import Stack, Tuple, Pad, Dict
sys.path.append('/home/aistudio/external-libraries')
class Args():
model_name_or_path = None
output_dir = None
max_seq_length = 128
batch_size = 8
learning_rate = 5e-5
weight_decay = 0.0
adam_epsilon = 1e-8
max_grad_norm = 1.0
num_train_epochs = 3
max_steps = -1
warmup_steps = 0
logging_steps = 1
save_steps = 100
seed = 42
device = 'gpu'
def evaluate(model, loss_fct, metric, data_loader, label_num):
model.eval()
metric.reset()
avg_loss, precision, recall, f1_score = 0, 0, 0, 0
for batch in data_loader:
input_ids, token_type_ids, length, labels = batch
logits = model(input_ids, token_type_ids)
loss = loss_fct(logits, labels)
avg_loss = paddle.mean(loss)
preds = logits.argmax(axis=2)
num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute(
None, length, preds, labels)
metric.update(num_infer_chunks.numpy(),
num_label_chunks.numpy(), num_correct_chunks.numpy())
precision, recall, f1_score = metric.accumulate()
print("eval loss: %f, precision: %f, recall: %f, f1: %f" %
(avg_loss, precision, recall, f1_score))
model.train()
def tokenize_and_align_labels(example,
tokenizer,
no_entity_id,
max_seq_len=512):
labels = example['labels']
example = example['tokens']
tokenized_input = tokenizer(
example,
return_length=True,
is_split_into_words=True,
max_seq_len=max_seq_len)
# -2 for [CLS] and [SEP]
if len(tokenized_input['input_ids']) - 2 < len(labels):
labels = labels[:len(tokenized_input['input_ids']) - 2]
tokenized_input['labels'] = [no_entity_id] + labels + [no_entity_id]
tokenized_input['labels'] += [no_entity_id] * (
len(tokenized_input['input_ids']) - len(tokenized_input['labels']))
return tokenized_input
def _read(data_file, label_map_file):
with open(label_map_file, "r") as inf:
tag2label = json.load(inf)
with open(data_file, 'r', encoding='utf-8') as inf:
for line in inf:
line_stripped = line.strip().split('\t')
assert len(line_stripped) == 2
tokens = line_stripped[0].split("\002")
tags = line_stripped[1].split("\002")
labels = [tag2label[tag] for tag in tags]
yield {"tokens": tokens, "labels": labels}
def do_train(args):
paddle.set_device(args.device)
if paddle.distributed.get_world_size() > 1:
paddle.distributed.init_parallel_env()
# Create dataset, tokenizer and dataloader.
train_ds = load_dataset(
_read,
data_file="/home/aistudio/paddle-video-semantic-tag/data/train.tsv",
label_map_file="/home/aistudio/paddle-video-semantic-tag/data/label_map.json",
lazy=False)
test_ds = load_dataset(
_read,
data_file="/home/aistudio/paddle-video-semantic-tag/data/val.tsv",
label_map_file="/home/aistudio/paddle-video-semantic-tag/data/label_map.json",
lazy=False)
train_ds.label_list = test_ds.label_list = ["B-ENT", "I-ENT", "O"]
tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)
label_list = train_ds.label_list
label_num = len(label_list)
no_entity_id = label_num - 1
trans_func = partial(
tokenize_and_align_labels,
tokenizer=tokenizer,
no_entity_id=no_entity_id,
max_seq_len=args.max_seq_length)
train_ds = train_ds.map(trans_func)
ignore_label = -100
batchify_fn = lambda samples, fn=Dict({
'input_ids': Pad(axis=0, pad_val=tokenizer.pad_token_id), # input
'token_type_ids': Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment
'seq_len': Stack(), # seq_len
'labels': Pad(axis=0, pad_val=ignore_label) # label
}): fn(samples)
train_batch_sampler = paddle.io.DistributedBatchSampler(
train_ds, batch_size=args.batch_size, shuffle=True, drop_last=True)
train_data_loader = DataLoader(
dataset=train_ds,
collate_fn=batchify_fn,
num_workers=0,
batch_sampler=train_batch_sampler,
return_list=True)
test_ds = test_ds.map(trans_func)
test_data_loader = DataLoader(
dataset=test_ds,
collate_fn=batchify_fn,
num_workers=0,
batch_size=args.batch_size,
return_list=True)
# Define the model netword and its loss
model = BertForTokenClassification.from_pretrained(
args.model_name_or_path, num_classes=label_num)
if paddle.distributed.get_world_size() > 1:
model = paddle.DataParallel(model)
num_training_steps = args.max_steps if args.max_steps > 0 else len(
train_data_loader) * args.num_train_epochs
lr_scheduler = LinearDecayWithWarmup(args.learning_rate,
num_training_steps, args.warmup_steps)
# Generate parameter names needed to perform weight decay.
# All bias and LayerNorm parameters are excluded.
decay_params = [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias", "norm"])
]
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
epsilon=args.adam_epsilon,
parameters=model.parameters(),
weight_decay=args.weight_decay,
apply_decay_param_fun=lambda x: x in decay_params)
loss_fct = paddle.nn.loss.CrossEntropyLoss(ignore_index=ignore_label)
metric = ChunkEvaluator(label_list=label_list)
global_step = 0
last_step = args.num_train_epochs * len(train_data_loader)
tic_train = time.time()
for epoch in range(args.num_train_epochs):
for step, batch in enumerate(train_data_loader):
global_step += 1
input_ids, token_type_ids, _, labels = batch
logits = model(input_ids, token_type_ids)
loss = loss_fct(logits, labels)
avg_loss = paddle.mean(loss)
if global_step % args.logging_steps == 0:
print(
"global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
% (global_step, epoch, step, avg_loss,
args.logging_steps / (time.time() - tic_train)))
tic_train = time.time()
avg_loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.clear_grad()
if global_step % args.save_steps == 0 or global_step == last_step:
if paddle.distributed.get_rank() == 0:
evaluate(model, loss_fct, metric, test_data_loader,
label_num)
paddle.save(
model.state_dict(),
os.path.join(args.output_dir,
"model_%d.pdparams" % global_step))
args = Args()
args.model_name_or_path = 'bert-wwm-ext-chinese'
args.max_seq_length = 128
args.batch_size = 32
args.learning_rate = 2e-5
args.num_train_epochs = 3
args.logging_steps = 1
args.save_steps = 500
args.output_dir = '/home/aistudio/paddle-video-semantic-tag/data/checkpoints/semantic_tag'
args.device = 'gpu'
do_train(args)
In [4]
!ls /home/aistudio/paddle-video-semantic-tag/data/checkpoints/se*
model_500.pdparams model_567.pdparams
2.3.4 生成语义标签结果
生成的识别结果存储在./predict_results/ents_results.json
。
In [5]
import argparse
import os
import sys
import ast
import random
import time
import math
from functools import partial
import json
import codecs
from tqdm import tqdm
import numpy as np
import paddle
from paddle.io import DataLoader
import paddlenlp as ppnlp
from paddlenlp.datasets import load_dataset
from paddlenlp.data import Stack, Tuple, Pad, Dict
from paddlenlp.transformers import BertForTokenClassification, BertTokenizer
sys.path.append('/home/aistudio/external-libraries')
class Args():
model_name_or_path = None
output_dir = None
max_seq_length = 128
batch_size = 8
learning_rate = 5e-5
weight_decay = 0.0
adam_epsilon = 1e-8
max_grad_norm = 1.0
num_train_epochs = 3
max_steps = -1
warmup_steps = 0
logging_steps = 1
save_steps = 100
seed = 42
device = 'gpu'
def tokenize_and_align_labels(example,
tokenizer,
no_entity_id,
max_seq_len=512):
labels = example['labels']
example = example['tokens']
tokenized_input = tokenizer(
example,
return_length=True,
is_split_into_words=True,
max_seq_len=max_seq_len)
# -2 for [CLS] and [SEP]
if len(tokenized_input['input_ids']) - 2 < len(labels):
labels = labels[:len(tokenized_input['input_ids']) - 2]
tokenized_input['labels'] = [no_entity_id] + labels + [no_entity_id]
tokenized_input['labels'] += [no_entity_id] * (
len(tokenized_input['input_ids']) - len(tokenized_input['labels']))
return tokenized_input
def parse_decodes(input_words, id2label, decodes, lens):
decodes = [x for batch in decodes for x in batch]
lens = [x for batch in lens for x in batch]
outputs = []
entities_list = []
for idx, end in enumerate(lens):
sent = "".join(input_words[idx]['tokens'])
tags = [id2label[x] for x in decodes[idx][1:end]]
sent_out = []
tags_out = []
words = ""
for s, t in zip(sent, tags):
if t.startswith('B-') or t == 'O':
if len(words):
sent_out.append(words)
if t.startswith('B-'):
tags_out.append(t.split('-')[1])
else:
tags_out.append(t)
words = s
else:
words += s
if len(sent_out) < len(tags_out):
sent_out.append(words)
outputs.append(''.join(
[str((s, t)) for s, t in zip(sent_out, tags_out)]))
entities_list.append(
[s for s, t in zip(sent_out, tags_out) if t == "ENT"])
return outputs, entities_list
def _read(data_file, label_map_file):
with open(label_map_file, "r") as inf:
tag2label = json.load(inf)
with open(data_file, 'r', encoding='utf-8') as inf:
for line in inf:
line_stripped = line.strip().split('\t')
assert len(line_stripped) == 2
tokens = line_stripped[0].split("\002")
tags = line_stripped[1].split("\002")
labels = [tag2label[tag] for tag in tags]
yield {"tokens": tokens, "labels": labels}
def do_predict(args):
paddle.set_device(args.device)
# Create dataset, tokenizer and dataloader.
train_ds = load_dataset(
_read,
data_file="/home/aistudio/paddle-video-semantic-tag/data/train.tsv",
label_map_file="/home/aistudio/paddle-video-semantic-tag/data/label_map.json",
lazy=False)
predict_ds = load_dataset(
_read,
data_file="/home/aistudio/paddle-video-semantic-tag/data/test.tsv",
label_map_file="/home/aistudio/paddle-video-semantic-tag/data/label_map.json",
lazy=False)
train_ds.label_list = predict_ds.label_list = ["B-ENT", "I-ENT", "O"]
tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)
label_list = train_ds.label_list
label_num = len(label_list)
no_entity_id = label_num - 1
trans_func = partial(
tokenize_and_align_labels,
tokenizer=tokenizer,
no_entity_id=no_entity_id,
max_seq_len=args.max_seq_length)
ignore_label = -100
batchify_fn = lambda samples, fn=Dict({
'input_ids': Pad(axis=0, pad_val=tokenizer.pad_token_id), # input
'token_type_ids': Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment
'seq_len': Stack(),
'labels': Pad(axis=0, pad_val=ignore_label) # label
}): fn(samples)
raw_data = predict_ds.data
id2label = dict(enumerate(predict_ds.label_list))
predict_ds = predict_ds.map(trans_func)
predict_data_loader = DataLoader(
dataset=predict_ds,
collate_fn=batchify_fn,
num_workers=0,
batch_size=args.batch_size,
return_list=True)
# Define the model netword
model = BertForTokenClassification.from_pretrained(
args.model_name_or_path, num_classes=label_num)
if args.init_checkpoint_path:
model_dict = paddle.load(args.init_checkpoint_path)
model.set_dict(model_dict)
model.eval()
pred_list = []
len_list = []
for step, batch in tqdm(enumerate(predict_data_loader)):
input_ids, token_type_ids, length, labels = batch
logits = model(input_ids, token_type_ids)
pred = paddle.argmax(logits, axis=-1)
pred_list.append(pred.numpy())
len_list.append(length.numpy())
preds, entities_list = parse_decodes(raw_data, id2label, pred_list,
len_list)
save_dir = "predict_results"
if not os.path.isdir(save_dir):
os.makedirs(save_dir)
file_path = os.path.join(save_dir, "ner_results.txt")
with open(file_path, "w", encoding="utf8") as fout:
fout.write("\n".join(preds))
# Print some examples
# print(
# "The results have been saved in the file: %s, some examples are shown below: "
# % file_path)
# print("\n".join(preds[:10]))
with open("/home/aistudio/paddle-video-semantic-tag/data/nids.txt", "r") as inf:
lines = inf.readlines()
nid2ents = {}
for entities, nid in zip(entities_list, lines):
nid2ents[nid.strip()] = entities
save_json = os.path.join(save_dir, "ents_results.json")
with codecs.open(save_json, "w", encoding="utf-8") as ouf:
json.dump(nid2ents, ouf, ensure_ascii=False)
print("Saved " + save_json)
args = Args()
args.model_name_or_path = 'bert-wwm-ext-chinese'
args.init_checkpoint_path = '/home/aistudio/paddle-video-semantic-tag/data/checkpoints/semantic_tag/model_567.pdparams'
args.max_seq_length = 128
args.batch_size = 32
args.device = 'gpu'
do_predict(args)
[2022-04-29 02:13:12,761] [ INFO] - Found /home/aistudio/.paddlenlp/models/bert-wwm-ext-chinese/bert-wwm-ext-chinese-vocab.txt [2022-04-29 02:13:12,777] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/bert-wwm-ext-chinese/bert-wwm-ext-chinese.pdparams 311it [00:13, 23.29it/s]
Saved predict_results/ents_results.json
3. 结果文件生成和提交
运行下列代码后,将result.txt
文件提交至比赛提交结果页面即可。
In [6]
import os
import os.path as osp
import codecs
import json
class Args():
test_path = '/home/aistudio/dataset_sample/test_a.json'
category_level1_result = '/home/aistudio/predict_results/level1_top1.json'
category_level2_result = '/home/aistudio/predict_results/level2_top1.json'
tag_result = '/home/aistudio/predict_results/ents_results.json'
if __name__ == "__main__":
args = Args()
with codecs.open(args.test_path, "r", encoding="utf-8") as inf:
print("Loading {}...".format(args.test_path))
lines = inf.readlines()
nids = [json.loads(line)["@id"] for line in lines]
# load the prediction results of 'paddle-video-classify-tag' model on test-set
with codecs.open(
args.category_level1_result, "r", encoding="utf-8") as inf:
pred_level1 = json.load(inf)
with codecs.open(
args.category_level2_result, "r", encoding="utf-8") as inf:
pred_level2 = json.load(inf)
# load the prediction results of 'paddle-video-semantic-tag' model on test-set
with codecs.open(args.tag_result, "r", encoding="utf-8") as inf:
pred_tags = json.load(inf)
# merge results and generate an entry for each nid.
submission_lines = []
for nid in nids:
level1_category = pred_level1[nid]["class_name"] \
if nid in pred_level1 else ""
level2_category = pred_level2[nid]["class_name"] \
if nid in pred_level2 else ""
tags = pred_tags[nid] if nid in pred_tags else []
result = {
"@id": nid,
"category": [
{
"@meta": {
"type": "level1"
},
"@value": level1_category
},
{
"@meta": {
"type": "level2"
},
"@value": level2_category
},
],
"tag": [{
"@value": tag
} for tag in tags],
}
submission_lines.append(json.dumps(result, ensure_ascii=False) + "\n")
with codecs.open("result.txt", "w", encoding="utf-8") as ouf:
ouf.writelines(submission_lines)
print("Saved result.txt")
Loading /home/aistudio/dataset_sample/test_a.json... Saved result.txt
退出前务必执行
运行下列指令删除临时大文件,下次启动时重新生成,否则可能导致下次在线项目加载较慢。
In [7]
!rm -rf dataset_sample paddle-video-classify-tag paddle-video-semantic-tag predict_results tsn_features_test_a tsn_features_train_sample
请点击此处查看本环境基本用法.
Please click here for more detailed instructions.
更多推荐
所有评论(0)