LIC 2022 视频语义理解基线（快速启动版）

转自AI Studio，原文链接：LIC 2022 视频语义理解基线（快速启动版） - 飞桨AI StudioLIC2022视频语义理解基线❗️该版本为快速启动版，训练集取比赛提供的训练集的子集，目的是减少在线项目加载等待时间，供选手快速跑通流程，实现一键训练+预测，生成可提交并有分数的结果文件。该基线分数比全量数据版基线（🚩链接稍后给出）稍低（e.g., 0.33 vs 0.38）。❗️请选择

AI Studio

479人浏览 · 2022-05-19 17:38:20

AI Studio · 2022-05-19 17:38:20 发布

转自AI Studio，原文链接：

LIC 2022 视频语义理解基线（快速启动版） - 飞桨AI Studio

LIC2022视频语义理解基线

❗️该版本为快速启动版，训练集取比赛提供的训练集的子集，目的是减少在线项目加载等待时间，供选手快速跑通流程，实现一键训练+预测，生成可提交并有分数的结果文件。

该基线分数比全量数据版基线（🚩链接稍后给出）稍低（e.g., 0.33 vs 0.38）。

❗️请选择启动GPU环境（CPU环境将无法进行模型的训练和预测，且从CPU到GPU环境的转换将非常慢）。

❗️每次打开环境，需要约15分钟文件的加载、同步时间，请耐心等候加载完毕。

😃 每天AI Studio 免费提供8个算力点，最多可使用16小时GPU。除此之外，选手还可以选择申请AI Studio提供的100小时免费V100 GPU算力码。

1. 背景介绍

该项目为👉 LIC2022视频语义理解技术评测任务👈 的快速启动版基准模型，项目基于github基线改造，为选手提供了存储和算力支持。模型包括两部分：1）视频分类标签模型 paddle-video-classify-tag；2）视频语义标签模型 paddle-video-semantic-tag。

视频分类标签模型根据视频内容在封闭的二级标签体系上进行分类，得到描述视频的分类标签；视频语义标签模型从视频的文本信息中抽取实体语义标签（选手可进行升级，如利用给定的知识库进行推理、融合多模信息提升标签理解效果，或生成标签等）。两部分模型产出的标签结果，分别对应技术评测标签中的分类标签、语义标签。

2. 快速实践

2.1 环境搭建

运行下列指令添加路径。

注：注释行中的依赖包已持久化安装至项目目录，无需重复安装。

In [ ]

# !mkdir /home/aistudio/external-libraries
# !pip install opencv-python -i https://mirror.baidu.com/pypi/simple -t /home/aistudio/external-libraries
# !pip install paddlenlp==2.0.1 -i https://mirror.baidu.com/pypi/simple -t /home/aistudio/external-libraries
# !pip install tqdm wget -t /home/aistudio/external-libraries
!tar -xvf external-libraries.tar
!rm external-libraries.tar
import os
import sys 
sys.path.append('/home/aistudio/external-libraries')

!mkdir paddle-video-classify-tag && cd paddle-video-classify-tag && mkdir data
!mkdir paddle-video-semantic-tag && cd paddle-video-semantic-tag && mkdir data

2.2 数据加载

数据内容包含：

样例训练集（比赛使用训练集的抽样集合）的视频信息，及官方提供的tsn视觉特征
A榜测试集（比赛使用的A榜测试集全量集合）的视频信息，及官方提供的tsn视觉特征

In [ ]

!tar -zxvf /home/aistudio/data/data142559/dataset_sample.tar.gz
!cd paddle-video-classify-tag && unzip /home/aistudio/data/data142559/tsn_features_test_a.zip
!cd paddle-video-classify-tag && unzip /home/aistudio/data/data142559/tsn_features_train_sample.zip

2.3 视频分类标签基线

该基线基于 VideoTag飞桨大规模视频分类模型，能够根据视频内容在封闭的二级标签体系上进行分类，得到描述视频的分类标签。

2.3.1 数据准备

该视频分类标签模型使用TSN网络提取原始视频的向量表征。由于该步骤比较耗时，我们提供了所有视频数据的TSN特征，运行2.2数据加载章节指令即可。

数据集上有两级标签，我们分别在一级标签（level1）和二级标签（level2）的设定下进行分类实验。在每一种设定下均需要进行训练、验证和测试的数据划分。准备过程会得到如下的输出：

paddle-video-classify-tag
   |-- weights
      |-- attention_lstm.pdmodel
      |-- attention_lstm.pdopt
      |-- attention_lstm.pdparams

运行以下代码，准备视频语义理解数据集的label集合；训练、验证、测试的样本列表等。

In [ ]

import os
import os.path as osp
import requests
import time
import codecs
import json
import argparse
import random

def create_splits_indice(n_samples, SPLITS):
    assert sum([v for k, v in SPLITS]) == 1.0
    indices = list(range(n_samples))
    random.shuffle(indices)
    split2indice = {}
    r_offset = 0
    for idx, (split, ratio) in enumerate(SPLITS):
        l_offset = r_offset
        if idx == len(SPLITS) - 1:
            r_offset = n_samples
        else:
            r_offset = int(n_samples * ratio) + l_offset
        split2indice[split] = indices[l_offset:r_offset]
    return split2indice


def prepare_split(data, split_name, test_only=False, gather_labels=False, classify_tag_dir='/home/aistudio/paddle-video-classify-tag'):
    '''
      1. Prepare ALL (unique) labels for classification from trainval-set.
      2. For each split, generate sample list for level1 & level2 classification.
    '''
    trainval_tsn_feature_dir = '/home/aistudio/paddle-video-classify-tag/tsn_features_train_sample'
    test_tsn_feature_dir = '/home/aistudio/paddle-video-classify-tag/tsn_features_test_a'
    cls_data_dir = os.path.join(classify_tag_dir, 'data')
    if not os.path.exists(cls_data_dir):
        os.mkdir(cls_data_dir)

    sample_nids = [sample["@id"] for sample in data]
    level1_labels = []
    level2_labels = []
    if not test_only:
        for sample in data:
            category = {
                each["@meta"]["type"]: each["@value"]
                for each in sample["category"]
            }
            level1_labels.append(category["level1"])
            level2_labels.append(category["level2"])

    def create_sample_list(sample_labels, level_name):
        save_label_file = cls_data_dir + "/{}_label.txt".format(level_name)
        if gather_labels:
            # For trainval set:
            # Gather candidate labels and dump to {level1,level2}_label.txt
            labels = sorted([str(label) for label in list(set(sample_labels))])
            with codecs.open(save_label_file, "w", encoding="utf-8") as ouf:
                ouf.writelines([label + "\n" for label in labels])
                print("Saved " + save_label_file)
        else:
            # For test set: load existing labels.
            with codecs.open(save_label_file, "r", encoding="utf-8") as inf:
                labels = [line.strip() for line in inf.readlines()]
        label2idx = {label: idx for idx, label in enumerate(labels)}
        sample_lines = []
        # Generate sample list: one sample per line (feature_path -> label)
        for i in range(len(sample_nids)):
            label_indice = label2idx[str(sample_labels[i])] if not test_only \
                           else -1
            if split_name in ["train", "val", "trainval"]:
                tsn_feature_dir = trainval_tsn_feature_dir
            elif split_name in ["test"]:
                tsn_feature_dir = test_tsn_feature_dir
            feature_path = osp.join(tsn_feature_dir,
                                    "{}.npy".format(sample_nids[i]))
            if osp.exists(feature_path):
                line = "{} {}\n".format(feature_path, str(label_indice))
                sample_lines.append(line)
        save_split_file = cls_data_dir + "/{}_{}.list".format(level_name, split_name)
        with codecs.open(save_split_file, "w", encoding="utf-8") as ouf:
            ouf.writelines(sample_lines)
            print("Saved {}, size={}".format(save_split_file,
                                             len(sample_lines)))

    create_sample_list(level1_labels, "level1")
    create_sample_list(level2_labels, "level2")


random.seed(6666)
classify_tag_dir = '/home/aistudio/paddle-video-classify-tag'
if not os.path.exists(classify_tag_dir):
    os.mkdir(classify_tag_dir)
trainval_path = '/home/aistudio/dataset_sample/train.sample.json'
test_path = '/home/aistudio/dataset_sample/test_a.json'

# load data for train & validation (have labels).
with codecs.open(trainval_path, "r", encoding="utf-8") as inf:
    print("Loading {}...".format(trainval_path))
    lines = inf.readlines()
    trainval_data = [json.loads(line) for line in lines]

# load data for test (no labels).
with codecs.open(test_path, "r", encoding="utf-8") as inf:
    print("Loading {}...".format(test_path))
    lines = inf.readlines()
    test_data = [json.loads(line) for line in lines]

# split the trainval data into train-set(80%) and validation-set(20%).
split2indice = create_splits_indice(
    len(trainval_data), [
        ("train", 4.0 / 5.0),
        ("val", 1.0 / 5.0),
    ])
train_data = [trainval_data[idx] for idx in split2indice["train"]]
val_data = [trainval_data[idx] for idx in split2indice["val"]]

prepare_split(trainval_data, "trainval", gather_labels=True)
prepare_split(train_data, "train")
prepare_split(val_data, "val")
prepare_split(test_data, "test", test_only=True)

Loading /home/aistudio/dataset_sample/train.sample.json...
Loading /home/aistudio/dataset_sample/test_a.json...

2.3.2 训练与验证

运行以下指令训练分类模型。

选手可以参考原代码库中的模型微调指南获取更多信息。

In [4]

import os
import sys
import argparse
import ast
import logging
import paddle
paddle.enable_static()
sys.path.append('/home/aistudio/external-libraries')
sys.path.append('/home/aistudio/work/paddle-video-classify-tag')

from utils.train_utils import train_with_dataloader
import models
from utils.config_utils import *
from reader import get_reader
from metrics import get_metrics
from utils.utility import check_cuda
from utils.utility import check_version

logging.root.handlers = []
FORMAT = '[%(levelname)s: %(filename)s: %(lineno)4d]: %(message)s'
logging.basicConfig(level=logging.INFO, format=FORMAT)
logger = logging.getLogger(__name__)


class Args():
    model_name = 'AttentionLSTM'
    config = '/home/aistudio/work/paddle-video-classify-tag/configs/attention_lstm-single-level1.yaml'
    batch_size = None
    learning_rate = None
    pretrain = '/home/aistudio/work/paddle-video-classify-tag/weights/attention_lstm'
    use_gpu = True
    no_memory_optimize = False
    epoch = None
    valid_interval = 1
    save_dir = os.path.join('paddle-video-classify-tag', 'data', 'checkpoints', 'level1')
    log_interval = 50
    fix_random_seed = False

def train(args):
    # parse config
    config = parse_config(args.config)
    train_config = merge_configs(config, 'train', vars(args))
    valid_config = merge_configs(config, 'valid', vars(args))
    # print_configs(train_config, 'Train')
    train_model = models.get_model(args.model_name, train_config, mode='train')
    valid_model = models.get_model(args.model_name, valid_config, mode='valid')

    # build model
    startup = paddle.static.Program()
    train_prog = paddle.static.Program()
    if args.fix_random_seed:
        startup.random_seed = 1000
        train_prog.random_seed = 1000
    with paddle.static.program_guard(train_prog, startup):
        with paddle.utils.unique_name.guard():
            train_model.build_input(use_dataloader=True)
            train_model.build_model()
            # for the input, has the form [data1, data2,..., label], so train_feeds[-1] is label
            train_feeds = train_model.feeds()
            train_fetch_list = train_model.fetches()
            train_loss = train_fetch_list[0]
            optimizer = train_model.optimizer()
            optimizer.minimize(train_loss)
            train_dataloader = train_model.dataloader()

    valid_prog = paddle.static.Program()
    with paddle.static.program_guard(valid_prog, startup):
        with paddle.utils.unique_name.guard():
            valid_model.build_input(use_dataloader=True)
            valid_model.build_model()
            valid_feeds = valid_model.feeds()
            valid_fetch_list = valid_model.fetches()
            valid_dataloader = valid_model.dataloader()

    place = paddle.CUDAPlace(0) if args.use_gpu else paddle.CPUPlace()
    exe = paddle.static.Executor(place)
    exe.run(startup)

    if args.pretrain:
        train_model.load_pretrain_params(exe, args.pretrain, train_prog)

    build_strategy = paddle.fluid.compiler.BuildStrategy()
    build_strategy.enable_inplace = True

    exec_strategy = paddle.static.ExecutionStrategy()

    compiled_train_prog = paddle.static.CompiledProgram(
        train_prog).with_data_parallel(
            loss_name=train_loss.name,
            build_strategy=build_strategy,
            exec_strategy=exec_strategy)
    compiled_valid_prog = paddle.static.CompiledProgram(
        valid_prog).with_data_parallel(
            share_vars_from=compiled_train_prog,
            build_strategy=build_strategy,
            exec_strategy=exec_strategy)

    # get reader
    bs_denominator = 1
    if args.use_gpu:
        # check number of GPUs
        gpus = os.getenv("CUDA_VISIBLE_DEVICES", "")
        if gpus == "":
            pass
        else:
            gpus = gpus.split(",")
            num_gpus = len(gpus)
            assert num_gpus == train_config.TRAIN.num_gpus, \
                   "num_gpus({}) set by CUDA_VISIBLE_DEVICES " \
                   "shoud be the same as that " \
                   "set in {}({})".format(
                   num_gpus, args.config, train_config.TRAIN.num_gpus)
        bs_denominator = train_config.TRAIN.num_gpus

    train_config.TRAIN.batch_size = int(train_config.TRAIN.batch_size /
                                        bs_denominator)
    valid_config.VALID.batch_size = int(valid_config.VALID.batch_size /
                                        bs_denominator)
    train_reader = get_reader(args.model_name.upper(), 'train', train_config)
    valid_reader = get_reader(args.model_name.upper(), 'valid', valid_config)

    # get metrics
    train_metrics = get_metrics(args.model_name.upper(), 'train', train_config)
    valid_metrics = get_metrics(args.model_name.upper(), 'valid', valid_config)

    epochs = args.epoch or train_model.epoch_num()

    exe_places = paddle.static.cuda_places(
    ) if args.use_gpu else paddle.static.cpu_places()
    train_dataloader.set_sample_list_generator(train_reader, places=exe_places)
    valid_dataloader.set_sample_list_generator(valid_reader, places=exe_places)

    train_with_dataloader(
        exe,
        train_prog,
        compiled_train_prog,
        train_dataloader,
        train_fetch_list,
        train_metrics,
        epochs=epochs,
        log_interval=args.log_interval,
        valid_interval=args.valid_interval,
        save_dir=args.save_dir,
        save_model_name=args.model_name,
        fix_random_seed=args.fix_random_seed,
        compiled_test_prog=compiled_valid_prog,
        test_dataloader=valid_dataloader,
        test_fetch_list=valid_fetch_list,
        test_metrics=valid_metrics)


args = Args()

# check whether the installed paddle is compiled with GPU
check_cuda(args.use_gpu)
check_version()

args.model_name = 'AttentionLSTM'
args.log_interval = 50

# first layer
args.config = '/home/aistudio/work/paddle-video-classify-tag/configs/attention_lstm-single-level1.yaml'
args.save_dir = '/home/aistudio/paddle-video-classify-tag/data/checkpoints/level1'
if not os.path.exists(args.save_dir):
    os.makedirs(args.save_dir)
train(args)

# second layer
args.config = '/home/aistudio/work/paddle-video-classify-tag/configs/attention_lstm-single-level2.yaml'
args.save_dir = '/home/aistudio/paddle-video-classify-tag/data/checkpoints/level2'
if not os.path.exists(args.save_dir):
    os.makedirs(args.save_dir)
train(args)

[INFO: regularizer.py:  101]: If regularizer of a Parameter has been set by 'fluid.ParamAttr' or 'fluid.WeightNormParamAttr' already. The Regularization[L2Decay, regularization_coeff=0.000800] in Optimizer will not take effect, and it will only be applied to other Parameters!
[INFO: attention_lstm.py:  164]: Load pretrain weights from /home/aistudio/work/paddle-video-classify-tag/weights/attention_lstm, exclude fc layer.
[INFO: train_utils.py:   45]: ------- learning rate [0.000125], learning rate counter [-] -----
[INFO: metrics_util.py:   80]: [TRAIN 2022-04-28 22:05:55] Epoch 0, iter 0, time 2.767573356628418,  , loss = 2923.538330, Hit@1 = 0.02, PERR = 0.02, GAP = 0.02
[INFO: train_utils.py:  122]: [TRAIN] Epoch 0 training finished, average time: 1.7017907901686065
share_vars_from is set, scope is ignored.
[INFO: metrics_util.py:   80]: [TEST] test_iter 0  , loss = 302.009247, Hit@1 = 0.53, PERR = 0.53, GAP = 0.60
[INFO: metrics_util.py:  124]: [TEST] Epoch0 Finish	avg_hit_at_one: 0.505859375,	avg_perr: 0.505859375,	avg_loss :329.6025136311849,	aps: [0.2056517240004745, 0.7582162120652176, 0.06977232338023834, 0, 0.0, 0.8453953428088683, 0.5386290618179087, 0.12647074684748238, 0.06464234760008425, 0.8143473229661798, 0.025846861750262674, 0.8125, 0.5606631794981926, 0.025, 0.6332121597279986, 0.48199568449775865, 0.03583711373823246, 0.844207523649567, 0.006626716822675181, 0.8250196464463089, 0.5152466244119106, 0.27196454054549163, 0.6783860174137636, 0, 0.053316102627013265, 0.19644972430583615, 0.9643802486046634, 0.17060293754538206, 0.05704956089391827, 0.07652161831232097, 0, 0.038493472255469084, 0.5543688830575145],	gap:0.5390077300563219
[INFO: train_utils.py:   45]: ------- learning rate [0.000125], learning rate counter [-] -----
[INFO: metrics_util.py:   80]: [TRAIN 2022-04-28 22:08:10] Epoch 1, iter 0, time 2.6242427825927734,  , loss = 364.424683, Hit@1 = 0.48, PERR = 0.48, GAP = 0.44
[INFO: train_utils.py:  122]: [TRAIN] Epoch 1 training finished, average time: 1.7430822508675712
[INFO: metrics_util.py:   80]: [TEST] test_iter 0  , loss = 248.946411, Hit@1 = 0.62, PERR = 0.62, GAP = 0.70
[INFO: metrics_util.py:  124]: [TEST] Epoch1 Finish	avg_hit_at_one: 0.5944010416666666,	avg_perr: 0.5944010416666666,	avg_loss :282.882386525472,	aps: [0.2639931946892855, 0.85903455775213, 0.09778176122448987, 0, 0.030033261592643315, 0.8680903206216465, 0.6081577153299009, 0.22149620838665163, 0.09123414426924248, 0.8905820604115383, 0.055248944301131656, 0.95, 0.5856899952942017, 0.168086815154962, 0.6645151962662453, 0.5090036826258103, 0.17872309016688617, 0.8700421734225023, 0.03301833984693047, 0.8629644432673815, 0.5494279056102854, 0.33194993658077027, 0.7208050878611371, 0, 0.2092875916594547, 0.24870436262552853, 0.9722225403199237, 0.19206833214739047, 0.09036579340064066, 0.10623058817084435, 0, 0.03769712329789659, 0.6627263444618232],	gap:0.6346135299340665
[INFO: train_utils.py:   45]: ------- learning rate [0.000125], learning rate counter [-] -----
[INFO: metrics_util.py:   80]: [TRAIN 2022-04-28 22:10:24] Epoch 2, iter 0, time 1.708108901977539,  , loss = 315.439453, Hit@1 = 0.55, PERR = 0.55, GAP = 0.56
[INFO: train_utils.py:  122]: [TRAIN] Epoch 2 training finished, average time: 1.795787981578282
[INFO: metrics_util.py:   80]: [TEST] test_iter 0  , loss = 235.408524, Hit@1 = 0.63, PERR = 0.63, GAP = 0.72
[INFO: metrics_util.py:  124]: [TEST] Epoch2 Finish	avg_hit_at_one: 0.6002604166666666,	avg_perr: 0.6002604166666666,	avg_loss :271.76447041829425,	aps: [0.31461424520869374, 0.8936133494848872, 0.11043648749825534, 0.125, 0.03722438391699092, 0.8856478710931908, 0.6146493900189034, 0.22876525268503656, 0.10540428561979417, 0.8973407150987074, 0.07046950416809067, 1.0, 0.5973890927620836, 0.19951584164627645, 0.6677338493633069, 0.5268199460078398, 0.24872847879439597, 0.8782532773887733, 0.1096882518043304, 0.8638357292555305, 0.5653448812511401, 0.32151854978205635, 0.7386509578503802, 0, 0.2089245778417543, 0.36278048502308347, 0.9742929999991989, 0.25464991632445033, 0.13271150047535973, 0.13510689282158658, 0.0, 0.04718878169155384, 0.6798868023231818],	gap:0.6512923596529407
[INFO: regularizer.py:  101]: If regularizer of a Parameter has been set by 'fluid.ParamAttr' or 'fluid.WeightNormParamAttr' already. The Regularization[L2Decay, regularization_coeff=0.000800] in Optimizer will not take effect, and it will only be applied to other Parameters!
[INFO: attention_lstm.py:  164]: Load pretrain weights from /home/aistudio/work/paddle-video-classify-tag/weights/attention_lstm, exclude fc layer.
[INFO: train_utils.py:   45]: ------- learning rate [0.000125], learning rate counter [-] -----
[INFO: metrics_util.py:   80]: [TRAIN 2022-04-28 22:13:03] Epoch 0, iter 0, time 2.3079347610473633,  , loss = 24682.550781, Hit@1 = 0.00, PERR = 0.00, GAP = 0.00
[INFO: train_utils.py:  122]: [TRAIN] Epoch 0 training finished, average time: 1.7917025527175592
share_vars_from is set, scope is ignored.
[INFO: metrics_util.py:   80]: [TEST] test_iter 0  , loss = 600.804932, Hit@1 = 0.30, PERR = 0.30, GAP = 0.24
[INFO: metrics_util.py:  124]: [TEST] Epoch0 Finish	avg_hit_at_one: 0.21549479166666666,	avg_perr: 0.21549479166666666,	avg_loss :640.4301401774088,	aps: [0, 0, 0, 0.40242828226797467, 0, 0, 0, 0, 0.33863014399927605, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.017490013599745342, 0, 0, 0, 0, 0, 0, 0, 0.03521126486951533, 0, 0, 0, 0, 0, 0, 0.03405457205661094, 0, 0, 0, 0, 0.0, 0.6962887915273482, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.04010896770467283, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.05729046027105949, 0, 0, 0, 0, 0.006493506493506493, 0, 0, 0, 0, 0, 0.6447801522817966, 0, 0, 0.06971153846153846, 0, 0, 0, 0, 0, 0.06551067886636681, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.20577600830312404, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.05182387738056744, 0.19935288798575898, 0, 0, 0, 0, 0, 0, 0.006060606060606061, 0, 0.01652246148167635, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.3408273189259355, 0, 0, 0.22446935557021744, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.10661600278272008, 0, 0, 0, 0.03607677045177045, 0, 0, 0, 0, 0.12624295024072427, 0, 0, 0, 0.011908733985379907, 0, 0, 0, 0, 0, 0.01123077136610415, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.14973782613585243, 0, 0, 0.6419203037914082, 0.3315040696329643, 0, 0, 0, 0.03735729886111951, 0, 0, 0, 0, 0.0006313131313131313, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.2578085880759878, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],	gap:0.1536302281405276
[INFO: train_utils.py:   45]: ------- learning rate [0.000125], learning rate counter [-] -----
[INFO: metrics_util.py:   80]: [TRAIN 2022-04-28 22:15:21] Epoch 1, iter 0, time 2.288398265838623,  , loss = 643.896606, Hit@1 = 0.26, PERR = 0.26, GAP = 0.17
[INFO: train_utils.py:  122]: [TRAIN] Epoch 1 training finished, average time: 1.7937103728858792
[INFO: metrics_util.py:   80]: [TEST] test_iter 0  , loss = 506.925232, Hit@1 = 0.42, PERR = 0.42, GAP = 0.35
[INFO: metrics_util.py:  124]: [TEST] Epoch1 Finish	avg_hit_at_one: 0.322265625,	avg_perr: 0.322265625,	avg_loss :543.9810791015625,	aps: [0, 0.6284037777940216, 0, 0.4476176894075993, 0, 0.017502088554720133, 0, 0.0, 0.49948578553903844, 0.6726278000591354, 0, 0, 0.014814814814814815, 0, 0, 0.0, 0, 0, 0.12025469917857523, 0, 0, 0, 0, 0, 0, 0, 0.06116341245294592, 0, 0.0, 0, 0, 0, 0, 0.08986510263555623, 0, 0, 0, 0.08181818181818182, 0.1834875432956615, 0.7799351937440513, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.8018812602440357, 0, 0, 0, 0, 0.0, 0, 0, 0, 0, 0, 0, 0.03716972682489924, 0, 0, 0, 0.35282327653333223, 0.29747338431548953, 0, 0, 0, 0.03269644204099986, 0, 0, 0, 0.10658710658710657, 0, 0.7397457184599068, 0.021230242805867577, 0.03970086959790015, 0.21014492753623187, 0, 0, 0, 0, 0, 0.2012167613391781, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0, 0, 0, 0, 0, 0.5718171721543543, 0.20836023284422855, 0, 0, 0.0205176753138432, 0, 0, 0, 0, 0, 0, 0, 0, 0.0, 0, 0.32366077494535755, 0.5132305645871157, 0, 0, 0, 0, 0, 0, 0.059059953550699054, 0, 0.02679003573313643, 0, 0, 0, 0, 0.010724231560075813, 0, 0, 0.0, 0, 0.3694164859158865, 0.04969704265478912, 0.06605492031500655, 0.26678914407765286, 0.5, 0.1117114831313118, 0.48005756650486286, 0, 0, 0, 0, 0, 0, 0, 0.020842379504993486, 0, 0, 0, 0.16806455477716875, 0, 0, 0, 0.35504129480409174, 0, 0, 0, 0, 0.18455439535372678, 0, 0.10785929617324966, 0, 0.07077634245187436, 0, 0, 0, 0.15618791293239848, 0, 0.013192995678206728, 0, 0, 0, 0, 0, 0, 0.38888888888888895, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.9289044289044288, 0.38066141949737636, 0, 0, 0.7315730501252811, 0.30272837068177305, 0.03046189453320919, 0, 0, 0.19456917878558563, 0, 0, 0, 0, 0.015130219289343896, 0, 0, 0, 0, 0, 0, 0, 0, 0.03746427683108207, 0, 0, 0, 0.1412861620299421, 0, 0, 0, 0, 0, 0, 0, 0.3944251458547141, 0, 0, 0.2678798532334657, 0.5851648351648351, 0, 0, 0, 0.018914641343419712, 0, 0, 0.12094981620223223, 0.0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],	gap:0.2629719332516419
[INFO: train_utils.py:   45]: ------- learning rate [0.000125], learning rate counter [-] -----
[INFO: metrics_util.py:   80]: [TRAIN 2022-04-28 22:17:41] Epoch 2, iter 0, time 2.2935118675231934,  , loss = 554.811279, Hit@1 = 0.31, PERR = 0.31, GAP = 0.25
[INFO: train_utils.py:  122]: [TRAIN] Epoch 2 training finished, average time: 1.8651155549652723
[INFO: metrics_util.py:   80]: [TEST] test_iter 0  , loss = 457.075806, Hit@1 = 0.42, PERR = 0.42, GAP = 0.43
[INFO: metrics_util.py:  124]: [TEST] Epoch2 Finish	avg_hit_at_one: 0.3645833333333333,	avg_perr: 0.3645833333333333,	avg_loss :490.911376953125,	aps: [0, 0.7264839860993707, 0, 0.47462585341997715, 0, 0.014711590709519537, 0, 0.04822073388816287, 0.5609560108708409, 0.8027352365113277, 0, 0, 0.32129792215393616, 0, 0, 0.02431476569407604, 0, 0.0, 0.2461927562794056, 0, 0, 0, 0, 0, 0.1660546382768605, 0.03484848484848485, 0.06536851469684964, 0, 0.0, 0, 0, 0, 0, 0.08563124068799817, 0, 0, 0, 0.34035087719298246, 0.33772339731642054, 0.8509592092451618, 0, 0.03125, 0, 0.0, 0.020867208672086725, 0.039261363636363636, 0, 0, 0, 0, 0.8568407053083421, 0, 0, 0, 0, 0.03447038958508859, 0, 0, 0, 0.004165394142997726, 0, 0.08268743667679837, 0.048576534576534575, 0, 0, 0, 0.41189811345296407, 0.6160275221199591, 0, 0, 0, 0.05479598297850816, 0, 0.011745038316747055, 0, 0.14839572192513367, 0.017857142857142856, 0.8142012317112416, 0.03919328856298826, 0.10154059203661106, 0.5147058823529411, 0, 0.0027100271002710027, 0.0, 0, 0, 0.25578997486755245, 0, 0.0, 0.03891402714932127, 0, 0.16742424242424242, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0.6971533526064891, 0.25392683350632306, 0.0, 0, 0.12435782848299608, 0, 0, 0.0, 0, 0, 0, 0, 0, 0.03965336134453781, 0, 0.3146511903607467, 0.5880597551385015, 0, 0, 0, 0, 0, 0, 0.08627606708336637, 0, 0.031046668257497162, 0, 0, 0, 0, 0.012812752669984209, 0, 0, 0.027443216190947587, 0, 0.3842286813576051, 0.3885135135135135, 0.08781870642994039, 0.6136011372304279, 0.10267857142857142, 0.17066271855227885, 0.5040137737506158, 0, 0, 0, 0, 0.0, 0, 0.018518518518518517, 0.12702702702702703, 0, 0.09607023411371238, 0, 0.2145263036917232, 0, 0, 0, 0.48504451428566525, 0, 0, 0, 0, 0.24584729615785442, 0, 0.18667368667368664, 0, 0.13666142557651992, 0, 0, 0, 0.3403490239016555, 0.7, 0.025737892154730938, 0.030303030303030304, 0, 0, 0, 0, 0, 0.7166666666666667, 0, 0, 0, 0, 0, 0.027777777777777776, 0.15572533484248255, 0, 0, 0.9294127880666342, 0.3886217948717949, 0, 0, 0.8316227550780863, 0.3991161073245167, 0.13076985608832742, 0, 0, 0.2779068113777733, 0, 0, 0.0, 0, 0.023388721455443927, 0, 0, 0, 0, 0, 0, 0, 0, 0.08238636363636363, 0, 0.0, 0, 0.2810910550416964, 0, 0.06020066889632107, 0, 0, 0, 0, 0, 0.47000346625667766, 0, 0, 0.4060489971691195, 0.6409310594991161, 0, 0, 0, 0.027279501544207423, 0, 0, 0.2954107030662055, 0.1875, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],	gap:0.35114517815351143

2.3.3 生成分类标签结果

运行以下代码块生成标签预测结果。

生成的标签结果存储在./predict_results/level{1, 2}_top1.json

In [1]

import os
import sys
import time
import logging
import argparse
import ast
import numpy as np
import paddle
try:
    import cPickle as pickle
except:
    import pickle

sys.path.append('/home/aistudio/external-libraries')
sys.path.append('/home/aistudio/work/paddle-video-classify-tag')

from utils.config_utils import *
import models
from reader import get_reader
from metrics import get_metrics
from utils.utility import check_cuda
from utils.utility import check_version

logging.root.handlers = []
FORMAT = '[%(levelname)s: %(filename)s: %(lineno)4d]: %(message)s'
logging.basicConfig(level=logging.DEBUG, format=FORMAT)
logger = logging.getLogger(__name__)


class Args():
    model_name = 'AttentionLSTM'
    config = '/home/aistudio/work/paddle-video-classify-tag/configs/attention_lstm-single-level1.yaml'
    use_gpu = True
    batch_size = None
    weights = '/home/aistudio/paddle-video-classify-tag/data/checkpoints/level1/AttentionLSTM_epoch2.pdparams'
    filelist = None
    log_interval = 200
    infer_topk = 10
    save_dir = './predict_results'
    save_file = "top1.json"
    label_file = '/home/aistudio/paddle-video-classify-tag/data/level1_label.txt'
    video_path = None


def infer(args):
    # parse config
    config = parse_config(args.config)
    infer_config = merge_configs(config, 'infer', vars(args))
    print_configs(infer_config, "Infer")
    infer_model = models.get_model(args.model_name, infer_config, mode='infer')
    infer_model.build_input(use_dataloader=False)
    infer_model.build_model()
    infer_feeds = infer_model.feeds()
    infer_outputs = infer_model.outputs()

    place = paddle.CUDAPlace(0) if args.use_gpu else paddle.CPUPlace()
    exe = paddle.static.Executor(place)

    exe.run(paddle.static.default_startup_program())

    filelist = args.filelist or infer_config.INFER.filelist
    filepath = args.video_path or infer_config.INFER.get('filepath', '')
    if filepath != '':
        assert os.path.exists(filepath), "{} not exist.".format(filepath)
    else:
        assert os.path.exists(filelist), "{} not exist.".format(filelist)

    # get infer reader
    infer_reader = get_reader(args.model_name.upper(), 'infer', infer_config)

    if args.weights:
        assert os.path.exists(
            args.weights), "Given weight dir {} not exist.".format(
                args.weights)
    # if no weight files specified, download weights from paddle
    weights = args.weights or infer_model.get_weights()

    infer_model.load_test_weights(exe, weights,
                                  paddle.static.default_main_program())

    infer_feeder = paddle.fluid.DataFeeder(place=place, feed_list=infer_feeds)
    fetch_list = infer_model.fetches()

    infer_metrics = get_metrics(args.model_name.upper(), 'infer', infer_config)
    infer_metrics.reset()

    periods = []
    cur_time = time.time()
    for infer_iter, data in enumerate(infer_reader()):
        data_feed_in = [items[:-1] for items in data]
        video_id = [items[-1] for items in data]
        infer_outs = exe.run(fetch_list=fetch_list,
                             feed=infer_feeder.feed(data_feed_in))
        infer_result_list = [item for item in infer_outs] + [video_id]

        prev_time = cur_time
        cur_time = time.time()
        period = cur_time - prev_time
        periods.append(period)

        infer_metrics.accumulate(infer_result_list)

        if args.log_interval > 0 and infer_iter % args.log_interval == 0:
            logger.info('Processed {} samples'.format((infer_iter + 1) * len(
                video_id)))

    logger.info('[INFER] infer finished. average time: {}'.format(
        np.mean(periods)))

    if not os.path.isdir(args.save_dir):
        os.makedirs(args.save_dir)

    infer_metrics.finalize_and_log_out(
        savedir=args.save_dir,
        savefile=args.save_file,
        label_file=args.label_file)


args = Args()

# 一级标签
args.config = '/home/aistudio/work/paddle-video-classify-tag/configs/attention_lstm-single-level1.yaml'
args.weights = '/home/aistudio/paddle-video-classify-tag/data/checkpoints/level1/AttentionLSTM_epoch2.pdparams'
args.label_file = '/home/aistudio/paddle-video-classify-tag/data/level1_label.txt'
args.save_file = 'level1_top1.json'
infer(args)

# 二级标签
args.config = '/home/aistudio/work/paddle-video-classify-tag/configs/attention_lstm-single-level2.yaml'
args.weights = '/home/aistudio/paddle-video-classify-tag/data/checkpoints/level2/AttentionLSTM_epoch2.pdparams'
args.label_file = '/home/aistudio/paddle-video-classify-tag/data/level2_label.txt'
args.save_file = 'level2_top1.json'
infer(args)

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  def convert_to_list(value, n, name, dtype=np.int):
[INFO: config_utils.py:   70]: ---------------- Infer Arguments ----------------
[INFO: config_utils.py:   72]: MODEL:
[INFO: config_utils.py:   74]:     name:AttentionLSTM
[INFO: config_utils.py:   74]:     dataset:YouTube-8M
[INFO: config_utils.py:   74]:     bone_nework:None
[INFO: config_utils.py:   74]:     drop_rate:0.5
[INFO: config_utils.py:   74]:     feature_names:['rgb']
[INFO: config_utils.py:   74]:     feature_dims:[2048]
[INFO: config_utils.py:   74]:     embedding_size:1024
[INFO: config_utils.py:   74]:     lstm_size:512
[INFO: config_utils.py:   74]:     num_classes:278
[INFO: config_utils.py:   74]:     topk:20
[INFO: config_utils.py:   72]: TRAIN:
[INFO: config_utils.py:   74]:     epoch:3
[INFO: config_utils.py:   74]:     learning_rate:0.000125
[INFO: config_utils.py:   74]:     decay_epochs:[5]
[INFO: config_utils.py:   74]:     decay_gamma:0.1
[INFO: config_utils.py:   74]:     weight_decay:0.0008
[INFO: config_utils.py:   74]:     num_samples:35952
[INFO: config_utils.py:   74]:     pretrain_base:None
[INFO: config_utils.py:   74]:     batch_size:128
[INFO: config_utils.py:   74]:     use_gpu:True
[INFO: config_utils.py:   74]:     num_gpus:1
[INFO: config_utils.py:   74]:     filelist:/home/aistudio/paddle-video-classify-tag/data/level2_train.list
[INFO: config_utils.py:   72]: VALID:
[INFO: config_utils.py:   74]:     batch_size:128
[INFO: config_utils.py:   74]:     filelist:/home/aistudio/paddle-video-classify-tag/data/level2_val.list
[INFO: config_utils.py:   72]: TEST:
[INFO: config_utils.py:   74]:     batch_size:128
[INFO: config_utils.py:   74]:     filelist:/home/aistudio/paddle-video-classify-tag/data/level2_val.list
[INFO: config_utils.py:   72]: INFER:
[INFO: config_utils.py:   74]:     batch_size:1
[INFO: config_utils.py:   74]:     filelist:/home/aistudio/paddle-video-classify-tag/data/level2_test.list
[INFO: config_utils.py:   75]: -------------------------------------------------
W0428 23:33:54.649835  8407 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0428 23:33:54.655040  8407 device_context.cc:372] device: 0, cuDNN Version: 7.6.
[INFO: 3560385804.py:  106]: Processed 1 samples
[INFO: 3560385804.py:  106]: Processed 201 samples
[INFO: 3560385804.py:  106]: Processed 401 samples
[INFO: 3560385804.py:  106]: Processed 601 samples
[INFO: 3560385804.py:  106]: Processed 801 samples
[INFO: 3560385804.py:  106]: Processed 1001 samples
[INFO: 3560385804.py:  106]: Processed 1201 samples
[INFO: 3560385804.py:  106]: Processed 1401 samples
[INFO: 3560385804.py:  106]: Processed 1601 samples
[INFO: 3560385804.py:  106]: Processed 1801 samples
[INFO: 3560385804.py:  106]: Processed 2001 samples
[INFO: 3560385804.py:  106]: Processed 2201 samples
[INFO: 3560385804.py:  106]: Processed 2401 samples
[INFO: 3560385804.py:  106]: Processed 2601 samples
[INFO: 3560385804.py:  106]: Processed 2801 samples
[INFO: 3560385804.py:  106]: Processed 3001 samples
[INFO: 3560385804.py:  106]: Processed 3201 samples
[INFO: 3560385804.py:  106]: Processed 3401 samples
[INFO: 3560385804.py:  106]: Processed 3601 samples
[INFO: 3560385804.py:  106]: Processed 3801 samples
[INFO: 3560385804.py:  106]: Processed 4001 samples
[INFO: 3560385804.py:  106]: Processed 4201 samples
[INFO: 3560385804.py:  106]: Processed 4401 samples
[INFO: 3560385804.py:  106]: Processed 4601 samples
[INFO: 3560385804.py:  106]: Processed 4801 samples
[INFO: 3560385804.py:  106]: Processed 5001 samples
[INFO: 3560385804.py:  106]: Processed 5201 samples
[INFO: 3560385804.py:  106]: Processed 5401 samples
[INFO: 3560385804.py:  106]: Processed 5601 samples
[INFO: 3560385804.py:  106]: Processed 5801 samples
[INFO: 3560385804.py:  106]: Processed 6001 samples
[INFO: 3560385804.py:  106]: Processed 6201 samples
[INFO: 3560385804.py:  106]: Processed 6401 samples
[INFO: 3560385804.py:  106]: Processed 6601 samples
[INFO: 3560385804.py:  106]: Processed 6801 samples
[INFO: 3560385804.py:  106]: Processed 7001 samples
[INFO: 3560385804.py:  106]: Processed 7201 samples
[INFO: 3560385804.py:  106]: Processed 7401 samples
[INFO: 3560385804.py:  106]: Processed 7601 samples
[INFO: 3560385804.py:  106]: Processed 7801 samples
[INFO: 3560385804.py:  106]: Processed 8001 samples
[INFO: 3560385804.py:  106]: Processed 8201 samples
[INFO: 3560385804.py:  106]: Processed 8401 samples
[INFO: 3560385804.py:  106]: Processed 8601 samples
[INFO: 3560385804.py:  106]: Processed 8801 samples
[INFO: 3560385804.py:  106]: Processed 9001 samples
[INFO: 3560385804.py:  106]: Processed 9201 samples
[INFO: 3560385804.py:  106]: Processed 9401 samples
[INFO: 3560385804.py:  106]: Processed 9601 samples
[INFO: 3560385804.py:  106]: Processed 9801 samples
[INFO: 3560385804.py:  109]: [INFER] infer finished. average time: 0.03070716172327805
[INFO: metrics_util.py:  119]: Saved ./predict_results/level2_top1.json

2.4 视频语义标签基线

该基线提供视频语义标签的理解能力，基线从视频的文本信息中抽取表示视频内容主旨的语义标签知识（选手可进行升级，如利用给定的知识库进行推理、融合多模信息提升标签理解效果等生成标签）。

2.4.1 数据处理

首先将数据整理成命名实体识别模型所需格式，并划分训练集、验证集等。可以参考PaddleNLP中文命名实体项目。

注：我们在数据处理阶段去除了未在title中出现的语义标签。

In [8]

import os
import pandas as pd
import json
import codecs
import argparse
import random

sys.path.append('/home/aistudio/external-libraries')
TAG_NAMES = ["B-ENT", "I-ENT", "O"]


class Args():
    trainval_path = '/home/aistudio/dataset_sample/train.sample.json'
    test_path = '/home/aistudio/dataset_sample/test_a.json'

def gather_text_and_tags(sample, test_only=False):
    def fill_tags(surf):
        '''
        For entities that appear in text, replace their tags with 'B-ENT/I-ENT'.
        '''
        s_idx = text.find(surf)
        if s_idx != -1:
            tags[s_idx] = TAG_NAMES[0]
            for i in range(s_idx + 1, s_idx + len(surf)):
                tags[i] = TAG_NAMES[1]
            return 1
        return 0

    text = sample["title"].replace(" ", "").replace("\t", "")
    # init tag sequence with all 'O's.
    tags = [TAG_NAMES[2] for i in range(len(text))]
    entities = []
    if not test_only:
        entities = [each["@value"] for each in sample["tag"]]
    # annotate 'B-ENT' and 'I-ENT' tags.
    n_bingo_entities = sum(
        [fill_tags(surf) for surf in entities if len(surf) > 0])
    # statistics
    stats = {
        "txt_length": len(text),
        "n_entities": len(entities),
        "n_bingo_entities": n_bingo_entities,
    }
    return text, tags, stats


def stat_numberic_list(li, name="default"):
    assert isinstance(li, list)
    stat = {}
    stat["size"] = len(li)
    if all(isinstance(x, int) for x in li):
        stat["max"] = max(li)
        stat["min"] = min(li)
        stat["sum"] = sum(li)
        stat["avr"] = stat["sum"] / float(len(li))
    print("list-%s:\n\t%s" % (name, str(stat)))


def analyze_annots(stats_list):
    for key in ["txt_length", "n_entities", "n_bingo_entities"]:
        numbers = [stats[key] for stats in stats_list]
        stat_numberic_list(numbers, name=key)


def prepare_split(data, split_name, test_only=False):
    sample_lines = []
    nid_lines = []
    stats_list = []
    for idx in range(len(data)):
        text, tags, stats = gather_text_and_tags(
            data[idx], test_only=test_only)
        if len(text) == 0:
            continue
        # proper data format.
        text = '\002'.join([ch for ch in text])
        tags = '\002'.join(tags)
        sample_lines.append('\t'.join([text, tags]) + "\n")
        nid_lines.append(data[idx]["@id"] + "\n")
        stats_list.append(stats)
    if split_name == "trainval":
        # print statistics.
        analyze_annots(stats_list)

    save_split_file = "/home/aistudio/paddle-video-semantic-tag/data/{}.tsv".format(split_name)
    with codecs.open(save_split_file, "w", encoding="utf-8") as ouf:
        ouf.writelines(sample_lines)
        print("Saved {}, size={}".format(save_split_file, len(sample_lines)))
    with codecs.open("/home/aistudio/paddle-video-semantic-tag/data/nids.txt", "w", encoding="utf-8") as ouf:
        ouf.writelines(nid_lines)


def create_splits_indice(n_samples, SPLITS):
    assert sum([v for k, v in SPLITS]) == 1.0
    indices = list(range(n_samples))
    random.shuffle(indices)
    split2indice = {}
    r_offset = 0
    for idx, (split, ratio) in enumerate(SPLITS):
        l_offset = r_offset
        if idx == len(SPLITS) - 1:
            r_offset = n_samples
        else:
            r_offset = int(n_samples * ratio) + l_offset
        split2indice[split] = indices[l_offset:r_offset]
    return split2indice


args = Args()
random.seed(6666)

# load data for train & validation (have labels).
with codecs.open(args.trainval_path, "r", encoding="utf-8") as inf:
    print("Loading {}...".format(args.trainval_path))
    lines = inf.readlines()
    trainval_data = [json.loads(line) for line in lines]

# load data for test (no labels).
with codecs.open(args.test_path, "r", encoding="utf-8") as inf:
    print("Loading {}...".format(args.test_path))
    lines = inf.readlines()
    test_data = [json.loads(line) for line in lines]

# split the trainval data into train-set(80%) and validation-set(20%).
split2indice = create_splits_indice(
    len(trainval_data), [
        ("train", 3.0 / 4.0),
        ("val", 1.0 / 4.0),
    ])
train_data = [trainval_data[idx] for idx in split2indice["train"]]
val_data = [trainval_data[idx] for idx in split2indice["val"]]

label_map_file = "/home/aistudio/paddle-video-semantic-tag/data/label_map.json"
with open(label_map_file, "w") as ouf:
    json.dump({tag: idx for idx, tag in enumerate(TAG_NAMES)}, ouf)
print("Saved " + label_map_file)

prepare_split(trainval_data, "trainval")
prepare_split(train_data, "train")
prepare_split(val_data, "val")
prepare_split(test_data, "test", test_only=True)

Loading /home/aistudio/dataset_sample/train.sample.json...
Loading /home/aistudio/dataset_sample/test_a.json...
Saved /home/aistudio/paddle-video-semantic-tag/data/label_map.json
list-txt_length:
	{'size': 8079, 'max': 67, 'min': 1, 'sum': 118072, 'avr': 14.614680034657754}
list-n_entities:
	{'size': 8079, 'max': 38, 'min': 0, 'sum': 21425, 'avr': 2.6519371209308082}
list-n_bingo_entities:
	{'size': 8079, 'max': 8, 'min': 0, 'sum': 8292, 'avr': 1.0263646490902338}
Saved /home/aistudio/paddle-video-semantic-tag/data/trainval.tsv, size=8079
Saved /home/aistudio/paddle-video-semantic-tag/data/train.tsv, size=6061
Saved /home/aistudio/paddle-video-semantic-tag/data/val.tsv, size=2018
Saved /home/aistudio/paddle-video-semantic-tag/data/test.tsv, size=9938

2.3.3 训练与验证

本模型使用了PaddleNLP模型库中的bert-wwm-ext-chinese模型，更多模型可参考PaddleNLP Transformer API。

In [ ]

import argparse
import os
import sys
import random
import time
import math
from functools import partial
import json

import numpy as np
import paddle
from paddle.io import DataLoader

import paddlenlp as ppnlp
from paddlenlp.transformers import LinearDecayWithWarmup
from paddlenlp.metrics import ChunkEvaluator
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import BertForTokenClassification, BertTokenizer
from paddlenlp.data import Stack, Tuple, Pad, Dict

sys.path.append('/home/aistudio/external-libraries')

class Args():
    model_name_or_path = None
    output_dir = None
    max_seq_length = 128
    batch_size = 8
    learning_rate = 5e-5
    weight_decay = 0.0
    adam_epsilon = 1e-8
    max_grad_norm = 1.0
    num_train_epochs = 3
    max_steps = -1
    warmup_steps = 0
    logging_steps = 1
    save_steps = 100
    seed = 42
    device = 'gpu'

def evaluate(model, loss_fct, metric, data_loader, label_num):
    model.eval()
    metric.reset()
    avg_loss, precision, recall, f1_score = 0, 0, 0, 0
    for batch in data_loader:
        input_ids, token_type_ids, length, labels = batch
        logits = model(input_ids, token_type_ids)
        loss = loss_fct(logits, labels)
        avg_loss = paddle.mean(loss)
        preds = logits.argmax(axis=2)
        num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute(
            None, length, preds, labels)
        metric.update(num_infer_chunks.numpy(),
                      num_label_chunks.numpy(), num_correct_chunks.numpy())
        precision, recall, f1_score = metric.accumulate()
    print("eval loss: %f, precision: %f, recall: %f, f1: %f" %
          (avg_loss, precision, recall, f1_score))
    model.train()


def tokenize_and_align_labels(example,
                              tokenizer,
                              no_entity_id,
                              max_seq_len=512):
    labels = example['labels']
    example = example['tokens']
    tokenized_input = tokenizer(
        example,
        return_length=True,
        is_split_into_words=True,
        max_seq_len=max_seq_len)

    # -2 for [CLS] and [SEP]
    if len(tokenized_input['input_ids']) - 2 < len(labels):
        labels = labels[:len(tokenized_input['input_ids']) - 2]
    tokenized_input['labels'] = [no_entity_id] + labels + [no_entity_id]
    tokenized_input['labels'] += [no_entity_id] * (
        len(tokenized_input['input_ids']) - len(tokenized_input['labels']))
    return tokenized_input


def _read(data_file, label_map_file):
    with open(label_map_file, "r") as inf:
        tag2label = json.load(inf)
    with open(data_file, 'r', encoding='utf-8') as inf:
        for line in inf:
            line_stripped = line.strip().split('\t')
            assert len(line_stripped) == 2
            tokens = line_stripped[0].split("\002")
            tags = line_stripped[1].split("\002")
            labels = [tag2label[tag] for tag in tags]
            yield {"tokens": tokens, "labels": labels}


def do_train(args):
    paddle.set_device(args.device)
    if paddle.distributed.get_world_size() > 1:
        paddle.distributed.init_parallel_env()

    # Create dataset, tokenizer and dataloader.
    train_ds = load_dataset(
        _read,
        data_file="/home/aistudio/paddle-video-semantic-tag/data/train.tsv",
        label_map_file="/home/aistudio/paddle-video-semantic-tag/data/label_map.json",
        lazy=False)
    test_ds = load_dataset(
        _read,
        data_file="/home/aistudio/paddle-video-semantic-tag/data/val.tsv",
        label_map_file="/home/aistudio/paddle-video-semantic-tag/data/label_map.json",
        lazy=False)
    train_ds.label_list = test_ds.label_list = ["B-ENT", "I-ENT", "O"]

    tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)

    label_list = train_ds.label_list
    label_num = len(label_list)
    no_entity_id = label_num - 1

    trans_func = partial(
        tokenize_and_align_labels,
        tokenizer=tokenizer,
        no_entity_id=no_entity_id,
        max_seq_len=args.max_seq_length)

    train_ds = train_ds.map(trans_func)

    ignore_label = -100

    batchify_fn = lambda samples, fn=Dict({
        'input_ids': Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
        'token_type_ids': Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
        'seq_len': Stack(),  # seq_len
        'labels': Pad(axis=0, pad_val=ignore_label)  # label
    }): fn(samples)

    train_batch_sampler = paddle.io.DistributedBatchSampler(
        train_ds, batch_size=args.batch_size, shuffle=True, drop_last=True)

    train_data_loader = DataLoader(
        dataset=train_ds,
        collate_fn=batchify_fn,
        num_workers=0,
        batch_sampler=train_batch_sampler,
        return_list=True)

    test_ds = test_ds.map(trans_func)

    test_data_loader = DataLoader(
        dataset=test_ds,
        collate_fn=batchify_fn,
        num_workers=0,
        batch_size=args.batch_size,
        return_list=True)

    # Define the model netword and its loss
    model = BertForTokenClassification.from_pretrained(
        args.model_name_or_path, num_classes=label_num)

    if paddle.distributed.get_world_size() > 1:
        model = paddle.DataParallel(model)

    num_training_steps = args.max_steps if args.max_steps > 0 else len(
        train_data_loader) * args.num_train_epochs

    lr_scheduler = LinearDecayWithWarmup(args.learning_rate,
                                         num_training_steps, args.warmup_steps)

    # Generate parameter names needed to perform weight decay.
    # All bias and LayerNorm parameters are excluded.
    decay_params = [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ]
    optimizer = paddle.optimizer.AdamW(
        learning_rate=lr_scheduler,
        epsilon=args.adam_epsilon,
        parameters=model.parameters(),
        weight_decay=args.weight_decay,
        apply_decay_param_fun=lambda x: x in decay_params)

    loss_fct = paddle.nn.loss.CrossEntropyLoss(ignore_index=ignore_label)

    metric = ChunkEvaluator(label_list=label_list)

    global_step = 0
    last_step = args.num_train_epochs * len(train_data_loader)
    tic_train = time.time()
    for epoch in range(args.num_train_epochs):
        for step, batch in enumerate(train_data_loader):
            global_step += 1
            input_ids, token_type_ids, _, labels = batch
            logits = model(input_ids, token_type_ids)
            loss = loss_fct(logits, labels)
            avg_loss = paddle.mean(loss)
            if global_step % args.logging_steps == 0:
                print(
                    "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
                    % (global_step, epoch, step, avg_loss,
                       args.logging_steps / (time.time() - tic_train)))
                tic_train = time.time()
            avg_loss.backward()
            optimizer.step()
            lr_scheduler.step()
            optimizer.clear_grad()
            if global_step % args.save_steps == 0 or global_step == last_step:
                if paddle.distributed.get_rank() == 0:
                    evaluate(model, loss_fct, metric, test_data_loader,
                             label_num)
                    paddle.save(
                        model.state_dict(),
                        os.path.join(args.output_dir,
                                     "model_%d.pdparams" % global_step))


args = Args()
args.model_name_or_path = 'bert-wwm-ext-chinese'
args.max_seq_length = 128
args.batch_size = 32
args.learning_rate = 2e-5
args.num_train_epochs = 3
args.logging_steps = 1
args.save_steps = 500
args.output_dir = '/home/aistudio/paddle-video-semantic-tag/data/checkpoints/semantic_tag'
args.device = 'gpu'
do_train(args)

In [4]

!ls /home/aistudio/paddle-video-semantic-tag/data/checkpoints/se*

model_500.pdparams  model_567.pdparams

2.3.4 生成语义标签结果

生成的识别结果存储在./predict_results/ents_results.json。

In [5]

import argparse
import os
import sys
import ast
import random
import time
import math
from functools import partial
import json
import codecs
from tqdm import tqdm

import numpy as np
import paddle
from paddle.io import DataLoader

import paddlenlp as ppnlp
from paddlenlp.datasets import load_dataset
from paddlenlp.data import Stack, Tuple, Pad, Dict
from paddlenlp.transformers import BertForTokenClassification, BertTokenizer

sys.path.append('/home/aistudio/external-libraries')


class Args():
    model_name_or_path = None
    output_dir = None
    max_seq_length = 128
    batch_size = 8
    learning_rate = 5e-5
    weight_decay = 0.0
    adam_epsilon = 1e-8
    max_grad_norm = 1.0
    num_train_epochs = 3
    max_steps = -1
    warmup_steps = 0
    logging_steps = 1
    save_steps = 100
    seed = 42
    device = 'gpu'

def tokenize_and_align_labels(example,
                              tokenizer,
                              no_entity_id,
                              max_seq_len=512):
    labels = example['labels']
    example = example['tokens']
    tokenized_input = tokenizer(
        example,
        return_length=True,
        is_split_into_words=True,
        max_seq_len=max_seq_len)

    # -2 for [CLS] and [SEP]
    if len(tokenized_input['input_ids']) - 2 < len(labels):
        labels = labels[:len(tokenized_input['input_ids']) - 2]
    tokenized_input['labels'] = [no_entity_id] + labels + [no_entity_id]
    tokenized_input['labels'] += [no_entity_id] * (
        len(tokenized_input['input_ids']) - len(tokenized_input['labels']))

    return tokenized_input


def parse_decodes(input_words, id2label, decodes, lens):
    decodes = [x for batch in decodes for x in batch]
    lens = [x for batch in lens for x in batch]

    outputs = []
    entities_list = []
    for idx, end in enumerate(lens):
        sent = "".join(input_words[idx]['tokens'])
        tags = [id2label[x] for x in decodes[idx][1:end]]
        sent_out = []
        tags_out = []
        words = ""
        for s, t in zip(sent, tags):
            if t.startswith('B-') or t == 'O':
                if len(words):
                    sent_out.append(words)
                if t.startswith('B-'):
                    tags_out.append(t.split('-')[1])
                else:
                    tags_out.append(t)
                words = s
            else:
                words += s
        if len(sent_out) < len(tags_out):
            sent_out.append(words)
        outputs.append(''.join(
            [str((s, t)) for s, t in zip(sent_out, tags_out)]))
        entities_list.append(
            [s for s, t in zip(sent_out, tags_out) if t == "ENT"])
    return outputs, entities_list


def _read(data_file, label_map_file):
    with open(label_map_file, "r") as inf:
        tag2label = json.load(inf)
    with open(data_file, 'r', encoding='utf-8') as inf:
        for line in inf:
            line_stripped = line.strip().split('\t')
            assert len(line_stripped) == 2
            tokens = line_stripped[0].split("\002")
            tags = line_stripped[1].split("\002")
            labels = [tag2label[tag] for tag in tags]
            yield {"tokens": tokens, "labels": labels}


def do_predict(args):
    paddle.set_device(args.device)

    # Create dataset, tokenizer and dataloader.
    train_ds = load_dataset(
        _read,
        data_file="/home/aistudio/paddle-video-semantic-tag/data/train.tsv",
        label_map_file="/home/aistudio/paddle-video-semantic-tag/data/label_map.json",
        lazy=False)
    predict_ds = load_dataset(
        _read,
        data_file="/home/aistudio/paddle-video-semantic-tag/data/test.tsv",
        label_map_file="/home/aistudio/paddle-video-semantic-tag/data/label_map.json",
        lazy=False)
    train_ds.label_list = predict_ds.label_list = ["B-ENT", "I-ENT", "O"]

    tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)

    label_list = train_ds.label_list
    label_num = len(label_list)
    no_entity_id = label_num - 1
    trans_func = partial(
        tokenize_and_align_labels,
        tokenizer=tokenizer,
        no_entity_id=no_entity_id,
        max_seq_len=args.max_seq_length)

    ignore_label = -100
    batchify_fn = lambda samples, fn=Dict({
        'input_ids': Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
        'token_type_ids': Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
        'seq_len': Stack(),
        'labels': Pad(axis=0, pad_val=ignore_label)  # label
    }): fn(samples)
    raw_data = predict_ds.data

    id2label = dict(enumerate(predict_ds.label_list))

    predict_ds = predict_ds.map(trans_func)
    predict_data_loader = DataLoader(
        dataset=predict_ds,
        collate_fn=batchify_fn,
        num_workers=0,
        batch_size=args.batch_size,
        return_list=True)

    # Define the model netword
    model = BertForTokenClassification.from_pretrained(
        args.model_name_or_path, num_classes=label_num)
    if args.init_checkpoint_path:
        model_dict = paddle.load(args.init_checkpoint_path)
        model.set_dict(model_dict)

    model.eval()
    pred_list = []
    len_list = []
    for step, batch in tqdm(enumerate(predict_data_loader)):
        input_ids, token_type_ids, length, labels = batch
        logits = model(input_ids, token_type_ids)
        pred = paddle.argmax(logits, axis=-1)
        pred_list.append(pred.numpy())
        len_list.append(length.numpy())

    preds, entities_list = parse_decodes(raw_data, id2label, pred_list,
                                         len_list)

    save_dir = "predict_results"
    if not os.path.isdir(save_dir):
        os.makedirs(save_dir)
    file_path = os.path.join(save_dir, "ner_results.txt")
    with open(file_path, "w", encoding="utf8") as fout:
        fout.write("\n".join(preds))
    # Print some examples
    # print(
    #     "The results have been saved in the file: %s, some examples are shown below: "
    #     % file_path)
    # print("\n".join(preds[:10]))

    with open("/home/aistudio/paddle-video-semantic-tag/data/nids.txt", "r") as inf:
        lines = inf.readlines()
    nid2ents = {}
    for entities, nid in zip(entities_list, lines):
        nid2ents[nid.strip()] = entities
    save_json = os.path.join(save_dir, "ents_results.json")
    with codecs.open(save_json, "w", encoding="utf-8") as ouf:
        json.dump(nid2ents, ouf, ensure_ascii=False)
    print("Saved " + save_json)


args = Args()
args.model_name_or_path = 'bert-wwm-ext-chinese'
args.init_checkpoint_path = '/home/aistudio/paddle-video-semantic-tag/data/checkpoints/semantic_tag/model_567.pdparams'
args.max_seq_length = 128
args.batch_size = 32
args.device = 'gpu'

do_predict(args)

[2022-04-29 02:13:12,761] [    INFO] - Found /home/aistudio/.paddlenlp/models/bert-wwm-ext-chinese/bert-wwm-ext-chinese-vocab.txt
[2022-04-29 02:13:12,777] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/bert-wwm-ext-chinese/bert-wwm-ext-chinese.pdparams
311it [00:13, 23.29it/s]

Saved predict_results/ents_results.json

3. 结果文件生成和提交

运行下列代码后，将result.txt文件提交至比赛提交结果页面即可。

In [6]

import os
import os.path as osp
import codecs
import json

class Args():
    test_path = '/home/aistudio/dataset_sample/test_a.json'
    category_level1_result = '/home/aistudio/predict_results/level1_top1.json'
    category_level2_result = '/home/aistudio/predict_results/level2_top1.json'
    tag_result = '/home/aistudio/predict_results/ents_results.json'

if __name__ == "__main__":
    args = Args()

    with codecs.open(args.test_path, "r", encoding="utf-8") as inf:
        print("Loading {}...".format(args.test_path))
        lines = inf.readlines()
        nids = [json.loads(line)["@id"] for line in lines]

    # load the prediction results of 'paddle-video-classify-tag' model on test-set
    with codecs.open(
            args.category_level1_result, "r", encoding="utf-8") as inf:
        pred_level1 = json.load(inf)
    with codecs.open(
            args.category_level2_result, "r", encoding="utf-8") as inf:
        pred_level2 = json.load(inf)
    # load the prediction results of 'paddle-video-semantic-tag' model on test-set
    with codecs.open(args.tag_result, "r", encoding="utf-8") as inf:
        pred_tags = json.load(inf)

    # merge results and generate an entry for each nid.
    submission_lines = []
    for nid in nids:
        level1_category = pred_level1[nid]["class_name"] \
                          if nid in pred_level1 else ""
        level2_category = pred_level2[nid]["class_name"] \
                          if nid in pred_level2 else ""
        tags = pred_tags[nid] if nid in pred_tags else []
        result = {
            "@id": nid,
            "category": [
                {
                    "@meta": {
                        "type": "level1"
                    },
                    "@value": level1_category
                },
                {
                    "@meta": {
                        "type": "level2"
                    },
                    "@value": level2_category
                },
            ],
            "tag": [{
                "@value": tag
            } for tag in tags],
        }
        submission_lines.append(json.dumps(result, ensure_ascii=False) + "\n")

    with codecs.open("result.txt", "w", encoding="utf-8") as ouf:
        ouf.writelines(submission_lines)
    print("Saved result.txt")

Loading /home/aistudio/dataset_sample/test_a.json...
Saved result.txt

退出前务必执行

运行下列指令删除临时大文件，下次启动时重新生成，否则可能导致下次在线项目加载较慢。

In [7]

!rm -rf dataset_sample paddle-video-classify-tag paddle-video-semantic-tag predict_results tsn_features_test_a tsn_features_train_sample

请点击此处查看本环境基本用法.
Please click here for more detailed instructions.

百度飞桨AI Studio社区

学大模型，用大模型上飞桨星河社区！每天8点V100G算力免费领！免费领取ERNIE 4.0 100w Token >>>

更多推荐

利用Amazon Bedrock畅玩Claude 3等多种领先模型，抢占AI高地(体验倒计时4小时)

百度飞桨星河社区

RAPTOR：索引树状 RAG，使用树结构来捕捉文本的高级和低级细节

百度飞桨星河社区

MultiHop-RAG：多跳查询的基准检索增强生成

百度飞桨星河社区

所有评论(0)

查看更多评论

AI Studio

@m0_63642362

已为社区贡献1437条内容