1 赛题背景 Overview


Semantic video understanding technology plays an integral role in quite a few of the most well-known mobile applications. In the past years, various perception-based video understanding methods are proven to be inadequate in cases where semantic knowledge or multi-modal information are essential cues of understanding.


Knowledge-enhanced video understanding technology addresses this issue by introducing multi-modal learning and knowledge reasoning techniques. In this competition, you’re challenged to develop classification algorithms, as well as video tagging algorithms, which accurately assign video-level labels using the provided datasets and knowledge resources. To focus on the scope, perception information of the videos, including face recognition results, OCR results, ASR results, and visual feature vectors, are provided alongside.


2 任务描述 Task Description


Given a knowledge base and the videos’ perception information, including face recognition results, OCR results, ASR results, and visual feature vectors. The first target is to classify the video into category and the corresponding sub-category within a defined two-level video classification system. The second target is to generate semantic tags for the video, the tags may be entities, concepts, events, or a facet of an entity.



  • 视频数据

  • 基础感知解析结果(OCR、语音识别、人脸识别结果等)


  • 视频语义标签,包括以下几类标签:

  • 分类标签:二层体系的封闭集(详见数据集描述)

  • 语义标签:实体/概念/事件/实体属性等标签(详见数据集描述)


  • Video data

  • Video’s perception information, including face recognition results, OCR results, ASR results, and visual feature vectors


  • Semantic video tags, including

  • Classification tags, where the tags are predefined within a two-level video classification system, i.e., every video should be assigned with exactly two classification tags, a first-level tag, and a corresponding second-level tag

  • Semantic tags, where the tags may be entities, concepts, events, or facets of entities, who are important to the accurate understanding of the videos.

3 数据集 Datasets


The dataset includes training set, test set A, test set B, and a knowledge base.
  • 训练集:训练集由好看视频/全民小视频资源抽样构建而成,提供了约4.5万个视频用于模型的训练;数据集中的视频均来自真实应用数据,标签结果为人工标注而成。

    Training set: The training set is constructed using the micro videos from Haokan Video APP and Quanmin Micro Video APP. It contains ~45k micro videos and its perception information, including face recognition results, OCR results, ASR results, and visual feature vectors. The target tags are human-labeled.

  • A榜测试集:数据构造方法与训练集完全相同,共包含约1万个样本,用于模型在线评测。

    Test set A: The construction of development set is the same as the training set. It contains ~10k micro videos, used for online evaluation.

  • B榜测试集:数据构造方法与训练集完全相同,共包含约1万个样本,用于模型的最终评测。

    Test set B: The construction of test set is the same as the training set. It contains ~10k micro videos, used for final evaluation.

  • 知识库:从百度百科中抽取得到,共包含约30万条知识,作为知识推理的依据。

    Knowledge base: There are about 300 thousand rows of knowledge information in total, extracted from Baidu Baike, i.e., Baidu Encyclopedia.

  • 训练集可以用于模型训练和超参数调优,在报名截止后可供自由下载。测试集将在比赛的不同阶段分两阶段开放。在第一阶段,会开放A榜测试集用于在线评测;在第二阶段会开放B榜测试集。最终的评测结果以B榜测试集的分数为准。

    The training set can be used for model training and hyper-parameters tuning. It will be available for download after the registration deadline. The test set will be released in two stages. In the first stage, test set A will be released for online evaluation. In the second stage, we will release the test set B, on which the final performance is evaluated.

4 快速开始

本项目为《知识增强的视频语义理解》技术评测任务的基准模型。包括两部分:1)视频分类标签模型 paddle-video-classify-tag;2)视频语义标签模型 paddle-video-semantic-tag。视频分类标签模型根据视频内容在封闭的二级标签体系上进行分类,得到描述视频的分类标签;视频语义标签模型 从视频的文本信息中抽取实体语义标签(选手可进行升级,如利用给定的知识库进行推理、融合多模信息提升标签理解效果,或生成标签等)。两部分模型产出的标签结果,分别对应技术评测数据集中提供的分类标签、语义标签。本项目对视频语义理解评测任务训练模型、预测与提交流程做了精简,可以实现一键运行和结果提交。【参考项目地址】


4.1 安装环境和依赖

!pip install opencv-python -i https://mirror.baidu.com/pypi/simple
!pip install paddlenlp==2.0.1 -i https://mirror.baidu.com/pypi/simple
!pip install tqdm wget

4.2 数据集预处理

!mkdir /home/aistudio/project
!unzip /home/aistudio/project.zip
!tar -zxvf /home/aistudio/data/data217134/dataset_sample.tar.gz
!cd /home/aistudio/project/dataset && unzip /home/aistudio/data/data217134/tsn_features_test_a.zip
!cd /home/aistudio/project/dataset && unzip /home/aistudio/data/data217134/tsn_features_train_sample.zip
!mv /home/aistudio/dataset_sample/* /home/aistudio/project/dataset/

4.3 划分数据集

%cd ~/project
!python spilt_dataset.py

5 视频分类标签

本项目基于【VideoTag飞桨大规模视频分类模型】。 视频分类标签模型根据视频内容在封闭的二级标签体系上进行分类,得到描述视频的分类标签。

5.1 视频分类1级标签模型训练


%cd ~/project/paddle-video-classify-tag
!python train.py \
    --model_name=AttentionLSTM \
    --config=./configs/attention_lstm-single-level1.yaml \
    --pretrain=./weights/attention_lstm \
    --save_dir=./data/checkpoints/level1 \

5.2 视频分类2级标签模型训练

!python train.py \
    --model_name=AttentionLSTM \
    --config=./configs/attention_lstm-single-level2.yaml \
    --pretrain=./weights/attention_lstm \
    --save_dir=./data/checkpoints/level2 \

5.3 视频分类1级标签模型预测

# 一级标签
!python predict.py \
    --model_name=AttentionLSTM \
    --config=./configs/attention_lstm-single-level1.yaml \
    --weights=./data/checkpoints/level1/AttentionLSTM_epoch2.pdparams \
    --label_file=/home/aistudio/project/dataset/data/level1_label.txt \
    --save_dir=./predict_results \
    --save_file=level1_top1.json \

5.4 视频分类2级标签模型预测

# 二级标签
!python predict.py \
    --model_name=AttentionLSTM \
    --config=./configs/attention_lstm-single-level2.yaml \
    --weights=./data/checkpoints/level2/AttentionLSTM_epoch2.pdparams \
    --label_file=/home/aistudio/project/dataset/data/level2_label.txt \
    --save_dir=./predict_results \
    --save_file=level2_top1.json \


6 视频语义标签


6.1 数据处理

%cd /home/aistudio/project/paddle-video-semantic-tag
!python prepare_semantic_tag.py

6.2 训练与验证

!python train_semantic_tag.py \
    --model_name_or_path bert-wwm-ext-chinese \
    --max_seq_length 128 \
    --batch_size 32 \
    --learning_rate 2e-5 \
    --num_train_epochs 3 \
    --logging_steps 1 \
    --save_steps 500 \
    --output_dir ./data/checkpoints/semantic_tag/ \
    --device gpu

6.3 生成语义标签结果

!python predict_semantic_tag.py \
    --model_name_or_path bert-wwm-ext-chinese \
    --max_seq_length 128 \
    --batch_size 32 \
    --device gpu \
    --init_checkpoint_path data/checkpoints/semantic_tag/model_500.pdparams


7 生成提交结果

%cd /home/aistudio/project
!python submit.py
Loading /home/aistudio/project/dataset/test_a.json...
Saved result.txt



8 提分技巧

  • 视频分类标签模型训练采用模型微调,使用自己的训练数据在VideoTag预训练模型上进行fine-tune,训练出自己的模型,以达到更好的效果。

  • 对AttentionLSTM模型进行改进和优化或者对模型框架进行改进。

  • 视频语义标签在baseline中使用了PaddleNLP模型库中的bert-wwm-ext-chinese模型,可以更换其他模型。

  • 可对推理过程进行升级,如利用给定的知识库进行推理、融合多模信息提升标签理解效果,或生成标签等。


