基于预训练模型的槽位填充和意图识别

意图识别原来如此

意图识别是指分析用户的核心需求,输出与查询输入最相关的信息,例如在搜索中要找电影、查快递、市政办公等需求,这些需求在底层的检索策略会有很大的不同,错误的识别几乎可以确定找不到能满足用户需求的内容,导致产生非常差的用户体验;在对话过程中要准确理解对方所想表达的意思,这是具有很大挑战性的任务。
例如用户输入查询“仙剑奇侠传”时,我们知道“仙剑奇侠传”既有游戏又有电视剧还有新闻、图片等等,如果我们通过用户意图识别发现该用户是想看“仙剑奇侠传”电视剧的,那我们直接把电视剧作为结果返回给用户,就会节省用户的搜索点击次数,缩短搜索时间,大大提升使用体验。而在对话中如果对方说“我的苹果从不出现卡顿”,那么我们就能通过意图识别判断出此刻的苹果是一个电子设备,而非水果,这样对话就能顺利进行下去。

总之,意图识别的准确性能在很大程度上影响着搜索的准确性和对话系统的智能性。

本示例将展示如何使用ERNIE预训练模型完成任务型对话中的槽位填充和意图识别任务,这两个任务是一个pipeline型任务对话系统的基石。

本示例使用的数据集为CrossWOC中文对话数据集。该数据集包含多个领域,包括景点,餐馆,酒店,交通等。

原始数据和数据处理脚本请参见 CrossWOZ

环境要求

  • PaddlePaddle

    本项目依赖于 PaddlePaddle 2.3 及以上版本,请参考 安装指南 进行安装

  • PaddleNLP

    pip install --upgrade paddlenlp -i https://pypi.org/simple
    
  • Python

    Python的版本要求 3.7+

AI Studio平台默认安装了Paddle和PaddleNLP,并定期更新版本。 如需手动更新Paddle,可参考飞桨安装说明,安装相应环境下最新版飞桨框架。

使用如下命令确保安装最新版PaddleNLP:

!python -m pip install paddlepaddle-gpu==2.3.0.post101 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
!pip install --upgrade paddlenlp
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Looking in links: https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
Requirement already satisfied: paddlepaddle-gpu==2.3.0.post101 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (2.3.0.post101)
Requirement already satisfied: numpy>=1.13 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlepaddle-gpu==2.3.0.post101) (1.19.5)
Requirement already satisfied: Pillow in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlepaddle-gpu==2.3.0.post101) (8.2.0)
Requirement already satisfied: protobuf>=3.1.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlepaddle-gpu==2.3.0.post101) (3.14.0)
Requirement already satisfied: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlepaddle-gpu==2.3.0.post101) (1.16.0)
Requirement already satisfied: requests>=2.20.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlepaddle-gpu==2.3.0.post101) (2.24.0)
Requirement already satisfied: astor in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlepaddle-gpu==2.3.0.post101) (0.8.1)
Requirement already satisfied: decorator in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlepaddle-gpu==2.3.0.post101) (4.4.2)
Requirement already satisfied: opt-einsum==3.3.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlepaddle-gpu==2.3.0.post101) (3.3.0)
Requirement already satisfied: paddle-bfloat==0.1.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlepaddle-gpu==2.3.0.post101) (0.1.2)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.20.0->paddlepaddle-gpu==2.3.0.post101) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.20.0->paddlepaddle-gpu==2.3.0.post101) (2019.9.11)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.20.0->paddlepaddle-gpu==2.3.0.post101) (2.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.20.0->paddlepaddle-gpu==2.3.0.post101) (1.25.11)
[33mWARNING: You are using pip version 22.0.4; however, version 22.1.2 is available.
You should consider upgrading via the '/opt/conda/envs/python35-paddle120-env/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mLooking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: paddlenlp in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (2.3.4)
Requirement already satisfied: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.4.4)
Requirement already satisfied: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.1.0)
Requirement already satisfied: tqdm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.64.0)
Requirement already satisfied: sentencepiece in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.1.96)
Requirement already satisfied: datasets>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.3.2)
Requirement already satisfied: paddlefsl in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.1.0)
Requirement already satisfied: paddle2onnx in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.9.8)
Requirement already satisfied: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.2.2)
Requirement already satisfied: multiprocess<=0.70.12.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.70.11.1)
Requirement already satisfied: protobuf<=3.20.0,>=3.1.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (3.14.0)
Requirement already satisfied: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.42.1)
Requirement already satisfied: dill<0.3.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.3.3)
Requirement already satisfied: pyarrow>=6.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (8.0.0)
Requirement already satisfied: huggingface-hub<1.0.0,>=0.1.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (0.8.1)
Requirement already satisfied: pandas in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (1.1.5)
Requirement already satisfied: requests>=2.19.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (2.24.0)
Requirement already satisfied: responses<0.19 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (0.18.0)
Requirement already satisfied: importlib-metadata in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (4.2.0)
Requirement already satisfied: xxhash in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (3.0.0)
Requirement already satisfied: packaging in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (21.3)
Requirement already satisfied: numpy>=1.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (1.19.5)
Requirement already satisfied: aiohttp in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (3.8.1)
Requirement already satisfied: fsspec[http]>=2021.05.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (2022.5.0)
Requirement already satisfied: six>=1.9 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from protobuf<=3.20.0,>=3.1.0->paddlenlp) (1.16.0)
Requirement already satisfied: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (0.24.2)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets>=2.0.0->paddlenlp) (4.2.0)
Requirement already satisfied: filelock in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets>=2.0.0->paddlenlp) (3.0.12)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets>=2.0.0->paddlenlp) (5.1.2)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from packaging->datasets>=2.0.0->paddlenlp) (3.0.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.19.0->datasets>=2.0.0->paddlenlp) (1.25.11)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.19.0->datasets>=2.0.0->paddlenlp) (2019.9.11)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.19.0->datasets>=2.0.0->paddlenlp) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.19.0->datasets>=2.0.0->paddlenlp) (2.8)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (2.1.0)
Requirement already satisfied: scipy>=0.19.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (1.6.3)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (0.14.1)
Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from aiohttp->datasets>=2.0.0->paddlenlp) (6.0.2)
Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from aiohttp->datasets>=2.0.0->paddlenlp) (1.3.0)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from aiohttp->datasets>=2.0.0->paddlenlp) (4.0.2)
Requirement already satisfied: attrs>=17.3.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from aiohttp->datasets>=2.0.0->paddlenlp) (21.4.0)
Requirement already satisfied: asynctest==0.13.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from aiohttp->datasets>=2.0.0->paddlenlp) (0.13.0)
Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from aiohttp->datasets>=2.0.0->paddlenlp) (2.1.0)
Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from aiohttp->datasets>=2.0.0->paddlenlp) (1.2.0)
Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from aiohttp->datasets>=2.0.0->paddlenlp) (1.7.2)
Requirement already satisfied: zipp>=0.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from importlib-metadata->datasets>=2.0.0->paddlenlp) (3.8.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas->datasets>=2.0.0->paddlenlp) (2.8.2)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas->datasets>=2.0.0->paddlenlp) (2019.3)
[33mWARNING: You are using pip version 22.0.4; however, version 22.1.2 is available.
You should consider upgrading via the '/opt/conda/envs/python35-paddle120-env/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

示例流程

与大多数NLP任务相同,本次示例的展示流程分为以下四步:

首先我们从数据准备开始。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LOcCxGdS-1656682353368)(https://ai-studio-static-online.cdn.bcebos.com/dd30e17318fb48fabb5701fd8a97be8176a1e372dd134cc0826e58cb5401933d)]

数据准备

数据准备流程如下:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BxwWWnCI-1656682353369)(https://ai-studio-static-online.cdn.bcebos.com/89ba02da6f914297ae2fc438d1c9f773556f226652134fb684ac0186bfa5bb7d)]

1. 使用load_dataset()自定义数据集

使用官方脚本预处理过的数据集已经上传至项目根目录,观察数据集格式后我们可以根据数据格式写出数据文件读取函数,传入load_dataset()。即可创建数据集。

from paddlenlp.datasets import load_dataset
import json

# 读取标签文件并创建label_map
def get_label_map(filename):
    with open(filename, "r", encoding="utf8") as f:
        label_list = json.load(f)
    
    id2label = dict([(idx, label) for idx, label in enumerate(label_list)])
    label2id = dict([(label, idx) for idx, label in enumerate(label_list)])
    return id2label, label2id

id2slot, slot2id = get_label_map('slot_labels.json')
id2intent, intent2id = get_label_map('intent_labels.json')

intent_weight = [1] * len(intent2id)

# 根据本地文件格式定义数据读取生成器
def read(filename):
    with open(filename, "r", encoding="utf8") as f:
        data = json.load(f)
    for entry in data:
        yield {
            'words': entry[0],
            'slots': entry[1],
            'intents': entry[2],
            'history': entry[4],
        }

# 将生成器传入load_dataset
train_ds = load_dataset(read, filename='train.json', lazy=False)
dev_ds = load_dataset(read, filename='test.json', lazy=False)

for idx in range(2):
    print(train_ds[idx]['words'])
    print(train_ds[idx]['slots'])
    print(train_ds[idx]['intents'])
    print(train_ds[idx]['history'])
    print()
['你', '好', ',', '麻', '烦', '帮', '我', '推', '荐', '一', '个', '门', '票', '免', '费', '的', '景', '点', '。']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B+Inform+景点+门票', 'I+Inform+景点+门票', 'O', 'O', 'O', 'O']
['General+greet+none+none', 'Request+景点+名称+']
[]

['你', '好', ',', '您', '可', '以', '选', '择', '故', '宫', ',', '八', '达', '岭', '长', '城', ',', '颐', '和', '园', '或', '者', '红', '砖', '美', '术', '馆', '。']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B+Inform+景点+名称', 'I+Inform+景点+名称', 'I+Inform+景点+名称', 'I+Inform+景点+名称', 'I+Inform+景点+名称', 'O']
['General+greet+none+none']
['你好,麻烦帮我推荐一个门票免费的景点。']

关于更多自定义数据集相关内容,请参考如何自定义数据集

2. 加载 paddlenlp.transformers.AutoTokenizer用于数据处理

文本数据在输入 ERNIE 预训练模型之前,需要经过数据处理转化为Feature。这一过程通常包括分词,token to id,add special token等步骤。

PaddleNLP对于各种预训练模型已经内置了相应的tokenizer,指定想要使用的模型名字即可加载对应的tokenizer。

可以通过调用tokenizer中的方法简单的完成上述数据处理。

import paddlenlp

# 设置模型名称
MODEL_NAME = 'ernie-3.0-medium-zh'
tokenizer = paddlenlp.transformers.AutoTokenizer.from_pretrained(MODEL_NAME)
[2022-06-29 15:11:04,512] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'ernie-3.0-medium-zh'.
[2022-06-29 15:11:04,515] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh_vocab.txt
[2022-06-29 15:11:04,575] [    INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/ernie-3.0-medium-zh/tokenizer_config.json
[2022-06-29 15:11:04,659] [    INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/ernie-3.0-medium-zh/special_tokens_map.json

3. 调用map()方法批量处理数据

由于我们传入了lazy=False,所以我们使用load_dataset()自定义的数据集是MapDataset对象。

MapDatasetpaddle.io.Dataset的功能增强版本。其内置的map()方法适合用来进行批量数据集处理。

map()方法传入的是一个用于数据处理的function。正好可以与tokenizer相配合。

以下是本示例中的用法:

from functools import partial
import numpy as np

def convert_example(example, tokenizer, use_history=False, no_entity_id=0, max_seq_len=512, mode='train'):
    
    # 调用tokenizer的数据处理方法把文本转为id
    tokenized_input = tokenizer(
        example['words'],
        is_split_into_words=True,
        max_seq_len=max_seq_len)
    
    # 对槽位标签做截断,保证和input_id等长
    slot_labels = example['slots']
    if len(tokenized_input['input_ids']) - 2 < len(slot_labels):
        slot_labels = slot_labels[:len(tokenized_input['input_ids']) - 2]
    
    # 根据label_map将槽位标签转为id
    tokenized_input['slot_labels'] = [no_entity_id] + [slot2id[label] for label in slot_labels] + [no_entity_id]
    
    # 由于任务中的意图识别是多标签分类,需要把意图标签转为类似one-hot的格式
    intent_labels = np.zeros(len(intent2id), dtype='int64')
    for l in example['intents']:
        intent_labels[intent2id[l]] = 1
        if mode == 'train':
            # 统计训练集中每个意图的正样本数
            intent_weight[intent2id[l]] += 1
    tokenized_input['intent_labels'] = intent_labels
    
    # 将历史对话用[SEP]拼起来并转成id
    if use_history:
        tokenized_history = tokenizer(
            tokenizer.cls_token+tokenizer.sep_token.join(example['history']),
            max_seq_len=max_seq_len)
        tokenized_input['history_ids'] = tokenized_history['input_ids']
    else:
        tokenized_input['history_ids'] = []
    return tokenized_input

use_history = False
max_seq_length = 512

train_trans_func = partial(
        convert_example,
        tokenizer=tokenizer,
        mode='train',
        use_history=use_history,
        max_seq_len=max_seq_length)

dev_trans_func = partial(
        convert_example,
        tokenizer=tokenizer,
        mode='dev',
        use_history=use_history,
        max_seq_len=max_seq_length)

train_ds.map(train_trans_func, lazy=False)   
dev_ds.map(dev_trans_func, lazy=False)

# 根据意图的正样本数和总样本数为不同样本赋予不同的权重
for intent, intent_id in intent2id.items():
    neg_pos = (len(train_ds) - intent_weight[intent_id]) / intent_weight[intent_id]
    intent_weight[intent_id] = np.log10(neg_pos)

for idx in range(2):
    print(train_ds[idx]['input_ids'])
    print(train_ds[idx]['token_type_ids'])
    print(train_ds[idx]['slot_labels'])
    print(train_ds[idx]['intent_labels'])
    print()
[1, 226, 170, 4, 1298, 1934, 836, 75, 426, 1645, 7, 27, 232, 1039, 783, 453, 5, 561, 180, 12043, 2]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0]
[1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0]

[1, 226, 170, 4, 892, 48, 22, 352, 790, 470, 915, 4, 643, 302, 1560, 84, 257, 4, 3540, 14, 509, 172, 123, 536, 1520, 188, 133, 774, 12043, 2]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 4, 4, 4, 0, 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0]

从以上结果可以看出,数据集中的example已经被转换成了模型可以接收的feature,包括input_ids、token_type_ids、slot_labels和intent_labels。
其中:

  • input_ids: 表示输入文本的token ID。
  • token_type_ids: 表示token所属的句子(Transformer类预训练模型支持单句以及句对输入)。
  • slot_labels: 槽位标签,其长度与输入文本相同。
  • intent_labels: 意图标签,是一个长度等于总便签数的列表,标签对应位置为1,其余为0.

更多有关数据处理的内容,请参考数据处理

4. Batchify和数据读入

使用paddle.io.BatchSamplerpaddlenlp.data中提供的方法把数据组成batch。

然后使用paddle.io.DataLoader接口多线程异步加载数据。

batchify_fn详解:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-E1hIZxwD-1656682353370)(https://ai-studio-static-online.cdn.bcebos.com/30e43d4659384375a2a2c1b890ca5a995c4324d7168e49cebf1d2a1e99161f7d)]

import paddle
from paddlenlp.data import Stack, Dict, Pad
from paddlenlp.data import DataCollatorWithPadding

batch_size = 20

# 初始化BatchSampler
train_batch_sampler = paddle.io.BatchSampler(
    train_ds, batch_size=batch_size, shuffle=True)

dev_batch_sampler = paddle.io.BatchSampler(
    dev_ds, batch_size=batch_size, shuffle=False)

# 定义batchify_fn
batchify_fn = lambda samples, fn=Dict({
    "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), 
    "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
    "slot_labels": Pad(axis=0, pad_val=0, dtype="int64"),
    "intent_labels": Stack(dtype="float32"),
    "history_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id)
}): fn(samples)

# batchify_fn = DataCollatorWithPadding(tokenizer)

# 初始化DataLoader
train_data_loader = paddle.io.DataLoader(
    dataset=train_ds,
    batch_sampler=train_batch_sampler,
    collate_fn=batchify_fn,
    return_list=True)

dev_data_loader = paddle.io.DataLoader(
    dataset=dev_ds,
    batch_sampler=dev_batch_sampler,
    collate_fn=batchify_fn,
    return_list=True)

更多PaddleNLP内置的batchify相关API,请参考collate

到这里数据集准备就全部完成了,下一步我们需要组网并设计loss function。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fqYo5Qxc-1656682353371)(https://ai-studio-static-online.cdn.bcebos.com/fdcb44a00ede4ce08ae2652931556fb58cc903f686bf491792489353d2800e7d)]

模型结构

使用PaddleNLP一键加载预训练模型

以下项目以ERNIE为例,介绍如何将预训练模型多任务学习同时完成意图识别和槽位填充任务。

本例中的意图识别和槽位填充本质上是一个句子分类任务和一个序列标注任务。将两者的loss结合即可实现多任务学习。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UIXGPaFy-1656682353372)(https://ai-studio-static-online.cdn.bcebos.com/d9ff881921d74602acb6eb27c8523cb50285f07a7beb4a3cbfa1edbd9b3f9c5c)]

from src.models import JointErnie

model = JointErnie.from_pretrained(MODEL_NAME, 
                                   intent_dim=len(intent2id), 
                                   slot_dim=len(slot2id), 
                                   dropout=0.1, 
                                   use_history=use_history)
[2022-06-29 15:11:15,483] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh.pdparams
W0629 15:11:15.487289  4452 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W0629 15:11:15.491648  4452 gpu_context.cc:306] device: 0, cuDNN Version: 7.6.

设计loss function

JointErnie模型会取出ErnieModel的sequence_output接入一个输出维度为槽位类别数的线性层得到slot_logits,并将pooled_output接入一个输出维度为意图类别数的线性层得到intent_logit.

所以本示例中的loss由slot_loss和intent_loss两部分组成,我们需要自己定义loss function。

槽位填充相当于在每个token的位置进行一次多分类任务,意图识别相当于对整句话做一个多标签分类任务。所以设计的loss function如下:

class NLULoss(paddle.nn.Layer):
    def __init__(self, pos_weight):
        super(NLULoss, self).__init__()

        self.intent_loss_fn = paddle.nn.BCEWithLogitsLoss(pos_weight=paddle.to_tensor(pos_weight))
        self.slot_loss_fct = paddle.nn.CrossEntropyLoss()

    def forward(self, logits, slot_labels, intent_labels):
        slot_logits, intent_logits = logits

        slot_loss = self.slot_loss_fct(slot_logits, slot_labels)
        intent_loss = self.intent_loss_fn(intent_logits, intent_labels)

        return slot_loss + intent_loss

选择网络结构后,我们需要设置Fine-Tune优化策略。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2GvbMDjd-1656682353373)(https://ai-studio-static-online.cdn.bcebos.com/7eca6595f338409498149cb586c077ba4933739810cf436080a2292be7e0a92d)]

设置Fine-Tune优化策略

适用于ERNIE/BERT这类Transformer模型的学习率为warmup的动态学习率。



图3:动态学习率示意图
# 训练过程中的最大学习率
learning_rate = 3e-5 

# 训练轮次
epochs = 1

# 学习率预热比例
warmup_proportion = 0.0

# 权重衰减系数,类似模型正则项策略,避免模型过拟合
weight_decay = 0.0

max_grad_norm = 1.0

num_training_steps = len(train_data_loader) * epochs

# 学习率衰减策略
lr_scheduler = paddlenlp.transformers.LinearDecayWithWarmup(learning_rate, num_training_steps,
                                    warmup_proportion)

decay_params = [
    p.name for n, p in model.named_parameters()
    if not any(nd in n for nd in ["bias", "norm"])
]

# 定义优化器
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in decay_params,
    grad_clip=paddle.nn.ClipGradByGlobalNorm(max_grad_norm))

现在万事俱备,我们可以开始训练模型。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rnxN76qi-1656682353375)(https://ai-studio-static-online.cdn.bcebos.com/6975542d488f4f75b385fe75d574a3aaa8e208f5e99f4acd8a8e8aea3b85c058)]

模型训练与评估

模型训练的过程通常有以下步骤:

  1. 从dataloader中取出一个batch data
  2. 将batch data喂给model,做前向计算
  3. 将前向计算结果传给损失函数,计算loss。
  4. loss反向回传,更新梯度。重复以上步骤。

每训练一个epoch后,程序对调用evaluation()方法分别计算两个任务的F1 score。

from src.utils import evaluation

criterion = NLULoss(intent_weight)
global_step = 0

for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        global_step += 1
        input_ids, token_type_ids, slot_labels, intent_labels, history_ids = batch
        logits = model(input_ids, token_type_ids, history_ids=history_ids)
        loss = criterion(logits, slot_labels, intent_labels)

        if global_step % 1000 == 0 :
            print("global step %d, epoch: %d, batch: %d, loss: %.5f" % (global_step, epoch, step, loss))

        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.clear_grad()
    
    print('\nEval begin...')
    evaluation(model, dev_data_loader)
global step 1000, epoch: 1, batch: 1000, loss: 0.17512
global step 2000, epoch: 1, batch: 2000, loss: 0.09479
global step 3000, epoch: 1, batch: 3000, loss: 0.02178
global step 4000, epoch: 1, batch: 4000, loss: 0.07578

Eval begin...
Total samples: 8476
Slot F1: 0.9397
Intent F1: 0.6043
model.eval()
# convert to static graph with specific input description
model = paddle.jit.to_static(
    model,
    input_spec=[
        paddle.static.InputSpec(
            shape=[None, None], dtype="int64"),  # input_ids
        paddle.static.InputSpec(
            shape=[None, None], dtype="int64")  # segment_ids
    ])
# save converted static graph model
paddle.jit.save(model, "infer_model")
config = paddle.inference.Config("infer_model" + ".pdmodel",
                                         "infer_model" + ".pdiparams")
config.disable_gpu()
config.switch_use_feed_fetch_ops(False)

predictor = paddle.inference.create_predictor(config)

input_handles = [
            predictor.get_input_handle(name)
            for name in predictor.get_input_names()
        ]
output_handles = [
            predictor.get_output_handle(name)
            for name in predictor.get_output_names()
        ]

infer_batchify_fn = lambda samples, fn=Dict(
        {
            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
            "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id)
        }): fn(samples)

infer_data_loader = paddle.io.DataLoader(
            dataset=dev_ds,
            batch_sampler=dev_batch_sampler,
            collate_fn=infer_batchify_fn,
            num_workers=0,
            return_list=True)

for data in infer_data_loader:
    for input_field, input_handle in zip(data, input_handles):
        input_handle.copy_from_cpu(input_field.numpy() if isinstance(
            input_field, paddle.Tensor) else input_field)
    predictor.run()
    output = [
        output_handle.copy_to_cpu() for output_handle in output_handles
    ]
    print(output)
[1m[35m--- Running analysis [ir_graph_build_pass][0m
[1m[35m--- Running analysis [ir_graph_clean_pass][0m
[1m[35m--- Running analysis [ir_analysis_pass][0m
[32m--- Running IR pass [simplify_with_basic_ops_pass][0m
[32m--- Running IR pass [layer_norm_fuse_pass][0m
[37m---    Fused 0 subgraphs into layer_norm op.[0m
[32m--- Running IR pass [attention_lstm_fuse_pass][0m
[32m--- Running IR pass [seqconv_eltadd_relu_fuse_pass][0m
[32m--- Running IR pass [seqpool_cvm_concat_fuse_pass][0m
[32m--- Running IR pass [mul_lstm_fuse_pass][0m
[32m--- Running IR pass [fc_gru_fuse_pass][0m
[37m---    fused 0 pairs of fc gru patterns[0m
[32m--- Running IR pass [mul_gru_fuse_pass][0m
[32m--- Running IR pass [seq_concat_fc_fuse_pass][0m
[32m--- Running IR pass [gpu_cpu_squeeze2_matmul_fuse_pass][0m
[32m--- Running IR pass [gpu_cpu_reshape2_matmul_fuse_pass][0m
[32m--- Running IR pass [gpu_cpu_flatten2_matmul_fuse_pass][0m
[32m--- Running IR pass [matmul_v2_scale_fuse_pass][0m
[32m--- Running IR pass [gpu_cpu_map_matmul_v2_to_mul_pass][0m
I0629 15:14:50.948932  4452 fuse_pass_base.cc:57] ---  detected 41 subgraphs
[32m--- Running IR pass [gpu_cpu_map_matmul_v2_to_matmul_pass][0m
I0629 15:14:50.950546  4452 fuse_pass_base.cc:57] ---  detected 12 subgraphs
[32m--- Running IR pass [matmul_scale_fuse_pass][0m
[32m--- Running IR pass [gpu_cpu_map_matmul_to_mul_pass][0m
[32m--- Running IR pass [fc_fuse_pass][0m
I0629 15:14:51.109480  4452 fuse_pass_base.cc:57] ---  detected 41 subgraphs
[32m--- Running IR pass [repeated_fc_relu_fuse_pass][0m
[32m--- Running IR pass [squared_mat_sub_fuse_pass][0m
[32m--- Running IR pass [conv_bn_fuse_pass][0m
[32m--- Running IR pass [conv_eltwiseadd_bn_fuse_pass][0m
[32m--- Running IR pass [conv_transpose_bn_fuse_pass][0m
[32m--- Running IR pass [conv_transpose_eltwiseadd_bn_fuse_pass][0m
[32m--- Running IR pass [is_test_pass][0m
[32m--- Running IR pass [runtime_context_cache_pass][0m
[1m[35m--- Running analysis [ir_params_sync_among_devices_pass][0m
[1m[35m--- Running analysis [adjust_cudnn_workspace_size_pass][0m
[1m[35m--- Running analysis [inference_op_replace_pass][0m
[1m[35m--- Running analysis [ir_graph_to_program_pass][0m
I0629 15:14:51.213642  4452 analysis_predictor.cc:1007] ======= optimize end =======
I0629 15:14:51.222798  4452 naive_executor.cc:102] ---  skip [feed], feed -> token_type_ids
I0629 15:14:51.222833  4452 naive_executor.cc:102] ---  skip [feed], feed -> words_ids
I0629 15:14:51.225849  4452 naive_executor.cc:102] ---  skip [linear_80.tmp_1], fetch -> fetch
I0629 15:14:51.225895  4452 naive_executor.cc:102] ---  skip [linear_81.tmp_1], fetch -> fetch


[array([[[ 6.7814775, -3.1593466, -4.1632333, ..., -3.7419374,
         -4.0669   , -4.0492043],
        [11.554589 , -4.407591 , -4.827274 , ..., -4.422597 ,
         -4.71447  , -5.1776814],
        [11.688357 , -4.48243  , -4.6533675, ..., -4.3366413,
         -4.8391576, -5.326421 ],
        ...,
        [13.176665 , -4.1729155, -4.728089 , ..., -4.333272 ,
         -4.9682946, -5.3904204],
        [13.158845 , -4.175857 , -4.732219 , ..., -4.3447914,
         -4.961181 , -5.3807683],
        [13.149474 , -4.165904 , -4.7227573, ..., -4.349148 ,
         -4.969923 , -5.3987556]],

       [[11.390071 , -4.130967 , -3.9426858, ..., -3.426007 ,
         -5.054478 , -5.846622 ],
        [11.302519 , -3.725232 , -4.90538  , ..., -4.3901167,
         -4.0434284, -4.9326787],
        [11.920896 , -4.1674213, -4.367908 , ..., -4.548027 ,
         -4.3104568, -5.063277 ],
        ...,
        [13.779258 , -4.145918 , -4.428103 , ..., -4.0718894,
         -4.791253 , -5.410724 ],
        [13.770816 , -4.140418 , -4.4266276, ..., -4.071643 ,
         -4.7830377, -5.3985143],
        [13.779096 , -4.1328306, -4.4179735, ..., -4.0831933,
         -4.791615 , -5.4085517]],

       [[ 7.456374 , -2.5692027, -4.24109  , ..., -3.810097 ,
         -3.3252873, -4.0920987],
        [ 8.574683 , -2.2495499, -4.775242 , ..., -4.3209004,
         -3.6838706, -4.957653 ],
        [ 8.927279 , -3.9099152, -3.0694826, ..., -3.6171744,
         -4.7853646, -5.089115 ],
        ...,
        [13.01001  , -3.5385768, -4.4491134, ..., -4.2443357,
         -4.598784 , -5.3697066],
        [12.998647 , -3.5454564, -4.4575524, ..., -4.237321 ,
         -4.6142797, -5.3851476],
        [12.970467 , -3.5336912, -4.4596734, ..., -4.2461004,
         -4.6133094, -5.3774924]],

       ...,

       [[10.560348 , -3.765053 , -3.6276329, ..., -1.9620574,
         -4.5778565, -5.420154 ],
        [10.790459 , -2.9837272, -4.1767244, ..., -3.850499 ,
         -3.7639039, -5.470011 ],
        [11.995274 , -3.8022048, -2.9080052, ..., -3.2904434,
         -4.5493007, -5.2190175],
        ...,
        [13.353181 , -3.8353279, -4.085818 , ..., -3.3101988,
         -4.67991  , -5.3841143],
        [13.35966  , -3.8365638, -4.078427 , ..., -3.313433 ,
         -4.686747 , -5.3946424],
        [13.341045 , -3.8436782, -4.0968795, ..., -3.314594 ,
         -4.680817 , -5.384107 ]],

       [[ 8.224532 , -3.576959 , -4.44025  , ..., -3.8914473,
         -3.3392422, -3.2988927],
        [12.265505 , -3.804784 , -4.2402215, ..., -4.5345654,
         -4.5437865, -5.1761727],
        [12.476783 , -3.9929006, -3.3433301, ..., -4.0739303,
         -4.920037 , -5.2187667],
        ...,
        [13.492063 , -4.144082 , -4.431912 , ..., -4.124772 ,
         -4.6841397, -5.151851 ],
        [13.48388  , -4.1280503, -4.42345  , ..., -4.140751 ,
         -4.657219 , -5.1178355],
        [13.465259 , -4.097384 , -4.4179034, ..., -4.151209 ,
         -4.643125 , -5.101619 ]],

       [[10.527641 , -3.2411811, -3.5474231, ..., -3.203087 ,
         -4.1240087, -5.4090724],
        [12.255399 , -3.2481868, -3.9800825, ..., -4.3268104,
         -4.399975 , -5.4881363],
        [12.489131 , -3.5077653, -2.9631546, ..., -3.8105638,
         -4.5532346, -5.2542653],
        ...,
        [13.400847 , -3.7305331, -4.1903267, ..., -4.0007954,
         -4.6963935, -5.4153113],
        [13.391182 , -3.726018 , -4.188986 , ..., -4.0083184,
         -4.6931624, -5.4093204],
        [13.394914 , -3.7251866, -4.1796756, ..., -3.9906137,
         -4.7131515, -5.4258523]]], dtype=float32), array([[  1.3510668,  -2.093196 ,  -6.227682 , ..., -10.037692 ,
        -10.521099 ,  -9.284265 ],
       [ -6.9154387,  -6.724922 ,  -8.537305 , ..., -15.176067 ,
        -14.752896 , -14.109042 ],
       [ -4.570983 ,  -6.6087685,  -3.3213701, ..., -10.563331 ,
         -9.909536 ,  -8.56572  ],
       ...,
       [ -8.539383 ,  -7.15563  ,  -7.552422 , ..., -15.469097 ,
        -14.777759 , -14.084039 ],
       [ -4.0770936,  -6.7344694,  -3.8842864, ..., -10.432816 ,
        -10.365968 ,  -9.091926 ],
       [ -7.797839 ,  -6.611453 ,  -6.9742765, ..., -15.78601  ,
        -15.158116 , -14.320247 ]], dtype=float32)]
        -10.962064 , -10.418475 ]], dtype=float32)]

引入对话历史

在该数据集的原始论文中,提到了一种改进模型的手段。即将对话历史的特征传入模型。可以增强模型的表现。

这里给大家举个例子:

  • 当前句:“好,玩累了周边有酒店可以休息吗?”

  • 意图:“Request+景点+周边酒店”

  • 历史:[“你好,帮我找一个20到50元之间的景点,谢谢。”,“八达岭长城怎么样?门票35元。”]

在本示例中,可以通过修改数据处理函数的的use_history参数来控制是否使用这一策略。

关于该策略的详细信息和原始baseline模型,请参考CrossWOZ

更多任务

关于更多任务型对话的示例和数据集,可以参考PaddleNLP中的DGU

以上内容实现基于PaddleNLP,开源不易,希望大家多多支持~
记得给PaddleNLP点个小小的Star⭐

GitHub地址:https://github.com/PaddlePaddle/PaddleNLP
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DWytmI2G-1656682353376)(https://ai-studio-static-online.cdn.bcebos.com/c54e958d7e50498381f2e89c84e2368723ff1db2f7c148518b1efa8890f93f0b)]

更多使用方法可参考PaddleNLP教程

加入交流群,一起学习吧

现在就加入PaddleNLP的QQ技术交流群,一起交流NLP技术吧!

作者仅为AiStudio搬运,原项目链接:https://aistudio.baidu.com/aistudio/projectdetail/2017202?contributionType=1

抽取](https://aistudio.baidu.com/aistudio/projectdetail/1317771)

加入交流群,一起学习吧

现在就加入PaddleNLP的QQ技术交流群,一起交流NLP技术吧!

作者仅为AiStudio搬运,原项目链接:https://aistudio.baidu.com/aistudio/projectdetail/2017202?contributionType=1

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐