『NLP打卡营』实践课11 动手搭建中文闲聊机器人

使用plato-mini模型和生成式api实现人机交互

AI Studio

793人浏览 · 2021-11-05 09:41:08

AI Studio · 2021-11-05 09:41:08 发布

聊天机器人的“前世今生”

在 1964 年至 1966 年间，麻省理工学院人工智能实验室的德裔美国计算机科学家约瑟夫·维森鲍姆（Joseph Weizenbaum）开发了历史上第一个聊天机器人 —— Eliza。
Eliza 的名字源于爱尔兰剧作家萧伯纳的戏剧作品《卖花女》中的角色，剧中出身贫寒的卖花女 Eliza 通过学习与上流社会沟通的方式，变成大使馆舞会上人人艳羡的“匈牙利王家公主”。作为世界上第一个聊天机器人，Eliza 被其作者赋予了充满戏剧性的内涵。
尽管在当时已经存在一些基本的数字语言生成器（可以输出一些连贯文本的程序），但 Eliza 是第一个明确设计用于与人互动的程序。用户可以使用打字机输入人类的自然语言，获得来自机器的响应。正如维森鲍姆解释的那样，Eliza 使“人与计算机之间的对话成为可能 ”。
随着深度学习技术的不断发展，聊天机器人变得越来越智能。我们可以通过机器人来完成一些机械性的问答工作，也可以在闲暇时和智能机器人进行对话，他们的出现让生活变得更丰富多彩。如今通过飞桨与Wechaty的结合就可实现一个简单的聊天机器人。
如下图就是基于 PaddleHub + Wechaty 的微信闲聊机器人demo。通过Wechaty获取微信接收的消息，然后使用PaddleHub的plato-mini模型根据对话的上下文生成新的对话文本，最终以微信消息的形式发送，实现闲聊的交互。

下图是基于 PaddleNLP + Wechaty 的微信情感识别机器人demo。通过Wechaty获取微信接收的消息，然后使用PaddleNLP的TextCNN模型对输入的文本进行情感判断，最终以微信消息的形式返回，实现对文本情感的识别。

感兴趣的同学可参照此demo在自己微信上实现一个情感识别机器人哦～

demo链接：https://github.com/mawenjie8731/paddlenlp-wechaty-demo

使用PaddleNLP生成式API搭建一个聊天机器人

近年来，人机对话系统受到了学术界和产业界的广泛关注并取得了不错的发展。开放域对话系统旨在建立一个开放域的多轮对话系统，使得机器可以流畅自然地与人进行语言交互，既可以进行日常问候类的闲聊，又可以完成特定功能，以使得开放域对话系统具有实际应用价值。

本实例将重点介绍PaddleNLP内置的生成式API的功能和用法，并使用PaddleNLP内置的plato-mini模型和配套的生成式API实现一个简单的闲聊机器人。

环境要求

PaddlePaddle

本项目依赖于 PaddlePaddle 2.0 及以上版本，请参考安装指南进行安装

PaddleNLP

pip install --upgrade paddlenlp -i https://pypi.org/simple

sentencepiece

pip install --upgrade sentencepiece -i https://pypi.org/simple

Python

Python的版本要求 3.6+

AI Studio平台默认安装了Paddle和PaddleNLP，并定期更新版本。如需手动更新Paddle，可参考飞桨安装说明，安装相应环境下最新版飞桨框架。

使用如下命令确保安装最新版PaddleNLP：

!pip install --upgrade paddlenlp -i https://pypi.org/simple

Requirement already satisfied: paddlenlp in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (2.0.8)
Requirement already satisfied: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.9.0)
Requirement already satisfied: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.2.2)
Requirement already satisfied: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.1.0)
Requirement already satisfied: multiprocess in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.70.11.1)
Requirement already satisfied: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.42.1)
Requirement already satisfied: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.4.4)
Requirement already satisfied: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp) (1.15.0)
Requirement already satisfied: numpy>=1.7 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp) (1.20.3)
Requirement already satisfied: dill>=0.3.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from multiprocess->paddlenlp) (0.3.3)
Requirement already satisfied: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (0.24.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (2.1.0)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (0.14.1)
Requirement already satisfied: scipy>=0.19.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (1.6.3)

!pip install --upgrade pip
!pip install --upgrade sentencepiece

Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: pip in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (21.2.4)
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: sentencepiece in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (0.1.96)

生成式API

PaddleNLP针对生成式任务提供了generate()函数，内嵌于PaddleNLP所有的生成式模型。支持Greedy Search、Beam Search和Sampling解码策略，用户只需指定解码策略以及相应的参数即可完成预测解码，得到生成的sequence的token ids以及概率得分。

下面给大家展示一个GPT模型使用生成API的例子：

1. 加载 `paddlenlp.transformers.GPTChineseTokenizer`用于数据处理

文本数据在输入预训练模型之前，需要经过数据处理转化为Feature。这一过程通常包括分词，token to id，add special token等步骤。

PaddleNLP对于各种预训练模型已经内置了相应的tokenizer，指定想要使用的模型名字即可加载对应的tokenizer。

调用GPTChineseTokenizer的__call__方法即可将我们说的话转为模型可接受的输入。

from paddlenlp.transformers import GPTChineseTokenizer

# 设置想要使用模型的名称
model_name = 'gpt-cpm-small-cn-distill'
tokenizer = GPTChineseTokenizer.from_pretrained(model_name)

[2021-08-30 23:05:44,716] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/gpt-cpm-small-cn-distill/gpt-cpm-cn-sentencepiece.model

import paddle

user_input = "花间一壶酒，独酌无相亲。举杯邀明月，"

# 将文本转为ids
input_ids = tokenizer(user_input)['input_ids']
print(input_ids)

# 将转换好的id转为tensor
input_ids = paddle.to_tensor(input_ids, dtype='int64').unsqueeze(0)
print(input_ids)

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.708 seconds.
Prefix dict has been built successfully.


[587, 205, 44, 6110, 1215, 8, 9, 2598, 20779, 255, 6629, 8, 12, 3473, 2475, 8, 2316, 11653, 8, 9]
Tensor(shape=[1, 20], dtype=int64, place=CUDAPlace(0), stop_gradient=True,
       [[587  , 205  , 44   , 6110 , 1215 , 8    , 9    , 2598 , 20779, 255  , 6629 , 8    , 12   , 3473 , 2475 , 8    , 2316 , 11653, 8    , 9    ]])

2. 使用PaddleNLP一键加载预训练模型

PaddleNLP提供了GPT,UnifiedTransformer等中文预训练模型，可以通过预训练模型名称完成一键加载。

GPT以Transformer Decoder的编码器为网络基本组件，采用单向注意力机制，适用于长文本生成任务。

PaddleNLP目前提供多种中英文GPT预训练模型，我们这次用的是一个小的中文GPT预训练模型。其他预训练模型请参考模型列表。

from paddlenlp.transformers import GPTLMHeadModel

# 一键加载中文GPT模型
model = GPTLMHeadModel.from_pretrained(model_name)

[2021-08-30 23:05:50,845] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/gpt-cpm-small-cn-distill/gpt-cpm-small-cn-distill.pdparams
[2021-08-30 23:05:51,712] [    INFO] - Weights of GPTLMHeadModel not initialized from pretrained model: ['lm_head.decoder_weight']

# 调用生成API升成文本
ids, scores = model.generate(
                input_ids=input_ids,
                max_length=16,
                min_length=1,
                decode_strategy='greedy_search')

print(ids)
print(scores)

Tensor(shape=[1, 16], dtype=int64, place=CUDAPlace(0), stop_gradient=True,
       [[39  , 8   , 1181, 211 , 8913, 8   , 12  , 8   , 10  , 8   , 10  , 8   , 10  , 8   , 10  , 8   ]])
Tensor(shape=[1, 1], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
       [[-0.33355120]])

generated_ids = ids[0].numpy().tolist()

# 使用tokenizer将生成的id转为文本
generated_text = tokenizer.convert_ids_to_string(generated_ids)
print(generated_text)

对影成三人。

可以看到生成的效果还不错，生成式API的用法也是非常的简便。

下面我们来展示一下如何使用UnifiedTransformer模型和生成式API完成闲聊对话。

1. 加载 `paddlenlp.transformers.UnifiedTransformerTokenizer`用于数据处理

UnifiedTransformerTokenizer的调用方式与GPT相同，但数据处理的API略有不同。

调用UnifiedTransformerTokenizer的dialogue_encode方法即可将我们说的话转为模型可接受的输入。

from paddlenlp.transformers import UnifiedTransformerTokenizer

# 设置想要使用模型的名称
model_name = 'plato-mini'
tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name)

[2021-08-30 23:05:52,637] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/plato-mini/plato-mini-vocab.txt
[2021-08-30 23:05:52,639] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/plato-mini/plato-mini-spm.model

user_input = ['你好啊，你今年多大了']

# 调用dialogue_encode方法生成输入
encoded_input = tokenizer.dialogue_encode(
                    user_input,
                    add_start_token_as_response=True,
                    return_tensors=True,
                    is_split_into_words=False)

print(encoded_input.keys())

dict_keys(['input_ids', 'token_type_ids', 'position_ids', 'attention_mask'])

dialogue_encode的详细用法，请参考dialogue_encode。

2. 使用PaddleNLP一键加载预训练模型

与GPT相同，我们可以一键调用UnifiedTransformer预训练模型。

UnifiedTransformer以Transformer的编码器为网络基本组件，采用灵活的注意力机制，十分适合文本生成任务，并在模型输入中加入了标识不同对话技能的special token，使得模型能同时支持闲聊对话、推荐对话和知识对话。

PaddleNLP目前为UnifiedTransformer提供了三个中文预训练模型：

unified_transformer-12L-cn 该预训练模型是在大规模中文会话数据集上训练得到的
unified_transformer-12L-cn-luge 该预训练模型是unified_transformer-12L-cn在千言对话数据集上进行微调得到的。
plato-mini 该模型使用了十亿级别的中文闲聊对话数据进行预训练。

from paddlenlp.transformers import UnifiedTransformerLMHeadModel

model = UnifiedTransformerLMHeadModel.from_pretrained(model_name)

[2021-08-30 23:05:52,717] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/plato-mini/plato-mini.pdparams

下一步我们将处理好的输入传入generate函数，并配置解码策略。

这里我们使用的是TopK加sampling的解码策略。即从概率最大的k个结果中按概率进行采样。

ids, scores = model.generate(
                input_ids=encoded_input['input_ids'],
                token_type_ids=encoded_input['token_type_ids'],
                position_ids=encoded_input['position_ids'],
                attention_mask=encoded_input['attention_mask'],
                max_length=64,
                min_length=1,
                decode_strategy='sampling',
                top_k=5,
                num_return_sequences=20)

print(ids)
print(scores)

Tensor(shape=[20, 18], dtype=int64, place=CUDAPlace(0), stop_gradient=True,
       [[6   , 763 , 1164, 7   , 3   , 67  , 7   , 2   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   ],
        [6   , 763 , 1585, 7   , 3   , 9   , 94  , 42  , 25375, 7   , 28  , 2   , 0   , 0   , 0   , 0   , 0   , 0   ],
        [912 , 3   , 6   , 763 , 14381, 26028, 7   , 2   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   ],
        [6   , 763 , 1850, 26028, 7   , 2   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   ],
        [6   , 763 , 1850, 26028, 3   , 6   , 87  , 37  , 713 , 3   , 10  , 11  , 25620, 4355, 2   , 0   , 0   , 0   ],
        [6   , 28  , 3   , 763 , 1164, 7   , 2   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   ],
        [6   , 763 , 1164, 7   , 2   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   ],
        [6   , 763 , 215 , 449 , 26028, 7   , 3   , 9   , 94  , 16  , 2   , 0   , 0   , 0   , 0   , 0   , 0   , 0   ],
        [6   , 763 , 1585, 7   , 3   , 9   , 763 , 42  , 25375, 2   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   ],
        [6   , 763 , 1585, 7   , 2   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   ],
        [6   , 763 , 1164, 26028, 7   , 3   , 9   , 94  , 16  , 2   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   ],
        [6   , 763 , 215 , 1850, 26203, 26028, 7   , 2   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   ],
        [912 , 28  , 3   , 6   , 763 , 1164, 7   , 3   , 6   , 87  , 10  , 11  , 25354, 1243, 94  , 2   , 0   , 0   ],
        [6   , 763 , 1850, 7   , 3   , 215 , 878 , 7   , 3   , 6   , 87  , 30  , 11  , 25408, 769 , 2347, 94  , 2   ],
        [6   , 763 , 1164, 7   , 3   , 9   , 94  , 2   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   ],
        [912 , 28  , 3   , 6   , 763 , 1585, 7   , 2   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   ],
        [6   , 763 , 1164, 26028, 7   , 3   , 9   , 42  , 25375, 7   , 16  , 2   , 0   , 0   , 0   , 0   , 0   , 0   ],
        [6   , 28  , 3   , 6   , 763 , 1164, 7   , 2   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   ],
        [912 , 6   , 215 , 2697, 26028, 7   , 9   , 42  , 25375, 7   , 28  , 16  , 2   , 0   , 0   , 0   , 0   , 0   ],
        [6   , 763 , 1850, 7   , 3   , 9   , 94  , 16  , 2   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   , 0   ]])
Tensor(shape=[20, 1], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
       [[-1.24522018],
        [-1.14729643],
        [-0.83712775],
        [-0.94538420],
        [-1.66557407],
        [-1.62348187],
        [-1.07532704],
        [-0.75960904],
        [-1.13562894],
        [-1.15903497],
        [-0.75655520],
        [-1.04702473],
        [-1.38941920],
        [-1.43422914],
        [-0.93517542],
        [-0.79756647],
        [-0.73397017],
        [-1.36063826],
        [-1.57881129],
        [-0.89794350]])

from utils import select_response

# 简单根据概率选取最佳回复
result = select_response(ids, scores, tokenizer, keep_space=False, num_return_sequences=20)
er, keep_space=False, num_return_sequences=20)
print(result)