知识库问答系统（KBQA）初探

基于Paddle框架的KBQA系统，可以利用知识图谱回答自然世界简单的单跳问题。

AI Studio

3116人浏览 · 2021-12-26 18:37:19

AI Studio · 2021-12-26 18:37:19 发布

1. 项目介绍

1.1 什么是KBQA？

本项目旨在实现一个简单的基于知识库的问答系统，在已有的知识图谱下，系统可以完成简单问句的语义理解、知识的自动检索，并返回问句的答案。

传统的基于搜索引擎的问答系统只能返回一些时效性的网页，以文档集合的形式呈现给用户，仍然需要用户阅读并分析文档以获取答案。而基于知识图谱的问答系统可以实现获取到更精确的答案，即在知识图谱中找到精确答案直接返回给用户，满足用户的精确信息需求，提供个性化知识服务。

1.2 本项目方法简介

本项目主要将KBQA分为以下四个核心算法模块：

问句主题词识别：从句子中识别出提问人所关心的主题实体
候选三元组检索：基于索引返回与主题实体相关的三元组知识
候选三元组分类：将检索到的候选三元组做二分类，将大量不符合提问目标的三元组过滤掉
候选答案精排序：根据问句和候选三元组的语义相似性排序得到可作为最佳答案的三元组

1.3 阅读指导

读者可以按照本Notebook的步骤逐步运行

也可以在运行完前3节后直接跳转到第8小节直接运行完整的KBQA预测流程

2. 环境配置

【重要说明】本项目的内存开销约10G，为保证程序正确运行，请使用AIstudio的高级版或至尊版GPU环境。

以AIstudio高级版为例，环境如下：

CPU: 2
RAM: 16g
GPU: Telsa V100, 16G
Python版本: Python 3.7
框架版本: PaddlePaddle 2.2.1

此外，还需要执行以下命令安装用于加载word2vec的gensim库以及用于计算编辑距离的python_Levenshtein库

! pip install gensim==3.8.1
! pip install python_Levenshtein==0.12.2

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting gensim==3.8.1
[?25l  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/44/93/c6011037f24e3106d13f3be55297bf84ece2bf15b278cc4776339dc52db5/gensim-3.8.1-cp37-cp37m-manylinux1_x86_64.whl (24.2MB)
     |████████████████████████████████| 24.2MB 4.3MB/s eta 0:00:01
[?25hRequirement already satisfied: numpy>=1.11.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from gensim==3.8.1) (1.20.3)
Collecting smart-open>=1.8.1 (from gensim==3.8.1)
[?25l  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/cd/11/05f68ea934c24ade38e95ac30a38407767787c4e3db1776eae4886ad8c95/smart_open-5.2.1-py3-none-any.whl (58kB)
     |████████████████████████████████| 61kB 1.4MB/s eta 0:00:011
[?25hRequirement already satisfied: scipy>=0.18.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from gensim==3.8.1) (1.6.3)
Requirement already satisfied: six>=1.5.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from gensim==3.8.1) (1.15.0)
Installing collected packages: smart-open, gensim
Successfully installed gensim-3.8.1 smart-open-5.2.1
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting python_Levenshtein==0.12.2
[?25l  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/2a/dc/97f2b63ef0fa1fd78dcb7195aca577804f6b2b51e712516cc0e902a9a201/python-Levenshtein-0.12.2.tar.gz (50kB)
     |████████████████████████████████| 51kB 5.8MB/s eta 0:00:011
[?25hRequirement already satisfied: setuptools in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from python_Levenshtein==0.12.2) (56.2.0)
Building wheels for collected packages: python-Levenshtein
  Building wheel for python-Levenshtein (setup.py) ... [?25ldone
[?25h  Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.2-cp37-cp37m-linux_x86_64.whl size=171661 sha256=eaed10e8e8ab0610268a8ef9dd73e762c77c734f12c051235404710bdda40b36
  Stored in directory: /home/aistudio/.cache/pip/wheels/7b/43/95/25e2d396067496519edc4426d846bf3905f53c24e4e42b0e71
Successfully built python-Levenshtein
Installing collected packages: python-Levenshtein
Successfully installed python-Levenshtein-0.12.2

3. 数据加载

3.1 知识库数据集介绍

本项目中所使用的问答数据集和知识库都来自于NLPCC2018比赛的任务七：Open Domain Question Answering knowledge-based question (KBQA)，包含以下两个文件：

中文百科知识库文件，约含2千万实体和6千万三元组，3.37G大小
知识库实体提及映射表，可以将现实世界常见的实体提及词与该知识库中的实体对应起来

以上数据均已公开，可以直接从赛事官网下载原版数据集，或从本人在AIStudio上传的数据集中获取经简单预处理过的版本。

3.2 知识库数据集加载

本项目直接挂载了已经完成预处理的知识库数据集，可以直接从文件中读取加载到内存

from work.TopicWordRecognization.run_ner import predict as ner_predict
from work.CandidateTriplesSelection.run_cls import predict as cls_predict
from work.CandidateTriplesLookup.knowledge_retrieval import entity_linking, search_triples_by_index
from work.AnswerRanking.ranking import span_question, score_similarity
from work.config import KGConfig, CLSConfig, NERConfig
import jieba
import gensim
import datetime
import json
import re
from functools import partial
import paddle
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import ErnieTokenizer, ErnieModel
from paddlenlp.data import Stack, Pad, Tuple

KGconfig = KGConfig()
mention2entity_clean_path = KGconfig.mention2entity_clean_path
knowledge_graph_path = KGconfig.knowledge_graph_path

print('正在加载mention2entity表', datetime.datetime.now())
with open(mention2entity_clean_path, 'r', encoding='utf-8') as f:
    mention2entity_dict = json.loads(f.read())

print('正在加载知识库', datetime.datetime.now())
forward_KG_f = open(knowledge_graph_path, 'rb')
print('知识库加载完毕', datetime.datetime.now())

/opt/conda/envs/python35-paddle120-env/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)


正在加载mention2entity表 2021-12-22 17:30:39.294507
正在加载知识库 2021-12-22 17:31:10.639150
知识库加载完毕 2021-12-22 17:31:10.641600

# 打印部分mention2entity映射表数据
for idx, (key, value) in enumerate(list(mention2entity_dict.items())[80:90]):
    print(key, '-->', value)

揠苗助长 --> ['揠苗助长(米切尔胡尔维茨导演美国电影)', '揠苗助长(成语)', '揠苗助长']
胸神经前支 --> ['胸神经前支']
anteriorbranchofthoracicnerves --> ['胸神经前支']
闭环 --> ['闭环']
htcmytouch4gslide --> ['HTC myTouch 4G Slide']
三级甲等医院 --> ['三级甲等医院']
不列颠尼亚(罗马帝国行省) --> ['不列颠尼亚(罗马帝国行省)']
不列颠尼亚 --> ['不列颠尼亚(罗马帝国行省)', '不列颠尼亚(日本动漫反叛的鲁路修中虚拟帝国)', '不列颠尼亚(英女神)', '不列颠尼亚(罗马行省)']
britannia --> ['不列颠尼亚(罗马帝国行省)', '《大不列颠》', '不列颠尼亚(英女神)']
白领 --> ['白领', '白领一族', '白领(工作人员的一种通称)', '白领(醉卧美人膝创造的网络小说)', '白领(1962年的韩国电影)']

3.3 知识库索引表构建及加载

在KBQA的搜索候选三元组阶段，需要根据问句提问的主题实体在知识库中检索相关的三元组。为了避免每次执行三元组查询时都需要遍历整个知识库的庞大开销，我们为知识库文件建立一个索引表，可以大大减少时间开销。

本项目挂在的知识库数据文件已经按头实体重新排序，我们的索引表需要在输入一个查询实体时，可以快速定位到知识库中该实体的位置，并且可以返回所有以该实体为头实体的三元组。

在实现上，我们使用字节流的形式打开知识库文件，用python的tell()方法定位每一种实体在知识库的开始位置，同时记录以其为头实体的所有三元组在知识库占用的总长度（以字节统计）。

构造完成的索引表将会是哈希形式的数据结构，我们将其保存在指定目录下。

def make_KG_index(knowledge_graph_path, forward_index_path):
    """
    读KG文件，用第一个实体为key构建单向索引，索引格式为字典，{mention:{'start_pos':int, 'length':int}, ...}
    利用索引读KG时：
    with open(knowledge_graph_path, 'rb') as f:
        f.seek(223)
        readresult = f.read(448).decode('utf-8')
    """
    def make_index(graph_path, index_path):
        print('begin to read KG', datetime.datetime.now())
        index_dict = dict()
        with open(graph_path, 'r', encoding='utf-8') as f:
            previous_entity = ''
            previous_start = 0
            while True:
                start_pos = f.tell()
                line = f.readline()
                if not line:
                    break
                entity = line.split(' ||| ')[0]
                if entity != previous_entity and previous_entity:
                    tmp_dict = dict()
                    tmp_dict['start_pos'] = previous_start
                    tmp_dict['length'] = start_pos - previous_start
                    index_dict[previous_entity] = tmp_dict
                    previous_start = start_pos
                previous_entity = entity
        print('finish reading KG, begin to write', datetime.datetime.now())
        with open(index_path, 'w', encoding='utf-8') as f:
            f.write(json.dumps(index_dict, ensure_ascii=False))
        print('finish writing', datetime.datetime.now())
    make_index(knowledge_graph_path, forward_index_path)

在本项目中，知识库文件的索引表已经提前构造好并挂载在了项目的数据集中，可以直接加载并使用。


print('正在加载索引表', datetime.datetime.now())
forward_index_path = KGconfig.forward_index_path
with open(forward_index_path, 'r', encoding='utf-8') as f:
    forward_index = json.loads(f.read())
print('索引表加载完毕', datetime.datetime.now())

# 用指定实体查询索引表并返回知识库中的三元组，打印其中前20条
entity = '姚明'
read_index, read_size = forward_index[entity]['start_pos'], forward_index[entity]['length']
print(read_index, read_size)
forward_KG_f.seek(read_index)
readresult = forward_KG_f.read(read_size).decode('utf-8')
print(readresult[:20])

正在加载索引表 2021-12-22 17:31:31.591060
索引表加载完毕 2021-12-22 17:31:49.429286
2231973201 2598
姚明 ||| 别名 ||| 姚明
姚明

3.4 word2vec模型加载

除上述知识库数据集外，本项目在答案排序模块还需使用现有的word2vec词向量，可以在主程序执行前完成预加载。
本项目使用的word2vec模型源自github开源仓库word2vec，并选取其中以百度百科领域预料为训练集、以中文字+词作为上下文特征而训练的300维词向量。

该词向量模型已经更名为sgns.target.word-character，并挂载在本项目的data/data122049目录下，使用下面的代码加载即可。

from work.config import Word2VecConfig
from gensim.models import KeyedVectors

def load_word2vec():
    word2vec_model_path = Word2VecConfig().model_path  # 词向量文件的位置

    print('正在预加载word2vec词向量，预计2min', datetime.datetime.now())
    word2vec_model = KeyedVectors.load_word2vec_format(word2vec_model_path, binary=False, unicode_errors='ignore')
    print('word2vec词向量加载完毕', datetime.datetime.now())
    return word2vec_model
word2vec_model = load_word2vec()

正在预加载word2vec词向量，预计2min 2021-12-22 17:31:53.311305
word2vec词向量加载完毕 2021-12-22 17:33:19.258214

4. 问句主题词识别

对于一个给定的问题句，系统需要确定问题句中核心询问的事物是什么。问句的主题词就是问句中的核心提问对象，可以用它帮我们找到知识图谱中对应的相关实体，进一步获取答案。例如对于问句“微软的创始人是谁？”，主题词就是“微软”。

4.1 模型结构

本项目的主题词识别模块采用基于百度预训练模型ERNIE的实体识别模型，将问句使用ERNIE编码后对每个token预测BIO标签，其中B代表token是主题词的开始字符，I代表token是主题词的中间字符，O代表token不属于任何主题词。模型将每个表示文本字符的token在ERNIE最后一层隐层是输出编码做全连接后投影到三分类输出层。

import paddle
from paddle import nn
from paddlenlp.transformers import ErniePretrainedModel


class ErnieNER(ErniePretrainedModel):
    def __init__(self, ernie, label_dim, dropout=None):
        super(ErnieNER, self).__init__()
        self.label_num = label_dim

        self.ernie = ernie  # allow ernie to be config
        self.dropout = nn.Dropout(dropout if dropout is not None else
                                  self.ernie.config["hidden_dropout_prob"])
        self.classifier = nn.Linear(self.ernie.config['hidden_size'], self.label_num)
        self.hidden = nn.Linear(self.ernie.config['hidden_size'], self.ernie.config['hidden_size'])

    def forward(self,
                words_ids,
                token_type_ids=None,
                position_ids=None,
                attention_mask=None,
                history_ids=None):
        sequence_output, pooled_output = self.ernie(
            words_ids,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            attention_mask=attention_mask)

        sequence_output = nn.functional.relu(self.hidden(self.dropout(sequence_output)))

        sequence_output = self.dropout(sequence_output)
        logits = self.classifier(sequence_output)

        return logits

4.2 构造数字特征

模型的训练数据来自由NLPCC2018KBQA问答数据集标注后的主题词NER数据。

对于每个问句，用BIO标签标注出其提问的主题词。我们将数据构造为(token, label)对的形式，并划分成训练集和验证集保存在文件中，在本项目的work/TopicWordRecognization/data目录下可以看到已经构造好的相关数据。

在训练之前，需要先从文件中读出文本格式的数据，然后用ERNIE的tokenizer编码器把句子文本转换成可供模型输入的数字特征，同时拼接上预训练模型的special token。同时，句子的标签序列也需要转换成模型输入所需的数字特征，并在文本中special token处对应做padding填充。

def read(data_path):
    all_sample_words, all_sample_labels = [], []
    with open(data_path, 'r', encoding='utf-8') as f:
        tmp_sample_words, tmp_sample_labels = [], []
        for line in f.readlines():
            if line == '\n' and tmp_sample_words and tmp_sample_words:
                all_sample_words.append(tmp_sample_words)
                all_sample_labels.append(tmp_sample_labels)
                tmp_sample_words, tmp_sample_labels = [], []
            else:
                word, label = line.strip().split(' ')[0], line.strip().split(' ')[1]
                tmp_sample_words.append(word)
                tmp_sample_labels.append(label)
    for idx in range(len(all_sample_words)):
        yield {"words": all_sample_words[idx], "labels": all_sample_labels[idx]}


def convert_example_to_feature(example, tokenizer, label2id, pad_default_tag=0, max_seq_len=512):
    features = tokenizer(example["words"], is_split_into_words=True, max_seq_len=max_seq_len)
    label_ids = [label2id[label] for label in example["labels"][:max_seq_len-2]]
    label_ids = [label2id[pad_default_tag]] + label_ids + [label2id[pad_default_tag]]
    assert len(features["input_ids"]) == len(label_ids)
    return features["input_ids"], features["token_type_ids"], label_ids

4.3 模型训练

这里只展示模型训练的核心代码，可以在work/TopicWordRecognization目录下运行run_ner.py复现完整的训练过程。

本项目已经挂载了训练完毕的模型~/data/data122049/ernie_ner_best_pdparams，可以直接用于后面KBQA流水线的预测。

def train():
    train_ds = load_dataset(read, data_path=train_path, lazy=False)  # 文件->example
    dev_ds = load_dataset(read, data_path=dev_path, lazy=False)

    tokenizer = ErnieTokenizer.from_pretrained(model_name)
    trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, label2id=label2id,
                         pad_default_tag="O", max_seq_len=max_seq_len)

    train_ds = train_ds.map(trans_func, lazy=False)  # example->feature
    dev_ds = dev_ds.map(trans_func, lazy=False)

    batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=tokenizer.pad_token_id),
        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
        Pad(axis=0, pad_val=label2id["O"], dtype='int64'),
    ): fn(samples)

    train_batch_sampler = paddle.io.BatchSampler(train_ds, batch_size=batch_size, shuffle=True)
    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=batch_size, shuffle=False)
    train_loader = paddle.io.DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn,
                                        return_list=True)
    dev_loader = paddle.io.DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn,
                                      return_list=True)

    ernie = ErnieModel.from_pretrained(model_name)
    model = ErnieNER(ernie, len(label2id), dropout=0.1)

    num_training_steps = len(train_loader) * num_epoch
    lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)
    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
    grad_clip = paddle.nn.ClipGradByGlobalNorm(max_grad_norm)
    optimizer = paddle.optimizer.AdamW(learning_rate=lr_scheduler, parameters=model.parameters(),
                                       weight_decay=weight_decay, apply_decay_param_fun=lambda x: x in decay_params,
                                       grad_clip=grad_clip)

    loss_model = paddle.nn.CrossEntropyLoss()
    ner_metric = SeqEntityScore(id2label)

    global_step, ner_best_f1 = 0, 0.
    model.train()
    for epoch in range(1, num_epoch + 1):
        for batch_data in train_loader:
            input_ids, token_type_ids, labels = batch_data
            logits = model(input_ids, token_type_ids=token_type_ids)

            loss = loss_model(logits, labels)

            loss.backward()
            lr_scheduler.step()
            optimizer.step()
            optimizer.clear_grad()

            if global_step > 0 and global_step % log_step == 0:
                print(
                    f"epoch: {epoch} - global_step: {global_step}/{num_training_steps} - loss:{loss.numpy().item():.6f}")
            if global_step > 0 and global_step % eval_step == 0:

                ner_results = evaluate(model, dev_loader, ner_metric)
                ner_result = ner_results["Total"]
                model.train()
                ner_f1 = ner_result["F1"]
                # if ner_f1 > ner_best_f1:
                #     paddle.save(model.state_dict(), f"{save_path}/ernie_ner_best.pdparams")
                if ner_f1 > ner_best_f1:
                    print(f"\nner best F1 performence has been updated: {ner_best_f1:.5f} --> {ner_f1:.5f}")
                    ner_best_f1 = ner_f1
                print(
                    f'\nner evalution result: precision: {ner_result["Precision"]:.5f}, recall: {ner_result["Recall"]:.5f},  F1: {ner_result["F1"]:.5f}, current best {ner_best_f1:.5f}\n')

            global_step += 1

5. 候选三元组检索

5.1 基于模糊查询的实体链指

在得到问句的主题词后，还需要将主题词对应到知识图谱中相关的实体节点上，且多数情况下会对应到多个实体上去，我们称之为候选实体。例如识别出问句中的主题词“马云”后，需要链接到知识图谱的“马云-阿里巴巴创始人”、“马云-云南民族大学副教授”等相关的实体上。

虽然知识库数据集中为我们提供了mention2entity实体映射表，但是由于自然语言的多样性，自然问句中的主题词未必能与映射表中的提及词恰好完整对应上。为了提高实体查询的准确率，在硬匹配的基础上使用以下基于规则和编辑距离的模糊查询方式。

对于问句经过实体识别得到的一个主题词，先尝试用硬匹配的方法，在映射表中寻找是否有完全一致的实体提及。若硬匹配没有检索到结果，则重新遍历映射表的所有实体提及，并借助正则表达式等手段将问句的主题词和映射表的实体提及都进行字符上的统一。在匹配时，采用计算实体提及与主题词之间编辑距离的方式来替代硬匹配的检索方式，并记录每个实体提及对应的计算结果。编辑距离值越小，则实体提及越接近于问句的主题词。如果实体提及为组合型（检测其带有逗号、顿号、“或”字等可能出现的分割字符）则先将其按照分隔符拆分成多个实体提及，再逐一与问句的主题词比较。最后统计所有实体提及的编辑距离计算结果，将编辑距离取得最小值的实体提及作为查询结果返回

import Levenshtein
import re
import unicodedata


def entity_linking(mention2entity_dict, input_mention):
    """
    输入问句的NER结果input_mention，找到mention2entity_dict中与其相关度高的几个mention，返回它们的entitis
    使用一些规则以适配更多的mention
    :param mention2entity_dict:
    :param input_mention:
    :return:
    """
    if input_mention == 'NONE':         # 对于查不到的，返回候选实体为空列表，下面保持一致
        return []

    input_mention = input_mention.replace(" ", "")      # mention2entity中的mention已经去空格了，这里要对NER结果也去空格
    relative_entities = mention2entity_dict.get(input_mention, [])    # 先尝试直接查
    if not relative_entities:                                   # 直接查查不到，进入模糊查询
        # 保存模糊查询结果，模糊查询势必会遍历整个知识库，匹配所有认为相似的mention并计算各自编辑距离，在比较编辑距离后选取最小的那几个mention
        fuzzy_query_relative_entities = dict()
        input_mention = unify_char_format(input_mention)
        for mention_key in mention2entity_dict.keys():
            prim_mention = mention_key
            _find = False

            # 先做数据格式的处理
            mention_key = unify_char_format(mention_key)

            if len(mention_key) == 0:
                continue

            if '\\' == mention_key[-1]: 
                    mention_key = mention_key[:-1] + '"'

            # 组合型的mention
            if ',' in mention_key or '、' in mention_key or '\\\\' in mention_key or ';' in mention_key or ('或' in mention_key and '或' not in input_mention):
                mention_splits = re.split(r'[,;、或]|\\\\', mention_key)
                for _mention in mention_splits:
                    if (len(input_mention) < 6 and Levenshtein.distance(input_mention, _mention) <= 1) \
                            or (len(input_mention) >= 6 and Levenshtein.distance(input_mention, _mention) <= 4) \
                            or (len(input_mention) >= 20 and Levenshtein.distance(input_mention, _mention) <= 10):
                        _find = True
                        fuzzy_query_relative_entities[prim_mention] = Levenshtein.distance(input_mention, _mention)
            # 非组合型的mention
            else:
                if (len(input_mention) < 6 and Levenshtein.distance(input_mention, mention_key) <= 1) \
                            or (len(input_mention) >= 6 and Levenshtein.distance(input_mention, mention_key) <= 4) \
                            or (len(input_mention) >= 20 and Levenshtein.distance(input_mention, mention_key) <= 10):
                    _find = True
                    fuzzy_query_relative_entities[prim_mention] = Levenshtein.distance(input_mention, mention_key)

        if fuzzy_query_relative_entities:               # 模糊查询查到了结果
            min_key = min(fuzzy_query_relative_entities.keys(), key=fuzzy_query_relative_entities.get)         # 最小编辑距离的那几个mention
            min_similar_score = fuzzy_query_relative_entities[min_key]
            for prim_mention in fuzzy_query_relative_entities.keys():
                if fuzzy_query_relative_entities[prim_mention] == min_similar_score:
                    relative_entities.extend(mention2entity_dict[prim_mention])
                    # print('在模糊查询时找到mention的匹配,主题词和映射表的mention分别为：', input_mention, prim_mention)
        else:                                           # 模糊查询仍然查不到结果
            # print('模糊查询仍查不到结果：', input_mention)
            pass
    if input_mention not in relative_entities:          # 对于一些常用词，不再mention2entity表中，也加入进来
        relative_entities.append(input_mention)
    return relative_entities

def unify_char_format(string):
    """
    用于将两个字符串做对比之前，先把字符串做规范化
    :param string:
    :return:
    """
    string = unicodedata.normalize('NFKC', string)             
    string = string.replace('【', '[').replace('】', ']')      
    string = string.lower()                               
    return string

input_mention = '史蒂芬霍金'
rela_ents = entity_linking(mention2entity_dict, input_mention)
print('匹配到知识库中的候选实体：', rela_ents)

匹配到知识库中的候选实体： ['斯蒂芬·金', '史蒂芬·金', '史蒂芬·霍金', '史蒂芬霍金']

5.2 基于索引的候选三元组检索

如3.3小节所述，使用预先构建好的知识库索引表，可以返回知识库中与指定实体相关的三元组。

def search_triples_by_index(relative_entitis, index, raw_graph_f):
    """
    :param relative_entitis: list
    :param index: dict
    :param raw_graph: the file-pointer of the raw graph file, and the content need to be post-process
    :return: list of all the triples relative to the input_triples entitis  双重列表
    """
    relative_triples = []
    for entity in relative_entitis:
        index_entity = index.get(entity, None)
        if index_entity:
            read_index, read_size = index[entity]['start_pos'], index[entity]['length']
            raw_graph_f.seek(read_index)
            readresult = raw_graph_f.read(read_size).decode('utf-8')
            for line in readresult.strip().split('\n'):
                triple = line.strip().split(' ||| ')
                relative_triples.append(triple)
    return relative_triples

input_mention = '史蒂芬霍金'
rela_ents = entity_linking(mention2entity_dict, input_mention)
print('匹配到知识库中的候选实体：', rela_ents)
rel_triples= search_triples_by_index(rela_ents, forward_index, forward_KG_f)
print('共检索到{}条三元组'.format(len(rel_triples)))
print('打印20条以内的三元组：')
print('\n'.join(map(str, rel_triples[:20])))

匹配到知识库中的候选实体： ['斯蒂芬·金', '史蒂芬·金', '史蒂芬·霍金', '史蒂芬霍金']
共检索到68条三元组
打印20条以内的三元组：
['斯蒂芬·金', '别名', '斯蒂芬·金']
['斯蒂芬·金', '中文名', '史蒂芬·金']
['斯蒂芬·金', '别 名', '约翰·史怀森/ 理查德.巴克曼']
['斯蒂芬·金', '出生地', '美国缅因州']
['斯蒂芬·金', '职 业', '作家']
['斯蒂芬·金', '主要成就', '文艺界富豪榜']
['斯蒂芬·金', '配 偶', '塔比莎·金/ 娜欧蜜·瑞秋·金']
['斯蒂芬·金', '儿 子', '乔·希尔·金/欧文·菲利普·金']
['斯蒂芬·金', '女 儿', '娜欧蜜·瑞秋·金']
['斯蒂芬·金', '民 族', '美利坚民族']
['斯蒂芬·金', '外文名', 'Stephen Edwin King']
['斯蒂芬·金', '国 籍', '美国']
['斯蒂芬·金', '出生日期', '1947年']
['斯蒂芬·金', '毕业院校', '缅因大学英文系']
['斯蒂芬·金', '代表作品', '《肖申克的救赎》']
['斯蒂芬·金', '子 女', '乔·希尔·金/欧文·菲利普·金']
['斯蒂芬·金', '斯蒂芬·金。2007年2月', '斯蒂芬·金。2007年2月']
['斯蒂芬·金', '笔名', 'Richard Bachman John Swithen']
['斯蒂芬·金', '出生', '1947年9月21日（67岁） 美国缅因州波特兰']
['斯蒂芬·金', '职业', '作家']

6. 候选三元组分类

6.1 模型结构

三元组分类模块采用基于百度预训练模型ERNIE的句对分类模型，将三元组和问句信息配对后使用ERNIE编码，取[CLS]字符的隐层输出向量经全连接层后投影到二分类输出层。

from paddle import nn
from paddlenlp.transformers import ErniePretrainedModel


class ErnieCLS(ErniePretrainedModel):
    def __init__(self, ernie, label_dim, dropout=None):
        super(ErnieCLS, self).__init__()
        self.label_num = label_dim

        self.ernie = ernie  # allow ernie to be config
        self.dropout = nn.Dropout(dropout if dropout is not None else
                                  self.ernie.config["hidden_dropout_prob"])
        self.classifier = nn.Linear(self.ernie.config['hidden_size'], self.label_num)
        self.hidden = nn.Linear(self.ernie.config['hidden_size'], self.ernie.config['hidden_size'])

    def forward(self,
                words_ids,
                token_type_ids=None,
                position_ids=None,
                attention_mask=None,
                history_ids=None):
        sequence_output, pooled_output = self.ernie(
            words_ids,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            attention_mask=attention_mask)

        pooled_output = nn.functional.relu(self.hidden(self.dropout(pooled_output)))

        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        return logits

6.2 构造数字特征

我们将问句作为句子A，将某一条候选三元组的头实体和关系名拼接作为句子B，以该三元组是否为此问句的标注答案来作为句对(A,B)的正/负标签，以此构造候选三元组分类模型的训练数据。
在本项目的work/CandidateTripleSelection/data目录下可以看到已经构造好的相关数据。

在数字特征构造阶段，可以直接用ERNIE的tokenizer对句对编码，自动拼接special token。

def read(data_path):
    all_sample_text1, all_sample_text2, all_sample_labels = [], [], []
    with open(data_path, 'r', encoding='utf-8') as f:
        for line in f.readlines():
            text1, text2, label = line.strip().split('\t')
            all_sample_text1.append(text1)
            all_sample_text2.append(text2)
            all_sample_labels.append(label)
    for idx in range(len(all_sample_labels)):
        yield {"text1": all_sample_text1[idx], "text2": all_sample_text2[idx], "label": all_sample_labels[idx]}


def convert_example_to_feature(example, tokenizer, label2id, max_seq_len=512):
    features = tokenizer(example["text1"], example["text2"], max_seq_len=max_seq_len)
    label_ids = label2id[example["label"]]
    return features["input_ids"], features["token_type_ids"], label_ids

6.3 模型训练

与4.3小节类似，这里只展示模型训练的核心代码，可以在work/CanditateTriplesSelection目录下运行run_cls.py复现完整的训练过程。

def train():
    train_ds = load_dataset(read, data_path=train_path, lazy=False)  # 文件->example
    dev_ds = load_dataset(read, data_path=dev_path, lazy=False)

    tokenizer = ErnieTokenizer.from_pretrained(model_name)
    trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=max_seq_len)

    train_ds = train_ds.map(trans_func, lazy=False)  # example->feature
    dev_ds = dev_ds.map(trans_func, lazy=False)

    batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=tokenizer.pad_token_id),
        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
        Stack(axis=0, dtype='int64'),
    ): fn(samples)

    train_batch_sampler = paddle.io.BatchSampler(train_ds, batch_size=batch_size, shuffle=True)
    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=batch_size, shuffle=False)
    train_loader = paddle.io.DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn,
                                        return_list=True)
    dev_loader = paddle.io.DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn,
                                      return_list=True)

    ernie = ErnieModel.from_pretrained(model_name)
    model = ErnieCLS(ernie, len(label2id), dropout=0.1)

    num_training_steps = len(train_loader) * num_epoch
    lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)
    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
    grad_clip = paddle.nn.ClipGradByGlobalNorm(max_grad_norm)
    optimizer = paddle.optimizer.AdamW(learning_rate=lr_scheduler, parameters=model.parameters(),
                                       weight_decay=weight_decay, apply_decay_param_fun=lambda x: x in decay_params,
                                       grad_clip=grad_clip)

    loss_model = paddle.nn.CrossEntropyLoss()
    cls_metric = ClassificationScore(id2label)

    global_step, cls_best_f1 = 0, 0.
    model.train()
    for epoch in range(1, num_epoch + 1):
        for batch_data in train_loader:
            input_ids, token_type_ids, labels = batch_data
            logits = model(input_ids, token_type_ids=token_type_ids)

            loss = loss_model(logits, labels)

            loss.backward()
            lr_scheduler.step()
            optimizer.step()
            optimizer.clear_grad()

            if global_step > 0 and global_step % log_step == 0:
                print(
                    f"epoch: {epoch} - global_step: {global_step}/{num_training_steps} - loss:{loss.numpy().item():.6f}")
            if global_step > 0 and global_step % eval_step == 0:

                cls_results = evaluate(model, dev_loader, cls_metric)
                cls_result = cls_results["1"]
                model.train()
                cls_f1 = cls_result["F1"]
                if cls_f1 > cls_best_f1:
                    paddle.save(model.state_dict(), f"{save_path}/ernie_cls_best.pdparams")
                if cls_f1 > cls_best_f1:
                    print(f"\ncls best F1 performence has been updated: {cls_best_f1:.5f} --> {cls_f1:.5f}")
                    cls_best_f1 = cls_f1
                print(
                    f'\ncls evalution result: precision: {cls_result["Precision"]:.5f}, recall: {cls_result["Recall"]:.5f},  F1: {cls_result["F1"]:.5f}, current best {cls_best_f1:.5f}\n')

            global_step += 1

7. 候选答案精排序

BERT分类器的返回结果中会包含一个或多个预测为正例的三元组，本项目旨在让系统可以唯一返回一个正确答案。为了提高系统唯一返回的答案的准确率，需要对多个答案进行排序。排序的原则是衡量问句中的提问属性词与候选三元组关系名之间的相似度。

7.1 获取问句和三元组的属性信息

提问属性词是指问句对于主题词的哪一个属性特征进行提问。例如问句“襄城公主的老公是谁？”，问句中主题词为“襄城公主”，对这个主题词的提问属性为“老公”。

获取问句的提问属性词可以使用基于规则的方法进行。具体过程如下。

去掉问句在实体识别过程中得到的主题词。如对于问句“襄城公主的老公是谁？”，则去掉主题词“襄城公主”。
去掉问句中的停用词、疑问助词、标点符号，疑问助词包括“哪个”、“多少”、“怎么”等。一些常用的提问开头语也要去除，如“我想知道”、“请问你”、“我很好奇”等。

三元组的关系名也被称为三元组的属性名，用于表示知识库中两个实体的关联关系。如三元组“小川原湖-周长-67.4km”中，“周长”即为它的关系名。

def span_question(question, ner_result):
    """
    用于答案排序阶段，删去问句中与答案排序无关的信息，如主题词、疑问词等
    """
    question = question.replace(ner_result, '').replace('《', '').replace('》', '')
    for delete_word in ['我想知道','我想请问','请问你','请问','你知道','谁知道','知道','谁清楚','我很好奇','你帮我问问','有没有人看过','有没有人'
                        '怎么','这个','有多少个','有哪些','哪些','哪个','多少','几个','谁','被谁','还有'
                        ,'吗','呀','啊','吧','着','的','是','呢','了','？','?','什么']:
        question = question.replace(delete_word, '')

    return question

span_res = span_question('襄城公主的老公是谁？', '襄城公主')
print(span_res)

老公

7.2 计算问句属性与三元组关系名的相似度

取提问属性词和三元组关系名后，对二者进行相似度的计算。具体算法是分别计算二者的Jaccard相似度和word2vec相似度，并将它们相加作为总体的相似度得分。

def score_similarity(word2vec_model, string1, string2):
    """
    比较两个字符串的相似度，从字符覆盖度、w2v相似度做综合评分，用于答案排序时，问句和三元组关系名的比较
    :return: 相似度得分
    """
    return char_overlap(string1, string2) + word2vec_sim(word2vec_model, string1, string2)


def char_overlap(string1, string2):
    char_intersection = set(string1) & set(string2)
    char_union = set(string1) | set(string2)
    return len(char_intersection) / len(char_union)


def word2vec_sim(word2vec_model, string1, string2):
    # 阅读n_similarity的源代码，是对两组词向量分别取平均值并做L2归一化，然后求内积
    words1 = jieba.cut(string1)
    words2 = jieba.cut(string2)

    de_seg1 = []
    de_seg2 = []
    for seg in words1:
        if seg not in word2vec_model.vocab:
            _ws = [_w for _w in seg if _w in word2vec_model.vocab]
            de_seg1.extend(_ws)
        else:
            de_seg1.append(seg)
    for seg in words2:
        if seg not in word2vec_model.vocab:
            _ws = [_w for _w in seg if _w in word2vec_model.vocab]
            de_seg1.extend(_ws)
        else:
            de_seg2.append(seg)
    if de_seg1 and de_seg2:
        score = word2vec_model.n_similarity(de_seg1, de_seg2)
    else:
        score = 0
    return score

8. 完整KBQA流程

完整的KBQA流程实现如下

对于控制台输入的问句，先经过主题词识别模型预测出问句的提问对象。若模型没有预测出结果，则尝试使用规则抽取主题词。将主题词打印在控制台；
得到主题词后，经实体链指模块得到一组候选实体，打印在控制台；
根据预先建立的索引，在知识库中检索出所有以候选实体为头实体的三元组，即候选三元组。由于候选三元组可能较多，只将前20条候选三元组打印在控制台；
用三元组粗分类模型对上述所有的候选组做二分类预测，只保留预测为正的三元组
对于保留的三元组，使用答案排序模块对比关系和问句的相似性，给每个三元组打分并排序
取上一步得分最高的三元组作为最佳三元组，尾实体作为最佳答案返回

注：

在确保本NoteBook前3小节的相关库已经全部加载完毕后，下面的流程代码可以直接运行，您可以在输入框中输入一个简单的问题，然后观察本KBQA系统对于答案的预测结果以及每一步的中间结果。
由于本KBQA系统微调后的模型已经全部挂载在data/data122049目录下，在下面的pipeline代码中会直接加载和预测。所以本NoteBook中第4节及之后的代码可以不必执行。

def pipeline_predict(question):
    ner_results = ner_predict(NERConfig().best_model_path, question)
    ner_results = set([_result.replace("《", "").replace("》", "") for _result in ner_results])
    # ner_results是一个set，可能有0个、1个或多个元素。如果是0个元素尝试以下规则看能否提取出实体
    if not ner_results:
        if '《' in question and '》' in question:
            ner_results = re.search(r'(.*)的.*是.*', question).group(1)
        elif re.search(r'', question):  
            ner_results = re.search(r'(.*)的.*是.*', question).group(1)
        else:
            print('没有提取出主题词！')
            return()

    print('■识别到的主题词：', ner_results, datetime.datetime.now())

    candidate_entities = []
    for mention in ner_results:
        candidate_entities.extend(entity_linking(mention2entity_dict, mention))
    print('■找到的候选实体：', candidate_entities, datetime.datetime.now())

    forward_candidate_triples = search_triples_by_index(candidate_entities, forward_index, forward_KG_f)
    candidate_triples = forward_candidate_triples
    candidate_triples = list(filter(lambda x: len(x) == 3, candidate_triples))
    candidate_triples_num = len(candidate_triples)
    print('■候选三元组共{}条'.format(candidate_triples_num), datetime.datetime.now())
    show_num = 20 if candidate_triples_num > 20 else candidate_triples_num
    print('■展示前{}条候选三元组：{}'.format(show_num, candidate_triples[:show_num]))

    candidate_triples_labels = cls_predict(CLSConfig().best_model_path, [question]*len(candidate_triples), [triple[0]+triple[1] for triple in candidate_triples])
    predict_triples = [candidate_triples[i] for i in range(len(candidate_triples)) if candidate_triples_labels[i] == '1']
    print('■三元组粗分类结果，保留以下三元组：', predict_triples)

    predict_answers = [_triple[2] for _triple in predict_triples]
    if len(predict_answers) == 0:
        print('■知识库中没有检索到相关知识，请换一个问题试试......')
        return()
    elif len(set(predict_answers)) == 1:  # 预测的答案只有一个，尽管提供答案的三元组可能有多个
        print('■预测答案唯一，直接输出......')
        best_triple = predict_triples[0]
        best_answer = predict_answers[0]
        print('■最佳答案：', best_answer)
    else:  # 预测出多个答案，需要排序
        print('■检测到多个答案，正在进行答案排序......')
        max_ner = ''  # 用所有ner结果中最长的那个去分割问句
        for _ner in ner_results:
            if len(_ner) > len(max_ner):
                max_ner = _ner
        fine_question = span_question(question, max_ner)
        rel_scores = [score_similarity(word2vec_model, _triple[1].replace(' ', ''), fine_question) for _triple in
                      predict_triples]
        triples_with_score = list(zip(map(tuple, predict_triples), rel_scores))
        triples_with_score.sort(key=lambda x: x[1], reverse=True)
        print('■三元组排序结果：\n{}'.format("\n".join([str(pair[0]) + '-->' + str(pair[1]) for pair in triples_with_score])))
        best_answer = triples_with_score[0][0][-1]
        print('■最佳答案：', best_answer)

input_question = input('■请输入问题：')
# input_question = '哈尔滨工业大学的校长是谁？'
# input_question = '雨果是哪国人？'
# input_question = '百晓生出自于哪部作品？'
print('■已输入问句：', input_question)

pipeline_predict(input_question)

■请输入问题：■已输入问句： 雨果是哪国人？


[2021-12-22 17:33:35,001] [    INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/ernie/ernie_v1_chn_base.pdparams and saved to /home/aistudio/.paddlenlp/models/ernie-1.0
[2021-12-22 17:33:35,003] [    INFO] - Downloading ernie_v1_chn_base.pdparams from https://paddlenlp.bj.bcebos.com/models/transformers/ernie/ernie_v1_chn_base.pdparams
100%|██████████| 392507/392507 [00:09<00:00, 41759.84it/s]
W1222 17:33:44.524525   101 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W1222 17:33:44.528751   101 device_context.cc:465] device: 0, cuDNN Version: 7.6.
[2021-12-22 17:33:47,166] [    INFO] - Weights from pretrained model not used in ErnieModel: ['cls.predictions.layer_norm.weight', 'cls.predictions.decoder_bias', 'cls.predictions.transform.bias', 'cls.predictions.transform.weight', 'cls.predictions.layer_norm.bias']
[2021-12-22 17:33:47,783] [    INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/ernie/vocab.txt and saved to /home/aistudio/.paddlenlp/models/ernie-1.0
[2021-12-22 17:33:47,786] [    INFO] - Downloading vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/ernie/vocab.txt
100%|██████████| 90/90 [00:00<00:00, 3720.30it/s]


■识别到的主题词： {'雨果'} 2021-12-22 17:33:47.919246
■找到的候选实体： ['雨果的秘密', '雨果(2011年奥斯卡获奖电影)', '雨果(2011年马丁·斯科塞斯执导美国电影)', '雨果(2011年Martin Scorsese导演电影)', '维克多(法国作家)', '雨果(格斗游戏《街头霸王》人物)', '雨果'] 2021-12-22 17:33:47.921878
■候选三元组共109条 2021-12-22 17:33:47.925449
■展示前20条候选三元组：[['雨果的秘密', '别名', '雨果的秘密'], ['雨果的秘密', '中文名称', '雨果'], ['雨果的秘密', '外文名称', 'Hugo'], ['雨果的秘密', '其它译名', '雨果的巴黎奇幻历险'], ['雨果的秘密', '出品公司', '派拉蒙影业公司'], ['雨果的秘密', '导演', '马丁·斯科塞斯'], ['雨果的秘密', '制片成本', '1.7亿美元'], ['雨果的秘密', '拍摄日期', '2010年'], ['雨果的秘密', '片长', '126分钟'], ['雨果的秘密', '分级', 'USA：PG'], ['雨果的秘密', '色彩', '彩色'], ['雨果的秘密', '主演', '阿沙·巴特菲尔德，科洛·莫瑞兹'], ['雨果的秘密', '类型', '剧情，科幻，传记'], ['雨果的秘密', '出品时间', '2011年11月23日'], ['雨果的秘密', '制片地区', '美国'], ['雨果的秘密', '编剧', '约翰·罗根'], ['雨果的秘密', '拍摄地点', '美国'], ['雨果的秘密', 'IMDB评分', '7.6'], ['雨果的秘密', '上映时间', '2012年5月31日（中国）'], ['雨果的秘密', '对白语言', '英语']]


[2021-12-22 17:33:48,496] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-1.0/ernie_v1_chn_base.pdparams
[2021-12-22 17:33:49,823] [    INFO] - Weights from pretrained model not used in ErnieModel: ['cls.predictions.layer_norm.weight', 'cls.predictions.decoder_bias', 'cls.predictions.transform.bias', 'cls.predictions.transform.weight', 'cls.predictions.layer_norm.bias']
[2021-12-22 17:33:50,497] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-1.0/vocab.txt


■三元组粗分类结果，保留以下三元组： [['雨果的秘密', '各地片名翻译', '雨果'], ['雨果(2011年奥斯卡获奖电影)', '制片地区', '美国'], ['雨果(2011年马丁·斯科塞斯执导美国电影)', '制片地区', '美国'], ['雨果(2011年Martin Scorsese导演电影)', '制片地区', '美国'], ['维克多(法国作家)', '国籍', '法国']]
■检测到多个答案，正在进行答案排序......


Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.828 seconds.
Prefix dict has been built successfully.


■三元组排序结果：
('维克多(法国作家)', '国籍', '法国')-->0.49243852496147156
('雨果的秘密', '各地片名翻译', '雨果')-->0.39723604917526245
('雨果(2011年奥斯卡获奖电影)', '制片地区', '美国')-->0.22878356277942657
('雨果(2011年马丁·斯科塞斯执导美国电影)', '制片地区', '美国')-->0.22878356277942657
('雨果(2011年Martin Scorsese导演电影)', '制片地区', '美国')-->0.22878356277942657
■最佳答案： 法国

9. 评测结果

本项目在迁移到AIsudio前使用pytorch框架实现，主题词识别模块和三元组分类模块的模型分别使用SpanBERT和Bert-base的pytorch版中文预训练模型。

在进行了一定的数据分析工作后，发现原始问答数据集中的答案实体和知识库中的实体存在天然的差别，包括英文大小写、小数精度、时间日期格式、正负号缺失和省略等等因素。显然因为这些格式上的出入就认为某个实体不是问句的答案是会给系统带来误差，所以笔者对于此类问题做了修正得到修正版问答数据集，并在上面用同样的方法进行训练和测试。

在NLPCC2018测试集上以答案返回准确率作为评测指标，测试结果如下：