转载至AI Studio
项目链接:https://aistudio.baidu.com/aistudio/projectdetail/3678873?channelType=0&channel=0

FAQ政务问答
1、项目说明
政府工作人员往往要做很多政策解读等工作,费时费力还耗费大量的人力,在政府内部,工作人员往往积累了很多问答对,但是不知道怎么构建一个问答系统来辅助工作人员提升日常工作效率,简化工作流程。

本项目基于PaddleNLP FAQ System。

本项目源代码全部开源在 PaddleNLP 中。​

如果对您有帮助,欢迎star收藏一下,不易走丢哦~链接指路:https://github.com/PaddlePaddle/PaddleNLP​

加入微信交流群,一起学习吧

在这里插入图片描述
如有疑问,可加入PaddleNLP的技术交流群(微信),一起交流NLP技术!添加小助手微信,回复NLP即可。

1.1 系统特色
低门槛

手把手搭建检索式 FAQ System
无需相似 Query-Query Pair 标注数据也能构建 FAQ System
效果好

业界领先的检索预训练模型: RocketQA DualEncoder
针对无标注数据场景的领先解决方案: 检索预训练模型 + 增强的无监督语义索引微调
性能快

基于 Paddle Inference 快速抽取向量
基于 Milvus 快速查询和高性能建库
基于 Paddle Serving 高性能部署
应用领域
FAQ问答系统有以下应用领域:

1.智慧城市的疫情问题咨询,一站式办事大厅问答。

2.保险行业的理赔,车险,人身保险等保险产品的问答。

3.电信日常客户的关于手机,咨询等业务的问答。

4.法律民事诉讼,保险理赔问答。

5.金融银行信用卡等业务办理的问答。

2、安装说明
AI Studio平台默认安装了Paddle和PaddleNLP,并定期更新版本。 如需手动更新,可参考如下说明:

In [1]

首次更新完以后,重启后方能生效

!pip install --upgrade paddlenlp
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: paddlenlp in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (2.1.1)
Collecting paddlenlp
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/4a/11/a59cb2c9858efcf2fd2fac04e0c14300659c39ffde77909a2c99672becf6/paddlenlp-2.2.5-py3-none-any.whl (1.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 2.0 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: multiprocess in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.70.11.1)
Requirement already satisfied: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.2.2)
Collecting datasets
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/c0/8f/8b22e97828460aab33c6c249f7feb7afbef715cadd9822c9373ca8c20736/datasets-2.0.0-py3-none-any.whl (325 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 325.5/325.5 KB 3.3 MB/s eta 0:00:00a 0:00:01
Requirement already satisfied: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.9.0)
Requirement already satisfied: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.4.4)
Requirement already satisfied: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.1.0)
Requirement already satisfied: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.42.1)
Collecting tqdm>=4.62.1
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a6/73/7780b2c0868bdce1f13ce27b09b239f0eefc975a5c7ebc82a7613b2a2f05/tqdm-4.63.1-py2.py3-none-any.whl (76 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76.6/76.6 KB 1.4 MB/s eta 0:00:00a 0:00:01
Collecting huggingface-hub<1.0.0,>=0.1.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/c8/df/1b454741459f6ce75f86534bdad42ca17291b14a83066695f7d2c676e16c/huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67.0/67.0 KB 80.8 kB/s eta 0:00:00 0:00:01
Requirement already satisfied: pandas in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets->paddlenlp) (1.1.5)
Collecting aiohttp
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/dd/fe/80c594d62a7ff07730fd2cfc3a058498087436d8c938243e0610d1928f0e/aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 3.2 MB/s eta 0:00:00a 0:00:01
Requirement already satisfied: numpy>=1.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets->paddlenlp) (1.19.5)
Requirement already satisfied: requests>=2.19.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets->paddlenlp) (2.24.0)
Collecting pyarrow>=5.0.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f6/17/5f50b5a47fe9988afcbc9cc63692ab7e34b3ddf8d608296212506c9d6f14/pyarrow-7.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26.7/26.7 MB 2.1 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: dill in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets->paddlenlp) (0.3.3)
Collecting xxhash
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ef/ac/0eb796cf052c392f1ae586452c89ca66164f7d4f655b039ca3d06e2291af/xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 212.2/212.2 KB 3.0 MB/s eta 0:00:00a 0:00:01
Requirement already satisfied: packaging in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets->paddlenlp) (21.3)
Requirement already satisfied: importlib-metadata in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets->paddlenlp) (4.2.0)
Collecting responses<0.19
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/79/f3/2b3a6dc5986303b3dd1bbbcf482022acb2583c428cd23f0b6d37b1a1a519/responses-0.18.0-py3-none-any.whl (38 kB)
Collecting fsspec[http]>=2021.05.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a4/45/dcaa113699791da09107c231858ae75f34516064fc4f0db295df3df23b9e/fsspec-2022.3.0-py3-none-any.whl (136 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 136.1/136.1 KB 1.7 MB/s eta 0:00:00a 0:00:01
Requirement already satisfied: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp) (1.16.0)
Requirement already satisfied: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (0.24.2)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets->paddlenlp) (4.1.1)
Requirement already satisfied: pyyaml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets->paddlenlp) (5.1.2)
Requirement already satisfied: filelock in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets->paddlenlp) (3.0.12)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from packaging->datasets->paddlenlp) (3.0.7)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.19.0->datasets->paddlenlp) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.19.0->datasets->paddlenlp) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.19.0->datasets->paddlenlp) (1.25.11)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.19.0->datasets->paddlenlp) (2021.10.8)
Requirement already satisfied: scipy>=0.19.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (1.6.3)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (0.14.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (2.1.0)
Collecting async-timeout<5.0,>=4.0.0a3
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d6/c1/8991e7c5385b897b8c020cdaad718c5b087a6626d1d11a23e1ea87e325a7/async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting asynctest==0.13.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/e8/b6/8d17e169d577ca7678b11cd0d3ceebb0a6089a7f4a2de4b945fe4b1c86db/asynctest-0.13.0-py3-none-any.whl (26 kB)
Collecting multidict<7.0,>=4.5
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/db/3f/1c876ed190e8fcd1a2faef3085427e5465076e28813a2499502633f7eed3/multidict-6.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (94 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94.8/94.8 KB 1.1 MB/s eta 0:00:00a 0:00:01
Collecting yarl<2.0,>=1.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/80/7f/af3ecdf87e8e41da7b133f1d61f82745f8c862bdade3b56addee3ad23956/yarl-1.7.2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (271 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 271.8/271.8 KB 3.0 MB/s eta 0:00:00a 0:00:01
Requirement already satisfied: attrs>=17.3.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from aiohttp->datasets->paddlenlp) (21.4.0)
Collecting frozenlist>=1.1.1
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d1/ae/e4437fe5b5ba0fbccdaf8ecde8e3b6e8903793ca638c4706d034c0969ce1/frozenlist-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (144 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 144.8/144.8 KB 2.3 MB/s eta 0:00:00a 0:00:01
Collecting aiosignal>=1.1.2
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/3b/87/fe94898f2d44a93a35d5aa74671ed28094d80753a1113d68b799fab6dc22/aiosignal-1.2.0-py3-none-any.whl (8.2 kB)
Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from aiohttp->datasets->paddlenlp) (2.0.12)
Requirement already satisfied: zipp>=0.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from importlib-metadata->datasets->paddlenlp) (3.7.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas->datasets->paddlenlp) (2.8.2)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas->datasets->paddlenlp) (2022.1)
Installing collected packages: xxhash, tqdm, pyarrow, multidict, fsspec, frozenlist, asynctest, async-timeout, yarl, responses, huggingface-hub, aiosignal, aiohttp, datasets, paddlenlp
Attempting uninstall: tqdm
Found existing installation: tqdm 4.27.0
Uninstalling tqdm-4.27.0:
Successfully uninstalled tqdm-4.27.0
Attempting uninstall: pyarrow
Found existing installation: pyarrow 2.0.0
Uninstalling pyarrow-2.0.0:
Successfully uninstalled pyarrow-2.0.0
Attempting uninstall: paddlenlp
Found existing installation: paddlenlp 2.1.1
Uninstalling paddlenlp-2.1.1:
Successfully uninstalled paddlenlp-2.1.1
ERROR: pip’s dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
paddlefsl 1.0.0 requires tqdm~=4.27.0, but you have tqdm 4.63.1 which is incompatible.
Successfully installed aiohttp-3.8.1 aiosignal-1.2.0 async-timeout-4.0.2 asynctest-0.13.0 datasets-2.0.0 frozenlist-1.3.0 fsspec-2022.3.0 huggingface-hub-0.4.0 multidict-6.0.2 paddlenlp-2.2.5 pyarrow-7.0.0 responses-0.18.0 tqdm-4.63.1 xxhash-3.0.0 yarl-1.7.2
安装项目依赖的其他库:

备注:如果提示找不到相关文件,左上角刷新即可。

In [2]
!pip install -r requirements.txt
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting paddle-serving-app>=0.7.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a9/31/b5f5e57e9ccdca7cd052382f6522ec4d2d79f97033f9180ea3e05ec474bb/paddle_serving_app-0.8.3-py3-none-any.whl (55 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55.7/55.7 KB 17.8 MB/s eta 0:00:00
Collecting paddle-serving-client>=0.7.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/05/26/bfec5675a3d4b7863558d30401383cc193ed7a2bffed7feed4bdfcc75065/paddle_serving_client-0.8.3-cp37-none-any.whl (44.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.3/44.3 MB 2.1 MB/s eta 0:00:0000:0100:01
Collecting paddle-serving-server-gpu>=0.7.0.post102
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/16/35/298c3f587afd3917ac76833ec07af2cf8de31e4b7548c3e60aae69cd7e05/paddle_serving_server_gpu-0.8.3.post1028-py3-none-any.whl (7.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.2/7.2 MB 1.6 MB/s eta 0:00:0000:0100:01
Collecting hnswlib>=0.5.2
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/05/e4/d2594433eb9b90e767991b76f9ff5930ebf4cc8a911e04e2fd5032d59940/hnswlib-0.6.2.tar.gz (31 kB)
Installing build dependencies … done
Getting requirements to build wheel … done
Preparing metadata (pyproject.toml) … done
Collecting opencv-python3.4.17.61
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f4/db/f1a45a838aba55e56ddccae015a7f3adee67f48b68e579527ca304bf3881/opencv_python-3.4.17.61-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (58.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.0/58.0 MB 1.6 MB/s eta 0:00:0000:0100:01m
Requirement already satisfied: sentencepiece<=0.1.96 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddle-serving-app>=0.7.0->-r requirements.txt (line 1)) (0.1.96)
Requirement already satisfied: pillow in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddle-serving-app>=0.7.0->-r requirements.txt (line 1)) (8.2.0)
Collecting shapely
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/9d/4d/4b0d86ed737acb29c5e627a91449470a9fb914f32640db3f1cb7ba5bc19e/Shapely-1.8.1.post1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 1.4 MB/s eta 0:00:00a 0:00:01
Requirement already satisfied: six>=1.10.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddle-serving-app>=0.7.0->-r requirements.txt (line 1)) (1.16.0)
Collecting pyclipper
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/c5/fa/2c294127e4f88967149a68ad5b3e43636e94e3721109572f8f17ab15b772/pyclipper-1.3.0.post2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (603 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 603.5/603.5 KB 965.8 kB/s eta 0:00:00a 0:00:01
Requirement already satisfied: numpy>=1.14.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from opencv-python
3.4.17.61->paddle-serving-app>=0.7.0->-r requirements.txt (line 1)) (1.19.5)
Collecting grpcio<=1.33.2
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/7a/46/d08d8a5d0e0449f541fe9e7a226854019a41a4fa41fd14332e55b0e4394f/grpcio-1.33.2-cp37-cp37m-manylinux2014_x86_64.whl (3.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.8/3.8 MB 1.8 MB/s eta 0:00:0000:0100:01
Collecting grpcio-tools<=1.33.2
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/77/1e/91eaee901589ebee04c21df2f551502e7ba946bab99338f77a1f8a4237e1/grpcio_tools-1.33.2-cp37-cp37m-manylinux2014_x86_64.whl (2.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.5/2.5 MB 770.1 kB/s eta 0:00:0000:0100:01
Requirement already satisfied: requests in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddle-serving-client>=0.7.0->-r requirements.txt (line 2)) (2.24.0)
Requirement already satisfied: protobuf>=3.11.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddle-serving-client>=0.7.0->-r requirements.txt (line 2)) (3.14.0)
Collecting Jinja22.11.3
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/7e/c2/1eece8c95ddbc9b1aeb64f5783a9e07a286de42191b7204d67b7496ddf35/Jinja2-2.11.3-py2.py3-none-any.whl (125 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 125.7/125.7 KB 1.2 MB/s eta 0:00:00a 0:00:01
Collecting MarkupSafe
1.1.1
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/c2/37/2e4def8ce3739a258998215df907f5815ecd1af71e62147f5eea2d12d4e8/MarkupSafe-1.1.1-cp37-cp37m-manylinux2010_x86_64.whl (33 kB)
Requirement already satisfied: pyyaml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddle-serving-server-gpu>=0.7.0.post102->-r requirements.txt (line 3)) (5.1.2)
Requirement already satisfied: flask<2.0.0,>=1.1.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddle-serving-server-gpu>=0.7.0.post102->-r requirements.txt (line 3)) (1.1.1)
Collecting func-timeout
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b3/0d/bf0567477f7281d9a3926c582bfef21bff7498fc0ffd3e9de21811896a0b/func_timeout-4.3.5.tar.gz (44 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.3/44.3 KB 1.5 MB/s eta 0:00:00
Preparing metadata (setup.py) … done
Collecting pytest
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d2/ac/556e4410326ce77eeb1d1ec35a3e3ec847fb3e5cb30673729d2eeeffc970/pytest-7.1.1-py3-none-any.whl (297 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 297.0/297.0 KB 361.9 kB/s eta 0:00:00a 0:00:01
Collecting Werkzeug1.0.1
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/cc/94/5f7079a0e00bd6863ef8f1da638721e9da21e5bacee597595b318f71d62e/Werkzeug-1.0.1-py2.py3-none-any.whl (298 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 298.6/298.6 KB 3.2 MB/s eta 0:00:00a 0:00:01
Collecting click
7.1.2
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d2/3d/fa76db83bf75c4f8d338c2fd15c8d33fdd7ad23a9b5e57eb6c5de26b430e/click-7.1.2-py2.py3-none-any.whl (82 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82.8/82.8 KB 24.0 MB/s eta 0:00:00
Requirement already satisfied: itsdangerous1.1.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddle-serving-server-gpu>=0.7.0.post102->-r requirements.txt (line 3)) (1.1.0)
Collecting iniconfig
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/9b/dd/b3c12c6d707058fa947864b67f0c4e0c39ef8610988d7baea9578f3c48f3/iniconfig-1.1.1-py2.py3-none-any.whl (5.0 kB)
Collecting py>=1.8.2
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f6/f0/10642828a8dfb741e5f3fbaac830550a518a775c7fff6f04a007259b0548/py-1.11.0-py2.py3-none-any.whl (98 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.7/98.7 KB 2.3 MB/s eta 0:00:00a 0:00:01
Collecting tomli>=1.0.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/97/75/10a9ebee3fd790d20926a90a2547f0bf78f371b2f13aa822c759680ca7b9/tomli-2.0.1-py3-none-any.whl (12 kB)
Requirement already satisfied: packaging in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pytest->paddle-serving-server-gpu>=0.7.0.post102->-r requirements.txt (line 3)) (21.3)
Requirement already satisfied: importlib-metadata>=0.12 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pytest->paddle-serving-server-gpu>=0.7.0.post102->-r requirements.txt (line 3)) (4.2.0)
Requirement already satisfied: attrs>=19.2.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pytest->paddle-serving-server-gpu>=0.7.0.post102->-r requirements.txt (line 3)) (21.4.0)
Requirement already satisfied: pluggy<2.0,>=0.12 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pytest->paddle-serving-server-gpu>=0.7.0.post102->-r requirements.txt (line 3)) (1.0.0)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->paddle-serving-client>=0.7.0->-r requirements.txt (line 2)) (1.25.11)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->paddle-serving-client>=0.7.0->-r requirements.txt (line 2)) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->paddle-serving-client>=0.7.0->-r requirements.txt (line 2)) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->paddle-serving-client>=0.7.0->-r requirements.txt (line 2)) (2021.10.8)
Requirement already satisfied: typing-extensions>=3.6.4 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from importlib-metadata>=0.12->pytest->paddle-serving-server-gpu>=0.7.0.post102->-r requirements.txt (line 3)) (4.1.1)
Requirement already satisfied: zipp>=0.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from importlib-metadata>=0.12->pytest->paddle-serving-server-gpu>=0.7.0.post102->-r requirements.txt (line 3)) (3.7.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from packaging->pytest->paddle-serving-server-gpu>=0.7.0.post102->-r requirements.txt (line 3)) (3.0.7)
Building wheels for collected packages: hnswlib, func-timeout
Building wheel for hnswlib (pyproject.toml) … done
Created wheel for hnswlib: filename=hnswlib-0.6.2-cp37-cp37m-linux_x86_64.whl size=1459339 sha256=2f62a9e785a83f99fe3204426089870d9d7aa4900a7c8d93ac49b69c2bc7b999
Stored in directory: /home/aistudio/.cache/pip/wheels/c2/60/19/0c2fe5af745e4ca8ef0e1ecbf8822d190e1b38ff05374d5d20
Building wheel for func-timeout (setup.py) … done
Created wheel for func-timeout: filename=func_timeout-4.3.5-py3-none-any.whl size=15077 sha256=2942ed96076e2a9093192a48898b9a97ca131c743c35c730f090200e4722c6ba
Stored in directory: /home/aistudio/.cache/pip/wheels/ea/6b/56/b7bcacbd1bd8cd29883e7f7a29cbd98b87b2d789b25bae5563
Successfully built hnswlib func-timeout
Installing collected packages: pyclipper, iniconfig, func-timeout, Werkzeug, tomli, shapely, py, opencv-python, MarkupSafe, hnswlib, grpcio, click, paddle-serving-app, Jinja2, grpcio-tools, pytest, paddle-serving-client, paddle-serving-server-gpu
Attempting uninstall: Werkzeug
Found existing installation: Werkzeug 0.16.0
Uninstalling Werkzeug-0.16.0:
Successfully uninstalled Werkzeug-0.16.0
Attempting uninstall: opencv-python
Found existing installation: opencv-python 4.1.1.26
Uninstalling opencv-python-4.1.1.26:
Successfully uninstalled opencv-python-4.1.1.26
Attempting uninstall: MarkupSafe
Found existing installation: MarkupSafe 2.0.1
Uninstalling MarkupSafe-2.0.1:
Successfully uninstalled MarkupSafe-2.0.1
Attempting uninstall: grpcio
Found existing installation: grpcio 1.35.0
Uninstalling grpcio-1.35.0:
Successfully uninstalled grpcio-1.35.0
Attempting uninstall: click
Found existing installation: Click 7.0
Uninstalling Click-7.0:
Successfully uninstalled Click-7.0
Attempting uninstall: Jinja2
Found existing installation: Jinja2 3.0.0
Uninstalling Jinja2-3.0.0:
Successfully uninstalled Jinja2-3.0.0
ERROR: pip’s dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
parl 1.4.1 requires pyzmq
18.1.1, but you have pyzmq 22.3.0 which is incompatible.
nbconvert 6.4.5 requires MarkupSafe>=2.0, but you have markupsafe 1.1.1 which is incompatible.
Successfully installed Jinja2-2.11.3 MarkupSafe-1.1.1 Werkzeug-1.0.1 click-7.1.2 func-timeout-4.3.5 grpcio-1.33.2 grpcio-tools-1.33.2 hnswlib-0.6.2 iniconfig-1.1.1 opencv-python-3.4.17.61 paddle-serving-app-0.8.3 paddle-serving-client-0.8.3 paddle-serving-server-gpu-0.8.3.post1028 py-1.11.0 pyclipper-1.3.0.post2 pytest-7.1.1 shapely-1.8.1.post1 tomli-2.0.1
首先导入项目所需要的第三方库:

In [3]
import abc
import sys
from functools import partial
import argparse
import os
import random
import time
import numpy as np

加载飞桨的API

import paddle
import paddle.nn as nn
import paddle.nn.functional as F

import paddlenlp as ppnlp
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.datasets import load_dataset, MapDataset
from paddlenlp.transformers import LinearDecayWithWarmup
from paddlenlp.utils.downloader import get_path_from_url
from paddle import inference
3、数据准备
基于公开的疫情政务问答数据集,我们需要准备训练集,评估集和召回库三部分。首先政务数据包含文章,问题,答案,我们选取其中的问题,用中英文回译和模型同义句生成的方法来生成评估集合,训练集则直接使用政务数据中的问题。 训练集示例如下:

宁夏针对哪些人员开通工伤保障绿色通道?
四川金堂税务网上申报率为多少?
普陀如何实时跟踪返沪人员健康信息?
闽政通上的跨省异地就医业务可办理哪些业务子项?
国家卫健委要求什么人员禁止乘坐交通工具?
中国科学院上海有机化学研究所与广西中医药大学联合哪些机构结合中医药特点研发抗新型冠状病毒的药剂?
河北省的单位想要审批投资项目,可以在哪里办理呢?
重庆市使用财政性资金采购进口疫情防控物资如何审批?
评估集是问题对,示例如下:

南昌市出台了哪些企业稳岗就业政策? 南昌市政府出台了哪些企业的稳岗就业政策?
上海宣布什么时间开始开展中小学在线教育,学生不到校? 中小学在线教育什么时候开始,上海开始开始的?
北京市对于中小微企业经贸交流平台展会项目的支持标准是怎样的? 中小微企业经贸交流平台的发展目标
在疫情防控期间怎样灵活安排工作时间? 怎样在防控期间灵活安排工作时间?
湖北省为什么鼓励缴费人通过线上缴费渠道缴费? 为什么要鼓励缴费人通过线上缴费渠道缴费
In [4]
if(not os.path.exists(‘faq_data.zip’)):
get_path_from_url(‘https://paddlenlp.bj.bcebos.com/applications/faq_data.zip’,root_dir=‘.’)
3.1 加载数据
In [5]
def read_simcse_text(data_path):
“”“Reads data.”“”
with open(data_path, ‘r’, encoding=‘utf-8’) as f:
for line in f:
data = line.rstrip()
yield {‘text_a’: data, ‘text_b’: data}

加载训练集

train_set_file=‘faq_data/data/train.csv’
train_ds = load_dataset(
read_simcse_text, data_path=train_set_file, lazy=False)

输出三条数据

for i in range(3):
print(train_ds[i])
{‘text_a’: ‘宁夏针对哪些人员开通工伤保障绿色通道?’, ‘text_b’: ‘宁夏针对哪些人员开通工伤保障绿色通道?’}
{‘text_a’: ‘四川金堂税务网上申报率为多少?’, ‘text_b’: ‘四川金堂税务网上申报率为多少?’}
{‘text_a’: ‘普陀如何实时跟踪返沪人员健康信息?’, ‘text_b’: ‘普陀如何实时跟踪返沪人员健康信息?’}
3.2 构建Dataloader
In [6]
def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False):
# 把文本转换成id的形式
result = []

for key, text in example.items():
    if 'label' in key:
        # do_evaluate
        result += [example['label']]
    else:
        # do_train
        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length)
        input_ids = encoded_inputs["input_ids"]
        token_type_ids = encoded_inputs["token_type_ids"]
        result += [input_ids, token_type_ids]
return result

max_seq_length=64
batch_size=32
model_name_or_path=‘rocketqa-zh-dureader-query-encoder’
tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained(model_name_or_path)

trans_func = partial(
convert_example,
tokenizer=tokenizer,
max_seq_length=max_seq_length)

batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment
Pad(axis=0, pad_val=tokenizer.pad_token_id), # title_input
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # tilte_segment
): [data for data in fn(samples)]

[2022-04-02 17:02:56,032] [ INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa-zh-dureader-vocab.txt and saved to /home/aistudio/.paddlenlp/models/rocketqa-zh-dureader-query-encoder
[2022-04-02 17:02:56,034] [ INFO] - Downloading rocketqa-zh-dureader-vocab.txt from https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa-zh-dureader-vocab.txt
100%|██████████| 89.0k/89.0k [00:00<00:00, 2.16MB/s]
In [7]

明文数据 -> ID 序列训练数据

def create_dataloader(dataset,
mode=‘train’,
batch_size=1,
batchify_fn=None,
trans_fn=None):
if trans_fn:
dataset = dataset.map(trans_fn)

shuffle = True if mode == 'train' else False
if mode == 'train':
    batch_sampler = paddle.io.DistributedBatchSampler(
        dataset, batch_size=batch_size, shuffle=shuffle)
else:
    batch_sampler = paddle.io.BatchSampler(
        dataset, batch_size=batch_size, shuffle=shuffle)

return paddle.io.DataLoader(
    dataset=dataset,
    batch_sampler=batch_sampler,
    collate_fn=batchify_fn,
    return_list=True)

train_data_loader = create_dataloader(
train_ds,
mode=‘train’,
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)

for idx, batch in enumerate(train_data_loader):
if idx == 0:
print(batch)
break
[Tensor(shape=[32, 43], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
[[1 , 102 , 1862, …, 0 , 0 , 0 ],
[1 , 632 , 644 , …, 0 , 0 , 0 ],
[1 , 296 , 242 , …, 0 , 0 , 0 ],
…,
[1 , 296 , 213 , …, 0 , 0 , 0 ],
[1 , 102 , 1862, …, 0 , 0 , 0 ],
[1 , 88 , 898 , …, 0 , 0 , 0 ]]), Tensor(shape=[32, 43], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
[[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0],
…,
[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0]]), Tensor(shape=[32, 43], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
[[1 , 102 , 1862, …, 0 , 0 , 0 ],
[1 , 632 , 644 , …, 0 , 0 , 0 ],
[1 , 296 , 242 , …, 0 , 0 , 0 ],
…,
[1 , 296 , 213 , …, 0 , 0 , 0 ],
[1 , 102 , 1862, …, 0 , 0 , 0 ],
[1 , 88 , 898 , …, 0 , 0 , 0 ]]), Tensor(shape=[32, 43], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
[[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0],
…,
[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0]])]
4、模型选择
首先政务问答场景的数据只有问题和答案对,再没有其他的数据了。如果使用有监督方法,需要问题-问题对,还需要收集一些问题进行人工标注。因此可以考虑使用无监督语义索引技术SimCSE模型。

总体上无监督技术没有有监督技术效果好,所以为了提升SimCSE的性能,我们使用了开放问答领域的预训练语言模型RocketQA,并且在SimCSE的基础上利用WR策略进行优化。

整个方案无需人工参与数据标注,所以是一个无监督的解决方案。

5、模型构建
5.1 SimCSE模型

上图是SimCSE的原理图,SimCSE主要是通过dropout来把同一个句子变成正样本(做两次前向,但是dropout有随机因素,所以产生的向量不一样,但是本质上还是表示的是同一句话),把一个batch里面其他的句子变成负样本的。

SimCSE网络结构搭建,搭建代码如下:

In [8]
class SimCSE(nn.Layer):
def init(self,
pretrained_model,
dropout=None,
margin=0.0,
scale=20,
output_emb_size=None):

    super().__init__()

    self.ptm = pretrained_model
    self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)

    # if output_emb_size is greater than 0, then add Linear layer to reduce embedding_size, 
    # we recommend set output_emb_size = 256 considering the trade-off beteween 
    # recall performance and efficiency
    self.output_emb_size = output_emb_size
    if output_emb_size > 0:
        weight_attr = paddle.ParamAttr(
            initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
        self.emb_reduce_linear = paddle.nn.Linear(
            768, output_emb_size, weight_attr=weight_attr)

    self.margin = margin
    # Used scaling cosine similarity to ease converge
    self.sacle = scale

@paddle.jit.to_static(input_spec=[
    paddle.static.InputSpec(
        shape=[None, None], dtype='int64'), paddle.static.InputSpec(
            shape=[None, None], dtype='int64')
])
def get_pooled_embedding(self,
                         input_ids,
                         token_type_ids=None,
                         position_ids=None,
                         attention_mask=None,
                         with_pooler=True):

    # Note: cls_embedding is poolerd embedding with act tanh 
    sequence_output, cls_embedding = self.ptm(input_ids, token_type_ids,
                                              position_ids, attention_mask)

    if with_pooler == False:
        cls_embedding = sequence_output[:, 0, :]

    if self.output_emb_size > 0:
        cls_embedding = self.emb_reduce_linear(cls_embedding)

    cls_embedding = self.dropout(cls_embedding)
    cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)

    return cls_embedding

def get_semantic_embedding(self, data_loader):
    self.eval()
    with paddle.no_grad():
        for batch_data in data_loader:
            input_ids, token_type_ids = batch_data
            input_ids = paddle.to_tensor(input_ids)
            token_type_ids = paddle.to_tensor(token_type_ids)

            text_embeddings = self.get_pooled_embedding(
                input_ids, token_type_ids=token_type_ids)

            yield text_embeddings

def cosine_sim(self,
               query_input_ids,
               title_input_ids,
               query_token_type_ids=None,
               query_position_ids=None,
               query_attention_mask=None,
               title_token_type_ids=None,
               title_position_ids=None,
               title_attention_mask=None,
               with_pooler=True):

    query_cls_embedding = self.get_pooled_embedding(
        query_input_ids,
        query_token_type_ids,
        query_position_ids,
        query_attention_mask,
        with_pooler=with_pooler)

    title_cls_embedding = self.get_pooled_embedding(
        title_input_ids,
        title_token_type_ids,
        title_position_ids,
        title_attention_mask,
        with_pooler=with_pooler)

    cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding,
                            axis=-1)
    return cosine_sim

def forward(self,
            query_input_ids,
            title_input_ids,
            query_token_type_ids=None,
            query_position_ids=None,
            query_attention_mask=None,
            title_token_type_ids=None,
            title_position_ids=None,
            title_attention_mask=None):

    query_cls_embedding = self.get_pooled_embedding(
        query_input_ids, query_token_type_ids, query_position_ids,
        query_attention_mask)

    title_cls_embedding = self.get_pooled_embedding(
        title_input_ids, title_token_type_ids, title_position_ids,
        title_attention_mask)

    cosine_sim = paddle.matmul(
        query_cls_embedding, title_cls_embedding, transpose_y=True)

    # substract margin from all positive samples cosine_sim()
    margin_diag = paddle.full(
        shape=[query_cls_embedding.shape[0]],
        fill_value=self.margin,
        dtype=paddle.get_default_dtype())

    cosine_sim = cosine_sim - paddle.diag(margin_diag)

    # scale cosine to ease training converge
    cosine_sim *= self.sacle

    labels = paddle.arange(0, query_cls_embedding.shape[0], dtype='int64')
    labels = paddle.reshape(labels, shape=[-1, 1])

    loss = F.cross_entropy(input=cosine_sim, label=labels)

    return loss

5.2 WR 策略

上图是WR策略跟其他策略的简单比较,其中WR策略对原句的语义改变很小,但是改变了句子的长度,破除了SimCSE句子长度相等的假设。WR策略起源于ESimCSE的论文,有兴趣可以从论文里了解其原理。

In [9]
def word_repetition(input_ids, token_type_ids, dup_rate=0.32):
“”“Word Reptition strategy.”“”
input_ids = input_ids.numpy().tolist()
token_type_ids = token_type_ids.numpy().tolist()

batch_size, seq_len = len(input_ids), len(input_ids[0])
repetitied_input_ids = []
repetitied_token_type_ids = []
rep_seq_len = seq_len
for batch_id in range(batch_size):
    cur_input_id = input_ids[batch_id]
    actual_len = np.count_nonzero(cur_input_id)
    dup_word_index = []
    # If sequence length is less than 5, skip it
    if (actual_len > 5):
        dup_len = random.randint(a=0, b=max(2, int(dup_rate * actual_len)))
        # Skip cls and sep position
        dup_word_index = random.sample(
            list(range(1, actual_len - 1)), k=dup_len)

    r_input_id = []
    r_token_type_id = []
    for idx, word_id in enumerate(cur_input_id):
        # Insert duplicate word
        if idx in dup_word_index:
            r_input_id.append(word_id)
            r_token_type_id.append(token_type_ids[batch_id][idx])
        r_input_id.append(word_id)
        r_token_type_id.append(token_type_ids[batch_id][idx])
    after_dup_len = len(r_input_id)
    repetitied_input_ids.append(r_input_id)
    repetitied_token_type_ids.append(r_token_type_id)

    if after_dup_len > rep_seq_len:
        rep_seq_len = after_dup_len
# Padding the data to the same length
for batch_id in range(batch_size):
    after_dup_len = len(repetitied_input_ids[batch_id])
    pad_len = rep_seq_len - after_dup_len
    repetitied_input_ids[batch_id] += [0] * pad_len
    repetitied_token_type_ids[batch_id] += [0] * pad_len

return paddle.to_tensor(repetitied_input_ids), paddle.to_tensor(
    repetitied_token_type_ids)

6.训练配置
In [10]

关键参数

scale=20 # 推荐值: 10 ~ 30
margin=0.1 # 推荐值: 0.0 ~ 0.2

max_seq_length=64
epochs=1
learning_rate=5E-5
warmup_proportion=0.0
weight_decay=0.0
save_steps=10
batch_size=64
output_emb_size=256
dup_rate=0.3
save_dir=‘checkpoints’
In [11]
pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained(model_name_or_path)
model = SimCSE(
pretrained_model,
margin=margin,
scale=scale,
output_emb_size=output_emb_size)

num_training_steps = len(train_data_loader) * epochs

lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps,
warmup_proportion)

Generate parameter names needed to perform weight decay.

All bias and LayerNorm parameters are excluded.

decay_params = [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in [“bias”, “norm”])
]
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
parameters=model.parameters(),
weight_decay=weight_decay,
apply_decay_param_fun=lambda x: x in decay_params)
[2022-04-02 17:02:58,915] [ INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_zh_dureader_query_encoder.pdparams and saved to /home/aistudio/.paddlenlp/models/rocketqa-zh-dureader-query-encoder
[2022-04-02 17:02:58,918] [ INFO] - Downloading rocketqa_zh_dureader_query_encoder.pdparams from https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_zh_dureader_query_encoder.pdparams
100%|██████████| 379M/379M [00:13<00:00, 29.2MB/s]
W0402 17:03:12.593480 1701 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W0402 17:03:12.598835 1701 device_context.cc:465] device: 0, cuDNN Version: 7.6.
7. 模型训练
In [12]
def do_train(model,train_data_loader,**kwargs):
save_dir=kwargs[‘save_dir’]
global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
for step, batch in enumerate(train_data_loader, start=1):
query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch
if(dup_rate > 0.0):
query_input_ids,query_token_type_ids=word_repetition(query_input_ids,query_token_type_ids,dup_rate)
title_input_ids,title_token_type_ids=word_repetition(title_input_ids,title_token_type_ids,dup_rate)
loss = model(
query_input_ids=query_input_ids,
title_input_ids=title_input_ids,
query_token_type_ids=query_token_type_ids,
title_token_type_ids=title_token_type_ids)

        global_step += 1
        if global_step % 5 == 0:
            print(
                "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
                % (global_step, epoch, step, loss,
                   10 / (time.time() - tic_train)))
            tic_train = time.time()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.clear_grad()
        if global_step % save_steps == 0:
            save_path = os.path.join(save_dir, "model_%d" % global_step)
            if not os.path.exists(save_path):
                os.makedirs(save_path)
            save_param_path = os.path.join(save_path, 'model_state.pdparams')
            paddle.save(model.state_dict(), save_param_path)
            tokenizer.save_pretrained(save_path)
# 保存最后一个batch的模型
save_path = os.path.join(save_dir, "model_%d" % global_step)
if not os.path.exists(save_path):
    os.makedirs(save_path)
    save_param_path = os.path.join(save_path, 'model_state.pdparams')
    paddle.save(model.state_dict(), save_param_path)
    tokenizer.save_pretrained(save_path)

模型训练

do_train(model,train_data_loader,save_dir=save_dir)
global step 5, epoch: 1, batch: 5, loss: 0.11371, speed: 1.94 step/s
global step 10, epoch: 1, batch: 10, loss: 0.06113, speed: 9.94 step/s
global step 15, epoch: 1, batch: 15, loss: 0.03259, speed: 3.66 step/s
global step 20, epoch: 1, batch: 20, loss: 0.03326, speed: 8.03 step/s
global step 25, epoch: 1, batch: 25, loss: 0.01415, speed: 3.91 step/s
global step 30, epoch: 1, batch: 30, loss: 0.01532, speed: 8.88 step/s
global step 35, epoch: 1, batch: 35, loss: 0.00748, speed: 4.11 step/s
global step 40, epoch: 1, batch: 40, loss: 0.00522, speed: 8.67 step/s
global step 45, epoch: 1, batch: 45, loss: 0.01428, speed: 4.10 step/s
global step 50, epoch: 1, batch: 50, loss: 0.01331, speed: 8.70 step/s
global step 55, epoch: 1, batch: 55, loss: 0.00476, speed: 3.98 step/s
global step 60, epoch: 1, batch: 60, loss: 0.00748, speed: 9.09 step/s
global step 65, epoch: 1, batch: 65, loss: 0.00501, speed: 4.04 step/s
global step 70, epoch: 1, batch: 70, loss: 0.00428, speed: 8.78 step/s
global step 75, epoch: 1, batch: 75, loss: 0.01702, speed: 3.85 step/s
global step 80, epoch: 1, batch: 80, loss: 0.00421, speed: 8.41 step/s
global step 85, epoch: 1, batch: 85, loss: 0.00393, speed: 3.98 step/s
global step 90, epoch: 1, batch: 90, loss: 0.00386, speed: 8.37 step/s
global step 95, epoch: 1, batch: 95, loss: 0.00645, speed: 3.93 step/s
global step 100, epoch: 1, batch: 100, loss: 0.00821, speed: 8.89 step/s
global step 105, epoch: 1, batch: 105, loss: 0.02699, speed: 4.11 step/s
global step 110, epoch: 1, batch: 110, loss: 0.00471, speed: 8.16 step/s
global step 115, epoch: 1, batch: 115, loss: 0.00460, speed: 4.11 step/s
global step 120, epoch: 1, batch: 120, loss: 0.00480, speed: 8.73 step/s
global step 125, epoch: 1, batch: 125, loss: 0.00441, speed: 4.26 step/s
8. 效果评估
In [13]
def gen_id2corpus(corpus_file):
id2corpus = {}
with open(corpus_file, ‘r’, encoding=‘utf-8’) as f:
for idx, line in enumerate(f):
id2corpus[idx] = line.rstrip()
return id2corpus
In [14]
corpus_file = ‘faq_data/data/corpus.csv’
id2corpus = gen_id2corpus(corpus_file)

conver_example function’s input must be dict

corpus_list = [{idx: text} for idx, text in id2corpus.items()]
print(corpus_list[:4])
[{0: ‘南昌市政府出台了哪些企业的稳岗就业政策?’}, {1: ‘中小学在线教育什么时候开始,上海开始开始的?’}, {2: ‘中小微企业经贸交流平台的发展目标’}, {3: ‘怎样在防控期间灵活安排工作时间?’}]
In [15]
def convert_example_test(example,
tokenizer,
max_seq_length=512,
pad_to_max_seq_len=False):
result = []
for key, text in example.items():
encoded_inputs = tokenizer(
text=text,
max_seq_len=max_seq_length,
pad_to_max_seq_len=pad_to_max_seq_len)
input_ids = encoded_inputs[“input_ids”]
token_type_ids = encoded_inputs[“token_type_ids”]
result += [input_ids, token_type_ids]
return result
In [16]
trans_func_corpus = partial(
convert_example_test,
tokenizer=tokenizer,
max_seq_length=max_seq_length)

batchify_fn_corpus = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # text_input
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # text_segment
): [data for data in fn(samples)]
corpus_ds = MapDataset(corpus_list)
corpus_data_loader = create_dataloader(
corpus_ds,
mode=‘predict’,
batch_size=batch_size,
batchify_fn=batchify_fn_corpus,
trans_fn=trans_func_corpus)
for item in corpus_data_loader:
print(item)
break
[Tensor(shape=[64, 32], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
[[1 , 219 , 1374, …, 0 , 0 , 0 ],
[1 , 12 , 96 , …, 0 , 0 , 0 ],
[1 , 12 , 96 , …, 0 , 0 , 0 ],
…,
[1 , 102 , 289 , …, 0 , 0 , 0 ],
[1 , 272 , 478 , …, 0 , 0 , 0 ],
[1 , 1367, 1937, …, 0 , 0 , 0 ]]), Tensor(shape=[64, 32], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,
[[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0],
…,
[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0]])]
In [17]
from ann_util import build_index
hnsw_max_elements=1000000
hnsw_ef=100
hnsw_m=100
final_index = build_index(corpus_data_loader, model,output_emb_size=output_emb_size,hnsw_max_elements=hnsw_max_elements,
hnsw_ef=hnsw_ef,
hnsw_m=hnsw_m)
[2022-04-02 17:04:06,429] [ INFO] - start build index…
[2022-04-02 17:04:11,781] [ INFO] - Total index number:5000
In [18]
def gen_text_file(similar_text_pair_file):
text2similar_text = {}
texts = []
with open(similar_text_pair_file, ‘r’, encoding=‘utf-8’) as f:
for line in f:
splited_line = line.rstrip().split(“\t”)
if len(splited_line) != 2:
continue

        text, similar_text = line.rstrip().split("\t")

        if not text or not similar_text:
            continue

        text2similar_text[text] = similar_text
        texts.append({"text": text})
return texts, text2similar_text

similar_text_pair_file=‘faq_data/data/test_pair.csv’
text_list, text2similar_text = gen_text_file(similar_text_pair_file)

print(text_list[:2])

print(text2similar_text)

In [19]
import os
query_ds = MapDataset(text_list)
query_data_loader = create_dataloader(
query_ds,
mode=‘predict’,
batch_size=batch_size,
batchify_fn=batchify_fn_corpus,
trans_fn=trans_func_corpus)
query_embedding = model.get_semantic_embedding(query_data_loader)

recall_result_dir=‘recall_result_dir’
os.makedirs(recall_result_dir,exist_ok=True)
In [20]
recall_num = 10
recall_result_file = ‘recall_result.txt’
recall_result_file = os.path.join(recall_result_dir,
recall_result_file)
with open(recall_result_file, ‘w’, encoding=‘utf-8’) as f:
for batch_index, batch_query_embedding in enumerate(query_embedding):
recalled_idx, cosine_sims = final_index.knn_query(
batch_query_embedding.numpy(), recall_num)
batch_size = len(cosine_sims)
for row_index in range(batch_size):
text_index = batch_size * batch_index + row_index
for idx, doc_idx in enumerate(recalled_idx[row_index]):
f.write(“{}\t{}\t{}\n”.format(text_list[text_index][
“text”], id2corpus[doc_idx], 1.0 - cosine_sims[
row_index][idx]))
In [21]
recall_N = []
from evaluate import recall
from data import get_rs
similar_text_pair=“faq_data/data/test_pair.csv”
rs=get_rs(similar_text_pair,recall_result_file,10)
recall_num = [1, 5, 10]
for topN in recall_num:
R = round(100 * recall(rs, N=topN), 3)
recall_N.append(str®)
for key, val in zip(recall_num, recall_N):
print(‘recall@{}={}’.format(key, val))
recall@1=79.479
recall@5=91.091
recall@10=93.594
9. 模型推理
In [22]
example=“南昌市出台了哪些企业稳岗就业政策?”
print(‘输入文本:{}’.format(example))
encoded_inputs = tokenizer(
text=[example],
max_seq_len=max_seq_length)
input_ids = encoded_inputs[“input_ids”]
token_type_ids = encoded_inputs[“token_type_ids”]
input_ids = paddle.to_tensor(input_ids)
token_type_ids = paddle.to_tensor(token_type_ids)
cls_embedding=model.get_pooled_embedding( input_ids=input_ids,token_type_ids=token_type_ids)

print(‘提取特征:{}’.format(cls_embedding))

recalled_idx, cosine_sims = final_index.knn_query(
cls_embedding.numpy(), 10)
print(‘检索召回’)
for doc_idx,cosine_sim in zip(recalled_idx[0],cosine_sims[0]):
print(id2corpus[doc_idx],cosine_sim)
输入文本:南昌市出台了哪些企业稳岗就业政策?
检索召回
南昌市政府出台了哪些企业的稳岗就业政策? 0.025437057
南昌市出台了哪些政策推进实体企业有序经营? 0.3430903
怀化市为中稳定小企业职工队伍,出台了什么财务政策? 0.43941408
泰州市对稳岗用工政策提出了哪些要求? 0.4403869
南昌高新区出台了几项稳定经济增长的措施? 0.462421
省发展和改革委员会印发的通知中在实施援企稳岗政策上有什么规定? 0.46484655
人社部对中小微企业出台了什么政策? 0.46674693
九江市对有创业意愿的农民工提供了什么政策? 0.48485285
奉贤区针对中小企业出台了什么帮企业减负的扶持政策? 0.49053097
宁国市政府对新增就业岗位的企业有什么补贴政策? 0.49932688
10 预测部署
10.1 动转静导出
In [23]
output_path=‘output’
model.eval()

Convert to static graph with specific input description

model = paddle.jit.to_static(
model,
input_spec=[
paddle.static.InputSpec(
shape=[None, None], dtype=“int64”), # input_ids
paddle.static.InputSpec(
shape=[None, None], dtype=“int64”) # segment_ids
])

Save in static graph model.

save_path = os.path.join(output_path, “inference”)
paddle.jit.save(model, save_path)
10.2 问答检索引擎
模型准备结束以后,开始搭建 Milvus 的语义检索引擎,用于语义向量的快速检索,本项目使用Milvus开源工具进行向量检索,Milvus 的搭建教程请参考官方教程 Milvus官方安装教程本案例使用的是 Milvus 的1.1.1 CPU版本,建议使用官方的 Docker 安装方式,简单快捷。

10.3 Paddle Serving 部署
In [24]
import paddle_serving_client.io as serving_io

dirname=“output”
model_filename=“inference.get_pooled_embedding.pdmodel”
params_filename=“inference.get_pooled_embedding.pdiparams”
server_path=“serving_server”
client_path=“serving_client”
feed_alias_names=None
fetch_alias_names=“output_embedding”
show_proto=False
serving_io.inference_model_to_serving(
dirname=dirname,
serving_server=server_path,
serving_client=client_path,
model_filename=model_filename,
params_filename=params_filename,
show_proto=show_proto,
feed_alias_names=feed_alias_names,
fetch_alias_names=fetch_alias_names)
(dict_keys([‘input_ids’, ‘token_type_ids’]), dict_keys([‘elementwise_div_0’]))
搭建结束以后,就可以启动server部署服务,使用client端访问server端就行了。具体细节参考代码:https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/question_answering/faq_system

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐