Paddlenlp之UIE模型实战实体抽取任务【打车数据、快递单】

0.背景介绍

本项目将演示如何通过小样本样本进行模型微调,快速且准确抽取快递单中的目的地、出发地、时间、打车费用等内容,形成结构化信息。辅助物流行业从业者进行有效信息的提取,从而降低客户填单的成本。

数据集情况:
waybill.jsonl文件是快递单信息数据集:

{“id”: 57, “text”: “昌胜远黑龙江省哈尔滨市南岗区宽桥街28号18618391296”, “relations”: [], “entities”: [{“id”: 111, “start_offset”: 0, “end_offset”: 3, “label”: “姓名”}, {“id”: 112, “start_offset”: 3, “end_offset”: 7, “label”: “省份”}, {“id”: 113, “start_offset”: 7, “end_offset”: 11, “label”: “城市”}, {“id”: 114, “start_offset”: 11, “end_offset”: 14, “label”: “县区”}, {“id”: 115, “start_offset”: 14, “end_offset”: 20, “label”: “详细地址”}, {“id”: 116, “start_offset”: 20, “end_offset”: 31, “label”: “电话”}]}
{“id”: 58, “text”: “易颖18500308469山东省烟台市莱阳市富水南路1号”, “relations”: [], “entities”: [{“id”: 118, “start_offset”: 0, “end_offset”: 2, “label”: “姓名”}, {“id”: 119, “start_offset”: 2, “end_offset”: 13, “label”: “电话”}, {“id”: 120, “start_offset”: 13, “end_offset”: 16, “label”: “省份”}, {“id”: 121, “start_offset”: 16, “end_offset”: 19, “label”: “城市”}, {“id”: 122, “start_offset”: 19, “end_offset”: 22, “label”: “县区”}, {“id”: 123, “start_offset”: 22, “end_offset”: 28, “label”: “详细地址”}]}

doccano_ext.jsonl是打车数据集:

{“id”: 1, “text”: “昨天晚上十点加班打车回家58元”, “relations”: [], “entities”: [{“id”: 0, “start_offset”: 0, “end_offset”: 6, “label”: “时间”}, {“id”: 1, “start_offset”: 11, “end_offset”: 12, “label”: “目的地”}, {“id”: 2, “start_offset”: 12, “end_offset”: 14, “label”: “费用”}]}
{“id”: 2, “text”: “三月三号早上12点46加班,到公司54”, “relations”: [], “entities”: [{“id”: 3, “start_offset”: 0, “end_offset”: 11, “label”: “时间”}, {“id”: 4, “start_offset”: 15, “end_offset”: 17, “label”: “目的地”}, {“id”: 5, “start_offset”: 17, “end_offset”: 19, “label”: “费用”}]}
{“id”: 3, “text”: “8月31号十一点零四工作加班五十块钱”, “relations”: [], “entities”: [{“id”: 6, “start_offset”: 0, “end_offset”: 10, “label”: “时间”}, {“id”: 7, “start_offset”: 14, “end_offset”: 16, “label”: “费用”}]}
{“id”: 4, “text”: “5月17号晚上10点35分加班打车回家,36块五”, “relations”: [], “entities”: [{“id”: 8, “start_offset”: 0, “end_offset”: 13, “label”: “时间”}, {“id”: 1, “start_offset”: 18, “end_offset”: 19, “label”: “目的地”}, {“id”: 9, “start_offset”: 20, “end_offset”: 24, “label”: “费用”}]}
{“id”: 5, “text”: “2009年1月份通讯费一百元”, “relations”: [], “entities”: [{“id”: 10, “start_offset”: 0, “end_offset”: 7, “label”: “时间”}, {“id”: 11, “start_offset”: 11, “end_offset”: 13, “label”: “费用”}]}

结果展示预览

输入:

城市内交通费7月5日金额114广州至佛山
从百度大厦到龙泽苑东区打车费二十元
上海虹桥高铁到杭州时间是9月24日费用是73元
上周末坐动车从北京到上海花费五十块五毛
昨天北京飞上海话费一百元

输出:

{"出发地": [{"text": "广州", "start": 15, "end": 17, "probability": 0.9073772252165782}], "目的地": [{"text": "佛山", "start": 18, "end": 20, "probability": 0.9927365183877761}], "时间": [{"text": "7月5日", "start": 6, "end": 10, "probability": 0.9978010396512218}]}
{"出发地": [{"text": "百度大厦", "start": 1, "end": 5, "probability": 0.968825147409472}], "目的地": [{"text": "龙泽苑东区", "start": 6, "end": 11, "probability": 0.9877913072493669}]}
{"目的地": [{"text": "杭州", "start": 7, "end": 9, "probability": 0.9929172180094881}], "时间": [{"text": "9月24日", "start": 12, "end": 17, "probability": 0.9953342057701597}]}
{#"出发地": [{"text": "北京", "start": 7, "end": 9, "probability": 0.973048366717471}], "目的地": [{"text": "上海", "start": 10, "end": 12, "probability": 0.988486130309397}], "时间": [{"text": "上周末", "start": 0, "end": 3, "probability": 0.9977407699595275}]}
{"出发地": [{"text": "北京", "start": 2, "end": 4, "probability": 0.974188953533556}], "目的地": [{"text": "上海", "start": 5, "end": 7, "probability": 0.9928200521486445}], "时间": [{"text": "昨天", "start": 0, "end": 2, "probability": 0.9731559534465504}]}
! pip install --upgrade paddlenlp
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: paddlenlp in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (2.1.1)
Collecting paddlenlp
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/8e/e1/94cdbaca400a57687a8529213776468f003b64b6e35a6f4acf6b6539f543/paddlenlp-2.3.4-py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hRequirement already satisfied: paddlefsl in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.0.0)
Requirement already satisfied: multiprocess<=0.70.12.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.70.11.1)
Collecting datasets>=2.0.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/98/29/f381f8a633fed2c4f41c191498c3bc43d91a8e44c5202a8b0b2bd8b1acf3/datasets-2.3.2-py3-none-any.whl (362 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m362.3/362.3 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting paddle2onnx
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/be/62/bd17eeccd7cfd3601ae8607371673046158d67f48d095c3055edac2e5250/paddle2onnx-0.9.8-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hRequirement already satisfied: sentencepiece in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.1.96)
Requirement already satisfied: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.4.4)
Requirement already satisfied: protobuf<=3.20.0,>=3.1.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (3.14.0)
Requirement already satisfied: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.1.0)
Requirement already satisfied: tqdm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.27.0)
Requirement already satisfied: dill<0.3.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.3.3)
Requirement already satisfied: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.42.1)
Requirement already satisfied: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.2.2)
Collecting pyarrow>=6.0.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/af/7d/94064bf0b44cf9e3cf67bdb90735f01d1d53a3973f8d59da99dc9ed58303/pyarrow-8.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.3/29.3 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/bd/4c/166d788feff5c739b833342945bbba406581095fb6c4a056113fae646b5c/fsspec-2022.5.0-py3-none-any.whl (140 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: numpy>=1.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (1.19.5)
Requirement already satisfied: pandas in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (1.1.5)
Collecting responses<0.19
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/79/f3/2b3a6dc5986303b3dd1bbbcf482022acb2583c428cd23f0b6d37b1a1a519/responses-0.18.0-py3-none-any.whl (38 kB)
Collecting tqdm
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/8a/c4/d15f1e627fff25443ded77ea70a7b5532d6371498f9285d44d62587e209c/tqdm-4.64.0-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.4/78.4 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: packaging in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (21.3)
Collecting aiohttp
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/dd/fe/80c594d62a7ff07730fd2cfc3a058498087436d8c938243e0610d1928f0e/aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d8/2c/9af8451ab780598e3b26a84d4f0e3844841456657401eb6843fdb622bb41/huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.5/101.5 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: requests>=2.19.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (2.24.0)
Requirement already satisfied: importlib-metadata in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from datasets>=2.0.0->paddlenlp) (4.2.0)
Collecting xxhash
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ef/ac/0eb796cf052c392f1ae586452c89ca66164f7d4f655b039ca3d06e2291af/xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hRequirement already satisfied: six>=1.9 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from protobuf<=3.20.0,>=3.1.0->paddlenlp) (1.16.0)
Requirement already satisfied: pillow==8.2.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlefsl->paddlenlp) (8.2.0)
Collecting paddlefsl
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/fb/4a/25d1959a8f1fe5ee400f32fc9fc8b56d4fd6fc25315e23c0171f6e705e2a/paddlefsl-1.1.0-py3-none-any.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.0/101.0 kB[0m [31m381.0 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hRequirement already satisfied: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (0.24.2)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets>=2.0.0->paddlenlp) (5.1.2)
Requirement already satisfied: filelock in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets>=2.0.0->paddlenlp) (3.0.12)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets>=2.0.0->paddlenlp) (4.2.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from packaging->datasets>=2.0.0->paddlenlp) (3.0.9)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.19.0->datasets>=2.0.0->paddlenlp) (2.8)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.19.0->datasets>=2.0.0->paddlenlp) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.19.0->datasets>=2.0.0->paddlenlp) (2019.9.11)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests>=2.19.0->datasets>=2.0.0->paddlenlp) (1.25.6)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/56/aa/4ef5aa67a9a62505db124a5cb5262332d1d4153462eb8fd89c9fa41e5d92/urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m128.0/128.0 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (2.1.0)
Requirement already satisfied: scipy>=0.19.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (1.6.3)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (0.14.1)
Collecting frozenlist>=1.1.1
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d1/ae/e4437fe5b5ba0fbccdaf8ecde8e3b6e8903793ca638c4706d034c0969ce1/frozenlist-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (144 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.8/144.8 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting multidict<7.0,>=4.5
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/db/3f/1c876ed190e8fcd1a2faef3085427e5465076e28813a2499502633f7eed3/multidict-6.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (94 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.8/94.8 kB[0m [31m361.1 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hCollecting charset-normalizer<3.0,>=2.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/94/69/64b11e8c2fb21f08634468caef885112e682b0ebe2908e74d3616eb1c113/charset_normalizer-2.1.0-py3-none-any.whl (39 kB)
Collecting asynctest==0.13.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/e8/b6/8d17e169d577ca7678b11cd0d3ceebb0a6089a7f4a2de4b945fe4b1c86db/asynctest-0.13.0-py3-none-any.whl (26 kB)
Collecting aiosignal>=1.1.2
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/3b/87/fe94898f2d44a93a35d5aa74671ed28094d80753a1113d68b799fab6dc22/aiosignal-1.2.0-py3-none-any.whl (8.2 kB)
Collecting async-timeout<5.0,>=4.0.0a3
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d6/c1/8991e7c5385b897b8c020cdaad718c5b087a6626d1d11a23e1ea87e325a7/async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting yarl<2.0,>=1.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/80/7f/af3ecdf87e8e41da7b133f1d61f82745f8c862bdade3b56addee3ad23956/yarl-1.7.2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (271 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m271.8/271.8 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hRequirement already satisfied: attrs>=17.3.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from aiohttp->datasets>=2.0.0->paddlenlp) (21.4.0)
Requirement already satisfied: zipp>=0.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from importlib-metadata->datasets>=2.0.0->paddlenlp) (3.8.0)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas->datasets>=2.0.0->paddlenlp) (2019.3)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas->datasets>=2.0.0->paddlenlp) (2.8.2)
Installing collected packages: paddle2onnx, xxhash, urllib3, tqdm, pyarrow, multidict, fsspec, frozenlist, charset-normalizer, asynctest, async-timeout, yarl, aiosignal, responses, paddlefsl, huggingface-hub, aiohttp, datasets, paddlenlp
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.25.6
    Uninstalling urllib3-1.25.6:
      Successfully uninstalled urllib3-1.25.6
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.27.0
    Uninstalling tqdm-4.27.0:
      Successfully uninstalled tqdm-4.27.0
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 2.0.0
    Uninstalling pyarrow-2.0.0:
      Successfully uninstalled pyarrow-2.0.0
  Attempting uninstall: paddlefsl
    Found existing installation: paddlefsl 1.0.0
    Uninstalling paddlefsl-1.0.0:
      Successfully uninstalled paddlefsl-1.0.0
  Attempting uninstall: paddlenlp
    Found existing installation: paddlenlp 2.1.1
    Uninstalling paddlenlp-2.1.1:
      Successfully uninstalled paddlenlp-2.1.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
parl 1.4.1 requires pyzmq==18.1.1, but you have pyzmq 23.1.0 which is incompatible.[0m[31m
[0mSuccessfully installed aiohttp-3.8.1 aiosignal-1.2.0 async-timeout-4.0.2 asynctest-0.13.0 charset-normalizer-2.1.0 datasets-2.3.2 frozenlist-1.3.0 fsspec-2022.5.0 huggingface-hub-0.8.1 multidict-6.0.2 paddle2onnx-0.9.8 paddlefsl-1.1.0 paddlenlp-2.3.4 pyarrow-8.0.0 responses-0.18.0 tqdm-4.64.0 urllib3-1.25.11 xxhash-3.0.0 yarl-1.7.2

1.数据集加载(快递单数据、打车数据)

# ! wget https://paddlenlp.bj.bcebos.com/model_zoo/uie/waybill.jsonl
# ! mv waybill.jsonl ./data/
# ! mv doccano_ext.jsonl ./data/
# ! mv dev_test.jsonl ./data/

doccano_file: 从doccano导出的数据标注文件。

save_dir: 训练数据的保存目录,默认存储在data目录下。

negative_ratio: 最大负例比例,该参数只对抽取类型任务有效,适当构造负例可提升模型效果。负例数量和实际的标签数量有关,最大负例数量 = negative_ratio * 正例数量。该参数只对训练集有效,默认为5。为了保证评估指标的准确性,验证集和测试集默认构造全负例。

splits: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照8:1:1的比例将数据划分为训练集、验证集和测试集。

task_type: 选择任务类型,可选有抽取和分类两种类型的任务。

options: 指定分类任务的类别标签,该参数只对分类类型任务有效。默认为[“正向”, “负向”]。

prompt_prefix: 声明分类任务的prompt前缀信息,该参数只对分类类型任务有效。默认为"情感倾向"。

is_shuffle: 是否对数据集进行随机打散,默认为True。

seed: 随机种子,默认为1000.

*separator: 实体类别/评价维度与分类标签的分隔符,该参数只对实体/评价维度级分类任务有效。默认为"##"。

!python doccano.py \
    --doccano_file ./data/doccano_ext.jsonl \
    --task_type 'ext' \
    --save_dir ./data \
    --splits 0.8 0.1 0.1 \
    --negative_ratio 5
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/setuptools/depends.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
[32m[2022-07-14 11:34:26,474] [    INFO][0m - Converting doccano data...[0m
100%|████████████████████████████████████████| 40/40 [00:00<00:00, 42560.16it/s]
[32m[2022-07-14 11:34:26,477] [    INFO][0m - Adding negative samples for first stage prompt...[0m
100%|███████████████████████████████████████| 40/40 [00:00<00:00, 161009.75it/s]
[32m[2022-07-14 11:34:26,478] [    INFO][0m - Converting doccano data...[0m
100%|██████████████████████████████████████████| 5/5 [00:00<00:00, 21754.69it/s]
[32m[2022-07-14 11:34:26,479] [    INFO][0m - Adding negative samples for first stage prompt...[0m
100%|██████████████████████████████████████████| 5/5 [00:00<00:00, 44057.82it/s]
[32m[2022-07-14 11:34:26,479] [    INFO][0m - Converting doccano data...[0m
100%|██████████████████████████████████████████| 5/5 [00:00<00:00, 26181.67it/s]
[32m[2022-07-14 11:34:26,480] [    INFO][0m - Adding negative samples for first stage prompt...[0m
100%|██████████████████████████████████████████| 5/5 [00:00<00:00, 45689.59it/s]
[32m[2022-07-14 11:34:26,482] [    INFO][0m - Save 160 examples to ./data/train.txt.[0m
[32m[2022-07-14 11:34:26,482] [    INFO][0m - Save 20 examples to ./data/dev.txt.[0m
[32m[2022-07-14 11:34:26,482] [    INFO][0m - Save 20 examples to ./data/test.txt.[0m
[32m[2022-07-14 11:34:26,482] [    INFO][0m - Finished! It takes 0.01 seconds[0m
[0m

输出部分展示:

{"content": "上海到北京机票1320元", "result_list": [{"text": "上海", "start": 0, "end": 2}], "prompt": "出发地"}
{"content": "上海到北京机票1320元", "result_list": [{"text": "北京", "start": 3, "end": 5}], "prompt": "目的地"}
{"content": "上海到北京机票1320元", "result_list": [{"text": "1320", "start": 7, "end": 11}], "prompt": "费用"}
{"content": "上海虹桥到杭州东站高铁g7555共73元时间是10月14日", "result_list": [{"text": "上海虹桥", "start": 0, "end": 4}], "prompt": "出发地"}
{"content": "上海虹桥到杭州东站高铁g7555共73元时间是10月14日", "result_list": [{"text": "杭州东站", "start": 5, "end": 9}], "prompt": "目的地"}
{"content": "上海虹桥到杭州东站高铁g7555共73元时间是10月14日", "result_list": [{"text": "73", "start": 17, "end": 19}], "prompt": "费用"}
{"content": "上海虹桥到杭州东站高铁g7555共73元时间是10月14日", "result_list": [{"text": "10月14日", "start": 23, "end": 29}], "prompt": "时间"}
{"content": "昨天晚上十点加班打车回家58元", "result_list": [{"text": "昨天晚上十点", "start": 0, "end": 6}], "prompt": "时间"}
{"content": "昨天晚上十点加班打车回家58元", "result_list": [{"text": "家", "start": 11, "end": 12}], "prompt": "目的地"}
{"content": "昨天晚上十点加班打车回家58元", "result_list": [{"text": "58", "start": 12, "end": 14}], "prompt": "费用"}
{"content": "2月20号从南山到光明二十元", "result_list": [{"text": "2月20号", "start": 0, "end": 5}], "prompt": "时间"}
!python doccano.py \
    --doccano_file ./data/doccano_ext.jsonl \
    --task_type "ext" \
    --save_dir ./data \
    --negative_ratio 5

2.模型训练

!python finetune.py \
    --train_path "./data/train.txt" \
    --dev_path "./data/dev.txt" \
    --save_dir "./checkpoint" \
    --learning_rate 1e-5 \
    --batch_size 8 \
    --max_seq_len 512 \
    --num_epochs 100 \
    --model "uie-base" \
    --seed 1000 \
    --logging_steps 10 \
    --valid_steps 50 \
    --device "gpu"
[32m[2022-07-12 15:08:32,211] [    INFO][0m - Downloading resource files...[0m
[32m[2022-07-12 15:08:32,212] [    INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'uie-base'.[0m
W0712 15:08:32.240545   821 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W0712 15:08:32.244833   821 gpu_context.cc:306] device: 0, cuDNN Version: 7.6.
[32m[2022-07-12 15:08:40,748] [    INFO][0m - global step 10, epoch: 1, loss: 0.00342, speed: 3.77 step/s[0m
[32m[2022-07-12 15:08:43,215] [    INFO][0m - global step 20, epoch: 1, loss: 0.00256, speed: 4.05 step/s[0m
[32m[2022-07-12 15:08:45,766] [    INFO][0m - global step 30, epoch: 2, loss: 0.00223, speed: 3.92 step/s[0m
[32m[2022-07-12 15:08:48,246] [    INFO][0m - global step 40, epoch: 2, loss: 0.00195, speed: 4.03 step/s[0m
[32m[2022-07-12 15:08:50,790] [    INFO][0m - global step 50, epoch: 3, loss: 0.00171, speed: 3.93 step/s[0m
[32m[2022-07-12 15:08:51,060] [    INFO][0m - Evaluation precision: 1.00000, recall: 0.93750, F1: 0.96774[0m
[32m[2022-07-12 15:08:51,061] [    INFO][0m - best F1 performence has been updated: 0.00000 --> 0.96774[0m
[32m[2022-07-12 15:08:52,256] [    INFO][0m - tokenizer config file saved in ./checkpoint/model_best/tokenizer_config.json[0m
[32m[2022-07-12 15:08:52,256] [    INFO][0m - Special tokens file saved in ./checkpoint/model_best/special_tokens_map.json[0m
[32m[2022-07-12 15:08:54,734] [    INFO][0m - global step 60, epoch: 3, loss: 0.00151, speed: 4.04 step/s[0m
[32m[2022-07-12 15:08:57,280] [    INFO][0m - global step 70, epoch: 4, loss: 0.00137, speed: 3.93 step/s[0m
[32m[2022-07-12 15:08:59,753] [    INFO][0m - global step 80, epoch: 4, loss: 0.00122, speed: 4.04 step/s[0m
[32m[2022-07-12 15:09:02,309] [    INFO][0m - global step 90, epoch: 5, loss: 0.00110, speed: 3.91 step/s[0m
[32m[2022-07-12 15:09:04,786] [    INFO][0m - global step 100, epoch: 5, loss: 0.00102, speed: 4.04 step/s[0m
[32m[2022-07-12 15:09:05,052] [    INFO][0m - Evaluation precision: 1.00000, recall: 1.00000, F1: 1.00000[0m
[32m[2022-07-12 15:09:05,052] [    INFO][0m - best F1 performence has been updated: 0.96774 --> 1.00000[0m
[32m[2022-07-12 15:09:09,244] [    INFO][0m - tokenizer config file saved in ./checkpoint/model_best/tokenizer_config.json[0m
[32m[2022-07-12 15:09:09,244] [    INFO][0m - Special tokens file saved in ./checkpoint/model_best/special_tokens_map.json[0m
[32m[2022-07-12 15:09:11,809] [    INFO][0m - global step 110, epoch: 6, loss: 0.00094, speed: 3.90 step/s[0m
[32m[2022-07-12 15:09:14,298] [    INFO][0m - global step 120, epoch: 6, loss: 0.00087, speed: 4.02 step/s[0m
[32m[2022-07-12 15:09:16,850] [    INFO][0m - global step 130, epoch: 7, loss: 0.00082, speed: 3.92 step/s[0m
[32m[2022-07-12 15:09:19,343] [    INFO][0m - global step 140, epoch: 7, loss: 0.00077, speed: 4.01 step/s[0m
[32m[2022-07-12 15:09:21,905] [    INFO][0m - global step 150, epoch: 8, loss: 0.00072, speed: 3.90 step/s[0m
[32m[2022-07-12 15:09:22,176] [    INFO][0m - Evaluation precision: 1.00000, recall: 1.00000, F1: 1.00000[0m
[32m[2022-07-12 15:09:24,659] [    INFO][0m - global step 160, epoch: 8, loss: 0.00068, speed: 4.03 step/s[0m
[32m[2022-07-12 15:09:27,213] [    INFO][0m - global step 170, epoch: 9, loss: 0.00065, speed: 3.92 step/s[0m
[32m[2022-07-12 15:09:29,692] [    INFO][0m - global step 180, epoch: 9, loss: 0.00061, speed: 4.03 step/s[0m
[32m[2022-07-12 15:09:32,256] [    INFO][0m - global step 190, epoch: 10, loss: 0.00058, speed: 3.90 step/s[0m
[32m[2022-07-12 15:09:34,741] [    INFO][0m - global step 200, epoch: 10, loss: 0.00056, speed: 4.02 step/s[0m
[32m[2022-07-12 15:09:35,006] [    INFO][0m - Evaluation precision: 1.00000, recall: 1.00000, F1: 1.00000[0m
[32m[2022-07-12 15:09:37,561] [    INFO][0m - global step 210, epoch: 11, loss: 0.00053, speed: 3.91 step/s[0m
[32m[2022-07-12 15:09:40,038] [    INFO][0m - global step 220, epoch: 11, loss: 0.00051, speed: 4.04 step/s[0m
[32m[2022-07-12 15:09:42,596] [    INFO][0m - global step 230, epoch: 12, loss: 0.00049, speed: 3.91 step/s[0m
[32m[2022-07-12 15:09:45,080] [    INFO][0m - global step 240, epoch: 12, loss: 0.00047, speed: 4.03 step/s[0m
[32m[2022-07-12 15:09:47,643] [    INFO][0m - global step 250, epoch: 13, loss: 0.00045, speed: 3.90 step/s[0m
[32m[2022-07-12 15:09:47,910] [    INFO][0m - Evaluation precision: 1.00000, recall: 1.00000, F1: 1.00000[0m
[32m[2022-07-12 15:09:50,399] [    INFO][0m - global step 260, epoch: 13, loss: 0.00043, speed: 4.02 step/s[0m
[32m[2022-07-12 15:09:52,966] [    INFO][0m - global step 270, epoch: 14, loss: 0.00042, speed: 3.90 step/s[0m
[32m[2022-07-12 15:09:55,464] [    INFO][0m - global step 280, epoch: 14, loss: 0.00040, speed: 4.00 step/s[0m
[32m[2022-07-12 15:09:58,028] [    INFO][0m - global step 290, epoch: 15, loss: 0.00039, speed: 3.90 step/s[0m
[32m[2022-07-12 15:10:00,516] [    INFO][0m - global step 300, epoch: 15, loss: 0.00038, speed: 4.02 step/s[0m
[32m[2022-07-12 15:10:00,781] [    INFO][0m - Evaluation precision: 1.00000, recall: 1.00000, F1: 1.00000[0m
[32m[2022-07-12 15:10:03,348] [    INFO][0m - global step 310, epoch: 16, loss: 0.00036, speed: 3.90 step/s[0m
[32m[2022-07-12 15:10:05,836] [    INFO][0m - global step 320, epoch: 16, loss: 0.00035, speed: 4.02 step/s[0m
[32m[2022-07-12 15:10:08,393] [    INFO][0m - global step 330, epoch: 17, loss: 0.00034, speed: 3.91 step/s[0m
[32m[2022-07-12 15:10:10,888] [    INFO][0m - global step 340, epoch: 17, loss: 0.00033, speed: 4.01 step/s[0m
^C
Traceback (most recent call last):
  File "finetune.py", line 181, in <module>
    do_train()
  File "finetune.py", line 108, in do_train
    pos_ids)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/aistudio/model.py", line 35, in forward
    attention_mask=att_mask)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/ernie/modeling.py", line 666, in forward
    encoder_outputs = self.encoder(embedding_output, attention_mask)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/transformer.py", line 697, in forward
    output = mod(output, src_mask=src_mask)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/transformer.py", line 581, in forward
    src = self.self_attn(src, src, src, src_mask)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/transformer.py", line 400, in forward
    q, k, v = self._prepare_qkv(query, key, value, cache)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/transformer.py", line 234, in _prepare_qkv
    k, v = self.compute_kv(key, value)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/transformer.py", line 269, in compute_kv
    v = self.v_proj(value)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/common.py", line 172, in forward
    x=input, weight=self.weight, bias=self.bias, name=self.name)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/functional/common.py", line 1547, in linear
    return _C_ops.elementwise_add(pre_bias, bias)
KeyboardInterrupt

部分训练效果展示:具体输出已折叠

[2022-07-12 15:09:47,643] [    INFO] - global step 250, epoch: 13, loss: 0.00045, speed: 3.90 step/s
[2022-07-12 15:09:47,910] [    INFO] - Evaluation precision: 1.00000, recall: 1.00000, F1: 1.00000
[2022-07-12 15:09:50,399] [    INFO] - global step 260, epoch: 13, loss: 0.00043, speed: 4.02 step/s
[2022-07-12 15:09:52,966] [    INFO] - global step 270, epoch: 14, loss: 0.00042, speed: 3.90 step/s
[2022-07-12 15:09:55,464] [    INFO] - global step 280, epoch: 14, loss: 0.00040, speed: 4.00 step/s
[2022-07-12 15:09:58,028] [    INFO] - global step 290, epoch: 15, loss: 0.00039, speed: 3.90 step/s
[2022-07-12 15:10:00,516] [    INFO] - global step 300, epoch: 15, loss: 0.00038, speed: 4.02 step/s
[2022-07-12 15:10:00,781] [    INFO] - Evaluation precision: 1.00000, recall: 1.00000, F1: 1.00000
[2022-07-12 15:10:03,348] [    INFO] - global step 310, epoch: 16, loss: 0.00036, speed: 3.90 step/s
[2022-07-12 15:10:05,836] [    INFO] - global step 320, epoch: 16, loss: 0.00035, speed: 4.02 step/s
[2022-07-12 15:10:08,393] [    INFO] - global step 330, epoch: 17, loss: 0.00034, speed: 3.91 step/s
[2022-07-12 15:10:10,888] [    INFO] - global step 340, epoch: 17, loss: 0.00033, speed: 4.01 step/s

推荐使用GPU环境,否则可能会内存溢出。CPU环境下,可以修改model为uie-tiny,适当调下batch_size。

增加准确率的话:–num_epochs 设置大点多训练训练

可配置参数说明:
train_path: 训练集文件路径。

dev_path: 验证集文件路径。

save_dir: 模型存储路径,默认为./checkpoint。

learning_rate: 学习率,默认为1e-5。

batch_size: 批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数,默认为16。

max_seq_len: 文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。

num_epochs: 训练轮数,默认为100。

model 选择模型,程序会基于选择的模型进行模型微调,可选有uie-base和uie-tiny,默认为uie-base。

seed: 随机种子,默认为1000.

logging_steps: 日志打印的间隔steps数,默认10。

valid_steps: evaluate的间隔steps数,默认100。

device: 选用什么设备进行训练,可选cpu或gpu。

3.模型评估

!python evaluate.py \
    --model_path ./checkpoint/model_best \
    --test_path ./data/test.txt \
    --batch_size 16 \
    --max_seq_len 512
[2022-07-11 13:41:14,918] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best'.
W0711 13:41:14.951239 30950 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W0711 13:41:14.954762 30950 gpu_context.cc:306] device: 0, cuDNN Version: 7.6.
[2022-07-11 13:41:23,831] [    INFO] - -----------------------------
[2022-07-11 13:41:23,831] [    INFO] - Class Name: all_classes
[2022-07-11 13:41:23,832] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
!python evaluate.py \
    --model_path ./checkpoint/model_best \
    --test_path ./data/test.txt \
    --debug
[2022-07-11 13:41:28,303] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best'.
W0711 13:41:28.341869 30986 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W0711 13:41:28.345369 30986 gpu_context.cc:306] device: 0, cuDNN Version: 7.6.
[2022-07-11 13:41:35,024] [    INFO] - -----------------------------
[2022-07-11 13:41:35,024] [    INFO] - Class Name: 出发地
[2022-07-11 13:41:35,024] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2022-07-11 13:41:35,139] [    INFO] - -----------------------------
[2022-07-11 13:41:35,139] [    INFO] - Class Name: 目的地
[2022-07-11 13:41:35,139] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2022-07-11 13:41:35,246] [    INFO] - -----------------------------
[2022-07-11 13:41:35,246] [    INFO] - Class Name: 费用
[2022-07-11 13:41:35,246] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2022-07-11 13:41:35,313] [    INFO] - -----------------------------
[2022-07-11 13:41:35,313] [    INFO] - Class Name: 时间
[2022-07-11 13:41:35,313] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000

model_path: 进行评估的模型文件夹路径,路径下需包含模型权重文件model_state.pdparams及配置文件model_config.json。

test_path: 进行评估的测试集文件。

batch_size: 批处理大小,请结合机器情况进行调整,默认为16。

max_seq_len: 文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。

model: 选择所使用的模型,可选有uie-base, uie-medium, uie-mini, uie-micro和uie-nano,默认为uie-base。

debug: 是否开启debug模式对每个正例类别分别进行评估,该模式仅用于模型调试,默认关闭。


4.结果预测

from pprint import pprint
import json
from paddlenlp import Taskflow

def openreadtxt(file_name):
    data = []
    file = open(file_name,'r',encoding='UTF-8')  #打开文件
    file_data = file.readlines() #读取所有行
    for row in file_data:
        data.append(row) #将每行数据插入data中     
    return data

data_input=openreadtxt('./input/nlp.txt')

schema = ['出发地', '目的地','时间']
few_ie = Taskflow('information_extraction', schema=schema, batch_size=1,task_path='./checkpoint/model_best')

results=few_ie(data_input)

with open("./output/test.txt", "w+",encoding='UTF-8') as f:    #a :   写入文件,若文件不存在则会先创建再写入,但不会覆盖原文件,而是追加在文件末尾
    for result in results:
        line = json.dumps(result, ensure_ascii=False)  #对中文默认使用的ascii编码.想输出真正的中文需要指定ensure_ascii=False
        f.write(line + "\n")

print("数据结果已导出")
[2022-07-11 13:42:32,312] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best'.


数据结果已导出

输入文件展示:

城市内交通费7月5日金额114广州至佛山
从百度大厦到龙泽苑东区打车费二十元
上海虹桥高铁到杭州时间是9月24日费用是73元
上周末坐动车从北京到上海花费五十块五毛
昨天北京飞上海话费一百元

输出展示:

{"出发地": [{"text": "广州", "start": 15, "end": 17, "probability": 0.9073772252165782}], "目的地": [{"text": "佛山", "start": 18, "end": 20, "probability": 0.9927365183877761}], "时间": [{"text": "7月5日", "start": 6, "end": 10, "probability": 0.9978010396512218}]}
{"出发地": [{"text": "百度大厦", "start": 1, "end": 5, "probability": 0.968825147409472}], "目的地": [{"text": "龙泽苑东区", "start": 6, "end": 11, "probability": 0.9877913072493669}]}
{"目的地": [{"text": "杭州", "start": 7, "end": 9, "probability": 0.9929172180094881}], "时间": [{"text": "9月24日", "start": 12, "end": 17, "probability": 0.9953342057701597}]}
{"出发地": [{"text": "北京", "start": 7, "end": 9, "probability": 0.973048366717471}], "目的地": [{"text": "上海", "start": 10, "end": 12, "probability": 0.988486130309397}], "时间": [{"text": "上周末", "start": 0, "end": 3, "probability": 0.9977407699595275}]}
{"出发地": [{"text": "北京", "start": 2, "end": 4, "probability": 0.974188953533556}], "目的地": [{"text": "上海", "start": 5, "end": 7, "probability": 0.9928200521486445}], "时间": [{"text": "昨天", "start": 0, "end": 2, "probability": 0.9731559534465504}]}

5.可视化显示visualDL

详细文档可以参考:
https://aistudio.baidu.com/aistudio/projectdetail/1739945?contributionType=1
有详细讲解,具体实现参考代码,

核心是:添加一个初始化记录器

下面是结果展示:

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

6.小技巧:获取paddle开源数据集

**数据集网站:**https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_list.html#id2

数据集名称 简介 调用方法

CoLA 单句分类任务,二分类,判断句子是否合法 paddlenlp.datasets.load_dataset(‘glue’,‘cola’)

SST-2 单句分类任务,二分类,判断句子情感极性
paddlenlp.datasets.load_dataset(‘glue’,‘sst-2’)

MRPC 句对匹配任务,二分类,判断句子对是否是相同意思 paddlenlp.datasets.load_dataset(‘glue’,‘mrpc’)

STSB 计算句子对相似性,分数为1~5 paddlenlp.datasets.load_dataset(‘glue’,‘sts-b’)
QQP 判定句子对是否等效,等效、不等效两种情况,二分类任务 paddlenlp.datasets.load_dataset(‘glue’,‘qqp’)

MNLI 句子对,一个前提,一个是假设。前提和假设的关系有三种情况:蕴含(entailment),矛盾(contradiction),中立(neutral)。句子对三分类问题 paddlenlp.datasets.load_dataset(‘glue’,‘mnli’)

QNLI 判断问题(question)和句子(sentence)是否蕴含,蕴含和不蕴含,二分类 paddlenlp.datasets.load_dataset(‘glue’,‘qnli’)

RTE 判断句对是否蕴含,句子1和句子2是否互为蕴含,二分类任务 paddlenlp.datasets.load_dataset(‘glue’,‘rte’)

WNLI 判断句子对是否相关,相关或不相关,二分类任务 paddlenlp.datasets.load_dataset(‘glue’,‘wnli’)

LCQMC A Large-scale Chinese Question Matching Corpus 语义匹配数据集 paddlenlp.datasets.load_dataset(‘lcqmc’)

通过paddlenlp提供的api调用,可以很方便实现数据加载,当然你想要把数据下载到本地,可以参考我下面的输出就可以保存数据了。

#加载中文评论情感分析语料数据集ChnSentiCorp
from paddlenlp.datasets import load_dataset

train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])

with open("./output/test2.txt", "w+",encoding='UTF-8') as f:    #a :   写入文件,若文件不存在则会先创建再写入,但不会覆盖原文件,而是追加在文件末尾
    for result in test_ds:
        line = json.dumps(result, ensure_ascii=False)  #对中文默认使用的ascii编码.想输出真正的中文需要指定ensure_ascii=False
        f.write(line + "\n")

7.总结

UIE(Universal Information Extraction):Yaojie Lu等人在ACL-2022中提出了通用信息抽取统一框架UIE。该框架实现了实体抽取、关系抽取、事件抽取、情感分析等任务的统一建模,并使得不同任务间具备良好的迁移和泛化能力。PaddleNLP借鉴该论文的方法,基于ERNIE 3.0知识增强预训练模型,训练并开源了首个中文通用信息抽取模型UIE。该模型可以支持不限定行业领域和抽取目标的关键信息抽取,实现零样本快速冷启动,并具备优秀的小样本微调能力,快速适配特定的抽取目标。

UIE的优势

使用简单: 用户可以使用自然语言自定义抽取目标,无需训练即可统一抽取输入文本中的对应信息。实现开箱即用,并满足各类信息抽取需求。

降本增效: 以往的信息抽取技术需要大量标注数据才能保证信息抽取的效果,为了提高开发过程中的开发效率,减少不必要的重复工作时间,开放域信息抽取可以实现零样本(zero-shot)或者少样本(few-shot)抽取,大幅度降低标注数据依赖,在降低成本的同时,还提升了效果。

效果领先: 开放域信息抽取在多种场景,多种任务上,均有不俗的表现。

本人本次主要通过实体抽取这个案例分享给大家,主要对开源的paddlenlp的案例进行了细化,比如在结果可视化方面以及结果输入输出的增加,使demo项目更佳完善。

当然标注问题是所有问题的痛点,可以参考我的博客来解决这个问题

本人博客:https://blog.csdn.net/sinat_39620217?type=blog

此文仅为搬运,原作地址https://aistudio.baidu.com/aistudio/projectdetail/4321385

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐