pycorrector 文本纠错开源工具

文本纠错开源工具 pycorrector 点击进入GitHub项目主页

pycorrector,主要用于音似、形似错字纠正,可用于输入法、OCR、ASR的文本错误纠正,兼容Kenlm语言模型纠错,和深度模型纠错,包括:Seq2Seq,Bert,MacBert,Electra,Ernie等。

pycorrector 项目当前在GitHub上star数2165,fork数565,watch数70。社区代码贡献者10位,项目引用数14个。

今天来看下如何利用ERNIE模型进行中文的文本纠错。

问题

中文文本纠错任务,常见错误类型包括:

  • 谐音字词,如 配副眼睛-配副眼镜
  • 混淆音字词,如 流浪织女-牛郎织女
  • 字词顺序颠倒,如 伍迪艾伦-艾伦伍迪
  • 字词补全,如 爱有天意-假如爱有天意
  • 形似字错误,如 高梁-高粱
  • 中文拼音全拼,如 xingfu-幸福
  • 中文拼音缩写,如 sz-深圳
  • 语法错误,如 想象难以-难以想象

当然,针对不同业务场景,这些问题并不一定全部存在,比如输入法中需要处理前四种,搜索引擎需要处理所有类型,语音识别后文本纠错只需要处理前两种,
其中’形似字错误’主要针对五笔或者笔画手写输入等。本项目重点解决其中的谐音、混淆音、形似字错误、中文拼音全拼、语法错误带来的纠错任务。

思路

规则的解决思路

中文纠错分为两步走,第一步是错误检测,第二步是错误纠正。主要是利用切词特征得到疑似错误候选集,然后基于语言模型困惑度(PPL)对候选集排序来得到top1,完成纠错。详细方法见项目GitHub主页,此处不展开。

深度模型的解决思路

今天主要介绍利用PaddlePaddle实现的ERNIE预训练模型,基于其MASK掩码的特征,可以改造Fill-Mask任务用于文本纠错。

安装

  • 全自动安装:pip install pycorrector
  • 半自动安装:
git clone https://github.com/shibing624/pycorrector.git
cd pycorrector
python setup.py install

通过以上两种方法的任何一种完成安装都可以。如果不想安装,可以下载github源码包,安装下面依赖再使用。

前置依赖

  • kenlm安装
pip install https://github.com/kpu/kenlm/archive/master.zip

安装kenlm-wiki

  • 其他库包安装
pip install -r requirements.txt

!pip install pycorrector
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: pycorrector in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (0.3.1)
Requirement already satisfied: jieba==0.39 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pycorrector) (0.39)
Requirement already satisfied: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pycorrector) (1.15.0)
Requirement already satisfied: pypinyin in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pycorrector) (0.41.0)
Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pycorrector) (1.16.4)
[33mWARNING: You are using pip version 21.0.1; however, version 21.1.1 is available.
You should consider upgrading via the '/opt/conda/envs/python35-paddle120-env/bin/python -m pip install --upgrade pip' command.[0m
import pycorrector
pycorrector.en_correct("helio")

'hello'

使用说明

ERNIE model reference: ERNIE GitHub

ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框架,该框架将大数据预训练与多源丰富知识相结合,通过持续学习技术,不断吸收海量文本数据中词汇、结构、语义等方面的知识,实现模型效果不断进化。ERNIE在情感分析、文本匹配、自然语言推理、词法分析、阅读理解、智能问答等16个公开数据集上全面显著超越世界领先技术,在国际权威的通用语言理解评估基准GLUE上,得分首次突破90分,获得全球第一。

  • 文本纠错
import sys

sys.path.append("..")
from pycorrector.ernie.ernie_corrector import ErnieCorrector

if __name__ == '__main__':
    error_sentences = [
        '真麻烦你了。希望你们好好的跳无',
        '少先队员因该为老人让坐',
        '机七学习是人工智能领遇最能体现智能的一个分知',
        '一只小鱼船浮在平净的河面上',
        '我的家乡是有明的渔米之乡',
    ]

    m = ErnieCorrector()
    for line in error_sentences:
        correct_sent, err = m.ernie_correct(line)
        print("original sentence:{} => {}, err:{}".format(line, correct_sent, err))

output:

original sentence:真麻烦你了。希望你们好好的跳无 => 真麻烦你了。希望你们好好的跳舞, err:[['无', '舞', 14, 15]]
original sentence:少先队员因该为老人让坐 => 少先队员应该为老人让座, err:[['因', '应', 4, 5], ['坐', '座', 10, 11]]
original sentence:机七学习是人工智能领遇最能体现智能的一个分知 => 机器学习是人工智能领域最能体现智能的一个分支, err:[['七', '器', 1, 2], ['遇', '域', 10, 11], ['知', '支', 21, 22]]
original sentence:一只小鱼船浮在平净的河面上 => 一只小渔船浮在平静的河面上, err:[['鱼', '渔', 3, 4], ['净', '静', 8, 9]]
original sentence:我的家乡是有明的渔米之乡 => 我的家乡是有名的渔米之乡, err:[['明', '名', 6, 7]]

实现原理

Fill-Mask任务

看下基于ERNIE怎么实现完形填空(Fill-Mask)任务:


import numpy as np
import paddle.fluid.dygraph as D
import paddle.fluid.layers as L

from pycorrector.ernie.modeling_ernie import ErnieModelForPretraining, ErnieModel
from pycorrector.ernie.tokenizing_ernie import ErnieTokenizer

model_dir = 'ernie-1.0'

D.guard().__enter__()  # activate paddle `dygrpah` mode
tokenizer = ErnieTokenizer.from_pretrained(model_dir)
rev_dict = {v: k for k, v in tokenizer.vocab.items()}
rev_dict[tokenizer.pad_id] = ''  # replace [PAD]
rev_dict[tokenizer.sep_id] = ''  # replace [PAD]
rev_dict[tokenizer.unk_id] = ''  # replace [PAD]


class ErnieCloze(ErnieModelForPretraining):
    def __init__(self, *args, **kwargs):
        super(ErnieCloze, self).__init__(*args, **kwargs)
        del self.pooler_heads

    def forward(self, src_ids, *args, **kwargs):
        pooled, encoded = ErnieModel.forward(self, src_ids, *args, **kwargs)
        encoded_2d = L.gather_nd(encoded, L.where(src_ids == mask_id))
        encoded_2d = self.mlm(encoded_2d)
        encoded_2d = self.mlm_ln(encoded_2d)
        logits_2d = L.matmul(encoded_2d, self.word_emb.weight, transpose_y=True) + self.mlm_bias
        return logits_2d


@np.vectorize
def rev_lookup(i):
    return rev_dict[i]

[   INFO 20210506 11:38:50 tokenizing_ernie:  86] get pretrain dir from https://ernie-github.cdn.bcebos.com/model-ernie1.0.1.tar.gz
[  DEBUG 20210506 11:38:50 file_utils:  56] https://ernie-github.cdn.bcebos.com/model-ernie1.0.1.tar.gz cached in /home/aistudio/.paddle-ernie-cache
[  DEBUG 20210506 11:38:50 tokenizing_ernie: 131] regex: (\[UNK\])|(\[SEP\])|(\[PAD\])|(\[CLS\])|(\[MASK\])|([a-zA-Z0-9]+|\S)
ernie = ErnieCloze.from_pretrained(model_dir)
ernie.eval()

ids, _ = tokenizer.encode('戊[MASK]变法,又称百日维新,是 [MASK] [MASK] [MASK] 、梁启超等维新派人士通过光绪帝进行 的一场资产阶级改良。')
mask_id = tokenizer.mask_id
print(ids)
ids = np.expand_dims(ids, 0)
ids = D.to_variable(ids)
logits = ernie(ids).numpy()
output_ids = np.argmax(logits, -1)
seg_txt = rev_lookup(output_ids)

print(seg_txt)
[   INFO 20210506 11:40:04 modeling_ernie: 198] get pretrain dir from https://ernie-github.cdn.bcebos.com/model-ernie1.0.1.tar.gz
[  DEBUG 20210506 11:40:04 file_utils:  56] https://ernie-github.cdn.bcebos.com/model-ernie1.0.1.tar.gz cached in /home/aistudio/.paddle-ernie-cache
[  DEBUG 20210506 11:40:04 modeling_ernie: 239] init ErnieModel with config: {'attention_probs_dropout_prob': 0.1, 'hidden_act': 'relu', 'hidden_dropout_prob': 0.1, 'hidden_size': 768, 'initializer_range': 0.02, 'max_position_embeddings': 513, 'num_attention_heads': 12, 'num_hidden_layers': 12, 'type_vocab_size': 2, 'vocab_size': 18000}
[   INFO 20210506 11:40:09 modeling_ernie: 216] loading pretrained model from /home/aistudio/.paddle-ernie-cache/466eabcffd6d6a83ae9cb97dd1a167bd


[    1  3299     3   282    72     4   311   351   502   139   534   102
     4    10     3     3     3     6  1164  1087   634    43   534   102
   809     8   391   124    93   325  1659   757    71    40     5     7
   191   138    66   976   222   285   698 12043     2]
['戌' '康' '有' '为']

def predict_mask(sentence_with_mask):
    """
    predict multi masks, support top5, multi mask
    :param sentence_with_mask:
    :return:
    """
    ids, id_types = tokenizer.encode(sentence_with_mask)
    mask_id = tokenizer.mask_id
    print(ids, id_types, mask_id)
    ids = np.expand_dims(ids, 0)
    ids = D.to_variable(ids)
    logits = ernie(ids).numpy()
    output_ids = np.argsort(logits, -1)
    j_ret = []
    for i in output_ids:
        i_ret = []
        for j in i[::-1][:5]:
            i_ret.append(rev_dict[j])
        j_ret.append(i_ret)
    out = []
    for i in range(len(j_ret)):
        temp = []
        for j in range(len(j_ret[i])):
            temp.append(j_ret[i][j])
        out.append(temp)
    out = np.array(out)
    out = np.transpose(out).tolist()
    return out
i = predict_mask('机器学习是人工[MASK]能领遇最能体现智能的一个分[MASK]。')
print(i)
[    1    98   361    18   489    10     8    35     3    52   350   825
   134    52    82    87   826    52     5     7    27    59     3 12043
     2] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 3
[['智', '支'], ['技', '域'], ['赋', '类'], ['学', '子'], ['能', '科']]

纠错

纠错逻辑大体如下:

  1. 文本编码统一
  2. 文本长句切分为短句
  3. 中文文本以字切分
  4. 挨个中文字设置为[MASK],然后用Fill-Mask预测该字的Top5
  5. 生成该字的音似、形似候选集
  6. 候选集命中Top5,则替换为纠错字

GitHub实现链接:

aistudio没有sudo权限,kenlm安装不上的同学,可以在本地linux机器安装测试。

  • kenlm安装
pip install https://github.com/kpu/kenlm/archive/master.zip

安装kenlm-wiki

!pip install kenlm
from pycorrector.ernie.ernie_corrector import ErnieCorrector

if __name__ == '__main__':
    error_sentences = [
        '真麻烦你了。希望你们好好的跳无',
        '少先队员因该为老人让坐',
        '机七学习是人工智能领遇最能体现智能的一个分知',
        '一只小鱼船浮在平净的河面上',
        '我的家乡是有明的渔米之乡',
    ]

    m = ErnieCorrector()
    for line in error_sentences:
        correct_sent, err = m.ernie_correct(line)
        print("original sentence:{} => {}, err:{}".format(line, correct_sent, err))

演示实现逻辑:
# -*- coding: utf-8 -*-
import operator
import os
import sys
import time

import numpy as np
import paddle.fluid.dygraph as D
import paddle.fluid.layers as L

sys.path.append('../..')
from pycorrector.utils.text_utils import is_chinese_string, convert_to_unicode
from pycorrector.utils.logger import logger
from pycorrector.corrector import Corrector
from pycorrector.ernie.modeling_ernie import ErnieModelForPretraining, ErnieModel
from pycorrector.ernie.tokenizing_ernie import ErnieTokenizer
from pycorrector.utils.tokenizer import segment

pwd_path = os.path.abspath(os.path.dirname(__file__))

D.guard().__enter__()  # activate paddle `dygrpah` mode


class ErnieCloze(ErnieModelForPretraining):
    def __init__(self, *args, **kwargs):
        super(ErnieCloze, self).__init__(*args, **kwargs)
        del self.pooler_heads

    def forward(self, src_ids, *args, **kwargs):
        pooled, encoded = ErnieModel.forward(self, src_ids, *args, **kwargs)
        # paddle ernie mask_id = 3, mask_token = [MASK]
        mask_id = 3
        encoded_2d = L.gather_nd(encoded, L.where(src_ids == mask_id))
        encoded_2d = self.mlm(encoded_2d)
        encoded_2d = self.mlm_ln(encoded_2d)
        logits_2d = L.matmul(encoded_2d, self.word_emb.weight, transpose_y=True) + self.mlm_bias
        return logits_2d


class ErnieCorrector(Corrector):
    def __init__(self, model_dir='ernie-1.0', topN=5):
        super(ErnieCorrector, self).__init__()
        self.name = 'ernie_corrector'
        t1 = time.time()
        self.ernie_tokenizer = ErnieTokenizer.from_pretrained(model_dir)
        self.rev_dict = {v: k for k, v in self.ernie_tokenizer.vocab.items()}
        self.rev_dict[self.ernie_tokenizer.pad_id] = ''  # replace [PAD]
        self.rev_dict[self.ernie_tokenizer.sep_id] = ''  # replace [PAD]
        self.rev_dict[self.ernie_tokenizer.unk_id] = ''  # replace [PAD]
        self.cloze = ErnieCloze.from_pretrained(model_dir)
        self.cloze.eval()
        logger.debug('Loaded ernie model: %s, spend: %.3f s.' % (model_dir, time.time() - t1))
        self.mask_id = self.ernie_tokenizer.mask_id  # 3
        self.mask_token = self.rev_dict[self.mask_id]  # "[MASK]"
        logger.debug('ernie mask_id :{}, mask_token: {}'.format(self.mask_id, self.mask_token))
        self.topN = topN

    def predict_mask(self, sentence_with_mask):
        ids, id_types = self.ernie_tokenizer.encode(sentence_with_mask)
        ids = np.expand_dims(ids, 0)
        ids = D.to_variable(ids)
        logits = self.cloze(ids).numpy()
        output_ids = np.argsort(logits, -1)
        masks_ret = []
        # 倒序,取最可能预测词topN
        for masks in output_ids:
            temp = []
            for mask in masks[::-1][:self.topN]:
                temp.append(self.rev_dict[mask])
            masks_ret.append(temp)
        # 处理多个mask的情况
        out = []
        for i in range(len(masks_ret)):
            temp = []
            for j in range(len(masks_ret[i])):
                temp.append(masks_ret[i][j])
            out.append(temp)
        # transpose out data
        # [['智', '学', '能', '技', '互'], ['能', '慧', '习', '智', '商']] => 智能 学慧 能习 技智 互商
        out = np.transpose(np.array(out))
        out = [''.join(i) for i in out.tolist()]
        ret = []
        for i in out:
            ret.append({'token': i})
        return ret

    def ernie_correct(self, text, ernie_cut_type='char'):
        """
        句子纠错
        :param text: 句子文本
        :param ernie_cut_type: 切词类型(char/word)
        :return: corrected_text, list[list], [error_word, correct_word, begin_pos, end_pos]
        """
        text_new = ''
        details = []
        self.check_corrector_initialized()
        # 编码统一,utf-8 to unicode
        text = convert_to_unicode(text)
        # 长句切分为短句
        blocks = self.split_text_by_maxlen(text, maxlen=512)
        for blk, start_idx in blocks:
            blk_new = ''
            blk = segment(blk, cut_type=ernie_cut_type, pos=False)
            for idx, s in enumerate(blk):
                # 处理中文错误
                if is_chinese_string(s):
                    sentence_lst = blk[:idx] + blk[idx:]
                    sentence_lst[idx] = self.mask_token * len(s)
                    sentence_new = ' '.join(sentence_lst)
                    # 预测,默认取top5
                    predicts = self.predict_mask(sentence_new)
                    top_tokens = []
                    for p in predicts:
                        top_tokens.append(p.get('token', ''))

                    if top_tokens and (s not in top_tokens):
                        # 取得所有可能正确的词
                        candidates = self.generate_items(s)
                        if candidates:
                            for token_str in top_tokens:
                                if token_str in candidates:
                                    details.append([s, token_str, start_idx + idx, start_idx + idx + 1])
                                    s = token_str
                                    break
                blk_new += s
            text_new += blk_new
        details = sorted(details, key=operator.itemgetter(2))
        return text_new, details


if __name__ == "__main__":
    d = ErnieCorrector()
    error_sentences = [
        '我对于宠物出租得事非常认同,因为其实很多人喜欢宠物',
        '疝気医院那好 ,疝気专科百科问答',
        '少先队员因该为老人让坐',
        '少 先  队 员 因 该 为 老人让坐',
        '机七学习是人工智能领遇最能体现智能的一个分知',
        '今天心情很好',
    ]
    for sent in error_sentences:
        corrected_sent, err = d.ernie_correct(sent)
        print("original sentence:{} => {}, err:{}".format(sent, corrected_sent, err))

总结

ERNIE模型纠错效果评估如下:

Sighan15:
准确率:29.70%
召回率:28.13%

Corpus500:
准确率:59.80%
召回率:41.33%

可以看出ERNIE模型对文本有强大的表达能力,仅仅依赖预训练的MLM模型,在纠错能力上就比优化良久的专家规则方法稍差而已,而且看结果细节一些纠正还挺靠谱。

看来选择一个好的模型,选择一个正确的方向真的很重要。我在这里只能希望规则的方法尽量做到扩展性好些,深度模型尽量做到调研各种模型全一些,深入一些。

当然,文本纠错是不仅仅是低阶的拼写错误纠正问题,更深入的有语义、语法纠错问题,我这里只是抛砖引玉,希望大家能关注该领域,多研究,多贡献。

交流与反馈

微信交流群,感兴趣的同学可以加入沟通NLP文本纠错相关技术,issues上回复不及时也可以在群里面提问。

微信交流群满了,可以加我微信号:xuming624, 备注:个人名称-NLP纠错 进群。

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐