基于ERNIE的中文文本纠错
pycorrector 文本纠错开源工具文本纠错开源工具 pycorrector 点击进入GitHub项目主页pycorrector,主要用于音似、形似错字纠正,可用于输入法、OCR、ASR的文本错误纠正,兼容Kenlm语言模型纠错,和深度模型纠错,包括:Seq2Seq,Bert,MacBert,Electra,Ernie等。pycorrector 项目当前在GitHub上star数2165,fo
pycorrector 文本纠错开源工具
文本纠错开源工具 pycorrector 点击进入GitHub项目主页
pycorrector,主要用于音似、形似错字纠正,可用于输入法、OCR、ASR的文本错误纠正,兼容Kenlm语言模型纠错,和深度模型纠错,包括:Seq2Seq,Bert,MacBert,Electra,Ernie等。
pycorrector 项目当前在GitHub上star数2165,fork数565,watch数70。社区代码贡献者10位,项目引用数14个。
今天来看下如何利用ERNIE模型进行中文的文本纠错。
问题
中文文本纠错任务,常见错误类型包括:
- 谐音字词,如 配副眼睛-配副眼镜
- 混淆音字词,如 流浪织女-牛郎织女
- 字词顺序颠倒,如 伍迪艾伦-艾伦伍迪
- 字词补全,如 爱有天意-假如爱有天意
- 形似字错误,如 高梁-高粱
- 中文拼音全拼,如 xingfu-幸福
- 中文拼音缩写,如 sz-深圳
- 语法错误,如 想象难以-难以想象
当然,针对不同业务场景,这些问题并不一定全部存在,比如输入法中需要处理前四种,搜索引擎需要处理所有类型,语音识别后文本纠错只需要处理前两种,
其中’形似字错误’主要针对五笔或者笔画手写输入等。本项目重点解决其中的谐音、混淆音、形似字错误、中文拼音全拼、语法错误带来的纠错任务。
思路
规则的解决思路
中文纠错分为两步走,第一步是错误检测,第二步是错误纠正。主要是利用切词特征得到疑似错误候选集,然后基于语言模型困惑度(PPL)对候选集排序来得到top1,完成纠错。详细方法见项目GitHub主页,此处不展开。
深度模型的解决思路
今天主要介绍利用PaddlePaddle实现的ERNIE预训练模型,基于其MASK掩码的特征,可以改造Fill-Mask任务用于文本纠错。
安装
- 全自动安装:pip install pycorrector
- 半自动安装:
git clone https://github.com/shibing624/pycorrector.git
cd pycorrector
python setup.py install
通过以上两种方法的任何一种完成安装都可以。如果不想安装,可以下载github源码包,安装下面依赖再使用。
前置依赖
- kenlm安装
pip install https://github.com/kpu/kenlm/archive/master.zip
- 其他库包安装
pip install -r requirements.txt
!pip install pycorrector
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: pycorrector in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (0.3.1)
Requirement already satisfied: jieba==0.39 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pycorrector) (0.39)
Requirement already satisfied: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pycorrector) (1.15.0)
Requirement already satisfied: pypinyin in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pycorrector) (0.41.0)
Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pycorrector) (1.16.4)
[33mWARNING: You are using pip version 21.0.1; however, version 21.1.1 is available.
You should consider upgrading via the '/opt/conda/envs/python35-paddle120-env/bin/python -m pip install --upgrade pip' command.[0m
import pycorrector
pycorrector.en_correct("helio")
'hello'
使用说明
ERNIE model reference: ERNIE GitHub
ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框架,该框架将大数据预训练与多源丰富知识相结合,通过持续学习技术,不断吸收海量文本数据中词汇、结构、语义等方面的知识,实现模型效果不断进化。ERNIE在情感分析、文本匹配、自然语言推理、词法分析、阅读理解、智能问答等16个公开数据集上全面显著超越世界领先技术,在国际权威的通用语言理解评估基准GLUE上,得分首次突破90分,获得全球第一。
- 文本纠错
import sys
sys.path.append("..")
from pycorrector.ernie.ernie_corrector import ErnieCorrector
if __name__ == '__main__':
error_sentences = [
'真麻烦你了。希望你们好好的跳无',
'少先队员因该为老人让坐',
'机七学习是人工智能领遇最能体现智能的一个分知',
'一只小鱼船浮在平净的河面上',
'我的家乡是有明的渔米之乡',
]
m = ErnieCorrector()
for line in error_sentences:
correct_sent, err = m.ernie_correct(line)
print("original sentence:{} => {}, err:{}".format(line, correct_sent, err))
output:
original sentence:真麻烦你了。希望你们好好的跳无 => 真麻烦你了。希望你们好好的跳舞, err:[['无', '舞', 14, 15]]
original sentence:少先队员因该为老人让坐 => 少先队员应该为老人让座, err:[['因', '应', 4, 5], ['坐', '座', 10, 11]]
original sentence:机七学习是人工智能领遇最能体现智能的一个分知 => 机器学习是人工智能领域最能体现智能的一个分支, err:[['七', '器', 1, 2], ['遇', '域', 10, 11], ['知', '支', 21, 22]]
original sentence:一只小鱼船浮在平净的河面上 => 一只小渔船浮在平静的河面上, err:[['鱼', '渔', 3, 4], ['净', '静', 8, 9]]
original sentence:我的家乡是有明的渔米之乡 => 我的家乡是有名的渔米之乡, err:[['明', '名', 6, 7]]
实现原理
Fill-Mask任务
看下基于ERNIE怎么实现完形填空(Fill-Mask)任务:
import numpy as np
import paddle.fluid.dygraph as D
import paddle.fluid.layers as L
from pycorrector.ernie.modeling_ernie import ErnieModelForPretraining, ErnieModel
from pycorrector.ernie.tokenizing_ernie import ErnieTokenizer
model_dir = 'ernie-1.0'
D.guard().__enter__() # activate paddle `dygrpah` mode
tokenizer = ErnieTokenizer.from_pretrained(model_dir)
rev_dict = {v: k for k, v in tokenizer.vocab.items()}
rev_dict[tokenizer.pad_id] = '' # replace [PAD]
rev_dict[tokenizer.sep_id] = '' # replace [PAD]
rev_dict[tokenizer.unk_id] = '' # replace [PAD]
class ErnieCloze(ErnieModelForPretraining):
def __init__(self, *args, **kwargs):
super(ErnieCloze, self).__init__(*args, **kwargs)
del self.pooler_heads
def forward(self, src_ids, *args, **kwargs):
pooled, encoded = ErnieModel.forward(self, src_ids, *args, **kwargs)
encoded_2d = L.gather_nd(encoded, L.where(src_ids == mask_id))
encoded_2d = self.mlm(encoded_2d)
encoded_2d = self.mlm_ln(encoded_2d)
logits_2d = L.matmul(encoded_2d, self.word_emb.weight, transpose_y=True) + self.mlm_bias
return logits_2d
@np.vectorize
def rev_lookup(i):
return rev_dict[i]
[ INFO 20210506 11:38:50 tokenizing_ernie: 86] get pretrain dir from https://ernie-github.cdn.bcebos.com/model-ernie1.0.1.tar.gz
[ DEBUG 20210506 11:38:50 file_utils: 56] https://ernie-github.cdn.bcebos.com/model-ernie1.0.1.tar.gz cached in /home/aistudio/.paddle-ernie-cache
[ DEBUG 20210506 11:38:50 tokenizing_ernie: 131] regex: (\[UNK\])|(\[SEP\])|(\[PAD\])|(\[CLS\])|(\[MASK\])|([a-zA-Z0-9]+|\S)
ernie = ErnieCloze.from_pretrained(model_dir)
ernie.eval()
ids, _ = tokenizer.encode('戊[MASK]变法,又称百日维新,是 [MASK] [MASK] [MASK] 、梁启超等维新派人士通过光绪帝进行 的一场资产阶级改良。')
mask_id = tokenizer.mask_id
print(ids)
ids = np.expand_dims(ids, 0)
ids = D.to_variable(ids)
logits = ernie(ids).numpy()
output_ids = np.argmax(logits, -1)
seg_txt = rev_lookup(output_ids)
print(seg_txt)
[ INFO 20210506 11:40:04 modeling_ernie: 198] get pretrain dir from https://ernie-github.cdn.bcebos.com/model-ernie1.0.1.tar.gz
[ DEBUG 20210506 11:40:04 file_utils: 56] https://ernie-github.cdn.bcebos.com/model-ernie1.0.1.tar.gz cached in /home/aistudio/.paddle-ernie-cache
[ DEBUG 20210506 11:40:04 modeling_ernie: 239] init ErnieModel with config: {'attention_probs_dropout_prob': 0.1, 'hidden_act': 'relu', 'hidden_dropout_prob': 0.1, 'hidden_size': 768, 'initializer_range': 0.02, 'max_position_embeddings': 513, 'num_attention_heads': 12, 'num_hidden_layers': 12, 'type_vocab_size': 2, 'vocab_size': 18000}
[ INFO 20210506 11:40:09 modeling_ernie: 216] loading pretrained model from /home/aistudio/.paddle-ernie-cache/466eabcffd6d6a83ae9cb97dd1a167bd
[ 1 3299 3 282 72 4 311 351 502 139 534 102
4 10 3 3 3 6 1164 1087 634 43 534 102
809 8 391 124 93 325 1659 757 71 40 5 7
191 138 66 976 222 285 698 12043 2]
['戌' '康' '有' '为']
def predict_mask(sentence_with_mask):
"""
predict multi masks, support top5, multi mask
:param sentence_with_mask:
:return:
"""
ids, id_types = tokenizer.encode(sentence_with_mask)
mask_id = tokenizer.mask_id
print(ids, id_types, mask_id)
ids = np.expand_dims(ids, 0)
ids = D.to_variable(ids)
logits = ernie(ids).numpy()
output_ids = np.argsort(logits, -1)
j_ret = []
for i in output_ids:
i_ret = []
for j in i[::-1][:5]:
i_ret.append(rev_dict[j])
j_ret.append(i_ret)
out = []
for i in range(len(j_ret)):
temp = []
for j in range(len(j_ret[i])):
temp.append(j_ret[i][j])
out.append(temp)
out = np.array(out)
out = np.transpose(out).tolist()
return out
i = predict_mask('机器学习是人工[MASK]能领遇最能体现智能的一个分[MASK]。')
print(i)
[ 1 98 361 18 489 10 8 35 3 52 350 825
134 52 82 87 826 52 5 7 27 59 3 12043
2] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 3
[['智', '支'], ['技', '域'], ['赋', '类'], ['学', '子'], ['能', '科']]
纠错
纠错逻辑大体如下:
- 文本编码统一
- 文本长句切分为短句
- 中文文本以字切分
- 挨个中文字设置为[MASK],然后用Fill-Mask预测该字的Top5
- 生成该字的音似、形似候选集
- 候选集命中Top5,则替换为纠错字
GitHub实现链接:
-
安装kenlm
aistudio没有sudo权限,kenlm安装不上的同学,可以在本地linux机器安装测试。
- kenlm安装
pip install https://github.com/kpu/kenlm/archive/master.zip
!pip install kenlm
from pycorrector.ernie.ernie_corrector import ErnieCorrector
if __name__ == '__main__':
error_sentences = [
'真麻烦你了。希望你们好好的跳无',
'少先队员因该为老人让坐',
'机七学习是人工智能领遇最能体现智能的一个分知',
'一只小鱼船浮在平净的河面上',
'我的家乡是有明的渔米之乡',
]
m = ErnieCorrector()
for line in error_sentences:
correct_sent, err = m.ernie_correct(line)
print("original sentence:{} => {}, err:{}".format(line, correct_sent, err))
演示实现逻辑:
# -*- coding: utf-8 -*-
import operator
import os
import sys
import time
import numpy as np
import paddle.fluid.dygraph as D
import paddle.fluid.layers as L
sys.path.append('../..')
from pycorrector.utils.text_utils import is_chinese_string, convert_to_unicode
from pycorrector.utils.logger import logger
from pycorrector.corrector import Corrector
from pycorrector.ernie.modeling_ernie import ErnieModelForPretraining, ErnieModel
from pycorrector.ernie.tokenizing_ernie import ErnieTokenizer
from pycorrector.utils.tokenizer import segment
pwd_path = os.path.abspath(os.path.dirname(__file__))
D.guard().__enter__() # activate paddle `dygrpah` mode
class ErnieCloze(ErnieModelForPretraining):
def __init__(self, *args, **kwargs):
super(ErnieCloze, self).__init__(*args, **kwargs)
del self.pooler_heads
def forward(self, src_ids, *args, **kwargs):
pooled, encoded = ErnieModel.forward(self, src_ids, *args, **kwargs)
# paddle ernie mask_id = 3, mask_token = [MASK]
mask_id = 3
encoded_2d = L.gather_nd(encoded, L.where(src_ids == mask_id))
encoded_2d = self.mlm(encoded_2d)
encoded_2d = self.mlm_ln(encoded_2d)
logits_2d = L.matmul(encoded_2d, self.word_emb.weight, transpose_y=True) + self.mlm_bias
return logits_2d
class ErnieCorrector(Corrector):
def __init__(self, model_dir='ernie-1.0', topN=5):
super(ErnieCorrector, self).__init__()
self.name = 'ernie_corrector'
t1 = time.time()
self.ernie_tokenizer = ErnieTokenizer.from_pretrained(model_dir)
self.rev_dict = {v: k for k, v in self.ernie_tokenizer.vocab.items()}
self.rev_dict[self.ernie_tokenizer.pad_id] = '' # replace [PAD]
self.rev_dict[self.ernie_tokenizer.sep_id] = '' # replace [PAD]
self.rev_dict[self.ernie_tokenizer.unk_id] = '' # replace [PAD]
self.cloze = ErnieCloze.from_pretrained(model_dir)
self.cloze.eval()
logger.debug('Loaded ernie model: %s, spend: %.3f s.' % (model_dir, time.time() - t1))
self.mask_id = self.ernie_tokenizer.mask_id # 3
self.mask_token = self.rev_dict[self.mask_id] # "[MASK]"
logger.debug('ernie mask_id :{}, mask_token: {}'.format(self.mask_id, self.mask_token))
self.topN = topN
def predict_mask(self, sentence_with_mask):
ids, id_types = self.ernie_tokenizer.encode(sentence_with_mask)
ids = np.expand_dims(ids, 0)
ids = D.to_variable(ids)
logits = self.cloze(ids).numpy()
output_ids = np.argsort(logits, -1)
masks_ret = []
# 倒序,取最可能预测词topN
for masks in output_ids:
temp = []
for mask in masks[::-1][:self.topN]:
temp.append(self.rev_dict[mask])
masks_ret.append(temp)
# 处理多个mask的情况
out = []
for i in range(len(masks_ret)):
temp = []
for j in range(len(masks_ret[i])):
temp.append(masks_ret[i][j])
out.append(temp)
# transpose out data
# [['智', '学', '能', '技', '互'], ['能', '慧', '习', '智', '商']] => 智能 学慧 能习 技智 互商
out = np.transpose(np.array(out))
out = [''.join(i) for i in out.tolist()]
ret = []
for i in out:
ret.append({'token': i})
return ret
def ernie_correct(self, text, ernie_cut_type='char'):
"""
句子纠错
:param text: 句子文本
:param ernie_cut_type: 切词类型(char/word)
:return: corrected_text, list[list], [error_word, correct_word, begin_pos, end_pos]
"""
text_new = ''
details = []
self.check_corrector_initialized()
# 编码统一,utf-8 to unicode
text = convert_to_unicode(text)
# 长句切分为短句
blocks = self.split_text_by_maxlen(text, maxlen=512)
for blk, start_idx in blocks:
blk_new = ''
blk = segment(blk, cut_type=ernie_cut_type, pos=False)
for idx, s in enumerate(blk):
# 处理中文错误
if is_chinese_string(s):
sentence_lst = blk[:idx] + blk[idx:]
sentence_lst[idx] = self.mask_token * len(s)
sentence_new = ' '.join(sentence_lst)
# 预测,默认取top5
predicts = self.predict_mask(sentence_new)
top_tokens = []
for p in predicts:
top_tokens.append(p.get('token', ''))
if top_tokens and (s not in top_tokens):
# 取得所有可能正确的词
candidates = self.generate_items(s)
if candidates:
for token_str in top_tokens:
if token_str in candidates:
details.append([s, token_str, start_idx + idx, start_idx + idx + 1])
s = token_str
break
blk_new += s
text_new += blk_new
details = sorted(details, key=operator.itemgetter(2))
return text_new, details
if __name__ == "__main__":
d = ErnieCorrector()
error_sentences = [
'我对于宠物出租得事非常认同,因为其实很多人喜欢宠物',
'疝気医院那好 ,疝気专科百科问答',
'少先队员因该为老人让坐',
'少 先 队 员 因 该 为 老人让坐',
'机七学习是人工智能领遇最能体现智能的一个分知',
'今天心情很好',
]
for sent in error_sentences:
corrected_sent, err = d.ernie_correct(sent)
print("original sentence:{} => {}, err:{}".format(sent, corrected_sent, err))
总结
ERNIE模型纠错效果评估如下:
Sighan15:
准确率:29.70%
召回率:28.13%
Corpus500:
准确率:59.80%
召回率:41.33%
可以看出ERNIE模型对文本有强大的表达能力,仅仅依赖预训练的MLM模型,在纠错能力上就比优化良久的专家规则方法稍差而已,而且看结果细节一些纠正还挺靠谱。
看来选择一个好的模型,选择一个正确的方向真的很重要。我在这里只能希望规则的方法尽量做到扩展性好些,深度模型尽量做到调研各种模型全一些,深入一些。
当然,文本纠错是不仅仅是低阶的拼写错误纠正问题,更深入的有语义、语法纠错问题,我这里只是抛砖引玉,希望大家能关注该领域,多研究,多贡献。
交流与反馈
微信交流群,感兴趣的同学可以加入沟通NLP文本纠错相关技术,issues上回复不及时也可以在群里面提问。
微信交流群满了,可以加我微信号:xuming624, 备注:个人名称-NLP纠错 进群。

更多推荐
所有评论(0)