基于PaddleOCR的体检报告识别

一、项目背景与意义

面对飞速发展互联网医疗时代,医疗信息化建设已经成为医疗行业发展的趋势。经调研,约80%的医学病历是处于非结构化状态的,难以直接被利用而造成了大量医学资源浪费。医疗数据中大量的半结构化与无结构化的文本,医学术语的专业性以及语言表达的多样性为结构化信息抽取带来了很大难度。因此,针对电子病历和报告的信息识别抽取和结构化管理对临床诊断、疾病预防与医学研究具有重要意义。

体检报告识别可以帮助医务服务人员自动识别录入用户征信信息,节约人力成本、提升服务效率,实现降本增效,具有重要实际意义。基于PaddleOCR已在文字识别领域取得优秀成果,本项目基于PaddleOCR实现体检报告检测与识别,对数据进行结构化处理,结合CV+NLP技术达到一定识别精度,未来推广应用场景可以基于识别信息做个性化疾病预测与健康推荐。

二、项目流程

PaddleOCR是百度开源的超轻量级OCR模型库,本文使用其框架进行体检报告识别,本次项目具体流程包括:

    1. 1.PaddleOCR环境安装与快速预测
    1. 2.体检报告检测模型训练det
    1. 3.体检报告识别模型训练rec
    1. 4.PP-Structure体检报告版面分析与识别

三、技术介绍

针对PaddleOCR文字识别提供的算法模型,本次选择基础模型用于体检报告识别与检测,流程如下:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WlbMFETX-1665146512445)(https://ai-studio-static-online.cdn.bcebos.com/f426fbc8b34a47a18968ba90a75d045119980c741dc54f978a311a53e9362693)]

1.检测:DB算法

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ulq5Un65-1665146512446)(https://ai-studio-static-online.cdn.bcebos.com/e69477e69e4744dc9e4397b80df7db6ce26294f403574870a5b7387960414d68)]

2.识别:CRNN+CTC

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-V1l16EzV-1665146512447)(https://ai-studio-static-online.cdn.bcebos.com/f5766b58a1ef4f8aa91703e6e160f92288df15b5aff34ec19a8c0a2595079377)]

四、数据集介绍

数据位置: data/data159696/report_ex.tar

解压命令 !tar -xf /home/aistudio/data/data159696/report_ex.tar

数据集结构:

/home/aistudio/report_ex
  └─ pngs:存放体检照片,以pngs形式结尾
  └─ txts: 存放标注坐标信息及包含内容.
  └─ json:内容同上 ,存放json格式信息。


数据集txt格式为:

Rect (182.0, 1078.03125, 266.0, 1064.03125) 姓名:张某某

Rect (356.0, 1078.03125, 412.0, 1064.03125) 性别:男

Rect (516.0, 1078.03125, 572.0, 1064.03125) 年龄:40
  • *注:本数据坐标是以左下角为原点,利用Paddleocr做检测时需要转换成左上角原点,且本数据坐标需要横纵坐标都乘4.

图片样式:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4KoDjWHC-1665146512447)(https://ai-studio-static-online.cdn.bcebos.com/b3064d270ecd4495ad9c0db7356fa671f2b5b2a9ffee45f39d763d175efdf830)]

五、模型训练

1.1 安装项目环境

安装PaddleOCR相关环境

%cd ~ 
!git clone -b release/2.1 https://github.com/PaddlePaddle/PaddleOCR.git
# 安装依赖库
%cd ~/PaddleOCR
!pip install -r requirements.txt -i https://mirror.baidu.com/pypi/simple
1.2. 下载预测模型并测试

下载PaddleOCR中文轻量级OCR模型用于测试部分图像识别结果,模型存放在在PaddleOCR/inference目录下。

! mkdir inference
# 下载超轻量级中文OCR模型的检测模型并解压
! cd inference && wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_det_infer.tar && tar xf ch_ppocr_mobile_v2.0_det_infer.tar && rm ch_ppocr_mobile_v2.0_det_infer.tar
# 下载超轻量级中文OCR模型的识别模型并解压
! cd inference && wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_rec_infer.tar && tar xf ch_ppocr_mobile_v2.0_rec_infer.tar && rm ch_ppocr_mobile_v2.0_rec_infer.tar
# 下载超轻量级中文OCR模型的文本方向分类器模型并解压
! cd inference && wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_infer.tar && tar xf ch_ppocr_mobile_v2.0_cls_infer.tar && rm ch_ppocr_mobile_v2.0_cls_infer.tar
1.3测试单张报告数据集并可视化
import matplotlib.pyplot as plt
from PIL import Image
%pylab inline

def show_img(img_path,figsize=(10,10)):
    ## 显示原图,读取名称为11.jpg的测试图像
    img = Image.open(img_path)
    plt.figure("test_img", figsize=figsize)
    plt.imshow(img)
    plt.show()
show_img("../20220623110401-0.png")
Populating the interactive namespace from numpy and matplotlib

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cyu9gIsW-1665146512448)(output_6_1.png)]

1.4测试单张图像

调用tools/infer/predict_system.py 完成报告识别,共需要传入三个参数:

  • image_dir: 指定要测试的图像
  • det_model_dir: 指定轻量检测模型的inference model
  • rec_model_dir: 指定轻量识别模型的inference model
  • cls_model_dir: 指定轻量方向分类器模型的inference model
# 快速运行
!python3 ./tools/infer/predict_system.py --image_dir="../20220623110401-0.png" \
--det_model_dir="./inference/ch_ppocr_mobile_v2.0_det_infer"  \
--rec_model_dir="./inference/ch_ppocr_mobile_v2.0_rec_infer" \
--cls_model_dir="./inference/ch_ppocr_mobile_v2.0_cls_infer"
# 训练效果
!python3 ./tools/infer/predict_system.py --image_dir="../20220623110401-0.png" \
--det_model_dir="./outputall/db_mv3/best_accuracy"  \
--rec_model_dir="./output/rec/best_accuracy" \
--cls_model_dir="./inference/ch_ppocr_mobile_v2.0_cls_infer"

上述结果说明:输出结果中有两列数据,第一列表示PaddleOCR识别出的文字,第二列表示识别出当前文字的置信度。置信度的数据范围是[0-1],置信度越接近1表示文本识别对的“信心”越大。
同时,识别结果会可视化在图像中并保存在./inference_results文件夹下,可以通过左边的目录结构选择要打开的文件,
也可以通过如下代码将可视化后的图像显示出来,观察OCR文本识别的效果。

针对上述./inference_results/20220623110401-0.png检测结果展示如下:

show_img("./inference_results/20220623110401-0.png",figsize=(20,20))

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VTkhpmt9-1665146512448)(output_11_0.png)]

2. 训练文字检测模型

PaddOCR官方检测模型数据集以icdar15为例,本文参照其标注格式进行检测模型的训练、评估与测试,模型以MobienetV3网络为例,可自己更改其他网络。

注:官方icdar15数据集存放在 ~/data/data34815/icdar2015.tar ,后续如有数据格式问题可做参考。官方数据~/train_data/icdar2015/text_localization 有两个文件夹和两个文件,分别是:

~/train_data/icdar2015/text_localization 
  └─ icdar_c4_train_imgs/         icdar数据集的训练数据
  └─ ch4_test_images/             icdar数据集的测试数据
  └─ train_icdar2015_label.txt    icdar数据集的训练标注
  └─ test_icdar2015_label.txt     icdar数据集的测试标注

官方提供的标注文件格式为:

" 图像文件名                    json.dumps编码的图像标注信息"
ch4_test_images/img_61.jpg    [{"transcription": "MASA", "points": [[310, 104], [416, 141], [418, 216], [312, 179]], ...}]

json.dumps编码前的图像标注信息是包含多个字典的list,字典中的 p o i n t s points points表示文本框的四个点的坐标(x, y),从左上角的点开始顺时针排列。
t r a n s c r i p t i o n transcription transcription表示当前文本框的文字,在文本检测任务中并不需要这个信息。

2.1. 数据准备

#首先解压本次体检报告数据到当前目录,解压命令如下
!tar -xf /home/aistudio/data/data159696/report_ex.tar
#查看当前文件夹图片数量
%cd /home/aistudio/report_ex/pngs
!ls -l | grep "^-" | wc -l   #一共20011张图片
/home/aistudio/report_ex/pngs
20011

本次体检报告数据集txt格式为:

Rect (182.0, 1078.03125, 266.0, 1064.03125) 姓名:张某某

Rect (356.0, 1078.03125, 412.0, 1064.03125) 性别:男

Rect (516.0, 1078.03125, 572.0, 1064.03125) 年龄:40

。。。

由于数据格式不同,本项目需要编写转换数据程序构建为PaddleOCR标注文件格式, 由于时间原因,格式代码比较粗糙,读者后续可根据需求自行完善。

以1.部分数据集为例的训练相关代码:

/home/aistudio/report_ex/
  └─ train_det_new1_hebing/        report_ex数据集的测试数据
  └─ test_det_new1_hebing  			  report_ex数据集的测试数据
/home/aistudio/  
  └─ train_det_new1_hebing.txt.txt    report_ex数据集的训练标注
  └─ test_det_new1_hebing.txt.txt    report_ex数据集的测试标注
  └─ gen_data_det_reg.py          格式转换代码
  └─ hebing.py						数据合并
  └─ split_data.py					切分训练集与测试集
  └─ file.py               拷贝训练集与测试集图片到文件夹
/home/aistudio/PaddleOCR
  └─ tools/train.py            训练代码
  └─ tools/infer_det.py         推理代码
  └─ configs/det/det_mv3_db_all.yml  配置文件

2.2 快速启动训练

下载PaddleOCR主流两种检测模型backbone,MobileNetV3和ResNet50_vd,后续可以根据需求使用PaddleClas中的模型更换backbone。

# 下载MobileNetV3的预训练模型
!pwd
!wget -P ./pretrain_models/ https://paddle-imagenet-models-name.bj.bcebos.com/MobileNetV3_large_x0_5_pretrained.tar
! cd pretrain_models/ && tar xf MobileNetV3_large_x0_5_pretrained.tar
# 下载ResNet50的预训练模型
!wget -P ./pretrain_models/ https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_ssld_pretrained.tar
! cd pretrain_models/ && tar xf ResNet50_vd_ssld_pretrained.tar

2.3数据转换与数据集划分

运行转换程序gen_data_det_reg.py,生成det检测和reg识别的txt格式,坐标以实际运行结果为准!

检测det.txt

20220623110401-0.png [{“transcription”:“姓名:张某某”,“points”:[[182.0,4256.125],[266.0,4256.125],[182.0,4312.125],[266.0,4312.125]]}]

20220623110401-0.png [{“transcription”:“性别:男”,“points”:[[356.0,4256.125],[412.0,4256.125],[356.0,4312.125],[412.0,4312.125]]}]

20220623110401-0.png [{“transcription”:“年龄:40”,“points”:[[516.0,4256.125],[572.0,4256.125],[516.0,4312.125],[572.0,4312.125]]}]

识别reg.txt

20220623110401-0.png 姓名:张某某

20220623110401-0.png 性别:男

20220623110401-0.png 年龄:40

注:本次体检报告由于数据量巨大且训练时间长,为了方便大家查看运行效果及调试 ,下列训练用到的数据集会分为 1.部分数据集与 2.全集数据集,脚本命名已写在注释中,按需运行按需打开注释即可。

#1.部分数据集数据转换脚本, 生成det1.txt ,合并后一共一百多张图片, 坐标为横坐标×4 、 纵坐标=图片高度-纵坐标×4 , reg.txt目前没用到。
#执行报错,会出现IndexError: list index out of range,  只生成2万多条数据,但不影响跑。
# %cd /home/aistudio/
# !python ./gen_data_det_reg.py
#2.跑全量数据脚本,生成det_all.txt,可以跑通,但全量数据集太大,还未执行,
%cd /home/aistudio/
!python ./gen_data_all.py
#合并上述生成txt的数据,改为一张照片对应所有坐标合并成一行  ,生成合并后新的txt
#1.部分数据合并 det1.txt,生成det_new_hebing.txt
# !python hebing.py
#2.全量数据合并 det_all.txt,生成det_new_hebing_all.txt
!python hebing_all.py
数据集划分:

将检测数据det.txt、识别数据reg.txt 划分训练集和验证集 , 生成train_det.txt 、 test_det.txt、 train_reg.txt 、 test_reg.txt四个文件

#1.划分部分数据集用于训练,将det.txt拆成train_det_new1_hebing.txt和test_det_new1_hebing.txt,注意生成的训练测试集txt名字要跟训练时相同
# !python split_data.py
#2.划分全量数据集,将det_new_hebing_all.txt拆成 train_det_hebing_all.txt和test_det_hebing_all.txt
!python split_data_all.py
20011

拷贝数据集图片

#注:拷贝上述训练及验证集图片到对应路径

#编辑file.py打开对应注释,重复执行两次!!! 一次train 、一次test , 生成上述txt对应的图片文件夹包含图片。

#1.部分数据图片拷贝到文件夹,拷贝train_det_new1.txt和test_det_new1.txt包含的图片拷贝到新文件夹./report_ex/train_det_new1和report_ex/test_det_new1,用于测试
# !python file.py
#2.全量数据图片拷贝到文件夹,train_det_hebing_all.txt和test_det_hebing_all.txt包含的图片拷贝到新文件夹./report_ex/train_det_hebing_all和./report_ex/test_det_hebing_all
!python file_all.py
检测模型训练

本次选择backbone为MobileNetV3、Resnet50的db算法的检测模型.通过-c 选择训练使用配置文件configs/det/det_db_mv3.yml配置文件,-o参数在不需要修改yml文件的情况下,改变训练的参数

# 官方训练backbone为MobileNetV3的db算法的检测模型,此部分只做参考,不用执行!
# !python3 PaddleOCR/tools/train.py -c PaddleOCR/configs/det/det_mv3_db.yml -o \
# Global.eval_batch_step="[0,500]" \
# Global.load_static_weights=true \
# Global.pretrained_model='PaddleOCR/pretrain_models/MobileNetV3_large_x0_5_pretrained' \
# Train.dataset.data_dir='PaddleOCR/train_data/text_localization/' \
# Train.dataset.label_file_list=['PaddleOCR/train_data/text_localization/train_icdar2015_label.txt'] \
# Eval.dataset.data_dir='PaddleOCR/train_data/text_localization/' \
# Eval.dataset.label_file_list=['PaddleOCR/train_data/text_localization/test_icdar2015_label.txt']
#每次启动运行结果提示缺少包执行此条安装相关环境
!pip install lmdb
!pip install pyclipper
!pip install  Levenshtein
!pip install imgaug

全量数据训练

由于数据量较大且训练时间较长,针对上述两种不同数据集大小,本次训练分别列出全量及部分数据集训练代码,可按需选择。

#1.合并后全量数据集+MobileNetV3检测模型训练
%cd /home/aistudio/
!python3 PaddleOCR/tools/train.py -c PaddleOCR/configs/det/det_mv3_db_all.yml -o \
Global.eval_batch_step="[0,300]" \
Global.load_static_weights=true \
Global.checkpoints='./outputall/db_mv3/best_accuracy' \
Global.pretrained_model='PaddleOCR/pretrain_models/MobileNetV3_large_x0_5_pretrained' \
Train.loader.batch_size_per_card=32 \
Train.dataset.data_dir='./report_ex/train_det_hebing_all' \
Train.dataset.label_file_list=['./train_det_hebing_all.txt'] \
Eval.dataset.data_dir='./report_ex/test_det_hebing_all' \
Eval.dataset.label_file_list=['./test_det_hebing_all.txt']
#2.合并全量数据集+Resnet检测模型训练
%cd /home/aistudio/
!python3 PaddleOCR/tools/train.py -c PaddleOCR/configs/det/det_mv3_db_all_resnet.yml -o \
Global.eval_batch_step="[0,500]" \
Global.load_static_weights=true \
Global.checkpoints='/home/aistudio/outputall/db_resnet/best_accuracy' \
Global.pretrained_model='PaddleOCR/pretrain_models/ResNet50_vd_ssld_pretrained' \
Train.loader.batch_size_per_card=16 \
Train.dataset.data_dir='./report_ex/train_det_hebing_all' \
Train.dataset.label_file_list=['./train_det_hebing_all.txt'] \
Eval.dataset.data_dir='./report_ex/test_det_hebing_all' \
Eval.dataset.label_file_list=['./test_det_hebing_all.txt']

部分数据集训练

#3.合并后部分数据训练模版,下面显示输出是用这个训练的结果 
%cd /home/aistudio/
!python3 PaddleOCR/tools/train.py -c PaddleOCR/configs/det/det_mv3_db.yml -o \
Global.eval_batch_step="[0,50]" \
Global.load_static_weights=true \
Global.pretrained_model='PaddleOCR/pretrain_models/MobileNetV3_large_x0_5_pretrained' \
Train.loader.batch_size_per_card=16 \
Train.dataset.data_dir='./report_ex/train_det_new1_hebing' \
Train.dataset.label_file_list=['./train_det_new1_hebing.txt'] \
Eval.dataset.data_dir='./report_ex/test_det_new1_hebing' \
Eval.dataset.label_file_list=['./test_det_new1_hebing.txt']
#3.合并后全集训练模版,可能要调batch_size_per_card大小 ,执行打开注释,注释其他,
# %cd /home/aistudio/
# !python3 PaddleOCR/tools/train.py -c PaddleOCR/configs/det/det_mv3_db.yml -o \
# Global.eval_batch_step="[0,10]" \
# Global.load_static_weights=true \
# Global.pretrained_model='PaddleOCR/pretrain_models/MobileNetV3_large_x0_5_pretrained' \
# Train.loader.batch_size_per_card=32 \
# Train.dataset.data_dir='./report_ex/train_det_hebing_all' \
# Train.dataset.label_file_list=['./train_det_hebing_all.txt'] \
# Eval.dataset.data_dir='./report_ex/test_det_hebing_all' \
# Eval.dataset.label_file_list=['./test_det_hebing_all.txt']

/home/aistudio
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  def convert_to_list(value, n, name, dtype=np.int):
[2022/08/09 09:38:47] root INFO: Architecture : 
[2022/08/09 09:38:47] root INFO:     Backbone : 
[2022/08/09 09:38:47] root INFO:         model_name : large
[2022/08/09 09:38:47] root INFO:         name : MobileNetV3
[2022/08/09 09:38:47] root INFO:         scale : 0.5
[2022/08/09 09:38:47] root INFO:     Head : 
[2022/08/09 09:38:47] root INFO:         k : 50
[2022/08/09 09:38:47] root INFO:         name : DBHead
[2022/08/09 09:38:47] root INFO:     Neck : 
[2022/08/09 09:38:47] root INFO:         name : DBFPN
[2022/08/09 09:38:47] root INFO:         out_channels : 256
[2022/08/09 09:38:47] root INFO:     Transform : None
[2022/08/09 09:38:47] root INFO:     algorithm : DB
[2022/08/09 09:38:47] root INFO:     model_type : det
[2022/08/09 09:38:47] root INFO: Eval : 
[2022/08/09 09:38:47] root INFO:     dataset : 
[2022/08/09 09:38:47] root INFO:         data_dir : ./report_ex/test_det_new1_hebing
[2022/08/09 09:38:47] root INFO:         label_file_list : ['./test_det_new1_hebing.txt']
[2022/08/09 09:38:47] root INFO:         name : SimpleDataSet
[2022/08/09 09:38:47] root INFO:         transforms : 
[2022/08/09 09:38:47] root INFO:             DecodeImage : 
[2022/08/09 09:38:47] root INFO:                 channel_first : False
[2022/08/09 09:38:47] root INFO:                 img_mode : BGR
[2022/08/09 09:38:47] root INFO:             DetLabelEncode : None
[2022/08/09 09:38:47] root INFO:             DetResizeForTest : 
[2022/08/09 09:38:47] root INFO:                 image_shape : [736, 1280]
[2022/08/09 09:38:47] root INFO:             NormalizeImage : 
[2022/08/09 09:38:47] root INFO:                 mean : [0.485, 0.456, 0.406]
[2022/08/09 09:38:47] root INFO:                 order : hwc
[2022/08/09 09:38:47] root INFO:                 scale : 1./255.
[2022/08/09 09:38:47] root INFO:                 std : [0.229, 0.224, 0.225]
[2022/08/09 09:38:47] root INFO:             ToCHWImage : None
[2022/08/09 09:38:47] root INFO:             KeepKeys : 
[2022/08/09 09:38:47] root INFO:                 keep_keys : ['image', 'shape', 'polys', 'ignore_tags']
[2022/08/09 09:38:47] root INFO:     loader : 
[2022/08/09 09:38:47] root INFO:         batch_size_per_card : 1
[2022/08/09 09:38:47] root INFO:         drop_last : False
[2022/08/09 09:38:47] root INFO:         num_workers : 8
[2022/08/09 09:38:47] root INFO:         shuffle : False
[2022/08/09 09:38:47] root INFO:         use_shared_memory : False
[2022/08/09 09:38:47] root INFO: Global : 
[2022/08/09 09:38:47] root INFO:     cal_metric_during_train : False
[2022/08/09 09:38:47] root INFO:     checkpoints : None
[2022/08/09 09:38:47] root INFO:     debug : False
[2022/08/09 09:38:47] root INFO:     distributed : False
[2022/08/09 09:38:47] root INFO:     epoch_num : 1200
[2022/08/09 09:38:47] root INFO:     eval_batch_step : [0, 50]
[2022/08/09 09:38:47] root INFO:     infer_img : ./20220623110401-0.png
[2022/08/09 09:38:47] root INFO:     load_static_weights : True
[2022/08/09 09:38:47] root INFO:     log_smooth_window : 20
[2022/08/09 09:38:47] root INFO:     pretrained_model : PaddleOCR/pretrain_models/MobileNetV3_large_x0_5_pretrained
[2022/08/09 09:38:47] root INFO:     print_batch_step : 10
[2022/08/09 09:38:47] root INFO:     save_epoch_step : 1200
[2022/08/09 09:38:47] root INFO:     save_inference_dir : None
[2022/08/09 09:38:47] root INFO:     save_model_dir : ./output1/db_mv3/
[2022/08/09 09:38:47] root INFO:     save_res_path : ./output1/det_db/predicts_db.txt
[2022/08/09 09:38:47] root INFO:     use_gpu : True
[2022/08/09 09:38:47] root INFO:     use_visualdl : False
[2022/08/09 09:38:47] root INFO: Loss : 
[2022/08/09 09:38:47] root INFO:     alpha : 5
[2022/08/09 09:38:47] root INFO:     balance_loss : True
[2022/08/09 09:38:47] root INFO:     beta : 10
[2022/08/09 09:38:47] root INFO:     main_loss_type : DiceLoss
[2022/08/09 09:38:47] root INFO:     name : DBLoss
[2022/08/09 09:38:47] root INFO:     ohem_ratio : 3
[2022/08/09 09:38:47] root INFO: Metric : 
[2022/08/09 09:38:47] root INFO:     main_indicator : hmean
[2022/08/09 09:38:47] root INFO:     name : DetMetric
[2022/08/09 09:38:47] root INFO: Optimizer : 
[2022/08/09 09:38:47] root INFO:     beta1 : 0.9
[2022/08/09 09:38:47] root INFO:     beta2 : 0.999
[2022/08/09 09:38:47] root INFO:     lr : 
[2022/08/09 09:38:47] root INFO:         learning_rate : 0.001
[2022/08/09 09:38:47] root INFO:     name : Adam
[2022/08/09 09:38:47] root INFO:     regularizer : 
[2022/08/09 09:38:47] root INFO:         factor : 0
[2022/08/09 09:38:47] root INFO:         name : L2
[2022/08/09 09:38:47] root INFO: PostProcess : 
[2022/08/09 09:38:47] root INFO:     box_thresh : 0.6
[2022/08/09 09:38:47] root INFO:     max_candidates : 1000
[2022/08/09 09:38:47] root INFO:     name : DBPostProcess
[2022/08/09 09:38:47] root INFO:     thresh : 0.3
[2022/08/09 09:38:47] root INFO:     unclip_ratio : 1.5
[2022/08/09 09:38:47] root INFO: Train : 
[2022/08/09 09:38:47] root INFO:     dataset : 
[2022/08/09 09:38:47] root INFO:         data_dir : ./report_ex/train_det_new1_hebing
[2022/08/09 09:38:47] root INFO:         label_file_list : ['./train_det_new1_hebing.txt']
[2022/08/09 09:38:47] root INFO:         name : SimpleDataSet
[2022/08/09 09:38:47] root INFO:         ratio_list : [1.0]
[2022/08/09 09:38:47] root INFO:         transforms : 
[2022/08/09 09:38:47] root INFO:             DecodeImage : 
[2022/08/09 09:38:47] root INFO:                 channel_first : False
[2022/08/09 09:38:47] root INFO:                 img_mode : BGR
[2022/08/09 09:38:47] root INFO:             DetLabelEncode : None
[2022/08/09 09:38:47] root INFO:             IaaAugment : 
[2022/08/09 09:38:47] root INFO:                 augmenter_args : 
[2022/08/09 09:38:47] root INFO:                     args : 
[2022/08/09 09:38:47] root INFO:                         p : 0.5
[2022/08/09 09:38:47] root INFO:                     type : Fliplr
[2022/08/09 09:38:47] root INFO:                     args : 
[2022/08/09 09:38:47] root INFO:                         rotate : [-10, 10]
[2022/08/09 09:38:47] root INFO:                     type : Affine
[2022/08/09 09:38:47] root INFO:                     args : 
[2022/08/09 09:38:47] root INFO:                         size : [0.5, 3]
[2022/08/09 09:38:47] root INFO:                     type : Resize
[2022/08/09 09:38:47] root INFO:             EastRandomCropData : 
[2022/08/09 09:38:47] root INFO:                 keep_ratio : True
[2022/08/09 09:38:47] root INFO:                 max_tries : 50
[2022/08/09 09:38:47] root INFO:                 size : [640, 640]
[2022/08/09 09:38:47] root INFO:             MakeBorderMap : 
[2022/08/09 09:38:47] root INFO:                 shrink_ratio : 0.4
[2022/08/09 09:38:47] root INFO:                 thresh_max : 0.7
[2022/08/09 09:38:47] root INFO:                 thresh_min : 0.3
[2022/08/09 09:38:47] root INFO:             MakeShrinkMap : 
[2022/08/09 09:38:47] root INFO:                 min_text_size : 8
[2022/08/09 09:38:47] root INFO:                 shrink_ratio : 0.4
[2022/08/09 09:38:47] root INFO:             NormalizeImage : 
[2022/08/09 09:38:47] root INFO:                 mean : [0.485, 0.456, 0.406]
[2022/08/09 09:38:47] root INFO:                 order : hwc
[2022/08/09 09:38:47] root INFO:                 scale : 1./255.
[2022/08/09 09:38:47] root INFO:                 std : [0.229, 0.224, 0.225]
[2022/08/09 09:38:47] root INFO:             ToCHWImage : None
[2022/08/09 09:38:47] root INFO:             KeepKeys : 
[2022/08/09 09:38:47] root INFO:                 keep_keys : ['image', 'threshold_map', 'threshold_mask', 'shrink_map', 'shrink_mask']
[2022/08/09 09:38:47] root INFO:     loader : 
[2022/08/09 09:38:47] root INFO:         batch_size_per_card : 16
[2022/08/09 09:38:47] root INFO:         drop_last : False
[2022/08/09 09:38:47] root INFO:         num_workers : 8
[2022/08/09 09:38:47] root INFO:         shuffle : True
[2022/08/09 09:38:47] root INFO:         use_shared_memory : False
[2022/08/09 09:38:47] root INFO: train with paddle 2.0.2 and device CUDAPlace(0)
[2022/08/09 09:38:47] root INFO: Initialize indexs of datasets:['./train_det_new1_hebing.txt']
[2022/08/09 09:38:47] root INFO: Initialize indexs of datasets:['./test_det_new1_hebing.txt']
W0809 09:38:47.257441 10327 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W0809 09:38:47.261169 10327 device_context.cc:372] device: 0, cuDNN Version: 7.6.
[2022/08/09 09:38:50] root INFO: load pretrained model from ['PaddleOCR/pretrain_models/MobileNetV3_large_x0_5_pretrained']
[2022/08/09 09:38:50] root INFO: train dataloader has 9 iters
[2022/08/09 09:38:50] root INFO: valid dataloader has 16 iters
[2022/08/09 09:38:50] root INFO: During the training process, after the 0th iteration, an evaluation is run every 50 iterations
[2022/08/09 09:38:50] root INFO: Initialize indexs of datasets:['./train_det_new1_hebing.txt']
[2022/08/09 09:39:51] root INFO: epoch: [1/1200], iter: 8, lr: 0.001000, loss: 7.751804, loss_shrink_maps: 4.606901, loss_threshold_maps: 2.225027, loss_binary_maps: 0.919876, reader_cost: 5.55701 s, batch_cost: 6.10987 s, samples: 140, ips: 2.29137
[2022/08/09 09:39:52] root INFO: save model in ./output1/db_mv3/latest
[2022/08/09 09:39:52] root INFO: Initialize indexs of datasets:['./train_det_new1_hebing.txt']
[2022/08/09 09:40:55] root INFO: epoch: [2/1200], iter: 10, lr: 0.001000, loss: 7.377272, loss_shrink_maps: 4.554792, loss_threshold_maps: 1.911745, loss_binary_maps: 0.910735, reader_cost: 6.11302 s, batch_cost: 6.33029 s, samples: 32, ips: 0.50551
[2022/08/09 09:40:58] root INFO: epoch: [2/1200], iter: 17, lr: 0.001000, loss: 6.610305, loss_shrink_maps: 4.466334, loss_threshold_maps: 1.255741, loss_binary_maps: 0.897122, reader_cost: 0.07724 s, batch_cost: 0.30940 s, samples: 108, ips: 34.90611
[2022/08/09 09:40:59] root INFO: save model in ./output1/db_mv3/latest
[2022/08/09 09:40:59] root INFO: Initialize indexs of datasets:['./train_det_new1_hebing.txt']
^C
main proc 11724 exit, kill process group 10327
main proc 11723 exit, kill process group 10327

2.3. 测试检测效果

训练过程中的检测模型保存在’./output/det_db/'中,模型保存的位置通过yml配置文件的Global.save_model_dir参数设置。

使用训练好的模型测试单张图像的检测效果。

# %cd PaddleOCR
#部分数据结果
# !python3 tools/infer_det.py -c configs/det/det_mv3_db.yml -o Global.infer_img="../20220623110401-0.png" Global.pretrained_model="/home/aistudio/output1/db_mv3/best_accuracy"
#全集数据结果,全集只训练一个epoch
!python3 tools/infer_det.py -c configs/det/det_mv3_db_all.yml -o Global.infer_img="../20220623110401-0.png" Global.pretrained_model="/home/aistudio/outputall/db_mv3/best_accuracy"
# %cd PaddleOCR/
# !python3 tools/infer_det.py -c configs/det/det_mv3_db.yml -o Global.infer_img="../20220623110401-0.png"  Global.checkpoints="./output/db_mv3/best_accuracy"

使用训练好的模型,测试文件夹下所有图像的检测效果,路径按需更改,下同。

#!python3 tools/infer_det.py -c configs/det/det_mv3_db.yml -o Global.infer_img="./doc/imgs_en/"  Global.checkpoints="./output/db_mv3/best_accuracy"

3. 训练文字识别模型

3.1. 数据准备

首先请将训练图片放入同一个文件夹(trainimages),并用一个txt文件(recgttrain.txt)记录图片路径和标签。

本项目识别使用的数据集: train_reg.txt 、test_reg.txt ,注意: 默认请将图片路径和图片标签用 \t 分割。

  • txt训练集
" 图像文件名                 图像标注信息 "

  20220623110401-0.png   姓名:张某某

训练集及测试集文件结构如下:

    |- train_reg.txt
    |- report_ex/
    	|- train_reg
           |- word_001.png
           |- word_002.jpg
           | ...
    	|- test_reg
           |- word_001.png
           |- word_002.jpg
           | ...

3.2. 快速启动训练

本节文字识别网络以 CRNN 识别模型为例,网络模型使用PaddleOCR主流两种识别模型backbone,MobileNetV3和ResNet50_vd :

# 下载ResNet50的预训练模型
%cd PaddleOCR/
!wget -P ./pretrain_models/ https://paddleocr.bj.bcebos.com/dygraph_v2.0/en/rec_mv3_none_bilstm_ctc_v2.0_train.tar
! cd pretrain_models/ && tar xf rec_mv3_none_bilstm_ctc_v2.0_train.tar
#rec.py为按坐标截取图片脚本,将原来一张图片的所有标注截取成多张,生成new_pngs图像文件夹并生成对应rec.txt,用于识别模型
%cd /home/aistudio/
!python ./rec.py
#查看当前文件夹图片数量
%cd ./new_pngs
!ls -l | grep "^-" | wc -l   #一共1490577张图片
/home/aistudio/new_pngs
1492727
#切分训练与测试数据集
%cd /home/aistudio/
!python ./rec_split_data.py
/home/aistudio
2150
#拷贝训练集与测试集对应图片到文件夹用于文字识别训练,执行方法同上,需要执行两次,一次 train ,一次test
!python rec_file.py

文字识别训练

本次文字识别训练因为服务器内存原因,只跑小部分数据集,模型只做基础演示,后续可调节参数或更换网络模型进行训练。

%cd PaddleOCR/
!python3 ./tools/train.py -c ./configs/rec/rec_icdar15_train.yml -o \
Global.eval_batch_step="[0,100]" \
Global.save_epoch_step=500 \
Global.pretrained_model='./pretrain_models/rec_mv3_none_bilstm_ctc_v2.0_train/best_accuracy' \
Train.dataset.data_dir='../report_ex/train_rec' \
Train.dataset.label_file_list=['../train_rec.txt'] \
Eval.dataset.data_dir='../report_ex/test_rec' \
Eval.dataset.label_file_list=['../test_rec.txt'] \
Optimizer.lr.learning_rate=0.001

3.3 测试识别效果

测试单张图像的识别结果

# !python3 tools/infer_rec.py -c configs/rec/rec_icdar15_train.yml -o Global.checkpoints=./pretrain_models/rec_mv3_none_bilstm_ctc_v2.0_train/best_accuracy Global.infer_img=../20220623110401-0.png
!python3 tools/infer_rec.py -c configs/rec/rec_icdar15_train.yml -o Global.checkpoints=./output/rec/best_accuracy Global.infer_img=../report_ex/test_rec/20220623110401-1001_0119.png
import matplotlib.pyplot as plt
from PIL import Image
%pylab inline

def show_img(img_path,figsize=(10,10)):
    ## 显示原图,读取名称为11.jpg的测试图像
    img = Image.open(img_path)
    plt.figure("test_img", figsize=figsize)
    plt.imshow(img)
    plt.show()
show_img("../report_ex/test_rec/20220623110401-1001_0119.png")
Populating the interactive namespace from numpy and matplotlib

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BT2s3YCc-1665146512449)(output_49_1.png)]

测试文件夹下所有图像的文字识别效果

!python3 tools/infer_rec.py -c configs/rec/rec_icdar15_train.yml -o Global.checkpoints=./output/rec_CRNN/best_accuracy Global.infer_img=./doc/imgs_words_en/

导出inference推理模型

!cp ./data/data164761/best_accuracy.pdopt ./outputall/rec
!cp ./data/data164761/best_accuracy.pdparams ./outputall/rec
#1.导出检测模型
%cd PaddleOCR/
!python tools/export_model.py -c configs/det/det_mv3_db_all.yml \
-o Global.pretrained_model="../outputall/db_mv3/best_accuracy" \
Global.save_inference_dir="./my_det_model"
[Errno 2] No such file or directory: 'PaddleOCR/'
/home/aistudio/PaddleOCR
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  def convert_to_list(value, n, name, dtype=np.int):
W0817 22:11:30.547513  1528 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W0817 22:11:30.552330  1528 device_context.cc:372] device: 0, cuDNN Version: 7.6.
[2022/08/17 22:11:33] root INFO: load pretrained model from ['../outputall/db_mv3/best_accuracy']
[2022/08/17 22:11:34] root INFO: inference model is saved to ./my_det_model/inference
#2.导出全集识别模型
%cd PaddleOCR/
!python tools/export_model.py -c configs/rec/ch_PP-OCRv3_rec_distillation.yml \
-o Global.pretrained_model="../outputall/rec/best_accuracy" \
Global.save_inference_dir="./my_rec_model"
#3.导出上述代码生成的识别模型
%cd PaddleOCR/
!python tools/export_model.py -c configs/rec/rec_icdar15_train.yml \
-o Global.pretrained_model="./output/rec/best_accuracy" \
Global.save_inference_dir="./my_zj_rec_model"
# 检测+识别
#!python3 ./tools/infer/predict_system.py --image_dir="../report_ex/test_rec/20220623110401-1006_0123.png" \
!pwd
%cd PaddleOCR
!python3 ./tools/infer/predict_system.py --image_dir="../report_ex/pngs/20220623110401-1001.png" \
--det_model_dir="./my_det_model/" \#使用上述代码生成的检测模型
--rec_model_dir="./inference/ch_ppocr_mobile_v2.0_rec_infer/" \
# --rec_model_dir="./my_zj_rec_model/" \  #使用上述代码生成的识别模型
# --rec_model_dir="./my_rec_model/" \   #使用后期训练的识别模型

检测+识别预测效果

%cd PaddleOCR
!python3 ./tools/infer/predict_system.py --image_dir="../report_ex/pngs/20220623110401-1001.png" \
--det_model_dir="./my_det_model/" \
--rec_model_dir="./inference/ch_ppocr_mobile_v2.0_rec_infer"
/home/aistudio/PaddleOCR
[Errno 2] No such file or directory: 'PaddleOCR'
/home/aistudio/PaddleOCR
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/setuptools/depends.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  def convert_to_list(value, n, name, dtype=np.int):
E0823 12:04:17.329943  1246 analysis_config.cc:80] Please compile with gpu to EnableGpu()
E0823 12:04:17.623020  1246 analysis_config.cc:80] Please compile with gpu to EnableGpu()
[2022/08/23 12:04:19] root INFO: dt_boxes num : 165, elapse : 1.5276000499725342
[2022/08/23 12:04:21] root INFO: rec_res num  : 165, elapse : 1.9995708465576172
[2022/08/23 12:04:21] root INFO: Predict time of ../report_ex/pngs/20220623110401-1001.png: 3.705s
[2022/08/23 12:04:21] root INFO: 姓名:张某某, 0.982
[2022/08/23 12:04:21] root INFO: 性别:男, 0.982
[2022/08/23 12:04:21] root INFO: 年龄:40, 0.997
[2022/08/23 12:04:21] root INFO: 浆膜腔漏出液及渗出液检查(胸腹水常规), 0.982
[2022/08/23 12:04:21] root INFO: 检查日期:2022-6-1, 0.997
[2022/08/23 12:04:21] root INFO: 检查医生:陆静, 0.990
[2022/08/23 12:04:21] root INFO: 透明度, 0.992
[2022/08/23 12:04:21] root INFO: 细菌, 0.884
[2022/08/23 12:04:21] root INFO: 清洗或, 0.995
[2022/08/23 12:04:21] root INFO: 清洗或微混, 0.963
[2022/08/23 12:04:21] root INFO: 无, 0.987
[2022/08/23 12:04:21] root INFO: 无, 0.998
[2022/08/23 12:04:21] root INFO: 比重, 0.996
[2022/08/23 12:04:21] root INFO: 0.3, 0.983
[2022/08/23 12:04:21] root INFO: 凝固, 0.963
[2022/08/23 12:04:21] root INFO: 不易凝, 0.998
[2022/08/23 12:04:21] root INFO: 不易凝固, 0.995
[2022/08/23 12:04:21] root INFO: <1.018, 0.988
[2022/08/23 12:04:21] root INFO: 糖定性, 0.998
[2022/08/23 12:04:21] root INFO: 0.1, 0.977
[2022/08/23 12:04:21] root INFO: 与血糖, 0.995
[2022/08/23 12:04:21] root INFO: 与血糖类似, 0.927
[2022/08/23 12:04:21] root INFO: 细胞计数, 0.996
[2022/08/23 12:04:21] root INFO: 109/L, 0.970
[2022/08/23 12:04:21] root INFO: <0.1*1, 0.953
[2022/08/23 12:04:21] root INFO: g/dl, 0.940
[2022/08/23 12:04:21] root INFO: 22.58, 0.889
[2022/08/23 12:04:21] root INFO: 细胞分类, 0.997
[2022/08/23 12:04:21] root INFO: 以淋巴, 0.961
[2022/08/23 12:04:21] root INFO: 蛋白定量, 0.994
[2022/08/23 12:04:21] root INFO: 2.0, 0.985
[2022/08/23 12:04:21] root INFO: 以淋巴细胞为, 0.848
[2022/08/23 12:04:21] root INFO: 阳性, 0.999
[2022/08/23 12:04:21] root INFO: 颜色, 0.999
[2022/08/23 12:04:21] root INFO: 淡黄色或黄绿, 0.997
[2022/08/23 12:04:21] root INFO: 蛋白定性, 0.992
[2022/08/23 12:04:21] root INFO: 淡黄色, 0.994
[2022/08/23 12:04:21] root INFO: 阴性, 0.996
[2022/08/23 12:04:21] root INFO: 检查小结, 0.995
[2022/08/23 12:04:21] root INFO: 本站网络现在东西精华东西, 0.999
[2022/08/23 12:04:21] root INFO: 检查日期:2022-6-1, 0.995
[2022/08/23 12:04:21] root INFO: 检查医生:盛浩, 0.989
[2022/08/23 12:04:21] root INFO: 前列腺液常规检查, 0.943
[2022/08/23 12:04:21] root INFO: 单位, 0.998
[2022/08/23 12:04:21] root INFO: 项目名称, 0.998
[2022/08/23 12:04:21] root INFO: 检查结果, 0.993
[2022/08/23 12:04:21] root INFO: 参考标识, 0.996
[2022/08/23 12:04:21] root INFO: 参考范围, 0.990
[2022/08/23 12:04:21] root INFO: 偶尔可见, 0.991
[2022/08/23 12:04:21] root INFO: 精子, 0.995
[2022/08/23 12:04:21] root INFO: 偶尔可见, 0.991
[2022/08/23 12:04:21] root INFO: 颜色, 0.997
[2022/08/23 12:04:21] root INFO: 淡乳白色稀薄液体, 0.990
[2022/08/23 12:04:21] root INFO: 淡乳白色稀薄液体, 0.993
[2022/08/23 12:04:21] root INFO: 少见,老年易见到, 0.997
[2022/08/23 12:04:21] root INFO: 少见,老年易见到, 0.995
[2022/08/23 12:04:21] root INFO: 滨粉样本, 0.828
[2022/08/23 12:04:21] root INFO: 白细胞, 0.971
[2022/08/23 12:04:21] root INFO: <10个/HP, 0.963
[2022/08/23 12:04:21] root INFO: <10个/HP, 0.972
[2022/08/23 12:04:21] root INFO: 上皮细胞, 0.990
[2022/08/23 12:04:21] root INFO: 少量, 0.989
[2022/08/23 12:04:21] root INFO: 少量, 0.986
[2022/08/23 12:04:21] root INFO: 偶尔可见, 0.991
[2022/08/23 12:04:21] root INFO: 颗粒细胞, 0.991
[2022/08/23 12:04:21] root INFO: 偶尔可见, 0.991
[2022/08/23 12:04:21] root INFO: 偶见, 0.965
[2022/08/23 12:04:21] root INFO: 红细胞, 0.992
[2022/08/23 12:04:21] root INFO: 偶见, 0.860
[2022/08/23 12:04:21] root INFO: 卵磷脂小体, 0.972
[2022/08/23 12:04:21] root INFO: 多量,均匀分布满视野(/HP), 0.956
[2022/08/23 12:04:21] root INFO: 多量,均匀分布满视野(/HP), 0.971
[2022/08/23 12:04:21] root INFO: 滴虫, 0.992
[2022/08/23 12:04:21] root INFO: 无, 0.997
[2022/08/23 12:04:21] root INFO: 无, 0.996
[2022/08/23 12:04:21] root INFO: 量, 0.998
[2022/08/23 12:04:21] root INFO: 数滴一1ml, 0.876
[2022/08/23 12:04:21] root INFO: 数滴-1ml, 0.895
[2022/08/23 12:04:21] root INFO: 检查小结, 0.996
[2022/08/23 12:04:21] root INFO: 工作城市这种分析上海, 0.998
[2022/08/23 12:04:21] root INFO: 检查日期:2022-6-1, 0.997
[2022/08/23 12:04:21] root INFO: 尿液常规, 0.978
[2022/08/23 12:04:21] root INFO: 检查医生:柯春梅, 0.952
[2022/08/23 12:04:21] root INFO: 检查结果, 0.997
[2022/08/23 12:04:21] root INFO: 单位, 0.998
[2022/08/23 12:04:21] root INFO: 项目名称, 0.998
[2022/08/23 12:04:21] root INFO: 参考标识, 0.989
[2022/08/23 12:04:21] root INFO: 参考范围, 0.995
[2022/08/23 12:04:21] root INFO: 阳性, 0.997
[2022/08/23 12:04:21] root INFO: 亚硝酸盐定性, 0.969
[2022/08/23 12:04:21] root INFO: 阴性, 0.967
[2022/08/23 12:04:21] root INFO: 5.5, 0.995
[2022/08/23 12:04:21] root INFO: 均值约6.5, 0.998
[2022/08/23 12:04:21] root INFO: 酸碱度测定, 0.985
[2022/08/23 12:04:21] root INFO: 1.1, 0.981
[2022/08/23 12:04:21] root INFO: 原比重, 0.863
[2022/08/23 12:04:21] root INFO: 1.002-1.030, 0.982
[2022/08/23 12:04:21] root INFO: 阴性, 0.991
[2022/08/23 12:04:21] root INFO: 尿潜血, 0.966
[2022/08/23 12:04:21] root INFO: 阴性, 0.992
[2022/08/23 12:04:21] root INFO: 白细胞, 0.969
[2022/08/23 12:04:21] root INFO: 阳性, 0.998
[2022/08/23 12:04:21] root INFO: 阴性, 0.996
[2022/08/23 12:04:21] root INFO: 阴性, 0.988
[2022/08/23 12:04:21] root INFO: 蛋白测定, 0.998
[2022/08/23 12:04:21] root INFO: 阴性, 0.994
[2022/08/23 12:04:21] root INFO: 阴性, 0.994
[2022/08/23 12:04:21] root INFO: 阴性, 0.992
[2022/08/23 12:04:21] root INFO: 酮体测定, 0.793
[2022/08/23 12:04:21] root INFO: 阴性, 0.991
[2022/08/23 12:04:21] root INFO: 尿胆原定性, 0.986
[2022/08/23 12:04:21] root INFO: 阴性, 0.992
[2022/08/23 12:04:21] root INFO: 阳性, 0.998
[2022/08/23 12:04:21] root INFO: 胆红素定性, 0.995
[2022/08/23 12:04:21] root INFO: 阴性, 0.993
[2022/08/23 12:04:21] root INFO: 阳性, 0.997
[2022/08/23 12:04:21] root INFO: 葡萄糖测定, 0.990
[2022/08/23 12:04:21] root INFO: 阴性, 0.967
[2022/08/23 12:04:21] root INFO: 应该一起规定发展没有喜欢, 0.998
[2022/08/23 12:04:21] root INFO: 检查小结, 0.993
[2022/08/23 12:04:21] root INFO: 生化室项目一览表, 0.981
[2022/08/23 12:04:21] root INFO: 检查日期:2022-6-1, 0.996
[2022/08/23 12:04:21] root INFO: 检查医生:王秀珍, 0.986
[2022/08/23 12:04:21] root INFO: 10., 0.815
[2022/08/23 12:04:21] root INFO: UL, 0.946
[2022/08/23 12:04:21] root INFO: 79., 0.903
[2022/08/23 12:04:21] root INFO: 快速谷草转氨, 0.959
[2022/08/23 12:04:21] root INFO: 0.00~4, 0.972
[2022/08/23 12:04:21] root INFO: a-经丁酸脱, 0.924
[2022/08/23 12:04:21] root INFO: U/L, 0.732
[2022/08/23 12:04:21] root INFO: 95.00~, 0.919
[2022/08/23 12:04:21] root INFO: 会, 0.886
[2022/08/23 12:04:21] root INFO: 111, 0.997
[2022/08/23 12:04:21] root INFO: mmoVL, 0.867
[2022/08/23 12:04:21] root INFO: 96.00~, 0.949
[2022/08/23 12:04:21] root INFO: 血清蛋白电泳, 0.993
[2022/08/23 12:04:21] root INFO: 0.1, 0.945
[2022/08/23 12:04:21] root INFO: Y:0.10, 0.877
[2022/08/23 12:04:21] root INFO: 69., 0.895
[2022/08/23 12:04:21] root INFO: 4.8, 0.962
[2022/08/23 12:04:21] root INFO: 总铁结合力, 0.998
[2022/08/23 12:04:21] root INFO: 45.00~, 0.917
[2022/08/23 12:04:21] root INFO: 总胆固醇, 0.996
[2022/08/23 12:04:21] root INFO: mmolL, 0.877
[2022/08/23 12:04:21] root INFO: umovL, 0.791
[2022/08/23 12:04:21] root INFO: 3.10~5, 0.985
[2022/08/23 12:04:21] root INFO: 119, 0.997
[2022/08/23 12:04:21] root INFO: 钠, 0.894
[2022/08/23 12:04:21] root INFO: 136.00, 0.981
[2022/08/23 12:04:21] root INFO: 肌钙蛋白!, 0.888
[2022/08/23 12:04:21] root INFO: 阳性, 0.988
[2022/08/23 12:04:21] root INFO: mmoyL, 0.919
[2022/08/23 12:04:21] root INFO: 阴性, 0.994
[2022/08/23 12:04:21] root INFO: UIL, 0.760
[2022/08/23 12:04:21] root INFO: 载脂蛋白AI, 0.895
[2022/08/23 12:04:21] root INFO: 0.90~2, 0.974
[2022/08/23 12:04:21] root INFO: 快速尿膜淀粉, 0.889
[2022/08/23 12:04:21] root INFO: 308, 0.998
[2022/08/23 12:04:21] root INFO: 1.3, 0.978
[2022/08/23 12:04:21] root INFO: 0.00~1, 0.906
[2022/08/23 12:04:21] root INFO: 检查小结, 0.992
[2022/08/23 12:04:21] root INFO: 些业务觉得以上, 0.924
[2022/08/23 12:04:21] root INFO: XX医院体检中心体检报告, 0.979
[2022/08/23 12:04:23] root INFO: The visualized image saved in ./inference_results/20220623110401-1001.png

PP-Structure体检报告版面分析与识别

PP-Structure是一个可用于复杂文档结构分析和处理的OCR工具包,主要特性如下:

支持对图片形式的文档进行版面分析,可以划分文字、标题、表格、图片以及列表5类区域(与Layout-Parser联合使用)
支持文字、标题、图片以及列表区域提取为文字字段(与PP-OCR联合使用)
支持表格区域进行结构化分析,最终结果输出Excel文件
支持python whl包和命令行两种方式,简单易用
支持版面分析和表格结构化两类任务自定义训练

环境准备

1 安装 Layout-Parser

!pip install -U https://paddleocr.bj.bcebos.com/whl/layoutparser-0.0.0-py3-none-any.whl
!pip install "paddleocr>=2.2" --no-deps -r requirements.txt
!pip install PyMuPDF

2.引入工具库

import datetime
import os
import fitz  # fitz就是pip install PyMuPDF
import cv2
import shutil
from paddleocr import PPStructure,draw_structure_result,save_structure_res
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/skimage/morphology/_skeletonize.py:241: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  0, 1, 1, 0, 0, 1, 0, 0, 0], dtype=np.bool)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/skimage/morphology/_skeletonize.py:256: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=np.bool)

版面分析

版面分析对文档数据进行区域分类,其中包括版面分析工具的Python脚本使用、提取指定类别检测框、性能指标以及自定义训练版面分析模型。

!pwd
import cv2
import layoutparser as lp
image = cv2.imread('20220623110401-0.png')
image = image[..., ::-1]

# 加载模型
model = lp.PaddleDetectionLayoutModel(config_path="lp://PubLayNet/ppyolov2_r50vd_dcn_365e_publaynet/config",
                                threshold=0.5,
                                label_map={0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"},
                                enforce_cpu=False,
                                enable_mkldnn=True)
# 检测
layout = model.detect(image)

# 显示结果
show_img = lp.draw_box(image, layout, box_width=3, show_element_type=True)
/home/aistudio


  5%|▌         | 11.8M/221M [00:00<00:03, 53.1MiB/s]

download https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_publaynet.tar to /home/aistudio/.paddledet/inference_model/ppyolov2_r50vd_dcn_365e_publaynet/ppyolov2_r50vd_dcn_365e_publaynet_infer/ppyolov2_r50vd_dcn_365e_publaynet.tar


  8%|▊         | 18.4M/221M [00:00<00:03, 56.5MiB/s]100%|██████████| 221M/221M [00:03<00:00, 71.7MiB/s] 
#最终版面分析效果,查看Table和Figure部分
show_img

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4EJyDePq-1665146512450)(output_67_0.png)]

表格识别

完成版面分析后,表格识别将表格图片转换为excel文档,其中包含对于表格文本的检测和识别以及对于表格结构和单元格坐标的预测。

!python -m pip install paddlepaddle==2.1.2
#转换预测结果文档(针对单个文件夹对应图片进行测试)
!pwd
table_engine = PPStructure(show_log=True)
save_folder = './result'
img_dir = './imgs'

files = os.listdir(img_dir)  
for fi in files:
    # 找到文件对应子目录
    # print(fi)
    fi_d = os.path.join(img_dir,fi)  
    # print(fi_d)  
    for img in os.listdir(fi_d):
        img_path = os.path.join(fi_d,img)
        img = cv2.imread(img_path)
        result = table_engine(img)
        # 保存在每张图片对应的子目录下
        save_structure_res(result, os.path.join(save_folder,fi),os.path.basename(img_path).split('.')[0])
/home/aistudio
[2022/08/28 22:57:53] ppocr DEBUG: Namespace(alpha=1.0, benchmark=False, beta=1.0, cls_batch_num=6, cls_image_shape='3, 48, 192', cls_model_dir=None, cls_thresh=0.9, cpu_threads=10, crop_res_save_dir='./output', det=True, det_algorithm='DB', det_db_box_thresh=0.6, det_db_score_mode='fast', det_db_thresh=0.3, det_db_unclip_ratio=1.5, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_east_score_thresh=0.8, det_fce_box_type='poly', det_limit_side_len=960, det_limit_type='max', det_model_dir='/home/aistudio/.paddleocr/whl/det/ch/ch_PP-OCRv3_det_infer', det_pse_box_thresh=0.85, det_pse_box_type='quad', det_pse_min_area=16, det_pse_scale=1, det_pse_thresh=0, det_sast_nms_thresh=0.2, det_sast_polygon=False, det_sast_score_thresh=0.5, draw_img_save_dir='./inference_results', drop_score=0.5, e2e_algorithm='PGNet', e2e_char_dict_path='./ppocr/utils/ic15_dict.txt', e2e_limit_side_len=768, e2e_limit_type='max', e2e_model_dir=None, e2e_pgnet_mode='fast', e2e_pgnet_score_thresh=0.5, e2e_pgnet_valid_set='totaltext', enable_mkldnn=False, fourier_degree=5, gpu_mem=500, help='==SUPPRESS==', image_dir=None, image_orientation=False, ir_optim=True, kie_algorithm='LayoutXLM', label_list=['0', '180'], lang='ch', layout=True, layout_dict_path='/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddleocr/ppocr/utils/dict/layout_dict/layout_cdla_dict.txt', layout_model_dir='/home/aistudio/.paddleocr/whl/layout/picodet_lcnet_x1_0_fgd_layout_cdla_infer', layout_nms_threshold=0.5, layout_score_threshold=0.5, max_batch_size=10, max_text_length=25, merge_no_span_structure=True, min_subgraph_size=15, mode='structure', ocr=True, ocr_order_method=None, ocr_version='PP-OCRv3', output='./output', precision='fp32', process_id=0, rec=True, rec_algorithm='SVTR_LCNet', rec_batch_num=6, rec_char_dict_path='/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddleocr/ppocr/utils/ppocr_keys_v1.txt', rec_image_shape='3, 48, 320', rec_model_dir='/home/aistudio/.paddleocr/whl/rec/ch/ch_PP-OCRv3_rec_infer', recovery=False, save_crop_res=False, save_log_path='./log_output/', save_pdf=False, scales=[8, 16, 32], ser_dict_path='../train_data/XFUND/class_list_xfun.txt', ser_model_dir=None, shape_info_filename=None, show_log=True, sr_batch_num=1, sr_image_shape='3, 32, 128', sr_model_dir=None, structure_version='PP-Structurev2', table=True, table_algorithm='TableAttn', table_char_dict_path='/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddleocr/ppocr/utils/dict/table_structure_dict_ch.txt', table_max_len=488, table_model_dir='/home/aistudio/.paddleocr/whl/table/ch_ppstructure_mobile_v2.0_SLANet_infer', total_process_num=1, type='ocr', use_angle_cls=False, use_dilation=False, use_gpu=True, use_mp=False, use_onnx=False, use_pdserving=False, use_space_char=True, use_tensorrt=False, use_xpu=False, vis_font_path='./doc/fonts/simfang.ttf', warmup=False)
[2022/08/28 22:57:56] ppocr DEBUG: dt_boxes num : 28, elapse : 0.02681589126586914
[2022/08/28 22:57:56] ppocr DEBUG: rec_res num  : 28, elapse : 0.048107147216796875
[2022/08/28 22:57:57] ppocr DEBUG: dt_boxes num : 47, elapse : 0.0371546745300293
[2022/08/28 22:57:57] ppocr DEBUG: rec_res num  : 47, elapse : 0.08329129219055176
[2022/08/28 22:57:57] ppocr DEBUG: dt_boxes num : 62, elapse : 0.045961618423461914
[2022/08/28 22:57:57] ppocr DEBUG: rec_res num  : 62, elapse : 0.10823178291320801
[2022/08/28 22:57:57] ppocr DEBUG: dt_boxes num : 1, elapse : 0.005411624908447266
[2022/08/28 22:57:57] ppocr DEBUG: rec_res num  : 1, elapse : 0.0036361217498779297
[2022/08/28 22:57:57] ppocr DEBUG: dt_boxes num : 1, elapse : 0.0047397613525390625
[2022/08/28 22:57:57] ppocr DEBUG: rec_res num  : 1, elapse : 0.003546476364135742
[2022/08/28 22:57:57] ppocr DEBUG: dt_boxes num : 1, elapse : 0.006383657455444336
[2022/08/28 22:57:57] ppocr DEBUG: rec_res num  : 1, elapse : 0.0035703182220458984
[2022/08/28 22:57:57] ppocr DEBUG: dt_boxes num : 1, elapse : 0.010457992553710938
[2022/08/28 22:57:57] ppocr DEBUG: rec_res num  : 1, elapse : 0.004181385040283203
[2022/08/28 22:57:57] ppocr DEBUG: dt_boxes num : 1, elapse : 0.011874675750732422
[2022/08/28 22:57:57] ppocr DEBUG: rec_res num  : 1, elapse : 0.009016752243041992
[2022/08/28 22:57:57] ppocr DEBUG: dt_boxes num : 1, elapse : 0.004208803176879883
[2022/08/28 22:57:57] ppocr DEBUG: rec_res num  : 1, elapse : 0.003774404525756836
#查看识别结果
!tree result
result
└── 1
    └── 20220623110401-0
        ├── [136, 1142, 3033, 2449]_0.xlsx
        ├── [138, 2167, 3040, 3259]_0.xlsx
        ├── [140, 392, 3032, 1056]_0.xlsx
        └── res_0.txt

2 directories, 4 files

查看上述导出其中之一的EXCEL文档

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0Ix1qk9J-1665146512450)(https://ai-studio-static-online.cdn.bcebos.com/0caa3d4f0da543d99b0ec58246d040435e1beae61c654f95b8c1efbf7a7c4ad0)]

!cat result/1/20220623110401-0/res_0.txt

模型部署

(1)Serving部署

#安装Paddle Serving --CPU
!pip install paddle-serving-client==0.4.0
!pip install paddle-serving-server==0.4.0 #CPU
!pip install paddle-serving-app==0.2.0
#将Paddle模型参数保存为Serving模型
import paddle
import paddle_serving_client.io as serving_io
paddle.enable_static()
serving_io.inference_model_to_serving("./my_det_model/inference.pdmodel", serving_server="serving_server", serving_client="serving_client", model_filename=None, params_filename=None)
#开启serving服务
!python -m paddle_serving_server.serve --model serving_server --thread 10 --port 9292
预测

from paddle_serving_client import Client
import numpy as np
from PIL import Image

# 连接客户端
client = Client()
client.load_client_config("serving_client/serving_client_conf.prototxt")
client.connect(["127.0.0.1:9292"])
def load_image(file):
    im = Image.open(file).convert('L')                        #将RGB转化为灰度图像,L代表灰度图像,像素值在0~255之间
    im = im.resize((28, 28), Image.ANTIALIAS)                 #resize image with high-quality 图像大小为28*28
    im = np.array(im).reshape(1, 28, 28).astype(np.float32)#返回新形状的数组,把它变成一个 numpy 数组以匹配数据馈送格式。
    # print(im)
    im = im / 255.0 * 2.0 - 1.0                               #归一化到【-1~1】之间
    return im

img = load_image('imgs/1/20220623110401-0.png')
fetch_map = client.predict(feed={"image": img}, fetch=["save_infer_model/scale_0.tmp_0"])
fetch_map = np.argsort(fetch_map['save_infer_model/scale_0.tmp_0'][0])
print(fetch_map)

(二)onnx模型导出

参考:https://aistudio.baidu.com/aistudio/projectdetail/2250546?channelType=0&channel=0

# 安装环境
!pip install onnx==1.10.1 onnxruntime-gpu==1.10 paddle2onnx
!paddle2onnx --model_dir ./stac --model_filename model.pdmodel --params_filename model.pdiparams --opset_version 11 --save_file result.onnx
 #模型转换:Paddle2ONNX 来转换模型,替换对应目录即可
 %cd ~
!paddle2onnx \
    --model_dir my_rec_model \
    --model_filename __model__ \
    --params_filename __params__ \
    --save_file inference_model/inference.onnx \
    --opset_version 12 \
    --enable_onnx_checker True

2onnx \
    --model_dir my_rec_model \
    --model_filename __model__ \
    --params_filename __params__ \
    --save_file inference_model/inference.onnx \
    --opset_version 12 \
    --enable_onnx_checker True

 

作者主页:https://blog.csdn.net/qq_36816848 ,有关项目和学习问题欢迎跟大家多多交流!

此文章为搬运
原项目链接

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐