PaddleSpeech：基于流式语音合成的电表点读系统

使用onnx模型推理流式合成与端到端合成，实现电表读数和编号实时点读系统

AI Studio

2118人浏览 · 2022-07-16 21:44:28

AI Studio · 2022-07-16 21:44:28 发布

0 背景介绍

流式语音合成模型是指可以在合成音频流的过程中，支持实时返回识别结果的一类TTS模型。与之相对的是非流式模型，它必须在处理完整句音频后才能返回结果。

显而易见，对于点读场景，使用流式语音合成可以大幅提升用户体验。

在之前的电表系列和点读系列项目中，我们使用的一直是非流式模型，这距离模型真实落地，显然还有不小差距。

这么做的原因是因为之前流式语音功能一直没能实现。现在，利用PaddleSpeech项目中，提供的强大的流式语音合成能力，通过ONNXRuntime，实现了流式语音模型的部署落地。

具体操作可以参考百度技术团队大佬的示例项目：
【PaddleSpeech】语音合成-onnx模型使用

在本文中，主要对上面的示例项目进行说明，并介绍改造、简化OpenVINO官方的PaddleOCR部署示例代码的方法，最终，结合之前打造的点读机API，完成一个完成的电表视频流读数和编号识别系统。

废话不多说，先上效果图和演示视频。

在这里插入图片描述

from IPython.display import Video

Video('2022-07-12 14-17-19.mkv')

Your browser does not support the video element.

0.1 参考资料

电表全系列

（主线篇）

（众人拾柴——应用篇）

（番外篇）

PPOCR+PPDET电表读数和编号检测

点读全系列

ONNX部署

【PaddleSpeech】语音合成-onnx模型使用

1 环境准备

参考下面在AI Studio上的操作，在本地安装PaddleSpeech环境，并准备好onnx部署模型。

# 安装PaddleSpeech
!git clone https://gitee.com/paddlepaddle/PaddleSpeech.git
# 进入PaddleSpeech目录
%cd PaddleSpeech
# 安装依赖环境
!pip install pytest-runner
# 需要本地安装
# !pip install openvino==2022.1.0
!pip install .

# 下载模型并解压缩
%cd /home/aistudio/work
!wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip
!wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_onnx_0.2.0.zip
!unzip fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip
!unzip mb_melgan_csmsc_onnx_0.2.0.zip

# 下载nltk数据包，如果项目中有就不用下载了
%cd /home/aistudio
!wget -P data https://paddlespeech.bj.bcebos.com/Parakeet/tools/nltk_data.tar.gz
!tar zxvf data/nltk_data.tar.gz

在这里插入图片描述

2 模型介绍

2.1 语音合成基本知识

在这里插入图片描述

语音合成流水线包含 文本前端（Text Frontend） 、声学模型（Acoustic Model） 和 声码器（Vocoder） 三个主要模块:

通过文本前端模块将原始文本转换为字符/音素。
通过声学模型将字符/音素转换为声学特征，如线性频谱图、mel 频谱图、LPC 特征等。
通过声码器将声学特征转换为波形。

语音合成基本流程如下图所示：

在这里插入图片描述

本文使用的是基于 FastSpeech2 声学模型和 MelGAN 声码器的中文流式语音合成系统：

文本前端：采用基于规则的中文文本前端系统，对文本正则、多音字、变调等中文文本场景进行了优化。
声学模型：对 FastSpeech2 模型的 Decoder 进行改进，使其可以流式合成
声码器：支持对 GAN Vocoder 的流式合成
推理引擎：使用 ONNXRuntime 推理引擎优化模型推理性能，使得语音合成系统在低压 CPU 上也能达到 RTF<1，满足流式合成的要求。

具体的模型运行过程可以参考项目【PaddleSpeech】语音合成-onnx模型使用，这里不再重复介绍。

2.2 文本前端（Text Frontend）

一个文本前端模块主要包含:

分段（Text Segmentation）
文本正则化（Text Normalization, TN）
分词（Word Segmentation, 主要是在中文中）
词性标注（Part-of-Speech, PoS）
韵律预测（Prosody）
字音转换（Grapheme-to-Phoneme，G2P）（Grapheme: 语言书写系统的最小有意义单位; Phoneme: 区分单词的最小语音单位）
- 多音字（Polyphone）
- 变调（Tone Sandhi）
  - “一”、“不”变
  - 三声变调
  - 轻声变调
  - 儿化音
  - 方言
…

（输入给声学模型之前，还需要把音素序列转换为 id）

其中最重要的模块是文本正则化模块和字音转换（TTS 中更常用 G2P 代指）模块。

PaddleSpeech Text-to-Speech的文本前端解决方案:

文本正则
G2P
- 多音字模块: pypinyin/g2pM
- 变调模块: 用分词 + 规则

2.3 声学模型（Acoustic Model）

声学模型将字符/音素转换为声学特征，如线性频谱图、mel 频谱图、LPC 特征等，声学特征以 “帧” 为单位，一般一帧是 10ms 左右，一个音素一般对应 5~20 帧左右, 声学模型需要解决的是 “不等长序列间的映射问题”，“不等长”是指，同一个人发不同音素的持续时间不同，同一个人在不同时刻说同一句话的语速可能不同，对应各个音素的持续时间不同，不同人说话的特色不同，对应各个音素的持续时间不同。这是一个困难的“一对多”问题。

声学模型主要分为自回归模型和非自回归模型，其中自回归模型在 t 时刻的预测需要依赖 t-1 时刻的输出作为输入，预测时间长，但是音质相对较好，非自回归模型不存在预测上的依赖关系，预测时间快，音质相对较差。

主流声学模型发展的脉络:

自回归模型:
- Tacotron
- Tacotron2
- Transformer TTS
非自回归模型:
- FastSpeech
- SpeedySpeech
- FastPitch
- FastSpeech2
- …

在本文中，我们使用 FastSpeech2 作为声学模型。

在这里插入图片描述

PaddleSpeech TTS 实现的 FastSpeech2 与论文不同的地方在于，使用的是 phone 级别的 pitch 和 energy(与 FastPitch 类似)，这样的合成结果可以更加稳定。

在这里插入图片描述

2.4 声码器（Vocoder）

声码器将声学特征转换为波形。声码器需要解决的是 “信息缺失的补全问题”。信息缺失是指，在音频波形转换为频谱图的时候，存在相位信息的缺失，在频谱图转换为 mel 频谱图的时候，存在频域压缩导致的信息缺失；假设音频的采样率是16kHZ, 一帧的音频有 10ms，也就是说，1s 的音频有 16000 个采样点，而 1s 中包含 100 帧，每一帧有 160 个采样点，声码器的作用就是将一个频谱帧变成音频波形的 160 个采样点，所以声码器中一般会包含上采样模块。

与声学模型类似，声码器也分为自回归模型和非自回归模型, 更细致的分类如下:

Autoregression
- WaveNet
- WaveRNN
- LPCNet
Flow
- WaveFlow
- WaveGlow
- FloWaveNet
- Parallel WaveNet
GAN
- WaveGAN
- Parallel WaveGAN
- MelGAN
- Style MelGAN
- Multi Band MelGAN
- HiFi GAN
VAE
- Wave-VAE
Diffusion
- WaveGrad
- DiffWave

2.5 PaddleSpeech模型列表

选择PaddleSpeech提供的语音合成模型时，要注意对应的训练集。比如在CSMCS数据集上训练的，就是中文语音合成模型，在ljspeech数据集上训练的，则是英文语音合成模型。

2.5.1 声学模型 Acoustic Models

Model Type	Dataset	Example Link	Pretrained Models	Static/ONNX Models	Size (static)
Tacotron2	LJSpeech	tacotron2-ljspeech	tacotron2_ljspeech_ckpt_0.2.0.zip
Tacotron2	CSMSC	tacotron2-csmsc	tacotron2_csmsc_ckpt_0.2.0.zip	tacotron2_csmsc_static_0.2.0.zip	103MB
TransformerTTS	LJSpeech	transformer-ljspeech	transformer_tts_ljspeech_ckpt_0.4.zip
SpeedySpeech	CSMSC	speedyspeech-csmsc	speedyspeech_csmsc_ckpt_0.2.0.zip	speedyspeech_csmsc_static_0.2.0.zip speedyspeech_csmsc_onnx_0.2.0.zip	13MB
FastSpeech2	CSMSC	fastspeech2-csmsc	fastspeech2_nosil_baker_ckpt_0.4.zip	fastspeech2_csmsc_static_0.2.0.zip fastspeech2_csmsc_onnx_0.2.0.zip	157MB
FastSpeech2-Conformer	CSMSC	fastspeech2-csmsc	fastspeech2_conformer_baker_ckpt_0.5.zip
FastSpeech2-CNNDecoder	CSMSC	fastspeech2-csmsc	fastspeech2_cnndecoder_csmsc_ckpt_1.0.0.zip	fastspeech2_cnndecoder_csmsc_static_1.0.0.zip fastspeech2_cnndecoder_csmsc_streaming_static_1.0.0.zip fastspeech2_cnndecoder_csmsc_onnx_1.0.0.zip fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip	84MB
FastSpeech2	AISHELL-3	fastspeech2-aishell3	fastspeech2_nosil_aishell3_ckpt_0.4.zip	fastspeech2_aishell3_static_1.1.0.zip fastspeech2_aishell3_onnx_1.1.0.zip	147MB
FastSpeech2	LJSpeech	fastspeech2-ljspeech	fastspeech2_nosil_ljspeech_ckpt_0.5.zip	fastspeech2_ljspeech_static_1.1.0.zip fastspeech2_ljspeech_onnx_1.1.0.zip	145MB
FastSpeech2	VCTK	fastspeech2-vctk	fastspeech2_nosil_vctk_ckpt_0.5.zip	fastspeech2_vctk_static_1.1.0.zip fastspeech2_vctk_onnx_1.1.0.zip	145MB

2.5.2 声码器 Vocoders

Model Type	Dataset	Example Link	Pretrained Models	Static/ONNX Models	Size (static)
WaveFlow	LJSpeech	waveflow-ljspeech	waveflow_ljspeech_ckpt_0.3.zip
Parallel WaveGAN	CSMSC	PWGAN-csmsc	pwg_baker_ckpt_0.4.zip	pwg_baker_static_0.4.zip pwgan_csmsc_onnx_0.2.0.zip	4.8MB
Parallel WaveGAN	LJSpeech	PWGAN-ljspeech	pwg_ljspeech_ckpt_0.5.zip	pwgan_ljspeech_static_1.1.0.zip pwgan_ljspeech_onnx_1.1.0.zip	4.8MB
Parallel WaveGAN	AISHELL-3	PWGAN-aishell3	pwg_aishell3_ckpt_0.5.zip	pwgan_aishell3_static_1.1.0.zip pwgan_aishell3_onnx_1.1.0.zip	4.8MB
Parallel WaveGAN	VCTK	PWGAN-vctk	pwg_vctk_ckpt_0.5.zip	pwgan_vctk_static_1.1.0.zip pwgan_vctk_onnx_1.1.0.zip	4.8MB
Multi Band MelGAN	CSMSC	MB MelGAN-csmsc	mb_melgan_csmsc_ckpt_0.1.1.zip mb_melgan_baker_finetune_ckpt_0.5.zip	mb_melgan_csmsc_static_0.1.1.zip mb_melgan_csmsc_onnx_0.2.0.zip	7.6MB
Style MelGAN	CSMSC	Style MelGAN-csmsc	style_melgan_csmsc_ckpt_0.1.1.zip
HiFiGAN	CSMSC	HiFiGAN-csmsc	hifigan_csmsc_ckpt_0.1.1.zip	hifigan_csmsc_static_0.1.1.zip hifigan_csmsc_onnx_0.2.0.zip	46MB
HiFiGAN	LJSpeech	HiFiGAN-ljspeech	hifigan_ljspeech_ckpt_0.2.0.zip	hifigan_ljspeech_static_1.1.0.zip hifigan_ljspeech_onnx_1.1.0.zip	49MB
HiFiGAN	AISHELL-3	HiFiGAN-aishell3	hifigan_aishell3_ckpt_0.2.0.zip	hifigan_aishell3_static_1.1.0.zip hifigan_aishell3_onnx_1.1.0.zip	46MB
HiFiGAN	VCTK	HiFiGAN-vctk	hifigan_vctk_ckpt_0.2.0.zip	hifigan_vctk_static_1.1.0.zip hifigan_vctk_onnx_1.1.0.zip	46MB
WaveRNN	CSMSC	WaveRNN-csmsc	wavernn_csmsc_ckpt_0.2.0.zip	wavernn_csmsc_static_0.2.0.zip	18MB

组织好预训练模型的情况下，在本地运行项目提供的流式语音合成脚本streaming_tts.py，如果本地的声卡环境支持，就能听到PaddleSpeech合成的语音了。

# 运行代码需要本地环境
import onnxruntime as ort
import math
import time
import numpy as np
import pyaudio
import soundfile as sf

from paddlespeech.server.utils.util import denorm, get_chunks
from paddlespeech.server.utils.audio_process import float2pcm
from paddlespeech.t2s.frontend.zh_frontend import Frontend


voc_block = 36
voc_pad = 14
am_block = 72
am_pad = 12
voc_upsample = 300
# 配置文本前端
phones_dict = "fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/phone_id_map.txt"
frontend = Frontend(
                phone_vocab_path=phones_dict,
                tone_vocab_path=None)

am_stat_path = r"fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/speech_stats.npy"
am_mu, am_std = np.load(am_stat_path)

# 模型路径
onnx_am_encoder = "fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_encoder_infer.onnx"
onnx_am_decoder = "fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_decoder.onnx"
onnx_am_postnet = "fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_postnet.onnx"
onnx_voc_melgan = "mb_melgan_csmsc_onnx_0.2.0/mb_melgan_csmsc.onnx"

# 用CPU推理
providers = ['CPUExecutionProvider']

# 配置ort session
sess_options = ort.SessionOptions()

# 加载模型，创建Onnxruntime Session
am_encoder_infer_sess = ort.InferenceSession(onnx_am_encoder, providers=providers, sess_options=sess_options)
am_decoder_sess = ort.InferenceSession(onnx_am_decoder, providers=providers, sess_options=sess_options)
am_postnet_sess = ort.InferenceSession(onnx_am_postnet, providers=providers, sess_options=sess_options)
voc_melgan_sess = ort.InferenceSession(onnx_voc_melgan, providers=providers, sess_options=sess_options)


def depadding(data, chunk_num, chunk_id, block, pad, upsample):
    """ 
    Streaming inference removes the result of pad inference
    """
    front_pad = min(chunk_id * block, pad)
    # first chunk
    if chunk_id == 0:
        data = data[:block * upsample]
    # last chunk
    elif chunk_id == chunk_num - 1:
        data = data[front_pad * upsample:]
    # middle chunk
    else:
        data = data[front_pad * upsample:(front_pad + block) * upsample]

    return data


def inference_stream(text):
    input_ids = frontend.get_input_ids(
                text,
                merge_sentences=False,
                get_tone_ids=False)
    phone_ids = input_ids["phone_ids"]
    for i in range(len(phone_ids)):
        part_phone_ids = phone_ids[i].numpy()
        voc_chunk_id = 0
        
        orig_hs = am_encoder_infer_sess.run(
            None, input_feed={'text': part_phone_ids})
        orig_hs = orig_hs[0]

        # streaming voc chunk info
        mel_len = orig_hs.shape[1]
        voc_chunk_num = math.ceil(mel_len / voc_block)
        start = 0
        end = min(voc_block + voc_pad, mel_len)

        # streaming am
        hss = get_chunks(orig_hs, am_block, am_pad, "am")
        am_chunk_num = len(hss)
        for i, hs in enumerate(hss):
            am_decoder_output = am_decoder_sess.run(
                None, input_feed={'xs': hs})
            am_postnet_output = am_postnet_sess.run(
                None,
                input_feed={
                    'xs': np.transpose(am_decoder_output[0], (0, 2, 1))
                })
            am_output_data = am_decoder_output + np.transpose(
                am_postnet_output[0], (0, 2, 1))
            normalized_mel = am_output_data[0][0]

            sub_mel = denorm(normalized_mel, am_mu,
                                am_std)
            sub_mel = depadding(sub_mel, am_chunk_num, i,
                                        am_block, am_pad, 1)

            if i == 0:
                mel_streaming = sub_mel
            else:
                mel_streaming = np.concatenate(
                    (mel_streaming, sub_mel), axis=0)

            # streaming voc
            # 当流式AM推理的mel帧数大于流式voc推理的chunk size，开始进行流式voc 推理
            while (mel_streaming.shape[0] >= end and
                    voc_chunk_id < voc_chunk_num):
                voc_chunk = mel_streaming[start:end, :]

                sub_wav = voc_melgan_sess.run(
                    output_names=None, input_feed={'logmel': voc_chunk})
                sub_wav = depadding(
                    sub_wav[0], voc_chunk_num, voc_chunk_id,
                    voc_block, voc_pad, voc_upsample)

                yield sub_wav

                voc_chunk_id += 1
                start = max(
                    0, voc_chunk_id * voc_block - voc_pad)
                end = min(
                    (voc_chunk_id + 1) * voc_block + voc_pad,
                    mel_len)




if __name__ == '__main__':

    text = "欢迎使用飞桨语音合成系统，测试一下合成效果。"
    # warm up
    # onnxruntime 第一次时间会长一些，建议先warmup一下
    for sub_wav in inference_stream(text="哈哈哈哈"):
        continue

    # pyaudio 播放
    p = pyaudio.PyAudio()
    stream  = p.open(format = p.get_format_from_width(2), # int16
                    channels = 1,
                    rate = 24000,
                    output = True)
    
    # 计时
    wavs = []
    t1 = time.time()
    for sub_wav in inference_stream(text):
        print("响应时间：", time.time() - t1)
        t1 = time.time()
        wavs.append(sub_wav.flatten())
        wav = float2pcm(sub_wav)  # float32 to int16
        wav_bytes = wav.tobytes()  # to bytes
        stream.write(wav_bytes)
    
    # 关闭 pyaudio播放器
    stream.stop_stream()
    stream.close()
    p.terminate()
    
    # 流式合成的结果导出
    wav = np.concatenate(wavs)
    print(wav.shape)
    sf.write("demo_stream.wav",data=wav, samplerate=24000)

3 电表读数检测识别流式点读系统

该系统高度依赖于前置项目，比如电表检测识别模型（PPOCR-v2版）来自于【PaddlePaddle+OpenVINO】电表检测识别模型的部署项目，并对该项目的OCR识别过程进行了大幅简化；前端GUI设计来自于手把手教你快速打造一个AI识物点读机项目，直接继承了该项目调用摄像头、加载视频的功能，并将该项目的非流式语音合成系统替换为流式语音。

下面介绍下几处核心代码。

3.1 PaddleOCR+OpenVINO的单图识别

在OpenVINO的官方示例教程中，PaddleOCR with OpenVINO的notebook是405-paddle-ocr-webcam，这是一个输入视频流的项目，但是高度耦合在原项目中。我们需要自己设计GUI前端，比较好的办法就是先把代码拆掉，只留下单图的OCR识别能力就行。

具体做法就是把不必要的层级，如try:……全删掉，并且将传入的视频帧frame替换成单张图片。核心代码实现如下：

def run_paddle_ocr(img_path):
    text_list = ['测试效果']
    processing_times = collections.deque()
    det_request = det_compiled_model.create_infer_request()
    test_pic = cv2.imread(img_path)
    test_image = image_preprocess(test_pic, 1600)
        
    # measure processing time for text detection
    start_time = time.time()
    # perform the inference step
    det_request.infer(inputs={det_input_layer.any_name: test_image})
    det_results = det_request.get_tensor(det_output_layer).data
    stop_time = time.time()

    # Postprocessing for Paddle Detection
    ori_im = test_pic.copy()
    data = {'image': test_pic}
    data_resize = processing.DetResizeForTest(data)
    data_list = []
    keep_keys = ['image', 'shape']
    for key in keep_keys:
        data_list.append(data_resize[key])
    img, shape_list = data_list

    shape_list = np.expand_dims(shape_list, axis=0) 
    pred = det_results[0]    
    if isinstance(pred, paddle.Tensor):
        pred = pred.numpy()
    segmentation = pred > 0.3

    boxes_batch = []
    for batch_index in range(pred.shape[0]):
        src_h, src_w, ratio_h, ratio_w = shape_list[batch_index]
        mask = segmentation[batch_index]
        boxes, scores = processing.boxes_from_bitmap(pred[batch_index], mask,src_w, src_h)
        boxes_batch.append({'points': boxes})
    post_result = boxes_batch
    dt_boxes = post_result[0]['points']
    dt_boxes = processing.filter_tag_det_res(dt_boxes, ori_im.shape)

    processing_times.append(stop_time - start_time)
    # use processing times from last 200 frames
    if len(processing_times) > 400:
        processing_times.popleft()
    processing_time_det = np.mean(processing_times) * 1000

    # Preprocess detection results for recognition
    dt_boxes = processing.sorted_boxes(dt_boxes)
    img_crop_list = []   
    if dt_boxes != []:
        for bno in range(len(dt_boxes)):
            tmp_box = copy.deepcopy(dt_boxes[bno])
            img_crop = processing.get_rotate_crop_image(ori_im, tmp_box)
            img_crop_list.append(img_crop)

        # Recognition starts from here
        img_num = len(img_crop_list)
        # Calculate the aspect ratio of all text bars
        width_list = []
        for img in img_crop_list:
            width_list.append(img.shape[1] / float(img.shape[0]))
        # Sorting can speed up the recognition process
        indices = np.argsort(np.array(width_list))
        rec_res = [['', 0.0]] * img_num
        batch_num = 4
        
        # For each detected text box, run inference for text recognition
        for beg_img_no in range(0, img_num, batch_num):
            end_img_no = min(img_num, beg_img_no + batch_num)

            norm_img_batch = []
            max_wh_ratio = 0
            for ino in range(beg_img_no, end_img_no):
                h, w = img_crop_list[indices[ino]].shape[0:2]
                wh_ratio = w * 1.0 / h
                max_wh_ratio = max(max_wh_ratio, wh_ratio)
            for ino in range(beg_img_no, end_img_no):
                norm_img = resize_norm_img(img_crop_list[indices[ino]],max_wh_ratio)
                norm_img = norm_img[np.newaxis, :]
                norm_img_batch.append(norm_img)

            norm_img_batch = np.concatenate(norm_img_batch)
            norm_img_batch = norm_img_batch.copy()

            # Run inference for text recognition 
            rec_request = rec_compiled_model.create_infer_request()
            rec_request.infer(inputs={rec_input_layer.any_name: norm_img_batch})
            rec_results = rec_request.get_tensor(rec_output_layer).data

            # Postprocessing recognition results
            postprocess_op = processing.build_post_process(processing.postprocess_params)
            rec_result = postprocess_op(rec_results)
            for rno in range(len(rec_result)):
                rec_res[indices[beg_img_no + rno]] = rec_result[rno]

        # Text recognition results, rec_res, include two parts:
        # txts are the recognized text results, scores are the recognition confidence level                   
        if rec_res != []:
            image = Image.fromarray(cv2.cvtColor(test_pic, cv2.COLOR_BGR2RGB))
            boxes = dt_boxes
            txts = [rec_res[i][0] for i in range(len(rec_res)) if rec_res[i][1]>0.3] 
            scores = [rec_res[i][1] for i in range(len(rec_res))] 

            # draw text recognition results beside the image
            draw_img = processing.draw_ocr_box_txt(
                image,
                boxes,
                txts,
                scores,
                drop_score=0.3)

            # Visualize PPOCR results
            _, f_width = draw_img.shape[:2]
            fps = 1000 / processing_time_det
            cv2.putText(img=draw_img, text=f"OpenVINO Inference time: {processing_time_det:.1f}ms ({fps:.1f} FPS)", 
                        org=(20, 40),fontFace=cv2.FONT_HERSHEY_COMPLEX, fontScale=f_width / 1000,
                        color=(0, 0, 255), thickness=1, lineType=cv2.LINE_AA)
                        
            draw_img = cv2.cvtColor(draw_img, cv2.COLOR_RGB2BGR)
            cv2.imwrite('test.jpg', draw_img)
            for i in range(len(txts)):
                if len(txts[i])>8:
                    txts[i] = '电表编号是' + str(txts[i])
                else:
                    txts[i] = '电表读数是' + str(txts[i])
            print(txts)
            if len(txts) > 0:
                text_list = txts  
    return text_list

在点读系统中，上面这段核心代码还需继续改造，不仅要输出识别结果，还要将合成的图片一并返回：

def run_paddle_ocr(frame):
    text_list = ['测试效果']
    processing_times = collections.deque()
    det_request = det_compiled_model.create_infer_request()
    test_image = image_preprocess(frame, 1600)
        
    # measure processing time for text detection
    start_time = time.time()
    # perform the inference step
    det_request.infer(inputs={det_input_layer.any_name: test_image})
    det_results = det_request.get_tensor(det_output_layer).data
    stop_time = time.time()

    # Postprocessing for Paddle Detection
    ori_im = frame.copy()
    data = {'image': frame}
    data_resize = processing.DetResizeForTest(data)
    data_list = []
    keep_keys = ['image', 'shape']
    for key in keep_keys:
        data_list.append(data_resize[key])
    img, shape_list = data_list

    shape_list = np.expand_dims(shape_list, axis=0) 
    pred = det_results[0]    
    if isinstance(pred, paddle.Tensor):
        pred = pred.numpy()
    segmentation = pred > 0.3

    boxes_batch = []
    for batch_index in range(pred.shape[0]):
        src_h, src_w, ratio_h, ratio_w = shape_list[batch_index]
        mask = segmentation[batch_index]
        boxes, scores = processing.boxes_from_bitmap(pred[batch_index], mask,src_w, src_h)
        boxes_batch.append({'points': boxes})
    post_result = boxes_batch
    dt_boxes = post_result[0]['points']
    dt_boxes = processing.filter_tag_det_res(dt_boxes, ori_im.shape)

    processing_times.append(stop_time - start_time)
    # use processing times from last 200 frames
    if len(processing_times) > 400:
        processing_times.popleft()
    processing_time_det = np.mean(processing_times) * 1000

    # Preprocess detection results for recognition
    dt_boxes = processing.sorted_boxes(dt_boxes)
    img_crop_list = []   
    if dt_boxes != []:
        for bno in range(len(dt_boxes)):
            tmp_box = copy.deepcopy(dt_boxes[bno])
            img_crop = processing.get_rotate_crop_image(ori_im, tmp_box)
            img_crop_list.append(img_crop)

        # Recognition starts from here
        img_num = len(img_crop_list)
        # Calculate the aspect ratio of all text bars
        width_list = []
        for img in img_crop_list:
            width_list.append(img.shape[1] / float(img.shape[0]))
        # Sorting can speed up the recognition process
        indices = np.argsort(np.array(width_list))
        rec_res = [['', 0.0]] * img_num
        batch_num = 4
        
        # For each detected text box, run inference for text recognition
        for beg_img_no in range(0, img_num, batch_num):
            end_img_no = min(img_num, beg_img_no + batch_num)

            norm_img_batch = []
            max_wh_ratio = 0
            for ino in range(beg_img_no, end_img_no):
                h, w = img_crop_list[indices[ino]].shape[0:2]
                wh_ratio = w * 1.0 / h
                max_wh_ratio = max(max_wh_ratio, wh_ratio)
            for ino in range(beg_img_no, end_img_no):
                norm_img = resize_norm_img(img_crop_list[indices[ino]],max_wh_ratio)
                norm_img = norm_img[np.newaxis, :]
                norm_img_batch.append(norm_img)

            norm_img_batch = np.concatenate(norm_img_batch)
            norm_img_batch = norm_img_batch.copy()

            # Run inference for text recognition 
            rec_request = rec_compiled_model.create_infer_request()
            rec_request.infer(inputs={rec_input_layer.any_name: norm_img_batch})
            rec_results = rec_request.get_tensor(rec_output_layer).data

            # Postprocessing recognition results
            postprocess_op = processing.build_post_process(processing.postprocess_params)
            rec_result = postprocess_op(rec_results)
            for rno in range(len(rec_result)):
                rec_res[indices[beg_img_no + rno]] = rec_result[rno]

        # Text recognition results, rec_res, include two parts:
        # txts are the recognized text results, scores are the recognition confidence level                   
        if rec_res != []:
            image = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
            boxes = dt_boxes
            txts = [rec_res[i][0] for i in range(len(rec_res)) if rec_res[i][1]>0.3] 
            scores = [rec_res[i][1] for i in range(len(rec_res))] 

            # draw text recognition results beside the image
            draw_img = processing.draw_ocr_box_txt(
                image,
                boxes,
                txts,
                scores,
                drop_score=0.3)

            # Visualize PPOCR results
            _, f_width = draw_img.shape[:2]
            fps = 1000 / processing_time_det            
            draw_img = cv2.cvtColor(draw_img, cv2.COLOR_RGB2BGR)
            draw_img = cv2.resize(draw_img, (800, 600))
            draw_img = cv2.putText(img=draw_img, text=f"OpenVINO Infer time: {processing_time_det:.1f}ms ({fps:.1f} FPS)", 
                        org=(10, 100),fontFace=cv2.FONT_HERSHEY_COMPLEX, fontScale=1.0,
                        color=(0, 0, 255), thickness=1, lineType=cv2.LINE_AA)
            cv2.imwrite('test.jpg', draw_img)
            for i in range(len(txts)):
                if len(txts[i])>8:
                    txts[i] = '电表编号是' + str(txts[i])
                else:
                    txts[i] = '电表读数是' + str(txts[i])
            print(txts)
            if len(txts) > 0:
                text_list = txts 
    return text_list, draw_img

3.2 流式语音合成与PaddleOCR识别结果的串联

对于流式语音合成任务，输入的是一串文字，但我们需要注意的是，PaddleOCR识别输出的结果是list。这里面有个简单的处理，就是去遍历这个list，然后用标点分割，把它们串成一串文字。

text_list = run_paddle_ocr('data/IMG_20210727_091835.jpg')

a = ','

text = a.join(text_list)

3.3 流式语音合成串入点读机

关键代码如下：

 # 判断点读事件是否已触发
if True == self.readEvent.is_set():
    save_time = str(int(time.time()))
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    cv2.imwrite('output_%s.jpg' % save_time, frame)

    # text_list = run_paddle_ocr('data/IMG_20210727_091835.jpg')
    a = ','
    text = a.join(text_list)
    for sub_wav in inference_stream(text="哈哈哈哈"):
        continue

    # pyaudio 播放
    p = pyaudio.PyAudio()
    stream  = p.open(format = p.get_format_from_width(2), # int16
                    channels = 1,
                    rate = 24000,
                    output = True)
    
    # 计时
    wavs = []
    t1 = time.time()
    for sub_wav in inference_stream(text):
        print("响应时间：", time.time() - t1)
        t1 = time.time()
        wavs.append(sub_wav.flatten())
        wav = float2pcm(sub_wav)  # float32 to int16
        wav_bytes = wav.tobytes()  # to bytes
        stream.write(wav_bytes)
    
    # 关闭 pyaudio播放器
    stream.stop_stream()
    stream.close()
    p.terminate()
    
    # 流式合成的结果导出
    wav = np.concatenate(wavs)
    print(wav.shape)
    sf.write("output_%s.wav" % save_time,data=wav, samplerate=24000)