【Paddle打比赛】语音合成

2021 新网银行智能语音大赛baseline。截止2021.11.17，该方案在总分榜第一，作品榜第二

AI Studio

1557人浏览 · 2021-11-28 10:08:22

AI Studio · 2021-11-28 10:08:22 发布

1. 比赛介绍

2021“创青春·交子杯”新网银行金融科技挑战赛面向全球高校在校大学生和毕业三年内的开发者发起挑战。创新赛：让“文字的声音”被听到，将聚焦于语音合成技术，要求大赛挑战者利用 TTS（语音合成技术）将提供的本文数据转换为自然、清晰、流畅的语音，让世界听到“文字的声音”。

参赛者训练以文本为输入语音为输出的模型。文字转换为语音在金融领域有较多应用，在客服、催收、案调等智能外呼场景中均有使用，高质量的语音会具备良好的流畅度、轻重音把握、断句等。本赛题要求参赛者能训练出自然流畅的高质量文字换语音的模型，模型需要转换的文字将限制在中文范围内。

详情请访问：https://mp.weixin.qq.com/s/YJyfffufkSTEHrtISVR7Bg

本实践基于PaddleSpeech提供了大赛的一个baseline。PaddleSpeech 是 PaddlePaddle 平台上的一个开源工具包，用于语音中的各种关键任务，具有最先进和有影响力的模型。

如果您觉得本案例对您有帮助,并且想要学习了解更多深度学习知识，欢迎关注飞桨官方出品的一站式深度学习在线百科，Star收藏一下，不易走丢哦~，链接指路： https://github.com/PaddlePaddle/awesome-DeepLearning

2. 实现方案

本案例基于PaddleSpeech套件公开的声学模型FastSpeech2模型和声码器Parallel WaveGAN实现语音合成。

2.1 环境准备

解压PaddleSpeech源码：

!unzip /home/aistudio/data/data117155/PaddleSpeech-develop.zip -d /home/aistudio/work/

安装：

%cd /home/aistudio/work/PaddleSpeech-develop/
!pip install -e .

%cd /home/aistudio/work/PaddleSpeech-develop/
!python -m setup.py install

2.2 FastSpeech2 with CSMSC训练

参考套件：https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3

2.3 Parallel WaveGAN with CSMSC训练

参考套件：https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc1

2.4 端到端推理

在CSMSC数据集上训练好的FastSpeech2模型和Parallel WaveGAN下载：

%cd /home/aistudio/work/PaddleSpeech-develop/examples/csmsc/tts3
!wget https://paddlespeech.bj.bcebos.com/Parakeet/pwg_baker_ckpt_0.4.zip
!unzip pwg_baker_ckpt_0.4.zip
!wget https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_baker_ckpt_0.4.zip
!unzip fastspeech2_nosil_baker_ckpt_0.4.zip

端到端语音合成：

# source the environment
!source /home/aistudio/work/PaddleSpeech-develop/examples/csmsc/tts3/path.sh

# run end-to-end synthesize
%cd /home/aistudio/work/PaddleSpeech-develop
!FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 paddlespeech/t2s/exps/fastspeech2/synthesize_e2e.py \
  --fastspeech2-config=examples/csmsc/tts3/fastspeech2_nosil_baker_ckpt_0.4/default.yaml \
  --fastspeech2-checkpoint=examples/csmsc/tts3/fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \
  --fastspeech2-stat=examples/csmsc/tts3/fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \
  --pwg-config=examples/csmsc/tts3/pwg_baker_ckpt_0.4/pwg_default.yaml \
  --pwg-checkpoint=examples/csmsc/tts3/pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
  --pwg-stat=examples/csmsc/tts3/pwg_baker_ckpt_0.4/pwg_stats.npy \
  --text=paddlespeech/t2s/exps/fastspeech2/../sentences.txt \
  --output-dir=examples/csmsc/tts3/exp/default/test_e2e \
  --inference-dir=examples/csmsc/tts3/exp/default/inference \
  --phones-dict=examples/csmsc/tts3/fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt

最终结果存储在/home/aistudio/work/PaddleSpeech-develop/examples/csmsc/tts3/exp/default/test_e2e目录下。

对于本次比赛，将/home/aistudio/work/PaddleSpeech-develop/paddlespeech/t2s/exps/sentences.txt中内容替换成比赛dev.txt中内容即可。

注意，有些数据在推理过程中报错，修改/home/aistudio/work/PaddleSpeech-develop/paddlespeech/t2s/exps/fastspeech2/synthesize_e2e.py中evaluate函数如下：

def evaluate(args, fastspeech2_config, pwg_config):
    # dataloader has been too verbose
    logging.getLogger("DataLoader").disabled = True

    # construct dataset for evaluation
    sentences = []
    with open(args.text, 'rt') as f:
        for line in f:
            items = line.strip().split() # items长度不一定是2
            utt_id = items[0]
            sentence = ",".join(items[1:])
            
            sentences.append((utt_id, sentence))

对于文本中"……"等，程序处理时存在崩溃，修改paddlespeech/t2s/frontend/zh_frontend.py文件中Frontend类_merge_erhua函数如下：

def _merge_erhua(self,
                     initials: List[str],
                     finals: List[str],
                     word: str,
                     pos: str) -> List[List[str]]:
        if word not in self.must_erhua and (word in self.not_erhua or
                                            pos in {"a", "j", "nr"}):
            return initials, finals
        new_initials = []
        new_finals = []

        if not len(finals) == len(word):# "……"等情况直接返回
            print(type(finals),finals)
            print(type(word),word)
            return new_initials, new_finals

        assert len(finals) == len(word)
        for i, phn in enumerate(finals):
            if i == len(finals) - 1 and word[i] == "儿" and phn in {
                    "er2", "er5"
            } and word[-2:] not in self.not_erhua and new_finals:
                new_finals[-1] = new_finals[-1][:-1] + "r" + new_finals[-1][-1]
            else:
                new_finals.append(phn)
                new_initials.append(initials[i])
        return new_initials, new_finals

大赛要求上传音频采样率为16k，本案例生成音频采样率为24k，通过下面的命令进行采样率转换：

#!/bin/bash
%cd /home/aistudio/work/PaddleSpeech-develop/examples/csmsc/tts3/exp/default/
!for x in ./test_e2e/*.wav
!do 
!  b=${x##*/}
!  sox ./test_e2e/$b -r 16000 ./HaveATry/$b
!done