Experiments on Forced Phonetic Alignment for Brazilian Portuguese Using Kaldi Tools
Forced alignment. Speech segmentation. Acoustic modeling. Kaldi. Brazilian Portuguese.
Phonetic analysis of speech, in general, requires the alignment of audio samples to its phonetic transcription. This could be done manually for a couple of files, but as the corpus grows large it becomes infeasibly time-consuming. This paper describes the evolution process towards creating free resources for phonetic alignment in Brazilian Portuguese (BP) using Kaldi, a toolkit that achieves state of the art for open-source speech recognition, within a toolkit we call UFPAlign. The contributions of this work are then twofold: developing resources to perform forced alignment in BP, including the release of scripts to train acoustic models via Kaldi, as well as the resources themselves under open licenses; and bringing forth a comparison to other two phonetic aligners that provide resources for BP, namely EasyAlign and Montreal Forced Aligner (MFA), the latter being also Kaldi-based. Evaluation took place in terms of phone boundary and intersection over union metrics over a dataset of 385 hand-aligned utterances, and results show that Kaldi-based aligners perform better overall, and that UFPAlign models are more accurate than MFA’s. Furthermore, complex deep-learning-based approaches still do not improve performance compared to simpler models.