ASHA journals
Browse

Tunable forced alignment system with deep learning (Kadambi et al., 2025)

figure
posted on 2025-03-31, 19:13 authored by Prad Kadambi, Tristan J. Mahr, Katherine C. Hustad, Visar Berisha
<p dir="ltr"><b>Purpose: </b>Phonetic forced alignment has a multitude of applications in automated analysis of speech, particularly in studying nonstandard speech such as children’s speech. Manual alignment is tedious but serves as the gold standard for clinical-grade alignment. Current tools do not support direct training on manual alignments. Thus, a trainable speaker adaptive phonetic forced alignment system, Wav2TextGrid, was developed for children’s speech. The source code for the method is publicly available along with a graphical user interface at <a href="https://github.com/pkadambi/Wav2TextGrid" target="_blank">https://github.com/pkadambi/Wav2TextGrid</a>.</p><p dir="ltr"><b>Method: </b>We propose a trainable, speaker-adaptive, neural forced aligner developed using a corpus of 42 neurotypical children from 3 to 6 years of age. Evaluation on both child speech and on the TIMIT corpus was performed to demonstrate aligner performance across age and dialectal variations.</p><p dir="ltr"><b>Results: </b>The trainable alignment tool markedly improved accuracy over baseline for several alignment quality metrics, for all phoneme categories. Accuracy for plosives and affricates in children’s speech improved more than 40% over baseline. Performance matched existing methods using approximately 13 min of labeled data, while approximately 45–60 min of labeled alignments yielded significant improvement.</p><p dir="ltr"><b>Conclusion: </b>The Wav2TextGrid tool allows alternate alignment workflows where the forced alignments, via training, are directly tailored to match clinical-grade, manually provided alignments.</p><p dir="ltr"><b>Supplemental Material S1.</b> Plot of percentage onset/offset error (empirical CDF).</p><p dir="ltr"><b>Supplemental Material S2.</b> Phoneme duration histograms.</p><p dir="ltr">Kadambi, P., Mahr, T. J., Hustad, K. C., & Berisha, V. (2025). A tunable forced alignment system based on deep learning: Applications to child speech. <i>Journal of Speech, Language, and Hearing Research</i>, <i>68</i>(7S), 3583–3601. <a href="https://doi.org/10.1044/2024_JSLHR-24-00347" target="_blank">https://doi.org/10.1044/2024_JSLHR-24-00347</a></p><p dir="ltr"><b>Publisher Note:</b> This article is part of the Special Issue: Select Papers From the 2024 Conference on Motor Speech—Basic Science and Clinical Innovation.</p>

Funding

This work was funded by National Institute on Deafness and Other Communication Disorders Grant R01DC019645-03, awarded to Katherine C. Hustad.

History