ASHA journals
Browse

Tunable forced alignment system with deep learning (Kadambi et al., 2025)

figure
posted on 2025-03-31, 19:13 authored by Prad Kadambi, Tristan J. Mahr, Katherine C. Hustad, Visar Berisha

Purpose: Phonetic forced alignment has a multitude of applications in automated analysis of speech, particularly in studying nonstandard speech such as children’s speech. Manual alignment is tedious but serves as the gold standard for clinical-grade alignment. Current tools do not support direct training on manual alignments. Thus, a trainable speaker adaptive phonetic forced alignment system, Wav2TextGrid, was developed for children’s speech. The source code for the method is publicly available along with a graphical user interface at https://github.com/pkadambi/Wav2TextGrid.

Method: We propose a trainable, speaker-adaptive, neural forced aligner developed using a corpus of 42 neurotypical children from 3 to 6 years of age. Evaluation on both child speech and on the TIMIT corpus was performed to demonstrate aligner performance across age and dialectal variations.

Results: The trainable alignment tool markedly improved accuracy over baseline for several alignment quality metrics, for all phoneme categories. Accuracy for plosives and affricates in children’s speech improved more than 40% over baseline. Performance matched existing methods using approximately 13 min of labeled data, while approximately 45–60 min of labeled alignments yielded significant improvement.

Conclusion: The Wav2TextGrid tool allows alternate alignment workflows where the forced alignments, via training, are directly tailored to match clinical-grade, manually provided alignments.

Supplemental Material S1. Plot of percentage onset/offset error (empirical CDF).

Supplemental Material S2. Phoneme duration histograms.

Kadambi, P., Mahr, T. J., Hustad, K. C., & Berisha, V. (2025). A tunable forced alignment system based on deep learning: Applications to child speech. Journal of Speech, Language, and Hearing Research. Advance online publication. https://doi.org/10.1044/2024_JSLHR-24-00347

Publisher Note: This article is part of the Special Issue: Select Papers From the 2024 Conference on Motor Speech—Basic Science and Clinical Innovation.

Funding

This work was funded by National Institute on Deafness and Other Communication Disorders Grant R01DC019645-03, awarded to Katherine C. Hustad.

History