ASHA journals
JSLHR-21-00540naghibolhosseini_SuppS1.mp4 (10.19 MB)

Deep learning for quantifying vocal fold dynamics (Yousef et al., 2022)

Download (10.19 MB)
posted on 2022-05-23, 22:37 authored by Ahmed M. Yousef, Dimitar D. Deliyski, Stephanie R. C. Zacharias, Alessandro de Alarcon, Robert F. Orlikoff, Maryam Naghibolhosseini

Purpose: Voice disorders are best assessed by examining vocal fold dynamics in connected speech. This can be achieved using flexible laryngeal high-speed videoendoscopy (HSV), which enables us to study vocal fold mechanics with high temporal details. Analysis of vocal fold vibration using HSV requires accurate segmentation of the vocal fold edges. This article presents an automated deep-learning scheme to segment the glottal area in HSV from which the glottal edges are derived during connected speech.

Method: Using a custom-built HSV system, data were obtained from a vocally healthy participant reciting the “Rainbow Passage.” A deep neural network was designed for glottal area segmentation in the HSV data. A recently introduced hybrid approach by the authors was utilized as an automated labeling tool to train the network on a set of HSV frames, where the glottis region was automatically annotated during vocal fold vibrations. The network was then tested against manually segmented frames using different metrics, intersection over union (IoU), and Boundary F1 (BF) score, and its performance was assessed on various phonatory events on the HSV sequence.

Results: The designed network was successfully trained using the hybrid approach, without the need for manual labeling, and tested on the manually labeled data. The performance metrics showed a mean IoU of 0.82 and a mean BF score of 0.96. In addition, the evaluation assessment of the network’s performance demonstrated an accurate segmentation of the glottal edges/area even during complex nonstationary phonatory events and when vocal folds were not vibrating, thus overcoming the limitations of the previous hybrid approach that could only be applied to the vibrating vocal folds.

Conclusions: The introduced automated scheme guarantees accurate glottis representation in challenging color HSV data with lower image quality and excessive laryngeal maneuvers during all instances of connected speech. This facilitates the future development of HSV-based measures to assess the running vibratory characteristics of the vocal folds in speakers with and without voice disorder.

Supplemental Material S1. A video displaying the performance of the introduced network during multiple, consecutive vocalized segments in the HSV data during running speech.

Yousef, A. M., Deliyski, D. D., Zacharias, S. R. C., de Alarcon, A., Orlikoff, R. F., & Naghibolhosseini, M. (2022). A deep learning approach for quantifying vocal fold dynamics during connected speech using laryngeal high-speed videoendoscopy. Journal of Speech, Language, and Hearing Research. Advance online publication.


The authors would like to acknowledge the support by the National Institute on Deafness and Other Communication Disorders Grant K01DC017751 (PI: Naghibolhosseini, Maryam), “Studying the Laryngeal Mechanisms Underlying Dysphonia in Connected Speech” and Grant R01DC007640 (PI: Deliyski, Dimitar), “Efficacy of Laryngeal High-Speed Videoendoscopy.”