Purpose: To overcome methodology limitations for studying auditory development in young children, we have recently developed an observer-based procedure that uses a conditioned, play-based, motor response (see Bonino & Leibold, 2017). The purpose of this article was to examine interrater reliability for the method.

Method: Video recordings of test sessions of 2- to 4-year-old children (n = 17) were examined. Detection of a 1000-Hz warble tone was measured with the Play Observer-Based, Two-Interval (PlayO2I) method in each of two conditions: for

a fixed intensity level (30 dB SPL) or for a variable intensitylevel signal (0–30 dB SPL). All test sessions were scored independently by three observers (one real-time, two offline). Observer consensus was evaluated with Fleiss’ kappa statistic. To determine if summary data were similar across the observers of each test session, the proportion of correct trials (fixed-level condition) or threshold (variable-level condition) were computed.

Results: The strength of observer consensus was classified as “almost perfect” and “substantial” for the fixed-level and variable-level conditions, respectively. Follow-up analysis of the variable-level data indicated that differences in observer consensus were seen based on the signal level, the type of response behavior provided by the child, and the confidence level of the real-time observer. Resulting summary data were similar across the three observers of each test session: no significant differences for estimates of the proportion of correct trials or threshold.

Conclusions: Results from this study confirm strong interrater reliability for the method. The PlayO2I method is a powerful tool for measuring detection and discrimination abilities in young children.

Supplemental Material S1. (Table) Age and number of trials completed in each condition are provided for individual children. The total number of trials completed by a child is shown in parentheses, and the number outside the parentheses indicates the number of trials that were included in the analyses. Excluded trials were due to equipment malfunction or missing offline coding data.

Supplemental Material S2. (Figure) Based on data from the offline observers, we computed the proportion of trials in which each of the eight auditory behaviors were coded for the fixed-level condition. The pattern of behavior is shown for individual children, rank ordered by age. For reference, the average pattern of performance is provided in the last column (AVG = average). Trials in which no behavior was observed were excluded (< 4% of trials).

Supplemental Material S3. (Table) Fleiss’ kappa statistics for individual test sessions are provided for the variable-level condition. In addition to the overall values computed by including all trials, three comparisons were computed based on classifying trials based on the signal intensity level relative to the estimated threshold that corresponds to the 71%-correct point (at or above versus below), the type of child behavior (full play-based response versus all other behaviors), and the real-time observer’s confidence level (“confident” versus “not confident”).

Supplemental Material S4. (Figure) Based on data from the offline observers, we computed the proportion of trials in which each of the eight auditory behaviors were coded for the variable-level condition. The pattern of behavior is shown for individual children, rank ordered by age. For reference, the average pattern of performance is provided in the last column (AVG = average). Trials in which no behavior was observed were excluded (11.2% of trials).

Supplemental Material S5. (Table) The three-way cross tabulation for the intensity level of the signal, the child response behavior, and the observer’s confidence in the judgement is shown for the variable level data.

Supplemental Material S6. (Table) Coefficient estimates and corresponding p-values from the Poisson generalized linear model (GLM) are shown. The outcome is the number of trials with the following explanatory variables: intensity level of the signal, the child response behavior, and the observer’s confidence in the judgement. An asterisk indicates significance at the .05 level and three asterisks indicate a p-value of less than .001.

Supplemental Material S7. (Figure) Evaluation of interval bias was completed by fitting psychometric functions (lines) to pooled data (circles) from real-time observers (left panel) or offline observers (right panel) by signal position: interval 1 (red) or interval 2 (blue). Symbol size indicates the number of observations. For comparison, data and fits when both intervals are included are provided in black. The appearance of “finger bias” for the real-time observer data indicates a bias towards interval 2 when guessing.

Bonino, A. Y., Wiens, A., Nightengale, E. C., & Vance, E. A. (2020). Interrater reliability for a two-interval, observer-based procedure for measuring hearing in young children. American Journal of Audiology. Advance online publication. https://doi.org/10.1044/2020_AJA-20-00022