Pronunciation assessment

Automatic pronunciation assessment uses computer speech recognition to determine how accurately speech has been pronounced,^[1]^[2] instead of relying on a human instructor or proctor.^[3] It is also called speech verification, pronunciation evaluation, and pronunciation scoring.^[4] This technology is used to grade speech quality, for computer-aided pronunciation teaching (CAPT) in computer-assisted language learning (CALL), for speaking skill remediation, and for accent reduction.^[4]

Pronunciation assessment is different than dictation or automatic transcription — instead of determining unknown speech, it verifies learners' pronunciation of known word(s), often from prior transcription of the same utterance, ideally scoring the intelligibility of the learners' speech.^[5]^[6] Sometimes pronunciation assessment evaluates the prosody of the learners' speech, such as intonation, pitch, tempo, rhythm, and syllable and word stress, although those are usually not essential for being understood in most languages.^[7] Pronunciation assessment is also used in reading tutoring, for example in products from Google,^[8] Microsoft,^[9]^[10] and Amira Learning.^[11] Automatic pronunciation assessment can also be used to help diagnose and treat speech disorders such as apraxia.^[12]

Intelligibility

The earliest work on pronunciation assessment avoided measuring genuine listener intelligibility,^[13] a shortcoming corrected in 2011 at the Toyohashi University of Technology,^[14] and included in the Versant high-stakes English fluency assessment from Pearson^[15] and mobile apps from 17zuoye Education & Technology,^[16] but still missing in 2023 products from Google Search,^[17] Microsoft,^[18] Educational Testing Service,^[19] Speechace,^[20] and ELSA.^[21] Assessing authentic listener intelligibility is essential for avoiding inaccuracies from accent bias,^[5] especially in high-stakes assessments;^[22]^[23]^[24] from words with multiple correct pronunciations;^[25] and from phoneme coding errors in machine-readable pronunciation dictionaries.^[26] In the Common European Framework of Reference for Languages (CEFR) assessment criteria for "overall phonological control", intelligibility outweighs formally correct pronunciation at all levels.^[27]

In 2022, researchers found that some newer speech-to-text systems, based on end-to-end reinforcement learning to map audio signals directly into words, produce word and phrase confidence scores (from 10-25ms audio frame logit aggregation) closely correlated with genuine listener intelligibility.^[28] In 2023, others were able to assess intelligibility using dynamic time warping distance measures from Wav2Vec2 representation of good speech.^[29]^[30] Further work through 2025 has focused specifically on measuring intelligibility.^[31]^[32]

A 2025 study of 42 pronunciation and speech coaching apps (32 mobile and 10 web) found that none offered intelligibility assessment. Instead, most provided only segmental and accent-focused scoring. About two-thirds of the apps provided some form of specific pronunciation feedback, usually with phonetic transcriptions, but accompanied by visual cues (such as animations of the vocal tract or the lips and tongue from the front) in only about 5% of the apps. Less than a third provided feedback on learner perception of exemplar speech.^[33]

Evaluation

Although there are as yet no industry-standard benchmarks for evaluating pronunciation assessment accuracy, researchers occasionally release evaluation speech corpuses for others to use for improving assessment quality.^[34]^[35]^[36]^[37] Such evaluation databases often emphasize formally unaccented pronunciation to the exclusion of genuine intelligibility evident from blinded listener transcriptions.^[6] As of mid-2025, state of the art approaches for automatically transcribing phonemes typically achieve an error rate of about 10% from known good speech.^[38]^[39]^[40]^[41]

Ethical issues in pronunciation assessment are present in both human and automatic methods. Authentic validity, fairness, and mitigating bias in evaluation are all crucial. Diverse speech data should be included in automatic pronunciation assessment models. Combining human judgments, especially listener transcriptions, with automated feedback can improve accuracy and fairness.^[42]

Second language learners benefit substantially from their use of common speech regognition systems for dictation, virtual assistants, and AI chatbots.^[43] In such systems, users naturally try to correct their own errors evident in speech recognition results that they notice. Such use improves their grammar and vocabulary development along with their pronunciation skills. The extent to which explicit pronunciation assessment and remediation approaches improve on such self-directed interactions remains an open question.^[43]

Recent developments

During 2021-22, a smartphone-based CAPT system was used to sense articulation through both audible and inaudible signals, providing feedback at the phoneme level.^[44]^[45]

Some promising areas for improvement which were being developed in 2024 include articulatory feature extraction^[46]^[47]^[48] and transfer learning to suppress unnecessary corrections.^[49] Other interesting advances under development include "augmented reality" interfaces for mobile devices using optical character recognition to provide pronunciation training on text found in user environments.^[50]^[51]

In 2024, audio multimodal large language models were first described as assessing pronunciation.^[52] That work has been carried forward by other researchers in 2025 who report positive results.^[53]^[54] Subsequently, researchers demonstrated pronunciation scoring by providing a language model with textual descriptions of speech, including the speech-to-text transcript, phoneme sequences, pauses, and phoneme sequence matching; this approach can achieve performance similar to multimodal LLMs that analyze raw audio while avoiding their higher computational cost.^[55]

In 2025, the Duolingo English Test authors published a description of their pronunciation assessment method, purportedly built to measure intelligibility rather than accent imitation.^[56] While achieving a correlation of 0.82 with expert human ratings, very close to inter-rater agreement and outperforming alternative methods, the method is nonetheless based on experts' scores along the six-point CEFR common reference levels scale, instead of actual blinded listener transcriptions.^[56]

Further promising work in 2025 includes assessment feedback aligning learner speech to synthetic utterances using interpretable features, identifying continuous spans of words for remediation feedback;^[57] synthesizing corrected speech matching learners' self-perceived voices, which they prefer and imitate more accurately as corrections;^[58] and streaming such interactions.^[59]

Fiction

The 2012 horror film Prometheus shows android character David 8 learning to pronounce Proto-Indo-European phrases from a holographic virtual tutor.^[60]

References

^ El Kheir, Yassine; et al. (October 2023), Automatic Pronunciation Assessment — A Review, Conference on Empirical Methods in Natural Language Processing, arXiv:2310.13974, S2CID 264426545
^ Lounis, Meriem; Dendani, Bilal; Bahi, Halima (January 2024). "Mispronunciation detection and diagnosis using deep neural networks: a systematic review". Multimedia Tools and Applications. 83 (23): 62793–62827. doi:10.1007/s11042-023-17899-x. Retrieved 12 July 2025.
^ Isaacs, Talia; Harding, Luke (July 2017). "Pronunciation assessment". Language Teaching. 50 (3): 347–366. doi:10.1017/S0261444817000118. ISSN 0261-4448. S2CID 209353525.
^ ^a ^b Ehsani, Farzad; Knodt, Eva (July 1998). "Speech technology in computer-aided language learning: Strengths and limitations of a new CALL paradigm". Language Learning & Technology. 2 (1). University of Hawaii National Foreign Language Resource Center; Michigan State University Center for Language Education and Research: 54–73. doi:10.64152/10125/25032. Retrieved 11 February 2023.
^ ^a ^b Loukina, Anastassia; et al. (September 2015). Pronunciation accuracy and intelligibility of non-native speech. Interspeech 2015. Dresden, Germany: ISCA. pp. 1917–1921. only 16% of the variability in word-level intelligibility can be explained by the presence of obvious mispronunciations.
^ ^a ^b O’Brien, Mary Grantham; et al. (December 2018). "Directions for the future of technology in pronunciation research and teaching". Journal of Second Language Pronunciation. 4 (2): 182–207. doi:10.1075/jslp.17001.obr. hdl:2066/199273. ISSN 2215-1931. S2CID 86440885. pronunciation researchers are primarily interested in improving L2 learners' intelligibility and comprehensibility, but they have not yet collected sufficient amounts of representative and reliable data (speech recordings with corresponding annotations and judgments) indicating which errors affect these speech dimensions and which do not. These data are essential to train ASR algorithms to assess L2 learners' intelligibility.
^ Eskenazi, Maxine (January 1999). "Using automatic speech processing for foreign language pronunciation tutoring: Some issues and a prototype". Language Learning & Technology. 2 (2): 62–76. doi:10.64152/10125/25043. Retrieved 11 February 2023.
^ "Read Along". Google.com. August 2022. Retrieved 29 September 2025.
^ "Reading Coach". Microsoft.com. March 2025. Retrieved 29 September 2025.
^ Tholfsen, Mike (February 2023). "Reading Coach in Immersive Reader plus new features coming to Reading Progress in Microsoft Teams". Techcommunity Education Blog. Microsoft. Retrieved 12 February 2023.
^ Banerji, Olina (March 2023). "Schools Are Using Voice Technology to Teach Reading. Is It Helping?". EdSurge News. Retrieved 7 March 2023.
^ Hair, Adam; Monroe, Penelope (June 2018). Apraxia world: A speech therapy game for children with speech sound disorders (PDF). Proceedings of the 17th ACM Conference on Interaction Design and Children. pp. 119–131. doi:10.1145/3202185.3202733. ISBN 9781450351522. S2CID 13790002.
^ Bernstein, Jared; et al. (November 1990), "Automatic Evaluation and Training in English Pronunciation", First International Conference on Spoken Language Processing (ICSLP 90), Kobe, Japan: International Speech Communication Association, pp. 1185–1188, retrieved 11 February 2023, listeners differ considerably in their ability to predict unintelligible words.... Thus, it seems the quality rating is a more desirable... automatic-grading score. (Section 2.2.2.)
^ Hiroshi, Kibishi; Nakagawa, Seiichi (August 2011). New feature parameters for pronunciation evaluation in English presentations at international conferences. Interspeech 2011. Florence, Italy: ISCA. pp. 1149–1152. Retrieved 11 February 2023. we investigated the relationship between pronunciation score / intelligibility and various acoustic measures, and then combined these measures.... As far as we know, the automatic estimation of intelligibility has not yet been studied.
^ Bonk, Bill (August 2020). "New innovations in assessment: Versant's Intelligibility Index score". Resources for English Language Learners and Teachers. Pearson English. Archived from the original on 2023-01-27. Retrieved 11 February 2023. you don't need a perfect accent, grammar, or vocabulary to be understandable. In reality, you just need to be understandable with little effort by listeners.
^ Gao, Yuan; et al. (May 2018). "Spoken English Intelligibility Remediation with PocketSphinx Alignment and Feature Extraction Improves Substantially over the State of the Art". 2nd IEEE Advanced Information Management, Communication, Electronic and Automation Control Conference (IMCEC 2018). pp. 924–927. arXiv:1709.01713. doi:10.1109/IMCEC.2018.8469649. ISBN 978-1-5386-1803-5. S2CID 31125681.
^ Snir, Tal (November 2019). "How do you pronounce quokka? Practice with Search". The Keyword. Google. Retrieved 11 February 2023.
^ "Pronunciation assessment tool". Azure Cognitive Services Speech Studio. Microsoft. Retrieved 11 February 2023.
^ Chen, Lei; et al. (December 2018). Automated Scoring of Nonnative Speech: Using the SpeechRater v. 5.0 Engine. ETS Research Report Series. Vol. 2018. Princeton, NJ: Educational Testing Service. pp. 1–31. doi:10.1002/ets2.12198. ISSN 2330-8516. S2CID 69925114. Retrieved 11 February 2023.
^ Alnafisah, Mutleb (September 2022), "Technology Review: Speechace", Proceedings of the 12th Pronunciation in Second Language Learning and Teaching Conference (Virtual PSLLT), no. 40, vol. 12, St. Catharines, Ontario: Iowa State University Digital Press, ISSN 2380-9566, retrieved 14 February 2023
^ Gorham, Jon; et al. (March 2022). Speech Recognition for English Language Learning (video). Technology in Language Teaching and Learning. Education Solutions. Retrieved 2023-02-14.
^ "Computer says no: Irish vet fails oral English test needed to stay in Australia". The Guardian. Australian Associated Press. 8 August 2017. Retrieved 12 February 2023.
^ Ferrier, Tracey (9 August 2017). "Australian ex-news reader with English degree fails robot's English test". The Sydney Morning Herald. Retrieved 12 February 2023.
^ Main, Ed; Watson, Richard (February 2022). "The English test that ruined thousands of lives". BBC News. Retrieved 12 February 2023.
^ Joyce, Katy Spratte (January 2023). "13 Words That Can Be Pronounced Two Ways". Reader's Digest. Retrieved 23 February 2023.
^ E.g., CMUDICT, "The CMU Pronouncing Dictionary". www.speech.cs.cmu.edu. Retrieved 15 February 2023. Compare "four" given as "F AO R" with the vowel AO as in "caught," to "row" given as "R OW" with the vowel OW as in "oat." This mistake is due to the "horse–hoarse merger," often called the "north–force merger."
^ Common European framework of reference for languages learning, teaching, assessment: Companion volume with new descriptors. Language Policy Programme, Education Policy Division, Education Department, Council of Europe. February 2018. p. 136. OCLC 1090351600.
^ Tu, Zehai; Ma, Ning; Barker, Jon (2022). Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction. Interspeech 2022. ISCA. pp. 3493–3497. doi:10.21437/Interspeech.2022-10408. Retrieved 17 December 2023.
^ Anand, Nayan; Sirigiraju, Meenakshi; Yarra, Chiranjeevi (June 2023). "Unsupervised speech intelligibility assessment with utterance level alignment distance between teacher and learner Wav2Vec-2.0 representations". arXiv:2306.08845 [cs.SD].
^ Shahin, Mostafa; Epps, Julien; Ahmed, Beena (September 2025). "Phonological level wav2vec2-based Mispronunciation Detection and Diagnosis method". Speech Communication. 173 103249. doi:10.1016/j.specom.2025.103249. ISSN 0167-6393. Retrieved 28 September 2025.
^ Geng, Haopeng; Saito, Daisuke; Minematsu, Nobuaki (August 2025). A Perception-Based L2 Speech Intelligibility Indicator: Leveraging a Rater's Shadowing and Sequence-to-sequence Voice Conversion. Interspeech 2025. Rotterdam, The Netherlands: ISCA. pp. 2420–2424.
^ Phukon, Bornali; Zheng, Xiuwen; Hasegawa-Johnson, Mark (August 2025). Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic, Semantic, and NLI Approaches. Interspeech 2025. Rotterdam, The Netherlands: ISCA. pp. 5708–5712.
^ Walesiak, Beata; Talley, Jim (15 September 2025). "Feedback Mechanisms in Pronunciation and Speech Coaching Apps". Pronunciation in Second Language Learning and Teaching Proceedings. 15 (1). doi:10.31274/psllt.18444. ISSN 2380-9566. Retrieved 25 September 2025.
^ Zhang, Junbo; et al. (August 2021). speechocean762: An Open-Source Non-Native English Speech Corpus for Pronunciation Assessment. Interspeech 2021. ISCA. pp. 3710–3714. arXiv:2104.01378. doi:10.21437/Interspeech.2021-1259. S2CID 233025050. Retrieved 19 February 2023.; GitHub corpus repository.
^ Vidal, Jazmín; et al. (September 2019). EpaDB: A Database for Development of Pronunciation Assessment Systems. Interspeech 2019. ISCA. pp. 589–593. doi:10.21437/Interspeech.2019-1839. hdl:11336/161618. S2CID 202742421. Retrieved 19 February 2023.; database .zip file.
^ Menzel, Wolfgang; et al. (May 2000). The ISLE Corpus of Non-Native Spoken English. Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00). Athens, Greece: European Language Resources Association. Retrieved 13 August 2025.
^ Zhao, Guanlong; et al. (2018). L2-ARCTIC: A Non-native English Speech Corpus. Interspeech 2018. ISCA. pp. 2783–2787. doi:10.21437/Interspeech.2018-1110. Retrieved 13 August 2025.
^ Zhou, Xuanru; et al. (August 2025). Towards Accurate Phonetic Error Detection Through Phoneme Similarity Modeling. Interspeech 2025. Rotterdam, The Netherlands: ISCA. pp. 4738–4742.
^ Alon, Yonatan (March 2021). "Real-time low-resource phoneme recognition on edge devices". arXiv:2103.13997 [cs.CL].
^ Yeo, Eunjung (October 2022). "wav2vec2-large-english-TIMIT-phoneme_v3". huggingface.co. Seoul National University Spoken Language Processing Lab. Retrieved 19 August 2025.
^ Lee, Jooyoung (June 2024). "wav2vec2-large-lv60_phoneme-timit_english_timit-4k". huggingface.co. Seoul National University Spoken Language Processing Lab. Retrieved 19 August 2025.
^ Babaeian, Ali (2023). "Pronunciation Assessment: Traditional vs Modern Modes". Journal of Education for Sustainable Innovation. 1 (1): 61–68. doi:10.56916/jesi.v1i1.530. Retrieved 2024-12-31.
^ ^a ^b Akhter, Elmoon (June 2025). "The Impact of Human-Machine Interaction on English Pronunciation and Fluency: Case Studies Using AI Speech Assistants". Review of Applied Science and Technology. 4 (2): 473–500. doi:10.63125/1wyj3p84.
^ Wong, Aslan B.; Chen, Xia; Liao, Qianru; Wu, Kaishun (July 2021). "Articulation Motion Sensing for Pronunciation Training". 2021 18th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON). pp. 1–2. doi:10.1109/SECON52354.2021.9491610. ISBN 978-1-6654-4108-7.
^ B. Wong, Aslan; Huang, ZiQi; Wu, Kaishun (1 October 2022). "Leveraging audible and inaudible signals for pronunciation training by sensing articulation through a smartphone". Speech Communication. 144: 42–56. doi:10.1016/j.specom.2022.08.002. ISSN 0167-6393.
^ Wu, Peter; et al. (14 February 2023), "Speaker-Independent Acoustic-to-Articulatory Speech Inversion", arXiv:2302.06774 [eess.AS]
^ Cho, Cheol Jun; et al. (January 2024). "Self-Supervised Models of Speech Infer Universal Articulatory Kinematics". arXiv:2310.10788 [eess.AS].
^ Mallela, Jhansi; Aluru, Sai Harshitha; Yarra, Chiranjeevi (February 2024). Exploring the Use of Self-Supervised Representations for Automatic Syllable Stress Detection. National Conference on Communications. Chennai, India. pp. 1–6. doi:10.1109/NCC60321.2024.10486028.
^ Sancinetti, Marcelo; et al. (May 2022). "A Transfer Learning Approach for Pronunciation Scoring". ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6812–6816. arXiv:2111.00976. doi:10.1109/ICASSP43922.2022.9747727. ISBN 978-1-6654-0540-9. S2CID 249437375.
^ Che Dalim, Che Samihah; et al. (February 2020). "Using augmented reality with speech input for non-native children's language learning" (PDF). International Journal of Human-Computer Studies. 134: 44–64. doi:10.1016/j.ijhcs.2019.10.002. S2CID 208098513. Retrieved 28 February 2023.
^ Tolba, Rahma M.; et al. (2023). "Mobile Augmented Reality for Learning Phonetics: A Review (2012–2022)". Extended Reality and Metaverse. Springer Proceedings in Business and Economics. Springer International Publishing. pp. 87–98. doi:10.1007/978-3-031-25390-4_7. ISBN 978-3-031-25389-8. Retrieved 28 February 2023.
^ Fu, Kaiqi; et al. (July 2024). "Pronunciation Assessment with Multi-modal Large Language Models". arXiv:2407.09209 [cs.CL]. Note that Speak.com produced an earlier commercial system that they had not described in technical detail.
^ Ma, Rao; et al. (May 2025). "Assessment of L2 Oral Proficiency using Speech Large Language Models". arXiv:2505.21148 [cs.CL].
^ Shankar, Natarajan Balaji; et al. (August 2025). Leveraging ASR and LLMs for Automated Scoring and Feedback in Children's Spoken Language Assessments. 10th Workshop on Speech and Language Technology in Education (SLaTE). Nijmegen, Netherlands: ISCA. pp. 1–5.
^ Chen, Hongjie; et al. (September 2025). "TextPA: Pronunciation Assessment through Texts". arXiv:2509.14187 [eess.AS].
^ ^a ^b Cai, Danwei; et al. (July 2025). "Developing an Automatic Pronunciation Scorer: Aligning Speech Evaluation Models and Applied Linguistics Constructs". Language Learning. 75: 170–203. doi:10.1111/lang.70000. Proficiency [is] estimated by an ML classifier trained to predict the human CEFR rating of a speaking response
^ McGhee, Charles; Gales, Mark J. F.; Knill, Kate M. (August 2025). Comparative Pronunciation Assessment and Feedback with Interpretable Speech Features. 10th Workshop on Speech and Language Technology in Education (SLaTE). Nijmegen, Netherlands: ISCA. pp. 36–40.
^ Yamanaka, Ryoga; et al. (August 2025). Synthesizing True Golden Voices to Enhance Pronunciation Training for Individual Language Learners. 10th Workshop on Speech and Language Technology in Education (SLaTE). Nijmegen, Netherlands: ISCA. pp. 209–213.
^ Nguyen, Tuan-Nam; et al. (August 2025). Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement. Interspeech 2025. Rotterdam, The Netherlands: ISCA. pp. 4163–4167.
^ Recitation of Schleicher's Fable in Proto-Indo-European from "Prometheus" [subtitled & translated]. June 2012. Event occurs at 0:44. Retrieved 28 September 2025.
^ Mathad, Vikram C.; et al. (2021). The Impact of Forced-Alignment Errors on Automatic Pronunciation Evaluation. Interspeech 2021. ISCA. pp. 176–180. doi:10.21437/interspeech.2021-1403. Retrieved 10 March 2023.

External links

International Speech Communication Association (ISCA) Special Interest Group on Speech and Language Technologies in Education (SLaTE)
ISCA SLaTE 2025 Workshop's Speak & Improve Challenge: Spoken Language Assessment and Feedback

[1] El Kheir, Yassine; et al. (October 2023), Automatic Pronunciation Assessment — A Review, Conference on Empirical Methods in Natural Language Processing, arXiv:2310.13974, S2CID 264426545

[2] Lounis, Meriem; Dendani, Bilal; Bahi, Halima (January 2024). "Mispronunciation detection and diagnosis using deep neural networks: a systematic review". Multimedia Tools and Applications. 83 (23): 62793–62827. doi:10.1007/s11042-023-17899-x. Retrieved 12 July 2025.

[3] Isaacs, Talia; Harding, Luke (July 2017). "Pronunciation assessment". Language Teaching. 50 (3): 347–366. doi:10.1017/S0261444817000118. ISSN 0261-4448. S2CID 209353525.

[EhsaniKnodt1998-4] Ehsani, Farzad; Knodt, Eva (July 1998). "Speech technology in computer-aided language learning: Strengths and limitations of a new CALL paradigm". Language Learning & Technology. 2 (1). University of Hawaii National Foreign Language Resource Center; Michigan State University Center for Language Education and Research: 54–73. doi:10.64152/10125/25032. Retrieved 11 February 2023.

[Loukina2015-5] Loukina, Anastassia; et al. (September 2015). Pronunciation accuracy and intelligibility of non-native speech. Interspeech 2015. Dresden, Germany: ISCA. pp. 1917–1921. only 16% of the variability in word-level intelligibility can be explained by the presence of obvious mispronunciations.

[obrien-6] O’Brien, Mary Grantham; et al. (December 2018). "Directions for the future of technology in pronunciation research and teaching". Journal of Second Language Pronunciation. 4 (2): 182–207. doi:10.1075/jslp.17001.obr. hdl:2066/199273. ISSN 2215-1931. S2CID 86440885. pronunciation researchers are primarily interested in improving L2 learners' intelligibility and comprehensibility, but they have not yet collected sufficient amounts of representative and reliable data (speech recordings with corresponding annotations and judgments) indicating which errors affect these speech dimensions and which do not. These data are essential to train ASR algorithms to assess L2 learners' intelligibility.

[7] Eskenazi, Maxine (January 1999). "Using automatic speech processing for foreign language pronunciation tutoring: Some issues and a prototype". Language Learning & Technology. 2 (2): 62–76. doi:10.64152/10125/25043. Retrieved 11 February 2023.

[8] "Read Along". Google.com. August 2022. Retrieved 29 September 2025.

[9] "Reading Coach". Microsoft.com. March 2025. Retrieved 29 September 2025.

[10] Tholfsen, Mike (February 2023). "Reading Coach in Immersive Reader plus new features coming to Reading Progress in Microsoft Teams". Techcommunity Education Blog. Microsoft. Retrieved 12 February 2023.

[11] Banerji, Olina (March 2023). "Schools Are Using Voice Technology to Teach Reading. Is It Helping?". EdSurge News. Retrieved 7 March 2023.

[12] Hair, Adam; Monroe, Penelope (June 2018). Apraxia world: A speech therapy game for children with speech sound disorders (PDF). Proceedings of the 17th ACM Conference on Interaction Design and Children. pp. 119–131. doi:10.1145/3202185.3202733. ISBN 9781450351522. S2CID 13790002.

[13] Bernstein, Jared; et al. (November 1990), "Automatic Evaluation and Training in English Pronunciation", First International Conference on Spoken Language Processing (ICSLP 90), Kobe, Japan: International Speech Communication Association, pp. 1185–1188, retrieved 11 February 2023, listeners differ considerably in their ability to predict unintelligible words.... Thus, it seems the quality rating is a more desirable... automatic-grading score. (Section 2.2.2.)

[14] Hiroshi, Kibishi; Nakagawa, Seiichi (August 2011). New feature parameters for pronunciation evaluation in English presentations at international conferences. Interspeech 2011. Florence, Italy: ISCA. pp. 1149–1152. Retrieved 11 February 2023. we investigated the relationship between pronunciation score / intelligibility and various acoustic measures, and then combined these measures.... As far as we know, the automatic estimation of intelligibility has not yet been studied.

[15] Bonk, Bill (August 2020). "New innovations in assessment: Versant's Intelligibility Index score". Resources for English Language Learners and Teachers. Pearson English. Archived from the original on 2023-01-27. Retrieved 11 February 2023. you don't need a perfect accent, grammar, or vocabulary to be understandable. In reality, you just need to be understandable with little effort by listeners.

[16] Gao, Yuan; et al. (May 2018). "Spoken English Intelligibility Remediation with PocketSphinx Alignment and Feature Extraction Improves Substantially over the State of the Art". 2nd IEEE Advanced Information Management, Communication, Electronic and Automation Control Conference (IMCEC 2018). pp. 924–927. arXiv:1709.01713. doi:10.1109/IMCEC.2018.8469649. ISBN 978-1-5386-1803-5. S2CID 31125681.

[17] Snir, Tal (November 2019). "How do you pronounce quokka? Practice with Search". The Keyword. Google. Retrieved 11 February 2023.

[18] "Pronunciation assessment tool". Azure Cognitive Services Speech Studio. Microsoft. Retrieved 11 February 2023.

[19] Chen, Lei; et al. (December 2018). Automated Scoring of Nonnative Speech: Using the SpeechRater v. 5.0 Engine. ETS Research Report Series. Vol. 2018. Princeton, NJ: Educational Testing Service. pp. 1–31. doi:10.1002/ets2.12198. ISSN 2330-8516. S2CID 69925114. Retrieved 11 February 2023.

[20] Alnafisah, Mutleb (September 2022), "Technology Review: Speechace", Proceedings of the 12th Pronunciation in Second Language Learning and Teaching Conference (Virtual PSLLT), no. 40, vol. 12, St. Catharines, Ontario: Iowa State University Digital Press, ISSN 2380-9566, retrieved 14 February 2023

[21] Gorham, Jon; et al. (March 2022). Speech Recognition for English Language Learning (video). Technology in Language Teaching and Learning. Education Solutions. Retrieved 2023-02-14.

[22] "Computer says no: Irish vet fails oral English test needed to stay in Australia". The Guardian. Australian Associated Press. 8 August 2017. Retrieved 12 February 2023.

[23] Ferrier, Tracey (9 August 2017). "Australian ex-news reader with English degree fails robot's English test". The Sydney Morning Herald. Retrieved 12 February 2023.

[24] Main, Ed; Watson, Richard (February 2022). "The English test that ruined thousands of lives". BBC News. Retrieved 12 February 2023.

[25] Joyce, Katy Spratte (January 2023). "13 Words That Can Be Pronounced Two Ways". Reader's Digest. Retrieved 23 February 2023.

[26] E.g., CMUDICT, "The CMU Pronouncing Dictionary". www.speech.cs.cmu.edu. Retrieved 15 February 2023. Compare "four" given as "F AO R" with the vowel AO as in "caught," to "row" given as "R OW" with the vowel OW as in "oat." This mistake is due to the "horse–hoarse merger," often called the "north–force merger."

[27] Common European framework of reference for languages learning, teaching, assessment: Companion volume with new descriptors. Language Policy Programme, Education Policy Division, Education Department, Council of Europe. February 2018. p. 136. OCLC 1090351600.

[28] Tu, Zehai; Ma, Ning; Barker, Jon (2022). Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction. Interspeech 2022. ISCA. pp. 3493–3497. doi:10.21437/Interspeech.2022-10408. Retrieved 17 December 2023.

[29] Anand, Nayan; Sirigiraju, Meenakshi; Yarra, Chiranjeevi (June 2023). "Unsupervised speech intelligibility assessment with utterance level alignment distance between teacher and learner Wav2Vec-2.0 representations". arXiv:2306.08845 [cs.SD].

[30] Shahin, Mostafa; Epps, Julien; Ahmed, Beena (September 2025). "Phonological level wav2vec2-based Mispronunciation Detection and Diagnosis method". Speech Communication. 173 103249. doi:10.1016/j.specom.2025.103249. ISSN 0167-6393. Retrieved 28 September 2025.

[31] Geng, Haopeng; Saito, Daisuke; Minematsu, Nobuaki (August 2025). A Perception-Based L2 Speech Intelligibility Indicator: Leveraging a Rater's Shadowing and Sequence-to-sequence Voice Conversion. Interspeech 2025. Rotterdam, The Netherlands: ISCA. pp. 2420–2424.

[32] Phukon, Bornali; Zheng, Xiuwen; Hasegawa-Johnson, Mark (August 2025). Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic, Semantic, and NLI Approaches. Interspeech 2025. Rotterdam, The Netherlands: ISCA. pp. 5708–5712.

[33] Walesiak, Beata; Talley, Jim (15 September 2025). "Feedback Mechanisms in Pronunciation and Speech Coaching Apps". Pronunciation in Second Language Learning and Teaching Proceedings. 15 (1). doi:10.31274/psllt.18444. ISSN 2380-9566. Retrieved 25 September 2025.

[34] Zhang, Junbo; et al. (August 2021). speechocean762: An Open-Source Non-Native English Speech Corpus for Pronunciation Assessment. Interspeech 2021. ISCA. pp. 3710–3714. arXiv:2104.01378. doi:10.21437/Interspeech.2021-1259. S2CID 233025050. Retrieved 19 February 2023.; GitHub corpus repository.

[35] Vidal, Jazmín; et al. (September 2019). EpaDB: A Database for Development of Pronunciation Assessment Systems. Interspeech 2019. ISCA. pp. 589–593. doi:10.21437/Interspeech.2019-1839. hdl:11336/161618. S2CID 202742421. Retrieved 19 February 2023.; database .zip file.

[36] Menzel, Wolfgang; et al. (May 2000). The ISLE Corpus of Non-Native Spoken English. Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00). Athens, Greece: European Language Resources Association. Retrieved 13 August 2025.

[37] Zhao, Guanlong; et al. (2018). L2-ARCTIC: A Non-native English Speech Corpus. Interspeech 2018. ISCA. pp. 2783–2787. doi:10.21437/Interspeech.2018-1110. Retrieved 13 August 2025.

[38] Zhou, Xuanru; et al. (August 2025). Towards Accurate Phonetic Error Detection Through Phoneme Similarity Modeling. Interspeech 2025. Rotterdam, The Netherlands: ISCA. pp. 4738–4742.

[39] Alon, Yonatan (March 2021). "Real-time low-resource phoneme recognition on edge devices". arXiv:2103.13997 [cs.CL].

[40] Yeo, Eunjung (October 2022). "wav2vec2-large-english-TIMIT-phoneme_v3". huggingface.co. Seoul National University Spoken Language Processing Lab. Retrieved 19 August 2025.

[41] Lee, Jooyoung (June 2024). "wav2vec2-large-lv60_phoneme-timit_english_timit-4k". huggingface.co. Seoul National University Spoken Language Processing Lab. Retrieved 19 August 2025.

[42] Babaeian, Ali (2023). "Pronunciation Assessment: Traditional vs Modern Modes". Journal of Education for Sustainable Innovation. 1 (1): 61–68. doi:10.56916/jesi.v1i1.530. Retrieved 2024-12-31.

[Akhter-43] Akhter, Elmoon (June 2025). "The Impact of Human-Machine Interaction on English Pronunciation and Fluency: Case Studies Using AI Speech Assistants". Review of Applied Science and Technology. 4 (2): 473–500. doi:10.63125/1wyj3p84.

[44] Wong, Aslan B.; Chen, Xia; Liao, Qianru; Wu, Kaishun (July 2021). "Articulation Motion Sensing for Pronunciation Training". 2021 18th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON). pp. 1–2. doi:10.1109/SECON52354.2021.9491610. ISBN 978-1-6654-4108-7.

[45] B. Wong, Aslan; Huang, ZiQi; Wu, Kaishun (1 October 2022). "Leveraging audible and inaudible signals for pronunciation training by sensing articulation through a smartphone". Speech Communication. 144: 42–56. doi:10.1016/j.specom.2022.08.002. ISSN 0167-6393.

[46] Wu, Peter; et al. (14 February 2023), "Speaker-Independent Acoustic-to-Articulatory Speech Inversion", arXiv:2302.06774 [eess.AS]

[47] Cho, Cheol Jun; et al. (January 2024). "Self-Supervised Models of Speech Infer Universal Articulatory Kinematics". arXiv:2310.10788 [eess.AS].

[48] Mallela, Jhansi; Aluru, Sai Harshitha; Yarra, Chiranjeevi (February 2024). Exploring the Use of Self-Supervised Representations for Automatic Syllable Stress Detection. National Conference on Communications. Chennai, India. pp. 1–6. doi:10.1109/NCC60321.2024.10486028.

[49] Sancinetti, Marcelo; et al. (May 2022). "A Transfer Learning Approach for Pronunciation Scoring". ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6812–6816. arXiv:2111.00976. doi:10.1109/ICASSP43922.2022.9747727. ISBN 978-1-6654-0540-9. S2CID 249437375.

[50] Che Dalim, Che Samihah; et al. (February 2020). "Using augmented reality with speech input for non-native children's language learning" (PDF). International Journal of Human-Computer Studies. 134: 44–64. doi:10.1016/j.ijhcs.2019.10.002. S2CID 208098513. Retrieved 28 February 2023.

[51] Tolba, Rahma M.; et al. (2023). "Mobile Augmented Reality for Learning Phonetics: A Review (2012–2022)". Extended Reality and Metaverse. Springer Proceedings in Business and Economics. Springer International Publishing. pp. 87–98. doi:10.1007/978-3-031-25390-4_7. ISBN 978-3-031-25389-8. Retrieved 28 February 2023.

[52] Fu, Kaiqi; et al. (July 2024). "Pronunciation Assessment with Multi-modal Large Language Models". arXiv:2407.09209 [cs.CL]. Note that Speak.com produced an earlier commercial system that they had not described in technical detail.

[53] Ma, Rao; et al. (May 2025). "Assessment of L2 Oral Proficiency using Speech Large Language Models". arXiv:2505.21148 [cs.CL].

[54] Shankar, Natarajan Balaji; et al. (August 2025). Leveraging ASR and LLMs for Automated Scoring and Feedback in Children's Spoken Language Assessments. 10th Workshop on Speech and Language Technology in Education (SLaTE). Nijmegen, Netherlands: ISCA. pp. 1–5.

[55] Chen, Hongjie; et al. (September 2025). "TextPA: Pronunciation Assessment through Texts". arXiv:2509.14187 [eess.AS].

[DET-56] Cai, Danwei; et al. (July 2025). "Developing an Automatic Pronunciation Scorer: Aligning Speech Evaluation Models and Applied Linguistics Constructs". Language Learning. 75: 170–203. doi:10.1111/lang.70000. Proficiency [is] estimated by an ML classifier trained to predict the human CEFR rating of a speaking response

[57] McGhee, Charles; Gales, Mark J. F.; Knill, Kate M. (August 2025). Comparative Pronunciation Assessment and Feedback with Interpretable Speech Features. 10th Workshop on Speech and Language Technology in Education (SLaTE). Nijmegen, Netherlands: ISCA. pp. 36–40.

[58] Yamanaka, Ryoga; et al. (August 2025). Synthesizing True Golden Voices to Enhance Pronunciation Training for Individual Language Learners. 10th Workshop on Speech and Language Technology in Education (SLaTE). Nijmegen, Netherlands: ISCA. pp. 209–213.

[59] Nguyen, Tuan-Nam; et al. (August 2025). Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement. Interspeech 2025. Rotterdam, The Netherlands: ISCA. pp. 4163–4167.

[60] Recitation of Schleicher's Fable in Proto-Indo-European from "Prometheus" [subtitled & translated]. June 2012. Event occurs at 0:44. Retrieved 28 September 2025.

[61] Mathad, Vikram C.; et al. (2021). The Impact of Forced-Alignment Errors on Automatic Pronunciation Evaluation. Interspeech 2021. ISCA. pp. 176–180. doi:10.21437/interspeech.2021-1403. Retrieved 10 March 2023.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]