NEURAL TEXT-TO-SPEECH FOR UZBEK WITH PROSODY TRANSFER AND SPEAKER ADAPTATION
loading.default
item.page.date
item.page.authors
item.page.journal-title
item.page.journal-issn
item.page.volume-title
item.page.publisher
Bright Mind Publishing
item.page.abstract
In this article we present an open, data-efficient Uzbek TTS system that integrates a non-autoregressive acoustic model with a prosody encoder and few-shot speaker adaptation. Rule-based text normalization and grapheme-to-phoneme conversion handle challenges of Uzbek orthography (Latin/Cyrillic), agglutinative morphology, and interrogative clitics. On 55 hours of speech, the proposed model improves MOS, reduces ASR-based CER, and successfully transfers reference prosody across voices with minimal data. We also release recipes, tokenizers, and evaluation metrics to support reproducible benchmarking and rapid local adaptation.