NEURAL TEXT-TO-SPEECH FOR UZBEK WITH PROSODY TRANSFER AND SPEAKER ADAPTATION

loading.default
thumbnail.default.alt

item.page.date

item.page.journal-title

item.page.journal-issn

item.page.volume-title

item.page.publisher

Bright Mind Publishing

item.page.abstract

In this article we present an open, data-efficient Uzbek TTS system that integrates a non-autoregressive acoustic model with a prosody encoder and few-shot speaker adaptation. Rule-based text normalization and grapheme-to-phoneme conversion handle challenges of Uzbek orthography (Latin/Cyrillic), agglutinative morphology, and interrogative clitics. On 55 hours of speech, the proposed model improves MOS, reduces ASR-based CER, and successfully transfers reference prosody across voices with minimal data. We also release recipes, tokenizers, and evaluation metrics to support reproducible benchmarking and rapid local adaptation.

item.page.description

item.page.citation

item.page.collections

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced