SpeakStream: Streaming Text-to-Speech with Interleaved Data

With the increasing integration of speech front-ends and large language models (LLM), there is a need to explore architectures that integrate these modalities. While end-to-end models have been explored extensively, cascaded models that stream outputs from LLMs to TTS seem to be oddly under-explored, even though they are potentially much simpler. Using traditional text-to-speech systems to convert LLM outputs to audio, however, poses a technical problem because they need entire utterances to generate sytlistic audio. In this paper we present a ‘streaming’ TTS that can generate audio from…
Apple Machine Learning Research

Author: admin

Leave a Reply

Your email address will not be published. Required fields are marked *