With the increasing integration of speech front-ends and large language models (LLM), there is a need to explore architectures that integrate these modalities. While end-to-end models have been explored extensively, cascaded models that stream outputs from LLMs to TTS seem to be oddly under-explored, even though they are potentially much simpler. Using traditional text-to-speech systems to convert LLM outputs to audio, however, poses a technical problem because they need entire utterances to generate sytlistic audio. In this paper we present a ‘streaming’ TTS that can generate audio from…
Apple Machine Learning Research