VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling

Recent AIGC systems possess the capability to generate digital multimedia content based on human language instructions, such as text, image and video. However, when it comes to speech, existing methods related to human instruction-to-speech generation exhibit two limitations. Firstly, they require the division of inputs into content prompt (transcript) and description prompt (style and speaker), instead of directly supporting human instruction. This division is less natural in form and does not align with other AIGC models. Secondly, the practice of utilizing an independent description prompt to model speech style, without considering the transcript content, restricts the ability to control speech at a fine-grained level. To address these limitations, we propose VoxInstruct, a novel unified multilingual codec language modeling framework that extends traditional text-to-speech tasks into a general human instruction-to-speech task. Our approach enhances the expressiveness of human instruction-guided speech generation and aligns the speech generation paradigm with other modalities. To enable the model to automatically extract the content of synthesized speech from raw text instructions, we introduce speech semantic tokens as an intermediate representation for instruction-to-content guidance. We also incorporate multiple Classifier-Free Guidance (CFG) strategies into our codec language model, which strengthens the generated speech following human instructions. Furthermore, our model architecture and training strategies allow for the simultaneous support of combining speech prompt and descriptive human instruction for expressive speech synthesis, which is a first-of-its-kind attempt.

The capabilities of the proposed expressive human instruction-to-speech generation model.

Overview: Expressive Human Instruction-to-Speech Generation

To demonstrate the diverse and powerful generation capabilities of VoxInstruct, we present various samples in this section.

Instruction Speech Prompt VoxInstruct
Supporting English Human Instruction
The man adopted a high-pitched voice and maintained a moderate energy level in his amazed slow speech."What an impetuous boy he is!" -
In a fast-paced conversation, a sad adult male with low pitch is reflecting on something someone said: "You know one of my most viewed videos so it's a definite issue that a lot of people go through but i guess you know just going on to facebook groups searching up paper whole there's just a bunch of solutions." -
a low pitched female voice, "Delaney had read one or two works on psychic phenomena and understood from them that spirit projection was not only quite feasible but far from uncommon." -
Discussing science and technology, a sad adult male with high pitch and normal energy speaks rapidly. "there are no three hydrogen, because here." -
While whitewashing the fence, a fearful adult male with normal pitch reminisces: "one time when I was whitewashing the fence." this scene belongs to the category of film and animation. -
Speaking calmly, an elderly male with normal pitch and high energy discusses the potential name change of a nfl team, sharing: "in the end, they said they still would have gone on the show had they known there would be a debate but at least one of them wouldn't have worn his red skins jacket which forces the question if they change the team name." -
Describing the news and politics scene, a happy youth with low pitch joyfully mentions:"to lori and benjie sagarin, rabbi michael weinberg, bruce crane and beth sair at temple beth israel." he appreciates their contributions. -
[CaseStudy] A happy old man with low pitch and high energy, speaking quickly, happily recounts his recent activities: “But don’t you always want to be happy, Bruno?" -
[CaseStudy] Engaging in a dialogue, a youthful male with normal pitch saying “But don’t you always want to be happy, Bruno?", drawing attention to “always" by stressing it significantly. -
Supporting Chinese Human Instruction

(trans: The voice is tearful, filled with intense sorrow, somewhat warm yet slightly pleading, and somewhat resolute, saying: "Oh God, save me!")


(trans: Old Lu issued instructions in a tone full of deep affection, with a slight earnest plea, calm yet with a certain expectation. He gently said "Understand", and his tone became heavier when mentioning "eldest senior brother", saying, "Good if you understand, child, put away the donation book, and go wash the horse! Your eldest senior brother is coming soon, right?")


(trans: Filled with joy and a profound sense of happiness, he expresses himself very warmly and contentedly, without hesitation stating that Lina lives with happy memories. His tone slows slightly when he describes "true": "All these make me very happy; this is true happiness. These are my memories, and this is my life!")


(trans: A young man speaks with determination, his voice tinged with anger and a slightly elevated pitch: "Those who cannot let go of anything will ultimately change nothing.")


(trans: A mother gently tells a bedtime story, speaking in a warm and soothing tone, with a slow pace: "Those who cannot let go of anything will ultimately change nothing.")


(trans: A comedian shouted on stage with a smug look, "Guess what? This guy is really good!")


(trans: A young man speaks with a low, steady voice as he says, "As the bronze figure moved, it suddenly struck the 'Feiyu point' under Zhang Yuhu's ribs." His voice maintains a moderate volume and a calm pace, with a hint of subtle surprise in his tone. Overall, he conveys a somewhat serious and resolute impression.)

Supporting Multilingual Human Instruction
“Donald Trump thinks he's making America great again. ”一个老年男主持人在电视上轻松愉快地调侃道,语速较慢,表现出震惊的情绪。

(trans: "Donald Trump thinks he's making America great again," quips a senior male host on television, in a light-hearted and cheerful tone, with a slower pace and a hint of astonishment in his expression.)

以中年男性的声音,声音中充满沉思,带有递进的情绪,较为亲切却又略有些坚定,"To be or not to be, that is the question."

(trans: In the voice of a middle-aged man, filled with contemplation and a progressive emotional tone, somewhat warm yet slightly resolute, he says, "To be or not to be, that is the question.")

An old female is performing on a talk show, “中国有句古话,叫做识时务者为俊杰,但我不能理解是什么意思。”, speaking at a rapid pace.

(trans: An old female is performing on a talk show, “There's an old Chinese saying, 'A wise man adapts himself to circumstances, as water shapes itself to the vessel,' but I can't understand what it means.”, speaking at a rapid pace.)

男声以较低的音高,较低的音量,充满喜悦并感到深刻的幸福,表现得非常亲切与满足,“I have a dream, 世界会更好。”

(trans: A male voice, with a lower pitch and volume, filled with joy and a profound sense of happiness, speaks very warmly and contentedly, saying, "I have a dream, the world will be better.")

With a sense of sadness, a young boy with high pitch and low volume speaks at a normal speed, expressing his feelings, saying, "我有一种爱。那种当你准备好一盆Epsom盐浴时的感觉"

(trans: With a sense of sadness, a young boy with a high pitch and low volume speaks at a normal speed, expressing his feelings, saying, "I have a kind of love. It's the kind you feel when you're preparing an Epsom salt bath.")


(trans: In a calm tone, a girl with a high-pitched voice tinged with a hint of fragility slowly expresses her desire: "It would be amazing to just quickly see that as a chart and just get a quick visualization." She gradually emphasizes her thoughts and expectations.)

Supporting Voice Cloning

(trans: "April's charm fades in the mortal world, yet peach blossoms just begin to flourish in the mountain temple.")


(trans: "Memory is a river long dried up, leaving only scattered pebbles in its lifeless riverbed.")

“Tom did not like to steal, but he had no one to teach him to be honest.”
“To be or not to be, that is the question, To be or not to be, that is the question.”
“These moments when we dare to aim higher, to break barriers, to reach for the stars, to make the unknown known. We count these moments as our proudest achievements.”
Attempts on Voice Style Modification
[CaseStudy] Stated sadly with a heavy heart and spoken very slowly: “For it is very hard, my LORD. To carry on, to persist without yielding."
[CaseStudy] In the television series, a general said in a calm tone, “For it is very hard, my LORD. To carry on, to persist without yielding", emphasizing the word “yielding".

Comparison: Speech Attribute Control Based on Human Instructions

In this section, we compare VoxInstruct with PromptTTS and Salle to demonstrate that our proposed approach has superior performance, which can convert human instructions into more natural and expressive speech. Noted that PromptTTS and Salle use seperate content and description prompts, while VoxInstruct uses human instructions directly.

Instruction PromptTTS Salle VoxInstruct(w/o pre-training) VoxInstruct
A sad young female with low pitch and normal energy sadly expresses:"and saw that the kylix on the floor." in a gaming context, she reflects on a visual observation.
In the realm of News and Politics, an old, female speaker with normal pitch and high energy speaks slowly and angrily, proclaiming: "All the single ladies, and good and mad the revolutionary power of women's anger avi klein is a psychotherapist and licensed clinical social worker."
In the context of Nonprofits and Activism, a fearful youth with low pitch and normal energy speaks rapidly, expressing uncertainty: "Who knows what's going to happen to it?"
"I dropped the gun and he got it." Shares an old male with low pitch and normal energy in the field of Education, conveying a sense of angry while recounting an incident.
In the category of an audiobook, a sad adult female with a low pitch and fast speed shares: "Very little could be seen through the windows." Despite limited visibility, her emotions are expressed deeply.

(trans: With intense anger, a teenage boy speaks loudly and quickly, saying, "Boycotting Japanese goods is nothing compared to boycotting stupid people.")


(trans: With a tone filled with happiness, a young woman says, "Tomorrow is Monday again, I really don't want to go to work, I miss my bed." Her speech is of moderate speed, with increased volume and stable tone. She particularly emphasizes the word "Monday," afraid that everyone might not hear it clearly.)


(trans: The young woman speaks in a natural tone with raised volume, slowly saying, "The more I think about it, the angrier I get, and the angrier I get, the less control I have.")


(trans: With an appreciative tone, the young man lowers his voice and speaks rapidly, saying, "The most satisfying thing is someone like you who listens so obediently!")


(trans: In a low, hoarse voice, the young man speaks slowly and reservedly, delivering his words calmly. The politeness in his voice lacks warmth as he says, "The usually very polite Old Zhang unexpectedly refused his request.")

Comparison: Speech Stress Control through Fine-Grained Human Instructions

In this section, we compare VoxInstruct with PromptTTS and Salle to demonstrate that our unified instruction-based speech generation approach has superior fine-grained control over speech. We fine-tuned all these models on an artificial fine-grained instruction dataset.

Instruction PromptTTS Salle VoxInstruct
Focusing on the words "BY JOHN KENDRICK BANGS" a natural-sounding female youth with high pitch and low volume speaks slowly, piquing the curiosity of the audience with her unique voice, uttering "KENDRICK" with particular stress.
“"That's a lie, Lucius," answered Lucille steadily.” answered a confident adult female with high pitch and normal volume, speaking slowly and thoughtfully, emphasizing "Lucille" strongly in the dialogue.
With a fast pace and high pitch, a natural youth male expresses his thoughts:"I don't propose to respect your little fancies." This dialogue reveals a stance of independence and philosophical openness, perhaps discussing a debate topic, or sharing an opinion, giving "little" a pronounced stress.
In a natural tone, a normal-pitched adult male speaks quickly:"The thing is a mere accident of words.", stressing "mere" above all other words.
Expressing natural emotions in a normal pace, an adult female with a high pitch and low volume remarks: "A roar that seemed to rend the heavens followed.", making "heavens" stand out through emphasis.

Comparison: Voice Cloning Ability Based on Speech Prompt

In this section, we compare VoxInstruct with the zero-shot TTS model VALL-E and VALL-E X, which respectively focus on monolingual and cross-lingual scenarios. These samples are from official demopages of VALL-E and VALL-E X, and all speakers are unseen during training.

Instruction (just transcript) Speech Prompt VALL-E VALL-E X VoxInstruct Ground Truth (if provided)
"And lay me down in thy cold bed and leave my shining lot." -
"We have to reduce the number of plastic bags." -
"So what is the campaign about?" -
"Um we have to pay have this security fee just in case she would damage something but um." -
"As friends thing I definitely I've got more male friends" -

(trans: "I believe you touched on a key point there.")

(trans: "Persist in the real estate regulation policy without wavering.")

- -

(trans: "twenty-six million, four hundred eighty-two thousand, five hundred forty-six.")

"One dark night at the head of a score of his tribe, he fell upon Wabigoon’s camp, his object being the abduction of the princess." - -
"He honours whatever he recognizes in himself, such morality equals self-glorification." - -