VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling

Anonymous Author(s)

A 2-minute video introducing VoxInstruct.

Abstract

Recent AIGC systems possess the capability to generate digital multimedia content based on human language instructions, such as text, image and video. However, when it comes to speech, existing methods related to human instruction-to-speech generation exhibit two limitations. Firstly, they require the division of inputs into content prompt (transcript) and description prompt (style and speaker), instead of directly supporting human instruction. This division is less natural in form and does not align with other AIGC models. Secondly, the practice of utilizing an independent description prompt to model speech style, without considering the transcript content, restricts the ability to control speech at a fine-grained level. To address these limitations, we propose VoxInstruct, a novel unified multilingual codec language modeling framework that extends traditional text-to-speech tasks into a general human instruction-to-speech task. Our approach enhances the expressiveness of human instruction-guided speech generation and aligns the speech generation paradigm with other modalities. To enable the model to automatically extract the content of synthesized speech from raw text instructions, we introduce speech semantic tokens as an intermediate representation for instruction-to-content guidance. We also incorporate multiple Classifier-Free Guidance (CFG) strategies into our codec language model, which strengthens the generated speech following human instructions. Furthermore, our model architecture and training strategies allow for the simultaneous support of combining speech prompt and descriptive human instruction for expressive speech synthesis, which is a first-of-its-kind attempt.

Interpolate start reference image.

The capabilities of the proposed expressive human instruction-to-speech generation model.

Overview: Expressive Human Instruction-to-Speech Generation

To demonstrate the diverse and powerful generation capabilities of VoxInstruct, we present various samples in this section.

Instruction Speech Prompt VoxInstruct
Supporting English Human Instruction
The man adopted a high-pitched voice and maintained a moderate energy level in his amazed slow speech."What an impetuous boy he is!" -
In a fast-paced conversation, a sad adult male with low pitch is reflecting on something someone said: "You know one of my most viewed videos so it's a definite issue that a lot of people go through but i guess you know just going on to facebook groups searching up paper whole there's just a bunch of solutions." -
a low pitched female voice, "Delaney had read one or two works on psychic phenomena and understood from them that spirit projection was not only quite feasible but far from uncommon." -
Discussing science and technology, a sad adult male with high pitch and normal energy speaks rapidly. "there are no three hydrogen, because here." -
While whitewashing the fence, a fearful adult male with normal pitch reminisces: "one time when I was whitewashing the fence." this scene belongs to the category of film and animation. -
Speaking calmly, an elderly male with normal pitch and high energy discusses the potential name change of a nfl team, sharing: "in the end, they said they still would have gone on the show had they known there would be a debate but at least one of them wouldn't have worn his red skins jacket which forces the question if they change the team name." -
Describing the news and politics scene, a happy youth with low pitch joyfully mentions:"to lori and benjie sagarin, rabbi michael weinberg, bruce crane and beth sair at temple beth israel." he appreciates their contributions. -
[CaseStudy] A happy old man with low pitch and high energy, speaking quickly, happily recounts his recent activities: “But don’t you always want to be happy, Bruno?" -
[CaseStudy] Engaging in a dialogue, a youthful male with normal pitch saying “But don’t you always want to be happy, Bruno?", drawing attention to “always" by stressing it significantly. -
Supporting Chinese Human Instruction
声音中泪流满面,带有强烈的悲哀,较为亲切却又略显哀求,有些坚定,说:“上帝呀,救我吧!”

(trans: The voice is tearful, filled with intense sorrow, somewhat warm yet slightly pleading, and somewhat resolute, saying: "Oh God, save me!")

-
老陆以一种充满深情的语调,带着轻微的诚恳请求,平和又带有一定的期望下达指示,轻轻地说“明白”,而在提到“大师兄”时语气加重,“明白就好,孩子,收好捐册,牵马去洗吧!你大师兄就快来了?”

(trans: Old Lu issued instructions in a tone full of deep affection, with a slight earnest plea, calm yet with a certain expectation. He gently said "Understand", and his tone became heavier when mentioning "eldest senior brother", saying, "Good if you understand, child, put away the donation book, and go wash the horse! Your eldest senior brother is coming soon, right?")

-
充满喜悦并感到深刻的幸福,表现得非常亲切与满足,没有犹豫地表示小莉娜带着幸福的记忆与生活,语调在描述“真正的”时稍显缓慢:"这些都让我非常高兴,这是真正的幸福。这是我的回忆,又是我的生活!”

(trans: Filled with joy and a profound sense of happiness, he expresses himself very warmly and contentedly, without hesitation stating that Lina lives with happy memories. His tone slows slightly when he describes "true": "All these make me very happy; this is true happiness. These are my memories, and this is my life!")

-
一位青年男性语气坚定,声音中透漏着愤怒的情绪,音调较高,“什么都无法舍弃的人,最终将什么都无法改变。”

(trans: A young man speaks with determination, his voice tinged with anger and a slightly elevated pitch: "Those who cannot let go of anything will ultimately change nothing.")

-
一位妈妈以床边故事的方式轻柔讲述,语气亲切,语速缓慢:“什么都无法舍弃的人,最终将什么都无法改变。”

(trans: A mother gently tells a bedtime story, speaking in a warm and soothing tone, with a slow pace: "Those who cannot let go of anything will ultimately change nothing.")

-
一个相声演员在台上高声吆喝道,带着洋洋得意的神情,“您猜怎么着,诶这人呐他就盖了帽儿了。”

(trans: A comedian shouted on stage with a smug look, "Guess what? This guy is really good!")

-
青年男子在说出:“铜人一摆,突然将张玉虎胁下的‘肺俞穴’一撞。”时,声音低沉,音量保持中等,语速平和,语气中透露出一丝不太明显的惊讶,整体给人一种较为严肃且坚定的印象。

(trans: A young man speaks with a low, steady voice as he says, "As the bronze figure moved, it suddenly struck the 'Feiyu point' under Zhang Yuhu's ribs." His voice maintains a moderate volume and a calm pace, with a hint of subtle surprise in his tone. Overall, he conveys a somewhat serious and resolute impression.)

-
Supporting Multilingual Human Instruction
“Donald Trump thinks he's making America great again. ”一个老年男主持人在电视上轻松愉快地调侃道,语速较慢,表现出震惊的情绪。

(trans: "Donald Trump thinks he's making America great again," quips a senior male host on television, in a light-hearted and cheerful tone, with a slower pace and a hint of astonishment in his expression.)

-
以中年男性的声音,声音中充满沉思,带有递进的情绪,较为亲切却又略有些坚定,"To be or not to be, that is the question."

(trans: In the voice of a middle-aged man, filled with contemplation and a progressive emotional tone, somewhat warm yet slightly resolute, he says, "To be or not to be, that is the question.")

-
An old female is performing on a talk show, “中国有句古话,叫做识时务者为俊杰,但我不能理解是什么意思。”, speaking at a rapid pace.

(trans: An old female is performing on a talk show, “There's an old Chinese saying, 'A wise man adapts himself to circumstances, as water shapes itself to the vessel,' but I can't understand what it means.”, speaking at a rapid pace.)

-
男声以较低的音高,较低的音量,充满喜悦并感到深刻的幸福,表现得非常亲切与满足,“I have a dream, 世界会更好。”

(trans: A male voice, with a lower pitch and volume, filled with joy and a profound sense of happiness, speaks very warmly and contentedly, saying, "I have a dream, the world will be better.")

-
With a sense of sadness, a young boy with high pitch and low volume speaks at a normal speed, expressing his feelings, saying, "我有一种爱。那种当你准备好一盆Epsom盐浴时的感觉"

(trans: With a sense of sadness, a young boy with a high pitch and low volume speaks at a normal speed, expressing his feelings, saying, "I have a kind of love. It's the kind you feel when you're preparing an Epsom salt bath.")

-
以镇定的语气,一位声音高亢且略带脆弱感的少女缓慢地表达她的愿望:“IT WOULD BE AMAZING TO JUST QUICKLY SEE THAT AS A CHART AND JUST GET A QUICK VISUALIZATION.” 她慢慢地突出她的思考和期望。

(trans: In a calm tone, a girl with a high-pitched voice tinged with a hint of fragility slowly expresses her desire: "It would be amazing to just quickly see that as a chart and just get a quick visualization." She gradually emphasizes her thoughts and expectations.)

-
Supporting Voice Cloning
“人间四月芳菲尽,山寺桃花始盛开。”

(trans: "April's charm fades in the mortal world, yet peach blossoms just begin to flourish in the mountain temple.")

“记忆是一条早已干涸的河流,只在毫无生气的河床中剩下零落的砾石。”

(trans: "Memory is a river long dried up, leaving only scattered pebbles in its lifeless riverbed.")

“Tom did not like to steal, but he had no one to teach him to be honest.”
“To be or not to be, that is the question, To be or not to be, that is the question.”
“These moments when we dare to aim higher, to break barriers, to reach for the stars, to make the unknown known. We count these moments as our proudest achievements.”
Attempts on Voice Style Modification
[CaseStudy] Stated sadly with a heavy heart and spoken very slowly: “For it is very hard, my LORD. To carry on, to persist without yielding."
[CaseStudy] In the television series, a general said in a calm tone, “For it is very hard, my LORD. To carry on, to persist without yielding", emphasizing the word “yielding".

Comparison: Speech Attribute Control Based on Human Instructions

In this section, we compare VoxInstruct with PromptTTS and Salle to demonstrate that our proposed approach has superior performance, which can convert human instructions into more natural and expressive speech. Noted that PromptTTS and Salle use seperate content and description prompts, while VoxInstruct uses human instructions directly.

Instruction PromptTTS Salle VoxInstruct(w/o pre-training) VoxInstruct
A sad young female with low pitch and normal energy sadly expresses:"and saw that the kylix on the floor." in a gaming context, she reflects on a visual observation.
In the realm of News and Politics, an old, female speaker with normal pitch and high energy speaks slowly and angrily, proclaiming: "All the single ladies, and good and mad the revolutionary power of women's anger avi klein is a psychotherapist and licensed clinical social worker."
In the context of Nonprofits and Activism, a fearful youth with low pitch and normal energy speaks rapidly, expressing uncertainty: "Who knows what's going to happen to it?"
"I dropped the gun and he got it." Shares an old male with low pitch and normal energy in the field of Education, conveying a sense of angry while recounting an incident.
In the category of an audiobook, a sad adult female with a low pitch and fast speed shares: "Very little could be seen through the windows." Despite limited visibility, her emotions are expressed deeply.
少年男孩满怀愤怒,用大声的音量和快速的语速,说:“抵制日货真不如抵制蠢货。”

(trans: With intense anger, a teenage boy speaks loudly and quickly, saying, "Boycotting Japanese goods is nothing compared to boycotting stupid people.")

青年女性充满高兴的语气说:“明天又是周一了真的不想上班想念我的床。”她的语速中等,音量提高,音调稳定,特别强调“周一”二字,生怕大家没听清楚。

(trans: With a tone filled with happiness, a young woman says, "Tomorrow is Monday again, I really don't want to go to work, I miss my bed." Her speech is of moderate speed, with increased volume and stable tone. She particularly emphasizes the word "Monday," afraid that everyone might not hear it clearly.)

青年女子用自然的语调并音量高昂地,语速慢慢地说:“越想越气,越气我就越控制不住。”

(trans: The young woman speaks in a natural tone with raised volume, slowly saying, "The more I think about it, the angrier I get, and the angrier I get, the less control I have.")

青年男子用充满激赏的说话方式,将音量调低,语速飞快,说:“最满意的就是你这样听话的人!”

(trans: With an appreciative tone, the young man lowers his voice and speaks rapidly, saying, "The most satisfying thing is someone like you who listens so obediently!")

青年男子低沉沙哑的声音,语速缓慢而矜持,平静地讲话,声音中的客套不带温柔:“平时十分客气的老张,竟然拒绝了他的要求。”

(trans: In a low, hoarse voice, the young man speaks slowly and reservedly, delivering his words calmly. The politeness in his voice lacks warmth as he says, "The usually very polite Old Zhang unexpectedly refused his request.")

Comparison: Speech Stress Control through Fine-Grained Human Instructions

In this section, we compare VoxInstruct with PromptTTS and Salle to demonstrate that our unified instruction-based speech generation approach has superior fine-grained control over speech. We fine-tuned all these models on an artificial fine-grained instruction dataset.

Instruction PromptTTS Salle VoxInstruct
Focusing on the words "BY JOHN KENDRICK BANGS" a natural-sounding female youth with high pitch and low volume speaks slowly, piquing the curiosity of the audience with her unique voice, uttering "KENDRICK" with particular stress.
“"That's a lie, Lucius," answered Lucille steadily.” answered a confident adult female with high pitch and normal volume, speaking slowly and thoughtfully, emphasizing "Lucille" strongly in the dialogue.
With a fast pace and high pitch, a natural youth male expresses his thoughts:"I don't propose to respect your little fancies." This dialogue reveals a stance of independence and philosophical openness, perhaps discussing a debate topic, or sharing an opinion, giving "little" a pronounced stress.
In a natural tone, a normal-pitched adult male speaks quickly:"The thing is a mere accident of words.", stressing "mere" above all other words.
Expressing natural emotions in a normal pace, an adult female with a high pitch and low volume remarks: "A roar that seemed to rend the heavens followed.", making "heavens" stand out through emphasis.

Comparison: Voice Cloning Ability Based on Speech Prompt

In this section, we compare VoxInstruct with the zero-shot TTS model VALL-E and VALL-E X, which respectively focus on monolingual and cross-lingual scenarios. These samples are from official demopages of VALL-E and VALL-E X, and all speakers are unseen during training.

Instruction (just transcript) Speech Prompt VALL-E VALL-E X VoxInstruct Ground Truth (if provided)
"And lay me down in thy cold bed and leave my shining lot." -
"We have to reduce the number of plastic bags." -
"So what is the campaign about?" -
"Um we have to pay have this security fee just in case she would damage something but um." -
"As friends thing I definitely I've got more male friends" -
"我相信你在那里涉及到了某个要点"

(trans: "I believe you touched on a key point there.")

- -
"坚持房地产调控政策不动摇"

(trans: "Persist in the real estate regulation policy without wavering.")

- -
"两千六百四十八万二千五百四十六"

(trans: "twenty-six million, four hundred eighty-two thousand, five hundred forty-six.")

- -
"One dark night at the head of a score of his tribe, he fell upon Wabigoon’s camp, his object being the abduction of the princess." - -
"He honours whatever he recognizes in himself, such morality equals self-glorification." - -