How to Setup VibeVoice Podcast TTS Microsoft Ai Voice Generator

When you click on links to various merchants on this site and make a purchase, this can result in this site earning a commission. Affiliate programs and affiliations include, but are not limited to, the eBay Partner Network. As an Amazon Associate I earn from qualifying purchases. #ad #promotions

setup microsoft vibevoice

*For unknown reasons, MS pulled their github repo and weights for the large off HF as I found out 9/4/2025. I updated the links below to reflect this. They gave no reason why which is not a good idea as it can fuel all sorts of speculation and it looks pretty unprofessional to do that. That said it was released MIT and people had forked/backed it up already.

Vibe Voice is fun. Vibe Voice also however is probably not yet ready for production especially in the 1.5B version that you can get off Huggingface.co, but the 7B likely has much better quality, artifact control and speaker consistency.

Background

Vibe Voice is a long format multi-speaker, multi-lingual, TTS engine that can produce 90 minute long audio in the 1.5B version and 45 minutes in the 7B version. The 1.5B does run on just an 8GB GPU and it is a python application that comes bundled with a Gradio interface. It is licensed MIT which is great. The authors warn to not view it as production ready yet, and after a bit of use I certainly concur. At least not in the 1.5B yet.

Follow-along with the video:

How To Install VibeVoice

Here are some notes to get you up and running with a Vibe Voice LXC container fast based off our current build guide series. That does mean you need to have followed two of the prior guides to have the base LXC backup to restore from or template already before proceeding.

Ultimate Local Ai Setup Guides for Proxmox and LXC - VibeVoice

Follow those guides here:

Step 1 Ollama and OpenWEBUI setup guide

Step 2 Llama.cpp and Unsloth setup guide

Step 3 HERE

Restore a backup and rename the container to: vibevoice and give it at least 2 cpu cores and 16GB RAM like this:

Install VibeVoice TTS

Start your container up now and enter the shell.

Next we need to install python3 in our LXC container.

apt install python3-full python3-pip pipenv
git clone https://github.com/vibevoice-community/VibeVoice.git
cd VibeVoice

After you have the base python3 install done, repo cloned and entered the directory we need to run:

You can enter this folder in the future and run this cmd also to enter back into this env. Typing exit will leave the shell FYI. Now in the shell install some the reqs.

It will install and take a bit of time. You need to install Flash Attn next and can use this cmd:

Now Install ffmpeg

You now have the software package installed.

Configure Gradio

Before we move on we will alter the GRADIO interface options to enable running this service locally and disable creating a publicly accessible share. From the cmd line run:

nano VibeVoice/demo/gradio_demo.py

use the pg down. Change the “127.0.0.1” to “0.0.0.0” as pictured. Then ctrl-x and Y to save.

Now you can run your local VibeVoice as a service accessible from any machine on your lan network.

Run Microsoft VibeVoice

These cmds will pull either the 1.5b or the 7B model. It is highly recommended you use the 7B if you can, again 19GB VRAM needed for that. Here is an updated step to download the weights from modelscope.

git clone https://www.modelscope.cn/microsoft/VibeVoice-Large.git

python demo/gradio_demo.py --model_path microsoft/VibeVoice-7B

Here is the cmd to run the smaller 1.5B that can fit in a 6GB VRAM gpu. The quality, insertion of random background music and artifacts are significantly increased.

python demo/gradio_demo.py --model_path microsoft/VibeVoice-1.5B

Access the gradio model interface on your local network. Don’t forget you needed to have changed your IP address in your newly launched LXC container! http://CONTAINER-IP:7860

VibeVoice 1.5B Quality Checks

The model performs pretty fair if you provide it clear and commonly used phrases for speakers, formatted exactly as Speaker X: text and line separated. 45 mins of great or 90 mins of slop are the real big news here, the voices are okay but lack dynamics and there are odd chimes and sounds added in often not as background music. Background music does spontaneously generate also. This one has some oddly timed kazoo noises.

Input Text:

Speaker 1: Hey this is Digital Spaceport and today we are covering VibeVoice from Microsoft.
Speaker 2: Are you going to show us how to set it up?
Speaker 1: Yes its pretty easy if you have followed my prior guides you will be running in about 10 minutes.
Speaker 3: Are you going to demonstrate it?
Speaker 1: You are kinda watching a demonstration right now.
Speaker 4: How much VRAM do I need to run this model?
Speaker 1: An 8 gigabyte GPU will get you there for the 1.5B model.
Speaker 2: Is this super high quality?
Speaker 1: I do not think it's super high quality but the length it can generate is pretty impressive. I think it is the longest available.

Output Audio:

The consistency of the speaker even in the same output stream is not great frankly and that is likely more then I can stand as it gets confusing in a podcast hosts role if speaker 1 takes over other speakers roles or if some altered new speaker emerges. In the next clip which had a few phrases for a flippy block comical commercial I was making there are some artifacts, clips, and even decision to spell out vs pronounce. Nonstandard words are not well handled. CFG set to standard, 1.3

Input Text:

Speaker 1: Flippyblock Extreme
Speaker 2: yea yea yeah flippyblock extreme
Speaker 1: did you say free flippyblock? FREE THE FLAPPY BIRD!
Speaker 2: FLAPPYBIRD!
Speaker 1: WOHOO!!!
Speaker 2: Buuurrps
Speaker 1: oh that’s totally cool

Output Audio:

If you have nonstandard phrases and words, too many ! marks, and many other quirks it will devolve into something….unnatural. I am calling this Exorcist Mode. CFG 1.8 the prompt was the same.

I was cracking up hard when I heard this so maybe somehow it can be used TBD on that but it is likely not what you are hoping for in day to day use. A tad unhinged. CFG was set to 2.0

My Next Steps

Hope for a dramatic improvement in quality in the 7B (UPDATE, IT IS BETTER! A LOT BETTER!) The 1.5B is funny, and a good start, but for sure not ready for integration into any automated podcast production tasks that strangers may hear. It may diverge too much even for your own uses. It going into Exorcist mode on me is freaky. (had the CFG all the way to 2)

LINKS

https://huggingface.co/microsoft/VibeVoice-Large

https://arxiv.org/pdf/2508.19205

https://github.com/microsoft/VibeVoice