How DramaBox Enables Expressive TTS with Local Voice Cloning?

To install and test Drama Box locally, clone the repo, install the requirements, and launch the Gradio demo. The first run downloads the model and serves the UI at localhost:7860. On an Nvidia RTX 6000 with 48 GB of VRAM, I saw just over 16 GB VRAM consumption while generating.

Drama Box is a fine-tune of Lyra X LTX 2, a 3.3 billion parameter audio-only diffusion transformer using flow matching conditioned on Gemma 3 12B text embeddings.

The architecture combines a diffusion transformer backbone with an audio variational autoencoder, which brings the voice from hidden space to the space where you can listen to it quite easily.

It also uses a vocoder for pauses and other timing.

Environment

I used Ubuntu with an Nvidia RTX 6000 48 GB. The Gradio app ran locally without issues. First launch triggered the model download automatically.

Install and run

Clone and requirements

Clone the Drama Box repository to your machine.

Install the Python requirements in a fresh environment.

Confirm CUDA and the correct GPU drivers are available before launching.

How DramaBox Enables Expressive TTS with Local Voice Cloning? screenshot 2

Launch the app

Launch the Gradio demo from the project directory.

The interface comes up at http://localhost:7860. On first run, the model files download and initialize.

How DramaBox Enables Expressive TTS with Local Voice Cloning? screenshot 3

For teams coordinating multiple tools and services around this workflow, see our agent manager guide to keep projects organized.

What is DramaBox?

Drama Box is a fine-tune of Lyra X LTX 2 using flow matching and Gemma 3 12B text embeddings for conditioning. The backbone is a diffusion transformer paired with an audio VAE. A vocoder handles pauses and prosody details.

How DramaBox Enables Expressive TTS with Local Voice Cloning? screenshot 1

Prompting method

Dialogue and directions

Treat the prompt as a full performance script. Dialogue goes inside double quotes and the model speaks it literally.

Everything outside the quotes is stage direction that shapes delivery without being spoken.

How DramaBox Enables Expressive TTS with Local Voice Cloning? screenshot 7

You can layer emotional transitions mid sentence, shift from a shout to a whisper, or drop in a laugh or a sigh, all through how you write the prompt.

You are not just writing text, you are directing a performance. That is why the name Drama Box.

How DramaBox Enables Expressive TTS with Local Voice Cloning? screenshot 8

Voice reference

You can drop in an optional 10 second voice reference clip and the model clones that timbre. I do not think the voice cloning is as good as it should be yet. Hopefully a future version brings the same expressive control on top of any voice you feed it.

How DramaBox Enables Expressive TTS with Local Voice Cloning? screenshot 9

If you are building tooling around prompts and scripts, check our workflow tips for Claude Code to keep your prompt engineering and code glue tidy.

Tests and observations

In a scene with rich stage directions and no reference audio, the delivery carried clear expressions and timing.

I think still bit robotic and plasticky, but as far as expressions are concerned, there is lot of improvement, no doubt about that. The expressive control is the standout.

How DramaBox Enables Expressive TTS with Local Voice Cloning? screenshot 5

In a female voice test with a reference clip, expressions stayed strong but voice cloning felt not really good enough.

Timbre resemblance was partial and inconsistent. The emotional beats, pauses, and emphasis were convincing.

How DramaBox Enables Expressive TTS with Local Voice Cloning? screenshot 6

In a longer scripted scene with complex structure, I saw hallucinations and rushed lines.

Some sentences were spoken correctly while others were mangled or skipped.

It might be sensitivity to length or an issue in the pipeline.

How DramaBox Enables Expressive TTS with Local Voice Cloning? screenshot 10

Resource notes

VRAM usage was just over 16 GB during generation. The Gradio UI ran on localhost:7860.

The model download happened automatically on first run.

How DramaBox Enables Expressive TTS with Local Voice Cloning? screenshot 4

If you compare your tooling stack across projects, see this comparison of two coding approaches to plan your integration path.

Limitations

Voice cloning is not there yet in terms of strong timbre match. Longer and denser scripts can trigger hallucination or pacing issues. Expressions and delivery control are already compelling.

Final thoughts

Drama Box installs cleanly, runs locally, and delivers expressive speech from script-like prompts with strong control. Voice cloning needs work, and longer prompts can wobble, but the progress on performance-style prompting is clear. Good start, with plenty of room for refinement as the model matures.

Environment

Install and run

Clone and requirements

Launch the app

What is DramaBox?

Prompting method

Dialogue and directions

Voice reference

Tests and observations

Resource notes

Limitations

Final thoughts

Leave a Comment Cancel reply