Text-to-speech
3D Outside the Box includes a text-to-speech synthesizer that speaks text that's typed into it. You can use it to generate speech for your animations.
Here are some sample wave-files generated by the text-to-speech engine: (Technical information: The TTS voice is produced from my own voice. I have recorded 4000 sentences, and haven't done any hand-correcting of the units.)
Comparison of voice quality using different voice-file sizes |
Transplanted prosody quality |
When you create a voice
file from your recordings you can specify how large a voice file to create.
|
This shows a comparison
of different levels of synthesized prosody, including the original wave from which the
prosody came. (The TTS voice and the original recording have the same voice because
they're both recorded by me. Unless you create your own TTS voice, your transplanted
prosody voice will be different than the original recording.)
|
Modified voice |
Other voices |
| These voices are easy-to-control modifications of the original voice. | Here are voices created from the CMU Blizzard challenge databases. |
The text-to-speech engine in 3DOB differs from other text-to-speech engines on the market because it supports:
Transplanted prosody - Much of the reason text-to-speech sounds bad is because it doesn't get the "prosody" (pitch, timing, and volume) of a sentence correct. TTS fails so miserably at this because in order to do prosody correctly, the computer must understand what it's saying. Unfortunately, computers are a long way from this. However, a tool in the mXac Natural Langauge Editor lets you "transplant" the prosody from a recorded voice onto a text-to-speech voice, greatly improving the quality. It basically turns text-to-speech into a very efficient audio compression scheme, with bit rates only a couple times higher than raw text. As a comparison: The best voice-audio compression techniques produce a bit-rate of about 36 kBytes/minute. Raw text is about 0.3 kByte/minute compressed. Transplanted prosody is around 2 kBytes/minute compressed.
Customize a voice - The included tools let you customize an existing text-to-speech voice, changing the way it sounds, how it emphasizes words (prosody), and even add an accent.
Lets you make your own voice - You can use the included tools to record your own voice and make a text-to-speech voice that sounds (mostly) like you. This is a fair amount of work, requiring 40-80 hours of work for a quality voice, although you can have it speaking haltedly in only an hour or two. Unfortunately, I only have a lexicon for English, so if you wish to create a non-English voice you need to find or create a lexicon for your language, which is even more work... just imagine entering pronunciations for the 10,000 most common words in your language. If you do make your own voice or lexicon, E-mail me so I can link to it from this page.
To use text-to-speech, you need to:
Download 3DOB from here and install it.
Download and install the large text-to-speech voices from here. The voices will (normally) be installed into "c:\program files\mXac\mXac Interactive Fiction".
Run the wave editor provided in 3DOB.
Under the "FX" menu, select "Text-to-speech" followed by "Text-to-Speech".
In the "Text-to-speech" dialog, press the "Load in a different text-to-speech" voice and select one the installed voices.
Type the text you want spoken and press the "Do text-to-speech" button.
It will fill the wave with the synthesized voice; press "Play" to hear it.
You can experiment with transplanted prosody using the "Transplanted prosody" menu under "FX".
If you wish to create your own text-to-speech voice, run the "mXac Natural Language Editor" installed with 3DOB.
Other text-to-speech voices and lexicons
At the moment I don't have any other text-to-speech voices or lexicons, but if you happen to create one, E-mail me your web page so I can add the link here.