I made one big mistake with that introduction when it was recorded on to the video tape for playback at the conference: I didn't realize how much people were going to laugh at the end and went straight into my presentation. I got to hear the laughter because the conference was live streamed using RealPlayer.
I would like to see more modern text-to-speech that sounds good, but also inhuman. On one hand very easy to understand but on the other obviously constructed and not something a living person could generate. Like the scifi where robots always look like robots and it's illegal/immoral/taboo to have a robot that's indistinguishable from a person.
Maybe just an ordinary human sounding TTS that gets put through a mild vocoder of some kind.
That's a real problem with the vast majority of current TTS. Terrible in things like consistent intonation, proper pronunciation, believable pauses, while sounding human all the same and the result is super uncanny valley.
The gaming and movie industry understands this very well, they use human voice actors that can nail all of that and then make it sound more metallic or compressed or whatnot. Otherwise it does not fit.
This is why it pisses me off when a techy person makes a YouTube video and just uses TTS instead of recording a voiceover. I know some people don’t have a good recording situation, but I get the sense that a lot of people just do it because they either think people can’t tell or they think it’s a clever hack or “the way of the future” or something.
It isn’t. Instead I find myself watching videos and getting a weird creepy feeling when I suddenly hear the voiceover mispronounce a word or put an emphasis in the wrong place. Part of it is the uncanny valley for sure, but the more pernicious thing is this: once I realize that the voice is AI-generated, I start to worry that the script might be too. Now I’m trying to figure out “is this guy just an amateur writer taking a while to get to his point, or is this an LLM-authored script that is never going to go beyond surface-level statements about the topic.”
I don’t think this is about a “good recording situation”. It’s likely people who think they suck at speaking/narrating or think they have a horrible accent or want to remain anonymous or just find the process annoying, and find it less embarrassing/more privacy-preserving/less of a hassle to use an AI voice.
Another factor, less common, is when you want or have to speak a non-native language you're not used to pronounce, in which case you're usually afraid of not being understood.
PS: I think all the Text To Speech systems sounds horrible, the last generations are even irritating, as the user of the parent commented.
And on the high end, making everyone sound like a professional public speaker. The machine sees mistakes as errors when in fact every non-hollywood speech contains multiple mistakes.
The main thing I realize is just how amazing we thought those machine voices were and how good and realistic, and how bad they sound now compared to what we have.
As a kid I thought the announcer voice for Blades of Steel was excellent at the time. Despite it being distorted it gave the feeling that you were watching an actual hockey game. Of course, most of the games I played then didn't have much in the way of human voice.
What a coincidence, I searched for that very video a few days ago. It's astonishing how much of a time capsule it was. A really small slice of online life. I remember people used to send emails around with Word docs containing GIFs and silly images, doctored up. Everything pixelated. Nostalgia.
It was so huge at the time. There weren't so many memes or silly videos around at the time (maybe some things from https://b3ta.com/ like Weebl's Badger Badger?), so everyone was aware of it.