That's a real problem with the vast majority of current TTS. Terrible in things like consistent intonation, proper pronunciation, believable pauses, while sounding human all the same and the result is super uncanny valley.
The gaming and movie industry understands this very well, they use human voice actors that can nail all of that and then make it sound more metallic or compressed or whatnot. Otherwise it does not fit.
This is why it pisses me off when a techy person makes a YouTube video and just uses TTS instead of recording a voiceover. I know some people don’t have a good recording situation, but I get the sense that a lot of people just do it because they either think people can’t tell or they think it’s a clever hack or “the way of the future” or something.
It isn’t. Instead I find myself watching videos and getting a weird creepy feeling when I suddenly hear the voiceover mispronounce a word or put an emphasis in the wrong place. Part of it is the uncanny valley for sure, but the more pernicious thing is this: once I realize that the voice is AI-generated, I start to worry that the script might be too. Now I’m trying to figure out “is this guy just an amateur writer taking a while to get to his point, or is this an LLM-authored script that is never going to go beyond surface-level statements about the topic.”
I don’t think this is about a “good recording situation”. It’s likely people who think they suck at speaking/narrating or think they have a horrible accent or want to remain anonymous or just find the process annoying, and find it less embarrassing/more privacy-preserving/less of a hassle to use an AI voice.
Another factor, less common, is when you want or have to speak a non-native language you're not used to pronounce, in which case you're usually afraid of not being understood.
PS: I think all the Text To Speech systems sounds horrible, the last generations are even irritating, as the user of the parent commented.
And on the high end, making everyone sound like a professional public speaker. The machine sees mistakes as errors when in fact every non-hollywood speech contains multiple mistakes.
The gaming and movie industry understands this very well, they use human voice actors that can nail all of that and then make it sound more metallic or compressed or whatnot. Otherwise it does not fit.