This might be a dumb question to ask, but what exactly is this useful for? B-Roll for YouTube videos? I'm not sure why so much effort is being put into something like this when the applications are so limited.
If you want to train a model to have a general understanding of the physical world, one way is to show it videos and ask it to predict what comes next, and then evaluate it on how close it was to what actually came next.
To really do well on this task, the model basically has to understand physics, and human anatomy, and all sorts of cultural things. So you're forcing the model to learn all these things about the world, but it's relatively easy to train because you can just collect a lot of videos and show the model parts of them -- you know what the next frame is, but the model doesn't.
Along the way, this also creates a video generation model - but you can think of this as more of a nice side effect rather than the ultimate goal.
It doesn’t have to understand anything, none of these demonstrate reasoning or understanding.
All these models have just “seen” enough videos of all those things to build a probability distribution to predict the next step.
This is not bad, or make it inherently dumb, a major component of human intelligence is built on similar strategies.
I couldn’t tell what grammatical rules are broken in text or what physical rules in a photograph but can tell it is wrong using the same methods .
Inference can take it far with large enough data sets, but sooner or later without reasoning you will hit a ceiling .
This is true for humans as well, plenty of people go far in life with just memorization and replication do a lot of jobs fairly competently, but not in everything.
Reasoning is essential for higher order functions and transformers is not the path for that
That's like saying that your brain doesn't understand anything, it just analyzes the visual data coming in via your eyes and predicts the next step of reality
The brain also does that . It doesn’t do it exclusively, but we do it an awful lot .
we do extensive amount of pattern matching and drop enormous amount of sensory input very quickly because we expect patterns and assume a lot about our surroundings.
Unlearning this is a hard skill to pick up. There are many versions of training from martial arts to meditation that attempt to achieve this .
Point is that alone is not sufficient, the other core component is reasoning and understanding , transformers and learning on data is insufficient .
Parrot and few other animals can imitate human speech very well , that doesn’t mean they are understanding the speech or constructing .
Don’t get me wrong, i am not saying it is not useful, it is , but this attribution of reasoning and understanding to models that foundationally has no such building block is just being impressed by a speaking parrot
I think people are just fundamentally not willing to attribute intelligence to things that can't have conversations. This is why the incredible belief was possible that babies or dogs don't feel pain. Once the AI is given some long term memory all of these ideas that AI is just a parrot will suddenly be gone and I personally think that it will probably be pretty easy to give robots memories and their own personal motivations. All you have to achieve is to train them in realtime and the rest is an optimization, you want the training to make sense and have it not store/believe every single thing that it is being told etc.
It is also the corollary: we tend to attribute intelligence to things merely because it can have conversations from the first golden era of AI in 1960's that is always the case.
Mimicking more patterns like emotion and motivation may be better user experience, it doesn't make the machine any smarter, just a better mime.
Your thesis is that as we mimic reality more and more the differences will not matter, this is a idea romanticized by popular media like Blade Runner.
I believe there are classes of applications, particularly if the goal singularity or better than human super intelligence, emulating human responses no matter how sophisticated won't take you take there. Proponents may hand wash this as moving the goalposts, it is only refining the tests to reflect the models of the era.
If the proponents of AI were serious about their claims of intelligence than they should also be pushing for AI rights , there is no such serious discourse happening, only issues related to human data privacy rights on what can be used by AI models for learning or where they can the models be allowed to work.
> If the proponents of AI were serious about their claims of intelligence than they should also be pushing for AI rights , there is no such serious discourse happening
It's beginning to happen. Anthropic hired their first AI welfare researcher from Eleos AI, which is an organization specifically dedicated to investigating this question: https://eleosai.org/
Back when computers took up a whole room, you'd also have asked: "but what exactly is this useful for? B-Roll some simple calculations that anybody can do with a piece of paper and a pen."?
Think 5-10 years into the future, this is a stepping stone
That's comparing apples to oranges though isn't it? Generating videos is the output of the technology, not the tech itself. It would be like someone asking "this computer that takes up a whole room printed out ascii art, what is this useful for?"
all the "creative" gen ai does a thing worse and more annoying than what exists now. the first computers did calculations faster and faster with immediate utility (for defense mostly)
this is kind of an unfair comparison. Whats the endpoint of generating AI videos? What can this do that is useful, contributes something to society, has artistic value, etc etc. We can make educational videos with a script but its also pretty easy for motivated parties to do that already, and its getting easier as cameras get better and smaller. I think asking "whats the point of this" is at least fair.
The end point is enabling people to put into video what is in their mind. Like a word processor for video. When you remove the need to have a room full of VFX artists to make a movie, then anyone can make a movie. Whether this is beneficial is dubious, but that's an end goal if you are looking for one.
We're preparing to use video generation (specifically image+text => video so we can also include an initial screenshot of the current game state for style control) for generating in-game cutscenes at our video game studio. Specifically, we're generating them at play-time in a sandbox-like game where the game plays differently each time, and therefore we don't want to prerecord any cutscenes.
Okay, so is the aim to run this locally on a client's computer or served from a cloud? How does the math work out where it's not just easier at that point to render it in game?
in it's current state, it's already useful for b-roll, video backgrounds for websites, and any other sort of "generic" application where the point of the shot is just to establish mood and fill time.
but more than anything it's useful as a stepping stone to more full-featured video generation that can maintain characters and story across multiple scenes. it seems clear that at some point tools like this will be able to generate full videos, not just shots.
This is a first step towards "the holodeck". You describe a scene and it exists. Imagine you could jump in and interact with it. That seems like something that could happen in 10-20 years.
You and your friends gather around the TV to watch a video about the time that you all traveled abroad and met a mysterious stranger. In the film, you witness each other take incredible risks, have intimate private conversations, and change in profound ways. Of course none of it actually happened; your voices and likenesses were fed into the movie generator. And did I mention in the film you’re driving expensive cars and wearing designer clothes?
Are they that limited? It's a machine that can make videos from user input: it can ostensibly be used wherever you need video, including for creative, technical and professional applications.
Now, it may not be the best fit for those yet due to its limitations, but you've gotta walk before you can run: compare Stable Diffusion 1.x to FLUX.1 with ControlNet to see where quality and controllability could head in the future.
Because it's pretty cool to be able to imagine any kind of scene in your head, put it into words, then see it be made into a video file that you can actually see and share and refine.
It's got a lot of potential as a way for google to get paid for other people's skills and hard work instead of the people that made all of that "data".
It’s kind of hilarious that anybody considers this “democratizing” creating media. How many people that need a video clip are going to be capable of running an open version of this themselves? The wonky “open” models aren’t even close. How much do you think these services are going to cost once the introductory period financed by race-to-the-bottom money stops? OpenAI already charges $200/mo if you want to be guaranteed more than 30-60 minutes of Advanced Voice. The introductory period exists solely to get people engaged enough to push through blatantly stealing millions of artists creative output so they can have a beautiful tool they sell to Hollywood for a whole lot of money that’s still less than traditional vfx, and to m everyone gets to dink around in the useless free models or too-expensive-for-most prosumer tools and people with expensive video card arrays or the functional equivalent will still be niche tinkering hobbyists with inferior tooling and models and the skilled commercial artists still employed are being paid shit because of market forces. Great job SV. Making the world a better place.