I seem to recall that there used to be a video showing this approach in action. As input it took a video panning across a shelf full of books where the resolution was so low that the titles were illegible. And as output it produced a video with higher resolution and all the titles easily readable. Unfortunately I can't find that video any longer.
Yes, it all boils down to point spread functions. In the mosaic case, the PSF varies locally (per pixel) and temporally (in different video frames). The paper you link similarly details how they figure out the PSF. You can theoretically also do the entire thing without knowing the PSF, which is called blind deconvolution: https://en.wikipedia.org/wiki/Blind_deconvolution
- http://www.eyetap.org/papers/docs/mann94virtual.pdf - http://wearcam.org/orbits/index.html
I seem to recall that there used to be a video showing this approach in action. As input it took a video panning across a shelf full of books where the resolution was so low that the titles were illegible. And as output it produced a video with higher resolution and all the titles easily readable. Unfortunately I can't find that video any longer.