There are some (fairly old) papers that might contain some useful ideas for you:...

MauranKilom · on Jan 25, 2022

Yes, it all boils down to point spread functions. In the mosaic case, the PSF varies locally (per pixel) and temporally (in different video frames). The paper you link similarly details how they figure out the PSF. You can theoretically also do the entire thing without knowing the PSF, which is called blind deconvolution: https://en.wikipedia.org/wiki/Blind_deconvolution