Adapting SynSin for Enhanced Video Frame Interpolation via Temporal Feature Fusion and Depth Consistency
Main Article Content
Abstract
Frame interpolation seeks to synthesize temporally consistent in-between frames that enhance video smoothness and visual continuity. We revisit SynSin—originally designed for single-image novel view synthesis—and reformulate it for interpolation by introducing (i) dual-frame input handling, (ii) temporal feature fusion and explicit temporal encoding, (iii) a dedicated interpolation module, (iv) depth-consistency constraints across inputs and the synthesized frame, and (v) a refinement stage to suppress artifacts. Training uses reconstruction-oriented loss while evaluation reports MSE, SSIM, and PSNR. We assess the approach on the indoor 7Scenes benchmark. The modified model yields validation MSE = 0.0011 and SSIM = 0.943, improving over an unmodified baseline (MSE = 0.0033, SSIM = 0.9327), a 66.7 % error reduction, and a +0.0103 absolute SSIM gain. These results indicate that a view-synthesis backbone can be effectively adapted to temporal synthesis, offering a simple, data-efficient route to competitive interpolation quality useful for video editing, animation, and streaming. Beyond performance, our study highlights a practical pathway for repurposing view-synthesis architectures to broader video tasks, encouraging unified designs that share geometry-aware depth reasoning and temporal modeling.
