Generalization to Novel Camera Trajectories


To showcase the generalization to novel camera movements, we generate each video in the following manner: the first part of the video is rendered using the camera and dynamics control latents as inferred by the latent pose estimators on each video frame (as shown in Fig. 4 and described in Sec. 4.1). Then, we freeze the dynamics control latent and apply a series of manipulations to the camera control latent that result in panning, zooming, and circular camera motions, while the video content is frozen. We note that the rotating motion is novel and does not exist in either of the training datasets DySO or SSv2. We obtain these manipulations from several synthetic scenes similar to DySO that were rendered with the corresponding camera paths setup by hand. We obtain the camera control latents from those synthetic scenes using the camera estimator, and add them to the control latent on the real video to transfer the camera motion from the synthetic dataset to the real video. Finally, we resume playback of the rendered video by using camera & dynamics control latents from the original video again.

Latent Distances on Real-World Videos

We extend Figure 5, right, with more examples. We plot frame-to-frame L2 distances, for camera (left) and dynamics control latents (right).

Comparison to Robust DynRF on SSv2

We compare DyST to Robust DynRF, a state-of-the-art NeRF method for dynamic videos without camera poses. For DyST, we apply the trained model to the first, middle and last frame of the video to compute a scene representation, and generate all frames using control latents estimated from those frames. For Robust DynRF, we show renderings after training the model on the full video. Note that in contrast to Robust DynRF, DyST has not seen any of the videos during training. On these videos, Robust DynRF achieves an average PSNR of 26.1 and LPIPS of 0.34. In contrast, DyST achieves a significantly better PSNR of 27.9 and LPIPS of 0.18.


Robust DynRF DyST Video

Novel View Synthesis on DySO

We showcase several more examples for novel view synthethis on the test set of the DySO dataset, following the format of Figure 3. The columns from left-to-right: 3 input views to estimate the scene representation, the input to the camera estimator, the input to the dynamics estimator, the image generated by DyST, and the ground truth for that particular camera and dynamics combination. Note that for several scenes, the object or parts of the background are not visible in the input views from the requested pose or viewing direction. DyST resolves the resulting uncertainty by blurring the respective parts of the generated image.


Input Views Cam. Est. Dyn. Est. Pred GT