TL;DR: We present GeoStream, a streaming video generation system with precise metric-scale camera control via a self-refreshing 3D cache, enabling accurate viewpoint manipulation under severe and extreme camera motions. More examples below!
🎯 We aim for precise camera control under severe motion and extreme magnitude. How do we achieve this?
Condition
GeoStream (Ours)
CameraCtrl II
Condition
GeoStream (Ours)
GEN3C
🧑🏫 Stage 1: Bidirectional Teacher Training.
We train a teacher model on ground-truth frames with explicit geometric conditioning: depth from a frame is unprojected to a 3D point cloud and reprojected into the target view to produce point-rendering conditioning for the diffusion transformer.
🧑🎓 Stage 2: On-Policy Causal Distillation.
We distill a causal student for autoregressive streaming. After a brief teacher-forced warm-up, the student is distilled on-policy via DMD. The student rolls out its own frames, and the 3D cache used as conditioning is re-rendered from depth estimated on those self-generated frames, matching the self-refresh mechanism used at inference. This aligns the train and inference distributions on both the frame stream and the geometric conditioning, jointly closing standard autoregressive drift and the second-order feedback loop induced by the cache.
GeoStream (Ours) vs baselines on the RealEstate10K Test set.
Ground Truth
GeoStream (Ours)
MotionCtrl
SEVA
CameraCtrl II
FlexWorld + MoGe-2
ViewCrafter + MoGe-2
VMem + MoGe-2
GEN3C + MoGe-2
Spatia + MoGe-2
Ground Truth
GeoStream (Ours)
MotionCtrl
SEVA
CameraCtrl II
FlexWorld + MoGe-2
ViewCrafter + MoGe-2
VMem + MoGe-2
GEN3C + MoGe-2
Spatia + MoGe-2
Ground Truth
GeoStream (Ours)
MotionCtrl
SEVA
CameraCtrl II
FlexWorld + MoGe-2
ViewCrafter + MoGe-2
VMem + MoGe-2
GEN3C + MoGe-2
Spatia + MoGe-2
Ground Truth
GeoStream (Ours)
MotionCtrl
SEVA
CameraCtrl II
FlexWorld + MoGe-2
ViewCrafter + MoGe-2
VMem + MoGe-2
GEN3C + MoGe-2
Spatia + MoGe-2
Ground Truth
GeoStream (Ours)
MotionCtrl
SEVA
CameraCtrl II
FlexWorld + MoGe-2
ViewCrafter + MoGe-2
VMem + MoGe-2
GEN3C + MoGe-2
Spatia + MoGe-2
Ground Truth
GeoStream (Ours)
MotionCtrl
SEVA
CameraCtrl II
FlexWorld + MoGe-2
ViewCrafter + MoGe-2
VMem + MoGe-2
GEN3C + MoGe-2
Spatia + MoGe-2
Ground Truth
GeoStream (Ours)
MotionCtrl
SEVA
CameraCtrl II
FlexWorld + MoGe-2
ViewCrafter + MoGe-2
VMem + MoGe-2
GEN3C + MoGe-2
Spatia + MoGe-2
Ground Truth
GeoStream (Ours)
MotionCtrl
SEVA
CameraCtrl II
FlexWorld + MoGe-2
ViewCrafter + MoGe-2
VMem + MoGe-2
GEN3C + MoGe-2
Spatia + MoGe-2
Ground Truth
GeoStream (Ours)
MotionCtrl
SEVA
CameraCtrl II
FlexWorld + MoGe-2
ViewCrafter + MoGe-2
VMem + MoGe-2
GEN3C + MoGe-2
Spatia + MoGe-2
GeoStream (Ours) vs baselines under severe camera motion. The camera trajectories of generated videos are estimated by MapAnything.
Condition / Generation
GeoStream (Ours)
Condition / Generation
CameraCtrl II
Condition / Generation
GeoStream (Ours)
Condition / Generation
CameraCtrl II
Condition / Generation
GeoStream (Ours)
Condition / Generation
CameraCtrl II
Condition / Generation
GeoStream (Ours)
Condition / Generation
CameraCtrl II
Condition / Generation
GeoStream (Ours)
Condition / Generation
CameraCtrl II
Condition / Generation
GeoStream (Ours)
Condition / Generation
CameraCtrl II
GeoStream (Ours) vs baselines under extreme camera motion. The camera trajectories of generated videos are estimated by MapAnything.
Condition / Generation
GeoStream (Ours)
Condition / Generation
CameraCtrl II
Condition / Generation
GeoStream (Ours)
Condition / Generation
CameraCtrl II
Condition / Generation
GeoStream (Ours)
Condition / Generation
CameraCtrl II
Condition / Generation
GeoStream (Ours)
Condition / Generation
GEN3C
Condition / Generation
GeoStream (Ours)
Condition / Generation
GEN3C
Condition / Generation
GeoStream (Ours)
Condition / Generation
GEN3C
Long-term rollouts on the RealEstate10K Test set. The camera trajectory of the generated video is estimated by MapAnything.
Condition / Generation
Ground Truth
GeoStream (Ours)
Condition / Generation
Ground Truth
GeoStream (Ours)
Condition / Generation
Ground Truth
GeoStream (Ours)
Condition / Generation
Ground Truth
GeoStream (Ours)
Condition / Generation
Ground Truth
GeoStream (Ours)
Condition / Generation
Ground Truth
GeoStream (Ours)
Condition / Generation
Ground Truth
GeoStream (Ours)
Condition / Generation
Ground Truth
GeoStream (Ours)
Condition / Generation
Ground Truth
GeoStream (Ours)