GeoStream: Toward Precise Camera Controlled Streaming Video Generation

Anonymous Authors

TL;DR: We present GeoStream, a streaming video generation system with precise metric-scale camera control via a self-refreshing 3D cache, enabling accurate viewpoint manipulation under severe and extreme camera motions. More examples below!

Top-right (first frame) + ➡️ 🔄

Motivation

🎯 We aim for precise camera control under severe motion and extreme magnitude. How do we achieve this?

Condition

GeoStream (Ours)

CameraCtrl II

Condition

GeoStream (Ours)

GEN3C

Method Overview

Method Overview

🧑‍🏫 Stage 1: Bidirectional Teacher Training.

We train a teacher model on ground-truth frames with explicit geometric conditioning: depth from a frame is unprojected to a 3D point cloud and reprojected into the target view to produce point-rendering conditioning for the diffusion transformer.

🧑‍🎓 Stage 2: On-Policy Causal Distillation.

We distill a causal student for autoregressive streaming. After a brief teacher-forced warm-up, the student is distilled on-policy via DMD. The student rolls out its own frames, and the 3D cache used as conditioning is re-rendered from depth estimated on those self-generated frames, matching the self-refresh mechanism used at inference. This aligns the train and inference distributions on both the frame stream and the geometric conditioning, jointly closing standard autoregressive drift and the second-order feedback loop induced by the cache.

Qualitative Results

GeoStream (Ours) vs baselines on the RealEstate10K Test set.

0387ef3895b1393c

Ground Truth

GeoStream (Ours)

MotionCtrl

SEVA

CameraCtrl II

FlexWorld + MoGe-2

ViewCrafter + MoGe-2

VMem + MoGe-2

GEN3C + MoGe-2

Spatia + MoGe-2

1aaeb7f0aee2f9e4

Ground Truth

GeoStream (Ours)

MotionCtrl

SEVA

CameraCtrl II

FlexWorld + MoGe-2

ViewCrafter + MoGe-2

VMem + MoGe-2

GEN3C + MoGe-2

Spatia + MoGe-2

3e07add8413f8157

Ground Truth

GeoStream (Ours)

MotionCtrl

SEVA

CameraCtrl II

FlexWorld + MoGe-2

ViewCrafter + MoGe-2

VMem + MoGe-2

GEN3C + MoGe-2

Spatia + MoGe-2

40f92f1e65a5e1dd

Ground Truth

GeoStream (Ours)

MotionCtrl

SEVA

CameraCtrl II

FlexWorld + MoGe-2

ViewCrafter + MoGe-2

VMem + MoGe-2

GEN3C + MoGe-2

Spatia + MoGe-2

0f6206df8a8e440a

Ground Truth

GeoStream (Ours)

MotionCtrl

SEVA

CameraCtrl II

FlexWorld + MoGe-2

ViewCrafter + MoGe-2

VMem + MoGe-2

GEN3C + MoGe-2

Spatia + MoGe-2

59bfa3dceffc42b6

Ground Truth

GeoStream (Ours)

MotionCtrl

SEVA

CameraCtrl II

FlexWorld + MoGe-2

ViewCrafter + MoGe-2

VMem + MoGe-2

GEN3C + MoGe-2

Spatia + MoGe-2

5a15212752d1659a

Ground Truth

GeoStream (Ours)

MotionCtrl

SEVA

CameraCtrl II

FlexWorld + MoGe-2

ViewCrafter + MoGe-2

VMem + MoGe-2

GEN3C + MoGe-2

Spatia + MoGe-2

6a3fc7c0aee227b9

Ground Truth

GeoStream (Ours)

MotionCtrl

SEVA

CameraCtrl II

FlexWorld + MoGe-2

ViewCrafter + MoGe-2

VMem + MoGe-2

GEN3C + MoGe-2

Spatia + MoGe-2

0d46043105cf3185

Ground Truth

GeoStream (Ours)

MotionCtrl

SEVA

CameraCtrl II

FlexWorld + MoGe-2

ViewCrafter + MoGe-2

VMem + MoGe-2

GEN3C + MoGe-2

Spatia + MoGe-2

GeoStream (Ours) vs baselines under severe camera motion. The camera trajectories of generated videos are estimated by MapAnything.

b40142d9233e7825

Condition / Generation

GeoStream (Ours)

Condition / Generation

CameraCtrl II

d942e48c948b3546

Condition / Generation

GeoStream (Ours)

Condition / Generation

CameraCtrl II

86dd81ed2ece0c4e

Condition / Generation

GeoStream (Ours)

Condition / Generation

CameraCtrl II

565459b7bb0cbbf9

Condition / Generation

GeoStream (Ours)

Condition / Generation

CameraCtrl II

21e794f71e31becb

Condition / Generation

GeoStream (Ours)

Condition / Generation

CameraCtrl II

91e469c1698f1da4

Condition / Generation

GeoStream (Ours)

Condition / Generation

CameraCtrl II

GeoStream (Ours) vs baselines under extreme camera motion. The camera trajectories of generated videos are estimated by MapAnything.

fd48a65a5e252855

Condition / Generation

GeoStream (Ours)

Condition / Generation

CameraCtrl II

5c9274b41b9510f7

Condition / Generation

GeoStream (Ours)

Condition / Generation

CameraCtrl II

4905bc8817511dd2

Condition / Generation

GeoStream (Ours)

Condition / Generation

CameraCtrl II

01cf55ae3e378faf

Condition / Generation

GeoStream (Ours)

Condition / Generation

GEN3C

b1ee9f10fc740b0d

Condition / Generation

GeoStream (Ours)

Condition / Generation

GEN3C

cb16176621b1f3a7

Condition / Generation

GeoStream (Ours)

Condition / Generation

GEN3C

Long-term rollouts on the RealEstate10K Test set. The camera trajectory of the generated video is estimated by MapAnything.

666be06e68dcb7c5

Condition / Generation

Ground Truth

GeoStream (Ours)

03906f66d3bca71a

Condition / Generation

Ground Truth

GeoStream (Ours)

a7f0e6b19b27514d

Condition / Generation

Ground Truth

GeoStream (Ours)

4b86587ecd3325f4

Condition / Generation

Ground Truth

GeoStream (Ours)

1e1e13de4ebea05a

Condition / Generation

Ground Truth

GeoStream (Ours)

bc3e794a104aa602

Condition / Generation

Ground Truth

GeoStream (Ours)

1214f2a11a9fc1ed

Condition / Generation

Ground Truth

GeoStream (Ours)

f43c67fe4e7e28cf

Condition / Generation

Ground Truth

GeoStream (Ours)

8c8ed2adba3dad7a

Condition / Generation

Ground Truth

GeoStream (Ours)

Top