Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion

Anonymous

Simulated Intersection Scenarios for Video Generation Evaluation

We evaluate our approach using simulated intersection scenarios on NuScene data, where the ego car (green) remains stationary. In the bird's eye view map, we manually place new agents (red) crossing the intersection from left to right or right to left. These agents' bounding boxes are then projected onto the ego car's perspective. The resulting bounding-box frames are used as input for the Box2Video network, which generates realistic video sequences.

Our model has not been trained on car crash examples. To evaluate its generative capabilities on out-of-distribution scenarios, we tested it using manually crafted car crash prompts. In this scenario, the ego car (represented as a green car in the bird's-eye view map) remains stationary. Two new agents (red) were positioned to cross the intersection from opposite directions, leading to collision between them. These agents' bounding boxes are then projected onto the ego car's perspective. The resulting bounding-box frames are used as input for the Box2Video network, which generates realistic video sequences. Upon collision, the two cars merge into one in the output video, illustrating the model's response to an out-of-distribution input.

From BBox Predictor to Box2Video

We showcase a range of BDD generation results produced by our model in different scene scenarios, such as city, urban, highways, busy intersections and at night. Each visualization displays a 25-frame clip at 5 fps. The left column displays the ground-truth clip, while the right column features Ctrl-V generations, created using the BBox predictor's predictions and generated by Box2video. Inputs: ONE initial GT frame + THREE initial GT 2D-bounding-box frames + ONE last GT 2D-bounding-box frame.

3D-Bounding-Box Examples

Our model can predict and condition on both 2D bounding boxes and 3D-bounding boxes. Below are our 3D-bounding-box frame predictions and video generations conditioned on the 3D-bounding-box predictions from KITTI and vKITTI2 datasets.Inputs: ONE initial GT frame + THREE initial GT 3D-bounding-box frames + ONE last GT 3D-bounding-box frame. Left: bounding-box frame predictions. Right: Video generated based on predicted bounding-box frames. These examples are being compressed from 375x1242 to 320x512.

Frame-by-Frame Visualization of Teacher-Forced Box2Video Generations

A frame-by-frame visualization of Box2Video generation, conditioned on the ground-truth 2D-bounding-box frame sequence from the BDD dataset. The ground-truth bounding boxes are overlayed in the plots.

Comparison of Model Variants

Generation visualization on BDD100K when our BBox predictor receives a trajectory-frame as the final conditional frame instead of a 2D-bounding-box frame.

Generation comparison on BDD100K: Fine-tuned Stable Video Diffusion (SVD) (left) vs. Ctrl-V (right).