ObjCtrl-2.5D: Training-free Object Control with Camera Poses

Abstract

This work aims to achieve more precise and versatile object control in image-to-video (I2V) generation. Current methods typically represent the spatial movement of target objects with 2D trajectories, which often fail to capture user attention and frequently produce unnatural results. To enhance control, this paper presents ObjCtrl-2.5D, a training-free object control approach that uses a 3D trajectory, extended from a 2D trajectory with depth information, as a control signal. By modeling object movement as camera movement, ObjCtrl-2.5D represents the 3D trajectory as a sequence of camera poses, enabling object motion control using an existing camera motion control I2V generation model (CMC-I2V) without training. To adapt the CMC-I2V model originally designed for global motion control to handle local object motion, we introduce a module to isolate the target object from the background, enabling independent local control. In addition, we devise an effective way to achieve more accurate object control by sharing low-frequency warped latent within the object's region across frames. Extensive experiments demonstrate that ObjCtrl-2.5D significantly improves object control accuracy compared to training-free methods and offers more diverse control capabilities than training-based approaches using 2D trajectories, enabling complex effects like object rotation.


Methods

ObjCtrl-2.5D first extends the provided 2D trajectory to a 3D trajectory using depth information from the conditioning image. This 3D trajectory is then transformed into a camera pose via triangulation algorithm. To achieve object motion control within a frozen camera motion control module, ObjCtrl-2.5D integrates a Layer Control Module (LCM) that separates the object and background with distinct camera poses. After extracting camera pose features via a Camera Encoder, LCM spatially combines these features using a series of scale-wise masks. Additionally, ObjCtrl-2.5D introduces a Shared Warping Latent technique, which enhances control by sharing low-frequency initialized noise across frames within the warped areas of the object.

Results Control with Trajectory


Results Control with Camera Pose

Input ZoomIn PanLeft PanRight
Input PanLeft PanRight_speed1 PanRight_speed2
Input ZoomIn ZoomOut
Input ZoomIn ZoomOut
Input ZoomIn-Out Anti-Clockwise
Input Rotate Rotate

Flexible Background Movements

Input Free Fixed Reverse with Object Movement

Compare with Previous Methods

(A) Compare with Training-free Methods

Input PEEKABOO FreeTraj ObjCtrl-2.5D

(B) Compare with Training-based Methods

Input DragAnything DragNUWA ObjCtrl-2.5D

(C) More Comparisons

Input PEEKABOO FreeTraj DragNUWA DragAnything ObjCtrl-2.5D

Ablation Studies

Input (a) w/o LCM (b) w/o Scale-wise Mask (c) w/o SWL (d) ObjCtrl-2.5D

Input Copy-parsing SWL

Failure Cases

Due to the limitations of SVD (the basic T2V generation model of ObjCtrl-2.5D) in handling large motions, ObjCtrl-2.5D with high-speed camera poses results in the object fading out of the scene, leaving only the background. Interestingly, this outcome reveals potential for image inpainting applications, as seen in the last frames of the generated videos.


 

 

 

References

[1] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable Video Diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.

[2] Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, BoDai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024.

[3] Yash Jain, Anshul Nasery, Vibhav Vineet, and Harki-rat Behl. PEEKABOO: Interactive video generation via masked-diffusion. In CVPR, 2024.

[4] Haonan Qiu, Zhaoxi Chen, Zhouxia Wang, Yingqing He, Menghan Xia, and Ziwei Liu. FreeTraj: Tuning-free trajectory control in video diffusion models. arXiv preprint arXiv:2406.16863, 2024.

[5] Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. DragNUWA: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023.

[6] WeijiaWu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, Jun hao David Zhang, Shou Mike Zheng, Yan Li, Tingting Gao, and Di Zhang. DragAnything: Motion control for anything using entity representation. In ECCV, 2024.

[7] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Ra¨dle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, ChaoYuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024.

[8] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Muller. ZoeDepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.

LICENSE

This project is licensed under S-Lab License 1.0, Redistribution and use for non-commercial purposes should follow this license.

BibTex

@article{objctrl2.5d,
  title={{ObjCtrl-2.5D}: Training-free Object Control with Camera Poses}},
  author={Wang, Zhouxia and Lan, Yushi and Zhou, Shangchen and Loy, Chen Change},
  booktitle={arXiv preprint arXiv:2412.07721},
  year={2024}
}