Spectral Motion Alignment
for Video Motion Transfer using Diffusion Models


Geon Yeong Park* Hyeonho Jeong* Sang Wan Lee Jong Chul Ye
* Equal Contribution Corresponding Authors
KAIST
AAAI 2025

Spectral Motion Alignment significantly facilitates the capture of long-range and complex motion patterns within videos.

Input Video of Motion M1
 
Shark + M1 + Under Water
Spaceship + M1 + In Space

Input Video of Motion M2
 
Bear + M2
Astrounaut + M2 + On Snow

Abstract

The evolution of diffusion models has greatly impacted video generation and understanding. Particularly, text-to-video diffusion models have significantly facilitated the customization of input video with target appearance, motion, etc. Despite these advances, challenges persist in accurately distilling motion information from video frames. While existing works leverage the consecutive frame residual as the target motion vector, they inherently lack global motion context and are vulnerable to frame-wise distortions. To address this, here we present Spectral Motion Alignment (SMA), a novel framework that refines and aligns motion vectors using Fourier and wavelet transforms. SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics, and mitigating spatial artifacts. Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.

Overview



SMA framework distills the motion information in frequency-domain. Considering the (latent) frame residuals as motion vectors, we first derive the denoised motion vector estimates. Then, the motion vector $\delta v_0^n$ and its estimate $\delta \hat{v}_0^n$ are aligned in both pixel-domain and frequency-domain. Our regularization includes (1) global motion alignment based on 1D wavelet-transform, and (2) local motion refinement based on 2D Fourier transform.

1. Comparison within VMC (on Show-1)

 
Input Video of Motion M
    
VMC with SMA
Tanks + M + In Desert
VMC
Tanks + M + In Desert
 
    

VMC with SMA
Turtles + M + On The Sand

VMC
Turtles + M + On The Sand

 
Input Video of Motion M
    
VMC with SMA
Raccoon + M + Nuts
VMC
Raccoon + M + Nuts
 
    

VMC with SMA
Squirrel + M + Cherries

VMC
Squirrel + M + Cherries

 
Input Video of Motion M
    
VMC with SMA
Bear + M
VMC
Bear + M
 
    

VMC with SMA
Astronaut + M + On Snow

VMC
Astronaut + M + On Snow

 
Input Video of Motion M
    
VMC with SMA
Spaceships + M + In Space
VMC
Spaceships + M + In Space

2. Comparison within VMC (on Zeroscope)


 
Input Video of Motion M
    

VMC with SMA
Lamborghini + M + In Desert

VMC
Lamborghini + M + In Desert

 
Input Video of Motion M
    
VMC with SMA
Iron Man + M
VMC
Iron Man + M

3. Comparison within MotionDirector

 
Input Video of Motion M
MotionDirector with SMA
Chicken + M
MotionDirector
Chicken + M

 
 
Input Video of Motion M

MotionDirector with SMA
Eagle + M + On Edge

MotionDirector
Eagle + M + On Edge

 
 
Input Video of Motion M

MotionDirector with SMA
Goldfish + M

MotionDirector
Goldfish + M
 
 

MotionDirector with SMA
Airplanes + M + In the sky

MotionDirector
Airplanes + M + In the sky

 
Input Video of Motion M
MotionDirector with SMA
Tank + M + In Desert
MotionDirector
Tank + M + In Desert

4. Comparison within Diffusion-Motion-Transfer (DMT)

 
Input Video of Motion M
    
DMT with SMA
Goat + M + Into a river
DMT
Goat + M + Into a river

 
Input Video of Motion M
    
DMT with SMA
Flamingo + M + On the grass
DMT
Flamingo + M + On the grass

5. Comparison within Tune-A-Video

 
Input Video of Motion M
    
Tune-A-Video with SMA
Monkey + M + On The Water
Tune-A-Video
Monkey + M + On The Water

 
Input Video of Motion M
  
Tune-A-Video with SMA
Eagle + M + Above Trees
Tune-A-Video
Eagle + M + Above Trees

6. Comparison within ControlVideo

 
Input Video of Motion M
 
Depth Condition
  
ControlVideo with SMA
White Puppy + M
+ On The Lawn
ControlVideo
White Puppy + M
+ On The Lawn

 
Input Video of Motion M
 
Depth Condition
  
ControlVideo with SMA
Fennec Fox + M
+ Sausage, In Desert
ControlVideo
Fennec Fox + M
+ Sausage, In Desert

 
Input Video of Motion M
 
Depth Condition
  
ControlVideo with SMA
Fox + M
ControlVideo
Fox + M

7. Ablation of Global and Local Motion Alignment


Input Video
A man is skating.
VMC (Show-1)
An astronaut is skating on the dune.
VMC( Show-1) + $\ell_{\text{local}}$
An astronaut is skating on the dune.
VMC (Show-1) + $\ell_{\text{local}}$ + $\ell_{\text{global}}$
An astronaut is skating on the dune.

Input Video
A rabbit is eating strawberries.
  
VMC (Show-1) + $\ell_{\text{local}}$
A hamster is eating strawberries.
VMC (Show-1) + $\ell_{\text{local}}$ + $\ell_{\text{global}}$
A hamster is eating strawberries.

Input Video
A car is driving on the road.
  
VMC (Zeroscope) + $\ell_{\text{local}}$
A schoolbus is driving on the grass.
VMC (Zeroscope) + $\ell_{\text{local}}$ + $\ell_{\text{global}}$
A schoolbus is driving on the grass.

8.Additional Results

(a) Training Progress Visualization

A rabbit is eating strawberries.   →   A racoon is eating nuts.
Input Video
VMC 20 steps
VMC 200 steps
VMC 500 steps
 
Input Video
 
VMC w/ SMA 20 steps
 
VMC w/ SMA 200 steps
 
VMC w/ SMA 500 steps



(b) Comparisons of Local Motion Refinement with 2D DWT and 2D DFT

A man is skateboarding.   →   An astronaut is skateboarding on the dune.
Input Video
DWT refinement
DFT refinement

A rabbit is eating strawberries.   →   A squirrel is eating nuts.
Input Video
DWT refinement
DFT refinement

References

• Zhao, Rui, et al. "Motiondirector: Motion customization of text-to-video diffusion models." arXiv preprint arXiv:2310.08465 (2023).

• Jeong, Hyeonho, et al. "VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models." arXiv preprint arXiv:2312.00845 (2023).

• Wu, Jay Zhangjie, et al. "Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

• Zhao, Min, et al. "Controlvideo: Adding conditional control for one shot text-to-video editing." arXiv preprint arXiv:2305.17098 (2023).

• Sterling, Spencer. Zeroscope. https://huggingface.co/cerspense/zeroscope_v2_576w (2023).

• Zhang, David Junhao, et al. "Show-1: Marrying pixel and latent diffusion models for text-to-video generation." arXiv preprint arXiv:2309.15818 (2023).