Supplementary Material
Training-free Latent Inter-Frame Pruning with Attention Recovery Comparisons to Existing Real-Time (Low Latency) V2V ModelsComparisons with Training-free Pruning MethodsEffectiveness of Attention RecoveryVisualizations with Time-to-move IntegrationReference
Our method significantly increases the throughput of the base model (Self-Forcing [1]) for real-time video editing while maintaining the visual quality and temporal consistency of edited videos.
Three corgi puppies sharing a meal together on a kitchen floor.
| Input Video | LIPAR (Ours) - 33.8% Pruned | Self-Forcing [1] |
|---|---|---|
| StreamV2V [2] | StreamDiffusion [3] | ControlVideo [4] |
Three majestic lions huddled together feasting on a meal.
| Input Video | LIPAR (Ours) - 33.8% Pruned | Self-Forcing [1] |
|---|---|---|
| StreamV2V [2] | StreamDiffusion [3] | ControlVideo [4] |
A beautiful blonde woman with blue eyes wearing is performing the moonwalk. Simple dark background.
| Input Video | LIPAR (Ours) - 21.3% Pruned | Self-Forcing [1] |
|---|---|---|
| StreamV2V [2] | StreamDiffusion [3] | ControlVideo [4] |
Two cute, fluffy penguins wearing winter scarves waddling across a frozen ice path in Antarctica.
| Input Video | LIPAR (Ours) - 52.9% Pruned | Self-Forcing [1] |
|---|---|---|
| StreamV2V [2] | StreamDiffusion [3] | ControlVideo [4] |
A old man with white beard is holding and interacting with mysterious rock that has a small tree growing on it. Natural lighting, domestic interior background.
| Input Video | LIPAR (Ours) - 19.1% Pruned | Self-Forcing [1] |
|---|---|---|
| StreamV2V [2] | StreamDiffusion [3] | ControlVideo [4] |
A woman wearing a black leather jacket riding a motorcycle while stretching her arms out joyfully. Realistic cinematic style, wind blowing through hair, blurred asphalt road beneath to imply speed.
| Input Video | LIPAR (Ours) - 20.7% Pruned | Self-Forcing [1] |
|---|---|---|
| StreamV2V [2] | StreamDiffusion [3] | ControlVideo [4] |
A young anime protagonist riding a sleek bike with arms outstretched.
| Input Video | LIPAR (Ours) - 20.7% Pruned | Self-Forcing [1] |
|---|---|---|
| StreamV2V [2] | StreamDiffusion [3] | ControlVideo [4] |
Anime style animation of a frog dancing and performing acrobatic side somersaults. Vibrant cel-shaded colors.
| Input Video | LIPAR (Ours) - 16.8% Pruned | Self-Forcing [1] |
|---|---|---|
| StreamV2V [2] | StreamDiffusion [3] | ControlVideo [4] |
A majestic lion is turning its head to look around in a field.
| Input Video | LIPAR (Ours) - 11.9% Pruned | Self-Forcing [1] |
|---|---|---|
| StreamV2V [2] | StreamDiffusion [3] | ControlVideo [4] |
We compare our LIPAR pruning method against representative training-free pruning methods, including ToMe [5], Importance-based [6], and IDM [7], implemented on the Self-Forcing model, as shown below.
Three majestic lions huddled together feasting on a meal.
| Input Video | LIPAR (Ours) - 33.8% Pruned | Self-Forcing [1] - No Pruning |
|---|---|---|
| ToMe [5] - 32% Pruned | Importance-based [6] - 32% Pruned | IDM [7] - 32% Pruned |
A beautiful blonde woman with blue eyes wearing is performing the moonwalk. Simple dark background.
| Input Video | LIPAR (Ours) - 31.6% Pruned | Self-Forcing [1] - No Pruning |
|---|---|---|
| ToMe [5] - 32% Pruned | Importance-based [6] - 32% Pruned | IDM [7] - 32% Pruned |
Two cute, fluffy penguins wearing winter scarves waddling across a frozen ice path in Antarctica.
| Input Video | LIPAR (Ours) - 52.9% Pruned | Self-Forcing [1] - No Pruning |
|---|---|---|
| ToMe [5] - 32% Pruned | Importance-based [6] - 32% Pruned | IDM [7] - 32% Pruned |
Direct pruning leads to visual artifacts, and only M-degree approximation creates noisy patterns. In contrast, full Attention Recovery effectively mitigates artifacts and restores visual quality.
| Direct Pruning | M-Degree Apprx. | M-Degree Apprx.+ Noise-aware Duplication |
|---|---|---|
We further integrate our LIPAR method to the Time-to-Move (TTM) model [8] to demonstrate its generalizability.
Gardening
| Motion Prompt | TTM [8] | LIPAR (Ours) - 47% Pruned |
|---|---|---|
Owl
| Motion Prompt | TTM [8] | LIPAR (Ours) - 47% Pruned |
|---|---|---|
Cocktail
| Motion Prompt | TTM [8] | LIPAR (Ours) - 66% Pruned |
|---|---|---|
[1] Huang, Xun, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. "Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion." In Advances in Neural Information Processing Systems. 2025.
[2] Liang, Feng, et al. "Looking Backward: Streaming Video-to-Video Translation with Feature Banks." In The Thirteen International Conference on Learning Representations. 2025.
[3] Kodaira, Akio, Chenfeng Xu, Toshiki Hazama, Takanori Yoshimoto, Kohei Ohno, and others. "StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation." In arXiv. 2023.
[4] Zhang, Yabo, Yuxiang Wei, Dongsheng Jiang, XIAOPENG ZHANG, Wangmeng Zuo, and Qi Tian. "Input Video: Training-free Controllable Text-to-video Generation." In The Twelfth International Conference on Learning Representations. 2024.
[5] Bolya, Daniel, and Judy Hoffman. "Token merging for fast stable diffusion." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023.
[6] Wu, Haoyu, et al. "Importance-based token merging for efficient image and video generation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025.
[7] Fang, Haipeng, et al. "Attend to Not Attended: Structure-then-Detail Token Merging for Post-training DiT Acceleration." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
[8] Singer, Assaf, et al. "Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising." arXiv preprint arXiv:2511.08633 (2025).