Training-free Latent Inter-Frame Pruning with Attention Recovery

Supplementary Material

Training-free Latent Inter-Frame Pruning with Attention Recovery Comparisons to Existing Real-Time (Low Latency) V2V ModelsComparisons with Training-free Pruning MethodsEffectiveness of Attention RecoveryVisualizations with Time-to-move IntegrationReference

Comparisons to Existing Real-Time (Low Latency) V2V Models

Our method significantly increases the throughput of the base model (Self-Forcing [1]) for real-time video editing while maintaining the visual quality and temporal consistency of edited videos.

Three corgi puppies sharing a meal together on a kitchen floor.

Input Video	LIPAR (Ours) - 33.8% Pruned	Self-Forcing [1]

StreamV2V [2]	StreamDiffusion [3]	ControlVideo [4]

Three majestic lions huddled together feasting on a meal.

Input Video	LIPAR (Ours) - 33.8% Pruned	Self-Forcing [1]

StreamV2V [2]	StreamDiffusion [3]	ControlVideo [4]

A beautiful blonde woman with blue eyes wearing is performing the moonwalk. Simple dark background.

Input Video	LIPAR (Ours) - 21.3% Pruned	Self-Forcing [1]

StreamV2V [2]	StreamDiffusion [3]	ControlVideo [4]

Two cute, fluffy penguins wearing winter scarves waddling across a frozen ice path in Antarctica.

Input Video	LIPAR (Ours) - 52.9% Pruned	Self-Forcing [1]

StreamV2V [2]	StreamDiffusion [3]	ControlVideo [4]

A old man with white beard is holding and interacting with mysterious rock that has a small tree growing on it. Natural lighting, domestic interior background.

Input Video	LIPAR (Ours) - 19.1% Pruned	Self-Forcing [1]

StreamV2V [2]	StreamDiffusion [3]	ControlVideo [4]

A woman wearing a black leather jacket riding a motorcycle while stretching her arms out joyfully. Realistic cinematic style, wind blowing through hair, blurred asphalt road beneath to imply speed.

Input Video	LIPAR (Ours) - 20.7% Pruned	Self-Forcing [1]

StreamV2V [2]	StreamDiffusion [3]	ControlVideo [4]

A young anime protagonist riding a sleek bike with arms outstretched.

Input Video	LIPAR (Ours) - 20.7% Pruned	Self-Forcing [1]

StreamV2V [2]	StreamDiffusion [3]	ControlVideo [4]

Anime style animation of a frog dancing and performing acrobatic side somersaults. Vibrant cel-shaded colors.

Input Video	LIPAR (Ours) - 16.8% Pruned	Self-Forcing [1]

StreamV2V [2]	StreamDiffusion [3]	ControlVideo [4]

A majestic lion is turning its head to look around in a field.

Input Video	LIPAR (Ours) - 11.9% Pruned	Self-Forcing [1]

StreamV2V [2]	StreamDiffusion [3]	ControlVideo [4]

Comparisons with Other Training-free Pruning Methods

We compare our LIPAR pruning method against representative training-free pruning methods, including ToMe [5], Importance-based [6], and IDM [7], implemented on the Self-Forcing model, as shown below.

Three majestic lions huddled together feasting on a meal.

Input Video	LIPAR (Ours) - 33.8% Pruned	Self-Forcing [1] - No Pruning

ToMe [5] - 32% Pruned	Importance-based [6] - 32% Pruned	IDM [7] - 32% Pruned

A beautiful blonde woman with blue eyes wearing is performing the moonwalk. Simple dark background.

Input Video	LIPAR (Ours) - 31.6% Pruned	Self-Forcing [1] - No Pruning

ToMe [5] - 32% Pruned	Importance-based [6] - 32% Pruned	IDM [7] - 32% Pruned

Two cute, fluffy penguins wearing winter scarves waddling across a frozen ice path in Antarctica.

Input Video	LIPAR (Ours) - 52.9% Pruned	Self-Forcing [1] - No Pruning

ToMe [5] - 32% Pruned	Importance-based [6] - 32% Pruned	IDM [7] - 32% Pruned

Effectiveness of Attention Recovery

Direct pruning leads to visual artifacts, and only M-degree approximation creates noisy patterns. In contrast, full Attention Recovery effectively mitigates artifacts and restores visual quality.

Direct Pruning	M-Degree Apprx.	M-Degree Apprx.+ Noise-aware Duplication

Visualizations with Time-to-move Integration

We further integrate our LIPAR method to the Time-to-Move (TTM) model [8] to demonstrate its generalizability.

Gardening

Motion Prompt TTM [8] LIPAR (Ours) - 47% Pruned
Owl

Motion Prompt TTM [8] LIPAR (Ours) - 47% Pruned
Cocktail

Motion Prompt TTM [8] LIPAR (Ours) - 66% Pruned

LIPAR performs well on all provided TTM examples using random seed 0 on an NVIDIA A6000, with the later observation that generating with other random seeds may sometimes reduce video motion smoothness. This reduction in smoothness is only observed on TTM.

Reference

[1] Huang, Xun, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. "Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion." In Advances in Neural Information Processing Systems. 2025.
[2] Liang, Feng, et al. "Looking Backward: Streaming Video-to-Video Translation with Feature Banks." In The Thirteen International Conference on Learning Representations. 2025.
[3] Kodaira, Akio, Chenfeng Xu, Toshiki Hazama, Takanori Yoshimoto, Kohei Ohno, and others. "StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation." In arXiv. 2023.
[4] Zhang, Yabo, Yuxiang Wei, Dongsheng Jiang, XIAOPENG ZHANG, Wangmeng Zuo, and Qi Tian. "Input Video: Training-free Controllable Text-to-video Generation." In The Twelfth International Conference on Learning Representations. 2024.
[5] Bolya, Daniel, and Judy Hoffman. "Token merging for fast stable diffusion." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023.
[6] Wu, Haoyu, et al. "Importance-based token merging for efficient image and video generation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025.
[7] Fang, Haipeng, et al. "Attend to Not Attended: Structure-then-Detail Token Merging for Post-training DiT Acceleration." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
[8] Singer, Assaf, et al. "Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising." arXiv preprint arXiv:2511.08633 (2025).