Training-Free Latent Inter-frame Pruning with Attention Recovery

Videos have high temporal and spatial redundancy. Traditional video processing methods save compute and memory by not re-processing similar patches. Can we use the same techniques in the modern video generation pipeline (latent diffusion model)?

In the following:

1. We first verify that redundancy exists in the latent space.
2. We design the Latent Inter-frame with Attention Recovery (LIPAR) framework to fit within the latent diffusion model.
3. We propose Attention Recovery to mitigate the train-inference discrepancy incurred by pruning.

Goal: Calculate the Pearson correlation between the changes of pixels and latents with the same spatial location.
$$ \text{Corr}\left(\|p_{\text{pixel}}^{(t,x,y)} - p_{\text{pixel}}^{(t+1,x,y)}\|, \ \|p_{\text{latent}}^{(t,x,y)} - p_{\text{latent}}^{(t+1,x,y)}\|\right) $$ Results: WAN2.1 VAE: 0.69, WAN2.2 VAE: 0.77 (evaluated on Davis2017 train/val sets)
Meaning: A strong linear relationship implies if a patch does not change (is redundant) in pixel space, it also does not change in latent space.

Question: Is the decoder sensitive to compression error?

Experiment:

Compress the latent by substituting the latent patch with the previous patch if similar:
$$\hat{p}_{t+1}^{x,y} = \begin{cases} p_t^{x,y} & \text{if } \|p_{t+1}^{x,y} - p_t^{x,y}\| < \theta \\ p_{t+1}^{x,y} & \text{otherwise} \end{cases}$$
Test whether the decoded results are similar for the compressed vs. the original latents, i.e., $\text{Sim}\left(\text{Dec}(\hat{p}), \text{Dec}(p)\right) > \tau$

Results: Compressed 46%, LPIPS < 0.05

Meaning: The VAE is insensitive to the compression error.

These experiments show that latents (1) contain redundancy and (2) the decoder is insensitive to the compression error, which motivates us to adapt the video compression algorithm in pixel space to the latent space.

Overview

We propose LIPAR, a training-free method built on top of the Diffusion Transformer (Self-forcing), that accelerates video generation and reduces memory usage by exploiting the temporal redundancy in video latent patches across three stages:

(1) Latent Inter-frame Pruning: Removes redundant latent patches before sending to the DiT. $\|p_{t+1}^{x,y} - p_t^{x,y}\| < \theta$ (Complicated in reality)
(2) Attention Recovery: Since the model is trained on unpruned latents, we propose Attention Recovery to mitigate the train-inference discrepancy caused by pruning, enabling training-free integration.
(3) Restoration: We restore the latents dimensions by copying from the corresponding patches in the denoised latents.

Pruning latents will result in visual artifacts because when training the DiT, it only sees full-dimensional latents. Our goal is to approximate the original output values using the pruned input. In the following example, $x_2$, $x_3$, and $x_5$ are pruned along the temporal axis.

Instead of approximating all attention blocks in the DiT simultaneously, we first approximate a single attention block. Since the FFN and Cross-attention operate token-wise and are unaffected by pruning, we only need to approximate the output from the Self-attention layer.

We separate the approximation into two components: (1) signal values (M-degree approximation), and (2) noise distributions (Noise-aware duplication).

1. M-degree approximation: In the following equation, we list the original self-attention calculation on the left-hand side. For LIF pruning, pruning $x_2$ and $x_3$ implies that their signal components are similar ($x_1 \approx x_2 \approx x_3$). Therefore, on the right-hand side, we can approximate $x_2$ and $x_3$ using $x_1$.

$$\frac{e^{q^T k_1} v_1 + e^{q^T e^{\theta j} k_2} v_2 + e^{q^T e^{2\theta j} k_3} v_3 + A}{e^{q^T k_1} + e^{q^T e^{\theta j} k_2} + e^{q^T e^{2\theta j} k_3} + B} \approx \frac{(e^{q^T k_1} + e^{q^T e^{\theta j} k_1} + e^{q^T e^{2\theta j} k_1}) v_1 + A}{e^{q^T k_1} + e^{q^T e^{\theta j} k_1} + e^{q^T e^{2\theta j} k_1} + B}$$

Where $e^{\theta j}$ represents the RoPE positional encoding. The sum of $m$-largest terms is sufficient to approximate sums of exponentials in the parentheses.

2. Noise-aware duplication: The noise components of $x_1$, $x_2$, and $x_3$ are i.i.d., hence we cannot assume ($x_1 \approx x_2 \approx x_3$). Instead, we must find a noise-free $x_1$ (from KV cache or noise filtering) and decide whether to add fresh i.i.d. noise to match their distribution.

The acceleration mostly comes from generating fewer tokens, as $x_2$ and $x_3$ do not need to be regenerated throughout the entire DiT.

Use your cursor on the video to slide left/right for the video. The edited keyword and is marked as red. You may also want to check comparisons with other models, other pruning methods, ablation study, and extention to TTM.

We evaluate our method on 51 video-prompt pairs. For the full evaluation results, please see this link. The human evaluation test is performed by 14 participants, throughput and memory usage is evaluated on the entire dataset and on an A6000 GPU. Compared with the baseline (Self-forcing), LIPAR achieves 1.45× speedup, and 20% memory reduction while maintaining 86.7% win-tie rate.