Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration

CVPR 2026
1Stanford University   2ByteDance

Spectrum applied to HunyuanVideo with samples generated using only 14 network evaluations, leading to a significant speedup of 3.5x without quality degradation.

Spectrum applied to FLUX.1 with samples generated using only 14 network evaluations, leading to a significant speedup of 3.5x without quality degradation.

Abstract

Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups.

In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size.

Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79x speedup on FLUX.1 and 4.67x speedup on Wan2.1-14B, while maintaining much higher sample quality.

Method

Diffusion feature forecaster. Feature caching-based approaches have stood out and shown great promise in remarkably reducing the sampling time while maintaining desirable sample quality, all without additional training. Concretely, these methods cache latent features produced by certain blocks at selected timesteps and use them to synthesize features at future timesteps via light-weight predictors, thereby avoiding expensive denoiser evaluations. In particular, the naïve copy strategy directly reuses the most recent cached feature, while the recent work performs discrete Taylor expansion using a few nearest cached points. We summarize the algorithmic flow in the figure.

Method Overview
From local to global. Despite the promise, through our error analysis (see Theorem 3.1) and empirical verification, we find existing predictors (e.g., TaylorSeer) are all local predictors that incur approximation errors that compound rapidly as the forecasting horizon grows, leading to severe quality degradation at high speedup ratios. We introduce spectral feature forecasters, which address the pitfalls above by switching to the spectral domain that models the sampling trajectory globally.

Chebyshev polynomials as the spectral bases. We propose to leverage Chebyshev polynomials as a set of bases to model the evolution of diffusion features. An important property of the Chebyshev polynomials is that they form a set of orthonormal basis for the function space, such that any function can be represented as a weighted sum of the Chebyshev polynomials.
Overall approach. Spectrum features two core operations, online coefficient fitting and feature forecasting, as illustrated below.

Method Overview

Specifically, we fit the coefficient matrix corresponding to the Chebyshev polynomials given the current diffusion features via ridge regression. The fitted coefficients are then used to forecast features at multiple future diffusion steps to avoid the expensive denoiser forward pass. The approach is training-free and easy-to-implement with strong empirical performance. Moreover, we theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size, which is in stark contrast to local predictors.

Main Results: Text-to-Image

Method Overview Method Overview

Spectrum consistently achieves higher speedup while maintaining much better sample quality compared with all baselines, demonstrating the efficacy of our spectral feature forecaster.

Main Results: Text-to-Video

Method Overview

Qualitative comparisons on HunyuanVideo with the same initial latent.
Left: Original 50 step oracle;     Middle: Spectrum (14 steps);     Right: TaylorSeer (16 steps).

Qualitative comparisons on Wan2.1-14B with the same initial latent.
Left: Original 50 step oracle;     Middle: Spectrum (14 steps);     Right: TaylorSeer (16 steps).

Our Spectrum achieves the best performance with even fewer steps, demonstrating superior sample quality and better temporal consistency. In contrast, TaylorSeer exhibits unnatural color shifts and temporal jittering, which are likely caused by the compounding errors from local approximation.

BibTeX

@article{han2026adaptive,
  title={Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration},
  author={Han, Jiaqi and Shi, Juntong and Li, Puheng and Ye, Haotian and Guo, Qiushan and Ermon, Stefano},
  journal={arXiv preprint arXiv:2603.01623},
  year={2026}
}