💡 Introduction
We introduce DiverseVAR, a simple yet highly effective approach to restore the generative diversity of Visual Autoregressive (VAR) models without any additional training.
Despite their advantages in inference efficiency and image quality, VAR models frequently suffer from the well-known "diversity collapse", leading to a reduction in output variability, analogous to that observed in few-step distilled diffusion models. Through a thorough analysis of pre-trained VAR models, we found that:
- Structure formation predominantly occurs in the early scales.
- Diversity is primarily governed by a "pivotal component" within these early scales.
DiverseVAR leverages these findings by strategically intervening on the pivotal components during the inference process to unlock the inherent generative potential of VAR models.
🛠️ Method Overview
The DiverseVAR framework introduces two complementary, training-free regularization steps during inference, both focusing on the manipulation of the pivotal components:
-
Soft-Suppression Regularization (SSR):
- Applied to the model's input feature map at early scales.
- Mitigates diversity collapse by suppressing the dominant singular values (our proxy for the pivotal component).
-
Soft-Amplification Regularization (SAR):
- Applied to the model's output feature map.
- Aims to further promote controlled diversity and improve image-text alignment, especially for numerical attributes.
This training-free framework effectively boosts generative diversity while maintaining high-fidelity synthesis and faithful semantic alignment.
🖼️ Qualitative Results
The figure below illustrates the enhanced generative diversity achieved by our DiverseVAR (2nd and 4th rows) compared to the vanilla VAR models (1st and 3rd rows). Our method produces a wider variety of realistic images while preserving high-quality and strong text-image alignment.
Prompts used in Figure 2: “A man in a clown mask eating a donut”, “A cat wearing a Halloween costume”, “Golden Gate Bridge at sunset, glowing sky, ...”, “A palace under the sunset”, “A cool astronaut floating in space”, and “A cat riding a skateboard down a hill”.
Prompts used in Figure 3: “A very cute cat near a bunch of birds”, “A cat standing on a hill”, “A photo of a cute rabbit holding a cup of coffee in a cafe”, “A cinematic shot of a little pig priest wearing sunglasses”, “A dog covered in vines”, and “Cute grey cat, digital oil painting by Monet”.
Prompts used in Figure 4: “Editorial photoshoot of an old woman, high fashion 2000s fashion”, “An astronaut riding a horse on the moon, oil painting by Van Gogh”, “Full body shot, a French woman, photography, French streets”, “A boy and a girl fall in love”, “An abstract portrait of a pensive face, rendered in cool shades of blues, purples, and grays”, and “Cute boy, hair looking up to the stars, snow, beautiful lighting, painting style by Abe Toshiyuki”.
Prompts used in Figure 5: “A table with a light on over it”, “A library filled with warm yellow light”, “A villa standing on a hill”, “A train crossing a bridge over a canyon”, “A bridge stretching over a calm river”, and “A temple surrounded by flowers”.
📊 Quantitative Results (COCO Benchmarks)
The table below demonstrates that our DiverseVAR significantly improves diversity metrics (Recall ↑, Cov. ↑, FID ↓) while maintaining comparable CLIPScore (CLIP ↑) on the COCO2014-30K and COCO2017-5K benchmarks.
| Dataset | Method | Recall ↑ | Cov. ↑ | FID ↓ | CLIP ↑ |
|---|---|---|---|---|---|
| COCO2014-30K | Infinity-2B | 0.316 | 0.651 | 28.48 | 0.313 |
| +Ours (DiverseVAR) | 0.385 | 0.690 | 22.96 | 0.313 | |
| Infinity-8B | 0.451 | 0.740 | 18.79 | 0.319 | |
| +Ours (DiverseVAR) | 0.497 | 0.748 | 14.26 | 0.315 | |
| COCO2017-5K | Infinity-2B | 0.408 | 0.832 | 39.01 | 0.313 |
| +Ours (DiverseVAR) | 0.480 | 0.860 | 33.39 | 0.313 | |
| Infinity-8B | 0.563 | 0.892 | 29.47 | 0.319 | |
| +Ours (DiverseVAR) | 0.585 | 0.892 | 25.01 | 0.316 |
↑: Higher is better. ↓: Lower is better.
📄 Citation
Please cite our paper if you find this work useful for your research:
@article{wang2025diversity,
title = {Diversity Has Always Been There in Your Visual Autoregressive Models},
author = {Wang, Tong and Yang, Guanyu and Liu, Nian and Wang, Kai and Wang, Yaxing and Shaker, Abdelrahman M and Khan, Salman and Khan, Fahad Shahbaz and Li, Senmao},
journal = {arXiv preprint arXiv:2511.17074},
year = {2025}
}