DiverseVAR: Diversity Has Always Been There in Your Visual Autoregressive Models

💡 Introduction

We introduce DiverseVAR, a simple yet highly effective approach to restore the generative diversity of Visual Autoregressive (VAR) models without any additional training.

Despite their advantages in inference efficiency and image quality, VAR models frequently suffer from the well-known "diversity collapse", leading to a reduction in output variability, analogous to that observed in few-step distilled diffusion models. Through a thorough analysis of pre-trained VAR models, we found that:

Structure formation predominantly occurs in the early scales.
Diversity is primarily governed by a "pivotal component" within these early scales.

DiverseVAR leverages these findings by strategically intervening on the pivotal components during the inference process to unlock the inherent generative potential of VAR models.

🛠️ Method Overview

The DiverseVAR framework introduces two complementary, training-free regularization steps during inference, both focusing on the manipulation of the pivotal components:

Soft-Suppression Regularization (SSR):
- Applied to the model's input feature map at early scales.
- Mitigates diversity collapse by suppressing the dominant singular values (our proxy for the pivotal component).
Soft-Amplification Regularization (SAR):
- Applied to the model's output feature map.
- Aims to further promote controlled diversity and improve image-text alignment, especially for numerical attributes.

This training-free framework effectively boosts generative diversity while maintaining high-fidelity synthesis and faithful semantic alignment.

Figure 1. The overall framework of DiverseVAR. Diversity is encouraged at early scales, while the standard VAR inference process is preserved at later scales.

🖼️ Qualitative Results

The figure below illustrates the enhanced generative diversity achieved by our DiverseVAR (2nd and 4th rows) compared to the vanilla VAR models (1st and 3rd rows). Our method produces a wider variety of realistic images while preserving high-quality and strong text-image alignment.

Figure 2. Multiple generation samples from the vanilla VAR models (1st and 3rd rows) and our DiverseVAR (2nd and 4th rows).

Prompts used in Figure 2: “A man in a clown mask eating a donut”, “A cat wearing a Halloween costume”, “Golden Gate Bridge at sunset, glowing sky, ...”, “A palace under the sunset”, “A cool astronaut floating in space”, and “A cat riding a skateboard down a hill”.

Figure 3. Additional diversity comparison. Our results demonstrate superior diversity.

Prompts used in Figure 3: “A very cute cat near a bunch of birds”, “A cat standing on a hill”, “A photo of a cute rabbit holding a cup of coffee in a cafe”, “A cinematic shot of a little pig priest wearing sunglasses”, “A dog covered in vines”, and “Cute grey cat, digital oil painting by Monet”.

Figure 4. Additional diversity comparison. Our results demonstrate superior diversity.

Prompts used in Figure 4: “Editorial photoshoot of an old woman, high fashion 2000s fashion”, “An astronaut riding a horse on the moon, oil painting by Van Gogh”, “Full body shot, a French woman, photography, French streets”, “A boy and a girl fall in love”, “An abstract portrait of a pensive face, rendered in cool shades of blues, purples, and grays”, and “Cute boy, hair looking up to the stars, snow, beautiful lighting, painting style by Abe Toshiyuki”.

Figure 5. Additional diversity comparison. Our results demonstrate superior diversity.

Prompts used in Figure 5: “A table with a light on over it”, “A library filled with warm yellow light”, “A villa standing on a hill”, “A train crossing a bridge over a canyon”, “A bridge stretching over a calm river”, and “A temple surrounded by flowers”.

📊 Quantitative Results (COCO Benchmarks)

The table below demonstrates that our DiverseVAR significantly improves diversity metrics (Recall ↑, Cov. ↑, FID ↓) while maintaining comparable CLIPScore (CLIP ↑) on the COCO2014-30K and COCO2017-5K benchmarks.

Dataset	Method	Recall ↑	Cov. ↑	FID ↓	CLIP ↑
COCO2014-30K	Infinity-2B	0.316	0.651	28.48	0.313
	+Ours (DiverseVAR)	0.385	0.690	22.96	0.313
	Infinity-8B	0.451	0.740	18.79	0.319
	+Ours (DiverseVAR)	0.497	0.748	14.26	0.315
COCO2017-5K	Infinity-2B	0.408	0.832	39.01	0.313
	+Ours (DiverseVAR)	0.480	0.860	33.39	0.313
	Infinity-8B	0.563	0.892	29.47	0.319
	+Ours (DiverseVAR)	0.585	0.892	25.01	0.316

↑: Higher is better. ↓: Lower is better.

📄 Citation

Please cite our paper if you find this work useful for your research:

@article{wang2025diversity,
  title   = {Diversity Has Always Been There in Your Visual Autoregressive Models},
  author  = {Wang, Tong and Yang, Guanyu and Liu, Nian and Wang, Kai and Wang, Yaxing and Shaker, Abdelrahman M and Khan, Salman and Khan, Fahad Shahbaz and Li, Senmao},
  journal = {arXiv preprint arXiv:2511.17074},
  year    = {2025}
}