A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective

Lianghe Shi^*, Meng Wu^*, Huijie Zhang, Zekai Zhang, Molei Tao, Qing Qu

Department of Electrical Engineering and Computer Science, University of Michigan
School of Mathematics, Georgia Institute of Technology
NeurIPS 2025 (Spotlight)
ICML 2025 Workshop (Oral)
^*Indicates Equal Contribution

MY ALT TEXT

High-level depiction of the self-consuming pipeline. Top: Collapse iteration represents the replace paradigm where models are trained solely on synthetic images generated by the previous diffusion model. Middle: In the mitigated iteration, original real data and previously generated data are added to train the next-generation model. Our proposed selection methods construct a training subset and can further mitigate collapse. Bottom Right: Evolution of the generated images.

Abstract

The widespread use of diffusion models has led to an abundance of AI-generated data, raising concerns about \textit{model collapse}---a phenomenon in which recursive iterations of training on synthetic data lead to performance degradation. Prior work primarily characterizes this collapse via variance shrinkage or distribution shift, but these perspectives miss practical manifestations of model collapse. This paper identifies a transition from generalization to memorization during model collapse in diffusion models, where models increasingly replicate training data instead of generating novel content during iterative training on synthetic samples. This transition is directly driven by the declining entropy of the synthetic training data produced in each training cycle, which serves as a clear indicator of model degradation. Motivated by this insight, we propose an entropy-based data selection strategy to mitigate the transition from generalization to memorization and alleviate model collapse. Empirical results show that our approach significantly enhances visual quality and diversity in recursive generation, effectively preventing collapse.

Model Collapses from Generalization to Memorization

The generalization-to-memorization transition. Left: visualization of the generated images (𝒢_n) and their nearest neighbors in the training dataset (𝒴_n). As the iteration proceeds, the model can only copy images from the training dataset. Right: quantitative results of the generalization score of models over successive iterations. We use different colors to represent different dataset sizes. We use “iteration” to denote a full cycle of training and generation, rather than a gradient update.

The Relationship between
Generalization Score and Entropy

Scatter plots of the generalization score and properties of the training dataset, i.e., entropy and variance. Each point denotes one iteration of training in the self-consuming loop. We use different colors to represent the results of different dataset sizes.

Mitigating Collapse via Data Selection Methods

Since we have shown that the decaying entropy leads to the collapsing generalization performance, we propose two data selection methods to mitigate model collapse by approximately maximizing the entropy of the training dataset.

The pseudo-codes of the two methods:

Experiments

Generalization Score of the trained model over iterations. We indicate the settings on top of the subfigures. In each subfigure, three different lines are used to represent the vanilla paradigm and its variants augmented with the proposed selection methods.

FID of the generated images over iterations. We indicate the settings on top of the subfigures. In each subfigure, three different lines are used to represent the vanilla paradigm and its variants augmented with the proposed selection methods.

Estimated entropy of the training dataset over iterations. We indicate the settings on top of the subfigures. In each subfigure, three different lines are used to represent the vanilla paradigm and its variants augmented with the proposed selection methods.

Proportion of the selected images from previous iterations or the real dataset. We use different colors to represent different sources. In particular, the blue bars denote the proportion of the real images. The red line represents the \(1/n\) curve that indicates the proportion of the real images if we evenly select the data subset from all available images (accumulate-subsample). We indicate the settings on top of the subfigures.

BibTeX

@article{shi2025modelcollapse,
        title={A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective},
        author={Shi, Lianghe and Wu, Meng and Zhang, Huijie and Zhang, Zekai and Tao, Molei and Qu, Qing},
        journal={arXiv preprint arXiv:2509.16499},
        year={2025}}