Nikolas Adaloglou , Tim Kaiser, Felix Michels, and Markus KollmannHeinrich Heine University of Dusseldorf, Germany{adaloglo,tikai103,felix.michels,markus.kollmann}@hhu.deAbstract:We present a comprehensive experimental study on image level conditioning for diffusion models using cluster assignments. We elucidate how individual components regarding image clustering impact image synthesis across three datasets. By combining recent advancements from image clustering and diffusion models, we show that, given the optimal cluster granularity with respect to image synthesis (visual groups),cluster-conditioning can achieve state-of-the-art FID (i.e. 1.67, 2.17 on CIFAR10 and CIFAR100 respectively), while attaining a strong training sample efficiency. Finally, we propose a novel method to derive an uppercluster bound that reduces the search space of the visual groups using solely feature-based clustering. Unlike existing approaches, we find no significant connection between clustering and cluster-conditional image generation. The code and cluster assignments will be released.
Measuring the confidence of the generated samples using TEMI. In Fig. 7, we show the generated examples with the lowest and highest maximum softmax probability (MSP ) of the TEMI classification head as a measure of confidence. For comparison, we show unconditional samples that were generated using the same initial noise in the denoising process. Visual inspection shows that low-confidence C-EDM samples do not have coherent semantics compared to the unconditional ones, leading to inferior image quality. We hypothesize that the sampled condition for the low-confidence C-EDM samples is in conflict with the existing patterns in the initial noise. The above condition-noise conflict needs further investigation. By contrast, highly confident C-EDM samples show more clearly defined semantics than unconditional ones. When the low frequencies, such as the objects shape, remain intact, cluster conditioning aids in refining local pixel patterns. Finally, we observe that highly confident C-
id: 013ae276708075690f26a95ef6b32cd5 - page: 13
EDM samples consist of simple pixel patterns that are easy to generate, such as a white background. More samples are provided in the supplementary material.
id: 37deb4f6efaa645a37b6e86b5eff6725 - page: 13
13 14 Adaloglou et al. To conclude, confidence can be leveraged in future works in various ways, such as rejection/acceptance sampling or internal guidance methods [36, 37].
id: 51220b6e28d057e37fb758ffd74ffa37 - page: 13
What about a lower cluster bound? Starting with a high overestimation of the number of clusters (e.g. 1K for CIFAR10), we find that TEMI clustering with = 1 utilizes a subset of clusters, which could be used as a lower cluster bound. More precisely, we find a maximum standard deviation of 6.4 for C u ( = 1) for multiple cluster values in the range of [200, 10K] across datasets and backbones (see Supp.). Intuitively, C u is the minimum amount of clusters TEMI (with = 1) uses to group all image pairs while remaining discriminative. This behavior is analogous to cluster-based self-supervised learning (using image augmentations) [18,95] and has been recently coined as partial prototype collapse . Nonetheless, we consider the lower bound more applicable to large scales as the measured standard deviation might exclude the optimal granularity for small, highly curated datasets. Due to the above limitation and since the lower bound of C=2 adds minimal overhead using binary search (cluster
id: c2c939b02ea86a140882c6765a86171d - page: 14