Ander Salaberria, Gorka Azkune, Oier Lopez de Lacalle, Aitor Soroa, Eneko Agirre, and Frank KellerAbstract:Existing work has observed that current text-to image systems do not accurately reflect explicitspatial relations between objects such as left of or below. We hypothesize that this is becauseexplicit spatial relations rarely appear in the image captions used to train these models.We propose an automatic method that, given existing images, generates synthetic captionsthat contain 14 explicit spatial relations. We introduce the Spatial Relation for Generation(SR4G) dataset, which contains 9.9 millions image-caption pairs for training, and more than60 thousand captions for evaluation. In order to test generalization we also provide anunseen split, where the set of objects in the train and test captions are disjoint. SR4G isthe first dataset that can be used to spatially fine-tune text-to-image systems. We show thatfine-tuning two different Stable Diffusion models (denoted as SDSR4G) yields up to 9 pointsimprovements in the VISOR metric. The improvement holds in the unseen split, showingthat SDSR4G is able to generalize to unseen objects. SDSR4G improves the state-of-the-artwith fewer parameters, and avoids complex architectures. Our analysis shows that improvementis consistent for all relations. The dataset and the code are publicly available.
The table also shows the two auxiliary metrics, with VPGen obtaining the best results for object accuracy and VISOR. That is expected, since VPGen has been trained specifically for object generation, and VISOR is calculated over all the recognised objects. In fact, the better VISOR results are only due to better object accuracy, as our method produces better spatial configurations after factoring out object accuracy from VISOR (VISORCond). Also note the contamination issue for the unseen split, as the text-to-layout step of VPGen has been fine-tuned on COCO. This implies that VPGen has seen text-layout pairs using the entire set of objects, having been trained on all the objects in our test set.
id: fb94b72c982ddc071d82963356c62610 - page: 6
5 Analysis We show an extensive analysis of the consequences of fine-tuning on SR4G, covering performance per Figure 2: The horizontal axis depicts the difference of VISORCond values between relation pairs with opposing meanings defined on each side of the vertical axis. Results for SD and SDSR4G v2.1 on the unseen split. relation, biases for opposite relations, performance by frequency of triplets and qualitative examples. 5.1 Analysing performance per relation In Table 4 we show VISORCond values per spatial relation for SDSR4G v2.1 (our best model), both in the main and unseen splits. First, we observe that all projective relations significantly improve for both splits. The improvement is bigger for left of and right of. That might be due to random horizontal flips applied only to the images during the training of SD models, which are expected to damage the models ability to correctly learn those relations.
id: 5b6e28f9433d2ccfd493a1277ce9a001 - page: 6
Topological relations show a more variable beIn the case of separated, our unique haviour. topological relation that does not involve generating overlapping objects, SDSR4G is capable of improving its performance by up to 18.5 points VISORCond. However, for overlapping, finetuning is not helpful. SD v2.1 already knows how to generate images with the overlapping relation, achieving VISORCond values of 91.8 and 89.2 in both test splits. On the other hand, surrounding and inside seem to be especially hard. The VISORCond values are low for the SD model and fine-tuning even makes them worse (especially for inside). This is a limitation of our current approach, and different training strategies must be explored to tackle this issue.
id: 030d17652bfe4cae0dfee3295e974a62 - page: 6
Finally, SDSR4G improves for all scale relations. It is curious to observe that taller, wider and larger perform better than their opposites, even though the improvements over the base SD model are more modest. That suggests that the base SD model might have a bias towards those spatial relations. (a) Results using main splits. (b) Results using unseen splits. Figure 3: Correlation between the frequency of SR4G triplets in COCO training instances (shown in the logarithmic horizontal axis) and their respective VISORCond results for SD v2.1 and SDSR4G v2.1. Triplets are grouped by frequency for visibility.
id: 4d164ef54b375758d34f2343bb595d41 - page: 6