Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Created at 11pm, Apr 30

buazizi

Software Development

Contract ID

WyNk3E6UrMBO0o7cnXVfe78RqTPyKVgtYwnYamPns2U

File Type

PDF

Entry Count

Embed. Model

jina_embeddings_v2_base_en

Index Type

hnsw

This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment, given a pair ofimages depicting the person and the garment, respectively. Previousworks adapt existing exemplar-based inpainting diffusion models for virtual try-on to improve the naturalness of the generated visuals comparedto other methods (e.g., GAN-based), but they fail to preserve the identity of the garments. To overcome this limitation, we propose a noveldiffusion model that improves garment fidelity and generates authentic virtual try-on images. Our method, coined IDM–VTON, uses twodifferent modules to encode the semantics of garment image; given thebase UNet of the diffusion model, 1) the high-level semantics extractedfrom a visual encoder are fused to the cross-attention layer, and then2) the low-level features extracted from parallel UNet are fused to theself-attention layer. In addition, we provide detailed textual promptsfor both garment and person images to enhance the authenticity of thegenerated visuals.

Improving Diffusion Models for Virtual Try-on Table 2: Quantitative results on In-the-Wild dataset. We compare IDMVTON (ours) with other methods on In-the-Wild dataset to assess the generalization capabilities. We report LPIPS, SSIM and CLIP image similarity scores. For IDMVTON and StableVITON , we further customize models using pair of person-garment images (denoted by ;). We see that IDMVTON outperforms other methods, and customized IDMVTON (i.e., IDMVTON;) performs the best. Method HR-VITON LaDI-VTON DCI-VTON Stable-VITON IDMVTON (ours) LPIPS SSIM CLIP-I 0.741 0.768 0.735 0.736 0.795 0.330 0.303 0.283 0.260 0.164 0.701 0.819 0.752 0.836 0.901 Stable-VITON; 0.858 IDMVTON; (ours) 0.150 0.797 0.909 0.259 0.733 4.2 Results on Public Dataset

id: fcf2e507c0ef14c475363c1630705ad8 - page: 11

Qualitative results. Fig. 4 shows the visual comparisons of IDMVTON with other methods on VITON-HD and DressCode test datasets. We see that IDM VTON preserves both low-level details and high-level semantics, while others struggle to capture. While GAN-based methods (i.e., HR-VITON and GP-VTON) show comparable performance in capturing fine details of garment, they fall shorts in generating natural human images compared to diffusion-based methods. On the other hand, prior diffusion-based methods (i.e., LaDI-VTON, DCI-VTON, and StableVITON) fail to preserve the details of garments (e.g., the off-shoulder t-shirts in first row, or the details of sweatshirts in second row of VITON-HD dataset). Also, we observe that IDMVTON significantly outperforms other methods in generalization to the DressCode dataset. Especially, GAN-based methods show inferior performance on the DressCode dataset, showing its poor generalization ability. While prior diffusion-based methods show better images in generat

id: 220c5615c3e9aef2e8509819fce8298e - page: 11

Quantitative results. Tab. 1 shows the quantitative comparisons between IDMVTON (ours) and other methods on VITON-HD and DressCode test datasets. When compared to GAN-based methods, we observe that IDMVTON shows comparable performance in reconstruction scores (LPIPS and SSIM) and outperforms in FID and CLIP image similarity score on VITON-HD test dataset. However, the performance of GAN-based methods significantly degrades when tested on DressCode dataset, which we have observed in Fig. 4. We see that IDMVTON consistently outperforms prior diffusion-based methods in both VITON-HD and DressCode dataset. 11 12 Y. Choi et al. with o GarmentNetwith GarmentNetInput with o GarmentNetwith GarmentNetInput Fig. 6: Effect of GarmentNet. We compare the generated virtual try-on images without using GarmentNet (left) and together with GarmentNet (right). We observe that using GarmentNet significantly improves retaining the fine-grained details of garment (e.g., the graphics in the t-shirts).

id: f122f1c7eb5d9460df4942d431e148b1 - page: 11

4.3 Results on In-the-Wild Dataset Here, we evaluate our methods on challenging In-the-Wild dataset, where we compare with other diffusion-based VTON methods. In addition, we show the results of customization using a single pair of garment-person images. We also test customization on StableVITON . Qualitative results. Fig. 5 shows the qualitative results of IDMVTON (ours) compared to other baselines3. We see that IDMVTON generates more authentic images than LaDI-VTON, DCI-VTON, and Stable VITON, yet it struggles to preserve the intricate patterns of garments. By customizing IDMVTON, we obtain virtual try-on images with a high degree of consistency to the garment (e.g., the logos and the text renderings), but we find the customization does not work well for StableVITON.

id: dc01e5b0dc27e843652b4831c8e53fc2 - page: 12

How to Retrieve?

# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "WyNk3E6UrMBO0o7cnXVfe78RqTPyKVgtYwnYamPns2U", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "WyNk3E6UrMBO0o7cnXVfe78RqTPyKVgtYwnYamPns2U", "level": 2}'