CatLIP: CLIP-level Visual Recognition Accuracy with 2.7× Faster Pre-training on Web-scale Image-Text Data

Created at 11pm, Apr 30

buazizi

Software Development

Contract ID

qWKsYPuncPR67-nJBUEuUYHXXaPbp4cADK4GeXa3E9s

File Type

PDF

Entry Count

Embed. Model

jina_embeddings_v2_base_en

Index Type

hnsw

Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image andtext embeddings. However, pairwise similaritycomputation in contrastive loss between imageand text pairs poses computational challenges.This paper presents a novel weakly supervisedpre-training of vision models on web-scale imagetext data. The proposed method reframes pretraining on image-text data as a classificationtask. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss,achieving a remarkable 2.7× acceleration in training speed compared to contrastive learning onweb-scale data. Through extensive experimentsspanning diverse vision tasks, including detection and segmentation, we demonstrate that theproposed method maintains high representationquality

4.3. Comparison with existing pre-training methods Our weakly supervised method, CatLIP, is compared with state-of-the-art methods in Table 2. We compare pretraining parameter count, pre-training data size, image resolution during pre-training and fine-tuning, and top-1 accuracy on two standard datasets: ImageNet-1k and Places365. We group these methods into two categories: supervised and weakly-supervised. We classify models pre-trained on JFT as supervised because limited information is available about these datasets and their annotation process. (Zhai et al., 2022a) briefly mentions that JFT-3B has 30k classes, and is collected using a semi-automatic annotation pipeline, implying a manual curation and annotation process. Therefore, we consider JFTs as a supervised dataset, similar to (Singh et al., 2022). Table 2 shows that CatLIP delivers competitive performance compared to existing pre-training methods. For instance, ViT B/16 pre-trained with CatLIP and CLIP on

id: f4c82f97ee09c4db847766ad9e8fed88 - page: 6

1Initialized with random embeddings. 2Initialized with an average embedding. 6 CatLIP: CLIP-level Visual Recognition Accuracy with 2.7 Faster Pre-training on Web-scale Image-Text Data Table 2. Transfer learning accuracies of ViT models pre-trained on different datasets using supervised and weakly-supervised methods. Transfer learning is achieved through fine-tuning the entire model on downstream classification tasks. Our weakly-supervised models achieve the best performance on both ImageNet-1k and Places365 classification tasks. Here, we include models pre-trained on JFT under supervised pre-training as limited information is available about their semi-automatic labeling method, similar to (Singh et al., 2022). ViT-22B uses frozen image encoder. Here, WIT means web-crawled image-text dataset. CoCa is a hybrid approach that uses labels from JFT-3B to create captions for image-text training along with ALIGN data. Therefore, it is not directly comparable to other approaches. Model

id: a5f4bb9a84c54c4b4b19d57a2dd55baf - page: 6

Params Pretraining Resolution Top-1 accuracy Pre. Fine. ImageNet-1k Places365 Supervised pre-training ViT B/16 (Dosovitskiy et al., 2020) ViT L/16 (Dosovitskiy et al., 2020) ViT L/16 (Dosovitskiy et al., 2020) ViT H/14 (Dosovitskiy et al., 2020) ViT L/16 (Zhai et al., 2022a) ViT G/14 (Zhai et al., 2022a) ViT-22B (Dehghani et al., 2023) 87 M 305 M 305 M 634 M 305 M 1.9 B 22 B

id: 1d7ce5ac9d86579f6e4b7ab6e38d6a9f - page: 7

ImageNet-21k ImageNet-21k JFT 300M JFT 300M JFT 3B JFT 3B JFT 4B 224 224 224 224 224 224 224 384 384 512 512 384 518 224 84.0 85.2 87.8 88.6 88.5 90.5 89.5 58.2 59.0 Weakly supervised pre-training ViT B/16 (Singh et al., 2022) ViT L/16 (Singh et al., 2022) ViT H/16 (Singh et al., 2022) ALIGN EfficientNet-L2 (Jia et al., 2021) FLIP ViT-B (Li et al., 2023) FLIP ViT-H (Li et al., 2023) OpenAI CLIP ViT B/16 (Radford et al., 2021) RangeAugment CLIP ViT B/16 (Mehta et al., 2022) RangeAugment CLIP ViT H/16 (Mehta et al., 2022) OpenCLIP ViT B/16 (Cherti et al., 2023) OpenCLIP ViT L/14 (Cherti et al., 2023) OpenCLIP ViT H/14 (Cherti et al., 2023) CoCa (Yu et al., 2022) CLIP ViT B/16 (Our repro.) CatLIP ViT B/16 (Ours) CatLIP ViT L/16 (Ours) CatLIP ViT H/16 (Ours) CatLIP ViT B/16 (Ours) CatLIP ViT L/16 (Ours) CatLIP ViT H/16 (Ours)

id: f7f65128a70406250d23a2fd19d3084c - page: 7

How to Retrieve?

# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "qWKsYPuncPR67-nJBUEuUYHXXaPbp4cADK4GeXa3E9s", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "qWKsYPuncPR67-nJBUEuUYHXXaPbp4cADK4GeXa3E9s", "level": 2}'