Created at 9am, Mar 5
Ms-RAGArtificial Intelligence
0
Tri-Modal Motion Retrieval by Learning a Joint Embedding Space
b8TcqM2-54MdPZ_8dxl4qUogJukb-ikTuVt2nk_U85w
File Type
PDF
Entry Count
74
Embed. Model
jina_embeddings_v2_base_en
Index Type
hnsw

Kangning Yin, Shihao Zou, Yuxuan Ge, Zheng TianAbstract:Information retrieval is an ever-evolving and crucial research domain. The substantial demand for high-quality human motion data especially in online acquirement has led to a surge in human motion research works. Prior works have mainly concentrated on dual-modality learning, such as text and motion tasks, but three-modality learning has been rarely explored. Intuitively, an extra introduced modality can enrich a model’s application scenario, and more importantly, an adequate choice of the extra modality can also act as an intermediary and enhance the alignment between the other two disparate modalities. In this work, weintroduce LAVIMO (LAnguage-VIdeo-MOtion alignment), a novel framework for three-modality learning integrating human-centric videos as an additional modality, thereby effectively bridging the gap between text and motion. Moreover, our approach leverages a specially designed attention mechanism to foster enhanced alignment and synergistic effects among text, video, and motion modalities. Empirically, our results on the HumanML3D and KIT-ML datasets show that LAVIMO achieves state-of-the-art performance in various motion-related cross-modal retrieval tasks, including text-to-motion, motion-to-text, video-to-motion and motionto- video.

5 Protocol Methods Text-motion retrieval Motion-text retrieval R1 R2 R3 R5 R10 MedR R1 R2 R3 R5 R10 MedR (a) All 7.11 TEMOS Guo et al. 3.37 MotionCLIP 4.87 7.23 TMR 8.59 Ours(2-modal) 10.16 Ours(3-modal) 13.25 6.99 9.31 13.98 15.04 19.92 17.59 10.84 14.36 20.36 21.09 24.61 24.10 16.87 20.09 28.31 32.23 34.57 35.66 27.71 31.57 40.12 46.09 49.80 24.00 28.00 26.00 17.00 13.00 11.00 11.69 4.94 6.55 11.20 11.72 15.43 15.30 6.51 11.28 13.86 17.19 20.12 20.12 10.72 17.12 20.12 23.63 26.95 26.63 16.14 25.48 28.07 32.81 34.57 36.39 25.30 34.97 38.55 48.83 53.32
id: 90edc129bc093cbaacd6bda7b330a7b6 - page: 5
(b) All with threshold TEMOS 18.55 13.25 Guo et al. MotionCLIP 13.79 24.58 TMR 24.02 Ours(2-modal) 30.86 Ours(3-modal) 24.34 22.65 23.08 30.24 30.86 41.80 30.84 29.76 31.45 41.93 42.73 48.63 42.29 39.04 42.93 50.48 54.69 59.96 56.37 49.52 53.01 60.36 70.09 74.22 7.00 11.00 9.00 5.00 5.00 4.00 17.71 10.48 13.24 19.64 21.68 25.98 22.41 13.98 22.11 23.73 27.93 31.25 28.80 20.48 29.53 32.53 34.18 38.28 35.42 27.95 38.06 41.20 42.77 45.70 47.11 38.55 50.23 53.01 57.42 63.09 (c) Dissimilar subset TEMOS 24.00 Guo et al. ] 16.00 MotionCLIP 19.00 26.00 TMR 29.00 Ours(2-modal) 30.00 Ours(3-modal) 40.00 29.00 33.00 46.00 45.00 49.00 46.00 36.00 41.00 60.00 60.00 63.00 54.00 48.00 50.00 70.00 71.00 73.00 70.00 66.00 69.00 83.00 81.00 84.00 5.00 6.00 6.00 3.00 2.00 3.00 33.00 24.00 28.00 34.00 43.00 48.00 39.00 29.00 36.00 45.00 59.00 60.00 45.00 36.00 43.00 60.00 67.00 66.00 49.00 46.00 48.00 69.00 73.00 76.00
id: f35c0ac87100b82c7778f3308ba86b08 - page: 6
64.00 66.00 65.00 82.00 83.00 82.00 (d) Small batches TEMOS 43.88 42.25 Guo et al. MotionCLIP 41.29 49.25 TMR 53.96 Ours(2-modal) 58.10 Ours(3-modal) 58.25 62.62 55.38 69.75 76.42 77.80 67.00 75.12 69.50 78.25 82.05 86.34 74.00 87.50 78.83 87.88 89.66 93.08 84.75 96.12 90.12 95.00 95.22 96.47 2.06 1.88 1.73 1.50 1.38 1.08 41.88 39.75 39.55 50.12 58.58 60.23 55.88 62.75 52.07 67.12 75.05 77.52 65.62 73.62 68.13 76.88 81.84 86.44 75.25 86.88 77.94 88.88 89.68 93.22 85.75 95.88 90.85 94.75 93.86 95.87 Table 2. Text-to-motion Retrieval on KIT-ML. We conduct further evaluations of both our 2-modal and 3-modal approaches using the KIT-ML dataset. The findings reveal that our 2-modal version significantly surpasses previous methodologies in performance. Moreover, our 3-modal version demonstrates an even greater extent of superiority over other existing methods. The most notable results are emphasized in bold.
id: 5360dd0cba07f0d4a33f9c26deb8184e - page: 6
8 and assign 0.1 to recon. Training is conducted with a batch size B of 64 over 400 epochs. During the training process, we use AdamW optimizer with a learning rate being 1e-4 and then linearly decaying to 1e-5 after the first 100 epochs. In the process of augmenting data, an image is first resized randomly, from which a crop measuring 256 256 pixels is extracted. Subsequently, this crop is subjected to a variety of transformations, including jittering of colors randomly, conversion to grayscale on a random basis, application of Gaussian Blur, flipping horizontally in a random manner following the implementation of RandAugment . Our 2-modal version shares the same setting to 3-modal version, with the differences lying in the contrastive learning and modalities fusion between text and motion only. Evaluation Metrics. Our evaluation of retrieval performance utilizes standard metrics, includi
id: 4a2eecdc5771290601d350ac81c2ab7c - page: 6
How to Retrieve?
# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "b8TcqM2-54MdPZ_8dxl4qUogJukb-ikTuVt2nk_U85w", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "b8TcqM2-54MdPZ_8dxl4qUogJukb-ikTuVt2nk_U85w", "level": 2}'