Quentin Garrido 1,2, Mahmoud Assran 1, Nicolas Ballas 1, Adrien Bardes 1,3, Laurent Najman 2, Yann LeCun 1,4,51- FAIR at Meta, 2- Univ Gustave Eiffel, CNRS, LIGM, F-77454 Marne-la-Vallée, France, 3- INRIA4- Courant Institute, New York University, 5- Center for Data Science, New York UniversityJoint-Embedding Predictive Architecture (JEPA) has emerged as a promising self-supervised approach that learns by leveraging a world model. While previously limited to predicting missing parts of an input, we explore how to generalize the JEPA prediction task to a broader set of corruptions. We introduce Image World Models, an approach that goes beyond masked image modeling and learns to predict the effect of global photometric transformations in latent space. We study the recipe of learning performant IWMs and show that it relies on three key aspects: conditioning, prediction difficulty, and capacity. Additionally, we show that the predictive world model learned by IWM can be adapted through finetuning to solve diverse tasks; a fine-tuned IWM world model matches or surpasses the performance of previous self-supervised methods. Finally, we show that learning with an IWM allows one to control the abstraction level of the learned representations, learning invariant representations such as contrastive methods, or equivariant representations such as masked image modelling.Correspondence: Quentin Garrido at garridoq@meta.com
By varying the equivariance of the world model, IWM is able to occupy the spectrum in between contrastive approaches and MIM, as we can see in Figure 4 and Table 8 with IWMInv and IWMEqui spectrum. This spectrum can be summarized by the SSL ethos of "Learning what is predictible". Learning with a weak world model means that it cannot model the world properly and the encoder removes the information that cannot be predicted. On the other hand, if the world model is very powerful, the representation does not need to be as abstract or semantic as it can find a way to predict representations in any situation. This means that learning a world model offers a measurable way to control the level of abstraction of the representations. 12,384 being the two extremes of the IWM 18,384
id: 9206fabe9d160759c411c0ef623e15e2 - page: 9
7 Conclusion and future perspectives We introduced IWM, an approach to learn selfsupervised visual representations with world models. With an in-depth study, we provided guidelines and key components for learning a good image world model. Conditioning the world model with the image transformation is crucial to avoid collapsing to classical SSL behavior. Using strong transformations is also key to ensure that the world model learns to 9
id: 667b44392ac85fea8e0e031cd957aa98 - page: 9
Finally, enough capacity is needed for modeling complex behaviors. We showed that only a capable world model can be reused for discriminative task. This led to our predictor finetuning protocol that matches encoder finetuning at a fraction of the cost, showing that world models are versatile evaluation heads. We further adapted it to solve multiple tasks at once without losing performance. Finally, we studied how learning a world model impacts representation quality. A capable world model learns rich representations that improve performance on downstream tasks such as image classification and semantic segmentation. Additionally, learning an invariant world model led to better representations for linear evaluation. While MIM and Contrastive approaches are two ends of a spectrum in terms of representation abstraction, Image World Models allow us to interpolate between them. As such, we believe that learning image world models is a very promi
id: 3d2c46ba60e8a58cdbde0f49b86e1256 - page: 9
8 Broader impact statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.
id: af13019f2ef0311034c0d56a86972635 - page: 9