1FAIR, Meta, 2HuggingFace, 3AutoGPT, 4GenAI, MetaWe introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent amilestone in AI research. GAIA proposes real-world questions that require a set of fundamentalabilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency.GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: weshow that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. This notableperformance disparity contrasts with the recent trend of LLMs outperforming humans on tasksrequiring professional skills in e.g. law or chemistry. GAIA’s philosophy departs from the currenttrend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We positthat the advent of Artificial General Intelligence (AGI) hinges on a system’s capability to exhibitsimilar robustness as the average human does on such questions. Using GAIA’s methodology, wedevise 466 questions and their answer. We release our questions while retaining answers to 300 ofthem to power a leader-board hereby accessible.Date: November 23, 2023Correspondence: {gmialon,tscialom}@meta.com, clementine@huggingface.coCode: https://huggingface.co/gaia-benchmark
Lack of linguistic and cultural diversity. A big limitation of GAIA is its lack of language diversity: all questions are asked in standard English only, and many questions mostly rely on English web pages. This benchmark will therefore not validate the usefulness of assistants for non-English speakers (80% of the global world population), their usefulness on the non English-speaking web (about half of its content), nor on any sort of dialectal variation of English. As such, GAIA is only a first step to estimate the potential of AI assistants, but should not be seen as an absolute general proof of their success. We hope to fill this gap in future work or through community involvement.
id: 9282c7bb157891cccc80367d673aac40 - page: 11
7 Acknowledgements The authors would like to thank Nicolas Usunier for suggesting the web search baseline, Edwin Chen for helping us improve our unusual protocol for annotators, Yacine Jernite for sharing his insights on diversity when benchmark building, and Sasha Luccioni for taking the time to proofread some sections where proper English was eluding us.
id: 1ec58775a20916da2bd9be0c51dbd68e - page: 11
References Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. Flamingo: a visual language model for few-shot learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL
id: 233a74bca8d20f1b41424240fcf29196 - page: 11
Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hernandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, Clement Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Daz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Ittycheriah, Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, Sn
id: 2c50d8ff7025bc9f2e06a3f5746bddf8 - page: 11