Challenging conventional wisdom, this paper critically examines the supposed emergent abilities of large-scale language models. The authors argue that these abilities may not be inherent to the scaling of AI models but might stem from the metrics used in their evaluation. This provocative stance sparks a reevaluation of our understanding of large language models and underscores the need for more robust metrics to assess AI capabilities accurately.
Figure 5: Emergent abilities appear only for specic metrics, not task-model families. (A) Possible emergent abilities appear with at most 5 out of 39 BIG-Bench metrics. (B) Hand-annotated data by reveals emergent abilities appear only under 4 preferred metrics. (C) > 92% of emergent abilities appear under one of two metrics: Multiple Choice Grade and Exact String Match. Figure 6: Changing the metric when evaluating task-model family pairs causes emergent abilities to disappear. Left: The LaMDA model family displays emergent abilities when measured under the discontinuous Multiple Choice Grade. Right: The LaMDA model familys emergent abilities disappear when measured under a continuous BIG-Bench metric: Brier Score.
id: 4ed96a1bb1a74bea346a934c6776e9f0 - page: 7
Brier Score is a strictly proper scoring rule for predictions of mutually exclusive outcomes; for a binary outcome, the Brier Score simplies to the mean squared error between the outcome and its predicted probability mass. LaMDAs emergent abilities on the discontinuous Multiple Choice Grade disappeared when we changed the metric to the continuous Brier Score (Fig. 6). These results support our alternative explanation that emergent abilities are induced by the chosen metric. 5
id: 1fc5ae1e41acb33a817da16b4434b2a6 - page: 7
Inducing Emergent Abilities in Networks on Vision Tasks To demonstrate how emergent abilities can be induced by the researchers choice of metric, we show how to produce emergent abilities in deep networks of various architectures: fully connected, convolutional, self-attentional. We focus on vision tasks because abrupt transitions in vision models capabilities have not been observed to the best of our knowledge; this is one reason why emergence in large language models is considered so interesting. For the convolutional example, see App. B.
id: 848c4e538c4e469bfe5acd11789b941f - page: 7
Emergent Reconstruction of CIFAR100 Natural Images by Nonlinear Autoencoders We rst induce an emergent ability to reconstruct images in shallow (i.e., single hidden layer) nonlinear autoencoders trained on CIFAR100 natural images . To emphasize that the sharpness of the metric is responsible for emergent abilities, and to show that sharpness extends to metrics beyond Accuracy, we intentionally dene a discontinuous metric that measures a networks ability to reconstruct 7 106Shallow Autoencoder Model Parameters 10 Published Emergent Ability 0.2 0.003 104 102 c 0.3 0.1 105 105 104 109 0 106Shallow Autoencoder Model Parameters Published Emergent Ability 1010 20Normalized Score 0.004 15 5 No Emergent Ability 101Test Mean Squared Error = 1NNn=1||xnxn||2 0.4Test Reconstruction Ability = 1NNn=1[||xnxn||2<c] 107 108 0.0 Metric-Induced Emergent Ability
id: f23d8aa7a07d31bc05faa2ff90c3d505 - page: 7