Analysing Results from AI Benchmarks: Key Indicators and How to Obtain Them - Dria

Join the Network

Created at 10pm, Mar 10

Artificial Intelligence

0

Analysing Results from AI Benchmarks: Key Indicators and How to Obtain Them

cj4Zy2vrTvrtZrGKMCXioOCFy_fIyj9xD7g1PuKXXSA

File Type

PDF

Entry Count

106

Embed. Model

jina_embeddings_v2_base_en

Index Type

hnsw

Item response theory (IRT) can be applied to the analysis of the evaluation of results from AIbenchmarks. The two-parameter IRT model provides two indicators (difficulty and discrimination) onthe side of the item (or AI problem) while only one indicator (ability) on the side of the respondent(or AI agent). In this paper we analyse how to make this set of indicators dual, by adding a fourthindicator, generality, on the side of the respondent. Generality is meant to be dual to discrimination,and it is based on difficulty. Namely, generality is defined as a new metric that evaluates whether anagent is consistently good at easy problems and bad at difficult ones. With the addition of generality,we see that this set of four key indicators can give us more insight on the results of AI benchmarks.In particular, we explore two popular benchmarks in AI, the Arcade Learning Environment (Atari2600 games) and the General Video Game AI competition. We provide some guidelines to estimateand interpret these indicators for other AI benchmarks and competitions.

Once the data is ready, a 2-parameter IRT logistic model (2PL) is learned for each ALE and GVGAI game. We adopt MLE to estimate all the model parameters for all instances and the classier abilities simultaneously, as usual in IRT. In particular, for generating the IRT models, we used the ltm R package2, using Birnbaums method, as explained in section II-A. The package ltm (as many other IRT libraries) outputs indicators about the goodness of t, which can be used to quantify the discrepancy between the values observed in the data (items) and the values expected under the statistical IRT model. Item-t statistics may be used to test the hypothesis of whether the tted model could truly be the data-generating model or, conversely, we expect the item parameter estimates to be biased. In practice, an IRT model may be rejected on the basis of bad item-t statistics, as we would not be reasonably condent about the validity of the inferences drawn from it [Maydeu-Olivares, 2013]. Apart from the g

id: 6c04f28856d6ccc53b191189129d4ab0 - page: 13

In the present case, none of the estimated models were discarded because of bad item-t statistics or inconsistency in their results.

id: ce5261ae08668f711c3ac2092596527b - page: 13

Regarding the results, for the ALE games, difculties range from 10.51 to 8.22, while discriminations range from 0.64 to 58.27. For the GVGAI games, difculties range from 29.58 to 84.53, while discriminations range from 0.19 to 123.22.

id: f8df5f14f7bb70be95ebdaf6f09a6b1f - page: 13

The item parameter that is easiest to understand is difculty. Because of the MLE estimation method, the value is not equal but well correlated with the percentage of AI techniques that are successful for the game. Intuitively, easy games are solved by almost all techniques, and difcult games are those that are only solved by very able techniques. Fig. 4 shows the ICCs of those three most (and least) difcult ALE (left) and GVGAI (right) games with positive discrimination. From those games, the most difcult ALE game seems to be H.E.R.O, and iceandre.1 for GVGAI. However, we see cases such as Tennis (ALE), which has the highest difculty (8.22) but negative discrimination (0.13, Fig. 6 left). According to [Bellemare et al., 2015], it is a challenging game that requires fairly elaborate behaviour before observing any positive reward, but simple behaviour can avoid high negative rewards by not ever serving, which possibly explains the negative discrimination. Something similar happens with t

id: 18f4884271255111b65e53ff2a3b5401 - page: 13

How to Retrieve?

# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "cj4Zy2vrTvrtZrGKMCXioOCFy_fIyj9xD7g1PuKXXSA", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "cj4Zy2vrTvrtZrGKMCXioOCFy_fIyj9xD7g1PuKXXSA", "level": 2}'