Created at 6am, Apr 5
Ms-RAGArtificial Intelligence
0
Laser Learning Environment: A new environment for coordination-critical multi-agent tasks
wVvql3TqsrDg8vs_uF6vSzi39YXtYFOD8SMB0nXxu1o
File Type
PDF
Entry Count
62
Embed. Model
jina_embeddings_v2_base_en
Index Type
hnsw

Yannick Molinghen1,2, Rahaël Avalos2, Mark Van Achter4, Ann Nowé2, and TomLenaerts1,2,31 - Machine Learning Group, Université Libre de Bruxelles, Brussels, Belgium2- AI Lab, Vrije Universiteit Brussel, Brussels, Belgium3- Center for Human-Compatible AI, UC Berkeley, USA4- KU Leuven, Leuven, BelgiumAbstract. We introduce the Laser Learning Environment (LLE), a collaborative multi-agent reinforcement learning environment in which coordination is central. In LLE, agents depend on each other to make progress (interdependence), must jointly take specific sequences of actions to succeed (perfect coordination), and accomplishing those joint actions does not yield any intermediate reward (zero-incentive dynamics). The challenge of such problems lies in the difficulty of escaping state space bottlenecks caused by interdependence steps since escaping those bottlenecks is not rewarded. We test multiple state-of-the-art value-based MARL algorithms against LLE and show that they consistently fail at the collaborative task because of their inability to escape state space bottlenecks, even though they successfully achieve perfect coordination. We show that Q-learning extensions such as prioritised experience replay and n-steps return hinder exploration in environments with zero-incentive dynamics, and find that intrinsic curiosity with random network distillation is not sufficient to escape those bottlenecks. We demonstrate the need for novel methods to solve this problem and the relevance of LLE as cooperative MARL benchmark.

Foreanalysis Level 6 (Figure 1) is of size 12 13 and has 4 agents and 4 gems. The maximal score is hence 4 + 4 + 1 = 9 as explained in Section 3.5. The optimal policy in level 6 is the following: i) Agent green should collect the gem in the top left corner; ii) Agent red should block the red laser and wait for every other agent to cross; iii) Agent yellow should cross the red laser and collect the gem that only he can collect near the yellow source; iv) Agent yellow should block the laser for every agent to cross; v) Agents should collect the remaining gems on the bottom half; vi) Agents should go to the exit tiles. The length of such an episode is 30 time steps, well below the time limit of (cid:4) 1213 (cid:7) = 78 steps. 2
id: ed2cdbe326c480674c779fcf6daf79e4 - page: 9
4.1 Baseline results Figure 3 shows the mean score and exit rate over the course of training on level 6 (Figure 1). VDN performs best on this map. That being said, none of the algorithms ever reaches the highest possible score of 9 and at most half of the agents ever reach the end exit tiles. 1m VDN VDN+PER VDN+RND VDN+3Step Score Exit rate 4 0.4 2 0.2 0 0 0 0.2 0.4 0.6 0.8
id: 2df90481a0a341981da4aa613d4a3776 - page: 9
1m 0 0.2 0.4 0.6 0.8 Fig. 4: Training score and exit rate over training time for VDN, VDN with PER, VDN with RND and VDN with 3-step return on level 6. The maximal score that agents can reach on level 6 of an episode is 9. Results are averaged on 20 different seeds and shown with 95% confidence intervals Looking into the results, the best policy learned only completes items i), iii) and v). Agents red and yellow escape the top half of the map, collect gems on that side and reach the exit, while agents green and blue are not waited for. This policy yields a score of 6 and an exit rate of 0.5. We could introduce reward shaping in order to drive the agents towards a better solution more easily. However, reward shaping is notoriously difficult to achieve properly and can drive agents towards unexpected (and likely undesired) behaviours [Amodei et al., 2016].
id: 233f2b7be4a60ad28003b64fd002d4d3 - page: 10
4.2 Results with Q-learning extensions When a policy is not successful enough, there exists a few common approaches to try and improve the learning of the policy. We take VDN as our baseline since it provides the best results in our experiments and we combine it with Prioritised Experience Replay [Schaul et al., 2016, PER], n-step return [Watkins, 1989] and intrinsic curiosity [Schmidhuber, 1991] on top of it and analyse their impact on the learning process. Prioritised Experience Replay is a technique used in off-policy reinforcement learning to enhance learning efficiency by prioritising experiences and sampling them based on their informativeness. The intuition is to sample past experiences whose Q-values are poorly estimated more often and hope that when agents discover a better policy than their current one, this policy would be prioritised. In our setting, we hope that if agents ever complete the level, this experience would be prioritised.
id: 194624487fc1e1708920b0390279f76b - page: 10
How to Retrieve?
# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "wVvql3TqsrDg8vs_uF6vSzi39YXtYFOD8SMB0nXxu1o", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "wVvql3TqsrDg8vs_uF6vSzi39YXtYFOD8SMB0nXxu1o", "level": 2}'