Playing NetHack with LLMs: Potential & Limitations as Zero-Shot Agents - Dria

Join the Network

Created at 11am, Mar 4

Artificial Intelligence

0

Playing NetHack with LLMs: Potential & Limitations as Zero-Shot Agents

vtvcJpDmEhweWXGX--eSHr-IxnJwDxohgt0svKRtWOo

File Type

PDF

Entry Count

58

Embed. Model

jina_embeddings_v2_base_en

Index Type

hnsw

Dominik Jeurissen, Diego Perez-Liebana, Jeremy GowQueen Mary University of London{d.jeurissen, diego.perez, jeremy.gow}@qmul.ac.ukDuygu C¸ akmak, James KwanCreative Assembly{duygu.cakmak, james.kwan}@creative-assembly.comAbstract:Large Language Models (LLMs) have shown great success as high-level planners for zero-shot game-playing agents. However, these agents are primarily evaluated on Minecraft, where long-term planning is relatively straightforward. In contrast, agents tested in dynamic robot environments face limitations dueto simplistic environments with only a few objects and interactions. To fill this gap in the literature, we present NetPlay, the first LLMpowered zero-shot agent for the challenging roguelike NetHack. NetHack is a particularly challenging environment due to its diverse set of items and monsters, complex interactions, andmany ways to die.NetPlay uses an architecture designed for dynamic robot environments, modified for NetHack. Like previous approaches, it prompts the LLM to choose from predefined skills and tracks past interactions to enhance decision-making. Given NetHack’s unpredictable nature, NetPlay detects important game eventsto interrupt running skills, enabling it to react to unforeseen circumstances. While NetPlay demonstrates considerable flexibility and proficiency in interacting with NetHack’s mechanics, it struggles with ambiguous task descriptions and a lack of explicit feedback. Our findings demonstrate that NetPlay performs bestwith detailed context information, indicating the necessity for dynamic methods in supplying context information for complex games such as NetHack.

A skill continues to run until completion or interruption. Skills are interrupted when specific events occur, such as changing the level, teleporting, discovering new objects, and reaching low health. In addition to events, many skills are interrupted when a menu shows up due to their inability to handle them. Regardless of why a skill stopped, the agent then prompts the LLM to select the next skill. The sole exception is when the finish task skill is selected, or the game has ended, at which point the agent will stop until it receives a new task. D. Handcrafted Agent To assess the impact of the LLM in contrast to the predefined skills, we implemented a handcrafted agent that aims to replicate the behavior of NetPlay with the task set to Win the Game. The following list shows a breakdown of the agents decision-making process.

id: e494a19452ea3ef0a9cb163ece9aecae - page: 5

A. Setup All of our experiments used OpenAIs GPT-4-Turbo (GPT4-1106-PREVIEW) API as LLM with the temperature set to 0 and the response format set to JSON. Other models were not considered as initial tests revealed that models like GPT-3.5 and a 70B parameter instruct version of LLAMA 2 could not correctly utilize our skills. The agents memory size was set to 500 tokens. The agent had access to most commands that interact with the game directly, except for some rarely relevant commands, like turning undead or using a monsters special ability. All control and system commands, like opening the help menu or hiding icons on the map, were excluded. We also implemented a time limit of 10 LLM calls, at which point the experiment would terminate if the in-game time did not advance.

id: d3eaab474eff778f32a6d20ea0360cc4 - page: 5

B. Full Runs 1) Abort any open menu, as we did not implement a way to navigate them. 2) If there are hostile monsters nearby, fight them. 3) If health is below 60%, try healing with potions or by praying. 4) Eat food from the inventory when hungry. 5) Pick up items, which in this case are potions and food. 6) If nothing to explore, move to the next level if possible. 7) If nothing else to do, explore the level and try kicking open doors. the conditions are evaluated in sequence. Once a condition is met, a corresponding skill is executed. The selected skill will be interrupted in the same way as NetPlay. Once a skill is interrupted, the agent will choose the next skill by again checking all conditions in order starting from the first.

id: 7c80035afae8272b6d224a919774d2b8 - page: 5

All We started evaluating NetPlay by letting it play NetHack without any constraints, tasking it to win the game. We will refer to this agent as the unguided agent. Although the task was to play the entire game, the agent occasionally confused its own objectives with the assigned task, resulting in the agent marking the task as done too early. To address this issue, we disabled the finish task skill for this experiment. Due to budget limitations, we evaluated all agents using only the Valkyrie role, as most agents performed best with this class during the NetHack 2021 challenge. We conducted 20 runs with the unguided agent. Additionally, we performed 100 runs each with autoascend and the handcrafted agent for comparison. After evaluating the unguided agent, we carried out an additional 10 runs employing a guided agent who was informed on how to play better. A detailed description of

id: 6b5010c6a08261b33efed9e7b4af84a9 - page: 5

How to Retrieve?

# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "vtvcJpDmEhweWXGX--eSHr-IxnJwDxohgt0svKRtWOo", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "vtvcJpDmEhweWXGX--eSHr-IxnJwDxohgt0svKRtWOo", "level": 2}'