This research introduces a groundbreaking approach in artificial intelligence, combining deep learning and reinforcement learning to create an advanced AI agent. This agent, termed a deep Q-network (DQN), is designed to perform tasks that require human-level control, particularly in complex, unpredictable environments.Reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions and receiving feedback from those actions. It's similar to training a pet: the pet learns to perform tricks in exchange for treats. In the case of AI, the 'treat' is a positive signal or reward when the AI makes a correct decision. However, applying this learning method to real-world situations is challenging because the AI must interpret vast amounts of data from its environment and learn from it, just like a human or animal would.The DQN agent addresses this challenge by using deep neural networks, which are highly effective in processing and learning from large-scale, high-dimensional sensory data (like images and sounds). This enables the AI to understand and interact with complex environments directly based on raw sensory inputs, such as pixel data from images.The most impressive aspect of this research is its application to playing Atari 2600 video games. The DQN was only given the game pixels (the images on the screen) and the game score as inputs. Despite this, it learned to play these games successfully, reaching performance levels comparable to professional human game testers. It achieved this across a diverse set of 49 games, using the same underlying algorithm and settings.This achievement is significant because it showcases the DQN's ability to learn and adapt to a wide range of tasks, bridging the gap between understanding high-dimensional sensory data and performing complex actions. It marks a major step forward in developing AI that can operate in diverse, real-world environments, similar to how humans and animals do.
METHODS Preprocessing. Working directly with raw Atari 2600 frames, which are 210 3 160 pixel images with a 128-colour palette, can be demanding in terms of computation and memory requirements. We apply a basic preprocessing step aimed at reducing the input dimensionality and dealing with some artefacts of the Atari 2600 emulator. First, to encode a single frame we take the maximum value for each pixel colour value over the frame being encoded and the previous frame. This was necessary to remove flickering that is present in games where some objects appear only in even frames while other objects appear only in odd frames, an artefact caused by the limited number of sprites Atari 2600 can display at once. Second, we then extract the Y channel, also known as luminance, from the RGB frame and rescale it to 84 3 84. The function w from algorithm 1 described below applies this preprocessing to the m most recent frames and stacks them to produce the input to the Q-function, in which m 5
id: 93bb23ebe8e679d06924be728a79a25d - page: 6
4, although the algorithm is robust to different values of m (for example, 3 or 5). Code availability. The source code can be accessed at deepmind.com/dqn for non-commercial uses only. Model architecture. There are several possible ways of parameterizing Q using a neural network. Because Q maps historyaction pairs to scalar estimates of their Q-value, the history and the action have been used as inputs to the neural network by some previous approaches24,26. The main drawback of this type of architecture is that a separate forward pass is required to compute the Q-value of each action, resulting in a cost that scales linearly with the number of actions. We instead use an architecture in which there is a separate output unit for each possible action, and only the state representation is an input to the neural network. The outputs correspond to the predicted Q-values of the individual actions for the input state. The main advantage of this type of architectur
id: 83b2bfc92af626dc8e251f5a283d818e - page: 6
· is the ability to compute Q -values for all possible actions in a given state with only a single forward pass through the network. The exact architecture, shown schematically in Fig. 1, is as follows. The input to the neural network consists of an 84 3 84 3 4 image produced by the preprocess ing map w. The first hidden layer convolves 32 filters of 8 3 8 with stride 4 with the input image and applies a rectifier nonlinearity31,32. The second hidden layer con volves 64 filters of 4 3 4 with stride 2, again followed by a rectifier nonlinearity. This is followed by a third convolutional layer that convolves 64 filters of 3 3 3 with stride 1 followed by a rectifier. The final hidden layer is fully -connected and con sists of 512 rectifier units. The output layer is a fully -connected linear layer with a single output for each valid action. The number of valid actions varied between 4 and 18 on the games we considered. Training details. We performed experiments on 49 Atari 2600 games wher
id: 74f143f50cd230e93939c6f64023e9da - page: 6
· results were available for all other comparable methods12,15. A different network was trained on each game: the same network architecture, learning algorithm and hyperpara meter settings (see Extended Data Table 1) were used across all games, showing that our approach is robust enough to work on a variety of games while incorporating only minimal prior knowledge (see below). While we evaluated our agents on unmodi fied games, we made one change to the reward structure of the games during training only. As the scale of scores varies greatly from game to game, we clipped all posi tive rewards at 1 and all negative rewards at 21, leaving 0 rewards unchanged. Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier to use the same learning rate across multiple games. At the same time, it could affect the performance of our agent since it cannot differentiate between rewards of different magnitude. For games where there is a life counter, the At
id: 2a522addfdc5a8dab61eec03f426cbaf - page: 6