Original Paper: Playing Atari with Deep Reinforcement Learning
1
1
4
2
2
1
1
2
2
2
1
2
1
1
1
1
1
2
2
1
2
1
1
1
3
2
1
1
What "Q-learning" is: for true end-to-end learning, ideally we want the robot (=neural network) to simply pick the best action at a given time in the game. But instead of directly choosing the act...ion, you let the network assign “values” to each of the 18 possible joystick actions. So put simply, the value V for any action A represents the robot’s expectation of the future reward it will get if it performs that action A. So Q-learning is about training this "value" network.
I think it's worth discussing "high-dimensional sensory inputs" and "hand-crafted features" in this context. Essentially, instead of feeding the deep learning model with carefully processed inputs,... they directly input a raw image of the screen, and let a convolutional neural network learn what's happening on the screen and make decisions. The authors seem to suggest that the whole purpose of this paper is not to see if reinforcement learning works, but rather on whether it can work end-to-end (raw inputs). I'd say this paper successfully accomplishes that!
Since the whole point of the paper is to prove that deep learning works for RL, it makes sense to outline why it's a challenge in the first place. So in this paragraph they're basically doing that....
I'd like to bring to attention the last sentence which talks about the fixed underlying distribution. As the model learns to play the game, it will learn new strategies/behaviours, so obviously the ki...nd of data it produces after 1 hour is very different from the previous hour of gameplay. This is what they mean by "data distribution keeps changing". It is unclear though what they are referring to when they say that some DL methods *assume* a fixed underlying distribution. Maybe they mean that since the data keeps changing the model may keep "forgetting" what it learned from old gameplay data samples, and therefore repeating mistakes - if this is what they mean, then it should be fixed by the "Experience replay" they mention in the next paragraph.
It's obvious that with supervised learning there is lots of labeled data to teach the model, but it's less obvious why it should be easier for unsupervised learning as well. In unsupervised learni...ng, even though there's no labeled data the cost function (analogous to "reward signal") is still easy to train on. But in RL, the sparsity, noise, delay etc come into picture as constraints that make it even harder.
Independence of data samples: in DL usually the input data samples are assumed to be unrelated to each other. For example in an image recognition network, the training data will have huge numbers of r...andomly organized and unrelated images. But while learning how to play a *game*, usually the strategy of moves doesn’t just depend on the current state of the screen but also some previous states and moves. In a video game, one frame of the screen is definitely related to the next frame. If it takes 10 frames for a laser beam to destroy your spaceship, the 9th frame is a pretty good indication that the 10th frame is going to be painful! You don’t want to treat the two frames just a few milliseconds apart as totally separate experiences while learning, because they obviously carry valuable information about each other. They’re part of the same experience — that of a laser beam striking your spaceship. Essentially, you want to be able to learn from this correlation instead of ignoring it.
Just saying how they setup the system to be fair, and reassuring us that the AI agents didn't "cheat". :) This means no extra information besides what is shown on the screen! They do tweak the reward ...signal a little for ease of training, as they show in their methods section. That's very fair though.
The interesting thing to note here is that they used the same neural network architecture and hyperparameters for all games. I read the full paper and they don't mention in their methods section *how*... they arrived at the architecture they used and how they tuned the hyperparameters. I would have loved to see this as useful insight for other researchers.
Sharing some terminology etc about how they setup the training environment. 'E', referring to the environment, is obviously stochastic from the perspective of the player, as games are unpredictable!...
They show some notation and underline some assumptions like environment is stochastic, etc. The reward is in the form of CHANGE in game score (as opposed to the end result of victory/loss, which appli...es to non-video games like Chess and Go??). Want to point out that it is a classic case of very delayed feedback, because since the frame rate is 60 Hz, there may be no change in reward for many data samples even though the state on the screen changes
This is an important note about the algorithm they use. "Perceptual Aliasing" means that two different states/places can be perceived as the same. For example, in a building, it is nearly impossible t...o determine a location solely with the visual information, because all the corridors may look the same. This is a problem - in video games the state of the game doesn’t change so much in every millisecond, nor is a player capable of making decisions in every millisecond. So if you treat each 60Hz frame as a separate state, most of the training data samples will look exactly the same! So it’s better to keep a longer horizon for what a “state” looks like. Multiple consecutive frames also contain valuable information about *movement*. So the sequence of frames obviously carries more information. Moreover, when you move the joystick it usually stays in the same position for several milliseconds, so that is incorporated into this state. The same action is continued in each of the frames. Each sequence (which includes several frames and the same action between them) is an individual state, so this state still meets the criteria of a Markov Decision Process (MDP).
For any beginners, I should give a quick overview of how "Q Learning" works. You already know that it means learning how "valuable" a state is. Basically if you had a table which had a row for all ...the possible states (s) of the game, and the columns represented all the possible joystick moves (a) -> then each cell in the table represents the maximum total *future value* possible if you take that particular action and *play your best* from then on. This means you now have a “cheat sheet” of what to expect from any action at any state. The values of these cells is called the Q-star value. (Q*(s,a)). For any state s, if you take action a, the maximum total future value is Q*(s,a) as seen in that table. In the last line, pi (π) is ‘policy’. Policy is simply the strategy about *what action to pick* when you are in a particular state. Note that here we're not training for the policy, we're just learning the value function and will derive the policy from it.
Let's dive into the Bellman Equation here (it's not a full primer, so you should learn more about it if you haven't already). Say you are in a state S1. You see the Q* value for all possible actio...ns in the table (explained in para3), and choose A1 because its Q value is the highest. You get an immediate reward R1, and the game moves into a different state S2. For S2, the max future reward will be if it takes (say) action A2 in the table. Now, the initial Q value Q*(S1,A1) is the max value you could get if you played optimally from then on, right? This means, that Q*(S1, A1) should be equal to the sum of the reward R1 AND the max future value of the next state Q*(S2,A2)! Does this make sense? But we want to reduce the influence of the next state, so we multiply it by a number gamma which is between 0 and 1. This is called discounting Q*(S2,A2). Therefore Q*(S1,A1) = R1 + [gamma x Q*(S2,A2)]
As mentioned before by Aman, the strategy/policy is simply to choose the best action suggested by the Q* value....
So, let me remind you that we’re assuming that for any state, and for any future action, we *know* the optimal value function already, and can use it to pick the best action at the current state (beca...use by iterating over all the possible Q values we can literally look ahead into the future). But since such a Q-function doesn’t really exist in the real world! The best we can do is *approximate* the Q function by another function, and update that approx function little by little by testing it in the real world again and again. This approx function can be a simple linear polynomial, but we can even use non-linear functions. So we choose to use a neural network to be our “approximate Q function.”
This paragraph puts the decisions so far into mathematical notation, and how a deep neural network is brought into the mix. Now you see why you’re reading this paper in the first place — DeepMind u...ses a neural network to approximate a Q function. Hah!
I'll explain what "e-greedy" means here. You’re at state s and have several actions to choose from. We have an approximate Q value function, so we calculate what will be the Q value for each of tho...se actions. Now, while choosing the action, you have two options. The “greedy” option is to simply choose the action that has the highest Q value. You always pick the action which seems best to you *right now*, given your current approximate of the Q function — which means, given your current strategy. But there lies the problem — when you start out, you don’t really have accurate value approximation. And you want your AI to check out other possible strategies and see where they lead. This is why a “greedy” strategy isn’t always the best when you’re learning. While learning, you don’t want to just keep trying what you believe will work — you want to try other things which seem less likely, just so you can get experience. And that’s the difference between on-policy (greedy) and off-policy (not greedy). Here they vary the 'greed' based on how good the model is. They pick greedy actions with a probability of (1-e), where e is a variable that represents how random the choice is. So e=1 means the choice is completely random, and e=0 means that we always pick the greedy action. At first, when the network just begins learning and the value function is not accurate enough to lean on, they pick e to be very close to 1, because they want the AI to explore as many strategies as possible. As time goes by and the AI learns more and more, they reduce the value of e towards 0 so that the AI stays with a particular strategy.
In model-based RL, the rules and physics of the game are defined in terms of a ‘Transition Matrix’ which calculates the next state given a current state and action. There is also a separate ‘Reward Fu...nction’ which computes the reward given a current state and action. In a “model free” approach, we assume that if you train to improve the Q function will automatically incorporate the rules. It is also off-policy (not *learning* a specific policy), instead the policy is chosen by the researchers.
It is expected that a using simpler linear functions has a better chance of convergence than deep neural networks....
We all only see further by sitting on the shoulders of giants, and I want to point out to readers the DeepMind team was no exception! The innovation here though is to reduce the computational load of ...prior techniques and making it end-to-end, starting directly from raw inputs. :)
Should mention here that RPROP (resilient backpropagation) isoften faster than regular back propagation. It also doesn't require you to specify any free parameters like learning rate and momentum. I'm... not sure why they're talking about the computational cost of batch updates, given that this research project clearly involved plenty of compute (millions of self-play games end-to-end)? Edit: I guess when you have experience replay, you can't fix the size of the batch so easily so scaling to huge datasets may be hard indeed.
More places where this paper shines over previous ones. Here the algorithm learns the strategy for seven Atari 2600 games without adjustment of the architecture, while projects like "HyperHEAT" involv...ed evolving a new strategy for each game. This allowed the models to find flaws in the individual games. It sounds like the HyperNEAT paper would be a cool read!
The claim that "with enough data, deep learning can learn better features than hand-crafted features" is interesting but makes me feel uneasy. Maybe we don't need to learn such great features and with... some hand-crafting, can get the job done with less data? But I guess if your goal is to progress towards general artificial intelligence, you care more about the theoretical limits than a cost-benefit analysis.
These are the concrete advantages of using experience replay. Firstly, just like in regular deep learning, where each data sample can be reused multiple times to update weights, we can use the same ex...perience multiple times while training. This is more efficient use of data.
Second and third are very related. Because each state is very closely related to the next state (as it is while playing a video game), then training the weights with each consecutive state will le...ad to the program only following one single way of playing the game. You predict a move based on Q function, you make that move, and update the weights so that the next time you will again likely move left. But by breaking this pattern and drawing randomly from past experiences, you can avoid these feedback loops.
It’s good to draw random samples from experience replay, but sometimes in a game there are important transitions that you would like the agent to learn about. This is a limitation of the current appro...ach in this paper. A suggestion given is to pick recent or more important transitions with a greater probability while using experience replay. Maybe there is a way to attach importance to samples that cause a large change in the reward, but then that would cause additional issues. It would be an interesting research project to improve experience replay, but one that's probably very hard to do!!!
The state S is preprocessed to include 4 consecutive frames, all preprocessed into grayscale and resized and cropped to 84x84 squares. I think this is because given that the game runs at over 24 fra...mes per second, and humans can’t react so fast as to make a move in each single frame, it makes sense to consider 4 consecutive frames as being in the same state.
Although they downsample the images I'll still mostly give them the credit of "end-to-end" learning, because I feel that even the human brain does pre-processing on images, doesn't it? :)...
While making the network architecture, you can either make it a Q function which takes both S1 and A1 and outputs Q-value for this combination. But this means that you’d have to run this network f...or each of the 18 possible joystick actions at each step, and compare the output of all 18 runs. Instead, you can simply have an architecture where you use S1 as input and have 18 outputs each corresponding to the Q-value for a given joystick action. It’s much more efficient to compare the Q-values in this way.
A figure / graphical description of the model architecture would have been very helpful here, including some documentation of the process by which they decided upon the exact specifics. But good job e...xplaining the overall advantages over the arch used in prior work.
A figure would have been helpful....
Very important!!! The nature of rewards being input to the agent was modified. So, *any* positive reward was input as +1, negative rewards were input as -1, and no change was input as 0. This is of... course very different from how real games work — rewards are always changing, and some accomplishments have higher rewards than others. But it’s impressive that in spite of this, the agent performed better than humans in some games!
So they're limiting all positive rewards to +1 and all negative changes to -1, which is a noteworthy simplification! I'm sensing throughout the paper that their priority was to make the algorithm work... for multiple games, than on being able to truly explore the potential in each individual game. I guess it makes sense because later they came out with more research creating general game-playing DRL algorithms like MuZero. :)
The tradeoff between making the whole setup generalizable to multiple games, vs being able to play each game exceedingly well, keeps forcing them to make small tweaks to achieve a delicate balance. Se...e here that once again they make an exception for Space Invaders, making the frame skips shorter to make lasers more visible.
Let's talk about the evaluation metric you use while training, and I think there are some valuable lessons here for ML practitioners. Usually in supervised learning you have something like validat...ion accuracy, but here you don’t have any validation dataset. So what other things can we use to check whether our network is converging or are the weights just dancing around? Hmmm, the purpose of this paper is to create an AI agent which gets a high score on the game, so why not just use the total score as our evaluation metric? And we can play several games and get the average overall score. Well, turns out that using this metric doesn’t work well *in practice*, as it happens to be very noisy. Let’s think about some other metric? Well, another thing we’re doing in this experiment is to find a ‘policy’ which the AI will follow to ensure the highest score (it’s off policy learning, as explained previously). And the Q-value at any particular moment represents the total reward expected by the AI in future. So if the AI finds a great policy, then the Q-value for that policy will be higher, right? Let’s see if we can use the Q-value itself as our evaluation metric. And voila, it seems to be more stable than just the averaged total reward. Now, as you can see there’s no theoretical explanation for this, and it was just an idea which happened to work. (Actually that’s what happens in deep learning all the time. Some things just work, and other things which seem common sense just don’t. Another example of this is Dropout, which is a batshit crazy technique but works amazingly).
Just for reference, the 100 on the x-axis represents approximately 50 hours (>2 days) of training time....
THIS IS SO COOL!!!...
Good that they explain what's happening in the game, because I never played Atari! xD...
This is an excellent insight into how the model 'thinks'! Thanks DeepMind for sharing such examples....
Would be helpful to look into the differences between SARSA and Q-learning: https://studywolf.wordpress.com/2013/07/01/reinforcement-learning-sarsa-vs-q-learning/...
Wow, these earlier projects literally used 128 channels encoding each unique color. I agree with the authors that this greatly simplified the problem....
I smell some well-deserved bragging :)...
For Pong, a random model achieves similar results to DQN! Isn't that incredible? I wonder what they would have to say about this....
For references and cited articles, please visit the original publication.