In a previous post we went built a framework for running learning agents against PyGame. Now we’ll try and build something in it that can learn to play Pong.
We will be aided in this quest by two trusty friends Tensorflow Google’s recently released numerical computation library and this paper on reinforcement learning for Atari games by Deepmind. I’m going to assume some knowledge of Tensorflow here, if you don’t know much about it, it’s quite similar to Theano and here is a good starting point for learning.
- You will need Python 2 or 3 installed.
- You will need to install PyGame which can be obtained here.
- You will need to install Tensforflow which can be grabbed here.
- You will need PyGamePlayer which can be cloned from the git hub here.
The function Q* represents the abstract notion of the ideal Q* function, in most complex cases it will be impossible to calculate that exactly so we use a function approximator Q(s, a; θ). When a machine learning paper references a function approximator they are (almost always) talking about a neural net. These nets in Q learning are often referred to as Q-nets. The θ symbol in the Q function represents the parameters(weights and bias) of our net. In order to train our layer we will need a loss function, that is defined as:
y here is the expected reward of the state using the parameters of our Q from iteration i-1. Here an example of running a q-function in tensorflow. In this example we are running the simplest state possible. It is just an array of states, with a reward for each and the agents actions are moving to adjacent states:
Setting up the agent in PyGamePlayer
If you run this you will see the player moving to the bottom of the screen as the pong AI mercilessly destroys him. More inteligence is needed, so we will override the get_keys_pressed method to actually do some real work. Also as a first step, because the Pong screen is quite big and I’m guessing none of us have a super computer lets compress the screen image so it’s not quite so tough on our gpu.
How do we apply Q-Learning to Pong?
We will want our input to be not just the current frame, but the last few frames, say 4. 80 times 80 pixels is 6400 times 4 frames that’s 25600 data points and each can be in 2 states(black or white) meaning there are 2 to the power of 25600 possible screen states. Slightly too many for any computer to reasonably deal with.
This is where the deep bit of deep Q come in. We will use deep convolutional nets(for a good write up of these try here) to compress that huge screen space into a smaller space of just 512 floats and then learn our q function from that output.
So first lets create our convolutional network with Tensorflow:
Now we will use the exact same technique we used for the simple Q-Learning example above, but this time the state will be a collection of the last 4 frames of the game and there will be 3 possible actions.
This is how you train the network:
And getting the chosen action looks like this:
So get_key_presses needs to be changed to store these observations:
The normal training time for something like this even with a good GPU is in the order of days. But even if you were to train the current agent for days it would still perform pretty poorly. The reason for this is because if we start using the Q-function to determine our actions it will initially be exploring the space with a very poor weights. It is very likely that it will find some simple action that leads to a small improvement and get struck in a local minima doing that.
What we want is too delay using our weights until the agent has a good understanding of the space in which it exists. A good way to initially explore the space is to move randomly then over time slowly add in more and more moves chosen by the agent until eventually the agent is in full control.
Add this to the get_key_presses method
https://gist.github.com/DanielSlater/030e747a918abe8cff5f.js And then make the choose_next_action method this:
https://gist.github.com/DanielSlater/82a8209652bc593695e1.js And so now hazar, we have a Pong AI!
The PyGamePlayer project: https://github.com/DanielSlater/PyGamePlayer
The complete code for this example is here
Also I’ve now added the games mini-pong and half-pong which should be quicker to train against if you want to try out modifications.
And further here is a video of a talk I’ve done on this subject
15 thoughts on “Deep-Q learning Pong with Tensorflow and PyGame”
This comment has been removed by the author.
Does your method utilize difference frames as a preprocessing step to encourage motion detection as mentioned in karpathy's blog?
Source (search preprocessing): http://karpathy.github.io/2016/05/31/rl/
No, in this version we just use the last 4 frames as the state, not the difference. It could be easily modified to do use the difference by changing the first few lines of the _train method. I don't know which method is better, would probably take some experimentation, unless someone knows of a good research paper on this.
Thinking of porting this to a raspberry pi. To play pogo in the real world on an air hockey table.
I have a couple of questions, can you reach out to jaime.e.Espinosa@gmail.com
Always happy to help, if you want to email me there is contact me button in the side bar up and over there ->
At the beginning of this video David(Deepmind RE the DQN from the Nature paper) explains the use of the reward clipping method. Before the end of the video he describes a new scaling method which might benefit this post. It reminds me of your Net2Net work as he mentions automatically altering layers on the fly to accomplish this.
Video starts at POI:
Hi Derek, thanks for sharing :). The batch normalization of rewards, that David Silver describes does seem to lead to better learning. Might change my example here to use it in the future. Not sure exactly how it would work with net2net though?
Thanks for your nice post! Your first example of 1D gridworld is concise and helpful. However I am not very clear that after your obtaining the best fitted values how do we get an optimal policy from there successively? or we should be able to get one during the process of value iterations?
Glad you like it 🙂 The idea is that over time we converge towards the optimal policy. Initially the policy is very sub optimal, but as the estimation of one state improves, so that improves the estimation of the states next to it, and so on until convergence. Does that helps explain it?
I see. I think I got it, thanks for your clear answer!
Being inspired by your examples, I attempted to run the code, but got impatient with time to train Convolutional network to perform. So against a simple Pong Game, used explicit Paddle and Ball X,Y coordinates as features, and converges much faster [Obviously does not generalise to other games]. My example code at:
Hi Jumper, thanks for sharing, glad I was able to help in some way 🙂
I went over this website and I believe you have a lot of wonderful information, saved to my bookmarks Assam Ration Card Status
This comment has been removed by the author.
i don't know what is wrong of this…
2017-12-10 07:52:04.118494: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX FMA
Traceback (most recent call last):
File “/home/mrnivla/Documents/pong-ai-master/dqn/main.py”, line 69, in
File “/home/mrnivla/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py”, line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File “/home/mrnivla/Documents/pong-ai-master/dqn/main.py”, line 61, in main
agent = Agent(config, env, sess)
File “/home/mrnivla/Documents/pong-ai-master/dqn/dqn/agent.py”, line 27, in __init__
self.memory = ReplayMemory(self.config, self.model_dir)
File “/home/mrnivla/Documents/pong-ai-master/dqn/dqn/replay_memory.py”, line 18, in __init__
self.screens = np.empty((self.memory_size, config.screen_height, config.screen_width), dtype = np.float16)