How to run learning agents against PyGame


PyGame is the best supported library for programming games in Python. There are 1000’s of open source games that have been built with it.
I wanted to do some reinforcement learning neural networks in games and PyGame seemed the best choice.
But I was a bit disappointed that most examples involved hacking the original game files.

I wanted to see if I could write a general framework for running learning agents in PyGame that would require zero touching of the games files.
If you are interested in writing your own implementation of this then read on, but if you just want to dive straight in and start playing with your own learning agents I’ve uploaded a project here which contains examples of setting up Pong and Tetris.

Getting started

  • You will need Python 2 or 3 installed.
  • You will need to install PyGame which can be obtained here. You can also install it via pip, but this can be tricky in windows.
  • You will need numpy, downloadable from here or again can be installed from pip
  • You will also need a game of some sort to try and set up. Here is a good simple pong game.

    The plan

    There are 3 things a learning agent needs to be able to play a game:
    1. An array of all the pixels on the screen when ever it is updated.
    2. The ability to output a set of key presses back into the game.
    3. A feedback value to tell it when it is doing well/badly.
    We will tackle them in order.

    Grabbing the screen buffer

    In PyGame there is this handy method which will give us a 3 dimensional numpy array for each colour of every pixel on screen.

     screen = pygame.surfarray.array3d(pygame.display.get_surface())  

    We could write our learning agent by hacking the game file and inserting this line into the main loop after the screen has been updated. But a better way to do it(and a way that allows us to have zero touches of the game file) is to intercept any calls to the pygame.display.update method and then grab the buffer then, like so:

     import pygame  
    import pygame.surfarray

    # function that we can give two functions to and will return us a new function that calls both
    def function_combine(screen_update_func, our_intercepting_func):
    def wrap(*args, **kwargs):
    screen_update_func(*args,
    **kwargs) # call the screen update func we intercepted so the screen buffer is updated
    our_intercepting_func() # call our own function to get the screen buffer
    return wrap

    def on_screen_update():
    surface_array = pygame.surfarray.array3d(pygame.display.get_surface())
    print("We got the screen array")
    print(surface_array)

    # set our on_screen_update function to always get called whenever the screen updated
    pygame.display.update = function_combine(pygame.display.update, on_screen_update)
    # FYI the screen can also be modified via flip, so this might be needed for some games
    pygame.display.flip = function_combine(pygame.display.flip, on_screen_update)

    You can try this out by inserting this code before you start your game and it will print out the screen buffer as it comes in.

    Intercepting key presses

    The normal method in PyGame detecting key presses is via this method:

     events = pygame.event.get()  

    So we can intercept it and have it return our learning agents key presses:

     import pygame  
    from pygame.constants import K_DOWN, KEYDOWN

    def function_intercept(intercepted_func, intercepting_func):
    def wrap(*args, **kwargs):
    # call the function we are intercepting and get it's result
    real_results = intercepted_func(*args, **kwargs)
    # call our own function and return our new results
    new_results = intercepting_func(real_results, *args, **kwargs)
    return new_results
    return wrap

    def just_press_down_key(actual_events, *args, **kwargs):
    return [pygame.event.Event(KEYDOWN, {"key": K_DOWN})]

    pygame.event.get = function_intercept(pygame.event.get, just_press_down_key)

    I should also warn you that the pygame.event.get method can be called with args to filter out which events are needed. If your running a game that uses these you will either need to handle them or just use my complete implementation here.

    Getting feedback to the player

    The final piece of the puzzle is handling the feedback/reward from the game. Unfortunately there is no standard way of doing scoring in PyGame so this will always require some amount of going through the game code, but it can still be done with zero touches.

    For the pong game the scores are stored in two global variables bar1_score and bar2_score, which can be imported. Our reward is when the score changes in our favor.

     last_bar1_score = last_bar2_score = 0  

    def get_feedback():
    global last_bar1_score, last_bar2_score

    # import must be done inside the method because otherwise importing would cause the game to start playing
    from games.pong import bar1_score, bar2_score
    # get the difference in score between this and the last run
    score_change = (bar1_score - last_bar1_score) - (bar2_score - last_bar2_score)
    last_bar1_score = bar1_score
    last_bar2_score = bar2_score
    return score_change

    But for other games, such as Tetris there may not be a globally scoped score variable we can grab. But there may be a method or a set of that methods that we know are good/bad. Such as a player_takes_damage, level_up or kill_enemy. We can use our function_intercept code from before to grab these. Here is an example in Tetris using the result of removeCompleteLines to reward our agent:

     import tetris  

    new_reward = 0.0
    def add_removed_lines_to_reward(lines_removed, *args, **kwargs):
    global new_reward
    new_reward += lines_removed
    return lines_removed

    tetris.removeCompleteLines = function_intercept(tetris.removeCompleteLines,
    add_removed_lines_to_reward)

    def get_reward():
    global new_reward
    temp = new_reward
    new_reward = 0.0
    return temp

    Dealing with frame rates

    One final issue that you may need to consider is that the learning agent will significantly impact the execution speed of the game. In a lot of games the physics is scaled by the elapsed time since the last update. If your agent takes 1 second to process a single frame in pong then in the next update loop the ball will have already passed off the screen. The agent may also struggle to learn if there is significant variance in the movement of different frames.
    This can be handled by intercepting the pygame.time.get_ticks method and pygame.time.Clock in the same way as we have the other functions. See this file for details.

    Pong in PyGamePlayer

    Now all that remains is too stitch all those parts together and plug in the learning agent. In my project I’ve chosen to do this in a class, but it would be fine as a script.
    Below is an example of the full thing set up to learn against Pong using the PyGamePlayer module. The PongPlayer simply needs to inherit from the PyGamePlayer class and implement the get_keys_pressed and get_feeback methods, the framework handles everything else.

     from pygame.constants import K_DOWN  
    from pygame_player import PyGamePlayer

    class PongPlayer(PyGamePlayer):
    def __init__(self):
    """
    Example class for playing Pong
    """
    super(PongPlayer, self).__init__(force_game_fps=10) # we want to run at 10 frames per second
    self.last_bar1_score = 0.0
    self.last_bar2_score = 0.0

    def get_keys_pressed(self, screen_array, feedback):
    # The code for running the actual learning agent would go here with the screen_array and feeback as the inputs
    # and an output for each key in the game, activating them if they are over some threshold.
    return [K_DOWN]

    def get_feedback(self):
    # import must be done here because otherwise importing would cause the game to start playing
    from games.pong import bar1_score, bar2_score

    # get the difference in score between this and the last run
    score_change = (bar1_score - self.last_bar1_score) - (bar2_score - self.last_bar2_score)
    self.last_bar1_score = bar1_score
    self.last_bar2_score = bar2_score
    return score_change

    if __name__ == '__main__':
    player = PongPlayer()
    player.start()
    # importing pong will start the game playing
    import games.pong

    So hazar! We now have the worlds worst Pong AI.

    In my next post I’ll go through writing a good reinforcement learning agent for this.
    If you have any questions/correction please don’t hesitate to contact me.

    Full source code here.

    Particle Swarm Optimization in Python

    Here is a short and sweet particle swarm optimization implementation in Python. For more information on particle swarm optimization check out Particle swarm optimization in F#

    Example usage is like so:

    def simple_error_function(args):  
    return args[0]+args[1]

    number_of_parameters = 2
    max_iterations = 100
    best_parameters, best_error_score = particle_swarm_optimize(simple_error_function,
    number_of_parameters, max_iterations)

    The result will be the parameters that resulted in the lowest value for the simpe_error_function

    You can add custom parameter initialization:

    def simple_error_function(args):
    return args[0]+args[1]

    def parameter_init():
    return random.random()*0.5-10

    number_of_parameters = 2
    max_iterations = 100
    best_parameters, best_error_score = particle_swarm_optimize(simple_error_function, number_of_parameters,
    max_iterations, weight_init, parameter_init=parameter_init)

    Other arguments

    •  stopping_error: float = 0.001 #stop when we have a particle with this score 
    • num_particles: int = 10 #number of particles to run 
    • max_iterations_without_improvement: int = 10 #stop when we have this many consecutive turns without improvement 
    • c1: float = 2.0 # c1 PSO parameter 
    • c2: float = 2.0 # c1 PSO parameter

    Quick summaries of research papers around dynamically generating network structure

    I’ve been reading a lot of research papers on how the structure of ANN can be generated/detected. Here are some quick summaries of interesting papers in that area.

    Dynamic Node creation in backpropagation networks

    • Looks at attempting to find a good number of hidden nodes for a network by starting small and growing the correct number of hidden nodes
    • Claims that the approach of training the network then freezing existing nodes before adding new unfrozen nodes does not work well.
    • Adding new nodes then retraining the whole network appears to work better
    • The technique: 
      • A one hidden layer network is created with predefined number of hidden nodes
      • The network is trained, when the error rate stops decreasing over a number of iterations a new hidden node is added
      • Nodes stop being added either when a certain precision is reached or when the error starts increasing on a validation set
    • Results: Seems to train in not a significantly longer amount of time than fixed networks and tends to find near optimal solutions
    • Remaining questions they want to investigate is how big the new node adding window should be and where nodes should be added in multi layer networks?

    Optimal Brain Damage

    • After training a standard ANN we can potentially improve speed and generalization performance if we delete some of the nodes/connections
    • Optimal brain damage deletes parameters (sets them to 0 and freezes them from future use) based on there impact on the error of the model.
    • Results: Against the MNIST data it was shown that up to a 5th of parameters could be deleted with no degradation in the training performance and some improvement in the generalization.

    The Cascade Correlation Algorithms

    • A different approach to evolving structure for neural networks. 
    • Starts with a network with just fully connected input and output nodes. These are trained using quickprop until error rate stops decreasing.
    • Then multiple candidate nodes are created connected to every node except the output nodes(eventually hidden nodes will be added) with different random weights
    • These are then all trained to try and maximize the correlation with the output error.
    • The candidate with the best correlation is selected
    • Repeat training and adding in candidate nodes until stopping point
    • Results: Appears to do very well and producing results for things like xor and various other functions that are difficult for normal ANN’s to do.
    • It has 2 problems
      • Over fitting
      • Because the eventual network is a cascade of neurons(later neurons are really deep because they depend on every previous neuron) it runs a lot slower than a normal ANN
    • Can also be extended to work on recurrent networks

    Cascade-Correlation Neural Networks: A Survey

    • Presents some approaches to reducing the problems with cascade correlation.
    • Sibling/descendant cascade attempts to reduce the problem with networks going too deep.
      • There are 2 pools of candidate neurons, sibling(only connected to neurons in previous layers) and descendant(same as for normal cascade). 
      • Often siblings do get selected ahead of descendants, reducing the depth of the network
    • Says Optimal Brain Damage is effective at pruning the network after structure is discovered this helps with over fitting.
    • Knowledge based CCNN are another approach where candidates nodes can be normal functions or fully fledged ANN of there own. This approach sounds a lot of fun, would love more info on how successful it is.
    • Rule based CCNN, similar to the above but things like OR and AND operators are used. Again sounds alot of fun would love to know how it performs.

    Even more quick summaries of research papers

    Following on from part 1 and part 2 here are even more quick summaries of research papers,

    Generative NeuroEvolution for DeepLearning

    • Applies HyperNEAT to deep learning
    • Uses the MNIST hand-drawn digits dataset, trains both a normal deep network and a convolutional network
    • Results: 
      • HyperNEAT on it’s own performs very badly. 
      • Using HyperNEAT to generate a number of layers and then backprop on the final layer is achieves 58.4% for normal ANNs
      • Using HyperNEAT to generate a number of layers and then backprop on the final layer for a Convolution Neural net achieved performance of 92.1%
    • This paper seems to miss that one of the advantages of HyperNEAT is the ability to scale it to different sizes of input so it would be nice to see how it performans when given images with different dimensions vs a traditional approach which has to do a standard Photoshop resize and then work off that image.
    • Also would love to see some details of exactly how the algorithms were implemented
    • A big problem with HyperNeat is that though it can express a potentially infinite number of nodes and connection, the numbers of hidden nodes must be decided in advance
    • ES-HyperNeat is an attempt to extend HyperNeat to be able to determine how many hidden nodes it should have.
    • ES-HyperNeat uses the connection weights values themselves to determine node placement, areas with more complex(higher variance) weight patterns should be given more nodes.
    • There is an addition called Link Expression Output (LEO) for HyperNEAT where weather or not there is a connection between 2 nodes node is not calculated from the magnitude of weights but from another evolved parameter. This also works with ES-HyperNEAT
    • Algorithm is:
      • When creating a connection between 2 nodes (e.g an input and an output) rather than just calculate a single weight value we create 4 hidden nodes(quad tree style) and calculate the weights for each
      • if the variance of the weight is above some threshold we create all for nodes 
      • otherwise just have the single connection
      • when doing the connection for the 4 sub nodes we may do more subdivisions(up to some max)
    • Run experiments against maze navigation, multi model problems and retina problem
      • Results: Outperformed HyperNeat

      Deep Learning using Genetic Algorithms

      • Tests using ga’s to encode and then decode features of an image
      • Seems to have some success
      • They encounter some problems that may be better solved using NEAT/HyperNEAT

      Novelty Search and the Problem with Objectives

      • Normally Genetic algorithms are trained with an objective(fitness) function and those agents that perform best against it are the selected for
      • This leads to problems of local minima and will often result in lots of similar behaviors with small tweaks being selected for not allowing for the more complex sets of interactions required.
      • Novelty search instead adds a measure of the distinctness or novelty or difference in the result of a behavior to the fitness function. 
      • Gives the example of teaching a robot to walk, a fitness function would start off with just selecting the robot that fell the furthest to the goal. Which would likely never lead to complex interesting behavior. 
      • But if you reward different kinds of falls some may arise that balance for a bit in nay direction over time these can adapt to move to the actual objective
      • Recommends average distance to k nearest neighbors as a good measure of novelty.
      • Novelty search shows very good results vs objective search in maze problems and bi-pedal walking among others

      Unsupervised Feature Learning through Divergent Discriminative Feature Accumulation

      • Uses HyperNEAT with novelty search as the unsupervised layer in deep learning
      • Number of hidden nodes not set in advance, rather successive features are learned based on there difference from prevous features until a certain number reached
      • Only 2 layer network trained, learning 1500->3000 features
      • Results: Against MNIST data does amazingly well, incredibly impressive
      • Looks like this is a good alternative to restricted boltzman machine
      • Would love to see how it might do with more layers/convolutional architecture, scaling to different image sizes