Using Net2Net to speed up network training

When training neural networks there are 2 things that combine to make life frustrating:

  1. Neural networks can take an insane amount of time of train.
  2. How well a network is able to learn can be hugely affected by the choice of hyper parameters(hyper parameters here refers mainly to the numbers of layer and numbers of nodes per layer, but can also include learning rate, activation functions, etc) and without training a network in full you can only guess at which choices are better.
If a network could be trained quickly number 2 wouldn’t really matter, we could just do a grid search(or even particle swarm optimization(or maybe Bayesian optimization)) to run through a lots of different possibilities and select the hyper parameters with the best results. But for something like reinforcement learning in computer games the amount of time to train is counted in days so better hope your first guess was good…

My current research is around ways to try and get neural networks to adjust there size automatically, so that if there isn’t sufficient capacity in a network it will in some way determine this and resize itself. So far my success has been (very) limited, but while working on that I thought I would share this paper: Net2Net: Accelerating Learning via Knowledge Transfer which has a good, simple approach to resizing networks manually while keeping there activation unchanged.

I have posted a numpy implementation of it here on Github.

Being able to manually resize a trained network can give big savings on networks training time because when searching through hyper parameters options you can start off with a small partially trained network and see how adding extra hidden nodes or layers affects test results.

Net2Net comprises of 2 algorithms Net2WiderNet which adds nodes to a layer and Net2DeeperNet which adds a new layers. The code for Net2WiderNet in numpy looks like this:
https://gist.github.com/DanielSlater/edbecd61527aa4e833d947c4110c31b8.js

This creates the weights and biases for a layer 1 wider than the existing one. To increases the size by more nodes simply do this multiple times(note the finished library on github has the parameter new_layer_size to set exactly how big you want it). The new node is a clone of a random node from the same layer. The original node and it’s copy then have their outputs to the next layer halved so that the overall output from the network is unchanged.

How Net2WiderNet extends a layer with 2 hidden node layer to have 3

 

Unfortunately if 2 nodes in the same layer have exactly the same parameters then their activation will always be identical, which means their back propagated error will always be identical, they will update in the same way, their activation will still be the same, then you gained nothing by adding the new node… To stop this happening a small amount of noise is injected into the new node. This means as they train they have the potential to move further and further apart while training.

Net2DeeperNet is quite simple, it creates an identity layer, then adds a small amount of noise. This means that the network activation is only unchanged if the layer is a linear layer, because otherwise the activation functions non-linearity will alter the output. So bare in mind if you have an activation function on your new layer(and you almost certainly will) then the network output will be changed and will have worse performance until it has gone through some amount of training.
Here is the code:

BEGIN NET 2 DEEPER NET https://gist.github.com/DanielSlater/75df407b8f422e2b2c3d60bc65aeae14.js END NET 2 DEEPER NET

Usage in TensorFlow

This technique could be used in any neural network library/framework, but here is how you might use it in TensorFlow.

In this example we first train a minimal network with 100 hidden nodes in the first and second layers and train it for 75 epochs. Then we do a grid search of different numbers of hidden nodes for 50 epochs to see which lead to the best test accuracy.

https://gist.github.com/DanielSlater/96a4ccd17c2853026de8ad85856f1cc0.js
Here are the final results for the different numbers of hidden nodes:

1st layer 2nd layer Train accuracy Test accuracy
100 100 99.04% 93.47%
150 100 99.29% 93.37%
150 150 99.01% 93.58%
200 100 99.31% 93.69%
200 150 98.99% 93.63%
200 200 99.17% 93.54%
Note: don’t take this as the best choice for MNIST, this could still be improved by longer training, dropout to stop overfitting, batch normalization, etc

PyDataLondon 2016

Last week I gave a talk at PyDataLondon 2016 hosted at the Bloomberg offices in central London. If you don’t know anything about PyData it is an community of Python data science enthusiasts that run various meetups and conferences across the world. If your interested in that sort of thing and they are running something near to you I would highly recommend checking it out.

Below is the YouTube video for my talk and this is the associated GitHub, which includes all the example code.

The complete collection of talks from the conference is here. The standard across the board was very high, but if you only have time to watch a few, of those I saw here are two that you might find interesting.

 

Vincent D Warmerdam – The Duct Tape of Heroes Bayesian statistics

Bayesian statistics is a fascinating subject with many applications. If your trying to understand deep learning at a certain point research papers such as Auto-Encoding Variational Bayes and Auxiliary Deep Generative Models will stop making any kind of sense unless you have a good understanding of Bayesian statistics(and even if you do it can still be a struggle). This video works as a good introduction to the subject. His blog is also quite good.

Geoffrey French & Calvin Giles – Deep learning tutorial – advanced techniques

This has a good overview of useful techniques, mostly around computer vision(though they could be applied in other areas). Such as computing the saliency of inputs in determining a classification and getting good classifications when there when there is only limited labelled data.

Ricardo Pio Monti – Modelling a text corpus using Deep Boltzmann Machines in python

This gives a good explanation of how a Restricted/Deep Boltzmann Machine works and then shows an interesting application where a Deep Boltzmann Machine was used to cluster groups of research papers.

Mini-Pong and Half-Pong

I’m going to be giving a talk/tutorial at PyDataLondon 2016 on Friday the 6th of may, if your in London that weekend I would recommend going, there are going to be lots of interesting talks, and if you do go please say hi.

My talk is going to be a hands on, on how to build a pong playing AI, using Q-learning, step by step. Unfortunately training the agents even for very simple games still takes ages and I really wanted to have something training while I do the talk, so I’ve built two little games that I hope should train a bit faster.

Mini-Pong

This a version of pong with some of visual noise stripped out, no on screen score, no lines around the board. Also when you start you can pass args for the screen width and height and the game play should scale with these. This means you can run it as an 80×80 size screen(or even 40×40) and save to having to do the downsizing of the image when processing.

Half-Pong

This is an even kinder game than pong. There is only the players paddle and you get points just for hitting the other side of the screen. I’ve found that if you fiddle with the parameters you can start to see reasonable performance in the game with an hour of training(results may vary, massively). That said even after significant training the kinds of results I see are a some way off how well google deepmind report doing. Possibly they are using other tricks not reported in the paper, or just lots of hyper parameter tuning, or there are still more bugs in my implementation(entirely possible, if anyone finds any please submit).

I’ve also checked in some checkpoints of a trained half pong player, if anyone just wants to quickly see it running. Simply run this, from the examples directory.
It performs significantly better than random, though still looks pretty bad compared to a human.
Distance from building our future robot overlords, still significant.