Using Net2Net to speed up network training

When training neural networks there are 2 things that combine to make life frustrating:

  1. Neural networks can take an insane amount of time of train.
  2. How well a network is able to learn can be hugely affected by the choice of hyper parameters(hyper parameters here refers mainly to the numbers of layer and numbers of nodes per layer, but can also include learning rate, activation functions, etc) and without training a network in full you can only guess at which choices are better.
If a network could be trained quickly number 2 wouldn’t really matter, we could just do a grid search(or even particle swarm optimization(or maybe Bayesian optimization)) to run through a lots of different possibilities and select the hyper parameters with the best results. But for something like reinforcement learning in computer games the amount of time to train is counted in days so better hope your first guess was good…

My current research is around ways to try and get neural networks to adjust there size automatically, so that if there isn’t sufficient capacity in a network it will in some way determine this and resize itself. So far my success has been (very) limited, but while working on that I thought I would share this paper: Net2Net: Accelerating Learning via Knowledge Transfer which has a good, simple approach to resizing networks manually while keeping there activation unchanged.

I have posted a numpy implementation of it here on Github.

Being able to manually resize a trained network can give big savings on networks training time because when searching through hyper parameters options you can start off with a small partially trained network and see how adding extra hidden nodes or layers affects test results.

Net2Net comprises of 2 algorithms Net2WiderNet which adds nodes to a layer and Net2DeeperNet which adds a new layers. The code for Net2WiderNet in numpy looks like this:

This creates the weights and biases for a layer 1 wider than the existing one. To increases the size by more nodes simply do this multiple times(note the finished library on github has the parameter new_layer_size to set exactly how big you want it). The new node is a clone of a random node from the same layer. The original node and it’s copy then have their outputs to the next layer halved so that the overall output from the network is unchanged.

How Net2WiderNet extends a layer with 2 hidden node layer to have 3


Unfortunately if 2 nodes in the same layer have exactly the same parameters then their activation will always be identical, which means their back propagated error will always be identical, they will update in the same way, their activation will still be the same, then you gained nothing by adding the new node… To stop this happening a small amount of noise is injected into the new node. This means as they train they have the potential to move further and further apart while training.

Net2DeeperNet is quite simple, it creates an identity layer, then adds a small amount of noise. This means that the network activation is only unchanged if the layer is a linear layer, because otherwise the activation functions non-linearity will alter the output. So bare in mind if you have an activation function on your new layer(and you almost certainly will) then the network output will be changed and will have worse performance until it has gone through some amount of training.
Here is the code:


Usage in TensorFlow

This technique could be used in any neural network library/framework, but here is how you might use it in TensorFlow.

In this example we first train a minimal network with 100 hidden nodes in the first and second layers and train it for 75 epochs. Then we do a grid search of different numbers of hidden nodes for 50 epochs to see which lead to the best test accuracy.
Here are the final results for the different numbers of hidden nodes:

1st layer 2nd layer Train accuracy Test accuracy
100 100 99.04% 93.47%
150 100 99.29% 93.37%
150 150 99.01% 93.58%
200 100 99.31% 93.69%
200 150 98.99% 93.63%
200 200 99.17% 93.54%
Note: don’t take this as the best choice for MNIST, this could still be improved by longer training, dropout to stop overfitting, batch normalization, etc

Quick summaries of research papers around dynamically generating network structure

I’ve been reading a lot of research papers on how the structure of ANN can be generated/detected. Here are some quick summaries of interesting papers in that area.

Dynamic Node creation in backpropagation networks

  • Looks at attempting to find a good number of hidden nodes for a network by starting small and growing the correct number of hidden nodes
  • Claims that the approach of training the network then freezing existing nodes before adding new unfrozen nodes does not work well.
  • Adding new nodes then retraining the whole network appears to work better
  • The technique: 
    • A one hidden layer network is created with predefined number of hidden nodes
    • The network is trained, when the error rate stops decreasing over a number of iterations a new hidden node is added
    • Nodes stop being added either when a certain precision is reached or when the error starts increasing on a validation set
  • Results: Seems to train in not a significantly longer amount of time than fixed networks and tends to find near optimal solutions
  • Remaining questions they want to investigate is how big the new node adding window should be and where nodes should be added in multi layer networks?

Optimal Brain Damage

  • After training a standard ANN we can potentially improve speed and generalization performance if we delete some of the nodes/connections
  • Optimal brain damage deletes parameters (sets them to 0 and freezes them from future use) based on there impact on the error of the model.
  • Results: Against the MNIST data it was shown that up to a 5th of parameters could be deleted with no degradation in the training performance and some improvement in the generalization.

The Cascade Correlation Algorithms

  • A different approach to evolving structure for neural networks. 
  • Starts with a network with just fully connected input and output nodes. These are trained using quickprop until error rate stops decreasing.
  • Then multiple candidate nodes are created connected to every node except the output nodes(eventually hidden nodes will be added) with different random weights
  • These are then all trained to try and maximize the correlation with the output error.
  • The candidate with the best correlation is selected
  • Repeat training and adding in candidate nodes until stopping point
  • Results: Appears to do very well and producing results for things like xor and various other functions that are difficult for normal ANN’s to do.
  • It has 2 problems
    • Over fitting
    • Because the eventual network is a cascade of neurons(later neurons are really deep because they depend on every previous neuron) it runs a lot slower than a normal ANN
  • Can also be extended to work on recurrent networks

Cascade-Correlation Neural Networks: A Survey

  • Presents some approaches to reducing the problems with cascade correlation.
  • Sibling/descendant cascade attempts to reduce the problem with networks going too deep.
    • There are 2 pools of candidate neurons, sibling(only connected to neurons in previous layers) and descendant(same as for normal cascade). 
    • Often siblings do get selected ahead of descendants, reducing the depth of the network
  • Says Optimal Brain Damage is effective at pruning the network after structure is discovered this helps with over fitting.
  • Knowledge based CCNN are another approach where candidates nodes can be normal functions or fully fledged ANN of there own. This approach sounds a lot of fun, would love more info on how successful it is.
  • Rule based CCNN, similar to the above but things like OR and AND operators are used. Again sounds alot of fun would love to know how it performs.