DANIEL SLATER'S BLOG

Machine Learning, Network Structure, Neural Networks

Quick summaries of research papers around dynamically generating network structure

I’ve been reading a lot of research papers on how the structure of ANN can be generated/detected. Here are some quick summaries of interesting papers in that area.

Dynamic Node creation in backpropagation networks

Looks at attempting to find a good number of hidden nodes for a network by starting small and growing the correct number of hidden nodes
Claims that the approach of training the network then freezing existing nodes before adding new unfrozen nodes does not work well.
Adding new nodes then retraining the whole network appears to work better
The technique:

A one hidden layer network is created with predefined number of hidden nodes
The network is trained, when the error rate stops decreasing over a number of iterations a new hidden node is added
Nodes stop being added either when a certain precision is reached or when the error starts increasing on a validation set

Results: Seems to train in not a significantly longer amount of time than fixed networks and tends to find near optimal solutions
Remaining questions they want to investigate is how big the new node adding window should be and where nodes should be added in multi layer networks?

Optimal Brain Damage

After training a standard ANN we can potentially improve speed and generalization performance if we delete some of the nodes/connections
Optimal brain damage deletes parameters (sets them to 0 and freezes them from future use) based on there impact on the error of the model.
Results: Against the MNIST data it was shown that up to a 5th of parameters could be deleted with no degradation in the training performance and some improvement in the generalization.

The Cascade Correlation Algorithms

A different approach to evolving structure for neural networks.
Starts with a network with just fully connected input and output nodes. These are trained using quickprop until error rate stops decreasing.
Then multiple candidate nodes are created connected to every node except the output nodes(eventually hidden nodes will be added) with different random weights
These are then all trained to try and maximize the correlation with the output error.
The candidate with the best correlation is selected
Repeat training and adding in candidate nodes until stopping point
Results: Appears to do very well and producing results for things like xor and various other functions that are difficult for normal ANN’s to do.
It has 2 problems

Over fitting
Because the eventual network is a cascade of neurons(later neurons are really deep because they depend on every previous neuron) it runs a lot slower than a normal ANN

Can also be extended to work on recurrent networks

Cascade-Correlation Neural Networks: A Survey

Presents some approaches to reducing the problems with cascade correlation.
Sibling/descendant cascade attempts to reduce the problem with networks going too deep.

There are 2 pools of candidate neurons, sibling(only connected to neurons in previous layers) and descendant(same as for normal cascade).
Often siblings do get selected ahead of descendants, reducing the depth of the network

Says Optimal Brain Damage is effective at pruning the network after structure is discovered this helps with over fitting.
Knowledge based CCNN are another approach where candidates nodes can be normal functions or fully fledged ANN of there own. This approach sounds a lot of fun, would love more info on how successful it is.
Rule based CCNN, similar to the above but things like OR and AND operators are used. Again sounds alot of fun would love to know how it performs.

HyperNEAT, NEAT

Even more quick summaries of research papers

Following on from part 1 and part 2 here are even more quick summaries of research papers,

Generative NeuroEvolution for DeepLearning

Applies HyperNEAT to deep learning
Uses the MNIST hand-drawn digits dataset, trains both a normal deep network and a convolutional network
Results:

HyperNEAT on it’s own performs very badly.
Using HyperNEAT to generate a number of layers and then backprop on the final layer is achieves 58.4% for normal ANNs
Using HyperNEAT to generate a number of layers and then backprop on the final layer for a Convolution Neural net achieved performance of 92.1%

This paper seems to miss that one of the advantages of HyperNEAT is the ability to scale it to different sizes of input so it would be nice to see how it performans when given images with different dimensions vs a traditional approach which has to do a standard Photoshop resize and then work off that image.
Also would love to see some details of exactly how the algorithms were implemented

An Enhanced Hypercube-Based Encoding for Evolving thePlacement, Density and Connectivity of Neurons

A big problem with HyperNeat is that though it can express a potentially infinite number of nodes and connection, the numbers of hidden nodes must be decided in advance
ES-HyperNeat is an attempt to extend HyperNeat to be able to determine how many hidden nodes it should have.
ES-HyperNeat uses the connection weights values themselves to determine node placement, areas with more complex(higher variance) weight patterns should be given more nodes.
There is an addition called Link Expression Output (LEO) for HyperNEAT where weather or not there is a connection between 2 nodes node is not calculated from the magnitude of weights but from another evolved parameter. This also works with ES-HyperNEAT
Algorithm is:

When creating a connection between 2 nodes (e.g an input and an output) rather than just calculate a single weight value we create 4 hidden nodes(quad tree style) and calculate the weights for each
if the variance of the weight is above some threshold we create all for nodes
otherwise just have the single connection
when doing the connection for the 4 sub nodes we may do more subdivisions(up to some max)

Run experiments against maze navigation, multi model problems and retina problem

Results: Outperformed HyperNeat

Deep Learning using Genetic Algorithms

Tests using ga’s to encode and then decode features of an image
Seems to have some success
They encounter some problems that may be better solved using NEAT/HyperNEAT

Novelty Search and the Problem with Objectives

Normally Genetic algorithms are trained with an objective(fitness) function and those agents that perform best against it are the selected for
This leads to problems of local minima and will often result in lots of similar behaviors with small tweaks being selected for not allowing for the more complex sets of interactions required.
Novelty search instead adds a measure of the distinctness or novelty or difference in the result of a behavior to the fitness function.
Gives the example of teaching a robot to walk, a fitness function would start off with just selecting the robot that fell the furthest to the goal. Which would likely never lead to complex interesting behavior.
But if you reward different kinds of falls some may arise that balance for a bit in nay direction over time these can adapt to move to the actual objective
Recommends average distance to k nearest neighbors as a good measure of novelty.
Novelty search shows very good results vs objective search in maze problems and bi-pedal walking among others

Unsupervised Feature Learning through Divergent Discriminative Feature Accumulation

Uses HyperNEAT with novelty search as the unsupervised layer in deep learning
Number of hidden nodes not set in advance, rather successive features are learned based on there difference from prevous features until a certain number reached
Only 2 layer network trained, learning 1500->3000 features
Results: Against MNIST data does amazingly well, incredibly impressive
Looks like this is a good alternative to restricted boltzman machine
Would love to see how it might do with more layers/convolutional architecture, scaling to different image sizes

HyperNEAT, NEAT

Quick summaries of research papers around NEAT Part 2

Continuing on from my previous post here are some more quick summaries of research papers.

Evolving Reusable Neural Modules

Attempts to improve on NEAT by dividing the evolved nets into modules.
These modules are in themselves smaller neural nets with input, output and hidden nodes.
Blueprints are used to combine the modules into the final neural nets. Blueprints contain lists of modules to be used and mappings from the real inputs/outputs and the module input/outputs
Both blueprints and module are evolved and speciated.
The idea is having modules reduces the number of dimensions in the search space. It is somewhat analogous to the modules being chromosomes and blueprints being arrangements of chromosomes.
Experiment: NEAT and modular NEAT were run on a board game(very roughly like go).
Results: Modular NEAT was seen to evolve better solutions and for those better solutions to appear about 4 times faster. Though this level of improvement is possibly quite tied to the kind of task being learned.

Transfer Learning Across Heterogeneous Tasks Using Behavioural Genetic Principles

Transfer learning is applying learning in one domain to a related but different domain, e.g. speech recognition on male voices, to speech recognition on female voices.
4 challenges of transfer learning:

Successfully learning related tasks from source tasks
Determining task relatedness
Avoiding negative transfer
More closely imitate human learning

Approach steps:

Have a set of potentially related tasks, choose one as the source, we aim to learn them all
Have all the parameters for a neural nets encoded for a genetic algorithm, including number of hidden nodes and learning rate.
Create a population of x pairs of identical neural nets and x pairs of neural nets where 50% of genes are shared(between pairs)
unique training sets are created for each individual in the population by randomly filtering out a subset of the training data.
Train every individual on the source task and then each other task independently.
After training measure the results and calculate how much of the performance was down to genes and how much down to environment by comparing the performance of identical and non-identical twins.
Select from the identical twin population taking into account how much of there performance was down to genes rather than environment.
Select until convergence

Tasks where

Learning past tenses of English words
Mapping patterns to identical patterns
Categorizing patterns into
Patterns with errors
Arbitrary patterns, since random should be no generalization

Results: The networks were able to use direction of change in heritablity(performance resulting from genes), to indicate task relatedness.
Related tasks were learned better than using standard methods
Would be interesting to see how NEAT would work with this method?

Transfer learning approach for financial applications

This uses the approach from the above paper on 3 pieces of financial data.

Statlog – Australian credit approval
Statlog – German credit data
Banknote authentication

Results: Seems reasonably successful

How Important is Weight Symmetry in Backpropagation?

When they say weight symmetry they are referring to the weights used in a network feeding forward vs the weights used when doing back propagation.
Interesting food for thought is if weight symmetry is not important this could mitigate the vanishing gradient problem in deep neural nets…
They run 15 different data sets in the experiment all of which may be worth looking at for other experiments.
Results: Seemed pretty convincing that weight symmetry was not important, in particular an update rule they called Batch-Manhattan actually outperformed standard SGD.
Batch-Manhattan update rule i:

mini_batch = [x for x in order(datasetsamples, lambda x : rand.Next()][:mini_batch_size] #select a random set of samples to be our mini-batch

update_magnitude = -sign(sum([weight_derivate(x) for x in mini_batch]))*momentum * previous_update_magnitude – decay * current_weight
new_weight = current_weight + learning_rate * update_magnitude

previous_update_magnitude = update_magnitude

One thing to node about the above is that the function weight_derivative above potentially does use the weights in the back propagation step. This is where I would love to see the actual source used to generate this results.
Though the magnitude of update was not found to be that important the sign (unsurprisingly)was.
Would love to see more analysis of how remove the weights in back prop might affect very deep networks.

NEAT

Quick summaries of research papers around NEAT

I’m starting on a project around neural networks and as preparation I have been reading a lot of research papers and making quick summaries in case I need to go back to anything in them. I present them here, mostly for myself, but they may also be useful to others. Apologies in advance if you were involved in any of these and I have horribly butchered your work…

Efficient Evolution of Neural Networks through Complexification

Presents the NEAT, Neural Networks through Augmenting Topologies, approach to training neural networks.
Main website for neat is http://nn.cs.utexas.edu/?neat here is the C# source example http://sharpneat.sourceforge.net/
Weights for the network are learned using evolutionary algorithms the 3 main things that make NEAT special are:

Uses an approach that keeps track of the ids of connections and nodes so that there is more efficient competing of genes against equivalent genes(hard to explain this exactly quickly but if you read it, it is a good approach).
Uses speciation to protect innovation, the evolving agents that compete are divided into subgroups and only need to compete against other agent in there “species” in this way innovations are given time to improve before being wiped which allows them to develop for longer to the point where they may be useful.
Reduces dimensions through complexification. We start we simpler nets with fewer hidden nodes then over time evolve more, this makes early evolution quicker and allows for greater complexity over time.

Had very good results for a range of tasks.
The networks that evolve can have many layers and also be very nonstandard, e.g. connections from input layer direct to 3rd layer.
Worked with recurrent networks as well.
Here is a good blog post on using NeatSharp http://www.nashcoding.com/2010/07/17/tutorial-evolving-neural-networks-with-sharpneat-2-part-1/

Compositional Pattern Producing Networks

Rather than creating a neural network directly, we create a function which takes a set of inputs for 2 nodes and outputs the weight of a connection between those 2 nodes. We then call that function for every point of a space to generate a neural net. Here is the python pseudo code for the function:

def createNodeWeight(node1coordinates, node2coordinates):

weightOfConnectionBetween2Nodes = #result of some chain of composed functions

return weightOfConnectionBetween2Nodes

Why do this?

In nature we see that often that x^10 synapse connections are generated from x genes so we should look for similar mechanisms
In our task is related to images or learning anything with physical dimensions we can use the real world coordinates of the inputs to influence are results, e.g. with images we can place the inputs from each pixel of an image at that coordinate.
This allows to very easily scale to images of different resolutions.
Reduces the number of dimension you need to search through to find a solution. I don’t need to find the correct weight for every connection, just the smaller number of correct values to generate the connections.

The createNodeWeight function can itself use a neural network. This works very well generate the values for that network using NEAT, this is called hyperNEAT
Can create really interesting images
Here is the github for hyperNeatSharp and here is a good tutorial on using it.

Autonomous Evolution of Topographic Regularities in ArtificialNeural Networks

Applies hyperNEAT to learning checkers, the inputs exist across 2 dimension(checkers board)
Training is done by checking fitness against a standard min-max search algorithms with evaluation of the result with some randomness thrown in so the game is not to deterministic.
Results: hyperNeat was compared against Neat and also a NEAT-EI(Augmented version of NEAT) hyperNeat was able to evolve to beat the min-max with depth 4 and a lot quicker than NEAT-EI.
Generalization appears better than NEAT-EI as hyperNEAT is able to beat it in a direct game
Quite interesting stuff on why hyperNEAT performs well
Would love to see this applied to chess.

Evolving Adaptive Neural Networks with and without Adaptive Synapses

Experiment: A standard food foraging exercise with the twist that all the food may either be normal giving +points or be poison in which case -points. If the food is poison then when eaten a pain output neuron is stimulated.
If the food is all poison then the optimum strategy is to then stop searching for food. There is no way a fixed optimum strategy can be evolved because if the food is poison or not is randomized so the agent must be able to adapt.
2 sets of agents where evolved. Both used NEAT including the ability to generate recurrent connections.
1 set also was able to evolve local learning rules. For either strengthening or weakening connections between nodes. The idea being that the agent would evolve to have a rule that would stop it foraging for food once it encountered the poison.
Results: Interestingly it turned out that the first agent could actually learn to stop foraging for the poison on it’s own through recurrent connections and learned to do this faster than the agent with adaptive rules could learn it, because of it’s lower number of dimensions.
It would be interesting to run a similar experiment on a game where the environment where both poison and food where present in the environment will some sensor for the agent to distinguish between, so once it had found poison it got extra points for still picking out the food, could a recurrent neural net still preform as well there?