HyperNEAT – DANIEL SLATER'S BLOG

Generative NeuroEvolution for DeepLearning

Applies HyperNEAT to deep learning
Uses the MNIST hand-drawn digits dataset, trains both a normal deep network and a convolutional network
Results:

HyperNEAT on it’s own performs very badly.
Using HyperNEAT to generate a number of layers and then backprop on the final layer is achieves 58.4% for normal ANNs
Using HyperNEAT to generate a number of layers and then backprop on the final layer for a Convolution Neural net achieved performance of 92.1%

This paper seems to miss that one of the advantages of HyperNEAT is the ability to scale it to different sizes of input so it would be nice to see how it performans when given images with different dimensions vs a traditional approach which has to do a standard Photoshop resize and then work off that image.
Also would love to see some details of exactly how the algorithms were implemented

An Enhanced Hypercube-Based Encoding for Evolving thePlacement, Density and Connectivity of Neurons

A big problem with HyperNeat is that though it can express a potentially infinite number of nodes and connection, the numbers of hidden nodes must be decided in advance
ES-HyperNeat is an attempt to extend HyperNeat to be able to determine how many hidden nodes it should have.
ES-HyperNeat uses the connection weights values themselves to determine node placement, areas with more complex(higher variance) weight patterns should be given more nodes.
There is an addition called Link Expression Output (LEO) for HyperNEAT where weather or not there is a connection between 2 nodes node is not calculated from the magnitude of weights but from another evolved parameter. This also works with ES-HyperNEAT
Algorithm is:

When creating a connection between 2 nodes (e.g an input and an output) rather than just calculate a single weight value we create 4 hidden nodes(quad tree style) and calculate the weights for each
if the variance of the weight is above some threshold we create all for nodes
otherwise just have the single connection
when doing the connection for the 4 sub nodes we may do more subdivisions(up to some max)

Run experiments against maze navigation, multi model problems and retina problem

Results: Outperformed HyperNeat

Deep Learning using Genetic Algorithms

Tests using ga’s to encode and then decode features of an image
Seems to have some success
They encounter some problems that may be better solved using NEAT/HyperNEAT

Novelty Search and the Problem with Objectives

Normally Genetic algorithms are trained with an objective(fitness) function and those agents that perform best against it are the selected for
This leads to problems of local minima and will often result in lots of similar behaviors with small tweaks being selected for not allowing for the more complex sets of interactions required.
Novelty search instead adds a measure of the distinctness or novelty or difference in the result of a behavior to the fitness function.
Gives the example of teaching a robot to walk, a fitness function would start off with just selecting the robot that fell the furthest to the goal. Which would likely never lead to complex interesting behavior.
But if you reward different kinds of falls some may arise that balance for a bit in nay direction over time these can adapt to move to the actual objective
Recommends average distance to k nearest neighbors as a good measure of novelty.
Novelty search shows very good results vs objective search in maze problems and bi-pedal walking among others

Unsupervised Feature Learning through Divergent Discriminative Feature Accumulation

Uses HyperNEAT with novelty search as the unsupervised layer in deep learning
Number of hidden nodes not set in advance, rather successive features are learned based on there difference from prevous features until a certain number reached
Only 2 layer network trained, learning 1500->3000 features
Results: Against MNIST data does amazingly well, incredibly impressive
Looks like this is a good alternative to restricted boltzman machine
Would love to see how it might do with more layers/convolutional architecture, scaling to different image sizes

Continuing on from my previous post here are some more quick summaries of research papers.

Evolving Reusable Neural Modules

Attempts to improve on NEAT by dividing the evolved nets into modules.
These modules are in themselves smaller neural nets with input, output and hidden nodes.
Blueprints are used to combine the modules into the final neural nets. Blueprints contain lists of modules to be used and mappings from the real inputs/outputs and the module input/outputs
Both blueprints and module are evolved and speciated.
The idea is having modules reduces the number of dimensions in the search space. It is somewhat analogous to the modules being chromosomes and blueprints being arrangements of chromosomes.
Experiment: NEAT and modular NEAT were run on a board game(very roughly like go).
Results: Modular NEAT was seen to evolve better solutions and for those better solutions to appear about 4 times faster. Though this level of improvement is possibly quite tied to the kind of task being learned.

Transfer Learning Across Heterogeneous Tasks Using Behavioural Genetic Principles

Transfer learning is applying learning in one domain to a related but different domain, e.g. speech recognition on male voices, to speech recognition on female voices.
4 challenges of transfer learning:

Successfully learning related tasks from source tasks
Determining task relatedness
Avoiding negative transfer
More closely imitate human learning

Approach steps:

Have a set of potentially related tasks, choose one as the source, we aim to learn them all
Have all the parameters for a neural nets encoded for a genetic algorithm, including number of hidden nodes and learning rate.
Create a population of x pairs of identical neural nets and x pairs of neural nets where 50% of genes are shared(between pairs)
unique training sets are created for each individual in the population by randomly filtering out a subset of the training data.
Train every individual on the source task and then each other task independently.
After training measure the results and calculate how much of the performance was down to genes and how much down to environment by comparing the performance of identical and non-identical twins.
Select from the identical twin population taking into account how much of there performance was down to genes rather than environment.
Select until convergence

Tasks where

Learning past tenses of English words
Mapping patterns to identical patterns
Categorizing patterns into
Patterns with errors
Arbitrary patterns, since random should be no generalization

Results: The networks were able to use direction of change in heritablity(performance resulting from genes), to indicate task relatedness.
Related tasks were learned better than using standard methods
Would be interesting to see how NEAT would work with this method?

Transfer learning approach for financial applications

This uses the approach from the above paper on 3 pieces of financial data.

Statlog – Australian credit approval
Statlog – German credit data
Banknote authentication

Results: Seems reasonably successful

How Important is Weight Symmetry in Backpropagation?

When they say weight symmetry they are referring to the weights used in a network feeding forward vs the weights used when doing back propagation.
Interesting food for thought is if weight symmetry is not important this could mitigate the vanishing gradient problem in deep neural nets…
They run 15 different data sets in the experiment all of which may be worth looking at for other experiments.
Results: Seemed pretty convincing that weight symmetry was not important, in particular an update rule they called Batch-Manhattan actually outperformed standard SGD.
Batch-Manhattan update rule i:

mini_batch = [x for x in order(datasetsamples, lambda x : rand.Next()][:mini_batch_size] #select a random set of samples to be our mini-batch

update_magnitude = -sign(sum([weight_derivate(x) for x in mini_batch]))*momentum * previous_update_magnitude – decay * current_weight
new_weight = current_weight + learning_rate * update_magnitude

previous_update_magnitude = update_magnitude

One thing to node about the above is that the function weight_derivative above potentially does use the weights in the back propagation step. This is where I would love to see the actual source used to generate this results.
Though the magnitude of update was not found to be that important the sign (unsurprisingly)was.
Would love to see more analysis of how remove the weights in back prop might affect very deep networks.

Category: HyperNEAT

Even more quick summaries of research papers