Even more quick summaries of research papers

Following on from part 1 and part 2 here are even more quick summaries of research papers,

Generative NeuroEvolution for DeepLearning

  • Applies HyperNEAT to deep learning
  • Uses the MNIST hand-drawn digits dataset, trains both a normal deep network and a convolutional network
  • Results: 
    • HyperNEAT on it’s own performs very badly. 
    • Using HyperNEAT to generate a number of layers and then backprop on the final layer is achieves 58.4% for normal ANNs
    • Using HyperNEAT to generate a number of layers and then backprop on the final layer for a Convolution Neural net achieved performance of 92.1%
  • This paper seems to miss that one of the advantages of HyperNEAT is the ability to scale it to different sizes of input so it would be nice to see how it performans when given images with different dimensions vs a traditional approach which has to do a standard Photoshop resize and then work off that image.
  • Also would love to see some details of exactly how the algorithms were implemented
  • A big problem with HyperNeat is that though it can express a potentially infinite number of nodes and connection, the numbers of hidden nodes must be decided in advance
  • ES-HyperNeat is an attempt to extend HyperNeat to be able to determine how many hidden nodes it should have.
  • ES-HyperNeat uses the connection weights values themselves to determine node placement, areas with more complex(higher variance) weight patterns should be given more nodes.
  • There is an addition called Link Expression Output (LEO) for HyperNEAT where weather or not there is a connection between 2 nodes node is not calculated from the magnitude of weights but from another evolved parameter. This also works with ES-HyperNEAT
  • Algorithm is:
    • When creating a connection between 2 nodes (e.g an input and an output) rather than just calculate a single weight value we create 4 hidden nodes(quad tree style) and calculate the weights for each
    • if the variance of the weight is above some threshold we create all for nodes 
    • otherwise just have the single connection
    • when doing the connection for the 4 sub nodes we may do more subdivisions(up to some max)
  • Run experiments against maze navigation, multi model problems and retina problem
    • Results: Outperformed HyperNeat

    Deep Learning using Genetic Algorithms

    • Tests using ga’s to encode and then decode features of an image
    • Seems to have some success
    • They encounter some problems that may be better solved using NEAT/HyperNEAT

    Novelty Search and the Problem with Objectives

    • Normally Genetic algorithms are trained with an objective(fitness) function and those agents that perform best against it are the selected for
    • This leads to problems of local minima and will often result in lots of similar behaviors with small tweaks being selected for not allowing for the more complex sets of interactions required.
    • Novelty search instead adds a measure of the distinctness or novelty or difference in the result of a behavior to the fitness function. 
    • Gives the example of teaching a robot to walk, a fitness function would start off with just selecting the robot that fell the furthest to the goal. Which would likely never lead to complex interesting behavior. 
    • But if you reward different kinds of falls some may arise that balance for a bit in nay direction over time these can adapt to move to the actual objective
    • Recommends average distance to k nearest neighbors as a good measure of novelty.
    • Novelty search shows very good results vs objective search in maze problems and bi-pedal walking among others

    Unsupervised Feature Learning through Divergent Discriminative Feature Accumulation

    • Uses HyperNEAT with novelty search as the unsupervised layer in deep learning
    • Number of hidden nodes not set in advance, rather successive features are learned based on there difference from prevous features until a certain number reached
    • Only 2 layer network trained, learning 1500->3000 features
    • Results: Against MNIST data does amazingly well, incredibly impressive
    • Looks like this is a good alternative to restricted boltzman machine
    • Would love to see how it might do with more layers/convolutional architecture, scaling to different image sizes

    Quick summaries of research papers around NEAT Part 2

    Continuing on from my previous post here are some more quick summaries of research papers.

    Evolving Reusable Neural Modules

    • Attempts to improve on NEAT by dividing the evolved nets into modules. 
    • These modules are in themselves smaller neural nets with input, output and hidden nodes.
    • Blueprints are used to combine the modules into the final neural nets. Blueprints contain lists of modules to be used and mappings from the real inputs/outputs and the module input/outputs
    • Both blueprints and module are evolved and speciated.
    • The idea is having modules reduces the number of dimensions in the search space. It is somewhat analogous to the modules being chromosomes and blueprints being arrangements of chromosomes.
    • Experiment: NEAT and modular NEAT were run on a board game(very roughly like go).
    • Results: Modular NEAT was seen to evolve better solutions and for those better solutions to appear about 4 times faster. Though this level of improvement is possibly quite tied to the kind of task being learned.

    Transfer Learning Across Heterogeneous Tasks Using Behavioural Genetic Principles

    • Transfer learning is applying learning in one domain to a related but different domain, e.g. speech recognition on male voices, to speech recognition on female voices.
    • 4 challenges of transfer learning:
      • Successfully learning related tasks from source tasks
      • Determining task relatedness
      • Avoiding negative transfer
      • More closely imitate human learning
    • Approach steps:
      • Have a set of potentially related tasks, choose one as the source, we aim to learn them all
      • Have all the parameters for a neural nets encoded for a genetic algorithm, including number of hidden nodes and learning rate.
      • Create a population of x pairs of identical neural nets and x pairs of neural nets where 50% of genes are shared(between pairs)
      • unique training sets are created for each individual in the population by randomly filtering out a subset of the training data.
      • Train every individual on the source task and then each other task independently.
      • After training measure the results and calculate how much of the performance was down to genes and how much down to environment by comparing the performance of identical and non-identical twins.
      • Select from the identical twin population taking into account how much of there performance was down to genes rather than environment.
      • Select until convergence
    • Tasks where
      • Learning past tenses of English words
      • Mapping patterns to identical patterns
      • Categorizing patterns into
      • Patterns with errors
      • Arbitrary patterns, since random should be no generalization
    • Results: The networks were able to use direction of change in heritablity(performance resulting from genes), to indicate task relatedness.
    • Related tasks were learned better than using standard methods
    • Would be interesting to see how NEAT would work with this method?
    • This uses the approach from the above paper on 3 pieces of financial data.
      • Statlog – Australian credit approval
      • Statlog – German credit data
      • Banknote authentication
    • Results: Seems reasonably successful
    • When they say weight symmetry they are referring to the weights used in a network feeding forward vs the weights used when doing back propagation.
    • Interesting food for thought is if weight symmetry is not important this could mitigate the vanishing gradient problem in deep neural nets…
    • They run 15 different data sets in the experiment all of which may be worth looking at for other experiments.
    • Results: Seemed pretty convincing that weight symmetry was not important, in particular an update rule they called Batch-Manhattan actually outperformed standard SGD.
    • Batch-Manhattan update rule i: 
    mini_batch = [x for x in order(datasetsamples, lambda x : rand.Next()][:mini_batch_size] #select a random set of samples to be our mini-batch
    update_magnitude = -sign(sum([weight_derivate(x) for x in mini_batch]))*momentum * previous_update_magnitude – decay * current_weight
    new_weight = current_weight + learning_rate * update_magnitude
    previous_update_magnitude = update_magnitude
    • One thing to node about the above is that the function weight_derivative above potentially does use the weights in the back propagation step. This is where I would love to see the actual source used to generate this results.
    • Though the magnitude of update was not found to be that important the sign (unsurprisingly)was.
    • Would love to see more analysis of how remove the weights in back prop might affect very deep networks.

    Quick summaries of research papers around NEAT

    I’m starting on a project around neural networks and as preparation I have been reading a lot of research papers and making quick summaries in case I need to go back to anything in them. I present them here, mostly for myself, but they may also be useful to others. Apologies in advance if you were involved in any of these and I have horribly butchered your work…

    Efficient Evolution of Neural Networks through Complexification

    • Presents the NEAT, Neural Networks through Augmenting Topologies, approach to training neural networks.
    • Main website for neat is http://nn.cs.utexas.edu/?neat here is the C# source example http://sharpneat.sourceforge.net/
    • Weights for the network are learned using evolutionary algorithms the 3 main things that make NEAT special are:
      • Uses an approach that keeps track of the ids of connections and nodes so that there is more efficient competing of genes against equivalent genes(hard to explain this exactly quickly but if you read it, it is a good approach).
      • Uses speciation to protect innovation, the evolving agents that compete are divided into subgroups and only need to compete against other agent in there “species” in this way innovations are given time to improve before being wiped which allows them to develop for longer to the point where they may be useful.
      • Reduces dimensions through complexification. We start we simpler nets with fewer hidden nodes then over time evolve more, this makes early evolution quicker and allows for greater complexity over time.
    • Had very good results for a range of tasks.
    • The networks that evolve can have many layers and also be very nonstandard, e.g. connections from input layer direct to 3rd layer.
    • Worked with recurrent networks as well.
    • Here is a good blog post on using NeatSharp http://www.nashcoding.com/2010/07/17/tutorial-evolving-neural-networks-with-sharpneat-2-part-1/

    Compositional Pattern Producing Networks

    • Rather than creating a neural network directly, we create a function which takes a set of inputs for 2 nodes and outputs the weight of a connection between those 2 nodes. We then call that function for every point of a space to generate a neural net. Here is the python pseudo code for the function:
    def createNodeWeight(node1coordinates, node2coordinates):
         weightOfConnectionBetween2Nodes = #result of some chain of composed functions
         return weightOfConnectionBetween2Nodes
    • Why do this?
      • In nature we see that often that x^10 synapse connections are generated from x genes so we should look for similar mechanisms
      • In our task is related to images or learning anything with physical dimensions we can use the real world coordinates of the inputs to influence are results, e.g. with images we can place the inputs from each pixel of an image at that coordinate.
      • This allows to very easily scale to images of different resolutions.
      • Reduces the number of dimension you need to search through to find a solution. I don’t need to find the correct weight for every connection, just the smaller number of correct values to generate the connections.
    • The createNodeWeight function can itself use a neural network. This works very well generate the values for that network using NEAT, this is called hyperNEAT
    • Can create really interesting images
    • Here is the github for hyperNeatSharp and here is a good tutorial on using it.

    Autonomous Evolution of Topographic Regularities in ArtificialNeural Networks

    • Applies hyperNEAT to learning checkers, the inputs exist across 2 dimension(checkers board)
    • Training is done by checking fitness against a standard min-max search algorithms with evaluation of the result with some randomness thrown in so the game is not to deterministic.
    • Results: hyperNeat was compared against Neat and also a NEAT-EI(Augmented version of NEAT) hyperNeat was able to evolve to beat the min-max with depth 4 and a lot quicker than NEAT-EI.
    • Generalization appears better than NEAT-EI as hyperNEAT is able to beat it in a direct game
    • Quite interesting stuff on why hyperNEAT performs well
    • Would love to see this applied to chess.

    Evolving Adaptive Neural Networks with and without Adaptive Synapses

    • Experiment: A standard food foraging exercise with the twist that all the food may either be normal giving +points or be poison in which case -points. If the food is poison then when eaten a pain output neuron is stimulated. 
    • If the food is all poison then the optimum strategy is to then stop searching for food. There is no way a fixed optimum strategy can be evolved because if the food is poison or not is randomized so the agent must be able to adapt.
    • 2 sets of agents where evolved. Both used NEAT including the ability to generate recurrent connections.
    • 1 set also was able to evolve local learning rules. For either strengthening or weakening connections between nodes. The idea being that the agent would evolve to have a rule that would stop it foraging for food once it encountered the poison.
    • Results: Interestingly it turned out that the first agent could actually learn to stop foraging for the poison on it’s own through recurrent connections and learned to do this faster than the agent with adaptive rules could learn it, because of it’s lower number of dimensions.
    • It would be interesting to run a similar experiment on a game where the environment where both poison and food where present in the environment will some sensor for the agent to distinguish between, so once it had found poison it got extra points for still picking out the food, could a recurrent neural net still preform as well there?
    More in part 2