Even more quick summaries of research papers

Following on from part 1 and part 2 here are even more quick summaries of research papers,

Generative NeuroEvolution for DeepLearning

  • Applies HyperNEAT to deep learning
  • Uses the MNIST hand-drawn digits dataset, trains both a normal deep network and a convolutional network
  • Results: 
    • HyperNEAT on it’s own performs very badly. 
    • Using HyperNEAT to generate a number of layers and then backprop on the final layer is achieves 58.4% for normal ANNs
    • Using HyperNEAT to generate a number of layers and then backprop on the final layer for a Convolution Neural net achieved performance of 92.1%
  • This paper seems to miss that one of the advantages of HyperNEAT is the ability to scale it to different sizes of input so it would be nice to see how it performans when given images with different dimensions vs a traditional approach which has to do a standard Photoshop resize and then work off that image.
  • Also would love to see some details of exactly how the algorithms were implemented
  • A big problem with HyperNeat is that though it can express a potentially infinite number of nodes and connection, the numbers of hidden nodes must be decided in advance
  • ES-HyperNeat is an attempt to extend HyperNeat to be able to determine how many hidden nodes it should have.
  • ES-HyperNeat uses the connection weights values themselves to determine node placement, areas with more complex(higher variance) weight patterns should be given more nodes.
  • There is an addition called Link Expression Output (LEO) for HyperNEAT where weather or not there is a connection between 2 nodes node is not calculated from the magnitude of weights but from another evolved parameter. This also works with ES-HyperNEAT
  • Algorithm is:
    • When creating a connection between 2 nodes (e.g an input and an output) rather than just calculate a single weight value we create 4 hidden nodes(quad tree style) and calculate the weights for each
    • if the variance of the weight is above some threshold we create all for nodes 
    • otherwise just have the single connection
    • when doing the connection for the 4 sub nodes we may do more subdivisions(up to some max)
  • Run experiments against maze navigation, multi model problems and retina problem
    • Results: Outperformed HyperNeat

    Deep Learning using Genetic Algorithms

    • Tests using ga’s to encode and then decode features of an image
    • Seems to have some success
    • They encounter some problems that may be better solved using NEAT/HyperNEAT

    Novelty Search and the Problem with Objectives

    • Normally Genetic algorithms are trained with an objective(fitness) function and those agents that perform best against it are the selected for
    • This leads to problems of local minima and will often result in lots of similar behaviors with small tweaks being selected for not allowing for the more complex sets of interactions required.
    • Novelty search instead adds a measure of the distinctness or novelty or difference in the result of a behavior to the fitness function. 
    • Gives the example of teaching a robot to walk, a fitness function would start off with just selecting the robot that fell the furthest to the goal. Which would likely never lead to complex interesting behavior. 
    • But if you reward different kinds of falls some may arise that balance for a bit in nay direction over time these can adapt to move to the actual objective
    • Recommends average distance to k nearest neighbors as a good measure of novelty.
    • Novelty search shows very good results vs objective search in maze problems and bi-pedal walking among others

    Unsupervised Feature Learning through Divergent Discriminative Feature Accumulation

    • Uses HyperNEAT with novelty search as the unsupervised layer in deep learning
    • Number of hidden nodes not set in advance, rather successive features are learned based on there difference from prevous features until a certain number reached
    • Only 2 layer network trained, learning 1500->3000 features
    • Results: Against MNIST data does amazingly well, incredibly impressive
    • Looks like this is a good alternative to restricted boltzman machine
    • Would love to see how it might do with more layers/convolutional architecture, scaling to different image sizes

    Quick summaries of research papers around NEAT Part 2

    Continuing on from my previous post here are some more quick summaries of research papers.

    Evolving Reusable Neural Modules

    • Attempts to improve on NEAT by dividing the evolved nets into modules. 
    • These modules are in themselves smaller neural nets with input, output and hidden nodes.
    • Blueprints are used to combine the modules into the final neural nets. Blueprints contain lists of modules to be used and mappings from the real inputs/outputs and the module input/outputs
    • Both blueprints and module are evolved and speciated.
    • The idea is having modules reduces the number of dimensions in the search space. It is somewhat analogous to the modules being chromosomes and blueprints being arrangements of chromosomes.
    • Experiment: NEAT and modular NEAT were run on a board game(very roughly like go).
    • Results: Modular NEAT was seen to evolve better solutions and for those better solutions to appear about 4 times faster. Though this level of improvement is possibly quite tied to the kind of task being learned.

    Transfer Learning Across Heterogeneous Tasks Using Behavioural Genetic Principles

    • Transfer learning is applying learning in one domain to a related but different domain, e.g. speech recognition on male voices, to speech recognition on female voices.
    • 4 challenges of transfer learning:
      • Successfully learning related tasks from source tasks
      • Determining task relatedness
      • Avoiding negative transfer
      • More closely imitate human learning
    • Approach steps:
      • Have a set of potentially related tasks, choose one as the source, we aim to learn them all
      • Have all the parameters for a neural nets encoded for a genetic algorithm, including number of hidden nodes and learning rate.
      • Create a population of x pairs of identical neural nets and x pairs of neural nets where 50% of genes are shared(between pairs)
      • unique training sets are created for each individual in the population by randomly filtering out a subset of the training data.
      • Train every individual on the source task and then each other task independently.
      • After training measure the results and calculate how much of the performance was down to genes and how much down to environment by comparing the performance of identical and non-identical twins.
      • Select from the identical twin population taking into account how much of there performance was down to genes rather than environment.
      • Select until convergence
    • Tasks where
      • Learning past tenses of English words
      • Mapping patterns to identical patterns
      • Categorizing patterns into
      • Patterns with errors
      • Arbitrary patterns, since random should be no generalization
    • Results: The networks were able to use direction of change in heritablity(performance resulting from genes), to indicate task relatedness.
    • Related tasks were learned better than using standard methods
    • Would be interesting to see how NEAT would work with this method?
    • This uses the approach from the above paper on 3 pieces of financial data.
      • Statlog – Australian credit approval
      • Statlog – German credit data
      • Banknote authentication
    • Results: Seems reasonably successful
    • When they say weight symmetry they are referring to the weights used in a network feeding forward vs the weights used when doing back propagation.
    • Interesting food for thought is if weight symmetry is not important this could mitigate the vanishing gradient problem in deep neural nets…
    • They run 15 different data sets in the experiment all of which may be worth looking at for other experiments.
    • Results: Seemed pretty convincing that weight symmetry was not important, in particular an update rule they called Batch-Manhattan actually outperformed standard SGD.
    • Batch-Manhattan update rule i: 
    mini_batch = [x for x in order(datasetsamples, lambda x : rand.Next()][:mini_batch_size] #select a random set of samples to be our mini-batch
    update_magnitude = -sign(sum([weight_derivate(x) for x in mini_batch]))*momentum * previous_update_magnitude – decay * current_weight
    new_weight = current_weight + learning_rate * update_magnitude
    previous_update_magnitude = update_magnitude
    • One thing to node about the above is that the function weight_derivative above potentially does use the weights in the back propagation step. This is where I would love to see the actual source used to generate this results.
    • Though the magnitude of update was not found to be that important the sign (unsurprisingly)was.
    • Would love to see more analysis of how remove the weights in back prop might affect very deep networks.