Quick summaries of research papers around NEAT Part 2

Continuing on from my previous post here are some more quick summaries of research papers.

Evolving Reusable Neural Modules

  • Attempts to improve on NEAT by dividing the evolved nets into modules. 
  • These modules are in themselves smaller neural nets with input, output and hidden nodes.
  • Blueprints are used to combine the modules into the final neural nets. Blueprints contain lists of modules to be used and mappings from the real inputs/outputs and the module input/outputs
  • Both blueprints and module are evolved and speciated.
  • The idea is having modules reduces the number of dimensions in the search space. It is somewhat analogous to the modules being chromosomes and blueprints being arrangements of chromosomes.
  • Experiment: NEAT and modular NEAT were run on a board game(very roughly like go).
  • Results: Modular NEAT was seen to evolve better solutions and for those better solutions to appear about 4 times faster. Though this level of improvement is possibly quite tied to the kind of task being learned.

Transfer Learning Across Heterogeneous Tasks Using Behavioural Genetic Principles

  • Transfer learning is applying learning in one domain to a related but different domain, e.g. speech recognition on male voices, to speech recognition on female voices.
  • 4 challenges of transfer learning:
    • Successfully learning related tasks from source tasks
    • Determining task relatedness
    • Avoiding negative transfer
    • More closely imitate human learning
  • Approach steps:
    • Have a set of potentially related tasks, choose one as the source, we aim to learn them all
    • Have all the parameters for a neural nets encoded for a genetic algorithm, including number of hidden nodes and learning rate.
    • Create a population of x pairs of identical neural nets and x pairs of neural nets where 50% of genes are shared(between pairs)
    • unique training sets are created for each individual in the population by randomly filtering out a subset of the training data.
    • Train every individual on the source task and then each other task independently.
    • After training measure the results and calculate how much of the performance was down to genes and how much down to environment by comparing the performance of identical and non-identical twins.
    • Select from the identical twin population taking into account how much of there performance was down to genes rather than environment.
    • Select until convergence
  • Tasks where
    • Learning past tenses of English words
    • Mapping patterns to identical patterns
    • Categorizing patterns into
    • Patterns with errors
    • Arbitrary patterns, since random should be no generalization
  • Results: The networks were able to use direction of change in heritablity(performance resulting from genes), to indicate task relatedness.
  • Related tasks were learned better than using standard methods
  • Would be interesting to see how NEAT would work with this method?
  • This uses the approach from the above paper on 3 pieces of financial data.
    • Statlog – Australian credit approval
    • Statlog – German credit data
    • Banknote authentication
  • Results: Seems reasonably successful
  • When they say weight symmetry they are referring to the weights used in a network feeding forward vs the weights used when doing back propagation.
  • Interesting food for thought is if weight symmetry is not important this could mitigate the vanishing gradient problem in deep neural nets…
  • They run 15 different data sets in the experiment all of which may be worth looking at for other experiments.
  • Results: Seemed pretty convincing that weight symmetry was not important, in particular an update rule they called Batch-Manhattan actually outperformed standard SGD.
  • Batch-Manhattan update rule i: 
mini_batch = [x for x in order(datasetsamples, lambda x : rand.Next()][:mini_batch_size] #select a random set of samples to be our mini-batch
update_magnitude = -sign(sum([weight_derivate(x) for x in mini_batch]))*momentum * previous_update_magnitude – decay * current_weight
new_weight = current_weight + learning_rate * update_magnitude
previous_update_magnitude = update_magnitude
  • One thing to node about the above is that the function weight_derivative above potentially does use the weights in the back propagation step. This is where I would love to see the actual source used to generate this results.
  • Though the magnitude of update was not found to be that important the sign (unsurprisingly)was.
  • Would love to see more analysis of how remove the weights in back prop might affect very deep networks.

Leave a Reply