Continuing on from my previous post here are some more quick summaries of research papers.
Evolving Reusable Neural Modules
- Attempts to improve on NEAT by dividing the evolved nets into modules.
- These modules are in themselves smaller neural nets with input, output and hidden nodes.
- Blueprints are used to combine the modules into the final neural nets. Blueprints contain lists of modules to be used and mappings from the real inputs/outputs and the module input/outputs
- Both blueprints and module are evolved and speciated.
- The idea is having modules reduces the number of dimensions in the search space. It is somewhat analogous to the modules being chromosomes and blueprints being arrangements of chromosomes.
- Experiment: NEAT and modular NEAT were run on a board game(very roughly like go).
- Results: Modular NEAT was seen to evolve better solutions and for those better solutions to appear about 4 times faster. Though this level of improvement is possibly quite tied to the kind of task being learned.
Transfer Learning Across Heterogeneous Tasks Using Behavioural Genetic Principles
- Transfer learning is applying learning in one domain to a related but different domain, e.g. speech recognition on male voices, to speech recognition on female voices.
- 4 challenges of transfer learning:
- Successfully learning related tasks from source tasks
- Determining task relatedness
- Avoiding negative transfer
- More closely imitate human learning
- Approach steps:
- Have a set of potentially related tasks, choose one as the source, we aim to learn them all
- Have all the parameters for a neural nets encoded for a genetic algorithm, including number of hidden nodes and learning rate.
- Create a population of x pairs of identical neural nets and x pairs of neural nets where 50% of genes are shared(between pairs)
- unique training sets are created for each individual in the population by randomly filtering out a subset of the training data.
- Train every individual on the source task and then each other task independently.
- After training measure the results and calculate how much of the performance was down to genes and how much down to environment by comparing the performance of identical and non-identical twins.
- Select from the identical twin population taking into account how much of there performance was down to genes rather than environment.
- Select until convergence
- Tasks where
- Learning past tenses of English words
- Mapping patterns to identical patterns
- Categorizing patterns into
- Patterns with errors
- Arbitrary patterns, since random should be no generalization
- Results: The networks were able to use direction of change in heritablity(performance resulting from genes), to indicate task relatedness.
- Related tasks were learned better than using standard methods
- Would be interesting to see how NEAT would work with this method?
- This uses the approach from the above paper on 3 pieces of financial data.
- Statlog – Australian credit approval
- Statlog – German credit data
- Banknote authentication
- Results: Seems reasonably successful
- When they say weight symmetry they are referring to the weights used in a network feeding forward vs the weights used when doing back propagation.
- Interesting food for thought is if weight symmetry is not important this could mitigate the vanishing gradient problem in deep neural nets…
- They run 15 different data sets in the experiment all of which may be worth looking at for other experiments.
- Results: Seemed pretty convincing that weight symmetry was not important, in particular an update rule they called Batch-Manhattan actually outperformed standard SGD.
- Batch-Manhattan update rule i:
mini_batch = [x for x in order(datasetsamples, lambda x : rand.Next()][:mini_batch_size] #select a random set of samples to be our mini-batch
update_magnitude = -sign(sum([weight_derivate(x) for x in mini_batch]))*momentum * previous_update_magnitude – decay * current_weight
new_weight = current_weight + learning_rate * update_magnitude
new_weight = current_weight + learning_rate * update_magnitude
previous_update_magnitude = update_magnitude
- One thing to node about the above is that the function weight_derivative above potentially does use the weights in the back propagation step. This is where I would love to see the actual source used to generate this results.
- Though the magnitude of update was not found to be that important the sign (unsurprisingly)was.
- Would love to see more analysis of how remove the weights in back prop might affect very deep networks.