Quick summaries of research papers around dynamically generating network structure

I’ve been reading a lot of research papers on how the structure of ANN can be generated/detected. Here are some quick summaries of interesting papers in that area.

Dynamic Node creation in backpropagation networks

  • Looks at attempting to find a good number of hidden nodes for a network by starting small and growing the correct number of hidden nodes
  • Claims that the approach of training the network then freezing existing nodes before adding new unfrozen nodes does not work well.
  • Adding new nodes then retraining the whole network appears to work better
  • The technique: 
    • A one hidden layer network is created with predefined number of hidden nodes
    • The network is trained, when the error rate stops decreasing over a number of iterations a new hidden node is added
    • Nodes stop being added either when a certain precision is reached or when the error starts increasing on a validation set
  • Results: Seems to train in not a significantly longer amount of time than fixed networks and tends to find near optimal solutions
  • Remaining questions they want to investigate is how big the new node adding window should be and where nodes should be added in multi layer networks?

Optimal Brain Damage

  • After training a standard ANN we can potentially improve speed and generalization performance if we delete some of the nodes/connections
  • Optimal brain damage deletes parameters (sets them to 0 and freezes them from future use) based on there impact on the error of the model.
  • Results: Against the MNIST data it was shown that up to a 5th of parameters could be deleted with no degradation in the training performance and some improvement in the generalization.

The Cascade Correlation Algorithms

  • A different approach to evolving structure for neural networks. 
  • Starts with a network with just fully connected input and output nodes. These are trained using quickprop until error rate stops decreasing.
  • Then multiple candidate nodes are created connected to every node except the output nodes(eventually hidden nodes will be added) with different random weights
  • These are then all trained to try and maximize the correlation with the output error.
  • The candidate with the best correlation is selected
  • Repeat training and adding in candidate nodes until stopping point
  • Results: Appears to do very well and producing results for things like xor and various other functions that are difficult for normal ANN’s to do.
  • It has 2 problems
    • Over fitting
    • Because the eventual network is a cascade of neurons(later neurons are really deep because they depend on every previous neuron) it runs a lot slower than a normal ANN
  • Can also be extended to work on recurrent networks

Cascade-Correlation Neural Networks: A Survey

  • Presents some approaches to reducing the problems with cascade correlation.
  • Sibling/descendant cascade attempts to reduce the problem with networks going too deep.
    • There are 2 pools of candidate neurons, sibling(only connected to neurons in previous layers) and descendant(same as for normal cascade). 
    • Often siblings do get selected ahead of descendants, reducing the depth of the network
  • Says Optimal Brain Damage is effective at pruning the network after structure is discovered this helps with over fitting.
  • Knowledge based CCNN are another approach where candidates nodes can be normal functions or fully fledged ANN of there own. This approach sounds a lot of fun, would love more info on how successful it is.
  • Rule based CCNN, similar to the above but things like OR and AND operators are used. Again sounds alot of fun would love to know how it performs.

Why there is a big unsupervised learning shaped whole in the universe

The problems I had when I first started reading about unsupervised learning was, I didn’t understand why it needed to exist. Supervised learning makes sense, you give a neural net an input and output and tell it “find a way to get from one to the other”. Unsupervised learning doesn’t have that. You are just giving it some data and saying “do… something”. Even the diagram were confusing, I see input nodes, I see hidden nodes, what’s the output?
The best way to get across the need for unsupervised learning is too talk about the end goal. Across the first year of a babies life it learns a huge amount of things. How to focus it’s eyes, how to coordinate it’s limbs, how to distinguish it’s parents from other adults, that objects are permanent things. Until it learns to talk it is getting almost no feedback about any of this, yet it still learns. Getting good training data is hard, so the idea is: wouldn’t it be great if we could set up a machine and it would just go off and learn on it’s own.
Unfortunately so far we are a long way from that and the technique shown here seems trivial compared to that goal. But the goal is interesting enough that it is worth pursuing. The answer to the question “what am I asking an unsupervised network to do?” is “learn the data”.  The output will be a representation of the data that is simpler than the original. If the input is 10,000 pixels of an image the output can be any smaller number. What a lot of the simpler unsupervised nets do is transform into a single number that represents groups of similar sets of inputs. These are called clusters.

An example competitive learning neural net

A competitive learning neural net attempts groups it’s inputs into clusters. The code for it is really very simple. Here is the all that is needed in the constructor(if you don’t like Python it is also available in C#, Java and F#):
from random import uniform

class UnsupervisedNN(object):
def __init__(self, size_of_input_arrays, number_of_clusters_to_group_data_into):
#we have 1 hidden node for each cluster
self.__connections = [[uniform(-0.5, 0.5) for j in range(number_of_clusters_to_group_data_into)] for i in range(size_of_input_arrays)]
self.__hidden_nodes = [0.0]*number_of_clusters_to_group_data_into
When we give it an input, it will activate the hidden nodes based on the sum of the connections between that input and each hidden node. It makes more sense in code, like so:
def feed_forward(self, inputs):  
#We expect inputs to be an array of floats of length size_of_input_arrays.
for hidden_node_index in range(len(self.__hidden_nodes)):
activation = 0.0
#each hidden node will be activated from the inputs.
for input_index in range(len(self.__connections)):
activation += inputs[input_index]*self.__connections[input_index][hidden_node_index]

self.__hidden_nodes[h] = activation

#now we have activated all the hidden nodes we check which has the highest activation
#this node is the winner and so the cluster we think this input belongs to
return self.__hidden_nodes.index(max(self.__hidden_nodes))

So as it stands we have a method for randomly assigning data to clusters. To make it something useful we need to improve the connections. There are many ways this can be done, in competitive learning after you have selected a winner you make your connections to it more like that input. A good analogy is imagine we 3 inputs one for each color red, green and blue. If we get the color yellow the inputs were red and green. So after a wining node is selected it’s connections to those colors are increased so future red and green items are more likely to be considered a part of the same cluster. But because there is no blue the connection to this is weakened:
def Train(self, inputs):
wining_cluster_index = self.feed_forward(inputs)
learning_rate = 0.1
for input_index in range(len(self.__connections)):
weight = self.__connections[input_index][winner]
self.__connections[input_index][wining_cluster_index] = weight + learning_rate*(inputs[input_index]-weight)
A problem we can have here though is that a cluster can be initialized with terrible weights, such that nothing is ever assigned to it. In order to fix this a penalty added to each hidden node. when ever a hidden node is selected it’s penalty is increased. So that over time if a node keeps winning it’s the other nodes will eventually start getting selected. This penalty is also known as a conscience or bias.
To add a bias we just need to initialize an array in the constructor for each cluster
     self.__conscience = [0.0]*number_of_clusters_to_group_data_into
Change our feed forward to

def feed_forward(self, inputs):
for hidden_node_index in range(len(self.__hidden_nodes)):
activation = self.__conscience[hidden_node_index]
for input_index in range(len(self.__connections)):
activation += inputs[input_index]*self.__connections[input_index][hidden_node_index]

self.__hidden_nodes[h] = activation

return self.__hidden_nodes.index(max(self.__hidden_nodes))

Then in training we just make a small substitution every time a cluster wins
     self.__conscience[winning_cluster_index] -= self.conscience_learning_rate
Competitive learning nets are nice but come along long way from the goal of full unsupervised learning. In a future post I’m going to do a Restricted Boltzman Machine which is used in deep learning for the shallow layers to give us a simpler representation of an image to work with.
Full code is available on git hub in PythonC#Java and F#

Particle Swarm Optimization in F# part 2 – Parallelizing

In the last post I gave an example of particle swarm optimization algorithm in F#. F# has a few nice features, but the main reason I wanted to use it was because it is so easy to write multi-threaded applications with it.

Multi-threaded PSO version 1

If I want my algorithm to run multi-threaded all I have to do is take this line in the update_particles function.

let updated_particles = particles |> List.map (fun x -> update_particle args loss_func x global_best_params)   

And change it to:

 let updated_particles = particles |> List.map (fun x -> async { return update_particle args loss_func x global_best_params }
|> Async.Parallel
|> Async.RunSynchronously
|> Array.toList

Like magic the whole application now runs in parallel and because I have no mutable types I can guarantee there are no issues with cross thread calls. There are a few things I don’t like about this though, this is 3 lines when really it should be one and also this creates an array that I then have to map back into a list. It is an annoying quirk of F# that it is much easier to run Arrays in parallel with the Parallel.map function, than Lists, but arrays are mutable losing the guaranteed thread safety…
A bit of help from stack overflow yielded this, PSeq a lib allowing you to run parallel operations against F# sequences. So the above can be rewritten:

 let updated_particles = particles  |> PSeq.map (fun x -> update_particle args loss_func x global_best_params)   
|> PSeq.toList

Multi-threaded PSO version 2

Now running in parallel, our speed is improved, but we still don’t use all cores as efficiently as possible. Each iteration of updating the particles waits for every particle to complete. This is reasonable if they all take the same amount of time, but lets say the function is something that could execute in 1 second or in 100 seconds. We always have to wait the amount of time of the longest to complete and once only 1 is left we are only running on a single thread.
A better alternative is we just run the whole lifetime of the each particle in parallel. The only piece of data that needs to travel between the particles is the global_best parameters. This can be handled by passing this as a ref and having a setter functions so we always just take the current global best at start up and update it whenever we have a new value.
The changes we need to make for this are:
Remove the update_particles and run_until_stop_condition methods and replace them with this:

let rec private run_particle (args : Args) (particle : Particle) (get_global_best :unit -> Particle) check_particle_against_global_best loss_func iterations_to_run : Particle =
let updated_particle = update_particle args loss_func particle (get_global_best()).Parameters
check_particle_against_global_best updated_particle

let new_iterations_to_run = iterations_to_run - 1
if stop_condition args iterations_to_run (get_global_best()).Local_best_loss then
updated_particle
else
run_particle args updated_particle get_global_best check_particle_against_global_best loss_func new_iterations_to_run

The execute method needs to be modified to run run_particle in parallel:

 let execute (args : Args) (loss_func : list -> float) (initail_weights : seq<list>) =      
let particles = initail_weights |> Seq.take args.particles
|> PSeq.map (fun w -> Particle(w, [for _ in 1 .. w.Length -> 0.0], w, loss_func w))
|> Seq.toList

let global_best = ref (particles |> List.minBy (fun x -> x.Local_best_loss) )
let monitor = new System.Object()
let check_particle_against_global_best (particle : Particle) =
lock monitor (fun() -> if particle.Local_best_loss < global_best.Value.Local_best_loss then
global_best.contents <- particle)

let final_particles = particles |> PSeq.map (fun x -> run_particle args x global_best check_particle_against_global_best loss_func args.iterations)
|> PSeq.toList
(global_best.Value.Local_best, global_best.Value.Local_best_loss, final_particles)

Now this looks a bit more like traditional C# multi-threaded code and we now have the possibility of screwing it up. But hopefully we have contained the problem enough to be confident we haven’t. We keep a ref to the particle that is our global best. We need a monitor to lock against when updating and final our check_particle_against_global_best, to which we will pass each new particle we create to see if it is an improvement.

Speed tests

Here is method to write out the execution speed of  an action

 let speed_test action title =    
let iterations = 20
let sw = System.Diagnostics.Stopwatch();
let results =
[for i in 0 .. iterations do
GC.Collect()
sw.Restart()
action()
sw.Stop()
yield sw.Elapsed] |> List.sort
let median_result = List.nth results (iterations/2)
printfn "%s : %A" title median_result

The GC.Collect() forces the .Net framework to do a full garbage collection. If this isn’t done the speed of one test can be affected by the memory used in the previous test. I’m taking the median time here, which I think is a better measure than mean. When ever you run a set of speed tests there will always be a few that take ages because of spooky behavior of OS’s. The middle time ignores these occasional outliers.

Results

From my fairly old 4 core desktop
Single threaded Multi-threaded 1 Mutli-threaded 2
Fixed length function time in seconds 2.45 1.56 1.33
Variable length function time in seconds 4.45 1.44 0.71

So looks like Multi-threaded 2 gives a pretty decent improvement. Full code is here on github.

Particle Swarm Optimization in F#

Why is particle swarm optimization good?

Lets say you have a function that takes an array of inputs and produces a single output. Your have an objective, you want to find what input results in the lowest possible output for this function. If the function were differentiable you could just compute the gradient and follow it back to reach your objective, but in this hypothetical scenario it’s not. If the range of inputs were quite small you could just search the entire input space to find the best output, but again not in this scenario. If you’ve reached this point, this is when particle swarm optimization(referred to as PSO from here on) might be a good idea.

When would you use it?

One example might be a function runs a simulation of how an aircraft flies. The inputs are the wing span, tail length, different kinds of engines that can be used, etc. The output is a how much fuel the aircraft uses in the simulation. Or perhaps you could have functions simulating an economic model.

What is it?

In PSO you have many “particles” which will each execute the function with their own set of parameters for it. After each execution, each particle will look at the result they got and attempt to find better input parameters. They do this by both looking at there own best result, attempting to move towards that, and also the best result of all the other particles. In this way the particles search through the input space but in such a way that they prioritize successful areas. Also with there being many particles searching each with different sets of parameters, you hope to avoid problems of local minima.

 velocity = inertia_weights * previous_velocity  
+constant1*random1*(local_best_parameters-parameters[time-1])
+ constant2*random2*(global_best_parameters-previous_parameters)
parameters[time] = parameters[time-1]+velocity[time]

Here inertia_weight is just an array of number, all most likely between 0 and 1 used to dampen the previous velocity, to stop the particles heading off into the middle of nowhere. Also often velocity is constrained by a maximum value. constant1 and constant2 are used to control how much the particle is affected by the local and global bests. Normally they are set to between 1 and 2.

Is there any intuitive way of understanding this?

The thing that works for me is imagining a collection of bugs, one for each particle. Each bug is trying to find some food in an area. They can’t see the food, but there sense of smell tells them how close to it they are, this is the loss function and there position is the parameters for the function. Imagine all the bugs spread out in the area, all moving around it. Each iteration they reconsider what direction they are going? Based on there smell they consider is my current position better than the best position I have ever been if it isn’t they adjust there direction to move towards that. But these bugs are also social creatures and have there hive mind, so they also consider how does my current position compare to the best position any of the bugs have ever reached and adjust to move towards that. Over time each bug is getting a better and better personal best and the global best is also slowly improving. Imagine them swarming across the area gradually grouping more and more tightly around the successful points. Hope that makes things less not more confusing…

Why not use a genetic algorithm?

There is some overlap, genetic algorithms more naturally fit with discreet variables(e.g. bool) vs PSO fits better with continuous. But either can be modified to take the full range. I think when working with continuous variable PSO will work out to be more efficient in finding good solutions, but this is probably massively variable depending on the data. This is one of those annoying situations in machine learning where you have to try things and just see what looks like it’s working.

Why F#?

The best thing about F# is how easy is to write easy to read parallel code and the lightweight syntax. PSO is can achieve massive benefits from running in parallel. Also my day job is mostly working in C#, Java and Python so F# makes a nice change of pace.

Show me the code

I wanted to write all the code in a pure functional style. This means aiming for all functions being deterministic, the output is solely dependent on the input. No side effects or state. For the PSO algorithm there are a lot of variables such as constant 1, constant 2, inertia weight, etc. These can all be set as class variables, but because I wanted to this to be pure they should all be passed in as arguments to functions that use them. This would mean a lot of messy long functions, so I made an args type that would hold all these variables.

 type Args(inertia_weight : list, ?particles : int, ?iterations : int, ?max_velocity : float, ?success_threshold : float, ?c1 : float, ?c2 : float) =  
member this.inertia_weight = inertia_weight
member this.particles = defaultArg particles 10
member this.iterations = defaultArg iterations 100
member this.max_velocity = defaultArg max_velocity 10.0
member this.success_threshold = defaultArg success_threshold 0.0001
member this.c1 = defaultArg c1 2.0
member this.c2 = defaultArg c2 2.0

Next we have the particle class. Note I went for lists as the type for all the variables. I didn’t want to use Arrays because they are not immutable. Sequences are the most functionally pure choice, but in reality they are quite a lot slower than lists. Also structuredFormatDisplay is used so they print nicely.

 []  
type Particle =
val Parameters : list
val Velocity : list
val Local_best : list
val Local_best_loss : float
new(parameters, velocity, local_best, local_best_loss) =
{ Parameters = parameters; Velocity = velocity; Local_best = local_best; Local_best_loss = local_best_loss }

Here is the main algorythm

 let update_particle (args : Args) loss_func (particle : Particle) (global_best : list) : Particle =  
let limit_velocity velocity max_velocity =
velocity |> List.map (fun x -> if x > 0.0 then
min x max_velocity
else
min x -max_velocity)
let random = new System.Random()
let r1 = random.NextDouble()*args.c1
let r2 = random.NextDouble()*args.c2

let velocity = (List.map2 (fun w v -> w*v) args.inertia_weight particle.Velocity, // multiple last velocity by inertia weight
List.map2 (fun l p -> r1*(l-p)) particle.Local_best particle.Parameters, //get attraction of local best
List.map2 (fun g p -> r2*(g-p)) global_best particle.Parameters)//get attration of global best
|||> List.map3 (fun x y z -> x+y+z)//add the result of these 3 calculations together
|> limit_velocity <| args.max_velocity //limit velocity by max

let new_parameters = (particle.Parameters, velocity) ||> List.map2 (fun x y -> x + y)
let new_loss = loss_func new_parameters

if new_loss < particle.Local_best_loss then
Particle(new_parameters, velocity, new_parameters, new_loss)
else
Particle(new_parameters, velocity, particle.Local_best, particle.Local_best_loss)

For PSO we need to run this on a full collection of particles like so

 let update_particles (args : Args) (particles : list) (global_best_params : list) (global_best_loss : float) loss_func : list * list * float =    
let updated_particles = particles |> List.map (fun x -> update_particle args loss_func x global_best_params)

let best_from_this_iteration = updated_particles |> List.minBy (fun x -> x.Local_best_loss)

if global_best_loss < best_from_this_iteration.Local_best_loss then
(updated_particles, global_best_params, global_best_loss)
else
(updated_particles, best_from_this_iteration.Local_best, best_from_this_iteration.Local_best_loss)

A method for running this loop until our stop condition is reached

 let rec private run_until_stop_condition (args : Args) (particles : list) (global_best_params : list) (global_best_loss : float) loss_func iterations_to_run =  
let stop_condition (args : Args) iterations global_best_loss =
iterations <= 0 || global_best_loss <= args.success_threshold

let (new_particles, new_global_best_params, new_global_best_loss) = update_particles args particles global_best_params global_best_loss loss_func
let new_iterations_to_run = iterations_to_run - 1

if stop_condition args iterations_to_run new_global_best_loss then
(new_global_best_params, new_global_best_loss, new_particles)
else
run_until_stop_condition args new_particles new_global_best_params new_global_best_loss loss_func new_iterations_to_run

Now example usage looks like this.

 let target_point = [23.0;54.0]  
let loss_func (x : list) = x |> List.map2 (fun x y -> sin 1.0/x-sin 1.0/y) target_point |> List.sum |> abs
let random = new System.Random()
let initial_weights = seq{ while true do yield [random.NextDouble()*1000.0;random.NextDouble()*1000.0]};

let args = Args(inertia_weight=[0.8; 0.8], particles=10, success_threshold=0.01, iterations = 10);
let (global_best_params, global_best_loss, particles) = execute args loss_func initial_weights;;

Here is a link to the full code on git hub. In the next post I look at options for parallelizing the code.