Presenting WikiDataDotNet – Client API for WikiData

WikiData

WikiData is one of those things that sets the mind boggling at the possibilities of the internet. It’s a project, started by the WikiMedia foundation, to collect structured data on everything. If you are doing anything related to machine learning, it is the best source of data I have so far found.
It aims to contain an items on everything and for each item a collection of statements describing aspects of it and it’s relationship to other items. Everything makes more sense with an example, here is it’s record on the item Italy which can be found in the API like so:
This will return a JSON file with sections like:
       "id": "Q38",
"labels": {
"en": {
"language": "en",
"value": "Italy"
},
Here we see the id of the item, in this case Q38 that is used for looking Italy up. Then labels contains the name of Italy in each language. Further down there is also a section aliases that contains alternate names for Italy in every language.
Futher down we get to the really interesting stuff, claims.
          "P36": [  
{
"mainsnak": {
"snaktype": "value",
"property": "P36",
"datavalue": {
"value": {
"entity-type": "item",
"numeric-id": 220
},
"type": "wikibase-entityid"
},
"datatype": "wikibase-item"
},
"type": "statement",
"qualifiers": {
"P580": [
These are a series of statements about the different aspects of the item. For example the above P36 is a claim about what the capital of Italy is. Claims are also entities in the API, so they can also be looked up like so https://www.wikidata.org/w/api.php?action=wbgetentities&ids=P36
mainsnak is the main statement associated with this claim (a Snak in wikidata is any basic assertion that can be made about an item). These all have a value and a type. In this case the claim that about Italy’s capital, the value is a reference to a wiki entry, which can again be looked up from WikiData if you append a Q to the beginning of the numeric id, you my have already worked out what the entity here is https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q220
Other claims on Italy include location, who it shares a border with, public holidays, provinces, basic form of government, head of state, population(across history), head of government, the list is endless(no wait, actually it’s 64 entries long).

Presenting WikiDataDotNet

I’ve been working on a project that needed to query against WikiData from .Net. The only existing .Net API for this I could find is Wikibase.NET for writing wiki bots. It hasn’t been updated in a while and unfortunately a quick test reveals it no longer works. At a future date I may fix it up, but in the meantime I’ve created this quick query only API: WikiDataDotNet

Usage

It currently provides the ability to request entities:
F#
 let italy = WikiDataDotNet.Request.request_entity "Q38"   
C#
 var italy = WikiDataDotNet.Request.request_entity("Q38");  
and do a text search against wiki data:
F#
 let search_result = WikiDataDotNet.Request.search "Headquarters of the U.N"  
C#
 var searchResult = WikiDataDotNet.Request.search("en", "Headquarters of the U.N");  
That’s it for functionality so far. My next plans are to make it easier to look up Claims against items and do caching of Claims. Also maybe some kind of LINQ style querying interface would be nice.

Particle Swarm Optimization in F# part 2 – Parallelizing

In the last post I gave an example of particle swarm optimization algorithm in F#. F# has a few nice features, but the main reason I wanted to use it was because it is so easy to write multi-threaded applications with it.

Multi-threaded PSO version 1

If I want my algorithm to run multi-threaded all I have to do is take this line in the update_particles function.

let updated_particles = particles |> List.map (fun x -> update_particle args loss_func x global_best_params)   

And change it to:

 let updated_particles = particles |> List.map (fun x -> async { return update_particle args loss_func x global_best_params }
|> Async.Parallel
|> Async.RunSynchronously
|> Array.toList

Like magic the whole application now runs in parallel and because I have no mutable types I can guarantee there are no issues with cross thread calls. There are a few things I don’t like about this though, this is 3 lines when really it should be one and also this creates an array that I then have to map back into a list. It is an annoying quirk of F# that it is much easier to run Arrays in parallel with the Parallel.map function, than Lists, but arrays are mutable losing the guaranteed thread safety…
A bit of help from stack overflow yielded this, PSeq a lib allowing you to run parallel operations against F# sequences. So the above can be rewritten:

 let updated_particles = particles  |> PSeq.map (fun x -> update_particle args loss_func x global_best_params)   
|> PSeq.toList

Multi-threaded PSO version 2

Now running in parallel, our speed is improved, but we still don’t use all cores as efficiently as possible. Each iteration of updating the particles waits for every particle to complete. This is reasonable if they all take the same amount of time, but lets say the function is something that could execute in 1 second or in 100 seconds. We always have to wait the amount of time of the longest to complete and once only 1 is left we are only running on a single thread.
A better alternative is we just run the whole lifetime of the each particle in parallel. The only piece of data that needs to travel between the particles is the global_best parameters. This can be handled by passing this as a ref and having a setter functions so we always just take the current global best at start up and update it whenever we have a new value.
The changes we need to make for this are:
Remove the update_particles and run_until_stop_condition methods and replace them with this:

let rec private run_particle (args : Args) (particle : Particle) (get_global_best :unit -> Particle) check_particle_against_global_best loss_func iterations_to_run : Particle =
let updated_particle = update_particle args loss_func particle (get_global_best()).Parameters
check_particle_against_global_best updated_particle

let new_iterations_to_run = iterations_to_run - 1
if stop_condition args iterations_to_run (get_global_best()).Local_best_loss then
updated_particle
else
run_particle args updated_particle get_global_best check_particle_against_global_best loss_func new_iterations_to_run

The execute method needs to be modified to run run_particle in parallel:

 let execute (args : Args) (loss_func : list -> float) (initail_weights : seq<list>) =      
let particles = initail_weights |> Seq.take args.particles
|> PSeq.map (fun w -> Particle(w, [for _ in 1 .. w.Length -> 0.0], w, loss_func w))
|> Seq.toList

let global_best = ref (particles |> List.minBy (fun x -> x.Local_best_loss) )
let monitor = new System.Object()
let check_particle_against_global_best (particle : Particle) =
lock monitor (fun() -> if particle.Local_best_loss < global_best.Value.Local_best_loss then
global_best.contents <- particle)

let final_particles = particles |> PSeq.map (fun x -> run_particle args x global_best check_particle_against_global_best loss_func args.iterations)
|> PSeq.toList
(global_best.Value.Local_best, global_best.Value.Local_best_loss, final_particles)

Now this looks a bit more like traditional C# multi-threaded code and we now have the possibility of screwing it up. But hopefully we have contained the problem enough to be confident we haven’t. We keep a ref to the particle that is our global best. We need a monitor to lock against when updating and final our check_particle_against_global_best, to which we will pass each new particle we create to see if it is an improvement.

Speed tests

Here is method to write out the execution speed of  an action

 let speed_test action title =    
let iterations = 20
let sw = System.Diagnostics.Stopwatch();
let results =
[for i in 0 .. iterations do
GC.Collect()
sw.Restart()
action()
sw.Stop()
yield sw.Elapsed] |> List.sort
let median_result = List.nth results (iterations/2)
printfn "%s : %A" title median_result

The GC.Collect() forces the .Net framework to do a full garbage collection. If this isn’t done the speed of one test can be affected by the memory used in the previous test. I’m taking the median time here, which I think is a better measure than mean. When ever you run a set of speed tests there will always be a few that take ages because of spooky behavior of OS’s. The middle time ignores these occasional outliers.

Results

From my fairly old 4 core desktop
Single threaded Multi-threaded 1 Mutli-threaded 2
Fixed length function time in seconds 2.45 1.56 1.33
Variable length function time in seconds 4.45 1.44 0.71

So looks like Multi-threaded 2 gives a pretty decent improvement. Full code is here on github.

Particle Swarm Optimization in F#

Why is particle swarm optimization good?

Lets say you have a function that takes an array of inputs and produces a single output. Your have an objective, you want to find what input results in the lowest possible output for this function. If the function were differentiable you could just compute the gradient and follow it back to reach your objective, but in this hypothetical scenario it’s not. If the range of inputs were quite small you could just search the entire input space to find the best output, but again not in this scenario. If you’ve reached this point, this is when particle swarm optimization(referred to as PSO from here on) might be a good idea.

When would you use it?

One example might be a function runs a simulation of how an aircraft flies. The inputs are the wing span, tail length, different kinds of engines that can be used, etc. The output is a how much fuel the aircraft uses in the simulation. Or perhaps you could have functions simulating an economic model.

What is it?

In PSO you have many “particles” which will each execute the function with their own set of parameters for it. After each execution, each particle will look at the result they got and attempt to find better input parameters. They do this by both looking at there own best result, attempting to move towards that, and also the best result of all the other particles. In this way the particles search through the input space but in such a way that they prioritize successful areas. Also with there being many particles searching each with different sets of parameters, you hope to avoid problems of local minima.

 velocity = inertia_weights * previous_velocity  
+constant1*random1*(local_best_parameters-parameters[time-1])
+ constant2*random2*(global_best_parameters-previous_parameters)
parameters[time] = parameters[time-1]+velocity[time]

Here inertia_weight is just an array of number, all most likely between 0 and 1 used to dampen the previous velocity, to stop the particles heading off into the middle of nowhere. Also often velocity is constrained by a maximum value. constant1 and constant2 are used to control how much the particle is affected by the local and global bests. Normally they are set to between 1 and 2.

Is there any intuitive way of understanding this?

The thing that works for me is imagining a collection of bugs, one for each particle. Each bug is trying to find some food in an area. They can’t see the food, but there sense of smell tells them how close to it they are, this is the loss function and there position is the parameters for the function. Imagine all the bugs spread out in the area, all moving around it. Each iteration they reconsider what direction they are going? Based on there smell they consider is my current position better than the best position I have ever been if it isn’t they adjust there direction to move towards that. But these bugs are also social creatures and have there hive mind, so they also consider how does my current position compare to the best position any of the bugs have ever reached and adjust to move towards that. Over time each bug is getting a better and better personal best and the global best is also slowly improving. Imagine them swarming across the area gradually grouping more and more tightly around the successful points. Hope that makes things less not more confusing…

Why not use a genetic algorithm?

There is some overlap, genetic algorithms more naturally fit with discreet variables(e.g. bool) vs PSO fits better with continuous. But either can be modified to take the full range. I think when working with continuous variable PSO will work out to be more efficient in finding good solutions, but this is probably massively variable depending on the data. This is one of those annoying situations in machine learning where you have to try things and just see what looks like it’s working.

Why F#?

The best thing about F# is how easy is to write easy to read parallel code and the lightweight syntax. PSO is can achieve massive benefits from running in parallel. Also my day job is mostly working in C#, Java and Python so F# makes a nice change of pace.

Show me the code

I wanted to write all the code in a pure functional style. This means aiming for all functions being deterministic, the output is solely dependent on the input. No side effects or state. For the PSO algorithm there are a lot of variables such as constant 1, constant 2, inertia weight, etc. These can all be set as class variables, but because I wanted to this to be pure they should all be passed in as arguments to functions that use them. This would mean a lot of messy long functions, so I made an args type that would hold all these variables.

 type Args(inertia_weight : list, ?particles : int, ?iterations : int, ?max_velocity : float, ?success_threshold : float, ?c1 : float, ?c2 : float) =  
member this.inertia_weight = inertia_weight
member this.particles = defaultArg particles 10
member this.iterations = defaultArg iterations 100
member this.max_velocity = defaultArg max_velocity 10.0
member this.success_threshold = defaultArg success_threshold 0.0001
member this.c1 = defaultArg c1 2.0
member this.c2 = defaultArg c2 2.0

Next we have the particle class. Note I went for lists as the type for all the variables. I didn’t want to use Arrays because they are not immutable. Sequences are the most functionally pure choice, but in reality they are quite a lot slower than lists. Also structuredFormatDisplay is used so they print nicely.

 []  
type Particle =
val Parameters : list
val Velocity : list
val Local_best : list
val Local_best_loss : float
new(parameters, velocity, local_best, local_best_loss) =
{ Parameters = parameters; Velocity = velocity; Local_best = local_best; Local_best_loss = local_best_loss }

Here is the main algorythm

 let update_particle (args : Args) loss_func (particle : Particle) (global_best : list) : Particle =  
let limit_velocity velocity max_velocity =
velocity |> List.map (fun x -> if x > 0.0 then
min x max_velocity
else
min x -max_velocity)
let random = new System.Random()
let r1 = random.NextDouble()*args.c1
let r2 = random.NextDouble()*args.c2

let velocity = (List.map2 (fun w v -> w*v) args.inertia_weight particle.Velocity, // multiple last velocity by inertia weight
List.map2 (fun l p -> r1*(l-p)) particle.Local_best particle.Parameters, //get attraction of local best
List.map2 (fun g p -> r2*(g-p)) global_best particle.Parameters)//get attration of global best
|||> List.map3 (fun x y z -> x+y+z)//add the result of these 3 calculations together
|> limit_velocity <| args.max_velocity //limit velocity by max

let new_parameters = (particle.Parameters, velocity) ||> List.map2 (fun x y -> x + y)
let new_loss = loss_func new_parameters

if new_loss < particle.Local_best_loss then
Particle(new_parameters, velocity, new_parameters, new_loss)
else
Particle(new_parameters, velocity, particle.Local_best, particle.Local_best_loss)

For PSO we need to run this on a full collection of particles like so

 let update_particles (args : Args) (particles : list) (global_best_params : list) (global_best_loss : float) loss_func : list * list * float =    
let updated_particles = particles |> List.map (fun x -> update_particle args loss_func x global_best_params)

let best_from_this_iteration = updated_particles |> List.minBy (fun x -> x.Local_best_loss)

if global_best_loss < best_from_this_iteration.Local_best_loss then
(updated_particles, global_best_params, global_best_loss)
else
(updated_particles, best_from_this_iteration.Local_best, best_from_this_iteration.Local_best_loss)

A method for running this loop until our stop condition is reached

 let rec private run_until_stop_condition (args : Args) (particles : list) (global_best_params : list) (global_best_loss : float) loss_func iterations_to_run =  
let stop_condition (args : Args) iterations global_best_loss =
iterations <= 0 || global_best_loss <= args.success_threshold

let (new_particles, new_global_best_params, new_global_best_loss) = update_particles args particles global_best_params global_best_loss loss_func
let new_iterations_to_run = iterations_to_run - 1

if stop_condition args iterations_to_run new_global_best_loss then
(new_global_best_params, new_global_best_loss, new_particles)
else
run_until_stop_condition args new_particles new_global_best_params new_global_best_loss loss_func new_iterations_to_run

Now example usage looks like this.

 let target_point = [23.0;54.0]  
let loss_func (x : list) = x |> List.map2 (fun x y -> sin 1.0/x-sin 1.0/y) target_point |> List.sum |> abs
let random = new System.Random()
let initial_weights = seq{ while true do yield [random.NextDouble()*1000.0;random.NextDouble()*1000.0]};

let args = Args(inertia_weight=[0.8; 0.8], particles=10, success_threshold=0.01, iterations = 10);
let (global_best_params, global_best_loss, particles) = execute args loss_func initial_weights;;

Here is a link to the full code on git hub. In the next post I look at options for parallelizing the code.