DANIEL SLATER'S BLOG

Presenting WikiDataDotNet – Client API for WikiData

WikiData

WikiData is one of those things that sets the mind boggling at the possibilities of the internet. It’s a project, started by the WikiMedia foundation, to collect structured data on everything. If you are doing anything related to machine learning, it is the best source of data I have so far found.

It aims to contain an items on everything and for each item a collection of statements describing aspects of it and it’s relationship to other items. Everything makes more sense with an example, here is it’s record on the item Italy which can be found in the API like so:

https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q38

This will return a JSON file with sections like:

       "id": "Q38",
       "labels": {  
          "en": {  
           "language": "en",
           "value": "Italy"
         },

Here we see the id of the item, in this case Q38 that is used for looking Italy up. Then labels contains the name of Italy in each language. Further down there is also a section aliases that contains alternate names for Italy in every language.

Futher down we get to the really interesting stuff, claims.

          "P36": [  
           {  
             "mainsnak": {  
               "snaktype": "value",  
               "property": "P36",  
               "datavalue": {  
                 "value": {  
                   "entity-type": "item",  
                   "numeric-id": 220  
                 },  
                 "type": "wikibase-entityid"  
               },  
               "datatype": "wikibase-item"  
             },  
             "type": "statement",  
             "qualifiers": {  
               "P580": [

These are a series of statements about the different aspects of the item. For example the above P36 is a claim about what the capital of Italy is. Claims are also entities in the API, so they can also be looked up like so https://www.wikidata.org/w/api.php?action=wbgetentities&ids=P36

mainsnak is the main statement associated with this claim (a Snak in wikidata is any basic assertion that can be made about an item). These all have a value and a type. In this case the claim that about Italy’s capital, the value is a reference to a wiki entry, which can again be looked up from WikiData if you append a Q to the beginning of the numeric id, you my have already worked out what the entity here is https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q220

Other claims on Italy include location, who it shares a border with, public holidays, provinces, basic form of government, head of state, population(across history), head of government, the list is endless(no wait, actually it’s 64 entries long).

Presenting WikiDataDotNet

I’ve been working on a project that needed to query against WikiData from .Net. The only existing .Net API for this I could find is Wikibase.NET for writing wiki bots. It hasn’t been updated in a while and unfortunately a quick test reveals it no longer works. At a future date I may fix it up, but in the meantime I’ve created this quick query only API: WikiDataDotNet

Usage

It currently provides the ability to request entities:

 let italy = WikiDataDotNet.Request.request_entity "Q38"

 var italy = WikiDataDotNet.Request.request_entity("Q38");

and do a text search against wiki data:
F#

 let search_result = WikiDataDotNet.Request.search "Headquarters of the U.N"

 var searchResult = WikiDataDotNet.Request.search("en", "Headquarters of the U.N");

That’s it for functionality so far. My next plans are to make it easier to look up Claims against items and do caching of Claims. Also maybe some kind of LINQ style querying interface would be nice.

Unit Testing

How do I get my team to start unit testing

A team lead recently asked me(this genuinely happened, this isn’t just a rhetorical tick), “How do I get my team to start unit testing?“. Which sounds like a great title for a blog post…

In my opinion task of getting a team to write unit tests is really the task of getting a programmer to believe it is in their best interests to write unit tests. There are plenty of tools such as sonar qube to give technical feedback on unit coverage, but without a team buying in they won’t achieve much. It is very easy and of little benefit to do unit testing badly. So like a good salesman you need to sell them on why they will benefit from taking the extra time to unit test there already perfectly acceptable code(as they see it(if they don’t believe the code they are currently writing is acceptable then there are other problems)).

There are many reasons a person should unit test. Some reasons are noble and good, to do with doing the best job you can, for your company and your fellow professional. But that doesn’t work for everyone, so for those less nobly inclined there are also selfish reasons that are still valid.

The noble reasons:

Next level success: 1st level success, someone reports a bug, you fix it. Next level success, someone reports a bug, you fix it and you write tests that means no one in the future can reintroduce this bug. Unit testing allows you to future proof your code.
Quick feedback: One of the biggest factors in your ability to improve in any activity is your feedback loop. If you want to get good at chess, if you are playing against a good player they can tell you, “that move was bad” immediately after you make a bad move. Otherwise you may have to play the rest of the game and then lose a number of similar games before you work out it was that particular type of move that was the mistake. Unit testing allows you to get much quicker feedback. When you make a change to an application, you don’t have to run it, set up the scenario by hand, then check for correct behavior for multiple different behavior. With unit testing you can get the feedback across multiple scenarios across the application in sub 10 seconds.
Encourages good design: There have been loads of articles written by better writers than me on this subject. Good application design goes hand in hand with designing for unit testing. Separation of concerns, single responsibility principle, dependency injection, etc.

The less than noble reasons:

Plausible deniability: If something goes wrong any where near their code they can point to the unit tests and say “well I know my code works, it must be someone else’s problem”. This has happened to me, I was asked to write some code that displayed the number of business days old a certain item was. When it started displaying -1 days old in prod. I could take there inputs put them in to my unit test and show that my code was correct(The problem turned out to be we were being sent items from the future due to incorrect date conversion further upstream).
Your future employ-ability: Now a days unit testing is so widespread, you will be asked a question about unit testing in most interviews. You may not care so much about how you do in this job, but don’t you want to be applying best practice so you can get that shiny new future job.
Holding on to requirements: User A asks for a feature to work in a particular way. You make the change and put it into prod, then User B comes to you to complain, he asked for the feature to be in that particular way and now it doesn’t work for him. Unit tests can remove a lot of these kind of problems because you can mark the unit test with who requested the functionality on the test. Unit tests are a great way to capture requirements permanently and raise these kinds of conflicting requests earlier.

So now the team are fully behind the plan and raring to go. Well probably not immediately, in teams I’ve been involved with it takes a good few months of pushing these points and including making sure that unit test percentages are reviewed, committed code is reviewed and unit tests are always required as a part of it. People need to see the benefit from doing increased testing and this may take time and energy. But over time it will happen if you’re persistent.

Machine Learning, Python

Why there is a big unsupervised learning shaped whole in the universe

The problems I had when I first started reading about unsupervised learning was, I didn’t understand why it needed to exist. Supervised learning makes sense, you give a neural net an input and output and tell it “find a way to get from one to the other”. Unsupervised learning doesn’t have that. You are just giving it some data and saying “do… something”. Even the diagram were confusing, I see input nodes, I see hidden nodes, what’s the output?

The best way to get across the need for unsupervised learning is too talk about the end goal. Across the first year of a babies life it learns a huge amount of things. How to focus it’s eyes, how to coordinate it’s limbs, how to distinguish it’s parents from other adults, that objects are permanent things. Until it learns to talk it is getting almost no feedback about any of this, yet it still learns. Getting good training data is hard, so the idea is: wouldn’t it be great if we could set up a machine and it would just go off and learn on it’s own.

Unfortunately so far we are a long way from that and the technique shown here seems trivial compared to that goal. But the goal is interesting enough that it is worth pursuing. The answer to the question “what am I asking an unsupervised network to do?” is “learn the data”. The output will be a representation of the data that is simpler than the original. If the input is 10,000 pixels of an image the output can be any smaller number. What a lot of the simpler unsupervised nets do is transform into a single number that represents groups of similar sets of inputs. These are called clusters.

An example competitive learning neural net

A competitive learning neural net attempts groups it’s inputs into clusters. The code for it is really very simple. Here is the all that is needed in the constructor(if you don’t like Python it is also available in C#, Java and F#):

from random import uniform

class UnsupervisedNN(object):
   def __init__(self, size_of_input_arrays, number_of_clusters_to_group_data_into):
     #we have 1 hidden node for each cluster
     self.__connections = [[uniform(-0.5, 0.5) for j in range(number_of_clusters_to_group_data_into)] for i in range(size_of_input_arrays)]  
     self.__hidden_nodes = [0.0]*number_of_clusters_to_group_data_into

When we give it an input, it will activate the hidden nodes based on the sum of the connections between that input and each hidden node. It makes more sense in code, like so:

def feed_forward(self, inputs):  
     #We expect inputs to be an array of floats of length size_of_input_arrays.
     for hidden_node_index in range(len(self.__hidden_nodes)):  
       activation = 0.0
       #each hidden node will be activated from the inputs.  
       for input_index in range(len(self.__connections)):  
         activation += inputs[input_index]*self.__connections[input_index][hidden_node_index]  

       self.__hidden_nodes[h] = activation

     #now we have activated all the hidden nodes we check which has the highest activation
     #this node is the winner and so the cluster we think this input belongs to
     return self.__hidden_nodes.index(max(self.__hidden_nodes))

So as it stands we have a method for randomly assigning data to clusters. To make it something useful we need to improve the connections. There are many ways this can be done, in competitive learning after you have selected a winner you make your connections to it more like that input. A good analogy is imagine we 3 inputs one for each color red, green and blue. If we get the color yellow the inputs were red and green. So after a wining node is selected it’s connections to those colors are increased so future red and green items are more likely to be considered a part of the same cluster. But because there is no blue the connection to this is weakened:

def Train(self, inputs):
     wining_cluster_index = self.feed_forward(inputs)
     learning_rate = 0.1
     for input_index in range(len(self.__connections)):
       weight = self.__connections[input_index][winner]
       self.__connections[input_index][wining_cluster_index] = weight + learning_rate*(inputs[input_index]-weight)

A problem we can have here though is that a cluster can be initialized with terrible weights, such that nothing is ever assigned to it. In order to fix this a penalty added to each hidden node. when ever a hidden node is selected it’s penalty is increased. So that over time if a node keeps winning it’s the other nodes will eventually start getting selected. This penalty is also known as a conscience or bias.

To add a bias we just need to initialize an array in the constructor for each cluster

     self.__conscience = [0.0]*number_of_clusters_to_group_data_into

Change our feed forward to


def feed_forward(self, inputs):
     for hidden_node_index in range(len(self.__hidden_nodes)):  
       activation = self.__conscience[hidden_node_index]
       for input_index in range(len(self.__connections)):
         activation += inputs[input_index]*self.__connections[input_index][hidden_node_index]

       self.__hidden_nodes[h] = activation

     return self.__hidden_nodes.index(max(self.__hidden_nodes))

Then in training we just make a small substitution every time a cluster wins

     self.__conscience[winning_cluster_index] -= self.conscience_learning_rate

Competitive learning nets are nice but come along long way from the goal of full unsupervised learning. In a future post I’m going to do a Restricted Boltzman Machine which is used in deep learning for the shallow layers to give us a simpler representation of an image to work with.

Full code is available on git hub in Python, C#, Java and F#

CSharp

Programming a programming computer game – .Net run time type creator

A while ago me and a friend had an idea for a computer game. You would control a collection of bacteria all which needed to feed and would die of old age given enough time. They could also reproduce in order of keep their population going and fight enemy bacteria population controlled by an other player. The aim of the game was from your bacteria to out compete the other players bacteria on the map. So far so unoriginal, our new idea was that rather than controlling the creatures through say using a mouse and keyboard to give them orders you would instead controls them by writing the code for how they behaved. It would be a real time competitive programming game.

The game interface would be a map with a text panel on the right where the user would enter code that the creatures would execute to make their decisions. There were commands for where to move, what to eat, when to breed, etc. This was also a really nice space from which to play with algorithms like neural nets, evolutionary algorithms, clustering, A*, etc. We played around with it a bit and had a fair amount of fun, but we eventually realized that even for us who had built it, the game was too complicated for anyone to actually play. At least not in real time. So we abandoned it as a fun experiment.

But I recently saw this post on stack overflow that reminded me of that game. So I thought I would share some of the code for how to do in application code compilation in .Net. Hopefully it will be of use to some people and maybe even if I get enough interest I may try and clean up the rest of the code and release it as an open source project. Because despite being painfully complicated, when it did work it was fun, at least for uber nerds like us.

RunTimeTypeCreator

Here is the one and only method in the lib method:

 public static T CreateType(string source,   
                          IEnumerable assemblies,   
                          out List compilationErrors)   
                             where T : class

It will attempt to create an instance of the type T from the source passed in. The source will be compiled with references to all the assemblies in the assemblies parameter. So for example you could do this with it.

 namespace RunTimeTypeCreator.Tests   
 {   
    public interface ITestType   
    {   
      bool Success { get; }   
    }   
    public class RunTimeTypeCreatorTests   
    {   
       public static bool TestVariable = false;   
       public void Example()   
       {   
          const string source = @"   
 using RunTimeTypeCreator.Tests;   
 public class TestTypeClass : RunTimeTypeCreatorTests.ITestType   
 {   
    public bool Success { get { return RunTimeTypeCreatorTests.TestVariable; } }   
 }";   
          List compilationErrors;       
          var type = RunTimeTypeCreator.CreateType(source,   
                     new[] { "RunTimeTypeCreator.Tests.dll" }, //the name of this assembly  
                     out compilationErrors);   
          TestVariable = false;   
          //will print true   
          Console.WriteLine(type.Success)   
          //will print false   
          TestVariable = true;   
          Console.WriteLine(type.Success)   
       }   
    }   
 }

Which is kind of cool I think. Here’s a quick run through of how it works, this just shows the code minus bits of validation and error reporting, so if you want the full thing I would recommend getting it from github

 var csc = new CSharpCodeProvider();   
   
 var parameters = new CompilerParameters   
     {   
      //we don't want a physical executable in this case   
      GenerateExecutable = false,   
      //this is what allows it to access variables in our domain   
      GenerateInMemory = true   
     };   
   
 //add all the assmeblies we care about   
 foreach (var assembly in assemblies)   
     parameters.ReferencedAssemblies.Add(assembly);   
   
 //compile away, will load the class into memory   
 var result = csc.CompileAssemblyFromSource(parameters, source);   
   
 //we compiled succesfully so now just use reflection to get the type we want   
     var types = result.CompiledAssembly.GetTypes()   
              .Where(x => typeof(T).IsAssignableFrom(x)).ToList();    
   
 //create the type and return   
 return (T)Activator.CreateInstance(types.First());

I also had some other code around validating what the user was doing, Making sure they weren’t trying to access the file system, open ports or creating memory leaks/recursive loops. I’ll try and clean this up and post it at a future date.

Full code is available here on github.