As a nod to this piece having some sort of back story, it's probably worth telling you that my Dad was a farmer. I grew up in rural England and spent much of my childhood covered in various flavours of mud, dust, and other less desirable substances. The upshot of this (and the amount of Stargate, Voyager, and TNG I was watching at the time) means that I now have a penchant for hydroponics, a slightly unhealthy fascination with fertilizer, and a lasting interest in all things agricultural.
So you can imagine my joy when, as I was perusing through the the World Bank dataset, I found a shed load of indicators to play with. I'm going to do my best to gather various data, manipulate it into a respectable format (sometimes easier said than done), and squeeze it until it tells me something interesting.
This Blog¶
While I've been using FSharp and Jupyter separately for a while, this is my first outing with an IFSharp Jupyter notebook, and I'm mostly here to see what's possible. Either way, I suspect I'm going to learn something interesting from this, and I hope you will too.
Accessing the World Bank Data API¶
I think anyone who's searched for "fsharp something example" has probably seen something approaching the next few lines of code, so I'll not dwell on them too long. Somewhere along the way, the great people who maintain the FSharp.Data library, saw fit to add a type provider to the World Bank's open dataset API, and, well, it's great. With a few lines of code, you can gather economic, demographic, geographic or geological data series for almost any country you like. If you have a chance to visit their website to look around, you will be well rewarded.
The array of fields available is so vastly, hugely, mind-bogglingly big, that you will likely not know where to start. Just like me.
It turns out though, after a swift Google, that this is probably your best bet:
#load "Paket.fsx"
Paket.Package ["FSharp.Data"]
#load "Paket.Generated.Refs.fsx"
open FSharp.Data
let wb = WorldBankData.GetDataContext()
Good stuff. The code has run without any issues and we've got an object to fiddle with.
Take a look in the mirror¶
The API has some documentation, but as with most open source libraries, you're largely left on your own to figure out what's what from limited examples or old blogposts. With this in mind there are a couple of options available to us.
First and foremost is reflection, which, if you've heard of it already then bully for you, but if not, is a library that allows you to access and manipulate the metadata of your code's types and classes at runtime. In certain cases it can be incredibly useful, but it also looks a lot like a silver bullet for many programming problems, and as such can be used and abused in horrible ways.
For now though, I'm going to call 3 simple methods, available on any type or class, to tell me more about them:
An obligatory note from someone who's had to deal with applications that use it to access properties "dynamically"; Stop it, right now - it might seem like fun, but think about a) the performance costs when you have to scale up to bigger datasets and b) the poor sod who then has to 'fix' it.
wb.GetType()
wb.GetType().GetProperties()
|> Util.Table
wb.GetType().GetMethods()
|> Util.Table
Huh...
So these three methods should have taken the results of our WorldBank.GetDataContext()
and told me the name of the type that was returned...
FSharp.Data.Runtime.WorldBank.WorldBankData
... good, but also a list of the public properties and the methods of that type, i.e. ways in which I can access the data, of which there are none.
Hmmm... So much for Reflection, time to try something else.
Use the Source, Luke¶
So the outputs from the last attempt didn't seem so helpful, but, at the very least, it has told me where to look. My targeting computer's off and I'm going to use my gut.
A small amount of hunting around the FSharp.Data github site yields a promisingly named 'WorldBank' folder, and within it, two helpful looking files:
- https://github.com/fsharp/FSharp.Data/blob/master/src/WorldBank/WorldBankProvider.fs
- https://github.com/fsharp/FSharp.Data/blob/master/src/WorldBank/WorldBankRuntime.fs
Bingo. It's usually not worth digging around in the source code, but sometimes you're not left with much of a choice. I'm guessing from the fact that the WorldBank.WorldBankData
class was a part of the FSharp.Data.Runtime
namespace that the WorldBankRuntime.fs file is the place to look.
And we have it:
/// [omit]
type IWorldBankData =
abstract GetCountries<'T when 'T :> Country> : unit -> seq<'T>
abstract GetRegions<'T when 'T :> Region> : unit -> seq<'T>
abstract GetTopics<'T when 'T :> Topic> : unit -> seq<'T>
/// [omit]
type WorldBankData(serviceUrl:string, sources:string) =
let sources = sources.Split([| ';' |], StringSplitOptions.RemoveEmptyEntries) |> Array.toList
let restCache = createInternetFileCache "WorldBankRuntime" (TimeSpan.FromDays 30.0)
let connection = new ServiceConnection(restCache, serviceUrl, sources)
interface IWorldBankData with
member x.GetCountries() = CountryCollection(connection, None) :> seq<_>
member x.GetRegions() = RegionCollection(connection) :> seq<_>
member x.GetTopics() = TopicCollection(connection) :> seq<_>
Right at the bottom we have the WorldBankData
class and the IWorldBankData
interface it inherits from. For now the interface is irrelevant, but if you're interested in coding for applications rather than just for data analysis, I recommend reading about them and why they're useful.
The class itself has 3 member methods that I'm interested in GetCountries()
, GetRegions()
and GetTopics()
. Looking at the names used it looks rather like each generates a collection of objects from a connection to the WorldBank database.
Let's find out:
let topics = wb.GetTopics()
An error? Hmmm... curious, but at least IFSharp knows where I'm supposed to be looking; Topics
or get_Topics
To recap we looked at the type output by the WorldBankData.GetDataContext()
method, from that looked up the relevant class in the FSharp github repository, and tried to call a method defined in it... but it wasn't there.
Given that the type provider is going to be relatively dynamic and connection dependent, I suspect something clever is going on under the hood. I'm going to search for Topics
and get_Topics
in the WorldBankProvider.fs file, and see what falls out.
Sure enough, nestled deep inside the WorldBankProvider
typeprovider object there is a ProvidedTypeDefinition
call that contains the property names I can see (see below). This makes a WorldBankDataService
object, which handles the WorldBankData
class internally.
let createTypesForSources(sources, worldBankTypeName, asynchronous) =
...
let worldBankDataServiceType =
let t = ProvidedTypeDefinition("WorldBankDataService", Some typeof<WorldBankData>, hideObjectMethods = true, nonNullable = true)
t.AddMembersDelayed (fun () ->
[ yield ProvidedProperty("Countries", countriesType, getterCode = (fun (Singleton arg) -> <@@ ((%%arg : WorldBankData) :> IWorldBankData).GetCountries() @@>))
yield ProvidedProperty("Regions", regionsType, getterCode = (fun (Singleton arg) -> <@@ ((%%arg : WorldBankData) :> IWorldBankData).GetRegions() @@>))
yield ProvidedProperty("Topics", topicsType, getterCode = (fun (Singleton arg) -> <@@ ((%%arg : WorldBankData) :> IWorldBankData).GetTopics() @@>)) ])
serviceTypesType.AddMember t
t
All of the classes and properties that I want to access are defined in this file. They seem to be provided dynamically in some way, perhaps that's part of the "magic of TypeProviders" that I hear about, but this is something I'll have to investigate further in the future.
For now, I can infer a couple of things:
- The WorldBankData class generates data on the fly, lazy loading them as they are requested by the caller
- It effectively exposes 3 properties:
- Countries
- Regions
- Topics
Digging Around¶
Now we can have a root around in something. From the two files I can see that Topics
exposes a TopicCollection
to the caller, which is basically a sequence of Topic
objects:
type TopicCollection<'T when 'T :> Topic> internal (connection: ServiceConnection) =
let items = seq { for topic in connection.Topics -> Topic(connection, topic.Id) :?> 'T }
interface seq<'T> with member x.GetEnumerator() = items.GetEnumerator()
interface IEnumerable with member x.GetEnumerator() = (items :> IEnumerable).GetEnumerator()
interface ITopicCollection with member x.GetTopic(topicCode) = Topic(connection, topicCode)
Further, each Topic object should contain Code
, Name
and Description
properties, as well as a collection of IndicatorDescription
objects, more on those later.
type Topic internal (connection:ServiceConnection, topicCode:string) =
let indicatorsDescriptions = new IndicatorsDescriptions(connection, topicCode)
/// Get the WorldBank code of the topic
member x.Code = topicCode
/// Get the name of the topic
member x.Name = connection.TopicsIndexed.[topicCode].Name
/// Get the description of the topic
member x.Description = connection.TopicsIndexed.[topicCode].Description
interface ITopic with member x.GetIndicators() = indicatorsDescriptions
For now, I'm going to examine the contents of the object with the Util.Table
method, which presents very pretty tables like this:
let topics = wb.Topics
topics
|> Seq.take(10)
|> Util.Table
How neat is that? I've gone from knowing next to nothing about this library to being able to pull out a list of information (all of the 'topics' of data available within the World Bank dataset), and I know exactly where to dig next; pick a topic, see what it holds, repeat...
As I've already said, I've got a thing about farming, so "Agriculture & Rural Development" it is.
Now, I'd be lying if I said I hadn't already seen an example of getting a specific country from the Countries
collection. I know that the API uses named properties with the full spaces-and-everything name of the country in it, so I can take a punt that the same will be true with everything else.
I would guess that the following will return the data I'm after:
topics.``Agriculture & Rural Development``
However, before I try that I'm going to look into how this work, as if I didn't have that example to hand. Feel free to skip ahead until the Ploughing Onwards section.
[Extra] Deeper digging for the very keen¶
From the TopicCollection class there is a method that looks promising; GetTopic(topicCode)
. However, I know something strange is happening to the types in this library, so I'm going to take a look at WorldBankDataServide
object in the WorldBankProvider.fs file again, specifically at the topicsType
property.
GetIndicator(indicatorCode)
Here it is:
let topicsType =
let topicCollectionType = ProvidedTypeBuilder.MakeGenericType(typedefof<TopicCollection<_>>, [ topicType ])
let t = ProvidedTypeDefinition("Topics", Some topicCollectionType, hideObjectMethods = true, nonNullable = true)
t.AddMembersDelayed (fun () ->
[ for topic in connection.Topics do
let topicIdVal = topic.Id
let prop =
ProvidedProperty
( topic.Name, topicType,
getterCode = (fun (Singleton arg) -> <@@ ((%%arg : TopicCollection<Topic>) :> ITopicCollection).GetTopic(topicIdVal) @@>))
if not (String.IsNullOrEmpty topic.Description) then prop.AddXmlDoc(topic.Description)
yield prop ])
serviceTypesType.AddMember t
t
As you can see there is a call to AddMembersDelayed
to the dynamic ProvidedTypeDefinition
that we're clearly encountering here in the real world. 'Delayed' presumably means that the data is loaded on request, and not before (sensible in an online API), but more importantly we see that the method is yielding a collection of dynamic ProvidedProperty
objects, each named after Topic.Name
(i.e. "Agriculture & RuralDevelopment"), and with something called a getterCode that takes the Topic.Id
and calls it using the TopicCollection.GetTopic()
method.
And, the fact that all of these properties are lazy loaded explains why my attempt at GetProperties()
didn't work.
Great! So presumably, I look for properties with the name of the Topic...
... but what about the spaces and ampersand? Perhaps I should start with something simple:
topics.Trade.GetType()
Excellent! We're on the right track! Just for the hell of it I'll try the full name for Agriculture:
topics.Agriculture & RuralDevelopment.GetType()
As expected, but helpfully (or not), it tells me that the property does exist.
After some thinking and searching, I learn something new about F# (hooray). Double back-quotes are use to make phrases into valid identifiers:
https://docs.microsoft.com/en-us/dotnet/fsharp/language-reference/symbol-and-operator-reference/
``...`` Delimits an identifier that would otherwise not be a legal identifier, such as a language keyword.
Now this admittedly wasn't an easy leap, but once you now it, you know it. Similar tools are used in other languages, in R you can definitely name columns of dataframes in such a way, and clearly that's the case in F# as well.
So there we are, all that remains is to try it out!
Ploughing Onwards¶
let agricultureTopic = topics.``Agriculture & Rural Development``
agricultureTopic.GetType()
agricultureTopic.Description
This is good to see!
Now I'm going to apply these same tricks to the Topic.Indicators
property and see what's available:
let agricultureIndicators = agricultureTopic.Indicators
agricultureIndicators
|> Seq.take(10)
|> Util.Table
let cerealYield = agricultureIndicators.``Cereal yield (kg per hectare)``
cerealYield.GetType()
Great, so if I take a look at the Indicator type, I see there's an Item
property, which is familiar from dictionaries in .NET, that takes a year as an id and looks up in an internal dictionary. Note as well that the Indicator
class inherits from seq<int * float>
, which suggests we should be able to bring the mighty Seq
tools to bear on it later... Good job! We're nearly there!
/// Indicator data
type Indicator internal (connection:ServiceConnection, countryOrRegionCode:string, indicatorCode:string) =
let data = connection.GetData(countryOrRegionCode, indicatorCode) |> Seq.cache
let dataDict = lazy (dict data)
...
member x.Item
with get year =
match dataDict.Force().TryGetValue year with
| true, value -> value
| _ -> Double.NaN
...
interface seq<int * float> with member x.GetEnumerator() = data.GetEnumerator()
interface IEnumerable with member x.GetEnumerator() = (data.GetEnumerator() :> _)
let cerealYield = agricultureIndicators.``Cereal yield (kg per hectare)``.["2016"]
Oh no!
We got as far as trying to pull out the data for a specific year, but there wasn't any data in it?
This is because I didn't read the name of the class properly, my cerealYield
object isn't an Indicator
, it's an IndicatorDescription
- metadata only. In order to access the actual data we need to get the Indicator
collection of a country.
let countries = wb.Countries
countries
|> Seq.take(10)
|> Util.Table
Well, there's a lot (see "vastly, mid-bogglingly big" from the intro), how about the UK?
let ukCerealYields = wb.Countries.``United Kingdom``.Indicators.``Cereal yield (kg per hectare)``
ukCerealYields.GetType()
ukCerealYields
|> Seq.take(10)
|> Util.Table
Reaping the rewards¶
Grand. Now I have a set of data to access. I know that it behaves like a sequence, but I can get specific points like a dictionary, it loads data from the World Bank data set as I request it, ... I should be good to go!
There are lots of things to do next, I think I'll save them for another post. I hope you've found this helpful, I know I've learnt a lot, but above all I wanted to get across the tools I use to solve problems when there's little else to go on. The free access to all of this source code on github is an amazing, beautiful, thing, and it shouldn't be taken for granted. The same goes for tools like NuGet and Paket; I've been able to take a random file of text from a globally distributed storage architecture, plug it into a system on my own PC, download data from somewhere else in the world and print it out for you here... if you believe Arthur C Clarke, this is literally magic. It's awesome, and I will never stop loving it. I hope you get some of that feeling too.
Anyway, thanks for reading, feel free to download the notebook and run it yourself, if it helps. Otherwise, good luck, happy number crunching, and keep your eye out for my next post!