Show Code Cells

VisualisingSimpleDataSets_FSharp

Agricultural Approaches to Data Science - Visualising Simple Data Sets


Following on from my previous post on the World Bank open data API, I wanted to take a look at some simple 'visualizations' from it in F#.

'Visualization' (noun) is a strange word to me, it hasn't been popular for that long, it has all the hall marks of jargonism, and, on the face of it, appears to be completely interchangeable with 'Graph'. The reason I think it exists though, apart from to make analysts sound more impressive, is the implication that it should impart more than just the barest of facts. If a graph is a basic visual representation of data, then a Visualization should be a graphic that best represents a kernel of information. Probably, it will just be a graph of the data, but at the very least it should be an interesting one.

Perhaps I'm setting myself up for a fall here, but in this post, hopefully, I'm going to dig into the World Bank data sets, and produce some visualizations that will give an intuition about the world based on what we find there.

I also have to confess I've never really used Deedle (and I'm not yet enamoured with it, as I'll explain later), so please tolerate my blunderings in that respect.

Let's get Cereal

No, I'm serious. Previously, I'd inferred the correct syntax of the API, downloaded a small data set for an agricultural indicator, and decided that I was a literal wizard (let's say 'technomancer'). I'd been looking at Cereal Yield in Kg per Hectare (that's a unit of area 100m * 100m or ~2.5 acres, to the uninitiated), and we'd gathered a time series for the UK that started in the 1960s.

As a starter, I'm curious to see how the UK's Cereal Yield has varied over time, and whether or not it has a relationship with something else I spotted in the indicators list, Fertilizer Consumption in Kg per Hectare. I suspect it does (otherwise why would farmers buy so much Nitrogen?), but it may not be clear in the data we have available due to exogenous factors (rainfall, temperature, technology, Brexit - just kidding), that could add noise to the signal. It might also be the case that not all indicators have the same coverage, and the number of valid data points (where both variables have a real value) may be too low to shine a light on any relationship.

Either way, I'm going to plough in and see what happens.

A Simple Model

Before that though I want to highlight a very important point that applies to any numerical or statistical analysis:

First Rule of Statistics:

Keep a model in mind

If you are inclined to, it's very easy to make 2+2=5. This is especially true when you have a client who 'wants to see proof' of a relationship for their own purposes. It's important to only look at things that make sense, start simple, and go from there. There are a very few exceptions to this rule, which I will cover if I every get to post about a machine learning approach, but they are few and far between, and come with significant caveats.

In all other cases, it makes much more statistical sense to have an theoretical basis of a relationship (a model) before you attempt to test for one. In this example I believe three important things:

  • Cereal Yield should increase gradually over time
    • Improving Technology since the 1960s should have an effect on yield
    • Increasing competition from around the world should have pressured farmers towards using fertilizer to gain higher yields
  • Fertilizer Consumption should increase gradually over time
    • See increasing competition for Cereal Yield
  • Cereal Yield should be directly related to Fertilizer Consumption
    • the very point of applying fertilizer is to promote growth
    • fertilizer is expensive, and there may be a dampening effect where it is only applied when yield is expected to be low

So there you have it, that's a rough model, a general guess as to what I think will happen. Even in just writing out this list I've noticed something potentially interesting, improving technology may be a much greater driver of increasing yield than fertilizer consumption, so I may have to control for it in some way.

Either way, I hope you see the merits that rule. Lesson done, let's get stuck in.

Harvesting the Data

You'll recognise the lines below or something similar from examples in the API documentation or my World Bank API post:

In [1]:
#load "Paket.fsx"
Paket.Package [
    "FSharp.Data"
    "XPlot.Plotly"
    ]
#load "Paket.Generated.Refs.fsx"
In [2]:
open FSharp.Data
In [3]:
let wb = WorldBankData.GetDataContext()
Indicators
In [4]:
let agriculturalIndicators = wb.Topics.``Agriculture & Rural Development``.Indicators

I'm not worried about calling the same query multiple times, as it is cached in the API, but keeping these to hand will save me a lot of typing time when I'm figuring out what to type next. To get a complete list of them, I can just run the code below.

agriculturalIndicators
    |> Util.Table

Here are some that I intend to take a look at:

  • Cereal yield (kg per hectare)
  • Agricultural machinery, tractors per 100 sq. km of arable land
  • Annual freshwater withdrawals, agriculture (% of total freshwater withdrawal)
  • Average precipitation in depth (mm per year)
UK Cereal Yields
In [5]:
#load "XPlot.Plotly.fsx"
open XPlot.Plotly
In [6]:
let ukCerealYields =
    wb.Countries.``United Kingdom``
        .Indicators.``Cereal yield (kg per hectare)``
In [7]:
ukCerealYields
    |> Chart.Line
    |> Chart.WithLayout(Layout(title = "UK Cereal Yields",
                                showlegend = false,
                                xaxis=Xaxis(title="Year"),
                                yaxis=Yaxis(title="Cereal Yields (kg per hectare)")))
Out[7]:

Doesn't that look pretty? Even to the naked eye there is a clear upward trend, which is what we'd expect right?

The chart above was generated using the XPlot library, which contains a wrapper for Plotly. Anyone who's used Python for data analysis will have at least heard of Plotly. It's a cloud based service as much as it is a library, and it generates professional, interactive, graphs based on the data you send to it. The library itself sends your data to a server in order to process the graph more efficiently, but if that puts you off I understand there's an offline mode that you can use, as well.

In the case above I've taken my Indicator (essentially a sequence of (int * float) tuples), plotted it in a line chart, and applied a layout to that line chart (it's also possible to convert it to an Svg or png file, although it wasn't necessary here). I recommend you have a play with this code if you haven't done it before, and step through every line of it to see the effect it has. Be warned, sometimes Plotly and Jupyter Notebooks can take their time in giving you your well earned results, but it is the single best way to learn.

UK Fertilizer Consumption
In [8]:
let ukFertilizerConsumption =
    wb.Countries.``United Kingdom``
        .Indicators.``Fertilizer consumption (kilograms per hectare of arable land)``
In [9]:
ukFertilizerConsumption
    |> Chart.Line
    |> Chart.WithLayout(Layout(title = "UK Fertilizer Consumption",
                                showlegend = false,
                                xaxis=Xaxis(title="Year"),
                                yaxis=Yaxis(title="Fertilizer consumption (kg per hectare)")))
Out[9]:

Interestingly, Fertilizer Consumption has decreased rather than increased, as I'd thought it would. Either this could mean that my impressions about increased demand were incorrect, or there is something else interfering with the relationship. I've already suggested that fertiliser might be used more when it's more necessary, so maybe there's been a meteorological effect over the years in this time series, or perhaps there has been a shift away from the crops that require more intense nurturing. These seem like slim possibilities, but perhaps more likely, is the idea that regulation related to chemical use in the food chain may have restricted its use.

Either, now I've plotted the Cereal Yields and Fertilizer Consumption, I probably want to combine them. Even though I've looked at their individual trends, I want to be able to plot yearly points against one another, so how best to do that in F#?

Combining Time Series

Often, when performing some sort of real-world analysis, you are looking for trends and associations between variables, rather than just how a variable has performed over time (it's rarely the time of a measurement, but other variables' values at that time, that affect the result). This might seem self-evident comment, but it encapsulates an important point when managing incomplete data series: they need to be properly aligned.

I could just zip my Cereal Yield and Fertilizer Consumption series together, but in some cases, such as this one, the two are either not the same length, or they have different starting points, or both. Luckily, although for reasons I will never understand it's not always the case, the World Bank API returns data with the appropriate date.

This brings me onto Deedle. I've not used the library loads, although I've used similar packages in other languages, but it's widely recognised and has been included as part of the FsLab data science package. It's got some impressive tools and is clearly a competitor for managing tabular data in cases where matrices won't do, defining types is not agreeable (although I live in hope of F# anonymous types), or you don't yet know what you're going to do with you data, and need to slice and dice quickly in lots of different ways.

If you've not used it before, it has two main classes; a Series to contain indexed data for a variable, and a Frame to combine the variables into a table using their common index. They have lots of helper functions to make our lives easier, and most importantly, know how top stitch themsleves together. This is perfect for my needs; I just need to convert both sets of (Year,Value) pairs into Series, and then merge the two into a Frame.

As with the World Bank API though, I've found that the documentation is widespread but quite thin, and any examples don't cover some quite reasonable scenarios. This can be very frustrasting, and while the code below is neat, concise, and intuitive, it took me nigh on half an hour to figure out why the Series constructor wouldn't process a Sequence of Tuples... but more on that later.

In [10]:
Paket.Package ["Deedle"]
#load "Paket.Generated.Refs.fsx"
In [11]:
open Deedle
In [12]:
let ukFertilizerEffectFrame =
    frame [
            "ukCerealYield" => Series(ukCerealYields |> dict);
            "ukFertilizerConsumption" => Series(ukFertilizerConsumption |> dict)
            ]

ukFertilizerEffectFrame
Out[12]:
Deedle.Frame`2[System.Int32,System.String]

The ukFertilizerEffectFrame has done its job, and now I have a stitched set of features to investigate.

In [13]:
open System
ukFertilizerEffectFrame
    |> Frame.toArray2D
    |> fun f -> seq {for i in 0..(Array2D.length1(f)-1) do yield (f.[i,1], f.[i,0])}
    |> Seq.filter(fun (x,y) -> not (Double.IsNaN(x) || Double.IsNaN(y)))
    |> Chart.Scatter
    |> Chart.WithLayout(Layout(title = "UK Fertilizer Effect",
                                showlegend = false,
                                xaxis=Xaxis(title="Fertilizer Consumption (kg per hectare)"),
                                yaxis=Yaxis(title="Cereal Yield (kg per hectare)")))

Out[13]:

And there you have it, judging by eye, no relationship whatsoever. There are a number of possible reasons for this that we've already discussed, not least of all the low number of data points to compare.

This lack of signal can be a little disheartening, but remember that we are looking at a longitudinal data set here, which carries some problems of its own. I'm going to try to describe what I mean by that

Mind the Gap, Alan

This is a distinction that comes up regularly in quantitative work, and is often not given that attention that it deserves. I'm not going to launch into any complicated statistical analysis just yet (trust that I will eventually - mwahahahaha!), but that isn't necessary for anyone who wants an intuition about these ideas.

Longitudinal and Cross-Sectional Measurements

A Longitudinal set of data is usually one that spans a length of time, but most importantly applies to a specific subject. In this case we have been looking at how different variables have changed for the UK over time.

By comparison, a Cross-Sectional set of data is a series of measurements across a variety of subjects, and is usually a snapshot of a single point in time. An equivalent example for this article would be looking at how variables differed between some countries at a single point in time.

Note that I have highlighted the main disinguishing flags in these paragraphs; changed for vs differed between.

Confusingly, you can also have longitudinal data that applies to multiple subjects, however, by definition these subjects mustn't swap around, so you can think of it as a time series where the variable has changed for a fixed bucket of subjects instead. A nice analogy is measuring the price of stocks on a financial index such as the FTSE 100. In this case what you're interested in is the state of all of the stocks in combination (as the FTSE is a combination of the 100 largest companies in the UK, this is a pretty good measure for the UK stock market as a whole).

Even more confusingly, you can have Cross-Sectional time series, where measurements of different sets of samples are taken at over time. A (less nice) analogy for this is if you're measuring, say, how body temperature and happiness have differed between people on an underground (subway) carriage - people get on and off the train, but what you're interested in is how the state of passengers in general changes over time, not individuals specifically (spoiler - they're all unhappy because the tube at rush hour is hell on earth).

Where we went wrong

These might seem pedantic distinctions, but that's mostly because the labels don't do the concept justice. The underlying point is still really important, and we need look no further than our UK Fertilizer Effect Chart to prove it.

We believe, based on sound logical arguments, that there's a relationship between Fertilizer Consumption and Cereal Yield generally. However, we tried to look at the relationship in the UK specifically, over time. Not suprisingly, we found that there wasn't a signal, counter to what we'd expect. We then suggested a number of reasons why this would be the case, namely collinearities with other variables (maybe the UK uses more fertilizer because it has more tractors to move it).

Perhaps a cross-sectional data set would have been more appropriate for this sort of test, as they can help diversify between the idiosyncracies of a specific example. Maybe the UK's farmers have seen strong improvements in the technology at their disposal, but surely this will be the case for a lot of farmers around the world? If we cross compare between different countries at a single point in time, are we more likely to remove this trend?

Sigh... the answer is... maybe. Yes, we may be able to separate technology from the equation, if level of technology is relatively equal across all countries. Given the rapid rise of globalisation in the 20th Century I'm banking that the UK today is more similar to all other countries in our dataset today than to the UK in the '60s.

So which to use?

In short it depends what you're interested in, and the choice about which to use can be summed up in one line:

Second Rule of Statistics

You can only prove the relationship that you test for

We should look for longitudinal data sets if we're interested in:

  • a specific subject
  • changes in that subject over time

There are some problems associated with this sort of analysis though:

  • higher(?) risk of colinearities

We should look for cross-sectional data sets if we're interested in:

  • a general effect
  • helps diversifies effects of colinearities (but remember that there may be others)
  • independent of changes over time

But remember that there are other problems that might arise:

  • risk of colinearities with different factors (soil fertility?)

I'm looking for a direct relationship between two variables, independent of the state of a specific country, so instead I should be taking measurements across all countries in general (a cross-sectional data set). Is this going to help? We'll have to see...

Going Global

In [14]:
let cerealYield_fertilizerConsumption_Global =
    seq {
        for country in wb.Countries do
            yield ( country.Name,
                    country.Indicators.``Cereal yield (kg per hectare)``.[2015],
                    country.Indicators.``Fertilizer consumption (kilograms per hectare of arable land)``.[2015] )
                    }

cerealYield_fertilizerConsumption_Global
Out[14]:
seq
  [("Aruba", nan, nan); ("Afghanistan", 2133.2, 12.12582099);
   ("Angola", 981.8, 8.052614286); ("Albania", 4873.8, 90.32437002); ...]

By country with nans

In [15]:
cerealYield_fertilizerConsumption_Global
    |> Seq.filter(fun (c,y,f) -> not (Double.IsNaN(y) || Double.IsNaN(f)) && (y < 10000.0) && (f < 1000.))
    |> Seq.map(fun (c,y,f) -> (f,y))
    |> Chart.Scatter
    |> Chart.WithLayout(Layout(title = "Global 2015 Fertilizer Effect",
                                showlegend = false,
                                xaxis=Xaxis(title="Fertilizer Consumption (kg per hectare)"),
                                yaxis=Yaxis(title="Cereal Yield (kg per hectare)")))
Out[15]:

This looks very promising! Taking a cross-section of all countries' Cereal Yields and Fertilizer Consumption, we seem to find quite a strong relationship between the two!

Note that there were some outliers in this data that have been removed to better present the graph, these were the points filtered out using (y < 10000.0) && (f < 1000.), and are listed below:

In [16]:
cerealYield_fertilizerConsumption_Global
    |> Seq.filter(fun (c,y,f) -> not (Double.IsNaN(y) || Double.IsNaN(f)) && not ((y < 10000.0) && (f < 1000.)))
    |> Util.Table
Out[16]:
Item1Item2Item3
United Arab Emirates27268.6853.066666666667
Hong Kong SAR, China20002334.32258064516
Jordan1518.81079.48808063103
Kuwait6542.51277.36753246753
Malaysia2909.21539.29875262055
New Zealand8026.61716.99322033898
Qatar5845.87511.45038167939

Now there's pretty

Most importantly this and the UK specific graph now tell a story. There appears to be an effect in general, that doesn't appear at all in a specific case. From this we have inferred that there is a relationship, but that it's being masked by something else in some specific time series.

The colinearities in the UK's time series are diversified away in the global data, as we'd hoped. There is still plenty of noise, so other factors are probably important in the relationship, but we've been able to build a very simple visual explanation of it. Hooray!

I think I've talked this through step by step, so it's been quite a slow process (sorry about that). Hopefully, I won't have to repeat this again, and as I get deeper into analysing components of this dataset, the topics I'll spend time on will become less basic. That said, while I haven't looked at any 'real' statistics yet, the important concepts are coming to the front already. This isn't supposed to be a textbook of any kind, but hopefully anyone coming into this fresh will be able to see the way that I think about the world, rather than just another set of code or formulae. Whether you throw it straight out of the window is up to you!

I hope this section hasn't been too long. I really wanted to work the example of UK vs the world data sets, and ended up not focussing on more complicated visual representations as much as simple graphs as a result. Hopefully I'll get there eventually...

The next section is also entirely optional. If you want to hear me ramble about the perils of over-customisation in open source, go ahead, otherwise I hope you read my next post, and not too long in the future.

[Extra] The Trouble with Tibbles

I will have to cover this in more detail elsewhere, because, frankly, this title's too good to let go, but... I wanted to expand a tiny bit more on why I'm not raving about Deedle.

There is a seemingly low-level problem with open source libraries that I've never been happy with: compatibility.

Everyone has their own way of doing things, which in itself is not so great a problem (in fact it's exactly this tendency that brings us new ideas and projects), but unless there's a motivation to make things play well together, it can end up being more 'my way or the highway'.

I've worked a fair amount in R, and there are a number of packages that provide the same sort of functions as Deedle. One of them is called Tibble and the two suffer from a similar problem. It can be harder work converting data into and out of them than it is to replicate the functions you need manually. There are cases when the work required warrants the extra effort, but I've found these are rare, and everytime I want to do something simple like in the example above I'm left exasperated.

You'll note that in order to remove the N/As from the frame above I had to first convert it into data that was not a frame. I would be embarrassed by this, but as far as I can tell, this is the most concise way of managing the step. Deedle's Frame class has a filter, but, well, it's not very pretty (see below), and in the end Plotly can't manage the frame anyway.

ukFertilizerEffectFrame
    |> Frame.filterRowValues (fun row -> not (Double.IsNaN(row.GetAs<Double>("ukCerealYield")) ||
                                                Double.IsNaN(row.GetAs<Double>("ukFertilizerConsumption"))))
    |> Frame.toArray2D
    |> fun f -> seq {for i in 0..(Array2D.length1(f)-1) do yield (f.[i,0], f.[i,1])}
    |> Chart.Scatter

In this case, I wanted to construct a Frame from two Seq<'a,'b> objects. In my mind a sequence of tuples is equivalent to a matrix, csv, database table, and this step should be easy. Yes, it's only implied that the key is the first of my tupple's values, but this seems pretty intuitive; Yes, there's no guarantee that the Keys are unique, but that's true of construction a dictionary from a sequence too. Certainly, neither of these affect the ability to dump the data back out into a sequence.

I'm not sure if the syntax of this exact example holds true, but hopefully you follow my reasoning. If the Frame object inherited from seq<'T when 'T :> Tuple<_>>, and contained a simple constructor and get to aid it, then my problem would have been as simple as:

let ukFertilizerEffectFrame =
    frame [
            "ukCerealYield" => Series(ukCerealYields);
            "ukFertilizerConsumption" => Series(ukFertilizerConsumption)
            ]

ukFertilizerEffectFrame
    |> Frame.map(fun (a,b) -> not (Double.IsNaN(a) || Double.IsNaN(b)))
    |> Chart.Scatter

More importantly it would take a minute to work out from scratch.

I hope you don't mind my slightly negative bleatings on this point. I think it's worth saying, and I hope you agree with me, but as far as frustrations go, it is quite a low level. Overall the library appears to be very powerful and I'm looking forward to learning it better, I suspect that on balance I'll be much better off with it than without. If only I could have both...