Sunday 3 January 2016

From Anscombe's quartet to choropleth classification

There's a good chance that if you're into data then you'll know about Anscombe's quartet - a set of four different datasets, each with the same mean, variance and correlation coefficient (to name a few properties). The example comes from Frank Anscombe's 1973 paper, entitled 'Graphs in Statistical Analysis' and it's often used to emphasise the importance of visualsing data. Anscombe said in the paper that the purpose was 'merely to suggest that graphical procedures are useful' and I couldn't agree more. Over 40 years later, this still applies and it got me thinking about visual representations in maps, and whether there is a spatial equivalent to Anscombe's quartet. I'm not sure there is but I think the topic of choropleth classification forms a kind of opposite case, of which more below...

The original quartet, from Anscombe, 1973

The case of Anscombe's quartet illustrates that very different datasets can have the same overall properties. In that case, it was a mean of 9, a variance of 11 and so on. With maps, there is a kind of opposite situation, whereby a single dataset can be displayed differently using various data classification systems (as any spatial data analyst will tell you). This brings to mind another Anscombe quote from the 1973 paper where he says that 'unfortunately, most persons who have recourse to a computer for statistical analysis of data are not much interested either in computer programming or in statistical method, being primarily concerned with their own proper business'.

This could lead to a 'different data, same conclusions' situation, which is not ideal. The map example is the opposite. To illustrate the 'same data, different conclusions' kind of thing I'm talking about, I took the example of London house prices - a vastly under-discussed topic I'm sure you'll agree. I used the postcode sector geography and in the maps below have displayed average values for March 2015. The dataset is the same throughout, but the data classification method is different in each case, leading to different patterns and, possibly, different interpretations of the same data.

Equal interval - best used for familiar data ranges (e.g. %)

If you wanted to make the case that house prices in London weren't that much of a problem, you might use the classification above. In the case of London house prices in 2015 this might seem like a silly example, but imagine we were mapping a less familiar variable. Attention here is drawn only to a few places owing to the skewed distribution of the data and the equal numerical intervals between the choropleth class breaks. The next map shows the same data, using the 'natural breaks' classification.

The 'natural breaks' (Jenks) algorithm is specific to each dataset

Natural breaks certainly does a better job, but again is not great. The algorithm looks for natural breaks in a data distribution but again we don't see much of the variety of prices across London. We do now see some higher price areas in the outer areas of London but most of the city still appears rather humbly priced according to this map. One way round this would be to use a quantile classification (below), which puts the same number of areas in each coloured category.

Quantile puts the same number of observations in each category

The quantile classification is much more like what we might expect to see. There is a greater variety in terms of the number of areas displayed in each class on the map across all London Boroughs, but of course the actual price classification is rather arbitrary. People just don't think about house prices in this way, so it's less meaningful. If I were doing this for real I'd probably tweak it a bit - and quantile classification is normally best for linearly distributed data anyway. So, let's try a manual classification next...

A manual classification, with counts for each category

House prices and housing search are a couple of areas I've researched in recent times and I know from the data that people tend to segment the market by price at different cut-off points, such as £250,000, £500,000 and so on. In this case, I think it makes most sense to adopt a manual classification and also to show how many areas are in each class. I like this additional information as it helps me make sense of the underlying data distribution and also serves as a sense check on my own visual perception of the map. There are many problems with choropleth maps, but I think that in the legends we can communicate more information than is currently the case, as I wrote previously on my old blog.

One final map using the same data is shown below. This time, I have used only two classes but it's still the same data. What I've done here is use the March 2015 average UK house price (£273,000) to show which areas are above and which are below this point. This is an example of using data classification to draw attention to a particular issue (in this case, relative high prices in a UK context). Not everyone who does this kind of thing is open and honest about it of course, but sometimes it can be an effective way to make a point.

Let's all move to Barking and Dagenham

Which of the above maps are most accurate and truthful? Well, they all use the same data and none of them are wrong so surely it's all just a matter of interpretation and personal preference? I don't think so. For me, the manual classification showing area counts is the most appropriate for my purposes - the class breaks have some logic to them in relation to the topic they relate to, the map reader can tell (rather than infer) how the data are distributed across classes, and there is a reasonable level of variation within and across London Boroughs - unlike the natural breaks or equal interval versions.

Mind you, I quite like the potency of the last map, but I just did that to make a point I was interested in. The real message here is that you can draw different conclusions from a single dataset depending upon how you classify and map it. This is a really obvious point but is quite often overlooked by analysts who are, unsurprisingly, 'primarily concerned with their own proper business', as Anscombe said.

Data: house price data from HM Land Registry, Price Paid Data. Boundary data from GeoLytix.

Software: QGIS 2.10.

Citation: Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17-21.