Stats, Maps n Pix: January 2016

Sunday, 24 January 2016

Megalopolis revisited: commuting in the Northeastern United States

Back in the summer I did some work on mapping the American commute, which was picked up by WIRED and a few others online. This just proves how much some people like animated gifs but it also demonstrates that it can be a good way to show otherwise complex commuting relations between big cities and their commutersheds. I always meant to come back to it and look at patterns in the Northeastern US more closely, so I decided to finish it off this weekend whilst that part of the world is under a giant blanket of snow and most people aren't going anywhere fast. The real inspiration for looking at all of this, of course, comes from Jean Gottman's Megalopolis, which describes the massive metropolis stretching roughly from Boston in the north to Washington in the south and taking in over 50 million people. That's enough words already, so take a look at the first animated gif and read on... (click any image to see it full size).

Just 2.2 million people going to work in Manhattan

Now, I love a good animated gif as much as the next person, but there's more to it than that - I promise. I'm really interested here in the extent to which the cities of Gottman's Megalopolis are actually connected from a functional point of view. I've already written a working paper on it, which you can read if you're really keen, or a related blog post if you prefer. Of course we can simply look at some Census data and see who goes where, but that's harder to fathom at such a vast scale. That's where this Census tract-level data from the American Community Survey comes in useful. I had to patch it all together myself but if you're interested I also made it available for download because I like to share. There are all kinds of crazy commutes in the original dataset, including people who appear to live on the other side of the country but work in New York. You can see this in the next gif, and even more clearly in one still I extracted from it where the shape of the continental US comes out clearly.

Some super-long distance commutes - but not on a daily basis!

If you pause the animation just a frame or two before all points collapse in on Manhattan, you can see the shape of the US from the dots alone. This is an artefact of the way I've created the animations but it's also an interesting insight into the data - and potentially its validity. But also bear in mind the increasingly complex and long-distance live-work patterns now in existence, as documented in this Quartz piece from October 2015. For personal insights on bicoastal commutes, see this thread on flyertalk!

America in one city? Well, not quite, but there is something to this.

There are loads of good resources online about this kind of stuff, but not much that demonstrates what the patterns look like, hence my post today. However, three resources worth a closer look are the Federal Highway Administration's Megaregions piece, a Working Paper on Mega Commuters in the US by Melanie A. Rapino and Alison K. Fields of the US Census Bureau, and the Regional Plan Association's America 2050 megareions project and maps. But I'm guessing it's already time for another animated gif so here goes... This time I decided to focus in on Boston, New York, Philadelphia and Washington, DC and look at commutes ranging from 50 to 100 miles, as there is a good bit of overlap and this fits the 'long-distance' threshold used by many, including Rapino and Fields.

Hypnotised yet? If not, keep looking.

Most people, of course, don't commute these kinds of distances but I was particularly interested in the relationships over longer distances as it says a lot about the pull of individual cities. In the case of the above, it's actually the county-level I focus on (so New York County (Manhattan) for NYC, Suffolk County for Boston and so on). The animation is deliberately simplified so there are no place names for smaller settlements but you can see these on the still below. Perhaps you can even pick out your own commute. If you're really into this kind of thing, take a look at this Reddit about people who live in Philly but work in NYC.

Think of it as a super-commuter Venn diagram

I should also say a word or two about the dots. The dots go from point to point because a) this is a simplified model of commuting flows; b) even if I did a network version I wouldn't know exactly which way people go; and c) they tell me what I need to know - where people start and where they end up. If we extend the distance out to 200 miles, using flow lines, then the patterns for Boston, New York, Philadephia and Washington, DC look like the image below, which you can see encompasses a good bit of the so-called Megalopolis.

The BosWash area - now home to about 50 million people

I'm going to write a proper post in future on Gottmann's original ideas, with extracts from his almost 800 page 1961 book on the subject. The precursor to the book was a 1957 paper in Economic Geography on the same topic, which I'll also come back to at a later date. In the meantime, here's a little peek at the front cover and contents page that relate to today's topic. It's interesting to note that, despite its length, Gottmann says 'it may seem bulky to the reader, but the author feels it provides just a brief summary' (p. ix).

Wednesday, 20 January 2016

The spatial dataviz web

So far on this blog I've covered visualising data and how lots of what we see today has a long history. Today I thought I'd write about the 'spatial dataviz web' or, more accurately, the sites and resources that I find useful when doing spatial data analysis and visualisation. So, it's really my spatial dataviz web but that didn't sound as good. This is really just the tip of the iceberg. If you're reading this then the chances are you might know all of it already but hopefully there's something new for someone. I've organised this around websites, data and software with a good bit of overlap between the categories. If there are any great tools I've missed I'd love to hear about it (@undertheraedar on Twitter). I'll add these at the end of the post.

An example of a map created using open data and open source GIS

Websites

QGIS Tutorials and Tips - this website, from Ujaval Ghandi provides a really good introduction to learning QGIS and includes stuff which total beginners, intermediate and advanced users alike will find useful. I really like the fact that it's all screenshots and words rather than screencasts (though for this, Steve Bernard at the Financial Times has you covered with this QGIS Uncovered video series).

Note the drop-down menu for selecting your language

Andy Kirk's 'Resources' - Andy Kirk is one of the world's most famous dataviz experts and the Resources section of his website is a real treasure trove of dataviz insight. Quite a few of the examples here are relevant for spatial dataviz, so it's well worth a look as he has created a really comprehensive list of examples here.

Go on, try to find something that isn't here

The Spatial Blog - I also really like Nicholas Duggan's The Spatial Blog as it's regularly updated, packed with useful information and easy to read.

CartoGroup at Oregon State University - this group's website is full of technically impressive, beautiful mapping. It's broken down into different section (Atlases, Demos, Research) and the people behind it have won a ton of awards.

Some lovely work here

Spatial.ly - James Cheshire's site is on most people's radar and his pages are packed with useful resources and great visualisations.

ESRI data dictionary - don't know your TIN from your Tobler? Confused about the difference between topography and topology? Have an exam question on run-length encoding? The ESRI data dictionary has you covered. A real gem but not all that widely known about in some quarters.

Ordnance Survey - cartographic design, via their blog. Great Britain's mapping agency has recently added a series of posts on 'cartographic design principles' to their blog, written by experts such as Charley Glynn.

Some great tips from Ordnance Survey

Data

Natural Earth - a very obvious choice, but if you haven't heard of it or used it, it's basically the place to go for global geodata. If you're looking for a shapefile of US states, or of the sub-national regions of Botswana, this is the place to go. They have cultural, physical and raster data covering the whole world and it's all free to download. In fact, the extremely generous Terms of Use state that 'No permission is needed to use Natural Earth. Crediting the authors is unnecessary'.

The brilliant Natural Earth

GEOFABRIK - we all know that a good chunk of the world has now been mapped by OpenStreetMap but sometimes it can be difficult to know where to turn to get the data in Shapefile format. That's where GEOFABRIK from Karlsruhe in Germany come in. They have compiled, archived and regularly update Shapefiles for all areas of the world, downloadable in national subsets.

A free service, from Geofabrik GmbH

Mapshaper - if you're making interactive maps for the web, there's a good chance that you'll want to reduce the level of detail (and file size) in any spatial data you use. For example, if you download data from somewhere like OS OpenData in the UK, you'll get really high quality data but also big file sizes. There are loads of ways to simplify geodata and reduce the file size, but mapshaper's online interface and simplification algorithms are so simple to use. Created by Matthew Bloch of the New York Times.

Still my go-to tool for most simplification tasks in GIS

NASA - they have a ton of data, but unless you know exactly where to look it can be a bit bewildering for some users. The NASA link here is actually to Derek Watkins' 'SRTM Tile Grabber' which allows you to easily select an area of the earth and download elevation data for almost anywhere on earth. There are also loads of socioeconomic datasets available via NASA's SEDAC centre at Columbia University. NASA also have the amazing Visible Earth image series, which I used to create the little animated globe below.

Project Linework - this is less well known and is a 'library of handcrafted linework for cartography, each designed in an aesthetic style, as it says on the website. Downloads come in ai, geojson, shp and topojson formats. It's really cool.

A really nice initiative, via Daniel P. Huffman

These are the ones I use most often or like this best. But the best GIS data list on the internet is curated by Robin Wilson at the University of Southampton in the UK. It's fantastic and includes everything from glacier outlines to global terrorism.

Software

I've tweeted and blogged a lot about QGIS over the past few years, and with good reason. It's free, open source and constantly improving. It's not the only GIS I use (I still dabble in MapInfo from time to time and use ArcGIS for some geoprocessing) but most of the time I use QGIS now so this is my go-to tool for 80% of tasks. But this section is about free software, so here's an overview of the main free software I use.

QGIS - it might sound mysterious and/or odd to outsiders, and possibly a little threatening to some non-open source advocates but QGIS has over the past five years in particular established itself as a leading geographic information system. Certainly - in my opinion - the best of the open source GIS tools and many see it as a serious 'rival' to proprietary packages. But I prefer not to think of it in terms of rivalry. Instead, I am just in awe of what we can now do with it and grateful to people like Nyall Dawson, Nathan Woodrow, Anita Graser and Tim Sutton.

GIMP - a very dodgy name, for obvious reasons, but a great, free image manipulation programme. I often use this for post-GIS image processing but somewhat less these days because QGIS can now do quite a lot in the Print Composer on its own.

Blender - I'm a real novice at this but am posting it here because I've seen what it can do. For example, Steve Bernard's recent experiments with 3D animated globes.

IrfanView - this is not a spatial tool at all, but a free image viewer and manipulator, developed by Irfan Skiljan, a Bosnian graduate of the Vienna University of Technology, originally from Jajce. It's simple but powerful and really great for things like re-sizing large batches of image files or converting from one image format to another. I also use it a lot for basic editing of screenshots and photos. There are many more powerful tools but I love this for its simplicity and efficiency: two attributes I value highly.

Okay, so not exactly the 'spatial dataviz web' in its entirety but it's my version. If anyone wants to suggest an addition - beyond the more obvious stuff like ColorBrewer and ESRI's GIS Tools for Hadoop, I'd love to share it. Also, you'll find that pretty much everything you could want in terms of data is covered by Robin Wilson's list and for tools and software by Andy Kirk's Resources. I've also had a few suggestions for more, so see below.

Addendum

CartoDB - didn't put this in at first because most people are aware of it but now added at the suggestion of Maarten Lambrechts. I've used CartoDB quite a bit and have, more recently, experimented with different projections.

Mapbox - another one I filed under 'too obvious' but it's made the cut after Maarten's suggestion. Given that Prof Alex Singleton said the following, I think it's worth putting in here: "The new @Mapbox Studio is one of the slickest outputs I has seen from the GI industry in a while! Stunning work and a great barrier remover!"

Any more you want on the list? Let me know.

Sunday, 17 January 2016

Children living in deprived households in England

Not a particularly upbeat post title today, but it's an important topic too often overlooked. I wanted to shed some light on the matter because there are copious amounts of data on the issue, including those released as part of the 2015 English Indices of Deprivation, which I've explored in-depth in the past though a series of maps. A recent Twitter message from freelance writer and HuffPost blogger Shumailla Dar prompted me to re-visit this topic (thanks) and since I had most of the data set up, I thought I'd do another map series - this time with the 'Income Deprivation Affecting Children Index' (IDACI) from 2015. See below for an example of what this looks like. In all maps, I've added a little inset to show the pattern from the overall Indices of Deprivation 2015, for comparison. I also show the percentage of each local authority's small areas in each decile on the IDACI measure nationally.

Quite a north-south divide in relation to income deprivation and children

Okay, so what is 'income deprivation affecting children'? It's the proportion of all children aged 0 to 15 living in 'income deprived' families. 'Income deprived' is defined as 'families that either receive Income Support or income-based Jobseekers Allowance or income-based Employment and Support Allowance or Pension Credit (Guarantee) or families not in receipt of these benefits but in receipt of Working Tax Credit or Child Tax Credit with an equivalised income (excluding housing benefit) below 60 per cent of the national median before housing costs'. Again, not the most exhilarating topic, but given the impact this can have on young people's lives, it's such an important one. If you want more information on the details, see the Indices of Deprivation Technical Report. Beyond the technicalities, here's how it looks on the ground in Middlesbrough, the local authority with the second highest proportion of children living in income deprived households (35.7% overall).

Second only to Tower Hamlets on this index

The Indices of Deprivation 2015 Research Report found, though a user survey, that whilst 99% of respondents had used the Index of Multiple Deprivation, the figure for IDACI was 69%. Still high, but it suggests a lot of people haven't looked at it, particularly since most respondents were people already working with the data in local authorities, universities, central government and charities. Most 'normal' people have very little idea that the data exists or what the patterns are like in their area - hence today's post. There is also a similar index which reports income deprivation affecting older people, but that's one for another day. Before showing any more maps, here's the top 20 local authorities across England in terms of the percentage of children living in income deprived households.

Source: DCLG, 2015, p. 23

Now for some more maps, before I provide a link to the folder with a map of every local authority in England. First of all, here's Tower Hamlets. Remember that the bars show the percentage of small areas (LSOAs) in each local authority that are ranked in each decile within England - which sounds a bit confusing, I admit. To give an example instead - for the map below, 54.2% of Tower Hamlets LSOAs are ranked within the most deprived decile in England on the IDACI measure. Just bear in mind that the maps present a relative picture for England as a whole and that, broadly speaking, red = bad and blue = good.

This issue is very well know, but persistent, in Tower Hamlets

Chiltern's at the other end of the scale - but note the single red area

A real mix of areas in Bury

Liverpool is a city of contrasts on this measure

What does any of this tell us? Of course, we know that some places are rich and some are poor and that this will inevitably have an impact upon children in those areas but these maps reveal nothing of cause and effect. Rather, I hope they will provide local agencies, analysts and residents with an opportunity to explore patterns related to income deprivation affecting children in their area and perhaps to think about a topic they hadn't before. It's certainly not a new issue but one that, I think, we could do a lot more about tackling. But that's a step too far for today because I just wanted to share these maps after running off a new batch after being prompted to think about it.

Click here to go to the big Google Drive folder with all the maps

Files are ordered alphabetically, by local authority name

If you want to see comparable maps for the 2015 Indices of Deprivation overall, the 20% most and least deprived and other varieties of deprivation map, see my main IMD 2015 page. If you find any of these maps useful, feel free to use them and share them.

Notes: you can find full details of the data and method on the government's English Indices of Deprivation 2015 web page, including the IDACI spreadsheet. I'm not a member of, or affiliated with, any political party in case anyone asks about the red and blue colour scheme! The labelling is a bit wonky in places owing to the variable coverage of OSM data. I did a version with Ordnance Survey labelling too but this had too many points, but hopefully some of the labels help identify key areas you are interested in.

Sunday, 10 January 2016

Running QGIS on a Chromebook

I recently got a Chromebook because I wanted a lightweight, lightning fast computer that I could use for writing, e-mailing and other less computationally intensive tasks. But I also wanted something with a decent level of power so that I could install Ubuntu and run QGIS on the move (my main laptop weighs about 100kg). I just returned from doing a QGIS/dataviz training course for PhD students in Glasgow where someone was running QGIS on a Chromebook so I thought it was about time I got my act together and did it myself. I'm blogging it here in case anyone wants to have a go, or is considering doing it themselves. But let's start with a nice screenshot of QGIS on the Chromebook first.

Running QGIS in Ubuntu on a Dell Chromebook 13

I didn't want to wipe Chrome OS from my machine so I followed Matt Elias' instructions on how to get Ubuntu on my Chroombook using Crouton. He's tested it for a range of different machines and I think it's the best installation guide out there. I'm pretty new to Chrome OS and Ubuntu but it was really easy. The only problem came when I couldn't tell the difference between a 0 and a O. Once you get it all sorted, you start Ubuntu from a Chrome tab (see below) and then you can switch back and forth between Ubuntu and Chrome in an instant using CTRL+ALT+SHIFT+forward key. It really is super fast. My Chromebook has 4GB of RAM and an i3 processor so it's not top-spec by any means but the performance is great.

Looks a bit scary, is actually pretty easy

I then installed QGIS 2.8 because I'm used to it and it's very stable. This was really quick and easy as well. I decided to see what kind of damage I could do by throwing a massive dataset into the mix, just to test the capabilities of my new machine. For this, I decided to do a bit of experimental mapping with my American commute dataset of about 4 million commuter flows between 74,000 or so Census Tracts in the US. I've made the data available for anyone to use if you like what you see here: it's on my Dropbox. The total size of all parts of the shapefile is 1.6GB, so I thought it might break QGIS in Ubuntu but not at all. Here's the data loading up.

The 4 million lines loaded very quickly, the table a little slowly

I then decided to do a bit of filtering on the dataset to see how this went. Just like on my more powerful machine there was a short delay but it worked without a hitch. The only thing you'll notice is that the QGIS window goes into dimmed mode, as you can see below.

Screen dims while QGIS thinks about it

From here is was a case of experimenting with different filtering options on volume and length of commuting links, and then adding in some place labels and changing the projection, just to see what kind of performance I got. In short, it was great. No crashes, no long hangs and only quite small delays, despite the very large size of the dataset. Here are a few more images from my test session.

A bit of the Midwest without labels

Similar area, reprojected and with labels added

Slightly re-styled and focused on the Bay Area

I was very pleasantly surprised by how slick it is. The only strange thing I noticed was that when I went back to the Layers panel and right-clicked a layer name QGIS sometimes zoomed out to full extent, but this wasn't a major issue for me. I also did a little video of me switching from Chrome OS to Ubuntu and back again - it's not edited and is shown in real time just to give you a little idea of how quick it is. The video is not very good quality because as far as I'm aware there's no screen recorder that will allow me to capture this OS switch so I did it on my Android smartphone. Sorry about the blurriness but you can get the idea and also see a bit of Alt-Tabbing here.

So, my overall verdict... The fact that my Chromebook boots up in 2-3 seconds every time and wakes up in less time than that is a major plus. I can switch between Chrome OS and Ubuntu instantaneously. QGIS works very well and is so fast. It's still better on my dual quad-core, i7, 32GB RAM Dell MegaMonster (I forget the actual name) but for working on the move with a machine that is lighter and much less pricey than a MacBook Pro, I like what I see. It's not a top-end machine, obviously, but for most daily tasks it is more than sufficient.

If you're thinking about getting a Chromebook, then the Dell 13 gets great reviews and I think it's really nice. I know there are already a lot of QGIS users on Linux but I haven't heard too much about QGIS on the Chromebook, probably because Crouton is still relatively new and until recently quite difficult to figure out for many users (or at least people like me with limited knowledge of Chrome OS and Linux).

I love the fact that I can run both operating systems and that they are so fast. Contrary to what some critics say, I can confirm that the Chromebook isn't just a very expensive web browser; they are actually pretty powerful Linux laptops. I'll definitely continue to use my more powerful Windows machine but I really like what I see with this 'dual-boot' set up. If you do have a Chromebook already and want to use QGIS but are not brave enough to attempt the Linux install, you can just run it in a web browser in Chrome, since rollApp have made QGIS available online now.

Why am I not just using a Mac anyway? Good question. I think it all stems from the time when everyone had a BMX but I had a Grifter.

Sunday, 3 January 2016

From Anscombe's quartet to choropleth classification

There's a good chance that if you're into data then you'll know about Anscombe's quartet - a set of four different datasets, each with the same mean, variance and correlation coefficient (to name a few properties). The example comes from Frank Anscombe's 1973 paper, entitled 'Graphs in Statistical Analysis' and it's often used to emphasise the importance of visualsing data. Anscombe said in the paper that the purpose was 'merely to suggest that graphical procedures are useful' and I couldn't agree more. Over 40 years later, this still applies and it got me thinking about visual representations in maps, and whether there is a spatial equivalent to Anscombe's quartet. I'm not sure there is but I think the topic of choropleth classification forms a kind of opposite case, of which more below...

The original quartet, from Anscombe, 1973

The case of Anscombe's quartet illustrates that very different datasets can have the same overall properties. In that case, it was a mean of 9, a variance of 11 and so on. With maps, there is a kind of opposite situation, whereby a single dataset can be displayed differently using various data classification systems (as any spatial data analyst will tell you). This brings to mind another Anscombe quote from the 1973 paper where he says that 'unfortunately, most persons who have recourse to a computer for statistical analysis of data are not much interested either in computer programming or in statistical method, being primarily concerned with their own proper business'.

This could lead to a 'different data, same conclusions' situation, which is not ideal. The map example is the opposite. To illustrate the 'same data, different conclusions' kind of thing I'm talking about, I took the example of London house prices - a vastly under-discussed topic I'm sure you'll agree. I used the postcode sector geography and in the maps below have displayed average values for March 2015. The dataset is the same throughout, but the data classification method is different in each case, leading to different patterns and, possibly, different interpretations of the same data.

Equal interval - best used for familiar data ranges (e.g. %)

If you wanted to make the case that house prices in London weren't that much of a problem, you might use the classification above. In the case of London house prices in 2015 this might seem like a silly example, but imagine we were mapping a less familiar variable. Attention here is drawn only to a few places owing to the skewed distribution of the data and the equal numerical intervals between the choropleth class breaks. The next map shows the same data, using the 'natural breaks' classification.

The 'natural breaks' (Jenks) algorithm is specific to each dataset

Natural breaks certainly does a better job, but again is not great. The algorithm looks for natural breaks in a data distribution but again we don't see much of the variety of prices across London. We do now see some higher price areas in the outer areas of London but most of the city still appears rather humbly priced according to this map. One way round this would be to use a quantile classification (below), which puts the same number of areas in each coloured category.

Quantile puts the same number of observations in each category

The quantile classification is much more like what we might expect to see. There is a greater variety in terms of the number of areas displayed in each class on the map across all London Boroughs, but of course the actual price classification is rather arbitrary. People just don't think about house prices in this way, so it's less meaningful. If I were doing this for real I'd probably tweak it a bit - and quantile classification is normally best for linearly distributed data anyway. So, let's try a manual classification next...

A manual classification, with counts for each category

House prices and housing search are a couple of areas I've researched in recent times and I know from the data that people tend to segment the market by price at different cut-off points, such as £250,000, £500,000 and so on. In this case, I think it makes most sense to adopt a manual classification and also to show how many areas are in each class. I like this additional information as it helps me make sense of the underlying data distribution and also serves as a sense check on my own visual perception of the map. There are many problems with choropleth maps, but I think that in the legends we can communicate more information than is currently the case, as I wrote previously on my old blog.

One final map using the same data is shown below. This time, I have used only two classes but it's still the same data. What I've done here is use the March 2015 average UK house price (£273,000) to show which areas are above and which are below this point. This is an example of using data classification to draw attention to a particular issue (in this case, relative high prices in a UK context). Not everyone who does this kind of thing is open and honest about it of course, but sometimes it can be an effective way to make a point.

Let's all move to Barking and Dagenham

Which of the above maps are most accurate and truthful? Well, they all use the same data and none of them are wrong so surely it's all just a matter of interpretation and personal preference? I don't think so. For me, the manual classification showing area counts is the most appropriate for my purposes - the class breaks have some logic to them in relation to the topic they relate to, the map reader can tell (rather than infer) how the data are distributed across classes, and there is a reasonable level of variation within and across London Boroughs - unlike the natural breaks or equal interval versions.

Mind you, I quite like the potency of the last map, but I just did that to make a point I was interested in. The real message here is that you can draw different conclusions from a single dataset depending upon how you classify and map it. This is a really obvious point but is quite often overlooked by analysts who are, unsurprisingly, 'primarily concerned with their own proper business', as Anscombe said.

Data: house price data from HM Land Registry, Price Paid Data. Boundary data from GeoLytix.

Software: QGIS 2.10.

Citation: Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17-21.