Sunday 23 October 2016

The Global Human Settlement Layer: an amazing new global population dataset

At the recent Habitat III conference on housing and sustainable development in Quito, Ecuador (17-20 October 2016) the European Commission launched a new Global Human Settlement (GHS) dataset. That's what this post is about. Before that, here are some basic data details - Landsat data from 1975, 1990, 2000 and 2014 were processed and analysed in order to produce three different GHS products: one on population (GHS-POP), one on built-up areas (GHS-BUILT) and one city model dataset (GHS-SMOD). But this is already getting too technical, so let's look at some maps - I've just created a few in 3D for fun in order to give you an idea of what the population dataset looks like. Each image below uses the 250 metre resolution one (there is also a 1km cell version for population).

I've added a few place names and extruded cells by population

The London example is pretty interesting and I think provides a nice overview of settlement patterns both in terms of distribution and density. As you can tell, this is just a small chunk of the data - it covers the whole world so is a pretty big file - but more on that below. For that reason, I took a smaller extract to explore it further and for this I exported the United States as a separate file and looked at four metro areas I thought would be interesting from a populaton density point of view: San Francisco, Los Angeles, Houston and New York. The highest population value in any single 250m cell in the London example above was 1,595, so I also thought it would be interesting to compare them to the US cities. Let's take a look, starting with the Bay Area around San Francisco. 

That big spike in the north? That's San Quentin State Prison

This was also just a little extract, but again it gives an interesting view of population density. The population spike in the north of the Bay Area surprised me but then I looked at it more closely. The data showed a figure of 4,856 people in that 250m cell, which seemed pretty high so I dug a bit deeper. The Wikipedia page for San Quentin State Prison tells us there are 4,223 prisoners (137% of capacity) and another quick search tells me there is employee housing there too, so this figure stacks up. The next highest value is in San Francisco, so this makes sense too. But what about Los Angeles - how did that compare?

This is just a part of the wider Los Angeles metro area

The highest population value in any of the 250 metre cells in Los Angeles was 2,285, which I was a little surprised at because I didn't think it would be much higher than London. This was just a quick and dirty extract, so no labels here (or scale bars, sorry) but you do get a sense of the urban density and distribution of settlements here. Somewhere I did think would show much less density was Houston, and I was proved right here, as you can see below.

The sprawling metropolis of Houston

The highest population figure in any one 250 metre cell in Houston was 813, according to the GHS dataset. This is of course not surprising but I found it quite interesting to see it like this. Finally, I wanted to see what New York and its wider metro area looked like. I thought it would beat San Francisco for density, and it did.

An obvious spike in population density in most New York City Boroughs

The highest population figure for New York City per 250 metre cell was 6,189. This makes sense when you think that a tall residential apartment building can easily fit within one such cell - and in fact multiple buildings can. Mind you, it's still a pretty high figure.

These examples are from the 2014 dataset but there is so much else to see, if you have the time and skills to explore it. I'm at risk of becoming addicted to it, so I'll have to restrain myself. For now, I recommend that you check out the European Commission web pages on the data.

The rest of this post includes more technical information, possibly of interest to only a few data/GIS nerds with nothing better to do with their lives.

About the data (and yes, it's open and free)
The most important thing is to know where to get the data (once you've read about what it is) but this can involve endless clicking so here's a FTP link to the downloads. I've focused on the population part of the dataset here and it comes in TIF format. The 250 metre resolution one is about 626MB in size, so you need to have a decent machine to work with it. In terms of map projection, it comes as World Mollweide (EPSG:54009).

Quite a big file, but not too bad considering it's global
Here's the Copyright text file for the datasets

You can then open the file in your chosen GIS - I've shown a couple of examples of this below; one with ArcGIS and one with QGIS (the dataset notes file specifically mentions both of these). I have found it easier to work with in QGIS so far. When you open them at first you won't be very impressed - some further styling is needed. Also, in ArcGIS the high value suggests an impossible figure and in QGIS the values go from zero to zero - again, this just needs some tweaking in order to display something meaningful.

Notice the strange high values - that's not right!

Yep, nobody lives on earth (0 to 0 in the values on the left of the image)

Once you get the data on screen, you can start to style it and get something meaninful in front of you. Here's an example from ArcGIS, where you can see that there is some 'blockiness' in the data in some areas - it's not perfect at 250m resolution so at times the 1km product may be better on this front.

The high value of 7368 seems more reasonable here

Now let's take a look at a more cleanly styled view, this time for England. As you can see below, this now gives us quite a nice overview of the settlement pattern for the country.

This is just the original raster dataset, zoomed in

Since I extracted the data for just the United States, I also have a nice separate 250m cell version of that. I actually converted this to a vector layer in QGIS (and it's about 850MB) so here's what that looks like for the lower 48 states. I think this is quite pleasing to the eye. Click to make it bigger - it's a good approximation of the settlement pattern of the United States.

This is a vector version of the 250 metre population dataset for the US

One thing I haven't yet got to the bottom of is what the maximum population of any single 250m cell is. In both ArcGIS and QGIS, the maximum seems to be 634,492 - which isn't right. You definitely can't fit that many people in a 250 metre square! Hopefully someone will get to the bottom of this. I think this figure might come from aggregated blocks of cells in the data but so far I haven't had time to figure it out.

How to work with the data
Working with the data is quite tricky so here are a few tips for how I dealt with it in QGIS. This last part describes how I extracted a subset of the original massive 626MB TIF so that I could work with smaller chunks and then convert it to vector format for doing some 3D maps. All I did was load up the full 250 metre resolution population dataset and then went through the steps you can see in the screenshots below.

This is the original dataset, zoomed to Liverpool and Manchester

You can then just select an area of the TIF to extract by clicking and dragging

Using the new TIF, I then convereted it to a vector layer

This is the new vector layer, zoomed and symbolised

Finally, I decided to do a little bit of experimenting with the 2.5D symbology options in QGIS (available from version 2.14 onwards). The images at the top of the post were done in ArcScene (part of ArcGIS) but ideally I'd have done this in Blender instead - but that would have taken too much time. Also, I'm waiting to see what Steve Bernard and others might do with this dataset - there are so many possibilities and so far my Blender skills are really limited.

This might break your computer if you try too big an extract (e.g. an entire country)

Finally, a zoomed in version of the above viz

There's so much that you could do with this data for research purposes, or just for fun, but the first hurdle is getting your head round the data and how to work with it. This post is just intended as a small contribution in that vein. I hope some find it helpful.

Notes: the GHS population dataset is a giant raster (TIF format) of 626MB when compressed. I created an uncompressed version (by mistake) and it was 33GB! There are 141,969 columns and 60,829 rows in the full raster - this adds up to 8.6 billion cells, so I don't recommend trying to convert the whole thing to a vector image because it won't work and is not a good idea anyway. The creation of the dataset was supported by the Joint Research Centre (JRC) and the DG for Regional Development (DG REGIO) of the European Commission, together with the international partnership GEO Human Planet Initiative. Lots of very clever individuals contributed to the project, and you can find out more about the team on the GHSL people pages