Tag Archives: dataviz

Hexmaps in QGIS

Hexes! If you like boardgames you probably love those awesome maps were terrain has been transformed into a grid of hexagons (popularly known as hexes). Beyond this geeky interest hex-based maps can be used to create interesting visualizations where you want to colour the map based on a specific variable.

These visualizations are technically known as choropleth maps and they divide the space into a set of polygons that could be anything: country boundaries, Voronoi diagrams or regular tiles.

A choropleth map that visualizes the fraction of Australians that identified as Anglican at the 2011 census by Toby Hudson

Regular tiles are interesting if you don’t have a relevant distribution of pre-existing polygons. But what type of tile would you use? The typical approach is to use a grid of squares but as boardgamers already know this is far from perfect. The issue here is that the distance between a square centre and its neighbours depends on their configuration: it will be larger for the ones in the corners than the ones at right/left/top/bottom as Pythagoras knew some centuries ago. Hexagons better capture the spatial relation between tiles because the 6 neighbors of each hexagon are roughly positioned at the same distance of its centre. Also, did I mention that hexmaps look awesome?

Ok, let’s see how can we create an hexagon-based map with QGIS.

Roman camps in Scotland

We know that the Romans legions adventured beyond the Antonine wall; the sources talk about military campaigns and archaeological evidence support this idea because several temporary march camps have been found in Scotland. But where do we find these camps? To answer this question let’s create an hexmap where the hexes are coloured based on the number of temporary camps in the region.

Load the dataset

This zip file contains 2 vector files in Shapefile format:
scotland_boundaries.shp has the boundaries of the region
roman_camps.shp is the set of identified Roman temporary camps compiled by Canmore.

Roman temporary camps in Scotland

Install the mmqgis plugin

Go to Plugins -> Manage and Install Plugins and install mmqgis. This plugin extends the functionality of QGIS for vector map layers.

Install mmqgis using the plugin manager

Create an hexagonal grid

Once mmqgis is enabled you can use its functionality to create the hexes; go to MMQGIS -> Create -> Create Grid Layer and select Hexagons as the layer type.

You should set as extent the layer scotland_boundaries because we want the grid to cover the entire region. You can finally define the size of the hexes; I chose here 25km because it is roughly the distance a legion can cover in a day.

Parameters for a 25km-based hexagonal grid of Scotland

Intersect the grid with the boundaries

You probably got an hexagonal grid covering a large rectangle; it is kind of useless because to my knowledge Romans did not have submarines, so we should remove from the grid everything that is not land. In essence we want to remove from the grid everything outside the scotland_boundaries layer. You can do this using the command Intersection from Vector -> Geoprocessing Tools.

You need to specify the hex grid as the Input layer and the boundaries as the Intersect Layer. Please be aware that this process will take a while, specially if you defined a small hex size.

Count the number of camps per hex

The last step is to create a new hexagonal layer with an attribute defining the number of camps per hex. This algorithm is not in the Menus so you should open the Toolbox inside the Processing menu. Go to QGIS -> Vector analysis tools and select Count points in polygon.

The algorithm is hidden in the toolbox

The input parameters for the algorithm are straightforward; fill them and create this new layer.

Not much to explain here…

Visualize the result

Double-click on the new layer and set the type of Style to Graduated based on the attribute NUMPOINTS. Classify using a decent color ramp and you are done!

Standard Deviation is a good color mode for this type of data

Looks like an 80s-Avalon Hill-style wargame!

Discussion

You can use this method to overlap several layers of information:

Blending hexes with the Stamen Terrain layer

What can we say from this visualization? Some interesting spatial patterns are clearly visible:
1. The Firth of Forth seems to concentrate the majority of camps
2. The Romans definitely did not like Western Scotland. They probably did not want to move far from the coast where their fleet supported them
3. The route followed by the armies was probably used as the basis for the Gask Ridge fortification line

Acknowledgements

This post was heavily inspired by the entry written by Anita Graser in her blog

Identifying gaps in your data

One of the first things you want to do when you explore a new dataset is to identify possible gaps. Sample size and the number of variables are relevant but…how many observations do you have for each variable? This distinction is even more relevant for archaeologists because (if we are being honest…) most of our data has huge gaps.

Just to make the post clear:
– The Sample is the set of entities you collected.
Variables are measures and properties of this sample
Observations are the values of the variables for each item in your sample

The identification of variables with a decent number of observations is crucial for several processes. Let’s say that you have a bunch of archaeological sites and you want to create a map where the size of each dot (i.e. site) is proportional to the area of the site. This would be a bad idea if 90% of your sample does not have an assigned area because these points will be ignored.

This is even more relevant if you want to do some modelling (e.g. a linear regression). Lots of statistical models ignore variables that have observations with unassigned values so you have to be very careful about it. Let’s see how can we explore this issue.

Example: Arrowheads in UK

Loading the dataset

We downloaded a dataset of arrowheads collected by the Portable Antiquities Scheme:

[generic linenumbers=”False”] arrowheads <- read.csv("https://git.io/v9JJd", na.string="")
[/generic]

As you can see I specified that empty strings of text should be read as NA (i.e. Not Available). If you don’t do that then it will be read as an empty string, which is different than not having a value.

If we take a look at the newly created arrowheads data frame we will see a bunch of interesting metrics:

[generic linenumbers=”False”] str(arrowheads)
[/generic] You should get something like:

[generic linenumbers=”False”] ‘data.frame’: 1079 obs. of 13 variables:
$ id : int 522443 179174 233283 199204 106547 45059 508485 646649 401936 133638 …
$ classification : Factor w/ 81 levels “Arrowhead”,”Barb and Tanged”,..: NA NA 10 NA NA NA NA NA NA NA …
$ subClassification: Factor w/ 19 levels “barbed and tanged”,..: NA NA NA NA NA NA NA NA NA NA …
$ length : num 28 55.2 45.3 100.3 39 …
$ width : num 18 11 25 7.6 11 …
$ thickness : num 3 2 5.11 6.7 NA 3.54 3.47 6.5 1 NA …
$ diameter : num 2.2 4.5 4.58 5.1 6 6.82 6.96 7.5 8 8 …
$ weight : num NA 6.1 2.78 16.69 4.76 …
$ quantity : int 1 1 1 1 1 1 1 1 1 1 …
$ broadperiod : Factor w/ 12 levels “BRONZE AGE”,”EARLY MEDIEVAL”,..: 1 4 1 4 1 4 4 4 4 4 …
$ fromdate : int -2150 NA -2150 1250 -2150 1066 1066 1200 1066 1200 …
$ todate : int -1500 NA -1500 1499 -800 1350 1400 1300 1500 1499 …
$ district : Factor w/ 188 levels “Arun”,”Ashford”,..: 45 181 74 NA 26 93 177 137 186 53 …
[/generic]

See all these NA values? These are the gaps in our data. We can suspect that diameter or “subClassification* will not be popular, but having over 1000 arrowheads it is difficult to know what variables should we be used in the analysis.

Visualizing gaps

How can you identify these gaps? My preferred method is to visualize them using the Amelia package (yes, awesome name for an R package on missing data…). Its use is straightforward:

[generic linenumbers=”False”] install.package(“Amelia”)
library(“Amelia”)
missmap(arrowheads)
[/generic]

Missingness map

The structure is R-classic: rows are sample units while columns are variables. Red cells are the ones that have some values while the other ones are empty.

Interpretation

The map of missing values allows us to make informed decisions on how to proceed with the analysis. In this case:
– We should not use diameter for analysis because it is not present in most of the sample.
– We have a almost complete information on broad spatial and temporal coordinates (broadperiod and district)
classification and subClassification are quite useless here
– Measures that can be used are: weight, thickness, width and length

Impact

You can easily visualize the impact by creating a visualization with diameter and another one without this value:
[generic linenumbers=”False”] ggplot(arrowheads, aes(x=width, y=diameter, col=broadperiod)) + geom_point() + theme_bw() + facet_wrap(~broadperiod)
[/generic]

Scatterplot width vs diameter

Not looking good…R even tells you that you lost 1051 points in your dataset…also, most of the periods. Compare it with:

[generic linenumbers=”False”] ggplot(arrowheads, aes(x=width, y=length, col=broadperiod)) + geom_point() + theme_bw() + facet_wrap(~broadperiod)
[/generic]

Scatterplot width vs length

Only 84 rows contained missing values, much better!