So I'm modelling the H1N1 Swine Flu outbreak in Glasgow. One of the questions is how different regions (specifically 'datazones') within the study area interact with each other. You'd expect zones next to each other to interact with each other more than those on opposite sides of the city, for example. So our first idea is to use the distance between centroids as an explanatory variable and a distance-decay parameterisation for the effect. That works quite nicely.

But the real world isn't quite like that. Glasgow has a river running through it, splitting our study region into north and south parts. I wanted a new distance metric to reflect this partition.

So I came up with this idea - that the distance between two datazones is the minimum distance between the centroids of the graph formed by adjacent datazones.

First problem of that is computing the adjacencies between datazones. In R, I can just use poly2nb from the spdep package and get that. But I found that the graph created wasn't connected over the study region. Further investigation showed that polygons that clearly looked adjacent weren't being considered as adjacent. Here's a screenshot that shows three regions that look adjacent:

Now why wasn't poly2nb saying they were connected? I zoomed in for the answer:

Check the scale bar. These polygons are about a hundredth of a millimetre apart. That's a very tiny no-man's land. More like a "no-fly zone".

I figured I needed some kind of buffer on the polygons, and started looking into the rgeos package, recently released onto CRAN. It'll do that, and all sorts of things, but then when I tried working out adjacencies using its gIntersects function I discovered it worked to those tolerances already. By some magic the apparently adjacent regions were adjacent again.

However the graph was still not fully connected - it was composed of two subgraphs - the river is wide enough, and runs fully from east to west, that no fuzzy tolerances could really overlap the polygons. And that would probably be the wrong thing to do anyway.

So I loaded the regions into Quantum GIS and then loaded some Google Maps as underlays using the marvellous Google Maps plugin. I created a new vector layer of lines and started adding extra connections where bridges cross the Clyde:

These lines, shown in red above, were saved as a shapefile, then loaded into R and their start and endpoints overlaid on the datazones. That gave me some extra edges to add to the adjacency graph, all the graph stuff being handled by R's igraph package (including the distance-weighted minimum distance algorithm along adjacent datazones).

So now I have a completely-connected graph of datazones, which I can convert into a full distance matrix that reflects the geometry of the study region. The university's High-End Cluster is currently churning out the number for the model fit, and the results will hopefully be published in a journal sometime...

Thanks to: R, Quantum GIS, Google Maps plugin authors Ivan Mincik and Nungyao Lin, R package developers for sp, spdep, rgeos, igraph.