Overview

This is a worked example using spatial modelling techniques to investigate the uptake of broadband in Ireland. The data are obtained from the 2016 Irish Census, and each observation applies to a single Electoral Division (ED) - these are geographical areas – there are 3,440 of them in Ireland. A number of variables are supplied here, but the main ones to be used here are

In addition, the actual house count, and number of houses with broadband (both integers) that were used to compute Broadband are available as Broadband_Count and Broadband_Denom respectively. Thus, Broadband was computed using:

Broadband <- Broadband_Count / Broadband_Denom

A first look at the data

The data can be obtained through this link and is in an R data file named ireland_ct_ed.RData. Assuming you have downloaded this and put in into the R folder you are working with, load it using:

load('ireland_ct_ed.RData')

Note that there are 4 data items in this file. All of these are simplefeature R objects - these are similar to data frames, but also contain geographical information. At this point it may be useful to load up some libraries that will help to manipulate and analyse these.

library(tidyverse)
library(sf)
library(tmap)

tidyverse is a library you will likely have already used. sf implements the simplefeatures geographical data structure, and provides some functions to manipulate this. tmap provides map drawing tools, and works well together with sf.

We can take a first look at the data. The data you have loaded contains four objects ED_2016, CTY_2016 and ED_ll, CTY_ll. All of these are sf objects. We can use tmap to draw CTY_2016:

tm_shape(CTY_2016) + tm_polygons()

This give all of the counties in the Republic of Ireland. Looking at the object demonstrates the structure:

CTY_2016
## Simple feature collection with 26 features and 1 field
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: 17499.98 ymin: 19589.95 xmax: 334553.3 ymax: 465657.1
## epsg (SRID):    29902
## proj4string:    +proj=tmerc +lat_0=53.5 +lon_0=-8 +k=1.000035 +x_0=200000 +y_0=250000 +ellps=mod_airy +towgs84=482.5,-130.6,564.6,-1.042,-0.214,-0.631,8.15 +units=m +no_defs
## First 10 features:
##      COUNTY                           geom
## 1    CARLOW POLYGON ((279281.7 141187.2...
## 2     CAVAN POLYGON ((243213 283400.7, ...
## 3     CLARE MULTIPOLYGON (((70129.57 14...
## 4      CORK MULTIPOLYGON (((50217.46 38...
## 5   DONEGAL MULTIPOLYGON (((161286 3922...
## 6    DUBLIN MULTIPOLYGON (((332267.2 25...
## 7    GALWAY MULTIPOLYGON (((55329.46 25...
## 8     KERRY MULTIPOLYGON (((37227.54 99...
## 9   KILDARE POLYGON ((276420.7 180518.6...
## 10 KILKENNY POLYGON ((263667.1 112038.8...

There are two columns. COUNTY is a factor variable - this just gives the county name. The second column contains the ‘geometry’ of the county - basically information about the shape of the county in a ‘join the dots’ style, as lists of \((x,y)\) pairs. Although the structure is a list of 2 dimensional vectors, it appears as a single variable in an sf object. This makes life simpler when manipulating these objects. For most operations, you can apply dplyr tools as you would for ordinary data frames.

The other potentially new R feature is the use of tmap tools. Generally these are functions that begin with tm_<something> and tmap works similarly to ggplot. Initialy tm_shape specifies which sf object will be used to create the map, and tm_polygons specifies how the map is actually drawn. Here the default is just to draw some grey polygons, one for each county. When a polygon is drawn, the borders are drawn in black around the filled interior.

The other sf object to investigate now is ED_2016 - first have a look at it:

ED_2016
## Simple feature collection with 3409 features and 59 fields
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: 17499.98 ymin: 19589.95 xmax: 334553.3 ymax: 465657.1
## epsg (SRID):    29902
## proj4string:    +proj=tmerc +lat_0=53.5 +lon_0=-8 +k=1.000035 +x_0=200000 +y_0=250000 +ellps=mod_airy +towgs84=482.5,-130.6,564.6,-1.042,-0.214,-0.631,8.15 +units=m +no_defs
## First 10 features:
##    OBJECTID ED_ID    ED_ENGLISH ED_GAEILGE COUNTY      CONTAE PROVINCE
## 1      1025 17011  CARLOW URBAN       <NA> CARLOW Ceatharlach Leinster
## 2       452 17054 GRAIGUE URBAN       <NA> CARLOW Ceatharlach Leinster
## 3      1583 17014      CLONMORE       <NA> CARLOW Ceatharlach Leinster
## 4      2658 17022   HACKETSTOWN       <NA> CARLOW Ceatharlach Leinster
## 5      1086 17023   HAROLDSTOWN       <NA> CARLOW Ceatharlach Leinster
## 6      2745 17029       KINEAGH       <NA> CARLOW Ceatharlach Leinster
## 7      2806 17038        RAHILL       <NA> CARLOW Ceatharlach Leinster
## 8       513 17042     RATHVILLY       <NA> CARLOW Ceatharlach Leinster
## 9      2322 17048       TIKNOCK       <NA> CARLOW Ceatharlach Leinster
## 10     2103 17053  WILLIAMSTOWN       <NA> CARLOW Ceatharlach Leinster
##          CENTROID_X       CENTROID_Y                            GUID_
## 1  672246.595709023 676789.987603116 2AE19629185813A3E055000000000001
## 2  671472.173875166 676847.705808632 2AE196291A5913A3E055000000000001
## 3  695589.189341577 676293.184349241 2AE19629186413A3E055000000000001
## 4   696913.19617028 680551.731852523 2AE19629187F13A3E055000000000001
## 5  692511.415381392 678563.943859296 2AE19629188713A3E055000000000001
## 6   683946.89657837  680230.47376206 2AE19629189713A3E055000000000001
## 7  686357.696901933 683231.311557536 2AE1962918B813A3E055000000000001
## 8   689449.06958151 682615.715972625 2AE1962918C413A3E055000000000001
## 9  692847.174469574 682699.254100846 2AE19629182213A3E055000000000001
## 10 688957.230882521 678916.395756327 2AE196291A5813A3E055000000000001
##    CSOED_3409 OSIED_3441    CSOED_34_1   Shape__Are Shape__Len
## 1       01001      17011  Carlow Urban 2.440754e-04 0.06934973
## 2       01002      17054 Graigue Urban 6.959563e-05 0.04573356
## 3       01003      17014      Clonmore 3.748310e-03 0.36718650
## 4       01004      17022   Hacketstown 2.943561e-03 0.35413291
## 5       01005      17023   Haroldstown 1.540402e-03 0.20685603
## 6       01006      17029       Kineagh 2.402616e-03 0.25118324
## 7       01007      17038        Rahill 2.194834e-03 0.25819529
## 8       01008      17042     Rathvilly 1.696658e-03 0.18785879
## 9       01009      17048       Tiknock 1.435624e-03 0.20036499
## 10      01010      17053  Williamstown 2.072515e-03 0.28916014
##                                GUID GEOGID     Age0_4    Age5_14  Age25_44
## 1  2AE19629185813A3E055000000000001      1 0.05065789 0.09144737 0.3162281
## 2  2AE196291A5913A3E055000000000001      2 0.05338078 0.09822064 0.3402135
## 3  2AE19629186413A3E055000000000001      3 0.04372624 0.17110266 0.2471483
## 4  2AE19629187F13A3E055000000000001      4 0.07609669 0.16472695 0.2882722
## 5  2AE19629188713A3E055000000000001      5 0.08108108 0.12500000 0.2060811
## 6  2AE19629189713A3E055000000000001      6 0.06997085 0.10204082 0.2478134
## 7  2AE1962918B813A3E055000000000001      7 0.07818930 0.15775034 0.2716049
## 8  2AE1962918C413A3E055000000000001      8 0.09028571 0.16342857 0.2628571
## 9  2AE19629182213A3E055000000000001      9 0.05722892 0.18674699 0.2560241
## 10 2AE196291A5813A3E055000000000001     10 0.06007067 0.16607774 0.2791519
##     Age45_64 Age65over EU_National ROW_National Born_outside_Ireland
## 1  0.2160088 0.1567982  0.12368362  0.050862648           0.22227201
## 2  0.2298932 0.1665480  0.07137759  0.008565310           0.10421128
## 3  0.2889734 0.1273764  0.03816794  0.005725191           0.06679389
## 4  0.2300806 0.1333930  0.03783784  0.003603604           0.07567568
## 5  0.2837838 0.1486486  0.04421769  0.000000000           0.06462585
## 6  0.2769679 0.1603499  0.03216374  0.000000000           0.08771930
## 7  0.2592593 0.1289438  0.05241379  0.005517241           0.09793103
## 8  0.2422857 0.1142857  0.04013761  0.009174312           0.08944954
## 9  0.2530120 0.1445783  0.01510574  0.012084592           0.05438066
## 10 0.2826855 0.1307420  0.02508961  0.003584229           0.06451613
##     Separated SinglePerson  Pensioner LoneParent       DINK
## 1  0.07236842  0.125173531 0.08392603 0.26135217 0.13218971
## 2  0.05836299  0.103746398 0.07949791 0.22658610 0.12386707
## 3  0.01711027  0.031423290 0.04979253 0.09859155 0.05633803
## 4  0.05013429  0.013368984 0.06626506 0.23961661 0.06389776
## 5  0.04729730  0.050505051 0.04633205 0.14864865 0.01351351
## 6  0.03498542  0.041176471 0.06338028 0.07058824 0.04705882
## 7  0.03703704  0.017735334 0.05900621 0.12121212 0.08080808
## 8  0.06057143  0.051487414 0.04479578 0.21397380 0.06113537
## 9  0.04819277 -0.012048193 0.07308970 0.09890110 0.04395604
## 10 0.03533569  0.007067138 0.08906883 0.10666667 0.05333333
##    NonDependentKids  RentPublic RentPrivate      Flats   NoCenHeat
## 1         0.2701863 0.173566085  0.31072319 0.27816550 0.017456359
## 2         0.2654028 0.119868637  0.14121511 0.21147541 0.008210181
## 3         0.2815534 0.010869565  0.04891304 0.00000000 0.021739130
## 4         0.2510823 0.169620253  0.08101266 0.01758794 0.010126582
## 5         0.3114754 0.010000000  0.02000000 0.01000000 0.010000000
## 6         0.2698413 0.008333333  0.05833333 0.00000000 0.033333333
## 7         0.2907801 0.023715415  0.07905138 0.01185771 0.023715415
## 8         0.1952663 0.135048232  0.11575563 0.01286174 0.022508039
## 9         0.2615385 0.027027027  0.03603604 0.00000000 0.018018018
## 10        0.2553191 0.039603960  0.03960396 0.00000000 0.009900990
##       RoomsHH  PeopleRoom  SepticTank    HEQual  Employed   TwoCars
## 1  0.03942145 0.005769231 0.004987531 0.1780867 0.3941718 0.1589689
## 2  0.03697865 0.006238899 0.001642036 0.1017442 0.4555369 0.1747212
## 3  0.05798913 0.004929709 0.788043478 0.1908832 0.6004843 0.6440678
## 4  0.04936709 0.005728205 0.367088608 0.1393103 0.4870283 0.4629156
## 5  0.05440000 0.005441176 0.800000000 0.1818182 0.4510638 0.5257732
## 6  0.05908333 0.004837800 0.825000000 0.1781377 0.5246479 0.5948276
## 7  0.05913043 0.004872995 0.470355731 0.1941545 0.5134650 0.5742972
## 8  0.05122186 0.005492781 0.263665595 0.1319703 0.4655436 0.3941368
## 9  0.05747748 0.005203762 0.837837838 0.2018349 0.5418327 0.6363636
## 10 0.05673267 0.004938918 0.821782178 0.2666667 0.5570776 0.5454545
##      JTWPublic   HomeWork        LLTI UnpaidCare   Students Unemployed
## 1  0.027027027 0.03100416 0.028070175 0.03223684 0.14417178 0.12448875
## 2  0.009225092 0.04610951 0.023487544 0.03772242 0.06879195 0.11073826
## 3  0.012195122 0.08687616 0.009505703 0.04372624 0.09927361 0.05084746
## 4  0.017073171 0.06417112 0.019695613 0.04118174 0.08608491 0.14740566
## 5  0.000000000 0.02693603 0.016891892 0.04391892 0.16595745 0.10638298
## 6  0.000000000 0.05882353 0.023323615 0.06122449 0.11267606 0.07394366
## 7  0.014084507 0.07912688 0.016460905 0.03978052 0.08438061 0.09515260
## 8  0.016556291 0.03089245 0.019428571 0.02742857 0.11026034 0.11944870
## 9  0.000000000 0.07530120 0.018072289 0.04518072 0.10358566 0.09960159
## 10 0.008403361 0.04946996 0.010600707 0.03180212 0.10045662 0.04566210
##    EconInactFam       Agric Construction Manufacturing  Commerce
## 1    0.06160532 0.010376135   0.04280156    0.09533074 0.2114137
## 2    0.09731544 0.005524862   0.05524862    0.07734807 0.1602210
## 3    0.09200969 0.177419355   0.06048387    0.12903226 0.2177419
## 4    0.09433962 0.096852300   0.08716707    0.14285714 0.2251816
## 5    0.08510638 0.179245283   0.04716981    0.12264151 0.2735849
## 6    0.10211268 0.181208054   0.07382550    0.10738255 0.2550336
## 7    0.12028725 0.132867133   0.06293706    0.11188811 0.2377622
## 8    0.09494640 0.082236842   0.08881579    0.13486842 0.1875000
## 9    0.10358566 0.154411765   0.11029412    0.13970588 0.2058824
## 10   0.09589041 0.090163934   0.04918033    0.14754098 0.2213115
##      Transport     Public Professional  Internet Broadband Broadband_Count
## 1  0.052529183 0.03501946    0.2107652 0.6882793 0.9094203            1255
## 2  0.051565378 0.02394107    0.1491713 0.6256158 0.9317585             355
## 3  0.028225806 0.04838710    0.2217742 0.6141304 0.6902655              78
## 4  0.060532688 0.03389831    0.1646489 0.6683544 0.7537879             199
## 5  0.009433962 0.05660377    0.1320755 0.6700000 0.7910448              53
## 6  0.046979866 0.04697987    0.1543624 0.7500000 0.8333333              75
## 7  0.073426573 0.05244755    0.1993007 0.8181818 0.8985507             186
## 8  0.023026316 0.06907895    0.2401316 0.7299035 0.9603524             218
## 9  0.051470588 0.05882353    0.1764706 0.7207207 0.8000000              64
## 10 0.065573770 0.03278689    0.1885246 0.7524752 0.8026316              61
##    Broadband_Denom                       geometry
## 1             1380 POLYGON ((273069.2 176635.1...
## 2              381 POLYGON ((271682.5 175783, ...
## 3              113 POLYGON ((299568.4 177477.2...
## 4              264 POLYGON ((299262.9 182985.3...
## 5               67 POLYGON ((295206.7 178276.2...
## 6               90 POLYGON ((284142.1 182690.6...
## 7              207 POLYGON ((288513.9 184447.6...
## 8              227 POLYGON ((291189.5 183621.2...
## 9               80 POLYGON ((293737.8 184173.4...
## 10              76 POLYGON ((291224.3 181156.9...

It is like CTY_2016 but with a lot more columns, and 3440 rows, one for each ED. The geometry information is there, as are a number of other variables. As with normal tables, we can use dplyr to pull out the variables of interest:

ED_2016 %>% select(COUNTY,Broadband,Professional,Age65over)
## Simple feature collection with 3409 features and 4 fields
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: 17499.98 ymin: 19589.95 xmax: 334553.3 ymax: 465657.1
## epsg (SRID):    29902
## proj4string:    +proj=tmerc +lat_0=53.5 +lon_0=-8 +k=1.000035 +x_0=200000 +y_0=250000 +ellps=mod_airy +towgs84=482.5,-130.6,564.6,-1.042,-0.214,-0.631,8.15 +units=m +no_defs
## First 10 features:
##    COUNTY Broadband Professional Age65over                       geometry
## 1  CARLOW 0.9094203    0.2107652 0.1567982 POLYGON ((273069.2 176635.1...
## 2  CARLOW 0.9317585    0.1491713 0.1665480 POLYGON ((271682.5 175783, ...
## 3  CARLOW 0.6902655    0.2217742 0.1273764 POLYGON ((299568.4 177477.2...
## 4  CARLOW 0.7537879    0.1646489 0.1333930 POLYGON ((299262.9 182985.3...
## 5  CARLOW 0.7910448    0.1320755 0.1486486 POLYGON ((295206.7 178276.2...
## 6  CARLOW 0.8333333    0.1543624 0.1603499 POLYGON ((284142.1 182690.6...
## 7  CARLOW 0.8985507    0.1993007 0.1289438 POLYGON ((288513.9 184447.6...
## 8  CARLOW 0.9603524    0.2401316 0.1142857 POLYGON ((291189.5 183621.2...
## 9  CARLOW 0.8000000    0.1764706 0.1445783 POLYGON ((293737.8 184173.4...
## 10 CARLOW 0.8026316    0.1885246 0.1307420 POLYGON ((291224.3 181156.9...

We can also use tmap to create a map of a variabe of interest:

tm_shape(ED_2016) + tm_polygons(col='Broadband')

This gives a shaded (choropleth) map showing uptake levels for broadband. Unfortunately the polygon boundaries here get in the way (mainly because there are so many of them). An alternative approach is to use tm_fill which just fills in interiors without drawin boundaries:

tm_shape(ED_2016) + tm_fill(col='Broadband')

Another useful tool with tmap is the possibility to work with more than one sf object - here we add the borders for counties to the map just created:

tm_shape(ED_2016) + tm_fill(col='Broadband') + tm_shape(CTY_2016) + tm_borders()

As a final operation, we can change tmap‘s mode to ’view’ - this allows a web based map to be created:

tmap_mode('view')
tm_shape(ED_2016) + tm_fill(col='Broadband',alpha=0.2) 

Entering

tmap_mode('plot')

puts tmap back to ordinary plotting mode.

Much of this section has focussed on mapping and listing spatial data. Although this is a diversion from the modelling, it is useful to understand these ideas in order to explore spatial structure.

Modelling the data

The main modelling idea here is to consider the variables Professional and Age65over as predictors of broadband uptake. A number of approaches to the prediction can be taken.

A simple model

Possible the simplest approach would be a linear regression model:

\[ y_i = \beta_0 + \beta_i x_{1i} + \beta_2 x_{2i} + \varepsilon_i \] where \[ \varepsilon_i \sim N(0,\sigma^2) \] and

  • \(x_{1i}\) is the \(i\)th Professional observation
  • \(x_{2i}\) is the \(i\)th Age65over observation
  • \(y_i\) is the \(i\)th Broadband observation

The model is fitted here:

model1 <- lm(Broadband~Professional+Age65over, data=ED_2016)
summary(model1)
## 
## Call:
## lm(formula = Broadband ~ Professional + Age65over, data = ED_2016)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46557 -0.06123  0.03061  0.08586  0.23636 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.79160    0.01250  63.303  < 2e-16 ***
## Professional  0.30223    0.04550   6.643 3.57e-11 ***
## Age65over    -0.26826    0.04232  -6.339 2.61e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1168 on 3406 degrees of freedom
## Multiple R-squared:  0.02349,    Adjusted R-squared:  0.02292 
## F-statistic: 40.97 on 2 and 3406 DF,  p-value: < 2.2e-16

However, given the population in each ED differs, the proportions of houses with broadbands will have different standard errors. A more appropriate model might be a binomial model:

\[ M_i \sim \textrm{Binom}(N_i,\textrm{logistic}(\beta_0 + \beta_i x_{1i} + \beta_2 x_{2i})) \] where \(M_i\) is the total number of households with broadband in an ED, \(N_i\) is the total number of households in the ED, and other quantities are as before. The logistic function is defined as

\[ \textrm{logistic}(l) = \frac{\exp(l)}{1 + \exp(l)} \] This is fitted below:

model2 <- glm(Broadband~Professional+Age65over, 
              data=ED_2016, family=binomial, weight=Broadband_Denom)
summary(model2)
## 
## Call:
## glm(formula = Broadband ~ Professional + Age65over, family = binomial, 
##     data = ED_2016, weights = Broadband_Denom)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -21.8585   -3.9518   -1.5857    0.6838   21.5883  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.91176    0.01869  102.26   <2e-16 ***
## Professional  3.20766    0.07715   41.58   <2e-16 ***
## Age65over    -3.33202    0.05476  -60.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 79820  on 3408  degrees of freedom
## Residual deviance: 75125  on 3406  degrees of freedom
## AIC: 91971
## 
## Number of Fisher Scoring iterations: 5

Both models tell a similar story - generally that ED’s with higher proportions of people working in professional occupations, and lower proportions of 0ver-65s have a higher uptake on broadband.

To predict the actual proportion, use the predict function. For the binomial model, type='response' causes predict to return the response itselef (ie \(\textrm{logistic}(\beta_0 + \beta_i x_{1i} + \beta_2 x_{2i})\))

However, looking at the earlier maps, and generally thinking about broadband rollout, some parts of Ireland will be better served be the required fibre optic infrastructure than others. The models here only consider the people who might sign up to broadband (in terms of their age and employment type) but nothing about potential availability or quality of broadbast services offered. One way yo do this might be to add a term for the county in Ireland for each ED. Here a mixed effects model is considered -

\[ M_i \sim \textrm{Binom}(N_i,\textrm{logistic}(\beta_0 + \beta_i x_{1i} + \beta_2 x_{2i} + \gamma_j)) \]

Where \(\gamma_j\) is a random effect associated with county \(j\), and \(\gamma_j \sim \textrm{N}(0,\sigma_\gamma^2)\).

Note that the EDs are nested within counties, so each ED \(i\) is uniquely associated to a county \(j\). A model of this kind can be fitted via the lme4 package. Firstly, load the package:

library(lme4)

Next, fit the model:

model3 <- glmer(Broadband~Professional+Age65over + (1|COUNTY), 
              data=ED_2016, family=binomial, weight=Broadband_Denom)
summary(model3)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: Broadband ~ Professional + Age65over + (1 | COUNTY)
##    Data: ED_2016
## Weights: Broadband_Denom
## 
##      AIC      BIC   logLik deviance df.resid 
##  66073.7  66098.2 -33032.8  66065.7     3405 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -27.1576  -3.3776  -0.7924   1.2755  17.8448 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  COUNTY (Intercept) 0.09961  0.3156  
## Number of obs: 3409, groups:  COUNTY, 26
## 
## Fixed effects:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.49103    0.06515   22.89   <2e-16 ***
## Professional  3.05157    0.07931   38.48   <2e-16 ***
## Age65over    -1.82409    0.06357  -28.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Prfssn
## Professionl -0.274       
## Age65over   -0.120 -0.065

Note the AIC of this model is notably smaller than the one without a COUNTY level random effect term. We can obtain an estimate of each \(\gamma_j\) via the coef function:

coef(model3)
## $COUNTY
##           (Intercept) Professional Age65over
## CARLOW       1.553758      3.05157 -1.824095
## CAVAN        1.455126      3.05157 -1.824095
## CLARE        1.369435      3.05157 -1.824095
## CORK         1.726720      3.05157 -1.824095
## DONEGAL      1.319728      3.05157 -1.824095
## DUBLIN       2.600898      3.05157 -1.824095
## GALWAY       1.371666      3.05157 -1.824095
## KERRY        1.376857      3.05157 -1.824095
## KILDARE      1.949811      3.05157 -1.824095
## KILKENNY     1.279233      3.05157 -1.824095
## LAOIS        1.287136      3.05157 -1.824095
## LEITRIM      1.129365      3.05157 -1.824095
## LIMERICK     1.501339      3.05157 -1.824095
## LONGFORD     1.203292      3.05157 -1.824095
## LOUTH        1.883647      3.05157 -1.824095
## MAYO         1.309856      3.05157 -1.824095
## MEATH        1.682377      3.05157 -1.824095
## MONAGHAN     1.475138      3.05157 -1.824095
## OFFALY       1.164156      3.05157 -1.824095
## ROSCOMMON    1.216310      3.05157 -1.824095
## SLIGO        1.430056      3.05157 -1.824095
## TIPPERARY    1.236341      3.05157 -1.824095
## WATERFORD    1.620801      3.05157 -1.824095
## WESTMEATH    1.310943      3.05157 -1.824095
## WEXFORD      1.400396      3.05157 -1.824095
## WICKLOW      1.909264      3.05157 -1.824095
## 
## attr(,"class")
## [1] "coef.mer"

This gives, for each of the grouping levels (ie counties) estimates of each of the terms in the models. Since \(\beta_1\) and \(\beta_2\) are fixed effects terms, they are identical for each county. However the intercept for each county is \(\beta_0+\gamma_j\) - here shown in the (Intercept) column.

Note that the result of the coef function is a list of data frames - here as there is only one grouping term (COUNTY) the list has just one element. We can pick this out using coef(model3)$COUNTY or coef(model3)[['COUNTY]]. This is useful for mapping the estimates of the \(\beta_0 + \gamma_j\) values - a possible dplyr flow to do this is below. In short, this carries out the following tasks:

  • select the data frame of interest from the coef function
  • make the row names into a ‘proper’ column
  • select out the (Intercept) and COUNTY columns. Rename (Intercept) to Gamma
  • join these to the CTY_2016 sf object
  • use tmap to map it
CTY_coef <- 
  coef(model3)[["COUNTY"]] %>% 
  rownames_to_column(var='COUNTY') %>% 
  select(Gamma=`(Intercept)`,COUNTY) %>% 
  left_join(CTY_2016,.)
tm_shape(CTY_coef) + tm_polygons(col='Gamma',style='cont')

There is a pretty clear geographical pattern here.

A further model could be to allow for inter-county variability for the other regression coefficients:

\[ M_i \sim \textrm{Binom}(N_i,\textrm{logistic}(\gamma_{j0} + \gamma_{j1} x_{1i} + \gamma_{j2}x_{2i})) \]

where \[ \left[ \begin{array}{c} \gamma_{0j} \\ \gamma_{1j} \\ \gamma_{2j} \end{array} \right] \sim \textrm{MVN}\left( \left[ \begin{array}{c} \beta_{0} \\ \beta_{1} \\ \beta_{2} \end{array} \right], \Sigma_\gamma \right) \]

where \(\Sigma_\gamma\) is a variance-covariance matrix for \(\gamma_{0j}\), \(\gamma_{1j}\), and \(\gamma_{2j}\).

library(lme4)
model4 <- glmer(Broadband~ Professional + Age65over + (1 + Professional + Age65over|COUNTY), 
              data=ED_2016, family=binomial, weight=Broadband_Denom)
summary(model4)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: 
## Broadband ~ Professional + Age65over + (1 + Professional + Age65over |  
##     COUNTY)
##    Data: ED_2016
## Weights: Broadband_Denom
## 
##      AIC      BIC   logLik deviance df.resid 
##  62101.1  62156.3 -31041.5  62083.1     3400 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -34.458  -3.134  -0.733   1.232  14.904 
## 
## Random effects:
##  Groups Name         Variance Std.Dev. Corr       
##  COUNTY (Intercept)   0.6222  0.7888              
##         Professional 12.0234  3.4675   -0.84      
##         Age65over    10.3866  3.2228   -0.41 -0.02
## Number of obs: 3409, groups:  COUNTY, 26
## 
## Fixed effects:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.9862     0.1574  12.616  < 2e-16 ***
## Professional   2.0331     0.6885   2.953  0.00315 ** 
## Age65over     -3.8910     0.6405  -6.075 1.24e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Prfssn
## Professionl -0.839       
## Age65over   -0.416 -0.017

Using coef it is also possible to obtain and map the coefficient estimates fpor \(\gamma_0\), \(\gamma_1\) and \(\gamma_2\).

CTY_coef4 <- 
  coef(model4)[["COUNTY"]] %>% 
  rownames_to_column(var='COUNTY') %>% 
  rename(Gamma=`(Intercept)`) %>% 
  left_join(CTY_2016,.)
tm_shape(CTY_coef4) + tm_polygons(col=c('Gamma','Professional','Age65over'),style='cont') +
  tm_style('col_blind') + tm_legend(scale=0.6) 

Note that patterns are not so clear here. However, looking at the output from the fitting of this model, the \(\Gamma\) matrix has a high degree of negative correlation between \(\gamma_0\) and \(\gamma_1\) and between \(\gamma_0\) ands \(\gamma_2\) - so that high levels of the county level intercept term go with lower levels of the other two county level term coefficients - making geographical pattrerns harder to interpret. However this may still provide a better predictor.

Predictions from the model may be made using predict - for example

pred2 <- predict(model2,type='response')
pred3 <- predict(model3,type='response')
pred4 <- predict(model4,type='response')
ED_2016 %>% mutate(pred2=pred2,pred3=pred3,pred4=pred4) -> ED_preds
tm_shape(ED_preds) + 
  tm_fill(col=c('pred2','pred3','pred4'),style='cont') +
  tm_legend(scale=0.6)

More explicitly Spatial Models

By incorporating random county effects (the \(\gamma_j\)) geography has been taken into account in some of the models above - but apart from noting the county itself, no further spatial structure has been considered. Thus, for example in model 3 we note that there is correlation between pairs of EDs in the same county (the observations are binomial but the linear predictor depends on the same \(\gamma_j\)), but not otherwise. The independence between ED uptake rates for distinct counties in this model arises from the fact that the \(\gamma_j\)’s are independent. However, an alternative approach might be to consider \(\Gamma\) as a vector of county effects with a correlation structure reflecting their location. One way of doing this is via the mgcv package. This fits spline regressions and other related models - but is capable of fitting certain kinds of random effects models as well.

A random effects model (the same as model 3) can be fitted as below:

library(mgcv)
model3a <- gam(Broadband~Professional+Age65over + s(COUNTY,bs='re'),
              data=ED_2016, family=binomial, weight=Broadband_Denom, method='REML')
summary(model3a)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## Broadband ~ Professional + Age65over + s(COUNTY, bs = "re")
## 
## Parametric coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.49090    0.06631   22.48   <2e-16 ***
## Professional  3.05149    0.07932   38.47   <2e-16 ***
## Age65over    -1.82394    0.06356  -28.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##             edf Ref.df Chi.sq p-value    
## s(COUNTY) 24.92     25  22241  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.235   Deviance explained = 38.6%
## -REML =  33038  Scale est. = 1         n = 3409

The ‘s’ here actually stands for ‘spline’ but this isn’t a spline model. The gam function has been extended to take on random effects models as well - specifying bs='re' sets this kind of model.

In terms of prediction, both approaches seem to have very similar outputs:

library(purrr)
pred_resp <- partial(predict,type='response')
cmp_mod3 <- list(lmer=model3,mgcv=model3a) %>% map_df(pred_resp)
ggplot(cmp_mod3,aes(x=lmer,y=mgcv)) + geom_point() + coord_equal()

However this has yet to address spatial structure as described above. Setting bs='mrf' specifies this, by stating that the county effects are a Markov Random Field, specified by a contiguity list of counties. That is neighbouring counties are correlated. The Markov random field here follows a multivariate Gaussian distribution. Then here \(\Gamma\) is the vector of county effects - is \(\Gamma = (\gamma_1,\gamma_2,\cdots,\gamma_{26})^T\) having a distribution with mean \(\mathbf{0}\) and precision \(\mathbf{P}\) where \([\mathbf{p}]_{ij} = v_i\) if \(i = j\) and \(v_i\) is the number of adjacent counties to county \(i\), \([\mathbf{p}]_{ij} = -1\) if counties \(i\) and \(j\) are adjacent, and \([\mathbf{p}]_{ij} = 0\) otherwise. A further constraint that \(\sum_j \gamma_j = 0\) is applied (otherwise the distribution is degenerate).

nlist <- CTY_2016 %>% st_touches()
names(nlist) <- CTY_2016$COUNTY
model5 <- gam(Broadband~Professional+Age65over + s(COUNTY,bs='mrf',xt=list(nb=nlist)),
              data=ED_2016, family=binomial, weight=Broadband_Denom, method='REML')
summary(model5)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## Broadband ~ Professional + Age65over + s(COUNTY, bs = "mrf", 
##     xt = list(nb = nlist))
## 
## Parametric coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.56010    0.02040   76.49   <2e-16 ***
## Professional  3.05310    0.07930   38.50   <2e-16 ***
## Age65over    -1.82430    0.06355  -28.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##             edf Ref.df Chi.sq p-value    
## s(COUNTY) 24.77     25  22234  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.235   Deviance explained = 38.6%
## -REML =  33031  Scale est. = 1         n = 3409

To see the value of the county effects, one way is to use the predict function on the model. Here the prediction data is not the data used to fit the model, but a frame of 26 sets of independent predictor variables, one for each county, but with Broadband and Age65over sest to zero. This then gives the value \(\hat\beta_0 + \hat\gamma_i\) for each county - although this is an offset for the county effect, mapping this gives an indication of the geographical trend.

county_extract <- tibble(
  Professional=rep(0,26),
  Age65over=rep(0,26),
  COUNTY=levels(CTY_2016$COUNTY)) 
CTY_m5gamma <- CTY_2016 %>% mutate(gamma_i=predict(model5,newdata = county_extract))
tm_shape(CTY_m5gamma) + tm_polygons(col='gamma_i',style='cont')

The precision matrix \(\mathbf{P}\) can be expanded in terms of its eigenvectors as

\[ \mathbf{P} = \lambda_1 \mathbf{e}_1\mathbf{e}_1^T + \lambda_2 \mathbf{e}_2\mathbf{e}_2^T + \cdots + \lambda_n \mathbf{e}_n\mathbf{e}_n^T \]

where \(\{\lambda_1 \cdots \lambda_n\}\) are the eigenvalues of \(P\) indexed in order of magnitude, and \(\mathbf{e}_1 \cdots \mathbf{e}_n\) are the corresponding eigenvectors. Truncation of the expansion after \(k\) terms is equivalent to reducing the number of parameters in the model and tends to give smoother estimates of the \(\gamma_i\) values. Here we use \(k=4\):

names(nlist) <- CTY_2016$COUNTY
model6 <- gam(Broadband~Professional+Age65over + s(COUNTY,bs='mrf',k=4,xt=list(nb=nlist)),
              data=ED_2016, family=binomial, weight=Broadband_Denom, method='REML')
summary(model6)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## Broadband ~ Professional + Age65over + s(COUNTY, bs = "mrf", 
##     k = 4, xt = list(nb = nlist))
## 
## Parametric coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.58078    0.01902   83.11   <2e-16 ***
## Professional  3.31786    0.07490   44.29   <2e-16 ***
## Age65over    -2.21933    0.05975  -37.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##             edf Ref.df Chi.sq p-value    
## s(COUNTY) 2.999      3  18727  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.206   Deviance explained = 34.7%
## -REML =  34513  Scale est. = 1         n = 3409

A map is shown below. Since this is a fairly drastic truncation the result ius notably smoother.

CTY_m6gamma <- CTY_2016 %>% mutate(gamma_i=predict(model6,newdata = county_extract))
tm_shape(CTY_m6gamma) + tm_polygons(col='gamma_i',style='cont')

Finally - here an ED-level random effect is used. Again a Markov random field is used to model the ED-level effect. If there were no spatial autocorrelation the model would be over-parametrised, as the individual observatuions are at ED level themselves. However here a truncated precision matrix expansion is used. One issue is that the adjacency list for some EDs (especially for islands) is empty - the first three lines identify these EDs and add a 5.5km buffer around them - so that islands are adjacent to the land masses within this distance. Then the model is fitted with \(k=200\) (the full number of terms is around 3400).

ED_2016 %>% st_intersects() %>% map_int(~length(.x)) %>% {which(. == 1)} -> temp
bufs <- rep(0,nrow(ED_2016))
bufs[temp] <- 5500
nlist_ED <- ED_2016 %>% st_buffer(dist=bufs) %>% st_intersects() %>% imap(~setdiff(.x,.y))
names(nlist_ED) <- ED_2016$ED_ID

model7 <- gam(Broadband~Professional+Age65over + s(ED_ID,bs='mrf',xt=list(nb=nlist_ED),k=200),
              data=ED_2016, family=binomial, weight=Broadband_Denom, method='REML')
summary(model7)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## Broadband ~ Professional + Age65over + s(ED_ID, bs = "mrf", xt = list(nb = nlist_ED), 
##     k = 200)
## 
## Parametric coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.67779    0.02547   65.87   <2e-16 ***
## Professional  1.72981    0.09862   17.54   <2e-16 ***
## Age65over    -1.24063    0.08092  -15.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df Chi.sq p-value    
## s(ED_ID) 194.3    199  34414  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.365   Deviance explained = 56.9%
## -REML =  26164  Scale est. = 1         n = 3409

The ED-level outputs are also mapped.

ed_extract <- tibble(
  Professional=rep(0,nrow(ED_2016)),
  Age65over=rep(0,nrow(ED_2016)),
  ED_ID=ED_2016$ED_ID) 
ED_m7gamma <- ED_2016 %>% mutate(gamma_i=predict(model7,newdata = ed_extract))
tm_shape(ED_m7gamma) + tm_fill(col='gamma_i',style='cont') + tm_shape(CTY_2016) + tm_borders()

Which model to use

One quick and dirty guide might be to compare the AICs of these models. Here we drop model1 - and work with all of the binomial based options.

list(model2,model3,model4,model5,model6,model7) %>%
  imap_dfr(~{list(AIC=AIC(.x),model=paste("Model", .y + 1))}) %>%
  mutate(Rel_AIC=AIC-min(AIC)) %>%
  ggplot(.,aes(x=model,y=Rel_AIC)) + geom_col(fill='navyblue') 

Here we see that the ‘best’ model is the final one - and the ‘runner up’ is model 4 - which has random effects for all of the coefficients at county level - and also allowing for covariance between the effects. Note that the Markov random field models at county level did not perform as well as this one, but the MRF model at ED level performed best. This may be explained by the fact that the geography of the effects as seen in the final map do not match the county boundaries particularly well. Areas of raised or lowered geographical effects often straddle county boundaries.