Note – you will need to use your data from last week in today’s practical

In this practical session, you will start to learn some of the powerful statistical techniques we can employ to ask questions of our data.

I have termed the techniques here ‘spatial’ inferential statistics. We will be using spatial data, but certainly the first set of statistical techniques we will be using (everything before the GWR section) can be used on any data - spatial or non-spatial. However, as you will see - the spatial dimension is an important aspect to consider in modelling and if ignored, may well lead to errors in the results of your model and the interpretations you make…

Task 1 - A non-parametric test

In last week’s practical, you should have created two new re-coded variables from the average GCSE Score and unauthorised absence from school variables in your LondonWards dataset.

Your two new variables should be called “gcse_recode” and “unauth_recode” and should be at the end of your dataframe.

If you’ve not created those variables, you can use the function below to create them. Highlight and run the whole function and then use it to cretae your new variable.

library(rgdal)
library(tidyverse)
library(sf)

LondonWards <- readOGR("NewLondonWard.shp", layer="NewLondonWard")
## OGR data source with driver: ESRI Shapefile 
## Source: "NewLondonWard.shp", layer: "NewLondonWard"
## with 625 features
## It has 76 fields
## Integer64 fields read as strings:  x y
LondonWardsSF <- st_as_sf(LondonWards)
extradata <- read_csv("https://www.dropbox.com/s/qay9q1jwpffxcqj/LondonAdditionalDataFixed.csv?raw=1")
LondonWardsSF <- merge(LondonWardsSF, extradata, by.x = "WD11CD", by.y = "Wardcode")
newvar<-0
recode<-function(variable,high,medium,low){
  newvar[variable<=high]<-"High"
  newvar[variable<=medium]<-"Medium"
  newvar[variable<=low]<-"Low"
  return(newvar)
}
summary(LondonWardsSF$AvgGCSE201)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   245.0   332.3   343.7   345.8   358.3   409.1
summary(LondonWardsSF$UnauthAbse)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2463  0.8215  1.1364  1.1286  1.4105  2.4675
attach(LondonWardsSF)
LondonWardsSF$GCSE_recode <- recode(AvgGCSE201,409.1,358.3,332.3)
LondonWardsSF$unauth_recode <- recode(UnauthAbse,2.4675,1.4105,0.8215)

As you heard in the lecture, in answering a question using statistical (or any other) technique we are actually testing a research hypothesis. When we are trying to prove a hypothesis, our default starting position should always be that there is not a relationship and we will try and use evidence to prove otherwise.

We might have a working research hypothesis that missing school is likely to have a negative effect on the grades you get (for obvious reasons). Our starting point, then, is that there is no relationship between missing school and the grades that people get. This is known as our null hypothesis.

\(H_0\) = “There is no relationship between low average GCSE scores in London wards and high levels of unauthorised school absence.”

Can we reject this null hypothsis by running a chi-squared test?

##what does a cross tabulation of the data look like?

chisq<-chisq.test(LondonWardsSF$GCSE_recode,LondonWardsSF$unauth_recode)
#observed counts
chisq$observed
##                          LondonWardsSF$unauth_recode
## LondonWardsSF$GCSE_recode High Low Medium
##                    High      5 101     51
##                    Low      76   6     74
##                    Medium   74  49    188
#expected counts
chisq$expected
##                          LondonWardsSF$unauth_recode
## LondonWardsSF$GCSE_recode    High   Low   Medium
##                    High   38.9984 39.25  78.7516
##                    Low    38.7500 39.00  78.2500
##                    Medium 77.2516 77.75 155.9984
#chi squared statistic
chisq$statistic
## X-squared 
##  217.8617
#p-value
chisq$p.value
## [1] 5.40811e-46

It would appear that our p-value is very close to 0 - certainly much less than the 0.05 (95%) or 0.01 (99%) significance level we would normally use. This tells us that there is a much < 1% chance that there is no relationship between being absent from school and the average grades you get for people living in Wards across London.

Task 2: A Parametric Test - a linear regression model

Our chi-squared test can tell us that there is a strong likelihood that there is an association between school absence and school grades, but it can’t tell us how strong the relationship is or, for example, what an average decline in grades would be for a day off school.

As our data are scale/interval data, we are going to use a standard linear regression model to analyse the relationship between our dependent (GCSE Score) and independent (Unauthorised Absence) variabes.

Fitting a regression model is a little like following a recipe - there are several steps that you would normally need to follow in order to ensure the results that come out of the model are reliable. We will now go through these in turn.

In our regression model we are trying to predict the values of a dependent variable (in our case, GCSE scores) using the values of other independent variables.

1. Examine the frequency distributions in your data

We need to check that our variables are close to being normally distributed. This is becuase we are going to try and fit a straight line through these variables and skewed variables will make it difficult to do this.

library(ggplot2)

varlist <- data.frame(cbind(lapply(LondonWardsSF, class)))
varlist$id <- seq(1,nrow(varlist))

qplot(AvgGCSE201, data = LondonWardsSF, geom = "histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(UnauthAbse, data = LondonWardsSF, gemo = "histogram")
## Warning: Ignoring unknown parameters: gemo
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#OK, the variables look normally distributed. Would we expect there to be a relationship?

qplot(UnauthAbse, AvgGCSE201, data = LondonWardsSF, geom = "point") + stat_smooth(method="lm", se=FALSE, size=1)

OK, so looking at the histograms, yes our dependent and independent variable are quite normally distributed - which is good. And, when we plot the dependent (which is always the y axis variable) against the independent (always x axis) variable, we can visualise the straight line relationship and so fitting the regression model would be a sensible next step.

2. Fit a linear model to your data

#It looks like there is a negative relationship, so can we discover exactly what this relationship is using a linear regression model (we actually fitted one above to create the blue line)
library(broom)

#to fit the linear regrssion model, use the lm() function
model1 <- lm(AvgGCSE201 ~ UnauthAbse, data = LondonWardsSF)
#write the results out into a dataframe
model1_res <- tidy(model1)

#examine the results
summary(model1)
## 
## Call:
## lm(formula = AvgGCSE201 ~ UnauthAbse, data = LondonWardsSF)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -111.876  -10.038    0.609    9.579   56.695 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  379.558      1.830  207.45   <2e-16 ***
## UnauthAbse   -29.874      1.527  -19.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.39 on 623 degrees of freedom
## Multiple R-squared:  0.3807, Adjusted R-squared:  0.3797 
## F-statistic:   383 on 1 and 623 DF,  p-value: < 2.2e-16
#examine some of the diagnostic plots to see if there is any patterning in the residuals
plot(model1)

3. Interpreting the results of the model summary

  • The coefficient (estimate) in the summary() table shows that(on average) for a 1 unit (day) increase in unauthorised absence from school, there is a reduction in the average GCSE point score of -29.874.

  • The p-values for the intercept and the coefficient are highly statistically significant (<0.001) so we can rely on the relationship that is being observed.

  • The adjusted R-squared statistic is 0.38, which tells us that 38% of the variation in GCSE scores across Wards in London can be explained by variation in unauthorised absence from school (which is quite a lot for a single variable).

  • Interrogating the last graph in plot(model1)which is a scatter plot of fitted values (the model estimates achieved by plugging the values and coefficients back into the regression equation) against standardised residuals, we can see no apparent patterns in the cloud of points, which suggests the model has not violated any important assumptions.

For more detail about interpreting the model outputs here, try this resource:

http://blog.yhat.com/posts/r-lm-summary.html

In addtion, the two papers by Dunn (1989) and Jones (1984) on Moodle are old, but very good!

Testing for Spatial Patterns

Right, so our model looks OK - on the face of it. However, we should also test that there is not any spatial clustering of residuals.

We should plot the residuals on a map to see if there is any obvious spatial clustering.

library(tmap)

#save the residuals into your dataframe
LondonWardsSF$model1_resids <- model1$residuals
#now plot the residuals
tmap_mode("view")
## tmap mode set to interactive viewing
qtm(LondonWardsSF, fill = "model1_resids")

OK, so no obvious problems visually. Can we confirm this statistically just to make sure?

If you remember back to a couple of lectures ago, we can test for spatial patterns (spatial autocorrelation) using the Moran’s I statistic. Let’s try that…

library(spdep)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
library(sp)
library(sf)
#####
LondonWards <- as(LondonWardsSF,"Spatial")
#First calculate the centroids of all Wards in London
coordsW <- coordinates(LondonWards)
plot(coordsW)

#Now we need to generate a spatial weights matrix (remember from the lecture). We'll start with a simple binary matrix of queen's case neighbours
#create a neighbours list
LWard_nb <- poly2nb(LondonWards, queen=T)
#plot them
plot(LWard_nb, coordinates(coordsW), col="red")
#add a map underneath
plot(LondonWards, add=T)

#create a spatial weights object from these weights
Lward.lw <- nb2listw(LWard_nb, style="C")
#now run a moran's I test on the residuals
moran.test(LondonWards@data$model1_resids, Lward.lw)
## 
##  Moran I test under randomisation
## 
## data:  LondonWards@data$model1_resids  
## weights: Lward.lw  
## 
## Moran I statistic standard deviate = 12.969, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.3036561498     -0.0016025641      0.0005540091

OK, so here we have a statistically significant but relatively weak indication that there is some spatial clustering of residual values. A value of 0.30 (1 being perfect spatial autocorrelation, 0 being none at all) shows that there is some evidence that high residual values cluster near high values and low residual values cluster near lower values.

So what does this mean?

If you recall, the residual values are the points in the scatter plot that do not fall along the blue regression line…

qplot(UnauthAbse, AvgGCSE201, data = LondonWardsSF, geom = "point") + stat_smooth(method="lm", se=FALSE, size=1)

These are the wards in London where either high unauthorised absence rates do no necessarily lead to lower GCSE scores, or low unauthorised rates do not necessarily lead to higher GCSE scores. If these places cluster in space, then there might be some unobserved underlying factor causing this. This is important is it means that the assumption of independence that regression models rely upon might be violated.

Now in our case here, there is not a clear violation of spatial independence, but it is certainly hinted at.

4. Investigating further - Dummy Variables

What if instead of fitting one line to our cloud of points, we could fit several depending on whether the Wards we were analysing fell into some or other group. What if the relationship between attending school and achieving good exam results varied between inner and outer London, for example. Could we test for that? Well yes we can - quite easily in fact.

If we colour the points representing Wards for Inner and Outer London differently, we can start to see that there might be something interesting going on. There seems to be a stronger relationship between absence and GCSE scores in Outer London than Inner London. We can test for this in a standard linear regression model.

p <- ggplot(LondonWardsSF, aes(x=UnauthAbse, y=AvgGCSE201))
p + geom_point(aes(colour = InnerOuter))

Dummy variables are always categorical data (inner or outer London, or red / blue etc.). When we incorporate them into a regression model, they serve the purpose of splitting our analysis into groups. In the graph above, it would mean, effectively, having a separate regression line for the red points and a separate line for the blue points.

Let’s try it!

#first, let's make sure R is reading our InnerOuter variable as a factor
LondonWardsSF$InnerOuter <- as.factor(LondonWardsSF$InnerOuter)

#now run the model
model1_dummy <- lm(AvgGCSE201 ~ UnauthAbse + InnerOuter, data = LondonWardsSF)

summary(model1_dummy)
## 
## Call:
## lm(formula = AvgGCSE201 ~ UnauthAbse + InnerOuter, data = LondonWardsSF)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -108.256  -10.043   -0.084    9.117   58.966 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      374.630      2.239 167.328  < 2e-16 ***
## UnauthAbse       -28.156      1.579 -17.830  < 2e-16 ***
## InnerOuterOuter    4.888      1.306   3.742 0.000199 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.23 on 622 degrees of freedom
## Multiple R-squared:  0.3943, Adjusted R-squared:  0.3924 
## F-statistic: 202.5 on 2 and 622 DF,  p-value: < 2.2e-16

So how can we interpret this?

Well, the dummy variable is statistically significant and the coefficient tells us the difference between the two groups (Inner and Outer London) we are comparing. In this case, it is telling us that living in a Ward in outer London will improve your average GCSE score by 4.89 points, on average, compared to if you lived in Inner London. The R-squared has increased slightly, but not by much.

You will notice that despite there being two values in our dummy variable (Inner and Outer), we only get one coefficient. This is because with dummy variables, one value is always considered to be the control (comparison/reference) group. In this case we are comparing Outer London to Inner London. If our dummy variable had more than 2 levels we would have more coefficients, but always one as the reference.

The order in which the dummy comparisons are made is determined by what is known as a ‘contrast matrix’. This determines the treatment group (1) and the control (reference) group (0). We can view the contrast matrix using the contrasts() function:

contrasts(LondonWardsSF$InnerOuter)
##       Outer
## Inner     0
## Outer     1

If we want to change the reference group, there are various ways of doing this. We can use the contrasts() function, or we can use the relevel() function. Let’s try it:

LondonWardsSF$InnerOuter <- relevel(LondonWardsSF$InnerOuter, ref="Outer")

model1_dummy <- lm(AvgGCSE201 ~ UnauthAbse + InnerOuter, data = LondonWardsSF)
summary(model1_dummy)
## 
## Call:
## lm(formula = AvgGCSE201 ~ UnauthAbse + InnerOuter, data = LondonWardsSF)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -108.256  -10.043   -0.084    9.117   58.966 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      379.519      1.811 209.581  < 2e-16 ***
## UnauthAbse       -28.156      1.579 -17.830  < 2e-16 ***
## InnerOuterInner   -4.888      1.306  -3.742 0.000199 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.23 on 622 degrees of freedom
## Multiple R-squared:  0.3943, Adjusted R-squared:  0.3924 
## F-statistic: 202.5 on 2 and 622 DF,  p-value: < 2.2e-16

You will notice that the only thing that has changed in the model is that the coefficient for the InnerOuter variable now relates to Inner London and is now negative (meaning that living in Inner London is likely to reduce your average GCSE score by 4.89 points compared to Outer London). The rest of the model is exactly the same.

5. Investigating Further - Adding More Explanatory Variables

Up until this point, the model we have fitted using lm()has been exactly the same as the scatter plot we drew at the beginning. The relationships shown in the statistical outputs of the model we can observe in scatter plot. One of the great things about regression models is they allow us to go beyond the simple 2-dimensional scatter plot and extend our anaysis into multiple dimensions. Bam!

So let’s have a go…

Hmmm, what other things might affect someone’s performance at school? If we look through our suite of other variables (here - https://rpubs.com/adam_dennett/228756) we can see that there are various candidates. Socio-economic deprivation might be a good candidate - employment is a proxy for this (employed people tend to be more well-off than unemployed people), so let’s try adding employment in. In our dataset we have a variable labelled ‘Employment’ which is the employment rate (as a %) for 16-64 year olds.

  • First we need to check that our variable is normally distributed:
p <- ggplot(LondonWardsSF, aes(x=Employment))
p + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Yep looks OK. The employment rate for 16-64 year olds looks relatively normally distributed.

  • Now we should check that it’s not highly correlated with our other independent variables… (highly correlated independent variables will bias any results we get)
library(corrplot)
## corrplot 0.84 loaded
#to check for correlations, we can create a correlation matrix and then visualise it using the corrplot package
#first, convert LondonWardsSF to a data frame
LondonWardsDF <- st_set_geometry(LondonWardsSF,NULL)
cormat <- cor(LondonWardsDF[,8:72], use="complete.obs", method="pearson")
corrplot(cormat)

##ok quite big, let's go a bit smaller and selec just a few variables...
cormat <- cor(LondonWardsDF[,c(28,60,61,71)], use="complete.obs", method="pearson")
corrplot(cormat)

Right, so it doesn’t appear that unauthorised absence from school is too highly correlated with employment rate (moderate negative correlation - what is high correlation I hear you ask? Good question - we’ll go for 0.5 and above for now, but good question), so let’s put it into the model and see what it does to the analysis…

  • Run the model again with a new variable added in.
model2_dummy <- lm(AvgGCSE201 ~ UnauthAbse + Employment + InnerOuter, data = LondonWardsSF)

summary(model2_dummy)
## 
## Call:
## lm(formula = AvgGCSE201 ~ UnauthAbse + Employment + InnerOuter, 
##     data = LondonWardsSF)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.874   -9.457   -0.309    9.013   58.627 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     323.8259     8.3038  38.997  < 2e-16 ***
## UnauthAbse      -23.0673     1.6947 -13.612  < 2e-16 ***
## Employment        0.7678     0.1119   6.860 1.66e-11 ***
## InnerOuterInner  -5.6892     1.2658  -4.495 8.31e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.7 on 621 degrees of freedom
## Multiple R-squared:  0.437,  Adjusted R-squared:  0.4343 
## F-statistic: 160.7 on 3 and 621 DF,  p-value: < 2.2e-16

So, what do the model outputs tell us?

  • The first thing to note is the new variable (employment) is significant (p-value is <0.001 or ***) and positive, which means that as the employment rate in a ward goes up, so does the average GCSE Score

  • For a 1% increase in % of people employed in a Ward, we can expect a 0.76 point increase in the average GCSE score.

  • The fit of the model represented by the R-squared score has improved to aroun 43% of the variation in GSCE scores now explained by the independent variables.

  • Including employment in the model has increased the parameter value for the inner / outer London dummy variable.

  • t-values are standardised coefficient values and give a sense of the importance of each independent variable - especially when measured on different scales (the coefficients relate to the unit of measurement) - from this we can see clearly that unauthorised absence is the most important variable, but employment is more important than the inner/outer London dummy variable.

6. Building the optimum model

As you might have guessed, we can keep going and add more and more variables (as long as they are not highly correlated with each other) to try and explain as much of our independent variable as possible.

Task

You should try and build the optimum model of GCSE performance from your data in your LondonWards dataset. Experiment with adding different variables - when building a regression model in this way, you are trying to hit a sweet spot between increasing your R-squared value as much as possible, but with as few explanatory variables as possible.

A few things to watch out for…
  • You should never just throw variables at a model without a good theoretical reason for why they might have an influence. Choose your variables carefully!

  • Be prepared to take variables out of your model either if a new variable confounds (becomes more important than) earlier variables or turns out not to be significant.

For example, let’s try adding the rate of drugs related crime (logged as it is a positively skewed variable, where as the log is normal) and the number of cars per household…

model3_dummy <- lm(AvgGCSE201 ~ UnauthAbse + Employment + log(DrugsRate1) + CarsPerHH2 + InnerOuter, data = LondonWardsSF)

summary(model3_dummy)
## 
## Call:
## lm(formula = AvgGCSE201 ~ UnauthAbse + Employment + log(DrugsRate1) + 
##     CarsPerHH2 + InnerOuter, data = LondonWardsSF)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.022   -9.825    0.056    9.104   54.770 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     308.4173    10.3263  29.867  < 2e-16 ***
## UnauthAbse      -19.6005     1.8352 -10.680  < 2e-16 ***
## Employment        0.7077     0.1150   6.152 1.37e-09 ***
## log(DrugsRate1)   0.2804     1.0077   0.278    0.781    
## CarsPerHH2       14.8086     3.7292   3.971 8.00e-05 ***
## InnerOuterInner   0.7253     1.9047   0.381    0.703    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.48 on 619 degrees of freedom
## Multiple R-squared:  0.4553, Adjusted R-squared:  0.4509 
## F-statistic: 103.5 on 5 and 619 DF,  p-value: < 2.2e-16
Interpretation

OK, so now things are getting interesting. Our R-squared value has improved, but we can see that:

  • Our drug related crime rate variable is insignificant. Perhaps unsurprising when we think very carefully about it - drugs related crime is relatively small-scale (compared to other crimes) and therefore unlikely to affect very many school children.

  • Including a cars per household variable completely confounds the effects of the Inner/Outer London dummy variable. This does make sense as Inner/Outer London was always just a proxy for afluence which is captured far more effectively with the cars per household variable.

Task - continued

Keep experiementing with new explanatory variables until you are happy that you have built your optium model.

When you have finished:

  • write your residual values out to your LondonWardsSF dataframe
  • plot your residuals on a map to check visually for spatial autocorrelation
  • run a Moran’s I test to confirm the presence or otherwise of spatial autocorrelation.

Task 3 - Geographically Weighted Regression Models (GWR)

Here’s my final model from the last section:

model_final <- lm(AvgGCSE201 ~ UnauthAbse + Employment + CarsPerHH2, data = LondonWardsSF)
summary(model_final)
## 
## Call:
## lm(formula = AvgGCSE201 ~ UnauthAbse + Employment + CarsPerHH2, 
##     data = LondonWardsSF)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.128   -9.699    0.130    9.199   55.253 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 310.5354     8.5374  36.373  < 2e-16 ***
## UnauthAbse  -19.7024     1.8160 -10.849  < 2e-16 ***
## Employment    0.7055     0.1097   6.434 2.48e-10 ***
## CarsPerHH2   13.4755     2.0906   6.446 2.31e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.46 on 621 degrees of freedom
## Multiple R-squared:  0.4551, Adjusted R-squared:  0.4525 
## F-statistic: 172.9 on 3 and 621 DF,  p-value: < 2.2e-16
LondonWardsSF$model_final_res <- model_final$residuals

plot(model_final)

qtm(LondonWardsSF, fill = "model_final_res")
LondonWards <- as(LondonWardsSF,"Spatial")
moran.test(LondonWards@data$model1_resids, Lward.lw)
## 
##  Moran I test under randomisation
## 
## data:  LondonWards@data$model1_resids  
## weights: Lward.lw  
## 
## Moran I statistic standard deviate = 12.969, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.3036561498     -0.0016025641      0.0005540091

Now, we probably could stop at this point, but the Moran’s I test suggests that there might be a little spatial autocorrelation in the residuals, therefore it will be worth seeing if we can learn even more about the factors affecting school performance in London using some geographically weighted models.

This part of the practical will only skirt the edges of GWR, for much more detail you should visit the GWR website which is produced and maintained by Prof Chris Brunsdon and Dr Martin Charlton who originally developed the technique - http://gwr.nuim.ie/

There are various packages which will carry out GWR in R, for this pracical we we use spgwr (mainly because it was the first one I came across), although you could also use GWmodel or gwrr. At the end of this practical, you can test out these ideas in ArcGIS using the GWR toolset in the Spatial Statistics Toolbox - http://resources.arcgis.com/en/help/main/10.1/index.html#/Geographically_Weighted_Regression_GWR/005p00000021000000/

I should also acknowledge the guide on GWR (accessible here: http://www.bris.ac.uk/cmpo/events/2009/segregation/gwr.pdf) produced by the University of Bristol, which was a great help when producing this exercise.

library(spgwr)
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge')`
## NOTE: This package does not constitute approval of GWR
## as a method of spatial analysis; see example(gwr)
#calculate kernel bandwidth
GWRbandwidth <- gwr.sel(AvgGCSE201 ~ UnauthAbse + Employment + CarsPerHH2, data = LondonWards, coords=cbind(x,y),adapt=T) 
## Warning in gwr.sel(AvgGCSE201 ~ UnauthAbse + Employment + CarsPerHH2, data
## = LondonWards, : data is Spatial* object, ignoring coords argument
## Adaptive q: 0.381966 CV score: 128009 
## Adaptive q: 0.618034 CV score: 129579.3 
## Adaptive q: 0.236068 CV score: 125818 
## Adaptive q: 0.145898 CV score: 122617.5 
## Adaptive q: 0.09016994 CV score: 119379.5 
## Adaptive q: 0.05572809 CV score: 117056.1 
## Adaptive q: 0.03444185 CV score: 115477 
## Adaptive q: 0.02128624 CV score: 114278.4 
## Adaptive q: 0.01315562 CV score: 113264.9 
## Adaptive q: 0.008130619 CV score: 113662.9 
## Adaptive q: 0.01319871 CV score: 113267.8 
## Adaptive q: 0.01201289 CV score: 113176.3 
## Adaptive q: 0.01053 CV score: 113129.9 
## Adaptive q: 0.01038398 CV score: 113113.6 
## Adaptive q: 0.009523271 CV score: 113079.5 
## Adaptive q: 0.00967515 CV score: 113066.9 
## Adaptive q: 0.009839582 CV score: 113069.7 
## Adaptive q: 0.009730461 CV score: 113067 
## Adaptive q: 0.009634459 CV score: 113067.3 
## Adaptive q: 0.00967515 CV score: 113066.9
#run the gwr model
gwr.model = gwr(AvgGCSE201 ~ UnauthAbse + Employment + CarsPerHH2, data=LondonWards, coords=cbind(x,y), adapt=GWRbandwidth, hatmatrix=TRUE, se.fit=TRUE) 
## Warning in gwr(AvgGCSE201 ~ UnauthAbse + Employment + CarsPerHH2, data =
## LondonWards, : data is Spatial* object, ignoring coords argument
#print the results of the model
gwr.model
## Call:
## gwr(formula = AvgGCSE201 ~ UnauthAbse + Employment + CarsPerHH2, 
##     data = LondonWards, coords = cbind(x, y), adapt = GWRbandwidth, 
##     hatmatrix = TRUE, se.fit = TRUE)
## Kernel function: gwr.Gauss 
## Adaptive quantile: 0.00967515 (about 6 of 625 data points)
## Summary of GWR coefficient estimates at data points:
##                    Min.    1st Qu.     Median    3rd Qu.       Max.
## X.Intercept. 116.023352 285.271262 320.394568 349.527999 516.089672
## UnauthAbse   -54.411470 -26.629102 -18.067046  -9.256077  27.148358
## Employment    -2.012669   0.036767   0.473397   0.979962   3.261127
## CarsPerHH2   -82.068169   5.554686  17.509463  32.406615  74.444542
##                Global
## X.Intercept. 310.5354
## UnauthAbse   -19.7024
## Employment     0.7055
## CarsPerHH2    13.4755
## Number of data points: 625 
## Effective number of parameters (residual: 2traceS - traceS'S): 193.2947 
## Effective degrees of freedom (residual: 2traceS - traceS'S): 431.7053 
## Sigma (residual: 2traceS - traceS'S): 12.52927 
## Effective number of parameters (model: traceS): 138.6919 
## Effective degrees of freedom (model: traceS): 486.3081 
## Sigma (model: traceS): 11.80494 
## Sigma (ML): 10.41309 
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 5063.047 
## AIC (GWR p. 96, eq. 4.22): 4841.194 
## Residual sum of squares: 67770.21 
## Quasi-global R2: 0.7155156

The output from the GWR model reveals how the coefficients vary across the 625 Wards in our London Study region. You will see how the global coefficients are exactly the same as the coefficients in the earlier lm model. In this particular model (yours will look a little different if you have used different explanatory variables), if we take unauthorised absence from school, we can see that the coefficients range from a minimum value of -54.41 (1 unit change in unauthorised absence resulting in a drop in average GCSE score of -54.41) to +27.14 (1 unit change in unauthorised absence resulting in an increase in average GCSE score of +27.14). For half of the wards in the dataset, as unauthorised absence rises by 1 point, GCSE scores will decrease between -26.62 and -9.26 points (the inter-quartile range between the 1st Qu and the 3rd Qu).

You will notice that the R-Squared value has improved - this is not uncommon for GWR models, but it doesn’t necessarily mean they are definitely better than global models. The small number of cases under the kernel means that GW models have been criticised.

Coefficient ranges can also be seen for the other variables and they suggest some interesting spatial patterning. To explore this we can plot the GWR coefficients for different variables. Firstly we can attach the coefficients to our original dataframe - this can be achieved simply as the coefficients for each ward appear in the same order in our spatial points dataframe as they do in the original dataframe.

results<-as.data.frame(gwr.model$SDF)
head(results)
##      sum.w X.Intercept. UnauthAbse   Employment CarsPerHH2 X.Intercept._se
## 1 13.34440     328.4640  -16.35222  0.400799290  11.652347        38.62478
## 2 10.57502     345.4072  -12.15976 -0.076741673  10.187048        76.04057
## 3 12.83661     365.8682  -24.06273 -0.074336089   8.648726        52.43361
## 4 13.81728     429.5445  -29.87014 -1.008009186   7.198977        52.91981
## 5 10.99497     350.3324  -18.06705 -0.001153272   6.525103        82.84685
## 6 12.64518     341.9212  -15.62106  0.149011498   9.327368        40.94221
##   UnauthAbse_se Employment_se CarsPerHH2_se     gwr.e     pred  pred.se
## 1      9.195995     0.7768819      16.11905  1.285872 334.0734 4.534617
## 2     15.474366     1.0927318      21.82247  1.633796 331.4778 3.608857
## 3     12.164446     0.8896000      19.07899 -5.745640 332.7595 3.762452
## 4     11.342668     0.7790738      15.42237  8.732071 331.0502 3.950802
## 5     15.242469     1.1117277      20.03778 -8.590038 338.1035 2.629712
## 6      9.896348     0.7985666      18.04344 12.964637 336.0480 2.473224
##     localR2 X.Intercept._se_EDF UnauthAbse_se_EDF Employment_se_EDF
## 1 0.5251514            40.99473          9.760247         0.8245502
## 2 0.4371310            80.70631         16.423849         1.1597801
## 3 0.5588944            55.65086         12.910837         0.9441845
## 4 0.6363133            56.16689         12.038636         0.8268766
## 5 0.6451704            87.93020         16.177723         1.1799416
## 6 0.4511655            43.45436         10.503573         0.8475654
##   CarsPerHH2_se_EDF pred.se.1
## 1          17.10809  4.812854
## 2          23.16146  3.830290
## 3          20.24965  3.993311
## 4          16.36866  4.193217
## 5          21.26727  2.791067
## 6          19.15056  2.624977
#attach coefficients to original dataframe
LondonWards@data$coefUnauthAbse<-results$UnauthAbse
LondonWards@data$coefEmployment<-results$Employment
LondonWards@data$coefCarsPerHH2<-results$CarsPerHH2
tm_shape(LondonWards) +
  tm_polygons(col = "coefUnauthAbse", palette = "RdBu")
tm_shape(LondonWards) +
  tm_polygons(col = "coefEmployment", palette = "RdBu")
tm_shape(LondonWards) +
  tm_polygons(col = "coefCarsPerHH2", palette = "RdBu")

Taking the first plot, which is for the unauthorised absence coefficients, we can see that for the majority of boroughs in London, there is the negative relationship we would expect - i.e. as unauthorised absence goes up, exam scores go down. For three boroughs (Westminster, Kensington & Chelsea and Hammersmith and Fulham - the richest in London), however, the relationship is positive - as unauthorised school absence increases, so does average GCSE score. This is a very interesting pattern and counterintuitive pattern, but may partly be explained the multiple homes owned by many living in these boroughs (students living in different parts of the country and indeed the world for significant periods of the year) and the over representation of private schooling of those living in these areas. If this is not the case and unauthorised absence from school is reflecting the unauthorised absence of poorer students attending local, inner city schools, then high GCSE grades may also reflect the achievements of those who are sent away to expensive fee-paying schools elsewhere in the country and who return to their parental homes later in the year. Either way, these factors could explain these results.

Of course, these results may not be statistically significant across the whole of London. Roughly speaking, if a coefficient estimate is more than 2 standard errors away from zero, then it is “statistically significant”.

To calculate standard errors, for each variable you can use a formula similar to this:

sigTest = abs(gwr.model$SDF$UnauthAbse) -2 * gwr.model$SDF$UnauthAbse_se 

LondonWards$GWRUnauthSig<-sigTest

#tmaptools::palette_explorer()

If this is greater than zero (i.e. the estimate is more than two standard errors away from zero), it is very unlikely that the true value is zero, i.e. it is statistically significant (at nearly the 95% confidence level)

You should now calculate these for each variable in your GWR model and See if you can plot them on a map, for example:

tm_shape(LondonWards) +
  tm_polygons(col = "GWRUnauthSig", palette = "RdYlBu")

From the results of your GWR exercise, what are you able to conclude about the geographical variation in your explanatory variables when predicting your dependent variable?

ArcGIS

As mentioned at the beginning of this exercise, GWR can also be carried out using the Spatial Statistics Toolbox in ArcGIS. Using the guide in the ArcGIS help system below, try and repeat your GWR analysis using the tools available in Arc - http://resources.arcgis.com/en/help/main/10.1/index.html#/Geographically_Weighted_Regression_GWR/005p00000021000000/