1. Introduction

In this practical you will use regression, probably the standard method for constructing predictive models. You will learn how to set up a regression, consider about some of the key issues in model construction (selection) and develop local, kernel based regression models using GWR (Brunsdon et al 2016; Fotheringham et al 2002).

You will find Chapters 5 and Chapter 8 in An Introduction to R for Spatial Analysis and Mapping by Chris Brunsdon and Lex Comber useful for this practical.

The aim of this practical is to develop your competency and expertise in generating inference from data and to understand some of the limitations of classic statistical inference in the analysis of big data. This practical provides a smooth introduction to the sessions next week on Machine Learning and Outlier detection.

To do this you will use georgia dataset in the GISTools package and the GWmodel package which is curated and supported by the original GWR team. The code below will make sure you have all of the R packages you will need installed. The tests check whether a given package is in the list of installed packages, and installs it if not. Package installation only needs to be done once - not every time you use R - so if you are sure you have all of the packages installed already, you can skip this step.

# set a mirror
options(repos = c(CRAN = "http://cran.rstudio.com"))
# test for package existance and install
if (!is.element("tidyverse", installed.packages()))
    install.packages("tidyverse", dep = T)
if (!is.element("GISTools", installed.packages()))
    install.packages("GISTools", dep = T)
if (!is.element("GWmodel", installed.packages())) 
    install.packages("GWmodel", dep = T)
if (!is.element("rgdal", installed.packages()))
    install.packages("rgdal", dep = T)
if (!is.element("tmap", installed.packages())) 
    install.packages("tmap", dep = T)
if (!is.element("knitr", installed.packages()))
    install.packages("knitr", dep = T)
if (!is.element("car", installed.packages()))
    install.packages("car", dep = T)
if (!is.element("gclus", installed.packages()))
    install.packages("gclus", dep = T)
if (!is.element("kableExtra", installed.packages()))
    install.packages("kableExtra", dep = T)

Once we are sure all the packages are installed, you need to load them into the current session:

# load into the R session
library(tidyverse)
library(GISTools)
library(GWmodel)
library(rgdal)
library(tmap)
library(knitr)
library(car)
library(gclus)
library(kableExtra)

2. Data

You should load the georgia data, examine it and assign as selection of the variables to a data frame df:

library(GISTools)
data(georgia)
# have a look!
class(georgia)
head(georgia@data)
# assign selected variables to df
df <- as_tibble(georgia@data[, c(14,4:9)])
df$MedInc <- df$MedInc/1000

You will see the df object is a tibble format and contains a number of variables for the counties in Georgia from the 1990 census including the percentage of the population in each County that

Below is a quick look at the first few records of the data:

MedInc PctRural PctBach PctEld PctFB PctPov PctBlack
32.152 75.6 8.2 11.43 0.64 19.9 20.76
27.657 100.0 6.4 11.77 1.58 26.0 26.86
29.342 61.7 6.6 11.11 0.27 24.1 15.42
29.610 100.0 9.4 13.17 0.11 24.8 51.67
36.414 42.7 13.3 8.64 1.43 17.5 42.39
41.783 100.0 6.4 11.37 0.34 15.1 3.49

To see this in R, enter

head(df)

Our aim in this practical is to build a model that describes the median income using the factors listed above. We will use this to understand a number of key features in regression.

3. Initial explorations

We should have a look at the data to see what kinds of properties it has - we could use previously introduced functions such as summary to summarize the attributes. You may wish to use this now. But in terms of model selection what we are interested in the degree to which different variables are good predictors of Median Income.

In order to examine the relationships between the variables in df we can plot the continuous data against each other and use the cor function to examine collinearity amongst them:

plot(df, cex = 0.5, col = grey(0.145,alpha=0.5))