library(tidyverse)
library(downloader)
library(rgdal)
library(sf)
library(ggplot2)
library(reshape2)
library(plotly)
library(highcharter)

Spatial Descriptive Statistics

what will you achieve by the end of this practical.

  1. You will learn how to make descriptive plots (histograms and boxplots) to help understand the frequency distributions of your data
  2. You will understand how you can use R to write custom functions to process your data
  3. You produce a location quotient map to highlight interesting (above and below average) patterns in your data
  4. You will see an example of how you can write a function in R to produce a range of different maps based on user inputs
  5. You will learn how to perform a very basic cluster analysis and output the results of a basic geodemographic classification

Getting Started

Before we begin this week’s practical, we need to carry out some data preparation.

There is a problem with our London Wards data - we are missing some data relating to housing tenure. The housing tenure data in this file come from the 2011 Census and visiting http://www.nomisweb.co.uk/ and interrogating Table KS402EW (the Tenure table), we can discover that data for the percentage of shared owners and those living in accommodation rent free are missing.

Rather than making you go off to Nomisweb and fetch this data, because I’m really nice, I’ve posted on the web a file containing this and extra categorical, ratio and geographical data that we will need to add to our existing London data file.

We can easily join this new data to our original data in R.

getwd()
## [1] "N:/GIS/wk7"
#download this LondonWards shapfile, unzip it into your working directory and read it in... https://www.dropbox.com/s/5hkkxufhy7a6y4t/NewLondonWard.zip?raw=1 

LondonWards <- readOGR("NewLondonWard.shp", layer="NewLondonWard")
## OGR data source with driver: ESRI Shapefile 
## Source: "NewLondonWard.shp", layer: "NewLondonWard"
## with 625 features
## It has 76 fields
## Integer64 fields read as strings:  x y
LondonWardsSF <- st_as_sf(LondonWards)

extradata <- read_csv("https://www.dropbox.com/s/qay9q1jwpffxcqj/LondonAdditionalDataFixed.csv?raw=1")
## Parsed with column specification:
## cols(
##   WardName = col_character(),
##   WardCode = col_character(),
##   Wardcode = col_character(),
##   PctSharedOwnership2011 = col_double(),
##   PctRentFree2011 = col_double(),
##   Candidate = col_character(),
##   InnerOuter = col_character(),
##   x = col_double(),
##   y = col_double(),
##   AvgGCSE2011 = col_double(),
##   UnauthAbsenceSchools11 = col_double()
## )
LondonWardsSF <- merge(LondonWardsSF, extradata, by.x = "WD11CD", by.y = "Wardcode")

Main Tasks

Task 1 - Descriptive Statistics

Using the lecture notes for guidance, you should generate the following graphs and descriptive statistics using standard functions and ggplot2 in R. Each element should be copied and saved to a word document or something similar:

Generate the following from your LondonWardsSF data frame (hint, use the code in the lecture notes to help you if you are unsure how to do this): * A simple histogram for a scale/ratio variable of your choice * A simple histogram for a scale/ratio variable of your with a different frequency bin-width + The same histogram with vertical lines for the mean, median and mode (the mode will be the mid value for the bin with the largest count) and the inter-quartile range. hint – use summary(table$variable) to find the values if you are not sure + The same histogram with three different kernel density smoothed frequency gradients * A boxplot of the same variable * A faceted grid of histograms with for every variable in your London Wards data file. In order to do this, you will need to remove Factor (non-numeric) variables from your dataset and re-shape your data using the melt() function in the reshape2 package (hint – check the help file for melt.data.frame() to understand what the code below is doing). The code below will help you:

#check which variables are numeric first

list1 <- as.data.frame(cbind(lapply(LondonWardsSF, class)))
list1 <- cbind(list1, seq.int(nrow(list1)))
#you will notice that there are some non-numeric columns, we want to exclue these, and drop the geometry 
LondonSub <- LondonWardsSF[,c(1:73,83:86)]

#make sure the geometry is null or we will get errors - also create some subsets so that we can see our data better
LondonSub2 <- st_set_geometry(LondonWardsSF[,c(1:3,9:27)],NULL)
LondonSub3 <- st_set_geometry(LondonWardsSF[,c(1:3,28:50)],NULL)
LondonSub4 <- st_set_geometry(LondonWardsSF[,c(1:3,51:73,85:86)],NULL)

LondonMelt2 <- melt(LondonSub2, id.vars = 1:3)
attach(LondonMelt2)
hist2 <- ggplot(LondonMelt2, aes(x=value)) + geom_histogram(aes(y = ..density..)) + geom_density(colour="red", size=1, adjust=1)
hist2 + facet_wrap(~ variable, scales="free")

LondonMelt3 <- melt(LondonSub3, id.vars = 1:3)
attach(LondonMelt3)
hist3 <- ggplot(LondonMelt3, aes(x=value)) + geom_histogram(aes(y = ..density..)) + geom_density(colour="red", size=1, adjust=1)
hist3 + facet_wrap(~ variable, scales="free")

LondonMelt4 <- melt(LondonSub4, id.vars = 1:3)
attach(LondonMelt4)
hist4 <- ggplot(LondonMelt4, aes(x=value)) + geom_histogram(aes(y = ..density..)) + geom_density(colour="red", size=1, adjust=1)
hist4 + facet_wrap(~ variable, scales="free")

Make a note of which variables appear normally distributed and which appear to be skewed. What do the histograms for nominal and ordinal data look like?

Try performing a log10() transformation on the x variables and plotting a similar facet grid of histograms – what does this do to some of the skewed variables? For example, with the last subset:

hist5 <- ggplot(LondonMelt4, aes(x=log10(value))) + geom_histogram(aes(y = ..density..)) + stat_function(fun=dnorm, colour="red", size=1)
hist5 + facet_wrap(~ variable, scales="free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 90 rows containing non-finite values (stat_bin).

  • Create a 2D histogram and 2D kernel density estimate of ward centroids in London using the Eastings and Northings data in the x and y columns of your dataset. For example:
londonpoint<-ggplot(LondonSub, aes(x=x.y,y=y.y))+geom_point()+coord_equal()
londonpoint

londonpoint<-ggplot(LondonSub, aes(x=x.y,y=y.y))+stat_bin2d(bins=10)
londonpoint

londonpoint<-ggplot(LondonSub, aes(x=x.y,y=y.y))+geom_point()+coord_equal()
londonpoint

londonpoint+stat_density2d(aes(fill = ..level..), geom="polygon")

Task 2 - Introduction to functions in R

One of the great strengths of R is that is lets users define their own functions. Here we will practice writing a couple of basic functions to process some of the data we have been working with.

One of the benefits of a function is that it generalises some set of operations that can then be repeated over and again on different data.

In the lecture, it was mentioned that sometimes we should recode variables to reduce the amount of information contained in order that different tests can be carried out on the data. Here we will recode some of our scale/ratio data into some nominal/weak-ordinal data to carry out some basic analysis on.

There are various online guides which will help you a little more in writing functions, but the structure of a function in R is given below:

myfunction <- function(arg1, arg2, ... ){
  statements
  return(object)
}

A function to recode data in our dataset might look like the one below:

newvar<-0
recode<-function(variable,high,medium,low){
  newvar[variable<=high]<-"High"
  newvar[variable<=medium]<-"Medium"
  newvar[variable<=low]<-"Low"
  return(newvar)
}

What’s going on in this function?

  • First we initialise a new variable called newvar and set it to = 0. We then define a new function called recode. This takes in 4 pieces of information: A variable (called variable but I could have called it anything) and three values called high, medium and low. It outputs a value to the new string variable newvar based on the values of high, medium and low that are given to the function.

  • To create the function in R, highlight the all of the code in the function and then run the whole block (ctrl-Return in R-Studio). You will see that the function is stored in the workspace.

  • We can now use this function to recode any of our continuous variables into high, medium and low values based on the values we enter into the function.

  • We are going to recode the Average GCSE Score variable into High, Medium and Low values – High will be anything above the 3rd Quartile, Low will be anything below the 1st Quartile and Medium – anything in between. Note, if your data doesn’t have the 2013 GCSE scores but 2014, it will have different figures to these figures below and you will need to call the column by the column header you have

attach(LondonWardsSF)
#Check the name of your column, there could be a slight error and it might be called 'AvgGCSED201'
summary(AvgGCSE201) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   245.0   332.3   343.7   345.8   358.3   409.1

Create a new column in your data frame and fill it with recoded data for the Average GCSE Score in 2013. To do this, pass the AvgGCSE2013 variable to the recode() function, along with and the three values for high, medium and low. You should create a new variable called gcse_recode and use the function to fill it with values

If you wanted to be really fancy, you could try altering the function to calculate these “High”, “Medium” and “Low”

LondonWards$GCSE_recode <- recode(AvgGCSE201,409.1,358.3,332.3)

#or

LondonWardsSF$GCSE_recode <- recode(AvgGCSE201,409.1,358.3,332.3)

You should also create a second re-coded variable from the unauthorised absence variable using the same function – call this variable unauth_recode and again, used the 3rd and 1st quartiles to define your high, medium and low values.

Make sure these are saved to your data frame as we will use these in next week’s practical.

On to another function. This time, we will calculate some location quotients for housing tenure in London. If you remember, a location quotient is simply the ratio of a local distribution to the ratio of a global distribution. In our case, our global distribution will be London.

#Location Quotient function 1
LQ1<-function(pctVariable){
  pctVariable /mean(pctVariable)
}

#Location Quotient function 2
LQ2<-function(variable,rowtotal){
  localprop<-variable/rowtotal
  globalprop<-sum(variable)/sum(rowtotal)
  return(localprop/globalprop)
}

The two functions above calculate the same Location Quotient, but the first one works on variables which have already been converted into row percentages, the second will work on raw variables where an additional column for the row totals is stored in a separate column – e.g. “age 0-15”, “age 16-64” and “age 65 plus” all sum to the “Pop2013” column in our data London Wards data set:

head(LondonWardsSF[,1:7])
## Simple feature collection with 6 features and 7 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 543417.3 ymin: 183488.5 xmax: 551943.8 ymax: 191137.3
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.1502,0.247,0.8421,-20.4894 +units=m +no_defs
##      WD11CD WD11CDO         WD11NM WD11NMW
## 1 E05000026  00ABFX          Abbey    <NA>
## 2 E05000027  00ABFY         Alibon    <NA>
## 3 E05000028  00ABFZ      Becontree    <NA>
## 4 E05000029  00ABGA Chadwell Heath    <NA>
## 5 E05000030  00ABGB      Eastbrook    <NA>
## 6 E05000031  00ABGC       Eastbury    <NA>
##                              WardName.x WardCode.x Wardcode1
## 1          Barking and Dagenham - Abbey     00ABFX E05000026
## 2         Barking and Dagenham - Alibon     00ABFY E05000027
## 3      Barking and Dagenham - Becontree     00ABFZ E05000028
## 4 Barking and Dagenham - Chadwell Heath     00ABGA E05000029
## 5      Barking and Dagenham - Eastbrook     00ABGB E05000030
## 6       Barking and Dagenham - Eastbury     00ABGC E05000031
##                         geometry
## 1 MULTIPOLYGON (((543574.4997...
## 2 MULTIPOLYGON (((549604.1005...
## 3 MULTIPOLYGON (((547563.3995...
## 4 MULTIPOLYGON (((548880.9988...
## 5 MULTIPOLYGON (((551552.8994...
## 6 MULTIPOLYGON (((547271.2011...

Calculate Location Quotients for the 5 Housing tenure variables (Owner Occupied, Private Rent, Social Rent, Shared Ownership, Rent Free) in your data set using either of the functions above. Save these as 5 new variables in your dataset. *Hint – use the function to create the variable directly, for example:

#this is pseudo code, but you should see how this works
dataframe$newLQVariable <- LQ1(originalPercentageVariable)
#or
dataframe$newLQVariable <- LQ2(originalVariable,rowTotalVariable)

Task 3 – Mapping Location Quotients

You should now try and create a map or series of maps of your housing tenure location quotients using ggplot() – you have two options to try and accomplish this:

Easy option

Create a map by referring back to the Week 3 practical and following the steps from there (or your memory)

Double hard option

If you want to blow your mind, try the function below Warning, I wrote this myself, so the code is pretty untidy and relies on the input data to already be in the form of row percentages.

The function will take in a variable, calculate a location quotient for it, map it and then save the map as a PDF to your working directory. This is the sort of thing that you could produce for the coursework if you really got stuck into learning R…

Creating the maps using it is the easy bit, but see if you can figure out from my code what’s going on and what the function is doing at each stage! Here there are functions inside a function, so it gets a little complex.

You should, however, be able to Copy the code into R, run the whole function and then test it out and produce a map like the one below:

#############################################################
##A Function for creating various location quotient maps
##
##By Adam Dennett October 2014
##
##Please note, this function requires input data to already be in ##the form of row percentages. To create the function, highlight the ##whole block of code and run it. To run the function, simply use  ##LQMapper(your_dataframe)

library(rgeos)
library(ggplot2)
library(maptools)

LQMapper<-function(dataframe){
  print(colnames(dataframe))  
  vars<-readline("From the list above, select the variables 
                 you want to calculate location quotients for 
                 separated by spaces...")
  
  # split the string at the spaces  
  vars<-unlist(strsplit(vars, split = "\\s"))  
  # now save vars as a list
  vars<-as.list(vars)  
  
  shapefile<-readline("Now enter the name of your shapefile, 
                      e.g. foo.shp")
  LondonShp<-readShapePoly(shapefile)
  
  print(colnames(LondonShp@data))
  
  reg<-readline("Now, from the list above, choose the column
                header (variable) you are going to use to split 
                your shapefile regions in fortify, e.g. WD11CD.
                *note, this will the the column you will match
                your data on later")
  
  print("fortifying")
  London_geom<-fortify(LondonShp, region=reg)
  
  print("looping to create new location quotient variables...")
  attach(dataframe)  
  for(i in 1:length(vars)){
    pctVariable<-vars[[i]]
    colvect<-which(colnames(dataframe)==vars[[i]])
    
    #this is a little function to calculate location quotients
    LQ<-function(pctVariable){
      pctVariable/mean(pctVariable)
    }
    #use LQ function here to create new variable in dataframe 
    #and save it
    v<-dataframe[,colvect]
    dataframe[,paste("LQ_",pctVariable, sep="")]<-LQ(v)    
  }
  
  #reset i as we're going to use it again in a minute
  i=0
  
  matchvar<-readline("Now enter the name of the column header in 
                      your dataframe that you will match your 
                      shapefile region to, e.g. Wardcode")
  
  print("merging new data to fortified spatial dataframe")
  London_geom<-merge(London_geom,dataframe,by.x="id", by.y=matchvar)
  
  print("now entering the plotting loop")
  for(i in 1:length(vars)){
    print("I'm plotting")
    pctVariable<-paste("LQ_",vars[[i]],sep="")
    colvect<-which(colnames(dataframe)==paste("LQ_",vars[[i]],sep=""))
    
    #now make some plot layers - note the use of aes_string which 
    #allows us to pass variable names into the aesthetic mappings 
    #of ggplot
    layer1<-geom_polygon(aes_string(x="long", y="lat", fill=pctVariable, group="group"), data=London_geom)
    palette1<-scale_fill_gradient2(low="orange",mid="white", high="blue", midpoint =1)
    labels<-labs(list(title=pctVariable,x="Easting", y="Northing"))
    layer2<-geom_path(aes(x=long, y=lat, group=group),data=London_geom, colour="#bdbdbd")
    
    #create the plot
    LQMapperPlot<-ggplot()+layer1+palette1+layer2+labels+coord_equal()
    LQMapperPlot
    #save the plot to a pdf and give it a name based on its variable
    ggsave(LQMapperPlot, filename=paste(pctVariable,".pdf",sep=""))
    
  }  
  return(dataframe)  
}

###################################################################
#run the function by calling it and putting in the data frame parameter - at several points you will be asked for some user input
LondonWardsDF <- st_set_geometry(LondonWardsSF,NULL)

LQMapper(LondonWardsDF)
## OGR data source with driver: ESRI Shapefile 
## Source: "NewLondonWard.shp", layer: "NewLondonWard"
## with 625 features
## It has 76 fields
## Integer64 fields read as strings:  x y

Task 4 – Creating a Basic Geodemographic Classification

As we saw in the lecture, geodemographic classifications are widely used to classify areas according to the characteristics of the population that inhabits them. All geodemographic classifications are created using cluster analysis algorithms. Many of these algorithms exist, but one of the most commonly used is k-means. One of the pitfalls of these algorithms is that they will always find a solution, whether the variables have been selected appropriately or standardised correctly. This means that it’s very easy to create a classification which is misleading.

All of that said, it is useful to see how straightforward it is to create a classification yourself to describe some spatial data you have.

In a cluster analysis, you should select variables that are:

  • Ranged on the same scale

  • Normally distributed

  • Not highly correlated

To make this task easier, we will just select two variables to make our classification from. In a real geodemographic classification, hundreds of variables are often used.

cbind(lapply(LondonWardsDF, class))
##                        [,1]       
## WD11CD                 "factor"   
## WD11CDO                "factor"   
## WD11NM                 "factor"   
## WD11NMW                "factor"   
## WardName.x             "factor"   
## WardCode.x             "factor"   
## Wardcode1              "factor"   
## PopCensus2             "numeric"  
## Aged0_15               "numeric"  
## Aged16_64              "numeric"  
## Aged65plus             "numeric"  
## PctAged0_1             "numeric"  
## PctAged16_             "numeric"  
## PctAged65p             "numeric"  
## MeanAge201             "numeric"  
## MedianAge2             "numeric"  
## AreaSqKM               "numeric"  
## PopDensity             "numeric"  
## PctBame                "numeric"  
## PctNotBorn             "numeric"  
## PctNoEngli             "numeric"  
## GenFertRat             "numeric"  
## MaleLE0509             "numeric"  
## FemaleLE05             "numeric"  
## RateAmbula             "numeric"  
## RatesAmbul             "numeric"  
## InEmployme             "numeric"  
## Employment             "numeric"  
## NoJobs2011             "numeric"  
## EmpWkAgePo             "numeric"  
## RateNINoFo             "numeric"  
## MeanHouseP             "numeric"  
## NoProperti             "numeric"  
## NoHousehol             "numeric"  
## PctDetache             "numeric"  
## PctSemiDet             "numeric"  
## PctTerrace             "numeric"  
## PctFlatMai             "numeric"  
## PctOwned20             "numeric"  
## PctSocialR             "numeric"  
## PctPrivate             "numeric"  
## PctSharedO             "numeric"  
## PctRentFre             "numeric"  
## PctCTaxBan             "numeric"  
## PctCTaxB_1             "numeric"  
## PctCTaxB_2             "numeric"  
## MortgageRe             "numeric"  
## LandlordRe             "numeric"  
## Incapacity             "numeric"  
## IncomeSupp             "numeric"  
## JSAClaiman             "numeric"  
## JSAClaim_1             "numeric"  
## PctDepChil             "numeric"  
## PctDepCh_1             "numeric"  
## PctHHNoAdu             "numeric"  
## PctLonePar             "numeric"  
## IDRankLond             "numeric"  
## IDPctWorst             "numeric"  
## AvgGCSE201             "numeric"  
## UnauthAbse             "numeric"  
## PctWithNoQ             "numeric"  
## PctLev4Qua             "numeric"  
## CrimeRate1             "numeric"  
## ViolenceRa             "numeric"  
## RobberyRat             "numeric"  
## TheftAndHa             "numeric"  
## CriminalDa             "numeric"  
## DrugsRate1             "numeric"  
## Deliberate             "numeric"  
## PctOpenSpa             "numeric"  
## CarsPerHH2             "numeric"  
## AvgPubTran             "numeric"  
## TurnoutMay             "numeric"  
## ID                     "integer"  
## x.x                    "factor"   
## y.x                    "factor"   
## WardName.y             "character"
## WardCode.y             "character"
## PctSharedOwnership2011 "numeric"  
## PctRentFree2011        "numeric"  
## Candidate              "character"
## InnerOuter             "character"
## x.y                    "numeric"  
## y.y                    "numeric"  
## AvgGCSE2011            "numeric"  
## UnauthAbsenceSchools11 "numeric"  
## GCSE_recode            "character"
## LQOwned                "numeric"  
## LQSocRe                "numeric"  
## LQPriRe                "numeric"  
## LQShare                "numeric"  
## LQRentF                "numeric"
#clustering 
# Create a new data frame just containing the two variables we are #interested in
mydata<-as.data.frame(LondonWardsDF[,c("PctOwned20","PctNoEngli")])
attach(mydata)
## The following objects are masked from LondonWardsSF (pos = 5):
## 
##     PctNoEngli, PctOwned20
## The following objects are masked from LondonWardsSF (pos = 6):
## 
##     PctNoEngli, PctOwned20
#– check variable distributions first
histplot <- ggplot(data=mydata, aes(x=PctOwned20))
histplot+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

histplot <- ggplot(data=mydata, aes(x= PctNoEngli))
histplot+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# run a k-means to find 3 clusters – use 25 iterations
fit <- kmeans(mydata, 3, nstart=25) # 3 cluster solution
# get cluster means
centroid<-aggregate(mydata,by=list(fit$cluster),FUN=mean)
#print the results of the cluster groupings
centroid
##   Group.1 PctOwned20 PctNoEngli
## 1       1   27.49037  17.034225
## 2       2   48.92551  14.697571
## 3       3   72.36387   6.379581
# as we only have variable two dimensions we can plot the clusters on a graph
p<-ggplot(mydata,aes(PctOwned20, PctNoEngli))
p+geom_point(aes(colour=factor(fit$cluster)))+geom_point(data=centroid[,2:3],aes(PctOwned20, PctNoEngli), size=7, shape=18)+ theme(legend.position="none")

mydata$cluster <- fit$cluster

#add the cluster groups to the LondonWards data frame
LondonWardsDF$cluster<-mydata$cluster
#merge the cluster results into a fortified dataframe for plotting (we made LondonGeom earlier)
London_geom<-merge(London_geom,LondonWardsDF,by.x="id", by.y="Wardcode1")
#now map our geodeomographic classification
map <- ggplot()+geom_polygon(data=London_geom, aes(x=long, y=lat, group=group, fill=cluster))+coord_equal()+scale_fill_continuous(breaks=c(1,2,3),guide=guide_legend(title="Cluster"))
map

Now of course this is just the most basic of classifications, but you can easily see how you could include more variables or different variables to create a different classification - this is perhaps something you could try.

I haven’t even gone into using different clustering algorithms, how to decide on the appropriate number of clusters, using silhoutte plots to assess the strength of the clusters or creating pen-portraits using the variable z-scores for each cluster - this is practically a whole course in its own right… or indeed a dissertation topic!

Extension Exercise - Building a Shiny Application

If you would like to take your interactive visualisation further, then it is worth having a look at Shiny

Shiny

What is Shiny?

Shiny is an R package that lets you build interactive web-applications without needing to know HTML, CSS, JavaScript or other web-based languages.

What can you do with Shiny?

Shiny is great for visualising data - descriptive statistics.

Here is a Shiny Application that I built this summer:

https://adam-dennett.shinyapps.io/londonShiny/

Pretty bad-ass huh?

Now, creating this was not all that straightforward. I would like to think of myself as someone who is half-decent at R these days, but it took me a good week of playing around to get my head around the various quirks of Shiny and build this app.

For a showcase of various examples, you can have a look here: https://www.rstudio.com/products/shiny/shiny-user-showcase/

How can I learn Shiny?

Everything I learned about Shiny I learned from here: https://shiny.rstudio.com/

Shiny in your coursework

If you would like to build a shiny app for your coursework piece (I wouldn’t expect you to), then please do, but beware it may take some time to get up to speed with.

If you would like to download all of the code I wrote for the example app above, you can get it from here:

https://www.dropbox.com/s/g7cgzg7ep6pz3k4/LondonShiny.tar?raw=1

Enjoy!