Often we desire to merge different datasets based on a shared variable of interest. In this post we build on an earlier post concerning merging and lookup tables and demonstrate how to combine two government datasets:
Thus we can merge the data and consider the student attainment data in the context of social deprivation.
The Lower Super Output Area (LSOA) atlas provides a summary of demographic and related data for each LSOA in Greater London. The average population of an LSOA in London in 2010 was 1,722 compared with 8,346 for an MSOA and 13,078 for a ward.
The profiles are designed to provide an overview of the population in these small areas by combining a range of data on the population, diversity, households, health, housing, crime, benefits, land use, deprivation, schools, and employment.
Here we demonstrate how we can use the atlas data as a lookup table in R. If we had data from the National Pupil Database (NPD) and were interested in the effect of deprivation on student attainment we could use this atlas to “lookup” these deprivation statistics for each student by their LSOA.
We saved the second worksheet in this workbook as a csv to facilitate reading it into R. We also deleted rows 1 and 3 as they were redundant. This left a header row providing us with variable names. Notice how all the spaces in the variable names make them difficult to work with (lots of …!). Column 1 is the LSOA, named Codes, column 2 was called Names(a bad name).
#SETUP
rm(list = ls()) #clear workspace
setwd("C://Users//mammykins//Google Drive//R//lsoa") #set wd
#getwd() #check wd has been changed, make sure file is here
#PACKAGES
if (!require("dplyr")) {
install.packages("dplyr")
require("dplyr")
} # if dplyr is not installed, install it then load it
#INPUT
lsoa_atlas <- "lsoa_sheet2.csv"
lsoa_atlas <- read.csv(lsoa_atlas, sep = ",",
header = TRUE, stringsAsFactors = FALSE)
lsoa_atlas <- tbl_df(lsoa_atlas) # observed data set
This gives us the LSOA codes in London, ready to assist in matching with another dataframe. Note we read it in with the argument stringsAsFactors = FALSE, thus we read them in as characters.
glimpse(lsoa_atlas) # Have a butcher's
## Observations: 4,836
## Variables: 71
## $ Codes (chr) ...
## $ Names (chr) ...
## $ No.adults.in.employment.in.household..With.dependent.children (int) ...
## $ X..of.households.with.no.adults.in.employment..With.dependent.children (int) ...
## $ All.lone.parent.housholds.with.dependent.children (int) ...
## $ Lone.parents.not.in.employment (int) ...
## $ Lone.parent.not.in.employment.. (chr) ...
## $ Economically.active..Total (int) ...
## $ Economically.inactive..Total (int) ...
## $ Economically.active..Employee (int) ...
## $ Economically.active..Self.employed (int) ...
## $ Economically.active..Unemployed (int) ...
## $ Economically.active..Full.time.student (int) ...
## $ Employment.Rate (dbl) ...
## $ Unemployment.Rate (dbl) ...
## $ No.qualifications (int) ...
## $ Highest.level.of.qualification..Level.1.qualifications (int) ...
## $ Highest.level.of.qualification..Level.2.qualifications (int) ...
## $ Highest.level.of.qualification..Apprenticeship (int) ...
## $ Highest.level.of.qualification..Level.3.qualifications (int) ...
## $ Highest.level.of.qualification..Level.4.qualifications.and.above (int) ...
## $ Highest.level.of.qualification..Other.qualifications (int) ...
## $ Schoolchildren.and.full.time.students..Age.18.and.over (int) ...
## $ X..No.qualifications (dbl) ...
## $ X..Highest.level.of.qualification..Level.1.qualifications (dbl) ...
## $ X..Highest.level.of.qualification..Level.2.qualifications (dbl) ...
## $ X..Highest.level.of.qualification..Apprenticeship (dbl) ...
## $ X..Highest.level.of.qualification..Level.3.qualifications (dbl) ...
## $ X..Highest.level.of.qualification..Level.4.qualifications.and.above (dbl) ...
## $ X..Highest.level.of.qualification..Other.qualifications (dbl) ...
## $ X..Schoolchildren.and.full.time.students..Age.18.and.over (dbl) ...
## $ Day.to.day.activities.limited.a.lot (int) ...
## $ Day.to.day.activities.limited.a.little (int) ...
## $ Day.to.day.activities.not.limited (int) ...
## $ Very.good.or.Good.health (int) ...
## $ Fair.health (int) ...
## $ Bad.or.Very.Bad.health (int) ...
## $ Day.to.day.activities.limited.a.lot.... (dbl) ...
## $ Day.to.day.activities.limited.a.little.... (dbl) ...
## $ Day.to.day.activities.not.limited.... (dbl) ...
## $ Very.good.or.Good.health.... (dbl) ...
## $ Fair.health.... (dbl) ...
## $ Bad.or.Very.Bad.health.... (dbl) ...
## $ No.cars.or.vans.in.household (int) ...
## $ X1.car.or.van.in.household (int) ...
## $ X2.cars.or.vans.in.household (int) ...
## $ X3.cars.or.vans.in.household (int) ...
## $ X4.or.more.cars.or.vans.in.household (int) ...
## $ Sum.of.all.cars.or.vans.in.the.area (int) ...
## $ No.cars.or.vans.in.household.... (dbl) ...
## $ X1.car.or.van.in.household.... (dbl) ...
## $ X2.cars.or.vans.in.household.... (dbl) ...
## $ X3.cars.or.vans.in.household.... (dbl) ...
## $ X4.or.more.cars.or.vans.in.household.... (dbl) ...
## $ Cars.per.household (dbl) ...
## $ Total.Number.of.Children (int) ...
## $ Total.Number.of.Families.Claiming.Benefit (int) ...
## $ Number.of.families.with.3..children (chr) ...
## $ X..of.families.with.3..children (chr) ...
## $ X2010 (int) ...
## $ X (int) ...
## $ X.1 (int) ...
## $ X.2 (int) ...
## $ X2011 (int) ...
## $ X.3 (int) ...
## $ X.4 (int) ...
## $ X.5 (int) ...
## $ X2012 (int) ...
## $ X.6 (int) ...
## $ X.7 (int) ...
## $ X.8 (int) ...
sum(table(lsoa_atlas$Codes)) # The number of unique codes match the number of rows as expected
## [1] 4836
Note the messy variable names, I would tidy these up if doing a lot of work with this dataset to make it easier on the eyes. Before using this as a lookup table we generate some fake NPD data for 30 students and name the object npd.
set.seed(1337) # we use rnorm
n <- 30
npd <- data.frame(
studentid = 1:n,
school = c("Park view", "Grange Hill", "Sweet valley"),
superoutputarea = paste("E0100",
round(runif(n = n, min = 1000, max = 4332), digits = 0),
sep = ""), # Each fictional school has students attending from different areas, for coding simplicity, LSOA suffixed 0000-0999 were omitted.
attainment = (rnorm(n = n)),
stringsAsFactors = FALSE # we can modify the class of variables later if required
)
glimpse(npd)
## Observations: 30
## Variables: 4
## $ studentid (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,...
## $ school (chr) "Park view", "Grange Hill", "Sweet valley", "P...
## $ superoutputarea (chr) "E01002920", "E01002882", "E01001247", "E01002...
## $ attainment (dbl) 1.27178094, 0.64167341, 0.80076125, 1.86265923...
If we look at the data, we notice the variable superoutputarea is a nine digit code. We are interested in how the area relates to the socio-economic classification of typical people who live in that area or a measure of deprivation of the area. We must convert this into the more informative proxy which can then be used in our machine learning tools later. This can be achieved by merging our npd data with our LSOA lsoa_atlas data. This will result in 30 rows with 4 + 71 - 1 columns (-1 as Codes is in both).
Note how the variable has a different name in each dataframe, thus let’s correct that using rename(). Once our superoutput code ID variables have the same name in both dataframes then we can merge them using the base merge() function.
npd <- rename(npd, Codes = superoutputarea) # rename superoutputarea to Codes to match lsoa_atlas name for variable
names(npd)
## [1] "studentid" "school" "Codes" "attainment"
npd_atlas_merge <- merge(npd, lsoa_atlas, by = "Codes") # merge by the lookup variable "Codes"
npd_depriv <- select(npd_atlas_merge, studentid, Codes, school, attainment,
Employment.Rate)
npd_depriv
## studentid Codes school attainment Employment.Rate
## 1 16 E01001084 Park view 1.26264948 64.5
## 2 23 E01001155 Grange Hill 0.52415274 49.9
## 3 3 E01001247 Sweet valley 0.80076125 61.1
## 4 10 E01001487 Park view -1.63515201 52.1
## 5 24 E01001512 Sweet valley -1.07960965 68.5
## 6 14 E01001646 Grange Hill -1.07460214 61.2
## 7 20 E01001822 Grange Hill -0.71972217 46.7
## 8 8 E01001937 Grange Hill -0.79098417 71.4
## 9 6 E01002104 Sweet valley 0.61927781 59.0
## 10 19 E01002130 Park view 3.44607885 65.7
## 11 5 E01002244 Grange Hill -0.54535603 68.8
## 12 4 E01002512 Park view 1.86265923 67.1
## 13 29 E01002565 Grange Hill 0.65497038 49.9
## 14 25 E01002875 Park view -0.05511102 66.1
## 15 2 E01002882 Grange Hill 0.64167341 69.2
## 16 1 E01002920 Park view 1.27178094 68.8
## 17 22 E01003413 Park view -0.52977397 65.2
## 18 13 E01003757 Park view 0.75250489 57.5
## 19 21 E01003829 Sweet valley -0.32685410 62.0
## 20 28 E01003994 Park view 0.06503223 57.0
## 21 18 E01004078 Sweet valley -1.12355513 70.3
## 22 27 E01004105 Sweet valley -1.77390933 69.8
## 23 7 E01004158 Park view 0.12264043 72.5
## 24 17 E01004240 Grange Hill 1.09994930 47.4
## 25 11 E01004263 Grange Hill 1.78456732 50.1
## 26 15 E01004270 Sweet valley 0.06427374 54.7
## 27 26 E01004270 Grange Hill 2.19868418 54.7
## 28 12 E01004311 Sweet valley 0.29887827 53.3
## 29 30 E01004315 Sweet valley -0.71286911 61.8
The variable Employment.Rate can then be used as a proxy for deprivation that a student might experience based on where they live (A caveat, this is simulated data and the analysis is for demonstration purposes only). Note this simulated data is unrealistic in that students that attend the same school all come from randomly selected LSOA within London (they would likely come from adjacent or the same LSOA). Given that our attainment was simulated using rnorm() it is unsurprising to see a near zero correlation. Furthermore you would be unlikely to model this data with a linear regression, however the important point is that the data have been merged and can now be subjected to exploration or modelling activities.
plot(npd_depriv$Employment.Rate, npd_depriv$attainment,
xlab = "Employment rate", ylab = "Attainment",
pch = 19)
z <- lm(attainment ~ Employment.Rate, data = npd_depriv)
abline(z, col = "red") # an example of bad model fitting
cor(npd_depriv$Employment.Rate, npd_depriv$attainment) # if we try every variable we might get a strong correlation on one! Naughty.
## [1] -0.1651502
We can expand on this by considering the Index of Multiple Deprivation (IMD) score (2010) for each LSOA. A dataset taken from the Todd et al. (2015) paper provides us with the means to “lookup” and merge our npd data with this todd data providing us with an variable describing which decile of deprivaton corresponds to the relevant student’s LSOA they live. This data includes a translation for the whole country but our simulated data uses LSOA just from London.
npd <- rename(npd, LSOA = Codes) # rename to match variable names
#INPUT TODD data
todd <- "gp_lsoa_todd_data.csv"
todd <- read.csv(todd, sep = ",",
header = TRUE, stringsAsFactors = FALSE)
todd <- tbl_df(todd)
glimpse(todd)
## Observations: 32,482
## Variables: 13
## $ LSOA (chr) "E01000001", "E01000002", "E01...
## $ LSOAName (chr) "City of London 001A", "City o...
## $ totalPopulation (int) 1563, 1464, 1538, 1538, 1075, ...
## $ PharmacyPopulationIn20minBuffer (int) 1563, 1464, 1538, 1538, 1075, ...
## $ PharmacyAccessPcent (dbl) 100, 100, 100, 100, 100, 100, ...
## $ IMD2010 (dbl) 6.16, 5.59, 13.29, 11.17, 21.3...
## $ IMDpercentile (int) 2, 1, 4, 4, 7, 5, 9, 10, 8, 9,...
## $ UrbanRural2 (chr) "Urban", "Urban", "Urban", "Ur...
## $ LA (chr) "E09000001", "E09000001", "E09...
## $ Region..Name. (chr) "London", "London", "London", ...
## $ Region..Code. (int) 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ GP.population.in.20min.buffer (int) 1563, 1464, 1538, 1538, 1075, ...
## $ GPAccessPcent (dbl) 100, 100, 100, 100, 100, 100, ...
Now again, we merge, but this time we use the dplyr function join() to do the work, joining the variables of interest, IMD2010 and IMDpercentile to npd. This function is much faster!
#npd_todd <- merge(npd, todd, by = "LSOA") %>%
# select(studentid, LSOA, school, attainment,
# IMD2010, IMDpercentile)
npd_todd <- inner_join(x = npd, y = todd, by = "LSOA") %>%
select(studentid, LSOA, school, attainment,
IMD2010, IMDpercentile)
npd_todd
## studentid LSOA school attainment IMD2010 IMDpercentile
## 1 1 E01002920 Park view 1.27178094 4.54 1
## 2 2 E01002882 Grange Hill 0.64167341 21.86 7
## 3 3 E01001247 Sweet valley 0.80076125 24.62 7
## 4 4 E01002512 Park view 1.86265923 7.94 2
## 5 5 E01002244 Grange Hill -0.54535603 16.78 5
## 6 6 E01002104 Sweet valley 0.61927781 36.79 9
## 7 7 E01004158 Park view 0.12264043 10.97 3
## 8 8 E01001937 Grange Hill -0.79098417 18.07 6
## 9 9 E01001818 Sweet valley -0.49977167 48.86 10
## 10 10 E01001487 Park view -1.63515201 35.43 9
## 11 11 E01004263 Grange Hill 1.78456732 41.05 9
## 12 12 E01004311 Sweet valley 0.29887827 54.11 10
## 13 13 E01003757 Park view 0.75250489 24.61 7
## 14 14 E01001646 Grange Hill -1.07460214 24.57 7
## 15 15 E01004270 Sweet valley 0.06427374 33.76 8
## 16 16 E01001084 Park view 1.26264948 14.32 5
## 17 17 E01004240 Grange Hill 1.09994930 44.52 9
## 18 18 E01004078 Sweet valley -1.12355513 16.59 5
## 19 19 E01002130 Park view 3.44607885 26.03 7
## 20 20 E01001822 Grange Hill -0.71972217 46.18 10
## 21 21 E01003829 Sweet valley -0.32685410 25.18 7
## 22 22 E01003413 Park view -0.52977397 18.49 6
## 23 23 E01001155 Grange Hill 0.52415274 45.25 10
## 24 24 E01001512 Sweet valley -1.07960965 21.30 6
## 25 25 E01002875 Park view -0.05511102 16.17 5
## 26 26 E01004270 Grange Hill 2.19868418 33.76 8
## 27 27 E01004105 Sweet valley -1.77390933 10.10 3
## 28 28 E01003994 Park view 0.06503223 39.85 9
## 29 29 E01002565 Grange Hill 0.65497038 14.70 5
## 30 30 E01004315 Sweet valley -0.71286911 50.44 10
Again we can explore the data for any associations (Remember this is simulated data! Don’t make any inferences please). The regression is for demonstration purposes only.
plot(npd_todd$IMD2010, npd_todd$attainment,
xlab = "IMD2010 measure of deprivation", ylab = "Attainment",
pch = 19)
z <- lm(attainment ~ IMD2010, data = npd_todd)
abline(z, col = "red") # people like a line
cor(npd_todd$IMD2010, npd_todd$attainment)
## [1] -0.01322955
We have demonstrated how to make the most of some publicly available databases, specifically those employing similar methods for splitting up the population into homogenous groupings called lower super output areas which summarise the demographic and related data of about one to two thousand people. This captures the essence of small areas that may provide insight in building preditive models that relate student attainment with their LSOA and its deprivation. We succesfully merged a variety of datasets to show the utility of this approach and the merge() or join() function.
sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252
## [2] LC_CTYPE=English_United Kingdom.1252
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_0.4.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.3 digest_0.6.9 assertthat_0.1 R6_2.1.1
## [5] DBI_0.3.1 formatR_1.2.1 magrittr_1.5 evaluate_0.8
## [9] stringi_1.0-1 lazyeval_0.1.10 rmarkdown_0.9.2 tools_3.2.3
## [13] stringr_1.0.0 yaml_2.1.13 parallel_3.2.3 htmltools_0.3
## [17] knitr_1.12