Merging data from two different sources

Often we desire to merge different datasets based on a shared variable of interest. In this post we build on an earlier post concerning merging and lookup tables and demonstrate how to combine two government datasets:

student attainment data with their home address given by Lower Super Output Area (rather than the more sensitive postcode).
Lower Super Output Area summary data, including measures of deprivation.

Thus we can merge the data and consider the student attainment data in the context of social deprivation.

Lower Super Output Area

The Lower Super Output Area (LSOA) atlas provides a summary of demographic and related data for each LSOA in Greater London. The average population of an LSOA in London in 2010 was 1,722 compared with 8,346 for an MSOA and 13,078 for a ward.

The profiles are designed to provide an overview of the population in these small areas by combining a range of data on the population, diversity, households, health, housing, crime, benefits, land use, deprivation, schools, and employment.

Objectives

Succesfully merge the London LSOA atlas with our own simulated National Pupil Database (NPD) data.
Use this merged dataset to demonstrate how one could investigate relationships between student attainment and social measures of deprivation in simulated data.

The atlas data

Here we demonstrate how we can use the atlas data as a lookup table in R. If we had data from the National Pupil Database (NPD) and were interested in the effect of deprivation on student attainment we could use this atlas to “lookup” these deprivation statistics for each student by their LSOA.

We saved the second worksheet in this workbook as a csv to facilitate reading it into R. We also deleted rows 1 and 3 as they were redundant. This left a header row providing us with variable names. Notice how all the spaces in the variable names make them difficult to work with (lots of …!). Column 1 is the LSOA, named Codes, column 2 was called Names(a bad name).

#SETUP
rm(list = ls()) #clear workspace

setwd("C://Users//mammykins//Google Drive//R//lsoa") #set wd
#getwd() #check wd has been changed, make sure file is here

#PACKAGES
if (!require("dplyr")) {
  install.packages("dplyr")
  require("dplyr")
}  # if dplyr is not installed, install it then load it

#INPUT
lsoa_atlas <- "lsoa_sheet2.csv" 
lsoa_atlas <- read.csv(lsoa_atlas, sep = ",",
                     header = TRUE, stringsAsFactors = FALSE)
lsoa_atlas <- tbl_df(lsoa_atlas)  # observed data set

This gives us the LSOA codes in London, ready to assist in matching with another dataframe. Note we read it in with the argument stringsAsFactors = FALSE, thus we read them in as characters.

glimpse(lsoa_atlas)  #  Have a butcher's

## Observations: 4,836
## Variables: 71
## $ Codes                                                                  (chr) ...
## $ Names                                                                  (chr) ...
## $ No.adults.in.employment.in.household..With.dependent.children          (int) ...
## $ X..of.households.with.no.adults.in.employment..With.dependent.children (int) ...
## $ All.lone.parent.housholds.with.dependent.children                      (int) ...
## $ Lone.parents.not.in.employment                                         (int) ...
## $ Lone.parent.not.in.employment..                                        (chr) ...
## $ Economically.active..Total                                             (int) ...
## $ Economically.inactive..Total                                           (int) ...
## $ Economically.active..Employee                                          (int) ...
## $ Economically.active..Self.employed                                     (int) ...
## $ Economically.active..Unemployed                                        (int) ...
## $ Economically.active..Full.time.student                                 (int) ...
## $ Employment.Rate                                                        (dbl) ...
## $ Unemployment.Rate                                                      (dbl) ...
## $ No.qualifications                                                      (int) ...
## $ Highest.level.of.qualification..Level.1.qualifications                 (int) ...
## $ Highest.level.of.qualification..Level.2.qualifications                 (int) ...
## $ Highest.level.of.qualification..Apprenticeship                         (int) ...
## $ Highest.level.of.qualification..Level.3.qualifications                 (int) ...
## $ Highest.level.of.qualification..Level.4.qualifications.and.above       (int) ...
## $ Highest.level.of.qualification..Other.qualifications                   (int) ...
## $ Schoolchildren.and.full.time.students..Age.18.and.over                 (int) ...
## $ X..No.qualifications                                                   (dbl) ...
## $ X..Highest.level.of.qualification..Level.1.qualifications              (dbl) ...
## $ X..Highest.level.of.qualification..Level.2.qualifications              (dbl) ...
## $ X..Highest.level.of.qualification..Apprenticeship                      (dbl) ...
## $ X..Highest.level.of.qualification..Level.3.qualifications              (dbl) ...
## $ X..Highest.level.of.qualification..Level.4.qualifications.and.above    (dbl) ...
## $ X..Highest.level.of.qualification..Other.qualifications                (dbl) ...
## $ X..Schoolchildren.and.full.time.students..Age.18.and.over              (dbl) ...
## $ Day.to.day.activities.limited.a.lot                                    (int) ...
## $ Day.to.day.activities.limited.a.little                                 (int) ...
## $ Day.to.day.activities.not.limited                                      (int) ...
## $ Very.good.or.Good.health                                               (int) ...
## $ Fair.health                                                            (int) ...
## $ Bad.or.Very.Bad.health                                                 (int) ...
## $ Day.to.day.activities.limited.a.lot....                                (dbl) ...
## $ Day.to.day.activities.limited.a.little....                             (dbl) ...
## $ Day.to.day.activities.not.limited....                                  (dbl) ...
## $ Very.good.or.Good.health....                                           (dbl) ...
## $ Fair.health....                                                        (dbl) ...
## $ Bad.or.Very.Bad.health....                                             (dbl) ...
## $ No.cars.or.vans.in.household                                           (int) ...
## $ X1.car.or.van.in.household                                             (int) ...
## $ X2.cars.or.vans.in.household                                           (int) ...
## $ X3.cars.or.vans.in.household                                           (int) ...
## $ X4.or.more.cars.or.vans.in.household                                   (int) ...
## $ Sum.of.all.cars.or.vans.in.the.area                                    (int) ...
## $ No.cars.or.vans.in.household....                                       (dbl) ...
## $ X1.car.or.van.in.household....                                         (dbl) ...
## $ X2.cars.or.vans.in.household....                                       (dbl) ...
## $ X3.cars.or.vans.in.household....                                       (dbl) ...
## $ X4.or.more.cars.or.vans.in.household....                               (dbl) ...
## $ Cars.per.household                                                     (dbl) ...
## $ Total.Number.of.Children                                               (int) ...
## $ Total.Number.of.Families.Claiming.Benefit                              (int) ...
## $ Number.of.families.with.3..children                                    (chr) ...
## $ X..of.families.with.3..children                                        (chr) ...
## $ X2010                                                                  (int) ...
## $ X                                                                      (int) ...
## $ X.1                                                                    (int) ...
## $ X.2                                                                    (int) ...
## $ X2011                                                                  (int) ...
## $ X.3                                                                    (int) ...
## $ X.4                                                                    (int) ...
## $ X.5                                                                    (int) ...
## $ X2012                                                                  (int) ...
## $ X.6                                                                    (int) ...
## $ X.7                                                                    (int) ...
## $ X.8                                                                    (int) ...

sum(table(lsoa_atlas$Codes))  #  The number of unique codes match the number of rows as expected

## [1] 4836

Note the messy variable names, I would tidy these up if doing a lot of work with this dataset to make it easier on the eyes. Before using this as a lookup table we generate some fake NPD data for 30 students and name the object npd.

set.seed(1337)  # we use rnorm
n <- 30
npd <- data.frame(
  studentid = 1:n,
  school = c("Park view", "Grange Hill", "Sweet valley"),
  superoutputarea = paste("E0100",
                          round(runif(n = n, min = 1000, max = 4332), digits = 0),
                          sep = ""),  # Each fictional school has students attending from different areas, for coding simplicity, LSOA suffixed 0000-0999 were omitted.
  attainment = (rnorm(n = n)),
  stringsAsFactors = FALSE  #  we can modify the class of variables later if required
)
glimpse(npd)

## Observations: 30
## Variables: 4
## $ studentid       (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,...
## $ school          (chr) "Park view", "Grange Hill", "Sweet valley", "P...
## $ superoutputarea (chr) "E01002920", "E01002882", "E01001247", "E01002...
## $ attainment      (dbl) 1.27178094, 0.64167341, 0.80076125, 1.86265923...

If we look at the data, we notice the variable superoutputarea is a nine digit code. We are interested in how the area relates to the socio-economic classification of typical people who live in that area or a measure of deprivation of the area. We must convert this into the more informative proxy which can then be used in our machine learning tools later. This can be achieved by merging our npd data with our LSOA lsoa_atlas data. This will result in 30 rows with 4 + 71 - 1 columns (-1 as Codes is in both).

Note how the variable has a different name in each dataframe, thus let’s correct that using rename(). Once our superoutput code ID variables have the same name in both dataframes then we can merge them using the base merge() function.

npd <- rename(npd, Codes = superoutputarea)  #  rename superoutputarea to Codes to match lsoa_atlas name for variable
names(npd)

## [1] "studentid"  "school"     "Codes"      "attainment"

npd_atlas_merge <- merge(npd, lsoa_atlas, by = "Codes")  #  merge by the lookup variable "Codes"
npd_depriv <- select(npd_atlas_merge, studentid, Codes, school, attainment,
                     Employment.Rate)
npd_depriv

##    studentid     Codes       school  attainment Employment.Rate
## 1         16 E01001084    Park view  1.26264948            64.5
## 2         23 E01001155  Grange Hill  0.52415274            49.9
## 3          3 E01001247 Sweet valley  0.80076125            61.1
## 4         10 E01001487    Park view -1.63515201            52.1
## 5         24 E01001512 Sweet valley -1.07960965            68.5
## 6         14 E01001646  Grange Hill -1.07460214            61.2
## 7         20 E01001822  Grange Hill -0.71972217            46.7
## 8          8 E01001937  Grange Hill -0.79098417            71.4
## 9          6 E01002104 Sweet valley  0.61927781            59.0
## 10        19 E01002130    Park view  3.44607885            65.7
## 11         5 E01002244  Grange Hill -0.54535603            68.8
## 12         4 E01002512    Park view  1.86265923            67.1
## 13        29 E01002565  Grange Hill  0.65497038            49.9
## 14        25 E01002875    Park view -0.05511102            66.1
## 15         2 E01002882  Grange Hill  0.64167341            69.2
## 16         1 E01002920    Park view  1.27178094            68.8
## 17        22 E01003413    Park view -0.52977397            65.2
## 18        13 E01003757    Park view  0.75250489            57.5
## 19        21 E01003829 Sweet valley -0.32685410            62.0
## 20        28 E01003994    Park view  0.06503223            57.0
## 21        18 E01004078 Sweet valley -1.12355513            70.3
## 22        27 E01004105 Sweet valley -1.77390933            69.8
## 23         7 E01004158    Park view  0.12264043            72.5
## 24        17 E01004240  Grange Hill  1.09994930            47.4
## 25        11 E01004263  Grange Hill  1.78456732            50.1
## 26        15 E01004270 Sweet valley  0.06427374            54.7
## 27        26 E01004270  Grange Hill  2.19868418            54.7
## 28        12 E01004311 Sweet valley  0.29887827            53.3
## 29        30 E01004315 Sweet valley -0.71286911            61.8

The variable Employment.Rate can then be used as a proxy for deprivation that a student might experience based on where they live (A caveat, this is simulated data and the analysis is for demonstration purposes only). Note this simulated data is unrealistic in that students that attend the same school all come from randomly selected LSOA within London (they would likely come from adjacent or the same LSOA). Given that our attainment was simulated using rnorm() it is unsurprising to see a near zero correlation. Furthermore you would be unlikely to model this data with a linear regression, however the important point is that the data have been merged and can now be subjected to exploration or modelling activities.

plot(npd_depriv$Employment.Rate, npd_depriv$attainment,
     xlab = "Employment rate", ylab = "Attainment",
     pch = 19)

z <- lm(attainment ~ Employment.Rate, data = npd_depriv)
abline(z, col = "red")  #  an example of bad model fitting

cor(npd_depriv$Employment.Rate, npd_depriv$attainment)  #  if we try every variable we might get a strong correlation on one! Naughty.

## [1] -0.1651502

Other measures of deprivation

We can expand on this by considering the Index of Multiple Deprivation (IMD) score (2010) for each LSOA. A dataset taken from the Todd et al. (2015) paper provides us with the means to “lookup” and merge our npd data with this todd data providing us with an variable describing which decile of deprivaton corresponds to the relevant student’s LSOA they live. This data includes a translation for the whole country but our simulated data uses LSOA just from London.

npd <- rename(npd,  LSOA = Codes)  #  rename to match variable names

#INPUT TODD data
todd <- "gp_lsoa_todd_data.csv" 
todd <- read.csv(todd, sep = ",",
                     header = TRUE, stringsAsFactors = FALSE)
todd <- tbl_df(todd)
glimpse(todd)

## Observations: 32,482
## Variables: 13
## $ LSOA                            (chr) "E01000001", "E01000002", "E01...
## $ LSOAName                        (chr) "City of London 001A", "City o...
## $ totalPopulation                 (int) 1563, 1464, 1538, 1538, 1075, ...
## $ PharmacyPopulationIn20minBuffer (int) 1563, 1464, 1538, 1538, 1075, ...
## $ PharmacyAccessPcent             (dbl) 100, 100, 100, 100, 100, 100, ...
## $ IMD2010                         (dbl) 6.16, 5.59, 13.29, 11.17, 21.3...
## $ IMDpercentile                   (int) 2, 1, 4, 4, 7, 5, 9, 10, 8, 9,...
## $ UrbanRural2                     (chr) "Urban", "Urban", "Urban", "Ur...
## $ LA                              (chr) "E09000001", "E09000001", "E09...
## $ Region..Name.                   (chr) "London", "London", "London", ...
## $ Region..Code.                   (int) 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ GP.population.in.20min.buffer   (int) 1563, 1464, 1538, 1538, 1075, ...
## $ GPAccessPcent                   (dbl) 100, 100, 100, 100, 100, 100, ...

Now again, we merge, but this time we use the dplyr function join() to do the work, joining the variables of interest, IMD2010 and IMDpercentile to npd. This function is much faster!

#npd_todd <- merge(npd, todd, by = "LSOA") %>%
#  select(studentid, LSOA, school, attainment,
#         IMD2010, IMDpercentile)

npd_todd <- inner_join(x = npd, y = todd, by = "LSOA") %>%
  select(studentid, LSOA, school, attainment,
         IMD2010, IMDpercentile)

npd_todd

##    studentid      LSOA       school  attainment IMD2010 IMDpercentile
## 1          1 E01002920    Park view  1.27178094    4.54             1
## 2          2 E01002882  Grange Hill  0.64167341   21.86             7
## 3          3 E01001247 Sweet valley  0.80076125   24.62             7
## 4          4 E01002512    Park view  1.86265923    7.94             2
## 5          5 E01002244  Grange Hill -0.54535603   16.78             5
## 6          6 E01002104 Sweet valley  0.61927781   36.79             9
## 7          7 E01004158    Park view  0.12264043   10.97             3
## 8          8 E01001937  Grange Hill -0.79098417   18.07             6
## 9          9 E01001818 Sweet valley -0.49977167   48.86            10
## 10        10 E01001487    Park view -1.63515201   35.43             9
## 11        11 E01004263  Grange Hill  1.78456732   41.05             9
## 12        12 E01004311 Sweet valley  0.29887827   54.11            10
## 13        13 E01003757    Park view  0.75250489   24.61             7
## 14        14 E01001646  Grange Hill -1.07460214   24.57             7
## 15        15 E01004270 Sweet valley  0.06427374   33.76             8
## 16        16 E01001084    Park view  1.26264948   14.32             5
## 17        17 E01004240  Grange Hill  1.09994930   44.52             9
## 18        18 E01004078 Sweet valley -1.12355513   16.59             5
## 19        19 E01002130    Park view  3.44607885   26.03             7
## 20        20 E01001822  Grange Hill -0.71972217   46.18            10
## 21        21 E01003829 Sweet valley -0.32685410   25.18             7
## 22        22 E01003413    Park view -0.52977397   18.49             6
## 23        23 E01001155  Grange Hill  0.52415274   45.25            10
## 24        24 E01001512 Sweet valley -1.07960965   21.30             6
## 25        25 E01002875    Park view -0.05511102   16.17             5
## 26        26 E01004270  Grange Hill  2.19868418   33.76             8
## 27        27 E01004105 Sweet valley -1.77390933   10.10             3
## 28        28 E01003994    Park view  0.06503223   39.85             9
## 29        29 E01002565  Grange Hill  0.65497038   14.70             5
## 30        30 E01004315 Sweet valley -0.71286911   50.44            10

Again we can explore the data for any associations (Remember this is simulated data! Don’t make any inferences please). The regression is for demonstration purposes only.

plot(npd_todd$IMD2010, npd_todd$attainment,
     xlab = "IMD2010 measure of deprivation", ylab = "Attainment",
     pch = 19)

z <- lm(attainment ~ IMD2010, data = npd_todd)
abline(z, col = "red")  #  people like a line

cor(npd_todd$IMD2010, npd_todd$attainment)

## [1] -0.01322955

Conclusions

We have demonstrated how to make the most of some publicly available databases, specifically those employing similar methods for splitting up the population into homogenous groupings called lower super output areas which summarise the demographic and related data of about one to two thousand people. This captures the essence of small areas that may provide insight in building preditive models that relate student attainment with their LSOA and its deprivation. We succesfully merged a variety of datasets to show the utility of this approach and the merge() or join() function.

References

Todd, A., Copeland, A., Husband, A., Kasim, A., & Bambra, C. (2015). Access all areas? An area-level analysis of accessibility to general practice and community pharmacy services in England by urbanity and social deprivation. BMJ Open, 5(5), e007328. http://doi.org/10.1136/bmjopen-2014-007328

sessionInfo()

## R version 3.2.3 (2015-12-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_0.4.3
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3     digest_0.6.9    assertthat_0.1  R6_2.1.1       
##  [5] DBI_0.3.1       formatR_1.2.1   magrittr_1.5    evaluate_0.8   
##  [9] stringi_1.0-1   lazyeval_0.1.10 rmarkdown_0.9.2 tools_3.2.3    
## [13] stringr_1.0.0   yaml_2.1.13     parallel_3.2.3  htmltools_0.3  
## [17] knitr_1.12