Introduction

This practical is designed to introduce you analysing longitudinal data in R through some basic (but quite powerful) techniques. But the end of this practical you will:

  1. Be able to access and read in a large longitudinal microdata file
  2. Understand how to process your microdata file and generate aggregate variable transitions from these data
  3. Be able to visualise some of the longitudinal transitions using Sankey diagrams (if you’re not sure what a Sankey diagram is, check out Mike Bostock’s little introduction here: http://bost.ocks.org/mike/sankey/ or, of course, wikipedia: http://en.wikipedia.org/wiki/Sankey_diagram)

Getting Started

In this practial, I am assuming that you are using RStudio. If you using the UCL Desktop, you can find RStudio under ‘R’ in the start menu (if you’re not, you can download it from here: http://www.rstudio.com/).

Open RStudio and once the programme starts up, open a new R Script, either using the icon with the green ‘+’ or File > New File > R Script.

If you are unfamiliar with using R Studio, to run any script press ctrl+return when the cursor is on the line you want to run. To run multiple lines, highlight the whole section and press ctrl+return.

Loading some packages

Packages are pieces of software that allow us to carry out specific jobs. If you would like to find out what each of these packages does, then please Google them.

#load in the packages to carry out the data wrangling (ignore any error messages, you should be fine no matter what version of R you are using...)
library(tidyverse)
library(plyr)
library(reshape2)
library(plotly)

Analysing the Synthetic Longitudinal Data

In this practical we will be using the new England and Wales SYLLS synthetic longitudinal spine dataset - this can be manually downloaded from the Calls-Hub website here: https://calls.ac.uk/guides-resources/synthetic-ls-data/

Feel free to download and Unzip any of these files to a local directory on your computer and open it up in Excel (or a similar programme) to explore. It might be a good idea to do so as variable code look-ups are contained within.

Inside you will find three sheets: one containing meta data, one containing the list of (new) synthetic variables, and one containing all of the data (both the new synthetic data and the original SARs data).

Using R, we are able to read this data directly into R Studio. I have extracted the data as a .csv file and stored it in a publicly available dropbox folder, so we can now read this in…

Reading in your data

There are various ways that our data can be read in, here we will just use the base read.csv() function

#Read in the data...
data<-read.csv("https://www.dropbox.com/s/yn1y29zw7mkyqhu/SYLLS_data.csv?raw=1")
#Check the data has read in correctly
summary(data)
#examine the top of the file
head(data)

If you are in RStudio, data should now also appear in your global environment in the top-right Environment pane. Clicking on ‘data’ should bring a summary of it up as a selectable tab.

Data Wrangling

Now microdata are great, but what happens if we want to understand some of the aggregate characteristics of these data? For example, how many of these individuals underwent the transition from single to married between 2001 and 2011 or the numbers of healthy people in 2001 who became ill by 2011?

To do this we need to aggregate our data according to the characteristics we are interested in. In R, we can carry out these aggregations using the plyr (http://plyr.had.co.nz/) and reshape2 (http://had.co.nz/reshape/) packages (both written by Hadley Wickham of ggplot fame! http://ggplot2.org/):

First let’s try aggregating our data by the total marriage transitions:

#count the number of individuals undergoing the different marital transitions. 
marriage_trans <- ddply(data, .(SynMarital2001, Marital.Status), nrow)

You should now see a new data frame called ‘marriage_trans’ appear in your data pane at the top-right of RStudio. Click on it to open up and have a look.

What we have done is use the ddply function to create a cross-tabulation from the original microdataset. If you want to find out more about the ddply function type ?ddply into the console (or click the help tab in RStudio - bottom-right pane - and enter ddply in there).

The basic syntax for ddply is data frame / variables / function. Here we have used the ‘data’ dataframe, the variables are SynMarital2001 (synthetic martital status in 2001 variable) and Marital.Status (the real marital status in 2011 recorded in the SAR dataset) and the function is nrow (again, use ?nrow if you want to find out in detail what it does) which counts the number of rows (or people in our microdataset) for every combination of martital status in 2001 and 2011.

ddply is a very useful function. We can add additional variables/dimensions into any cross-tabulation we carry out. For example, we could add age to the cross-tabulation above:

#count transitions by a third dimension, such as age group:
marriage_trans_age<-ddply(data, .(SynMarital2001, Marital.Status, SynAge10YrGrp), nrow)

But with the extra age-group dimension, you might want to rearrange the data frame so that the age groups are columns. Do do this, we use the dcast function from the reshape2 package:

# cast 'molten' marriage_trans_age data frame into new square matrix
marriage_trans_age_DF<-dcast(marriage_trans_age,SynMarital2001+Marital.Status ~ SynAge10YrGrp)
## Using V1 as value column: use value.var to override.

If you click on the new data frames you have just created (top-right-hand pane in RStudio), you will see your cross-tabulated data.

Visualising the transitions

We can now visualise the marriage transitions we have just created using a sankey diagram. First we need to add some new columns to our data frame which also serve the purpose of converting the numeric codes to more readable text. In addition, for the Sankey diagrams to work we need three columns of data with three very specific names: ‘source’, ‘target’ and ‘value’.

#set up a new empty variable
newvar<-0 
#create a little function to recode the 2001 data *note highligh the whole thing before hitting ctrl+return to run and create the function*
recode_marriage_2001<-function(variable,one,two,three,four,five,missing){ 
  newvar[variable==one]<-"Single 2001" 
  newvar[variable==two]<-"Married (or Civil partnership) 2001" 
  newvar[variable==three]<-"Separated 2001" 
  newvar[variable==four]<-"Divorced 2001" 
  newvar[variable==five]<-"Widowed 2001"   
  newvar[variable==missing]<-"Missing Data 2001"  
  return(newvar) 
}

#use the function to create a new column called 'source'
marriage_trans$source<-recode_marriage_2001(marriage_trans$SynMarital2001,1,2,3,4,5,-7)

#reset the newvar variable
newvar<-0 
#now create a similar function to recode the 2011 data
recode_marriage_2011<-function(variable,one,two,three,four,five,missing){ 
  newvar[variable==one]<-"Single 2011" 
  newvar[variable==two]<-"Married (or Civil partnership) 2011" 
  newvar[variable==three]<-"Separated 2011" 
  newvar[variable==four]<-"Divorced 2011" 
  newvar[variable==five]<-"Widowed 2011"   
  newvar[variable==missing]<-"Missing Data 2011"  
  return(newvar) 
}

#and then use this function in a similar way to recode the data into a new column called 'target'
marriage_trans$target<-recode_marriage_2011(marriage_trans$Marital.Status,1,2,3,4,5,-7)

#we could just rename the V1 column as value using the names() function, but here it's just as easy to cast the data into a new column called 'value':
marriage_trans$value<-marriage_trans$V1

#check to see that the data look correct by looking at the top of the data frame
head(marriage_trans)
##   SynMarital2001 Marital.Status     V1            source
## 1             -7              1  66991 Missing Data 2001
## 2             -7              2     24 Missing Data 2001
## 3             -7              4      3 Missing Data 2001
## 4             -7              5     74 Missing Data 2001
## 5              1              1 202544       Single 2001
## 6              1              2  43544       Single 2001
##                                target  value
## 1                         Single 2011  66991
## 2 Married (or Civil partnership) 2011     24
## 3                       Divorced 2011      3
## 4                        Widowed 2011     74
## 5                         Single 2011 202544
## 6 Married (or Civil partnership) 2011  43544

Nothing like a good Sankey…

So, now we have set our data up correctly we can create our Sankey diagram to visualise the transitions.

We will use the plotly package to create our sankey diagram:

#create a copy of your data frame
dat = marriage_trans
#get a unique list of the source and target variables
node_list <- unique(c(dat[,4], dat[,5]))
#create a data frame of this list
node_list_df <- as.data.frame(node_list)
#create numeric indices of source and target variables
node_list_df$num <- seq(0,length(node_list)-1)
#update the dataset with numeric references
#first source
dat$source_num <- node_list_df[match(dat$source,node_list_df$node_list),2]
#then target
dat$target_num <- node_list_df[match(dat$target,node_list_df$node_list),2]


#now create a plotly sankey diagram  
p <- plot_ly(
  type = "sankey",
  orientation = "h",
  
  node = list(
    label = node_list,
    color = "blue",
    pad = 15,
    thickness = 20,
    line = list(
      color = "black",
      width = 0.5
    )
  ),
  
  link = list(
    source = dat$source_num,
    target = dat$target_num,
    value =  dat$value
  )
) %>% 
  layout(
    title = "Basic Sankey Diagram",
    font = list(
      size = 10
    )
  )

############
p

If you hover your mouse over the diagram, you should see the figures relating to the underlying data (don’t forget these numbers are from a 1% sample).

Now your turn…

Now you have created a Sankey diagram to visualise the transitions between marriage states in 2001 and 2011, you should try and explore some other transitions.

The easiest will be to look at some of the other basic transitions (such as social grade or religion), but perhaps try some 3-way transitions such as:

  • General health in 2001 (or 2011) by marital transition (does poor health mean you can’t find a partner?)
  • Social Grade transitions by ethnicity in 2011 (does ethnicity affect your your chances of switching social grade?)
  • Social grade and age by whether dead in 2011 (are you less likely to die if you transition into, or are already in, a higher social grade?)

If you look at the death transitions, you might need to be careful with your subsetting. While I have included death flags for individuals in the dataset, these are based on the probability of dying between 2001 and 2011 for each age group. Therefore, for each transition in your earlier Sankey diagram, some of those individuals might be dead by 2011 - in effect, we could add a ‘death’ transition to each group. If you wanted to try and deal with this, you could include death in any ddply reshaping of the microdata, or drop all dead people from the original ‘data’ file using the subset function, e.g.

#keep alive people
data_alive<-subset(data, SynAliveOrDead2011==1)
#keep dead people
data_dead<-subset(data, SynAliveOrDead2011==2)

With these three-way transitions, you may need to be careful about which columns you use to plot your Sankey diagram with, for example, if we use the marriage transition by age data frame we created earlier, we can assign the value label to the age group column we are interested in:

marriage_trans_age_DF$source<-recode_marriage_2001(marriage_trans_age_DF$SynMarital2001,1,2,3,4,5,-7)

marriage_trans_age_DF$target<-recode_marriage_2011(marriage_trans_age_DF$Marital.Status,1,2,3,4,5,-7)

#we are interested in age group '6' so call that the value column...
colnames(marriage_trans_age_DF)[colnames(marriage_trans_age_DF)=="6"]<-"value"

Now we can create the Sankey diagram as before:

#create a copy of your data frame
dat = marriage_trans_age_DF
#get a unique list of the source and target variables
node_list <- unique(c(dat[,14], dat[,15]))
#create a data frame of this list
node_list_df <- as.data.frame(node_list)
#create numeric indices of source and target variables
node_list_df$num <- seq(0,length(node_list)-1)
#update the dataset with numeric references
#first source
dat$source_num <- node_list_df[match(dat$source,node_list_df$node_list),2]
#then target
dat$target_num <- node_list_df[match(dat$target,node_list_df$node_list),2]


#now create a plotly sankey diagram  
p <- plot_ly(
  type = "sankey",
  orientation = "h",
  
  node = list(
    label = node_list,
    color = "blue",
    pad = 15,
    thickness = 20,
    line = list(
      color = "black",
      width = 0.5
    )
  ),
  
  link = list(
    source = dat$source_num,
    target = dat$target_num,
    value =  dat$value
  )
) %>% 
  layout(
    title = "Marriage Transitions for those in age-group 60-69",
    font = list(
      size = 10
    )
  )

############
p