Introduction

This practical is designed to introduce you analysing longitudinal data in R through some basic (but quite powerful) techniques. But the end of this practical you will:

  1. Be able to access and read in a large longitudinal microdata file
  2. Understand how to process your microdata file and generate aggregate variable transitions from these data
  3. Be able to visualise some of the longitudinal transitions using Sankey diagrams (if you’re not sure what a Sankey diagram is, check out Mike Bostock’s little introduction here: http://bost.ocks.org/mike/sankey/ or, of course, wikipedia: http://en.wikipedia.org/wiki/Sankey_diagram)

Getting Started

In this practial, I am assuming that you are using RStudio. If you using the UCL Desktop, you can find RStudio under ‘R’ in the start menu (if you’re not, you can download it from here: http://www.rstudio.com/).

Open RStudio and once the programme starts up, open a new R Script, either using the icon with the green ‘+’ or File > New File > R Script.

If you are unfamiliar with using R Studio, to run any script press ctrl+return when the cursor is on the line you want to run. To run multiple lines, highlight the whole section and press ctrl+return.

First, you should download the rCharts package. rCharts is not on the Comprehensive R Archive Network (CRAN - http://cran.r-project.org/), but can be installed from the web using the devtools package (which is!).

If you are working on your own computer with a version of R new enough to install devtools, then this is very easy - follow the steps below. If you are using R on the UCL Desktop (which at the time of writing is v3.0.1),you won’t be able to use devtools, sp skip the steps below and follow the steps in the ‘Installing rCharts on the UCL Desktop…’ section.

Installing rCharts on your own computer

First, install devtools if it is not already in your package directory and then use devtools to install rCharts:

install.packages(c("devtools"), contriburl="http://cran.ma.imperial.ac.uk/")
require(devtools)
install_github('rCharts', 'ramnathv')

Installing rCharts on the UCL Desktop running R version 3.0.1

#First install some dependencies necessary for rCharts to work. *Note, highlight and run this whole section...*
deps = c('RCurl', 'RJSONIO', 'whisker', 'yaml', 'plyr', 'reshape2')
for (dep in deps){
  install.packages(dep)
}

After a short time you’ll see various unpacking messages - some may look like error messages, but don’t worry just ignore them.

Now you should download the rCharts tar ball from github: https://github.com/ramnathv/rCharts/archive/master.tar.gz

By default, clicking on the link above on the UCL Desktop will open internet explorer and the file will be downloaded to some default file location when you choose to ‘save’. Once the file has saved, click the ‘Open folder’ option.

Windows explorer will open and you should see the .gz file in your downloads folder. Right click the address bar at the top of the Explorer window and ‘copy address as text’.

You can now point r to the location of the downloaded package to install it. Copy the code below, replacing file path with the one you have just copied which is specific to your own computer WHEN YOU PASTE YOUR FILE PATH IN YOU WILL NEED TO CHANGE THE BACKSLASH CHARACTERS TO FORWARD-SLASH CHARACTERS AS BELOW

install.packages("//uclusers.ucl.ac.uk/dfs/home/ucfn/ucfnard/Downloads/rCharts-master.tar.gz", repos = NULL, type = "source")

After a short time you should see some messages indicating that rCharts has loaded. You should now load the package:

require(rCharts)

Analysing the Synthetic Longitudinal Data

In this practical we will be using the new England and Wales SYLLS synthetic longitudinal spine dataset - this can be downloaded from here: https://dl.dropboxusercontent.com/u/8649795/NewSpine.zip

Unzip the file to a local directory on your computer and open it up in Excel (or a similar programme).

Inside you will find three sheets: one containing meta data, one containing the list of (new) synthetic variables, and one containing all of the data (both the new synthetic data and the original SARs data).

Reading in your data

The easiest way to get data into R load it in as a comma separated values file. Therefore, we need to save the appropriate sheets in our synthetic data Excel file as individual .csv files. Do do this, make sure you are in the data sheet and (assuming you are in Excel), go to File > Save As > (select an appropriate folder) > and save your data as CSV (Comma delimited) (*.csv) and call it SYLLS_data.csv. Alternatively, if you are having trouble doing this, download it from here: https://dl.dropboxusercontent.com/u/8649795/SYLLS_data.csv

Once you have done this, open RStudio, open a new R Script document and read in your data (or indeed read it in straight from the dropbox location above):

#Read your synthetic LS data into a data frame called 'data' - either from your local directory (quick) or directly from the web, as below (might take a little while, so be patient)
data<-read.csv("https://dl.dropboxusercontent.com/u/8649795/SYLLS_data.csv")
#Check the data has read in correctly
summary(data)
#examine the top of the file
head(data)

If you are in RStudio, data should now also appear in your global environment in the top-right Environment pane. Clicking on ‘data’ should bring a summary of it up as a selectable tab.

Data Wrangling

Now microdata are great, but what happens if we want to understand some of the aggregate characteristics of these data? For example, how many of these individuals underwent the transition from single to married between 2001 and 2011 or the numbers of healthy people in 2001 who became ill by 2011?

To do this we need to aggregate our data according to the characteristics we are interested in. In R, we can carry out these aggregations using the plyr (http://plyr.had.co.nz/) and reshape2 (http://had.co.nz/reshape/) packages (both written by Hadley Wickham of ggplot fame! http://ggplot2.org/):

#load in the packages to carry out the data wrangling (ignore any error messages, you should be fine no matter what version of R you are using...)
library(plyr)
library(reshape2)

First let’s try aggregating our data by the total marriage transitions:

#count the number of individuals undergoing the different marital transitions. 
marriage_trans<-ddply(data, .(SynMarital2001, Marital.Status), nrow)

You should now see a new data frame called ‘marriage_trans’ appear in your data pane at the top-right of RStudio. Click on it to open up and have a look.

What we have done is use the ddply function to create a cross-tabulation from the original microdataset. If you want to find out more about the ddply function type ?ddply into the console (or click the help tab in RStudio - bottom-right pane - and enter ddply in there).

The basic syntax for ddply is data frame / variables / function. Here we have used the ‘data’ dataframe, the variables are SynMarital2001 (synthetic martital status in 2001 variable) and Marital.Status (the real marital status in 2011 recorded in the SAR dataset) and the function is nrow (again, use ?nrow if you want to find out in detail what it does) which counts the number of rows (or people in our microdataset) for every combination of martital status in 2001 and 2011.

ddply is a very useful function. We can add additional variables/dimensions into any cross-tabulation we carry out. For example, we could add age to the cross-tabulation above:

#count transitions by a third dimension, such as age group:
marriage_trans_age<-ddply(data, .(SynMarital2001, Marital.Status, SynAge10YrGrp), nrow)

But with the extra age-group dimension, you might want to rearrange the data frame so that the age groups are columns. Do do this, we use the dcast function from the reshape2 package:

# cast 'molten' marriage_trans_age data frame into new square matrix
marriage_trans_age_DF<-dcast(marriage_trans_age,SynMarital2001+Marital.Status ~ SynAge10YrGrp)

If you click on the new data frames you have just created (top-right-hand pane in RStudio), you will see your cross-tabulated data.

Visualising the transitions

We can now visualise the marriage transitions we have just created using a sankey diagram. First we need to add some new columns to our data frame which also serve the purpose of converting the numeric codes to more readable text. In addition, for the Sankey diagrams to work we need three columns of data with three very specific names: ‘source’, ‘target’ and ‘value’.

#set up a new empty variable
newvar<-0 
#create a little function to recode the 2001 data *note highligh the whole thing before hitting ctrl+return to run and create the function*
recode_marriage_2001<-function(variable,one,two,three,four,five,missing){ 
  newvar[variable==one]<-"Single 2001" 
  newvar[variable==two]<-"Married (or Civil partnership) 2001" 
  newvar[variable==three]<-"Separated 2001" 
  newvar[variable==four]<-"Divorced 2001" 
  newvar[variable==five]<-"Widowed 2001"   
  newvar[variable==missing]<-"Missing Data 2001"  
  return(newvar) 
}

#use the function to create a new column called 'source'
marriage_trans$source<-recode_marriage_2001(marriage_trans$SynMarital2001,1,2,3,4,5,-7)

#reset the newvar variable
newvar<-0 
#now create a similar function to recode the 2011 data
recode_marriage_2011<-function(variable,one,two,three,four,five,missing){ 
  newvar[variable==one]<-"Single 2011" 
  newvar[variable==two]<-"Married (or Civil partnership) 2011" 
  newvar[variable==three]<-"Separated 2011" 
  newvar[variable==four]<-"Divorced 2011" 
  newvar[variable==five]<-"Widowed 2011"   
  newvar[variable==missing]<-"Missing Data 2011"  
  return(newvar) 
}

#and then use this function in a similar way to recode the data into a new column called 'target'
marriage_trans$target<-recode_marriage_2011(marriage_trans$Marital.Status,1,2,3,4,5,-7)

#we could just rename the V1 column as value using the names() function, but here it's just as easy to cast the data into a new column called 'value':
marriage_trans$value<-marriage_trans$V1

#check to see that the data look correct by looking at the top of the data frame
head(marriage_trans)
##   SynMarital2001 Marital.Status     V1            source
## 1             -7              1  66991 Missing Data 2001
## 2             -7              2     24 Missing Data 2001
## 3             -7              4      3 Missing Data 2001
## 4             -7              5     74 Missing Data 2001
## 5              1              1 202544       Single 2001
## 6              1              2  43544       Single 2001
##                                target  value
## 1                         Single 2011  66991
## 2 Married (or Civil partnership) 2011     24
## 3                       Divorced 2011      3
## 4                        Widowed 2011     74
## 5                         Single 2011 202544
## 6 Married (or Civil partnership) 2011  43544

Nothing like a good Sankey…

So, now we have set our data up correctly we can create our Sankey diagram to visualise the transitions.

The Sankey diagram uses some code written by @timelyportfolio. There are a couple of ways of using this code. If working on your own computer, then the most stable option is to download the code in the zip file from here: https://github.com/timelyportfolio/rCharts_d3_sankey and then unzip it to a location on your computer (pointing any commands to this location). Instead, here, we will read the code directly from @timelyportfolio’s github repository:

#first create a new rCharts object, into which we will place our Sankey diagram:
sankeyPlot <- rCharts$new()
#set the location of the library and html template to use:
sankeyPlot$setLib('http://timelyportfolio.github.io/rCharts_d3_sankey')
sankeyPlot$setTemplate(script = "http://timelyportfolio.github.io/rCharts_d3_sankey/layouts/chart.html")

#Now set up the plot itself - there are various parameters to play with
sankeyPlot$set(
  data = marriage_trans, #data to use in the plot
  nodeWidth = 15, #width of the little coloured blocks at either end of the plot
  nodePadding = 10, #the spacing between these coloured blocks
  layout = 32, #not really sure what this parameter controls
  width = 750, #width of the plot window
  height = 1000, #height of the plot window
  units = "People", #name of the units for when hovering cursor
  title = "Sankey Diagram"
  #labelFormat = ".1%" #can set the format of the label, but here we won't bother
)

To display your plot, simply call the object you have just created:

sankeyPlot
Your plot should look something like this:

If you hover your mouse over the diagram, you should see the figures relating to the underlying data (don’t forget these numbers are from a 1% sample).

Now your turn…

Now you have created a Sankey diagram to visualise the transitions between marriage states in 2001 and 2011, you should try and explore some other transitions.

The easiest will be to look at some of the other basic transitions (such as social grade or religion), but perhaps try some 3-way transitions such as:

  • General health in 2001 (or 2011) by marital transition (does poor health mean you can’t find a partner?)
  • Social Grade transitions by ethnicity in 2011 (does ethnicity affect your your chances of switching social grade?)
  • Social grade and age by whether dead in 2011 (are you less likely to die if you transition into, or are already in, a higher social grade?)

If you look at the death transitions, you might need to be careful with your subsetting. While I have included death flags for individuals in the dataset, these are based on the probability of dying between 2001 and 2011 for each age group. Therefore, for each transition in your earlier Sankey diagram, some of those individuals might be dead by 2011 - in effect, we could add a ‘death’ transition to each group. If you wanted to try and deal with this, you could include death in any ddply reshaping of the microdata, or drop all dead people from the original ‘data’ file using the subset function, e.g.

#keep alive people
data_alive<-subset(data, SynAliveOrDead2011==1)
#keep dead people
data_dead<-subset(data, SynAliveOrDead2011==2)

With these three-way transitions, you may need to be careful about which columns you use to plot your Sankey diagram with, for example, if we use the marriage transition by age data frame we created earlier, we can assign the value label to the age group column we are interested in:

marriage_trans_age_DF$source<-recode_marriage_2001(marriage_trans_age_DF$SynMarital2001,1,2,3,4,5,-7)

marriage_trans_age_DF$target<-recode_marriage_2011(marriage_trans_age_DF$Marital.Status,1,2,3,4,5,-7)

colnames(marriage_trans_age_DF)[colnames(marriage_trans_age_DF)=="6"]<-"value"

Now we can create the Sankey diagram as before:

sankeyPlot1 <- rCharts$new()
#set the location of the library and html template to use:
sankeyPlot1$setLib('http://timelyportfolio.github.io/rCharts_d3_sankey')
sankeyPlot1$setTemplate(script = "http://timelyportfolio.github.io/rCharts_d3_sankey/layouts/chart.html")

sankeyPlot1$set(
  data = marriage_trans_age_DF, #data to use in the plot
  nodeWidth = 15, #width of the little coloured blocks at either end of the plot
  nodePadding = 10, #the spacing between these coloured blocks
  layout = 32, #not really sure what this parameter controls
  width = 750, #width of the plot window
  height = 1000, #height of the plot window
  units = "People aged 60-69", #name of the units for when hovering cursor
  title = "Sankey Diagram"
  #labelFormat = ".1%" #can set the format of the label, but here we won't bother
)
sankeyPlot1