This practical is designed to introduce you analysing longitudinal data in R through some basic (but quite powerful) techniques. But the end of this practical you will:
In this practial, I am assuming that you are using RStudio. If you using the UCL Desktop, you can find RStudio under ‘R’ in the start menu (if you’re not, you can download it from here: http://www.rstudio.com/).
Open RStudio and once the programme starts up, open a new R Script, either using the icon with the green ‘+’ or File > New File > R Script.
If you are unfamiliar with using R Studio, to run any script press ctrl+return when the cursor is on the line you want to run. To run multiple lines, highlight the whole section and press ctrl+return.
Packages are pieces of software that allow us to carry out specific jobs. If you would like to find out what each of these packages does, then please Google them.
#load in the packages to carry out the data wrangling (ignore any error messages, you should be fine no matter what version of R you are using...)
library(tidyverse)
library(plyr)
library(reshape2)
library(plotly)
In this practical we will be using the new England and Wales SYLLS synthetic longitudinal spine dataset - this can be manually downloaded from the Calls-Hub website here: https://calls.ac.uk/guides-resources/synthetic-ls-data/
Feel free to download and Unzip any of these files to a local directory on your computer and open it up in Excel (or a similar programme) to explore. It might be a good idea to do so as variable code look-ups are contained within.
Inside you will find three sheets: one containing meta data, one containing the list of (new) synthetic variables, and one containing all of the data (both the new synthetic data and the original SARs data).
Using R, we are able to read this data directly into R Studio. I have extracted the data as a .csv file and stored it in a publicly available dropbox folder, so we can now read this in…
There are various ways that our data can be read in, here we will just use the base read.csv()
function
#Read in the data...
data<-read.csv("https://www.dropbox.com/s/yn1y29zw7mkyqhu/SYLLS_data.csv?raw=1")
#Check the data has read in correctly
summary(data)
#examine the top of the file
head(data)
If you are in RStudio, data should now also appear in your global environment in the top-right Environment pane. Clicking on ‘data’ should bring a summary of it up as a selectable tab.
Now microdata are great, but what happens if we want to understand some of the aggregate characteristics of these data? For example, how many of these individuals underwent the transition from single to married between 2001 and 2011 or the numbers of healthy people in 2001 who became ill by 2011?
To do this we need to aggregate our data according to the characteristics we are interested in. In R, we can carry out these aggregations using the plyr
(http://plyr.had.co.nz/) and reshape2
(http://had.co.nz/reshape/) packages (both written by Hadley Wickham of ggplot fame! http://ggplot2.org/):
First let’s try aggregating our data by the total marriage transitions:
#count the number of individuals undergoing the different marital transitions.
marriage_trans <- ddply(data, .(SynMarital2001, Marital.Status), nrow)
You should now see a new data frame called ‘marriage_trans’ appear in your data pane at the top-right of RStudio. Click on it to open up and have a look.
What we have done is use the ddply function to create a cross-tabulation from the original microdataset. If you want to find out more about the ddply function type ?ddply
into the console (or click the help tab in RStudio - bottom-right pane - and enter ddply in there).
The basic syntax for ddply is data frame / variables / function. Here we have used the ‘data’ dataframe, the variables are SynMarital2001 (synthetic martital status in 2001 variable) and Marital.Status (the real marital status in 2011 recorded in the SAR dataset) and the function is nrow
(again, use ?nrow
if you want to find out in detail what it does) which counts the number of rows (or people in our microdataset) for every combination of martital status in 2001 and 2011.
ddply
is a very useful function. We can add additional variables/dimensions into any cross-tabulation we carry out. For example, we could add age to the cross-tabulation above:
#count transitions by a third dimension, such as age group:
marriage_trans_age<-ddply(data, .(SynMarital2001, Marital.Status, SynAge10YrGrp), nrow)
But with the extra age-group dimension, you might want to rearrange the data frame so that the age groups are columns. Do do this, we use the dcast
function from the reshape2
package:
# cast 'molten' marriage_trans_age data frame into new square matrix
marriage_trans_age_DF<-dcast(marriage_trans_age,SynMarital2001+Marital.Status ~ SynAge10YrGrp)
## Using V1 as value column: use value.var to override.
If you click on the new data frames you have just created (top-right-hand pane in RStudio), you will see your cross-tabulated data.
We can now visualise the marriage transitions we have just created using a sankey diagram. First we need to add some new columns to our data frame which also serve the purpose of converting the numeric codes to more readable text. In addition, for the Sankey diagrams to work we need three columns of data with three very specific names: ‘source’, ‘target’ and ‘value’.
#set up a new empty variable
newvar<-0
#create a little function to recode the 2001 data *note highligh the whole thing before hitting ctrl+return to run and create the function*
recode_marriage_2001<-function(variable,one,two,three,four,five,missing){
newvar[variable==one]<-"Single 2001"
newvar[variable==two]<-"Married (or Civil partnership) 2001"
newvar[variable==three]<-"Separated 2001"
newvar[variable==four]<-"Divorced 2001"
newvar[variable==five]<-"Widowed 2001"
newvar[variable==missing]<-"Missing Data 2001"
return(newvar)
}
#use the function to create a new column called 'source'
marriage_trans$source<-recode_marriage_2001(marriage_trans$SynMarital2001,1,2,3,4,5,-7)
#reset the newvar variable
newvar<-0
#now create a similar function to recode the 2011 data
recode_marriage_2011<-function(variable,one,two,three,four,five,missing){
newvar[variable==one]<-"Single 2011"
newvar[variable==two]<-"Married (or Civil partnership) 2011"
newvar[variable==three]<-"Separated 2011"
newvar[variable==four]<-"Divorced 2011"
newvar[variable==five]<-"Widowed 2011"
newvar[variable==missing]<-"Missing Data 2011"
return(newvar)
}
#and then use this function in a similar way to recode the data into a new column called 'target'
marriage_trans$target<-recode_marriage_2011(marriage_trans$Marital.Status,1,2,3,4,5,-7)
#we could just rename the V1 column as value using the names() function, but here it's just as easy to cast the data into a new column called 'value':
marriage_trans$value<-marriage_trans$V1
#check to see that the data look correct by looking at the top of the data frame
head(marriage_trans)
## SynMarital2001 Marital.Status V1 source
## 1 -7 1 66991 Missing Data 2001
## 2 -7 2 24 Missing Data 2001
## 3 -7 4 3 Missing Data 2001
## 4 -7 5 74 Missing Data 2001
## 5 1 1 202544 Single 2001
## 6 1 2 43544 Single 2001
## target value
## 1 Single 2011 66991
## 2 Married (or Civil partnership) 2011 24
## 3 Divorced 2011 3
## 4 Widowed 2011 74
## 5 Single 2011 202544
## 6 Married (or Civil partnership) 2011 43544
So, now we have set our data up correctly we can create our Sankey diagram to visualise the transitions.
We will use the plotly
package to create our sankey diagram:
#create a copy of your data frame
dat = marriage_trans
#get a unique list of the source and target variables
node_list <- unique(c(dat[,4], dat[,5]))
#create a data frame of this list
node_list_df <- as.data.frame(node_list)
#create numeric indices of source and target variables
node_list_df$num <- seq(0,length(node_list)-1)
#update the dataset with numeric references
#first source
dat$source_num <- node_list_df[match(dat$source,node_list_df$node_list),2]
#then target
dat$target_num <- node_list_df[match(dat$target,node_list_df$node_list),2]
#now create a plotly sankey diagram
p <- plot_ly(
type = "sankey",
orientation = "h",
node = list(
label = node_list,
color = "blue",
pad = 15,
thickness = 20,
line = list(
color = "black",
width = 0.5
)
),
link = list(
source = dat$source_num,
target = dat$target_num,
value = dat$value
)
) %>%
layout(
title = "Basic Sankey Diagram",
font = list(
size = 10
)
)
############
p
If you hover your mouse over the diagram, you should see the figures relating to the underlying data (don’t forget these numbers are from a 1% sample).
Now you have created a Sankey diagram to visualise the transitions between marriage states in 2001 and 2011, you should try and explore some other transitions.
The easiest will be to look at some of the other basic transitions (such as social grade or religion), but perhaps try some 3-way transitions such as:
If you look at the death transitions, you might need to be careful with your subsetting. While I have included death flags for individuals in the dataset, these are based on the probability of dying between 2001 and 2011 for each age group. Therefore, for each transition in your earlier Sankey diagram, some of those individuals might be dead by 2011 - in effect, we could add a ‘death’ transition to each group. If you wanted to try and deal with this, you could include death in any ddply
reshaping of the microdata, or drop all dead people from the original ‘data’ file using the subset
function, e.g.
#keep alive people
data_alive<-subset(data, SynAliveOrDead2011==1)
#keep dead people
data_dead<-subset(data, SynAliveOrDead2011==2)
With these three-way transitions, you may need to be careful about which columns you use to plot your Sankey diagram with, for example, if we use the marriage transition by age data frame we created earlier, we can assign the value label to the age group column we are interested in:
marriage_trans_age_DF$source<-recode_marriage_2001(marriage_trans_age_DF$SynMarital2001,1,2,3,4,5,-7)
marriage_trans_age_DF$target<-recode_marriage_2011(marriage_trans_age_DF$Marital.Status,1,2,3,4,5,-7)
#we are interested in age group '6' so call that the value column...
colnames(marriage_trans_age_DF)[colnames(marriage_trans_age_DF)=="6"]<-"value"
Now we can create the Sankey diagram as before:
#create a copy of your data frame
dat = marriage_trans_age_DF
#get a unique list of the source and target variables
node_list <- unique(c(dat[,14], dat[,15]))
#create a data frame of this list
node_list_df <- as.data.frame(node_list)
#create numeric indices of source and target variables
node_list_df$num <- seq(0,length(node_list)-1)
#update the dataset with numeric references
#first source
dat$source_num <- node_list_df[match(dat$source,node_list_df$node_list),2]
#then target
dat$target_num <- node_list_df[match(dat$target,node_list_df$node_list),2]
#now create a plotly sankey diagram
p <- plot_ly(
type = "sankey",
orientation = "h",
node = list(
label = node_list,
color = "blue",
pad = 15,
thickness = 20,
line = list(
color = "black",
width = 0.5
)
),
link = list(
source = dat$source_num,
target = dat$target_num,
value = dat$value
)
) %>%
layout(
title = "Marriage Transitions for those in age-group 60-69",
font = list(
size = 10
)
)
############
p