Updates

v0.1: Initial draft

v0.2: Datasets made open-source, multiple-imputation, intra-class correlation co-efficients & sample size calculations

v0.3: Some more data manipulations added, handling survey data to estimate design effects

0.4: Updates on Kaplan Meier plots, GGPLOT style, intro to meta-analysis including meta-analysis for proportions

0.5: I learnt Broom and I am very happy

Introduction

R is a powerful, free and open-source piece of software for statistical analysis.

Unfortunately my experience of trying to learn R has been that materials are not always provided in a format that addresses the everyday problems of clinical epidemiology.

The aim of this guide is to teach some basic R skills so that you can a) Load, Clean and manipulate a dataset b) Do some descriptive statistics and basic statistical tests c) Do logistic regression d) Do poisson & cox regression e) Correlated outcomes f) Matched case-control data g) Sample size equations h) Multiple imputation with chained equations

We will use real datasets as we go along.

Brodly the material here aims to show you how to do most of the analysis approaches one might learn in the LSHTM MSc modules Statistical Methods in Epidemiology & Advanced Statistical Methods in Epidemiology (or equivalent) in R.

Not everything is done identically but the key elements should come out!

Important Caveat

The aim of this guide not to teach you statistics. I assume that you know what tests are appropriate in which situations. The aim is to show how these tests can be implemented in R and give an output with which a clinical epidemiologist is familiar.

Setting up R

A key point in R is that (unlike STATA) many of the analaysis tools need to be specifically loaded in to the software.

Think of these as modules designed for specific tasks. Each module is called a package. The strength of R is the wide variety of packages to address a huge range of analysis approaches. It can make R frustrating though as there can be multiple ways to approach the same problem.

The approaches and packages used here are ones I learn and felt comfortable with - there may be better ways to do things using other packages.

R Studio

I always use R within “R-Studio”. When you open R Studio it opens R but just makes the layout nicer - more like STATa with boxes for current code, datasets, plots etc.

Installing Packages

The first time you setup R you need to install the packages

The syntax is: > install(“packagename”) #NB Note use of ""

For example

##Data Manipulation##
install.packages("tidyverse") 
#tidyverse contains many packages for data manipulation, graphs etc
install.packages("splitstackshape")
#Really nice syntax for expanding datasets

#Basic Epi
install.packages("epiR") 
#Some useful epitools especially 2*2 tables and stratified MH Odds
install.packages("lmtest") 
#for doing likelihood ratio tests between models 

#Poisson & Cox Regression & Survival Analysis
install.packages("Epi") 
#Tools for Lexis Models
install.packages("popEpi")
#Extra tools to go with 'Epi' and make nice rate tables
install.packages("survival") 
#for survial analysis, kaplan-meier plots and cox regression
install.packages("survminer")
#Really nice KM plots!

#Random-Effects Modeks
install.packages("lme4")
#For fitting random-effects models
install.packages("sjstats")
#For intraclass correlation coeffiecients from Random-effects models

#Multiple Imputation
install.packages("mice")
#For multiple imputation

install.packages("arsenal")
#Really nice summmary tables

#Complex survey data analysis
install.packages("survey")
#For handling complex survey data
install.packages("ICC")
#Alternative methods for ICC calculation from survey data

#Meta-analysis
install.packages("meta")
install.packages("metafor")

#Make saving your models easier
install.packages("broom.mixed")

Loading Packages

Once the packags are installed in R you need to load them so they can be used. This step ‘loading’ needs to be done each time you start R. The syntax is: > library(packagename) #NB Note no use of ""

You don’t need to load all the packages every time especially if you aren’t using them in a given analysis. But many packages you might use frequenctly - especially some of the data manipulation ones

library(tidyverse)
library(lmtest)
library(survival)
library(survminer)
library(epiR)
library(Epi)
library(lme4)
library(popEpi)
library(mice)
library(sjstats)
library(splitstackshape)
library(arsenal)
library(survey)
library(ICC)
library(meta)
library(metafor)
library(broom)
library(broom.mixed)

Setting a working directory

You should set a working directory - this tells R where any files you might be looking for can be found without you needing to write the full file path

The syntax is: > setwd(“directory_where_the_files_are”)

setwd("~/Dropbox/Work/EPI/Learning R/")

If you plan to try the analyses below then you need to set your own working directory and put the dataset files inside that directory

From this point on all the code should work if everything is in the same directory

A few generic comments on how R holds and stores data

Each dataset has a name
Each variable within a dataset has a name
In general when referencing a variable we need to give the dataset name too, because we might have more than one dataset i.e > mosquito$species

Refers to the species variable within the mosquito dataset

In general in R we put the output of whatever we do into a new object for storage. For example in STATA we might just say > logistic died i.agegroup

This would run the logistic regression and give an output straight away

But in R we would say > model <- glm(died~factor(agegroup), family=“binomial”, data = mosquito)

This would run the logistic regression and save it in a dataset called model

The syntax for any procedure where we save data to a place is name_of_place_I_am_saving <- thing_I_am_saving_to_this_object

or if saving to a new or existing variable in a dataset my_dataset$variable_name <- thing_I_am_saving_to_that_variable

A key point in R is the ability to have multiple separate datasets available at anyone time - unlike STATA for example where only a single dataset can be available.

Annotating code

Just like in STATA we can annotate our code If we put a # symbol at the start of a line then R will treat that as annotation

some code here #This code is doing something important

Datasets

All the datasets used are available via the LSHTM Data Repository at “R For Clinical Epi”

Individual links to datasets are provided as needed.

Load a datset

Loading Data

Next we want to be able to load in a dataset. For the purpose of this tutotial I presume that most of the datasets you will be using are some form of comma sepeated variable file

The generic syntax is: > name_of_data <- read.csv( file = “filename.csv”, header = TRUE, sep = “,”)

Header = True: Top Row is variable names sep = “,”: This is comma delimited data

We can see the loaded data with > View(name_of_data)

Sample Dataset 1

The first dataset we are using is a cross-sectional survey of scabies. This dataset is freely available on the LSHTM Data Repository: “Scabies-Dataset”

scabies <- read.csv(file = "S1-Dataset_CSV.csv", header = TRUE, sep = ",")
#Remember we set the working directory already. As the file is inside that directory we can just give the file name.

#We can see the data with: View(scabies)

Sample Dataset 2

A strength of R is that you can have more than one dataset available at a time. The second dataset we are loading is based on the original outbreak of Ebola in the DRC. This is also freely available on the LSHTM Data Repository: “Ebola-Dataset”

ebola <- read.csv(file = "mmc1.txt", header = TRUE, sep = ",")
#Now we have two separate datasets available!

I’m loading both datasets so I have examples of the different types of variables etc for the exercises below

Referring to subsets of data

We can reference a subset of a dataset based on the values of a specific variable

The syntax is > [data$variable == something]

For example:

[survey$gender == "male"]
#This would apply our rule only to valus where gender = male

mean(scabies$age[scabies$gender == "male"])
#This would work out the mean age of thos individuals where gender is male.

Rerefencing a subset of data is obviously valuable for making new variables basd on existing values, plotting data and performing statistical tests.

Clean and manipulate a dataset

In general we can either alter an existing variable OR make a new variable by manipulating an old one.

The syntax for altering an existing variable is: > data$variable <- thing_we_want_do

The syntax for making a new variable is the same: > data$new_variable <- thing_we_want_new_variable_to_be

Discarding unwanted Variables

If we want to get rid of some variable we don’t need we can do this with select. The general syntax is > data <- select(data, -VAR_C) #Drop these vars

scabies <- select(scabies, -scabies_distribution_region, -scabieslesions_number_range, -impetigolesions_number_range)

Discarding unwanted Rows

We can get rid of rows of data we don’t need. The general syntax is > data <- filter(data, how_to_filter)

ebola_men <- filter(ebola, sex =="male")
#Make a copy of the dataset with only the men in it

Turning continuous into categorical variables

We often want to group continuous variables like age into groups We can do this by “cutting” the variable. We need to tell R the new variable is a categorical/factor variable

The generic syntax is: > data$variable <- as.factor(cut(data$variable,c(cut_here,cut_here,cut_here))

In general we also want include the lowest value i.e recode from the value of the first cut to the second

data$variable <- as.factor(cut(data$variable,c(cut_here,cut_here,cut_here), include.lowest=TRUE)

scabies$agegroups <- as.factor(cut(scabies$age, c(0,10,20,Inf), labels = c("0-10","11-20","21+"), include.lowest = TRUE)) 
#Make a new variable called age group. Do this by splitting people based on the age variable and splitting it in to groups 0>10, 11>20, 21>Infinity
#So the first number tells you where to start from
#Then everything between that number and the next break; inclusive will be in the next group etc


scabies$house_cat <- as.factor(cut(scabies$house_inhabitants, c(0,5,10,Inf), labels = c("0-5","6-10","10+"), include.lowest = TRUE))
#Make a new variable of the house size.
#Check this looks correct

table(scabies$house_cat, scabies$house_inhabitants)

##       
##          1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  17  30
##   0-5    5  39 153 288 389   0   0   0   0   0   0   0   0   0   0   0   0
##   6-10   0   0   0   0   0 349 203 184 101 101   0   0   0   0   0   0   0
##   10+    0   0   0   0   0   0   0   0   0   0  45  32   8   3   2   4   2

Turning categorical variables into numeric variables

We can do this using > as.numeric()

ebola$status <- as.numeric(ebola$status) 
#Convert each value of the variable status to a numeric one

Recoding categorical variables

We can do this using the ‘recode’ command

dataset$variable <- recode(dataset$variable, OldCat = “NewCat”, OldCat2 = “NewCat2”)

ebola$transmission <- recode(ebola$transmission, syringe = "needle")
##This is obviously not a very exciting recode but you get the idea!

Setting a baseline group for categorical variables

When we have a categorical variable it is key we no which value R uses as the refrence - for example when doing logistic regression.

We can set this with the ‘relevel’ function

scabies$house_cat <- relevel(scabies$house_cat, ref = "0-5")
#Make 0-5 household size the baseline group

scabies$agegroups <-relevel(scabies$agegroups, ref = "21+")
#Make age 21+ the baseline group

ebola$transmission <-relevel(ebola$transmission, ref = "person_to_person")
#Make person to person transmission the baseline group

Working with dates

Sometime when dates are imported from a file R doesn’t realise it is a date as opposed to just a string of characters. We can check this easily.

class(ebola$disease_onset)

## [1] "factor"

#We see that R currently thinks every date is a different categorical variable which is not very helpful

We can convert strings of characters stored as dates into an ‘R’ date (which in the background is stored as a numberic value (days since 1970)).

The general syntax is: > data$date <- as.Date(data$date, format = "")

There are a range of values for format. Common ones include %a - Abbreviated weekday: Mon %A - Full weekday: Monday %b - Abbreviated month: Nov %B - Full Month: November %d - Day of the month: 01 %m - Month: 07 %y - Two digit year: 98 %Y - Four digit year

So given 2017-01-24 you might write format = “%Y-%m-%d”

If you need to turn the ‘R’ date into a strict number (for some calculations) you can simply specify: > data$date <- as.numeric(as.Date(data$date, format = ""))

#We can turn it in to a date
ebola$disease_onset <- as.Date(ebola$disease_onset, format = "%Y-%m-%d")
ebola$disease_ended <- as.Date(ebola$disease_ended, format = "%Y-%m-%d")
class(ebola$disease_onset)

## [1] "Date"

#We see that R now recognises this as a date variable.

Reshaping Datasets

Tidyr has a number of commands for reshaping data. To go long to wide we use

spread(DATAFRAME, VARNAMES, VARVALUES) #Take a given dataframe #Make N columns for each unique value of VARNAMES #Populate each new column with the corresponding value from VARVALUES

To go wide to long we use

gather(DATAFRAME,NEWVAR, NEWVALUES, (BETWEENTHISVAR:TOTHISVAR)) # Take the data in datatable DATAFRAME # Make a new variable called “NEWVAR”. # The values in NEWVAR are taken from the names of the original columns specified - currently BETWEENTHISVAR:TOTHISVAR # Make a new variable called “NEWVALUES” # The values in NEWVALUES are the corresponding values from column BETWEENTHISVAR, TOTHISVAR etc

Sometime we also want to combine two columns into a single value We use

unite(DATAFRAME,TEMPVAR, VAR1, VAR2) # Make a new variable TEMPVAR # Make the variable by combining VAR1 and VAR2

df <- data.frame(month=rep(1:3,2),
                 student=rep(c("Amy", "Bob"), each=3),
                 A=c(9, 7, 6, 8, 6, 9),
                 B=c(6, 7, 8, 5, 6, 7))
#Here we have a where each student is on a different row for each month
#The students took two tests A and B. 
#For each student/month combination we have a value for A and a value for B

df2 <- gather(df,test,score,A:B)
#Make a new datatable df2
#Have a column called "test". 
#This will have the value either A or B as these are the names of the columns we specified.
#Have a column called "score". 
#This will have the value previously in column A or B respectively for each row

df2

##    month student test score
## 1      1     Amy    A     9
## 2      2     Amy    A     7
## 3      3     Amy    A     6
## 4      1     Bob    A     8
## 5      2     Bob    A     6
## 6      3     Bob    A     9
## 7      1     Amy    B     6
## 8      2     Amy    B     7
## 9      3     Amy    B     8
## 10     1     Bob    B     5
## 11     2     Bob    B     6
## 12     3     Bob    B     7

#Now we have a single row for each combination of month/student/test 
#Their score is in the score column

df3 <- spread(df2,test,score)
#Make a new datatable df3
#Make a column for each unique value in the test variable.
#Name each of these columns based on that unique value
#Under each column put the corresponding value that was in the score column

df3

##   month student A B
## 1     1     Amy 9 6
## 2     1     Bob 8 5
## 3     2     Amy 7 7
## 4     2     Bob 6 6
## 5     3     Amy 6 8
## 6     3     Bob 9 7

#We are back to where we started!

Generating ‘n’ by groups

Sometimes we have multiple rows per an individual/group and want to number these. This is equivalent to STATA bysort VAR: gen n=_n

The general syntax is: > dataset$n <- ave(1:length(dataset$grouping_var), dataset$grouping_var, FUN = seq_along)

Expanding a dataset

Sometimes we want to expand panel data. For example our data might look like

##   fact count
## 1    a     1
## 2    b     2
## 3    c     3

But we might want

##   fact
## 1    a
## 2    b
## 3    b
## 4    c
## 5    c
## 6    c

We can use the ‘expandRows’ command. The general syntax is

expandRows(dataframe, X, DROP =, COUNT.IS.COL = ) # DROP can be specified True or False. It says whether or not to keep the column that has the count data in. The default is TRUE. # COUNT.IS.COL - says whether the value X is the column number with the count value in it, or if FALSE that all rows should be expanded by the X

dt3 <- expandRows(dt, 2)
#Expand the original datatable. Replicate each row by the value in column 2. 
dt3

##     fact
## 1      a
## 2      b
## 2.1    b
## 3      c
## 3.1    c
## 3.2    c

Descriptive and basic statistics

We should do some exploratory basic analyses before we proceed to more complex regression analysis

Means etc

We might want to get a mean, median and maximum tc For a single variable we can use ‘summary’

summary(scabies$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    6.00   12.00   18.79   30.00   83.00

#Or just

mean(scabies$age)

## [1] 18.79193

Means by groups/catgories

If we want to summarise by a categorical variable we can do this nicely using both stat.table & tableyby.

Why use one or the other? stat.table allows you to specify exactly what statistic you want summarised (“I want the maximum value”)

tableby lets you quickly generate the most common statistics (mean, counts) for lots of variables

The general syntax for stat.table is:

stat.table(grouping_factor/s, statistics_to_show, dataset)

stat.table(gender,mean(age), data = scabies)

##  ------------------- 
##  gender   mean(age)  
##  ------------------- 
##  female       22.18  
##  male         14.15  
##  -------------------

#Tabulate, by gender, the mean age from the scabies dataset

stat.table(gender,list(mean(age),median(age)), data = scabies)

##  ------------------------------- 
##  gender   mean(age) median(age)  
##  ------------------------------- 
##  female       22.18       15.00  
##  male         14.15        9.00  
##  -------------------------------

#Tabulate, by gender, the mean and median age from the scabies dataset

The general syntax for tableby is:

tableby(groupingvar~var_1+var_n, data = dataset) #tableby will give you the mean, SD, IQR and range of numeric variables #tableby will give you the count and % for each level of categorical variables

summary_data <- tableby(gender~age+scabies_infestation,data=scabies)
summary(summary_data)

## 
## 
## |                            | female (N=1103) |  male (N=805)   | Total (N=1908)  | p value|
## |:---------------------------|:---------------:|:---------------:|:---------------:|-------:|
## |**age**                     |                 |                 |                 | < 0.001|
## |&nbsp;&nbsp;&nbsp;Mean (SD) | 22.179 (17.500) | 14.152 (16.232) | 18.792 (17.429) |        |
## |&nbsp;&nbsp;&nbsp;Range     | 0.000 - 83.000  | 0.000 - 82.000  | 0.000 - 83.000  |        |
## |**scabies_infestation**     |                 |                 |                 | < 0.001|
## |&nbsp;&nbsp;&nbsp;no        |   931 (84.4%)   |   611 (75.9%)   |  1542 (80.8%)   |        |
## |&nbsp;&nbsp;&nbsp;yes       |   172 (15.6%)   |   194 (24.1%)   |   366 (19.2%)   |        |

#Tabulate, by gender summary statistics for age and scabies

T-Test

We might want to know if the mean of a value like age varies between two groups.

t.test(scabies$age[scabies$gender=="male"],scabies$age[scabies$gender=="female"])

## 
##  Welch Two Sample t-test
## 
## data:  scabies$age[scabies$gender == "male"] and scabies$age[scabies$gender == "female"]
## t = -10.32, df = 1801.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.552509 -6.501593
## sample estimates:
## mean of x mean of y 
##  14.15155  22.17860

This presents the mean age of each group, the 95% confidence interval for the difference and the p-value. We can see that the groups do have different mean ages.

Confidence intervals of a proportion (and a p-value)

prop.test(numerator,denominator)

Cross_Tabs, Chi Square Tests and Mantel-haenszel odds ratio

We might want to tabulate catergorical variables

table(data$dependent$variable, data$independent_variable)

table(impetigo = scabies$impetigo_active, scabies = scabies$scabies_infestation)

##         scabies
## impetigo   no  yes
##      no  1098  187
##      yes  444  179

#Simple table

We might want to do a chi-square and calculate an odds ratio

We can do this using > epi.2by2

The data needs to be laid out as:

            Has Disease      No Disease

Exposed

Not-Exposed

The generic syntax is > epi.2by2(table(scabies$dep_var, scabies$indep_var))

#Lets looks how table aligns the data
table(scabies$scabies_infestation, scabies$impetigo_active)

##      
##         no  yes
##   no  1098  444
##   yes  187  179

#See that because 'no' is the 'base' level the table is laid out
#               No Disease      Has Disease

# Not-exposed

# Exposed

#This is dependent on how your data is coded so you need to check this before using epi.2by2
#If the table is laid out correctly then you can input straight into epi.2by2, otherwise you #need to recode or re-order the variables so that the table will be laid out correctly

#epi.2by2 wants the data with the exposed/disease group in top right corner
#So we just tell R to order the variables differently when we draw the table

epi.2by2(table(relevel(scabies$scabies_infestation,"yes"), relevel(scabies$impetigo_active,"yes")))

##              Outcome +    Outcome -      Total        Inc risk *
## Exposed +          179          187        366              48.9
## Exposed -          444         1098       1542              28.8
## Total              623         1285       1908              32.7
##                  Odds
## Exposed +       0.957
## Exposed -       0.404
## Total           0.485
## 
## Point estimates and 95 % CIs:
## -------------------------------------------------------------------
## Inc risk ratio                               1.70 (1.49, 1.94)
## Odds ratio                                   2.37 (1.88, 2.99)
## Attrib risk *                                20.11 (14.52, 25.71)
## Attrib risk in population *                  3.86 (0.77, 6.95)
## Attrib fraction in exposed (%)               41.13 (32.89, 48.35)
## Attrib fraction in population (%)            11.82 (8.30, 15.20)
## -------------------------------------------------------------------
##  X2 test statistic: 54.415 p-value: < 0.001
##  Wald confidence limits
##  * Outcomes per 100 population units

Notice this gives you counts, an odds ratio and a chi-square result. Effctively this command combines elements of the STATA commands: mhodds and tab,chi

Stratified MH Odds Ratios for Confounding

We can also stratify our odds ratio by a third variable to look for confounding

epi.2by2(table(data$dep_var, data$indep_var, data$stratificationvariable)) #As above make sure the levels of dep_var and indep_var are set such that the table is ordered in the way that epi.2by2 wants the data

The output gives us the crude odds ratio (identical to the one we got without the stratification variable) and then an adjusted odds ratio. We can look at these two for evidence of confounding.

We can get (if we desire) the stratum specific odds ratios out

stratum <- epi.2by2(table(relevel(scabies$scabies_infestation,"yes"), relevel(scabies$impetigo_active,"yes"),scabies$gender))


 #Note this time I'm saving the 2*2 table
stratum

##              Outcome +    Outcome -      Total        Inc risk *
## Exposed +          179          187        366              48.9
## Exposed -          444         1098       1542              28.8
## Total              623         1285       1908              32.7
##                  Odds
## Exposed +       0.957
## Exposed -       0.404
## Total           0.485
## 
## 
## Point estimates and 95 % CIs:
## -------------------------------------------------------------------
## Inc risk ratio (crude)                       1.70 (1.49, 1.94)
## Inc risk ratio (M-H)                         1.58 (1.39, 1.79)
## Inc risk ratio (crude:M-H)                   1.08
## Odds ratio (crude)                           2.37 (1.88, 2.99)
## Odds ratio (M-H)                             2.18 (1.72, 2.76)
## Odds ratio (crude:M-H)                       1.09
## Attrib risk (crude) *                        20.11 (14.52, 25.71)
## Attrib risk (M-H) *                          17.75 (5.52, 29.99)
## Attrib risk (crude:M-H)                      1.13
## -------------------------------------------------------------------
##  Test of homogeneity of IRR: X2 test statistic: 8.011 p-value: 0.005
##  Test of homogeneity of  OR: X2 test statistic: 0.629 p-value: 0.428
##  Wald confidence limits
##  M-H: Mantel-Haenszel
##  * Outcomes per 100 population units

summary(stratum)$OR.strata.wald

##        est    lower   upper
## 1 2.417620 1.716620 3.40488
## 2 1.996753 1.439806 2.76914

Basic plots

The basic R software can make nice plots. The introduction below is really only an outline. Later we will introduce the specific GGPLOT2 package which makes really nice graphs.

You may well want to ignore this section and just learn how to make GGPLOT graphs as they are much prettier!

Histogram

We can do some simple plots to look at distributions such as a histogram The general syntax is > hist(data$variable)

hist(scabies$age)

Scatterplot

We could do a basic scatter plot of house size vs number of people in a room

plot(scabies$house_inhabitants,scabies$room_cohabitation)

Colour Scatter

We might make this look nicer by adding some colours representing key variables

scabies$scatter_colour[scabies$scabies_infestation == "yes"] <- "red" 
#If scabies_infestation is 0 then the colour is red

scabies$scatter_colour[scabies$scabies_infestation == "no"] <- "blue"
#If scabies_infestation is 0 then the colour is blue

plot(scabies$house_inhabitants,scabies$room_cohabitation, col = scabies$scatter_colour)

#For each value look in the variable scatter_colours and work out what colour to make each person

Lots of the points overlap so we can always make jitter them

plot(jitter(scabies$house_inhabitants),scabies$room_cohabitation, col = scabies$scatter_colour)

Later we will come back to GGPLOT2 which makes much nicer graphs!

Logistic Regression

I will start by showing the Long-Hand version of how to do this - but once you know how to do it then you will want to learn to use Broom and make your life better.

Sample Dataset 3

The examples here use the “scabies dataset” again. For ease I quickly repeat below the code needed to implement the data-cleaning steps demonstrated above on the dataset

scabies <- read.csv(file = "S1-Dataset_CSV.csv", header = TRUE, sep = ",")
#Load dataset
scabies$agegroups <- as.factor(cut(scabies$age, c(0,10,20,Inf), labels = c("0-10","11-20","21+"), include.lowest = TRUE)) 
scabies$agegroups <-relevel(scabies$agegroups, ref = "21+")
#Categorise age and set a baseline

scabies$house_cat <- as.factor(cut(scabies$house_inhabitants, c(0,5,10,Inf), labels = c("0-5","6-10","10+"), include.lowest = TRUE))
scabies$house_cat <- relevel(scabies$house_cat, ref = "0-5")
#Categorise housesize and set a baseline

We can perform logistic regression in a straightforward fashion in. The formula for any logistic regression is

logistic_model <- glm(dependent_variable~indepdent_variable+independent_variable, data = dataframe, family = binomial)

If we want to include a variable as a Factor then we state that by saying > factor(independent_variable)

logistic_model <- glm(dependent_variable~factor(indepdent_variable), data = dataframe, family = binomial)

Unlike in STATA this does not automatically generate an output. As a clinical epidemiologist we want a table with the odds ratio, co-efficients, standard errors, confidence interval and the Wald-test P-Value.

We get the Odds ratios and Confidence intervals by taking the exponential of the output from the logistic regression model

logistic_model_OR <- cbind(OR = exp(coef(logistic_model)), exp(confint(logistic_model)))

Then we need to combine the Odds ratios and CI values with the raw coefficients and p-values into a single table

logistic_model_summary <- summary(logistic_model) logistic_model_summary <- cbind(logistic_model_OR, logistic_model_summary)

We can display then display the complte table of the output

logistic_model_summary

An example of logistic regression

Here we will see if the size of the household is associated with risk of scabies (variable = house_cat), after controlling for age (categorised) and gender.

scabiesrisk <- glm(scabies_infestation~factor(agegroups)+factor(gender)+factor(house_cat),data=scabies,family=binomial())
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint(scabiesrisk)))

## Waiting for profiling to be done...

scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_summary <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_summary

##                                OR      2.5 %    97.5 %   Estimate
## (Intercept)            0.09357141 0.06925691 0.1243857 -2.3690303
## factor(agegroups)0-10  2.20016940 1.61679635 3.0237474  0.7885344
## factor(agegroups)11-20 2.53291768 1.80434783 3.5769136  0.9293719
## factor(gender)male     1.44749159 1.13940858 1.8399368  0.3698321
## factor(house_cat)6-10  1.30521927 1.02664372 1.6624397  0.2663710
## factor(house_cat)10+   1.17003712 0.65654905 1.9900581  0.1570355
##                        Std. Error     z value     Pr(>|z|)
## (Intercept)             0.1492092 -15.8772359 9.110557e-57
## factor(agegroups)0-10   0.1594864   4.9442116 7.645264e-07
## factor(agegroups)11-20  0.1743214   5.3313714 9.747386e-08
## factor(gender)male      0.1221866   3.0267824 2.471718e-03
## factor(house_cat)6-10   0.1228792   2.1677478 3.017788e-02
## factor(house_cat)10+    0.2813713   0.5581076 5.767709e-01

The output gives us the OR associated with each variable, the 95% CI of that OR, and a Wald-Test P-Value for each variable

Trying a logistic regression for yourself

Investigate using logistic regression whether having impetigo (impetigo_odds) is associated with scabies (scabies_infestation) after controlling for age (agegroups) and sex (gender).

## Waiting for profiling to be done...

##                               OR     2.5 %    97.5 %   Estimate Std. Error
## (Intercept)            0.1379817 0.1064948 0.1768205 -1.9806341  0.1292414
## scabies_infestationyes 1.9380091 1.5180396 2.4740877  0.6616612  0.1245435
## agegroups0-10          3.7264517 2.8550721 4.8992194  1.3154565  0.1376324
## agegroups11-20         2.7487745 2.0310065 3.7353394  1.0111552  0.1553069
## gendermale             1.7509457 1.4225953 2.1557118  0.5601561  0.1060015
## house_cat6-10          0.9548564 0.7741093 1.1776160 -0.0461943  0.1069907
## house_cat10+           0.8941528 0.5449317 1.4384328 -0.1118786  0.2468512
##                            z value     Pr(>|z|)
## (Intercept)            -15.3250751 5.199446e-53
## scabies_infestationyes   5.3126904 1.080185e-07
## agegroups0-10            9.5577530 1.203411e-21
## agegroups11-20           6.5106924 7.480520e-11
## gendermale               5.2844143 1.261076e-07
## house_cat6-10           -0.4317601 6.659158e-01
## house_cat10+            -0.4532228 6.503883e-01

You should get a table showing an Odds Ratio of 2 - i.e the odds of impetigo are twice as high in people with scabies. The p-value suggests this is a highly significant finding.

Comparing two models via a Likelihood Ratio Test

We might want to compare two models to see if overall the factor variable house_cat (size of the house) is significant after controlling for other variables (as opposed to the individual Wald tests for specific values of house_cat).

As in STATA we simply run the model without the Variable we are interested in and do a Likelihood Ratio test

scabiesrisk2 <- glm(scabies_infestation~factor(agegroups)+factor(gender),data=scabies,family=binomial())
#We could have got out all the results of this new model just like we did for the initial Logistic Regression but for the purpose of this demonstration we don't need to
lrtest(scabiesrisk,scabiesrisk2)

## Likelihood ratio test
## 
## Model 1: scabies_infestation ~ factor(agegroups) + factor(gender) + factor(house_cat)
## Model 2: scabies_infestation ~ factor(agegroups) + factor(gender)
##   #Df  LogLik Df  Chisq Pr(>Chisq)  
## 1   6 -900.44                       
## 2   4 -902.80 -2 4.7328    0.09382 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We se that overall there is borderline significance between household size and scabies risk. This makes sense if we look at the Wald-Test for each category as one house size category was associated and one was not.

Fiting an interaction term

We can easily fit an interaction term to our model. The general syntax is

interaction_model <- glm(dependent_variable~indepdent_variable*independent_variable, data = dataframe family = binomial)

scabiesrisk3 <- glm(scabies_infestation~factor(agegroups)*factor(gender),data=scabies,family=binomial())
scabiesrisk3_OR <- exp(cbind(OR= coef(scabiesrisk3), confint(scabiesrisk3)))

## Waiting for profiling to be done...

scabiesrisk3_summary <- summary(scabiesrisk3)
scabiesrisk3_summary <- cbind(scabiesrisk3_OR, scabiesrisk3_summary$coefficients)
scabiesrisk3_summary

##                                                  OR      2.5 %    97.5 %
## (Intercept)                               0.1018100 0.07391201 0.1366901
## factor(agegroups)0-10                     2.6697985 1.81088534 3.9801955
## factor(agegroups)11-20                    2.3655612 1.51556214 3.6962701
## factor(gender)male                        1.6634407 0.93937388 2.8651848
## factor(agegroups)0-10:factor(gender)male  0.7017255 0.37220313 1.3517237
## factor(agegroups)11-20:factor(gender)male 1.1363006 0.56352873 2.3357191
##                                             Estimate Std. Error
## (Intercept)                               -2.2846473  0.1564481
## factor(agegroups)0-10                      0.9820030  0.2004429
## factor(agegroups)11-20                     0.8610153  0.2268275
## factor(gender)male                         0.5088882  0.2831246
## factor(agegroups)0-10:factor(gender)male  -0.3542130  0.3280584
## factor(agegroups)11-20:factor(gender)male  0.1277779  0.3619259
##                                               z value     Pr(>|z|)
## (Intercept)                               -14.6032328 2.678375e-48
## factor(agegroups)0-10                       4.8991654 9.624460e-07
## factor(agegroups)11-20                      3.7959034 1.471068e-04
## factor(gender)male                          1.7974000 7.227214e-02
## factor(agegroups)0-10:factor(gender)male   -1.0797255 2.802644e-01
## factor(agegroups)11-20:factor(gender)male   0.3530499 7.240510e-01

The output givs us 1) The Odds Ratio of each factor in the interaction in the baseline group of the other factor: i.e odds ratio for scabies, amongst people aged 0-10 compared to those aged 21+ (the reference group for age) in the baseline group of gender (female) 2) The interaction terms

As in STATA we can then compare the model with and without the interaction term using a likelihood ratio test to see if the interaction is statistically significant.

#Compare our model containing age & gender (scabiesrisk2) without an interaction term to our third model (scabiesrisk3) which does have an interaction term between age and gender
lrtest(scabiesrisk2,scabiesrisk3)

## Likelihood ratio test
## 
## Model 1: scabies_infestation ~ factor(agegroups) + factor(gender)
## Model 2: scabies_infestation ~ factor(agegroups) * factor(gender)
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   4 -902.80                     
## 2   6 -901.15  2 3.3131     0.1908

Remember the null-hypothesis = No interaction. So here the P-value suggests no evidence against the null; so we should not fit this interaction term.

#Using Broom Broom is a packagage within the tidyverse that makes all of the steps above MUCH easier. It installs when you install tidyverse but when you load the tidyverse library it doesn’t load by default.

The general syntax is > logistic_model <- glm(dependent_variable~indepdent_variable+independent_variable, data = dataframe, family = binomial)

This is identical to fitting the model above Then we simplify with > logistic_model_summary <- tidy(logistic_model, exponentiate =TRUE, conf.int =TRUE)

Take the model data and put into a table including exponentiating the coefficients and giving a 95% CI. This replaces all the steps after model fitting with 1 line.

Because (as you will see below) there is a variant called broom.mixed for mixed/random-effects models we need to specify which version to use so the slight tweak (if both broom and broom.mixed are loaded) is > logistic_model_summary <- broom::tidy(logistic_model, exponentiate =TRUE, conf.int =TRUE)

## Waiting for profiling to be done...

##                               OR     2.5 %    97.5 %   Estimate Std. Error
## (Intercept)            0.1379817 0.1064948 0.1768205 -1.9806341  0.1292414
## scabies_infestationyes 1.9380091 1.5180396 2.4740877  0.6616612  0.1245435
## agegroups0-10          3.7264517 2.8550721 4.8992194  1.3154565  0.1376324
## agegroups11-20         2.7487745 2.0310065 3.7353394  1.0111552  0.1553069
## gendermale             1.7509457 1.4225953 2.1557118  0.5601561  0.1060015
## house_cat6-10          0.9548564 0.7741093 1.1776160 -0.0461943  0.1069907
## house_cat10+           0.8941528 0.5449317 1.4384328 -0.1118786  0.2468512
##                            z value     Pr(>|z|)
## (Intercept)            -15.3250751 5.199446e-53
## scabies_infestationyes   5.3126904 1.080185e-07
## agegroups0-10            9.5577530 1.203411e-21
## agegroups11-20           6.5106924 7.480520e-11
## gendermale               5.2844143 1.261076e-07
## house_cat6-10           -0.4317601 6.659158e-01
## house_cat10+            -0.4532228 6.503883e-01

Matched Case-Control Data

Conditional Logistic Regression

For individually matched case control studies we use Conditional Logistic Regression not normal logistic regression

The syntax for this is:

conditional_model <- clogit(variable_denoting)or_control~indepdent_variable+independent_variable+strata(matching_var), data = dataframe)

Where ‘matching_var’ is a variable saying which Case is matched to which Control

We get the ORs, 95% CIs and everything else exactly the same as for logistic regression.

> conditional <- clogit(case~independent_variable+strata(matching_var), data = dataframe)
> conditional_OR <- exp(cbind(OR= coef(conditional), confint(conditional)))
> conditional_summary <- summary(conditional)
> conditional_summary <- cbind(conditional_OR, conditional_summary$coefficients)
> conditional_summary

Equally everything else is just like normal logistic regression i.e we can fit multiple models, run Likelihood-Ratio Tests etc

Poisson and Cox Regression

Sample Dataset 4

For the purpose of Poisson and cox regression we will initially go back to a clean version of our “Ebola dataset”. For ease I quickly repeat below the code needed to implement the data-cleaning steps demonstrated above on the dataset

##This is the ebola dataset again
ebola <- read.csv(file = "mmc1.txt", header = TRUE, sep = ",")

#Lets set 'person_to_person' route of transmission as baseline
ebola$transmission <- relevel(ebola$transmission, "person_to_person")

Setting up our time variables

The package we are using for the Poisson and Cox Regression needs the dates in a particular format to work.

Specifically it needs them in a ‘decimal year’ i.e 1982.145 If our data isn’t already like this then we set it up using the ‘cal.yr’ command. The cal.yr command wants a date as an input - so remember if your dates are stored as strings and not dates in the raw data then you need to convert them to a date and then to a decimal date. You can do this in one step.

class(ebola$disease_onset)

## [1] "factor"

#We see that the timein variable is currently a factor (i.e just recognisd as a bunch of characters), so we need to convert factor>data>decimal date. 
ebola$disease_onset <- cal.yr(as.Date(ebola$disease_onset, format = "%Y-%m-%d"))
ebola$disease_ended <- cal.yr(as.Date(ebola$disease_ended, format = "%Y-%m-%d"))

Declaring our periods of follow-up: Making a lexis model

We need to declare the way time is modelled when doing this kind of analysis. This is like STSET in STATA.

That is we need to specify 1) Entry point 2) Exit point

We may have multiple time scales - for example both Calendar Time and Chronological age. The general syntax is:

lexis_object <- Lexis( 
  entry = list ( name_first_time_scale = variable_setting_that,
                 name_second_time_scale = variable_setting_that),
  exit = list (name_of_time_scale_we calculate_exit_against = variable_for exit),
  exit.status = (where_we_define_the_outcome),
  data = data_frame)

For example if we might be interested in time in follow-up and chronological age as time scales. So we might do:

ebola_lexis_model <- Lexis(entry = list("calendar_time"=disease_onset), exit = list("calendar_time" = disease_ended), exit.status = status =="died", data = ebola)

## NOTE: entry.status has been set to FALSE for all.

## Warning in Lexis(entry = list(calendar_time = disease_onset), exit = list(calendar_time = disease_ended), : Dropping NA rows with duration of follow up < tol

#We define the beginning of calendar time in the study as the date in the 'disease_onset' variable
#We call time on this scale "calendar_time"
#Time out is on the same timescale as time in - i.e calendar_time
#If people are missing either a time in or time-out so they get dropped

R is also happy if we just have a duration of follow-up as opposed to a Date of Entry and a Date of Exit. In that case we can just say:

lexis_model <- Lexis(exit = list("calendar_time" = duration), exit.status = died, data = dataset)
##Here we specify the exit time as being the duration of follow-up
##We haven't stated an entry time so R assumes it to be 0 - i.e people were followed up from time 0 for the duration of time specified as the exit time

Key variables in the Lexis model

When we make this model it creates several key variable we need to reference when running our Cox and Poisson models.

‘lex.dur’ = duration of follow-up ‘lex.Xst’ = Status of the individual at the end of each time block. For example in the above lexis model this is whether or not the event CHD occured in the any given time period.

We will reference these variables when we run our poisson and cox regression models.

Making a rate table

Once we have a lexis model it is easy to see crude rates and rates stratified by a variable

The general syntax is: > rate_table <- rate(lexis_model, obs = lex.Xst, pyrs = lex.dur, print = stratifying_var)

If we want we can give per 100 or 1,000 years by simply adjusting ‘pyrs’ > rate_table <- rate(lexis_model, obs = lex.Xst, pyrs = lex.dur/1000, print = stratifying_var)

average_rate <- rate(ebola_lexis_model, obs = lex.Xst, pyrs = lex.dur)
#Show the overall rate of cardiovascular outcomes per 1 person years (because people die fast of ebola and follow-up is short)
rate_by_transmission <- rate(ebola_lexis_model, obs = lex.Xst, pyrs = lex.dur, print = transmission)
#Show the rates of cardiovascular outcome by sex per 1 person years (because people die fast of ebola and follow-up is short)
average_rate

## 
## Crude rates and 95% confidence intervals:
## 
##    lex.Xst  lex.dur     rate  SE.rate  rate.lo  rate.hi
## 1:     245 5.303217 46.19837 2.951506 40.76096 52.36112

rate_by_transmission

## 
## Crude rates and 95% confidence intervals:
## 
##        transmission lex.Xst   lex.dur     rate   SE.rate  rate.lo
## 1: person_to_person     129 2.8993840 44.49221  3.917321 37.44023
## 2:             both       8 0.2491444 32.10989 11.352561 16.05788
## 3:            other      10 0.1341547 74.54082 23.571876 40.10658
## 4:          syringe      86 1.7686516 48.62461  5.243327 39.36108
## 5:          unknown      12 0.2518823 47.64130 13.752860 27.05568
##      rate.hi
## 1:  52.87245
## 2:  64.20806
## 3: 138.53919
## 4:  60.06829
## 5:  83.88975

Running a simple Poisson Regression

We run a poisson regression essentially the same as a logistic regression. We use the same GLM command as before.

Instead of setting ‘family = binomial’ we need to set ‘family = poisson’
We include an additional variable called offset. This represents the fact that follow-up time is different for each person in the study (i.e it gives you a rate against person-time rather than a number of events / people). This variable is the log of the follow-up time in the lexis-model: ‘log(lex.dur)’
We set the outcome as whether lex.Xst occured (i.e did the event occur)

The generic formula is > poisson_model <- glm(lex.Xst~offset(log(lex.dur))+indepdent_variable+independent_variable, data = lexis_dataframe, family = poisson)

#Fit a model for HR of death from CHD against grade
ebola_model <- glm(lex.Xst~factor(transmission)+sex+offset(log(lex.dur)), data = ebola_lexis_model, family = poisson)
#The outcome is lex.Xst - whatever we set as the outcome when we made our lexis dataset 
#The offset is the log(lex.dur) - the log of duration of follow-up
#We fit a poisson model 

ebola_ratios <- cbind(HR = exp(coef(ebola_model)), exp(confint(ebola_model)))

## Waiting for profiling to be done...

#Just like in the Logistic model we take exponentials of the coefficients to get our Hazard Rations
ebola_model_out <- summary(ebola_model) 
ebola_model_out <-cbind (ebola_ratios, ebola_model_out$coefficients)
#We combine this all in to a nice table
ebola_model_out

##                                     HR     2.5 %     97.5 %    Estimate
## (Intercept)                 75.8444260 4.0975490 407.863217  4.32868422
## factor(transmission)both     0.7230170 0.3237964   1.389407 -0.32432256
## factor(transmission)other    1.6052597 0.7574323   2.976279  0.47328555
## factor(transmission)syringe  1.0928077 0.8294145   1.432693  0.08875023
## factor(transmission)unknown  1.0722117 0.5607743   1.861057  0.06972352
## sexfemale                    0.5879981 0.1102686  10.851812 -0.53103158
## sexmale                      0.5845758 0.1086098  10.822539 -0.53686884
##                             Std. Error    z value     Pr(>|z|)
## (Intercept)                  1.0578400  4.0920029 4.276632e-05
## factor(transmission)both     0.3666753 -0.8844951 3.764290e-01
## factor(transmission)other    0.3450006  1.3718400 1.701133e-01
## factor(transmission)syringe  0.1392193  0.6374849 5.238090e-01
## factor(transmission)unknown  0.3033065  0.2298781 8.181865e-01
## sexfemale                    1.0548472 -0.5034204 6.146687e-01
## sexmale                      1.0580754 -0.5074013 6.118733e-01

broom_ebola <- tidy(ebola_model, exponentiate =TRUE, conf.int =TRUE)

#Using Broom with Poisson and Cox As before we can simplify this process with Broom

#Long hand way
ebola_model <- glm(lex.Xst~factor(transmission)+sex+offset(log(lex.dur)), data = ebola_lexis_model, family = poisson)
ebola_ratios <- cbind(HR = exp(coef(ebola_model)), exp(confint(ebola_model)))

## Waiting for profiling to be done...

ebola_model_out <- summary(ebola_model) 
ebola_model_out <-cbind (ebola_ratios, ebola_model_out$coefficients)
ebola_model_out

##                                     HR     2.5 %     97.5 %    Estimate
## (Intercept)                 75.8444260 4.0975490 407.863217  4.32868422
## factor(transmission)both     0.7230170 0.3237964   1.389407 -0.32432256
## factor(transmission)other    1.6052597 0.7574323   2.976279  0.47328555
## factor(transmission)syringe  1.0928077 0.8294145   1.432693  0.08875023
## factor(transmission)unknown  1.0722117 0.5607743   1.861057  0.06972352
## sexfemale                    0.5879981 0.1102686  10.851812 -0.53103158
## sexmale                      0.5845758 0.1086098  10.822539 -0.53686884
##                             Std. Error    z value     Pr(>|z|)
## (Intercept)                  1.0578400  4.0920029 4.276632e-05
## factor(transmission)both     0.3666753 -0.8844951 3.764290e-01
## factor(transmission)other    0.3450006  1.3718400 1.701133e-01
## factor(transmission)syringe  0.1392193  0.6374849 5.238090e-01
## factor(transmission)unknown  0.3033065  0.2298781 8.181865e-01
## sexfemale                    1.0548472 -0.5034204 6.146687e-01
## sexmale                      1.0580754 -0.5074013 6.118733e-01

#Short hand using broom
ebola_model <- glm(lex.Xst~factor(transmission)+sex+offset(log(lex.dur)), data = ebola_lexis_model, family = poisson)
broom_ebola <- tidy(ebola_model, exponentiate =TRUE, conf.int =TRUE)

Lexis Expansions and time as a confounder.

Time (for example age) is a key confounder so we need to be able to control for this. In STATA we would use STSPLIT to do this. We can do the same in R.

We do this using: > expanded_lexis_model <- splitLexis(lexis_model, “time_scale_to_split_on”, breaks = c(break1,break2,break3))

Sample Dataset 5

We need a different dataset for this example. Again the dataset is freely available on the LSHTM Data Repository: “Alcohol outcomes” This is a simulated dataset (a fully-anoynmised, tweaked dataset based on a real-one) looking at the association between age, alcohol and death.

#Load in our new dataset
alcohol <- read.csv(file = "alcohol_outcomes.csv", header = TRUE, sep=",")

#Check how dates are saved
class(alcohol$timein)

## [1] "factor"

#time is saved as a factor so we need to convert it
alcohol$timein <- cal.yr(as.Date(alcohol$timein, format = "%d/%m/%Y"))
alcohol$timeout <- cal.yr(as.Date(alcohol$timeout, format = "%d/%m/%Y"))
alcohol$timebth <- cal.yr(as.Date(alcohol$timebth, format = "%d/%m/%Y"))

#Setup our lexis model
alcohol_lexis <- Lexis(entry = list("calendar_time"=timein, "age" = timein - timebth), exit = list("calendar_time" = timeout), exit.status = death ==1, data = alcohol)

## NOTE: entry.status has been set to FALSE for all.

#We define the beginning of calendar time in the study as the date in the 'timein' variable
#We call time on this scale "calendar_time"
#We define the beginning of the age-time scale as the difference between when you joined the study and your date of birth. We call this time scale "age"
#Time out is on the same timescale as time in - i.e calendar_time

#Perform a lexis expansion
split_alcohol_lexis <- splitLexis(alcohol_lexis, breaks = c(50,60,70,80), time.scale = "age")
#We say we want to split our lexis_model
#We want to do this on our defined "age" time scale
#We want the breaks to be at the listed points

Now we need to tell R we want to use those Time-Blocks as a variable in our regression. NB Compare to STATA - In stata we cut the time and make it a new variable in one go with

stplit newvariable, at(timepoint,timepoint,timepoint)

But each time we want to split on a new timescale in STATA we need to stset the data again and do a new ‘stplit’ command.

In R we declare all the timescales initially (in the lexis command), then we split each timeperiod and then make a new variable reflecting each split. The generic syntax is

expanded_model$block_name <- timeBand(expanded_model, “name_of_split”, type = “factor”)

split_alcohol_lexis$age_blocks <- timeBand(split_alcohol_lexis, "age", type = "factor")
#We make a new variable called age_blocks.
#We used the splits we defined on the "age" time-scale for this
#Make each segment of this time-scale an element of a categorical variable
#We then remove any blank 'time-blocks' i.e segments with no observations
#This is done by re-factoring the variable
split_alcohol_lexis$age_blocks <- factor(split_alcohol_lexis$age_blocks)

Running a poisson regression including a time-varying variable

Now we can run the poisson regression with age (in blocks) as a variable just like in STATA.

age_alcohol_model <- glm(lex.Xst~factor(alcohol)+factor(age_blocks)+offset(log(lex.dur)), data = split_alcohol_lexis, family = poisson)
age_alcohol_ratios <- cbind(HR = exp(coef(age_alcohol_model)), exp(confint(age_alcohol_model)))

## Waiting for profiling to be done...

age_alcohol_out <- summary(age_alcohol_model) 
age_alcohol_out <-cbind (age_alcohol_ratios, age_alcohol_out$coefficients)
age_alcohol_out

##                                      HR        2.5 %       97.5 %
## (Intercept)                 0.001758427  0.000698216 3.568313e-03
## factor(alcohol)2            1.583354946  1.287840186 1.943563e+00
## factor(age_blocks)(50,60]   2.788198316  1.304756320 7.235037e+00
## factor(age_blocks)(60,70]   8.358271714  4.051253875 2.125521e+01
## factor(age_blocks)(70,80]  18.434422669  8.856542929 4.712319e+01
## factor(age_blocks)(80,Inf] 49.721819767 18.829170021 1.449072e+02
##                             Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)                -6.343336  0.4088553 -15.514867 2.752023e-54
## factor(alcohol)2            0.459546  0.1049087   4.380437 1.184415e-05
## factor(age_blocks)(50,60]   1.025396  0.4292502   2.388806 1.690321e-02
## factor(age_blocks)(60,70]   2.123252  0.4154268   5.111013 3.204365e-07
## factor(age_blocks)(70,80]   2.914220  0.4189527   6.955964 3.501584e-12
## factor(age_blocks)(80,Inf]  3.906444  0.5098479   7.661979 1.830903e-14

#Now fit a model adjusting for the effect of changing age through follow-up time

Comparing Models

Just like in logistic regression we can compare our two poisson models using ‘lrtest’. Remember the two models need to be fittd on the same number of observations. This means both models need to be fitted AFTER the lexis expansion - so in the current example we would compare: Lexis Expanded Dataset - controlling for Age & Grade Lexis Expanded Dataset - controlling for Age Only The LRTEST would be whether Grade was associated having controlled for Age

age_model <- glm(lex.Xst~age_blocks+offset(log(lex.dur)), data = split_alcohol_lexis, family = poisson)
#Fit a model on the expanded dataset where only Age is an independent variable
age_model_ratios <- cbind(HR = exp(coef(age_model)), exp(confint(age_model)))

## Waiting for profiling to be done...

age_model_out <- summary(age_model) 
age_model_out <-cbind (age_model_ratios, age_model_out$coefficients)
age_model_out

##                              HR        2.5 %       97.5 %  Estimate
## (Intercept)         0.001908232 7.583858e-04   0.00386644 -6.261578
## age_blocks(50,60]   2.864554178 1.340670e+00   7.43252992  1.052413
## age_blocks(60,70]   9.040064110 4.386843e+00  22.97293591  2.201666
## age_blocks(70,80]  21.271550431 1.025650e+01  54.25717095  3.057371
## age_blocks(80,Inf] 61.056794442 2.323188e+01 177.24621063  4.111804
##                    Std. Error    z value     Pr(>|z|)
## (Intercept)         0.4082441 -15.337829 4.272431e-53
## age_blocks(50,60]   0.4291935   2.452070 1.420369e-02
## age_blocks(60,70]   0.4149555   5.305789 1.121869e-07
## age_blocks(70,80]   0.4174887   7.323242 2.420515e-13
## age_blocks(80,Inf]  0.5075158   8.101825 5.414068e-16

lrtest(age_model,age_alcohol_model)

## Likelihood ratio test
## 
## Model 1: lex.Xst ~ age_blocks + offset(log(lex.dur))
## Model 2: lex.Xst ~ factor(alcohol) + factor(age_blocks) + offset(log(lex.dur))
##   #Df  LogLik Df Chisq Pr(>Chisq)    
## 1   5 -1487.5                        
## 2   6 -1478.1  1 18.72  1.514e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Perform a likelihood test of the model with Age&Alcohol compared to the model with only Age

If we wanted to test if age was associated we would compare: Lexis Expanded Dataset - controlling for Age & Alcohol Lexis Expanded Dataset - controlling for Alcohol Only Note even though strictly we don’t need to expand the dataset for the Alcohol model (as its not a time varying factor) we need the two models to have the same number of observations so it does need to be run against the Lexis Expanded dataset

Cox Regression

Cox Regression is done in a very similar fashion to all other models. We declare the outcome and follow-up time slightly differently than from in the poisson but otherwise things look very similar!

The generic formula is:

cox_model <- coxph(Surv(lex.dur,lex.Xst)+indepdent_variable+independent_variable, data = lexis_dataframe)

cox_lexis_model <- coxph(Surv(lex.dur,lex.Xst)~factor(alcohol),data = split_alcohol_lexis)
##Fit a cox model. 
##The initial statement Surv(lex.dur,lex.Xst) says that the duration of follow-up is in lex.dur and the outcome variable in lex.Xst (as outlined above)

cox_lexis_ratios <- cbind(HR = exp(coef(cox_lexis_model)), exp(confint(cox_lexis_model)))
#As every other time we exponentiat the coefficients to get hazard ratios
cox_lexis_out <- summary(cox_lexis_model)
cox_lexis_out <- cbind(cox_lexis_ratios,cox_lexis_out$coefficients)
#And we put everything in a nice table
cox_lexis_out

##                        HR    2.5 %   97.5 %      coef exp(coef)  se(coef)
## factor(alcohol)2 2.237758 1.830073 2.736263 0.8054745  2.237758 0.1026134
##                         z    Pr(>|z|)
## factor(alcohol)2 7.849601 4.17364e-15

Note that the poisson and the cox give very similar estimates of the impact of alcohol on the outcome.

As in Poisson Regression we can compare two cox-models using ‘lrtest’ if we need to. Remember as always that both models must be performed on a dataset with the same number of observations - i.e after any lexis-expansion has been performed.

##Kaplan Meier Plots

We can plot very nice KM plots. As before we have declared out data as a Lexis Model. We input the dataset in a similar fashion to our cox-regression but use the survfit command rather than coxph > km <- survfit(Surv(lex.dur, lex.Xst) ~ var_to_stratify_on, data = dataset)

Then we plot the KM Plot > ggsurvplot(km, data = dataset, risk.table = TRUE / FALSE)

km <- survfit(Surv(lex.dur,lex.Xst)~factor(alcohol),data = alcohol_lexis)
##Fit the data
##The initial statement Surv(lex.dur,lex.Xst) says that the duration of follow-up is in lex.dur and the outcome variable in lex.Xst (as outlined above). 
#We want to stratify by alcohol

ggsurvplot(km,data=alcohol_lexis)

#Plot the Kaplan Meier Plot

Correlated Outcomes

We often have data that is correlated. For example 1) Multiple episods of flu in a single individual 2) Clustered-survey data - i.e individuals within households

Commonly we use a random-effects model to adjust for this correlation.

We do this using a very similar model to all previous ones.

As before there is a version of BROOM (broom.mixed) to make your life easier. I’ll show you long hand first and then the short-cut.

The general syntax for a clustered Logistic Regression is:

correlated_logistic <- glmer(DEPVAR~INDEPVAR + (1|CLUSTER_VAR),family = “binomial”, data = data) #For example if we have individuals within households then CLUSTER_VAR would be the variable specifying which house they were in

The general syntax for a clustered Poisson Regression (after Lexis setup as before is):

correlated_poisson <- glmer(lex.Xst~INDEPVAR + offset(log(lex.dur))+ (1|CLUSTER_VAR),family = “poisson”, data = data) #For example if we have individuals followed over time with multiple episodes of infection, with one line in our dataframe per period of follow-up, then CLUSTER_VAR would be the variable identifying that person across each block of follow-up

Specifying the Clustering and getting the data out

In both cases the key point in setting up the formula is the addition of: > +(1|CLUSTER_VAR)

If we have nested levels of clustering we can specify this easily:

+(1|first_clustering_level/nested_clustering_level) #For example +(1|province/village) where villages are nested within provinces

When we want to get out our odds ratios and confidence intervbals we have to specify that slightly diffrently than in a non-random-effects model

OR <- cbind(OR = exp(fixef(re_model)),exp(confint(re_model,parm=“beta_”))) ##Rather than the co-efficients we want the Fixed-Effects for our Odds Ratios ##And we want the confidence intervals specifcally for those rather than all the other variables a RE-Model makes

Otherwise everything is the same as all our other models

Saying how many integration points to use

R-Defaults to 1 integration point for RE-Models. We can if needed change this (more accurate vs more time) > model_name <- glmer(DEPVAR~INDEPVAR +(1|CLUSTERVAR), family = “binomial”, nAGQ = X, data = DATA) #Currently you can only set integration points >1 for models with a single level of clustering

Random Effects Logistic Regression

Sample Dataset 6

To look at a random effects logistic regression we are using a (simulated) dataset about trachoma. This is a clustered dataset with many individuals per household. Again the dataset is freely available on the LSHTM Data Repository: “Trachoma Household Survey” We want to find out if a latrine is protective from getting trachoma.

hh_tf <- read.csv (file = "hh_tf.csv", header = TRUE, sep = ",")
#Read in data

re_hh_tf <- glmer(trachoma~factor(latrine)+ (1 | household),family = "binomial", data = hh_tf)
summary(re_hh_tf)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: trachoma ~ factor(latrine) + (1 | household)
##    Data: hh_tf
## 
##      AIC      BIC   logLik deviance df.resid 
##   1922.3   1940.9   -958.1   1916.3     3676 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.27997 -0.01801 -0.01775 -0.01742  3.06478 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  household (Intercept) 74.66    8.641   
## Number of obs: 3679, groups:  household, 1761
## 
## Fixed effects:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -8.0091     0.3434 -23.321   <2e-16 ***
## factor(latrine)1  -1.2450     0.5981  -2.082   0.0374 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## fctr(ltrn)1 0.017

#Run a Random Effects model
#Individuals share a household which is designated by the 'household' variable
#This therefore is the unit of clustering

re_hh_tf_OR <- cbind(OR = exp(fixef(re_hh_tf)),exp(confint(re_hh_tf,parm="beta_")))

## Computing profile confidence intervals ...

re_hh_tf_out <- summary(re_hh_tf)
#We still want our P-Values and raw data just like before

re_hh_tf_out <- cbind (re_hh_tf_OR, re_hh_tf_out$coefficients)
#Combine it all together
re_hh_tf_out

##                            OR        2.5 %       97.5 %  Estimate
## (Intercept)      0.0003324303 0.0001656782 0.0006394189 -8.009080
## factor(latrine)1 0.2879269306 0.0828465249 0.8755127240 -1.245049
##                  Std. Error    z value      Pr(>|z|)
## (Intercept)       0.3434317 -23.320735 2.731632e-120
## factor(latrine)1  0.5981090  -2.081642  3.737522e-02

Making life easier with broom.mixed

There is a special version of broom for mixed (random) effects models. This doesn’t install or load with tidyverse so we need to call it specifically Otherwise it works similarly.

As both broom & broom.mixed have a ‘tidy’ command we will need to specify which one to use.

So the general syntax is >FIT MODEL >broom.mixed::tidy(MODEL, exponentiate=TRUE, conf.int=TRUE)

re_hh_tf <- glmer(trachoma~factor(latrine)+ (1 | household),family = "binomial", data = hh_tf)
re_hh_tf_OR <- cbind(OR = exp(fixef(re_hh_tf)),exp(confint(re_hh_tf,parm="beta_")))

## Computing profile confidence intervals ...

re_hh_tf_out <- summary(re_hh_tf)
re_hh_tf_out <- cbind (re_hh_tf_OR, re_hh_tf_out$coefficients)
re_hh_tf_out

##                            OR        2.5 %       97.5 %  Estimate
## (Intercept)      0.0003324303 0.0001656782 0.0006394189 -8.009080
## factor(latrine)1 0.2879269306 0.0828465249 0.8755127240 -1.245049
##                  Std. Error    z value      Pr(>|z|)
## (Intercept)       0.3434317 -23.320735 2.731632e-120
## factor(latrine)1  0.5981090  -2.081642  3.737522e-02

#This is the long hand version

#or with broom.mixed
re_hh_tf <- glmer(trachoma~factor(latrine)+ (1 | household),family = "binomial", data = hh_tf)
re_hh_tf_withBROOM <- broom.mixed::tidy(re_hh_tf, exponentiate=TRUE, conf.int=TRUE)

Random Effects Poisson Regression

Sample Dataset 7

We are using a dataset about trachoma and PCR results - this is taken from a series of surveys in the Gambia.

Again the dataset is freely available on the LSHTM Data Repository: “Trachoma Cohort Study”

For your information: PCR is coded 2 for positive and 1 for negative. Clinical evidence of active trachoma is recorded as RE.Grade == TF

tf_pcr <- read.csv (file = "TF_PCR.csv", header = TRUE, sep = ",")
#Read in data

tf_pcr$timein <- cal.yr(as.Date(tf_pcr$timein, format = "%d-%b-%y"))
tf_pcr$timeout <- cal.yr(as.Date(tf_pcr$timeout, format = "%d-%b-%y"))
#Convert dates in to decimal dates

tf_pcr$RE.Grade <- relevel(tf_pcr$RE.Grade,"N")
#Make a normal clinical examination of the eye the baseline level

tf_pcr_lexis <- Lexis(entry = list("calendar" = timein), exit = list("calendar" = timeout), exit.status = PCR==2, data = tf_pcr)

## NOTE: entry.status has been set to FALSE for all.

#Setup a lexis model

re_tf_pcr <- glmer(lex.Xst~factor(RE.Grade)+offset(log(lex.dur))+(1|ID), family = "poisson", data = tf_pcr_lexis)
#Multiple readings from the same invdividual
#The ID for the individuals is in the 'ID' variable
#This therefore is the unit of clustering

re_tf_OR <- cbind(HR = exp(fixef(re_tf_pcr)),exp(confint(re_tf_pcr,parm="beta_")))

## Computing profile confidence intervals ...

##Rather than the co-efficients we want the Fixed-Effects for our Odds Ratios
##And we want the confidence intervals specifcally for those rather than all the other variabls a RE-Model makes

re_tf_pcr_out <- summary(re_tf_pcr)
#We still want our P-Values and raw data just like before

re_tf_pcr_out <- cbind (re_tf_OR, re_tf_pcr_out$coefficients)
#Combine it all together
re_tf_pcr_out

##                           HR     2.5 %   97.5 %   Estimate Std. Error
## (Intercept)        1.5609469 1.1207591 2.080427  0.4452926  0.1563783
## factor(RE.Grade)TF 2.0846403 1.4417488 2.975819  0.7345963  0.1845991
## factor(RE.Grade)TS 0.5650381 0.1418873 1.824576 -0.5708621  0.6397415
##                       z value     Pr(>|z|)
## (Intercept)         2.8475347 4.405929e-03
## factor(RE.Grade)TF  3.9794144 6.908522e-05
## factor(RE.Grade)TS -0.8923325 3.722148e-01

#Individuals with clinical evidence of trachoma (TF) are twice as likely to have a positive PCR result.

Confirming there is clustering

When we run a Random-Effects model we also want to see if whether there is or is not evidence of clustering.

STATA does this by looking to see if rho (logistic regression, measure of within cluster variation) or alpha (poisson, measure of within cluster variation) = zero.

It does this STATA runs a likelihood ratio test comparing the model with and without random-effects (where ro/alpha = 0).

We can do the same by fitting a model without clustering and running an LRT comparing it to our RE-model. We can also get the equivalent estimate of rho.

no_re_hh_tf <- glm(trachoma~factor(latrine),family = "binomial", data = hh_tf)

re_hh_tf <- glmer(trachoma~factor(latrine)+ (1 | household),family = "binomial", data = hh_tf)

icc(re_hh_tf)

## 
## Generalized linear mixed model
## 
## Family : binomial (logit)
## Formula: trachoma ~ factor(latrine) + (1 | household)
## 
##   ICC (household): 0.9578

#Give estimate of the intra-cluster correlation == rho

lrtest(re_hh_tf,no_re_hh_tf)

## Warning in modelUpdate(objects[[i - 1]], objects[[i]]): original model was
## of class "glmerMod", updated model is of class "glm"

## Likelihood ratio test
## 
## Model 1: trachoma ~ factor(latrine) + (1 | household)
## Model 2: trachoma ~ factor(latrine)
##   #Df   LogLik Df  Chisq Pr(>Chisq)    
## 1   3  -958.15                         
## 2   2 -1188.67 -1 461.05  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Test the null hypothesis that rho = 0 and that there is therefore no evidence of clustering

The difference between the two models is not Zero and the p-value is highly suggestive that there is evidence of clustering

Introduction to Sample Size Calculations

Suggested by “Chrissy h Roberts”

Below the basics of a few types of sample size calculations - for example comparing proportions, means or to detect an effect size of a certain magnitude are shown.

Comparing proportions

The general syntax is

epi.propsize(treat = X, control = Y, power = Z, n = Z) #Specify either n - to compute power OR power to compute sample size #Specify the other as NA

epi.propsize(treat = 0.1, control = 0.2, power = 0.8, n = NA)

## $n.total
## [1] 398
## 
## $n.treat
## [1] 199
## 
## $n.control
## [1] 199
## 
## $power
## [1] 0.8
## 
## $lambda
## [1] 0.5

#Calculate sample size needed to have 80% power to show difference of 10% assuming group 1 has proportion of 10%

epi.propsize(treat = 0.1, control = 0.2, power = 0.8, n = NA, sided.test = 1)

## $n.total
## [1] 314
## 
## $n.treat
## [1] 157
## 
## $n.control
## [1] 157
## 
## $power
## [1] 0.8
## 
## $lambda
## [1] 0.5

#As above but a 1-sided test

epi.propsize(treat = 0.1, control = 0.2, power = NA, n = 300)

## $n.total
## [1] 300
## 
## $n.treat
## [1] 150
## 
## $n.control
## [1] 150
## 
## $power
## [1] 0.6808308
## 
## $lambda
## [1] 0.5

#Calculate the power to detect a difference between arms given a sample size of 300

Sample size to detect a difference in proportions

We need to know the 1) Proportion in control group 2) Odds ratio we wish to detect

The general syntax when giving an Odds Ratio to detect is

epi.ccsize(OR = X, p0 = Y, n = Z, power = Z) #Specify either n - to compute power OR power to compute sample size #Specify the other as NA epi.ccsize(OR = X, p0 = Y, n = Z, power = Z, r = ) #Where R is the ratio of controls to cases

epi.ccsize(OR = 2, p0 = 0.2, n = NA, power = 0.8)

## $n.total
## [1] 344
## 
## $n.case
## [1] 172
## 
## $n.control
## [1] 172
## 
## $power
## [1] 0.8
## 
## $OR
## [1] 2

help("epi.ccsize")
#Calculate sample size required to detect an Odds Ratio of 2 given an expected prevalence of 20% in the control group

epi.ccsize(OR = 4, p0 = 0.3, n = NA, power = 0.8, r =4)

## $n.total
## [1] 110
## 
## $n.case
## [1] 22
## 
## $n.control
## [1] 88
## 
## $power
## [1] 0.8
## 
## $OR
## [1] 4

#Calculate sample size required to detect an Odds Ratio of 4 given an expected prevalence of 30% in the control group and 4 controls for every case

Sample size for detecting a specific prevalence

The general syntax is > epi.simplesize(Py = prevalence_to_find, epsilon.R = variation_around_prevalence, method = “proportion”) #epsilon.R is specified as a proportion of Py #So if we want to find 10% +/- 2% we would set epsion.R as 0.2

epi.simplesize(Py = 0.2, epsilon.r = 0.3, method = "proportion")

## [1] 171

#Calculate sample size required to detect a true prevalence of 10% +/- 3%

#We can specify a fixed population calculation by adding in 'n ='

Sample size for comparing means

The general syntax is

power.t.test(delta = DIFF_BETWEEN_MEANS, power = Z, type =“SAMPLE_TYPE”) #Type may be “two.sample”, “one.sample”, “paired” #We can additionally specify the anticiapted standard deviation for example power.t.test(delta = DIFF_BETWEEN_MEANS, power = Z, type =“SAMPLE_TYPE”, sd = 1)

OR alternatively (using same package as other sample size calculations) >epi.meansize(treat = X, control = Y, n = Z, power = Z, sigma = A) #Specify either n - to compute power OR power to compute sample size #Specify the other as NA #Sigma is the anticipated Standard Deviation

power.t.test(delta = 2, power = 0.8, type ="two.sample", sd = 1)

## 
##      Two-sample t test power calculation 
## 
##               n = 5.090008
##           delta = 2
##              sd = 1
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

#Calculate sample size needed to have 80% power to show difference of 2 in the mean value between two groups

power.t.test(delta = 2, power = 0.8, type ="paired", sd = 1)

## 
##      Paired t test power calculation 
## 
##               n = 4.220731
##           delta = 2
##              sd = 1
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number of *pairs*, sd is std.dev. of *differences* within pairs

#Calculate sample size needed to have 80% power to show difference of 0.5 in the mean of two paired values - assuming a standard deviation of 1

epi.meansize(treat = 14 , control = 12, n = NA, power = 0.8, sigma = 1)

## $n.total
## [1] 8
## 
## $n.treat
## [1] 4
## 
## $n.control
## [1] 4
## 
## $power
## [1] 0.8
## 
## $delta
## [1] 2

#Calculate sample size needed to have 80% power to show difference of 2 between mean assuming a standard deviation of 1

#Capture Recapture Population Estimates We can calculate the predicted true population based on results of a capture-recapture survey The general syntax is > epi.popsize(people_seen_firs_time,people_seen_second_time,people_seen_both_times)

epi.popsize(1104,1112,920)

## 
##  Estimated population size: 1334 (95% CI 1319 - 1350)
##  Estimated number of untested subjects: 38 (95% CI 23 - 54)

#We saw a 1104 people in our first survey
#We saw 1112 people in our second survey
#920 people were seen both times
#Calculate estimated true population and proportion of individuals missed both times

Multiple Imputation

Suggested by “Tom Yates”

I presume as always that you understand the caveats of multiple imputation. The MICE package uses Multivariate Imputation via Chained Equations which is a common approach to multiple imputation.

Sample Dataset 8

I deleted some data from our earlier scabies dataset so we can look at how multiple imputation works.

Again the dataset with missing data is available on the LSHTM Data-Repository: “Scabies Missing Data”

scabies_missing <- read.csv(file = "scabies_results_missing.csv", header = TRUE, sep = ",")

Looking at what data is missing

We can see what data is missing using the ‘md.pattern’ command. The general syntax is:

md.pattern(dataset)

md.pattern(scabies_missing)

##      village impetigo_inactive other_lesions room_cohabitation age
## 1439       1                 1             1                 1   1
## 103        1                 1             1                 1   1
## 81         1                 1             1                 1   1
## 2          1                 1             1                 1   1
## 70         1                 1             1                 1   1
## 3          1                 1             1                 1   1
## 3          1                 1             1                 1   1
## 52         1                 1             1                 1   1
## 4          1                 1             1                 1   1
## 7          1                 1             1                 1   1
## 2          1                 1             1                 1   1
## 63         1                 1             1                 1   0
## 3          1                 1             1                 1   0
## 1          1                 1             1                 1   0
## 4          1                 1             1                 1   0
## 54         1                 1             1                 0   1
## 4          1                 1             1                 0   1
## 2          1                 1             1                 0   1
## 3          1                 1             1                 0   1
## 4          1                 1             1                 0   1
## 1          1                 1             1                 0   1
## 2          1                 1             1                 0   0
## 1          1                 1             1                 0   0
##            0                 0             0                71  74
##      house_inhabitants scabies_infestation gender impetigo_active    
## 1439                 1                   1      1               1   0
## 103                  1                   1      1               0   1
## 81                   1                   1      0               1   1
## 2                    1                   1      0               0   2
## 70                   1                   0      1               1   1
## 3                    1                   0      1               0   2
## 3                    1                   0      0               1   2
## 52                   0                   1      1               1   1
## 4                    0                   1      1               0   2
## 7                    0                   1      0               1   2
## 2                    0                   0      1               1   2
## 63                   1                   1      1               1   1
## 3                    1                   1      1               0   2
## 1                    1                   0      1               1   2
## 4                    0                   1      1               1   2
## 54                   1                   1      1               1   1
## 4                    1                   1      1               0   2
## 2                    1                   1      0               1   2
## 3                    1                   0      1               1   2
## 4                    0                   1      1               1   2
## 1                    0                   0      1               1   3
## 2                    1                   1      1               1   2
## 1                    1                   1      0               1   3
##                     74                  83     96             119 517

Imputing data using Multivariate Imputation via Chained Equations

We can impute our data using the ‘mice’ command. The general syntax is:

imputed_data <- mice(data = datasets, m = number_imputet_sets_to_make, defaultMethod = “method_for_imputing_values”, maxit = number_iterations_to_impute_values, seed = 500)

This generates ‘m’ datasets each with slightly different imputed data.

NB Seed is just a number which kicks off the random-number generator. So if you alter the seed number the imputed dataset is going to change.

Specifying which imputation method to use

We don’t actually need to specify defaultMethod unless we want to change how R imputes values.

By default ‘mice’ will use the following methods: * Numeric values: pmm, predictive mean matching * Binary data: logreg, logistic regression imputation * Unordered categorical data with >2 levels: polyreg, polytomous regression * Ordered categorical data with >2 levels: polr, proportional odds model

We can specify alternative methods * norm = Bayesian linear regression

For example we could use Bayesian linear regression for Numeric imputation by stating

defaultMethod = c(“norm”,“logreg”, “polyreg”, “polr”),

imputed_scabies_data <- mice(data = scabies_missing, m = 5, maxit = 10, seed = 500)

By default MICE uses all values to impute from and imputes all missing values. These settings can be changed if needed

Checking our imputation

MICE includes a specific graphics package which is set up to work well with imputed data. We can use this to see how reasonable our imputations are.

We can show box and whisker plots to see if our nurmerical imputed data looks reasonable with

bwplot(imputed_dataset, variable) #R automatically colour codes the raw and imputed datasets differently

bwplot(imputed_scabies_data, age)

We can see distributions also with a density plot

densityplot(imputed_dataset) #R automatically colour codes the raw and imputed datasets differently

We can see if association still look reasonable

xyplot(imputer_dataset,variable_a~variable_b) #R automatically colour codes the raw and imputed datasets differently

Seeing an imputed dataset

We can see an individual imputed dataset with

imputed_set_1 <- complete(imputed_data,1)

We can merge the datasets together - for example into a long dataset

imputed_long <- complete(imputed_data, “long”)

If we want to include the original dataset we can do that too

imputed_long <- complete(imputed_data, “long”, include = TRUE)

Manipulating an imputed dataset

If we want to apply a data-manipulation such as converting a continuous to a categorical variable or making a new variable we can do this across our datasets.

First of all we convert the imputed data into a standard-dataset that we can work with:

imputed_long <- mice::complete(imputed_data, “long”, include = TRUE) #We have to specify mice::complete as ‘tidyr’ also has function called complete, so this lets R know which version to use. #Its important we include the Raw data as this lets us put the data back in to an ‘imputed’ dataset for the purposes of analysis #Now we can perform whatever manipulations we want #Then we convert the manipulated dataset back to an ‘imputed’ dataset using “as.mids”

coded_imputed_data <- as.mids(imputed_long)

complete_scabies_data <- mice::complete(imputed_scabies_data,"long", include = TRUE)
#Take out imputed data and make a long data-table. Include the original dataset with missing values.


complete_scabies_data$agegroups <- as.factor(cut(complete_scabies_data$age, c(0,10,20,Inf), include.lowest = TRUE, labels = FALSE)) 
#Categorise age into groups across all our imputed datasets
#Labels == False asigns a numeric value to each category (i.e 1,2,3)

complete_scabies_data$agegroups <-relevel(complete_scabies_data$agegroups, ref = 3)
#Set the baseline value

coded_imputed_scabies <- as.mids(complete_scabies_data)
#Convert back to "Imputed" format

Fitting a model across all the imputed datasets

If we want to fit a model across all the different imputed datasets then we can do that! The general syntax is very similar to a normal logistic regression but with a few changes as we are working with multiple imputed datasets and then pooling the results.

imputed_model <- with(imputed_data,glm(attack~ smokes+female+hsgrad, family = binomial())) #We specify the ‘with’ command to say we are modelling across each of our imputed datasets

imputed_model_summary <- (summary(pool(imputed_model), exponentiate = TRUE, conf.int = TRUE)) #We specify the ’pool’command to now pool the estimates across each of our imputed datasets # We ask it to exponentiate the estimates and give us the 95% CI all in one go

imputed_model_summary Show the summary

#Imputed Data
imputed_scabies_model <- with(coded_imputed_scabies,glm(impetigo_active~factor(scabies_infestation)+factor(gender)+factor(agegroups), family = binomial()))
imputed_scabies_model_summary <- (summary(pool(imputed_scabies_model), exponentiate = TRUE, conf.int = TRUE))

## Warning in process_mipo(z, object, conf.int = conf.int, conf.level =
## conf.level, : Exponentiating coefficients, but model did not use a log or
## logit link function

#Let us also fit this same model to the full-original dataset
impetigorisk <- glm(impetigo_active~factor(scabies_infestation)+factor(gender)+factor(agegroups),data=scabies,family=binomial())
impetigorisk_OR <- exp(cbind(OR= coef(impetigorisk), confint(impetigorisk)))

## Waiting for profiling to be done...

impetigorisk_summary <- summary(impetigorisk)
impetigorisk_summary <- cbind(impetigorisk_OR, impetigorisk_summary$coefficients)

You can see we get very similar estimates of risk factors for impetigo from both our imputed dataset and our original dataset

imputed_scabies_model_summary

##                               estimate std.error  statistic        df
## (Intercept)                  0.1284172 0.1321029 -15.536910 153.46067
## factor(scabies_infestation)1 1.9602606 0.1298640   5.182940 507.53928
## factor(gender)male           1.7469015 0.1090545   5.115273 787.60076
## factor(agegroups)1           3.8253644 0.1502099   8.931858 168.01825
## factor(agegroups)2           2.9491067 0.1759415   6.146941  82.66736
##                                   p.value      2.5 %    97.5 %
## (Intercept)                  0.000000e+00 0.09891984 0.1667106
## factor(scabies_infestation)1 2.777999e-07 1.51882984 2.5299883
## factor(gender)male           3.938015e-07 1.41025669 2.1639073
## factor(agegroups)1           0.000000e+00 2.84371631 5.1458766
## factor(agegroups)2           1.254234e-09 2.07828042 4.1848204

#Show the result of logistic regression on the imputed dataset

impetigorisk_summary

##                                       OR     2.5 %    97.5 %   Estimate
## (Intercept)                    0.1345342 0.1057763 0.1691508 -2.0059369
## factor(scabies_infestation)yes 1.9325497 1.5143192 2.4661511  0.6588402
## factor(gender)male             1.7577258 1.4287526 2.1630912  0.5640208
## factor(agegroups)0-10          3.7025777 2.8401712 4.8620068  1.3090292
## factor(agegroups)11-20         2.7347978 2.0226139 3.7127619  1.0060575
##                                Std. Error    z value     Pr(>|z|)
## (Intercept)                     0.1196454 -16.765690 4.349338e-63
## factor(scabies_infestation)yes  0.1243502   5.298267 1.169072e-07
## factor(gender)male              0.1057719   5.332426 9.690930e-08
## factor(agegroups)0-10           0.1370214   9.553466 1.254279e-21
## factor(agegroups)11-20          0.1548162   6.498398 8.117993e-11

#Show the result of logistic regression on the original dataset

Comparing two imputed datasets

When we want to compate two nested models derived from imputed datasets we can do so using ‘pool.compare’

pool.compare(model_1, model_2, data = NULL, method = “Wald”) #This is similar to performing an LRT on two normal models - but in the circumstance of pooled data this approach should be used.

Intra-class-correlation Co-efficients and Design Effects

Suggested and with help from “Peter MacPherson” and “Daniel Parker”

Statisitcally there are lots of different ways to calculate an Intra-Class-Correlation coefficient.

Below I provide an outline of a few otions. Which one is appropriate may depend on why you need an ICC.

Survey Data and Design Effects

If we have used a complex survey design then we can declare the survey design and then calculate the design effect.

We declare the survey design using > survey_dataset <- svydesign(id=~cluster_var,data=dataset)

We can then calculate the frequency of our outcome (say presence of Trachoma) adjusting for clustering and then get out the design effect of our data compared to SRS with replacement. >svymean(~outcome,survey_dataset,proportion = TRUE, deff = “replace”)

#Lets look at our scabies survey again
scabies <- read.csv(file = "S1-Dataset_CSV.csv", header = TRUE, sep = ",")

#Declare the survey design
survey_scabies <- svydesign(id=~Village, data = scabies)

## Warning in svydesign.default(id = ~Village, data = scabies): No weights or
## probabilities supplied, assuming equal probability

svymean(~scabies_infestation,survey_scabies,proportion = TRUE, deff = "replace")

##                            mean       SE   DEff
## scabies_infestationno  0.808176 0.010242 1.2903
## scabies_infestationyes 0.191824 0.010242 1.2903

#Work out the DF of scabies adjusting for clustering at the village level

Survey Data and ICC

If we have used a complex survey design then we can also get out an ICC and then given the cluster size calculate the design effect.

The general syntax is > ICCest(group_var,outcome_var,data=dataset)

It expects outcome_var to be coded numerically

scabies$hasscabies <- 0
scabies$hasscabies[scabies$scabies_infestation == "yes"] <- 1
#Convert our yes/no variable to 0 / 1 variable as expected by ICCest

ICCest(Village,hasscabies, data = scabies)

## Warning in ICCest(Village, hasscabies, data = scabies): 'x' has been
## coerced to a factor

## $ICC
## [1] 0.004087404
## 
## $LowerCI
## [1] -0.0009097985
## 
## $UpperCI
## [1] 0.02560099
## 
## $N
## [1] 10
## 
## $k
## [1] 185.7484
## 
## $varw
## [1] 0.1545527
## 
## $vara
## [1] 0.0006343122

#Gives the ICC for scabies and the mean village size.
#ICC of 0.004087404
#K (average cluster size) 185
#So DF = 1 + (185-1)*0.004087404
1 + (185-1)*0.004087404

## [1] 1.752082

#Which is broadly similar to the output of the previous estimate

Getting an ICC from a Random-Effects Model

Alternatively in some situations you might want to fit a random-effects model and then calculate the ICC from the model.

The command for this is just: > icc (re_model)

re_village_scabies <- glmer(scabies_infestation~ + (1 | Village),family = "binomial", data = scabies, nAGQ = 10)
#Run a model clustered at household level. Ten intgration points.
icc(re_village_scabies)

## 
## Generalized linear mixed model
## 
## Family : binomial (logit)
## Formula: scabies_infestation ~ +(1 | Village)
## 
##   ICC (Village): 0.0034

#Get out our ICC
#You can then use the ICC to calculate the design effect using standard formulas

Meta-analysis

##Meta-analysis of proportions

The general syntax is > metaprop(Number_Events, Number_Observations)

metaprop(c(267,175,219,117),c(406,329,223,137),method="GLMM")

## Warning in metafor::rma.glmm(xi = event[!exclude], ni = n[!exclude], method
## = "FE", : Cannot invert Hessian for saturated model.

## Warning in metafor::rma.glmm(xi = event[!exclude], ni = n[!exclude], method
## = method.tau, : Cannot invert Hessian for saturated model.

##   proportion           95%-CI %W(fixed) %W(random)
## 1     0.6576 [0.6092; 0.7037]        --         --
## 2     0.5319 [0.4764; 0.5869]        --         --
## 3     0.9821 [0.9547; 0.9951]        --         --
## 4     0.8540 [0.7836; 0.9085]        --         --
## 
## Number of studies combined: k = 4
## 
##                      proportion           95%-CI  z p-value
## Fixed effect model       0.7105 [0.6829; 0.7366] --      --
## Random effects model     0.8330 [0.5400; 0.9550] --      --
## 
## Quantifying heterogeneity:
## tau^2 = 2.1010; H = 9.04; I^2 = 98.8%
## 
## Test of heterogeneity:
##       Q d.f.  p-value             Test
##      NA    3       --        Wald-type
##  187.20    3 < 0.0001 Likelihood-Ratio
## 
## Details on meta-analytical method:
## - Random intercept logistic regression model
## - Maximum-likelihood estimator for tau^2
## - Logit transformation
## - Clopper-Pearson confidence interval for individual studies

#Calculate the weighted proportion of events across four studies
#Method = GLMM specifies that a random-intercept should be used

meta_data <- metaprop(c(267,175,219,117),c(406,329,223,137),method="GLMM", comb.fixed = FALSE)

## Warning in metafor::rma.glmm(xi = event[!exclude], ni = n[!exclude], method
## = "FE", : Cannot invert Hessian for saturated model.

## Warning in metafor::rma.glmm(xi = event[!exclude], ni = n[!exclude], method
## = method.tau, : Cannot invert Hessian for saturated model.

#As above but don't give a fixed-effec estimate and store the results so we can use them in our plot

meta_forest <- forest(meta_data)

#Calculate the weighted proportion of events across four studies
#Method = GLMM specifies that a random-intercept should be used

GGPLOT2

Basic elements of a GGPLOT

The GGPLOT package makes much better graphs than the basic R package. This is a very brief introduction to the syntax of GGPLOT2.

Everything in ggplot combines a similar group of elements to make its graphs. In any plot we specify 1) Dataset 2) A Co-ordinate system 3) How to show the data points (ggplot calls these ‘geoms’)

We add elements on top of one another to build up the plot. The first two elements are given together via the syntax > plot <- ggplot2 (dataset, aes(co-ordinate system)) #Make something called plot. Take data from ‘dataset’ and define the way to show the data as statd by the co-ordinate system.

There are a variety of co-ordinate systems. When our plot involves a single variable (i.e age as a histograme) then we need to specify X =

plot <- ggplot2 (scabies, aes(x = age)) #Look in the scabies dataframe - we are plotting age along the X-Axis

If we want to stratify our plots by a factor we can do so by telling GGPLOT with ‘fill’

plot <- ggplot2 (scabies, aes(x = age, fill = gender)) #Stratify by Gender - i.e draw two histograms with different colours based on the variable gender

When our plot involves two variables (i.e a scatter plot or a box-plot) then we specify X= and y =

plot <- ggplot2 (scabies, aes(x = house_inhabitants, y = room_cohabitation)) #Look in the scabies dataframe - we are plotting number people in the house along the X-Axis and number of people sharing a room on the Y-Axis

Specifcying the type of graph - ‘geoms’

We then tell GGPLOT how to render the data - this is the geom.

Some common geoms are: geom_point = scatterplot geom_histogram = histogram geom_boxplot

We can specify specific values for the data-points (“geom”) like colour: > + geom_point(aes(colour = factor(gender))) #Add to our plot by drawing each value as a point i.e make a scatter-plot #Set the colour of each datapoint based on the factor variable gender

Labels, Legends etc

We can add labels using: > + labs (title = “Graph Title”, x = “X-Axis Label”, y = “Y-Axis Label”) #Add to our plot by adding labels for axis etc

Splitting GGPLOT code across multiple lines

Sometimes to make the code easier to follow we split it across several lines. For example

plot <- ggplot2 (scabies, aes(x = house_inhabitants, y = room_cohabitation))+ geom_point(aes(colour = factor(gender))) + labs (title = “House size and Room Cohabitation”, x = “House size”, y = “Room size”) #Note we put a PLUS sign at the end of each line to tell R that the code is continuing

Example Plots with GGPLOT

Here are some examples similar to the graphs we drew with basic R earlier. The aim is just to give a feel for how the plots are constructed.

GGPLOT Scatterplot

gg_scatter <- ggplot (scabies, aes(x = house_inhabitants, y = room_cohabitation)) +
#Two variable House size and room cohabitation
  
#NB I'm saving the plot under the name gg_scatter by saying gg_scatter <- X,Y,Z
  geom_point(aes(colour = factor(gender))) +

#I'm drawing a scatter plot and colouring dots based on gender  
  labs(title = "House size vs Number people sharing a room", x ="House Size", y = "People sharing a room")

#I'm giving some labels to the graph

gg_scatter

#Show the gg_scatter graph

GGPLOT Histogram

gg_histo <- ggplot (scabies, aes(x = age)) +
  
#I have a single variable age
geom_histogram() +

#With which I will draw a scatterplot  
labs(title = "Distribution of age in study", y = "No. participants", x = "Age (yrs)")

#I will give some axis and graph labels
gg_histo

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

GGPLOT Boxplot

gg_boxplot <- ggplot (scabies, aes(x = gender, y = age)) +

#I have two variable Gender and Age
  geom_boxplot(aes(colour = factor(gender))) +

#I am drawing a box plot and colouring them in by age
  labs(title = "Distribution of age by gender", y = "Age", x = "Gender")

#I have some axis labels
gg_boxplot

Examples of powerful plotting with GGPLOT

We can do really fancy things with GGPLOT like make multiple graphs in a single panel. We declare these using one of the “facet” options. For example we could draw age distribution histograms by village.

library(ggplot2)
gg_histo_by_village <- ggplot (scabies, aes(x = age, fill = gender)) +

#I have only one co-ordinate x = age
#Stratifying on = gender; i.e fill in  colour based on  value in gender
  geom_histogram(breaks = seq(0,80, by = 2)) +

#I'm plotting a histogram. I've divided the data between 0 to 80 it to blocks 5 years wide.
  labs(title = "Distribution of age in study", y = "No. participants", x = "Age (yrs)")+

#I've given some labels
  facet_wrap("Village")

#I've broken the data down in to several facets based on the village
gg_histo_by_village

Learning R for Clinical Epidemiologists (v0.5)

Or How I learnt to move on from STATA