v0.1: Initial draft
v0.2: Datasets made open-source, multiple-imputation, intra-class correlation co-efficients & sample size calculations
v0.3: Some more data manipulations added, handling survey data to estimate design effects
0.4: Updates on Kaplan Meier plots, GGPLOT style, intro to meta-analysis including meta-analysis for proportions
0.5: I learnt Broom and I am very happy
R is a powerful, free and open-source piece of software for statistical analysis.
Unfortunately my experience of trying to learn R has been that materials are not always provided in a format that addresses the everyday problems of clinical epidemiology.
The aim of this guide is to teach some basic R skills so that you can a) Load, Clean and manipulate a dataset b) Do some descriptive statistics and basic statistical tests c) Do logistic regression d) Do poisson & cox regression e) Correlated outcomes f) Matched case-control data g) Sample size equations h) Multiple imputation with chained equations
We will use real datasets as we go along.
Brodly the material here aims to show you how to do most of the analysis approaches one might learn in the LSHTM MSc modules Statistical Methods in Epidemiology & Advanced Statistical Methods in Epidemiology (or equivalent) in R.
Not everything is done identically but the key elements should come out!
The aim of this guide not to teach you statistics. I assume that you know what tests are appropriate in which situations. The aim is to show how these tests can be implemented in R and give an output with which a clinical epidemiologist is familiar.
A key point in R is that (unlike STATA) many of the analaysis tools need to be specifically loaded in to the software.
Think of these as modules designed for specific tasks. Each module is called a package. The strength of R is the wide variety of packages to address a huge range of analysis approaches. It can make R frustrating though as there can be multiple ways to approach the same problem.
The approaches and packages used here are ones I learn and felt comfortable with - there may be better ways to do things using other packages.
I always use R within “R-Studio”. When you open R Studio it opens R but just makes the layout nicer - more like STATa with boxes for current code, datasets, plots etc.
The first time you setup R you need to install the packages
The syntax is: > install(“packagename”) #NB Note use of ""
For example
##Data Manipulation##
install.packages("tidyverse")
#tidyverse contains many packages for data manipulation, graphs etc
install.packages("splitstackshape")
#Really nice syntax for expanding datasets
#Basic Epi
install.packages("epiR")
#Some useful epitools especially 2*2 tables and stratified MH Odds
install.packages("lmtest")
#for doing likelihood ratio tests between models
#Poisson & Cox Regression & Survival Analysis
install.packages("Epi")
#Tools for Lexis Models
install.packages("popEpi")
#Extra tools to go with 'Epi' and make nice rate tables
install.packages("survival")
#for survial analysis, kaplan-meier plots and cox regression
install.packages("survminer")
#Really nice KM plots!
#Random-Effects Modeks
install.packages("lme4")
#For fitting random-effects models
install.packages("sjstats")
#For intraclass correlation coeffiecients from Random-effects models
#Multiple Imputation
install.packages("mice")
#For multiple imputation
install.packages("arsenal")
#Really nice summmary tables
#Complex survey data analysis
install.packages("survey")
#For handling complex survey data
install.packages("ICC")
#Alternative methods for ICC calculation from survey data
#Meta-analysis
install.packages("meta")
install.packages("metafor")
#Make saving your models easier
install.packages("broom.mixed")
Once the packags are installed in R you need to load them so they can be used. This step ‘loading’ needs to be done each time you start R. The syntax is: > library(packagename) #NB Note no use of ""
You don’t need to load all the packages every time especially if you aren’t using them in a given analysis. But many packages you might use frequenctly - especially some of the data manipulation ones
library(tidyverse)
library(lmtest)
library(survival)
library(survminer)
library(epiR)
library(Epi)
library(lme4)
library(popEpi)
library(mice)
library(sjstats)
library(splitstackshape)
library(arsenal)
library(survey)
library(ICC)
library(meta)
library(metafor)
library(broom)
library(broom.mixed)
You should set a working directory - this tells R where any files you might be looking for can be found without you needing to write the full file path
The syntax is: > setwd(“directory_where_the_files_are”)
setwd("~/Dropbox/Work/EPI/Learning R/")
If you plan to try the analyses below then you need to set your own working directory and put the dataset files inside that directory
From this point on all the code should work if everything is in the same directory
Refers to the species variable within the mosquito dataset
In general in R we put the output of whatever we do into a new object for storage. For example in STATA we might just say > logistic died i.agegroup
This would run the logistic regression and give an output straight away
But in R we would say > model <- glm(died~factor(agegroup), family=“binomial”, data = mosquito)
This would run the logistic regression and save it in a dataset called model
The syntax for any procedure where we save data to a place is name_of_place_I_am_saving <- thing_I_am_saving_to_this_object
or if saving to a new or existing variable in a dataset my_dataset$variable_name <- thing_I_am_saving_to_that_variable
A key point in R is the ability to have multiple separate datasets available at anyone time - unlike STATA for example where only a single dataset can be available.
Just like in STATA we can annotate our code If we put a # symbol at the start of a line then R will treat that as annotation
some code here #This code is doing something important
All the datasets used are available via the LSHTM Data Repository at “R For Clinical Epi”
Individual links to datasets are provided as needed.
Next we want to be able to load in a dataset. For the purpose of this tutotial I presume that most of the datasets you will be using are some form of comma sepeated variable file
The generic syntax is: > name_of_data <- read.csv( file = “filename.csv”, header = TRUE, sep = “,”)
Header = True: Top Row is variable names sep = “,”: This is comma delimited data
We can see the loaded data with > View(name_of_data)
The first dataset we are using is a cross-sectional survey of scabies. This dataset is freely available on the LSHTM Data Repository: “Scabies-Dataset”
scabies <- read.csv(file = "S1-Dataset_CSV.csv", header = TRUE, sep = ",")
#Remember we set the working directory already. As the file is inside that directory we can just give the file name.
#We can see the data with: View(scabies)
A strength of R is that you can have more than one dataset available at a time. The second dataset we are loading is based on the original outbreak of Ebola in the DRC. This is also freely available on the LSHTM Data Repository: “Ebola-Dataset”
ebola <- read.csv(file = "mmc1.txt", header = TRUE, sep = ",")
#Now we have two separate datasets available!
I’m loading both datasets so I have examples of the different types of variables etc for the exercises below
We can reference a subset of a dataset based on the values of a specific variable
The syntax is > [data$variable == something]
For example:
[survey$gender == "male"]
#This would apply our rule only to valus where gender = male
mean(scabies$age[scabies$gender == "male"])
#This would work out the mean age of thos individuals where gender is male.
Rerefencing a subset of data is obviously valuable for making new variables basd on existing values, plotting data and performing statistical tests.
In general we can either alter an existing variable OR make a new variable by manipulating an old one.
The syntax for altering an existing variable is: > data$variable <- thing_we_want_do
The syntax for making a new variable is the same: > data$new_variable <- thing_we_want_new_variable_to_be
If we want to get rid of some variable we don’t need we can do this with select. The general syntax is > data <- select(data, -VAR_C) #Drop these vars
scabies <- select(scabies, -scabies_distribution_region, -scabieslesions_number_range, -impetigolesions_number_range)
We can get rid of rows of data we don’t need. The general syntax is > data <- filter(data, how_to_filter)
ebola_men <- filter(ebola, sex =="male")
#Make a copy of the dataset with only the men in it
We often want to group continuous variables like age into groups We can do this by “cutting” the variable. We need to tell R the new variable is a categorical/factor variable
The generic syntax is: > data\(variable <- as.factor(cut(data\)variable,c(cut_here,cut_here,cut_here))
In general we also want include the lowest value i.e recode from the value of the first cut to the second
data\(variable <- as.factor(cut(data\)variable,c(cut_here,cut_here,cut_here), include.lowest=TRUE)
scabies$agegroups <- as.factor(cut(scabies$age, c(0,10,20,Inf), labels = c("0-10","11-20","21+"), include.lowest = TRUE))
#Make a new variable called age group. Do this by splitting people based on the age variable and splitting it in to groups 0>10, 11>20, 21>Infinity
#So the first number tells you where to start from
#Then everything between that number and the next break; inclusive will be in the next group etc
scabies$house_cat <- as.factor(cut(scabies$house_inhabitants, c(0,5,10,Inf), labels = c("0-5","6-10","10+"), include.lowest = TRUE))
#Make a new variable of the house size.
#Check this looks correct
table(scabies$house_cat, scabies$house_inhabitants)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 30
## 0-5 5 39 153 288 389 0 0 0 0 0 0 0 0 0 0 0 0
## 6-10 0 0 0 0 0 349 203 184 101 101 0 0 0 0 0 0 0
## 10+ 0 0 0 0 0 0 0 0 0 0 45 32 8 3 2 4 2
We can do this using > as.numeric()
ebola$status <- as.numeric(ebola$status)
#Convert each value of the variable status to a numeric one
We can do this using the ‘recode’ command
dataset\(variable <- recode(dataset\)variable, OldCat = “NewCat”, OldCat2 = “NewCat2”)
ebola$transmission <- recode(ebola$transmission, syringe = "needle")
##This is obviously not a very exciting recode but you get the idea!
When we have a categorical variable it is key we no which value R uses as the refrence - for example when doing logistic regression.
We can set this with the ‘relevel’ function
scabies$house_cat <- relevel(scabies$house_cat, ref = "0-5")
#Make 0-5 household size the baseline group
scabies$agegroups <-relevel(scabies$agegroups, ref = "21+")
#Make age 21+ the baseline group
ebola$transmission <-relevel(ebola$transmission, ref = "person_to_person")
#Make person to person transmission the baseline group
Sometime when dates are imported from a file R doesn’t realise it is a date as opposed to just a string of characters. We can check this easily.
class(ebola$disease_onset)
## [1] "factor"
#We see that R currently thinks every date is a different categorical variable which is not very helpful
We can convert strings of characters stored as dates into an ‘R’ date (which in the background is stored as a numberic value (days since 1970)).
The general syntax is: > data\(date <- as.Date(data\)date, format = "")
There are a range of values for format. Common ones include %a - Abbreviated weekday: Mon %A - Full weekday: Monday %b - Abbreviated month: Nov %B - Full Month: November %d - Day of the month: 01 %m - Month: 07 %y - Two digit year: 98 %Y - Four digit year
So given 2017-01-24 you might write format = “%Y-%m-%d”
If you need to turn the ‘R’ date into a strict number (for some calculations) you can simply specify: > data\(date <- as.numeric(as.Date(data\)date, format = ""))
#We can turn it in to a date
ebola$disease_onset <- as.Date(ebola$disease_onset, format = "%Y-%m-%d")
ebola$disease_ended <- as.Date(ebola$disease_ended, format = "%Y-%m-%d")
class(ebola$disease_onset)
## [1] "Date"
#We see that R now recognises this as a date variable.
Tidyr has a number of commands for reshaping data. To go long to wide we use
spread(DATAFRAME, VARNAMES, VARVALUES) #Take a given dataframe #Make N columns for each unique value of VARNAMES #Populate each new column with the corresponding value from VARVALUES
To go wide to long we use
gather(DATAFRAME,NEWVAR, NEWVALUES, (BETWEENTHISVAR:TOTHISVAR)) # Take the data in datatable DATAFRAME # Make a new variable called “NEWVAR”. # The values in NEWVAR are taken from the names of the original columns specified - currently BETWEENTHISVAR:TOTHISVAR # Make a new variable called “NEWVALUES” # The values in NEWVALUES are the corresponding values from column BETWEENTHISVAR, TOTHISVAR etc
Sometime we also want to combine two columns into a single value We use
unite(DATAFRAME,TEMPVAR, VAR1, VAR2) # Make a new variable TEMPVAR # Make the variable by combining VAR1 and VAR2
df <- data.frame(month=rep(1:3,2),
student=rep(c("Amy", "Bob"), each=3),
A=c(9, 7, 6, 8, 6, 9),
B=c(6, 7, 8, 5, 6, 7))
#Here we have a where each student is on a different row for each month
#The students took two tests A and B.
#For each student/month combination we have a value for A and a value for B
df2 <- gather(df,test,score,A:B)
#Make a new datatable df2
#Have a column called "test".
#This will have the value either A or B as these are the names of the columns we specified.
#Have a column called "score".
#This will have the value previously in column A or B respectively for each row
df2
## month student test score
## 1 1 Amy A 9
## 2 2 Amy A 7
## 3 3 Amy A 6
## 4 1 Bob A 8
## 5 2 Bob A 6
## 6 3 Bob A 9
## 7 1 Amy B 6
## 8 2 Amy B 7
## 9 3 Amy B 8
## 10 1 Bob B 5
## 11 2 Bob B 6
## 12 3 Bob B 7
#Now we have a single row for each combination of month/student/test
#Their score is in the score column
df3 <- spread(df2,test,score)
#Make a new datatable df3
#Make a column for each unique value in the test variable.
#Name each of these columns based on that unique value
#Under each column put the corresponding value that was in the score column
df3
## month student A B
## 1 1 Amy 9 6
## 2 1 Bob 8 5
## 3 2 Amy 7 7
## 4 2 Bob 6 6
## 5 3 Amy 6 8
## 6 3 Bob 9 7
#We are back to where we started!
Sometimes we have multiple rows per an individual/group and want to number these. This is equivalent to STATA bysort VAR: gen n=_n
The general syntax is: > dataset\(n <- ave(1:length(dataset\)grouping_var), dataset$grouping_var, FUN = seq_along)
Sometimes we want to expand panel data. For example our data might look like
## fact count
## 1 a 1
## 2 b 2
## 3 c 3
But we might want
## fact
## 1 a
## 2 b
## 3 b
## 4 c
## 5 c
## 6 c
We can use the ‘expandRows’ command. The general syntax is
expandRows(dataframe, X, DROP =, COUNT.IS.COL = ) # DROP can be specified True or False. It says whether or not to keep the column that has the count data in. The default is TRUE. # COUNT.IS.COL - says whether the value X is the column number with the count value in it, or if FALSE that all rows should be expanded by the X
dt3 <- expandRows(dt, 2)
#Expand the original datatable. Replicate each row by the value in column 2.
dt3
## fact
## 1 a
## 2 b
## 2.1 b
## 3 c
## 3.1 c
## 3.2 c
We should do some exploratory basic analyses before we proceed to more complex regression analysis
We might want to get a mean, median and maximum tc For a single variable we can use ‘summary’
summary(scabies$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 6.00 12.00 18.79 30.00 83.00
#Or just
mean(scabies$age)
## [1] 18.79193
If we want to summarise by a categorical variable we can do this nicely using both stat.table & tableyby.
Why use one or the other? stat.table allows you to specify exactly what statistic you want summarised (“I want the maximum value”)
tableby lets you quickly generate the most common statistics (mean, counts) for lots of variables
The general syntax for stat.table is:
stat.table(grouping_factor/s, statistics_to_show, dataset)
stat.table(gender,mean(age), data = scabies)
## -------------------
## gender mean(age)
## -------------------
## female 22.18
## male 14.15
## -------------------
#Tabulate, by gender, the mean age from the scabies dataset
stat.table(gender,list(mean(age),median(age)), data = scabies)
## -------------------------------
## gender mean(age) median(age)
## -------------------------------
## female 22.18 15.00
## male 14.15 9.00
## -------------------------------
#Tabulate, by gender, the mean and median age from the scabies dataset
The general syntax for tableby is:
tableby(groupingvar~var_1+var_n, data = dataset) #tableby will give you the mean, SD, IQR and range of numeric variables #tableby will give you the count and % for each level of categorical variables
summary_data <- tableby(gender~age+scabies_infestation,data=scabies)
summary(summary_data)
##
##
## | | female (N=1103) | male (N=805) | Total (N=1908) | p value|
## |:---------------------------|:---------------:|:---------------:|:---------------:|-------:|
## |**age** | | | | < 0.001|
## | Mean (SD) | 22.179 (17.500) | 14.152 (16.232) | 18.792 (17.429) | |
## | Range | 0.000 - 83.000 | 0.000 - 82.000 | 0.000 - 83.000 | |
## |**scabies_infestation** | | | | < 0.001|
## | no | 931 (84.4%) | 611 (75.9%) | 1542 (80.8%) | |
## | yes | 172 (15.6%) | 194 (24.1%) | 366 (19.2%) | |
#Tabulate, by gender summary statistics for age and scabies
We might want to know if the mean of a value like age varies between two groups.
t.test(scabies$age[scabies$gender=="male"],scabies$age[scabies$gender=="female"])
##
## Welch Two Sample t-test
##
## data: scabies$age[scabies$gender == "male"] and scabies$age[scabies$gender == "female"]
## t = -10.32, df = 1801.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.552509 -6.501593
## sample estimates:
## mean of x mean of y
## 14.15155 22.17860
This presents the mean age of each group, the 95% confidence interval for the difference and the p-value. We can see that the groups do have different mean ages.
prop.test(numerator,denominator)
We might want to tabulate catergorical variables
table(data\(dependent\)variable, data$independent_variable)
table(impetigo = scabies$impetigo_active, scabies = scabies$scabies_infestation)
## scabies
## impetigo no yes
## no 1098 187
## yes 444 179
#Simple table
We might want to do a chi-square and calculate an odds ratio
We can do this using > epi.2by2
The data needs to be laid out as:
Has Disease No Disease
Exposed
Not-Exposed
The generic syntax is > epi.2by2(table(scabies\(dep_var, scabies\)indep_var))
#Lets looks how table aligns the data
table(scabies$scabies_infestation, scabies$impetigo_active)
##
## no yes
## no 1098 444
## yes 187 179
#See that because 'no' is the 'base' level the table is laid out
# No Disease Has Disease
# Not-exposed
# Exposed
#This is dependent on how your data is coded so you need to check this before using epi.2by2
#If the table is laid out correctly then you can input straight into epi.2by2, otherwise you #need to recode or re-order the variables so that the table will be laid out correctly
#epi.2by2 wants the data with the exposed/disease group in top right corner
#So we just tell R to order the variables differently when we draw the table
epi.2by2(table(relevel(scabies$scabies_infestation,"yes"), relevel(scabies$impetigo_active,"yes")))
## Outcome + Outcome - Total Inc risk *
## Exposed + 179 187 366 48.9
## Exposed - 444 1098 1542 28.8
## Total 623 1285 1908 32.7
## Odds
## Exposed + 0.957
## Exposed - 0.404
## Total 0.485
##
## Point estimates and 95 % CIs:
## -------------------------------------------------------------------
## Inc risk ratio 1.70 (1.49, 1.94)
## Odds ratio 2.37 (1.88, 2.99)
## Attrib risk * 20.11 (14.52, 25.71)
## Attrib risk in population * 3.86 (0.77, 6.95)
## Attrib fraction in exposed (%) 41.13 (32.89, 48.35)
## Attrib fraction in population (%) 11.82 (8.30, 15.20)
## -------------------------------------------------------------------
## X2 test statistic: 54.415 p-value: < 0.001
## Wald confidence limits
## * Outcomes per 100 population units
Notice this gives you counts, an odds ratio and a chi-square result. Effctively this command combines elements of the STATA commands: mhodds and tab,chi
We can also stratify our odds ratio by a third variable to look for confounding
epi.2by2(table(data\(dep_var, data\)indep_var, data$stratificationvariable)) #As above make sure the levels of dep_var and indep_var are set such that the table is ordered in the way that epi.2by2 wants the data
The output gives us the crude odds ratio (identical to the one we got without the stratification variable) and then an adjusted odds ratio. We can look at these two for evidence of confounding.
We can get (if we desire) the stratum specific odds ratios out
stratum <- epi.2by2(table(relevel(scabies$scabies_infestation,"yes"), relevel(scabies$impetigo_active,"yes"),scabies$gender))
#Note this time I'm saving the 2*2 table
stratum
## Outcome + Outcome - Total Inc risk *
## Exposed + 179 187 366 48.9
## Exposed - 444 1098 1542 28.8
## Total 623 1285 1908 32.7
## Odds
## Exposed + 0.957
## Exposed - 0.404
## Total 0.485
##
##
## Point estimates and 95 % CIs:
## -------------------------------------------------------------------
## Inc risk ratio (crude) 1.70 (1.49, 1.94)
## Inc risk ratio (M-H) 1.58 (1.39, 1.79)
## Inc risk ratio (crude:M-H) 1.08
## Odds ratio (crude) 2.37 (1.88, 2.99)
## Odds ratio (M-H) 2.18 (1.72, 2.76)
## Odds ratio (crude:M-H) 1.09
## Attrib risk (crude) * 20.11 (14.52, 25.71)
## Attrib risk (M-H) * 17.75 (5.52, 29.99)
## Attrib risk (crude:M-H) 1.13
## -------------------------------------------------------------------
## Test of homogeneity of IRR: X2 test statistic: 8.011 p-value: 0.005
## Test of homogeneity of OR: X2 test statistic: 0.629 p-value: 0.428
## Wald confidence limits
## M-H: Mantel-Haenszel
## * Outcomes per 100 population units
summary(stratum)$OR.strata.wald
## est lower upper
## 1 2.417620 1.716620 3.40488
## 2 1.996753 1.439806 2.76914
The basic R software can make nice plots. The introduction below is really only an outline. Later we will introduce the specific GGPLOT2 package which makes really nice graphs.
You may well want to ignore this section and just learn how to make GGPLOT graphs as they are much prettier!
We can do some simple plots to look at distributions such as a histogram The general syntax is > hist(data$variable)
hist(scabies$age)
We could do a basic scatter plot of house size vs number of people in a room
plot(scabies$house_inhabitants,scabies$room_cohabitation)
We might make this look nicer by adding some colours representing key variables
scabies$scatter_colour[scabies$scabies_infestation == "yes"] <- "red"
#If scabies_infestation is 0 then the colour is red
scabies$scatter_colour[scabies$scabies_infestation == "no"] <- "blue"
#If scabies_infestation is 0 then the colour is blue
plot(scabies$house_inhabitants,scabies$room_cohabitation, col = scabies$scatter_colour)
#For each value look in the variable scatter_colours and work out what colour to make each person
Lots of the points overlap so we can always make jitter them
plot(jitter(scabies$house_inhabitants),scabies$room_cohabitation, col = scabies$scatter_colour)
Later we will come back to GGPLOT2 which makes much nicer graphs!
I will start by showing the Long-Hand version of how to do this - but once you know how to do it then you will want to learn to use Broom and make your life better.
The examples here use the “scabies dataset” again. For ease I quickly repeat below the code needed to implement the data-cleaning steps demonstrated above on the dataset
scabies <- read.csv(file = "S1-Dataset_CSV.csv", header = TRUE, sep = ",")
#Load dataset
scabies$agegroups <- as.factor(cut(scabies$age, c(0,10,20,Inf), labels = c("0-10","11-20","21+"), include.lowest = TRUE))
scabies$agegroups <-relevel(scabies$agegroups, ref = "21+")
#Categorise age and set a baseline
scabies$house_cat <- as.factor(cut(scabies$house_inhabitants, c(0,5,10,Inf), labels = c("0-5","6-10","10+"), include.lowest = TRUE))
scabies$house_cat <- relevel(scabies$house_cat, ref = "0-5")
#Categorise housesize and set a baseline
We can perform logistic regression in a straightforward fashion in. The formula for any logistic regression is
logistic_model <- glm(dependent_variable~indepdent_variable+independent_variable, data = dataframe, family = binomial)
If we want to include a variable as a Factor then we state that by saying > factor(independent_variable)
logistic_model <- glm(dependent_variable~factor(indepdent_variable), data = dataframe, family = binomial)
Unlike in STATA this does not automatically generate an output. As a clinical epidemiologist we want a table with the odds ratio, co-efficients, standard errors, confidence interval and the Wald-test P-Value.
We get the Odds ratios and Confidence intervals by taking the exponential of the output from the logistic regression model
logistic_model_OR <- cbind(OR = exp(coef(logistic_model)), exp(confint(logistic_model)))
Then we need to combine the Odds ratios and CI values with the raw coefficients and p-values into a single table
logistic_model_summary <- summary(logistic_model) logistic_model_summary <- cbind(logistic_model_OR, logistic_model_summary)
We can display then display the complte table of the output
logistic_model_summary
Here we will see if the size of the household is associated with risk of scabies (variable = house_cat), after controlling for age (categorised) and gender.
scabiesrisk <- glm(scabies_infestation~factor(agegroups)+factor(gender)+factor(house_cat),data=scabies,family=binomial())
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint(scabiesrisk)))
## Waiting for profiling to be done...
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_summary <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_summary
## OR 2.5 % 97.5 % Estimate
## (Intercept) 0.09357141 0.06925691 0.1243857 -2.3690303
## factor(agegroups)0-10 2.20016940 1.61679635 3.0237474 0.7885344
## factor(agegroups)11-20 2.53291768 1.80434783 3.5769136 0.9293719
## factor(gender)male 1.44749159 1.13940858 1.8399368 0.3698321
## factor(house_cat)6-10 1.30521927 1.02664372 1.6624397 0.2663710
## factor(house_cat)10+ 1.17003712 0.65654905 1.9900581 0.1570355
## Std. Error z value Pr(>|z|)
## (Intercept) 0.1492092 -15.8772359 9.110557e-57
## factor(agegroups)0-10 0.1594864 4.9442116 7.645264e-07
## factor(agegroups)11-20 0.1743214 5.3313714 9.747386e-08
## factor(gender)male 0.1221866 3.0267824 2.471718e-03
## factor(house_cat)6-10 0.1228792 2.1677478 3.017788e-02
## factor(house_cat)10+ 0.2813713 0.5581076 5.767709e-01
The output gives us the OR associated with each variable, the 95% CI of that OR, and a Wald-Test P-Value for each variable
Investigate using logistic regression whether having impetigo (impetigo_odds) is associated with scabies (scabies_infestation) after controlling for age (agegroups) and sex (gender).
## Waiting for profiling to be done...
## OR 2.5 % 97.5 % Estimate Std. Error
## (Intercept) 0.1379817 0.1064948 0.1768205 -1.9806341 0.1292414
## scabies_infestationyes 1.9380091 1.5180396 2.4740877 0.6616612 0.1245435
## agegroups0-10 3.7264517 2.8550721 4.8992194 1.3154565 0.1376324
## agegroups11-20 2.7487745 2.0310065 3.7353394 1.0111552 0.1553069
## gendermale 1.7509457 1.4225953 2.1557118 0.5601561 0.1060015
## house_cat6-10 0.9548564 0.7741093 1.1776160 -0.0461943 0.1069907
## house_cat10+ 0.8941528 0.5449317 1.4384328 -0.1118786 0.2468512
## z value Pr(>|z|)
## (Intercept) -15.3250751 5.199446e-53
## scabies_infestationyes 5.3126904 1.080185e-07
## agegroups0-10 9.5577530 1.203411e-21
## agegroups11-20 6.5106924 7.480520e-11
## gendermale 5.2844143 1.261076e-07
## house_cat6-10 -0.4317601 6.659158e-01
## house_cat10+ -0.4532228 6.503883e-01
You should get a table showing an Odds Ratio of 2 - i.e the odds of impetigo are twice as high in people with scabies. The p-value suggests this is a highly significant finding.
We might want to compare two models to see if overall the factor variable house_cat (size of the house) is significant after controlling for other variables (as opposed to the individual Wald tests for specific values of house_cat).
As in STATA we simply run the model without the Variable we are interested in and do a Likelihood Ratio test
scabiesrisk2 <- glm(scabies_infestation~factor(agegroups)+factor(gender),data=scabies,family=binomial())
#We could have got out all the results of this new model just like we did for the initial Logistic Regression but for the purpose of this demonstration we don't need to
lrtest(scabiesrisk,scabiesrisk2)
## Likelihood ratio test
##
## Model 1: scabies_infestation ~ factor(agegroups) + factor(gender) + factor(house_cat)
## Model 2: scabies_infestation ~ factor(agegroups) + factor(gender)
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 6 -900.44
## 2 4 -902.80 -2 4.7328 0.09382 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We se that overall there is borderline significance between household size and scabies risk. This makes sense if we look at the Wald-Test for each category as one house size category was associated and one was not.
We can easily fit an interaction term to our model. The general syntax is
interaction_model <- glm(dependent_variable~indepdent_variable*independent_variable, data = dataframe family = binomial)
scabiesrisk3 <- glm(scabies_infestation~factor(agegroups)*factor(gender),data=scabies,family=binomial())
scabiesrisk3_OR <- exp(cbind(OR= coef(scabiesrisk3), confint(scabiesrisk3)))
## Waiting for profiling to be done...
scabiesrisk3_summary <- summary(scabiesrisk3)
scabiesrisk3_summary <- cbind(scabiesrisk3_OR, scabiesrisk3_summary$coefficients)
scabiesrisk3_summary
## OR 2.5 % 97.5 %
## (Intercept) 0.1018100 0.07391201 0.1366901
## factor(agegroups)0-10 2.6697985 1.81088534 3.9801955
## factor(agegroups)11-20 2.3655612 1.51556214 3.6962701
## factor(gender)male 1.6634407 0.93937388 2.8651848
## factor(agegroups)0-10:factor(gender)male 0.7017255 0.37220313 1.3517237
## factor(agegroups)11-20:factor(gender)male 1.1363006 0.56352873 2.3357191
## Estimate Std. Error
## (Intercept) -2.2846473 0.1564481
## factor(agegroups)0-10 0.9820030 0.2004429
## factor(agegroups)11-20 0.8610153 0.2268275
## factor(gender)male 0.5088882 0.2831246
## factor(agegroups)0-10:factor(gender)male -0.3542130 0.3280584
## factor(agegroups)11-20:factor(gender)male 0.1277779 0.3619259
## z value Pr(>|z|)
## (Intercept) -14.6032328 2.678375e-48
## factor(agegroups)0-10 4.8991654 9.624460e-07
## factor(agegroups)11-20 3.7959034 1.471068e-04
## factor(gender)male 1.7974000 7.227214e-02
## factor(agegroups)0-10:factor(gender)male -1.0797255 2.802644e-01
## factor(agegroups)11-20:factor(gender)male 0.3530499 7.240510e-01
The output givs us 1) The Odds Ratio of each factor in the interaction in the baseline group of the other factor: i.e odds ratio for scabies, amongst people aged 0-10 compared to those aged 21+ (the reference group for age) in the baseline group of gender (female) 2) The interaction terms
As in STATA we can then compare the model with and without the interaction term using a likelihood ratio test to see if the interaction is statistically significant.
#Compare our model containing age & gender (scabiesrisk2) without an interaction term to our third model (scabiesrisk3) which does have an interaction term between age and gender
lrtest(scabiesrisk2,scabiesrisk3)
## Likelihood ratio test
##
## Model 1: scabies_infestation ~ factor(agegroups) + factor(gender)
## Model 2: scabies_infestation ~ factor(agegroups) * factor(gender)
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 4 -902.80
## 2 6 -901.15 2 3.3131 0.1908
Remember the null-hypothesis = No interaction. So here the P-value suggests no evidence against the null; so we should not fit this interaction term.
#Using Broom Broom is a packagage within the tidyverse that makes all of the steps above MUCH easier. It installs when you install tidyverse but when you load the tidyverse library it doesn’t load by default.
The general syntax is > logistic_model <- glm(dependent_variable~indepdent_variable+independent_variable, data = dataframe, family = binomial)
This is identical to fitting the model above Then we simplify with > logistic_model_summary <- tidy(logistic_model, exponentiate =TRUE, conf.int =TRUE)
Take the model data and put into a table including exponentiating the coefficients and giving a 95% CI. This replaces all the steps after model fitting with 1 line.
Because (as you will see below) there is a variant called broom.mixed for mixed/random-effects models we need to specify which version to use so the slight tweak (if both broom and broom.mixed are loaded) is > logistic_model_summary <- broom::tidy(logistic_model, exponentiate =TRUE, conf.int =TRUE)
## Waiting for profiling to be done...
## OR 2.5 % 97.5 % Estimate Std. Error
## (Intercept) 0.1379817 0.1064948 0.1768205 -1.9806341 0.1292414
## scabies_infestationyes 1.9380091 1.5180396 2.4740877 0.6616612 0.1245435
## agegroups0-10 3.7264517 2.8550721 4.8992194 1.3154565 0.1376324
## agegroups11-20 2.7487745 2.0310065 3.7353394 1.0111552 0.1553069
## gendermale 1.7509457 1.4225953 2.1557118 0.5601561 0.1060015
## house_cat6-10 0.9548564 0.7741093 1.1776160 -0.0461943 0.1069907
## house_cat10+ 0.8941528 0.5449317 1.4384328 -0.1118786 0.2468512
## z value Pr(>|z|)
## (Intercept) -15.3250751 5.199446e-53
## scabies_infestationyes 5.3126904 1.080185e-07
## agegroups0-10 9.5577530 1.203411e-21
## agegroups11-20 6.5106924 7.480520e-11
## gendermale 5.2844143 1.261076e-07
## house_cat6-10 -0.4317601 6.659158e-01
## house_cat10+ -0.4532228 6.503883e-01
For individually matched case control studies we use Conditional Logistic Regression not normal logistic regression
The syntax for this is:
conditional_model <- clogit(variable_denoting)or_control~indepdent_variable+independent_variable+strata(matching_var), data = dataframe)
Where ‘matching_var’ is a variable saying which Case is matched to which Control
We get the ORs, 95% CIs and everything else exactly the same as for logistic regression.
> conditional <- clogit(case~independent_variable+strata(matching_var), data = dataframe)
> conditional_OR <- exp(cbind(OR= coef(conditional), confint(conditional)))
> conditional_summary <- summary(conditional)
> conditional_summary <- cbind(conditional_OR, conditional_summary$coefficients)
> conditional_summary
Equally everything else is just like normal logistic regression i.e we can fit multiple models, run Likelihood-Ratio Tests etc
For the purpose of Poisson and cox regression we will initially go back to a clean version of our “Ebola dataset”. For ease I quickly repeat below the code needed to implement the data-cleaning steps demonstrated above on the dataset
##This is the ebola dataset again
ebola <- read.csv(file = "mmc1.txt", header = TRUE, sep = ",")
#Lets set 'person_to_person' route of transmission as baseline
ebola$transmission <- relevel(ebola$transmission, "person_to_person")
The package we are using for the Poisson and Cox Regression needs the dates in a particular format to work.
Specifically it needs them in a ‘decimal year’ i.e 1982.145 If our data isn’t already like this then we set it up using the ‘cal.yr’ command. The cal.yr command wants a date as an input - so remember if your dates are stored as strings and not dates in the raw data then you need to convert them to a date and then to a decimal date. You can do this in one step.
class(ebola$disease_onset)
## [1] "factor"
#We see that the timein variable is currently a factor (i.e just recognisd as a bunch of characters), so we need to convert factor>data>decimal date.
ebola$disease_onset <- cal.yr(as.Date(ebola$disease_onset, format = "%Y-%m-%d"))
ebola$disease_ended <- cal.yr(as.Date(ebola$disease_ended, format = "%Y-%m-%d"))
We need to declare the way time is modelled when doing this kind of analysis. This is like STSET in STATA.
That is we need to specify 1) Entry point 2) Exit point
We may have multiple time scales - for example both Calendar Time and Chronological age. The general syntax is:
lexis_object <- Lexis(
entry = list ( name_first_time_scale = variable_setting_that,
name_second_time_scale = variable_setting_that),
exit = list (name_of_time_scale_we calculate_exit_against = variable_for exit),
exit.status = (where_we_define_the_outcome),
data = data_frame)
For example if we might be interested in time in follow-up and chronological age as time scales. So we might do:
ebola_lexis_model <- Lexis(entry = list("calendar_time"=disease_onset), exit = list("calendar_time" = disease_ended), exit.status = status =="died", data = ebola)
## NOTE: entry.status has been set to FALSE for all.
## Warning in Lexis(entry = list(calendar_time = disease_onset), exit = list(calendar_time = disease_ended), : Dropping NA rows with duration of follow up < tol
#We define the beginning of calendar time in the study as the date in the 'disease_onset' variable
#We call time on this scale "calendar_time"
#Time out is on the same timescale as time in - i.e calendar_time
#If people are missing either a time in or time-out so they get dropped
R is also happy if we just have a duration of follow-up as opposed to a Date of Entry and a Date of Exit. In that case we can just say:
lexis_model <- Lexis(exit = list("calendar_time" = duration), exit.status = died, data = dataset)
##Here we specify the exit time as being the duration of follow-up
##We haven't stated an entry time so R assumes it to be 0 - i.e people were followed up from time 0 for the duration of time specified as the exit time
When we make this model it creates several key variable we need to reference when running our Cox and Poisson models.
‘lex.dur’ = duration of follow-up ‘lex.Xst’ = Status of the individual at the end of each time block. For example in the above lexis model this is whether or not the event CHD occured in the any given time period.
We will reference these variables when we run our poisson and cox regression models.
Once we have a lexis model it is easy to see crude rates and rates stratified by a variable
The general syntax is: > rate_table <- rate(lexis_model, obs = lex.Xst, pyrs = lex.dur, print = stratifying_var)
If we want we can give per 100 or 1,000 years by simply adjusting ‘pyrs’ > rate_table <- rate(lexis_model, obs = lex.Xst, pyrs = lex.dur/1000, print = stratifying_var)
average_rate <- rate(ebola_lexis_model, obs = lex.Xst, pyrs = lex.dur)
#Show the overall rate of cardiovascular outcomes per 1 person years (because people die fast of ebola and follow-up is short)
rate_by_transmission <- rate(ebola_lexis_model, obs = lex.Xst, pyrs = lex.dur, print = transmission)
#Show the rates of cardiovascular outcome by sex per 1 person years (because people die fast of ebola and follow-up is short)
average_rate
##
## Crude rates and 95% confidence intervals:
##
## lex.Xst lex.dur rate SE.rate rate.lo rate.hi
## 1: 245 5.303217 46.19837 2.951506 40.76096 52.36112
rate_by_transmission
##
## Crude rates and 95% confidence intervals:
##
## transmission lex.Xst lex.dur rate SE.rate rate.lo
## 1: person_to_person 129 2.8993840 44.49221 3.917321 37.44023
## 2: both 8 0.2491444 32.10989 11.352561 16.05788
## 3: other 10 0.1341547 74.54082 23.571876 40.10658
## 4: syringe 86 1.7686516 48.62461 5.243327 39.36108
## 5: unknown 12 0.2518823 47.64130 13.752860 27.05568
## rate.hi
## 1: 52.87245
## 2: 64.20806
## 3: 138.53919
## 4: 60.06829
## 5: 83.88975
We run a poisson regression essentially the same as a logistic regression. We use the same GLM command as before.
Instead of setting ‘family = binomial’ we need to set ‘family = poisson’
We include an additional variable called offset. This represents the fact that follow-up time is different for each person in the study (i.e it gives you a rate against person-time rather than a number of events / people). This variable is the log of the follow-up time in the lexis-model: ‘log(lex.dur)’
We set the outcome as whether lex.Xst occured (i.e did the event occur)
The generic formula is > poisson_model <- glm(lex.Xst~offset(log(lex.dur))+indepdent_variable+independent_variable, data = lexis_dataframe, family = poisson)
#Fit a model for HR of death from CHD against grade
ebola_model <- glm(lex.Xst~factor(transmission)+sex+offset(log(lex.dur)), data = ebola_lexis_model, family = poisson)
#The outcome is lex.Xst - whatever we set as the outcome when we made our lexis dataset
#The offset is the log(lex.dur) - the log of duration of follow-up
#We fit a poisson model
ebola_ratios <- cbind(HR = exp(coef(ebola_model)), exp(confint(ebola_model)))
## Waiting for profiling to be done...
#Just like in the Logistic model we take exponentials of the coefficients to get our Hazard Rations
ebola_model_out <- summary(ebola_model)
ebola_model_out <-cbind (ebola_ratios, ebola_model_out$coefficients)
#We combine this all in to a nice table
ebola_model_out
## HR 2.5 % 97.5 % Estimate
## (Intercept) 75.8444260 4.0975490 407.863217 4.32868422
## factor(transmission)both 0.7230170 0.3237964 1.389407 -0.32432256
## factor(transmission)other 1.6052597 0.7574323 2.976279 0.47328555
## factor(transmission)syringe 1.0928077 0.8294145 1.432693 0.08875023
## factor(transmission)unknown 1.0722117 0.5607743 1.861057 0.06972352
## sexfemale 0.5879981 0.1102686 10.851812 -0.53103158
## sexmale 0.5845758 0.1086098 10.822539 -0.53686884
## Std. Error z value Pr(>|z|)
## (Intercept) 1.0578400 4.0920029 4.276632e-05
## factor(transmission)both 0.3666753 -0.8844951 3.764290e-01
## factor(transmission)other 0.3450006 1.3718400 1.701133e-01
## factor(transmission)syringe 0.1392193 0.6374849 5.238090e-01
## factor(transmission)unknown 0.3033065 0.2298781 8.181865e-01
## sexfemale 1.0548472 -0.5034204 6.146687e-01
## sexmale 1.0580754 -0.5074013 6.118733e-01
broom_ebola <- tidy(ebola_model, exponentiate =TRUE, conf.int =TRUE)
#Using Broom with Poisson and Cox As before we can simplify this process with Broom
#Long hand way
ebola_model <- glm(lex.Xst~factor(transmission)+sex+offset(log(lex.dur)), data = ebola_lexis_model, family = poisson)
ebola_ratios <- cbind(HR = exp(coef(ebola_model)), exp(confint(ebola_model)))
## Waiting for profiling to be done...
ebola_model_out <- summary(ebola_model)
ebola_model_out <-cbind (ebola_ratios, ebola_model_out$coefficients)
ebola_model_out
## HR 2.5 % 97.5 % Estimate
## (Intercept) 75.8444260 4.0975490 407.863217 4.32868422
## factor(transmission)both 0.7230170 0.3237964 1.389407 -0.32432256
## factor(transmission)other 1.6052597 0.7574323 2.976279 0.47328555
## factor(transmission)syringe 1.0928077 0.8294145 1.432693 0.08875023
## factor(transmission)unknown 1.0722117 0.5607743 1.861057 0.06972352
## sexfemale 0.5879981 0.1102686 10.851812 -0.53103158
## sexmale 0.5845758 0.1086098 10.822539 -0.53686884
## Std. Error z value Pr(>|z|)
## (Intercept) 1.0578400 4.0920029 4.276632e-05
## factor(transmission)both 0.3666753 -0.8844951 3.764290e-01
## factor(transmission)other 0.3450006 1.3718400 1.701133e-01
## factor(transmission)syringe 0.1392193 0.6374849 5.238090e-01
## factor(transmission)unknown 0.3033065 0.2298781 8.181865e-01
## sexfemale 1.0548472 -0.5034204 6.146687e-01
## sexmale 1.0580754 -0.5074013 6.118733e-01
#Short hand using broom
ebola_model <- glm(lex.Xst~factor(transmission)+sex+offset(log(lex.dur)), data = ebola_lexis_model, family = poisson)
broom_ebola <- tidy(ebola_model, exponentiate =TRUE, conf.int =TRUE)
Time (for example age) is a key confounder so we need to be able to control for this. In STATA we would use STSPLIT to do this. We can do the same in R.
We do this using: > expanded_lexis_model <- splitLexis(lexis_model, “time_scale_to_split_on”, breaks = c(break1,break2,break3))
We need a different dataset for this example. Again the dataset is freely available on the LSHTM Data Repository: “Alcohol outcomes” This is a simulated dataset (a fully-anoynmised, tweaked dataset based on a real-one) looking at the association between age, alcohol and death.
#Load in our new dataset
alcohol <- read.csv(file = "alcohol_outcomes.csv", header = TRUE, sep=",")
#Check how dates are saved
class(alcohol$timein)
## [1] "factor"
#time is saved as a factor so we need to convert it
alcohol$timein <- cal.yr(as.Date(alcohol$timein, format = "%d/%m/%Y"))
alcohol$timeout <- cal.yr(as.Date(alcohol$timeout, format = "%d/%m/%Y"))
alcohol$timebth <- cal.yr(as.Date(alcohol$timebth, format = "%d/%m/%Y"))
#Setup our lexis model
alcohol_lexis <- Lexis(entry = list("calendar_time"=timein, "age" = timein - timebth), exit = list("calendar_time" = timeout), exit.status = death ==1, data = alcohol)
## NOTE: entry.status has been set to FALSE for all.
#We define the beginning of calendar time in the study as the date in the 'timein' variable
#We call time on this scale "calendar_time"
#We define the beginning of the age-time scale as the difference between when you joined the study and your date of birth. We call this time scale "age"
#Time out is on the same timescale as time in - i.e calendar_time
#Perform a lexis expansion
split_alcohol_lexis <- splitLexis(alcohol_lexis, breaks = c(50,60,70,80), time.scale = "age")
#We say we want to split our lexis_model
#We want to do this on our defined "age" time scale
#We want the breaks to be at the listed points
Now we need to tell R we want to use those Time-Blocks as a variable in our regression. NB Compare to STATA - In stata we cut the time and make it a new variable in one go with
stplit newvariable, at(timepoint,timepoint,timepoint)
But each time we want to split on a new timescale in STATA we need to stset the data again and do a new ‘stplit’ command.
In R we declare all the timescales initially (in the lexis command), then we split each timeperiod and then make a new variable reflecting each split. The generic syntax is
expanded_model$block_name <- timeBand(expanded_model, “name_of_split”, type = “factor”)
split_alcohol_lexis$age_blocks <- timeBand(split_alcohol_lexis, "age", type = "factor")
#We make a new variable called age_blocks.
#We used the splits we defined on the "age" time-scale for this
#Make each segment of this time-scale an element of a categorical variable
#We then remove any blank 'time-blocks' i.e segments with no observations
#This is done by re-factoring the variable
split_alcohol_lexis$age_blocks <- factor(split_alcohol_lexis$age_blocks)
Now we can run the poisson regression with age (in blocks) as a variable just like in STATA.
age_alcohol_model <- glm(lex.Xst~factor(alcohol)+factor(age_blocks)+offset(log(lex.dur)), data = split_alcohol_lexis, family = poisson)
age_alcohol_ratios <- cbind(HR = exp(coef(age_alcohol_model)), exp(confint(age_alcohol_model)))
## Waiting for profiling to be done...
age_alcohol_out <- summary(age_alcohol_model)
age_alcohol_out <-cbind (age_alcohol_ratios, age_alcohol_out$coefficients)
age_alcohol_out
## HR 2.5 % 97.5 %
## (Intercept) 0.001758427 0.000698216 3.568313e-03
## factor(alcohol)2 1.583354946 1.287840186 1.943563e+00
## factor(age_blocks)(50,60] 2.788198316 1.304756320 7.235037e+00
## factor(age_blocks)(60,70] 8.358271714 4.051253875 2.125521e+01
## factor(age_blocks)(70,80] 18.434422669 8.856542929 4.712319e+01
## factor(age_blocks)(80,Inf] 49.721819767 18.829170021 1.449072e+02
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.343336 0.4088553 -15.514867 2.752023e-54
## factor(alcohol)2 0.459546 0.1049087 4.380437 1.184415e-05
## factor(age_blocks)(50,60] 1.025396 0.4292502 2.388806 1.690321e-02
## factor(age_blocks)(60,70] 2.123252 0.4154268 5.111013 3.204365e-07
## factor(age_blocks)(70,80] 2.914220 0.4189527 6.955964 3.501584e-12
## factor(age_blocks)(80,Inf] 3.906444 0.5098479 7.661979 1.830903e-14
#Now fit a model adjusting for the effect of changing age through follow-up time
Just like in logistic regression we can compare our two poisson models using ‘lrtest’. Remember the two models need to be fittd on the same number of observations. This means both models need to be fitted AFTER the lexis expansion - so in the current example we would compare: Lexis Expanded Dataset - controlling for Age & Grade Lexis Expanded Dataset - controlling for Age Only The LRTEST would be whether Grade was associated having controlled for Age
age_model <- glm(lex.Xst~age_blocks+offset(log(lex.dur)), data = split_alcohol_lexis, family = poisson)
#Fit a model on the expanded dataset where only Age is an independent variable
age_model_ratios <- cbind(HR = exp(coef(age_model)), exp(confint(age_model)))
## Waiting for profiling to be done...
age_model_out <- summary(age_model)
age_model_out <-cbind (age_model_ratios, age_model_out$coefficients)
age_model_out
## HR 2.5 % 97.5 % Estimate
## (Intercept) 0.001908232 7.583858e-04 0.00386644 -6.261578
## age_blocks(50,60] 2.864554178 1.340670e+00 7.43252992 1.052413
## age_blocks(60,70] 9.040064110 4.386843e+00 22.97293591 2.201666
## age_blocks(70,80] 21.271550431 1.025650e+01 54.25717095 3.057371
## age_blocks(80,Inf] 61.056794442 2.323188e+01 177.24621063 4.111804
## Std. Error z value Pr(>|z|)
## (Intercept) 0.4082441 -15.337829 4.272431e-53
## age_blocks(50,60] 0.4291935 2.452070 1.420369e-02
## age_blocks(60,70] 0.4149555 5.305789 1.121869e-07
## age_blocks(70,80] 0.4174887 7.323242 2.420515e-13
## age_blocks(80,Inf] 0.5075158 8.101825 5.414068e-16
lrtest(age_model,age_alcohol_model)
## Likelihood ratio test
##
## Model 1: lex.Xst ~ age_blocks + offset(log(lex.dur))
## Model 2: lex.Xst ~ factor(alcohol) + factor(age_blocks) + offset(log(lex.dur))
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 5 -1487.5
## 2 6 -1478.1 1 18.72 1.514e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Perform a likelihood test of the model with Age&Alcohol compared to the model with only Age
If we wanted to test if age was associated we would compare: Lexis Expanded Dataset - controlling for Age & Alcohol Lexis Expanded Dataset - controlling for Alcohol Only Note even though strictly we don’t need to expand the dataset for the Alcohol model (as its not a time varying factor) we need the two models to have the same number of observations so it does need to be run against the Lexis Expanded dataset
Cox Regression is done in a very similar fashion to all other models. We declare the outcome and follow-up time slightly differently than from in the poisson but otherwise things look very similar!
The generic formula is:
cox_model <- coxph(Surv(lex.dur,lex.Xst)+indepdent_variable+independent_variable, data = lexis_dataframe)
cox_lexis_model <- coxph(Surv(lex.dur,lex.Xst)~factor(alcohol),data = split_alcohol_lexis)
##Fit a cox model.
##The initial statement Surv(lex.dur,lex.Xst) says that the duration of follow-up is in lex.dur and the outcome variable in lex.Xst (as outlined above)
cox_lexis_ratios <- cbind(HR = exp(coef(cox_lexis_model)), exp(confint(cox_lexis_model)))
#As every other time we exponentiat the coefficients to get hazard ratios
cox_lexis_out <- summary(cox_lexis_model)
cox_lexis_out <- cbind(cox_lexis_ratios,cox_lexis_out$coefficients)
#And we put everything in a nice table
cox_lexis_out
## HR 2.5 % 97.5 % coef exp(coef) se(coef)
## factor(alcohol)2 2.237758 1.830073 2.736263 0.8054745 2.237758 0.1026134
## z Pr(>|z|)
## factor(alcohol)2 7.849601 4.17364e-15
Note that the poisson and the cox give very similar estimates of the impact of alcohol on the outcome.
As in Poisson Regression we can compare two cox-models using ‘lrtest’ if we need to. Remember as always that both models must be performed on a dataset with the same number of observations - i.e after any lexis-expansion has been performed.
##Kaplan Meier Plots
We can plot very nice KM plots. As before we have declared out data as a Lexis Model. We input the dataset in a similar fashion to our cox-regression but use the survfit command rather than coxph > km <- survfit(Surv(lex.dur, lex.Xst) ~ var_to_stratify_on, data = dataset)
Then we plot the KM Plot > ggsurvplot(km, data = dataset, risk.table = TRUE / FALSE)
km <- survfit(Surv(lex.dur,lex.Xst)~factor(alcohol),data = alcohol_lexis)
##Fit the data
##The initial statement Surv(lex.dur,lex.Xst) says that the duration of follow-up is in lex.dur and the outcome variable in lex.Xst (as outlined above).
#We want to stratify by alcohol
ggsurvplot(km,data=alcohol_lexis)
#Plot the Kaplan Meier Plot
Suggested by “Chrissy h Roberts”
Below the basics of a few types of sample size calculations - for example comparing proportions, means or to detect an effect size of a certain magnitude are shown.
The general syntax is
epi.propsize(treat = X, control = Y, power = Z, n = Z) #Specify either n - to compute power OR power to compute sample size #Specify the other as NA
epi.propsize(treat = 0.1, control = 0.2, power = 0.8, n = NA)
## $n.total
## [1] 398
##
## $n.treat
## [1] 199
##
## $n.control
## [1] 199
##
## $power
## [1] 0.8
##
## $lambda
## [1] 0.5
#Calculate sample size needed to have 80% power to show difference of 10% assuming group 1 has proportion of 10%
epi.propsize(treat = 0.1, control = 0.2, power = 0.8, n = NA, sided.test = 1)
## $n.total
## [1] 314
##
## $n.treat
## [1] 157
##
## $n.control
## [1] 157
##
## $power
## [1] 0.8
##
## $lambda
## [1] 0.5
#As above but a 1-sided test
epi.propsize(treat = 0.1, control = 0.2, power = NA, n = 300)
## $n.total
## [1] 300
##
## $n.treat
## [1] 150
##
## $n.control
## [1] 150
##
## $power
## [1] 0.6808308
##
## $lambda
## [1] 0.5
#Calculate the power to detect a difference between arms given a sample size of 300
We need to know the 1) Proportion in control group 2) Odds ratio we wish to detect
The general syntax when giving an Odds Ratio to detect is
epi.ccsize(OR = X, p0 = Y, n = Z, power = Z) #Specify either n - to compute power OR power to compute sample size #Specify the other as NA epi.ccsize(OR = X, p0 = Y, n = Z, power = Z, r = ) #Where R is the ratio of controls to cases
epi.ccsize(OR = 2, p0 = 0.2, n = NA, power = 0.8)
## $n.total
## [1] 344
##
## $n.case
## [1] 172
##
## $n.control
## [1] 172
##
## $power
## [1] 0.8
##
## $OR
## [1] 2
help("epi.ccsize")
#Calculate sample size required to detect an Odds Ratio of 2 given an expected prevalence of 20% in the control group
epi.ccsize(OR = 4, p0 = 0.3, n = NA, power = 0.8, r =4)
## $n.total
## [1] 110
##
## $n.case
## [1] 22
##
## $n.control
## [1] 88
##
## $power
## [1] 0.8
##
## $OR
## [1] 4
#Calculate sample size required to detect an Odds Ratio of 4 given an expected prevalence of 30% in the control group and 4 controls for every case
The general syntax is > epi.simplesize(Py = prevalence_to_find, epsilon.R = variation_around_prevalence, method = “proportion”) #epsilon.R is specified as a proportion of Py #So if we want to find 10% +/- 2% we would set epsion.R as 0.2
epi.simplesize(Py = 0.2, epsilon.r = 0.3, method = "proportion")
## [1] 171
#Calculate sample size required to detect a true prevalence of 10% +/- 3%
#We can specify a fixed population calculation by adding in 'n ='
The general syntax is
power.t.test(delta = DIFF_BETWEEN_MEANS, power = Z, type =“SAMPLE_TYPE”) #Type may be “two.sample”, “one.sample”, “paired” #We can additionally specify the anticiapted standard deviation for example power.t.test(delta = DIFF_BETWEEN_MEANS, power = Z, type =“SAMPLE_TYPE”, sd = 1)
OR alternatively (using same package as other sample size calculations) >epi.meansize(treat = X, control = Y, n = Z, power = Z, sigma = A) #Specify either n - to compute power OR power to compute sample size #Specify the other as NA #Sigma is the anticipated Standard Deviation
power.t.test(delta = 2, power = 0.8, type ="two.sample", sd = 1)
##
## Two-sample t test power calculation
##
## n = 5.090008
## delta = 2
## sd = 1
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
#Calculate sample size needed to have 80% power to show difference of 2 in the mean value between two groups
power.t.test(delta = 2, power = 0.8, type ="paired", sd = 1)
##
## Paired t test power calculation
##
## n = 4.220731
## delta = 2
## sd = 1
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number of *pairs*, sd is std.dev. of *differences* within pairs
#Calculate sample size needed to have 80% power to show difference of 0.5 in the mean of two paired values - assuming a standard deviation of 1
epi.meansize(treat = 14 , control = 12, n = NA, power = 0.8, sigma = 1)
## $n.total
## [1] 8
##
## $n.treat
## [1] 4
##
## $n.control
## [1] 4
##
## $power
## [1] 0.8
##
## $delta
## [1] 2
#Calculate sample size needed to have 80% power to show difference of 2 between mean assuming a standard deviation of 1
#Capture Recapture Population Estimates We can calculate the predicted true population based on results of a capture-recapture survey The general syntax is > epi.popsize(people_seen_firs_time,people_seen_second_time,people_seen_both_times)
epi.popsize(1104,1112,920)
##
## Estimated population size: 1334 (95% CI 1319 - 1350)
## Estimated number of untested subjects: 38 (95% CI 23 - 54)
#We saw a 1104 people in our first survey
#We saw 1112 people in our second survey
#920 people were seen both times
#Calculate estimated true population and proportion of individuals missed both times
Suggested by “Tom Yates”
I presume as always that you understand the caveats of multiple imputation. The MICE package uses Multivariate Imputation via Chained Equations which is a common approach to multiple imputation.
I deleted some data from our earlier scabies dataset so we can look at how multiple imputation works.
Again the dataset with missing data is available on the LSHTM Data-Repository: “Scabies Missing Data”
scabies_missing <- read.csv(file = "scabies_results_missing.csv", header = TRUE, sep = ",")
We can see what data is missing using the ‘md.pattern’ command. The general syntax is:
md.pattern(dataset)
md.pattern(scabies_missing)
## village impetigo_inactive other_lesions room_cohabitation age
## 1439 1 1 1 1 1
## 103 1 1 1 1 1
## 81 1 1 1 1 1
## 2 1 1 1 1 1
## 70 1 1 1 1 1
## 3 1 1 1 1 1
## 3 1 1 1 1 1
## 52 1 1 1 1 1
## 4 1 1 1 1 1
## 7 1 1 1 1 1
## 2 1 1 1 1 1
## 63 1 1 1 1 0
## 3 1 1 1 1 0
## 1 1 1 1 1 0
## 4 1 1 1 1 0
## 54 1 1 1 0 1
## 4 1 1 1 0 1
## 2 1 1 1 0 1
## 3 1 1 1 0 1
## 4 1 1 1 0 1
## 1 1 1 1 0 1
## 2 1 1 1 0 0
## 1 1 1 1 0 0
## 0 0 0 71 74
## house_inhabitants scabies_infestation gender impetigo_active
## 1439 1 1 1 1 0
## 103 1 1 1 0 1
## 81 1 1 0 1 1
## 2 1 1 0 0 2
## 70 1 0 1 1 1
## 3 1 0 1 0 2
## 3 1 0 0 1 2
## 52 0 1 1 1 1
## 4 0 1 1 0 2
## 7 0 1 0 1 2
## 2 0 0 1 1 2
## 63 1 1 1 1 1
## 3 1 1 1 0 2
## 1 1 0 1 1 2
## 4 0 1 1 1 2
## 54 1 1 1 1 1
## 4 1 1 1 0 2
## 2 1 1 0 1 2
## 3 1 0 1 1 2
## 4 0 1 1 1 2
## 1 0 0 1 1 3
## 2 1 1 1 1 2
## 1 1 1 0 1 3
## 74 83 96 119 517
We can impute our data using the ‘mice’ command. The general syntax is:
imputed_data <- mice(data = datasets, m = number_imputet_sets_to_make, defaultMethod = “method_for_imputing_values”, maxit = number_iterations_to_impute_values, seed = 500)
This generates ‘m’ datasets each with slightly different imputed data.
NB Seed is just a number which kicks off the random-number generator. So if you alter the seed number the imputed dataset is going to change.
We don’t actually need to specify defaultMethod unless we want to change how R imputes values.
By default ‘mice’ will use the following methods: * Numeric values: pmm, predictive mean matching * Binary data: logreg, logistic regression imputation * Unordered categorical data with >2 levels: polyreg, polytomous regression * Ordered categorical data with >2 levels: polr, proportional odds model
We can specify alternative methods * norm = Bayesian linear regression
For example we could use Bayesian linear regression for Numeric imputation by stating
defaultMethod = c(“norm”,“logreg”, “polyreg”, “polr”),
imputed_scabies_data <- mice(data = scabies_missing, m = 5, maxit = 10, seed = 500)
By default MICE uses all values to impute from and imputes all missing values. These settings can be changed if needed
MICE includes a specific graphics package which is set up to work well with imputed data. We can use this to see how reasonable our imputations are.
We can show box and whisker plots to see if our nurmerical imputed data looks reasonable with
bwplot(imputed_dataset, variable) #R automatically colour codes the raw and imputed datasets differently
bwplot(imputed_scabies_data, age)
We can see distributions also with a density plot
densityplot(imputed_dataset) #R automatically colour codes the raw and imputed datasets differently
We can see if association still look reasonable
xyplot(imputer_dataset,variable_a~variable_b) #R automatically colour codes the raw and imputed datasets differently
We can see an individual imputed dataset with
imputed_set_1 <- complete(imputed_data,1)
We can merge the datasets together - for example into a long dataset
imputed_long <- complete(imputed_data, “long”)
If we want to include the original dataset we can do that too
imputed_long <- complete(imputed_data, “long”, include = TRUE)
If we want to apply a data-manipulation such as converting a continuous to a categorical variable or making a new variable we can do this across our datasets.
First of all we convert the imputed data into a standard-dataset that we can work with:
imputed_long <- mice::complete(imputed_data, “long”, include = TRUE) #We have to specify mice::complete as ‘tidyr’ also has function called complete, so this lets R know which version to use. #Its important we include the Raw data as this lets us put the data back in to an ‘imputed’ dataset for the purposes of analysis #Now we can perform whatever manipulations we want #Then we convert the manipulated dataset back to an ‘imputed’ dataset using “as.mids”
coded_imputed_data <- as.mids(imputed_long)
complete_scabies_data <- mice::complete(imputed_scabies_data,"long", include = TRUE)
#Take out imputed data and make a long data-table. Include the original dataset with missing values.
complete_scabies_data$agegroups <- as.factor(cut(complete_scabies_data$age, c(0,10,20,Inf), include.lowest = TRUE, labels = FALSE))
#Categorise age into groups across all our imputed datasets
#Labels == False asigns a numeric value to each category (i.e 1,2,3)
complete_scabies_data$agegroups <-relevel(complete_scabies_data$agegroups, ref = 3)
#Set the baseline value
coded_imputed_scabies <- as.mids(complete_scabies_data)
#Convert back to "Imputed" format
If we want to fit a model across all the different imputed datasets then we can do that! The general syntax is very similar to a normal logistic regression but with a few changes as we are working with multiple imputed datasets and then pooling the results.
imputed_model <- with(imputed_data,glm(attack~ smokes+female+hsgrad, family = binomial())) #We specify the ‘with’ command to say we are modelling across each of our imputed datasets
imputed_model_summary <- (summary(pool(imputed_model), exponentiate = TRUE, conf.int = TRUE)) #We specify the ’pool’command to now pool the estimates across each of our imputed datasets # We ask it to exponentiate the estimates and give us the 95% CI all in one go
imputed_model_summary Show the summary
#Imputed Data
imputed_scabies_model <- with(coded_imputed_scabies,glm(impetigo_active~factor(scabies_infestation)+factor(gender)+factor(agegroups), family = binomial()))
imputed_scabies_model_summary <- (summary(pool(imputed_scabies_model), exponentiate = TRUE, conf.int = TRUE))
## Warning in process_mipo(z, object, conf.int = conf.int, conf.level =
## conf.level, : Exponentiating coefficients, but model did not use a log or
## logit link function
#Let us also fit this same model to the full-original dataset
impetigorisk <- glm(impetigo_active~factor(scabies_infestation)+factor(gender)+factor(agegroups),data=scabies,family=binomial())
impetigorisk_OR <- exp(cbind(OR= coef(impetigorisk), confint(impetigorisk)))
## Waiting for profiling to be done...
impetigorisk_summary <- summary(impetigorisk)
impetigorisk_summary <- cbind(impetigorisk_OR, impetigorisk_summary$coefficients)
You can see we get very similar estimates of risk factors for impetigo from both our imputed dataset and our original dataset
imputed_scabies_model_summary
## estimate std.error statistic df
## (Intercept) 0.1284172 0.1321029 -15.536910 153.46067
## factor(scabies_infestation)1 1.9602606 0.1298640 5.182940 507.53928
## factor(gender)male 1.7469015 0.1090545 5.115273 787.60076
## factor(agegroups)1 3.8253644 0.1502099 8.931858 168.01825
## factor(agegroups)2 2.9491067 0.1759415 6.146941 82.66736
## p.value 2.5 % 97.5 %
## (Intercept) 0.000000e+00 0.09891984 0.1667106
## factor(scabies_infestation)1 2.777999e-07 1.51882984 2.5299883
## factor(gender)male 3.938015e-07 1.41025669 2.1639073
## factor(agegroups)1 0.000000e+00 2.84371631 5.1458766
## factor(agegroups)2 1.254234e-09 2.07828042 4.1848204
#Show the result of logistic regression on the imputed dataset
impetigorisk_summary
## OR 2.5 % 97.5 % Estimate
## (Intercept) 0.1345342 0.1057763 0.1691508 -2.0059369
## factor(scabies_infestation)yes 1.9325497 1.5143192 2.4661511 0.6588402
## factor(gender)male 1.7577258 1.4287526 2.1630912 0.5640208
## factor(agegroups)0-10 3.7025777 2.8401712 4.8620068 1.3090292
## factor(agegroups)11-20 2.7347978 2.0226139 3.7127619 1.0060575
## Std. Error z value Pr(>|z|)
## (Intercept) 0.1196454 -16.765690 4.349338e-63
## factor(scabies_infestation)yes 0.1243502 5.298267 1.169072e-07
## factor(gender)male 0.1057719 5.332426 9.690930e-08
## factor(agegroups)0-10 0.1370214 9.553466 1.254279e-21
## factor(agegroups)11-20 0.1548162 6.498398 8.117993e-11
#Show the result of logistic regression on the original dataset
When we want to compate two nested models derived from imputed datasets we can do so using ‘pool.compare’
pool.compare(model_1, model_2, data = NULL, method = “Wald”) #This is similar to performing an LRT on two normal models - but in the circumstance of pooled data this approach should be used.
Suggested and with help from “Peter MacPherson” and “Daniel Parker”
Statisitcally there are lots of different ways to calculate an Intra-Class-Correlation coefficient.
Below I provide an outline of a few otions. Which one is appropriate may depend on why you need an ICC.
If we have used a complex survey design then we can declare the survey design and then calculate the design effect.
We declare the survey design using > survey_dataset <- svydesign(id=~cluster_var,data=dataset)
We can then calculate the frequency of our outcome (say presence of Trachoma) adjusting for clustering and then get out the design effect of our data compared to SRS with replacement. >svymean(~outcome,survey_dataset,proportion = TRUE, deff = “replace”)
#Lets look at our scabies survey again
scabies <- read.csv(file = "S1-Dataset_CSV.csv", header = TRUE, sep = ",")
#Declare the survey design
survey_scabies <- svydesign(id=~Village, data = scabies)
## Warning in svydesign.default(id = ~Village, data = scabies): No weights or
## probabilities supplied, assuming equal probability
svymean(~scabies_infestation,survey_scabies,proportion = TRUE, deff = "replace")
## mean SE DEff
## scabies_infestationno 0.808176 0.010242 1.2903
## scabies_infestationyes 0.191824 0.010242 1.2903
#Work out the DF of scabies adjusting for clustering at the village level
If we have used a complex survey design then we can also get out an ICC and then given the cluster size calculate the design effect.
The general syntax is > ICCest(group_var,outcome_var,data=dataset)
It expects outcome_var to be coded numerically
scabies$hasscabies <- 0
scabies$hasscabies[scabies$scabies_infestation == "yes"] <- 1
#Convert our yes/no variable to 0 / 1 variable as expected by ICCest
ICCest(Village,hasscabies, data = scabies)
## Warning in ICCest(Village, hasscabies, data = scabies): 'x' has been
## coerced to a factor
## $ICC
## [1] 0.004087404
##
## $LowerCI
## [1] -0.0009097985
##
## $UpperCI
## [1] 0.02560099
##
## $N
## [1] 10
##
## $k
## [1] 185.7484
##
## $varw
## [1] 0.1545527
##
## $vara
## [1] 0.0006343122
#Gives the ICC for scabies and the mean village size.
#ICC of 0.004087404
#K (average cluster size) 185
#So DF = 1 + (185-1)*0.004087404
1 + (185-1)*0.004087404
## [1] 1.752082
#Which is broadly similar to the output of the previous estimate
Alternatively in some situations you might want to fit a random-effects model and then calculate the ICC from the model.
The command for this is just: > icc (re_model)
re_village_scabies <- glmer(scabies_infestation~ + (1 | Village),family = "binomial", data = scabies, nAGQ = 10)
#Run a model clustered at household level. Ten intgration points.
icc(re_village_scabies)
##
## Generalized linear mixed model
##
## Family : binomial (logit)
## Formula: scabies_infestation ~ +(1 | Village)
##
## ICC (Village): 0.0034
#Get out our ICC
#You can then use the ICC to calculate the design effect using standard formulas
##Meta-analysis of proportions
The general syntax is > metaprop(Number_Events, Number_Observations)
metaprop(c(267,175,219,117),c(406,329,223,137),method="GLMM")
## Warning in metafor::rma.glmm(xi = event[!exclude], ni = n[!exclude], method
## = "FE", : Cannot invert Hessian for saturated model.
## Warning in metafor::rma.glmm(xi = event[!exclude], ni = n[!exclude], method
## = method.tau, : Cannot invert Hessian for saturated model.
## proportion 95%-CI %W(fixed) %W(random)
## 1 0.6576 [0.6092; 0.7037] -- --
## 2 0.5319 [0.4764; 0.5869] -- --
## 3 0.9821 [0.9547; 0.9951] -- --
## 4 0.8540 [0.7836; 0.9085] -- --
##
## Number of studies combined: k = 4
##
## proportion 95%-CI z p-value
## Fixed effect model 0.7105 [0.6829; 0.7366] -- --
## Random effects model 0.8330 [0.5400; 0.9550] -- --
##
## Quantifying heterogeneity:
## tau^2 = 2.1010; H = 9.04; I^2 = 98.8%
##
## Test of heterogeneity:
## Q d.f. p-value Test
## NA 3 -- Wald-type
## 187.20 3 < 0.0001 Likelihood-Ratio
##
## Details on meta-analytical method:
## - Random intercept logistic regression model
## - Maximum-likelihood estimator for tau^2
## - Logit transformation
## - Clopper-Pearson confidence interval for individual studies
#Calculate the weighted proportion of events across four studies
#Method = GLMM specifies that a random-intercept should be used
meta_data <- metaprop(c(267,175,219,117),c(406,329,223,137),method="GLMM", comb.fixed = FALSE)
## Warning in metafor::rma.glmm(xi = event[!exclude], ni = n[!exclude], method
## = "FE", : Cannot invert Hessian for saturated model.
## Warning in metafor::rma.glmm(xi = event[!exclude], ni = n[!exclude], method
## = method.tau, : Cannot invert Hessian for saturated model.
#As above but don't give a fixed-effec estimate and store the results so we can use them in our plot
meta_forest <- forest(meta_data)
#Calculate the weighted proportion of events across four studies
#Method = GLMM specifies that a random-intercept should be used
The GGPLOT package makes much better graphs than the basic R package. This is a very brief introduction to the syntax of GGPLOT2.
Everything in ggplot combines a similar group of elements to make its graphs. In any plot we specify 1) Dataset 2) A Co-ordinate system 3) How to show the data points (ggplot calls these ‘geoms’)
We add elements on top of one another to build up the plot. The first two elements are given together via the syntax > plot <- ggplot2 (dataset, aes(co-ordinate system)) #Make something called plot. Take data from ‘dataset’ and define the way to show the data as statd by the co-ordinate system.
There are a variety of co-ordinate systems. When our plot involves a single variable (i.e age as a histograme) then we need to specify X =
plot <- ggplot2 (scabies, aes(x = age)) #Look in the scabies dataframe - we are plotting age along the X-Axis
If we want to stratify our plots by a factor we can do so by telling GGPLOT with ‘fill’
plot <- ggplot2 (scabies, aes(x = age, fill = gender)) #Stratify by Gender - i.e draw two histograms with different colours based on the variable gender
When our plot involves two variables (i.e a scatter plot or a box-plot) then we specify X= and y =
plot <- ggplot2 (scabies, aes(x = house_inhabitants, y = room_cohabitation)) #Look in the scabies dataframe - we are plotting number people in the house along the X-Axis and number of people sharing a room on the Y-Axis
We then tell GGPLOT how to render the data - this is the geom.
Some common geoms are: geom_point = scatterplot geom_histogram = histogram geom_boxplot
We can specify specific values for the data-points (“geom”) like colour: > + geom_point(aes(colour = factor(gender))) #Add to our plot by drawing each value as a point i.e make a scatter-plot #Set the colour of each datapoint based on the factor variable gender
We can add labels using: > + labs (title = “Graph Title”, x = “X-Axis Label”, y = “Y-Axis Label”) #Add to our plot by adding labels for axis etc
Sometimes to make the code easier to follow we split it across several lines. For example
plot <- ggplot2 (scabies, aes(x = house_inhabitants, y = room_cohabitation))+ geom_point(aes(colour = factor(gender))) + labs (title = “House size and Room Cohabitation”, x = “House size”, y = “Room size”) #Note we put a PLUS sign at the end of each line to tell R that the code is continuing
Here are some examples similar to the graphs we drew with basic R earlier. The aim is just to give a feel for how the plots are constructed.
gg_scatter <- ggplot (scabies, aes(x = house_inhabitants, y = room_cohabitation)) +
#Two variable House size and room cohabitation
#NB I'm saving the plot under the name gg_scatter by saying gg_scatter <- X,Y,Z
geom_point(aes(colour = factor(gender))) +
#I'm drawing a scatter plot and colouring dots based on gender
labs(title = "House size vs Number people sharing a room", x ="House Size", y = "People sharing a room")
#I'm giving some labels to the graph
gg_scatter
#Show the gg_scatter graph
gg_histo <- ggplot (scabies, aes(x = age)) +
#I have a single variable age
geom_histogram() +
#With which I will draw a scatterplot
labs(title = "Distribution of age in study", y = "No. participants", x = "Age (yrs)")
#I will give some axis and graph labels
gg_histo
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
gg_boxplot <- ggplot (scabies, aes(x = gender, y = age)) +
#I have two variable Gender and Age
geom_boxplot(aes(colour = factor(gender))) +
#I am drawing a box plot and colouring them in by age
labs(title = "Distribution of age by gender", y = "Age", x = "Gender")
#I have some axis labels
gg_boxplot
We can do really fancy things with GGPLOT like make multiple graphs in a single panel. We declare these using one of the “facet” options. For example we could draw age distribution histograms by village.
library(ggplot2)
gg_histo_by_village <- ggplot (scabies, aes(x = age, fill = gender)) +
#I have only one co-ordinate x = age
#Stratifying on = gender; i.e fill in colour based on value in gender
geom_histogram(breaks = seq(0,80, by = 2)) +
#I'm plotting a histogram. I've divided the data between 0 to 80 it to blocks 5 years wide.
labs(title = "Distribution of age in study", y = "No. participants", x = "Age (yrs)")+
#I've given some labels
facet_wrap("Village")
#I've broken the data down in to several facets based on the village
gg_histo_by_village