Introduction
Welcome to an “Introduction to biostatistics using R: summer edition 2023”.
The six modules will be delivered over three weekly Wednesday morning’s seminars, between 9am and 12 midday, on the the 1st, and the 8th of February 2023. Venue: Building 41, room 153.
Week 1. Wednesday February 1st, 2023 (9am to 12 midday).
Module 1. Data good care of your data, reproducible research.
This module will introduce you to good data management practices, and the concept of reproducible research. Concepts such as data security, archiving, data cleaning, and documenting a clear audit trail of these processes will be covered.
Module 2. Preparing data for analysis and presenting descriptive statistics.
This module provides you the skills to prepare data for analysis, including various formats of data, labelling discrete levels of variables, and creating new variables. This module will also explore the presentation of descriptive statistics. When is the median and interquartile range preferred over the mean and standard deviation? And, the presentation of descriptive statistics in a manuscript being prepared for peer review.
Reading data into R.
Data sets can be loaded into R various ways. The simplist, is via a spreadsheet, such as a comma separated volume (csv). Using the ‘read.csv’ function. However, a working directory needs to be set-up.
# set working directly
setwd('~/OneDrive - University of Wollongong/Data/')
## setwd("C:/Users/25105517/OneDrive - University of Wollongong/Data/")
# read dataset into R, naming the data set dta.
framingham <- read.csv(file = 'framingham.csv', header = T, sep = ',')
# check the names of variable.
names(framingham)## [1] "randid" "sex" "totchol" "age" "sysbp"
## [6] "diabp" "cursmoke" "cigpday" "bmi" "diabetes"
## [11] "bpmeds" "heartrte" "glucose" "educ" "prevchd"
## [16] "prevap" "prevmi" "prevstrk" "prevhyp" "time"
## [21] "period" "hdlc" "ldlc" "death" "angina"
## [26] "hospmi" "mi_fchd" "anychd" "stroke" "cvd"
## [31] "hyperten" "timeap" "timemi" "timemifc" "timechd"
## [36] "timestrk" "timecvd" "timedth" "timehyp" "start.date"
## [41] "stroke.date" "death.date"
# Look at the first ten rows.
# head(framingham)
# Look at the last ten rows.
# tail(framingham)
# View help file for dataset
# install riskCommunicator package which has the Framingham dataset.
## install.packages('riskCommunicator')
library(riskCommunicator)
## help(framingham)
# Create age groups with labels
framingham$age.group <- cut(framingham$age, breaks = c(30,39,49,59,70),
labels = c('30-39','40-49',
'50-59','60-69'))
# Check lower and upper values of groups
tapply(framingham$age, framingham$age.group, min)## 30-39 40-49 50-59 60-69
## 32 40 50 60
tapply(framingham$age, framingham$age.group, max)## 30-39 40-49 50-59 60-69
## 39 49 59 70
# Convert md/dL to mmols (1 mg/dL = 18 mmols/L)
framingham$glucose.mmols <- framingham$glucose/18
framingham$totchol.mmols <- framingham$totchol/18
# Label factors with numeric values
table(framingham$sex)##
## 1 2
## 5022 6605
framingham$sex <- factor(framingham$sex, level = c(1,2),
labels = c('Male','Female'))
table(framingham$sex)##
## Male Female
## 5022 6605
table(framingham$diabetes)##
## 0 1
## 11097 530
framingham$diabetes <- factor(framingham$diabetes, level = c(0,1),
labels = c('No','Yes'))
table(framingham$diabetes)##
## No Yes
## 11097 530
Our specific hypothesis is related to confirming the relationship between blood pressure and the risk of stroke. Therefore, we will subset the data set.
# remove participants with a prior stroke or tx
# for BP and only use first visit data.
names(framingham)## [1] "randid" "sex" "totchol" "age"
## [5] "sysbp" "diabp" "cursmoke" "cigpday"
## [9] "bmi" "diabetes" "bpmeds" "heartrte"
## [13] "glucose" "educ" "prevchd" "prevap"
## [17] "prevmi" "prevstrk" "prevhyp" "time"
## [21] "period" "hdlc" "ldlc" "death"
## [25] "angina" "hospmi" "mi_fchd" "anychd"
## [29] "stroke" "cvd" "hyperten" "timeap"
## [33] "timemi" "timemifc" "timechd" "timestrk"
## [37] "timecvd" "timedth" "timehyp" "start.date"
## [41] "stroke.date" "death.date" "age.group" "glucose.mmols"
## [45] "totchol.mmols"
d <- subset(framingham, bpmeds == 0 &
prevstrk == 0 &
period == 1, select = c(randid,age,totchol.mmols,
diabp,glucose.mmols,bmi,
sex,sysbp,cursmoke,educ,
cigpday,diabetes,stroke,
timestrk,death,timedth,
prevhyp,age.group))
# add n = 1, for counting
d$n <- 1
# create follow-up in years
d$timestrk.yrs <- d$timestrk/365.25
d$stroke.yes <- factor(d$stroke, levels= 0:1, labels = c('No','Yes'))
table(d$stroke.yes,d$stroke)##
## 0 1
## No 3863 0
## Yes 0 343
d$HT.yes <- factor(d$prevhyp, levels= 0:1, labels = c('No','Yes'))
table(d$HT.yes,d$prevhyp)##
## 0 1
## No 2963 0
## Yes 0 1243
# variables labels and unit
library(Hmisc)
d <- upData(d, labels = c(age = 'Age',
totchol.mmols = 'Total cholesterol',
diabp = 'Diasolic BP',
glucose.mmols = 'BGL',
sex = 'Sex',
sysbp = 'Systolic BP',
cursmoke = 'Current smoker',
educ = 'Education level',
stroke = 'Stroke',
timestrk = 'Follow-up',
timestrk.yrs = 'Follow-up'),
units = c(age = 'yrs',
totchol.mmols = 'mmol/L',
glucose.mmols = 'mmol/L',
diabp = 'mmHg',
sysbp = 'mmHg',
timestrk = 'days',
timestrk.yrs = 'yrs'))## Input object size: 510712 bytes; 22 variables 4206 observations
## New object size: 501024 bytes; 22 variables 4206 observations
d$educ <- factor(d$educ, levels = 1:4, labels = c('4th grade or less',
'5th, 6th, or 7th grades',
'grade school graduate no high school',
'high school, did not graduate'))
## The Hmisc package will create Table 1. Including continuous and discrete variables. Frequencies are presented as "% (n)" and ## continuous variables as "p25/p50/p75 mean +/- SD".
# Summary table using Hmisc package
# saving summary table a csv file
table.1 <- print(summary(sex ~ age +
sysbp +
bmi +
glucose.mmols +
diabetes +
HT.yes +
cursmoke +
educ +
timestrk.yrs,
method = 'reverse',
overall = F,
test = T,
data = d), prmsd = T,
digits = 3,
prtest = 'P')##
##
## Descriptive Statistics by Sex
##
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------+
## | |N |Male |Female |P-value |
## | | |(N=1868) |(N=2338) | |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------+
## |Age [yrs] |4206|42.00/49.00/57.00 49.61+/- 8.70|42.00/49.00/56.00 49.62+/- 8.53| 0.808|
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------+
## |Systolic BP [mmHg] |4206|118.0/128.0/141.0 130.9+/- 18.8|115.5/127.5/144.0 132.3+/- 23.2| 0.894|
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------+
## |bmi |4190|23.97/26.05/28.30 26.16+/- 3.40|22.53/24.71/27.64 25.47+/- 4.47| <0.001|
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------+
## |BGL [mmol/L] |3822| 3.94/4.33/4.83 4.58+/-1.39 | 4.00/4.33/4.78 4.53+/-1.23 | 0.817|
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------+
## |diabetes : Yes |4206| 3% ( 56) | 2% ( 51) | 0.095|
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------+
## |HT.yes : Yes |4206| 31% ( 570) | 29% ( 673) | 0.222|
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------+
## |Current smoker |4206| 61% (1135) | 41% ( 957) | <0.001|
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------+
## |educ : 4th grade or less |4096| 44% (805) | 40% (909) | <0.001|
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------+
## | 5th, 6th, or 7th grades | | 27% (497) | 32% (724) | |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------+
## | grade school graduate no high school| | 13% (228) | 20% (453) | |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------+
## | high school, did not graduate | | 16% (282) | 9% (198) | |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------+
## |Follow-up [yrs] |4206|15.99/24.00/24.00 19.49+/- 6.59|21.09/24.00/24.00 21.02+/- 5.77| <0.001|
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------+
write.csv(table.1, file = 'TableOne.csv', row.names = F)Module 3. Inferential statistics 1: Comparing two or more groups, continuous and categorical outcomes.
This module will introduce the concepts of estimation and null-hypothesis statistical significance testing. In addition sample size and power calculations will be covered, and Type I and Type II error.
# SysBP and stroke
hist(d$sysbp, main = '',
xlab = 'Systolic BP (mmHg) \n First visit')describe(d$sysbp)## d$sysbp : Systolic BP [mmHg]
## n missing distinct Info Mean Gmd .05 .10
## 4206 0 226 1 131.7 23.06 104.0 108.5
## .25 .50 .75 .90 .95
## 116.6 128.0 142.5 160.0 173.0
##
## lowest : 83.5 85.0 85.5 90.0 92.0, highest: 230.0 232.0 235.0 243.0 295.0
boxplot(sysbp ~ stroke.yes, data = d,
col = c('blue','red'),
xlab = 'Stroke',
ylab = 'SBP (mmHg)',
las = 1)tapply(d$sysbp, d$stroke, mean)## 0 1
## 130.5967 143.7114
tapply(d$sysbp, d$stroke, sd)## 0 1
## 20.56035 25.91658
# save a jpeg version of SBP and stroke status
## library(grDevices)
## setwd("C:/Users/25105517/OneDrive - University of Wollongong/Data/")
jpeg(file = 'FigureOne.jpg',
width = 600, height = 600,
units = 'px',
pointsize = 12,
quality = 100)
par(family = 'sans')
boxplot(sysbp ~ stroke.yes, data = d,
col = c('lightgray','lightgray'),
xlab = 'Stroke',
ylab = 'SBP (mmHg)',
las = 1,
ylim = c(0,300),
frame = F,
yaxt = 'n')
axis(2, at = seq(0,300,25), label = seq(0,300,25), las = 1)
tapply(d$sysbp, d$stroke, mean)## 0 1
## 130.5967 143.7114
tapply(d$sysbp, d$stroke, sd)## 0 1
## 20.56035 25.91658
TTest <- t.test(sysbp ~ stroke, data = d)
TTest ##
## Welch Two Sample t-test
##
## data: sysbp by stroke
## t = -9.1205, df = 381.19, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -15.94196 -10.28741
## sample estimates:
## mean in group 0 mean in group 1
## 130.5967 143.7114
library(epiR)
d$stroke.yes <- relevel(d$stroke.yes, ref = 'Yes')
sbp.stroke <- epi.conf(data.frame(group = d$stroke.yes, val = d$sysbp),
ctype = "mean.unpaired", conf.level = 0.95 )
sbp.stroke## est se lower upper
## 1 13.11468 1.185815 10.78986 15.43951
str(sbp.stroke)## 'data.frame': 1 obs. of 4 variables:
## $ est : num 13.1
## $ se : num 1.19
## $ lower: num 10.8
## $ upper: num 15.4
text(1.5, 40, paste('mean difference (mmHg) = ', round(sbp.stroke$est,1),
' (95%CI ', round(sbp.stroke$lower,1),' to ',
round(sbp.stroke$upper,1),', p < 0.001)', sep = ''), cex = 0.75)
dev.off()## quartz_off_screen
## 2
Null Hypothesis Significance Testing.
\(H_0\) Mean systolic BP among non-stroke \(=\) Mean systolic BP among stroke.
\(H_i\) Mean systolic BP among non-stroke \(\neq\) Mean systolic BP among stroke.
Then a critical value is set, say from a z-statistic (i.e 1.96, < 2.5% tail)
# SHADE TAILS OF n-DISTRIBUTED PLOTS
z <- pretty(c(-3,3), 100)
ht <- dnorm(z)
data <- data.frame(z = z, ht = ht)
zc <- 1.96
t <- subset(data, z > zc)
plot(data, type = "l",
axes = T,
xlab = '',
ylab = '')
lines(data)
polygon(c(rev(t$z), t$z),c(rep(0, nrow(t)), t$ht), col = 'grey',
border = NA)
polygon(c(rev(t$z*-1), t$z*-1), c(rep(0, nrow(t)), t$ht),
col = 'grey', border = NA)
abline(v = 0, lty = 2)
abline(v = c(-1,1), lty = 2, col = 'grey')
text(-0.75, 0.02, '-1 SD')
text(0.75, 0.02, '+1 SD')
text(-1.75, 0.02, '-1.96 SD')
text(1.75, 0.02, '+1.96 SD')mean(d$sysbp[d$stroke == 1]) - mean(d$sysbp[d$stroke == 0])## [1] 13.11468
t.test(sysbp ~ stroke, data = d)##
## Welch Two Sample t-test
##
## data: sysbp by stroke
## t = -9.1205, df = 381.19, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -15.94196 -10.28741
## sample estimates:
## mean in group 0 mean in group 1
## 130.5967 143.7114
# EXAMPLE OF T-TEST
set.seed(12345)
n <- 3000 # large number
a <- rnorm(n, mean = 61.8, sd = 10)
b <- rnorm(n, mean = 62.7, sd = 9.5)
plot(density(a), lty = 1, col = 'blue',
frame = F,
main = '',
xlab = '',
ylab = '',
ylim = c(0,0.075),
lwd = 2,
las = 1)
lines(density(b), lty = 1, col = 'black', lwd = 2)
legend(20,0.065, c('PCI - mean age 61.8 yrs (SD 10)',
'CABG - mean age 62.7 yrs (SD 9.5)'),
cex = 0.75,
bty = 'n',
lty = 1,
lwd = 2,
col = c('blue','black'))
TTest <- t.test(a,b)
TTest##
## Welch Two Sample t-test
##
## data: a and b
## t = -3.4118, df = 5986.2, p-value = 0.0006496
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3417023 -0.3625029
## sample estimates:
## mean of x mean of y
## 61.76762 62.61972
str(TTest)## List of 10
## $ statistic : Named num -3.41
## ..- attr(*, "names")= chr "t"
## $ parameter : Named num 5986
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.00065
## $ conf.int : num [1:2] -1.342 -0.363
## ..- attr(*, "conf.level")= num 0.95
## $ estimate : Named num [1:2] 61.8 62.6
## ..- attr(*, "names")= chr [1:2] "mean of x" "mean of y"
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "difference in means"
## $ stderr : num 0.25
## $ alternative: chr "two.sided"
## $ method : chr "Welch Two Sample t-test"
## $ data.name : chr "a and b"
## - attr(*, "class")= chr "htest"
text(90, 0.04, paste('p-value = ', round(TTest$p.value,3), sep = ''))n <- 30 # small number
a <- rnorm(n, mean = 61.8, sd = 10)
b <- rnorm(n, mean = 62.7, sd = 9.5)
plot(density(a), lty = 1, col = 'blue',
frame = F,
main = '',
xlab = '',
ylab = '',
ylim = c(0,0.075),
lwd = 2,
las = 1)
lines(density(b), lty = 1, col = 'black', lwd = 2)
legend(50,0.065, c('PCI - mean age 61.8 yrs (SD 10)',
'CABG - mean age 62.7 yrs (SD 9.5)'),
cex = 0.75,
bty = 'n',
lty = 1,
lwd = 2,
col = c('blue','black'))
TTest <- t.test(a,b)
TTest##
## Welch Two Sample t-test
##
## data: a and b
## t = -0.28393, df = 54.736, p-value = 0.7775
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.455870 4.101878
## sample estimates:
## mean of x mean of y
## 61.79375 62.47075
str(TTest)## List of 10
## $ statistic : Named num -0.284
## ..- attr(*, "names")= chr "t"
## $ parameter : Named num 54.7
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.778
## $ conf.int : num [1:2] -5.46 4.1
## ..- attr(*, "conf.level")= num 0.95
## $ estimate : Named num [1:2] 61.8 62.5
## ..- attr(*, "names")= chr [1:2] "mean of x" "mean of y"
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "difference in means"
## $ stderr : num 2.38
## $ alternative: chr "two.sided"
## $ method : chr "Welch Two Sample t-test"
## $ data.name : chr "a and b"
## - attr(*, "class")= chr "htest"
text(80, 0.04, paste('p-value = ', round(TTest$p.value,3), sep = '')) ## Type I and Type II error
NEJM 2000 ARDSNet lower tidal volume (6 mL/kg) versus traditional 12mL/kg. The primary outome death before discharge and breathing without assitance. Results were tx group 134/432 (31.0%) versus control 171/429 (39.8%).
## setwd('~/OneDrive - University of Wollongong/Data/')
## setwd("C:/Users/25105517/OneDrive - University of Wollongong/Data/")
source("~/OneDrive - University of Wollongong/Data/epiTablesV3.R")##
## ------------------------------------------------------------------------------
## Epi Tables
## R epiTab Function
## Input values epiTab(a,b,c,d)
## a,b,c,d values from two-by-two table (rows exposure and columns outcome)
## -------------------------------------------------------------------------------
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 482 1622 2104
## No 1110 3211 4321
## Total 1592 4833 6425
##
## Rate of outcome among intervention group (p1) = 0.2291 (95% CI: 21.11, 24.7)
##
## Rate of outcome among control group (p2) = 0.2569 (95% CI: 24.39, 26.99)
##
## Risk Difference = -2.78 (95% CI: -5, -0.56)
##
## Hazard Risk = 0.876 95% CI = 0.787 to 0.975
## Rate Ratio = 0.892 95% CI = 0.812 to 0.979
## Odds Ratio = 0.86 95% CI = 0.761 to 0.972
## chi2 p-value = 0.0154
## Fisher's exact p-value = 0.0163
## ------------------------------------------------------------------------
##
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 95 229 324
## No 283 400 683
## Total 378 629 1007
##
## Rate of outcome among intervention group (p1) = 0.2932 (95% CI: 24.34, 34.3)
##
## Rate of outcome among control group (p2) = 0.4143 (95% CI: 37.73, 45.14)
##
## Risk Difference = -12.11 (95% CI: -18.3, -5.92)
##
## Hazard Risk = 0.649 95% CI = 0.514 to 0.819
## Rate Ratio = 0.708 95% CI = 0.585 to 0.857
## Odds Ratio = 0.586 95% CI = 0.441 to 0.778
## chi2 p-value = 2e-04
## Fisher's exact p-value = 2e-04
## ------------------------------------------------------------------------
epiTab(134,298,171,258)##
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 134 298 432
## No 171 258 429
## Total 305 556 861
##
## Rate of outcome among intervention group (p1) = 0.3102 (95% CI: 26.64, 35.39)
##
## Rate of outcome among control group (p2) = 0.3986 (95% CI: 35.21, 44.51)
##
## Risk Difference = -8.84 (95% CI: -15.21, -2.47)
##
## Hazard Risk = 0.73 95% CI = 0.582 to 0.915
## Rate Ratio = 0.778 95% CI = 0.648 to 0.934
## Odds Ratio = 0.678 95% CI = 0.512 to 0.898
## chi2 p-value = 0.0067
## Fisher's exact p-value = 0.0069
## ------------------------------------------------------------------------
An example of potentially Type II error - NEJM 2018 EOLIA ECMO for ARDS primary outome 60-day mortality tx 44/124 (35%) control 57/125 (46%)
epiTab(44,80,57,68)##
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 44 80 124
## No 57 68 125
## Total 101 148 249
##
## Rate of outcome among intervention group (p1) = 0.3548 (95% CI: 26.98, 43.99)
##
## Rate of outcome among control group (p2) = 0.456 (95% CI: 36.78, 54.42)
##
## Risk Difference = -10.12 (95% CI: -22.31, 2.07)
##
## Hazard Risk = 0.72 95% CI = 0.486 to 1.067
## Rate Ratio = 0.778 95% CI = 0.574 to 1.055
## Odds Ratio = 0.656 95% CI = 0.394 to 1.091
## chi2 p-value = 0.1041
## Fisher's exact p-value = 0.1217
## ------------------------------------------------------------------------
# original sample size calculation was based on 40% vs 60%
# 60-day mortality
power.prop.test(p1 = 0.4, p2 = 0.6, sig.level = 0.05, power = 0.8)##
## Two-sample comparison of proportions power calculation
##
## n = 96.92364
## p1 = 0.4
## p2 = 0.6
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
p1 <- 0.6
q1 <- 1 - p1
p2 <- 0.4
q2 <- 1 - p2
p.bar <- (p1 + p2)/2
q.bar <- 1 - p.bar
# Sample size
n <- 2*((p.bar * q.bar) * (1.96 + 0.84)^2) / (p1 - p2)^2
n## [1] 98
power.prop.test(p1 = 0.36, p2 = 0.46, sig.level = 0.05, power = 0.8)##
## Two-sample comparison of proportions power calculation
##
## n = 378.5477
## p1 = 0.36
## p2 = 0.46
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
power.prop.test(p1 = 0.36, p2 = 0.46, sig.level = 0.05, n = 125) # power##
## Two-sample comparison of proportions power calculation
##
## n = 125
## p1 = 0.36
## p2 = 0.46
## sig.level = 0.05
## power = 0.3615169
## alternative = two.sided
##
## NOTE: n is number in *each* group
# Age and stroke
boxplot(age ~ stroke.yes, data = d,
col = c('blue','red'),
xlab = 'Stroke',
ylab = 'Age (years)',
las = 1)tapply(d$age, d$stroke, mean)## 0 1
## 49.13461 55.07289
tapply(d$age, d$stroke, sd)## 0 1
## 8.504941 7.768804
# Age and Systolic BP
plot(d$age, d$sysbp, pch = 16,
xlab = 'Age (yrs)',
ylab = 'Systolic BP (mmHg)',
main = '')plot(d$age[d$stroke == 0], d$sysbp[d$stroke == 0], pch = 16,
col = 'blue',
xlab = 'Age (yrs)',
ylab = 'Systolic BP (mmHg)',
main = '',
ylim = c(80,300),
las = 1)
points(d$age[d$stroke == 1], d$sysbp[d$stroke == 1], pch = 16,
col = 'red')
abline(lsfit(d$age,d$sysbp), lty = 2, lwd = 2)# Cigs per day and stroke
boxplot(cigpday ~ stroke.yes, data = d,
col = c('blue','red'),
xlab = 'Stroke',
ylab = 'Age (years)',
las = 1)tapply(d$cigpday, d$stroke, mean, na.rm = T)## 0 1
## 9.105085 9.170088
tapply(d$cigpday, d$stroke, sd, na.rm = T)## 0 1
## 12.00131 12.06747
# table of rates of stroke based
# systolic HT
library(Epi)
d$sysHT <- ifelse(d$sysbp > 140, 1, 0)
d$sysHT <- factor(d$sysHT, levels = 0:1, labels = c('No','Yes'))
print(stat.table(list(sysHT), contents = list(N = count(),
'Stroke (n)' = sum(stroke),
'Stroke (%)' = ratio(stroke,n,100),
'Stroke (rate)' = ratio(stroke,timestrk,365.25),
'follow-up (Total years)' = sum(timestrk.yrs)),
data = d,
margin = T),
digits = 3)## ---------------------------------------------------
## sysHT N Stroke Stroke Stroke follow-up
## (n) (%) (rate) (Total
## years)
## ---------------------------------------------------
## No 3035.000 178.000 5.865 0.003 64518.680
## Yes 1171.000 165.000 14.091 0.008 21014.623
##
## Total 4206.000 343.000 8.155 0.004 85533.303
## ---------------------------------------------------
tapply(d$timestrk.yrs, d$sysHT, sum)## No Yes
## 64518.68 21014.62
# rate with 95% CI
library(epiDisplay)
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$n, d$sysHT, sum)))
names(stroke.tab) <- c('Strokes','N')
head(stroke.tab)## Strokes N
## No 178 3035
## Yes 165 1171
ci.binomial(stroke.tab$Strokes, stroke.tab$N)## events total probability se exact.lower95ci exact.upper95ci
## 1 178 3035 0.05864909 0.004265079 0.05054965 0.06761068
## 2 165 1171 0.14090521 0.010167300 0.12146029 0.16216781
# incidence based on person years
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$timestrk.yrs, d$sysHT, sum)))
names(stroke.tab) <- c('Strokes','pyrs')
head(stroke.tab)## Strokes pyrs
## No 178 64518.68
## Yes 165 21014.62
ci.poisson(stroke.tab$Strokes, stroke.tab$pyrs)## events person.time incidence se exact.lower95ci exact.upper95ci
## 1 178 64518.68 0.002758891 0.0002067876 0.002367128 0.003197555
## 2 165 21014.62 0.007851676 0.0006112521 0.006697479 0.009147202
# risk of stroke
# Rate Ratio
## setwd('~/OneDrive - University of Wollongong/Data/')
## setwd("C:/Users/25105517/OneDrive - University of Wollongong/Data/")
source('~/OneDrive - University of Wollongong/Data/epiTablesV3.R')##
## ------------------------------------------------------------------------------
## Epi Tables
## R epiTab Function
## Input values epiTab(a,b,c,d)
## a,b,c,d values from two-by-two table (rows exposure and columns outcome)
## -------------------------------------------------------------------------------
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 482 1622 2104
## No 1110 3211 4321
## Total 1592 4833 6425
##
## Rate of outcome among intervention group (p1) = 0.2291 (95% CI: 21.11, 24.7)
##
## Rate of outcome among control group (p2) = 0.2569 (95% CI: 24.39, 26.99)
##
## Risk Difference = -2.78 (95% CI: -5, -0.56)
##
## Hazard Risk = 0.876 95% CI = 0.787 to 0.975
## Rate Ratio = 0.892 95% CI = 0.812 to 0.979
## Odds Ratio = 0.86 95% CI = 0.761 to 0.972
## chi2 p-value = 0.0154
## Fisher's exact p-value = 0.0163
## ------------------------------------------------------------------------
##
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 95 229 324
## No 283 400 683
## Total 378 629 1007
##
## Rate of outcome among intervention group (p1) = 0.2932 (95% CI: 24.34, 34.3)
##
## Rate of outcome among control group (p2) = 0.4143 (95% CI: 37.73, 45.14)
##
## Risk Difference = -12.11 (95% CI: -18.3, -5.92)
##
## Hazard Risk = 0.649 95% CI = 0.514 to 0.819
## Rate Ratio = 0.708 95% CI = 0.585 to 0.857
## Odds Ratio = 0.586 95% CI = 0.441 to 0.778
## chi2 p-value = 2e-04
## Fisher's exact p-value = 2e-04
## ------------------------------------------------------------------------
epiTab(165,1171-165,178,3035-178)##
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 165 1006 1171
## No 178 2857 3035
## Total 343 3863 4206
##
## Rate of outcome among intervention group (p1) = 0.1409 (95% CI: 12.1, 16.09)
##
## Rate of outcome among control group (p2) = 0.0586 (95% CI: 5.03, 6.7)
##
## Risk Difference = 8.23 (95% CI: 6.06, 10.39)
##
## Hazard Risk = 2.513 95% CI = 2.033 to 3.106
## Rate Ratio = 2.403 95% CI = 1.966 to 2.937
## Odds Ratio = 2.633 95% CI = 2.105 to 3.293
## chi2 p-value = 0
## Fisher's exact p-value = 0
## ------------------------------------------------------------------------
(sum(d$stroke[d$sysHT == 'Yes']) / sum(d$n[d$sysHT == 'Yes'])) /
(sum(d$stroke[d$sysHT == 'No']) / sum(d$n[d$sysHT == 'No']))## [1] 2.402513
Confidence intervals
Confidence intervals can be found all over statistics. They provide an interval likely to include the true population parameter we’re trying to estimate, allowing us to express estimated values from sample data with some confidence.
Depending on the situation, there are numerous methods for calculating them.The following formula is used to compute it:
$$Confidence Interval = (point estimate)+/-(critical value)*(standard error)$$
WEEK 2. Wednesday February 8th, 2023 (9am to 12 midday).
Module 4. Inferential statistics 2: Correlation and regression, adjusting for confounders.
This module will initially explore correlation and simple linear regression. We will then extend these concepts to multi-linear regression, and generalised-linear-models that are use to analysis various outcomes, such as continuous, binomial, and rates.
# Correlation and linear regression
set.seed(12345)
d.samp <-subset(d, randid %in% sample(d$randid, 150))
par(mfrow = c(1,1))
plot(sysbp ~ age, data = d.samp,
las = 1,
xlab = 'Age (yrs)',
ylab = 'Systolic BP (mmHg)')
summary(lm(sysbp ~ age, data = d.samp))##
## Call:
## lm(formula = sysbp ~ age, data = d.samp)
##
## Residuals:
## Systolic BP [mmHg]
## Min 1Q Median 3Q Max
## -49.120 -14.068 -1.351 9.512 91.466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 87.9392 9.7463 9.023 8.91e-16 ***
## age 0.8622 0.1925 4.479 1.49e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.21 on 148 degrees of freedom
## Multiple R-squared: 0.1194, Adjusted R-squared: 0.1134
## F-statistic: 20.06 on 1 and 148 DF, p-value: 1.489e-05
abline(a = 87.9392, b = 0.8622, lty = 2)summary(lm(sysbp ~ I(age-35), data = d.samp))##
## Call:
## lm(formula = sysbp ~ I(age - 35), data = d.samp)
##
## Residuals:
## Systolic BP [mmHg]
## Min 1Q Median 3Q Max
## -49.120 -14.068 -1.351 9.512 91.466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 118.1164 3.3385 35.380 < 2e-16 ***
## I(age - 35) 0.8622 0.1925 4.479 1.49e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.21 on 148 degrees of freedom
## Multiple R-squared: 0.1194, Adjusted R-squared: 0.1134
## F-statistic: 20.06 on 1 and 148 DF, p-value: 1.489e-05
cor(d.samp$age, d.samp$sysbp)## [1] 0.3455124
cov(d.samp$age, d.samp$sysbp) / (var(d.samp$age) * var(d.samp$sysbp))^0.5## [1] 0.3455124
tapply(d$sysbp, d$stroke, mean)## 0 1
## 130.5967 143.7114
summary(lm(sysbp ~ stroke, data = d))##
## Call:
## lm(formula = sysbp ~ stroke, data = d)
##
## Residuals:
## Systolic BP [mmHg]
## Min 1Q Median 3Q Max
## -49.711 -14.597 -3.597 10.903 151.289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 130.5967 0.3386 385.66 <2e-16 ***
## stroke 13.1147 1.1858 11.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.05 on 4204 degrees of freedom
## Multiple R-squared: 0.02827, Adjusted R-squared: 0.02804
## F-statistic: 122.3 on 1 and 4204 DF, p-value: < 2.2e-16
summary(lm(sysbp ~ stroke - 1, data = d))##
## Call:
## lm(formula = sysbp ~ stroke - 1, data = d)
##
## Residuals:
## Systolic BP [mmHg]
## Min 1Q Median 3Q Max
## -49.71 113.00 125.00 139.88 243.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## stroke 143.711 6.854 20.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 126.9 on 4205 degrees of freedom
## Multiple R-squared: 0.09467, Adjusted R-squared: 0.09445
## F-statistic: 439.7 on 1 and 4205 DF, p-value: < 2.2e-16
regress.display(lm(sysbp ~ stroke, data = d))## Linear regression predicting sysbp
##
## Coeff.(95%CI) P(t-test) P(F-test)
## stroke: 1 vs 0 13.11 (10.79,15.44) < 0.001 < 0.001
##
## No. of observations = 4206
# crude models
plot(stroke ~ sysbp, data = d, las = 1)plot(sysbp ~ stroke.yes, data = d, las = 1)stat.table(HT.yes, contents = list(D = sum(stroke),
N = sum(n),
R = ratio(stroke,n)),
margin = T,
data = d)## ---------------------------------
## HT.yes D N R
## ---------------------------------
## No 157.00 2963.00 0.05
## Yes 186.00 1243.00 0.15
##
## Total 343.00 4206.00 0.08
## ---------------------------------
# logit model
summary(glm(stroke ~ HT.yes, data = d, family = binomial(link='logit')))##
## Call:
## glm(formula = stroke ~ HT.yes, family = binomial(link = "logit"),
## data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5694 -0.5694 -0.3300 -0.3300 2.4239
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.88327 0.08201 -35.16 <2e-16 ***
## HT.yesYes 1.14583 0.11423 10.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2376.7 on 4205 degrees of freedom
## Residual deviance: 2277.3 on 4204 degrees of freedom
## AIC: 2281.3
##
## Number of Fisher Scoring iterations: 5
# predicted rate exp(a+bx)/(a+bx)+1
exp(-2.88327 + (1*1.1458)) / (exp(-2.88327 + (1*1.1458)) + 1)## [1] 0.1496346
# log odds
# observed odds ratio
(0.15/0.85)/(0.05/0.95)## [1] 3.352941
# predicted OR
exp(1.14583)## [1] 3.145051
# Rate Ratio
tapply(d$stroke, d$HT.yes, mean)## No Yes
## 0.05298684 0.14963797
0.14963797/0.05298684## [1] 2.824059
glm.stroke <- glm(cbind(stroke,n) ~ sysHT, data = d, family = binomial)
print(logistic.display(glm.stroke), digits = 3)##
## Logistic regression predicting cbind(stroke, n)
##
## OR(95%CI) P(Wald's test) P(LR-test)
## sysHT: Yes vs No 2.4 (1.92,3) < 0.001 < 0.001
##
## Log-likelihood = -949.6711
## No. of observations = 4206
## AIC value = 1903.3422
glm.stroke <- glm(stroke ~ sysHT + offset(n), data = d, family = poisson)
glm.stroke <- glm(stroke ~ sysHT, offset(n), data = d, family = poisson)
print(idr.display(glm.stroke), digits = 3)##
##
## Poisson regression predicting stroke
##
## IDR(95%CI) P(Wald's test) P(LR-test)
## sysHT: Yes vs No 2.4 (1.94,2.97) < 0.001 < 0.001
##
## Log-likelihood = -1171.1858
## No. of observations = 4206
## AIC value = 2346.3716
cph <- coxph(Surv(timestrk, stroke) ~ I(sysbp/10), data = d)
summary(cph)## Call:
## coxph(formula = Surv(timestrk, stroke) ~ I(sysbp/10), data = d)
##
## n= 4206, number of events= 343
##
## coef exp(coef) se(coef) z Pr(>|z|)
## I(sysbp/10) 0.28278 1.32681 0.02013 14.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## I(sysbp/10) 1.327 0.7537 1.275 1.38
##
## Concordance= 0.684 (se = 0.015 )
## Likelihood ratio test= 157 on 1 df, p=<2e-16
## Wald test = 197.3 on 1 df, p=<2e-16
## Score (logrank) test = 196.6 on 1 df, p=<2e-16
# adjusted models
# freq data
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$n, d$sysHT, sum)))
names(stroke.tab) <- c('Strokes','N')
ci.binomial(stroke.tab$Strokes, stroke.tab$N)## events total probability se exact.lower95ci exact.upper95ci
## 1 178 3035 0.05864909 0.004265079 0.05054965 0.06761068
## 2 165 1171 0.14090521 0.010167300 0.12146029 0.16216781
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sex, sum), tapply(d$n, d$sex, sum)))
names(stroke.tab) <- c('Strokes','N')
ci.binomial(stroke.tab$Strokes, stroke.tab$N)## events total probability se exact.lower95ci exact.upper95ci
## 1 167 1868 0.08940043 0.006601537 0.07683967 0.1032664
## 2 176 2338 0.07527802 0.005456541 0.06490470 0.0867278
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$age.group, sum), tapply(d$n, d$age.group, sum)))
names(stroke.tab) <- c('Strokes','N')
ci.binomial(stroke.tab$Strokes, stroke.tab$N)## events total probability se exact.lower95ci exact.upper95ci
## 1 11 553 0.01989150 0.005937564 0.00996962 0.03531338
## 2 68 1642 0.04141291 0.004916965 0.03229793 0.05220926
## 3 148 1307 0.11323642 0.008765147 0.09655669 0.13168263
## 4 116 704 0.16477273 0.013981658 0.13809602 0.19428352
table(d$age.group, d$stroke.yes)##
## Yes No
## 30-39 11 542
## 40-49 68 1574
## 50-59 148 1159
## 60-69 116 588
round(epi.conf(dat = table(d$age.group, d$stroke.yes), ctype = "prop.single"), digits = 3)## est se lower upper
## 30-39 0.020 0.006 0.011 0.035
## 40-49 0.041 0.005 0.033 0.052
## 50-59 0.113 0.009 0.097 0.132
## 60-69 0.165 0.014 0.139 0.194
d$sex <- relevel(d$sex, ref = 'Female')
glm.stroke <- glm(cbind(stroke,n) ~ sysHT + sex + age.group, data = d, family = binomial)
print(logistic.display(glm.stroke), digits = 3)##
## Logistic regression predicting cbind(stroke, n)
##
## crude OR(95%CI) adj. OR(95%CI) P(Wald's test)
## sysHT: Yes vs No 2.4 (1.92,3) 1.72 (1.36,2.17) < 0.001
##
## sex: Male vs Female 1.19 (0.95,1.48) 1.27 (1.01,1.59) 0.038
##
## age.group: ref.=30-39
## 40-49 2.08 (1.09,3.96) 1.96 (1.03,3.74) 0.041
## 50-59 5.69 (3.06,10.59) 4.84 (2.59,9.05) < 0.001
## 60-69 8.28 (4.42,15.53) 6.49 (3.42,12.28) < 0.001
##
## P(LR-test)
## sysHT: Yes vs No < 0.001
##
## sex: Male vs Female < 0.001
##
## age.group: ref.=30-39 < 0.001
## 40-49
## 50-59
## 60-69
##
## Log-likelihood = -903.8435
## No. of observations = 4206
## AIC value = 1819.687
# Using follow-up time
# Kaplan-Meier estimates
km.0 <- survfit(Surv(timestrk.yrs, stroke) ~ sysHT, data = d)
summary(km.0, times = c(5,10,20))## Call: survfit(formula = Surv(timestrk.yrs, stroke) ~ sysHT, data = d)
##
## sysHT=No
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 5 2945 13 0.996 0.00120 0.993 0.998
## 10 2819 20 0.989 0.00194 0.985 0.993
## 20 2374 91 0.955 0.00398 0.947 0.963
##
## sysHT=Yes
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 5 1082 24 0.979 0.00427 0.971 0.987
## 10 950 36 0.944 0.00702 0.931 0.958
## 20 639 74 0.858 0.01154 0.836 0.881
par(mfrow = c(1,2))
plot(km.0, lty = c(1,2),
las = 1,
col = c('blue','red'),
xlab = 'Follow-up (yrs)',
ylab = 'Probability of avoiding stroke')
legend('bottomleft', c('no-HT','HT'),
lty = c(1,2),
col = c('blue','red'),
bty = 'n')
plot(km.0, fun = 'event',
lty = c(1,2),
col = c('blue','red'),
las = 1,
xlab = 'Follow-up (yrs)',
ylab = 'Probability of stroke')
legend('topleft', c('no-HT','HT'),
lty = c(1,2),
col = c('blue','red'),
bty = 'n')cph.0 <- coxph(Surv(timestrk.yrs, stroke) ~ sysHT, data = d)
summary(cph.0)## Call:
## coxph(formula = Surv(timestrk.yrs, stroke) ~ sysHT, data = d)
##
## n= 4206, number of events= 343
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sysHTYes 1.1139 3.0464 0.1083 10.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sysHTYes 3.046 0.3283 2.464 3.767
##
## Concordance= 0.628 (se = 0.013 )
## Likelihood ratio test= 99.52 on 1 df, p=<2e-16
## Wald test = 105.8 on 1 df, p=<2e-16
## Score (logrank) test = 117.1 on 1 df, p=<2e-16
newData <- data.frame(expand.grid(stroke = 0:1,
sysHT = c('Yes','No'),
timestrk.yrs = c(5,10,20)))
newData## stroke sysHT timestrk.yrs
## 1 0 Yes 5
## 2 1 Yes 5
## 3 0 No 5
## 4 1 No 5
## 5 0 Yes 10
## 6 1 Yes 10
## 7 0 No 10
## 8 1 No 10
## 9 0 Yes 20
## 10 1 Yes 20
## 11 0 No 20
## 12 1 No 20
yhat <- predict(cph.0, newdata = newData, type = 'expected')
pred.stroke <- cbind(newData, yhat)
subset(pred.stroke, stroke == 1 & timestrk.yrs %in% c(10,20))## stroke sysHT timestrk.yrs yhat
## 6 1 Yes 10 0.04631465
## 8 1 No 10 0.01520324
## 10 1 Yes 20 0.14772021
## 12 1 No 20 0.04849063
par(mfrow = c(1,1),
mar = c(8,4,2,8),
cex = 0.9)
plot(km.0, fun = 'event',
lty = c(2,2),
col = c('blue','red'),
las = 1,
lwd = 2,
xlab = 'Follow-up (yrs)',
ylab = 'Probability of stroke',
ylim = c(0,0.15),
las = 1,
frame = F,
axes = F,
xmax = 20)
axis(1, at = seq(0,20,1), label = seq(0,20,1))
axis(2, at = seq(0,0.15,0.01), label = seq(0,0.15,0.01), las = 1)
summary(km.0, times = c(5,10,20))## Call: survfit(formula = Surv(timestrk.yrs, stroke) ~ sysHT, data = d)
##
## sysHT=No
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 5 2945 13 0.996 0.00120 0.993 0.998
## 10 2819 20 0.989 0.00194 0.985 0.993
## 20 2374 91 0.955 0.00398 0.947 0.963
##
## sysHT=Yes
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 5 1082 24 0.979 0.00427 0.971 0.987
## 10 950 36 0.944 0.00702 0.931 0.958
## 20 639 74 0.858 0.01154 0.836 0.881
mtext(side = 4,at = 1-0.858, col = "red", text = 'Hypertension', las = 1)
mtext(side = 4,at = 1-0.955, col = "blue", text = 'No-hypertension', las = 1)
# Add no. at risk
n.surv <- summary(km.0, times = seq(0,20,1))
## n.surv
n.risk <- matrix(n.surv$n.risk, byrow = T, nrow = 2)
## n.risk
mtext(side = 1, at = -0.1, line = 4, text = 'no. at risk \nHT group', outer = F, cex = 0.7, font = 1, adj = 0)
mtext(side = 1, at = c(-1, seq(0,20,1)), line = 5, text = c('HT', n.risk[1,]), outer = F, cex = 0.7, adj = 0, col = 1)
mtext(side = 1, at = c(-1, seq(0,20,1)), line = 6, text = c('No HT', n.risk[2,]), outer = F, cex = 0.7, adj = 0, col = 2)Module 5. An introduction to meta-analysis: approaches to combining the results of a number of studies.
This module provides an introduction to the meta-analysis of the results a number of studies, including prevalence, continuous outcomes, and rates. This model will also cover approaches to heterogeneity between studies, and the concepts of fixed-effects and random-effects models. We will also cover Trial Sequential Analysis approaches to meta-analysis, essentially do we need more studies? ## Meta-analysis of various estimates (Rate Ratio, Prevalence etc)
library(meta)
# Meta analysis of Rate Ratios
d.3 <- data.frame(study = c('Study 1','Study 2','Study 3'),
event.e = c(5,20,15),
n.e = c(100,2000,300),
event.c = c(10,40,30),
n.c = c(100,2000,300))
meta.3 <- metabin(event.e, n.e,
event.c, n.c,
sm = 'RR',
data = d.3,
studlab = study)
meta.3## RR 95%-CI %W(fixed) %W(random)
## Study 1 0.5000 [0.1772; 1.4105] 12.5 12.8
## Study 2 0.5000 [0.2934; 0.8522] 50.0 48.6
## Study 3 0.5000 [0.2747; 0.9099] 37.5 38.5
##
## Number of studies combined: k = 3
##
## RR 95%-CI z p-value
## Fixed effect model 0.5000 [0.3447; 0.7252] -3.65 0.0003
## Random effects model 0.5000 [0.3448; 0.7251] -3.65 0.0003
##
## Quantifying heterogeneity:
## tau^2 = 0; tau = 0; I^2 = 0.0% [0.0%; 89.6%]; H = 1.00 [1.00; 3.10]
##
## Test of heterogeneity:
## Q d.f. p-value
## 0.00 2 1.0000
##
## Details on meta-analytical method:
## - Mantel-Haenszel method
## - DerSimonian-Laird estimator for tau^2
## - Mantel-Haenszel estimator used in calculation of Q and tau^2 (like RevMan 5)
par(cex = 0.9)
forest(meta.3)# Clinical variation among surgical patients
# meta analysis of a proportion
d <- data.frame(study = c('Chen (2015)',
'Hewitt (2015)',
'Joseph (2014)',
'Makary (2010)',
'Reisinger (2015)',
'Robinson (2013)',
'Tegels (2014)',
'Richardson (2019)'),
n = c(379,325,220,594,159,72,127,160),
events = c(100,88,82,62,39,15,30,47),
year = c(2015,2015,2014,2010,2015,2013,
2014,2019))
d <- d[ order(d$year),]
d## study n events year
## 4 Makary (2010) 594 62 2010
## 6 Robinson (2013) 72 15 2013
## 3 Joseph (2014) 220 82 2014
## 7 Tegels (2014) 127 30 2014
## 1 Chen (2015) 379 100 2015
## 2 Hewitt (2015) 325 88 2015
## 5 Reisinger (2015) 159 39 2015
## 8 Richardson (2019) 160 47 2019
meta.prop <- metaprop(events, n, studlab = study,
data = d)
meta.prop## proportion 95%-CI
## Makary (2010) 0.1044 [0.0810; 0.1318]
## Robinson (2013) 0.2083 [0.1216; 0.3202]
## Joseph (2014) 0.3727 [0.3087; 0.4403]
## Tegels (2014) 0.2362 [0.1654; 0.3197]
## Chen (2015) 0.2639 [0.2202; 0.3113]
## Hewitt (2015) 0.2708 [0.2232; 0.3226]
## Reisinger (2015) 0.2453 [0.1806; 0.3197]
## Richardson (2019) 0.2938 [0.2245; 0.3708]
##
## Number of studies combined: k = 8
##
## proportion 95%-CI
## Fixed effect model 0.2274 [0.2097; 0.2461]
## Random effects model 0.2401 [0.1863; 0.3035]
##
## Quantifying heterogeneity:
## tau^2 = 0.1852; tau = 0.4303; I^2 = 91.4% [85.5%; 94.9%]; H = 3.41 [2.62; 4.44]
##
## Test of heterogeneity:
## Q d.f. p-value Test
## 81.49 7 < 0.0001 Wald-type
## 94.55 7 < 0.0001 Likelihood-Ratio
##
## Details on meta-analytical method:
## - Random intercept logistic regression model
## - Maximum-likelihood estimator for tau^2
## - Logit transformation
## - Clopper-Pearson confidence interval for individual studies
forest(meta.prop)# sample size based on this MA of prevalence abd 5% margin of error
n.for.survey(p = 0.25, delta = 0.05)##
## Sample size for survey.
## Assumptions:
## Proportion = 0.25
## Confidence limit = 95 %
## Delta = 0.05 from the estimate.
##
## Sample size = 288
dta <- read.csv('~/OneDrive - University of Wollongong/Data/DataCaVitDFx.csv', header = T, sep = ',')
## dta <- read.csv('c:/data/DataCaVitDFx.csv', header = T, sep = ',')
dta## study n.e event.e n.c event.c fUp.mth year
## 1 Chapuy (1992) 1387 80 1403 110 18 1992
## 2 Dawson-Hughes (1997) 187 11 202 26 36 1997
## 3 Chapuy (2002) 393 27 190 21 24 2002
## 4 Harwood (2004) 75 6 37 5 12 2004
## 5 Avenell (2004) 35 3 35 5 12 2004
## 6 Larsen (2004) 4957 318 2116 167 36 2004
## 7 Porthouse (2005) 1321 58 1993 91 24 2005
## 8 Grant (2005) 1306 165 1332 179 62 2005
## 9 Jackson (2006) 18176 2012 18106 2158 84 2006
## 10 Bolton-Smith (2007) 62 2 61 2 24 2007
## 11 Salovaara (2010) 1586 78 1609 94 36 2010
# annualsed rates of fracture
dta$yr.rate.e <- (dta$event.e/dta$n.e)/ (dta$fUp.mth/12)
dta$yr.rate.c <- (dta$event.c/dta$n.c)/ (dta$fUp.mth/12)
dta$rate.e <- (dta$event.e/dta$n.e)
dta$rate.c <- (dta$event.c/dta$n.c)
dta$RD <- round(dta$rate.e - dta$rate.c,4)
dta$RR <- round(dta$rate.e/dta$rate.c,2)
dta## study n.e event.e n.c event.c fUp.mth year yr.rate.e
## 1 Chapuy (1992) 1387 80 1403 110 18 1992 0.03845230
## 2 Dawson-Hughes (1997) 187 11 202 26 36 1997 0.01960784
## 3 Chapuy (2002) 393 27 190 21 24 2002 0.03435115
## 4 Harwood (2004) 75 6 37 5 12 2004 0.08000000
## 5 Avenell (2004) 35 3 35 5 12 2004 0.08571429
## 6 Larsen (2004) 4957 318 2116 167 36 2004 0.02138390
## 7 Porthouse (2005) 1321 58 1993 91 24 2005 0.02195307
## 8 Grant (2005) 1306 165 1332 179 62 2005 0.02445290
## 9 Jackson (2006) 18176 2012 18106 2158 84 2006 0.01581363
## 10 Bolton-Smith (2007) 62 2 61 2 24 2007 0.01612903
## 11 Salovaara (2010) 1586 78 1609 94 36 2010 0.01639344
## yr.rate.c rate.e rate.c RD RR
## 1 0.05226895 0.05767844 0.07840342 -0.0207 0.74
## 2 0.04290429 0.05882353 0.12871287 -0.0699 0.46
## 3 0.05526316 0.06870229 0.11052632 -0.0418 0.62
## 4 0.13513514 0.08000000 0.13513514 -0.0551 0.59
## 5 0.14285714 0.08571429 0.14285714 -0.0571 0.60
## 6 0.02630750 0.06415170 0.07892250 -0.0148 0.81
## 7 0.02282990 0.04390613 0.04565981 -0.0018 0.96
## 8 0.02600988 0.12633997 0.13438438 -0.0080 0.94
## 9 0.01702672 0.11069542 0.11918701 -0.0085 0.93
## 10 0.01639344 0.03225806 0.03278689 -0.0005 0.98
## 11 0.01947379 0.04918033 0.05842138 -0.0092 0.84
d <- dta
# TRADITIONAL META-ANALYSIS
meta.all <- metabin(event.e,n.e,event.c,n.c,
data = d,
sm = 'RR',
studlab = study,
method = 'MH')
meta.all## RR 95%-CI %W(fixed) %W(random)
## Chapuy (1992) 0.7357 [0.5570; 0.9717] 3.8 7.0
## Dawson-Hughes (1997) 0.4570 [0.2324; 0.8988] 0.9 1.3
## Chapuy (2002) 0.6216 [0.3610; 1.0702] 1.0 2.0
## Harwood (2004) 0.5920 [0.1932; 1.8137] 0.2 0.5
## Avenell (2004) 0.6000 [0.1552; 2.3203] 0.2 0.3
## Larsen (2004) 0.8128 [0.6788; 0.9734] 8.0 14.6
## Porthouse (2005) 0.9616 [0.6969; 1.3267] 2.5 5.3
## Grant (2005) 0.9401 [0.7718; 1.1452] 6.1 12.6
## Jackson (2006) 0.9288 [0.8772; 0.9834] 74.2 49.9
## Bolton-Smith (2007) 0.9839 [0.1431; 6.7637] 0.1 0.2
## Salovaara (2010) 0.8418 [0.6286; 1.1274] 3.2 6.4
##
## Number of studies combined: k = 11
##
## RR 95%-CI z p-value
## Fixed effect model 0.9026 [0.8588; 0.9486] -4.04 < 0.0001
## Random effects model 0.8755 [0.8102; 0.9460] -3.36 0.0008
##
## Quantifying heterogeneity:
## tau^2 = 0.0023; tau = 0.0478; I^2 = 12.8% [0.0%; 53.3%]; H = 1.07 [1.00; 1.46]
##
## Test of heterogeneity:
## Q d.f. p-value
## 11.47 10 0.3224
##
## Details on meta-analytical method:
## - Mantel-Haenszel method
## - DerSimonian-Laird estimator for tau^2
## - Mantel-Haenszel estimator used in calculation of Q and tau^2 (like RevMan 5)
forest(meta.all)# meta-regression
mreg.all <- metareg(meta.all, ~ I(yr.rate.c * 1000), method.tau = 'REML', hakn = F)
mreg.all##
## Mixed-Effects Model (k = 11; tau^2 estimator: REML)
##
## tau^2 (estimated amount of residual heterogeneity): 0 (SE = 0.0050)
## tau (square root of estimated tau^2 value): 0
## I^2 (residual heterogeneity / unaccounted variability): 0.00%
## H^2 (unaccounted variability / sampling variability): 1.00
## R^2 (amount of heterogeneity accounted for): 100.00%
##
## Test for Residual Heterogeneity:
## QE(df = 9) = 4.9581, p-val = 0.8379
##
## Test of Moderators (coefficient 2):
## QM(df = 1) = 6.5063, p-val = 0.0107
##
## Model Results:
##
## estimate se zval pval ci.lb ci.ub
## intrcpt 0.0281 0.0568 0.4947 0.6208 -0.0832 0.1393
## I(yr.rate.c * 1000) -0.0063 0.0025 -2.5507 0.0107 -0.0112 -0.0015 *
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Predicted RR for various baseline risk
pred <- data.frame(annualRisk.1000 = seq(5,60,5),
pred.RR = round(exp(predict(mreg.all, newmods = seq(5,60,5))$pred),2),
lcl = round(exp(predict(mreg.all, newmods = seq(5,60,5))$ci.lb),2),
ucl = round(exp(predict(mreg.all, newmods = seq(5,60,5))$ci.ub),2))
pred## annualRisk.1000 pred.RR lcl ucl
## 1 5 1.00 0.91 1.09
## 2 10 0.97 0.90 1.04
## 3 15 0.94 0.88 0.99
## 4 20 0.91 0.86 0.95
## 5 25 0.88 0.83 0.93
## 6 30 0.85 0.80 0.91
## 7 35 0.82 0.76 0.90
## 8 40 0.80 0.72 0.89
## 9 45 0.77 0.68 0.88
## 10 50 0.75 0.64 0.87
## 11 55 0.73 0.61 0.87
## 12 60 0.70 0.58 0.86
# NNT with annual risk ranging from 5 to 60 fractures per 1000
risk.1000 <- seq(5,60,5)
RR <- pred$pred.RR
Exp <- risk.1000 * RR
ARR <- risk.1000 - Exp
NNT <- ceiling(1/(ARR/1000))
cbind(risk.1000, RR, Exp, ARR, NNT)## risk.1000 RR Exp ARR NNT
## [1,] 5 1.00 5.00 0.00 Inf
## [2,] 10 0.97 9.70 0.30 3334
## [3,] 15 0.94 14.10 0.90 1112
## [4,] 20 0.91 18.20 1.80 556
## [5,] 25 0.88 22.00 3.00 334
## [6,] 30 0.85 25.50 4.50 223
## [7,] 35 0.82 28.70 6.30 159
## [8,] 40 0.80 32.00 8.00 125
## [9,] 45 0.77 34.65 10.35 97
## [10,] 50 0.75 37.50 12.50 80
## [11,] 55 0.73 40.15 14.85 68
## [12,] 60 0.70 42.00 18.00 56
# meta-gression plot
par(mfrow = c(1,1), cex = 1,
family = 'sans',
bg = 'white',
col = 'black',
mar = c(5,5,3,3))
plot(d$yr.rate.c * 1000, meta.all$TE,
cex = d$n.c^0.5/7,
pch = c(rep(16,8),16,rep(16,8)),
las = 1,
xlab = 'Annual rate (per 1000) of fracture among control group',
xlim = c(0,150),
ylab = 'log(Relative Risk)',
ylim = c(-1.5,0.75),
axes = F,
col = 'lightgrey',
col.axis = 'black',
col.lab = 'black')
points(d$yr.rate.c * 1000, meta.all$TE, pch = 1, col = 'black',
cex = d$n.c^0.5/7)
axis(1, at = seq(0,150,10), labels = seq(0,150,10), cex.axis = 0.85, col = 'black', col.axis = 'black')
axis(2, at = seq(-1.5,0.75, 0.25), labels = seq(-1.5,0.75, 0.25), cex.axis = 0.85, las = 1, col = 'black', col.axis = 'black')
lines(seq(0,150,5), predict(mreg.all, newmods = seq(0,150,5))$pred, lty = 1, lwd = 2.5, col = 'black')
lines(seq(0,150,5), predict(mreg.all, newmods = seq(0,150,5))$cr.lb, lty = 3, lwd = 2, col = 'black')
lines(seq(0,150,5), predict(mreg.all, newmods = seq(0,150,5))$cr.ub, lty = 3, lwd = 2, col = 'black')
abline(h = 0, lty = 2, lwd = 1.25)
abline(h = seq(-1.5,0.5, 0.25), v = seq(0,150,5),lty = 3, lwd = 1, col = 'grey')
text(d$yr.rate.c[1] * 1000 + 16.5, meta.all$TE[1] , d$study[1], cex = 0.7, family = 'sans', col = 'black')
text(d$yr.rate.c[2] * 1000 + 5 , meta.all$TE[2] - 0.125 , d$study[2], cex = 0.7, family = 'sans', col = 'black')
text(d$yr.rate.c[3] * 1000 , meta.all$TE[3] - 0.095 , d$study[3], cex = 0.7, family = 'sans', col = 'black')
text(d$yr.rate.c[4] * 1000 - 7.5 , meta.all$TE[4] - 0.095 , d$study[4], cex = 0.7, family = 'sans', col = 'black')
text(d$yr.rate.c[5] * 1000 , meta.all$TE[5] + 0.08 , d$study[5], cex = 0.7, family = 'sans', col = 'black')
text(d$yr.rate.c[6] * 1000 + 10 , meta.all$TE[6] - 0.165 , d$study[6], cex = 0.7, family = 'sans', col = 'black')
text(d$yr.rate.c[7] * 1000 , meta.all$TE[7] + 0.175 , d$study[7], cex = 0.7, family = 'sans', col = 'black')
text(d$yr.rate.c[8] * 1000 + 14 , meta.all$TE[8] , d$study[8], cex = 0.7, family = 'sans', col = 'black')
text(d$yr.rate.c[9] * 1000 , meta.all$TE[9] + 0.425 , d$study[9], cex = 0.7, family = 'sans', col = 'black')
text(d$yr.rate.c[10] * 1000 - 9 , meta.all$TE[10] + 0.075 , d$study[10], cex = 0.5, family = 'sans', col = 'black')
text(d$yr.rate.c[11] * 1000 - 12 , meta.all$TE[11] - 0.15 , d$study[11], cex = 0.7, family = 'sans', col = 'black')# meta-analysis - outcome of interest mortality
dta <- data.frame(study = c('Look (2018)',
'Nielsen (2013)',
'Laurent (2005)',
'Hachimi-Idrissi (2005)',
'Bernard (2002)',
'HACA (2002)',
'Hachimi-Idrissi (2001)',
'Tiainen (2007)',
'Bernard (1997)',
'Dankiewicz (2021)',
'Lascarrou (2019)',
'Dankiewicz (2021)',
'Lascarrou (2019)',
'Nielsen (2013)'),
location = rep('Out of hospital',14),
year = c(2018,2013,2005,2005,2002,2002,2001,2007,1997,2021,2019,2021,2019,2013),
n = c(87,939,42,28,77,275,30,70,44,1850,581,939,1850,581),
event.e = c(27,226,15,6,22,56,13,8,10,465,231,465,231,226),
n.e = c(45,473,22,14,43,137,16,36,22,925,284,925,284,473),
event.c = c(33,220,11,8,23,76,13,12,17,446,247,446,247,220),
n.c = c(42,466,20,14,34,138,14,34,22,925,297,925,297,466),
target.temp = c(rep(33,5),32,34,33,33,33,33,33,33,33),
comparison = c('No-TTM','TTM-36','TTM-37','TTM-37',
'No-TTM','No-TTM','Tx>38','Tx>38',
'No-TTM','Tx>37.8','TTM-37','TTM-36','Tx>37.8','TTM-37'),
design = c('RCT','RCT','RCT','RCT','RCT',
'RCT','RCT','RCT','Non-R','RCT','RCT','RCT','RCT','RCT'),
control.target = c('Trials without explicit normothermia / fever avoidance',
'Trials with explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials with explicit normothermia / fever avoidance',
'Trials with explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance'))
dta <- dta[ order(dta$year),]
dta$id <- 1:14
dta$ratio.e <- paste(dta$event.e,'/',dta$n.e, sep = '')
dta$p.e <- paste('(', round(dta$event.e/dta$n.e,2)*100,'%)', sep = '')
dta$ratio.c <- paste(dta$event.c,'/',dta$n.c, sep = '')
dta$p.c <- paste('(', round(dta$event.c/dta$n.c,2)*100,'%)', sep = '')
dta$tx <- paste(dta$ratio.e,dta$p.e, sep = ' ')
dta$c <- paste(dta$ratio.c,dta$p.c, sep = ' ')
meta.all <- metabin(event.e, n.e,
event.c, n.c,
sm = 'RR',
data = dta,
studlab = study,
byvar = control.target,
subset = id %in% c(1:8,11,13))
# ---------------------------- Forest plot ---------------------------------------------
par(cex = 1.15)
forest(meta.all, leftcols = c('studlab',
'tx','c'),
leftlabs = c('Study (Year)','Hypothermia\n (n/N)(%)',
'Control\n (n/N)(%)'),
bylab = '',
label.right = '',
label.left = '',
col.square = 'blue',
col.diamond = 'blue',
overall = F,
col.by = 'black')Module 6. Writing for publication and presenting results at conferences.
This module will explore approaches to interpreting and presenting results of your research. We will focus on the presentation of results in manuscripts, abstract submissions, and reports. The presentation of results in the context of conference presentations will also be covered.
Summary table using Hmisc package
Table 1. Characteristics of study participants.
| Males (n=1,868) | Females (n=2,338) | p-value | |
|---|---|---|---|
| Age (years), mean (SD) | 49.6 (8.70) | 49.6 (8.53) | 0.808 |
| Systolic BP (mmHg), mean (SD) | 130.9 (18.80) | 132.3 (23.20) | 0.808 |
| BMI (kg/m2), mean (SD) | 26.2 (3.4) | 25.5 (3.3) | <0.001 |
| Hypertension, n (%) | 570 (31) | 673 (29) | 0.222 |
| Current smoker, n (%) | 1,135 (61) | 957 (41) | <0.001 |
| Diabetes, n (%) | 56 (3) | 51 (2) | 0.095 |
| BGL, (mmol/L), mean (SD) | 4.6 (1.39) | 4.5 (1.23) | 0.817 |
| Education, n (%) | <0.001 | ||
| - 4th grade or less | 805 (44) | 909 (40) | |
| - 5th, 6th, or 7th grades | 497 (27) | 724 (32) | |
| - grade school graduate no high school | 228 (13) | 453 (20) | |
| - high school, did not graduate | 282 (16) | 198 (9) | |
| Follow-up (yrs), mean (SD) | 19.5 (6.6) | 21 (5.8) | <0.001 |
Using the Hmisc ‘summary’ function we have created the above Table, we have included test = TRUE, so the function will compare the continuous variables using (un-paired T-Test), and the frequency data using a Pearson’s chi-squared test.
Table 2. Hypertension and risk of stroke.
| events/n (%) | crude OR(95%CI) | adj. OR(95%CI) | p-value* | |
|---|---|---|---|---|
| No-hypertension | 178/3,035 (5.9%) | 1.0 (ref) | 1.0 (ref) | |
| Hypertension | 165/1,171 (14.1%) | 2.40 (1.92, 3.00) | 1.72 (1.36,2.17) | <0.001 |
| Females | 176/2,338 (7.5%) | 1.0 (ref) | 1.0 (ref) | |
| Males | 167/1,868 (14.1%) | 1.19 (0.95, 1.48) | 1.27 (1.01,1.59) | 0.038 |
| Age Group | ||||
| 30-39 | 11/553 (2.0%) | 1.0 (ref) | 1.0 (ref) | |
| 40-49 | 68/1,642 (4.1%) | 2.08 (1.09,3.96) | 1.96 (1.03,3.74) | 0.041 |
| 50-59 | 148/1,307 (11.3%) | 5.69 (3.06,10.59) | 4.84 (2.59,9.05) | < 0.001 |
| 60-69 | 116/704 (16.4%) | 8.28 (4.42,15.53) | 6.49 (3.42,12.28) | < 0.001 |
- adjusted \(p\)-values.
Our practical sessions we will be using the R Statistical Language (https://www.r-project.org/) via the R Studio editor (https://www.rstudio.com/products/rstudio/download/#download). And, we suggest you load this onto your laptop prior to Week 1.
The R language is an object orientated statistical program, that can be used for all stages of the research process, including:
- The planning of a study, meta-analysis of current literature, power and sample size;
- All data cleaning and management;
- All analysis; and,
- Most important, presentation of results for publication in manuscripts (including all tables and figures).
Some useful links:
BMJ Statistics at square one.
https://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one
Bland and Altman Statistical notes in the BMJ.