Introduction to Biostatistics using R: Summer edition, 2023
Introduction
Welcome to an “Introduction to biostatistics using R: summer edition 2023”.
The six modules will be delivered over three weekly Wednesday morning’s seminars, between 9am and 12 midday, on the the 1st, and the 8th of February 2023. Venue: Building 41, room 153.
Week 1. Wednesday February 1st, 2023 (9am to 12 midday).
Module 1. Data good care of your data, reproducible research.
This module will introduce you to good data management practices, and the concept of reproducible research. Concepts such as data security, archiving, data cleaning, and documenting a clear audit trail of these processes will be covered.
Module 2. Preparing data for analysis and presenting descriptive statistics.
This module provides you the skills to prepare data for analysis, including various formats of data, labelling discrete levels of variables, and creating new variables. This module will also explore the presentation of descriptive statistics. When is the median and interquartile range preferred over the mean and standard deviation? And, the presentation of descriptive statistics in a manuscript being prepared for peer review.
Reading data into R.
Data sets can be loaded into R various ways. The simplist, is via a spreadsheet, such as a comma separated volume (csv). Using the ‘read.csv’ function. However, a working directory needs to be set-up.
# set working directly
setwd('~/OneDrive - University of Wollongong/Data/')
# read dataset into R, naming the data set dta.
<- read.csv(file = 'framingham.csv', header = T, sep = ',')
framingham # check the names of variable.
names(framingham)
## [1] "randid" "sex" "totchol" "age" "sysbp"
## [6] "diabp" "cursmoke" "cigpday" "bmi" "diabetes"
## [11] "bpmeds" "heartrte" "glucose" "educ" "prevchd"
## [16] "prevap" "prevmi" "prevstrk" "prevhyp" "time"
## [21] "period" "hdlc" "ldlc" "death" "angina"
## [26] "hospmi" "mi_fchd" "anychd" "stroke" "cvd"
## [31] "hyperten" "timeap" "timemi" "timemifc" "timechd"
## [36] "timestrk" "timecvd" "timedth" "timehyp" "start.date"
## [41] "stroke.date" "death.date"
# Look at the first ten rows.
# head(framingham)
# Look at the last ten rows.
# tail(framingham)
# View help file for dataset
# install riskCommunicator package which has the Framingham dataset.
## install.packages('riskCommunicator')
library(riskCommunicator)
help(framingham)
# Create age groups with labels
$age.group <- cut(framingham$age, breaks = c(30,39,49,59,69,90),
framinghamlabels = c('30-39','40-49',
'50-59','60-69',
'70+'))
# Check lower and upper values of groups
tapply(framingham$age, framingham$age.group, min)
## 30-39 40-49 50-59 60-69 70+
## 32 40 50 60 70
tapply(framingham$age, framingham$age.group, max)
## 30-39 40-49 50-59 60-69 70+
## 39 49 59 69 81
# Convert md/dL to mmols (1 mg/dL = 18 mmols/L)
$glucose.mmols <- framingham$glucose/18
framingham$totchol.mmols <- framingham$totchol/18
framingham
# Label factors with numeric values
table(framingham$sex)
##
## 1 2
## 5022 6605
$sex <- factor(framingham$sex, level = c(1,2),
framinghamlabels = c('Male','Female'))
table(framingham$sex)
##
## Male Female
## 5022 6605
table(framingham$diabetes)
##
## 0 1
## 11097 530
$diabetes <- factor(framingham$diabetes, level = c(0,1),
framinghamlabels = c('No','Yes'))
table(framingham$diabetes)
##
## No Yes
## 11097 530
Our specific hypothesis is related to confirming the relationship between blood pressure and the risk of stroke. Therefore, we will subset the data set.
# remove participants with a prior stroke or tx
# for BP and only use first visit data.
names(framingham)
## [1] "randid" "sex" "totchol" "age"
## [5] "sysbp" "diabp" "cursmoke" "cigpday"
## [9] "bmi" "diabetes" "bpmeds" "heartrte"
## [13] "glucose" "educ" "prevchd" "prevap"
## [17] "prevmi" "prevstrk" "prevhyp" "time"
## [21] "period" "hdlc" "ldlc" "death"
## [25] "angina" "hospmi" "mi_fchd" "anychd"
## [29] "stroke" "cvd" "hyperten" "timeap"
## [33] "timemi" "timemifc" "timechd" "timestrk"
## [37] "timecvd" "timedth" "timehyp" "start.date"
## [41] "stroke.date" "death.date" "age.group" "glucose.mmols"
## [45] "totchol.mmols"
<- subset(framingham, bpmeds == 0 &
d == 0 &
prevstrk == 1, select = c(randid,age,totchol.mmols,
period
diabp,glucose.mmols,
sex,sysbp,cursmoke,educ,
cigpday,diabetes,stroke,
timestrk,death,timedth))# add n = 1, for counting
$n <- 1
d# create follow-up in years
$timestrk.yrs <- d$timestrk/365.25
d$stroke.yes <- factor(d$stroke, levels= 0:1, labels = c('No','Yes'))
dtable(d$stroke.yes,d$stroke)
##
## 0 1
## No 3863 0
## Yes 0 343
# variables labels and unit
library(Hmisc)
<- upData(d, labels = c(age = 'Age',
d totchol.mmols = 'Total cholesterol',
diabp = 'Diasolic BP',
glucose.mmols = 'BGL',
sex = 'Sex',
sysbp = 'Systolic BP',
cursmoke = 'Current smoker',
educ = 'Education level',
stroke = 'Stroke',
timestrk = 'Follow-up'),
units = c(age = 'yrs',
totchol.mmols = 'mmol/L',
diabp = 'mmHg',
sysbp = 'mmol/L',
timestrk = 'days'))
## Input object size: 424952 bytes; 18 variables 4206 observations
## New object size: 414288 bytes; 18 variables 4206 observations
$educ <- factor(d$educ, levels = 1:4, labels = c('4th grade or less',
d'5th, 6th, or 7th grades',
'grade school graduate no high school',
'high school, did not graduate'))
The Hmisc package will create Table 1. Including continuous and discrete variables. Frequencies are presented as “% (n)” and continuous variables as “p25/p50/p75 mean +/- SD”.
# Summary table using Hmisc package
print(summary(sex ~ age +
+
sysbp +
educ +
diabetes +
cursmoke
timestrk,method = 'reverse',
overall = F,
data = d), prmsd = T, digits = 3)
##
##
## Descriptive Statistics by Sex
##
## +----------------------------------------+----+--------------------------------+--------------------------------+
## | |N |Male |Female |
## | | |(N=1868) |(N=2338) |
## +----------------------------------------+----+--------------------------------+--------------------------------+
## |Age [yrs] |4206|42.00/49.00/57.00 49.61+/- 8.70|42.00/49.00/56.00 49.62+/- 8.53|
## +----------------------------------------+----+--------------------------------+--------------------------------+
## |Systolic BP [mmol/L] |4206|118.0/128.0/141.0 130.9+/- 18.8|115.5/127.5/144.0 132.3+/- 23.2|
## +----------------------------------------+----+--------------------------------+--------------------------------+
## |educ : 4th grade or less |4096| 44% (805) | 40% (909) |
## +----------------------------------------+----+--------------------------------+--------------------------------+
## | 5th, 6th, or 7th grades | | 27% (497) | 32% (724) |
## +----------------------------------------+----+--------------------------------+--------------------------------+
## | grade school graduate no high school| | 13% (228) | 20% (453) |
## +----------------------------------------+----+--------------------------------+--------------------------------+
## | high school, did not graduate | | 16% (282) | 9% (198) |
## +----------------------------------------+----+--------------------------------+--------------------------------+
## |diabetes : Yes |4206| 3% ( 56) | 2% ( 51) |
## +----------------------------------------+----+--------------------------------+--------------------------------+
## |Current smoker |4206| 61% (1135) | 41% ( 957) |
## +----------------------------------------+----+--------------------------------+--------------------------------+
## |Follow-up [days] |4206| 5842/8766/8766 7117+/-2405 | 7705/8766/8766 7676+/-2108 |
## +----------------------------------------+----+--------------------------------+--------------------------------+
Module 3. Inferential statistics 1: Comparing two or more groups, continuous and categorical outcomes.
This module will introduce the concepts of estimation and null-hypothesis statistical significance testing. In addition sample size and power calculations will be covered, and Type I and Type II error.
# SysBP and stroke
hist(d$sysbp, main = '',
xlab = 'Systolic BP (mmHg) \n First visit')
describe(d$sysbp)
## d$sysbp : Systolic BP [mmol/L]
## n missing distinct Info Mean Gmd .05 .10
## 4206 0 226 1 131.7 23.06 104.0 108.5
## .25 .50 .75 .90 .95
## 116.6 128.0 142.5 160.0 173.0
##
## lowest : 83.5 85.0 85.5 90.0 92.0, highest: 230.0 232.0 235.0 243.0 295.0
boxplot(sysbp ~ stroke.yes, data = d,
col = c('blue','red'),
xlab = 'Stroke',
ylab = 'SBP (mmHg)',
las = 1)
tapply(d$sysbp, d$stroke, mean)
## 0 1
## 130.5967 143.7114
tapply(d$sysbp, d$stroke, sd)
## 0 1
## 20.56035 25.91658
Null Hypothesis Significance Testing.
\(H_0\) Mean systolic BP among non-stroke \(=\) Mean systolic BP among stroke.
\(H_i\) Mean systolic BP among non-stroke \(\neq\) Mean systolic BP among stroke.
Then a critical value is set, say from a z-statistic (i.e 1.96, < 2.5% tail)
# SHADE TAILS OF n-DISTRIBUTED PLOTS
<- pretty(c(-3,3), 100)
z <- dnorm(z)
ht <- data.frame(z = z, ht = ht)
data <- 1.96
zc <- subset(data, z > zc)
t plot(data, type = "l",
axes = T,
xlab = '',
ylab = '')
lines(data)
polygon(c(rev(t$z), t$z),c(rep(0, nrow(t)), t$ht), col = 'grey',
border = NA)
polygon(c(rev(t$z*-1), t$z*-1), c(rep(0, nrow(t)), t$ht),
col = 'grey', border = NA)
abline(v = 0, lty = 2)
abline(v = c(-1,1), lty = 2, col = 'grey')
text(-0.75, 0.02, '-1 SD')
text(0.75, 0.02, '+1 SD')
text(-1.75, 0.02, '-1.96 SD')
text(1.75, 0.02, '+1.96 SD')
mean(d$sysbp[d$stroke == 1]) - mean(d$sysbp[d$stroke == 0])
## [1] 13.11468
t.test(sysbp ~ stroke, data = d)
##
## Welch Two Sample t-test
##
## data: sysbp by stroke
## t = -9.1205, df = 381.19, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -15.94196 -10.28741
## sample estimates:
## mean in group 0 mean in group 1
## 130.5967 143.7114
# EXAMPLE OF T-TEST
set.seed(12345)
<- 3000 # large number
n <- rnorm(n, mean = 61.8, sd = 10)
a <- rnorm(n, mean = 62.7, sd = 9.5)
b
plot(density(a), lty = 1, col = 'blue',
frame = F,
main = '',
xlab = '',
ylab = '',
ylim = c(0,0.05),
lwd = 2)
lines(density(b), lty = 1, col = 'black', lwd = 2)
legend(75,0.045, c('PCI - mean age 61.8 yrs (SD 10)',
'CABG - mean age 62.7 yrs (SD 9.5)'),
bty = n,
lty = 1,
lwd = 2,
col = c('blue','black'))
t.test(a,b)
##
## Welch Two Sample t-test
##
## data: a and b
## t = -3.4118, df = 5986.2, p-value = 0.0006496
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3417023 -0.3625029
## sample estimates:
## mean of x mean of y
## 61.76762 62.61972
<- 30 # small number
n <- rnorm(n, mean = 61.8, sd = 10)
a <- rnorm(n, mean = 62.7, sd = 9.5)
b
plot(density(a), lty = 1, col = 'blue',
frame = F,
main = '',
xlab = '',
ylab = '',
ylim = c(0,0.05),
lwd = 2)
lines(density(b), lty = 1, col = 'black', lwd = 2)
legend(75,0.045, c('PCI - mean age 61.8 yrs (SD 10)',
'CABG - mean age 62.7 yrs (SD 9.5)'),
bty = n,
lty = 1,
lwd = 2,
col = c('blue','black'))
t.test(a,b)
##
## Welch Two Sample t-test
##
## data: a and b
## t = -0.28393, df = 54.736, p-value = 0.7775
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.455870 4.101878
## sample estimates:
## mean of x mean of y
## 61.79375 62.47075
Type I and Type II error
NEJM 2000 ARDSNet lower tidal volume (6 mL/kg) versus traditional 12mL/kg. The primary outome death before discharge and breathing without assitance. Results were tx group 134/432 (31.0%) versus control 171/429 (39.8%).
setwd('~/OneDrive - University of Wollongong/Data/')
source('EpiTables V3.R')
##
## ------------------------------------------------------------------------------
## Epi Tables
## R epiTab Function
## Input values epiTab(a,b,c,d)
## a,b,c,d values from two-by-two table (rows exposure and columns outcome)
## -------------------------------------------------------------------------------
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 482 1622 2104
## No 1110 3211 4321
## Total 1592 4833 6425
##
## Rate of outcome among intervention group (p1) = 0.2291 (95% CI: 21.11, 24.7)
##
## Rate of outcome among control group (p2) = 0.2569 (95% CI: 24.39, 26.99)
##
## Risk Difference = -2.78 (95% CI: -5, -0.56)
##
## Hazard Risk = 0.876 95% CI = 0.787 to 0.975
## Rate Ratio = 0.892 95% CI = 0.812 to 0.979
## Odds Ratio = 0.86 95% CI = 0.761 to 0.972
## chi2 p-value = 0.0154
## Fisher's exact p-value = 0.0163
## ------------------------------------------------------------------------
##
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 95 229 324
## No 283 400 683
## Total 378 629 1007
##
## Rate of outcome among intervention group (p1) = 0.2932 (95% CI: 24.34, 34.3)
##
## Rate of outcome among control group (p2) = 0.4143 (95% CI: 37.73, 45.14)
##
## Risk Difference = -12.11 (95% CI: -18.3, -5.92)
##
## Hazard Risk = 0.649 95% CI = 0.514 to 0.819
## Rate Ratio = 0.708 95% CI = 0.585 to 0.857
## Odds Ratio = 0.586 95% CI = 0.441 to 0.778
## chi2 p-value = 2e-04
## Fisher's exact p-value = 2e-04
## ------------------------------------------------------------------------
epiTab(134,298,171,258)
##
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 134 298 432
## No 171 258 429
## Total 305 556 861
##
## Rate of outcome among intervention group (p1) = 0.3102 (95% CI: 26.64, 35.39)
##
## Rate of outcome among control group (p2) = 0.3986 (95% CI: 35.21, 44.51)
##
## Risk Difference = -8.84 (95% CI: -15.21, -2.47)
##
## Hazard Risk = 0.73 95% CI = 0.582 to 0.915
## Rate Ratio = 0.778 95% CI = 0.648 to 0.934
## Odds Ratio = 0.678 95% CI = 0.512 to 0.898
## chi2 p-value = 0.0067
## Fisher's exact p-value = 0.0069
## ------------------------------------------------------------------------
An example of potentially Type II error - NEJM 2018 EOLIA ECMO for ARDS primary outome 60-day mortality tx 44/124 (35%) control 57/125 (46%)
epiTab(44,80,57,68)
##
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 44 80 124
## No 57 68 125
## Total 101 148 249
##
## Rate of outcome among intervention group (p1) = 0.3548 (95% CI: 26.98, 43.99)
##
## Rate of outcome among control group (p2) = 0.456 (95% CI: 36.78, 54.42)
##
## Risk Difference = -10.12 (95% CI: -22.31, 2.07)
##
## Hazard Risk = 0.72 95% CI = 0.486 to 1.067
## Rate Ratio = 0.778 95% CI = 0.574 to 1.055
## Odds Ratio = 0.656 95% CI = 0.394 to 1.091
## chi2 p-value = 0.1041
## Fisher's exact p-value = 0.1217
## ------------------------------------------------------------------------
# original sample size calculation was based on 40% vs 60%
# 60-day mortality
power.prop.test(p1 = 0.4, p2 = 0.6, sig.level = 0.05, power = 0.8)
##
## Two-sample comparison of proportions power calculation
##
## n = 96.92364
## p1 = 0.4
## p2 = 0.6
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
<- 0.6
p1 <- 1 - p1
q1 <- 0.4
p2 <- 1 - p2
q2 <- (p1 + p2)/2
p.bar <- 1 - p.bar
q.bar
# Sample size
<- 2*((p.bar * q.bar) * (1.96 + 0.84)^2) / (p1 - p2)^2
n n
## [1] 98
power.prop.test(p1 = 0.36, p2 = 0.46, sig.level = 0.05, power = 0.8)
##
## Two-sample comparison of proportions power calculation
##
## n = 378.5477
## p1 = 0.36
## p2 = 0.46
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
power.prop.test(p1 = 0.36, p2 = 0.46, sig.level = 0.05, n = 125) # power
##
## Two-sample comparison of proportions power calculation
##
## n = 125
## p1 = 0.36
## p2 = 0.46
## sig.level = 0.05
## power = 0.3615169
## alternative = two.sided
##
## NOTE: n is number in *each* group
# Age and stroke
boxplot(age ~ stroke.yes, data = d,
col = c('blue','red'),
xlab = 'Stroke',
ylab = 'Age (years)',
las = 1)
tapply(d$age, d$stroke, mean)
## 0 1
## 49.13461 55.07289
tapply(d$age, d$stroke, sd)
## 0 1
## 8.504941 7.768804
# Age and Systolic BP
plot(d$age, d$sysbp, pch = 16,
xlab = 'Age (yrs)',
ylab = 'Systolic BP (mmHg)',
main = '')
plot(d$age[d$stroke == 0], d$sysbp[d$stroke == 0], pch = 16,
col = 'blue',
xlab = 'Age (yrs)',
ylab = 'Systolic BP (mmHg)',
main = '',
ylim = c(80,300),
las = 1)
points(d$age[d$stroke == 1], d$sysbp[d$stroke == 1], pch = 16,
col = 'red')
abline(lsfit(d$age,d$sysbp), lty = 2, lwd = 2)
# Cigs per day and stroke
boxplot(cigpday ~ stroke.yes, data = d,
col = c('blue','red'),
xlab = 'Stroke',
ylab = 'Age (years)',
las = 1)
tapply(d$cigpday, d$stroke, mean, na.rm = T)
## 0 1
## 9.105085 9.170088
tapply(d$cigpday, d$stroke, sd, na.rm = T)
## 0 1
## 12.00131 12.06747
# table of rates of stroke based
# systolic HT
library(Epi)
$sysHT <- ifelse(d$sysbp > 140, 1, 0)
d$sysHT <- factor(d$sysHT, levels = 0:1, labels = c('No','Yes'))
dprint(stat.table(list(sysHT), contents = list(N = count(),
'Stroke (n)' = sum(stroke),
'Stroke (%)' = ratio(stroke,n,100),
'Stroke (rate)' = ratio(stroke,timestrk,365.25),
'follow-up (Total years)' = sum(timestrk.yrs)),
data = d,
margin = T),
digits = 3)
## ---------------------------------------------------
## sysHT N Stroke Stroke Stroke follow-up
## (n) (%) (rate) (Total
## years)
## ---------------------------------------------------
## No 3035.000 178.000 5.865 0.003 64518.680
## Yes 1171.000 165.000 14.091 0.008 21014.623
##
## Total 4206.000 343.000 8.155 0.004 85533.303
## ---------------------------------------------------
tapply(d$timestrk.yrs, d$sysHT, sum)
## No Yes
## 64518.68 21014.62
# rate with 95% CI
library(epiDisplay)
<- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$n, d$sysHT, sum)))
stroke.tab names(stroke.tab) <- c('Strokes','N')
head(stroke.tab)
## Strokes N
## No 178 3035
## Yes 165 1171
ci.binomial(stroke.tab$Strokes, stroke.tab$N)
## events total probability se exact.lower95ci exact.upper95ci
## 1 178 3035 0.05864909 0.004265079 0.05054965 0.06761068
## 2 165 1171 0.14090521 0.010167300 0.12146029 0.16216781
# incidence based on person years
<- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$timestrk.yrs, d$sysHT, sum)))
stroke.tab names(stroke.tab) <- c('Strokes','pyrs')
head(stroke.tab)
## Strokes pyrs
## No 178 64518.68
## Yes 165 21014.62
ci.poisson(stroke.tab$Strokes, stroke.tab$pyrs)
## events person.time incidence se exact.lower95ci exact.upper95ci
## 1 178 64518.68 0.002758891 0.0002067876 0.002367128 0.003197555
## 2 165 21014.62 0.007851676 0.0006112521 0.006697479 0.009147202
# risk of stroke
# Rate Ratio
setwd('~/OneDrive - NSW Health Department/Data/')
source('epiTables V3.R')
##
## ------------------------------------------------------------------------------
## Epi Tables
## R epiTab Function
## Input values epiTab(a,b,c,d)
## a,b,c,d values from two-by-two table (rows exposure and columns outcome)
## -------------------------------------------------------------------------------
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 482 1622 2104
## No 1110 3211 4321
## Total 1592 4833 6425
##
## Rate of outcome among intervention group (p1) = 0.2291 (95% CI: 21.11, 24.7)
##
## Rate of outcome among control group (p2) = 0.2569 (95% CI: 24.39, 26.99)
##
## Risk Difference = -2.78 (95% CI: -5, -0.56)
##
## Hazard Risk = 0.876 95% CI = 0.787 to 0.975
## Rate Ratio = 0.892 95% CI = 0.812 to 0.979
## Odds Ratio = 0.86 95% CI = 0.761 to 0.972
## chi2 p-value = 0.0154
## Fisher's exact p-value = 0.0163
## ------------------------------------------------------------------------
##
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 95 229 324
## No 283 400 683
## Total 378 629 1007
##
## Rate of outcome among intervention group (p1) = 0.2932 (95% CI: 24.34, 34.3)
##
## Rate of outcome among control group (p2) = 0.4143 (95% CI: 37.73, 45.14)
##
## Risk Difference = -12.11 (95% CI: -18.3, -5.92)
##
## Hazard Risk = 0.649 95% CI = 0.514 to 0.819
## Rate Ratio = 0.708 95% CI = 0.585 to 0.857
## Odds Ratio = 0.586 95% CI = 0.441 to 0.778
## chi2 p-value = 2e-04
## Fisher's exact p-value = 2e-04
## ------------------------------------------------------------------------
epiTab(165,1171-165,178,3035-178)
##
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 165 1006 1171
## No 178 2857 3035
## Total 343 3863 4206
##
## Rate of outcome among intervention group (p1) = 0.1409 (95% CI: 12.1, 16.09)
##
## Rate of outcome among control group (p2) = 0.0586 (95% CI: 5.03, 6.7)
##
## Risk Difference = 8.23 (95% CI: 6.06, 10.39)
##
## Hazard Risk = 2.513 95% CI = 2.033 to 3.106
## Rate Ratio = 2.403 95% CI = 1.966 to 2.937
## Odds Ratio = 2.633 95% CI = 2.105 to 3.293
## chi2 p-value = 0
## Fisher's exact p-value = 0
## ------------------------------------------------------------------------
sum(d$stroke[d$sysHT == 'Yes']) / sum(d$n[d$sysHT == 'Yes'])) /
(sum(d$stroke[d$sysHT == 'No']) / sum(d$n[d$sysHT == 'No'])) (
## [1] 2.402513
WEEK 2. Wednesday February 8th, 2023 (9am to 12 midday).
Module 4. Inferential statistics 2: Correlation and regression, adjusting for confounders.
This module will initially explore correlation and simple linear regression. We will then extend these concepts to multi-linear regression, and generalised-linear-models that are use to analysis various outcomes, such as continuous, binomial, and rates.
# crude models
<- glm(cbind(stroke,n) ~ sysHT, data = d, family = binomial)
glm.stroke print(logistic.display(glm.stroke), digits = 3)
##
## Logistic regression predicting cbind(stroke, n)
##
## OR(95%CI) P(Wald's test) P(LR-test)
## sysHT: Yes vs No 2.4 (1.92,3) < 0.001 < 0.001
##
## Log-likelihood = -949.6711
## No. of observations = 4206
## AIC value = 1903.3422
<- coxph(Surv(timestrk, stroke) ~ I(sysbp/10), data = d)
cph summary(cph)
## Call:
## coxph(formula = Surv(timestrk, stroke) ~ I(sysbp/10), data = d)
##
## n= 4206, number of events= 343
##
## coef exp(coef) se(coef) z Pr(>|z|)
## I(sysbp/10) 0.28278 1.32681 0.02013 14.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## I(sysbp/10) 1.327 0.7537 1.275 1.38
##
## Concordance= 0.684 (se = 0.015 )
## Likelihood ratio test= 157 on 1 df, p=<2e-16
## Wald test = 197.3 on 1 df, p=<2e-16
## Score (logrank) test = 196.6 on 1 df, p=<2e-16
# adjusted models
<- glm(cbind(stroke,n) ~ sysHT + age + sex, data = d, family = binomial)
glm.stroke print(logistic.display(glm.stroke), digits = 3)
##
## Logistic regression predicting cbind(stroke, n)
##
## crude OR(95%CI) adj. OR(95%CI) P(Wald's test)
## sysHT: Yes vs No 2.4 (1.92,3) 1.69 (1.33,2.14) < 0.001
##
## age (cont. var.) 1.08 (1.06,1.09) 1.07 (1.05,1.08) < 0.001
##
## sex: Female vs Male 0.84 (0.68,1.05) 0.79 (0.63,0.99) 0.045
##
## P(LR-test)
## sysHT: Yes vs No < 0.001
##
## age (cont. var.) < 0.001
##
## sex: Female vs Male < 0.001
##
## Log-likelihood = -904.9604
## No. of observations = 4206
## AIC value = 1817.9208
<- coxph(Surv(timestrk, stroke) ~ I(sysbp/10) + age + sex, data = d)
cph summary(cph)
## Call:
## coxph(formula = Surv(timestrk, stroke) ~ I(sysbp/10) + age +
## sex, data = d)
##
## n= 4206, number of events= 343
##
## coef exp(coef) se(coef) z Pr(>|z|)
## I(sysbp/10) 0.207071 1.230070 0.023106 8.962 < 2e-16 ***
## age 0.082763 1.086284 0.007267 11.389 < 2e-16 ***
## sexFemale -0.505446 0.603236 0.110869 -4.559 5.14e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## I(sysbp/10) 1.2301 0.8130 1.1756 1.2871
## age 1.0863 0.9206 1.0709 1.1019
## sexFemale 0.6032 1.6577 0.4854 0.7497
##
## Concordance= 0.754 (se = 0.013 )
## Likelihood ratio test= 308.2 on 3 df, p=<2e-16
## Wald test = 299.8 on 3 df, p=<2e-16
## Score (logrank) test = 330.7 on 3 df, p=<2e-16
Module 5. An introduction to meta-analysis: approaches to combining the results of a number of studies.
This module provides an introduction to the meta-analysis of the results a number of studies, including prevalence, continuous outcomes, and rates. This model will also cover approaches to heterogeneity between studies, and the concepts of fixed-effects and random-effects models. We will also cover Trial Sequential Analysis approaches to meta-analysis, essentially do we need more studies?
library(meta)
.3 <- data.frame(study = c('Study 1','Study 2','Study 3'),
devent.e = c(5,20,15),
n.e = c(100,2000,300),
event.c = c(10,40,30),
n.c = c(100,2000,300))
.3 <- metabin(event.e, n.e,
meta
event.c, n.c,sm = 'RR',
data = d.3,
studlab = study)
.3 meta
## RR 95%-CI %W(fixed) %W(random)
## Study 1 0.5000 [0.1772; 1.4105] 12.5 12.8
## Study 2 0.5000 [0.2934; 0.8522] 50.0 48.6
## Study 3 0.5000 [0.2747; 0.9099] 37.5 38.5
##
## Number of studies combined: k = 3
##
## RR 95%-CI z p-value
## Fixed effect model 0.5000 [0.3447; 0.7252] -3.65 0.0003
## Random effects model 0.5000 [0.3448; 0.7251] -3.65 0.0003
##
## Quantifying heterogeneity:
## tau^2 = 0; tau = 0; I^2 = 0.0% [0.0%; 89.6%]; H = 1.00 [1.00; 3.10]
##
## Test of heterogeneity:
## Q d.f. p-value
## 0.00 2 1.0000
##
## Details on meta-analytical method:
## - Mantel-Haenszel method
## - DerSimonian-Laird estimator for tau^2
## - Mantel-Haenszel estimator used in calculation of Q and tau^2 (like RevMan 5)
forest(meta.3)
# meta-analysis - outcome of interest mortality
<- data.frame(study = c('Look (2018)',
dta 'Nielsen (2013)',
'Laurent (2005)',
'Hachimi-Idrissi (2005)',
'Bernard (2002)',
'HACA (2002)',
'Hachimi-Idrissi (2001)',
'Tiainen (2007)',
'Bernard (1997)',
'Dankiewicz (2021)',
'Lascarrou (2019)',
'Dankiewicz (2021)',
'Lascarrou (2019)',
'Nielsen (2013)'),
location = rep('Out of hospital',14),
year = c(2018,2013,2005,2005,2002,2002,2001,2007,1997,2021,2019,2021,2019,2013),
n = c(87,939,42,28,77,275,30,70,44,1850,581,939,1850,581),
event.e = c(27,226,15,6,22,56,13,8,10,465,231,465,231,226),
n.e = c(45,473,22,14,43,137,16,36,22,925,284,925,284,473),
event.c = c(33,220,11,8,23,76,13,12,17,446,247,446,247,220),
n.c = c(42,466,20,14,34,138,14,34,22,925,297,925,297,466),
target.temp = c(rep(33,5),32,34,33,33,33,33,33,33,33),
comparison = c('No-TTM','TTM-36','TTM-37','TTM-37',
'No-TTM','No-TTM','Tx>38','Tx>38',
'No-TTM','Tx>37.8','TTM-37','TTM-36','Tx>37.8','TTM-37'),
design = c('RCT','RCT','RCT','RCT','RCT',
'RCT','RCT','RCT','Non-R','RCT','RCT','RCT','RCT','RCT'),
control.target = c('Trials without explicit normothermia / fever avoidance',
'Trials with explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials with explicit normothermia / fever avoidance',
'Trials with explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance',
'Trials without explicit normothermia / fever avoidance'))
<- dta[ order(dta$year),]
dta $id <- 1:14
dta$ratio.e <- paste(dta$event.e,'/',dta$n.e, sep = '')
dta$p.e <- paste('(', round(dta$event.e/dta$n.e,2)*100,'%)', sep = '')
dta$ratio.c <- paste(dta$event.c,'/',dta$n.c, sep = '')
dta$p.c <- paste('(', round(dta$event.c/dta$n.c,2)*100,'%)', sep = '')
dta$tx <- paste(dta$ratio.e,dta$p.e, sep = ' ')
dta$c <- paste(dta$ratio.c,dta$p.c, sep = ' ')
dta
<- metabin(event.e, n.e,
meta.all
event.c, n.c,sm = 'RR',
data = dta,
studlab = study,
byvar = control.target,
subset = id %in% c(1:8,11,13))
# ---------------------------- Forest plot ---------------------------------------------
par(cex = 1.15)
forest(meta.all, leftcols = c('studlab',
'tx','c'),
leftlabs = c('Study (Year)','Hypothermia\n (n/N)(%)',
'Control\n (n/N)(%)'),
bylab = '',
label.right = '',
label.left = '',
col.square = 'blue',
col.diamond = 'blue',
overall = F,
col.by = 'black')
Module 6. Writing for publication and presenting results at conferences.
This module will explore approaches to interpreting and presenting results of your research. We will focus on the presentation of results in manuscripts, abstract submissions, and reports. The presentation of results in the context of conference presentations will also be covered.
Table 1. Characteristics of study participants.
Women (n=1,117) | Men (n=846) | Combined (N=2,023) | p-value | |
---|---|---|---|---|
Age (years), mean (SD) | 57 (4.9) | 57 (4.9) | 57 (4.9) | 0.247 |
Education, n (%) | <0.001 | |||
- Some high school | 564 (49) | 459 (56) | 1,023 (52) | |
- high school/GED | 310 (27) | 146 (18) | 456 (23) | |
- some college/vocational school | 195 (17) | 101 (12) | 295 (15) | |
- college | 78 (7) | 114 (14) | 192 (10) | |
BMI (\(kg/m^2\)), mean (SD) | 26.5 (4.6) | 26.2 (3.3) | 26.4 (4.1) | 0.536 |
Systolic BP (mmHg), mean (SD) | 143.3 (24.9) | 135.8 (22.1) | 140.2 (24.0) | <0.001 |
Hypertension, n (%) | 570 (48) | 322 (38) | 892 (44) | <0.001 |
Current smoker, n (%) | 336 (29) | 448 (53) | 784 (39) | <0.001 |
Diabetes, n (%) | 44 (4) | 37 (4) | 81 (4) | 0.472 |
Total chol (mmol/L), mean (SD) | 14.3 (2.5) | 13.1 (2.3) | 13.8 (2.5) | <0.001 |
Glucose (mmol/L), mean (SD) | 4.7 (1.5) | 4.7 (1.7) | 4.7 (1.6) | 0.081 |
Follow-up (mths), median (IQR) | 185 (102, 278) | 198 (102, 275) | 188 (102, 276) | 0.428 |
Using the Hmisc ‘summary’ function we have created the above Table, we have included test = TRUE, so the function will compare the continuous variables using (un-paired T-Test), and the frequency data using a Pearson’s chi-squared test. The mean age of women and men are both essentially 57 years and 4.9 years, respectively, so even before comparing these mean and SD, it is obvious there is now differences between the mean values. To assess the role of change when comparing the mean values of two groups, we often use a t-test, formally known as Student’s t-test, as William S. Gosset [^5] at the time of publishing this method was employee of the Head Brewer of Guinness, his employer’s policy in publishing research results at that time prevented him from using his real name. Student’s t-test essentially is the ration of the difference between the two group mean weighted by an estimate of error. If we let the two group means be expressed as \(\bar{x_1}\) and \(\bar{x_2}\), repestively, the test statistic is estimates as:
Our practical sessions we will be using the R Statistical Language (https://www.r-project.org/) via the R Studio editor (https://www.rstudio.com/products/rstudio/download/#download). And, we suggest you load this onto your laptop prior to Week 1.
The R language is an object orientated statistical program, that can be used for all stages of the research process, including:
- The planning of a study, meta-analysis of current literature, power and sample size;
- All data cleaning and management;
- All analysis; and,
- Most important, presentation of results for publication in manuscripts (including all tables and figures).