Introduction to Biostatistics using R: Summer edition, 2023

Introduction

Welcome to an “Introduction to biostatistics using R: summer edition 2023”.

The six modules will be delivered over three weekly Wednesday morning’s seminars, between 9am and 12 midday, on the the 1st, and the 8th of February 2023. Venue: Building 41, room 153.

Week 1. Wednesday February 1st, 2023 (9am to 12 midday).

Module 1. Data good care of your data, reproducible research.

This module will introduce you to good data management practices, and the concept of reproducible research. Concepts such as data security, archiving, data cleaning, and documenting a clear audit trail of these processes will be covered.

Module 2. Preparing data for analysis and presenting descriptive statistics.

This module provides you the skills to prepare data for analysis, including various formats of data, labelling discrete levels of variables, and creating new variables. This module will also explore the presentation of descriptive statistics. When is the median and interquartile range preferred over the mean and standard deviation? And, the presentation of descriptive statistics in a manuscript being prepared for peer review.

Reading data into R.

Data sets can be loaded into R various ways. The simplist, is via a spreadsheet, such as a comma separated volume (csv). Using the ‘read.csv’ function. However, a working directory needs to be set-up.

# set working directly
setwd('~/OneDrive - University of Wollongong/Data/')

# read dataset into R, naming the data set dta.
framingham <- read.csv(file = 'framingham.csv', header = T, sep = ',')
# check the names of variable.
names(framingham)
##  [1] "randid"      "sex"         "totchol"     "age"         "sysbp"      
##  [6] "diabp"       "cursmoke"    "cigpday"     "bmi"         "diabetes"   
## [11] "bpmeds"      "heartrte"    "glucose"     "educ"        "prevchd"    
## [16] "prevap"      "prevmi"      "prevstrk"    "prevhyp"     "time"       
## [21] "period"      "hdlc"        "ldlc"        "death"       "angina"     
## [26] "hospmi"      "mi_fchd"     "anychd"      "stroke"      "cvd"        
## [31] "hyperten"    "timeap"      "timemi"      "timemifc"    "timechd"    
## [36] "timestrk"    "timecvd"     "timedth"     "timehyp"     "start.date" 
## [41] "stroke.date" "death.date"
# Look at the first ten rows.
# head(framingham)

# Look at the last ten rows.
# tail(framingham)

# View help file for dataset
# install riskCommunicator package which has the Framingham dataset.
## install.packages('riskCommunicator')
library(riskCommunicator)
help(framingham)

# Create age groups with labels
framingham$age.group <- cut(framingham$age, breaks = c(30,39,49,59,69,90),
                                            labels = c('30-39','40-49',
                                                        '50-59','60-69',
                                                        '70+'))
# Check lower and upper values of groups
tapply(framingham$age, framingham$age.group, min)
## 30-39 40-49 50-59 60-69   70+ 
##    32    40    50    60    70
tapply(framingham$age, framingham$age.group, max)
## 30-39 40-49 50-59 60-69   70+ 
##    39    49    59    69    81
# Convert md/dL to mmols (1 mg/dL = 18 mmols/L)
framingham$glucose.mmols <- framingham$glucose/18
framingham$totchol.mmols <- framingham$totchol/18

# Label factors with numeric values
table(framingham$sex)
## 
##    1    2 
## 5022 6605
framingham$sex <- factor(framingham$sex, level = c(1,2), 
                                         labels = c('Male','Female'))
table(framingham$sex)
## 
##   Male Female 
##   5022   6605
table(framingham$diabetes)
## 
##     0     1 
## 11097   530
framingham$diabetes <- factor(framingham$diabetes, level = c(0,1), 
                         labels = c('No','Yes'))
table(framingham$diabetes)
## 
##    No   Yes 
## 11097   530

Our specific hypothesis is related to confirming the relationship between blood pressure and the risk of stroke. Therefore, we will subset the data set.

# remove participants with a prior stroke or tx 
# for BP and only use first visit data.
names(framingham)
##  [1] "randid"        "sex"           "totchol"       "age"          
##  [5] "sysbp"         "diabp"         "cursmoke"      "cigpday"      
##  [9] "bmi"           "diabetes"      "bpmeds"        "heartrte"     
## [13] "glucose"       "educ"          "prevchd"       "prevap"       
## [17] "prevmi"        "prevstrk"      "prevhyp"       "time"         
## [21] "period"        "hdlc"          "ldlc"          "death"        
## [25] "angina"        "hospmi"        "mi_fchd"       "anychd"       
## [29] "stroke"        "cvd"           "hyperten"      "timeap"       
## [33] "timemi"        "timemifc"      "timechd"       "timestrk"     
## [37] "timecvd"       "timedth"       "timehyp"       "start.date"   
## [41] "stroke.date"   "death.date"    "age.group"     "glucose.mmols"
## [45] "totchol.mmols"
d <- subset(framingham, bpmeds == 0 &
                         prevstrk == 0 &
                         period == 1, select = c(randid,age,totchol.mmols,
                                                 diabp,glucose.mmols,
                                                 sex,sysbp,cursmoke,educ,
                                                 cigpday,diabetes,stroke,
                                                 timestrk,death,timedth))
# add n = 1, for counting
d$n <- 1
# create follow-up in years
d$timestrk.yrs <- d$timestrk/365.25
d$stroke.yes <- factor(d$stroke, levels= 0:1, labels = c('No','Yes'))
table(d$stroke.yes,d$stroke)
##      
##          0    1
##   No  3863    0
##   Yes    0  343
# variables labels and unit
library(Hmisc)
d <- upData(d, labels = c(age = 'Age',
                          totchol.mmols = 'Total cholesterol',
                          diabp = 'Diasolic BP',
                          glucose.mmols = 'BGL',
                          sex = 'Sex',
                          sysbp = 'Systolic BP',
                          cursmoke = 'Current smoker',
                          educ = 'Education level',
                          stroke = 'Stroke',
                          timestrk = 'Follow-up'),
             units = c(age = 'yrs',
                       totchol.mmols = 'mmol/L',
                       diabp = 'mmHg',
                       sysbp = 'mmol/L',        
                       timestrk = 'days'))
## Input object size:    424952 bytes;   18 variables    4206 observations
## New object size: 414288 bytes;   18 variables    4206 observations
d$educ <- factor(d$educ, levels = 1:4, labels = c('4th grade or less',
                                                  '5th, 6th, or 7th grades',
                                                  'grade school graduate no high school',
                                                  'high school, did not graduate'))

The Hmisc package will create Table 1. Including continuous and discrete variables. Frequencies are presented as “% (n)” and continuous variables as “p25/p50/p75 mean +/- SD”.

# Summary table using Hmisc package
print(summary(sex ~ age +
                    sysbp +
                    educ +
                    diabetes +
                    cursmoke +
                    timestrk,
                    method = 'reverse',
                    overall = F,
                    data = d), prmsd = T, digits = 3)
## 
## 
## Descriptive Statistics by Sex
## 
## +----------------------------------------+----+--------------------------------+--------------------------------+
## |                                        |N   |Male                            |Female                          |
## |                                        |    |(N=1868)                        |(N=2338)                        |
## +----------------------------------------+----+--------------------------------+--------------------------------+
## |Age [yrs]                               |4206|42.00/49.00/57.00  49.61+/- 8.70|42.00/49.00/56.00  49.62+/- 8.53|
## +----------------------------------------+----+--------------------------------+--------------------------------+
## |Systolic BP [mmol/L]                    |4206|118.0/128.0/141.0  130.9+/- 18.8|115.5/127.5/144.0  132.3+/- 23.2|
## +----------------------------------------+----+--------------------------------+--------------------------------+
## |educ : 4th grade or less                |4096|           44%  (805)           |           40%  (909)           |
## +----------------------------------------+----+--------------------------------+--------------------------------+
## |    5th, 6th, or 7th grades             |    |           27%  (497)           |           32%  (724)           |
## +----------------------------------------+----+--------------------------------+--------------------------------+
## |    grade school graduate no high school|    |           13%  (228)           |           20%  (453)           |
## +----------------------------------------+----+--------------------------------+--------------------------------+
## |    high school, did not graduate       |    |           16%  (282)           |            9%  (198)           |
## +----------------------------------------+----+--------------------------------+--------------------------------+
## |diabetes : Yes                          |4206|            3%  (  56)          |            2%  (  51)          |
## +----------------------------------------+----+--------------------------------+--------------------------------+
## |Current smoker                          |4206|           61%  (1135)          |           41%  ( 957)          |
## +----------------------------------------+----+--------------------------------+--------------------------------+
## |Follow-up [days]                        |4206|   5842/8766/8766  7117+/-2405  |   7705/8766/8766  7676+/-2108  |
## +----------------------------------------+----+--------------------------------+--------------------------------+

Module 3. Inferential statistics 1: Comparing two or more groups, continuous and categorical outcomes.

This module will introduce the concepts of estimation and null-hypothesis statistical significance testing. In addition sample size and power calculations will be covered, and Type I and Type II error.

# SysBP and stroke
hist(d$sysbp, main = '',
              xlab = 'Systolic BP (mmHg) \n First visit')

describe(d$sysbp)
## d$sysbp : Systolic BP [mmol/L] 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4206        0      226        1    131.7    23.06    104.0    108.5 
##      .25      .50      .75      .90      .95 
##    116.6    128.0    142.5    160.0    173.0 
## 
## lowest :  83.5  85.0  85.5  90.0  92.0, highest: 230.0 232.0 235.0 243.0 295.0
boxplot(sysbp ~ stroke.yes, data = d,
        col = c('blue','red'),
        xlab = 'Stroke',
        ylab = 'SBP (mmHg)',
        las = 1)

tapply(d$sysbp, d$stroke, mean)
##        0        1 
## 130.5967 143.7114
tapply(d$sysbp, d$stroke, sd)
##        0        1 
## 20.56035 25.91658

Null Hypothesis Significance Testing.

\(H_0\) Mean systolic BP among non-stroke \(=\) Mean systolic BP among stroke.

\(H_i\) Mean systolic BP among non-stroke \(\neq\) Mean systolic BP among stroke.

Then a critical value is set, say from a z-statistic (i.e 1.96, < 2.5% tail)

# SHADE TAILS OF n-DISTRIBUTED PLOTS
z <- pretty(c(-3,3), 100)
ht <- dnorm(z)
data <- data.frame(z = z, ht = ht)
zc <- 1.96
t <- subset(data, z > zc)
plot(data, type = "l", 
           axes = T,
           xlab = '',
           ylab = '')

lines(data)
polygon(c(rev(t$z), t$z),c(rep(0, nrow(t)), t$ht), col = 'grey', 
                                                      border = NA)
polygon(c(rev(t$z*-1), t$z*-1), c(rep(0, nrow(t)), t$ht), 
                                        col = 'grey', border = NA)
                                        

abline(v = 0, lty = 2)
abline(v = c(-1,1), lty = 2, col = 'grey') 
text(-0.75, 0.02, '-1 SD')
text(0.75, 0.02, '+1 SD')

text(-1.75, 0.02, '-1.96 SD')
text(1.75, 0.02, '+1.96 SD')

mean(d$sysbp[d$stroke == 1]) - mean(d$sysbp[d$stroke == 0])
## [1] 13.11468
t.test(sysbp ~ stroke, data = d)
## 
##  Welch Two Sample t-test
## 
## data:  sysbp by stroke
## t = -9.1205, df = 381.19, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -15.94196 -10.28741
## sample estimates:
## mean in group 0 mean in group 1 
##        130.5967        143.7114
# EXAMPLE OF T-TEST
set.seed(12345)

n <- 3000 # large number
a <- rnorm(n, mean = 61.8, sd = 10)
b <- rnorm(n, mean = 62.7, sd = 9.5)


plot(density(a), lty = 1, col = 'blue',
                          frame = F,
                          main = '',
                          xlab = '',
                          ylab = '',
                          ylim = c(0,0.05),
                          lwd = 2)
lines(density(b), lty = 1, col = 'black', lwd = 2)

legend(75,0.045, c('PCI - mean age 61.8 yrs (SD 10)',
                   'CABG - mean age 62.7 yrs (SD 9.5)'),
                   bty = n,
                   lty = 1,
                   lwd = 2,
                   col = c('blue','black'))

t.test(a,b)
## 
##  Welch Two Sample t-test
## 
## data:  a and b
## t = -3.4118, df = 5986.2, p-value = 0.0006496
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3417023 -0.3625029
## sample estimates:
## mean of x mean of y 
##  61.76762  62.61972
n <- 30 # small number
a <- rnorm(n, mean = 61.8, sd = 10)
b <- rnorm(n, mean = 62.7, sd = 9.5)


plot(density(a), lty = 1, col = 'blue',
                          frame = F,
                          main = '',
                          xlab = '',
                          ylab = '',
                          ylim = c(0,0.05),
                          lwd = 2)
lines(density(b), lty = 1, col = 'black', lwd = 2)

legend(75,0.045, c('PCI - mean age 61.8 yrs (SD 10)',
                   'CABG - mean age 62.7 yrs (SD 9.5)'),
                   bty = n,
                   lty = 1,
                   lwd = 2,
                   col = c('blue','black'))

t.test(a,b)
## 
##  Welch Two Sample t-test
## 
## data:  a and b
## t = -0.28393, df = 54.736, p-value = 0.7775
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.455870  4.101878
## sample estimates:
## mean of x mean of y 
##  61.79375  62.47075

Type I and Type II error

NEJM 2000 ARDSNet lower tidal volume (6 mL/kg) versus traditional 12mL/kg. The primary outome death before discharge and breathing without assitance. Results were tx group 134/432 (31.0%) versus control 171/429 (39.8%).

setwd('~/OneDrive - University of Wollongong/Data/')
source('EpiTables V3.R')
## 
## ------------------------------------------------------------------------------
## Epi Tables
## R epiTab Function
## Input values epiTab(a,b,c,d)
## a,b,c,d values from two-by-two table (rows exposure and columns outcome)
## -------------------------------------------------------------------------------
## ------------------------------------------------------------------------
## Trial Data:
##             Outcome
## Intervention  Yes   No Total
##        Yes    482 1622  2104
##        No    1110 3211  4321
##        Total 1592 4833  6425
## 
## Rate of outcome among intervention group (p1) = 0.2291 (95% CI: 21.11, 24.7)
## 
## Rate of outcome among control group (p2) = 0.2569 (95% CI: 24.39, 26.99)
## 
## Risk Difference = -2.78 (95% CI: -5, -0.56)
## 
## Hazard Risk = 0.876 95% CI = 0.787 to 0.975
## Rate Ratio = 0.892 95% CI = 0.812 to 0.979
## Odds Ratio = 0.86 95% CI = 0.761 to 0.972
## chi2 p-value = 0.0154
## Fisher's exact p-value = 0.0163
## ------------------------------------------------------------------------
## 
## ------------------------------------------------------------------------
## Trial Data:
##             Outcome
## Intervention Yes  No Total
##        Yes    95 229   324
##        No    283 400   683
##        Total 378 629  1007
## 
## Rate of outcome among intervention group (p1) = 0.2932 (95% CI: 24.34, 34.3)
## 
## Rate of outcome among control group (p2) = 0.4143 (95% CI: 37.73, 45.14)
## 
## Risk Difference = -12.11 (95% CI: -18.3, -5.92)
## 
## Hazard Risk = 0.649 95% CI = 0.514 to 0.819
## Rate Ratio = 0.708 95% CI = 0.585 to 0.857
## Odds Ratio = 0.586 95% CI = 0.441 to 0.778
## chi2 p-value = 2e-04
## Fisher's exact p-value = 2e-04
## ------------------------------------------------------------------------
epiTab(134,298,171,258)
## 
## ------------------------------------------------------------------------
## Trial Data:
##             Outcome
## Intervention Yes  No Total
##        Yes   134 298   432
##        No    171 258   429
##        Total 305 556   861
## 
## Rate of outcome among intervention group (p1) = 0.3102 (95% CI: 26.64, 35.39)
## 
## Rate of outcome among control group (p2) = 0.3986 (95% CI: 35.21, 44.51)
## 
## Risk Difference = -8.84 (95% CI: -15.21, -2.47)
## 
## Hazard Risk = 0.73 95% CI = 0.582 to 0.915
## Rate Ratio = 0.778 95% CI = 0.648 to 0.934
## Odds Ratio = 0.678 95% CI = 0.512 to 0.898
## chi2 p-value = 0.0067
## Fisher's exact p-value = 0.0069
## ------------------------------------------------------------------------

An example of potentially Type II error - NEJM 2018 EOLIA ECMO for ARDS primary outome 60-day mortality tx 44/124 (35%) control 57/125 (46%)

epiTab(44,80,57,68)
## 
## ------------------------------------------------------------------------
## Trial Data:
##             Outcome
## Intervention Yes  No Total
##        Yes    44  80   124
##        No     57  68   125
##        Total 101 148   249
## 
## Rate of outcome among intervention group (p1) = 0.3548 (95% CI: 26.98, 43.99)
## 
## Rate of outcome among control group (p2) = 0.456 (95% CI: 36.78, 54.42)
## 
## Risk Difference = -10.12 (95% CI: -22.31, 2.07)
## 
## Hazard Risk = 0.72 95% CI = 0.486 to 1.067
## Rate Ratio = 0.778 95% CI = 0.574 to 1.055
## Odds Ratio = 0.656 95% CI = 0.394 to 1.091
## chi2 p-value = 0.1041
## Fisher's exact p-value = 0.1217
## ------------------------------------------------------------------------
# original sample size calculation was based on 40% vs 60%
# 60-day mortality
power.prop.test(p1 = 0.4, p2 = 0.6, sig.level = 0.05, power = 0.8)
## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 96.92364
##              p1 = 0.4
##              p2 = 0.6
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group
p1 <- 0.6
q1 <- 1 - p1
p2 <- 0.4
q2 <- 1 - p2
p.bar <- (p1 + p2)/2
q.bar <- 1 - p.bar

# Sample size
n <- 2*((p.bar * q.bar) * (1.96 + 0.84)^2)  / (p1 - p2)^2
n
## [1] 98
power.prop.test(p1 = 0.36, p2 = 0.46, sig.level = 0.05, power = 0.8)
## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 378.5477
##              p1 = 0.36
##              p2 = 0.46
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group
power.prop.test(p1 = 0.36, p2 = 0.46, sig.level = 0.05, n = 125) # power
## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 125
##              p1 = 0.36
##              p2 = 0.46
##       sig.level = 0.05
##           power = 0.3615169
##     alternative = two.sided
## 
## NOTE: n is number in *each* group
# Age and stroke
boxplot(age ~ stroke.yes, data = d,
        col = c('blue','red'),
        xlab = 'Stroke',
        ylab = 'Age (years)',
        las = 1)

tapply(d$age, d$stroke, mean)
##        0        1 
## 49.13461 55.07289
tapply(d$age, d$stroke, sd)
##        0        1 
## 8.504941 7.768804
# Age and Systolic BP
plot(d$age, d$sysbp, pch = 16,
                     xlab = 'Age (yrs)',
                     ylab = 'Systolic BP (mmHg)',
                     main = '')

plot(d$age[d$stroke == 0], d$sysbp[d$stroke == 0], pch = 16,
                     col = 'blue',
                     xlab = 'Age (yrs)',
                     ylab = 'Systolic BP (mmHg)',
                     main = '',
                     ylim = c(80,300),
                     las = 1)
points(d$age[d$stroke == 1], d$sysbp[d$stroke == 1], pch = 16,
                             col = 'red')
abline(lsfit(d$age,d$sysbp), lty = 2, lwd = 2)

# Cigs per day and stroke
boxplot(cigpday ~ stroke.yes, data = d,
        col = c('blue','red'),
        xlab = 'Stroke',
        ylab = 'Age (years)',
        las = 1)

tapply(d$cigpday, d$stroke, mean, na.rm = T)
##        0        1 
## 9.105085 9.170088
tapply(d$cigpday, d$stroke, sd, na.rm = T)
##        0        1 
## 12.00131 12.06747
# table of rates of stroke based 
# systolic HT
library(Epi)
d$sysHT <- ifelse(d$sysbp > 140, 1, 0)
d$sysHT <- factor(d$sysHT, levels = 0:1, labels = c('No','Yes'))
print(stat.table(list(sysHT), contents = list(N = count(),
                                        'Stroke (n)' = sum(stroke),
                                        'Stroke (%)' = ratio(stroke,n,100),
                                        'Stroke (rate)' = ratio(stroke,timestrk,365.25),
                                        'follow-up (Total years)' = sum(timestrk.yrs)),
                                         data = d,
                                         margin = T),
                                         digits = 3)
##  --------------------------------------------------- 
##  sysHT          N  Stroke  Stroke  Stroke follow-up  
##                       (n)     (%)  (rate)    (Total  
##                                              years)  
##  --------------------------------------------------- 
##  No      3035.000 178.000   5.865   0.003 64518.680  
##  Yes     1171.000 165.000  14.091   0.008 21014.623  
##                                                      
##  Total   4206.000 343.000   8.155   0.004 85533.303  
##  ---------------------------------------------------
tapply(d$timestrk.yrs, d$sysHT, sum)
##       No      Yes 
## 64518.68 21014.62
# rate with 95% CI
library(epiDisplay)
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$n, d$sysHT, sum)))
names(stroke.tab) <- c('Strokes','N')
head(stroke.tab)
##     Strokes    N
## No      178 3035
## Yes     165 1171
ci.binomial(stroke.tab$Strokes, stroke.tab$N)
##   events total probability          se exact.lower95ci exact.upper95ci
## 1    178  3035  0.05864909 0.004265079      0.05054965      0.06761068
## 2    165  1171  0.14090521 0.010167300      0.12146029      0.16216781
# incidence based on person years
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$timestrk.yrs, d$sysHT, sum)))
names(stroke.tab) <- c('Strokes','pyrs')
head(stroke.tab)
##     Strokes     pyrs
## No      178 64518.68
## Yes     165 21014.62
ci.poisson(stroke.tab$Strokes, stroke.tab$pyrs)
##   events person.time   incidence           se exact.lower95ci exact.upper95ci
## 1    178    64518.68 0.002758891 0.0002067876     0.002367128     0.003197555
## 2    165    21014.62 0.007851676 0.0006112521     0.006697479     0.009147202
# risk of stroke
# Rate Ratio
setwd('~/OneDrive - NSW Health Department/Data/')
source('epiTables V3.R')
## 
## ------------------------------------------------------------------------------
## Epi Tables
## R epiTab Function
## Input values epiTab(a,b,c,d)
## a,b,c,d values from two-by-two table (rows exposure and columns outcome)
## -------------------------------------------------------------------------------
## ------------------------------------------------------------------------
## Trial Data:
##             Outcome
## Intervention  Yes   No Total
##        Yes    482 1622  2104
##        No    1110 3211  4321
##        Total 1592 4833  6425
## 
## Rate of outcome among intervention group (p1) = 0.2291 (95% CI: 21.11, 24.7)
## 
## Rate of outcome among control group (p2) = 0.2569 (95% CI: 24.39, 26.99)
## 
## Risk Difference = -2.78 (95% CI: -5, -0.56)
## 
## Hazard Risk = 0.876 95% CI = 0.787 to 0.975
## Rate Ratio = 0.892 95% CI = 0.812 to 0.979
## Odds Ratio = 0.86 95% CI = 0.761 to 0.972
## chi2 p-value = 0.0154
## Fisher's exact p-value = 0.0163
## ------------------------------------------------------------------------
## 
## ------------------------------------------------------------------------
## Trial Data:
##             Outcome
## Intervention Yes  No Total
##        Yes    95 229   324
##        No    283 400   683
##        Total 378 629  1007
## 
## Rate of outcome among intervention group (p1) = 0.2932 (95% CI: 24.34, 34.3)
## 
## Rate of outcome among control group (p2) = 0.4143 (95% CI: 37.73, 45.14)
## 
## Risk Difference = -12.11 (95% CI: -18.3, -5.92)
## 
## Hazard Risk = 0.649 95% CI = 0.514 to 0.819
## Rate Ratio = 0.708 95% CI = 0.585 to 0.857
## Odds Ratio = 0.586 95% CI = 0.441 to 0.778
## chi2 p-value = 2e-04
## Fisher's exact p-value = 2e-04
## ------------------------------------------------------------------------
epiTab(165,1171-165,178,3035-178)
## 
## ------------------------------------------------------------------------
## Trial Data:
##             Outcome
## Intervention Yes   No Total
##        Yes   165 1006  1171
##        No    178 2857  3035
##        Total 343 3863  4206
## 
## Rate of outcome among intervention group (p1) = 0.1409 (95% CI: 12.1, 16.09)
## 
## Rate of outcome among control group (p2) = 0.0586 (95% CI: 5.03, 6.7)
## 
## Risk Difference = 8.23 (95% CI: 6.06, 10.39)
## 
## Hazard Risk = 2.513 95% CI = 2.033 to 3.106
## Rate Ratio = 2.403 95% CI = 1.966 to 2.937
## Odds Ratio = 2.633 95% CI = 2.105 to 3.293
## chi2 p-value = 0
## Fisher's exact p-value = 0
## ------------------------------------------------------------------------
(sum(d$stroke[d$sysHT == 'Yes']) / sum(d$n[d$sysHT == 'Yes'])) / 
(sum(d$stroke[d$sysHT == 'No']) / sum(d$n[d$sysHT == 'No']))
## [1] 2.402513

WEEK 2. Wednesday February 8th, 2023 (9am to 12 midday).

Module 4. Inferential statistics 2: Correlation and regression, adjusting for confounders.

This module will initially explore correlation and simple linear regression. We will then extend these concepts to multi-linear regression, and generalised-linear-models that are use to analysis various outcomes, such as continuous, binomial, and rates.

# crude models 
glm.stroke <- glm(cbind(stroke,n) ~ sysHT, data = d, family = binomial)
print(logistic.display(glm.stroke), digits = 3)
## 
## Logistic regression predicting cbind(stroke, n) 
##  
##                  OR(95%CI)     P(Wald's test) P(LR-test)
## sysHT: Yes vs No 2.4 (1.92,3)  < 0.001        < 0.001   
##                                                         
## Log-likelihood = -949.6711
## No. of observations = 4206
## AIC value = 1903.3422
cph <- coxph(Surv(timestrk, stroke) ~ I(sysbp/10), data = d)
summary(cph)
## Call:
## coxph(formula = Surv(timestrk, stroke) ~ I(sysbp/10), data = d)
## 
##   n= 4206, number of events= 343 
## 
##                coef exp(coef) se(coef)     z Pr(>|z|)    
## I(sysbp/10) 0.28278   1.32681  0.02013 14.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## I(sysbp/10)     1.327     0.7537     1.275      1.38
## 
## Concordance= 0.684  (se = 0.015 )
## Likelihood ratio test= 157  on 1 df,   p=<2e-16
## Wald test            = 197.3  on 1 df,   p=<2e-16
## Score (logrank) test = 196.6  on 1 df,   p=<2e-16
# adjusted models 
glm.stroke <- glm(cbind(stroke,n) ~ sysHT + age + sex, data = d, family = binomial)
print(logistic.display(glm.stroke), digits = 3)
## 
## Logistic regression predicting cbind(stroke, n) 
##  
##                     crude OR(95%CI)   adj. OR(95%CI)    P(Wald's test)
## sysHT: Yes vs No    2.4 (1.92,3)      1.69 (1.33,2.14)  < 0.001       
##                                                                       
## age (cont. var.)    1.08 (1.06,1.09)  1.07 (1.05,1.08)  < 0.001       
##                                                                       
## sex: Female vs Male 0.84 (0.68,1.05)  0.79 (0.63,0.99)  0.045         
##                                                                       
##                     P(LR-test)
## sysHT: Yes vs No    < 0.001   
##                               
## age (cont. var.)    < 0.001   
##                               
## sex: Female vs Male < 0.001   
##                               
## Log-likelihood = -904.9604
## No. of observations = 4206
## AIC value = 1817.9208
cph <- coxph(Surv(timestrk, stroke) ~ I(sysbp/10) + age + sex, data = d)
summary(cph)
## Call:
## coxph(formula = Surv(timestrk, stroke) ~ I(sysbp/10) + age + 
##     sex, data = d)
## 
##   n= 4206, number of events= 343 
## 
##                  coef exp(coef)  se(coef)      z Pr(>|z|)    
## I(sysbp/10)  0.207071  1.230070  0.023106  8.962  < 2e-16 ***
## age          0.082763  1.086284  0.007267 11.389  < 2e-16 ***
## sexFemale   -0.505446  0.603236  0.110869 -4.559 5.14e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## I(sysbp/10)    1.2301     0.8130    1.1756    1.2871
## age            1.0863     0.9206    1.0709    1.1019
## sexFemale      0.6032     1.6577    0.4854    0.7497
## 
## Concordance= 0.754  (se = 0.013 )
## Likelihood ratio test= 308.2  on 3 df,   p=<2e-16
## Wald test            = 299.8  on 3 df,   p=<2e-16
## Score (logrank) test = 330.7  on 3 df,   p=<2e-16

Module 5. An introduction to meta-analysis: approaches to combining the results of a number of studies.

This module provides an introduction to the meta-analysis of the results a number of studies, including prevalence, continuous outcomes, and rates. This model will also cover approaches to heterogeneity between studies, and the concepts of fixed-effects and random-effects models. We will also cover Trial Sequential Analysis approaches to meta-analysis, essentially do we need more studies?

library(meta)

d.3 <- data.frame(study = c('Study 1','Study 2','Study 3'),
                  event.e = c(5,20,15),
                  n.e = c(100,2000,300),
                  event.c = c(10,40,30),
                  n.c = c(100,2000,300))

meta.3 <- metabin(event.e, n.e,
                    event.c, n.c,
                    sm = 'RR',
                    data = d.3,
                    studlab = study)

meta.3
##             RR           95%-CI %W(fixed) %W(random)
## Study 1 0.5000 [0.1772; 1.4105]      12.5       12.8
## Study 2 0.5000 [0.2934; 0.8522]      50.0       48.6
## Study 3 0.5000 [0.2747; 0.9099]      37.5       38.5
## 
## Number of studies combined: k = 3
## 
##                          RR           95%-CI     z p-value
## Fixed effect model   0.5000 [0.3447; 0.7252] -3.65  0.0003
## Random effects model 0.5000 [0.3448; 0.7251] -3.65  0.0003
## 
## Quantifying heterogeneity:
##  tau^2 = 0; tau = 0; I^2 = 0.0% [0.0%; 89.6%]; H = 1.00 [1.00; 3.10]
## 
## Test of heterogeneity:
##     Q d.f. p-value
##  0.00    2  1.0000
## 
## Details on meta-analytical method:
## - Mantel-Haenszel method
## - DerSimonian-Laird estimator for tau^2
## - Mantel-Haenszel estimator used in calculation of Q and tau^2 (like RevMan 5)
forest(meta.3)

# meta-analysis - outcome of interest mortality
dta <- data.frame(study = c('Look (2018)',  
                            'Nielsen (2013)',  
                            'Laurent (2005)', 
                            'Hachimi-Idrissi (2005)',
                            'Bernard  (2002)',  
                            'HACA (2002)',
                            'Hachimi-Idrissi (2001)',
                            'Tiainen (2007)',
                            'Bernard (1997)',
                            'Dankiewicz (2021)',
                            'Lascarrou (2019)',
                            'Dankiewicz (2021)',
                            'Lascarrou (2019)',
                            'Nielsen (2013)'),
                  location = rep('Out of hospital',14),
                  year = c(2018,2013,2005,2005,2002,2002,2001,2007,1997,2021,2019,2021,2019,2013),
                  n = c(87,939,42,28,77,275,30,70,44,1850,581,939,1850,581),
                  event.e = c(27,226,15,6,22,56,13,8,10,465,231,465,231,226),
                  n.e = c(45,473,22,14,43,137,16,36,22,925,284,925,284,473),
                  event.c = c(33,220,11,8,23,76,13,12,17,446,247,446,247,220),
                  n.c = c(42,466,20,14,34,138,14,34,22,925,297,925,297,466),
                  target.temp = c(rep(33,5),32,34,33,33,33,33,33,33,33),
                  comparison = c('No-TTM','TTM-36','TTM-37','TTM-37',
                                 'No-TTM','No-TTM','Tx>38','Tx>38',
                                 'No-TTM','Tx>37.8','TTM-37','TTM-36','Tx>37.8','TTM-37'),
                  design = c('RCT','RCT','RCT','RCT','RCT',
                             'RCT','RCT','RCT','Non-R','RCT','RCT','RCT','RCT','RCT'),
                  control.target = c('Trials without explicit normothermia / fever avoidance',
                                     'Trials with explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials with explicit normothermia / fever avoidance',
                                     'Trials with explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance'))
dta <- dta[ order(dta$year),]
dta$id <- 1:14
dta$ratio.e <- paste(dta$event.e,'/',dta$n.e, sep = '')
dta$p.e <- paste('(', round(dta$event.e/dta$n.e,2)*100,'%)', sep = '') 
dta$ratio.c <- paste(dta$event.c,'/',dta$n.c, sep = '')
dta$p.c <- paste('(', round(dta$event.c/dta$n.c,2)*100,'%)', sep = '') 
dta$tx <- paste(dta$ratio.e,dta$p.e, sep = ' ') 
dta$c <- paste(dta$ratio.c,dta$p.c, sep = ' ')   

meta.all <- metabin(event.e, n.e,
                    event.c, n.c,
                    sm = 'RR',
                    data = dta,
                    studlab = study,
                    byvar = control.target,
                    subset = id %in% c(1:8,11,13))

# ---------------------------- Forest plot ---------------------------------------------

par(cex = 1.15) 

forest(meta.all, leftcols = c('studlab',
                              'tx','c'),
                 leftlabs = c('Study (Year)','Hypothermia\n (n/N)(%)',
                                             'Control\n (n/N)(%)'),
                 bylab = '',
                 label.right = '',
                 label.left = '',
                 col.square = 'blue',
                 col.diamond = 'blue',
                 overall = F,
                 col.by = 'black')

Module 6. Writing for publication and presenting results at conferences.

This module will explore approaches to interpreting and presenting results of your research. We will focus on the presentation of results in manuscripts, abstract submissions, and reports. The presentation of results in the context of conference presentations will also be covered.

Table 1. Characteristics of study participants.

Women (n=1,117) Men (n=846) Combined (N=2,023) p-value
Age (years), mean (SD) 57 (4.9) 57 (4.9) 57 (4.9) 0.247
Education, n (%) <0.001
- Some high school 564 (49) 459 (56) 1,023 (52)
- high school/GED 310 (27) 146 (18) 456 (23)
- some college/vocational school 195 (17) 101 (12) 295 (15)
- college 78 (7) 114 (14) 192 (10)
BMI (\(kg/m^2\)), mean (SD) 26.5 (4.6) 26.2 (3.3) 26.4 (4.1) 0.536
Systolic BP (mmHg), mean (SD) 143.3 (24.9) 135.8 (22.1) 140.2 (24.0) <0.001
Hypertension, n (%) 570 (48) 322 (38) 892 (44) <0.001
Current smoker, n (%) 336 (29) 448 (53) 784 (39) <0.001
Diabetes, n (%) 44 (4) 37 (4) 81 (4) 0.472
Total chol (mmol/L), mean (SD) 14.3 (2.5) 13.1 (2.3) 13.8 (2.5) <0.001
Glucose (mmol/L), mean (SD) 4.7 (1.5) 4.7 (1.7) 4.7 (1.6) 0.081
Follow-up (mths), median (IQR) 185 (102, 278) 198 (102, 275) 188 (102, 276) 0.428

Using the Hmisc ‘summary’ function we have created the above Table, we have included test = TRUE, so the function will compare the continuous variables using (un-paired T-Test), and the frequency data using a Pearson’s chi-squared test. The mean age of women and men are both essentially 57 years and 4.9 years, respectively, so even before comparing these mean and SD, it is obvious there is now differences between the mean values. To assess the role of change when comparing the mean values of two groups, we often use a t-test, formally known as Student’s t-test, as William S. Gosset [^5] at the time of publishing this method was employee of the Head Brewer of Guinness, his employer’s policy in publishing research results at that time prevented him from using his real name. Student’s t-test essentially is the ration of the difference between the two group mean weighted by an estimate of error. If we let the two group means be expressed as \(\bar{x_1}\) and \(\bar{x_2}\), repestively, the test statistic is estimates as:

Our practical sessions we will be using the R Statistical Language (https://www.r-project.org/) via the R Studio editor (https://www.rstudio.com/products/rstudio/download/#download). And, we suggest you load this onto your laptop prior to Week 1.

The R language is an object orientated statistical program, that can be used for all stages of the research process, including:

  1. The planning of a study, meta-analysis of current literature, power and sample size;
  2. All data cleaning and management;
  3. All analysis; and,
  4. Most important, presentation of results for publication in manuscripts (including all tables and figures).