Quantitative Research Methodology Workshops

Welcome to our 2021 quantitative methods workshop.

The program is to run over three sessions.

Workshop 1 : Wednesday 22 September 2021, 9:00am – 12:00pm - Research design – planning a study and determining an appropriate sample size: What is the appropriate study design for your project? What data will you need to collect, and how many study participants do you need?
Workshop 2: Wednesday 29 September 2021 - Preparing data for analysis and descriptive statistics : How to prepare data for analysis and presenting the characteristics of your study population. Preparing tables and figures for publication.
Workshop 3 :Wednesday 29 September 2021 Inferential statistics and combining data from multiple studies in a meta-analysis: Comparing groups and adjusting for potential confounding factors. How do you combine results from multiple studies in a meta-analysis to summarize current evidence?

For the practical sessions we will be using R language for statistical computing.

For our practical sessions we will be using the R Statistical language (https://www.r-project.org/) via the R Studio editor (https://www.rstudio.com/products/rstudio/download/#download).

The R language is a object orientated statistical program, that can be used for all stages of the research process, including:

the planning of a study, meta-analysis of current literature and sample size;
the cleaning and management of study data;
the analysis; and,
the preparation of manuscripts for publication - including all tables and figures.

Workshop 1 : Research design – planning a study and determining an appropriate sample size:

Research design: What is the most appropriate study design for your research aim?

What is the appropriate study design for your project?
What data will you need to collect, and
How many study participants do you need?

What type of study are you and your team planning to undertake? Is your project aimed at evaluating the effectiveness of a new interventions? Identifying risk factors for a specific outcome of interest? Are you interested in describing the burden of a problem among a specific population and at a specific time? Or, do you want to evaluate the accuracy of a screening or diagnostic test. Depending on the specific aim of your project, consideration needs to be given to the best study design to address your aims.

Table 1. Research aim, suitable study design, and outcome measure.

Aim of study	Best study design	Outcomes measures
Effectiveness of a new intervention	Randomised controlled trial	Relative Risk Reduction, Absolute Risk Reduction, Risk Difference, and confidence intervals
Risk factor for disease	Cohort or Case-control study (rare outcome)	Relative Risk, Odds Ratio for exposure, and confidence intervals
Burden of a disease among a specific population	Cohort for incidence ( short event, may results in death), cross-sectional survey (disease for long period of time)	Incidence, prevalence, and confidence intervals
Development of diagnostic test	Case-control	Sensitivity, Specificity, Area Under the ROC Curve, and confidence intervals

An example of a randomised controlled trial to assess effectiveness of an intervention.

Previous trials involving patients with the acute respiratory distress syndrome (ARDS) had failed to show a beneficial effect of prone positioning during mechanical ventilation and mortality. However, meta-analyses had suggested that survival is significantly improved with prone positioning as compared with supine positioning among patients with severely hypoxemic ARDS at the time of randomization. As a result Gueren et al ¹ conducted as randomised controlled trial to assess the effectiveness of prone positioning among patients with severe ARDS.

The study was a multicenter, prospective, randomized, controlled trial, randomly assigning 466 patients with severe ARDS to undergo prone-positioning sessions of at least 16 hours or to be left in the supine position. Severe ARDS was defined as a ratio of the partial pressure of arterial oxygen (Pa\(0_2\)) to the fraction of inspired oxygen (Fi\(0_2\)) of less than 150 mm Hg, with an Fi\(0_2\) of at least 0.6, a positive end-expiratory pressure of at least 5 cm of water, and a tidal volume close to 6 mL per kilogram of predicted body weight. The primary outcome was the proportion of patients who died from any cause within 28 days after randomisation.

As a result a total of 237 patients were assigned to the prone group, and 229 patients were assigned to the supine group. The 28-day mortality was 16.0% (38/237) among the prone group and 32.8% (75/229) in the supine group. The ratio of these two rates is 0.49 (0.160/0.328) .

Table 2. Prone postioning in severe ARDS and 28-days mortality, rate ratio (RR), risk difference (RD), and associated 95% confidence intervals (CI).

Outcome	Supine Group (n=229)	Prone Group (n=237)	estimate (95% CI)
Mortality at 28-days,n	75	38	RR = 0.49 (0.35 - 0.69)
%	32.8	16.0	RD = -16.72 (-24.38, -9.05)

The rate ratio estimate from Table 2 suggested that prone positioning was observed to reduce the risk of 28-mortality by 51% (1-0.49), and the 95% confidence intervals estimates that the true estimate should be between 0.35 and 0.69. The absolute risk difference (RD) estimates the there was a 16.7% reduction in risk among among prone patients compared to supine patients (0.160-0.328), and that the true estimate is between -9.1 and -24.4%.

Using the epiR package.

The epiR package will allow summary data to be entered to estimate the RR, RD and associated 95% CI.

library(epiR)
epi.2by2(dat = c(38,237-38,75,229-75), method = 'cohort.count', units = 100)

##              Outcome +    Outcome -      Total        Inc risk *        Odds
## Exposed +           38          199        237              16.0       0.191
## Exposed -           75          154        229              32.8       0.487
## Total              113          353        466              24.2       0.320
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio                                 0.49 (0.35, 0.69)
## Odds ratio                                     0.39 (0.25, 0.61)
## Attrib risk in the exposed *                   -16.72 (-24.38, -9.05)
## Attrib fraction in the exposed (%)            -104.26 (-188.54, -44.60)
## Attrib risk in the population *                -8.50 (-15.72, -1.28)
## Attrib fraction in the population (%)         -35.06 (-52.33, -19.75)
## -------------------------------------------------------------------
## Uncorrected chi2 test that OR = 1: chi2(1) = 17.719 Pr>chi2 = <0.001
## Fisher exact test that OR = 1: Pr>chi2 = <0.001
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units

An example of a cohort study.

The British Doctors study (1951-1966), smoking and death following a myocardial infarction.

Table 3. Smoking status and death following a myocardial infarction ².

Outcome	Smoking	non-Smokers	estimate (95% CI)
no. of events (person years)	630 (142,247)	101 (39,222)	RR = 1.72 (1.39 to 2.14)
rate per 1000	4.4	2.6	RD = 1.85 (1.24 to 2.46)

Indicating 72% increased risk of death following a AMI among smokers versus non-smoker. And, the RD indicates approximately an extra 2 AMI related deaths for every 1000 man years among smokers compared to non-smokers.

epi.2by2(dat = c(630,142247,101,39220), method = 'cohort.time', units = 1000)

##              Outcome +    Time at risk        Inc rate *
## Exposed +          630          142247              4.43
## Exposed -          101           39220              2.58
## Total              731          181467              4.03
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc rate ratio                                 1.72 (1.39, 2.14)
## Attrib rate in the exposed *                   1.85 (1.24, 2.46)
## Attrib fraction in the exposed (%)            41.85 (28.16, 53.35)
## Attrib rate in the population *                1.45 (0.87, 2.03)
## Attrib fraction in the population (%)         36.07 (27.75, 43.94)
## -------------------------------------------------------------------
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 1000 units of population time at risk

Various sample size calculations

In the next section we will give some examples of sample size calculations, and associated R-code, for various study designs.

Sample size for a prevalence study.

\(N\) = \(\frac{Z_\alpha^2pq}{(d)^2}\)

Where, \(Z_\alpha\) is level of the confidence interval, usually 95% and therefore 1.96, \(p\) is the estimated prevalence, and \(q\) = 1 - \(p\) and \(d\) is the margin of error.

For example, the sample size for a study to estimated a prevalence of 11%, with a margin of error of 5%:

\(N\) = \(\frac{1.96^2*0.11*0.89}{(0.05)^2}\) = 150.43

Or in R:

library(epiDisplay)
n.for.survey(p = 0.11, delta = 0.05, alpha = 0.05)

## 
## Sample size for survey. 
## Assumptions: 
##   Proportion       = 0.11 
##   Confidence limit = 95 % 
##   Delta            = 0.05 from the estimate. 
## 
##   Sample size      = 150

What if the prevalence is thought to potentially only be 5.5%, and a 2% margin of error was suggested?

n.for.survey(p = 0.55, delta = 0.02, alpha = 0.05)

## 
## Sample size for survey. 
## Assumptions: 
##   Proportion       = 0.55 
##   Confidence limit = 95 % 
##   Delta            = 0.02 from the estimate. 
## 
##   Sample size      = 2377

Sample size for comparing rates between two groups.

\(N\) = \(\frac{2\bar{p}\bar{q}(Z_\alpha + Z_\beta)^2}{(p_1 - p_2)^2}\)

Where, \(p_1\) and \(p_2\) are then rates of the outcome of interest in the intervention and controls groups. The average of these two rates is = \(\bar{p}\), and \(\bar{q}\) is 1 - \(\bar{p}\). The z-score of Type I error is \(Z_\alpha\) (1.96 for 5%, two-tailed), and \(Z_\beta\) the associated z-score for Type II error (1.28 for 90% power).

The sample size for the prone positioning study by Gueren et al ³ was based on a baseline rate of mortality of 60% and the intervention was proposed to redcue this to 45%, with an associated Type I error rate of 5%, and Type II error rate of 10% (alpha = 0.05 and beta = 0.10, power 90%).

\(N\) = \(\frac{(0.60 + 0.45)/2 * 1-(0.60 + 0.45)/2) * (1.96 + 1.28)^2}{(0.60 - 0.45)^2}\) = 233 study participants in each group

power.prop.test(p1 = 0.60, p2 = 0.45, sig.level = 0.05, power = 0.90)

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 230.8303
##              p1 = 0.6
##              p2 = 0.45
##       sig.level = 0.05
##           power = 0.9
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

p <- (0.60 + 0.45)/2
q <- 1-p
2*p*q * (1.96 + 1.28)^2/(0.60 - 0.45)^2

## [1] 232.6968

To display an example of Type II error, we can explore a situation where the sample for the prone intervention had only 50 patients in each arm of the study?

epi.2by2(dat = c(8,50-8,16,50-16), method = 'cohort.count', units = 100)

##              Outcome +    Outcome -      Total        Inc risk *        Odds
## Exposed +            8           42         50                16       0.190
## Exposed -           16           34         50                32       0.471
## Total               24           76        100                24       0.316
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio                                 0.50 (0.24, 1.06)
## Odds ratio                                     0.40 (0.15, 1.06)
## Attrib risk in the exposed *                   -16.00 (-32.45, 0.45)
## Attrib fraction in the exposed (%)            -100.00 (-324.56, 5.79)
## Attrib risk in the population *                -8.00 (-23.40, 7.40)
## Attrib fraction in the population (%)         -33.33 (-72.80, -2.88)
## -------------------------------------------------------------------
## Uncorrected chi2 test that OR = 1: chi2(1) = 3.509 Pr>chi2 = 0.061
## Fisher exact test that OR = 1: Pr>chi2 = 0.100
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units

Even though the same estimated benefit of effect is observed (RR = 0.50), we notice the upper 95% CI has now crossed 1.0 (the null effect), and the p-value is now 0.06 from the Pearson’s chi-squared test.

Now let’s look at an example of a larger trial sample size, as the outcome of interest was much lower than the prone study. The PREDiMED trial of a Mediterranean diet to reduce the risk of CVD⁴, was planned with a baseline four year risk of 12% for the outcome of interest (CVD), and it was proposed the intervention would reduce this risk by 20% (0.12 * 0.8 = 0.096). The associated Type I and Type II error rates were 5% and 20%, respectively (alpha = 0.05 and power of 80%).

\(N\) = \(\frac{(0.12*0.88 + 0.096*0.904) * (1.96 + 0.84)^2}{(0.12 - 0.096)^2}\) = 2618.56 study participants in each group

Or using R:

power.prop.test(p1 = 0.12, p2 = 0.12*0.8, sig.level = 0.05, power = 0.8)

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 2624.271
##              p1 = 0.12
##              p2 = 0.096
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Sample size for comparing the mean of two groups.

\(N\) = 2[\(\frac{(Z_\alpha + Z_\beta)^2\sigma}{(\delta)^2}\)]

Where, \(\delta\) is the difference to between group means (\(\mu_1 - \mu_2\)), \(\sigma\) is the standard deviation of the outcome of interest in the population, \(Z_\alpha\) and \(Z_\beta\) are the z-scores for Type I (0.05) and Type II (0.20) errors, respectively (1.96 and 0.841).

Sample size for the correlation of two normally distributed variables.

\(N\) = \(\frac{(Z_\alpha + Z_\beta)^2}{C} + 3\)

Where, \(C\) = 0.5 x log(\(\frac{l+r}{1-r}\)), and \(r\) is the expected correlation. \(Z_\alpha\) and \(Z_\beta\) are the z-scores for Type I and Type II errors, respectively.

Sample size for one group proportion, non-inferiority⁵.

\(N\) = \(2[\frac{(Z_\alpha + Z_\beta)^2 p(1-p)}{(p - p_0 - \delta)^2}]\)

Where, \(\delta\) is the non-inferiority boundary, \(Z_\alpha\) and \(Z_\beta\) are the z-scores for Type I (0.05) and Type II (0.20) error, respectively (1.96 and 0.841). And, p if the current rate, and \(p_0\) is the true response rate.

non-inferiority trial of femoral PICC versus other CVC estimated current rate of adverse events 15% inferiority boundary margin 5% - potential superiority to reduce rate to 10%.

library(TrialSize)
library(epiDisplay)

# Non-inferiority sample size
OneSampleProportion.NIS(alpha = 0.05, beta = 0.2, 
                                            p = 0.15, 
                                            delta = 0.1, 
                                            differ = 0.05)

## [1] 315.3104

Reading data into R.

Data sets can be loaded into R various ways. The simplist, is via a spreadsheet, such as a comma separated volume (csv). Using the ‘read.csv’ function. However, a working directory needs to be set-up.

# set working directly
setwd('~/OneDrive - Western Sydney University/Data/')

# read dataset into R, naming the data set dta.
dta <- read.csv(file = 'framingham.csv', header = T, sep = ',')

# check the names of the columns
names(dta)

##  [1] "male"            "age"             "education"       "currentSmoker"  
##  [5] "cigsPerDay"      "BPMeds"          "prevalentStroke" "prevalentHyp"   
##  [9] "diabetes"        "totChol"         "sysBP"           "diaBP"          
## [13] "BMI"             "heartRate"       "glucose"         "TenYearCHD"

Loading and using R packages

The first time you use a R package it will be need to be loaded into R using the “install.packages” function. Once installed into R, the package is loaded a the start of your R session, using “library(package.name)”. A common packages to create summary tables is Frank Harrel’s Hmisc package, and Bendix Carsten’s Epi package.

# Frank Harrels's Hmisc package
## install.packages('Hmisc')

# Bendix Carsten's Epi package
## install.packages('Epi')

Describing data

# examples of describing data 
library(Hmisc)
library(Epi)

# creat an ID variable from 1 to the end of the last row
dta$id <- 1:length(dta[,1])

# describe a single variable
describe(dta$age)

## dta$age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4240        0       39    0.999    49.58     9.85       37       39 
##      .25      .50      .75      .90      .95 
##       42       49       56       62       64 
## 
## lowest : 32 33 34 35 36, highest: 66 67 68 69 70

# Mean and SD age
mean(dta$age)

## [1] 49.58019

sd(dta$age)

## [1] 8.572942

# median and interquartile range (IQR)
quantile(dta$age, probs = c(0.5,0.25,0.75))

## 50% 25% 75% 
##  49  42  56

# mean and sd of age based on sex
tapply(dta$age, dta$male, mean)

##        0        1 
## 49.79587 49.29341

tapply(dta$age, dta$male, sd)

##        0        1 
## 8.597883 8.533584

# describe the whole data file
## describe(dta)

# create a variable equal 1 for counting purposes
dta$n <- 1

# Create a new variable called sex and label numeric values (0 = female, 1 = male)
dta$sex <- factor(dta$male, levels = 0:1, labels = c('female','male'))
table(dta$sex)

## 
## female   male 
##   2420   1820

# look at first 10 records
head(dta, 10)

##    male age education currentSmoker cigsPerDay BPMeds prevalentStroke
## 1     1  39         4             0          0      0               0
## 2     0  46         2             0          0      0               0
## 3     1  48         1             1         20      0               0
## 4     0  61         3             1         30      0               0
## 5     0  46         3             1         23      0               0
## 6     0  43         2             0          0      0               0
## 7     0  63         1             0          0      0               0
## 8     0  45         2             1         20      0               0
## 9     1  52         1             0          0      0               0
## 10    1  43         1             1         30      0               0
##    prevalentHyp diabetes totChol sysBP diaBP   BMI heartRate glucose TenYearCHD
## 1             0        0     195 106.0    70 26.97        80      77          0
## 2             0        0     250 121.0    81 28.73        95      76          0
## 3             0        0     245 127.5    80 25.34        75      70          0
## 4             1        0     225 150.0    95 28.58        65     103          1
## 5             0        0     285 130.0    84 23.10        85      85          0
## 6             1        0     228 180.0   110 30.30        77      99          0
## 7             0        0     205 138.0    71 33.11        60      85          1
## 8             0        0     313 100.0    71 21.68        79      78          0
## 9             1        0     260 141.5    89 26.36        76      79          0
## 10            1        0     225 162.0   107 23.61        93      88          0
##    id n    sex
## 1   1 1   male
## 2   2 1 female
## 3   3 1   male
## 4   4 1 female
## 5   5 1 female
## 6   6 1 female
## 7   7 1 female
## 8   8 1 female
## 9   9 1   male
## 10 10 1   male

# looks at the last 10 records
tail(dta, 10)

##      male age education currentSmoker cigsPerDay BPMeds prevalentStroke
## 4231    0  56         1             1          3      0               0
## 4232    1  58         3             0          0      0               0
## 4233    1  68         1             0          0      0               0
## 4234    1  50         1             1          1      0               0
## 4235    1  51         3             1         43      0               0
## 4236    0  48         2             1         20     NA               0
## 4237    0  44         1             1         15      0               0
## 4238    0  52         2             0          0      0               0
## 4239    1  40         3             0          0      0               0
## 4240    0  39         3             1         30      0               0
##      prevalentHyp diabetes totChol sysBP diaBP   BMI heartRate glucose
## 4231            1        0     268 170.0   102 22.89        57      NA
## 4232            1        0     187 141.0    81 24.96        80      81
## 4233            1        0     176 168.0    97 23.14        60      79
## 4234            1        0     313 179.0    92 25.97        66      86
## 4235            0        0     207 126.5    80 19.71        65      68
## 4236            0        0     248 131.0    72 22.00        84      86
## 4237            0        0     210 126.5    87 19.16        86      NA
## 4238            0        0     269 133.5    83 21.47        80     107
## 4239            1        0     185 141.0    98 25.60        67      72
## 4240            0        0     196 133.0    86 20.91        85      80
##      TenYearCHD   id n    sex
## 4231          0 4231 1 female
## 4232          0 4232 1   male
## 4233          1 4233 1   male
## 4234          1 4234 1   male
## 4235          0 4235 1   male
## 4236          0 4236 1 female
## 4237          0 4237 1 female
## 4238          0 4238 1 female
## 4239          0 4239 1   male
## 4240          0 4240 1 female

Creating summary tables using the Epi package.

# Using the Epi package stat.table create summary  of rates 
# various factors
stat.table(sex, contents = list('Age (mean)' = mean(age),
                                'Age (SD)' = sd(age),
                                'HR (mean)' = mean(heartRate),
                                'HR (SD)' = sd(heartRate),
                                'BMI (mean)' = mean(BMI)),
                                 margin = T,
                                 data = dta)

##  ------------------------------------------------- 
##  sex          Age     Age      HR HR (SD)     BMI  
##            (mean)    (SD)  (mean)          (mean)  
##  ------------------------------------------------- 
##  female     49.80    8.60   77.10   12.09   25.51  
##  male       49.29    8.53   74.26   11.74   26.19  
##                                                    
##  Total      49.58    8.57   75.88   12.03   25.80  
##  -------------------------------------------------

stat.table(sex, contents = list('BMI (SD)' = sd(BMI),
                                'BSL (mean)' = mean(glucose),
                                'BSL (SD)' = sd(glucose),
                                'SBP (mean)' = mean(sysBP),
                                'SBP (sd)' = sd(sysBP)),
                                 margin = T,
                                 data = dta)

##  ------------------------------------------------- 
##  sex          BMI     BSL     BSL     SBP     SBP  
##              (SD)  (mean)    (SD)  (mean)    (sd)  
##  ------------------------------------------------- 
##  female      4.50   81.84   23.24  133.04   23.79  
##  male        3.42   82.12   24.83  131.44   19.42  
##                                                    
##  Total       4.08   81.96   23.95  132.35   22.03  
##  -------------------------------------------------

stat.table(sex, contents = list('smk (%)' = ratio(currentSmoker,n),
                                'HT (%)' = ratio(prevalentHyp,n),
                                'DM (%)' = ratio(diabetes,n),
                                'CHD (%)' = ratio(TenYearCHD,n)),
                                 margin = T,
                                 data = dta)

##  ----------------------------------------- 
##  sex      smk (%)  HT (%)  DM (%) CHD (%)  
##  ----------------------------------------- 
##  female      0.41    0.31    0.02    0.12  
##  male        0.61    0.31    0.03    0.19  
##                                            
##  Total       0.49    0.31    0.03    0.15  
##  -----------------------------------------

Using Frank Harrell’s Hmisc package can be easier, as continuous and frequencey data can easily be combined into a single table. This approach can create Table 1 for a manuscript.

# Using Hmsic summary function

print(summary(sex ~ age + 
                    BMI +
                    heartRate +
                    glucose +
                    sysBP +
                    currentSmoker +
                    prevalentHyp +
                    diabetes, method = 'reverse',
                    overall = TRUE,
                    test = T,
                    data = dta), prmsd = T, 
                                 digits = 2,
                                 prtest = 'P')

## 
## 
## Descriptive Statistics by sex
## 
## +-------------+----+---------------------------+---------------------------+---------------------------+-----------------+
## |             |N   |female                     |male                       |Combined                   |P-value          |
## |             |    |(N=2420)                   |(N=1820)                   |(N=4240)                   |                 |
## +-------------+----+---------------------------+---------------------------+---------------------------+-----------------+
## |age          |4240|42.0/49.0/57.0  49.8+/- 8.6|42.0/48.0/56.0  49.3+/- 8.5|42.0/49.0/56.0  49.6+/- 8.6|           0.054 |
## +-------------+----+---------------------------+---------------------------+---------------------------+-----------------+
## |BMI          |4221|22.5/24.7/27.7  25.5+/- 4.5|24.0/26.1/28.3  26.2+/- 3.4|23.1/25.4/28.0  25.8+/- 4.1|           <0.001|
## +-------------+----+---------------------------+---------------------------+---------------------------+-----------------+
## |heartRate    |4239|     69/75/85  77+/-12     |     66/75/80  74+/-12     |     68/75/83  76+/-12     |           <0.001|
## +-------------+----+---------------------------+---------------------------+---------------------------+-----------------+
## |glucose      |3852|     72/78/86  82+/-23     |     70/78/87  82+/-25     |     71/78/87  82+/-24     |            0.56 |
## +-------------+----+---------------------------+---------------------------+---------------------------+-----------------+
## |sysBP        |4240|   116/128/145  133+/- 24  |   118/128/141  131+/- 19  |   117/128/144  132+/- 22  |            0.72 |
## +-------------+----+---------------------------+---------------------------+---------------------------+-----------------+
## |currentSmoker|4240|        41%  ( 989)        |        61%  (1106)        |        49%  (2095)        |           <0.001|
## +-------------+----+---------------------------+---------------------------+---------------------------+-----------------+
## |prevalentHyp |4240|        31%  ( 746)        |        31%  ( 571)        |        31%  (1317)        |            0.7  |
## +-------------+----+---------------------------+---------------------------+---------------------------+-----------------+
## |diabetes     |4240|         2%  (  57)        |         3%  (  52)        |         3%  ( 109)        |            0.31 |
## +-------------+----+---------------------------+---------------------------+---------------------------+-----------------+

Subset the data to include study particpants aged 70 years for more.

# check the range of ages of the original dataset.
min(dta$age)

## [1] 32

max(dta$age)

## [1] 70

# check if any ages are missing
describe(dta$age)

## dta$age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4240        0       39    0.999    49.58     9.85       37       39 
##      .25      .50      .75      .90      .95 
##       42       49       56       62       64 
## 
## lowest : 32 33 34 35 36, highest: 66 67 68 69 70

table(is.na(dta$age))

## 
## FALSE 
##  4240

Using the subset function to limit the data to study particpants aged 50 year or older.

# subset the data to focus on study particpants aged 50 years or older
d <- subset(dta, age > 49)

Some examples of exploring the relationship between variables. In this case systolic blood pressure (sysBP) and age.

# relationship between age and sbp
summary(lm(sysBP ~ age, data = d))

## 
## Call:
## lm(formula = sysBP ~ age, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -57.41 -16.43  -3.42  13.03 148.01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  82.0454     6.1093  13.430   <2e-16 ***
## age           1.0148     0.1063   9.546   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.51 on 2021 degrees of freedom
## Multiple R-squared:  0.04315,    Adjusted R-squared:  0.04267 
## F-statistic: 91.13 on 1 and 2021 DF,  p-value: < 2.2e-16

In the case above the intercept may be diffcult to interpret as it os based on the estimated average sysBP at age 0. Using R it is quite simple to rescale age in the linear regression model to somethging more usefull (i.e an intercept at age 50)

summary(lm(sysBP ~ I(age-50), data = d))

## 
## Call:
## lm(formula = sysBP ~ I(age - 50), data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -57.41 -16.43  -3.42  13.03 148.01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 132.7868     0.9320 142.468   <2e-16 ***
## I(age - 50)   1.0148     0.1063   9.546   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.51 on 2021 degrees of freedom
## Multiple R-squared:  0.04315,    Adjusted R-squared:  0.04267 
## F-statistic: 91.13 on 1 and 2021 DF,  p-value: < 2.2e-16

mean(d$sysBP[d$age == 50], na.rm = T)

## [1] 133.8964

As you can see the linear regression models intercept make more sense when re-scale around the lowest age in the data. The results of the ordinary least squares indicates the average sysbP at age 50 was 132.8 mmHg, and on average increased by 1.02 mmHg for each 1-year increase in age.

Now let’s plot the data and see some of the strengths of the R-language to annotate figures.

plot(d$sysBP ~ d$age, pch = 1, col = 'black',
                                xlab = 'Age (yrs)',
                                xlim = c(50,70),
                                ylab = 'Systolic BP (mmHg)',
                                las = 1,
                                ylim = c(75,250),
                                axes = F)
axis(side = 1, at = seq(50,70,1), labels = seq(50,70,1), cex = 0.8)
axis(side = 2, at = seq(75,250,25), labels = seq(75,250,25), las = 1, cex = 0.8)

# add a line of regression
abline(lsfit(d$age,d$sysBP), lty = 2)

# add the Pearson's correlation coeffficient
text(69,240, paste('rho = ',round(cor.test(d$age,d$sysBP)$estimate,2)))

Is the relationship between age and systolic BP different between women and men?

# mean and sd of sbp by sex
tapply(d$sysBP, d$sex, mean)

##   female     male 
## 143.2540 135.8398

tapply(d$sysBP, d$sex, sd)

##   female     male 
## 24.89276 22.07442

summary(lm(sysBP ~ I(age-50) * sex, data = d))

## 
## Call:
## lm(formula = sysBP ~ I(age - 50) * sex, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.569 -16.393  -3.195  12.994 142.781 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         133.3211     1.2188 109.388  < 2e-16 ***
## I(age - 50)           1.3498     0.1379   9.790  < 2e-16 ***
## sexmale              -1.1586     1.8539  -0.625    0.532    
## I(age - 50):sexmale  -0.8334     0.2121  -3.929 8.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.17 on 2019 degrees of freedom
## Multiple R-squared:  0.07195,    Adjusted R-squared:  0.07057 
## F-statistic: 52.17 on 3 and 2019 DF,  p-value: < 2.2e-16

plot(d$sysBP[d$sex == 'male'] ~ d$age[d$sex == 'male'], 
                                pch = 1, 
                                col = 'black',
                                xlab = 'Age (yrs)',
                                ylab = 'Systolic BP (mmHg)',
                                las = 1,
                                ylim = c(75,250),
                                xlim = c(50,70),
                                axes = F)
axis(side = 1, at = seq(50,70,1), labels = seq(50,70,1), cex = 0.8)
axis(side = 2, at = seq(75,250,25), labels = seq(75,250,25), las = 1, cex = 0.8)

points(d$sysBP[d$sex == 'female'] ~ d$age[d$sex == 'female'], 
                                    pch = 1, col = 'black')

# add a line of regression for omwne and men separately
abline(lsfit(d$age[d$sex == 'male'],d$sysBP[d$sex == 'male']), lty = 2)
abline(lsfit(d$age[d$sex == 'female'],d$sysBP[d$sex == 'female']), lty = 3)

legend('topright', c('Women','Men'),
                  pch = c(16,1),
                  lty = c(3,2),
                  bty = 'n')

Assessing the assocation between a study factor and study outcome.

Let’s use the Framingham data to assess the relationship between being a male and ten-year risk of CHD.

library(Epi)

stat.table(male, contents =  list(CHD = sum(TenYearCHD),
                                           N = sum(n),
                                           Rate = ratio(TenYearCHD,n)),
                                           margin = T,
                                           data = d)

##  -------------------------------- 
##  male        CHD       N    Rate  
##  -------------------------------- 
##  0        221.00 1177.00    0.19  
##  1        233.00  846.00    0.28  
##                                   
##  Total    454.00 2023.00    0.22  
##  --------------------------------

Men appear to have a higher incidence of CHD at 10-years of follow-up (19% versus 28%), when compared to women. We can formally estimate the effect and associated 95% CI using the epiR package.

epi.2by2(dat = c(233,846-233,221,1177-221), method = 'cohort.count')

##              Outcome +    Outcome -      Total        Inc risk *        Odds
## Exposed +          233          613        846              27.5       0.380
## Exposed -          221          956       1177              18.8       0.231
## Total              454         1569       2023              22.4       0.289
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio                                 1.47 (1.25, 1.72)
## Odds ratio                                     1.64 (1.33, 2.03)
## Attrib risk in the exposed *                   8.76 (5.02, 12.51)
## Attrib fraction in the exposed (%)            31.82 (19.88, 41.99)
## Attrib risk in the population *                3.67 (0.79, 6.54)
## Attrib fraction in the population (%)         16.33 (9.07, 23.02)
## -------------------------------------------------------------------
## Uncorrected chi2 test that OR = 1: chi2(1) = 21.725 Pr>chi2 = <0.001
## Fisher exact test that OR = 1: Pr>chi2 = <0.001
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units

We can see from the results that on average men have 51% higher risk of CHD over 10-years when compared to women, and we estimate we are 95% confident the true estimate is between 31 and 74%. In a contemporary research setting we would use a modelling strategy to estimate these effects using generalised linear regression.

# log binomial model (RR)
glm.0 <- glm(TenYearCHD ~ male, data = d, family = binomial(link='log'))
logistic.display(glm.0)

## 
## Logistic regression predicting TenYearCHD 
##  
##               OR(95%CI)         P(Wald's test) P(LR-test)
## male: 1 vs 0  1.47 (1.25,1.72)  < 0.001        < 0.001   
##                                                          
## Log-likelihood = -1066.3815
## No. of observations = 2023
## AIC value = 2136.7629

# logit binomial model (OR)
glm.0 <- glm(TenYearCHD ~ male, data = d, family = binomial(link='logit'))
logistic.display(glm.0)

## 
## Logistic regression predicting TenYearCHD 
##  
##               OR(95%CI)         P(Wald's test) P(LR-test)
## male: 1 vs 0  1.64 (1.33,2.03)  < 0.001        < 0.001   
##                                                          
## Log-likelihood = -1066.3815
## No. of observations = 2023
## AIC value = 2136.7629

The benefit of using a modelling strategy is us easy to add variablke to the model and in a sense adjust for these other variables (covariates). Assessing the association after adjusting for age.

library(epiDisplay)

glm.adj <- glm(TenYearCHD ~ male + age, data = d, family = binomial(link='log'))
logistic.display(glm.adj)

## 
## Logistic regression predicting TenYearCHD 
##  
##                  crude OR(95%CI)   adj. OR(95%CI)    P(Wald's test) P(LR-test)
## male: 1 vs 0     1.64 (1.33,2.03)  1.46 (1.25,1.72)  < 0.001        < 0.001   
##                                                                               
## age (cont. var.) 1.06 (1.03,1.08)  1.04 (1.03,1.06)  < 0.001        < 0.001   
##                                                                               
## Log-likelihood = -1052.8745
## No. of observations = 2023
## AIC value = 2111.7491

glm.adj <- glm(TenYearCHD ~ male + age, data = d, family = binomial(link='logit'))
logistic.display(glm.adj)

## 
## Logistic regression predicting TenYearCHD 
##  
##                  crude OR(95%CI)   adj. OR(95%CI)    P(Wald's test) P(LR-test)
## male: 1 vs 0     1.64 (1.33,2.03)  1.68 (1.36,2.07)  < 0.001        < 0.001   
##                                                                               
## age (cont. var.) 1.06 (1.03,1.08)  1.06 (1.04,1.08)  < 0.001        < 0.001   
##                                                                               
## Log-likelihood = -1052.3859
## No. of observations = 2023
## AIC value = 2110.7718

Meta-analysis - combining the results of a number of trials.

Using trials involving the application of targeted therapeutic hypothermia to out-of-hospital cardiac arrest patients, we can present the approach of combining trials into a meta-analysis to assess the current state of published literature.

library(meta)
# meta-analysis - outcome of interest mortality
dta <- data.frame(study = c('Look (2018)',  
                            'Nielsen (2013)',  
                            'Laurent (2005)', 
                            'Hachimi-Idrissi (2005)',
                            'Bernard  (2002)',  
                            'HACA (2002)',
                            'Hachimi-Idrissi (2001)',
                            'Tiainen (2007)',
                            'Bernard (1997)',
                            'Dankiewicz (2021)',
                            'Lascarrou (2019)',
                            'Dankiewicz (2021)',
                            'Lascarrou (2019)',
                            'Nielsen (2013)'),
                  location = rep('Out of hospital',14),
                  year = c(2018,2013,2005,2005,2002,2002,2001,2007,1997,2021,2019,2021,2019,2013),
                  n = c(87,939,42,28,77,275,30,70,44,1850,581,939,1850,581),
                  event.e = c(27,226,15,6,22,56,13,8,10,465,231,465,231,226),
                  n.e = c(45,473,22,14,43,137,16,36,22,925,284,925,284,473),
                  event.c = c(33,220,11,8,23,76,13,12,17,446,247,446,247,220),
                  n.c = c(42,466,20,14,34,138,14,34,22,925,297,925,297,466),
                  target.temp = c(rep(33,5),32,34,33,33,33,33,33,33,33),
                  comparison = c('No-TTM','TTM-36','TTM-37','TTM-37',
                                 'No-TTM','No-TTM','Tx>38','Tx>38',
                                 'No-TTM','Tx>37.8','TTM-37','TTM-36','Tx>37.8','TTM-37'),
                  design = c('RCT','RCT','RCT','RCT','RCT',
                             'RCT','RCT','RCT','Non-R','RCT','RCT','RCT','RCT','RCT'),
                  control.target = c('Trials without explicit normothermia / fever avoidance',
                                     'Trials with explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials with explicit normothermia / fever avoidance',
                                     'Trials with explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance',
                                     'Trials without explicit normothermia / fever avoidance'))
dta <- dta[ order(dta$year),]
dta$id <- 1:14
dta$ratio.e <- paste(dta$event.e,'/',dta$n.e, sep = '')
dta$p.e <- paste('(', round(dta$event.e/dta$n.e,2)*100,'%)', sep = '') 
dta$ratio.c <- paste(dta$event.c,'/',dta$n.c, sep = '')
dta$p.c <- paste('(', round(dta$event.c/dta$n.c,2)*100,'%)', sep = '') 
dta$tx <- paste(dta$ratio.e,dta$p.e, sep = ' ') 
dta$c <- paste(dta$ratio.c,dta$p.c, sep = ' ')   

meta.all <- metabin(event.e, n.e,
                    event.c, n.c,
                    sm = 'RR',
                    data = dta,
                    studlab = study,
                    byvar = control.target,
                    subset = id %in% c(1:8,11,13))

# ---------------------------- Forest plot ---------------------------------------------

par(cex = 1.15, mar = c(2,2,2,2)) 

forest(meta.all, leftcols = c('studlab',
                              'tx','c'),
                 leftlabs = c('Study (Year)','Hypothermia\n (n/N)(%)',
                                             'Control\n (n/N)(%)'),
                 bylab = '',
                 label.right = '',
                 label.left = '',
                 col.square = 'blue',
                 col.diamond = 'blue',
                 overall = F,
                 col.by = 'black')

Guerin C, et al . Prone Positioning in Severe Acute Respiratory Distress Syndrome, NEJM, vol 368 (23), 2013. pp2159-2168↩︎
Doll, R and A.B.Hill (1966). Mortality of British doctors in relation to smoking; observations on coronary thrombosis. In Epidemiological Approaches to the Study of Cancer and Other Chronic Diseases, W. Haenszel (ed), 19: 204–268. National Cancer Institute Monograph.↩︎
Guerin C, et al . Prone Positioning in Severe Acute Respiratory Distress Syndrome, NEJM, vol 368 (23), 2013. pp2159-2168↩︎
Estruch R, et al . Primary Prevention of Cardiovascular Disease with a Mediterranean Diet, NEJM, vol 368 (14), 2013. pp1279-1290↩︎
Chow SC, Shao J, Wang H. Sample Size Calculation in Clinical Research. New York: Marcel Dekker, 2003.↩︎

Quantitative Research Methodology Workshops