This module will introduce the concepts of estimation and null-hypothesis statistical significance testing. In addition sample size and power calculations will be covered, and Type I and Type II error.

Our approach based on PPDAC:

Problem

At the time of the initiation of the Framingham study it was not clear the role of blood pressure played in the risk of stroke, as blood pressure was thought to naturally increase with age. Therefore, in terms of assessing the risk of stroke is relationship to increasing blood pressure, a cohort study design is the most suitable design. Importantly, it would be un-ethical to randomise individuals to high blood pressure, and a prospective cohort of individuals, without stroke at baseline, among whom blood pressure can be measured, and then follow-up for an incident stroke. Our approach will be to refer to blood pressure as our study factor and stroke as our study outcome.

Now using R:

## Input object size:    511536 bytes;   22 variables    4206 observations
## New object size: 501400 bytes;   22 variables    4206 observations

An introduction to multiple linear regression (continuous outcome).

For example let us explore the relationship between increase age and systolic blood pressure.

options(width = 1500)

summary(lm(sysbp ~ age, data = d))
## 
## Call:
## lm(formula = sysbp ~ age, data = d)
## 
## Residuals:
## Systolic BP [mmHg] 
##     Min      1Q  Median      3Q     Max 
## -56.085 -13.253  -2.304   9.862 149.746 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  84.7830     1.7826   47.56   <2e-16 ***
## age           0.9449     0.0354   26.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.74 on 4204 degrees of freedom
## Multiple R-squared:  0.1449, Adjusted R-squared:  0.1447 
## F-statistic: 712.5 on 1 and 4204 DF,  p-value: < 2.2e-16
summary(lm(sysbp ~ I(age/5), data = d))
## 
## Call:
## lm(formula = sysbp ~ I(age/5), data = d)
## 
## Residuals:
## Systolic BP [mmHg] 
##     Min      1Q  Median      3Q     Max 
## -56.085 -13.253  -2.304   9.862 149.746 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   84.783      1.783   47.56   <2e-16 ***
## I(age/5)       4.724      0.177   26.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.74 on 4204 degrees of freedom
## Multiple R-squared:  0.1449, Adjusted R-squared:  0.1447 
## F-statistic: 712.5 on 1 and 4204 DF,  p-value: < 2.2e-16
par(mfrow = c(1,2))
plot(d$age[d$sex == 'Women'], d$sysbp[d$sex == 'Women'], xlab = 'Age (yrs)',
                     ylab = 'SBP (mmHg)',
                     las = 1,
                     col = 'red')
abline(lsfit(d$age[d$sex == 'Women'], d$sysbp[d$sex == 'Women']), lty = 2, lwd = 2)
mtext(side = 3, adj = 0, text = 'Women')
summary(lm(sysbp ~ age, data = d, subset = sex == 'Women'))
## 
## Call:
## lm(formula = sysbp ~ age, data = d, subset = sex == "Women")
## 
## Residuals:
## Systolic BP [mmHg] 
##     Min      1Q  Median      3Q     Max 
## -58.285 -13.033  -2.497   9.810 144.017 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 67.64338    2.48900   27.18   <2e-16 ***
## age          1.30218    0.04943   26.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.38 on 2336 degrees of freedom
## Multiple R-squared:  0.229,  Adjusted R-squared:  0.2287 
## F-statistic: 693.9 on 1 and 2336 DF,  p-value: < 2.2e-16
plot(d$age[d$sex == 'Men'], d$sysbp[d$sex == 'Men'], xlab = 'Age (yrs)',
                     ylab = 'SBP (mmHg)',
                     las = 1,
                     col = 'blue')
abline(lsfit(d$age[d$sex == 'Men'], d$sysbp[d$sex == 'Men']), lty = 2, lwd = 2)
mtext(side = 3, adj = 0, text = 'Men')

summary(lm(sysbp ~ age, data = d, subset = sex == 'Men'))
## 
## Call:
## lm(formula = sysbp ~ age, data = d, subset = sex == "Men")
## 
## Residuals:
## Systolic BP [mmHg] 
##     Min      1Q  Median      3Q     Max 
## -51.737 -12.590  -2.141   9.558  96.160 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 105.38287    2.44171   43.16   <2e-16 ***
## age           0.51473    0.04848   10.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.22 on 1866 degrees of freedom
## Multiple R-squared:  0.05698,    Adjusted R-squared:  0.05647 
## F-statistic: 112.7 on 1 and 1866 DF,  p-value: < 2.2e-16
summary(lm(sysbp ~ I(age-32) + sex, data = d))
## 
## Call:
## lm(formula = sysbp ~ I(age - 32) + sex, data = d)
## 
## Residuals:
## Systolic BP [mmHg] 
##     Min      1Q  Median      3Q     Max 
## -55.344 -13.339  -2.283  10.100 149.155 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 114.27922    0.77253 147.928   <2e-16 ***
## I(age - 32)   0.94482    0.03538  26.704   <2e-16 ***
## sexWomen      1.33194    0.61243   2.175   0.0297 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.73 on 4203 degrees of freedom
## Multiple R-squared:  0.1459, Adjusted R-squared:  0.1455 
## F-statistic: 358.9 on 2 and 4203 DF,  p-value: < 2.2e-16
summary(lm(sysbp ~ I(age-32) + sex + bmi, data = d))
## 
## Call:
## lm(formula = sysbp ~ I(age - 32) + sex + bmi, data = d)
## 
## Residuals:
## Systolic BP [mmHg] 
##     Min      1Q  Median      3Q     Max 
## -59.859 -12.422  -2.205   9.915 130.200 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 76.10972    1.96283   38.77  < 2e-16 ***
## I(age - 32)  0.85109    0.03390   25.10  < 2e-16 ***
## sexWomen     2.39997    0.58398    4.11 4.04e-05 ***
## bmi          1.52126    0.07239   21.01  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.72 on 4186 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.2265, Adjusted R-squared:  0.226 
## F-statistic: 408.7 on 3 and 4186 DF,  p-value: < 2.2e-16
summary(lm(sysbp ~ I(age-32) * sex, data = d))
## 
## Call:
## lm(formula = sysbp ~ I(age - 32) * sex, data = d)
## 
## Residuals:
## Systolic BP [mmHg] 
##     Min      1Q  Median      3Q     Max 
## -58.285 -12.781  -2.428   9.729 144.017 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          121.85412    1.01652 119.873   <2e-16 ***
## I(age - 32)            0.51473    0.05175   9.946   <2e-16 ***
## sexWomen             -12.54110    1.37344  -9.131   <2e-16 ***
## I(age - 32):sexWomen   0.78745    0.07003  11.245   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.45 on 4202 degrees of freedom
## Multiple R-squared:  0.1708, Adjusted R-squared:  0.1702 
## F-statistic: 288.6 on 3 and 4202 DF,  p-value: < 2.2e-16
tapply(d$sysbp, d$sex, mean)
##      Men    Women 
## 130.9197 132.2626
summary(lm(sysbp ~ sex, data = d))
## 
## Call:
## lm(formula = sysbp ~ sex, data = d)
## 
## Residuals:
## Systolic BP [mmHg] 
##     Min      1Q  Median      3Q     Max 
## -48.763 -14.920  -3.763  10.737 162.737 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 130.9197     0.4938 265.147   <2e-16 ***
## sexWomen      1.3429     0.6623   2.028   0.0426 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.34 on 4204 degrees of freedom
## Multiple R-squared:  0.0009771,  Adjusted R-squared:  0.0007395 
## F-statistic: 4.112 on 1 and 4204 DF,  p-value: 0.04265
t.test(sysbp ~ sex, data = d, var.equal = T)
## 
##  Two Sample t-test
## 
## data:  sysbp by sex
## t = -2.0278, df = 4204, p-value = 0.04265
## alternative hypothesis: true difference in means between group Men and group Women is not equal to 0
## 95 percent confidence interval:
##  -2.64130527 -0.04452955
## sample estimates:
##   mean in group Men mean in group Women 
##            130.9197            132.2626

Let’s explore generlised linear models.

options(width = 1500)
d$sysHT <- d$prevhyp
d$sysHT <- factor(d$sysHT, levels = 0:1, labels = c('No','Yes'))
d$stroke.yes <- relevel(d$stroke.yes, ref = 'Yes')

# Rate with 95% CI
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$n, d$sysHT, sum)))
names(stroke.tab) <- c('Strokes','N')
head(stroke.tab)
##     Strokes    N
## No      157 2963
## Yes     186 1243
ci.binomial(stroke.tab$Strokes, stroke.tab$N)
##   events total probability         se exact.lower95ci exact.upper95ci
## 1    157  2963  0.05298684 0.00411525      0.04519247      0.06167668
## 2    186  1243  0.14963797 0.01011783      0.13024489      0.17070700
stat.table(sysHT, contents = list(N = count(),
                                  Stroke = sum(stroke),
                                  'Rate(%)' = ratio(stroke,n,100)),
                                  margin = T,
                                  data = d)
##  -------------------------------- 
##  sysHT         N  Stroke Rate(%)  
##  -------------------------------- 
##  No         2963  157.00    5.30  
##  Yes        1243  186.00   14.96  
##                                   
##  Total      4206  343.00    8.16  
##  --------------------------------
# Rate Ratio 
table(d$sysHT, d$stroke.yes)
##      
##        Yes   No
##   No   157 2806
##   Yes  186 1057
Pe <- (186/(186+1057))
Po <- (157/(157+2806))
RR <- Pe/Po
RR
## [1] 2.824059
# Odds Ratio 
OR <- (Pe/(1-Pe))/(Po/(1-Po))
OR
## [1] 3.145039
epiTab(186,1057,157,2806)
## 
## ------------------------------------------------------------------------
## Trial Data:
##             Outcome
## Intervention Yes   No Total
##        Yes   186 1057  1243
##        No    157 2806  2963
##        Total 343 3863  4206
## 
## Rate of outcome among intervention group (p1) = 0.1496 (95% CI: 12.98, 16.95)
## 
## Rate of outcome among control group (p2) = 0.053 (95% CI: 4.49, 6.11)
## 
## Risk Difference = 9.67 (95% CI: 7.52, 11.81)
## 
## Hazard Risk = 2.977 95% CI = 2.407 to 3.682
## Rate Ratio = 2.824 95% CI = 2.308 to 3.456
## Odds Ratio = 3.145 95% CI = 2.514 to 3.934
## chi2 p-value = 0
## Fisher's exact p-value = 0
## ------------------------------------------------------------------------
d$sysHT <- relevel(d$sysHT, ref = 'No')
## RR
logistic.display(glm(cbind(stroke,n) ~ sysHT, family = binomial, data = d))
## 
## Logistic regression predicting cbind(stroke, n) 
##  
##                  OR(95%CI)         P(Wald's test) P(LR-test)
## sysHT: Yes vs No 2.82 (2.26,3.53)  < 0.001        < 0.001   
##                                                             
## Log-likelihood = -937.1428
## No. of observations = 4206
## AIC value = 1878.2856
## OR

logistic.display(glm(stroke ~ sysHT, family = binomial, data = d))
## 
## Logistic regression predicting stroke 
##  
##                  OR(95%CI)         P(Wald's test) P(LR-test)
## sysHT: Yes vs No 3.15 (2.51,3.93)  < 0.001        < 0.001   
##                                                             
## Log-likelihood = -1138.632
## No. of observations = 4206
## AIC value = 2281.264
# each decrease in SBP by 10 mmHg
d$sysBP.10 <- -1 * d$sysbp/10
logistic.display(glm(stroke ~ sysBP.10, family = binomial, data = d))
## 
## Logistic regression predicting stroke 
##  
##                       OR(95%CI)         P(Wald's test) P(LR-test)
## sysBP.10 (cont. var.) 0.79 (0.76,0.83)  < 0.001        < 0.001   
##                                                                  
## Log-likelihood = -1137.0414
## No. of observations = 4206
## AIC value = 2278.0828
## adjRR
d$male <- as.factor(ifelse(d$sex == 'Men','Yes,','No'))
d$stroke.yes <- relevel(d$stroke.yes, ref = 'No')
tab.strat <- table(d$sysHT, d$stroke.yes, d$male)
mantelhaen.test(tab.strat)
## 
##  Mantel-Haenszel chi-squared test with continuity correction
## 
## data:  tab.strat
## Mantel-Haenszel X-squared = 107.19, df = 1, p-value < 2.2e-16
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
##  2.496475 3.904584
## sample estimates:
## common odds ratio 
##           3.12213
logistic.display(glm(stroke ~  sysHT + male, family = binomial, 
                                                            data = d), crude.p.value = T)
## 
## Logistic regression predicting stroke 
##  
##                  crude OR(95%CI)   crude P value adj. OR(95%CI)    P(Wald's test) P(LR-test)
## sysHT: Yes vs No 3.15 (2.51,3.93)  < 0.001       3.14 (2.51,3.92)  < 0.001        < 0.001   
##                                                                                             
## male: Yes, vs No 1.21 (0.97,1.5)   0.097         1.18 (0.95,1.48)  0.138          0.138     
##                                                                                             
## Log-likelihood = -1137.5345
## No. of observations = 4206
## AIC value = 2281.0691
logistic.display(glm(cbind(stroke,n) ~  sysHT + age + male, family = binomial, 
                                                            data = d), crude.p.value = T)
## 
## Logistic regression predicting cbind(stroke, n) 
##  
##                  crude OR(95%CI)   crude P value adj. OR(95%CI)    P(Wald's test) P(LR-test)
## sysHT: Yes vs No 2.82 (2.26,3.53)  < 0.001       2.06 (1.63,2.6)   < 0.001        < 0.001   
##                                                                                             
## age (cont. var.) 1.08 (1.06,1.09)  < 0.001       1.07 (1.05,1.08)  < 0.001        < 0.001   
##                                                                                             
## male: Yes, vs No 1.19 (0.95,1.48)  0.126         1.23 (0.98,1.54)  0.073          < 0.001   
##                                                                                             
## Log-likelihood = -895.8385
## No. of observations = 4206
## AIC value = 1799.6771
# AUC ROC
lroc(glm(stroke ~ sysHT, family = binomial, data = d), graph = F)$auc
## [1] 0.6343263
lroc(glm(stroke ~ sysHT + age, family = binomial, data = d), graph = F)$auc
## [1] 0.7213891
lroc(glm(stroke ~ sysHT + age + sex, family = binomial, data = d), graph = F)$auc
## [1] 0.7238226
lroc(glm(stroke ~ sysHT + age + sex + bmi, family = binomial, data = d), graph = F)$auc
## [1] 0.7271359

The Area Under the Curve (ROC Curve) increases from 0.63 to > 0.71 with the addition of more predictors.