This module will introduce the concepts of estimation and null-hypothesis statistical significance testing. In addition sample size and power calculations will be covered, and Type I and Type II error.
Our approach based on PPDAC:
Problem
At the time of the initiation of the Framingham study it was not clear the role of blood pressure played in the risk of stroke, as blood pressure was thought to naturally increase with age. Therefore, in terms of assessing the risk of stroke is relationship to increasing blood pressure, a cohort study design is the most suitable design. Importantly, it would be un-ethical to randomise individuals to high blood pressure, and a prospective cohort of individuals, without stroke at baseline, among whom blood pressure can be measured, and then follow-up for an incident stroke. Our approach will be to refer to blood pressure as our study factor and stroke as our study outcome.
Now using R:
## Input object size: 511536 bytes; 22 variables 4206 observations
## New object size: 501400 bytes; 22 variables 4206 observations
For example let us explore the relationship between increase age and systolic blood pressure.
options(width = 1500)
summary(lm(sysbp ~ age, data = d))
##
## Call:
## lm(formula = sysbp ~ age, data = d)
##
## Residuals:
## Systolic BP [mmHg]
## Min 1Q Median 3Q Max
## -56.085 -13.253 -2.304 9.862 149.746
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.7830 1.7826 47.56 <2e-16 ***
## age 0.9449 0.0354 26.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.74 on 4204 degrees of freedom
## Multiple R-squared: 0.1449, Adjusted R-squared: 0.1447
## F-statistic: 712.5 on 1 and 4204 DF, p-value: < 2.2e-16
summary(lm(sysbp ~ I(age/5), data = d))
##
## Call:
## lm(formula = sysbp ~ I(age/5), data = d)
##
## Residuals:
## Systolic BP [mmHg]
## Min 1Q Median 3Q Max
## -56.085 -13.253 -2.304 9.862 149.746
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.783 1.783 47.56 <2e-16 ***
## I(age/5) 4.724 0.177 26.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.74 on 4204 degrees of freedom
## Multiple R-squared: 0.1449, Adjusted R-squared: 0.1447
## F-statistic: 712.5 on 1 and 4204 DF, p-value: < 2.2e-16
par(mfrow = c(1,2))
plot(d$age[d$sex == 'Women'], d$sysbp[d$sex == 'Women'], xlab = 'Age (yrs)',
ylab = 'SBP (mmHg)',
las = 1,
col = 'red')
abline(lsfit(d$age[d$sex == 'Women'], d$sysbp[d$sex == 'Women']), lty = 2, lwd = 2)
mtext(side = 3, adj = 0, text = 'Women')
summary(lm(sysbp ~ age, data = d, subset = sex == 'Women'))
##
## Call:
## lm(formula = sysbp ~ age, data = d, subset = sex == "Women")
##
## Residuals:
## Systolic BP [mmHg]
## Min 1Q Median 3Q Max
## -58.285 -13.033 -2.497 9.810 144.017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.64338 2.48900 27.18 <2e-16 ***
## age 1.30218 0.04943 26.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.38 on 2336 degrees of freedom
## Multiple R-squared: 0.229, Adjusted R-squared: 0.2287
## F-statistic: 693.9 on 1 and 2336 DF, p-value: < 2.2e-16
plot(d$age[d$sex == 'Men'], d$sysbp[d$sex == 'Men'], xlab = 'Age (yrs)',
ylab = 'SBP (mmHg)',
las = 1,
col = 'blue')
abline(lsfit(d$age[d$sex == 'Men'], d$sysbp[d$sex == 'Men']), lty = 2, lwd = 2)
mtext(side = 3, adj = 0, text = 'Men')
summary(lm(sysbp ~ age, data = d, subset = sex == 'Men'))
##
## Call:
## lm(formula = sysbp ~ age, data = d, subset = sex == "Men")
##
## Residuals:
## Systolic BP [mmHg]
## Min 1Q Median 3Q Max
## -51.737 -12.590 -2.141 9.558 96.160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 105.38287 2.44171 43.16 <2e-16 ***
## age 0.51473 0.04848 10.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.22 on 1866 degrees of freedom
## Multiple R-squared: 0.05698, Adjusted R-squared: 0.05647
## F-statistic: 112.7 on 1 and 1866 DF, p-value: < 2.2e-16
summary(lm(sysbp ~ I(age-32) + sex, data = d))
##
## Call:
## lm(formula = sysbp ~ I(age - 32) + sex, data = d)
##
## Residuals:
## Systolic BP [mmHg]
## Min 1Q Median 3Q Max
## -55.344 -13.339 -2.283 10.100 149.155
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 114.27922 0.77253 147.928 <2e-16 ***
## I(age - 32) 0.94482 0.03538 26.704 <2e-16 ***
## sexWomen 1.33194 0.61243 2.175 0.0297 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.73 on 4203 degrees of freedom
## Multiple R-squared: 0.1459, Adjusted R-squared: 0.1455
## F-statistic: 358.9 on 2 and 4203 DF, p-value: < 2.2e-16
summary(lm(sysbp ~ I(age-32) + sex + bmi, data = d))
##
## Call:
## lm(formula = sysbp ~ I(age - 32) + sex + bmi, data = d)
##
## Residuals:
## Systolic BP [mmHg]
## Min 1Q Median 3Q Max
## -59.859 -12.422 -2.205 9.915 130.200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.10972 1.96283 38.77 < 2e-16 ***
## I(age - 32) 0.85109 0.03390 25.10 < 2e-16 ***
## sexWomen 2.39997 0.58398 4.11 4.04e-05 ***
## bmi 1.52126 0.07239 21.01 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.72 on 4186 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.2265, Adjusted R-squared: 0.226
## F-statistic: 408.7 on 3 and 4186 DF, p-value: < 2.2e-16
summary(lm(sysbp ~ I(age-32) * sex, data = d))
##
## Call:
## lm(formula = sysbp ~ I(age - 32) * sex, data = d)
##
## Residuals:
## Systolic BP [mmHg]
## Min 1Q Median 3Q Max
## -58.285 -12.781 -2.428 9.729 144.017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 121.85412 1.01652 119.873 <2e-16 ***
## I(age - 32) 0.51473 0.05175 9.946 <2e-16 ***
## sexWomen -12.54110 1.37344 -9.131 <2e-16 ***
## I(age - 32):sexWomen 0.78745 0.07003 11.245 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.45 on 4202 degrees of freedom
## Multiple R-squared: 0.1708, Adjusted R-squared: 0.1702
## F-statistic: 288.6 on 3 and 4202 DF, p-value: < 2.2e-16
tapply(d$sysbp, d$sex, mean)
## Men Women
## 130.9197 132.2626
summary(lm(sysbp ~ sex, data = d))
##
## Call:
## lm(formula = sysbp ~ sex, data = d)
##
## Residuals:
## Systolic BP [mmHg]
## Min 1Q Median 3Q Max
## -48.763 -14.920 -3.763 10.737 162.737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 130.9197 0.4938 265.147 <2e-16 ***
## sexWomen 1.3429 0.6623 2.028 0.0426 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.34 on 4204 degrees of freedom
## Multiple R-squared: 0.0009771, Adjusted R-squared: 0.0007395
## F-statistic: 4.112 on 1 and 4204 DF, p-value: 0.04265
t.test(sysbp ~ sex, data = d, var.equal = T)
##
## Two Sample t-test
##
## data: sysbp by sex
## t = -2.0278, df = 4204, p-value = 0.04265
## alternative hypothesis: true difference in means between group Men and group Women is not equal to 0
## 95 percent confidence interval:
## -2.64130527 -0.04452955
## sample estimates:
## mean in group Men mean in group Women
## 130.9197 132.2626
options(width = 1500)
d$sysHT <- d$prevhyp
d$sysHT <- factor(d$sysHT, levels = 0:1, labels = c('No','Yes'))
d$stroke.yes <- relevel(d$stroke.yes, ref = 'Yes')
# Rate with 95% CI
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$n, d$sysHT, sum)))
names(stroke.tab) <- c('Strokes','N')
head(stroke.tab)
## Strokes N
## No 157 2963
## Yes 186 1243
ci.binomial(stroke.tab$Strokes, stroke.tab$N)
## events total probability se exact.lower95ci exact.upper95ci
## 1 157 2963 0.05298684 0.00411525 0.04519247 0.06167668
## 2 186 1243 0.14963797 0.01011783 0.13024489 0.17070700
stat.table(sysHT, contents = list(N = count(),
Stroke = sum(stroke),
'Rate(%)' = ratio(stroke,n,100)),
margin = T,
data = d)
## --------------------------------
## sysHT N Stroke Rate(%)
## --------------------------------
## No 2963 157.00 5.30
## Yes 1243 186.00 14.96
##
## Total 4206 343.00 8.16
## --------------------------------
# Rate Ratio
table(d$sysHT, d$stroke.yes)
##
## Yes No
## No 157 2806
## Yes 186 1057
Pe <- (186/(186+1057))
Po <- (157/(157+2806))
RR <- Pe/Po
RR
## [1] 2.824059
# Odds Ratio
OR <- (Pe/(1-Pe))/(Po/(1-Po))
OR
## [1] 3.145039
epiTab(186,1057,157,2806)
##
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 186 1057 1243
## No 157 2806 2963
## Total 343 3863 4206
##
## Rate of outcome among intervention group (p1) = 0.1496 (95% CI: 12.98, 16.95)
##
## Rate of outcome among control group (p2) = 0.053 (95% CI: 4.49, 6.11)
##
## Risk Difference = 9.67 (95% CI: 7.52, 11.81)
##
## Hazard Risk = 2.977 95% CI = 2.407 to 3.682
## Rate Ratio = 2.824 95% CI = 2.308 to 3.456
## Odds Ratio = 3.145 95% CI = 2.514 to 3.934
## chi2 p-value = 0
## Fisher's exact p-value = 0
## ------------------------------------------------------------------------
d$sysHT <- relevel(d$sysHT, ref = 'No')
## RR
logistic.display(glm(cbind(stroke,n) ~ sysHT, family = binomial, data = d))
##
## Logistic regression predicting cbind(stroke, n)
##
## OR(95%CI) P(Wald's test) P(LR-test)
## sysHT: Yes vs No 2.82 (2.26,3.53) < 0.001 < 0.001
##
## Log-likelihood = -937.1428
## No. of observations = 4206
## AIC value = 1878.2856
## OR
logistic.display(glm(stroke ~ sysHT, family = binomial, data = d))
##
## Logistic regression predicting stroke
##
## OR(95%CI) P(Wald's test) P(LR-test)
## sysHT: Yes vs No 3.15 (2.51,3.93) < 0.001 < 0.001
##
## Log-likelihood = -1138.632
## No. of observations = 4206
## AIC value = 2281.264
# each decrease in SBP by 10 mmHg
d$sysBP.10 <- -1 * d$sysbp/10
logistic.display(glm(stroke ~ sysBP.10, family = binomial, data = d))
##
## Logistic regression predicting stroke
##
## OR(95%CI) P(Wald's test) P(LR-test)
## sysBP.10 (cont. var.) 0.79 (0.76,0.83) < 0.001 < 0.001
##
## Log-likelihood = -1137.0414
## No. of observations = 4206
## AIC value = 2278.0828
## adjRR
d$male <- as.factor(ifelse(d$sex == 'Men','Yes,','No'))
d$stroke.yes <- relevel(d$stroke.yes, ref = 'No')
tab.strat <- table(d$sysHT, d$stroke.yes, d$male)
mantelhaen.test(tab.strat)
##
## Mantel-Haenszel chi-squared test with continuity correction
##
## data: tab.strat
## Mantel-Haenszel X-squared = 107.19, df = 1, p-value < 2.2e-16
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
## 2.496475 3.904584
## sample estimates:
## common odds ratio
## 3.12213
logistic.display(glm(stroke ~ sysHT + male, family = binomial,
data = d), crude.p.value = T)
##
## Logistic regression predicting stroke
##
## crude OR(95%CI) crude P value adj. OR(95%CI) P(Wald's test) P(LR-test)
## sysHT: Yes vs No 3.15 (2.51,3.93) < 0.001 3.14 (2.51,3.92) < 0.001 < 0.001
##
## male: Yes, vs No 1.21 (0.97,1.5) 0.097 1.18 (0.95,1.48) 0.138 0.138
##
## Log-likelihood = -1137.5345
## No. of observations = 4206
## AIC value = 2281.0691
logistic.display(glm(cbind(stroke,n) ~ sysHT + age + male, family = binomial,
data = d), crude.p.value = T)
##
## Logistic regression predicting cbind(stroke, n)
##
## crude OR(95%CI) crude P value adj. OR(95%CI) P(Wald's test) P(LR-test)
## sysHT: Yes vs No 2.82 (2.26,3.53) < 0.001 2.06 (1.63,2.6) < 0.001 < 0.001
##
## age (cont. var.) 1.08 (1.06,1.09) < 0.001 1.07 (1.05,1.08) < 0.001 < 0.001
##
## male: Yes, vs No 1.19 (0.95,1.48) 0.126 1.23 (0.98,1.54) 0.073 < 0.001
##
## Log-likelihood = -895.8385
## No. of observations = 4206
## AIC value = 1799.6771
# AUC ROC
lroc(glm(stroke ~ sysHT, family = binomial, data = d), graph = F)$auc
## [1] 0.6343263
lroc(glm(stroke ~ sysHT + age, family = binomial, data = d), graph = F)$auc
## [1] 0.7213891
lroc(glm(stroke ~ sysHT + age + sex, family = binomial, data = d), graph = F)$auc
## [1] 0.7238226
lroc(glm(stroke ~ sysHT + age + sex + bmi, family = binomial, data = d), graph = F)$auc
## [1] 0.7271359
The Area Under the Curve (ROC Curve) increases from 0.63 to > 0.71 with the addition of more predictors.