This module will introduce the concepts of estimation and null-hypothesis statistical significance testing. In addition sample size and power calculations will be covered, and Type I and Type II error.
Our approach based on PPDAC:
Problem
At the time of the initiation of the Framingham study it was not clear the role of blood pressure played in the risk of stroke, as blood pressure was thought to naturally increase with age. Therefore, in terms of assessing the risk of stroke is relationship to increasing blood pressure, a cohort study design is the most suitable design. Importantly, it would be un-ethical to randomise individuals to high blood pressure, and a prospective cohort of individuals, without stroke at baseline, among whom blood pressure can be measured, and then follow-up for an incident stroke. Our approach will be to refer to blood pressure as our study factor and stroke as our study outcome.
Population
Adults with (exposed) and without high (un-exposed/comparator/counter factorial) blood pressure.
Data
Baseline blood pressure and follow stroke incidence.
Analysis
Hypothesis testing and estimation.
The NHST approach includes the stating of the null-hypothesis (\(H_0\)) (no-effect). In other would no association between the study factor and study outcome:
\[H_0: \text{The rate of stroke among HT} = \text{rate among non-HT}\]
In contrast to the null-hypothesis, an alternate hypothesis (\(H_i\)) is define:
\[H_i: \text{The rate of stroke among HT} \neq \text{rate among non-HT}\]
Then a critical value is set, say from a z-statistic (i.e > 1.96 upper 2.5% tail), or the case of the chi-squared test statistic (\(\chi^2\)) > 3.84 (\(1.96^2\)).
To assess the potential role of chance when comparing the frequencies of a two-by-two table of stroke yes/no and HT yes/no, the Pearson’s chi-squared test is commonly used. It is based on comparing the difference between the observed (O) table cell frequencies and what would be expected (E) by chance. The specific statistic being the sum of these differences:
\[\chi^2 = \sum{\frac{(O-E)^2}{E}}\] Commonly the approach of squaring an subtraction of two values is seen in statistics, this is a trick statisticians use to remove negative values! And, the expected values of each given cell is estimated using the row and column totals:
\[E = \frac{(total_{row} \times total_{column})}{Total_N}\] # Estimation of effect ## Rate Ratio, Odds Ratio and Incidence Rate Ratio.
SPSS Code:
* Cross tabulation and chisq test .
CROSSTABS TABLES = stroke BY prevhyp
/CELLS=COUNT ROW
/STATISTICS=CHISQ.
EXECUTE.
SPSS OUTPUT
| . | . | HT | no-HT. | Total | |
|---|---|---|---|---|---|
| stroke | No | Count. | 2806 | 1057 | 3863. |
| . | . | % within prevhyp | 85.0% | 94.7% | 91.8% |
| . | Yes | Count | 186 | 157 | 343. |
| . | % within prevhyp | 15.0% | 5.3%. | 8.2% | |
| Total | Count. | 1243 | 2963 | 4206. | |
| . | % within prevhyp | 100% | 100% | 100.0%. |
Chi-Squared Tests
| Value. | df | Asymptotic Significance (2-sided) | Exact Sig. (2-sided) | Exact Sig. (1-sided) | |
|---|---|---|---|---|---|
| Pearson Chi-Square | 109.211\(^a\) | 1 | <.001 | ||
| Continuity Correction\(^b\) | 107.925 | 1 | <.001 | ||
| Likelihood Ratio | 99.456. | 1 | <.001 | ||
| Fisher’s Exact Test | . | <.001 | <.001. | ||
| Linear-by-Linear Association. | 109.185. | 1 | <.001 | ||
| N of Valid Cases | 4206. |
note: \(^a\) 0
cells (0.0%) have expected count less than 5. The minimum expected count
is 101.37.
\(^b\) Computed only for a 2x2
table
The \(\chi^2\) statistic (Pearson’s without continuity correction) is 109.211, which has a p-value < 0.001.
Using R the assocated p-value is:
options(width = 1500)
pchisq(q = 109.211, df = 1, lower.tail = F)
## [1] 1.458982e-25
Now using R:
options(width = 1500)
# Framingham multiple visit data
# set working directly
setwd('~/OneDrive - University of Wollongong/Data/')
## setwd("C:/Users/25105517/OneDrive - University of Wollongong/Data/")
## setwd("C:/Users/sfrost/OneDrive - University of Wollongong/Data/")
## setwd('Z:/OneDrive - University of Wollongong/Data/')
## source('epiTables V3.R')
# read dataset into R, naming the data set dta.
framingham <- read.csv(file = 'framingham.csv', header = T, sep = ',')
framingham$n <- 1
# Convert md/dL to mmols (1 mg/dL = 18 mmols/L)
framingham$glucose.mmols <- framingham$glucose/18
framingham$totchol.mmols <- framingham$totchol/18
# Create age groups with labels
framingham$age.group <- cut(framingham$age, breaks = c(30,39,49,59,69,90),
labels = c('30-39','40-49',
'50-59','60-69',
'70+'))
# remove participants with a prior stroke or tx
# for BP and only use first visit data.
d <- subset(framingham, bpmeds == 0 &
prevstrk == 0 &
period == 1, select = c(randid,age,totchol.mmols,
diabp,glucose.mmols,bmi,
sex,sysbp,cursmoke,educ,
cigpday,diabetes,stroke,
timestrk,death,timedth,
prevhyp,age.group))
# create follow-up in years
d$n <- 1
d$timestrk.yrs <- d$timestrk/365.25
d$stroke.yes <- factor(d$stroke, levels= 0:1, labels = c('No','Yes'))
d$HT.yes <- factor(d$prevhyp, levels= 0:1, labels = c('No','Yes'))
d$cursmoke <- factor(d$cursmoke, levels= 0:1, labels = c('No','Yes'))
d$educ <- factor(d$educ, levels = 1:4, labels = c('4th grade or less',
'5th, 6th, or 7th grades',
'grade school graduate no high school',
'high school, did not graduate'))
# variables labels and unit
d <- upData(d, labels = c(age = 'Age',
totchol.mmols = 'Total cholesterol',
diabp = 'Diasolic BP',
glucose.mmols = 'BGL',
sex = 'Sex',
sysbp = 'Systolic BP',
cursmoke = 'Current smoker',
educ = 'Education level',
stroke = 'Stroke',
timestrk = 'Follow-up',
timestrk.yrs = 'Follow-up'),
units = c(age = 'yrs',
totchol.mmols = 'mmol/L',
glucose.mmols = 'mmol/L',
diabp = 'mmHg',
sysbp = 'mmHg',
timestrk = 'days',
timestrk.yrs = 'yrs'))
## Input object size: 511024 bytes; 22 variables 4206 observations
## New object size: 501112 bytes; 22 variables 4206 observations
d$sysHT <- d$prevhyp
d$sysHT <- factor(d$sysHT, levels = 0:1, labels = c('No','Yes'))
print(stat.table(list(sysHT), contents = list(N = count(),
'Stroke (n)' = sum(stroke),
'Stroke (%)' = ratio(stroke,n,100),
'Stroke (rate)' = ratio(stroke,timestrk.yrs,1000),
'follow-up (Total years)' = sum(timestrk.yrs)),
data = d,
margin = T),
digits = 3)
## ---------------------------------------------------
## sysHT N Stroke Stroke Stroke follow-up
## (n) (%) (rate) (Total
## years)
## ---------------------------------------------------
## No 2963.000 157.000 5.299 2.481 63293.676
## Yes 1243.000 186.000 14.964 8.363 22239.628
##
## Total 4206.000 343.000 8.155 4.010 85533.303
## ---------------------------------------------------
tapply(d$timestrk.yrs, d$sysHT, sum)
## No Yes
## 63293.68 22239.63
d$stroke.yes <- relevel(d$stroke.yes, ref = 'Yes')
kableone(CreateTableOne(data = d,
vars = c("age","sex","sysbp", "bmi","glucose.mmols",
"totchol.mmols","diabetes","HT.yes",
"cursmoke","educ","timestrk.yrs"),
factorVars = c("diabetes","HT.yes",
"cursmoke","educ"),
strata = "stroke.yes",
addOverall = T))
| Overall | Yes | No | p | test | |
|---|---|---|---|---|---|
| n | 4206 | 343 | 3863 | ||
| age (mean (SD)) | 49.62 (8.60) | 55.07 (7.77) | 49.13 (8.50) | <0.001 | |
| sex (mean (SD)) | 1.56 (0.50) | 1.51 (0.50) | 1.56 (0.50) | 0.096 | |
| sysbp (mean (SD)) | 131.67 (21.35) | 143.71 (25.92) | 130.60 (20.56) | <0.001 | |
| bmi (mean (SD)) | 25.78 (4.04) | 26.79 (4.50) | 25.69 (3.98) | <0.001 | |
| glucose.mmols (mean (SD)) | 4.55 (1.30) | 4.70 (1.49) | 4.54 (1.29) | 0.030 | |
| totchol.mmols (mean (SD)) | 13.13 (2.46) | 13.43 (2.78) | 13.10 (2.43) | 0.016 | |
| diabetes = 1 (%) | 107 ( 2.5) | 23 ( 6.7) | 84 ( 2.2) | <0.001 | |
| HT.yes = Yes (%) | 1243 (29.6) | 186 (54.2) | 1057 (27.4) | <0.001 | |
| cursmoke = Yes (%) | 2092 (49.7) | 170 (49.6) | 1922 (49.8) | 0.991 | |
| educ (%) | 0.003 | ||||
| 4th grade or less | 1714 (41.8) | 172 (51.5) | 1542 (41.0) | ||
| 5th, 6th, or 7th grades | 1221 (29.8) | 86 (25.7) | 1135 (30.2) | ||
| grade school graduate no high school | 681 (16.6) | 45 (13.5) | 636 (16.9) | ||
| high school, did not graduate | 480 (11.7) | 31 ( 9.3) | 449 (11.9) | ||
| timestrk.yrs (mean (SD)) | 20.34 (6.19) | 14.41 (6.37) | 20.86 (5.90) | <0.001 |
# Rate with 95% CI
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$n, d$sysHT, sum)))
names(stroke.tab) <- c('Strokes','N')
head(stroke.tab)
## Strokes N
## No 157 2963
## Yes 186 1243
ci.binomial(stroke.tab$Strokes, stroke.tab$N)
## events total probability se exact.lower95ci exact.upper95ci
## 1 157 2963 0.05298684 0.00411525 0.04519247 0.06167668
## 2 186 1243 0.14963797 0.01011783 0.13024489 0.17070700
# Comparing rates between groups (%) (rate ratio)
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$n, d$sysHT, sum)-tapply(d$stroke, d$sysHT, sum)))
names(stroke.tab) <- c('Strokes','n')
head(stroke.tab)
## Strokes n
## No 157 2806
## Yes 186 1057
ci.poisson(stroke.tab$Strokes, stroke.tab$n)
## events person.time incidence se exact.lower95ci exact.upper95ci
## 1 157 2806 0.05595153 0.004465418 0.04750285 0.06546329
## 2 186 1057 0.17596973 0.012902726 0.15150993 0.20324503
epi.2by2(dat = cbind(rev(stroke.tab$Strokes), rev(stroke.tab$n)), method = "cohort.count",
conf.level = 0.95,
outcome = 'as.rows',
units = 100)
## Exposed + Exposed - Total
## Outcome + 186 1057 1243
## Outcome - 157 2806 2963
## Total 343 3863 4206
##
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio 1.98 (1.78, 2.21)
## Odds ratio 3.15 (2.51, 3.93)
## Attrib risk in the exposed * 26.87 (21.41, 32.32)
## Attrib fraction in the exposed (%) 49.54 (43.68, 54.80)
## Attrib risk in the population * 2.19 (0.22, 4.16)
## Attrib fraction in the population (%) 7.41 (5.72, 9.07)
## -------------------------------------------------------------------
## Uncorrected chi2 test that OR = 1: chi2(1) = 109.211 Pr>chi2 = <0.001
## Fisher exact test that OR = 1: Pr>chi2 = <0.001
## Wald confidence limits
## CI: confidence interval
## * Outcomes per 100 population units
# Using the Observed and expected counts to obtain the $$chi^2$$ statistic
d$sysHT <- relevel(d$sysHT, ref = 'Yes')
d$stroke.yes <- relevel(d$stroke.yes, ref = 'Yes')
stat.table(sysHT, contents = list(N = count(),
Stroke = sum(stroke),
'Rate(%)' = ratio(stroke,n,100)),
margin = T,
data = d)
## --------------------------------
## sysHT N Stroke Rate(%)
## --------------------------------
## Yes 1243 186.00 14.96
## No 2963 157.00 5.30
##
## Total 4206 343.00 8.16
## --------------------------------
tab.O <- table(d$sysHT, d$stroke.yes)
tab.O
##
## Yes No
## Yes 186 1057
## No 157 2806
tab.E <- matrix(c(343*1243/4206, 3863*1243/4206,343*2963/4206,2963*3863/4206), byrow = T, nrow = 2)
tab.E
## [,1] [,2]
## [1,] 101.3669 1141.633
## [2,] 241.6331 2721.367
chi2 <- sum((tab.O - tab.E)^2/tab.E)
chi2
## [1] 109.2112
pchisq(q = chi2, df = 1, lower.tail = F)
## [1] 1.458843e-25
# incidence based on person years (incidence rate ratio)
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$timestrk.yrs, d$sysHT, sum)))
names(stroke.tab) <- c('Strokes','pyrs')
head(stroke.tab)
## Strokes pyrs
## Yes 186 22239.63
## No 157 63293.68
ci.poisson(stroke.tab$Strokes, stroke.tab$pyrs)
## events person.time incidence se exact.lower95ci exact.upper95ci
## 1 186 22239.63 0.008363449 0.0006132379 0.007200930 0.009659784
## 2 157 63293.68 0.002480501 0.0001979655 0.002105945 0.002902186
epi.2by2(dat = cbind(rev(stroke.tab$Strokes), rev(stroke.tab$pyrs)), method = "cohort.time",
conf.level = 0.95,
units = 100)
## Outcome + Time at risk Inc rate *
## Exposed + 157 63294 0.248
## Exposed - 186 22240 0.836
## Total 343 85533 0.401
##
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc rate ratio 0.30 (0.24, 0.37)
## Attrib rate in the exposed * -0.59 (-0.71, -0.46)
## Attrib fraction in the exposed (%) -237.17 (-319.66, -171.17)
## Attrib rate in the population * -0.44 (-0.56, -0.31)
## Attrib fraction in the population (%) -108.56 (-116.60, -100.30)
## -------------------------------------------------------------------
## Wald confidence limits
## CI: confidence interval
## * Outcomes per 100 units of population time at risk
The NHST approach includes the stating of the null-hypothesis (\(H_0\)) (no-effect). In other would no association between the study factor and study outcome:
\[H_0: \text{Mean systolic BP among non-stroke} = \text{Mean systolic BP among stroke}\] In contrast to the null-hypothesis, an alternate hypothesis (\(H_i\)) is define:
\[H_i: \text{Mean systolic BP among non-stroke} \neq \text{Mean systolic BP among stroke}\] Then a critical value is set, say from a z-statistic (i.e 1.96, < 2.5% tail). Comparing the means of two groups - Student’s T-Test. To assess the potential role of chance when comparing the mean systolic blood pressure at baseline, between the stroke yes and stroke no study participants, we will use a t-test, formally known as Student’s t-test, as William S. Gosset 1 at the time of publishing this method was employee of the Head Brewer of Guinness, his employer’s policy in publishing research results at that time prevented him from using his real name. Student’s t-test essentially is the ratio of the difference between the two group mean weighted by an estimate of error of this comparison. If we let the two group means be expressed as \(\bar{x_1}\) and \(\bar{x_2}\), repestively, the test statistic is estimates as:
\[t = \frac{\bar{x_1} - \bar{x_2}}{SE(\bar{x_1} - \bar{x_2})}\] Where, \(SE(\bar{x_1} - \bar{x_2})\) is a summary standard error of both the means, using the sample size and SD of the groups. However, there is two version, one based on assuming and equal variance (\(SD^2\) or \(\sigma^2\)), and the other unequal variance. The former being:
\[SE(\bar{x_1} - \bar{x_2}) = \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}\] SPSS Code:
MEANS TABLES = sysbp BY stroke
/CELLS=COUNT MEAN STDDEV MEDIAN.
* Creat histograms.
EXAMINE VARIABLES = sysbp
/PLOT HISTOGRAM.
EXAMINE VARIABLES=sysbp BY stroke
/PLOT=BOXPLOT
/STATISTICS=NONE
/NOTOTAL.
EXECUTE.
SPSS Output Report sysbp
| stroke | N | Mean | Std. Deviation | Median |
|---|---|---|---|---|
| No | 3863 | 130.597 | 20.5604 | 127.000 |
| Yes | 343 | 143.711 | 25.9166. | 139.000 |
| Total | 4206 | 131.666 | 21.3485. | 128.000 |
R Output
SPSS Code
* T-test mean SBP between stroke Yes and Stroke No participants.
T-TEST GROUPS=stroke(0,1)
/MISSING=ANALYSIS
/VARIABLES=sysbp
/CRITERIA=CI(.95).
SPSS Output
Independent Samples Test
Levene’s Test for Equality of Variances —— t-test for Equality of
Means
| F | Sig. | t | df | One-Sided p | Two-Sided p. | Mean Difference | Std. Error Difference | 95% CI of the Difference | ||
|---|---|---|---|---|---|---|---|---|---|---|
| sysbp | Equal variances assumed | 31.316 | <.001 | -11.060 | 4204. | <.001 | <.001 | -13.1147 | 1.1858 | -15.4395 to -10.7899. |
| Equal variances not assumed | . | -9.121 | 381.186 | <.001 | <.001 | -13.1147 | 1.4379 | -15.9420 to -10.2874. |
When using the R t-test function we can choose between the two by using the subcommand ‘var.equal’ T/F. Given the equal SD and therefore variance of age between women and men, we will include TRUE.
R Output
# T-test
t.test(sysbp ~ stroke, data = d, var.equal = T) # assumption of equal variance (SD)
##
## Two Sample t-test
##
## data: sysbp by stroke
## t = -11.06, df = 4204, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -15.43951 -10.78986
## sample estimates:
## mean in group 0 mean in group 1
## 130.5967 143.7114
t.test(sysbp ~ stroke, data = d, var.equal = F) # assumption of un0qual variance (SD)
##
## Welch Two Sample t-test
##
## data: sysbp by stroke
## t = -9.1205, df = 381.19, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -15.94196 -10.28741
## sample estimates:
## mean in group 0 mean in group 1
## 130.5967 143.7114
par(family = 'sans')
boxplot(sysbp ~ stroke.yes, data = d,
col = c('red','lightgreen'),
xlab = 'Stroke',
ylab = 'SBP (mmHg)',
las = 1,
ylim = c(0,300),
frame = F,
yaxt = 'n')
axis(2, at = seq(0,300,25), label = seq(0,300,25), las = 1)
TTest <- t.test(sysbp ~ stroke, data = d)
d$stroke.yes <- relevel(d$stroke.yes, ref = 'Yes')
sbp.stroke <- epi.conf(data.frame(group = d$stroke.yes, val = d$sysbp),
ctype = "mean.unpaired", conf.level = 0.95 )
text(1.5, 40, paste('mean difference (mmHg) = ', round(sbp.stroke$est,1),
' (95%CI ', round(sbp.stroke$lower,1),' to ',
round(sbp.stroke$upper,1),', p < 0.001)', sep = ''), cex = 0.75)
# SHADE TAILS OF n-DISTRIBUTED PLOTS
par(mfrow = c(1,1))
z <- pretty(c(-3,3), 100)
ht <- dnorm(z)
data <- data.frame(z = z, ht = ht)
zc <- 1.96
t <- subset(data, z > zc)
plot(data, type = "l",
axes = T,
xlab = '',
ylab = '')
lines(data)
polygon(c(rev(t$z), t$z),c(rep(0, nrow(t)), t$ht), col = 'grey',
border = NA)
polygon(c(rev(t$z*-1), t$z*-1), c(rep(0, nrow(t)), t$ht),
col = 'grey', border = NA)
abline(v = 0, lty = 2)
abline(v = c(-1,1), lty = 2, col = 'grey')
text(-0.75, 0.02, '-1 SD')
text(0.75, 0.02, '+1 SD')
text(-1.75, 0.02, '-1.96 SD')
text(1.75, 0.02, '+1.96 SD')
# Linear-regression model
summary(lm(sysbp ~ stroke, data = d))
##
## Call:
## lm(formula = sysbp ~ stroke, data = d)
##
## Residuals:
## Systolic BP [mmHg]
## Min 1Q Median 3Q Max
## -49.711 -14.597 -3.597 10.903 151.289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 130.5967 0.3386 385.66 <2e-16 ***
## stroke 13.1147 1.1858 11.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.05 on 4204 degrees of freedom
## Multiple R-squared: 0.02827, Adjusted R-squared: 0.02804
## F-statistic: 122.3 on 1 and 4204 DF, p-value: < 2.2e-16
regress.display(lm(sysbp ~ stroke, data = d))
## Linear regression predicting sysbp
##
## Coeff.(95%CI) P(t-test) P(F-test)
## stroke: 1 vs 0 13.11 (10.79,15.44) < 0.001 < 0.001
##
## No. of observations = 4206
Study participants in the Framingham Heart Study who had a stroke during the following period were observed to have a higher average systolic blood pressure at baseline, 144 mmHg versus 131 mmHg, respectively (mean difference = 13.1 mmHg, 95%CI 10.8 to 15.4, p < 0.001).
Student. The Probable error of a mean. Biometrika vol 6, no. 1, 1908 pp1-25↩︎