This module will introduce the concepts of estimation and null-hypothesis statistical significance testing. In addition sample size and power calculations will be covered, and Type I and Type II error.

Our approach based on PPDAC:

Problem

At the time of the initiation of the Framingham study it was not clear the role of blood pressure played in the risk of stroke, as blood pressure was thought to naturally increase with age. Therefore, in terms of assessing the risk of stroke is relationship to increasing blood pressure, a cohort study design is the most suitable design. Importantly, it would be un-ethical to randomise individuals to high blood pressure, and a prospective cohort of individuals, without stroke at baseline, among whom blood pressure can be measured, and then follow-up for an incident stroke. Our approach will be to refer to blood pressure as our study factor and stroke as our study outcome.

Population

Adults with (exposed) and without high (un-exposed/comparator/counter factorial) blood pressure.

Data

Baseline blood pressure and follow stroke incidence.

Analysis

Hypothesis testing and estimation.

Null Hypothesis Significance Testing (NHST)

Comparing the rates of stroke between hypertensive and non-hypertensive study participants.

The NHST approach includes the stating of the null-hypothesis (\(H_0\)) (no-effect). In other would no association between the study factor and study outcome:

\[H_0: \text{The rate of stroke among HT} = \text{rate among non-HT}\]

In contrast to the null-hypothesis, an alternate hypothesis (\(H_i\)) is define:

\[H_i: \text{The rate of stroke among HT} \neq \text{rate among non-HT}\]

Then a critical value is set, say from a z-statistic (i.e > 1.96 upper 2.5% tail), or the case of the chi-squared test statistic (\(\chi^2\)) > 3.84 (\(1.96^2\)).

To assess the potential role of chance when comparing the frequencies of a two-by-two table of stroke yes/no and HT yes/no, the Pearson’s chi-squared test is commonly used. It is based on comparing the difference between the observed (O) table cell frequencies and what would be expected (E) by chance. The specific statistic being the sum of these differences:

\[\chi^2 = \sum{\frac{(O-E)^2}{E}}\] Commonly the approach of squaring an subtraction of two values is seen in statistics, this is a trick statisticians use to remove negative values! And, the expected values of each given cell is estimated using the row and column totals:

\[E = \frac{(total_{row} \times total_{column})}{Total_N}\] # Estimation of effect ## Rate Ratio, Odds Ratio and Incidence Rate Ratio.

SPSS Code:

* Cross tabulation and chisq test .
CROSSTABS TABLES =  stroke BY prevhyp
                    /CELLS=COUNT ROW
                    /STATISTICS=CHISQ.
EXECUTE.

SPSS OUTPUT

	.	.	HT	no-HT.	Total
stroke	No	Count.	2806	1057	3863.
.	.	% within prevhyp	85.0%	94.7%	91.8%
.	Yes	Count	186	157	343.
	.	% within prevhyp	15.0%	5.3%.	8.2%
Total		Count.	1243	2963	4206.
	.	% within prevhyp	100%	100%	100.0%.

Chi-Squared Tests

	Value.	df	Asymptotic Significance (2-sided)	Exact Sig. (2-sided)	Exact Sig. (1-sided)
Pearson Chi-Square	109.211\(^a\)	1	<.001
Continuity Correction\(^b\)	107.925	1	<.001
Likelihood Ratio	99.456.	1	<.001
Fisher’s Exact Test		.		<.001	<.001.
Linear-by-Linear Association.	109.185.	1	<.001
N of Valid Cases	4206.

note: \(^a\) 0 cells (0.0%) have expected count less than 5. The minimum expected count is 101.37.
\(^b\) Computed only for a 2x2 table

The \(\chi^2\) statistic (Pearson’s without continuity correction) is 109.211, which has a p-value < 0.001.

Using R the assocated p-value is:

options(width = 1500)
pchisq(q = 109.211, df = 1, lower.tail = F)

## [1] 1.458982e-25

Now using R:

options(width = 1500)
# Framingham multiple visit data
# set working directly
setwd('~/OneDrive - University of Wollongong/Data/')
## setwd("C:/Users/25105517/OneDrive - University of Wollongong/Data/")
## setwd("C:/Users/sfrost/OneDrive - University of Wollongong/Data/")
## setwd('Z:/OneDrive - University of Wollongong/Data/')
## source('epiTables V3.R')

# read dataset into R, naming the data set dta.
framingham <- read.csv(file = 'framingham.csv', header = T, sep = ',')
framingham$n <- 1
# Convert md/dL to mmols (1 mg/dL = 18 mmols/L)
framingham$glucose.mmols <- framingham$glucose/18
framingham$totchol.mmols <- framingham$totchol/18
# Create age groups with labels
framingham$age.group <- cut(framingham$age, breaks = c(30,39,49,59,69,90),
                            labels = c('30-39','40-49',
                                       '50-59','60-69',
                                       '70+'))

# remove participants with a prior stroke or tx 
# for BP and only use first visit data.
d <- subset(framingham,  bpmeds == 0 &
              prevstrk == 0 &
              period == 1, select = c(randid,age,totchol.mmols,
                                      diabp,glucose.mmols,bmi,
                                      sex,sysbp,cursmoke,educ,
                                      cigpday,diabetes,stroke,
                                      timestrk,death,timedth,
                                      prevhyp,age.group))
# create follow-up in years
d$n <- 1
d$timestrk.yrs <- d$timestrk/365.25
d$stroke.yes <- factor(d$stroke, levels= 0:1, labels = c('No','Yes'))
d$HT.yes <- factor(d$prevhyp, levels= 0:1, labels = c('No','Yes'))
d$cursmoke <- factor(d$cursmoke, levels= 0:1, labels = c('No','Yes'))
d$educ <- factor(d$educ, levels = 1:4, labels = c('4th grade or less',
                                                  '5th, 6th, or 7th grades',
                                                  'grade school graduate no high school',
                                                  'high school, did not graduate'))
# variables labels and unit
d <- upData(d, labels = c(age = 'Age',
                          totchol.mmols = 'Total cholesterol',
                          diabp = 'Diasolic BP',
                          glucose.mmols = 'BGL',
                          sex = 'Sex',
                          sysbp = 'Systolic BP',
                          cursmoke = 'Current smoker',
                          educ = 'Education level',
                          stroke = 'Stroke',
                          timestrk = 'Follow-up',
                          timestrk.yrs = 'Follow-up'),
               units = c(age = 'yrs',
                      totchol.mmols = 'mmol/L',
                      glucose.mmols = 'mmol/L',
                      diabp = 'mmHg',
                      sysbp = 'mmHg',        
                      timestrk = 'days',
                      timestrk.yrs = 'yrs'))

## Input object size:    511024 bytes;   22 variables    4206 observations
## New object size: 501112 bytes;   22 variables    4206 observations

d$sysHT <- d$prevhyp
d$sysHT <- factor(d$sysHT, levels = 0:1, labels = c('No','Yes'))
print(stat.table(list(sysHT), contents = list(N = count(),
                                        'Stroke (n)' = sum(stroke),
                                        'Stroke (%)' = ratio(stroke,n,100),
                                        'Stroke (rate)' = ratio(stroke,timestrk.yrs,1000),
                                        'follow-up (Total years)' = sum(timestrk.yrs)),
                                         data = d,
                                         margin = T),
                                         digits = 3)

##  --------------------------------------------------- 
##  sysHT          N  Stroke  Stroke  Stroke follow-up  
##                       (n)     (%)  (rate)    (Total  
##                                              years)  
##  --------------------------------------------------- 
##  No      2963.000 157.000   5.299   2.481 63293.676  
##  Yes     1243.000 186.000  14.964   8.363 22239.628  
##                                                      
##  Total   4206.000 343.000   8.155   4.010 85533.303  
##  ---------------------------------------------------

tapply(d$timestrk.yrs, d$sysHT, sum)

##       No      Yes 
## 63293.68 22239.63

d$stroke.yes <- relevel(d$stroke.yes, ref = 'Yes')
kableone(CreateTableOne(data = d,
                        vars = c("age","sex","sysbp", "bmi","glucose.mmols",
                                 "totchol.mmols","diabetes","HT.yes",
                                 "cursmoke","educ","timestrk.yrs"),
                        factorVars = c("diabetes","HT.yes",
                                       "cursmoke","educ"),
                        strata = "stroke.yes",
                        addOverall = T))

	Overall	Yes	No	p
n	4206	343	3863
age (mean (SD))	49.62 (8.60)	55.07 (7.77)	49.13 (8.50)	<0.001
sex (mean (SD))	1.56 (0.50)	1.51 (0.50)	1.56 (0.50)	0.096
sysbp (mean (SD))	131.67 (21.35)	143.71 (25.92)	130.60 (20.56)	<0.001
bmi (mean (SD))	25.78 (4.04)	26.79 (4.50)	25.69 (3.98)	<0.001
glucose.mmols (mean (SD))	4.55 (1.30)	4.70 (1.49)	4.54 (1.29)	0.030
totchol.mmols (mean (SD))	13.13 (2.46)	13.43 (2.78)	13.10 (2.43)	0.016
diabetes = 1 (%)	107 ( 2.5)	23 ( 6.7)	84 ( 2.2)	<0.001
HT.yes = Yes (%)	1243 (29.6)	186 (54.2)	1057 (27.4)	<0.001
cursmoke = Yes (%)	2092 (49.7)	170 (49.6)	1922 (49.8)	0.991
educ (%)				0.003
4th grade or less	1714 (41.8)	172 (51.5)	1542 (41.0)
5th, 6th, or 7th grades	1221 (29.8)	86 (25.7)	1135 (30.2)
grade school graduate no high school	681 (16.6)	45 (13.5)	636 (16.9)
high school, did not graduate	480 (11.7)	31 ( 9.3)	449 (11.9)
timestrk.yrs (mean (SD))	20.34 (6.19)	14.41 (6.37)	20.86 (5.90)	<0.001

# Rate with 95% CI
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$n, d$sysHT, sum)))
names(stroke.tab) <- c('Strokes','N')
head(stroke.tab)

##     Strokes    N
## No      157 2963
## Yes     186 1243

ci.binomial(stroke.tab$Strokes, stroke.tab$N)

##   events total probability         se exact.lower95ci exact.upper95ci
## 1    157  2963  0.05298684 0.00411525      0.04519247      0.06167668
## 2    186  1243  0.14963797 0.01011783      0.13024489      0.17070700

# Comparing rates between groups (%) (rate ratio)
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$n, d$sysHT, sum)-tapply(d$stroke, d$sysHT, sum)))
names(stroke.tab) <- c('Strokes','n')
head(stroke.tab)

##     Strokes    n
## No      157 2806
## Yes     186 1057

ci.poisson(stroke.tab$Strokes, stroke.tab$n)

##   events person.time  incidence          se exact.lower95ci exact.upper95ci
## 1    157        2806 0.05595153 0.004465418      0.04750285      0.06546329
## 2    186        1057 0.17596973 0.012902726      0.15150993      0.20324503

epi.2by2(dat = cbind(rev(stroke.tab$Strokes), rev(stroke.tab$n)), method = "cohort.count", 
                      conf.level = 0.95, 
                      outcome = 'as.rows',
                      units = 100)

##              Exposed +    Exposed -      Total
## Outcome +          186         1057       1243
## Outcome -          157         2806       2963
## Total              343         3863       4206
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio                                 1.98 (1.78, 2.21)
## Odds ratio                                     3.15 (2.51, 3.93)
## Attrib risk in the exposed *                   26.87 (21.41, 32.32)
## Attrib fraction in the exposed (%)            49.54 (43.68, 54.80)
## Attrib risk in the population *                2.19 (0.22, 4.16)
## Attrib fraction in the population (%)         7.41 (5.72, 9.07)
## -------------------------------------------------------------------
## Uncorrected chi2 test that OR = 1: chi2(1) = 109.211 Pr>chi2 = <0.001
## Fisher exact test that OR = 1: Pr>chi2 = <0.001
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units

# Using the Observed and expected counts to obtain the $$chi^2$$ statistic
d$sysHT <- relevel(d$sysHT, ref = 'Yes')
d$stroke.yes <- relevel(d$stroke.yes, ref = 'Yes')
stat.table(sysHT, contents = list(N = count(),
                                  Stroke = sum(stroke),
                                  'Rate(%)' = ratio(stroke,n,100)),
                                  margin = T,
                                  data = d)

##  -------------------------------- 
##  sysHT         N  Stroke Rate(%)  
##  -------------------------------- 
##  Yes        1243  186.00   14.96  
##  No         2963  157.00    5.30  
##                                   
##  Total      4206  343.00    8.16  
##  --------------------------------

tab.O <- table(d$sysHT, d$stroke.yes)
tab.O

##      
##        Yes   No
##   Yes  186 1057
##   No   157 2806

tab.E <- matrix(c(343*1243/4206, 3863*1243/4206,343*2963/4206,2963*3863/4206), byrow = T, nrow = 2)
tab.E

##          [,1]     [,2]
## [1,] 101.3669 1141.633
## [2,] 241.6331 2721.367

chi2 <- sum((tab.O - tab.E)^2/tab.E)
chi2

## [1] 109.2112

pchisq(q = chi2, df = 1, lower.tail = F)

## [1] 1.458843e-25

# Using the R function.
tab.O

##      
##        Yes   No
##   Yes  186 1057
##   No   157 2806

chisq.test(tab.O, correct = F)

## 
##  Pearson's Chi-squared test
## 
## data:  tab.O
## X-squared = 109.21, df = 1, p-value < 2.2e-16

# incidence based on person years (incidence rate ratio)
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$timestrk.yrs, d$sysHT, sum)))
names(stroke.tab) <- c('Strokes','pyrs')
head(stroke.tab)

##     Strokes     pyrs
## Yes     186 22239.63
## No      157 63293.68

ci.poisson(stroke.tab$Strokes, stroke.tab$pyrs)

##   events person.time   incidence           se exact.lower95ci exact.upper95ci
## 1    186    22239.63 0.008363449 0.0006132379     0.007200930     0.009659784
## 2    157    63293.68 0.002480501 0.0001979655     0.002105945     0.002902186

epi.2by2(dat = cbind(rev(stroke.tab$Strokes), rev(stroke.tab$pyrs)), method = "cohort.time", 
                      conf.level = 0.95, 
                      units = 100)

##              Outcome +    Time at risk        Inc rate *
## Exposed +          157           63294             0.248
## Exposed -          186           22240             0.836
## Total              343           85533             0.401
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc rate ratio                                 0.30 (0.24, 0.37)
## Attrib rate in the exposed *                   -0.59 (-0.71, -0.46)
## Attrib fraction in the exposed (%)            -237.17 (-319.66, -171.17)
## Attrib rate in the population *                -0.44 (-0.56, -0.31)
## Attrib fraction in the population (%)         -108.56 (-116.60, -100.30)
## -------------------------------------------------------------------
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 units of population time at risk

Let’s extent Pearson’s Chi-squared test for more than two groups. For example, let’s look at age groups and stroke.

options(width = 1500)
# remove empty levels of age.group
d$age.group <- droplevels(d$age.group)
tab.O <- table(d$age.group, d$stroke.yes)
tab.O

##        
##          Yes   No
##   30-39   11  542
##   40-49   68 1574
##   50-59  148 1159
##   60-69  116  588

stat.table(age.group, contents = list(N = count(),
                                  Stroke = sum(stroke),
                                  'Rate(%)' = ratio(stroke,n,100)),
                                  margin = T,
                                  data = d)

##  ------------------------------------ 
##  age.group         N  Stroke Rate(%)  
##  ------------------------------------ 
##  30-39           553   11.00    1.99  
##  40-49          1642   68.00    4.14  
##  50-59          1307  148.00   11.32  
##  60-69           704  116.00   16.48  
##                                       
##  Total          4206  343.00    8.16  
##  ------------------------------------

# Using the R function.
chisq.test(tab.O, correct = F)

## 
##  Pearson's Chi-squared test
## 
## data:  tab.O
## X-squared = 146.01, df = 3, p-value < 2.2e-16

A problem with using Pearson’s Chi-squared test with more than two groups is that even though at least two groups differ in terms of the outcome of interest, that overall test will not identify which specific groups differ. In the case of age groups, there is a natural order to the increasing age groups, and a trend test may be more useful.

options(width = 1500)
# remove empty levels of age.group
# Using the R function.
prop.trend.test(x = tab.O[,1], n = tab.O[,1]+tab.O[,2])

## 
##  Chi-squared Test for Trend in Proportions
## 
## data:  tab.O[, 1] out of tab.O[, 1] + tab.O[, 2] ,
##  using scores: 1 2 3 4
## X-squared = 139.06, df = 1, p-value < 2.2e-16

The benefit of the trend test is the assessment of a dose response between the study factor, in our case increasing SBP group, and risk of stroke. Repeated comparison between a reference group, say the 83.5,114 mmHg and subsequent groups may be in fact under powered.

options(width = 1500)
# SysBP groups
d$sysbp.group <- cut2(d$sysbp, g = 5)

tab.O <- table(d$sysbp.group, d$stroke.yes)
tab.O

##              
##               Yes  No
##   [ 83.5,114)  36 832
##   [114.5,124)  36 783
##   [124.0,133)  60 790
##   [133.0,148)  85 765
##   [147.5,295] 126 693

stat.table(sysbp.group, contents = list(N = count(),
                                  Stroke = sum(stroke),
                                  'Rate(%)' = ratio(stroke,n,100)),
                                  margin = T,
                                  data = d)

##  -------------------------------------- 
##  sysbp.group         N  Stroke Rate(%)  
##  -------------------------------------- 
##  [ 83.5,114)       868   36.00    4.15  
##  [114.5,124)       819   36.00    4.40  
##  [124.0,133)       850   60.00    7.06  
##  [133.0,148)       850   85.00   10.00  
##  [147.5,295]       819  126.00   15.38  
##                                         
##  Total            4206  343.00    8.16  
##  --------------------------------------

print(logistic.display(glm(cbind(stroke,n) ~ sysbp.group, family = binomial, data = d)), digits = 2)

## 
## Logistic regression predicting cbind(stroke, n) 
##  
##                               OR(95%CI)         P(Wald's test) P(LR-test)
## sysbp.group: ref.=[ 83.5,114)                                  < 0.001   
##    [114.5,124)                1.06 (0.66,1.7)   0.809                    
##    [124.0,133)                1.7 (1.11,2.6)    0.014                    
##    [133.0,148)                2.41 (1.61,3.6)   < 0.001                  
##    [147.5,295]                3.71 (2.53,5.44)  < 0.001                  
##                                                                          
## Log-likelihood = -939.8625
## No. of observations = 4206
## AIC value = 1889.725

prop.trend.test(x = tab.O[,1], n = tab.O[,1]+tab.O[,2])

## 
##  Chi-squared Test for Trend in Proportions
## 
## data:  tab.O[, 1] out of tab.O[, 1] + tab.O[, 2] ,
##  using scores: 1 2 3 4 5
## X-squared = 87.211, df = 1, p-value < 2.2e-16

# Using a regression model the trend test can be
# undertaken by adding a ordinal value to the groups
d$sysbp.group.n <- as.numeric(d$sysbp.group) - 1
print(logistic.display(glm(cbind(stroke,n) ~ sysbp.group.n, family = binomial, data = d)), digits = 2)

## 
## Logistic regression predicting cbind(stroke, n) 
##  
##                            OR(95%CI)         P(Wald's test) P(LR-test)
## sysbp.group.n (cont. var.) 1.43 (1.32,1.56)  < 0.001        < 0.001   
##                                                                       
## Log-likelihood = -940.7791
## No. of observations = 4206
## AIC value = 1885.5582

print(summary(glm(cbind(stroke,n) ~ I(sysbp/10), family = binomial, data = d)), digits = 2)

## 
## Call:
## glm(formula = cbind(stroke, n) ~ I(sysbp/10), family = binomial, 
##     data = d)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -1.04   -0.41   -0.35   -0.31    2.04  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -5.325      0.315   -16.9   <2e-16 ***
## I(sysbp/10)    0.206      0.022     9.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1481.7  on 4205  degrees of freedom
## Residual deviance: 1399.0  on 4204  degrees of freedom
## AIC: 1878
## 
## Number of Fisher Scoring iterations: 5

Null Hypothesis Significance Testing (NHST)

Comparing the mean systolic blood pressure between stroke and non-stroke study participants.

The NHST approach includes the stating of the null-hypothesis (\(H_0\)) (no-effect). In other would no association between the study factor and study outcome:

\[H_0: \text{Mean systolic BP among non-stroke} = \text{Mean systolic BP among stroke}\] In contrast to the null-hypothesis, an alternate hypothesis (\(H_i\)) is define:

\[H_i: \text{Mean systolic BP among non-stroke} \neq \text{Mean systolic BP among stroke}\] Then a critical value is set, say from a z-statistic (i.e 1.96, < 2.5% tail). Comparing the means of two groups - Student’s T-Test. To assess the potential role of chance when comparing the mean systolic blood pressure at baseline, between the stroke yes and stroke no study participants, we will use a t-test, formally known as Student’s t-test, as William S. Gosset [^1] at the time of publishing this method was employee of the Head Brewer of Guinness, his employer’s policy in publishing research results at that time prevented him from using his real name. Student’s t-test essentially is the ratio of the difference between the two group mean weighted by an estimate of error of this comparison. If we let the two group means be expressed as \(\bar{x_1}\) and \(\bar{x_2}\), repestively, the test statistic is estimates as:

\[t = \frac{\bar{x_1} - \bar{x_2}}{SE(\bar{x_1} - \bar{x_2})}\] Where, \(SE(\bar{x_1} - \bar{x_2})\) is a summary standard error of both the means, using the sample size and SD of the groups. However, there is two version, one based on assuming and equal variance (\(SD^2\) or \(\sigma^2\)), and the other unequal variance. The former being:

\[SE(\bar{x_1} - \bar{x_2}) = \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}\] SPSS Code:

MEANS TABLES = sysbp BY stroke
  /CELLS=COUNT MEAN STDDEV MEDIAN.
  
* Creat histograms.
EXAMINE VARIABLES = sysbp 
      /PLOT HISTOGRAM.


EXAMINE VARIABLES=sysbp BY stroke
  /PLOT=BOXPLOT
  /STATISTICS=NONE
  /NOTOTAL.

EXECUTE.

SPSS Output Report sysbp

stroke	N	Mean	Std. Deviation	Median
No	3863	130.597	20.5604	127.000
Yes	343	143.711	25.9166.	139.000
Total	4206	131.666	21.3485.	128.000

R Output

SPSS Code

* T-test mean SBP between stroke Yes and Stroke No participants.
T-TEST GROUPS=stroke(0,1)
  /MISSING=ANALYSIS
  /VARIABLES=sysbp 
  /CRITERIA=CI(.95).

SPSS Output

Independent Samples Test
Levene’s Test for Equality of Variances —— t-test for Equality of Means

		F	Sig.	t	df	One-Sided p	Two-Sided p.	Mean Difference	Std. Error Difference	95% CI of the Difference
sysbp	Equal variances assumed	31.316	<.001	-11.060	4204.	<.001	<.001	-13.1147	1.1858	-15.4395 to -10.7899.
	Equal variances not assumed		.	-9.121	381.186	<.001	<.001	-13.1147	1.4379	-15.9420 to -10.2874.

When using the R t-test function we can choose between the two by using the subcommand ‘var.equal’ T/F. Given the equal SD and therefore variance of age between women and men, we will include TRUE.

R Output

# T-test
t.test(sysbp ~ stroke, data = d, var.equal = T) # assumption of equal variance (SD)

## 
##  Two Sample t-test
## 
## data:  sysbp by stroke
## t = -11.06, df = 4204, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -15.43951 -10.78986
## sample estimates:
## mean in group 0 mean in group 1 
##        130.5967        143.7114

t.test(sysbp ~ stroke, data = d, var.equal = F) # assumption of un0qual variance (SD)

## 
##  Welch Two Sample t-test
## 
## data:  sysbp by stroke
## t = -9.1205, df = 381.19, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -15.94196 -10.28741
## sample estimates:
## mean in group 0 mean in group 1 
##        130.5967        143.7114

par(family = 'sans')
boxplot(sysbp ~ stroke.yes, data = d,
        col = c('red','lightgreen'),
        xlab = 'Stroke',
        ylab = 'SBP (mmHg)',
        las = 1,
        ylim = c(0,300),
        frame = F,
        yaxt = 'n')
axis(2, at = seq(0,300,25), label = seq(0,300,25), las = 1)

TTest <- t.test(sysbp ~ stroke, data = d)

d$stroke.yes <- relevel(d$stroke.yes, ref = 'Yes')
sbp.stroke <- epi.conf(data.frame(group = d$stroke.yes, val = d$sysbp), 
                                          ctype = "mean.unpaired", conf.level = 0.95 )

text(1.5, 40, paste('mean difference (mmHg) = ', round(sbp.stroke$est,1),
                    ' (95%CI ', round(sbp.stroke$lower,1),' to ',
                                round(sbp.stroke$upper,1),', p < 0.001)', sep = ''), cex = 0.75)

# SHADE TAILS OF n-DISTRIBUTED PLOTS
par(mfrow = c(1,1))
z <- pretty(c(-3,3), 100)
ht <- dnorm(z)
data <- data.frame(z = z, ht = ht)
zc <- 1.96
t <- subset(data, z > zc)
plot(data, type = "l", 
           axes = T,
           xlab = '',
           ylab = '')

lines(data)
polygon(c(rev(t$z), t$z),c(rep(0, nrow(t)), t$ht), col = 'grey', 
                                                      border = NA)
polygon(c(rev(t$z*-1), t$z*-1), c(rep(0, nrow(t)), t$ht), 
                                        col = 'grey', border = NA)
                                        
abline(v = 0, lty = 2)
abline(v = c(-1,1), lty = 2, col = 'grey') 
text(-0.75, 0.02, '-1 SD')
text(0.75, 0.02, '+1 SD')

text(-1.75, 0.02, '-1.96 SD')
text(1.75, 0.02, '+1.96 SD')

Using a linear regression model

# Linear-regression model
summary(lm(sysbp ~ stroke, data = d))

## 
## Call:
## lm(formula = sysbp ~ stroke, data = d)
## 
## Residuals:
## Systolic BP [mmHg] 
##     Min      1Q  Median      3Q     Max 
## -49.711 -14.597  -3.597  10.903 151.289 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 130.5967     0.3386  385.66   <2e-16 ***
## stroke       13.1147     1.1858   11.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.05 on 4204 degrees of freedom
## Multiple R-squared:  0.02827,    Adjusted R-squared:  0.02804 
## F-statistic: 122.3 on 1 and 4204 DF,  p-value: < 2.2e-16

regress.display(lm(sysbp ~ stroke, data  = d))

## Linear regression predicting sysbp
##  
##                 Coeff.(95%CI)        P(t-test) P(F-test)
## stroke: 1 vs 0  13.11 (10.79,15.44)  < 0.001   < 0.001  
##                                                         
## No. of observations = 4206

Conclusion

Study participants in the Framingham Heart Study who had a stroke during the following period were observed to have a higher average systolic blood pressure at baseline, 144 mmHg versus 131 mmHg, respectively (mean difference = 13.1 mmHg, 95%CI 10.8 to 15.4, p < 0.001).

Now let us explore the relationship between age and systolic blood pressure.

Both age (yrs) and systolic blood pressure (mmHg) are continuous variables.

par(mfrow=c(1,3))
hist(d$age, xlab = 'Age (yrs)', 
            main = '',
            las = 1)
hist(d$sysbp, xlab = 'SBP (mmHg)', 
              main = '',
              las = 1)

A scatter plot of the association between age (independent variable) and SBP (dependent variable).

par(mfrow=c(1,1))
plot(d$age, d$sysbp, 
            xlab = 'Age (yrs)',
            ylab = 'SBP (mmHg)',
            las = 1)
# adding a regression line:
abline(lsfit(d$age, d$sysbp), lwd = 2, lty = 2, col = 'blue')

The regression line suggests that SBP appears to increase with age (generally accepted as true). So let is us explore this association with a formal null-hypothesis approach. Given that both variables are continuous a correlation coefficient (Pearson’s) can be used.

\[H_0: \text{The correlation coefficient (rho) between age and SBP} = 0\] \[H_i: \text{The correlation coefficient (rho) between age and SBP} \neq 0\] Pearson’s correlation coefficient can be calculated as follows:

\[\text rho = \frac{cov(age, sysbp)}{(\sigma_{age}^2 \times \sigma_{sysbp}^2)^\frac{1}{2}}\]

cor.test(d$age, d$sysbp)

## 
##  Pearson's product-moment correlation
## 
## data:  d$age and d$sysbp
## t = 26.693, df = 4204, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3545472 0.4062397
## sample estimates:
##       cor 
## 0.3806909

cov(d$age, d$sysbp) / (var(d$age)*var(d$sysbp))^0.5

## [1] 0.3806909

The Pearson’s correlation coefficient = 0.38 with a 95% confidence interval of 0.35 to 0.41. The test of the null hypothesis that rho is = 0 gives a p-value < 0.001. We can summarise this result as that there appears to be an association between increasing age and SBP, however age only appears to explain approximately 14% of the variation in SBP among study participants.

A scatter plot of the association between age (independent variable) and SBP (dependent variable). Rho (95% CI) and p-value has been added.

par(mfrow=c(1,1))
plot(d$age, d$sysbp, 
            xlab = 'Age (yrs)',
            ylab = 'SBP (mmHg)',
            las = 1)
# adding a regression line:
abline(lsfit(d$age, d$sysbp), lwd = 2, lty = 2, col = 'blue')
text(50, 275, 'rho = 0.38 (95%CI 0.35-0.41), p < 0.001')

An estimate of the observed (average increase) in SBP for each unit change in age can be estimates using linear-regression.

summary(lm(sysbp ~ age, data = d))

## 
## Call:
## lm(formula = sysbp ~ age, data = d)
## 
## Residuals:
## Systolic BP [mmHg] 
##     Min      1Q  Median      3Q     Max 
## -56.085 -13.253  -2.304   9.862 149.746 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  84.7830     1.7826   47.56   <2e-16 ***
## age           0.9449     0.0354   26.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.74 on 4204 degrees of freedom
## Multiple R-squared:  0.1449, Adjusted R-squared:  0.1447 
## F-statistic: 712.5 on 1 and 4204 DF,  p-value: < 2.2e-16

# for each increase in age by 5-years
summary(lm(sysbp ~ I(age/5), data = d))

## 
## Call:
## lm(formula = sysbp ~ I(age/5), data = d)
## 
## Residuals:
## Systolic BP [mmHg] 
##     Min      1Q  Median      3Q     Max 
## -56.085 -13.253  -2.304   9.862 149.746 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   84.783      1.783   47.56   <2e-16 ***
## I(age/5)       4.724      0.177   26.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.74 on 4204 degrees of freedom
## Multiple R-squared:  0.1449, Adjusted R-squared:  0.1447 
## F-statistic: 712.5 on 1 and 4204 DF,  p-value: < 2.2e-16

The regression results suggest that SBP increases by 0.95 mmHg for each year increase in age. While for each five year increase in age the SBP is estimated to increase by 4.7 mmHg. Compare the results if age was used as an categorical variable (age group).

summary(aov(sysbp ~ age.group, data = d))

##               Df  Sum Sq Mean Sq F value Pr(>F)    
## age.group      3  260645   86882   220.5 <2e-16 ***
## Residuals   4202 1655825     394                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis of variance will identify if at least two groups have mean values that difference beyond that may have occurred by chance (at 0.05, or 1/20). However, ANOVA lacks estimates, to identify which groups differs. This can be overcome by using linear regression.

summary(aov(sysbp ~ age.group, data = d))

##               Df  Sum Sq Mean Sq F value Pr(>F)    
## age.group      3  260645   86882   220.5 <2e-16 ***
## Residuals   4202 1655825     394                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(lm(sysbp ~ age.group, data = d))

## 
## Call:
## lm(formula = sysbp ~ age.group, data = d)
## 
## Residuals:
## Systolic BP [mmHg] 
##     Min      1Q  Median      3Q     Max 
## -52.559 -13.503  -2.503  10.384 150.384 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    120.1302     0.8441 142.310  < 2e-16 ***
## age.group40-49   6.3725     0.9760   6.529 7.39e-11 ***
## age.group50-59  15.9287     1.0070  15.818  < 2e-16 ***
## age.group60-69  24.4856     1.1280  21.708  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.85 on 4202 degrees of freedom
## Multiple R-squared:  0.136,  Adjusted R-squared:  0.1354 
## F-statistic: 220.5 on 3 and 4202 DF,  p-value: < 2.2e-16

boxplot(sysbp ~ age.group, data = d,
                xlab = '',
                ylab = 'SBP (mmHg)',
                las = 1)

# Remove intercept
tapply(d$sysbp, d$age.group, mean)

##    30-39    40-49    50-59    60-69 
## 120.1302 126.5027 136.0589 144.6158

summary(lm(sysbp ~ age.group - 1, data = d))

## 
## Call:
## lm(formula = sysbp ~ age.group - 1, data = d)
## 
## Residuals:
## Systolic BP [mmHg] 
##     Min      1Q  Median      3Q     Max 
## -52.559 -13.503  -2.503  10.384 150.384 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## age.group30-39 120.1302     0.8441   142.3   <2e-16 ***
## age.group40-49 126.5027     0.4899   258.2   <2e-16 ***
## age.group50-59 136.0589     0.5491   247.8   <2e-16 ***
## age.group60-69 144.6158     0.7482   193.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.85 on 4202 degrees of freedom
## Multiple R-squared:  0.9779, Adjusted R-squared:  0.9779 
## F-statistic: 4.642e+04 on 4 and 4202 DF,  p-value: < 2.2e-16

regress.display(lm(sysbp ~ age.group - 1, data = d))

## Linear regression predicting sysbp
##  
##                  Coeff.(95%CI)           P(t-test) P(F-test)
## age.group: ref.=                                   < 0.001  
##    30-39         120.13 (118.48,121.79)  < 0.001            
##    40-49         126.5 (125.54,127.46)   < 0.001            
##    50-59         136.06 (134.98,137.14)  < 0.001            
##    60-69         144.62 (143.15,146.08)  < 0.001            
##                                                             
## No. of observations = 4206

The best approach would be to use age as a continuous variable and rescale to something useful like 5-year increase.

# Remove intercept
summary(lm(sysbp ~ I(age/5), data = d))

## 
## Call:
## lm(formula = sysbp ~ I(age/5), data = d)
## 
## Residuals:
## Systolic BP [mmHg] 
##     Min      1Q  Median      3Q     Max 
## -56.085 -13.253  -2.304   9.862 149.746 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   84.783      1.783   47.56   <2e-16 ***
## I(age/5)       4.724      0.177   26.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.74 on 4204 degrees of freedom
## Multiple R-squared:  0.1449, Adjusted R-squared:  0.1447 
## F-statistic: 712.5 on 1 and 4204 DF,  p-value: < 2.2e-16

Essentially saying that SBP increases by approximately 4.7 mmHg for each 5-year increase in age.

In conclusion - the take away message is that it maybe convenient to categorise a continuous variable to present some descriptive statistics, however, formal statistical test should be based on preserving the contiuous nature of the given variable. [^1]: Student. The Probable error of a mean. Biometrika vol 6, no. 1, 1908 pp1-25

SNPG916 Inferential Statistics - Workshop 4

Steve Frost PhD SoN, Faculty of Science, Medicine and Health.

2023-09-12

Null Hypothesis Significance Testing (NHST)

Comparing the rates of stroke between hypertensive and non-hypertensive study participants.

Null Hypothesis Significance Testing (NHST)

Comparing the mean systolic blood pressure between stroke and non-stroke study participants.

Using a linear regression model

Conclusion

Now let us explore the relationship between age and systolic blood pressure.