This workshop with give an introduction to creating Table 1 of a research report using SPSS and R. We will use the teaching version of the Framingham Heart Study Data.
Opening the File menu and selecting Open - Data (find your location on the framingham.sav file). The SPSS Syntax for this:
* Set a working directory.
CD "/Users/stevefrost/Library/CloudStorage/OneDrive-UniversityofWollongong/Data/".
* CD "Z:\OneDrive - University of Wollongong\Data\".
* Load data.
GET FILE "framingham.sav".
* name dataset
DATASET NAME Framingham.
* List variable names (columns)
DISPLAY NAMES.
EXECUTE.
Variable Names
randid
sex
totchol
age
sysbp
diabp
cursmoke
cigpday
bmi
diabetes
bpmeds
heartrte
glucose
educ
prevchd
prevap
prevmi
prevstrk
prevhyp
time
period
hdlc
ldlc
death
angina
hospmi
mi_fchd
anychd
stroke
cvd
hyperten
timeap
timemi
timemifc
timechd
timestrk
timecvd
timedth
timehyp
start.date
stroke.date
death.date
Currently defined variables
Usinf the drop down menu Transform - Recode into new variable. SPSS Syntax:
* Create age group.
RECODE age (30 thru 39=0)
(40 thru 49=1)
(50 thru 59=2)
(60 thru 69=3)
(70 thru 108=4)
INTO age_group.
VALUE LABELS age_group 0'30-39'
1'40-49'
2'50-59'
3'60-69'
4'70+'.
* Min and max age for each age group.
MEANS TABLES = age BY age_group
/CELLS=COUNT MEAN MIN MAX.
EXECUTE.
Report
Age (yrs)
| age_group | N | Mean | Minimum | Maximum |
|---|---|---|---|---|
| 30-39 | 566 | 37.45 | 32 | 39. |
| 40-49 | 3097 | 45.10 | 40 | 49. |
| 50-59 | 4193 | 54.33 | 50 | 59. |
| 60-69 | 2950 | 63.92 | 60 | 69. |
| 70+ | 821 | 72.88 | 70 | 81. |
| Total | 11627 | 54.79 | 32 | 81. |
The total cholesterol and serum glucose are in mg/dL (~ 18mg/dL = 1 mmo/L). SPSS Syntax to create new variables and label values or various yes/no responses.
* Value labels sex and stroke.
VALUE LABELS
stroke 0'No' 1'Yes'.
EXECUTE.
VALUE LABELS
sex 1'Men' 2'Women'.
EXECUTE.
VALUE LABELS
diabetes 0'No' 1'Yes'.
EXECUTE.
VALUE LABELS
prevhyp 0'No' 1'Yes'.
EXECUTE.
VALUE LABELS
cursmoke 0'No' 1'Yes'.
EXECUTE.
VALUE LABELS
educ 1'4th grade or less'
2'5th, 6th, or 7th grades'
3'grade school graduate no high school'
4'high school, did not graduate'.
EXECUTE.
* Change mg/dL to mmol/L.
COMPUTE glucoseN = NUMBER(glucose,F2).
COMPUTE glucose_mmols = glucoseN/18.
COMPUTE totchol_mmols = totchol/18.
COMPUTE timestrk_yrs = timestrk/365.25.
EXECUTE.
The Framingham heart study data represents multiple visits, and study participants who have experienced a stroke prior to the start of follow-up, or are being treated for hypertension. In our workshop we will be exploring the relationship between high blood pressure and subsequent stroke. Therefore, will will subet the overall data to only include visit 1 variables, and exclude study participants. with a history of stroke or treatment for high BP.
SELECT IF bpmeds = '0' & prevstrk = 0 & period = 1.
MEANS TABLES = age sysbp bmi glucose_mmols totchol_mmols timestrk_yrs BY sex
/CELLS=COUNT MEAN STDDEV MEDIAN.
EXECUTE.
| sex | age | sysbp. | bmi | glucose_mmols | totchol_mmols | timestrk_yrs | |
|---|---|---|---|---|---|---|---|
| Men | N | 1868 | 1868 | 1864. | 1750. | 1861. | 1868. |
| Mean | 49.61 | 130.920 | 26.1634 | 3.9047 | 12.9756 | 19.4852. | |
| Std. | 8.696 | 18.7522 | 3.39747 | 1.20978 | 2.36139 | 6.58543. | |
| Median | 49.00 | 128.000 | 26.0500 | 4.1667 | 12.8333 | 24.0000. | |
| Women | N | 2338 | 2338 | 2326. | 2072. | 2297. | 2338. |
| Mean | 49.62 | 132.263 | 25.4714. | 3.9847. | 13.2499 | 21.0158. | |
| Std. | 8.527 | 23.2019 | 4.46580 | 1.13656 | 2.53250 | 5.77186. | |
| Median | 49.00 | 127.500 | 24.7100 | 4.2222 | 13.1111 | 24.0000. | |
| Total | N | 4206. | 4206. | 4190. | 3822. | 4158. | 4206. |
| Mean | 49.62 | 131.666 | 25.7793 | 3.9481 | 13.1272 | 20.3360. | |
| Std. | 8.601 | 21.3485 | 4.03991 | 1.17118 | 2.46088 | 6.19263. | |
| Median | 49.00 | 128.000 | 25.4100 | 4.1667 | 12.9444. | 24.0000. |
Categorical (discrete) variables variables’ ## SPSS Syntax to for N (%).
* Cross tabulation and chisq test .
CROSSTABS TABLES = diabetes prevhyp cursmoke educ BY sex
/CELLS=COUNT ROW
/STATISTICS=CHISQ.
EXECUTE.
Crosstab
| Men | Women | Total | |||
|---|---|---|---|---|---|
| diabetes | No | Count | 1812 | 2287 | 4099 |
| . | % within diabetes | 44.2% | 55.8% | 100.0% | |
| Yes | Count | 56 | 51 | 107. | |
| . | % within diabetes | 52.3% | 47.7% | 100.0% | |
| Total. | Count | 1868 | 2338 | 4206. | |
| . | % within diabetes | 44.4% | 55.6% | 100.0% |
| Overall | Men | Women | p | test | |
|---|---|---|---|---|---|
| (N=4206) | (n=1868) | (n=2338) | |||
| Age (yrs), mean(SD) | 49.6 (8.60) | 49.6 (8.70) | 49.6 (8.53) | 0.965 | |
| Systolic BP (mmHg), mean(SD) | 131.7 (21.35) | 130.9 (18.75) | 132.3 (23.20) | 0.043 | |
| BMI (kg/\(m^2\)), mean (SD)) | 25.78 (4.04) | 26.16 (3.40) | 25.47 (4.47) | <0.001 | |
| glucose (mmol/L) mean (SD)) | 4.55 (1.30) | 4.58 (1.39) | 4.53 (1.23) | 0.297 | |
| total chol (mmol/L), mean (SD) | 13.13 (2.46) | 12.98 (2.36) | 13.25 (2.53) | <0.001 | |
| diabetes, n(%). | 107 ( 2.5) | 56 ( 3.0) | 51 ( 2.2) | 0.116 | |
| Hypertension, n(%) | 1243 (29.6) | 570 (30.5) | 673 (28.8) | 0.235 | |
| current smoker, n(%) | 2092 (49.7) | 1135 (60.8) | 957 (40.9) | <0.001 | |
| eduction level, n(%) | <0.001 | ||||
| 4th grade or less | 1714 (41.8) | 805 (44.4) | 909 (39.8) | ||
| 5th, 6th, or 7th grades | 1221 (29.8) | 497 (27.4) | 724 (31.7) | ||
| grade school graduate no high school | 681 (16.6) | 228 (12.6) | 453 (19.8) | ||
| high school, did not graduate | 480 (11.7) | 282 (15.6) | 198 ( 8.7) | ||
| Follow-up (yrs), mean (SD) | 20.34 (6.19) | 19.49 (6.59) | 21.02 (5.77) | <0.001 |
The question often asked is how many decimal places should be used when presenting the descriptive statistics. The suggested approaches are based on the precision of the data collected. For example, if age (in years) was collected as an integer, than the convention in to present the mean with no-more-than 1 decimal point. However, the standard deviation (due to a numer of steps in its calculation can be reported with 2 decimal points).
options(width = 1500)
# Framingham multiple visit data
# load packages
library(Hmisc)
library(haven)
library(Epi)
library(tableone)
library(epiDisplay)
library(epitools)
library(epiR)
# set working directly
setwd('~/OneDrive - University of Wollongong/Data/')
## setwd("C:/Users/25105517/OneDrive - University of Wollongong/Data/")
## setwd("C:/Users/sfrost/OneDrive - University of Wollongong/Data/")
## setwd('Z:/OneDrive - University of Wollongong/Data/')
source('epiTables V3.R')##
## ------------------------------------------------------------------------------
## Epi Tables
## R epiTab Function
## Input values epiTab(a,b,c,d)
## a,b,c,d values from two-by-two table (rows exposure and columns outcome)
## -------------------------------------------------------------------------------
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 482 1622 2104
## No 1110 3211 4321
## Total 1592 4833 6425
##
## Rate of outcome among intervention group (p1) = 0.2291 (95% CI: 21.11, 24.7)
##
## Rate of outcome among control group (p2) = 0.2569 (95% CI: 24.39, 26.99)
##
## Risk Difference = -2.78 (95% CI: -5, -0.56)
##
## Hazard Risk = 0.876 95% CI = 0.787 to 0.975
## Rate Ratio = 0.892 95% CI = 0.812 to 0.979
## Odds Ratio = 0.86 95% CI = 0.761 to 0.972
## chi2 p-value = 0.0154
## Fisher's exact p-value = 0.0163
## ------------------------------------------------------------------------
##
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 95 229 324
## No 283 400 683
## Total 378 629 1007
##
## Rate of outcome among intervention group (p1) = 0.2932 (95% CI: 24.34, 34.3)
##
## Rate of outcome among control group (p2) = 0.4143 (95% CI: 37.73, 45.14)
##
## Risk Difference = -12.11 (95% CI: -18.3, -5.92)
##
## Hazard Risk = 0.649 95% CI = 0.514 to 0.819
## Rate Ratio = 0.708 95% CI = 0.585 to 0.857
## Odds Ratio = 0.586 95% CI = 0.441 to 0.778
## chi2 p-value = 2e-04
## Fisher's exact p-value = 2e-04
## ------------------------------------------------------------------------
# read dataset into R, naming the data set dta.
framingham <- read.csv(file = 'framingham.csv', header = T, sep = ',')
framingham$n <- 1
# check the names of variable.
names(framingham)## [1] "randid" "sex" "totchol" "age" "sysbp" "diabp" "cursmoke" "cigpday" "bmi" "diabetes" "bpmeds" "heartrte" "glucose" "educ" "prevchd" "prevap" "prevmi" "prevstrk" "prevhyp" "time" "period" "hdlc" "ldlc" "death" "angina" "hospmi" "mi_fchd" "anychd" "stroke" "cvd" "hyperten" "timeap" "timemi" "timemifc" "timechd" "timestrk" "timecvd" "timedth" "timehyp" "start.date" "stroke.date" "death.date" "n"
# Look at the first ten rows.
# head(framingham)
# Look at the last ten rows.
# tail(framingham)
# View help file for dataset
## help(framingham)
# Create age groups with labels
min(framingham$age)## [1] 32
## [1] 81
framingham$age.group <- cut(framingham$age, breaks = c(30,39,49,59,69,90),
labels = c('30-39','40-49',
'50-59','60-69',
'70+'))
# Check lower and upper values of groups
stat.table(age.group, contents = list(N = sum(n),
Mean = mean(age),
Min = min(age),
Max = max(age)),
margin = T,
data = framingham)## ---------------------------------------------
## age.group N Mean Min Max
## ---------------------------------------------
## 30-39 566.00 37.45 32.00 39.00
## 40-49 3097.00 45.10 40.00 49.00
## 50-59 4193.00 54.33 50.00 59.00
## 60-69 2950.00 63.92 60.00 69.00
## 70+ 821.00 72.88 70.00 81.00
##
## Total 11627.00 54.79 32.00 81.00
## ---------------------------------------------
## 30-39 40-49 50-59 60-69 70+
## 32 40 50 60 70
## 30-39 40-49 50-59 60-69 70+
## 39 49 59 69 81
# Convert md/dL to mmols (1 mg/dL = 18 mmols/L)
framingham$glucose.mmols <- framingham$glucose/18
framingham$totchol.mmols <- framingham$totchol/18
# Label factors with numeric values
table(framingham$sex)##
## 1 2
## 5022 6605
framingham$sex <- factor(framingham$sex, level = c(1,2),
labels = c('Men','Women'))
table(framingham$sex)##
## Men Women
## 5022 6605
##
## 0 1
## 11097 530
framingham$diabetes <- factor(framingham$diabetes, level = c(0,1),
labels = c('No','Yes'))
table(framingham$diabetes)##
## No Yes
## 11097 530
# remove participants with a prior stroke or tx
# for BP and only use first visit data.
names(framingham)## [1] "randid" "sex" "totchol" "age" "sysbp" "diabp" "cursmoke" "cigpday" "bmi" "diabetes" "bpmeds" "heartrte" "glucose" "educ" "prevchd" "prevap" "prevmi" "prevstrk" "prevhyp" "time" "period" "hdlc" "ldlc" "death" "angina" "hospmi" "mi_fchd" "anychd" "stroke" "cvd" "hyperten" "timeap" "timemi" "timemifc" "timechd" "timestrk" "timecvd" "timedth" "timehyp" "start.date" "stroke.date" "death.date" "n" "age.group" "glucose.mmols" "totchol.mmols"
d <- subset(framingham, bpmeds == 0 &
prevstrk == 0 &
period == 1, select = c(randid,age,totchol.mmols,
diabp,glucose.mmols,bmi,
sex,sysbp,cursmoke,educ,
cigpday,diabetes,stroke,
timestrk,death,timedth,
prevhyp,age.group))
# add n = 1, for counting
d$n <- 1
# create follow-up in years
d$timestrk.yrs <- d$timestrk/365.25
d$stroke.yes <- factor(d$stroke, levels= 0:1, labels = c('No','Yes'))
table(d$stroke.yes,d$stroke)##
## 0 1
## No 3863 0
## Yes 0 343
##
## 0 1
## No 2963 0
## Yes 0 1243
##
## No Yes
## No 1395 1568
## Yes 719 524
d$educ <- factor(d$educ, levels = 1:4, labels = c('4th grade or less',
'5th, 6th, or 7th grades',
'grade school graduate no high school',
'high school, did not graduate'))
# variables labels and unit
d <- upData(d, labels = c(age = 'Age',
totchol.mmols = 'Total cholesterol',
diabp = 'Diasolic BP',
glucose.mmols = 'BGL',
sex = 'Sex',
sysbp = 'Systolic BP',
cursmoke = 'Current smoker',
educ = 'Education level',
stroke = 'Stroke',
timestrk = 'Follow-up',
timestrk.yrs = 'Follow-up'),
units = c(age = 'yrs',
totchol.mmols = 'mmol/L',
glucose.mmols = 'mmol/L',
diabp = 'mmHg',
sysbp = 'mmHg',
timestrk = 'days',
timestrk.yrs = 'yrs'))## Input object size: 512048 bytes; 22 variables 4206 observations
## New object size: 501912 bytes; 22 variables 4206 observations
# Table 1.
table.1 <- print(summary(stroke.yes ~ sex +
age +
sysbp +
bmi +
glucose.mmols +
totchol.mmols +
diabetes +
HT.yes +
cursmoke +
educ +
timestrk.yrs,
method = 'reverse',
overall = F,
test = T,
data = d), prmsd = T,
digits = 3)##
##
## Descriptive Statistics by stroke.yes
##
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## | |N |No |Yes | Test |
## | | |(N=3863) |(N=343) |Statistic |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |Sex : Women |4206| 56% (2162) | 51% ( 176) | Chi-square=2.76 d.f.=1 P=0.096|
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |Age [yrs] |4206|42.00/48.00/56.00 49.13+/- 8.50|50.00/57.00/61.00 55.07+/- 7.77| F=152 d.f.=1,4204 P<0.001 |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |Systolic BP [mmHg] |4206|116.0/127.0/141.0 130.6+/- 20.6|126.0/139.0/158.0 143.7+/- 25.9| F=99.4 d.f.=1,4204 P<0.001 |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |bmi |4190|23.06/25.33/27.89 25.69+/- 3.98|23.65/26.48/28.86 26.79+/- 4.50| F=19.8 d.f.=1,4188 P<0.001 |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |BGL [mmol/L] |3822| 3.94/4.33/4.83 4.54+/-1.29 | 4.00/4.39/4.94 4.70+/-1.49 | F=2.9 d.f.=1,3820 P=0.089 |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |Total cholesterol [mmol/L] |4158|11.39/12.94/14.56 13.10+/- 2.43|11.72/13.22/14.67 13.43+/- 2.78| F=3.77 d.f.=1,4156 P=0.052 |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |diabetes : Yes |4206| 2% ( 84) | 7% ( 23) | Chi-square=26.1 d.f.=1 P<0.001|
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |HT.yes : Yes |4206| 27% (1057) | 54% ( 186) | Chi-square=109 d.f.=1 P<0.001|
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |Current smoker : Yes |4206| 50% (1922) | 50% ( 170) | Chi-square=0 d.f.=1 P=0.946 |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |Education level : 4th grade or less |4096| 41% (1542) | 51% ( 172) | Chi-square=14.1 d.f.=3 P=0.003|
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## | 5th, 6th, or 7th grades | | 30% (1135) | 26% ( 86) | |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## | grade school graduate no high school| | 17% ( 636) | 13% ( 45) | |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## | high school, did not graduate | | 12% ( 449) | 9% ( 31) | |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |Follow-up [yrs] |4206|20.45/24.00/24.00 20.86+/- 5.90| 9.83/15.43/19.92 14.41+/- 6.37| F=667 d.f.=1,4204 P<0.001 |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
##
## Welch Two Sample t-test
##
## data: age by sex
## t = -0.043424, df = 3967, p-value = 0.9654
## alternative hypothesis: true difference in means between group Men and group Women is not equal to 0
## 95 percent confidence interval:
## -0.5361589 0.5129231
## sample estimates:
## mean in group Men mean in group Women
## 49.61242 49.62404
##
## Welch Two Sample t-test
##
## data: sysbp by sex
## t = -2.0759, df = 4203.4, p-value = 0.03797
## alternative hypothesis: true difference in means between group Men and group Women is not equal to 0
## 95 percent confidence interval:
## -2.61121307 -0.07462175
## sample estimates:
## mean in group Men mean in group Women
## 130.9197 132.2626
##
## Welch Two Sample t-test
##
## data: bmi by sex
## t = 5.6944, df = 4177, p-value = 1.323e-08
## alternative hypothesis: true difference in means between group Men and group Women is not equal to 0
## 95 percent confidence interval:
## 0.4537310 0.9302105
## sample estimates:
## mean in group Men mean in group Women
## 26.16341 25.47144
##
## Welch Two Sample t-test
##
## data: glucose.mmols by sex
## t = 1.0314, df = 3521.1, p-value = 0.3024
## alternative hypothesis: true difference in means between group Men and group Women is not equal to 0
## 95 percent confidence interval:
## -0.03977396 0.12807297
## sample estimates:
## mean in group Men mean in group Women
## 4.576190 4.532041
##
## Welch Two Sample t-test
##
## data: totchol.mmols by sex
## t = -3.6057, df = 4075, p-value = 0.000315
## alternative hypothesis: true difference in means between group Men and group Women is not equal to 0
## 95 percent confidence interval:
## -0.4234912 -0.1251669
## sample estimates:
## mean in group Men mean in group Women
## 12.97561 13.24994
##
## Welch Two Sample t-test
##
## data: timestrk.yrs by sex
## t = -7.9078, df = 3737.3, p-value = 3.421e-15
## alternative hypothesis: true difference in means between group Men and group Women is not equal to 0
## 95 percent confidence interval:
## -1.910125 -1.151139
## sample estimates:
## mean in group Men mean in group Women
## 19.48518 21.01582
# Checking the distribution of follow-up
# one figure two plots
par(mfrow = c(1,2))
hist(d$timestrk.yrs,
xlab = 'Follow-up (yrs)',
ylab = '')
mtext(side = 3, adj = 0, text = '(a) All participants')
hist(d$timestrk.yrs[d$stroke == 1],
xlab = 'Follow-up (yrs)',
ylab = '')
mtext(side = 3, adj = 0, text = '(b) Stroke participants')# Checking the distributions
# one figure two by two plots
par(mfrow = c(2,2))
hist(d$glucose.mmols,
xlab = 'BGL (mmol/L)',
ylab = '', main = '')
mtext(side = 3, adj = 0, text = '(a) Serum glucose')
hist(d$totchol.mmols,
xlab = 'Total Chol (mmol/L)',
ylab = '', main = '')
mtext(side = 3, adj = 0, text = '(b) Serum Cholesterol')
hist(log(d$glucose.mmols),
xlab = 'BGL (log)',
ylab = '', main = '')
mtext(side = 3, adj = 0, text = '(c) ln(serum glucose)')
hist(log(d$totchol.mmols),
xlab = 'Total Chol (log)',
ylab = '', main = '')
mtext(side = 3, adj = 0, text = '(d) ln(serum cholesterol)')d$HT.yes <- relevel(d$HT.yes, ref = 'No')
printTableOne <- print(CreateTableOne(data = d,
vars = c("age", "sysbp", "bmi","glucose.mmols",
"totchol.mmols","diabetes","HT.yes",
"cursmoke","educ","timestrk.yrs"),
factorVars = c("sex","diabetes","HT.yes",
"cursmoke","educ"),
strata = "stroke.yes",
addOverall = F), nonnormal = "timestrk.yrs",
varLabels = T,
missing = T)## Stratified by stroke.yes
## No Yes p test Missing
## n 3863 343
## Age (mean (SD)) 49.13 (8.50) 55.07 (7.77) <0.001 0.0
## Systolic BP (mean (SD)) 130.60 (20.56) 143.71 (25.92) <0.001 0.0
## bmi (mean (SD)) 25.69 (3.98) 26.79 (4.50) <0.001 0.4
## BGL (mean (SD)) 4.54 (1.29) 4.70 (1.49) 0.030 9.1
## Total cholesterol (mean (SD)) 13.10 (2.43) 13.43 (2.78) 0.016 1.1
## diabetes = Yes (%) 84 ( 2.2) 23 ( 6.7) <0.001 0.0
## HT.yes = Yes (%) 1057 (27.4) 186 (54.2) <0.001 0.0
## Current smoker = Yes (%) 1922 (49.8) 170 (49.6) 0.991 0.0
## Education level (%) 0.003 2.6
## 4th grade or less 1542 (41.0) 172 (51.5)
## 5th, 6th, or 7th grades 1135 (30.2) 86 (25.7)
## grade school graduate no high school 636 (16.9) 45 (13.5)
## high school, did not graduate 449 (11.9) 31 ( 9.3)
## Follow-up (median [IQR]) 24.00 [20.45, 24.00] 15.43 [9.83, 19.92] <0.001 nonnorm 0.0
kableone(CreateTableOne(data = d,
vars = c("age", "sysbp", "bmi","glucose.mmols",
"totchol.mmols","diabetes","HT.yes",
"cursmoke","educ","timestrk.yrs"),
factorVars = c("diabetes","HT.yes",
"cursmoke","educ"),
strata = "sex",
addOverall = T))| Overall | Men | Women | p | test | |
|---|---|---|---|---|---|
| n | 4206 | 1868 | 2338 | ||
| age (mean (SD)) | 49.62 (8.60) | 49.61 (8.70) | 49.62 (8.53) | 0.965 | |
| sysbp (mean (SD)) | 131.67 (21.35) | 130.92 (18.75) | 132.26 (23.20) | 0.043 | |
| bmi (mean (SD)) | 25.78 (4.04) | 26.16 (3.40) | 25.47 (4.47) | <0.001 | |
| glucose.mmols (mean (SD)) | 4.55 (1.30) | 4.58 (1.39) | 4.53 (1.23) | 0.297 | |
| totchol.mmols (mean (SD)) | 13.13 (2.46) | 12.98 (2.36) | 13.25 (2.53) | <0.001 | |
| diabetes = Yes (%) | 107 ( 2.5) | 56 ( 3.0) | 51 ( 2.2) | 0.116 | |
| HT.yes = Yes (%) | 1243 (29.6) | 570 (30.5) | 673 (28.8) | 0.235 | |
| cursmoke = Yes (%) | 2092 (49.7) | 1135 (60.8) | 957 (40.9) | <0.001 | |
| educ (%) | <0.001 | ||||
| 4th grade or less | 1714 (41.8) | 805 (44.4) | 909 (39.8) | ||
| 5th, 6th, or 7th grades | 1221 (29.8) | 497 (27.4) | 724 (31.7) | ||
| grade school graduate no high school | 681 (16.6) | 228 (12.6) | 453 (19.8) | ||
| high school, did not graduate | 480 (11.7) | 282 (15.6) | 198 ( 8.7) | ||
| timestrk.yrs (mean (SD)) | 20.34 (6.19) | 19.49 (6.59) | 21.02 (5.77) | <0.001 |
This module will introduce the concepts of estimation and null-hypothesis statistical significance testing. In addition sample size and power calculations will be covered, and Type I and Type II error.
# SysBP and stroke
par(mfrow = c(1,2))
hist(d$sysbp, main = '',
xlab = 'Systolic BP (mmHg) \n First visit')
describe(d$sysbp)## d$sysbp : Systolic BP [mmHg]
## n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
## 4206 0 226 1 131.7 23.06 104.0 108.5 116.6 128.0 142.5 160.0 173.0
##
## lowest : 83.5 85.0 85.5 90.0 92.0, highest: 230.0 232.0 235.0 243.0 295.0
boxplot(sysbp ~ stroke.yes, data = d,
col = c('blue','red'),
xlab = 'Stroke',
ylab = 'SBP (mmHg)',
las = 1)## 0 1
## 130.5967 143.7114
## 0 1
## 20.56035 25.91658
\(H_0\) Mean systolic BP among non-stroke \(=\) Mean systolic BP among stroke.
\(H_i\) Mean systolic BP among non-stroke \(\neq\) Mean systolic BP among stroke.
Then a critical value is set, say from a z-statistic (i.e 1.96, < 2.5% tail)
# SHADE TAILS OF n-DISTRIBUTED PLOTS
par(mfrow = c(1,1))
z <- pretty(c(-3,3), 100)
ht <- dnorm(z)
data <- data.frame(z = z, ht = ht)
zc <- 1.96
t <- subset(data, z > zc)
plot(data, type = "l",
axes = T,
xlab = '',
ylab = '')
lines(data)
polygon(c(rev(t$z), t$z),c(rep(0, nrow(t)), t$ht), col = 'grey',
border = NA)
polygon(c(rev(t$z*-1), t$z*-1), c(rep(0, nrow(t)), t$ht),
col = 'grey', border = NA)
abline(v = 0, lty = 2)
abline(v = c(-1,1), lty = 2, col = 'grey')
text(-0.75, 0.02, '-1 SD')
text(0.75, 0.02, '+1 SD')
text(-1.75, 0.02, '-1.96 SD')
text(1.75, 0.02, '+1.96 SD')## [1] 13.11468
##
## Welch Two Sample t-test
##
## data: sysbp by stroke
## t = -9.1205, df = 381.19, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -15.94196 -10.28741
## sample estimates:
## mean in group 0 mean in group 1
## 130.5967 143.7114
# EXAMPLE OF T-TEST
set.seed(12345)
n <- 3000 # large number
a <- rnorm(n, mean = 61.8, sd = 10)
b <- rnorm(n, mean = 62.7, sd = 9.5)
plot(density(a), lty = 1, col = 'blue',
frame = F,
main = '',
xlab = '',
ylab = '',
ylim = c(0,0.05),
lwd = 2)
lines(density(b), lty = 1, col = 'black', lwd = 2)
legend(75,0.045, c('PCI - mean age 61.8 yrs (SD 10)',
'CABG - mean age 62.7 yrs (SD 9.5)'),
bty = n,
lty = 1,
lwd = 2,
col = c('blue','black'))##
## Welch Two Sample t-test
##
## data: a and b
## t = -3.4118, df = 5986.2, p-value = 0.0006496
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3417023 -0.3625029
## sample estimates:
## mean of x mean of y
## 61.76762 62.61972
n <- 30 # small number
a <- rnorm(n, mean = 61.8, sd = 10)
b <- rnorm(n, mean = 62.7, sd = 9.5)
plot(density(a), lty = 1, col = 'blue',
frame = F,
main = '',
xlab = '',
ylab = '',
ylim = c(0,0.05),
lwd = 2)
lines(density(b), lty = 1, col = 'black', lwd = 2)
legend(75,0.045, c('PCI - mean age 61.8 yrs (SD 10)',
'CABG - mean age 62.7 yrs (SD 9.5)'),
bty = n,
lty = 1,
lwd = 2,
col = c('blue','black'))##
## Welch Two Sample t-test
##
## data: a and b
## t = -0.28393, df = 54.736, p-value = 0.7775
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.455870 4.101878
## sample estimates:
## mean of x mean of y
## 61.79375 62.47075
NEJM 2000 ARDSNet lower tidal volume (6 mL/kg) versus traditional 12mL/kg. The primary outome death before discharge and breathing without assitance. Results were tx group 134/432 (31.0%) versus control 171/429 (39.8%).
## setwd('~/OneDrive - University of Wollongong/Data/')
## source('EpiTables V3.R')
epiTab(134,298,171,258)##
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 134 298 432
## No 171 258 429
## Total 305 556 861
##
## Rate of outcome among intervention group (p1) = 0.3102 (95% CI: 26.64, 35.39)
##
## Rate of outcome among control group (p2) = 0.3986 (95% CI: 35.21, 44.51)
##
## Risk Difference = -8.84 (95% CI: -15.21, -2.47)
##
## Hazard Risk = 0.73 95% CI = 0.582 to 0.915
## Rate Ratio = 0.778 95% CI = 0.648 to 0.934
## Odds Ratio = 0.678 95% CI = 0.512 to 0.898
## chi2 p-value = 0.0067
## Fisher's exact p-value = 0.0069
## ------------------------------------------------------------------------
An example of potentially Type II error - NEJM 2018 EOLIA ECMO for ARDS primary outome 60-day mortality tx 44/124 (35%) control 57/125 (46%)
##
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 44 80 124
## No 57 68 125
## Total 101 148 249
##
## Rate of outcome among intervention group (p1) = 0.3548 (95% CI: 26.98, 43.99)
##
## Rate of outcome among control group (p2) = 0.456 (95% CI: 36.78, 54.42)
##
## Risk Difference = -10.12 (95% CI: -22.31, 2.07)
##
## Hazard Risk = 0.72 95% CI = 0.486 to 1.067
## Rate Ratio = 0.778 95% CI = 0.574 to 1.055
## Odds Ratio = 0.656 95% CI = 0.394 to 1.091
## chi2 p-value = 0.1041
## Fisher's exact p-value = 0.1217
## ------------------------------------------------------------------------
# original sample size calculation was based on 40% vs 60%
# 60-day mortality
power.prop.test(p1 = 0.4, p2 = 0.6, sig.level = 0.05, power = 0.8)##
## Two-sample comparison of proportions power calculation
##
## n = 96.92364
## p1 = 0.4
## p2 = 0.6
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
p1 <- 0.6
q1 <- 1 - p1
p2 <- 0.4
q2 <- 1 - p2
p.bar <- (p1 + p2)/2
q.bar <- 1 - p.bar
# Sample size
n <- 2*((p.bar * q.bar) * (1.96 + 0.84)^2) / (p1 - p2)^2
n## [1] 98
##
## Two-sample comparison of proportions power calculation
##
## n = 378.5477
## p1 = 0.36
## p2 = 0.46
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
##
## Two-sample comparison of proportions power calculation
##
## n = 125
## p1 = 0.36
## p2 = 0.46
## sig.level = 0.05
## power = 0.3615169
## alternative = two.sided
##
## NOTE: n is number in *each* group
# Age and stroke
boxplot(age ~ stroke.yes, data = d,
col = c('blue','red'),
xlab = 'Stroke',
ylab = 'Age (years)',
las = 1)## 0 1
## 49.13461 55.07289
## 0 1
## 8.504941 7.768804
# Age and Systolic BP
plot(d$age, d$sysbp, pch = 16,
xlab = 'Age (yrs)',
ylab = 'Systolic BP (mmHg)',
main = '')plot(d$age[d$stroke == 0], d$sysbp[d$stroke == 0], pch = 16,
col = 'blue',
xlab = 'Age (yrs)',
ylab = 'Systolic BP (mmHg)',
main = '',
ylim = c(80,300),
las = 1)
points(d$age[d$stroke == 1], d$sysbp[d$stroke == 1], pch = 16,
col = 'red')
abline(lsfit(d$age,d$sysbp), lty = 2, lwd = 2)# Cigs per day and stroke
boxplot(cigpday ~ stroke.yes, data = d,
col = c('blue','red'),
xlab = 'Stroke',
ylab = 'Age (years)',
las = 1)## 0 1
## 9.105085 9.170088
## 0 1
## 12.00131 12.06747
# table of rates of stroke based
# systolic HT
d$sysHT <- ifelse(d$sysbp > 140, 1, 0)
d$sysHT <- factor(d$sysHT, levels = 0:1, labels = c('No','Yes'))
print(stat.table(list(sysHT), contents = list(N = count(),
'Stroke (n)' = sum(stroke),
'Stroke (%)' = ratio(stroke,n,100),
'Stroke (rate)' = ratio(stroke,timestrk,365.25),
'follow-up (Total years)' = sum(timestrk.yrs)),
data = d,
margin = T),
digits = 3)## ---------------------------------------------------
## sysHT N Stroke Stroke Stroke follow-up
## (n) (%) (rate) (Total
## years)
## ---------------------------------------------------
## No 3035.000 178.000 5.865 0.003 64518.680
## Yes 1171.000 165.000 14.091 0.008 21014.623
##
## Total 4206.000 343.000 8.155 0.004 85533.303
## ---------------------------------------------------
## No Yes
## 64518.68 21014.62
# rate with 95% CI
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$n, d$sysHT, sum)))
names(stroke.tab) <- c('Strokes','N')
head(stroke.tab)## Strokes N
## No 178 3035
## Yes 165 1171
## events total probability se exact.lower95ci exact.upper95ci
## 1 178 3035 0.05864909 0.004265079 0.05054965 0.06761068
## 2 165 1171 0.14090521 0.010167300 0.12146029 0.16216781
# incidence based on person years
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$timestrk.yrs, d$sysHT, sum)))
names(stroke.tab) <- c('Strokes','pyrs')
head(stroke.tab)## Strokes pyrs
## No 178 64518.68
## Yes 165 21014.62
## events person.time incidence se exact.lower95ci exact.upper95ci
## 1 178 64518.68 0.002758891 0.0002067876 0.002367128 0.003197555
## 2 165 21014.62 0.007851676 0.0006112521 0.006697479 0.009147202
# risk of stroke
# Rate Ratio
## setwd('C:/Users/25105517/OneDrive - University of Wollongong/Data')
## source('epiTables V3.R')
epiTab(165,1171-165,178,3035-178)##
## ------------------------------------------------------------------------
## Trial Data:
## Outcome
## Intervention Yes No Total
## Yes 165 1006 1171
## No 178 2857 3035
## Total 343 3863 4206
##
## Rate of outcome among intervention group (p1) = 0.1409 (95% CI: 12.1, 16.09)
##
## Rate of outcome among control group (p2) = 0.0586 (95% CI: 5.03, 6.7)
##
## Risk Difference = 8.23 (95% CI: 6.06, 10.39)
##
## Hazard Risk = 2.513 95% CI = 2.033 to 3.106
## Rate Ratio = 2.403 95% CI = 1.966 to 2.937
## Odds Ratio = 2.633 95% CI = 2.105 to 3.293
## chi2 p-value = 0
## Fisher's exact p-value = 0
## ------------------------------------------------------------------------
(sum(d$stroke[d$sysHT == 'Yes']) / sum(d$n[d$sysHT == 'Yes'])) /
(sum(d$stroke[d$sysHT == 'No']) / sum(d$n[d$sysHT == 'No']))## [1] 2.402513
Our practical sessions we will be using the R Statistical Language (https://www.r-project.org/) via the R Studio editor (https://www.rstudio.com/products/rstudio/download/#download). And, we suggest you load this onto your laptop prior to Week 1.
The R language is an object orientated statistical program, that can be used for all stages of the research process, including: