Introduction

An introduction to descriptive statistics using SPSS and R.

This workshop with give an introduction to creating Table 1 of a research report using SPSS and R. We will use the teaching version of the Framingham Heart Study Data.

Load data into SPSS.

Opening the File menu and selecting Open - Data (find your location on the framingham.sav file). The SPSS Syntax for this:

* Set a working directory.
CD "/Users/stevefrost/Library/CloudStorage/OneDrive-UniversityofWollongong/Data/".
* CD "Z:\OneDrive - University of Wollongong\Data\".
* Load data.
GET FILE "framingham.sav".

* name dataset
DATASET NAME Framingham.

* List variable names (columns)
DISPLAY NAMES.

EXECUTE.

SPSS Output:

Variable Names
randid
sex
totchol
age
sysbp
diabp
cursmoke
cigpday
bmi
diabetes
bpmeds
heartrte
glucose
educ
prevchd
prevap
prevmi
prevstrk
prevhyp
time
period
hdlc
ldlc
death
angina
hospmi
mi_fchd
anychd
stroke
cvd
hyperten
timeap
timemi
timemifc
timechd
timestrk
timecvd
timedth
timehyp
start.date
stroke.date
death.date
Currently defined variables

Creating new variables in SPSS.

Usinf the drop down menu Transform - Recode into new variable. SPSS Syntax:

* Create age group.
RECODE age (30 thru 39=0) 
                    (40 thru 49=1) 
                    (50 thru 59=2) 
                    (60 thru 69=3) 
                    (70 thru 108=4) 
                    INTO age_group.

VALUE LABELS age_group 0'30-39' 
                                        1'40-49' 
                                        2'50-59' 
                                        3'60-69' 
                                        4'70+'.

* Min and max age for each age group.
MEANS TABLES = age BY age_group
  /CELLS=COUNT MEAN MIN MAX.

EXECUTE.

SPSS Output

Report
Age (yrs)

age_group	N	Mean	Minimum	Maximum
30-39	566	37.45	32	39.
40-49	3097	45.10	40	49.
50-59	4193	54.33	50	59.
60-69	2950	63.92	60	69.
70+	821	72.88	70	81.
Total	11627	54.79	32	81.

Recoding and labelling data.

The total cholesterol and serum glucose are in mg/dL (~ 18mg/dL = 1 mmo/L). SPSS Syntax to create new variables and label values or various yes/no responses.

SPSS Code.

* Value labels sex and stroke.
VALUE LABELS 
           stroke 0'No' 1'Yes'.
EXECUTE.

VALUE LABELS 
           sex 1'Men' 2'Women'.
EXECUTE.

VALUE LABELS 
           diabetes 0'No' 1'Yes'.
EXECUTE.

VALUE LABELS 
           prevhyp 0'No' 1'Yes'.
EXECUTE.

VALUE LABELS 
           cursmoke 0'No' 1'Yes'.
EXECUTE.

VALUE LABELS
            educ 1'4th grade or less'
                    2'5th, 6th, or 7th grades'
                    3'grade school graduate no high school'
                    4'high school, did not graduate'.
EXECUTE.

* Change mg/dL to mmol/L.
COMPUTE glucoseN = NUMBER(glucose,F2).
COMPUTE glucose_mmols = glucoseN/18.
COMPUTE totchol_mmols =  totchol/18.
COMPUTE timestrk_yrs = timestrk/365.25.
EXECUTE.

The Framingham heart study data represents multiple visits, and study participants who have experienced a stroke prior to the start of follow-up, or are being treated for hypertension. In our workshop we will be exploring the relationship between high blood pressure and subsequent stroke. Therefore, will will subet the overall data to only include visit 1 variables, and exclude study participants. with a history of stroke or treatment for high BP.

SPSS code to subset dataset.

SELECT IF  bpmeds = '0' & prevstrk = 0 & period = 1.

Descriptive statistics of continuous variables.

SPSS Syntax to for N Mean SD Median etc.

MEANS TABLES = age sysbp bmi glucose_mmols totchol_mmols timestrk_yrs BY sex
  /CELLS=COUNT MEAN STDDEV MEDIAN.
EXECUTE.

SPSS Output continuous variables:

sex		age	sysbp.	bmi	glucose_mmols	totchol_mmols	timestrk_yrs
Men	N	1868	1868	1864.	1750.	1861.	1868.
	Mean	49.61	130.920	26.1634	3.9047	12.9756	19.4852.
	Std.	8.696	18.7522	3.39747	1.20978	2.36139	6.58543.
	Median	49.00	128.000	26.0500	4.1667	12.8333	24.0000.
Women	N	2338	2338	2326.	2072.	2297.	2338.
	Mean	49.62	132.263	25.4714.	3.9847.	13.2499	21.0158.
	Std.	8.527	23.2019	4.46580	1.13656	2.53250	5.77186.
	Median	49.00	127.500	24.7100	4.2222	13.1111	24.0000.
Total	N	4206.	4206.	4190.	3822.	4158.	4206.
	Mean	49.62	131.666	25.7793	3.9481	13.1272	20.3360.
	Std.	8.601	21.3485	4.03991	1.17118	2.46088	6.19263.
	Median	49.00	128.000	25.4100	4.1667	12.9444.	24.0000.

Categorical (discrete) variables variables’ ## SPSS Syntax to for N (%).

* Cross tabulation and chisq test .
CROSSTABS TABLES = diabetes prevhyp cursmoke educ BY sex
                    /CELLS=COUNT ROW
                    /STATISTICS=CHISQ.
EXECUTE.

SPSS Output crosstabs diabetes by sex:

Crosstab

			Men	Women	Total
diabetes	No	Count	1812	2287	4099
	.	% within diabetes	44.2%	55.8%	100.0%
	Yes	Count	56	51	107.
	.	% within diabetes	52.3%	47.7%	100.0%
Total.		Count	1868	2338	4206.
	.	% within diabetes	44.4%	55.6%	100.0%

The resulting Table 1 describing the baseline characteristics among women and men:

	Overall	Men	Women	p
	(N=4206)	(n=1868)	(n=2338)
Age (yrs), mean(SD)	49.6 (8.60)	49.6 (8.70)	49.6 (8.53)	0.965
Systolic BP (mmHg), mean(SD)	131.7 (21.35)	130.9 (18.75)	132.3 (23.20)	0.043
BMI (kg/\(m^2\)), mean (SD))	25.78 (4.04)	26.16 (3.40)	25.47 (4.47)	<0.001
glucose (mmol/L) mean (SD))	4.55 (1.30)	4.58 (1.39)	4.53 (1.23)	0.297
total chol (mmol/L), mean (SD)	13.13 (2.46)	12.98 (2.36)	13.25 (2.53)	<0.001
diabetes, n(%).	107 ( 2.5)	56 ( 3.0)	51 ( 2.2)	0.116
Hypertension, n(%)	1243 (29.6)	570 (30.5)	673 (28.8)	0.235
current smoker, n(%)	2092 (49.7)	1135 (60.8)	957 (40.9)	<0.001
eduction level, n(%)				<0.001
4th grade or less	1714 (41.8)	805 (44.4)	909 (39.8)
5th, 6th, or 7th grades	1221 (29.8)	497 (27.4)	724 (31.7)
grade school graduate no high school	681 (16.6)	228 (12.6)	453 (19.8)
high school, did not graduate	480 (11.7)	282 (15.6)	198 ( 8.7)
Follow-up (yrs), mean (SD)	20.34 (6.19)	19.49 (6.59)	21.02 (5.77)	<0.001

The question often asked is how many decimal places should be used when presenting the descriptive statistics. The suggested approaches are based on the precision of the data collected. For example, if age (in years) was collected as an integer, than the convention in to present the mean with no-more-than 1 decimal point. However, the standard deviation (due to a numer of steps in its calculation can be reported with 2 decimal points).

The same analyses replicated in R:

options(width = 1500)
# Framingham multiple visit data
# load packages
library(Hmisc)
library(haven)
library(Epi)
library(tableone)
library(epiDisplay)
library(epitools)
library(epiR)
# set working directly
setwd('~/OneDrive - University of Wollongong/Data/')
## setwd("C:/Users/25105517/OneDrive - University of Wollongong/Data/")
## setwd("C:/Users/sfrost/OneDrive - University of Wollongong/Data/")
## setwd('Z:/OneDrive - University of Wollongong/Data/')
source('epiTables V3.R')

## 
## ------------------------------------------------------------------------------
## Epi Tables
## R epiTab Function
## Input values epiTab(a,b,c,d)
## a,b,c,d values from two-by-two table (rows exposure and columns outcome)
## -------------------------------------------------------------------------------
## ------------------------------------------------------------------------
## Trial Data:
##             Outcome
## Intervention  Yes   No Total
##        Yes    482 1622  2104
##        No    1110 3211  4321
##        Total 1592 4833  6425
## 
## Rate of outcome among intervention group (p1) = 0.2291 (95% CI: 21.11, 24.7)
## 
## Rate of outcome among control group (p2) = 0.2569 (95% CI: 24.39, 26.99)
## 
## Risk Difference = -2.78 (95% CI: -5, -0.56)
## 
## Hazard Risk = 0.876 95% CI = 0.787 to 0.975
## Rate Ratio = 0.892 95% CI = 0.812 to 0.979
## Odds Ratio = 0.86 95% CI = 0.761 to 0.972
## chi2 p-value = 0.0154
## Fisher's exact p-value = 0.0163
## ------------------------------------------------------------------------
## 
## ------------------------------------------------------------------------
## Trial Data:
##             Outcome
## Intervention Yes  No Total
##        Yes    95 229   324
##        No    283 400   683
##        Total 378 629  1007
## 
## Rate of outcome among intervention group (p1) = 0.2932 (95% CI: 24.34, 34.3)
## 
## Rate of outcome among control group (p2) = 0.4143 (95% CI: 37.73, 45.14)
## 
## Risk Difference = -12.11 (95% CI: -18.3, -5.92)
## 
## Hazard Risk = 0.649 95% CI = 0.514 to 0.819
## Rate Ratio = 0.708 95% CI = 0.585 to 0.857
## Odds Ratio = 0.586 95% CI = 0.441 to 0.778
## chi2 p-value = 2e-04
## Fisher's exact p-value = 2e-04
## ------------------------------------------------------------------------

# read dataset into R, naming the data set dta.
framingham <- read.csv(file = 'framingham.csv', header = T, sep = ',')
framingham$n <- 1
# check the names of variable.
names(framingham)

##  [1] "randid"      "sex"         "totchol"     "age"         "sysbp"       "diabp"       "cursmoke"    "cigpday"     "bmi"         "diabetes"    "bpmeds"      "heartrte"    "glucose"     "educ"        "prevchd"     "prevap"      "prevmi"      "prevstrk"    "prevhyp"     "time"        "period"      "hdlc"        "ldlc"        "death"       "angina"      "hospmi"      "mi_fchd"     "anychd"      "stroke"      "cvd"         "hyperten"    "timeap"      "timemi"      "timemifc"    "timechd"     "timestrk"    "timecvd"     "timedth"     "timehyp"     "start.date"  "stroke.date" "death.date"  "n"

# Look at the first ten rows.
# head(framingham)

# Look at the last ten rows.
# tail(framingham)

# View help file for dataset
## help(framingham)

# Create age groups with labels
min(framingham$age)

## [1] 32

max(framingham$age)

## [1] 81

framingham$age.group <- cut(framingham$age, breaks = c(30,39,49,59,69,90),
                            labels = c('30-39','40-49',
                                       '50-59','60-69',
                                       '70+'))

# Check lower and upper values of groups
stat.table(age.group, contents = list(N = sum(n),
                                      Mean = mean(age),
                                      Min = min(age),
                                      Max = max(age)),
                                      margin = T,
                                      data = framingham)

##  --------------------------------------------- 
##  age.group          N    Mean     Min     Max  
##  --------------------------------------------- 
##  30-39         566.00   37.45   32.00   39.00  
##  40-49        3097.00   45.10   40.00   49.00  
##  50-59        4193.00   54.33   50.00   59.00  
##  60-69        2950.00   63.92   60.00   69.00  
##  70+           821.00   72.88   70.00   81.00  
##                                                
##  Total       11627.00   54.79   32.00   81.00  
##  ---------------------------------------------

tapply(framingham$age, framingham$age.group, min)

## 30-39 40-49 50-59 60-69   70+ 
##    32    40    50    60    70

tapply(framingham$age, framingham$age.group, max)

## 30-39 40-49 50-59 60-69   70+ 
##    39    49    59    69    81

# Convert md/dL to mmols (1 mg/dL = 18 mmols/L)
framingham$glucose.mmols <- framingham$glucose/18
framingham$totchol.mmols <- framingham$totchol/18

# Label factors with numeric values
table(framingham$sex)

## 
##    1    2 
## 5022 6605

framingham$sex <- factor(framingham$sex, level = c(1,2), 
                                         labels = c('Men','Women'))
table(framingham$sex)

## 
##   Men Women 
##  5022  6605

table(framingham$diabetes)

## 
##     0     1 
## 11097   530

framingham$diabetes <- factor(framingham$diabetes, level = c(0,1), 
                              labels = c('No','Yes'))
table(framingham$diabetes)

## 
##    No   Yes 
## 11097   530

# remove participants with a prior stroke or tx 
# for BP and only use first visit data.
names(framingham)

##  [1] "randid"        "sex"           "totchol"       "age"           "sysbp"         "diabp"         "cursmoke"      "cigpday"       "bmi"           "diabetes"      "bpmeds"        "heartrte"      "glucose"       "educ"          "prevchd"       "prevap"        "prevmi"        "prevstrk"      "prevhyp"       "time"          "period"        "hdlc"          "ldlc"          "death"         "angina"        "hospmi"        "mi_fchd"       "anychd"        "stroke"        "cvd"           "hyperten"      "timeap"        "timemi"        "timemifc"      "timechd"       "timestrk"      "timecvd"       "timedth"       "timehyp"       "start.date"    "stroke.date"   "death.date"    "n"             "age.group"     "glucose.mmols" "totchol.mmols"

d <- subset(framingham,  bpmeds == 0 &
              prevstrk == 0 &
              period == 1, select = c(randid,age,totchol.mmols,
                                      diabp,glucose.mmols,bmi,
                                      sex,sysbp,cursmoke,educ,
                                      cigpday,diabetes,stroke,
                                      timestrk,death,timedth,
                                      prevhyp,age.group))
# add n = 1, for counting
d$n <- 1
# create follow-up in years
d$timestrk.yrs <- d$timestrk/365.25
d$stroke.yes <- factor(d$stroke, levels= 0:1, labels = c('No','Yes'))
table(d$stroke.yes,d$stroke)

##      
##          0    1
##   No  3863    0
##   Yes    0  343

d$HT.yes <- factor(d$prevhyp, levels= 0:1, labels = c('No','Yes'))
table(d$HT.yes,d$prevhyp)

##      
##          0    1
##   No  2963    0
##   Yes    0 1243

d$cursmoke <- factor(d$cursmoke, levels= 0:1, labels = c('No','Yes'))
table(d$HT.yes,d$cursmoke)

##      
##         No  Yes
##   No  1395 1568
##   Yes  719  524

d$educ <- factor(d$educ, levels = 1:4, labels = c('4th grade or less',
                                                  '5th, 6th, or 7th grades',
                                                  'grade school graduate no high school',
                                                  'high school, did not graduate'))
# variables labels and unit
d <- upData(d, labels = c(age = 'Age',
                          totchol.mmols = 'Total cholesterol',
                          diabp = 'Diasolic BP',
                          glucose.mmols = 'BGL',
                          sex = 'Sex',
                          sysbp = 'Systolic BP',
                          cursmoke = 'Current smoker',
                          educ = 'Education level',
                          stroke = 'Stroke',
                          timestrk = 'Follow-up',
                          timestrk.yrs = 'Follow-up'),
               units = c(age = 'yrs',
                      totchol.mmols = 'mmol/L',
                      glucose.mmols = 'mmol/L',
                      diabp = 'mmHg',
                      sysbp = 'mmHg',        
                      timestrk = 'days',
                      timestrk.yrs = 'yrs'))

## Input object size:    512048 bytes;   22 variables    4206 observations
## New object size: 501912 bytes;   22 variables    4206 observations

# Table 1.
table.1 <- print(summary(stroke.yes ~ sex +
                               age +
                               sysbp +
                               bmi +
                               glucose.mmols +
                               totchol.mmols +
                               diabetes +
                               HT.yes +
                               cursmoke +
                               educ +
                               timestrk.yrs,
                         method = 'reverse',
                         overall = F,
                         test = T,
                         data = d), prmsd = T, 
                         digits = 3)

## 
## 
## Descriptive Statistics by stroke.yes
## 
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |                                        |N   |No                              |Yes                             |  Test                         |
## |                                        |    |(N=3863)                        |(N=343)                         |Statistic                      |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |Sex : Women                             |4206|           56%  (2162)          |           51%  ( 176)          | Chi-square=2.76 d.f.=1 P=0.096|
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |Age [yrs]                               |4206|42.00/48.00/56.00  49.13+/- 8.50|50.00/57.00/61.00  55.07+/- 7.77|    F=152 d.f.=1,4204 P<0.001  |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |Systolic BP [mmHg]                      |4206|116.0/127.0/141.0  130.6+/- 20.6|126.0/139.0/158.0  143.7+/- 25.9|   F=99.4 d.f.=1,4204 P<0.001  |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |bmi                                     |4190|23.06/25.33/27.89  25.69+/- 3.98|23.65/26.48/28.86  26.79+/- 4.50|   F=19.8 d.f.=1,4188 P<0.001  |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |BGL [mmol/L]                            |3822|   3.94/4.33/4.83  4.54+/-1.29  |   4.00/4.39/4.94  4.70+/-1.49  |    F=2.9 d.f.=1,3820 P=0.089  |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |Total cholesterol [mmol/L]              |4158|11.39/12.94/14.56  13.10+/- 2.43|11.72/13.22/14.67  13.43+/- 2.78|   F=3.77 d.f.=1,4156 P=0.052  |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |diabetes : Yes                          |4206|            2%  (  84)          |            7%  (  23)          | Chi-square=26.1 d.f.=1 P<0.001|
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |HT.yes : Yes                            |4206|           27%  (1057)          |           54%  ( 186)          |  Chi-square=109 d.f.=1 P<0.001|
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |Current smoker : Yes                    |4206|           50%  (1922)          |           50%  ( 170)          |   Chi-square=0 d.f.=1 P=0.946 |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |Education level : 4th grade or less     |4096|           41%  (1542)          |           51%  ( 172)          | Chi-square=14.1 d.f.=3 P=0.003|
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |    5th, 6th, or 7th grades             |    |           30%  (1135)          |           26%  (  86)          |                               |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |    grade school graduate no high school|    |           17%  ( 636)          |           13%  (  45)          |                               |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |    high school, did not graduate       |    |           12%  ( 449)          |            9%  (  31)          |                               |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+
## |Follow-up [yrs]                         |4206|20.45/24.00/24.00  20.86+/- 5.90| 9.83/15.43/19.92  14.41+/- 6.37|    F=667 d.f.=1,4204 P<0.001  |
## +----------------------------------------+----+--------------------------------+--------------------------------+-------------------------------+

write.csv(table.1, file = 'TableOne.csv', row.names = F)

t.test(age ~ sex, data = d)

## 
##  Welch Two Sample t-test
## 
## data:  age by sex
## t = -0.043424, df = 3967, p-value = 0.9654
## alternative hypothesis: true difference in means between group Men and group Women is not equal to 0
## 95 percent confidence interval:
##  -0.5361589  0.5129231
## sample estimates:
##   mean in group Men mean in group Women 
##            49.61242            49.62404

t.test(sysbp ~ sex, data = d)

## 
##  Welch Two Sample t-test
## 
## data:  sysbp by sex
## t = -2.0759, df = 4203.4, p-value = 0.03797
## alternative hypothesis: true difference in means between group Men and group Women is not equal to 0
## 95 percent confidence interval:
##  -2.61121307 -0.07462175
## sample estimates:
##   mean in group Men mean in group Women 
##            130.9197            132.2626

t.test(bmi ~ sex, data = d)

## 
##  Welch Two Sample t-test
## 
## data:  bmi by sex
## t = 5.6944, df = 4177, p-value = 1.323e-08
## alternative hypothesis: true difference in means between group Men and group Women is not equal to 0
## 95 percent confidence interval:
##  0.4537310 0.9302105
## sample estimates:
##   mean in group Men mean in group Women 
##            26.16341            25.47144

t.test(glucose.mmols ~ sex, data = d)

## 
##  Welch Two Sample t-test
## 
## data:  glucose.mmols by sex
## t = 1.0314, df = 3521.1, p-value = 0.3024
## alternative hypothesis: true difference in means between group Men and group Women is not equal to 0
## 95 percent confidence interval:
##  -0.03977396  0.12807297
## sample estimates:
##   mean in group Men mean in group Women 
##            4.576190            4.532041

t.test(totchol.mmols ~ sex, data = d)

## 
##  Welch Two Sample t-test
## 
## data:  totchol.mmols by sex
## t = -3.6057, df = 4075, p-value = 0.000315
## alternative hypothesis: true difference in means between group Men and group Women is not equal to 0
## 95 percent confidence interval:
##  -0.4234912 -0.1251669
## sample estimates:
##   mean in group Men mean in group Women 
##            12.97561            13.24994

t.test(timestrk.yrs ~ sex, data = d)

## 
##  Welch Two Sample t-test
## 
## data:  timestrk.yrs by sex
## t = -7.9078, df = 3737.3, p-value = 3.421e-15
## alternative hypothesis: true difference in means between group Men and group Women is not equal to 0
## 95 percent confidence interval:
##  -1.910125 -1.151139
## sample estimates:
##   mean in group Men mean in group Women 
##            19.48518            21.01582

# Checking the distribution of follow-up
# one figure two plots
par(mfrow = c(1,2))

hist(d$timestrk.yrs,
    xlab = 'Follow-up (yrs)',
    ylab = '')
mtext(side = 3, adj = 0, text = '(a) All participants')
hist(d$timestrk.yrs[d$stroke == 1],
    xlab = 'Follow-up (yrs)',
    ylab = '')
mtext(side = 3, adj = 0, text = '(b) Stroke participants')

# Checking the distributions
# one figure two by two plots
par(mfrow = c(2,2))
hist(d$glucose.mmols,
    xlab = 'BGL (mmol/L)',
    ylab = '', main = '')
mtext(side = 3, adj = 0, text = '(a) Serum glucose')

hist(d$totchol.mmols,
    xlab = 'Total Chol (mmol/L)',
    ylab = '', main = '')
mtext(side = 3, adj = 0, text = '(b) Serum Cholesterol')

hist(log(d$glucose.mmols),
    xlab = 'BGL (log)',
    ylab = '', main = '')
mtext(side = 3, adj = 0, text = '(c) ln(serum glucose)')

hist(log(d$totchol.mmols),
    xlab = 'Total Chol (log)',
    ylab = '', main = '')
mtext(side = 3, adj = 0, text = '(d) ln(serum cholesterol)')

# Check distribution of all cont' variables.
par(mfrow = c(3,2))
hist(d[,c(2,3,5,6,8,14)])

d$HT.yes <- relevel(d$HT.yes, ref = 'No')
printTableOne <- print(CreateTableOne(data = d,
                         vars = c("age", "sysbp", "bmi","glucose.mmols",
                                  "totchol.mmols","diabetes","HT.yes",
                                  "cursmoke","educ","timestrk.yrs"),
                         factorVars = c("sex","diabetes","HT.yes",
                                         "cursmoke","educ"),
                         strata = "stroke.yes",
                         addOverall = F), nonnormal = "timestrk.yrs",
                                          varLabels = T,
                                          missing = T)

##                                          Stratified by stroke.yes
##                                           No                    Yes                  p      test    Missing
##   n                                         3863                   343                                     
##   Age (mean (SD))                          49.13 (8.50)          55.07 (7.77)        <0.001         0.0    
##   Systolic BP (mean (SD))                 130.60 (20.56)        143.71 (25.92)       <0.001         0.0    
##   bmi (mean (SD))                          25.69 (3.98)          26.79 (4.50)        <0.001         0.4    
##   BGL (mean (SD))                           4.54 (1.29)           4.70 (1.49)         0.030         9.1    
##   Total cholesterol (mean (SD))            13.10 (2.43)          13.43 (2.78)         0.016         1.1    
##   diabetes = Yes (%)                          84 ( 2.2)             23 ( 6.7)        <0.001         0.0    
##   HT.yes = Yes (%)                          1057 (27.4)            186 (54.2)        <0.001         0.0    
##   Current smoker = Yes (%)                  1922 (49.8)            170 (49.6)         0.991         0.0    
##   Education level (%)                                                                 0.003         2.6    
##      4th grade or less                      1542 (41.0)            172 (51.5)                              
##      5th, 6th, or 7th grades                1135 (30.2)             86 (25.7)                              
##      grade school graduate no high school    636 (16.9)             45 (13.5)                              
##      high school, did not graduate           449 (11.9)             31 ( 9.3)                              
##   Follow-up (median [IQR])                 24.00 [20.45, 24.00]  15.43 [9.83, 19.92] <0.001 nonnorm 0.0

kableone(CreateTableOne(data = d,
                        vars = c("age", "sysbp", "bmi","glucose.mmols",
                                 "totchol.mmols","diabetes","HT.yes",
                                 "cursmoke","educ","timestrk.yrs"),
                        factorVars = c("diabetes","HT.yes",
                                       "cursmoke","educ"),
                        strata = "sex",
                        addOverall = T))

	Overall	Men	Women	p
n	4206	1868	2338
age (mean (SD))	49.62 (8.60)	49.61 (8.70)	49.62 (8.53)	0.965
sysbp (mean (SD))	131.67 (21.35)	130.92 (18.75)	132.26 (23.20)	0.043
bmi (mean (SD))	25.78 (4.04)	26.16 (3.40)	25.47 (4.47)	<0.001
glucose.mmols (mean (SD))	4.55 (1.30)	4.58 (1.39)	4.53 (1.23)	0.297
totchol.mmols (mean (SD))	13.13 (2.46)	12.98 (2.36)	13.25 (2.53)	<0.001
diabetes = Yes (%)	107 ( 2.5)	56 ( 3.0)	51 ( 2.2)	0.116
HT.yes = Yes (%)	1243 (29.6)	570 (30.5)	673 (28.8)	0.235
cursmoke = Yes (%)	2092 (49.7)	1135 (60.8)	957 (40.9)	<0.001
educ (%)				<0.001
4th grade or less	1714 (41.8)	805 (44.4)	909 (39.8)
5th, 6th, or 7th grades	1221 (29.8)	497 (27.4)	724 (31.7)
grade school graduate no high school	681 (16.6)	228 (12.6)	453 (19.8)
high school, did not graduate	480 (11.7)	282 (15.6)	198 ( 8.7)
timestrk.yrs (mean (SD))	20.34 (6.19)	19.49 (6.59)	21.02 (5.77)	<0.001

# ------------------------- The End ---------------------------------

Inferential statistics: Comparing two or more groups, continuous and categorical outcomes.

This module will introduce the concepts of estimation and null-hypothesis statistical significance testing. In addition sample size and power calculations will be covered, and Type I and Type II error.

# SysBP and stroke
par(mfrow = c(1,2))
hist(d$sysbp, main = '',
              xlab = 'Systolic BP (mmHg) \n First visit')
describe(d$sysbp)

## d$sysbp : Systolic BP [mmHg] 
##        n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
##     4206        0      226        1    131.7    23.06    104.0    108.5    116.6    128.0    142.5    160.0    173.0 
## 
## lowest :  83.5  85.0  85.5  90.0  92.0, highest: 230.0 232.0 235.0 243.0 295.0

boxplot(sysbp ~ stroke.yes, data = d,
        col = c('blue','red'),
        xlab = 'Stroke',
        ylab = 'SBP (mmHg)',
        las = 1)

tapply(d$sysbp, d$stroke, mean)

##        0        1 
## 130.5967 143.7114

tapply(d$sysbp, d$stroke, sd)

##        0        1 
## 20.56035 25.91658

Null Hypothesis Significance Testing.

\(H_0\) Mean systolic BP among non-stroke \(=\) Mean systolic BP among stroke.

\(H_i\) Mean systolic BP among non-stroke \(\neq\) Mean systolic BP among stroke.

Then a critical value is set, say from a z-statistic (i.e 1.96, < 2.5% tail)

# SHADE TAILS OF n-DISTRIBUTED PLOTS
par(mfrow = c(1,1))
z <- pretty(c(-3,3), 100)
ht <- dnorm(z)
data <- data.frame(z = z, ht = ht)
zc <- 1.96
t <- subset(data, z > zc)
plot(data, type = "l", 
           axes = T,
           xlab = '',
           ylab = '')

lines(data)
polygon(c(rev(t$z), t$z),c(rep(0, nrow(t)), t$ht), col = 'grey', 
                                                      border = NA)
polygon(c(rev(t$z*-1), t$z*-1), c(rep(0, nrow(t)), t$ht), 
                                        col = 'grey', border = NA)
                                        

abline(v = 0, lty = 2)
abline(v = c(-1,1), lty = 2, col = 'grey') 
text(-0.75, 0.02, '-1 SD')
text(0.75, 0.02, '+1 SD')

text(-1.75, 0.02, '-1.96 SD')
text(1.75, 0.02, '+1.96 SD')

mean(d$sysbp[d$stroke == 1]) - mean(d$sysbp[d$stroke == 0])

## [1] 13.11468

t.test(sysbp ~ stroke, data = d)

## 
##  Welch Two Sample t-test
## 
## data:  sysbp by stroke
## t = -9.1205, df = 381.19, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -15.94196 -10.28741
## sample estimates:
## mean in group 0 mean in group 1 
##        130.5967        143.7114

# EXAMPLE OF T-TEST
set.seed(12345)

n <- 3000 # large number
a <- rnorm(n, mean = 61.8, sd = 10)
b <- rnorm(n, mean = 62.7, sd = 9.5)


plot(density(a), lty = 1, col = 'blue',
                          frame = F,
                          main = '',
                          xlab = '',
                          ylab = '',
                          ylim = c(0,0.05),
                          lwd = 2)
lines(density(b), lty = 1, col = 'black', lwd = 2)

legend(75,0.045, c('PCI - mean age 61.8 yrs (SD 10)',
                   'CABG - mean age 62.7 yrs (SD 9.5)'),
                   bty = n,
                   lty = 1,
                   lwd = 2,
                   col = c('blue','black'))

t.test(a,b)

## 
##  Welch Two Sample t-test
## 
## data:  a and b
## t = -3.4118, df = 5986.2, p-value = 0.0006496
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3417023 -0.3625029
## sample estimates:
## mean of x mean of y 
##  61.76762  62.61972

n <- 30 # small number
a <- rnorm(n, mean = 61.8, sd = 10)
b <- rnorm(n, mean = 62.7, sd = 9.5)


plot(density(a), lty = 1, col = 'blue',
                          frame = F,
                          main = '',
                          xlab = '',
                          ylab = '',
                          ylim = c(0,0.05),
                          lwd = 2)
lines(density(b), lty = 1, col = 'black', lwd = 2)

legend(75,0.045, c('PCI - mean age 61.8 yrs (SD 10)',
                   'CABG - mean age 62.7 yrs (SD 9.5)'),
                   bty = n,
                   lty = 1,
                   lwd = 2,
                   col = c('blue','black'))

t.test(a,b)

## 
##  Welch Two Sample t-test
## 
## data:  a and b
## t = -0.28393, df = 54.736, p-value = 0.7775
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.455870  4.101878
## sample estimates:
## mean of x mean of y 
##  61.79375  62.47075

Type I and Type II error

NEJM 2000 ARDSNet lower tidal volume (6 mL/kg) versus traditional 12mL/kg. The primary outome death before discharge and breathing without assitance. Results were tx group 134/432 (31.0%) versus control 171/429 (39.8%).

## setwd('~/OneDrive - University of Wollongong/Data/')
## source('EpiTables V3.R')

epiTab(134,298,171,258)

## 
## ------------------------------------------------------------------------
## Trial Data:
##             Outcome
## Intervention Yes  No Total
##        Yes   134 298   432
##        No    171 258   429
##        Total 305 556   861
## 
## Rate of outcome among intervention group (p1) = 0.3102 (95% CI: 26.64, 35.39)
## 
## Rate of outcome among control group (p2) = 0.3986 (95% CI: 35.21, 44.51)
## 
## Risk Difference = -8.84 (95% CI: -15.21, -2.47)
## 
## Hazard Risk = 0.73 95% CI = 0.582 to 0.915
## Rate Ratio = 0.778 95% CI = 0.648 to 0.934
## Odds Ratio = 0.678 95% CI = 0.512 to 0.898
## chi2 p-value = 0.0067
## Fisher's exact p-value = 0.0069
## ------------------------------------------------------------------------

An example of potentially Type II error - NEJM 2018 EOLIA ECMO for ARDS primary outome 60-day mortality tx 44/124 (35%) control 57/125 (46%)

epiTab(44,80,57,68)

## 
## ------------------------------------------------------------------------
## Trial Data:
##             Outcome
## Intervention Yes  No Total
##        Yes    44  80   124
##        No     57  68   125
##        Total 101 148   249
## 
## Rate of outcome among intervention group (p1) = 0.3548 (95% CI: 26.98, 43.99)
## 
## Rate of outcome among control group (p2) = 0.456 (95% CI: 36.78, 54.42)
## 
## Risk Difference = -10.12 (95% CI: -22.31, 2.07)
## 
## Hazard Risk = 0.72 95% CI = 0.486 to 1.067
## Rate Ratio = 0.778 95% CI = 0.574 to 1.055
## Odds Ratio = 0.656 95% CI = 0.394 to 1.091
## chi2 p-value = 0.1041
## Fisher's exact p-value = 0.1217
## ------------------------------------------------------------------------

# original sample size calculation was based on 40% vs 60%
# 60-day mortality
power.prop.test(p1 = 0.4, p2 = 0.6, sig.level = 0.05, power = 0.8)

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 96.92364
##              p1 = 0.4
##              p2 = 0.6
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

p1 <- 0.6
q1 <- 1 - p1
p2 <- 0.4
q2 <- 1 - p2
p.bar <- (p1 + p2)/2
q.bar <- 1 - p.bar

# Sample size
n <- 2*((p.bar * q.bar) * (1.96 + 0.84)^2)  / (p1 - p2)^2
n

## [1] 98

power.prop.test(p1 = 0.36, p2 = 0.46, sig.level = 0.05, power = 0.8)

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 378.5477
##              p1 = 0.36
##              p2 = 0.46
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

power.prop.test(p1 = 0.36, p2 = 0.46, sig.level = 0.05, n = 125) # power

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 125
##              p1 = 0.36
##              p2 = 0.46
##       sig.level = 0.05
##           power = 0.3615169
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

# Age and stroke
boxplot(age ~ stroke.yes, data = d,
        col = c('blue','red'),
        xlab = 'Stroke',
        ylab = 'Age (years)',
        las = 1)

tapply(d$age, d$stroke, mean)

##        0        1 
## 49.13461 55.07289

tapply(d$age, d$stroke, sd)

##        0        1 
## 8.504941 7.768804

# Age and Systolic BP
plot(d$age, d$sysbp, pch = 16,
                     xlab = 'Age (yrs)',
                     ylab = 'Systolic BP (mmHg)',
                     main = '')

plot(d$age[d$stroke == 0], d$sysbp[d$stroke == 0], pch = 16,
                     col = 'blue',
                     xlab = 'Age (yrs)',
                     ylab = 'Systolic BP (mmHg)',
                     main = '',
                     ylim = c(80,300),
                     las = 1)
points(d$age[d$stroke == 1], d$sysbp[d$stroke == 1], pch = 16,
                             col = 'red')
abline(lsfit(d$age,d$sysbp), lty = 2, lwd = 2)

# Cigs per day and stroke
boxplot(cigpday ~ stroke.yes, data = d,
        col = c('blue','red'),
        xlab = 'Stroke',
        ylab = 'Age (years)',
        las = 1)

tapply(d$cigpday, d$stroke, mean, na.rm = T)

##        0        1 
## 9.105085 9.170088

tapply(d$cigpday, d$stroke, sd, na.rm = T)

##        0        1 
## 12.00131 12.06747

# table of rates of stroke based 
# systolic HT
d$sysHT <- ifelse(d$sysbp > 140, 1, 0)
d$sysHT <- factor(d$sysHT, levels = 0:1, labels = c('No','Yes'))
print(stat.table(list(sysHT), contents = list(N = count(),
                                        'Stroke (n)' = sum(stroke),
                                        'Stroke (%)' = ratio(stroke,n,100),
                                        'Stroke (rate)' = ratio(stroke,timestrk,365.25),
                                        'follow-up (Total years)' = sum(timestrk.yrs)),
                                         data = d,
                                         margin = T),
                                         digits = 3)

##  --------------------------------------------------- 
##  sysHT          N  Stroke  Stroke  Stroke follow-up  
##                       (n)     (%)  (rate)    (Total  
##                                              years)  
##  --------------------------------------------------- 
##  No      3035.000 178.000   5.865   0.003 64518.680  
##  Yes     1171.000 165.000  14.091   0.008 21014.623  
##                                                      
##  Total   4206.000 343.000   8.155   0.004 85533.303  
##  ---------------------------------------------------

tapply(d$timestrk.yrs, d$sysHT, sum)

##       No      Yes 
## 64518.68 21014.62

# rate with 95% CI
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$n, d$sysHT, sum)))
names(stroke.tab) <- c('Strokes','N')
head(stroke.tab)

##     Strokes    N
## No      178 3035
## Yes     165 1171

ci.binomial(stroke.tab$Strokes, stroke.tab$N)

##   events total probability          se exact.lower95ci exact.upper95ci
## 1    178  3035  0.05864909 0.004265079      0.05054965      0.06761068
## 2    165  1171  0.14090521 0.010167300      0.12146029      0.16216781

# incidence based on person years
stroke.tab <- data.frame(cbind(tapply(d$stroke, d$sysHT, sum), tapply(d$timestrk.yrs, d$sysHT, sum)))
names(stroke.tab) <- c('Strokes','pyrs')
head(stroke.tab)

##     Strokes     pyrs
## No      178 64518.68
## Yes     165 21014.62

ci.poisson(stroke.tab$Strokes, stroke.tab$pyrs)

##   events person.time   incidence           se exact.lower95ci exact.upper95ci
## 1    178    64518.68 0.002758891 0.0002067876     0.002367128     0.003197555
## 2    165    21014.62 0.007851676 0.0006112521     0.006697479     0.009147202

# risk of stroke
# Rate Ratio
## setwd('C:/Users/25105517/OneDrive - University of Wollongong/Data')
## source('epiTables V3.R')
epiTab(165,1171-165,178,3035-178)

## 
## ------------------------------------------------------------------------
## Trial Data:
##             Outcome
## Intervention Yes   No Total
##        Yes   165 1006  1171
##        No    178 2857  3035
##        Total 343 3863  4206
## 
## Rate of outcome among intervention group (p1) = 0.1409 (95% CI: 12.1, 16.09)
## 
## Rate of outcome among control group (p2) = 0.0586 (95% CI: 5.03, 6.7)
## 
## Risk Difference = 8.23 (95% CI: 6.06, 10.39)
## 
## Hazard Risk = 2.513 95% CI = 2.033 to 3.106
## Rate Ratio = 2.403 95% CI = 1.966 to 2.937
## Odds Ratio = 2.633 95% CI = 2.105 to 3.293
## chi2 p-value = 0
## Fisher's exact p-value = 0
## ------------------------------------------------------------------------

(sum(d$stroke[d$sysHT == 'Yes']) / sum(d$n[d$sysHT == 'Yes'])) / 
(sum(d$stroke[d$sysHT == 'No']) / sum(d$n[d$sysHT == 'No']))

## [1] 2.402513

Our practical sessions we will be using the R Statistical Language (https://www.r-project.org/) via the R Studio editor (https://www.rstudio.com/products/rstudio/download/#download). And, we suggest you load this onto your laptop prior to Week 1.

The R language is an object orientated statistical program, that can be used for all stages of the research process, including:

The planning of a study, meta-analysis of current literature, power and sample size;
All data cleaning and management;
All analysis; and,
Most important, presentation of results for publication in manuscripts (including all tables and figures).

SNPG916 Descriptive Statistics - Workshop

Steve Frost PhD SoN, Faculty of Science, Medicine and Health.

Spring, 2023.