On “Introduction to Biostatistics for Clinical and Translational Researchers”

notes by Dan Connolly

Course Information

Introduction to Biostatistics for Clinical and Translational Researchers

from the KUMC calendar:

Presentation materials are distributed by email and on the biostats department educational resources page.

Part I: Basic Concepts

Role of Statistics

Statistics is … for analyzing information to
help people make decisions when faced with unceartanty

The figure on “The Role of Statistics” is nice: model, reality, sample, histogram/observation, with observation linked to model.

Hypotheses Testing

with links wikipedia on Statistical hypothesis testing:

Methods usually test the null.

back to Einstein: “… a single experiment can prove me wrong.”

The scientific method…

Example:

\( H_{1} \): mean BMI of a population is 26.3. Suppose average of sample is 26. supports it? pretty close. What about 27.5? What about 30?

This is a simple T test.

The data is the evidence for/against your hypothesis.

If observation is unlikely under \( H_{0} \), then reject \( H_{0} \) in favor of \( H_{1} \).

courtroom analogy: History of 0.05: british court system quantified conviction as “1 in 20 chance.” Stats literature continues to use the idioms “convict” or “acquit.”

Classic example: Student's sleep data

from R help on t.test function and the sleep data set, i.e. “Data which show the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients.”

str(datasets::sleep)
## 'data.frame':    20 obs. of  3 variables:
##  $ extra: num  0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
##  $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ID   : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## Classical example: Student's sleep data
plot(extra ~ group, data = sleep)

plot of chunk unnamed-chunk-1

## Formula interface
t.test(extra ~ group, data = sleep)
## 
##  Welch Two Sample t-test
## 
## data:  extra by group 
## t = -1.861, df = 17.78, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -3.3655  0.2055 
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33 
## 

Vocabulary

of hypothesis testing

Note: 2-sided testing is half as powerful as 1-sided.

R note: see also: Interactive depiction of type I and type II error and power

pub note on markdown vs latex: I suspect latex would let me define a \power macro instead of typing 1-\beta over and over.

Statistical Power

increases with effect size, sample size.

influences:

Part II: Descriptive Statistics

box plot, histogram, mean, median…

Field of Statistics

Types of Data

Ratio Data

6 != 7
## [1] TRUE

6 < 7
## [1] TRUE

6 - 7 == 101 - 102
## [1] TRUE

10/5
## [1] 2

e.g. person A \( t = 10 minutes \) took at twice as long as person B (\( t=5 minutes \))
examples: age, birth weight, follow-up time, time to complete a task, dose, survival time

Interval Data

examples: temp, dates

t1 <- as.POSIXct("2012-06-28")
t2 <- as.POSIXct("2012-06-30")
t2 - t1
## Time difference of 2 days
t2/t1
## Error: '/' not defined for "POSIXt" objects

Ordinal

e.g. sports rankings: top 5 basketball teams

team <- factor(c("UK", "KU", "MU", "UT"), ordered = T)
team
## [1] UK KU MU UT
## Levels: KU < MU < UK < UT

team[1] - team[2]
## Warning: '-' is not meaningful for ordered factors
## [1] NA

team[1] > team[2]
## [1] TRUE

examples: likert scale, highest level of education

Nominal Data

no order.

fruit <- factor(c("apple", "banana", "apple"))

fruit[1]
## [1] apple
## Levels: apple banana

fruit[1] < fruit[2]
## Warning: < not meaningful for factors
## [1] NA

typical stats

nominal: mode
ordinal: median
interval: mean, standard deviation
ratio: geometric mean, coefficient of varioation

table(fruit)
## fruit
##  apple banana 
##      2      1 
Mo <- function(x) {
    sort(table(x), decreasing = T)[1]
}
Mo(fruit)
## apple 
##     2 
median.ordered <- function(x) {
    x.ordered <- sort(x)
    x.ordered[length(x)/2]
}
median.ordered(team)
## [1] MU
## Levels: KU < MU < UK < UT
median(team)
## [1] MU
## Levels: KU < MU < UK < UT
table(datasets::state.division)
## 
##        New England    Middle Atlantic     South Atlantic 
##                  6                  3                  8 
## East South Central West South Central East North Central 
##                  4                  4                  5 
## West North Central           Mountain            Pacific 
##                  7                  8                  5 
Mo(datasets::state.division)
## South Atlantic 
##              8 

oops… mode seems ill-defined.

Using Graphs to Describe Data

gender <- factor(1, 0)
names(gender) <- c('male', 'female')

frequencies, percentage, proportions
bar charts, pie charts

Example: Myopia

bar chart… full light, night light, darkness

does not prove causation. we didn't randomize. possible confouding factors: genetics, parental behavior, …

Example: Nausea

bar graphs showing N, percent.

with unequal sample sizes, use proportion as the axis

Graphics and Continuous data: time-to-death

ttd <- log(rnorm(50) * 1000)
## Warning: NaNs produced
hist(ttd)

plot of chunk unnamed-chunk-9

Example: Weight (box plot)

boxplots show quartiles

weight <- rnorm(100)
hist(weight)
boxplot(weight, horizontal = T, add = T)

plot of chunk unnamed-chunk-10

** how to overlay those?**
where does R put the whiskers of a boxplot?

Central Tendency

relative frequency ~= percentage

central tendency, variability not meaningful for qualitative/discrete

mean(team)
## Warning: argument is not numeric or logical: returning NA
## [1] NA

median not affected by outliers, tails

x <- c(1, 2, 2, 4)
median(x)
## [1] 2
y <- c(1, 2, 2, 9)
median(x) == median(y)
## [1] TRUE

mean is affected by shape of distribution. takes account of magnitude of every observation
good sampling stability (varies the least among samples)

mean(x)
## [1] 2.25
mean(x) == mean(y)
## [1] FALSE

mean follows tail in case of skewness.

Variability

x <- c(1, 2, 3, 4, 10)
y <- c(1, 2, 3, 4, 100)

mean(x)
## [1] 4
mean(y)
## [1] 22
range(x)
## [1]  1 10
sd(x)
## [1] 3.536

variance \( s^{2} \)
\( s \) is in the same units as the data.

Shape

normal

TODO: get a nice normal curve

hist(rnorm(1000))

plot of chunk unnamed-chunk-15

right skewed

bimodal what's the precise definition of mode?

Appendix: R Markdown boilerplate

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).

When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist    
##  Min.   : 4.0   Min.   :  2  
##  1st Qu.:12.0   1st Qu.: 26  
##  Median :15.0   Median : 36  
##  Mean   :15.4   Mean   : 43  
##  3rd Qu.:19.0   3rd Qu.: 56  
##  Max.   :25.0   Max.   :120  

You can also embed plots, for example:

plot(cars)

plot of chunk unnamed-chunk-17