On “Introduction to Biostatistics for Clinical and Translational Researchers”

notes by Dan Connolly

Course Information

Introduction to Biostatistics for Clinical and Translational Researchers

from the KUMC calendar:

WHEN Friday, June 29, 2012, 8 – 10am
LOCATION 1025 Orr Major
PRESENTER(S) Jo Wick, PhD
SPONSOR DEPT Biostatistics
DETAILS This short course will be comprised of four lectures on selected topics in biostatistics relevant to clinical and translational research. No knowledge of statistics is assumed. All are welcome to attend. Topics include: Designing experiments, descriptive statistics, estimation, hypothesis testing, one and two-sample tests, ANOVA, linear regression and survival analysis.

Presentation materials are distributed by email and on the biostats department educational resources page.

Part I: Basic Concepts

Role of Statistics

Statistics is … for analyzing information to
help people make decisions when faced with unceartanty

The figure on “The Role of Statistics” is nice: model, reality, sample, histogram/observation, with observation linked to model.

Hypotheses Testing

with links wikipedia on Statistical hypothesis testing:

Null hypothesis \( H_{0} \)
Alternative hypothesis \( H_{1} \)
- usually a statement of differences or association
p value is traditionally 0.5. A level of evidence that allows me to reject the null hypothesis.

Methods usually test the null.

back to Einstein: “… a single experiment can prove me wrong.”

The scientific method…

Example:

\( H_{1} \): mean BMI of a population is 26.3. Suppose average of sample is 26. supports it? pretty close. What about 27.5? What about 30?

This is a simple T test.

The data is the evidence for/against your hypothesis.

If observation is unlikely under \( H_{0} \), then reject \( H_{0} \) in favor of \( H_{1} \).

courtroom analogy: History of 0.05: british court system quantified conviction as “1 in 20 chance.” Stats literature continues to use the idioms “convict” or “acquit.”

Classic example: Student's sleep data

from R help on t.test function and the sleep data set, i.e. “Data which show the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients.”

str(datasets::sleep)

## 'data.frame':    20 obs. of  3 variables:
##  $ extra: num  0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
##  $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ID   : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...

## Classical example: Student's sleep data
plot(extra ~ group, data = sleep)

plot of chunk unnamed-chunk-1

## Formula interface
t.test(extra ~ group, data = sleep)

## 
##  Welch Two Sample t-test
## 
## data:  extra by group 
## t = -1.861, df = 17.78, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -3.3655  0.2055 
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33 
##

Vocabulary

of hypothesis testing

Type I Error (\( \alpha \)): \( H_{0} \) is incorrectly rejected
- “An innocent man is provent guilty in a court of law.”
- Commonly accepted rate is \( \alpha \le 0.05 \).
Type II Error (\( \beta \)): failing to reject a false \( H_{0} \)
- “A guilty man is proven not guilty in a court of law.”
- commonly accepted rate is \( \beta \le 0.2 \)
- more critical in early phase research.
Power (\( 1 - \beta \)): correctly rejecting \( H_{0} \)
- “Justice has been served”
- rate \( 1-\beta \ge 0.8 \)

Note: 2-sided testing is half as powerful as 1-sided.

pub note on markdown vs latex: I suspect latex would let me define a \power macro instead of typing 1-\beta over and over.

Statistical Power

increases with effect size, sample size.

influences:

dropouts
…
post-hoc hypotheses (“fishing”)

Part II: Descriptive Statistics

box plot, histogram, mean, median…

Field of Statistics

descriptive statistics
- Graphical
- Numerical
experimental design
inferential statistics

Types of Data

level of measurememt
- discrete / qualitative
  - Nominal a la Haskell Eq
  - Ordinal a la Haskell Ord
- continuous / quantitative
  - Interval + equivalence of intervals, order
  - Ratio + absolute zero

Ratio Data

6 != 7

## [1] TRUE


6 < 7

## [1] TRUE


6 - 7 == 101 - 102

## [1] TRUE


10/5

## [1] 2

e.g. person A \( t = 10 minutes \) took at twice as long as person B (\( t=5 minutes \))
examples: age, birth weight, follow-up time, time to complete a task, dose, survival time

Interval Data

temp: C (interval) vs K (ratio)

examples: temp, dates

t1 <- as.POSIXct("2012-06-28")
t2 <- as.POSIXct("2012-06-30")
t2 - t1

## Time difference of 2 days

t2/t1

## Error: '/' not defined for "POSIXt" objects

Ordinal

e.g. sports rankings: top 5 basketball teams

team <- factor(c("UK", "KU", "MU", "UT"), ordered = T)
team

## [1] UK KU MU UT
## Levels: KU < MU < UK < UT


team[1] - team[2]

## Warning: '-' is not meaningful for ordered factors

## [1] NA


team[1] > team[2]

## [1] TRUE

examples: likert scale, highest level of education

Nominal Data

no order.

fruit <- factor(c("apple", "banana", "apple"))

fruit[1]

## [1] apple
## Levels: apple banana


fruit[1] < fruit[2]

## Warning: < not meaningful for factors

## [1] NA

typical stats

nominal: mode
ordinal: median
interval: mean, standard deviation
ratio: geometric mean, coefficient of varioation

table(fruit)

## fruit
##  apple banana 
##      2      1

Mo <- function(x) {
    sort(table(x), decreasing = T)[1]
}
Mo(fruit)

## apple 
##     2

median.ordered <- function(x) {
    x.ordered <- sort(x)
    x.ordered[length(x)/2]
}
median.ordered(team)

## [1] MU
## Levels: KU < MU < UK < UT

median(team)

## [1] MU
## Levels: KU < MU < UK < UT

table(datasets::state.division)

## 
##        New England    Middle Atlantic     South Atlantic 
##                  6                  3                  8 
## East South Central West South Central East North Central 
##                  4                  4                  5 
## West North Central           Mountain            Pacific 
##                  7                  8                  5

Mo(datasets::state.division)

## South Atlantic 
##              8

oops… mode seems ill-defined.

Using Graphs to Describe Data

gender <- factor(1, 0)
names(gender) <- c('male', 'female')

frequencies, percentage, proportions
bar charts, pie charts

Example: Myopia

bar chart… full light, night light, darkness

does not prove causation. we didn't randomize. possible confouding factors: genetics, parental behavior, …

Example: Nausea

bar graphs showing N, percent.

with unequal sample sizes, use proportion as the axis

Graphics and Continuous data: time-to-death

ttd <- log(rnorm(50) * 1000)

## Warning: NaNs produced

hist(ttd)

plot of chunk unnamed-chunk-9

Example: Weight (box plot)

boxplots show quartiles

weight <- rnorm(100)
hist(weight)
boxplot(weight, horizontal = T, add = T)

plot of chunk unnamed-chunk-10

** how to overlay those?**
where does R put the whiskers of a boxplot?

Central Tendency

relative frequency ~= percentage

mean: arithmetic average
median:
mode

central tendency, variability not meaningful for qualitative/discrete

mean(team)

## Warning: argument is not numeric or logical: returning NA

## [1] NA

median not affected by outliers, tails

x <- c(1, 2, 2, 4)
median(x)

## [1] 2

y <- c(1, 2, 2, 9)
median(x) == median(y)

## [1] TRUE

mean is affected by shape of distribution. takes account of magnitude of every observation
good sampling stability (varies the least among samples)

mean(x)

## [1] 2.25

mean(x) == mean(y)

## [1] FALSE

mean follows tail in case of skewness.

Variability

range
standard deviation

x <- c(1, 2, 3, 4, 10)
y <- c(1, 2, 3, 4, 100)

mean(x)

## [1] 4

mean(y)

## [1] 22

range(x)

## [1]  1 10

sd(x)

## [1] 3.536

variance \( s^{2} \)
\( s \) is in the same units as the data.

Shape

normal

TODO: get a nice normal curve

hist(rnorm(1000))

plot of chunk unnamed-chunk-15

right skewed

bimodal what's the precise definition of mode?

Appendix: R Markdown boilerplate

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).

When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist    
##  Min.   : 4.0   Min.   :  2  
##  1st Qu.:12.0   1st Qu.: 26  
##  Median :15.0   Median : 36  
##  Mean   :15.4   Mean   : 43  
##  3rd Qu.:19.0   3rd Qu.: 56  
##  Max.   :25.0   Max.   :120

You can also embed plots, for example:

plot(cars)

plot of chunk unnamed-chunk-17