notes by Dan Connolly
Introduction to Biostatistics for Clinical and Translational Researchers
from the KUMC calendar:
Presentation materials are distributed by email and on the biostats department educational resources page.
Statistics is … for analyzing information to
help people make decisions when faced with unceartanty
The figure on “The Role of Statistics” is nice: model, reality, sample, histogram/observation, with observation linked to model.
with links wikipedia on Statistical hypothesis testing:
Methods usually test the null.
back to Einstein: “… a single experiment can prove me wrong.”
The scientific method…
Example:
\( H_{1} \): mean BMI of a population is 26.3. Suppose average of sample is 26. supports it? pretty close. What about 27.5? What about 30?
This is a simple T test.
The data is the evidence for/against your hypothesis.
If observation is unlikely under \( H_{0} \), then reject \( H_{0} \) in favor of \( H_{1} \).
courtroom analogy: History of 0.05: british court system quantified conviction as “1 in 20 chance.” Stats literature continues to use the idioms “convict” or “acquit.”
from R help on t.test function and the sleep data set, i.e. “Data which show the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients.”
str(datasets::sleep)
## 'data.frame': 20 obs. of 3 variables:
## $ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
## $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## Classical example: Student's sleep data
plot(extra ~ group, data = sleep)
## Formula interface
t.test(extra ~ group, data = sleep)
##
## Welch Two Sample t-test
##
## data: extra by group
## t = -1.861, df = 17.78, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.3655 0.2055
## sample estimates:
## mean in group 1 mean in group 2
## 0.75 2.33
##
of hypothesis testing
Type I Error (\( \alpha \)): \( H_{0} \) is incorrectly rejected
Type II Error (\( \beta \)): failing to reject a false \( H_{0} \)
Power (\( 1 - \beta \)): correctly rejecting \( H_{0} \)
Note: 2-sided testing is half as powerful as 1-sided.
R note: see also: Interactive depiction of type I and type II error and power
pub note on markdown vs latex: I suspect latex would let me define a \power macro instead of typing 1-\beta over and over.
increases with effect size, sample size.
influences:
box plot, histogram, mean, median…
6 != 7
## [1] TRUE
6 < 7
## [1] TRUE
6 - 7 == 101 - 102
## [1] TRUE
10/5
## [1] 2
e.g. person A \( t = 10 minutes \) took at twice as long as person B (\( t=5 minutes \))
examples: age, birth weight, follow-up time, time to complete a task, dose, survival time
examples: temp, dates
t1 <- as.POSIXct("2012-06-28")
t2 <- as.POSIXct("2012-06-30")
t2 - t1
## Time difference of 2 days
t2/t1
## Error: '/' not defined for "POSIXt" objects
e.g. sports rankings: top 5 basketball teams
team <- factor(c("UK", "KU", "MU", "UT"), ordered = T)
team
## [1] UK KU MU UT
## Levels: KU < MU < UK < UT
team[1] - team[2]
## Warning: '-' is not meaningful for ordered factors
## [1] NA
team[1] > team[2]
## [1] TRUE
examples: likert scale, highest level of education
no order.
fruit <- factor(c("apple", "banana", "apple"))
fruit[1]
## [1] apple
## Levels: apple banana
fruit[1] < fruit[2]
## Warning: < not meaningful for factors
## [1] NA
nominal: mode
ordinal: median
interval: mean, standard deviation
ratio: geometric mean, coefficient of varioation
table(fruit)
## fruit
## apple banana
## 2 1
Mo <- function(x) {
sort(table(x), decreasing = T)[1]
}
Mo(fruit)
## apple
## 2
median.ordered <- function(x) {
x.ordered <- sort(x)
x.ordered[length(x)/2]
}
median.ordered(team)
## [1] MU
## Levels: KU < MU < UK < UT
median(team)
## [1] MU
## Levels: KU < MU < UK < UT
table(datasets::state.division)
##
## New England Middle Atlantic South Atlantic
## 6 3 8
## East South Central West South Central East North Central
## 4 4 5
## West North Central Mountain Pacific
## 7 8 5
Mo(datasets::state.division)
## South Atlantic
## 8
oops… mode seems ill-defined.
gender <- factor(1, 0)
names(gender) <- c('male', 'female')
frequencies, percentage, proportions
bar charts, pie charts
bar chart… full light, night light, darkness
does not prove causation. we didn't randomize. possible confouding factors: genetics, parental behavior, …
bar graphs showing N, percent.
with unequal sample sizes, use proportion as the axis
ttd <- log(rnorm(50) * 1000)
## Warning: NaNs produced
hist(ttd)
boxplots show quartiles
weight <- rnorm(100)
hist(weight)
boxplot(weight, horizontal = T, add = T)
** how to overlay those?**
where does R put the whiskers of a boxplot?
relative frequency ~= percentage
central tendency, variability not meaningful for qualitative/discrete
mean(team)
## Warning: argument is not numeric or logical: returning NA
## [1] NA
median not affected by outliers, tails
x <- c(1, 2, 2, 4)
median(x)
## [1] 2
y <- c(1, 2, 2, 9)
median(x) == median(y)
## [1] TRUE
mean is affected by shape of distribution. takes account of magnitude of every observation
good sampling stability (varies the least among samples)
mean(x)
## [1] 2.25
mean(x) == mean(y)
## [1] FALSE
mean follows tail in case of skewness.
x <- c(1, 2, 3, 4, 10)
y <- c(1, 2, 3, 4, 100)
mean(x)
## [1] 4
mean(y)
## [1] 22
range(x)
## [1] 1 10
sd(x)
## [1] 3.536
variance \( s^{2} \)
\( s \) is in the same units as the data.
normal
TODO: get a nice normal curve
hist(rnorm(1000))
right skewed
bimodal what's the precise definition of mode?
This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).
When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2
## 1st Qu.:12.0 1st Qu.: 26
## Median :15.0 Median : 36
## Mean :15.4 Mean : 43
## 3rd Qu.:19.0 3rd Qu.: 56
## Max. :25.0 Max. :120
You can also embed plots, for example:
plot(cars)