| title: “Statistical inference” |
| author: “lianrui” |
| date: “2017/8/23” |
| output: |
| pdf_document: default |
We want to get some “fact” about the total population. The challenge here is that the data is noisy. Therefore we have to take the uncertainty into account.
https://en.wikipedia.org/wiki/Statistical_inference 1. Point estimate; 2. Interval estimate; 3. Hypothesis testing; 4. Clustering and classification; 5. Credible interval (further: https://en.wikipedia.org/wiki/Credible_interval).
parametric Nonparametric statistics are statistics not based on parameterized families of probability distributions. They include both descriptive and inferential statistics. The typical parameters are the mean, variance, etc.
non-parametric Unlike parametric statistics, nonparametric statistics make no assumptions about the probability distributions of the variables being assessed. The difference between parametric models and non-parametric models is that the former has a fixed number of parameters, while the latter grows the number of parameters with the amount of training data.[1] Note that the non-parametric model does, counterintuitively, contain parameters: the distinction is that parameters are determined by the training data in the case of non-parametric statistics, not the model.Bootstraping and permutation are non-parametric.
semi-parametric
Probability: a number between 0 and 1 to give a sense of the chance of a particular event.
Why is probability the fundation of statistic inference. Statistical inference is the process of deducing properties of an underlying probability distribution by analysis of data. a probability model connect data to population with assumptions.
For a fare coin flipping experiment, X = 0 represent tails; X = 1 represent heads p(x) = (1/2)^x * (1/2)^(1-x)
If the coin is not fair, the chance of tail is A, and the chance of head is 1-A p(x) = A^x * (1-A)^(1-x)
Gives the probability that a discrete random variable is exactly equal to some value. 1. >= 0 2. sum of total p = 1
PDF associated with a continuous variable. area under the function curve correspondent to the probability of that variable. areas of 1. >= 0 2. total area under the function curve equals to 1.
Code: demo of a continuous distrition (beta distribution). the area under the curve is 1.
x <- c (-.5, 0, 1, 1, 1.5)
y <- c (0, 0, 2, 0, 0)
d <- data.frame(x = x, y = y)
library(ggplot2)
c <- ggplot(data = d, aes(x = x, y = y)) + geom_line(colour = "steelblue")
c
Code: Fore beta(2, 1), What is the probability that 75% or fewer have been addressed?
pbeta(.75, 2, 1)
## [1] 0.5625
Code: culmulative distribution function F(x) = P(X <= x)
pbeta(c(0.4, 0.5, 0.6), 2, 1)
## [1] 0.16 0.25 0.36
S(x) = P (X > x) S(x) = 1- F(x)
understanding conditional prabability with Venn diagram. VennDiagram package as a tool.
Under condition a subject has or do not have the disease (sensitivity and specificity) Sensitivity : The probability of positive testing result given that the subject has the disease. Among all the subjects have the disease, the proportion of patients who get the positive test result; P(+|D) Specificity: The probability of negative testing result given that the subject does NOT actually have the disease. Among all the subjects who don’t have the disease, the proportion of patients who get the negative test result. P(-|Not D)
Under condition of positive or negative testing result. Positive predictive value: The probability of having the disease given that the testing result is positive. Among all the subjects who are of positive testing result, the proportion actually have the disease. P(D|+) Negative predictive value: The probability of NOT having the disease given that the test result is negative. Among all the subjects who are of negative testing result, the proportion actually NOT having the disease. P(Not D | -)
Diagnostic likelyhood ratio. Diagnostic likelyhood ratio of positive test. sensitivity/(1-specificity) The ratio of test positive patients to test negative patients How effective the test help to screen the patients Diagnostic likelyhood ratio of negative test. (1-sensitivity)/Specificity The extend a test will misdiagnosis.
see: conditional probability and diagnosis.xlsx https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2636062/
case:Flawed Statistics in Murder Trial May Cost Expert His Medical License http://science.sciencemag.org/content/309/5734/news-summaries a mother of two sudden infant death (SID) was convicted murder because a renowned piediatrician testified that “nature occuring” of two SIDs in a row is very very rare. the doctor simply squared the probability of the occurance of one SID in a family; But the thing is that if a family has one SID, the subseuqent SID risk is higher because of genetic factors that are playing a role.
iid sample: independent and identically distribution.
sample expected values will estimate some of the population characteristics. so -How the population is centered: mean.
-How the population is spread out: variance.
The difference between sample mean and population mean: For a discrete random variable with probability mass function (PMF) of p(x), The expected value or mean is the geometric average: E[x] = Sigma (x * p(x)) Geometrically, E[x] is the center mass of {x, p(x)}.
Sample mean is a empirical estimation of the population mean. 1. the population mean is the center of mass of population; 2. the sample mean is the center of mass of the sample (observed data); 3. the sample mean ESTIMATE the population mean ; 4. The sample mean of a random varialbe, is a random variable too; 4. the sample mean is unbiased: the population mean of its distribution is the mean that it is trying to estimate. 5. the more data that goes into the sample mean, the more concentrated its’ density/mass function is around the population mean. Code: The histogram of parent child height.
library(UsingR);
## Loading required package: MASS
## Loading required package: HistData
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
##
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
##
## cancer
data(galton)
library(reshape2)
par(mfrow=c(1,2))
hist(galton$child,col="blue",breaks=100)
hist(galton$parent,col="blue",breaks=100)
head(galton)
## child parent
## 1 61.7 70.5
## 2 61.7 68.5
## 3 61.7 65.5
## 4 61.7 64.5
## 5 61.7 64.0
## 6 62.2 67.5
dim(galton)
## [1] 928 2
str(galton)
## 'data.frame': 928 obs. of 2 variables:
## $ child : num 61.7 61.7 61.7 61.7 61.7 62.2 62.2 62.2 62.2 62.2 ...
## $ parent: num 70.5 68.5 65.5 64.5 64 67.5 67.5 67.5 66.5 66.5 ...
longGalton <- melt(galton, measure.vars = c("child", "parent"))
g <- ggplot(longGalton, aes(x = value)) + geom_histogram(aes(y = ..density.., fill = variable), binwidth=1, color = "red") + geom_density(size = 0.5)
g <- g + facet_grid(. ~ variable)
g
Code: manipulate the child hight sample with different empirical mean. mse(mean squared error) is sort of a measure of imbalance, how teetering or tottering this histogram is.
myHist <- function(mu){
g <- ggplot(galton, aes(x = child))
g <- g + geom_histogram(fill = "steelblue", binwidth = 0.5,
aes(y = ..density..))
g <- g + geom_density(colour = "red", size = 0.5)
g <- g + geom_vline(xintercept = mu, size = 0.5, colour = "orange")
mse <- round(mean(galton$child - mu)^2, 3)
g <- g + labs(title = paste("mu = ", mu, " MSE = ", mse))
g
}
Note: expected mean is the center of mass of the sample distribution. When assumed mean approximate the expected mean, mse is going to 0.
Code: simulation of the expected value of a binominal experiment of n times with the probability of p.
library(ggplot2)
library(plyr)
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:Hmisc':
##
## is.discrete, summarize
binominalPlot <- function(n,p){
x <- dbinom(c(0:n),size=n,prob=p)
dat <- data.frame(N = c(0:n), P = x)
gd <- ggplot(data = dat, aes(x = N, y = P))
g <- gd + geom_bar(aes(fill = P), stat = "identity")
g
}
binominalPlot(8, 0.8)
Code: Simulating: comparing the center of the distribution of averages and the center of that population. They will be centered at the same place.
library(ggplot2)
nosim <- 10000; n <- 100
dat <- data.frame(x = c(rnorm(nosim), apply(matrix(data = rnorm(nosim * n), nrow = nosim), 1, mean)),
# the first 10000 of x is random number from rnorm;
# the second 10000 of x is the mean of 10 rnorm
what = factor(rep(c("Obs", "Mean"), c(nosim, nosim)))
# repeat "Obs" and "Mean", each = nosim
)
g <- ggplot(dat, aes(x = x, fill = what)) + geom_density(size = 0.5, alpha = .2)
g <- g + ggtitle("UNBIASED estimation of the means of normal")
g
#matrix: create a matrix from data, default is by column.
# apply: applying a function to margins of an array or matrix. by row = 1, by column = 2
Note:
1. Expected values are properties of distributions; 2. The average of random variables is itself a random variable 3. And its associated distribution has an expected value; 4. The center of the distribution means is the same as that of the original distribution; 5. Therefore, the expected value of the sample mean is the population mean that it’s trying to estimate 6. Unbiased: When the expected value of an estimator is what its trying to estimate, we say that the estimator is unbiased
ibid. Code: simulation of N number of dice roll and the distribution of the mean of these N number of dice rolls.
nosim = 10000
dat <- data.frame(x = c(sample(1 : 6, nosim, replace = TRUE),
apply(matrix(data = sample(1 : 6, nosim * 2, replace = TRUE),
nrow = nosim), 1, mean),
apply(matrix(data = sample(1 : 6, nosim * 3, replace = TRUE),
nrow = nosim), 1, mean),
apply(matrix(data = sample(1 : 6, nosim * 4, replace = TRUE),
nrow = nosim), 1, mean)),
size = factor(rep(1:4, rep(nosim, 4))))
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20, binwidth = 0.5, size = 0.2, colour = "steelblue")
g
g <- g + facet_grid(.~size)
g <- g + ggtitle("simulation of N number of dice roll and the distribution of the mean of these N number of dice rolls.")
g
ibid. Code: simulation of n number of standard normal and the distribution and mean of these n simulation.
dat <- data.frame(
x = c(sample(0 : 1, nosim, replace = TRUE),
apply(matrix(sample(0 : 1, nosim * 10, replace = TRUE),
nosim), 1, mean),
apply(matrix(sample(0 : 1, nosim * 20, replace = TRUE),
nosim), 1, mean),
apply(matrix(sample(0 : 1, nosim * 30, replace = TRUE),
nosim), 1, mean)
),
size = factor(rep(c(1, 10, 20, 30), rep(nosim, 4))))
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20, binwidth = 1 / 12, colour = "steelblue", size = 0.2);
g <- g + facet_grid(. ~ size)
g
An important characterization of a population is how spread out it is. One of the key measures of spread is variability. We estimate population variability with the sample variance, or the square root, called the standard deviation with the same units;
Variability has many important uses in statistics. First, the population variance is itself an intrinsically interesting quantity that we want to estimate. Secondly, variability in our estimates is what makes them not imprecise. An important aspect of statistics is quantifying the variability in our estimates.
Population: mean = mu, variance = sigma square, sd = sigma; Sample: mean = Xbar, variance = S = sigma square/(n-1), sd = sigma/square root (n-1) means os sample means: mean = mu, variance = S/n, sd = S/square root (n)
This is very useful since we don’t need to repeat to calculate the means of sample mean and get the distributio of it. instead, if we know the sample mean and sample variance, (xbar, and variance), we can deduce the distribution of the means of sample means.
Code: Distributions with increasing variance. https://github.com/bcaffo/courses/blob/master/06_StatisticalInference/05_Variance/index.Rmd#distributions-with-increasing-variance
library(ggplot2)
xvals <- seq(-10, 10, by = .01)
dat <- data.frame(
y = c(
dnorm(xvals, mean = 0, sd = 1),
dnorm(xvals, mean = 0, sd = 2),
dnorm(xvals, mean = 0, sd = 3),
dnorm(xvals, mean = 0, sd = 4)
),
x = rep(xvals, 4),
factor = factor(rep(1 : 4, rep(length(xvals), 4)))
)
g <- ggplot(dat, aes(x = x, y = y, color = factor)) + geom_line(size = 2)
g <- g + ggtitle("data with big variance are more spread out")
g
Code: with more data, the sample variance concentrated around population variance (unbiased). when you take small sample size, usually the extreme value of the sample will not be included. therefore, you underestimate the range (or variance) of this sample. Therefore, we adjust the degree of freedom to N-1 to make the estimation more unbiased. Below, when the sample size is small, the sample variance is below the population variance.
library(ggplot2)
nosim <- 10000;
dat <- data.frame(
x = c(apply(matrix(rnorm(nosim * 10), nosim), 1, var),
apply(matrix(rnorm(nosim * 50), nosim), 1, var),
apply(matrix(rnorm(nosim * 300), nosim), 1, var)),
n = factor(rep(c("10", "50", "300"), c(nosim, nosim, nosim)))
)
g <- ggplot(dat, aes(x = x, fill = n)) +
geom_density(size = 0.2, alpha = .2) +
geom_vline(xintercept = 1, size = .4, colour = "orange")
g <- g + ggtitle("with more data, the sample variance concentrated around population variance (unbiased)")
g
ibid. Code: with more data, the sample variance concentrated around population variance (unbiased). for dice rolling experiments.
dat <- data.frame(
x = c(apply(matrix(sample(1 : 6, nosim * 10, replace = TRUE),
nosim), 1, var),
apply(matrix(sample(1 : 6, nosim * 20, replace = TRUE),
nosim), 1, var),
apply(matrix(sample(1 : 6, nosim * 30, replace = TRUE),
nosim), 1, var)
),
size = factor(rep(c(10, 20, 30), rep(nosim, 3))))
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20, binwidth=.3, colour = "steelblue")
g <- g + geom_vline(xintercept = 2.92, size = .5, colour = "orange")
g <- g + facet_grid(. ~ size)
g
Code: comparing standard deviation of population of the standard deviation of sample means.
nosim <- 1000
n <- 10
a <- sd(apply(matrix(rnorm(nosim * n), nosim), 1, mean))
b <- 1 / sqrt(n)
a
## [1] 0.3101352
b
## [1] 0.3162278
ibid: standard uniform distribution.
nosim <- 1000
n <- 10
sd(apply(matrix(runif(nosim * n), nosim), 1, mean))
## [1] 0.09485237
1 / sqrt(12 * n)
## [1] 0.09128709
ibid. poisson distribution.
nosim <- 1000
n <- 10
sd(apply(matrix(rpois(nosim * n, 4), nosim), 1, mean))
## [1] 0.6350883
2 / sqrt(n)
## [1] 0.6324555
ibid. binominal distribution.
nosim <- 1000
n <- 10
sd(apply(matrix(sample(0 : 1, nosim * n, replace = TRUE),
nosim), 1, mean))
## [1] 0.1599504
1 / (2 * sqrt(n))
## [1] 0.1581139
Code: we draw samples from a population of which we don’t know the exact parameters, mean, variance… we characterize the sample. than we can estimate the population mean and population variance, at the same time, we can estimate the center and spread-out of N samples from this population. Data example: father/sun height.
library(UsingR); data(father.son);
x <- father.son$sheight
n<-length(x)
g <- ggplot(data = father.son, aes(x = sheight))
g <- g + geom_histogram(aes(y = ..density..), fill = "lightblue", binwidth=1, colour = "black")
g <- g + geom_density(size = 2, colour = "black")
g
a <- round(c(population.var.est = var(x), mean.var.est = var(x) / n, population.sd.est = sd(x), mean.sd.est = sd(x) / sqrt(n)),2)
a
## population.var.est mean.var.est population.sd.est
## 7.92 0.01 2.81
## mean.sd.est
## 0.09
PMF of Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution mean = p, variance = p(1-p)
the binominal variable is the total number of heads by flip the coin. x1, …, xn be iid Bernoulli(p) X = Sigma (x) is a binominal random variable. binominal mass function: https://en.wikipedia.org/wiki/Binomial_distribution
Code: Bionominal example: a family of 8 children, 7 of them are girls. the probability of having 7 or more girls in that family.
choose(8, 7) * .5 ^ 8 + choose(8, 8) * .5 ^8
## [1] 0.03515625
pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE)
## [1] 0.03515625
code: Gaussian distribution with mean mu and variance of sigma square: X ~ N(mu, sigma square) standard normal distribution: mu = 0, sigma = 1
library(colorspace)
library(RColorBrewer)
library(ggplot2)
library(scales)
x <- seq(-3, 3, length = 1000)
dat <- data.frame(x = x, y = dnorm(x))
dat$qt <- cut(dat$x, breaks = 6, labels = FALSE)
g <- ggplot(data = dat,
aes(x = x, y = y)) + geom_line(size = 0.5, colour = "steelblue", fill = "black")
## Warning: Ignoring unknown parameters: fill
g <- g + geom_vline(xintercept = -3 : 3, size = 0.1, colour = "black")
g <- g + geom_area(aes(x=x,y=y,group=qt,fill=qt),color="red")
g <- g + geom_text(aes(label = "2.5%", x = -2.5, y = 0.03), colour = "steelblue") +
geom_text(aes(label = "2.5%", x = 2.5, y = 0.03), colour = "steelblue") +
geom_text(aes(label = "13.5%", x = -1.5, y = 0.15), colour = "steelblue") +
geom_text(aes(label = "13.5%", x = 1.5, y = 0.15), colour = "steelblue") +
geom_text(aes(label = "34%", x = -.5, y = 0.35), colour = "steelblue") +
geom_text(aes(label = "34%", x = .5, y = 0.35), colour = "steelblue")
g <- g + geom_vline(xintercept = c(-2.33, -1.96, -1.645, -1.28, 1.28, 1.645, 1.96, 2.33),
size = 0.1, colour = "red", lty = 4)
g <- g + scale_fill_gradient2(midpoint=median(unique(dat$qt)), guide="none",
low = muted("red"),
mid = "white",
high = muted("steelblue"))
g
convert non-normal to normal. Z ~ (mean-mu)/sigma Code: for a non-standardized normal distribution with mu = 2, and sigma = 2, what is the probability of x >5
?pnorm
pnorm(2, mean = 2, sd = 1, lower.tail = FALSE)
## [1] 0.5
Assume that the number of daily ad clicks for a company is (approximately) normally distributed with a mean of 1020 and a standard deviation of 50. mean = 1020, sigma = 50.
# What's the probability of getting more than 1,160 clicks in a day?
pnorm(1160, mean = 1020, sd = 50, lower.tail = FALSE)
## [1] 0.00255513
pnorm((1160-1020)/50, lower.tail = FALSE)
## [1] 0.00255513
# What number of clicks will represent the one where 75% of daily click is lower than that.
qnorm(0.75, mean = 1020, sd = 50, lower.tail = TRUE)
## [1] 1053.724
Code: Simulating the shape of normal with different mean and varance.
library(ggplot2)
mu0 <- 30
myplot <- function(sigma, mua, n, alpha){
g = ggplot(data.frame(mu = c(27, 37)), aes(x = mu))
g = g + stat_function(fun = dnorm, geom = "line", args = list(mean = mu0, sd = sigma/sqrt(n)),size = 2, col = "red")
g = g + stat_function(fun = dnorm, geom = "line", args = list(mean = mua, sd = sigma/sqrt(n)),size = 2, col = "blue")
xitc = mu0 + qnorm(1 - alpha) * sigma / sqrt(n)
g = g + geom_vline(xintercept = xitc, size = 3)
g
}
myplot(10, 30, 100, 0.05)
At any rate, the Poisson distribution is used to model counts. Poisson mass function: https://en.wikipedia.org/wiki/Poisson_distribution X in non-negative integers. mean of Poisson distribution is lamda, variance of Poisson is also lamda instances that Poisson distribution is used: -Modeling count data; -Event-time; -Survival data; -Contengency tables; In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. -Approximating binominals when n is large and p is small.
Poisson random variables are used to model rates. X ~ Poisson(lamda * t) lamda is the expected count per unit time; t is the total monitoring time.
Code: the number of people that show up at a bus stop is Possion with a mean of 2.5 per hour. what is the probability that less than 3 people show up for 4 hour in this bus stop.
ppois(3, lambda = 2.5 * 4)
## [1] 0.01033605
Code: Poisson approximation to binominal. X ~ binominal (n, p) lambda = n * p n gets large p gets small the probability distribution governing acts which is binomial can be well approximated by Poisson probabilities with this specific notation lambda as n times p. Code: Poisson approximation to binominal
p_binom <- pbinom(2, size = 500, prob = 0.01)
p_poisson <- ppois(2, lambda = 500 * 0.01)
p_binom;p_poisson
## [1] 0.1233858
## [1] 0.124652
Asymptotics: the behavior of statistics when the sample size limits to infinity. In statistics, asymptotic theory, or large sample theory, is a generic framework for assessment of properties of estimators and statistical tests. Within this framework it is typically assumed that the sample size n grows indefinitely, and the properties of statistical procedures are evaluated in the limit as n → ∞. (wikipedia)
So, asymptotics are incredibly useful for simple statistical inference and approximations.
Asymptotics also form the basis for frequency interpretation of probabilities.
In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed. the average limits to what its estimating, the population mean. The variance limits to the population variance, and the standard dievation limits to population deviation.
The probability of a head on a coin is 50% because we believe that if we were to flip it infinitely many times, we would get exactly 50% heads. this is asymptotics. Our very notion of probability depends on the idea of asymptotics. We can use asymptotics to help is figure out things about distributions without knowing much about them to begin with.
Code: Law of large numbers in normal distribution.
n <- 1000
a <- rnorm(n)
b <- cumsum(a)
c <- b/(1:n)
dat <- data.frame(a = c(1:1000), b, c)
head(dat)
## a b c
## 1 1 1.396271 1.3962712
## 2 2 1.952760 0.9763799
## 3 3 3.327528 1.1091761
## 4 4 3.765534 0.9413834
## 5 5 3.799043 0.7598087
## 6 6 4.215188 0.7025313
library(ggplot2)
gd <- ggplot(data = dat, aes(x = a, y = c))
g <- gd + geom_line()
g
Code: the asymptotics of mean of coin flipping: Law of large numbers.
n = 1000
means <- cumsum(sample(0:1, n, replace = TRUE))/(1:n)
dat <- data.frame(x = c(1:1000), y = means)
gdat <- ggplot(data = dat, aes (x = x, y = y))
g <- gdat + geom_line()
g
Discussion: 1) Asymptotics has a “goal”, i.e., the estimator will converge to what you want to estimate. For example, the sample proportion from iid coin flips is consistent for the true success probability of a coin. the sample mean of iid samples is consistent for the population mean. the sample variance and standard dievation is also consistent. This is what the Law of Large Numbers says. In other words, if we get infinite amount of data, or large number of data, we can asymptotics to the answer.
Central Limit Theorem: the distribution of averages is often normal; even if the distribution of the data being sampled from is very non-normal; This helps us create robust strategies for creating statistical inferences when we’re not willing to assume much about the generating mechanism of our data.
(Estimate - Mean of estimate)/(Std.Err.estimate) ~ standard normal for large n. the sample average is approximately stardard normal with mean of mu and variance of sigma square/n
Code: simulate a standard normal random variable by rolling a dice of n times Xi is the outcome of dice i, mu = E[Xi] = 3.5; Var(Xi) = 2.92; sd = square root (2.92/n). roll n dice, take their mean, subtract off 3.5, and divide by 1.71/square root (n)
nosim <- 1000
# calculate the estimator: (Estimate - Mean of estimate)/(Std.Err.estimate)
cfunc <- function(x, n) sqrt(n) * (mean(x) - 3.5) / 1.71
dat <- data.frame(
x = c(apply(matrix(sample(1 : 6, nosim * 10, replace = TRUE),
nosim), 1, cfunc, 10),
# sample number 1 to 6 of total (nosim *10 = 10000) times, replace = true;
# create a matrix of 1000 X 10;
# for each row (argument = 1), apply cfunc with n = 10.
apply(matrix(sample(1 : 6, nosim * 20, replace = TRUE),
nosim), 1, cfunc, 20),
# sample number 1 to 6 of total (nosim *20 = 20000) times, replace = true;
# create a matrix of 1000 X 20;
# for each row (argument = 1), apply cfunc with n = 20.
apply(matrix(sample(1 : 6, nosim * 1000, replace = TRUE),
nosim), 1, cfunc, 1000)
# ibid.
),
size = factor(rep(c(10, 20, 1000), rep(nosim, 3))))
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .50, binwidth=.3, colour = "steelblue", aes(y = ..density..))
g <- g + stat_function(fun = dnorm, size = 0.5, colour = "orange")
g + facet_grid(. ~ size)
Code: simulate a standard normal random variable by coin flipping. Xi be the 0 or 1, of the ith flip of a possibly unfair coin, with the probability of p of head. E[Xi] = p; Var[Xi] = p(1-p) statistic = (phead - p)/square root(p(1-p)/n) Also, the speed at which the normalized coin flips converges to normality is governed by how biased the coin is. let p = 0.9. So just keep this in mind that the central limit theorem doesn’t guarantee that the normal distribution will be a good approximation. Simply that as the number of coin flips limits to infinity eventually, it will be a good approximation.
nosim <- 1000
cfunc <- function(x, n) 2 * sqrt(n) * (mean(x) - 0.5)
dat <- data.frame(
x = c(apply(matrix(sample(0:1, nosim * 10, replace = TRUE),
nosim), 1, cfunc, 10),
apply(matrix(sample(0:1, nosim * 20, replace = TRUE),
nosim), 1, cfunc, 20),
apply(matrix(sample(0:1, nosim * 30, replace = TRUE),
nosim), 1, cfunc, 30)
),
size = factor(rep(c(10, 20, 30), rep(nosim, 3))))
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(binwidth=.3, colour = "black", aes(y = ..density..))
g <- g + stat_function(fun = dnorm, size = 2)
g + facet_grid(. ~ size)
code: Confidence interval of the son’s height.
library(UsingR);data(father.son); x <- father.son$sheight
(mean(x) + c(-1, 1) * qnorm(.975) * sd(x) / sqrt(length(x))) / 12
## [1] 5.709670 5.737674
code: confidence interval of the coin flip. Xi is 0 or 1 with success probabillity of p, variance = p(1-p); the confidence of interval: phead plus/minus normal quantile * square root(p(1-p)/n)
p <- seq(0, 1, length = 1000)
dat <- data.frame(p = p, v = p * (1 - p))
gd <- ggplot(data = dat, aes(x = p, y = v))
g <- gd + geom_line(size = 0.5, colour = "orange")
g <- g + ggtitle("when p = 0.5, binomial distribution has the largest variance")
g
So the 95% confidence interval is: phead plus/minus 1/square root (n).
code: confidence interval of election. Imagine you were running for political office and your campaign advisor told you that in a random sample of 100 likely voters, 56 intended to vote for you. Can you relax? Do you have this race in the bag? Is 0.56 out of 100 sampled enough evidence to conclude that you’ll likely get more than 50% of the vote? But you don’t have access to a computer or calculator.
# a quick calculation.
0.56 + c(-1,1) * 1/sqrt(100)
## [1] 0.46 0.66
# formal calculation.
.56 + c(-1, 1) * qnorm(0.975) * sqrt(0.56 * (1 - 0.56)/100)
## [1] 0.4627099 0.6572901
# using function binom.test
binom.test(56, 100)$conf.int
## [1] 0.4571875 0.6591640
## attr(,"conf.level")
## [1] 0.95
Code: Confidence interval of n coin flips for nosim times. see the empirical and real probability and the role of CLT here. a simulation where I flip a coin with a given success probability over and over and over again. And calculate the percentage of times my walled interval covers the true coin probability that we used to generate the data. So I’m going to say, in each one of my simulations I’m going to do 20 coin flips.
n <- 100;
pvals <- seq(.1, .9, by = .05);
nosim <- 1000
coverage <- sapply(pvals, function(p){
phats <- rbinom(nosim, prob = p, size = n) / n
# the observed probobility of "head" of each of 1000 (nosim) trials. In each trial, there are n = 30 coin flips, and the chance of "head" equals total number of heads over n = 30.
ll <- phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
# the lower level of 95% confidence interval of the observed (empirical) probability.
ul <- phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
# the upper level of 95% confidence interval of the observed (empirical) probability.
mean(ll < p & ul > p)
# the number of observations covered by the lower-upper level (confidence interval. )
# the logic vector could be treated as numeric, true = 1, false = 0.
})
ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage)) + geom_line(size = 0.5, colour = "orange") + geom_hline(yintercept = 0.95, type = 3) + ylim(0.75, 1.0)
## Warning: Ignoring unknown parameters: type
When n is small, the Central Limit Theroem may not work well. it’s simply because the central limit theorem isn’t as accurate, isn’t as accurate as we need it to be for this specific value of N for coins with this specific true probability.
Code: Agresti/Coull interval instead of Wald interval when n is small. there are two ways to fix this: 1) increase the total number of observations (n) in each simulation. For example, n = 100; 2) the Agresti/Coull interval instead of the Wald interval.
n <- 100;
pvals <- seq(.1, .9, by = .05);
nosim <- 1000
coverage <- sapply(pvals, function(p){
phats <- (rbinom(nosim, prob = p, size = n) + 2) / (n + 4)
# the observed probobility of "head" of each of 1000 (nosim) trials. In each trial, there are n = 30 coin flips, and the chance of "head" equals total number of heads over n = 30.
ll <- phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
# the lower level of 95% confidence interval of the observed (empirical) probability.
ul <- phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
# the upper level of 95% confidence interval of the observed (empirical) probability.
mean(ll < p & ul > p)
# the number of observations covered by the lower-upper level (confidence interval. )
# the logic vector could be treated as numeric, true = 1, false = 0.
})
ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage)) + geom_line(size = 0.5, colour = "orange") + geom_hline(yintercept = 0.95, type = 3) + ylim(0.75, 1.0)
## Warning: Ignoring unknown parameters: type
Code: confidence interval of Poisson distribution. case: a nuclear pump that failed 5 times out of 94.32 days, given 95% confidence interval for the failure rate per day. X ~ Poisson (lambda * t) Estimate lamda.head = X/t Var(lambda.head) = lambda/t
x <- 5
t <- 94.32
lambda <- x/t
round(lambda + c(-1, 1) * qnorm(.975) * sqrt(lambda/t), 3)
## [1] 0.007 0.099
# use Poisson test in R
poisson.test(x, T = 94.32)$conf
## [1] 0.01721254 0.12371005
## attr(,"conf.level")
## [1] 0.95
lambdavals <- seq(0.005, 0.10, by = .01); nosim <- 1000
t <- 100000
coverage <- sapply(lambdavals, function(lambda){
lhats <- rpois(nosim, lambda = lambda * t) / t
ll <- lhats - qnorm(.975) * sqrt(lhats / t)
ul <- lhats + qnorm(.975) * sqrt(lhats / t)
mean(ll < lambda & ul > lambda)
})
ggplot(data.frame(lambdavals, coverage), aes(x = lambdavals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95)+ylim(0, 1.0)
for very small values of lambda which we have a good sense of if we could have relatively few events for a large amount of monitoring time. For relatively small values of lambda we shouldn’t be using this asymptotic interval.
The LLN states that averages of iid samples converge to the population means that they are estimating The CLT states that averages are approximately normal, with distributions centered at the population mean with standard deviation equal to the standard error of the mean CLT gives no guarantee that \(n\) is large enough Taking the mean and adding and subtracting the relevant normal quantile times the SE yields a confidence interval for the mean Adding and subtracting 2 SEs works for 95% intervals Confidence intervals get wider as the coverage increases (why?) Confidence intervals get narrower with less variability or larger sample sizes The Poisson and binomial case have exact intervals that don’t require the CLT But a quick fix for small sample size binomial calculations is to add 2 successes and failures
Confidence intervals are a convenient way to communicate that uncertainty in estimates. Using central limit theorem, we can create confidence interval: CI = est plus/minus Zquantile * SE(est) For small sample size, T confidence intervals is applied.
In probability and statistics, Student’s t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.
T distribution: (Xbar-mu)/(estimated standard error) Gosset’s t distribution has only one parameter: degree of freedom (n-1)
The T quantile is from T distribution. T distributio has heavier tail, because the estimated stardard error is smaller that the standard dievation of population. And so T quantile is a little bit wider. As you collect more and more data, the T interval will become more and more approximate to Z interval. So T interval maybe the choice at the very beginning. If you use z quantile when sample size is small, you confidence interval may be too narrow.
Est plus/minus Tquantile * SE (est)
The t interval technically assumes that the data are iid normal, though it is robust to this assumption It works well whenever the distribution of the data is roughly symmetric and mound shaped Paired observations are often analyzed using the \(t\) interval by taking differences For large degrees of freedom, t quantiles become the same as standard normal quantiles; therefore this interval converges to the same interval as the CLT yielded For skewed distributions, the spirit of the \(t\) interval assumptions are violated. Also, for skewed distributions, it doesn’t make a lot of sense to center the interval at the mean In this case, consider taking logs or using a different summary like the median For highly discrete data, like binary, other intervals are available
Code: comparing T distribution and normal distribution, with different degree of freedom.
# library(ggplot2); library(manipulate)
k <- 1000
xvals <- seq(-5, 5, length = k)
myplot <- function(df){
d <- data.frame(y = c(dnorm(xvals), dt(xvals, df)),
x = xvals,
dist = factor(rep(c("Normal", "T"), c(k,k))))
g <- ggplot(d, aes(x = x, y = y))
g <- g + geom_line(size = 2, aes(colour = dist))
g
}
# manipulate(myplot(mu), mu = slider(1, 20, step = 1))
code: with the degree of freedom increase, t quantile approximate z quantile.
pvals <- seq(.5, .99, by = .01)
myplot2 <- function(df){
d <- data.frame(n= qnorm(pvals),t=qt(pvals, df),
p = pvals)
g <- ggplot(d, aes(x= n, y = t))
g <- g + geom_abline(size = 2, col = "lightblue")
g <- g + geom_line(size = 2, col = "black")
g <- g + geom_vline(xintercept = qnorm(0.975))
g <- g + geom_hline(yintercept = qt(0.975, df))
g
}
#manipulate(myplot2(df), df = slider(1, 20, step = 1))
code: sleep data set raw data, show group differences. (paried t)
data(sleep)
dim(sleep)
## [1] 20 3
head(sleep)
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
library(ggplot2)
gd <- ggplot(data = sleep, aes (x = group, y = extra, colour = ID,
fill = ID, group = factor(ID)))
g <- gd + geom_point(size = 10, pch = 21, alpha = 0.5)
g <- g + geom_line(size = 1, aes(colour = ID))
g
code: t test of sleep data. (paired)
g1 <- sleep$extra[1 : 10]; g2 <- sleep$extra[11 : 20]
difference <- g2 - g1
mn <- mean(difference); s <- sd(difference); n <- 10
mn + c(-1, 1) * qt(.975, n-1) * s / sqrt(n)
## [1] 0.7001142 2.4598858
t.test(difference)
##
## One Sample t-test
##
## data: difference
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.7001142 2.4598858
## sample estimates:
## mean of x
## 1.58
t.test(g2, g1, paired = TRUE)
##
## Paired t-test
##
## data: g2 and g1
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.7001142 2.4598858
## sample estimates:
## mean of the differences
## 1.58
t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep)
##
## Paired t-test
##
## data: extra by I(relevel(group, 2))
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.7001142 2.4598858
## sample estimates:
## mean of the differences
## 1.58
rbind(
mn + c(-1, 1) * qt(.975, n-1) * s / sqrt(n),
as.vector(t.test(difference)$conf.int),
as.vector(t.test(g2, g1, paired = TRUE)$conf.int),
as.vector(t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep)$conf.int)
)
## [,1] [,2]
## [1,] 0.7001142 2.459886
## [2,] 0.7001142 2.459886
## [3,] 0.7001142 2.459886
## [4,] 0.7001142 2.459886
Code: t confidence interval of independent group. Example: two groups in a randomized trial. and the variance of the two group is the same because of randomization. Note, constant variance is an assumption for independent t testing.
The t confidence interval of two independent group: code: comparing SBP of 8 oral contraceptive users versus 21 controls. Xoc = 132.86 mmHg, soc = 15.34 mmHg Xc = 127.44 mmHg, sc = 18.23 mmHg
sp <- sqrt((7 * 15.34^2 + 20 * 18.23^2) / (8 + 21 - 2))
132.86 - 127.44 + c(-1, 1) * qt(.975, 27) * sp * (1 / 8 + 1 / 21)^.5
## [1] -9.521097 20.361097
# note the calculated interval contains zero, so we can't rule out that the two group are actually the same.
code: sleep data unsing independent t test.
data(sleep)
head(sleep)
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
g1 <- sleep$extra[1 : 10]; g2 <- sleep$extra[11 : 20]
n1 <- length(g1); n2 <- length(g2)
x1 <- sleep[which (sleep$group == "1"), ]
x2 <- sleep[which (sleep$group == "2"), ]
sp <- sqrt(((n1 - 1) * sd(x1$extra)^2 + (n2-1) * sd(x2$extra)^2) / (n1 + n2-2))
md <- mean(g2) - mean(g1)
semd <- sp * sqrt(1 / n1 + 1/n2)
rbind(
md + c(-1, 1) * qt(.975, n1 + n2 - 2) * semd,
t.test(g2, g1, paired = FALSE, var.equal = TRUE)$conf,
t.test(g2, g1, paired = TRUE)$conf
)
## [,1] [,2]
## [1,] -0.2038740 3.363874
## [2,] -0.2038740 3.363874
## [3,] 0.7001142 2.459886
Note: 1) the unpaired t test attribute all the difference to inter-group difference. And because of the big intra-group variance, unpaired t test give a confidence interval containing 0. so we can’t rule out that the two group are different.
Code: chick weight.
library(datasets);data(ChickWeight)
library(reshape2)
head(ChickWeight)
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
wideCW <- dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight")
names(wideCW)[-(1 : 2)] <- paste("time", names(wideCW)[-(1 : 2)], sep = "")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:Hmisc':
##
## combine, src, summarize
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
wideCW <- mutate(wideCW,gain = time21 - time0)
head(wideCW)
## Diet Chick time0 time2 time4 time6 time8 time10 time12 time14 time16
## 1 1 18 39 35 NA NA NA NA NA NA NA
## 2 1 16 41 45 49 51 57 51 54 NA NA
## 3 1 15 41 49 56 64 68 68 67 68 NA
## 4 1 13 41 48 53 60 65 67 71 70 71
## 5 1 9 42 51 59 68 85 96 90 92 93
## 6 1 20 41 47 54 58 65 73 77 89 98
## time18 time20 time21 gain
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 81 91 96 55
## 5 100 100 98 56
## 6 107 115 117 76
g <- ggplot(ChickWeight, aes(x = Time, y = weight, colour = Diet, group = Chick))
g <- g + geom_line()
g <- g + stat_summary(aes(group = 1), geom = "line", fun.y = mean, size = 1, col = "black")
g <- g + facet_grid(. ~ Diet)
g
Code: the violin plot of the variations of weight gain for each diet.
g <- ggplot(wideCW, aes(x = factor(Diet), y = gain, fill = factor(Diet)))
g <- g + geom_violin(col = "black", size = 2)
g
## Warning: Removed 5 rows containing non-finite values (stat_ydensity).
Note: It looks that group 1 and group 4 are of different variance. As such, the equal variance assumption of t test maybe violated.
Code: the independent t test with variance equal or unequal assumptions.
wideCW14 <- subset(wideCW, Diet %in% c(1, 4))
rbind(
t.test(gain ~ Diet, paired = FALSE, var.equal = TRUE, data = wideCW14)$conf,
t.test(gain ~ Diet, paired = FALSE, var.equal = FALSE, data = wideCW14)$conf
)
## [,1] [,2]
## [1,] -108.1468 -14.81154
## [2,] -104.6590 -18.29932
Note: the confidence interval calucation with unequal variance.
For binomial data, there’s lots of ways to compare two groups Relative risk, risk difference, odds ratio. Chi-squared tests, normal approximations, exact tests. For count data, there’s also Chi-squared tests and exact tests. We’ll leave the discussions for comparing groups of data for binary and count data until covering glms in the regression class. In addition, Mathematical Biostatistics Boot Camp 2 covers many special cases relevant to biostatistics.
Hypothesis testing is concerned with making decisions using data. - Null hypothesis: the status quo, H0 (H naught). H naught is the starting point which is assumed to be true. - New evidence is required to reject H0 and to establish an alternative hypothesis.
Case: respiratory disturbance index (RDI). severe disordered breathing (SDI) a sample of 100 patients, mean RDI = 32, sd = 10 - H0: mu = 30 - Ha: mu >30 the alternative hypothesis are <, > or not equal. Truth Decide Result H0 H0 correctly accept the H0 H0 Ha type I error: false positive Ha Ha correctly reject H0 and establish Ha Ha H0 type II error: false negative
Type I error rate and type II error rate is related. As type I error rate increase, type II error rate decrease.
Considering the SDI case, if the sample mean is larger than some constant,
Code: the type I and II error rate is related. alpha = type I error rate = probability of rejecting the null hypothesis when, in fact, the null hypothesis is correct.
a sample of 100 patients, mean RDI = 32, sd = 10, se = 10/squareroot (100) = 1 H0: X bar ~ N(30, 1)
library(ggplot2)
mu0 <- 30
myplot <- function(sigma, mua, n, alpha){
g = ggplot(data.frame(mu = c(27, 37)), aes(x = mu))
g = g + stat_function(fun = dnorm, geom = "line", args = list(mean = mu0, sd = sigma/sqrt(n)),size = 2, col = "red")
g = g + stat_function(fun = dnorm, geom = "line", args = list(mean = mua, sd = sigma/sqrt(n)),size = 2, col = "blue")
xitc = mu0 + qnorm(1 - alpha) * sigma / sqrt(n)
g = g + geom_vline(xintercept = xitc, colour = "orange", size = 0.5)
g = g + geom_text(aes(label = round(mu0 + qnorm(1 - alpha) * sigma / sqrt(n), 2)), x = mu0 + qnorm(1 - alpha) * sigma / sqrt(n), y = 0)
g
}
myplot(10, 30, 100, 0.05)
Note: 1-under the null hypothesis, most (<95%) of the data will below 32. In other word, under the null hypothesis H0, the chance of occuring a number greater than 31.64 is only 5%. 2-for a sample with mean of 32, it is highly unlikely that this sample belongs the null hypothesis. and it is highly likely that the sample is a new distribution. therefore, the null hypothesis is rejected. 3-comparing N~(30, 1) with N~(32, 1), there are two times of standard error of the mean from the null hypotheis (32-30)/(10/squareroot (100)).
Consider the RDI, if the sample size is 16 instead of 100. test statistic = (xbar - 30)/se, follows a T distribution with 15 df under H0.
qt(.95, 15)
## [1] 1.75305
teststatistic <- (32-30)*16^.5/10
teststatistic
## [1] 0.8
Note: under 16 samples, the mean of 32 suggests only 0.8 times of the standard deviation. for T test, we need 1.75 times of the standard deviation. therefore, we can’t reject the nulll hypothesis, the sample is N ~ (30, 1).
H0: mu = 30, sd = 10; Ha: mu is not equal t0 30, and sd = 10. we need to split the 5% into the two side of the T distribution.
qt(.975, 15)
## [1] 2.13145
# we are going to reject the null hypothesis if the test statistic is LARGER than this, or;
qt(.025, 15)
## [1] -2.13145
# we are going to reject the null hypothesis if the test statistic is SMALLER than this.
Note: if you failed with the one side test, you will also fail the two sided test, because we split the alpha into two side, and because of the smaller alpha, we will move further out into the tail.
code: we can pass data to R functions like t.test. if the sons’ height is equal to fathers’ height.
library(UsingR)
data("father.son")
t.test(father.son$sheight - father.son$fheight)
##
## One Sample t-test
##
## data: father.son$sheight - father.son$fheight
## t = 11.789, df = 1077, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.8310296 1.1629160
## sample estimates:
## mean of x
## 0.9969728
Consider testing H0 : mu = mu0 versus Ha: mu not equal to mu0; Take the set of all possible values for which you failed to reject H0, this set is a (1-alpha) confidence interval for mu. if a (1-alpha) contains mu0, then we failed to reject H0.
library(datasets); data(ChickWeight); library(reshape2)
##define weight gain or loss
wideCW <- dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight")
names(wideCW)[-(1 : 2)] <- paste("time", names(wideCW)[-(1 : 2)], sep = "")
library(dplyr)
wideCW <- mutate(wideCW,gain = time21 - time0)
wideCW14 <- subset(wideCW, Diet %in% c(1, 4))
wideCW14
## Diet Chick time0 time2 time4 time6 time8 time10 time12 time14 time16
## 1 1 18 39 35 NA NA NA NA NA NA NA
## 2 1 16 41 45 49 51 57 51 54 NA NA
## 3 1 15 41 49 56 64 68 68 67 68 NA
## 4 1 13 41 48 53 60 65 67 71 70 71
## 5 1 9 42 51 59 68 85 96 90 92 93
## 6 1 20 41 47 54 58 65 73 77 89 98
## 7 1 10 41 44 52 63 74 81 89 96 101
## 8 1 8 42 50 61 71 84 93 110 116 126
## 9 1 17 42 51 61 72 83 89 98 103 113
## 10 1 19 43 48 55 62 65 71 82 88 106
## 11 1 4 42 49 56 67 74 87 102 108 136
## 12 1 6 41 49 59 74 97 124 141 148 155
## 13 1 11 43 51 63 84 112 139 168 177 182
## 14 1 3 43 39 55 67 84 99 115 138 163
## 15 1 1 42 51 59 64 76 93 106 125 149
## 16 1 12 41 49 56 62 72 88 119 135 162
## 17 1 2 40 49 58 72 84 103 122 138 162
## 18 1 5 41 42 48 60 79 106 141 164 197
## 19 1 14 41 49 62 79 101 128 164 192 227
## 20 1 7 41 49 57 71 89 112 146 174 218
## 41 4 44 42 51 65 86 103 118 127 138 145
## 42 4 45 41 50 61 78 98 117 135 141 147
## 43 4 43 42 55 69 96 131 157 184 188 197
## 44 4 41 42 51 66 85 103 124 155 153 175
## 45 4 47 41 53 66 79 100 123 148 157 168
## 46 4 49 40 53 64 85 108 128 152 166 184
## 47 4 46 40 52 62 82 101 120 144 156 173
## 48 4 50 41 54 67 84 105 122 155 175 205
## 49 4 42 42 49 63 84 103 126 160 174 204
## 50 4 48 39 50 62 80 104 125 154 170 222
## time18 time20 time21 gain
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 81 91 96 55
## 5 100 100 98 56
## 6 107 115 117 76
## 7 112 120 124 83
## 8 134 125 NA NA
## 9 123 133 142 100
## 10 120 144 157 114
## 11 154 160 157 115
## 12 160 160 157 116
## 13 184 181 175 132
## 14 187 198 202 159
## 15 171 199 205 163
## 16 185 195 205 164
## 17 187 209 215 175
## 18 199 220 223 182
## 19 248 259 266 225
## 20 250 288 305 264
## 41 146 NA NA NA
## 42 174 197 196 155
## 43 198 199 200 158
## 44 184 199 204 162
## 45 185 210 205 164
## 46 203 233 237 197
## 47 210 231 238 198
## 48 234 264 264 223
## 49 234 269 281 239
## 50 261 303 322 283
t.test(gain ~ Diet, paired = FALSE,
var.equal = TRUE, data = wideCW14)
##
## Two Sample t-test
##
## data: gain by Diet
## t = -2.7252, df = 23, p-value = 0.01207
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -108.14679 -14.81154
## sample estimates:
## mean in group 1 mean in group 4
## 136.1875 197.6667
Code: hypothesis testing that’s not a normal or a t. A family of 8 children, 7 of which are girls. You are a little bit skeptical that the gender rate is not 1:1. how much evidence is this, that the genders are iid coin, fair coin flips between boys and girls for this couple. In other words, what is the relevant rejection region so that the probabililty of rejecting is less than 5%.
H0: p = 0.5; Ha: p > 0.5
A hypothesis test what would be the number of girls that the couple could have in order for the probability of having that many or more the 5% under the, under the null hypothesis of a fair coin?
Well, if we set up a rejection region, so in other words we’re going to reject the null hypothesis if the couple had anywhere between zero and eight girls, well of course we’d always reject.
P value is the most common measure of statistical significance. Comments in favor of and against P values: http://warnercnr.colostate.edu/~anderson/thompson1.html Also see Statistical Evidence: A Likelihood Paradigm by Richard Royall Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy by Steve Goodman The hilariously titled: The Earth is Round (p < .05) by Cohen.
Ideas on P value: Start under the null hypothesis assuming nothing happened. how extreme, how unusual of the obtaining data under the null hypothesis.
The P-value is the probability under the null hypothesis of obtaining evidence as extreme or more txtreme than that obatained. Then H0 is true and we have observed an rare event, or H0 is false.
pt(2.5, 15, lower.tail = FALSE)
## [1] 0.0122529
Note: when T statistic is 2.5, the actually obtained under the null hypothesis is one percent. So either the null hypothesis is true ans we have seen something “alien” or the null hypothesis is false.
Code: a family of 7 daugnther out of total 8 children.
choose(8, 7) * 0.5^8 + choose(8, 8) * 0.5^8
## [1] 0.03515625
# in other way.
pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE)
## [1] 0.03515625
Note: Under null hypothesis where p is 0.05, the probility of have 7 or 8 girls is 0.035. So either the null hypothesis is wrong or we have face some extreme in null hypothesis. we may think that in this family, the p to have daughter is larger than 0.5.
Code: family of 7 daughter out of total 8 children, p is different.
prob <- c(0.2, 0.4, 0.6, 0.8)
prob
## [1] 0.2 0.4 0.6 0.8
p <- pbinom(6, size = 8, prob = prob, lower.tail = FALSE)
p
## [1] 0.00008448 0.00851968 0.10637568 0.50331648
Note: 1) if the H0: p = 0.2, then the probability of having 7 daughter out of 8 children is 0.00008448. It’s so rare to have 7 daughters, either the p = 0.2 is not valid or we are facing some extreme scenario. 2) if the H0: p = 0.8, the probability of having 7 daughters out of 8 children is 0.5. Probability p =0.8 is right. and we can’t reject p = 0.8 because 50% of chance that we can have 7 girls.
Code: Poisson example suppose that a hospital has an infection rate of 10 infections per 100 person/days at risk (rate = 0.1) during the last monitoring period. assume that an infection rate of 0.05 is an important benchmark. H0: lamba = 0.05; lamda100 = 5 Ha: lamda > 0.05.
what’s the probability of obtaining 10 or more infecion.
ppois(9, 5, lower.tail = FALSE)
## [1] 0.03182806
Note: the probability of having more than 10 infections per 100 patient/days is 0.03. This is below the assumed 0.05 incidence rate and is kind of “extreme”. therefore, if in 100 patients we have 10 or more infection, measures should be taken.
Power is the prob of rejecting the null when it is false. Power is more important when you failed to reject the null hypothesis– when we get a null result, we need to consider the power.you want to design a study so that you have a reasonable chance of detecting the alternative hypothesis.
Type II error (beta): false negative error. you missed the chance of detecting actually positive result. power = 1 - beta. power is a function that depends on the mean under the null hypothesis.
mu0 <- 30; mua <- 32; sigma <- 4; n <- 16
alpha = 0.05
z <- qnorm(1 - alpha)
beta <- pnorm(mu0 + z * sigma/sqrt(n),
mean = mua, sd = sigma/sqrt(n),
lower.tail = FALSE)
beta
## [1] 0.63876
there is a 64% probability of detecting a mean as large as 32 or larger if we conduct this experiment.
Code: plotting the power curve.
library(ggplot2)
nseq = c(8, 16, 32, 64, 128)
mua = seq(30, 35, by = 0.1)
z = qnorm(.95)
power = sapply(nseq, function(n)
pnorm(mu0 + z * sigma / sqrt(n), mean = mua, sd = sigma / sqrt(n),
lower.tail = FALSE)
)
colnames(power) <- paste("n", nseq, sep = "")
d <- data.frame(mua, power)
library(reshape2)
d2 <- melt(d, id.vars = "mua")
names(d2) <- c("mua", "n", "power")
g <- ggplot(d2,
aes(x = mua, y = power, col = n)) + geom_line(size = 2)
g
code: simulation of normal alpha, beta, power. Normal simulation-alpha-beta-power.R
Effective size: (mua - mu0)/sigma
power.t.test(n = 16, delta = 2/4, sd = 1, type = "one.sample", alt = "one.sided")$power
## [1] 0.6040329
power.t.test(n = 16, delta = 2, sd = 4, type = "one.sample", alt = "one.sided")$power
## [1] 0.6040329
power.t.test(n = 16, delta = 100, sd = 200, type = "one.sample", alt = "one.sided")$power
## [1] 0.6040329
Note: power only depends on the effect size.
power.t.test(power = 0.8 , delta = 2/4, sd = 1, type = "one.sample", alt = "one.sided")$n
## [1] 26.13751
Note: we need an effective size as large as 0.5. so we need at least 27 to get the power of 0.8.
power.t.test(n = 16, power = 0.8, sd = 4, type = "one.sample", alt = "one.sided")$delta
## [1] 2.606797
Note: use power.t.test as the first attach on power calculation. power has a lot of knobs and dials that you can turn. and it’s very easy to get tripped up on thinking that you have better power than you have or thinking that you need a smaller sample size.
when you perform more than one hypothesis test you have to do some sort of correction to make sure that you’re not fooling yourself. This lecture is a little bit about how to do these corrections.
Why is hypothesis test/significance analysis is commonly overused??? multiple P values for the same data set.
Corrections to avoid positives or discoveries. Two components of multiple testing corrections: -Error measure; -Correction.
Problem: when we are doing 10000 tests with alpha = 0.05, for example, some high dimensional data, but because 10000 X 0.05 = 500, we may encounter 500 false positive result. So comparing with simple test, we need to control the error rate in a different way.
set.seed(1010093)
pValues <- rep(NA,1000) # an empty vector, length of 1000.
for(i in 1:1000){
y <- rnorm(20)
x <- rnorm(20)
# two normal random variable with no relationship.
pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
# Controls false positive rate
sum(pValues < 0.05)
## [1] 51
Controlling falimy-wise error rate (FWER) m tests; FWER at level alpha so P(V >= 1) < alpha; Calculate normal P value alpha-FWER = alpha/m call all P values less than alpha-FWER significant.
Code: example of Bonferroni correction.
set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
y <- rnorm(20)
x <- rnorm(20)
pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
# Controls false positive rate
sum(pValues < 0.05)
## [1] 51
sum(p.adjust(pValues, method = "bonferroni") < 0.05)
## [1] 0
Pros: easy and handy, conservative; Cons: very conservative.
Basic idea: -m test -control FDR at level alpha so E[V/R] -Calculate normal P; -Order P values from smallest to largest: P(1),…, P(m) -call any P(i) <= alpha X (i/m)
Most popular correction for signal-processing disciplines. genomics, imaging, astronomy…
set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
y <- rnorm(20)
x <- rnorm(20)
pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
# Controls false positive rate
sum(pValues < 0.05)
## [1] 51
sum(p.adjust(pValues, method = "BH") < 0.05)
## [1] 0
# Benjamini-Hochberg correction.
Pros: easy, less conservative; Cons: Allows for more false positives.
set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
x <- rnorm(20)
# First 500 beta=0, last 500 beta=2
if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)}
pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
trueStatus <- rep(c("zero","not zero"),each=500)
table(pValues < 0.05, trueStatus)
## trueStatus
## not zero zero
## FALSE 0 476
## TRUE 500 24
Code:
# Controls FWER
table(p.adjust(pValues,method="bonferroni") < 0.05,trueStatus)
## trueStatus
## not zero zero
## FALSE 23 500
## TRUE 477 0
# Controls FDR
table(p.adjust(pValues,method="BH") < 0.05,trueStatus)
## trueStatus
## not zero zero
## FALSE 0 487
## TRUE 500 13
par(mfrow=c(1,2))
plot(pValues,p.adjust(pValues,method="bonferroni"),pch=19)
plot(pValues,p.adjust(pValues,method="BH"),pch=19)
Summary: Multiple testing is an entire subfield A basic Bonferroni/BH correction is usually enough If there is strong dependence between tests there may be problems Consider method=“BY”
Introduction to multiple testing https://ies.ed.gov/ncee/pubs/20084018/app_b.asp
Pull oneself up from their own bootstrap. doing something that’s physically impossible. We need many mathematics to perform inference. Bootstrapping liberate data analysis…e.g., how to get the confidence interval, you need complicated asymptotics Draw a direct line between informations we are using and the inferences that we are making. Bootstrapping may be dangerous but you don’t need complex mathematics. Bootstrapping: very useful to do statistical inferences where it would otherwise be difficult.
Very much in the spirit of data science: understand the mean of six die roll: we can do the algebra calculation; we can roll the die 50 times, and repeat this over and over.
imaging you don’t know if the die is fair. We play 50 die rolls and get a EMPIRICAL distribution, which is not the population distribution because the randomness. We can draw the sample and throw it into a bag, and we draw a sample from the bag repeatly. The fundamental idea is to exactly replicate our simulation experiment–use observed data to construct an estimated population distribution, we simulate that estimated population distribution and figure out the distribution,
Code:
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:Hmisc':
##
## combine
nosim <- 1000
cfunc <- function(x, n) mean(x)
g1 = ggplot(data.frame(y = rep(1/6, 6), x = 1 : 6), aes(y = y, x = x))
g1 = g1 + geom_bar(stat = "identity", fill = "lightblue", colour = "black")
dat <- data.frame(x = apply(matrix(sample(1 : 6, nosim * 50, replace = TRUE),
nosim), 1, mean))
g2 <- ggplot(dat, aes(x = x)) + geom_histogram(binwidth=.2, colour = "black", fill = "salmon", aes(y = ..density..))
grid.arrange(g1, g2, ncol = 2)
n = 50
B = 1000
## our data
x = sample(1 : 6, n, replace = TRUE)
x
## [1] 6 3 3 6 2 4 6 6 2 5 3 1 6 1 3 6 5 1 6 6 6 2 1 2 6 4 5 1 2 3 6 6 4 4 6
## [36] 6 5 3 2 1 2 4 3 2 1 4 1 6 6 4
## bootstrap resamples
resamples = matrix(sample(x,
n * B,
replace = TRUE),
B, n)
table(x)
## x
## 1 2 3 4 5 6
## 8 8 7 7 4 16
prop.table(table(x))
## x
## 1 2 3 4 5 6
## 0.16 0.16 0.14 0.14 0.08 0.32
resampledMeans = apply(resamples, 1, mean)
g1 <- ggplot(as.data.frame(prop.table(table(x))), aes(x = x, y = Freq)) + geom_bar(colour = "black", fill = "lightblue", stat = "identity")
g2 <- ggplot(data.frame(x = resampledMeans), aes(x = x)) + geom_histogram(binwidth=.2, colour = "black", fill = "salmon", aes(y = ..density..))
grid.arrange(g1, g2, ncol = 2)
Code: father and sun height.
library(UsingR)
data(father.son)
x <- father.son$sheight
n <- length(x)
B <- 10000
resamples <- matrix(sample(x,
n * B,
replace = TRUE),
B, n)
resampledMedians <- apply(resamples, 1, median)
g = ggplot(data.frame(x = resampledMedians), aes(x = x))
g = g + geom_density(size = 2, fill = "red")
#g = g + geom_histogram(alpha = .20, binwidth=.3, colour = "black", fill = "blue", aes(y = ..density..))
g = g + geom_vline(xintercept = median(x), size = 2)
g
The Bootstrap principle: 1. Suppose a statistic that estimates some population parameter, but the sample distribution is unknown; 2. Use the distribution defined by the observed data, to approximate its sampling distribution.
Nonparametric bootstrap algorithm example: 1. sample n observations with replacement from the observed data resulting in one simulated complete dataset; 2. Sample with replacement; 3. take the statistics; 4. repeat the process many times. 5. the statistic get from this process is a approximation fo the population distribution; 6. Calculate the standard deviation to estimate the standard error of the median; 7. take 2.5th and 97.5th percentiles as confidence interval for the median. Bootstrap confidence interval.
library(UsingR)
data(father.son)
x <- father.son$sheight
n <- length(x)
B <- 10000
resamples <- matrix(sample(x,
n * B,
replace = TRUE),
B, n)
resampledMedians <- apply(resamples, 1, median)
sd(resampledMedians);
## [1] 0.08262171
quantile(resampledMedians, c(0.025, 0.975))
## 2.5% 97.5%
## 68.44665 68.81461
g = ggplot(data.frame(x = resampledMedians), aes(x = x))
g = g + geom_density(size = 0.5)
g = g + geom_histogram(alpha = .40, binwidth=.05, colour = "grey", fill = "orange", aes(y = ..density..))
g = g + geom_vline(xintercept = median(x), size = 1)
g
Further: “An Introduction to the Bootstrap”
Permutation is used for group comparisons.
Code:
data(InsectSprays)
g = ggplot(InsectSprays, aes(spray, count, fill = spray))
g = g + geom_boxplot()
g
Permutation tests: consider the null hypothesis that the distribution of the observations from each group is the same; the group labels are irrelevant; Consider a dataframe with two column: count and spray; Permute the spray labels recalculate the statistic -Mean difference in counts; -Geometric means; -T statistic; -… Calculate the percentage of simulations where the simulated statistic was more extreme (toward the alternative) than the observed.
Code:
subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"),]
y <- subdata$count
group <- as.character(subdata$spray)
# to calculate the mean of two group B and C, notice here group is a variable.
testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"])
# To calculate the difference of mean between group B and C.
observedStat <- testStat(y, group)
#Permutation: change the group label of subdata randomly for 10000 times, i.e., if the group if irrevelant to the y.
permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group)))
mean(permutations > observedStat)
## [1] 0
g = ggplot(data.frame(permutations = permutations),
aes(permutations))
g = g + geom_histogram(aes(stat = ..density..), fill = "steelblue", color = "grey", binwidth = 1)
## Warning: Ignoring unknown aesthetics: stat
g = g + geom_vline(xintercept = observedStat, size = 1, colour = "orange")
g
http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw1.html#1 http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw2.html#1
http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw3.html#1
http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw4.html#1