title: “Statistical inference”
author: “lianrui”
date: “2017/8/23”
output:
pdf_document: default

Course Content

1. Introduction

what is statistical inference

We want to get some “fact” about the total population. The challenge here is that the data is noisy. Therefore we have to take the uncertainty into account.

Key challenges through the data generation process:

  1. Sampling bias;
  2. Measurement relevance;
  3. Unseen variables;
  4. Measurement errors;
  5. Measure vabiability;
  6. Interpretation;
  7. Communication.

Statistical inference objectives:

https://en.wikipedia.org/wiki/Statistical_inference 1. Point estimate; 2. Interval estimate; 3. Hypothesis testing; 4. Clustering and classification; 5. Credible interval (further: https://en.wikipedia.org/wiki/Credible_interval).

assumptions and models

parametric Nonparametric statistics are statistics not based on parameterized families of probability distributions. They include both descriptive and inferential statistics. The typical parameters are the mean, variance, etc.

non-parametric Unlike parametric statistics, nonparametric statistics make no assumptions about the probability distributions of the variables being assessed. The difference between parametric models and non-parametric models is that the former has a fixed number of parameters, while the latter grows the number of parameters with the amount of training data.[1] Note that the non-parametric model does, counterintuitively, contain parameters: the distinction is that parameters are determined by the training data in the case of non-parametric statistics, not the model.Bootstraping and permutation are non-parametric.

semi-parametric

three paradigms

  1. Frequency;
  2. Baysian;
  3. AIC.

2. Probability

probability

Probability: a number between 0 and 1 to give a sense of the chance of a particular event.

Why is probability the fundation of statistic inference. Statistical inference is the process of deducing properties of an underlying probability distribution by analysis of data. a probability model connect data to population with assumptions.

probability calculus

Probability mass function (PMF)

For a fare coin flipping experiment, X = 0 represent tails; X = 1 represent heads p(x) = (1/2)^x * (1/2)^(1-x)

If the coin is not fair, the chance of tail is A, and the chance of head is 1-A p(x) = A^x * (1-A)^(1-x)

Gives the probability that a discrete random variable is exactly equal to some value. 1. >= 0 2. sum of total p = 1

Probability density function (PDF)

PDF associated with a continuous variable. area under the function curve correspondent to the probability of that variable. areas of 1. >= 0 2. total area under the function curve equals to 1.

Code: demo of a continuous distrition (beta distribution). the area under the curve is 1.

x <- c (-.5, 0, 1, 1, 1.5)
y <- c (0, 0, 2, 0, 0)
d <- data.frame(x = x, y = y)
library(ggplot2)
c <- ggplot(data = d, aes(x = x, y = y)) + geom_line(colour = "steelblue")
c

Code: Fore beta(2, 1), What is the probability that 75% or fewer have been addressed?

pbeta(.75, 2, 1)
## [1] 0.5625

Code: culmulative distribution function F(x) = P(X <= x)

pbeta(c(0.4, 0.5, 0.6), 2, 1)
## [1] 0.16 0.25 0.36

survival function

S(x) = P (X > x) S(x) = 1- F(x)

quantile and percentile.

3. Conditional probability

https://xkcd.com/795/

understanding conditional prabability with Venn diagram. VennDiagram package as a tool.

Sensitivity and specificity in diagnostic testing.

  1. Under condition a subject has or do not have the disease (sensitivity and specificity) Sensitivity : The probability of positive testing result given that the subject has the disease. Among all the subjects have the disease, the proportion of patients who get the positive test result; P(+|D) Specificity: The probability of negative testing result given that the subject does NOT actually have the disease. Among all the subjects who don’t have the disease, the proportion of patients who get the negative test result. P(-|Not D)

  2. Under condition of positive or negative testing result. Positive predictive value: The probability of having the disease given that the testing result is positive. Among all the subjects who are of positive testing result, the proportion actually have the disease. P(D|+) Negative predictive value: The probability of NOT having the disease given that the test result is negative. Among all the subjects who are of negative testing result, the proportion actually NOT having the disease. P(Not D | -)

  3. Diagnostic likelyhood ratio. Diagnostic likelyhood ratio of positive test. sensitivity/(1-specificity) The ratio of test positive patients to test negative patients How effective the test help to screen the patients Diagnostic likelyhood ratio of negative test. (1-sensitivity)/Specificity The extend a test will misdiagnosis.

see: conditional probability and diagnosis.xlsx https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2636062/

independence:

case:Flawed Statistics in Murder Trial May Cost Expert His Medical License http://science.sciencemag.org/content/309/5734/news-summaries a mother of two sudden infant death (SID) was convicted murder because a renowned piediatrician testified that “nature occuring” of two SIDs in a row is very very rare. the doctor simply squared the probability of the occurance of one SID in a family; But the thing is that if a family has one SID, the subseuqent SID risk is higher because of genetic factors that are playing a role.

iid sample: independent and identically distribution.

4. Expected values.

sample expected values will estimate some of the population characteristics. so -How the population is centered: mean.
-How the population is spread out: variance.

Population mean and sample mean.

The difference between sample mean and population mean: For a discrete random variable with probability mass function (PMF) of p(x), The expected value or mean is the geometric average: E[x] = Sigma (x * p(x)) Geometrically, E[x] is the center mass of {x, p(x)}.

Sample mean is a empirical estimation of the population mean. 1. the population mean is the center of mass of population; 2. the sample mean is the center of mass of the sample (observed data); 3. the sample mean ESTIMATE the population mean ; 4. The sample mean of a random varialbe, is a random variable too; 4. the sample mean is unbiased: the population mean of its distribution is the mean that it is trying to estimate. 5. the more data that goes into the sample mean, the more concentrated its’ density/mass function is around the population mean. Code: The histogram of parent child height.

library(UsingR); 
## Loading required package: MASS
## Loading required package: HistData
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
## 
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
## 
##     cancer
data(galton)
library(reshape2)
par(mfrow=c(1,2))
hist(galton$child,col="blue",breaks=100)
hist(galton$parent,col="blue",breaks=100)

head(galton)
##   child parent
## 1  61.7   70.5
## 2  61.7   68.5
## 3  61.7   65.5
## 4  61.7   64.5
## 5  61.7   64.0
## 6  62.2   67.5
dim(galton)
## [1] 928   2
str(galton)
## 'data.frame':    928 obs. of  2 variables:
##  $ child : num  61.7 61.7 61.7 61.7 61.7 62.2 62.2 62.2 62.2 62.2 ...
##  $ parent: num  70.5 68.5 65.5 64.5 64 67.5 67.5 67.5 66.5 66.5 ...
longGalton <- melt(galton, measure.vars = c("child", "parent"))
g <- ggplot(longGalton, aes(x = value)) + geom_histogram(aes(y = ..density..,  fill = variable), binwidth=1, color = "red") + geom_density(size = 0.5)
g <- g + facet_grid(. ~ variable)
g

Code: manipulate the child hight sample with different empirical mean. mse(mean squared error) is sort of a measure of imbalance, how teetering or tottering this histogram is.

myHist <- function(mu){
        g <- ggplot(galton, aes(x = child))
        g <- g + geom_histogram(fill = "steelblue", binwidth = 0.5, 
                                aes(y = ..density..))
        g <- g + geom_density(colour = "red", size = 0.5)
        g <- g + geom_vline(xintercept = mu, size = 0.5, colour = "orange")
        mse <- round(mean(galton$child - mu)^2, 3)
        g <- g + labs(title = paste("mu = ", mu, " MSE = ", mse))
        g
}

Note: expected mean is the center of mass of the sample distribution. When assumed mean approximate the expected mean, mse is going to 0.

Code: simulation of the expected value of a binominal experiment of n times with the probability of p.

library(ggplot2)
library(plyr)
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize
binominalPlot <- function(n,p){
        x <- dbinom(c(0:n),size=n,prob=p)
        dat <- data.frame(N = c(0:n), P = x)
        gd <- ggplot(data = dat, aes(x = N, y = P))
        g <- gd + geom_bar(aes(fill = P), stat = "identity")
        g
}
binominalPlot(8, 0.8)

Code: Simulating: comparing the center of the distribution of averages and the center of that population. They will be centered at the same place.

library(ggplot2)
nosim <- 10000; n <- 100
dat <- data.frame(x = c(rnorm(nosim), apply(matrix(data = rnorm(nosim * n), nrow = nosim), 1, mean)),
                  # the first 10000 of x is random number from rnorm; 
                  # the second 10000 of x is the mean of 10 rnorm
                  what = factor(rep(c("Obs", "Mean"), c(nosim, nosim)))
                  # repeat "Obs" and "Mean", each = nosim
                  )
g <- ggplot(dat, aes(x = x, fill = what)) + geom_density(size = 0.5, alpha = .2)
g <- g + ggtitle("UNBIASED estimation of the means of normal")
g

#matrix: create a matrix from data, default is by column. 
# apply: applying a function to margins of an array or matrix. by row = 1, by column = 2

Note:
1. Expected values are properties of distributions; 2. The average of random variables is itself a random variable 3. And its associated distribution has an expected value; 4. The center of the distribution means is the same as that of the original distribution; 5. Therefore, the expected value of the sample mean is the population mean that it’s trying to estimate 6. Unbiased: When the expected value of an estimator is what its trying to estimate, we say that the estimator is unbiased

ibid. Code: simulation of N number of dice roll and the distribution of the mean of these N number of dice rolls.

nosim = 10000
dat <- data.frame(x = c(sample(1 : 6, nosim, replace = TRUE),
                        apply(matrix(data = sample(1 : 6, nosim * 2, replace = TRUE), 
                                     nrow = nosim), 1, mean),
                        apply(matrix(data = sample(1 : 6, nosim * 3, replace = TRUE), 
                                     nrow = nosim), 1, mean),
                        apply(matrix(data = sample(1 : 6, nosim * 4, replace = TRUE), 
                                     nrow = nosim), 1, mean)),
                  size = factor(rep(1:4, rep(nosim, 4))))
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20, binwidth = 0.5, size = 0.2, colour = "steelblue") 
g 

g <- g + facet_grid(.~size)
g <- g + ggtitle("simulation of N number of dice roll and the distribution of the mean of these N number of dice rolls.")
g

ibid. Code: simulation of n number of standard normal and the distribution and mean of these n simulation.

dat <- data.frame(
  x = c(sample(0 : 1, nosim, replace = TRUE),
        apply(matrix(sample(0 : 1, nosim * 10, replace = TRUE), 
                     nosim), 1, mean),
        apply(matrix(sample(0 : 1, nosim * 20, replace = TRUE), 
                     nosim), 1, mean),
        apply(matrix(sample(0 : 1, nosim * 30, replace = TRUE), 
                     nosim), 1, mean)
        ),
  size = factor(rep(c(1, 10, 20, 30), rep(nosim, 4))))
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20, binwidth = 1 / 12, colour = "steelblue", size = 0.2); 
g <- g + facet_grid(. ~ size)
g

5. Variability.

An important characterization of a population is how spread out it is. One of the key measures of spread is variability. We estimate population variability with the sample variance, or the square root, called the standard deviation with the same units;

Variability has many important uses in statistics. First, the population variance is itself an intrinsically interesting quantity that we want to estimate. Secondly, variability in our estimates is what makes them not imprecise. An important aspect of statistics is quantifying the variability in our estimates.

comparing the different notations.

Population: mean = mu, variance = sigma square, sd = sigma; Sample: mean = Xbar, variance = S = sigma square/(n-1), sd = sigma/square root (n-1) means os sample means: mean = mu, variance = S/n, sd = S/square root (n)

This is very useful since we don’t need to repeat to calculate the means of sample mean and get the distributio of it. instead, if we know the sample mean and sample variance, (xbar, and variance), we can deduce the distribution of the means of sample means.

Code: Distributions with increasing variance. https://github.com/bcaffo/courses/blob/master/06_StatisticalInference/05_Variance/index.Rmd#distributions-with-increasing-variance

library(ggplot2)
xvals <- seq(-10, 10, by = .01)
dat <- data.frame(
    y = c(
        dnorm(xvals, mean = 0, sd = 1),
        dnorm(xvals, mean = 0, sd = 2),
        dnorm(xvals, mean = 0, sd = 3),
        dnorm(xvals, mean = 0, sd = 4)
    ),
    x = rep(xvals, 4),
    factor = factor(rep(1 : 4, rep(length(xvals), 4)))
)
g <- ggplot(dat, aes(x = x, y = y, color = factor)) + geom_line(size = 2)    
g <- g + ggtitle("data with big variance are more spread out")
g

Code: with more data, the sample variance concentrated around population variance (unbiased). when you take small sample size, usually the extreme value of the sample will not be included. therefore, you underestimate the range (or variance) of this sample. Therefore, we adjust the degree of freedom to N-1 to make the estimation more unbiased. Below, when the sample size is small, the sample variance is below the population variance.

library(ggplot2)
nosim <- 10000; 
dat <- data.frame(
    x = c(apply(matrix(rnorm(nosim * 10), nosim), 1, var),
          apply(matrix(rnorm(nosim * 50), nosim), 1, var),
          apply(matrix(rnorm(nosim * 300), nosim), 1, var)),
    n = factor(rep(c("10", "50", "300"), c(nosim, nosim, nosim))) 
    )
g <- ggplot(dat, aes(x = x, fill = n)) + 
        geom_density(size = 0.2, alpha = .2) + 
        geom_vline(xintercept = 1, size = .4, colour = "orange")
g <- g + ggtitle("with more data, the sample variance concentrated around population variance (unbiased)")
g

ibid. Code: with more data, the sample variance concentrated around population variance (unbiased). for dice rolling experiments.

dat <- data.frame(
  x = c(apply(matrix(sample(1 : 6, nosim * 10, replace = TRUE), 
                     nosim), 1, var),
        apply(matrix(sample(1 : 6, nosim * 20, replace = TRUE), 
                     nosim), 1, var),
        apply(matrix(sample(1 : 6, nosim * 30, replace = TRUE), 
                     nosim), 1, var)
        ),
  size = factor(rep(c(10, 20, 30), rep(nosim, 3))))
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20, binwidth=.3, colour = "steelblue") 
g <- g + geom_vline(xintercept = 2.92, size = .5, colour = "orange")
g <- g + facet_grid(. ~ size)
g

Code: comparing standard deviation of population of the standard deviation of sample means.

nosim <- 1000
n <- 10
a <- sd(apply(matrix(rnorm(nosim * n), nosim), 1, mean))
b <- 1 / sqrt(n)
a
## [1] 0.3101352
b
## [1] 0.3162278

ibid: standard uniform distribution.

nosim <- 1000
n <- 10
sd(apply(matrix(runif(nosim * n), nosim), 1, mean))
## [1] 0.09485237
1 / sqrt(12 * n)
## [1] 0.09128709

ibid. poisson distribution.

nosim <- 1000
n <- 10
sd(apply(matrix(rpois(nosim * n, 4), nosim), 1, mean))
## [1] 0.6350883
2 / sqrt(n)
## [1] 0.6324555

ibid. binominal distribution.

nosim <- 1000
n <- 10
sd(apply(matrix(sample(0 : 1, nosim * n, replace = TRUE),
                nosim), 1, mean))
## [1] 0.1599504
1 / (2 * sqrt(n))
## [1] 0.1581139

Code: we draw samples from a population of which we don’t know the exact parameters, mean, variance… we characterize the sample. than we can estimate the population mean and population variance, at the same time, we can estimate the center and spread-out of N samples from this population. Data example: father/sun height.

library(UsingR); data(father.son); 
x <- father.son$sheight
n<-length(x)
g <- ggplot(data = father.son, aes(x = sheight)) 
g <- g + geom_histogram(aes(y = ..density..), fill = "lightblue", binwidth=1, colour = "black")
g <- g + geom_density(size = 2, colour = "black")
g

a <- round(c(population.var.est = var(x), mean.var.est = var(x) / n, population.sd.est = sd(x), mean.sd.est = sd(x) / sqrt(n)),2)
a
## population.var.est       mean.var.est  population.sd.est 
##               7.92               0.01               2.81 
##        mean.sd.est 
##               0.09

Summarize:

  1. sample variance estimate the population variance;
  2. the distribution of sample variance centered at what it’s estimating;
  3. it gets more concetrated when you are getting more data;
  4. the distribution of sample mean: variance = sigma square /n; sd = sigma/square root (n)

6. Distributions

binominal distribution (the Bernoulli distribution)

PMF of Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution mean = p, variance = p(1-p)

the binominal variable is the total number of heads by flip the coin. x1, …, xn be iid Bernoulli(p) X = Sigma (x) is a binominal random variable. binominal mass function: https://en.wikipedia.org/wiki/Binomial_distribution

Code: Bionominal example: a family of 8 children, 7 of them are girls. the probability of having 7 or more girls in that family.

choose(8, 7) * .5 ^ 8 + choose(8, 8) * .5 ^8
## [1] 0.03515625
pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE)
## [1] 0.03515625

normal distribution

code: Gaussian distribution with mean mu and variance of sigma square: X ~ N(mu, sigma square) standard normal distribution: mu = 0, sigma = 1

library(colorspace)
library(RColorBrewer)
library(ggplot2)
library(scales)
x <- seq(-3, 3, length = 1000)
dat <- data.frame(x = x, y = dnorm(x))
dat$qt <- cut(dat$x, breaks = 6, labels = FALSE)
g <- ggplot(data = dat, 
            aes(x = x, y = y)) + geom_line(size = 0.5, colour = "steelblue", fill = "black")
## Warning: Ignoring unknown parameters: fill
g <- g + geom_vline(xintercept = -3 : 3, size = 0.1, colour = "black")
g <- g + geom_area(aes(x=x,y=y,group=qt,fill=qt),color="red")
g <- g + geom_text(aes(label = "2.5%", x = -2.5, y = 0.03), colour = "steelblue") +
         geom_text(aes(label = "2.5%", x =  2.5, y = 0.03), colour = "steelblue") +
         geom_text(aes(label = "13.5%", x = -1.5, y = 0.15), colour = "steelblue") +
         geom_text(aes(label = "13.5%", x = 1.5, y = 0.15), colour = "steelblue") +
         geom_text(aes(label = "34%", x = -.5, y = 0.35), colour = "steelblue") +
         geom_text(aes(label = "34%", x =  .5, y = 0.35), colour = "steelblue")
g <- g + geom_vline(xintercept = c(-2.33, -1.96, -1.645, -1.28, 1.28, 1.645, 1.96, 2.33), 
                    size = 0.1, colour = "red", lty = 4)
g <- g + scale_fill_gradient2(midpoint=median(unique(dat$qt)), guide="none", 
                              low = muted("red"), 
                              mid = "white",
                              high = muted("steelblue"))
g

convert non-normal to normal. Z ~ (mean-mu)/sigma Code: for a non-standardized normal distribution with mu = 2, and sigma = 2, what is the probability of x >5

?pnorm
pnorm(2, mean = 2, sd = 1, lower.tail = FALSE)
## [1] 0.5

Assume that the number of daily ad clicks for a company is (approximately) normally distributed with a mean of 1020 and a standard deviation of 50. mean = 1020, sigma = 50.

# What's the probability of getting more than 1,160 clicks in a day?
pnorm(1160, mean = 1020, sd = 50, lower.tail = FALSE)
## [1] 0.00255513
pnorm((1160-1020)/50, lower.tail = FALSE)
## [1] 0.00255513
# What number of clicks will represent the one where 75% of daily click is lower than that. 
qnorm(0.75, mean = 1020, sd = 50, lower.tail = TRUE)
## [1] 1053.724

Code: Simulating the shape of normal with different mean and varance.

library(ggplot2)
mu0 <- 30
myplot <- function(sigma, mua, n, alpha){
        g = ggplot(data.frame(mu = c(27, 37)), aes(x = mu))
        g = g + stat_function(fun = dnorm, geom = "line", args = list(mean = mu0, sd = sigma/sqrt(n)),size = 2, col = "red")
        g = g + stat_function(fun = dnorm, geom = "line", args = list(mean = mua, sd = sigma/sqrt(n)),size = 2, col = "blue")
        xitc = mu0 + qnorm(1 - alpha) * sigma / sqrt(n)
        g = g + geom_vline(xintercept = xitc, size = 3)
        g
}
myplot(10, 30, 100, 0.05)

Poisson distribution.

At any rate, the Poisson distribution is used to model counts. Poisson mass function: https://en.wikipedia.org/wiki/Poisson_distribution X in non-negative integers. mean of Poisson distribution is lamda, variance of Poisson is also lamda instances that Poisson distribution is used: -Modeling count data; -Event-time; -Survival data; -Contengency tables; In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. -Approximating binominals when n is large and p is small.

Poisson random variables are used to model rates. X ~ Poisson(lamda * t) lamda is the expected count per unit time; t is the total monitoring time.

Code: the number of people that show up at a bus stop is Possion with a mean of 2.5 per hour. what is the probability that less than 3 people show up for 4 hour in this bus stop.

ppois(3, lambda = 2.5 * 4)
## [1] 0.01033605

Code: Poisson approximation to binominal. X ~ binominal (n, p) lambda = n * p n gets large p gets small the probability distribution governing acts which is binomial can be well approximated by Poisson probabilities with this specific notation lambda as n times p. Code: Poisson approximation to binominal

p_binom <- pbinom(2, size = 500, prob = 0.01)
p_poisson <- ppois(2, lambda = 500 * 0.01)
p_binom;p_poisson
## [1] 0.1233858
## [1] 0.124652

7. Asymptotics

Asymptotics: the behavior of statistics when the sample size limits to infinity. In statistics, asymptotic theory, or large sample theory, is a generic framework for assessment of properties of estimators and statistical tests. Within this framework it is typically assumed that the sample size n grows indefinitely, and the properties of statistical procedures are evaluated in the limit as n → ∞. (wikipedia)

So, asymptotics are incredibly useful for simple statistical inference and approximations.

Asymptotics also form the basis for frequency interpretation of probabilities.

asymptotics and Law of large numbers.

In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed. the average limits to what its estimating, the population mean. The variance limits to the population variance, and the standard dievation limits to population deviation.

The probability of a head on a coin is 50% because we believe that if we were to flip it infinitely many times, we would get exactly 50% heads. this is asymptotics. Our very notion of probability depends on the idea of asymptotics. We can use asymptotics to help is figure out things about distributions without knowing much about them to begin with.

Code: Law of large numbers in normal distribution.

n <- 1000
a <- rnorm(n)
b <- cumsum(a)
c <- b/(1:n)
dat <- data.frame(a = c(1:1000), b, c)
head(dat)
##   a        b         c
## 1 1 1.396271 1.3962712
## 2 2 1.952760 0.9763799
## 3 3 3.327528 1.1091761
## 4 4 3.765534 0.9413834
## 5 5 3.799043 0.7598087
## 6 6 4.215188 0.7025313
library(ggplot2)
gd <- ggplot(data = dat, aes(x = a, y = c))
g <- gd + geom_line()
g

Code: the asymptotics of mean of coin flipping: Law of large numbers.

n = 1000
means <- cumsum(sample(0:1, n, replace = TRUE))/(1:n)
dat <- data.frame(x = c(1:1000), y = means)
gdat <- ggplot(data = dat, aes (x = x, y = y))
g <- gdat + geom_line()
g

Discussion: 1) Asymptotics has a “goal”, i.e., the estimator will converge to what you want to estimate. For example, the sample proportion from iid coin flips is consistent for the true success probability of a coin. the sample mean of iid samples is consistent for the population mean. the sample variance and standard dievation is also consistent. This is what the Law of Large Numbers says. In other words, if we get infinite amount of data, or large number of data, we can asymptotics to the answer.

asymptotics and center limit theorem.

Central Limit Theorem: the distribution of averages is often normal; even if the distribution of the data being sampled from is very non-normal; This helps us create robust strategies for creating statistical inferences when we’re not willing to assume much about the generating mechanism of our data.

(Estimate - Mean of estimate)/(Std.Err.estimate) ~ standard normal for large n. the sample average is approximately stardard normal with mean of mu and variance of sigma square/n

Code: simulate a standard normal random variable by rolling a dice of n times Xi is the outcome of dice i, mu = E[Xi] = 3.5; Var(Xi) = 2.92; sd = square root (2.92/n). roll n dice, take their mean, subtract off 3.5, and divide by 1.71/square root (n)

nosim <- 1000
# calculate the estimator: (Estimate - Mean of estimate)/(Std.Err.estimate)
cfunc <- function(x, n) sqrt(n) * (mean(x) - 3.5) / 1.71 

dat <- data.frame(
  x = c(apply(matrix(sample(1 : 6, nosim * 10, replace = TRUE), 
                     nosim), 1, cfunc, 10),
        # sample number 1 to 6 of total (nosim *10 = 10000) times, replace = true;  
        # create a matrix of 1000 X 10; 
        # for each row (argument = 1), apply cfunc with n = 10. 
        apply(matrix(sample(1 : 6, nosim * 20, replace = TRUE), 
                     nosim), 1, cfunc, 20),
        # sample number 1 to 6 of total (nosim *20 = 20000) times, replace = true;  
        # create a matrix of 1000 X 20; 
        # for each row (argument = 1), apply cfunc with n = 20. 
        apply(matrix(sample(1 : 6, nosim * 1000, replace = TRUE), 
                     nosim), 1, cfunc, 1000)
        # ibid. 
        ),
  size = factor(rep(c(10, 20, 1000), rep(nosim, 3))))
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .50, binwidth=.3, colour = "steelblue", aes(y = ..density..)) 
g <- g + stat_function(fun = dnorm, size = 0.5, colour = "orange")
g + facet_grid(. ~ size)

Code: simulate a standard normal random variable by coin flipping. Xi be the 0 or 1, of the ith flip of a possibly unfair coin, with the probability of p of head. E[Xi] = p; Var[Xi] = p(1-p) statistic = (phead - p)/square root(p(1-p)/n) Also, the speed at which the normalized coin flips converges to normality is governed by how biased the coin is. let p = 0.9. So just keep this in mind that the central limit theorem doesn’t guarantee that the normal distribution will be a good approximation. Simply that as the number of coin flips limits to infinity eventually, it will be a good approximation.

nosim <- 1000
cfunc <- function(x, n) 2 * sqrt(n) * (mean(x) - 0.5) 
dat <- data.frame(
  x = c(apply(matrix(sample(0:1, nosim * 10, replace = TRUE), 
                     nosim), 1, cfunc, 10),
        apply(matrix(sample(0:1, nosim * 20, replace = TRUE), 
                     nosim), 1, cfunc, 20),
        apply(matrix(sample(0:1, nosim * 30, replace = TRUE), 
                     nosim), 1, cfunc, 30)
        ),
  size = factor(rep(c(10, 20, 30), rep(nosim, 3))))
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(binwidth=.3, colour = "black", aes(y = ..density..)) 
g <- g + stat_function(fun = dnorm, size = 2)
g + facet_grid(. ~ size)

asymptotics and confidence interval.

code: Confidence interval of the son’s height.

library(UsingR);data(father.son); x <- father.son$sheight
(mean(x) + c(-1, 1) * qnorm(.975) * sd(x) / sqrt(length(x))) / 12
## [1] 5.709670 5.737674

code: confidence interval of the coin flip. Xi is 0 or 1 with success probabillity of p, variance = p(1-p); the confidence of interval: phead plus/minus normal quantile * square root(p(1-p)/n)

p <- seq(0, 1, length = 1000)
dat <- data.frame(p = p, v = p * (1 - p))
gd <- ggplot(data = dat, aes(x = p, y = v))
g <- gd + geom_line(size = 0.5, colour = "orange")
g <- g + ggtitle("when p = 0.5, binomial distribution has the largest variance")
g

So the 95% confidence interval is: phead plus/minus 1/square root (n).

code: confidence interval of election. Imagine you were running for political office and your campaign advisor told you that in a random sample of 100 likely voters, 56 intended to vote for you. Can you relax? Do you have this race in the bag? Is 0.56 out of 100 sampled enough evidence to conclude that you’ll likely get more than 50% of the vote? But you don’t have access to a computer or calculator.

# a quick calculation. 
0.56 + c(-1,1) * 1/sqrt(100)
## [1] 0.46 0.66
# formal calculation. 
.56 + c(-1, 1) * qnorm(0.975) * sqrt(0.56 * (1 - 0.56)/100)
## [1] 0.4627099 0.6572901
# using function binom.test
binom.test(56, 100)$conf.int
## [1] 0.4571875 0.6591640
## attr(,"conf.level")
## [1] 0.95

Code: Confidence interval of n coin flips for nosim times. see the empirical and real probability and the role of CLT here. a simulation where I flip a coin with a given success probability over and over and over again. And calculate the percentage of times my walled interval covers the true coin probability that we used to generate the data. So I’m going to say, in each one of my simulations I’m going to do 20 coin flips.

n <- 100; 
pvals <- seq(.1, .9, by = .05); 
nosim <- 1000
coverage <- sapply(pvals, function(p){
  phats <- rbinom(nosim, prob = p, size = n) / n
  # the observed probobility of "head" of each of 1000 (nosim) trials. In each trial, there are n = 30 coin flips, and the chance of "head" equals total number of heads over n = 30. 
  ll <- phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
  # the lower level of 95% confidence interval of the observed (empirical) probability. 
  ul <- phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
  # the upper level of 95% confidence interval of the observed (empirical) probability. 
  mean(ll < p & ul > p)
  # the number of observations covered by the lower-upper level (confidence interval. )
  # the logic vector could be treated as numeric, true = 1, false = 0. 
})
ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage)) + geom_line(size = 0.5, colour = "orange") + geom_hline(yintercept = 0.95, type = 3) + ylim(0.75, 1.0)
## Warning: Ignoring unknown parameters: type

When n is small, the Central Limit Theroem may not work well. it’s simply because the central limit theorem isn’t as accurate, isn’t as accurate as we need it to be for this specific value of N for coins with this specific true probability.

Code: Agresti/Coull interval instead of Wald interval when n is small. there are two ways to fix this: 1) increase the total number of observations (n) in each simulation. For example, n = 100; 2) the Agresti/Coull interval instead of the Wald interval.

n <- 100; 
pvals <- seq(.1, .9, by = .05); 
nosim <- 1000
coverage <- sapply(pvals, function(p){
  phats <- (rbinom(nosim, prob = p, size = n) + 2) / (n + 4)
  # the observed probobility of "head" of each of 1000 (nosim) trials. In each trial, there are n = 30 coin flips, and the chance of "head" equals total number of heads over n = 30. 
  ll <- phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
  # the lower level of 95% confidence interval of the observed (empirical) probability. 
  ul <- phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
  # the upper level of 95% confidence interval of the observed (empirical) probability. 
  mean(ll < p & ul > p)
  # the number of observations covered by the lower-upper level (confidence interval. )
  # the logic vector could be treated as numeric, true = 1, false = 0. 
})
ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage)) + geom_line(size = 0.5, colour = "orange") + geom_hline(yintercept = 0.95, type = 3) + ylim(0.75, 1.0)
## Warning: Ignoring unknown parameters: type

Code: confidence interval of Poisson distribution. case: a nuclear pump that failed 5 times out of 94.32 days, given 95% confidence interval for the failure rate per day. X ~ Poisson (lambda * t) Estimate lamda.head = X/t Var(lambda.head) = lambda/t

x <- 5
t <- 94.32
lambda <- x/t
round(lambda + c(-1, 1) * qnorm(.975) * sqrt(lambda/t), 3)
## [1] 0.007 0.099
# use Poisson test in R
poisson.test(x, T = 94.32)$conf
## [1] 0.01721254 0.12371005
## attr(,"conf.level")
## [1] 0.95
lambdavals <- seq(0.005, 0.10, by = .01); nosim <- 1000
t <- 100000
coverage <- sapply(lambdavals, function(lambda){
  lhats <- rpois(nosim, lambda = lambda * t) / t
  ll <- lhats - qnorm(.975) * sqrt(lhats / t)
  ul <- lhats + qnorm(.975) * sqrt(lhats / t)
  mean(ll < lambda & ul > lambda)
})
ggplot(data.frame(lambdavals, coverage), aes(x = lambdavals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95)+ylim(0, 1.0)

for very small values of lambda which we have a good sense of if we could have relatively few events for a large amount of monitoring time. For relatively small values of lambda we shouldn’t be using this asymptotic interval.

summary.

The LLN states that averages of iid samples converge to the population means that they are estimating The CLT states that averages are approximately normal, with distributions centered at the population mean with standard deviation equal to the standard error of the mean CLT gives no guarantee that \(n\) is large enough Taking the mean and adding and subtracting the relevant normal quantile times the SE yields a confidence interval for the mean Adding and subtracting 2 SEs works for 95% intervals Confidence intervals get wider as the coverage increases (why?) Confidence intervals get narrower with less variability or larger sample sizes The Poisson and binomial case have exact intervals that don’t require the CLT But a quick fix for small sample size binomial calculations is to add 2 successes and failures

Confidence Interval.

Confidence intervals are a convenient way to communicate that uncertainty in estimates. Using central limit theorem, we can create confidence interval: CI = est plus/minus Zquantile * SE(est) For small sample size, T confidence intervals is applied.

T confidence interval.

In probability and statistics, Student’s t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.

T distribution: (Xbar-mu)/(estimated standard error) Gosset’s t distribution has only one parameter: degree of freedom (n-1)

The T quantile is from T distribution. T distributio has heavier tail, because the estimated stardard error is smaller that the standard dievation of population. And so T quantile is a little bit wider. As you collect more and more data, the T interval will become more and more approximate to Z interval. So T interval maybe the choice at the very beginning. If you use z quantile when sample size is small, you confidence interval may be too narrow.

Est plus/minus Tquantile * SE (est)

notes on t interval:

The t interval technically assumes that the data are iid normal, though it is robust to this assumption It works well whenever the distribution of the data is roughly symmetric and mound shaped Paired observations are often analyzed using the \(t\) interval by taking differences For large degrees of freedom, t quantiles become the same as standard normal quantiles; therefore this interval converges to the same interval as the CLT yielded For skewed distributions, the spirit of the \(t\) interval assumptions are violated. Also, for skewed distributions, it doesn’t make a lot of sense to center the interval at the mean In this case, consider taking logs or using a different summary like the median For highly discrete data, like binary, other intervals are available

Code: comparing T distribution and normal distribution, with different degree of freedom.

# library(ggplot2); library(manipulate)
k <- 1000
xvals <- seq(-5, 5, length = k)
myplot <- function(df){
  d <- data.frame(y = c(dnorm(xvals), dt(xvals, df)),
                  x = xvals,
                  dist = factor(rep(c("Normal", "T"), c(k,k))))
  g <- ggplot(d, aes(x = x, y = y)) 
  g <- g + geom_line(size = 2, aes(colour = dist))
  g
}
# manipulate(myplot(mu), mu = slider(1, 20, step = 1))  

code: with the degree of freedom increase, t quantile approximate z quantile.

pvals <- seq(.5, .99, by = .01)
myplot2 <- function(df){
  d <- data.frame(n= qnorm(pvals),t=qt(pvals, df),
                  p = pvals)
  g <- ggplot(d, aes(x= n, y = t))
  g <- g + geom_abline(size = 2, col = "lightblue")
  g <- g + geom_line(size = 2, col = "black")
  g <- g + geom_vline(xintercept = qnorm(0.975))
  g <- g + geom_hline(yintercept = qt(0.975, df))
  g
}
#manipulate(myplot2(df), df = slider(1, 20, step = 1))

code: sleep data set raw data, show group differences. (paried t)

data(sleep)
dim(sleep)
## [1] 20  3
head(sleep)
##   extra group ID
## 1   0.7     1  1
## 2  -1.6     1  2
## 3  -0.2     1  3
## 4  -1.2     1  4
## 5  -0.1     1  5
## 6   3.4     1  6
library(ggplot2)
gd <- ggplot(data = sleep, aes (x = group, y = extra, colour = ID,
                                fill = ID, group = factor(ID)))
g <- gd + geom_point(size = 10, pch = 21, alpha = 0.5)
g <- g + geom_line(size = 1, aes(colour = ID))
g

code: t test of sleep data. (paired)

g1 <- sleep$extra[1 : 10]; g2 <- sleep$extra[11 : 20]
difference <- g2 - g1
mn <- mean(difference); s <- sd(difference); n <- 10
mn + c(-1, 1) * qt(.975, n-1) * s / sqrt(n)
## [1] 0.7001142 2.4598858
t.test(difference)
## 
##  One Sample t-test
## 
## data:  difference
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.7001142 2.4598858
## sample estimates:
## mean of x 
##      1.58
t.test(g2, g1, paired = TRUE)
## 
##  Paired t-test
## 
## data:  g2 and g1
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.7001142 2.4598858
## sample estimates:
## mean of the differences 
##                    1.58
t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep)
## 
##  Paired t-test
## 
## data:  extra by I(relevel(group, 2))
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.7001142 2.4598858
## sample estimates:
## mean of the differences 
##                    1.58
rbind(
mn + c(-1, 1) * qt(.975, n-1) * s / sqrt(n),
as.vector(t.test(difference)$conf.int),
as.vector(t.test(g2, g1, paired = TRUE)$conf.int),
as.vector(t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep)$conf.int)
)
##           [,1]     [,2]
## [1,] 0.7001142 2.459886
## [2,] 0.7001142 2.459886
## [3,] 0.7001142 2.459886
## [4,] 0.7001142 2.459886

Code: t confidence interval of independent group. Example: two groups in a randomized trial. and the variance of the two group is the same because of randomization. Note, constant variance is an assumption for independent t testing.

The t confidence interval of two independent group: code: comparing SBP of 8 oral contraceptive users versus 21 controls. Xoc = 132.86 mmHg, soc = 15.34 mmHg Xc = 127.44 mmHg, sc = 18.23 mmHg

sp <- sqrt((7 * 15.34^2 + 20 * 18.23^2) / (8 + 21 - 2))
132.86 - 127.44 + c(-1, 1) * qt(.975, 27) * sp * (1 / 8 + 1 / 21)^.5
## [1] -9.521097 20.361097
# note the calculated interval contains zero, so we can't rule out that the two group are actually the same. 

code: sleep data unsing independent t test.

data(sleep)
head(sleep)
##   extra group ID
## 1   0.7     1  1
## 2  -1.6     1  2
## 3  -0.2     1  3
## 4  -1.2     1  4
## 5  -0.1     1  5
## 6   3.4     1  6
g1 <- sleep$extra[1 : 10]; g2 <- sleep$extra[11 : 20]
n1 <- length(g1); n2 <- length(g2)
x1 <- sleep[which (sleep$group == "1"), ]
x2 <- sleep[which (sleep$group == "2"), ]
sp <- sqrt(((n1 - 1) * sd(x1$extra)^2 + (n2-1) * sd(x2$extra)^2) / (n1 + n2-2))
md <- mean(g2) - mean(g1)
semd <- sp * sqrt(1 / n1 + 1/n2)
rbind(
md + c(-1, 1) * qt(.975, n1 + n2 - 2) * semd,  
t.test(g2, g1, paired = FALSE, var.equal = TRUE)$conf,
t.test(g2, g1, paired = TRUE)$conf
)
##            [,1]     [,2]
## [1,] -0.2038740 3.363874
## [2,] -0.2038740 3.363874
## [3,]  0.7001142 2.459886

Note: 1) the unpaired t test attribute all the difference to inter-group difference. And because of the big intra-group variance, unpaired t test give a confidence interval containing 0. so we can’t rule out that the two group are different.

Code: chick weight.

library(datasets);data(ChickWeight)
library(reshape2)
head(ChickWeight)
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1
wideCW <- dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight")
names(wideCW)[-(1 : 2)] <- paste("time", names(wideCW)[-(1 : 2)], sep = "")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:Hmisc':
## 
##     combine, src, summarize
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
wideCW <- mutate(wideCW,gain = time21 - time0)
head(wideCW)
##   Diet Chick time0 time2 time4 time6 time8 time10 time12 time14 time16
## 1    1    18    39    35    NA    NA    NA     NA     NA     NA     NA
## 2    1    16    41    45    49    51    57     51     54     NA     NA
## 3    1    15    41    49    56    64    68     68     67     68     NA
## 4    1    13    41    48    53    60    65     67     71     70     71
## 5    1     9    42    51    59    68    85     96     90     92     93
## 6    1    20    41    47    54    58    65     73     77     89     98
##   time18 time20 time21 gain
## 1     NA     NA     NA   NA
## 2     NA     NA     NA   NA
## 3     NA     NA     NA   NA
## 4     81     91     96   55
## 5    100    100     98   56
## 6    107    115    117   76
g <- ggplot(ChickWeight, aes(x = Time, y = weight, colour = Diet, group = Chick))
g <- g + geom_line()
g <- g + stat_summary(aes(group = 1), geom = "line", fun.y = mean, size = 1, col = "black")
g <- g + facet_grid(. ~ Diet)
g

Code: the violin plot of the variations of weight gain for each diet.

g <- ggplot(wideCW, aes(x = factor(Diet), y = gain, fill = factor(Diet)))
g <- g + geom_violin(col = "black", size = 2)
g
## Warning: Removed 5 rows containing non-finite values (stat_ydensity).

Note: It looks that group 1 and group 4 are of different variance. As such, the equal variance assumption of t test maybe violated.

Code: the independent t test with variance equal or unequal assumptions.

wideCW14 <- subset(wideCW, Diet %in% c(1, 4))
rbind(
t.test(gain ~ Diet, paired = FALSE, var.equal = TRUE, data = wideCW14)$conf,
t.test(gain ~ Diet, paired = FALSE, var.equal = FALSE, data = wideCW14)$conf
)
##           [,1]      [,2]
## [1,] -108.1468 -14.81154
## [2,] -104.6590 -18.29932

Note: the confidence interval calucation with unequal variance.

For binomial data, there’s lots of ways to compare two groups Relative risk, risk difference, odds ratio. Chi-squared tests, normal approximations, exact tests. For count data, there’s also Chi-squared tests and exact tests. We’ll leave the discussions for comparing groups of data for binary and count data until covering glms in the regression class. In addition, Mathematical Biostatistics Boot Camp 2 covers many special cases relevant to biostatistics.

Hypothesis testing.

Hypothesis testing: concepts.

Hypothesis testing is concerned with making decisions using data. - Null hypothesis: the status quo, H0 (H naught). H naught is the starting point which is assumed to be true. - New evidence is required to reject H0 and to establish an alternative hypothesis.

Case: respiratory disturbance index (RDI). severe disordered breathing (SDI) a sample of 100 patients, mean RDI = 32, sd = 10 - H0: mu = 30 - Ha: mu >30 the alternative hypothesis are <, > or not equal. Truth Decide Result H0 H0 correctly accept the H0 H0 Ha type I error: false positive Ha Ha correctly reject H0 and establish Ha Ha H0 type II error: false negative

Type I error rate and type II error rate is related. As type I error rate increase, type II error rate decrease.

Example of choosing a rejection region.

Considering the SDI case, if the sample mean is larger than some constant,

Code: the type I and II error rate is related. alpha = type I error rate = probability of rejecting the null hypothesis when, in fact, the null hypothesis is correct.

a sample of 100 patients, mean RDI = 32, sd = 10, se = 10/squareroot (100) = 1 H0: X bar ~ N(30, 1)

library(ggplot2)
mu0 <- 30
myplot <- function(sigma, mua, n, alpha){
        g = ggplot(data.frame(mu = c(27, 37)), aes(x = mu))
        g = g + stat_function(fun = dnorm, geom = "line", args = list(mean = mu0, sd = sigma/sqrt(n)),size = 2, col = "red")
        g = g + stat_function(fun = dnorm, geom = "line", args = list(mean = mua, sd = sigma/sqrt(n)),size = 2, col = "blue")
        xitc = mu0 + qnorm(1 - alpha) * sigma / sqrt(n)
        g = g + geom_vline(xintercept = xitc, colour = "orange", size = 0.5)
        g = g + geom_text(aes(label = round(mu0 + qnorm(1 - alpha) * sigma / sqrt(n), 2)), x = mu0 + qnorm(1 - alpha) * sigma / sqrt(n), y = 0)
        g
}
myplot(10, 30, 100, 0.05)

Note: 1-under the null hypothesis, most (<95%) of the data will below 32. In other word, under the null hypothesis H0, the chance of occuring a number greater than 31.64 is only 5%. 2-for a sample with mean of 32, it is highly unlikely that this sample belongs the null hypothesis. and it is highly likely that the sample is a new distribution. therefore, the null hypothesis is rejected. 3-comparing N~(30, 1) with N~(32, 1), there are two times of standard error of the mean from the null hypotheis (32-30)/(10/squareroot (100)).

T test.

Consider the RDI, if the sample size is 16 instead of 100. test statistic = (xbar - 30)/se, follows a T distribution with 15 df under H0.

qt(.95, 15)
## [1] 1.75305
teststatistic <- (32-30)*16^.5/10
teststatistic
## [1] 0.8

Note: under 16 samples, the mean of 32 suggests only 0.8 times of the standard deviation. for T test, we need 1.75 times of the standard deviation. therefore, we can’t reject the nulll hypothesis, the sample is N ~ (30, 1).

two sided tests.

H0: mu = 30, sd = 10; Ha: mu is not equal t0 30, and sd = 10. we need to split the 5% into the two side of the T distribution.

qt(.975, 15)
## [1] 2.13145
# we are going to reject the null hypothesis if the test statistic is LARGER than this, or;
qt(.025, 15)
## [1] -2.13145
# we are going to reject the null hypothesis if the test statistic is SMALLER than this. 

Note: if you failed with the one side test, you will also fail the two sided test, because we split the alpha into two side, and because of the smaller alpha, we will move further out into the tail.

code: we can pass data to R functions like t.test. if the sons’ height is equal to fathers’ height.

library(UsingR)
data("father.son")
t.test(father.son$sheight - father.son$fheight)
## 
##  One Sample t-test
## 
## data:  father.son$sheight - father.son$fheight
## t = 11.789, df = 1077, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.8310296 1.1629160
## sample estimates:
## mean of x 
## 0.9969728

the confidence interval and hypothesis test.

Consider testing H0 : mu = mu0 versus Ha: mu not equal to mu0; Take the set of all possible values for which you failed to reject H0, this set is a (1-alpha) confidence interval for mu. if a (1-alpha) contains mu0, then we failed to reject H0.

library(datasets); data(ChickWeight); library(reshape2)
##define weight gain or loss
wideCW <- dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight")
names(wideCW)[-(1 : 2)] <- paste("time", names(wideCW)[-(1 : 2)], sep = "")
library(dplyr)
wideCW <- mutate(wideCW,gain = time21 - time0)
wideCW14 <- subset(wideCW, Diet %in% c(1, 4))
wideCW14
##    Diet Chick time0 time2 time4 time6 time8 time10 time12 time14 time16
## 1     1    18    39    35    NA    NA    NA     NA     NA     NA     NA
## 2     1    16    41    45    49    51    57     51     54     NA     NA
## 3     1    15    41    49    56    64    68     68     67     68     NA
## 4     1    13    41    48    53    60    65     67     71     70     71
## 5     1     9    42    51    59    68    85     96     90     92     93
## 6     1    20    41    47    54    58    65     73     77     89     98
## 7     1    10    41    44    52    63    74     81     89     96    101
## 8     1     8    42    50    61    71    84     93    110    116    126
## 9     1    17    42    51    61    72    83     89     98    103    113
## 10    1    19    43    48    55    62    65     71     82     88    106
## 11    1     4    42    49    56    67    74     87    102    108    136
## 12    1     6    41    49    59    74    97    124    141    148    155
## 13    1    11    43    51    63    84   112    139    168    177    182
## 14    1     3    43    39    55    67    84     99    115    138    163
## 15    1     1    42    51    59    64    76     93    106    125    149
## 16    1    12    41    49    56    62    72     88    119    135    162
## 17    1     2    40    49    58    72    84    103    122    138    162
## 18    1     5    41    42    48    60    79    106    141    164    197
## 19    1    14    41    49    62    79   101    128    164    192    227
## 20    1     7    41    49    57    71    89    112    146    174    218
## 41    4    44    42    51    65    86   103    118    127    138    145
## 42    4    45    41    50    61    78    98    117    135    141    147
## 43    4    43    42    55    69    96   131    157    184    188    197
## 44    4    41    42    51    66    85   103    124    155    153    175
## 45    4    47    41    53    66    79   100    123    148    157    168
## 46    4    49    40    53    64    85   108    128    152    166    184
## 47    4    46    40    52    62    82   101    120    144    156    173
## 48    4    50    41    54    67    84   105    122    155    175    205
## 49    4    42    42    49    63    84   103    126    160    174    204
## 50    4    48    39    50    62    80   104    125    154    170    222
##    time18 time20 time21 gain
## 1      NA     NA     NA   NA
## 2      NA     NA     NA   NA
## 3      NA     NA     NA   NA
## 4      81     91     96   55
## 5     100    100     98   56
## 6     107    115    117   76
## 7     112    120    124   83
## 8     134    125     NA   NA
## 9     123    133    142  100
## 10    120    144    157  114
## 11    154    160    157  115
## 12    160    160    157  116
## 13    184    181    175  132
## 14    187    198    202  159
## 15    171    199    205  163
## 16    185    195    205  164
## 17    187    209    215  175
## 18    199    220    223  182
## 19    248    259    266  225
## 20    250    288    305  264
## 41    146     NA     NA   NA
## 42    174    197    196  155
## 43    198    199    200  158
## 44    184    199    204  162
## 45    185    210    205  164
## 46    203    233    237  197
## 47    210    231    238  198
## 48    234    264    264  223
## 49    234    269    281  239
## 50    261    303    322  283
t.test(gain ~ Diet, paired = FALSE, 
       var.equal = TRUE, data = wideCW14)
## 
##  Two Sample t-test
## 
## data:  gain by Diet
## t = -2.7252, df = 23, p-value = 0.01207
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -108.14679  -14.81154
## sample estimates:
## mean in group 1 mean in group 4 
##        136.1875        197.6667

Code: hypothesis testing that’s not a normal or a t. A family of 8 children, 7 of which are girls. You are a little bit skeptical that the gender rate is not 1:1. how much evidence is this, that the genders are iid coin, fair coin flips between boys and girls for this couple. In other words, what is the relevant rejection region so that the probabililty of rejecting is less than 5%.

H0: p = 0.5; Ha: p > 0.5

A hypothesis test what would be the number of girls that the couple could have in order for the probability of having that many or more the 5% under the, under the null hypothesis of a fair coin?

Well, if we set up a rejection region, so in other words we’re going to reject the null hypothesis if the couple had anywhere between zero and eight girls, well of course we’d always reject.

P value.

P value is the most common measure of statistical significance. Comments in favor of and against P values: http://warnercnr.colostate.edu/~anderson/thompson1.html Also see Statistical Evidence: A Likelihood Paradigm by Richard Royall Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy by Steve Goodman The hilariously titled: The Earth is Round (p < .05) by Cohen.

Ideas on P value: Start under the null hypothesis assuming nothing happened. how extreme, how unusual of the obtaining data under the null hypothesis.

  1. define the hypothetical distributio of a data summary(statistic) when “nothing is going on”–null hypothesis;
  2. calculate the summary/statistic with the data we have (test statistic);
  3. compare what we calculate with hypothetical distribution and see how extreme (p - value.) when P value is small, it suggest a test statistic is extreme in NULL hypothesis.

The P-value is the probability under the null hypothesis of obtaining evidence as extreme or more txtreme than that obatained. Then H0 is true and we have observed an rare event, or H0 is false.

pt(2.5, 15, lower.tail = FALSE)
## [1] 0.0122529

Note: when T statistic is 2.5, the actually obtained under the null hypothesis is one percent. So either the null hypothesis is true ans we have seen something “alien” or the null hypothesis is false.

Code: a family of 7 daugnther out of total 8 children.

choose(8, 7) * 0.5^8 + choose(8, 8) * 0.5^8
## [1] 0.03515625
# in other way. 
pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE)
## [1] 0.03515625

Note: Under null hypothesis where p is 0.05, the probility of have 7 or 8 girls is 0.035. So either the null hypothesis is wrong or we have face some extreme in null hypothesis. we may think that in this family, the p to have daughter is larger than 0.5.

Code: family of 7 daughter out of total 8 children, p is different.

prob <- c(0.2, 0.4, 0.6, 0.8)
prob
## [1] 0.2 0.4 0.6 0.8
p <- pbinom(6, size = 8, prob = prob, lower.tail = FALSE)
p
## [1] 0.00008448 0.00851968 0.10637568 0.50331648

Note: 1) if the H0: p = 0.2, then the probability of having 7 daughter out of 8 children is 0.00008448. It’s so rare to have 7 daughters, either the p = 0.2 is not valid or we are facing some extreme scenario. 2) if the H0: p = 0.8, the probability of having 7 daughters out of 8 children is 0.5. Probability p =0.8 is right. and we can’t reject p = 0.8 because 50% of chance that we can have 7 girls.

Code: Poisson example suppose that a hospital has an infection rate of 10 infections per 100 person/days at risk (rate = 0.1) during the last monitoring period. assume that an infection rate of 0.05 is an important benchmark. H0: lamba = 0.05; lamda100 = 5 Ha: lamda > 0.05.

what’s the probability of obtaining 10 or more infecion.

ppois(9, 5, lower.tail = FALSE)
## [1] 0.03182806

Note: the probability of having more than 10 infections per 100 patient/days is 0.03. This is below the assumed 0.05 incidence rate and is kind of “extreme”. therefore, if in 100 patients we have 10 or more infection, measures should be taken.

Power.

Power is the prob of rejecting the null when it is false. Power is more important when you failed to reject the null hypothesis– when we get a null result, we need to consider the power.you want to design a study so that you have a reasonable chance of detecting the alternative hypothesis.

Type II error (beta): false negative error. you missed the chance of detecting actually positive result. power = 1 - beta. power is a function that depends on the mean under the null hypothesis.

code: the power calculation.

mu0 <- 30; mua <- 32; sigma <- 4; n <- 16
alpha = 0.05
z <- qnorm(1 - alpha)
beta <- pnorm(mu0 + z * sigma/sqrt(n), 
      mean = mua, sd = sigma/sqrt(n), 
      lower.tail = FALSE)
beta
## [1] 0.63876

there is a 64% probability of detecting a mean as large as 32 or larger if we conduct this experiment.

Code: plotting the power curve.

library(ggplot2)
nseq = c(8, 16, 32, 64, 128)
mua = seq(30, 35, by = 0.1)
z = qnorm(.95)
power = sapply(nseq, function(n)
pnorm(mu0 + z * sigma / sqrt(n), mean = mua, sd = sigma / sqrt(n), 
          lower.tail = FALSE)
    )
colnames(power) <- paste("n", nseq, sep = "")
d <- data.frame(mua, power)
library(reshape2)
d2 <- melt(d, id.vars = "mua")
names(d2) <- c("mua", "n", "power")    
g <- ggplot(d2, 
            aes(x = mua, y = power, col = n)) + geom_line(size = 2)
g            

code: simulation of normal alpha, beta, power. Normal simulation-alpha-beta-power.R

Effective size: (mua - mu0)/sigma

Power in t test.

power.t.test(n = 16, delta = 2/4, sd = 1, type = "one.sample", alt = "one.sided")$power
## [1] 0.6040329
power.t.test(n = 16, delta = 2, sd = 4, type = "one.sample", alt = "one.sided")$power
## [1] 0.6040329
power.t.test(n = 16, delta = 100, sd = 200, type = "one.sample", alt = "one.sided")$power
## [1] 0.6040329

Note: power only depends on the effect size.

Code: calculate sample size

power.t.test(power = 0.8 , delta = 2/4, sd = 1, type = "one.sample", alt = "one.sided")$n
## [1] 26.13751

Note: we need an effective size as large as 0.5. so we need at least 27 to get the power of 0.8.

Code: calculte te delta.

power.t.test(n = 16, power = 0.8, sd = 4, type = "one.sample", alt = "one.sided")$delta
## [1] 2.606797

Note: use power.t.test as the first attach on power calculation. power has a lot of knobs and dials that you can turn. and it’s very easy to get tripped up on thinking that you have better power than you have or thinking that you need a smaller sample size.

Multiple testing.

when you perform more than one hypothesis test you have to do some sort of correction to make sure that you’re not fooling yourself. This lecture is a little bit about how to do these corrections.

Why is hypothesis test/significance analysis is commonly overused??? multiple P values for the same data set.

Corrections to avoid positives or discoveries. Two components of multiple testing corrections: -Error measure; -Correction.

Problem: when we are doing 10000 tests with alpha = 0.05, for example, some high dimensional data, but because 10000 X 0.05 = 500, we may encounter 500 false positive result. So comparing with simple test, we need to control the error rate in a different way.

set.seed(1010093)
pValues <- rep(NA,1000) # an empty vector, length of 1000. 
for(i in 1:1000){
  y <- rnorm(20)
  x <- rnorm(20)
  # two normal random variable with no relationship. 
  pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
# Controls false positive rate
sum(pValues < 0.05)
## [1] 51

Bonferroni correction.

Controlling falimy-wise error rate (FWER) m tests; FWER at level alpha so P(V >= 1) < alpha; Calculate normal P value alpha-FWER = alpha/m call all P values less than alpha-FWER significant.

Code: example of Bonferroni correction.

set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
  y <- rnorm(20)
  x <- rnorm(20)
  pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}

# Controls false positive rate
sum(pValues < 0.05)
## [1] 51
sum(p.adjust(pValues, method = "bonferroni") < 0.05)
## [1] 0

Pros: easy and handy, conservative; Cons: very conservative.

Controlling false discovery rate (FDR)

Basic idea: -m test -control FDR at level alpha so E[V/R] -Calculate normal P; -Order P values from smallest to largest: P(1),…, P(m) -call any P(i) <= alpha X (i/m)

Most popular correction for signal-processing disciplines. genomics, imaging, astronomy…

set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
  y <- rnorm(20)
  x <- rnorm(20)
  pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}

# Controls false positive rate
sum(pValues < 0.05)
## [1] 51
sum(p.adjust(pValues, method = "BH") < 0.05)
## [1] 0
# Benjamini-Hochberg correction. 

Pros: easy, less conservative; Cons: Allows for more false positives.

set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
  x <- rnorm(20)
  # First 500 beta=0, last 500 beta=2
  if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)}
  pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
trueStatus <- rep(c("zero","not zero"),each=500)
table(pValues < 0.05, trueStatus)
##        trueStatus
##         not zero zero
##   FALSE        0  476
##   TRUE       500   24

Code:

# Controls FWER 
table(p.adjust(pValues,method="bonferroni") < 0.05,trueStatus)
##        trueStatus
##         not zero zero
##   FALSE       23  500
##   TRUE       477    0
# Controls FDR 
table(p.adjust(pValues,method="BH") < 0.05,trueStatus)
##        trueStatus
##         not zero zero
##   FALSE        0  487
##   TRUE       500   13
par(mfrow=c(1,2))
plot(pValues,p.adjust(pValues,method="bonferroni"),pch=19)
plot(pValues,p.adjust(pValues,method="BH"),pch=19)

Summary: Multiple testing is an entire subfield A basic Bonferroni/BH correction is usually enough If there is strong dependence between tests there may be problems Consider method=“BY”

Introduction to multiple testing https://ies.ed.gov/ncee/pubs/20084018/app_b.asp

Resampling.

bootstrapping (Ephron, 1979).

Pull oneself up from their own bootstrap. doing something that’s physically impossible. We need many mathematics to perform inference. Bootstrapping liberate data analysis…e.g., how to get the confidence interval, you need complicated asymptotics Draw a direct line between informations we are using and the inferences that we are making. Bootstrapping may be dangerous but you don’t need complex mathematics. Bootstrapping: very useful to do statistical inferences where it would otherwise be difficult.

Very much in the spirit of data science: understand the mean of six die roll: we can do the algebra calculation; we can roll the die 50 times, and repeat this over and over.

imaging you don’t know if the die is fair. We play 50 die rolls and get a EMPIRICAL distribution, which is not the population distribution because the randomness. We can draw the sample and throw it into a bag, and we draw a sample from the bag repeatly. The fundamental idea is to exactly replicate our simulation experiment–use observed data to construct an estimated population distribution, we simulate that estimated population distribution and figure out the distribution,

Code:

library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:Hmisc':
## 
##     combine
nosim <- 1000
cfunc <- function(x, n) mean(x)
g1 = ggplot(data.frame(y = rep(1/6, 6), x = 1 : 6), aes(y = y, x = x))
g1 = g1 + geom_bar(stat = "identity", fill = "lightblue", colour = "black")

dat <- data.frame(x = apply(matrix(sample(1 : 6, nosim * 50, replace = TRUE), 
                     nosim), 1, mean))
g2 <- ggplot(dat, aes(x = x)) + geom_histogram(binwidth=.2, colour = "black", fill = "salmon", aes(y = ..density..)) 

grid.arrange(g1, g2, ncol = 2)

n = 50
B = 1000
## our data
x = sample(1 : 6, n, replace = TRUE)
x
##  [1] 6 3 3 6 2 4 6 6 2 5 3 1 6 1 3 6 5 1 6 6 6 2 1 2 6 4 5 1 2 3 6 6 4 4 6
## [36] 6 5 3 2 1 2 4 3 2 1 4 1 6 6 4
## bootstrap resamples
resamples = matrix(sample(x,
                           n * B,
                           replace = TRUE),
                    B, n)
table(x)
## x
##  1  2  3  4  5  6 
##  8  8  7  7  4 16
prop.table(table(x))
## x
##    1    2    3    4    5    6 
## 0.16 0.16 0.14 0.14 0.08 0.32
resampledMeans = apply(resamples, 1, mean)
g1 <- ggplot(as.data.frame(prop.table(table(x))), aes(x = x, y = Freq)) + geom_bar(colour = "black", fill = "lightblue", stat = "identity") 
g2 <- ggplot(data.frame(x = resampledMeans), aes(x = x)) + geom_histogram(binwidth=.2, colour = "black", fill = "salmon", aes(y = ..density..)) 
grid.arrange(g1, g2, ncol = 2)

Code: father and sun height.

library(UsingR)
data(father.son)
x <- father.son$sheight
n <- length(x)
B <- 10000
resamples <- matrix(sample(x,
                           n * B,
                           replace = TRUE),
                    B, n)
resampledMedians <- apply(resamples, 1, median)
g = ggplot(data.frame(x = resampledMedians), aes(x = x)) 
g = g + geom_density(size = 2, fill = "red")
#g = g + geom_histogram(alpha = .20, binwidth=.3, colour = "black", fill = "blue", aes(y = ..density..)) 
g = g + geom_vline(xintercept = median(x), size = 2)
g

The Bootstrap principle: 1. Suppose a statistic that estimates some population parameter, but the sample distribution is unknown; 2. Use the distribution defined by the observed data, to approximate its sampling distribution.

Nonparametric bootstrap algorithm example: 1. sample n observations with replacement from the observed data resulting in one simulated complete dataset; 2. Sample with replacement; 3. take the statistics; 4. repeat the process many times. 5. the statistic get from this process is a approximation fo the population distribution; 6. Calculate the standard deviation to estimate the standard error of the median; 7. take 2.5th and 97.5th percentiles as confidence interval for the median. Bootstrap confidence interval.

library(UsingR)
data(father.son)
x <- father.son$sheight
n <- length(x)
B <- 10000
resamples <- matrix(sample(x,
                           n * B,
                           replace = TRUE),
                    B, n)
resampledMedians <- apply(resamples, 1, median)
sd(resampledMedians); 
## [1] 0.08262171
quantile(resampledMedians, c(0.025, 0.975))
##     2.5%    97.5% 
## 68.44665 68.81461
g = ggplot(data.frame(x = resampledMedians), aes(x = x)) 
g = g + geom_density(size = 0.5)
g = g + geom_histogram(alpha = .40, binwidth=.05, colour = "grey", fill = "orange", aes(y = ..density..)) 
g = g + geom_vline(xintercept = median(x), size = 1)
g

Further: “An Introduction to the Bootstrap”

Permutation.

Permutation is used for group comparisons.

Code:

data(InsectSprays)
g = ggplot(InsectSprays, aes(spray, count, fill = spray))
g = g + geom_boxplot()
g

Permutation tests: consider the null hypothesis that the distribution of the observations from each group is the same; the group labels are irrelevant; Consider a dataframe with two column: count and spray; Permute the spray labels recalculate the statistic -Mean difference in counts; -Geometric means; -T statistic; -… Calculate the percentage of simulations where the simulated statistic was more extreme (toward the alternative) than the observed.

Code:

subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"),]
y <- subdata$count
group <- as.character(subdata$spray)
# to calculate the mean of two group B and C, notice here group is a variable. 
testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"])
# To calculate the difference of mean between group B and C. 
observedStat <- testStat(y, group)
#Permutation: change the group label of subdata randomly for 10000 times, i.e., if the group if irrevelant to the y. 
permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group)))
mean(permutations > observedStat)
## [1] 0
g = ggplot(data.frame(permutations = permutations),
           aes(permutations))
g = g + geom_histogram(aes(stat = ..density..), fill = "steelblue", color = "grey", binwidth = 1)
## Warning: Ignoring unknown aesthetics: stat
g = g + geom_vline(xintercept = observedStat, size = 1, colour = "orange")
g

On-line exercises

http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw1.html#1 http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw2.html#1

http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw3.html#1

http://bcaffo.github.io/courses/06_StatisticalInference/homework/hw4.html#1

Refereces:

https://github.com/bcaffo/courses/tree/master/06_StatisticalInference

https://www.youtube.com/playlist?list=PLpl-gQkQivXiBmGyzLrUjzsblmQsLtkzJ

https://leanpub.com/LittleInferenceBook