Normal distribution, \(\displaystyle \chi^2\) distribution and F-distribution are the three most important distributions in parametric statistics. They can be considered as “measuring sticks or rulers” for examining the behavior of a test statistic. A test statistic is some measure (e.g., mean, variance etc.) that has been computed from sample data. The \(\chi^2\) distribution and F-distribution are intimately related with Normal distribution as explained below.
Normal distribution is the most important one because most of the test statistic are assumed to be distributed normally. When the sample size is more than 30, then it is safe to use Normal distribution to assess the likelihood of a test statistic occurring purely by chance. This fact arises from the Central Limit Theorem which gives us the distribution of averages of sample (sampling distribution of a sample mean).
The formula for standard Normal distribution can be derived by analyzing random throws of dart using simple assumptions. The exact formula for Normal distribution is \(p(x)={1/\sqrt{2\pi}} e^{-x^2/2}\).The peak of this distribution occurs at y=0.399 (since x=0 and \({1/\sqrt{2\pi}}\)=0.399). Let us write r code to verify this.
x <- seq(-5,5,by=0.01)
p <- function(x){1/sqrt(2*pi)*exp(-1*x^2)}
plot(x,p(x),"l")
peak <- p(0)
text(0.27,0.38,round(peak,3),adj=c(0,0),cex=0.8)
Let us develop some intuition regarding the CLT. The claim here is that even if the original distribution is not normal, the distribution of the average of their samples as the sample size increases approaches the normal distribution. First, let us consider a DEFINITELY NOT-normal distribution, an arithmetic sequence from 1 to 5 with the length of 200. As we increase the sample size using the slider, the distribution of sample mean becomes more and more normal. Furthermore, the variance of the sampling distribution also decreases. The formula is \(var(\bar{X})=\sigma^2/n\) where n is the sample size.
sampleSize <- 30
population <- seq(1,5,length=200)
pop.mean <- mean(population)
pop.var <- round((200-1)/200*var(population),3)
xbar <- rep(0,200)
for (i in 1:200) {
xbar[i] <- mean(sample(population,sampleSize,replace=TRUE))
}
#Sample calculations
s.mean <- round(mean(xbar),3)
s.var <- round(var(xbar),3)
s.sd <- sqrt(s.var)
est.pop.var <- round(s.var*sampleSize,3)
title.pop <- paste("Actual pop mean: ",pop.mean," var: ", pop.var)
title.sam <- paste("Estimated pop mean: ",s.mean, "var: ", est.pop.var)
par(mfrow = c(1,2))
hist(population,freq=FALSE, main = title.pop)
hist(xbar,freq=FALSE,main=title.sam)
#superimpose a normal curve
curve(dnorm(x,mean=s.mean,sd=s.sd),add=TRUE,lwd=2,col="red")
\(\chi^2\) distribution with k-degrees of freedom is the distribution of a sum of the squares of k independent standard normal distribution.
degrees <- 30
chisq <- rep(0,1000)
sumchisq <- chisq
for (i in 1:degrees) {
chisq <- rnorm(1000)^2
sumchisq <- sumchisq+chisq
}
#computed empirical values
com_mean <- round(mean(sumchisq),3)
com_var <- round(var(sumchisq),3)
#theoretical values
th_mean <- degrees #mean= degrees of freedom (df)
th_var <- 2*degrees #variance=2*df
th_sd <- round(sqrt(th_var),3)
title1 <- paste("Computed mean: ",com_mean," var: ", com_var)
title1 <- paste(title1, "\n Theoretical mean:", th_mean, " var: ", th_var)
hist(sumchisq,freq=FALSE,main=title1)
x <- rchisq(500,th_mean)
#superimpose a theoretical chi-sq curve
curve(dchisq(x,th_mean),add=TRUE,lwd=2,col="red")
We can observe the following: 1) As the degrees of freedom increases, the chi-sq distribution approaches Normal distribution. 2)\(\chi^2\) statistic is always positive as it is the sum of squared numbers.
Illustrating the use of \(\chi^2\) distribution for independence test Is there a gender effect in reading preference?
reading.taste <- data.frame(c(25,20),c(25,30))
dimnames(reading.taste)<- list(c("Female","Male"),c("Fiction","Non-fiction"))
reading.taste
## Fiction Non-fiction
## Female 25 25
## Male 20 30
chisq.test(reading.taste, simulate.p.value=TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: reading.taste
## X-squared = 1.0101, df = NA, p-value = 0.4363
Since p-value>>0.05, apparently gender has no influence. Null hypothesis that there is no gender effect could not be rejected.
A random variate of the F-distribution with parameters d1 and d2 arises as the ratio of two appropriately scaled chi-squared variates:
\(X = \frac{U_1/d_1}{U_2/d_2}\) where U1 and U2 have chi-squared distributions with d1 and d2 degrees of freedom respectively, and U1 and U2 are independent.
options(digits = 2)
ndf <- 20
ddf <- 25
f <- rep(0,1000)
for (i in 1:1000){
num <- sample(rchisq(500,ndf),1)/ndf
denum <- sample(rchisq(500,ddf),1)/ddf
f[i] <- num/denum
}
##Theoretical values
th.mean <- round(ddf/(ddf-2),3)
th.var <- round(2*ddf^2*(ndf+ddf-2)/(ndf*(ddf-2)^2*(ddf-4)),3)
#Computed values from sample
comp.mean <- round(mean(f),3)
comp.var <- round(var(f),3)
title1 <- paste("Theoretical mean: ",th.mean, "var: ", th.var)
title1 <- paste(title1, "\nComputed mean: ", comp.mean, " var:", comp.var)
hist(f,freq=FALSE, main = title1)
x <- rf(500,ndf,ddf)
#superimpose a theoretical F-curve
curve(df(x,ndf,ddf),add=TRUE,lwd=2,col="red")
We can observe the following: 1) As the degrees of freedom increases, the F distribution approaches Normal distribution. 2) F-statistic is always positive as it is the ratio of the sums of squared numbers.
U of Universe, ssb@universe↩︎