1 Normal Distribution

Normal distribution, \(\displaystyle \chi^2\) distribution and F-distribution are the three most important distributions in parametric statistics. They can be considered as “measuring sticks or rulers” for examining the behavior of a test statistic. A test statistic is some measure (e.g., mean, variance etc.) that has been computed from sample data. The \(\chi^2\) distribution and F-distribution are intimately related with Normal distribution as explained below.

Normal distribution is the most important one because most of the test statistic are assumed to be distributed normally. When the sample size is more than 30, then it is safe to use Normal distribution to assess the likelihood of a test statistic occurring purely by chance. This fact arises from the Central Limit Theorem which gives us the distribution of averages of sample (sampling distribution of a sample mean).

The formula for standard Normal distribution can be derived by analyzing random throws of dart using simple assumptions. The exact formula for Normal distribution is \(p(x)={1/\sqrt{2\pi}} e^{-x^2/2}\).The peak of this distribution occurs at y=0.399 (since x=0 and \({1/\sqrt{2\pi}}\)=0.399). Let us write r code to verify this.

x <- seq(-5,5,by=0.01)
p <- function(x){1/sqrt(2*pi)*exp(-1*x^2)}
plot(x,p(x),"l")
peak <- p(0)
text(0.27,0.38,round(peak,3),adj=c(0,0),cex=0.8)

1.1 The Central Limit Theorem

Let us develop some intuition regarding the CLT. The claim here is that even if the original distribution is not normal, the distribution of the average of their samples as the sample size increases approaches the normal distribution. First, let us consider a DEFINITELY NOT-normal distribution, an arithmetic sequence from 1 to 5 with the length of 200. As we increase the sample size using the slider, the distribution of sample mean becomes more and more normal. Furthermore, the variance of the sampling distribution also decreases. The formula is \(var(\bar{X})=\sigma^2/n\) where n is the sample size.

  sampleSize <- 30

  population <- seq(1,5,length=200)
  pop.mean <- mean(population) 
  pop.var <- round((200-1)/200*var(population),3)
  
  
  xbar <- rep(0,200)
  
  for (i in 1:200) {
      xbar[i] <- mean(sample(population,sampleSize,replace=TRUE))
    }
  
  #Sample calculations
  s.mean <- round(mean(xbar),3)
  s.var <- round(var(xbar),3)
  s.sd <- sqrt(s.var)
  
  est.pop.var <- round(s.var*sampleSize,3)
    
  title.pop <- paste("Actual pop mean: ",pop.mean," var: ", pop.var)
  title.sam <- paste("Estimated pop mean: ",s.mean, "var: ", est.pop.var)
  
  par(mfrow = c(1,2))
  hist(population,freq=FALSE, main = title.pop)
  hist(xbar,freq=FALSE,main=title.sam)
  
  #superimpose a normal curve
  curve(dnorm(x,mean=s.mean,sd=s.sd),add=TRUE,lwd=2,col="red") 

2 The Chi-square Distribution

\(\chi^2\) distribution with k-degrees of freedom is the distribution of a sum of the squares of k independent standard normal distribution.

2.1 Generating \(\chi^2\) with R

degrees <- 30

chisq <- rep(0,1000)
sumchisq <- chisq

for (i in 1:degrees) {
  chisq <- rnorm(1000)^2
  sumchisq <- sumchisq+chisq
}
  #computed empirical values
  com_mean <- round(mean(sumchisq),3) 
  com_var <-  round(var(sumchisq),3) 
  
  #theoretical values
  th_mean <- degrees  #mean= degrees of freedom (df)
  th_var <- 2*degrees #variance=2*df
  th_sd <- round(sqrt(th_var),3)
  
  title1 <- paste("Computed mean: ",com_mean," var: ", com_var)
  title1 <- paste(title1, "\n Theoretical mean:", th_mean, " var: ", th_var)
  
  hist(sumchisq,freq=FALSE,main=title1)
  x <- rchisq(500,th_mean)
  
  #superimpose a theoretical chi-sq curve
  curve(dchisq(x,th_mean),add=TRUE,lwd=2,col="red") 

We can observe the following: 1) As the degrees of freedom increases, the chi-sq distribution approaches Normal distribution. 2)\(\chi^2\) statistic is always positive as it is the sum of squared numbers.

Illustrating the use of \(\chi^2\) distribution for independence test Is there a gender effect in reading preference?

reading.taste <- data.frame(c(25,20),c(25,30))
dimnames(reading.taste)<- list(c("Female","Male"),c("Fiction","Non-fiction"))
reading.taste
##        Fiction Non-fiction
## Female      25          25
## Male        20          30
chisq.test(reading.taste, simulate.p.value=TRUE)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  reading.taste
## X-squared = 1.0101, df = NA, p-value = 0.4363

Since p-value>>0.05, apparently gender has no influence. Null hypothesis that there is no gender effect could not be rejected.

3 The F Distribution

A random variate of the F-distribution with parameters d1 and d2 arises as the ratio of two appropriately scaled chi-squared variates:

\(X = \frac{U_1/d_1}{U_2/d_2}\) where U1 and U2 have chi-squared distributions with d1 and d2 degrees of freedom respectively, and U1 and U2 are independent.

3.1 Generating F-distribution with R

options(digits = 2)

  ndf <- 20
  ddf <- 25
  f <- rep(0,1000)
  
  for (i in 1:1000){
      num <-   sample(rchisq(500,ndf),1)/ndf
      denum <- sample(rchisq(500,ddf),1)/ddf
      f[i] <- num/denum
  }
  
  ##Theoretical values
  th.mean <- round(ddf/(ddf-2),3)
  th.var <- round(2*ddf^2*(ndf+ddf-2)/(ndf*(ddf-2)^2*(ddf-4)),3)
  
  #Computed values from sample
  comp.mean <- round(mean(f),3)
  comp.var <- round(var(f),3)
    
  title1 <- paste("Theoretical mean: ",th.mean, "var: ", th.var)
  title1 <- paste(title1, "\nComputed mean: ", comp.mean, " var:",  comp.var)
  
  hist(f,freq=FALSE, main = title1)
  
  x <- rf(500,ndf,ddf)
  
  #superimpose a theoretical F-curve
  curve(df(x,ndf,ddf),add=TRUE,lwd=2,col="red") 

We can observe the following: 1) As the degrees of freedom increases, the F distribution approaches Normal distribution. 2) F-statistic is always positive as it is the ratio of the sums of squared numbers.


  1. U of Universe, ↩︎