What is Chi-Square Goodness of Fit

The chi-square goodness of fit test is used to compare the observed distribution to an expected distribution, in a situation where we have two or more categories in a discrete data. In other words, it compares multiple observed proportions to expected probabilities.

Example

For example, we collected wild tulips and found that 81 were red, 50 were yellow and 27 were white

Are these colors equally common?

R Function: chisq.test()

chisq.test(x, p) x: a numeric vector p: a vector of probabilities of the same length of x.

Answer: Are the Colors Equally Common

tulip <- c(81, 50, 27)
res <- chisq.test(tulip, p = c(1/3, 1/3, 1/3))
res
## 
##  Chi-squared test for given probabilities
## 
## data:  tulip
## X-squared = 27.886, df = 2, p-value = 8.803e-07

The p-value of the test is 8.80310^{-7}, which is less than the significance level alpha = 0.05. We can conclude that the colors are significantly not commonly distributed with a p-value = 8.80310^{-7}.

Note that, the chi-square test should be used only when all calculated expected values are greater than 5.

# Access to the expected values
res$expected
## [1] 52.66667 52.66667 52.66667

Example

Suppose that, in the region where you collected the data, the ratio of red, yellow and white tulip is 3:2:1 (3+2+1 = 6). This means that the expected proportion is:

3/6 (= 1/2) for red
2/6 ( = 1/3) for yellow
1/6 for white

We want to know, if there is any significant difference between the observed proportions and the expected proportions.

tulip <- c(81, 50, 27)
res <- chisq.test(tulip, p = c(1/2, 1/3, 1/6))
res
## 
##  Chi-squared test for given probabilities
## 
## data:  tulip
## X-squared = 0.20253, df = 2, p-value = 0.9037

The p-value of the test is 0.9037, which is greater than the significance level alpha = 0.05. We can conclude that the observed proportions are not significantly different from the expected proportions.

Access to the values returned by chisq.test()function

The result of chisq.test() function is a list containing the following components:

statistic: the value the chi-squared test statistic.
parameter: the degrees of freedom
p.value: the p-value of the test
observed: the observed count
expected: the expected count

# printing the p-value
res$p.value
## [1] 0.9036928
# printing the mean
res$estimate
## NULL

Example 3: Hanuma Airlines: page 167 of Analytics Book

FoodType <- c(190, 185, 90,35)
res <- chisq.test(FoodType, p = c(.35,.40,.20,.05))
res
## 
##  Chi-squared test for given probabilities
## 
## data:  FoodType
## X-squared = 7.4107, df = 3, p-value = 0.0599

Null hypothesis is accepted, there is no difference between Observed and expected frequency, so the belief is correct.

##Example 6.15 of the analytics book page 169

k <- c(4640,4967,4640,4967,4640,4957,5169,4957,5169,4957,5064,5033,5064,5033,5064,5062,4514,5062,4514,5062,5217,4883,5217,4883,5217,4658,4998,4658,4998,4658,5557,4843,5557,4843,5557,5510,5112,5510,5112,5510,5005,5111,5005,5111,5005,4967,4865,4967,4865,4967)
library(zoo)
## Warning: package 'zoo' was built under R version 3.6.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
min_r <- min(k)
max_r <- max(k)
# finding number of bins by Sturges formula
No_of_bins <- floor(1+3.3*log10(length(k)))
Bin_Width <- round((max_r-min_r)/No_of_bins,0)
# specifying the breaks in histogram
hk <- hist(k,breaks=c(seq(min_r,max_r+Bin_Width,Bin_Width)))

# candidate distribution - normal
breaks_cdf <- pnorm(hk$breaks,mean=5000,sd=300)
# Getting the probability value, assuming the dist is normal
null.probs <- rollapply(breaks_cdf, 2, function(x) x[2]-x[1])
## applying Chisquare test
chisq.test(hk$counts, p=null.probs, rescale.p=TRUE, simulate.p.value=TRUE)
## 
##  Chi-squared test for given probabilities with simulated p-value
##  (based on 2000 replicates)
## 
## data:  hk$counts
## X-squared = 15.246, df = NA, p-value = 0.01049

The p-value is less thann 0.05, so null hypothesis to be rejected. Which means the distribution cannot be normal.

Example 6.16 of Analytics Book Page : 170

library(MASS)
set.seed(101)
k <- c(2.47,4.23,5.41,3.49,4.17,10.09,18.78,0.68,2.28,16.16,0.28,2.97,4.01,5.88,20.32,26.88,19.07,0.22,6.37,10.38,4.2,10.17,1.84,21.88,9.42,0.01,6.15,4.99,3.07,18.6,1.54,10.23,3.99,6.17,0.39,11.03,9.38,1.57,6.91,2.49,5.52,11.53,7.64,8.8,7.17,3.26,6.74,16.32,10,7.45)
rate=1/mean(k)
min_r <- min(k)
max_r <- max(k)
# finding number of bins by Sturges formula
No_of_bins <- floor(1+3.3*log10(length(k)))
Bin_Width <- round((max_r-min_r)/No_of_bins,0)
# specifying the breaks in histogram
hk <- hist(k,breaks=c(seq(min_r,max_r+Bin_Width,Bin_Width)))

# candidate distribution - normal
breaks_cdf <- pexp(hk$breaks,rate=rate)
breaks_cdf
## [1] 0.001305994 0.407880023 0.648935445 0.791855829 0.876592509 0.926832403
## [7] 0.956619349 0.974279860
# Getting the probability value, assuming the dist is exponential
null.probs <- rollapply(breaks_cdf, 2, function(x) x[2]-x[1])
null.probs
## [1] 0.40657403 0.24105542 0.14292038 0.08473668 0.05023989 0.02978695
## [7] 0.01766051
## applying Chisquare test
chisq.test(hk$counts, p=null.probs, rescale.p=TRUE, simulate.p.value=TRUE)
## 
##  Chi-squared test for given probabilities with simulated p-value
##  (based on 2000 replicates)
## 
## data:  hk$counts
## X-squared = 9.0094, df = NA, p-value = 0.1524

The p-value is 0.15 which is more than the critical value of 0.05, so the data follows an exponential distribution