The chi-square goodness of fit test is used to compare the observed distribution to an expected distribution, in a situation where we have two or more categories in a discrete data. In other words, it compares multiple observed proportions to expected probabilities.
For example, we collected wild tulips and found that 81 were red, 50 were yellow and 27 were white
Are these colors equally common?
chisq.test(x, p) x: a numeric vector p: a vector of probabilities of the same length of x.
tulip <- c(81, 50, 27)
res <- chisq.test(tulip, p = c(1/3, 1/3, 1/3))
res
##
## Chi-squared test for given probabilities
##
## data: tulip
## X-squared = 27.886, df = 2, p-value = 8.803e-07
The p-value of the test is 8.80310^{-7}, which is less than the significance level alpha = 0.05. We can conclude that the colors are significantly not commonly distributed with a p-value = 8.80310^{-7}.
Note that, the chi-square test should be used only when all calculated expected values are greater than 5.
# Access to the expected values
res$expected
## [1] 52.66667 52.66667 52.66667
Suppose that, in the region where you collected the data, the ratio of red, yellow and white tulip is 3:2:1 (3+2+1 = 6). This means that the expected proportion is:
3/6 (= 1/2) for red
2/6 ( = 1/3) for yellow
1/6 for white
We want to know, if there is any significant difference between the observed proportions and the expected proportions.
tulip <- c(81, 50, 27)
res <- chisq.test(tulip, p = c(1/2, 1/3, 1/6))
res
##
## Chi-squared test for given probabilities
##
## data: tulip
## X-squared = 0.20253, df = 2, p-value = 0.9037
The p-value of the test is 0.9037, which is greater than the significance level alpha = 0.05. We can conclude that the observed proportions are not significantly different from the expected proportions.
The result of chisq.test() function is a list containing the following components:
statistic: the value the chi-squared test statistic.
parameter: the degrees of freedom
p.value: the p-value of the test
observed: the observed count
expected: the expected count
# printing the p-value
res$p.value
## [1] 0.9036928
# printing the mean
res$estimate
## NULL
FoodType <- c(190, 185, 90,35)
res <- chisq.test(FoodType, p = c(.35,.40,.20,.05))
res
##
## Chi-squared test for given probabilities
##
## data: FoodType
## X-squared = 7.4107, df = 3, p-value = 0.0599
Null hypothesis is accepted, there is no difference between Observed and expected frequency, so the belief is correct.
##Example 6.15 of the analytics book page 169
k <- c(4640,4967,4640,4967,4640,4957,5169,4957,5169,4957,5064,5033,5064,5033,5064,5062,4514,5062,4514,5062,5217,4883,5217,4883,5217,4658,4998,4658,4998,4658,5557,4843,5557,4843,5557,5510,5112,5510,5112,5510,5005,5111,5005,5111,5005,4967,4865,4967,4865,4967)
library(zoo)
## Warning: package 'zoo' was built under R version 3.6.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
min_r <- min(k)
max_r <- max(k)
# finding number of bins by Sturges formula
No_of_bins <- floor(1+3.3*log10(length(k)))
Bin_Width <- round((max_r-min_r)/No_of_bins,0)
# specifying the breaks in histogram
hk <- hist(k,breaks=c(seq(min_r,max_r+Bin_Width,Bin_Width)))
# candidate distribution - normal
breaks_cdf <- pnorm(hk$breaks,mean=5000,sd=300)
# Getting the probability value, assuming the dist is normal
null.probs <- rollapply(breaks_cdf, 2, function(x) x[2]-x[1])
## applying Chisquare test
chisq.test(hk$counts, p=null.probs, rescale.p=TRUE, simulate.p.value=TRUE)
##
## Chi-squared test for given probabilities with simulated p-value
## (based on 2000 replicates)
##
## data: hk$counts
## X-squared = 15.246, df = NA, p-value = 0.01049
The p-value is less thann 0.05, so null hypothesis to be rejected. Which means the distribution cannot be normal.
library(MASS)
set.seed(101)
k <- c(2.47,4.23,5.41,3.49,4.17,10.09,18.78,0.68,2.28,16.16,0.28,2.97,4.01,5.88,20.32,26.88,19.07,0.22,6.37,10.38,4.2,10.17,1.84,21.88,9.42,0.01,6.15,4.99,3.07,18.6,1.54,10.23,3.99,6.17,0.39,11.03,9.38,1.57,6.91,2.49,5.52,11.53,7.64,8.8,7.17,3.26,6.74,16.32,10,7.45)
rate=1/mean(k)
min_r <- min(k)
max_r <- max(k)
# finding number of bins by Sturges formula
No_of_bins <- floor(1+3.3*log10(length(k)))
Bin_Width <- round((max_r-min_r)/No_of_bins,0)
# specifying the breaks in histogram
hk <- hist(k,breaks=c(seq(min_r,max_r+Bin_Width,Bin_Width)))
# candidate distribution - normal
breaks_cdf <- pexp(hk$breaks,rate=rate)
breaks_cdf
## [1] 0.001305994 0.407880023 0.648935445 0.791855829 0.876592509 0.926832403
## [7] 0.956619349 0.974279860
# Getting the probability value, assuming the dist is exponential
null.probs <- rollapply(breaks_cdf, 2, function(x) x[2]-x[1])
null.probs
## [1] 0.40657403 0.24105542 0.14292038 0.08473668 0.05023989 0.02978695
## [7] 0.01766051
## applying Chisquare test
chisq.test(hk$counts, p=null.probs, rescale.p=TRUE, simulate.p.value=TRUE)
##
## Chi-squared test for given probabilities with simulated p-value
## (based on 2000 replicates)
##
## data: hk$counts
## X-squared = 9.0094, df = NA, p-value = 0.1524
The p-value is 0.15 which is more than the critical value of 0.05, so the data follows an exponential distribution