Chi-square goodness of fit test hypotheses/ Pearson’s Chi-square test

The Chi-square goodness of fit test is a statistical hypothesis test used to determine whether a variable is likely to come from a specified distribution or not. It is often used to evaluate whether sample data is representative of the full population.


chi-square goodness of fit test evaluates two hypotheses: the null and alternative hypotheses. They’re two competing answers to the question “Was the sample drawn from a population that follows the specified distribution?”

Null hypothesis (H0): The population follows the specified distribution. Alternative hypothesis (Ha): The population does not follow the specified distribution. These are general hypotheses that apply to all chi-square goodness of fit tests. You should make your hypotheses more specific by describing the “specified distribution.” You can name the probability distribution (e.g., Poisson distribution) or give the expected proportions of each group.

Steps

Step 1: Calculate the expected frequencies

Step 2: Calculate chi-square

i = 1 n ( O i E i ) 2 E i

Step 3: Find the critical chi-square value

Find the critical chi-square value in a chi-square critical value table or using statistical software. The critical value is calculated from a chi-square distribution. To find the critical chi-square value, you’ll need to know two things:

The degrees of freedom (df):

For chi-square goodness of fit tests, the df is the number of groups minus one.

Significance level (alpha):

By convention, the significance level is usually 0.05.


Simulated Dataset (Candy Data)

Flavor Number of Pieces of Candy Expected Number of Pieces of Candy
Apple 16 20
Lime 25 20
mango 12 20
chocolate 24 20
Grape 22 20
Flavor Number of Pieces of Candy (10 bags) Expected Number of Pieces of Candy Observed - Expected
  (= Difference)
Squared Difference Square Difference
———————–
Expected Number
Test Statistic
(=Sum of Previous Col)
Apple 15 20 15-20 = -5
Lime 25 20 25-20 = 5
Mango 12 20 12-20 = -8
Chocolate 24 20 24-20 = 4
Grape 23 20 23-20 = 3
Table: Chi-Square Probabilities

Here, The areas given across the top are the areas to the right of the critical value. To look up an area on the left, subtract it from one, and then look it up (ie: 0.05 on the left is 0.95 on the right)

df 0.995 0.99 0.975 0.95 0.90 0.10 0.05 0.025 0.01 0.005
1 0.001 0.004 0.016 2.706 3.841 5.024 6.635 7.879
2 0.010 0.020 0.051 0.103 0.211 4.605 5.991 7.378 9.210 10.597
3 0.072 0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.345 12.838
4 0.207 0.297 0.484 0.711 1.064 7.779 9.488 11.143 13.277 14.860
5 0.412 0.554 0.831 1.145 1.610 9.236 11.070 12.833 15.086 16.750
6 0.676 0.872 1.237 1.635 2.204 10.645 12.592 14.449 16.812 18.548
7 0.989 1.239 1.690 2.167 2.833 12.017 14.067 16.013 18.475 20.278
8 1.344 1.646 2.180 2.733 3.490 13.362 15.507 17.535 20.090 21.955
9 1.735 2.088 2.700 3.325 4.168 14.684 16.919 19.023 21.666 23.589
10 2.156 2.558 3.247 3.940 4.865 15.987 18.307 20.483 23.209 25.188

R Code:

 Candy_observed_frequency <- c(16, 25, 12, 24, 22)
Candy_expected_probability <- c(0.2, 0.2, 0.2, 0.2, 0.2)

The expected frequency sum should be 1.

Use the Chi-Square Goodness of Fit Test to see if you’re a good fit.

H0: A variable follows a hypothesized distribution.

H1: A variable does not follow a hypothesized distribution.

chisq.test(x= Candy_observed_frequency, p= Candy_expected_probability)
## 
##  Chi-squared test for given probabilities
## 
## data:  Candy_observed_frequency
## X-squared = 6.303, df = 4, p-value = 0.1776

Inference:

The Chi-Square to P-Value Calculator can be used to establish that the p-value for X2 = 6.3 with degrees of freedom= 4 is 0.177.

We cannot reject the null hypothesis since the p-value is not less than 0.05, which means we don’t have enough evidence to conclude that candy flavors are not uniform in the bag.


Simulated Dataset (data1)

require(stats)
set.seed(102)
data1<-rpois(200,5)
data1
##   [1]  5  5  7  5  2  4 10  4  8  4  8  6  7  4  5  8  6  4 10  7  8  5  5  6  4
##  [26]  4  2  4  4  5  4  8  2  7  9  4  2  7  6  4  7  6  4  4  2  4  3  5  4  7
##  [51]  4  5  5  3  5  4  7  4 10  9  8  2  5  6  3  6  4  1  8  5 10  5  9  4  4
##  [76]  5  5  1 13  4  6  2  5  5  3  3  4  6  6  3  3  4 12  5  3  4  6  4  4  4
## [101]  8  4  7  5  2 10  5  7  3  4  6  2  4  4  9  4  4  3  5  0  3  4  4  5  5
## [126]  1  9  6 10  8  7  7  4  4  3  5  5  5  6  7  4  3  6  2  4  4  5  8  8  9
## [151] 10  3  2  5  4  9  5  7  4  4  5  4  1  6  2  4  4  6  7  1  4  7  1  4  5
## [176]  9  3  7  5  4  4  5  2  3  4  4  3  4  8  6  5 10  3  4  3  4  5  4  6  2
data1_table<-table(data1)
data1_table
## data1
##  0  1  2  3  4  5  6  7  8  9 10 12 13 
##  1  6 14 19 58 36 19 17 12  8  8  1  1

Chi-Square Test whether Simulated Dataset (data1) is coming from Uniform probability

m=max(data1)
chisq.test(data1_table,p=rep(1/m,m))
## 
##  Chi-squared test for given probabilities
## 
## data:  data1_table
## X-squared = 201.57, df = 12, p-value < 2.2e-16

Inference: Home Work


Simulated Dataset 2

Every day, an equal number of clients enter a business, according to a vendor. To test this theory, a corporate executive records the number of customers who visit the shop in a given week and discovers the following.

Monday: 250 customers, Tuesday: 230 customers, Wednesday: 265 customers, Thursday: 235 customers, and Friday: 223 customers

Objective: To evaluate if the data is consistent with the vendor claim, do the Chi-Square goodness of fit test in R.

 observed_frequency <- c(258, 235, 260, 235, 224)
expected_probability <- c(0.2, 0.2, 0.2, 0.2, 0.2)

The expected frequency sum should be 1.

Use the Chi-Square Goodness of Fit Test to see if you’re a good fit.

H0: A variable follows a hypothesized distribution.

H1: A variable does not follow a hypothesized distribution.

chisq.test(x= observed_frequency, p= expected_probability)
## 
##  Chi-squared test for given probabilities
## 
## data:  observed_frequency
## X-squared = 4.1304, df = 4, p-value = 0.3887

Inference: The Chi-Square to P-Value Calculator can be used to establish that the p-value for X2 = 4.13 with degrees of freedom= 4 is 0.3887.

We cannot reject the null hypothesis since the p-value is not less than 0.05, which means we don’t have enough evidence to conclude that the genuine customer distribution differs from the vendor’s claimed distribution.


References:

https://www.jmp.com/en_sg/statistics-knowledge-portal/chi-square-test/chi-square-goodness-of-fit-test.html#:~:text=What%20is%20the%20Chi%2Dsquare,representative%20of%20the%20full%20population.

https://people.richland.edu/james/lecture/m170/tbl-chi.html

https://www.tutorialspoint.com/how-to-perform-chi-square-test-for-goodness-of-fit-in-r

https://www.r-bloggers.com/2022/01/chi-square-goodness-of-fit-formula-in-r/

https://www.scribbr.com/statistics/chi-square-goodness-of-fit/