The Chi-square goodness of fit test is a statistical hypothesis test used to determine whether a variable is likely to come from a specified distribution or not. It is often used to evaluate whether sample data is representative of the full population.
chi-square goodness of fit test evaluates two hypotheses: the null and alternative hypotheses. They’re two competing answers to the question “Was the sample drawn from a population that follows the specified distribution?”
Null hypothesis (H0): The population follows the specified distribution. Alternative hypothesis (Ha): The population does not follow the specified distribution. These are general hypotheses that apply to all chi-square goodness of fit tests. You should make your hypotheses more specific by describing the “specified distribution.” You can name the probability distribution (e.g., Poisson distribution) or give the expected proportions of each group.
Steps |
Find the critical chi-square value in a chi-square critical value table or using statistical software. The critical value is calculated from a chi-square distribution. To find the critical chi-square value, you’ll need to know two things:
For chi-square goodness of fit tests, the df is the number of groups minus one.
By convention, the significance level is usually 0.05.
Flavor | Number of Pieces of Candy | Expected Number of Pieces of Candy |
---|---|---|
Apple | 16 | 20 |
Lime | 25 | 20 |
mango | 12 | 20 |
chocolate | 24 | 20 |
Grape | 22 | 20 |
Flavor | Number of Pieces of Candy (10 bags) | Expected Number of Pieces of Candy |
Observed - Expected (= Difference) |
Squared Difference |
Square Difference ———————– Expected Number |
Test Statistic (=Sum of Previous Col) |
---|---|---|---|---|---|---|
Apple | 15 | 20 | 15-20 = -5 | |||
Lime | 25 | 20 | 25-20 = 5 | |||
Mango | 12 | 20 | 12-20 = -8 | |||
Chocolate | 24 | 20 | 24-20 = 4 | |||
Grape | 23 | 20 | 23-20 = 3 |
Table: Chi-Square Probabilities |
Here, The areas given across the top are the areas to the right of the critical value. To look up an area on the left, subtract it from one, and then look it up (ie: 0.05 on the left is 0.95 on the right)
df | 0.995 | 0.99 | 0.975 | 0.95 | 0.90 | 0.10 | 0.05 | 0.025 | 0.01 | 0.005 |
---|---|---|---|---|---|---|---|---|---|---|
1 | — | — | 0.001 | 0.004 | 0.016 | 2.706 | 3.841 | 5.024 | 6.635 | 7.879 |
2 | 0.010 | 0.020 | 0.051 | 0.103 | 0.211 | 4.605 | 5.991 | 7.378 | 9.210 | 10.597 |
3 | 0.072 | 0.115 | 0.216 | 0.352 | 0.584 | 6.251 | 7.815 | 9.348 | 11.345 | 12.838 |
4 | 0.207 | 0.297 | 0.484 | 0.711 | 1.064 | 7.779 | 9.488 | 11.143 | 13.277 | 14.860 |
5 | 0.412 | 0.554 | 0.831 | 1.145 | 1.610 | 9.236 | 11.070 | 12.833 | 15.086 | 16.750 |
6 | 0.676 | 0.872 | 1.237 | 1.635 | 2.204 | 10.645 | 12.592 | 14.449 | 16.812 | 18.548 |
7 | 0.989 | 1.239 | 1.690 | 2.167 | 2.833 | 12.017 | 14.067 | 16.013 | 18.475 | 20.278 |
8 | 1.344 | 1.646 | 2.180 | 2.733 | 3.490 | 13.362 | 15.507 | 17.535 | 20.090 | 21.955 |
9 | 1.735 | 2.088 | 2.700 | 3.325 | 4.168 | 14.684 | 16.919 | 19.023 | 21.666 | 23.589 |
10 | 2.156 | 2.558 | 3.247 | 3.940 | 4.865 | 15.987 | 18.307 | 20.483 | 23.209 | 25.188 |
Candy_observed_frequency <- c(16, 25, 12, 24, 22)
Candy_expected_probability <- c(0.2, 0.2, 0.2, 0.2, 0.2)
The expected frequency sum should be 1.
Use the Chi-Square Goodness of Fit Test to see if you’re a good fit.
H0: A variable follows a hypothesized distribution.
H1: A variable does not follow a hypothesized distribution.
chisq.test(x= Candy_observed_frequency, p= Candy_expected_probability)
##
## Chi-squared test for given probabilities
##
## data: Candy_observed_frequency
## X-squared = 6.303, df = 4, p-value = 0.1776
Inference:
The Chi-Square to P-Value Calculator can be used to establish that the p-value for X2 = 6.3 with degrees of freedom= 4 is 0.177.
We cannot reject the null hypothesis since the p-value is not less than 0.05, which means we don’t have enough evidence to conclude that candy flavors are not uniform in the bag.
require(stats)
set.seed(102)
data1<-rpois(200,5)
data1
## [1] 5 5 7 5 2 4 10 4 8 4 8 6 7 4 5 8 6 4 10 7 8 5 5 6 4
## [26] 4 2 4 4 5 4 8 2 7 9 4 2 7 6 4 7 6 4 4 2 4 3 5 4 7
## [51] 4 5 5 3 5 4 7 4 10 9 8 2 5 6 3 6 4 1 8 5 10 5 9 4 4
## [76] 5 5 1 13 4 6 2 5 5 3 3 4 6 6 3 3 4 12 5 3 4 6 4 4 4
## [101] 8 4 7 5 2 10 5 7 3 4 6 2 4 4 9 4 4 3 5 0 3 4 4 5 5
## [126] 1 9 6 10 8 7 7 4 4 3 5 5 5 6 7 4 3 6 2 4 4 5 8 8 9
## [151] 10 3 2 5 4 9 5 7 4 4 5 4 1 6 2 4 4 6 7 1 4 7 1 4 5
## [176] 9 3 7 5 4 4 5 2 3 4 4 3 4 8 6 5 10 3 4 3 4 5 4 6 2
data1_table<-table(data1)
data1_table
## data1
## 0 1 2 3 4 5 6 7 8 9 10 12 13
## 1 6 14 19 58 36 19 17 12 8 8 1 1
Chi-Square Test whether Simulated Dataset (data1) is coming from Uniform probability
m=max(data1)
chisq.test(data1_table,p=rep(1/m,m))
##
## Chi-squared test for given probabilities
##
## data: data1_table
## X-squared = 201.57, df = 12, p-value < 2.2e-16
Inference: Home Work
Every day, an equal number of clients enter a business, according to a vendor. To test this theory, a corporate executive records the number of customers who visit the shop in a given week and discovers the following.
Monday: 250 customers, Tuesday: 230 customers, Wednesday: 265 customers, Thursday: 235 customers, and Friday: 223 customers
Objective: To evaluate if the data is consistent with the vendor claim, do the Chi-Square goodness of fit test in R.
observed_frequency <- c(258, 235, 260, 235, 224)
expected_probability <- c(0.2, 0.2, 0.2, 0.2, 0.2)
The expected frequency sum should be 1.
Use the Chi-Square Goodness of Fit Test to see if you’re a good fit.
H0: A variable follows a hypothesized distribution.
H1: A variable does not follow a hypothesized distribution.
chisq.test(x= observed_frequency, p= expected_probability)
##
## Chi-squared test for given probabilities
##
## data: observed_frequency
## X-squared = 4.1304, df = 4, p-value = 0.3887
Inference: The Chi-Square to P-Value Calculator can be used to establish that the p-value for X2 = 4.13 with degrees of freedom= 4 is 0.3887.
We cannot reject the null hypothesis since the p-value is not less than 0.05, which means we don’t have enough evidence to conclude that the genuine customer distribution differs from the vendor’s claimed distribution.
References: |
https://www.jmp.com/en_sg/statistics-knowledge-portal/chi-square-test/chi-square-goodness-of-fit-test.html#:~:text=What%20is%20the%20Chi%2Dsquare,representative%20of%20the%20full%20population.
https://people.richland.edu/james/lecture/m170/tbl-chi.html
https://www.tutorialspoint.com/how-to-perform-chi-square-test-for-goodness-of-fit-in-r
https://www.r-bloggers.com/2022/01/chi-square-goodness-of-fit-formula-in-r/
https://www.scribbr.com/statistics/chi-square-goodness-of-fit/