Pearson Chi-Square Goodness of Fit

Chi-square goodness of fit test hypotheses/ Pearson’s Chi-square test

The Chi-square goodness of fit test is a statistical hypothesis test used to determine whether a variable is likely to come from a specified distribution or not. It is often used to evaluate whether sample data is representative of the full population.

chi-square goodness of fit test evaluates two hypotheses: the null and alternative hypotheses. They’re two competing answers to the question “Was the sample drawn from a population that follows the specified distribution?”

Null hypothesis (H0): The population follows the specified distribution. Alternative hypothesis (Ha): The population does not follow the specified distribution. These are general hypotheses that apply to all chi-square goodness of fit tests. You should make your hypotheses more specific by describing the “specified distribution.” You can name the probability distribution (e.g., Poisson distribution) or give the expected proportions of each group.

Steps

Step 1: Calculate the expected frequencies

Step 2: Calculate chi-square

$\sum_{i = 1}^{n} \frac{(O_{i} - E_{i})^{2}}{E_{i}}$

Step 3: Find the critical chi-square value

Find the critical chi-square value in a chi-square critical value table or using statistical software. The critical value is calculated from a chi-square distribution. To find the critical chi-square value, you’ll need to know two things:

The degrees of freedom (df):

For chi-square goodness of fit tests, the df is the number of groups minus one.

Significance level (alpha):

By convention, the significance level is usually 0.05.

Simulated Dataset (Candy Data)

Flavor	Number of Pieces of Candy	Expected Number of Pieces of Candy
Apple	16	20
Lime	25	20
mango	12	20
chocolate	24	20
Grape	22	20

Flavor	Number of Pieces of Candy (10 bags)	Expected Number of Pieces of Candy	Observed - Expected (= Difference)
Apple	15	20	15-20 = -5
Lime	25	20	25-20 = 5
Mango	12	20	12-20 = -8
Chocolate	24	20	24-20 = 4
Grape	23	20	23-20 = 3

Table: Chi-Square Probabilities

Here, The areas given across the top are the areas to the right of the critical value. To look up an area on the left, subtract it from one, and then look it up (ie: 0.05 on the left is 0.95 on the right)

df	0.995	0.99	0.975	0.95	0.90	0.10	0.05	0.025	0.01	0.005
1	—	—	0.001	0.004	0.016	2.706	3.841	5.024	6.635	7.879
2	0.010	0.020	0.051	0.103	0.211	4.605	5.991	7.378	9.210	10.597
3	0.072	0.115	0.216	0.352	0.584	6.251	7.815	9.348	11.345	12.838
4	0.207	0.297	0.484	0.711	1.064	7.779	9.488	11.143	13.277	14.860
5	0.412	0.554	0.831	1.145	1.610	9.236	11.070	12.833	15.086	16.750
6	0.676	0.872	1.237	1.635	2.204	10.645	12.592	14.449	16.812	18.548
7	0.989	1.239	1.690	2.167	2.833	12.017	14.067	16.013	18.475	20.278
8	1.344	1.646	2.180	2.733	3.490	13.362	15.507	17.535	20.090	21.955
9	1.735	2.088	2.700	3.325	4.168	14.684	16.919	19.023	21.666	23.589
10	2.156	2.558	3.247	3.940	4.865	15.987	18.307	20.483	23.209	25.188

R Code:

 Candy_observed_frequency <- c(16, 25, 12, 24, 22)
Candy_expected_probability <- c(0.2, 0.2, 0.2, 0.2, 0.2)

The expected frequency sum should be 1.

Use the Chi-Square Goodness of Fit Test to see if you’re a good fit.

H0: A variable follows a hypothesized distribution.

H1: A variable does not follow a hypothesized distribution.

chisq.test(x= Candy_observed_frequency, p= Candy_expected_probability)

## 
##  Chi-squared test for given probabilities
## 
## data:  Candy_observed_frequency
## X-squared = 6.303, df = 4, p-value = 0.1776

Inference:

The Chi-Square to P-Value Calculator can be used to establish that the p-value for X2 = 6.3 with degrees of freedom= 4 is 0.177.

We cannot reject the null hypothesis since the p-value is not less than 0.05, which means we don’t have enough evidence to conclude that candy flavors are not uniform in the bag.

Simulated Dataset (data1)

require(stats)
set.seed(102)

data1<-rpois(200,5)
data1

##   [1]  5  5  7  5  2  4 10  4  8  4  8  6  7  4  5  8  6  4 10  7  8  5  5  6  4
##  [26]  4  2  4  4  5  4  8  2  7  9  4  2  7  6  4  7  6  4  4  2  4  3  5  4  7
##  [51]  4  5  5  3  5  4  7  4 10  9  8  2  5  6  3  6  4  1  8  5 10  5  9  4  4
##  [76]  5  5  1 13  4  6  2  5  5  3  3  4  6  6  3  3  4 12  5  3  4  6  4  4  4
## [101]  8  4  7  5  2 10  5  7  3  4  6  2  4  4  9  4  4  3  5  0  3  4  4  5  5
## [126]  1  9  6 10  8  7  7  4  4  3  5  5  5  6  7  4  3  6  2  4  4  5  8  8  9
## [151] 10  3  2  5  4  9  5  7  4  4  5  4  1  6  2  4  4  6  7  1  4  7  1  4  5
## [176]  9  3  7  5  4  4  5  2  3  4  4  3  4  8  6  5 10  3  4  3  4  5  4  6  2

data1_table<-table(data1)
data1_table

## data1
##  0  1  2  3  4  5  6  7  8  9 10 12 13 
##  1  6 14 19 58 36 19 17 12  8  8  1  1

Chi-Square Test whether Simulated Dataset (data1) is coming from Uniform probability

m=max(data1)

chisq.test(data1_table,p=rep(1/m,m))

## 
##  Chi-squared test for given probabilities
## 
## data:  data1_table
## X-squared = 201.57, df = 12, p-value < 2.2e-16

Inference: Home Work

Simulated Dataset 2

Every day, an equal number of clients enter a business, according to a vendor. To test this theory, a corporate executive records the number of customers who visit the shop in a given week and discovers the following.

Monday: 250 customers, Tuesday: 230 customers, Wednesday: 265 customers, Thursday: 235 customers, and Friday: 223 customers

Objective: To evaluate if the data is consistent with the vendor claim, do the Chi-Square goodness of fit test in R.

 observed_frequency <- c(258, 235, 260, 235, 224)
expected_probability <- c(0.2, 0.2, 0.2, 0.2, 0.2)

The expected frequency sum should be 1.

Use the Chi-Square Goodness of Fit Test to see if you’re a good fit.

H0: A variable follows a hypothesized distribution.

H1: A variable does not follow a hypothesized distribution.

chisq.test(x= observed_frequency, p= expected_probability)

## 
##  Chi-squared test for given probabilities
## 
## data:  observed_frequency
## X-squared = 4.1304, df = 4, p-value = 0.3887

Inference: The Chi-Square to P-Value Calculator can be used to establish that the p-value for X2 = 4.13 with degrees of freedom= 4 is 0.3887.

We cannot reject the null hypothesis since the p-value is not less than 0.05, which means we don’t have enough evidence to conclude that the genuine customer distribution differs from the vendor’s claimed distribution.

References:

https://www.jmp.com/en_sg/statistics-knowledge-portal/chi-square-test/chi-square-goodness-of-fit-test.html#:~:text=What%20is%20the%20Chi%2Dsquare,representative%20of%20the%20full%20population.

https://people.richland.edu/james/lecture/m170/tbl-chi.html

https://www.tutorialspoint.com/how-to-perform-chi-square-test-for-goodness-of-fit-in-r

https://www.r-bloggers.com/2022/01/chi-square-goodness-of-fit-formula-in-r/

https://www.scribbr.com/statistics/chi-square-goodness-of-fit/