Introduction :-
In this report , I am trying to do chi-square test on two categorical variables from a given dataset.
For a Chi-Square test, the input will be observed frequency (fo) contingency table in between any two categorical variables.
Based on marginal frequencies, the system will compute expected frequency (fe) contingency table for those two variables.
\[f~e~ = \frac{\sum_{}X_r * \sum_{}X_c}{N}\]
A comparison of the observed table (fo) and expected table (fe) will be done. Residuals will be calculated between the two tables based on the following formula.
\[Normalresiduals = \frac{(f~o~ - f~e~)^2}{f~e~}\]
(or)
\[Pearsonresiduals = \frac{(f~o~ - f~e~)}{\sqrt(f~e~)}\]
Now, Chi-square value of this observed frequency table is sum of all individual residuals for each cell of the residual consistency table.
\[\chi^2 = \sum_{c=1}^{C}\frac{(f~o~ - f~e~)^2}{f~e~}\]
For a given degrees of freedom, we can fetch the Critical \(\chi^2\) value from the standard \(\chi^2\) table for a pre-defined P-Value.
\[\chi^2(df , P-Value)\]
Null Hypothesis ( Ho) : The two categorical variables are independent to each other.
Alternative Hypothesis ( Ha) : The two categorical variables are having relationship.
If our \(\chi^2\) P-value is less than critical \(\chi^2(df,p)\) , then the test is statistically not significant. So, reject the null hypothesis. There is a relationship between those two categorical variables.
If our \(\chi^2\) P-value is greater than critical \(\chi^2(df,p)\) , then the test is statistically significant. So, accept the null hypothesis. The two categorical variables are independent to each othe.
\(\chi^2\) Test :-
Given contingency table (fo) for the two categorical variables is as follows.
## A B C
## red 70 80 50
## green 50 10 40
Expected contingency table (fe) is as follows.
## A B C
## red 80 60 60
## green 40 30 30
Pearson Residuals are as follows.
## A B C
## red -1.118034 2.581989 -1.290994
## green 1.581139 -3.651484 1.825742
The same Pearson Residuals can be shown in the following association plot.
Standard Residuals are as follows.
## A B C
## red -2.5 5.345225 -2.672612
## green 2.5 -5.345225 2.672612
The same standard resisuals can be explained in following mosaicplot.
\(\chi^2\) test report is as follows
##
## Pearson's Chi-squared test
##
## data: m
## X-squared = 28.75, df = 2, p-value = 5.715e-07
As the P-Value is very very small than 0.5 , we can say as, the test is not statically siginificant. We can reject the null hypothesis.
There is releationship between the two given categorical variables.
The structure of the complete \(\chi^2\) test is as follows.
## List of 9
## $ statistic: Named num 28.8
## ..- attr(*, "names")= chr "X-squared"
## $ parameter: Named int 2
## ..- attr(*, "names")= chr "df"
## $ p.value : num 5.72e-07
## $ method : chr "Pearson's Chi-squared test"
## $ data.name: chr "m"
## $ observed : num [1:2, 1:3] 70 50 80 10 50 40
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "red" "green"
## .. ..$ : chr [1:3] "A" "B" "C"
## $ expected : num [1:2, 1:3] 80 40 60 30 60 30
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "red" "green"
## .. ..$ : chr [1:3] "A" "B" "C"
## $ residuals: num [1:2, 1:3] -1.12 1.58 2.58 -3.65 -1.29 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "red" "green"
## .. ..$ : chr [1:3] "A" "B" "C"
## $ stdres : num [1:2, 1:3] -2.5 2.5 5.35 -5.35 -2.67 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "red" "green"
## .. ..$ : chr [1:3] "A" "B" "C"
## - attr(*, "class")= chr "htest"
Log-Linear Model ( Extension of \(\chi^2\) Test) :-
There are few draw backs in \(\chi^2\) Test.
Only two-way cross tables can be analysed by \(\chi^2\) Test. If we want to analyse multi-cross table , with \(\chi^2\) test it is not possible.
\(\chi^2\) test tells us whether there is releationship between those variables (or) not. but it is not conveying which type of assosciation is there between those variables.
To overcome those drawbacks, we have to extend the \(\chi^2\) test. The extension of \(\chi^2\) test is possible by Log-Linear Models.
Our cross table is as follows.
## Loading required package: MASS
## products
## color A B C
## red 70 80 50
## green 50 10 40
The summary of the Fitted Log-Linear model is as follows.
## Call:
## loglm(formula = ~color + products, data = m)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 32.45926 2 8.944624e-08
## Pearson 28.75000 2 5.715008e-07
In this Log-Linear Model analysis also, the P-Value for Likelihood Ratio \(\chi^2\) & Pearson \(\chi^2\) are very very low ( less than 0.5).
So, we can reject the null hypothesis and conclude as there is a relationship between those two variables.
Thank you.