Chi-Square Test of Independence in R

The chi-square test of independence is used to analyze the frequency table (i.e. contengency table) formed by two categorical variables. The chi-square test evaluates whether there is a significant association between the categories of the two variables.

Data format Contingency Tables

# Import the data
file_path <- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt"
housetasks <- read.delim(file_path, row.names = 1)
head(housetasks)
##            Wife Alternating Husband Jointly
## Laundry     156          14       2       4
## Main_meal   124          20       5       4
## Dinner       77          11       7      13
## Breakfeast   82          36      15       7
## Tidying      53          11       1      57
## Dishes       32          24       4      53

The data is a contingency table containing 13 housetasks and their distribution in the couple:

rows are the different tasks
values are the frequencies of the tasks done :
by the wife only
alternatively
by the husband only
or jointly

Graphical Display of contingency Tables

Contingency table can be visualized using the function balloonplot() [in gplots package]. This function draws a graphical matrix where each cell contains a dot whose size reflects the relative magnitude of the corresponding component.

install gplots

library(gplots)
## Warning: package 'gplots' was built under R version 3.6.1
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
# 1. convert the data as a table
dt <- as.table(as.matrix(housetasks))
# 2. Graph
balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)

It’s also possible to visualize a contingency table as a mosaic plot. This is done using the function mosaicplot() from the built-in R package garphics:

library(graphics)
mosaicplot(dt, shade = TRUE, las=2,
           main = "housetasks")

The argument shade is used to color the graph
The argument las = 2 produces vertical labels

Blue color indicates that the observed value is higher than the expected value if the data were random
Red color specifies that the observed value is lower than the expected value if the data were random

From this mosaic plot, it can be seen that the housetasks Laundry, Main_meal, Dinner and breakfeast (blue color) are mainly done by the wife in our example.
## Compute Chi-square test in R

chisq <- chisq.test(housetasks)
chisq
## 
##  Pearson's Chi-squared test
## 
## data:  housetasks
## X-squared = 1944.5, df = 36, p-value < 2.2e-16

In our example, the row and the column variables are statistically significantly associated (p-value = 0).

# Observed counts
chisq$observed
##            Wife Alternating Husband Jointly
## Laundry     156          14       2       4
## Main_meal   124          20       5       4
## Dinner       77          11       7      13
## Breakfeast   82          36      15       7
## Tidying      53          11       1      57
## Dishes       32          24       4      53
## Shopping     33          23       9      55
## Official     12          46      23      15
## Driving      10          51      75       3
## Finances     13          13      21      66
## Insurance     8           1      53      77
## Repairs       0           3     160       2
## Holidays      0           1       6     153
# Expected counts
round(chisq$expected,2)
##             Wife Alternating Husband Jointly
## Laundry    60.55       25.63   38.45   51.37
## Main_meal  52.64       22.28   33.42   44.65
## Dinner     37.16       15.73   23.59   31.52
## Breakfeast 48.17       20.39   30.58   40.86
## Tidying    41.97       17.77   26.65   35.61
## Dishes     38.88       16.46   24.69   32.98
## Shopping   41.28       17.48   26.22   35.02
## Official   33.03       13.98   20.97   28.02
## Driving    47.82       20.24   30.37   40.57
## Finances   38.88       16.46   24.69   32.98
## Insurance  47.82       20.24   30.37   40.57
## Repairs    56.77       24.03   36.05   48.16
## Holidays   55.05       23.30   34.95   46.70

Nature of dependence between row and column variables

As mentioned above the total Chi-square statistic is 1944.456196.

If you want to know the most contributing cells to the total Chi-square score, you just have to calculate the Chi-square statistic for each cell:

Cells with the highest absolute standardized residuals contribute the most to the total Chi-square score.

round(chisq$residuals, 3)
##              Wife Alternating Husband Jointly
## Laundry    12.266      -2.298  -5.878  -6.609
## Main_meal   9.836      -0.484  -4.917  -6.084
## Dinner      6.537      -1.192  -3.416  -3.299
## Breakfeast  4.875       3.457  -2.818  -5.297
## Tidying     1.702      -1.606  -4.969   3.585
## Dishes     -1.103       1.859  -4.163   3.486
## Shopping   -1.289       1.321  -3.362   3.376
## Official   -3.659       8.563   0.443  -2.459
## Driving    -5.469       6.836   8.100  -5.898
## Finances   -4.150      -0.852  -0.742   5.750
## Insurance  -5.758      -4.277   4.107   5.720
## Repairs    -7.534      -4.290  20.646  -6.651
## Holidays   -7.419      -4.620  -4.897  15.556
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.2
## corrplot 0.84 loaded
corrplot(chisq$residuals, is.cor = FALSE)

For a given cell, the size of the circle is proportional to the amount of the cell contribution.

Positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables.
In the image above, it’s evident that there are an association between the column Wife and the rows Laundry, Main_meal.
There is a strong positive association between the column Husband and the row Repair
Negative residuals are in red. This implies a repulsion (negative association) between the corresponding row and column variables. For example the column Wife are negatively associated (~ “not associated”) with the row Repairs. There is a repulsion between the column Husband and, the rows Laundry and Main_meal

# Contibution in percentage (%)
contrib <- 100*chisq$residuals^2/chisq$statistic
round(contrib, 3)
##             Wife Alternating Husband Jointly
## Laundry    7.738       0.272   1.777   2.246
## Main_meal  4.976       0.012   1.243   1.903
## Dinner     2.197       0.073   0.600   0.560
## Breakfeast 1.222       0.615   0.408   1.443
## Tidying    0.149       0.133   1.270   0.661
## Dishes     0.063       0.178   0.891   0.625
## Shopping   0.085       0.090   0.581   0.586
## Official   0.688       3.771   0.010   0.311
## Driving    1.538       2.403   3.374   1.789
## Finances   0.886       0.037   0.028   1.700
## Insurance  1.705       0.941   0.868   1.683
## Repairs    2.919       0.947  21.921   2.275
## Holidays   2.831       1.098   1.233  12.445
# Visualize the contribution
corrplot(contrib, is.cor = FALSE)

It can be seen that:

The column “Wife” is strongly associated with Laundry, Main_meal, Dinner
The column “Husband” is strongly associated with the row Repairs
The column jointly is frequently associated with the row Holidays

From the image above, it can be seen that the most contributing cells to the Chi-square are Wife/Laundry (7.74%), Wife/Main_meal (4.98%), Husband/Repairs (21.9%), Jointly/Holidays (12.44%).

These cells contribute about 47.06% to the total Chi-square score and thus account for most of the difference between expected and observed values.

This confirms the earlier visual interpretation of the data. As stated earlier, visual interpretation may be complex when the contingency table is very large. In this case, the contribution of one cell to the total Chi-square score becomes a useful way of establishing the nature of dependency.

Access to values returned by chisq.test()function

The result of chisq.test() function is a list containing the following components:

statistic: the value the chi-squared test statistic.
parameter: the degrees of freedom
p.value: the p-value of the test
observed: the observed count
expected: the expected count

# printing the p-value
chisq$p.value
## [1] 0
# printing the mean
chisq$estimate
## NULL