Today we’re going to be working with tables, specifically those that are meant to be interpreted as a categorical variables for either or both the independent and dependent variable.

Chi Squared

Let’s first load a table of data

file <- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt"
housetasks <- read.delim(file, row.names = 1)
head(housetasks)
##            Wife Alternating Husband Jointly
## Laundry     156          14       2       4
## Main_meal   124          20       5       4
## Dinner       77          11       7      13
## Breakfeast   82          36      15       7
## Tidying      53          11       1      57
## Dishes       32          24       4      53

This data set features a household and the various chores that can occur within. There are 4 different categories that these chores and split as: They are done by the wife, the husband, alternately, and jointly. I think the “alternating” category is awfully confusing. So let’s just get rid of it.

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
house1 <- select(housetasks, -Alternating)
head(house1)
##            Wife Husband Jointly
## Laundry     156       2       4
## Main_meal   124       5       4
## Dinner       77       7      13
## Breakfeast   82      15       7
## Tidying      53       1      57
## Dishes       32       4      53

Let’s try to visualize some relationships to see if we can spot the difference through an eye test first.

table <- as.table(as.matrix(house1))
install.packages("gplots", repos="http://cran.rstudio.com/", dependencies=TRUE)
## Installing package into 'C:/Users/Alexander Chang/Documents/R/win-library/3.5'
## (as 'lib' is unspecified)
## package 'gplots' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Alexander Chang\AppData\Local\Temp\RtmpgZPvOa\downloaded_packages
library(gplots)
## Warning: package 'gplots' was built under R version 3.5.3
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
balloonplot(t(table), main = "housetasks", xlab = "", ylab="",
            label = FALSE, show.margins = FALSE)

The dot size here represents the magnitude of the corresponding component. We see that the wife does a lot of of the tasks in the top rows, whereas the husband does more of the bottom rows. Let’s see if there’s a more accurate way to depict this than just dot size.

install.packages("graphics", repos = "http://cran.rstudio.com/", dependencies = TRUE)
## Installing package into 'C:/Users/Alexander Chang/Documents/R/win-library/3.5'
## (as 'lib' is unspecified)
## Warning: package 'graphics' is not available (for R version 3.5.1)
## Warning: package 'graphics' is a base package, and should not be updated
mosaicplot(table, shade = TRUE, las=2, main = "housetasks")

blue here is a positive association whereas red is a negative, Interestingly enough there definetly seems to be a share of work clearly done by the husband (driving, insurance, repairs, holidays), the wife (laundry, meals, dinner, breakfast), and then both (tidying, dishes, shopping)

A Chi square test will test to see if the distribution of the tasks amongst our three categories are independent (up to chance) or dependent (follows a clear pattern) by calculating the expected values and then comparing them to the actual values. It is noted that one should only conduct this test when the cell’s values are all above 5.

chisq <- chisq.test(house1)
chisq
## 
##  Pearson's Chi-squared test
## 
## data:  house1
## X-squared = 1613.8, df = 24, p-value < 2.2e-16

We can that there is indeed a significant amount of difference between expected and observed.

Let’s bring the two up to compare by the eye test.

#observed 
chisq$observed
##            Wife Husband Jointly
## Laundry     156       2       4
## Main_meal   124       5       4
## Dinner       77       7      13
## Breakfeast   82      15       7
## Tidying      53       1      57
## Dishes       32       4      53
## Shopping     33       9      55
## Official     12      23      15
## Driving      10      75       3
## Finances     13      21      66
## Insurance     8      53      77
## Repairs       0     160       2
## Holidays      0       6     153
round(chisq$expected,1)
##            Wife Husband Jointly
## Laundry    65.2    41.4    55.3
## Main_meal  53.6    34.0    45.4
## Dinner     39.1    24.8    33.1
## Breakfeast 41.9    26.6    35.5
## Tidying    44.7    28.4    37.9
## Dishes     35.8    22.8    30.4
## Shopping   39.1    24.8    33.1
## Official   20.1    12.8    17.1
## Driving    35.4    22.5    30.1
## Finances   40.3    25.6    34.2
## Insurance  55.6    35.3    47.1
## Repairs    65.2    41.4    55.3
## Holidays   64.0    40.7    54.3

Which cells are contributing most to our Chi-Squared value? Let’s take a look.

install.packages("corrplot", repos = "http://cran.rstudio.com", dependencies = TRUE)
## Installing package into 'C:/Users/Alexander Chang/Documents/R/win-library/3.5'
## (as 'lib' is unspecified)
## package 'corrplot' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Alexander Chang\AppData\Local\Temp\RtmpgZPvOa\downloaded_packages
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.5.3
## corrplot 0.84 loaded
corrplot(chisq$residual, is.corr = FALSE)

how about seeing this is percentages?

conperc <- 100*chisq$residuals^2/chisq$statistic
round(conperc,2)
##            Wife Husband Jointly
## Laundry    7.83    2.33    2.95
## Main_meal  5.74    1.53    2.34
## Dinner     2.28    0.79    0.76
## Breakfeast 2.38    0.31    1.42
## Tidying    0.10    1.64    0.60
## Dishes     0.03    0.96    1.04
## Shopping   0.06    0.62    0.89
## Official   0.20    0.51    0.02
## Driving    1.13    7.59    1.51
## Finances   1.14    0.05    1.84
## Insurance  2.52    0.55    1.17
## Repairs    4.04   21.03    3.19
## Holidays   3.97    1.83   11.11

We see that the wife deals with laundry and meals, Husband repairs, and both plan for the holidays. something that can be better seen by numbers compared to the visuals.

If we had cells that had less than 5 as a value we would utilize the Fisher’s exact test.