Let’s first load a table of data
file <- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt"
housetasks <- read.delim(file, row.names = 1)
head(housetasks)
## Wife Alternating Husband Jointly
## Laundry 156 14 2 4
## Main_meal 124 20 5 4
## Dinner 77 11 7 13
## Breakfeast 82 36 15 7
## Tidying 53 11 1 57
## Dishes 32 24 4 53
This data set features a household and the various chores that can occur within. There are 4 different categories that these chores and split as: They are done by the wife, the husband, alternately, and jointly. I think the “alternating” category is awfully confusing. So let’s just get rid of it.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
house1 <- select(housetasks, -Alternating)
head(house1)
## Wife Husband Jointly
## Laundry 156 2 4
## Main_meal 124 5 4
## Dinner 77 7 13
## Breakfeast 82 15 7
## Tidying 53 1 57
## Dishes 32 4 53
Let’s try to visualize some relationships to see if we can spot the difference through an eye test first.
table <- as.table(as.matrix(house1))
install.packages("gplots", repos="http://cran.rstudio.com/", dependencies=TRUE)
## Installing package into 'C:/Users/Alexander Chang/Documents/R/win-library/3.5'
## (as 'lib' is unspecified)
## package 'gplots' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Alexander Chang\AppData\Local\Temp\RtmpgZPvOa\downloaded_packages
library(gplots)
## Warning: package 'gplots' was built under R version 3.5.3
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
balloonplot(t(table), main = "housetasks", xlab = "", ylab="",
label = FALSE, show.margins = FALSE)
The dot size here represents the magnitude of the corresponding component. We see that the wife does a lot of of the tasks in the top rows, whereas the husband does more of the bottom rows. Let’s see if there’s a more accurate way to depict this than just dot size.
install.packages("graphics", repos = "http://cran.rstudio.com/", dependencies = TRUE)
## Installing package into 'C:/Users/Alexander Chang/Documents/R/win-library/3.5'
## (as 'lib' is unspecified)
## Warning: package 'graphics' is not available (for R version 3.5.1)
## Warning: package 'graphics' is a base package, and should not be updated
mosaicplot(table, shade = TRUE, las=2, main = "housetasks")
blue here is a positive association whereas red is a negative, Interestingly enough there definetly seems to be a share of work clearly done by the husband (driving, insurance, repairs, holidays), the wife (laundry, meals, dinner, breakfast), and then both (tidying, dishes, shopping)
A Chi square test will test to see if the distribution of the tasks amongst our three categories are independent (up to chance) or dependent (follows a clear pattern) by calculating the expected values and then comparing them to the actual values. It is noted that one should only conduct this test when the cell’s values are all above 5.
chisq <- chisq.test(house1)
chisq
##
## Pearson's Chi-squared test
##
## data: house1
## X-squared = 1613.8, df = 24, p-value < 2.2e-16
We can that there is indeed a significant amount of difference between expected and observed.
Let’s bring the two up to compare by the eye test.
#observed
chisq$observed
## Wife Husband Jointly
## Laundry 156 2 4
## Main_meal 124 5 4
## Dinner 77 7 13
## Breakfeast 82 15 7
## Tidying 53 1 57
## Dishes 32 4 53
## Shopping 33 9 55
## Official 12 23 15
## Driving 10 75 3
## Finances 13 21 66
## Insurance 8 53 77
## Repairs 0 160 2
## Holidays 0 6 153
round(chisq$expected,1)
## Wife Husband Jointly
## Laundry 65.2 41.4 55.3
## Main_meal 53.6 34.0 45.4
## Dinner 39.1 24.8 33.1
## Breakfeast 41.9 26.6 35.5
## Tidying 44.7 28.4 37.9
## Dishes 35.8 22.8 30.4
## Shopping 39.1 24.8 33.1
## Official 20.1 12.8 17.1
## Driving 35.4 22.5 30.1
## Finances 40.3 25.6 34.2
## Insurance 55.6 35.3 47.1
## Repairs 65.2 41.4 55.3
## Holidays 64.0 40.7 54.3
Which cells are contributing most to our Chi-Squared value? Let’s take a look.
install.packages("corrplot", repos = "http://cran.rstudio.com", dependencies = TRUE)
## Installing package into 'C:/Users/Alexander Chang/Documents/R/win-library/3.5'
## (as 'lib' is unspecified)
## package 'corrplot' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Alexander Chang\AppData\Local\Temp\RtmpgZPvOa\downloaded_packages
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.5.3
## corrplot 0.84 loaded
corrplot(chisq$residual, is.corr = FALSE)
how about seeing this is percentages?
conperc <- 100*chisq$residuals^2/chisq$statistic
round(conperc,2)
## Wife Husband Jointly
## Laundry 7.83 2.33 2.95
## Main_meal 5.74 1.53 2.34
## Dinner 2.28 0.79 0.76
## Breakfeast 2.38 0.31 1.42
## Tidying 0.10 1.64 0.60
## Dishes 0.03 0.96 1.04
## Shopping 0.06 0.62 0.89
## Official 0.20 0.51 0.02
## Driving 1.13 7.59 1.51
## Finances 1.14 0.05 1.84
## Insurance 2.52 0.55 1.17
## Repairs 4.04 21.03 3.19
## Holidays 3.97 1.83 11.11
We see that the wife deals with laundry and meals, Husband repairs, and both plan for the holidays. something that can be better seen by numbers compared to the visuals.
If we had cells that had less than 5 as a value we would utilize the Fisher’s exact test.