Replicate the correspondence analysis (CA) on household chores.

Source: http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/113-ca-correspondence-analysis-in-r-essentials/

Get the data

A German sample in young, married, heterosexual couples in the late 1970s, see more https://www.tandfonline.com/doi/abs/10.1207/S15327906MBR3403_4

library(factoextra)
head(housetasks)

Graph contingency tables and run a chi-square test

library(gplots)
dt <- as.table(as.matrix(housetasks))
balloonplot(
  t(dt),
  main = "housetasks",
  xlab = "",
  ylab = "",
  label = F,
  show.margins = FALSE
)
chisq.test(housetasks)
chisq.test(housetasks)$stdres

Compute CA

library(FactoMineR)
res.ca <- CA(housetasks, graph = FALSE) # argument ncp: number of dimensions kept in the final results
print(res.ca)

Eigenvalues / Inertia

“Our data contains 13 rows and 4 columns.

If the data were random, the expected value of the eigenvalue for each axis would be 1/(nrow(housetasks)-1) = 1/12 = 8.33% in terms of rows.

Likewise, the average axis should account for 1/(ncol(housetasks)-1) = 1/3 = 33.33% in terms of the 4 columns.”

res.ca$eig
fviz_screeplot(res.ca) +
 geom_hline(yintercept = 33.33, linetype = 2, color = "red")

Biplot

“Symmetric plot represents the row and column profiles simultaneously in a common space. In this case, only the distance between row points or the distance between column points can be really interpreted.

The distance between any row and column items is not meaningful! You can only make a general statements about the observed pattern.

In order to interpret the distance between column and row points, the column profiles must be presented in row space or vice versa. This type of map is called asymmetric biplot.”

fviz_ca_biplot(res.ca, repel = TRUE)

Graph of row variables

“If a row item is well represented by two dimensions, the sum of the cos2 is closed to one. For some of the row items, more than 2 dimensions are required to perfectly represent the data.”

fviz_ca_row(res.ca, repel = TRUE)
fviz_ca_row(res.ca, alpha.row = "cos2", repel = TRUE)
library(corrplot)
corrplot(res.ca$row$cos2, is.corr = FALSE)

Contributions of rows to the dimensions

“Rows that contribute the most to Dim.1 and Dim.2 are the most important in explaining the variability in the data set.”

“It’s possible to use the function `corrplot()’ to highlight the most contributing row points for each dimension:”

corrplot(res.ca$row$contrib, is.corr = FALSE)
fviz_contrib(res.ca, choice = "row", axes = 1, top = 10)
fviz_contrib(res.ca, choice = "row", axes = 2, top = 10)
fviz_contrib(res.ca, choice = "row", axes = 1:2, top = 10)
fviz_ca_row(res.ca, col.row = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE)

Graph of column variables

fviz_ca_col(res.ca, col.col = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE)
fviz_ca_col(res.ca, col.col = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE)

Quality of column representation

“A cos2 closed to 1 corresponds to a column/row variables that are well represented on the factor map.”

fviz_cos2(res.ca, choice = "col", axes = 1:2)

Asymmetric biplot

“If the angle between two arrows is acute, then their is a strong association between the corresponding row and column.

To interpret the distance between rows and and a column you should perpendicularly project row points on the column arrow.”

fviz_ca_biplot(res.ca, 
               map ="rowprincipal", arrow = c(TRUE, TRUE),
               repel = TRUE)
fviz_ca_biplot(res.ca, 
               map ="colprincipal", arrow = c(TRUE, TRUE),
               repel = TRUE)

Export the results

biplot.ca <- fviz_ca_biplot(res.ca, repel = T)
scree.plot <- fviz_eig(res.ca)
library(ggpubr)
ggexport(plotlist = list(scree.plot, biplot.ca), 
         filename = "CA.pdf")
write.infile(res.ca, "ca.csv", sep = ";")

Run a similar analysis on the data you collected.

Sample: student-driven sample of heterosexual couples in their 40s (n = 31), St.Petersburg, Russia (2023)

Get the data

simple_coa <- read.csv(choose.files(), encoding = "UTF-8") # pick the csv
names(simple_coa)
table(
  simple_coa$Laundry,
  simple_coa$Breakfast) # you can crosstab the pairs of columns
sum_total <- (table(t(simple_coa[,2:14])))
sum_total # calculate sum totals for a summary
rownames <-
  c(
    "Laundry",
    "Main_meal",
    "Dinner",
    "Breakfast",
    "Tidying",
    "Dishes",
    "Shopping",
    "Official",
    "Driving",
    "Finances",
    "Insurance",
    "Repairs",
    "Holidays"
  )
colnames <- c("Wife", "Alternating", "Husband", "Jointly", "NA")

hc <- matrix(c(21,7,0,3,0, 
               13,9,5,4,0,
               11,12,3,5,0,
               12,8,3,6,2,
               11,7,3,10,0,
               5,14,4,6,2,
               7,9,5,10,0,
               7,3,10,6,5,
               1,5,14,5,6,
               3,3,7,13,5,
               1,4,8,10,8,
               3,0,16,10,0,
               9,2,2,15,3
               ) , 
             nrow = 13, 
             byrow = T)
rownames(hc) <- rownames
colnames(hc) <- colnames
head(hc)

Graph contingency tables and run a chi-square test

Pick the way to proceed further: without NAs (option 1)–better for comparability, with NAs (option 2, columns 1-4)–more observations and true to the data collected.

dt2 <- as.table(as.matrix(hc[ , 1:4]))
balloonplot(
  t(dt2),
  main = "housetasks",
  xlab = "",
  ylab = "",
  label = FALSE,
  show.margins = FALSE
)
dt2 <- as.table(as.matrix(hc))
balloonplot(
  t(dt2),
  main = "housetasks",
  xlab = "",
  ylab = "",
  label = FALSE,
  show.margins = FALSE
)

(Further analysis is done on full data.)

chisq.test(hc)
chisq.test(hc)$stdres
chisq.test(hc)$exp # share of expected counts below 5 should not exceed 20% of all counts! I recommend excluding NAs.

Compute CA

res2.ca <- CA(hc, graph = T)
summary(res2.ca)
res3.ca <- CA(hc, graph = T, col.sup = 5, row.sup = 7)

Eigenvalues / Inertia

res2.ca$eig
fviz_screeplot(res2.ca) +
 geom_hline(yintercept = 25.0, linetype = 2, color = "red")

Biplot

fviz_ca_biplot(res2.ca, repel = TRUE, axes = 1:2)

Graph of row variables

fviz_ca_row(res2.ca, repel = TRUE)
fviz_ca_row(res2.ca, alpha.row = "cos2", repel = T)
corrplot(res2.ca$row$cos2, is.corr = FALSE)

Contributions of rows to the dimensions

corrplot(res2.ca$row$contrib, is.corr = FALSE)
fviz_contrib(res2.ca, choice = "row", axes = 1, top = 10)
fviz_contrib(res2.ca, choice = "row", axes = 2, top = 10)
fviz_contrib(res2.ca, choice = "row", axes = 1:2, top = 10)
fviz_ca_row(res2.ca, col.row = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE)

Graph of column variables

fviz_ca_col(res2.ca, col.col = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE)
fviz_ca_row(res2.ca, col.row = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE)

Quality of column representation

fviz_cos2(res2.ca, choice = "col", axes = 1:2)

Asymmetric biplots

fviz_ca_biplot(res2.ca, 
               map ="rowprincipal", arrow = c(TRUE, TRUE),
               repel = TRUE)
fviz_ca_biplot(res2.ca, 
               map ="colprincipal", arrow = c(TRUE, TRUE),
               repel = TRUE)

Export the results

biplot2.ca <- fviz_ca_biplot(res2.ca)
scree2.plot <- fviz_eig(res2.ca)
ggexport(plotlist = list(scree2.plot, 
                         biplot2.ca), 
         filename = "CA2.pdf")
write.infile(res2.ca, "ca2.csv", sep = ";")

Self-check questions:

  1. Give a two-sentence summary of your CA on the collected data.
  2. Compare the example with the collected data analysis: Which CA keep more information in? What are the most differentiating household tasks in the two cases? Can you summarize the difference in 1-2 sentences?
  3. What behavioral/sociological conclusion could you make based on the analysis?
  4. What recommendation could you give to businesses targeting these or similar household?

Next time:

A new turn on crosstabs: Multiple Correspondence Analysis

Browsing through the press release https://www.levada.ru/en/2018/10/12/happiness/, you stumble upon the following cross-tabulation: https://www.levada.ru/cp/wp-content/uploads/2018/09/Schaste_tab..pdf

Follow the routine shown in https://rpubs.com/shirokaner/coa to create a cross-tab out of vectors of numbers and then run multiple correspondence analyses on the data.

Task:

Examine the original cross tabulation and make sure to leave only relevant numbers in your analysis and remember that chi-square only works properly on raw counts, not percent.

Three correspondence maps are to be done (pick any of them):

  1. happy-scale by sociodemographics (age, marital status, and income);
  2. widespread sources of happiness by sociodemographics; and
  3. widespread sources of unhappiness by sociodemographics.

Some prepared numbers for the analyses can be found below:

# library(datapasta)
# vector_paste()
happy <- c(22, 28, 20, 20, 23, 19, 25, 16, 21, 30,
  48, 52, 50, 41, 50, 39, 48, 34, 50, 54,
  19, 13, 17, 26, 17, 26, 15, 27, 20, 10,
  6, 4, 6, 7, 4, 11, 6, 13, 4, 3,
  6, 4, 6, 7, 6, 6, 5, 10, 5, 3)

hp_source <-  c(74, 70, 81, 70, 79, 68, 53, 72, 73, 75,
  19, 23, 13, 20, 15, 23, 31, 18, 19, 19,
  18, 20, 20, 14, 18, 14, 23, 13, 15, 25,
  12, 12, 11, 14, 11, 17, 11, 14, 12, 12,
  5, 4, 4, 7, 5, 1, 7, 3, 4, 7)

uhp_source <- c(54, 56, 57, 50, 58, 50, 42, 54, 54, 50,
                17, 31, 18, 10, 19, 9, 26, 12, 19, 22,
                15, 2, 9, 26, 13, 22, 9, 20, 14, 10,
                11, 4, 10, 14, 6, 18, 10, 12, 9, 12,
                9, 10, 8, 9, 7, 10, 13, 10, 7, 13)

coln <- c("18-34", "35-54", "55+", "Partnered", "Wid_Div", "Single", "Lower_Cl", "Middle_Cl", "Upper_Cl")

rown <- c("Def_YES", "Rather_YES", "Rather_NO", "Def_NO", "DK")

rown <- c("Family+", "Relationships+", "Work_Earnings+", "Health+", "Hobbies")

rown <- c("Work_Earnings-", "Ruling_Pwrs", "Health-", "Family-", "Intimacy")
