Introduction

In this lesson, we will focus on how to examine the association of two categorical variables in our data using a cross-tab. Think of cross-tabs as similar to a frequency table or table of proportions but to examine two variables, as opposed to just one variable.

We will focus mostly on using Base R (instead of tidyverse) to create our tables of frequencies/proportions. It is easier, more efficient, and less code to use Base R for this task.

Getting started

Remember to start by setting your working directory to a new folder, and saving your script/markdown document in this folder. You should also store any data you need in that folder. Note: even if you set your working directory through the toolbar in R Studio (e.g. by clicking Session –> Set Working Directory…) you should still paste the executed code in your R Script or R Markdown document. This helps make your code replicable and ensures you know which folder on your computer you’re working out of.

Set working directory

setwd("/Users/shanaya/POL3325G/Lectures/Lecture 8")

Packages and data

Today we will be using the 2021 Canadian Election Study data. You can download the data here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/XBZHKC

Remember you need to unzip the folder that downloads and then move the data to your working directory. We will import the dta (stata) file. You may also want to keep a copy of the code book for reference (e.g. to learn about the variables so that you can properly recode them).

library(rio)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(gt)

df <- import("2021 Canadian Election Study v2.0.dta")

Example 1: Cross-tab two ordinal level variables

dependent variable: cps21_volunteer (In the past 12 months how many times did you volunteer for a group or organization - 4 categories plus fifth category as don’t know/prefer not to answer)
independent variable: cps21_income_cat (Does your household income fall into one of these broad categories?)
steps: subset our data to keep these two variables (and maybe an ID variable), remove NAs, recode as factors with proper ordering of the levels

Wrangle data

# check out the NA values for our variables of interest: 
df %>% filter(cps21_volunteer == 5 | is.na(cps21_volunteer)) # 657 NA values for cps21_volunteer 
df %>% filter(cps21_income_cat == 9 | is.na(cps21_income_cat)) # 20,616 NA values for income variable

Below, I recode the variables I am interested in by: 1) changing and condensing categories of the volunteer and income variables 2) specifying the new volunteer and income variables should be of the class “factor” (ordinal) with levels from low to high 3) use select() to keep only the variables I am interested in (plus and ID variable)

df1 <- df %>% 
  mutate(volunteer = case_when(cps21_volunteer == 1 ~ "never",
                                cps21_volunteer %in% c(2,3) ~ "sometimes",
                                cps21_volunteer == 4 ~ "often",
                                TRUE ~ NA_character_)) %>%
  mutate(income = case_when(cps21_income_cat == 1 ~ "none",
                                      cps21_income_cat %in% c(2,3) ~ "low",
                                      cps21_income_cat %in% c(4,5,6) ~ "medium",
                                      cps21_income_cat %in% c(7,8) ~ "high",
                                      TRUE ~ NA_character_)) %>%
  mutate( 
    volunteer = factor(volunteer, levels = c("never", "sometimes", "often")),
    income = factor(income, levels = c("none", "low", "medium", "high"))
  ) %>% 
  select(cps21_ResponseId, volunteer, cps21_volunteer, income, cps21_income_cat)

Check recoding:

table(df1$cps21_volunteer, df1$volunteer)

##    
##     never sometimes often
##   1 13165         0     0
##   2     0      1758     0
##   3     0      2763     0
##   4     0         0  2625
##   5     0         0     0

table(df1$cps21_income_cat, df1$income)

##    
##     none low medium high
##   1   55   0      0    0
##   2    0  67      0    0
##   3    0  71      0    0
##   4    0   0     67    0
##   5    0   0     25    0
##   6    0   0     37    0
##   7    0   0      0   19
##   8    0   0      0   11
##   9    0   0      0    0

levels(df1$volunteer)

## [1] "never"     "sometimes" "often"

levels(df1$income)

## [1] "none"   "low"    "medium" "high"

drop na values

df1 <- df1 %>%
  drop_na(income, volunteer)

Above, we remove rows (observations, which are individuals in our survey) where at least one of the specified columns (income or volunteer) have NA values. If either income or volunteer has an NA value, that observation (individual) is removed. The size of our sample (data) is significantly smaller! We should discuss this when we present our cross tab (e.g. there were a lot of missing values for the income variable in particular).

Cross-tab (frequencies)

table(df1$volunteer, df1$income)

##            
##             none low medium high
##   never       36  86     72   17
##   sometimes   14  24     29    6
##   often        2  19     19    6

Above we ask, how are our 330 individuals distributed between income and volunteerism? E.g. 17 individuals never volunteered and have high income. Remember, the dependent variable (volunteerism) should appear in the rows and the independent variable (income) in the columns.

Cross-tab (proportions)

prop.table(table(df1$volunteer, df1$income))

##            
##                    none         low      medium        high
##   never     0.109090909 0.260606061 0.218181818 0.051515152
##   sometimes 0.042424242 0.072727273 0.087878788 0.018181818
##   often     0.006060606 0.057575758 0.057575758 0.018181818

For prop.table(), if we don’t specify “margin”, the entire sample (e.g. 330 observations in this case) is the reference group. Above, we see that 5% of respondents in CES 2021 never volunteered and had high income (17 % 330 = 0.515 x 100 = 5.15%)

prop.table(table(df1$volunteer, df1$income), margin=2)

##            
##                   none        low     medium       high
##   never     0.69230769 0.66666667 0.60000000 0.58620690
##   sometimes 0.26923077 0.18604651 0.24166667 0.20689655
##   often     0.03846154 0.14728682 0.15833333 0.20689655

For prop.table(), if we don’t specify “margin”, the entire sample (e.g. 330 observations in this case) is the reference group. If we specify a “margin” argument in the prop.table() function, then we can calculate proportions within subsets of the sample. We should specify margin=2 to look at the proportion of volunteers in each income category. We say 2 to mean the second variable (income).

Among respondents with no income, 69% never volunteer. (There are 52 individuals in total who reported having NO income in our survey. 36 of these 52 individuals reported that they NEVER volunteer.) Among respondents with high income, 58.6% (59%) never volunteer. (There 29 individuals in total who reported having high income in our survey. 17 of these 29 individuals reported that they NEVER volunteer)

There is some support for a relationship between these two variables, but it appears to be limited. As we move from no income to high income, the people who never reported volunteering decreases slightly (top row). As we move from no income to high income, the people who reported volunteering often increases slightly.

Using dplyr/tidyverse

# frequencies
df1 %>%
  count(volunteer, income) %>%
  pivot_wider(names_from = income, values_from = n, values_fill = 0)

## # A tibble: 3 × 5
##   volunteer  none   low medium  high
##   <fct>     <int> <int>  <int> <int>
## 1 never        36    86     72    17
## 2 sometimes    14    24     29     6
## 3 often         2    19     19     6

# proportions 
df1 %>%
  drop_na(volunteer, income) %>% # here just confirming we are dropping NA values from calculations
  count(volunteer, income) %>%
  group_by(income) %>% # group by income so that we calculate proportions within each income group 
  mutate(prop = n / sum(n)) %>%
  select(-n) %>%  
  pivot_wider(names_from = income, values_from = prop, values_fill = 0)

## # A tibble: 3 × 5
##   volunteer   none   low medium  high
##   <fct>      <dbl> <dbl>  <dbl> <dbl>
## 1 never     0.692  0.667  0.6   0.586
## 2 sometimes 0.269  0.186  0.242 0.207
## 3 often     0.0385 0.147  0.158 0.207

Above, we calculated proportions within each income group (within each category of our independent variable). If we remove group_by(income), we would generate the cross tab created by: prop.table(table(df1$volunteer, df1$income))

Make your dplyr cross-tab “pretty” using the gt package

wide_df1 <- df1 %>%
  drop_na(volunteer, income) %>% # here just confirming we are dropping NA values from calculations
  count(volunteer, income) %>%
  group_by(income) %>% # group by income so that we calculate proportions within each income group 
  mutate(prop = n / sum(n)) %>%
  select(-n) %>%  
  pivot_wider(names_from = income, values_from = prop, values_fill = 0)


wide_df1 %>%
  gt() %>%
  fmt_number(decimals=2) %>%
  tab_spanner(label="Income Level",
               columns =c(low, medium, high)) %>%
  tab_header(title= "Does volunteerism vary by income?")

volunteer	none	Income Level
Does volunteerism vary by income?
volunteer	none	low	medium	high
never	0.69	0.67	0.60	0.59
sometimes	0.27	0.19	0.24	0.21
often	0.04	0.15	0.16	0.21

If you run the above code in the console, you can export the table from the plot/viewer window by clicking “Export” –> “Save as Image” or “Copy to Clipboard”.

Example 2: Cross-tab binary dependent variable

Dependent variable: cps21_duty_choice (Some people feel voting is a duty and others feel it is a choice, which is it for you with 1 = duty and 2= choice 3= NAs)
Independent variable: income variable from above cps21_income_cat

df2 <- df %>% 
   mutate(income = case_when(cps21_income_cat == 1 ~ "none",
                                      cps21_income_cat %in% c(2,3) ~ "low",
                                      cps21_income_cat %in% c(4,5,6) ~ "medium",
                                      cps21_income_cat %in% c(7,8) ~ "high",
                                      TRUE ~ NA_character_),
          voteduty = case_match(cps21_duty_choice,
                                1 ~ "duty",
                                2 ~ "choice",
                                TRUE ~ NA_character_)) %>%
   mutate(income = factor(income, levels = c("none", "low", "medium", "high"))) %>% 
  select(cps21_ResponseId, income, cps21_income_cat, cps21_duty_choice, voteduty)

check recoding, drop NA values

table(df2$cps21_duty_choice, df2$voteduty)

##    
##     choice  duty
##   1      0 15718
##   2   4821     0
##   3      0     0

df2 <- df2 %>%
  drop_na(income, voteduty)

table(df2$voteduty, df2$income)

##         
##          none low medium high
##   choice   17  41     26    5
##   duty     29  91     93   25

You’ll notice that if we sum down the columns, we would find the total number of respondents in each income category:

table(df2$income)

## 
##   none    low medium   high 
##     46    132    119     30

prop.table(table(df2$voteduty, df2$income), margin=2)

##         
##               none       low    medium      high
##   choice 0.3695652 0.3106061 0.2184874 0.1666667
##   duty   0.6304348 0.6893939 0.7815126 0.8333333

The column percentages sum to 100 - focus on interpreting the row that fits with your hypothesis. As income increases, the proportion of respondents who see voting as a duty increases. We find that people with lower income are more likely to view voting as “choice”. The relationship appears to be linear.

Wrapping up

You can use the examples above to create cross-tabs of various categorical variables of interest. Just remember to identify which variable is your independent variable and which variable is your dependent variable before you make your cross-tab.

Important functions/operators:

table()
prop.table()
pivot_wider()
group_by()
gt()
pivot_wider()

Class 8: Describing the relationship between two variables

POL3325G Data Science for Politics (March 4, 2025)

Shanaya Vanhooren