In this lesson, we will focus on how to examine the association of two categorical variables in our data using a cross-tab. Think of cross-tabs as similar to a frequency table or table of proportions but to examine two variables, as opposed to just one variable.
We will focus mostly on using Base R (instead of tidyverse) to create our tables of frequencies/proportions. It is easier, more efficient, and less code to use Base R for this task.
Remember to start by setting your working directory to a new folder, and saving your script/markdown document in this folder. You should also store any data you need in that folder. Note: even if you set your working directory through the toolbar in R Studio (e.g. by clicking Session –> Set Working Directory…) you should still paste the executed code in your R Script or R Markdown document. This helps make your code replicable and ensures you know which folder on your computer you’re working out of.
setwd("/Users/shanaya/POL3325G/Lectures/Lecture 8")
Today we will be using the 2021 Canadian Election Study data. You can download the data here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/XBZHKC
Remember you need to unzip the folder that downloads and then move the data to your working directory. We will import the dta (stata) file. You may also want to keep a copy of the code book for reference (e.g. to learn about the variables so that you can properly recode them).
library(rio)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gt)
df <- import("2021 Canadian Election Study v2.0.dta")
dependent variable: cps21_volunteer (In the past 12 months how many times did you volunteer for a group or organization - 4 categories plus fifth category as don’t know/prefer not to answer)
independent variable: cps21_income_cat (Does your household income fall into one of these broad categories?)
steps: subset our data to keep these two variables (and maybe an ID variable), remove NAs, recode as factors with proper ordering of the levels
# check out the NA values for our variables of interest:
df %>% filter(cps21_volunteer == 5 | is.na(cps21_volunteer)) # 657 NA values for cps21_volunteer
df %>% filter(cps21_income_cat == 9 | is.na(cps21_income_cat)) # 20,616 NA values for income variable
Below, I recode the variables I am interested in by: 1) changing and condensing categories of the volunteer and income variables 2) specifying the new volunteer and income variables should be of the class “factor” (ordinal) with levels from low to high 3) use select() to keep only the variables I am interested in (plus and ID variable)
df1 <- df %>%
mutate(volunteer = case_when(cps21_volunteer == 1 ~ "never",
cps21_volunteer %in% c(2,3) ~ "sometimes",
cps21_volunteer == 4 ~ "often",
TRUE ~ NA_character_)) %>%
mutate(income = case_when(cps21_income_cat == 1 ~ "none",
cps21_income_cat %in% c(2,3) ~ "low",
cps21_income_cat %in% c(4,5,6) ~ "medium",
cps21_income_cat %in% c(7,8) ~ "high",
TRUE ~ NA_character_)) %>%
mutate(
volunteer = factor(volunteer, levels = c("never", "sometimes", "often")),
income = factor(income, levels = c("none", "low", "medium", "high"))
) %>%
select(cps21_ResponseId, volunteer, cps21_volunteer, income, cps21_income_cat)
table(df1$cps21_volunteer, df1$volunteer)
##
## never sometimes often
## 1 13165 0 0
## 2 0 1758 0
## 3 0 2763 0
## 4 0 0 2625
## 5 0 0 0
table(df1$cps21_income_cat, df1$income)
##
## none low medium high
## 1 55 0 0 0
## 2 0 67 0 0
## 3 0 71 0 0
## 4 0 0 67 0
## 5 0 0 25 0
## 6 0 0 37 0
## 7 0 0 0 19
## 8 0 0 0 11
## 9 0 0 0 0
levels(df1$volunteer)
## [1] "never" "sometimes" "often"
levels(df1$income)
## [1] "none" "low" "medium" "high"
df1 <- df1 %>%
drop_na(income, volunteer)
Above, we remove rows (observations, which are individuals in our survey) where at least one of the specified columns (income or volunteer) have NA values. If either income or volunteer has an NA value, that observation (individual) is removed. The size of our sample (data) is significantly smaller! We should discuss this when we present our cross tab (e.g. there were a lot of missing values for the income variable in particular).
table(df1$volunteer, df1$income)
##
## none low medium high
## never 36 86 72 17
## sometimes 14 24 29 6
## often 2 19 19 6
Above we ask, how are our 330 individuals distributed between income and volunteerism? E.g. 17 individuals never volunteered and have high income. Remember, the dependent variable (volunteerism) should appear in the rows and the independent variable (income) in the columns.
prop.table(table(df1$volunteer, df1$income))
##
## none low medium high
## never 0.109090909 0.260606061 0.218181818 0.051515152
## sometimes 0.042424242 0.072727273 0.087878788 0.018181818
## often 0.006060606 0.057575758 0.057575758 0.018181818
For prop.table(), if we don’t specify “margin”, the entire sample (e.g. 330 observations in this case) is the reference group. Above, we see that 5% of respondents in CES 2021 never volunteered and had high income (17 % 330 = 0.515 x 100 = 5.15%)
prop.table(table(df1$volunteer, df1$income), margin=2)
##
## none low medium high
## never 0.69230769 0.66666667 0.60000000 0.58620690
## sometimes 0.26923077 0.18604651 0.24166667 0.20689655
## often 0.03846154 0.14728682 0.15833333 0.20689655
For prop.table(), if we don’t specify “margin”, the entire sample (e.g. 330 observations in this case) is the reference group. If we specify a “margin” argument in the prop.table() function, then we can calculate proportions within subsets of the sample. We should specify margin=2 to look at the proportion of volunteers in each income category. We say 2 to mean the second variable (income).
Among respondents with no income, 69% never volunteer. (There are 52 individuals in total who reported having NO income in our survey. 36 of these 52 individuals reported that they NEVER volunteer.) Among respondents with high income, 58.6% (59%) never volunteer. (There 29 individuals in total who reported having high income in our survey. 17 of these 29 individuals reported that they NEVER volunteer)
There is some support for a relationship between these two variables, but it appears to be limited. As we move from no income to high income, the people who never reported volunteering decreases slightly (top row). As we move from no income to high income, the people who reported volunteering often increases slightly.
# frequencies
df1 %>%
count(volunteer, income) %>%
pivot_wider(names_from = income, values_from = n, values_fill = 0)
## # A tibble: 3 × 5
## volunteer none low medium high
## <fct> <int> <int> <int> <int>
## 1 never 36 86 72 17
## 2 sometimes 14 24 29 6
## 3 often 2 19 19 6
# proportions
df1 %>%
drop_na(volunteer, income) %>% # here just confirming we are dropping NA values from calculations
count(volunteer, income) %>%
group_by(income) %>% # group by income so that we calculate proportions within each income group
mutate(prop = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from = income, values_from = prop, values_fill = 0)
## # A tibble: 3 × 5
## volunteer none low medium high
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 never 0.692 0.667 0.6 0.586
## 2 sometimes 0.269 0.186 0.242 0.207
## 3 often 0.0385 0.147 0.158 0.207
Above, we calculated proportions within each income group (within
each category of our independent variable). If we remove
group_by(income), we would generate the cross tab created by:
prop.table(table(df1$volunteer, df1$income))
wide_df1 <- df1 %>%
drop_na(volunteer, income) %>% # here just confirming we are dropping NA values from calculations
count(volunteer, income) %>%
group_by(income) %>% # group by income so that we calculate proportions within each income group
mutate(prop = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from = income, values_from = prop, values_fill = 0)
wide_df1 %>%
gt() %>%
fmt_number(decimals=2) %>%
tab_spanner(label="Income Level",
columns =c(low, medium, high)) %>%
tab_header(title= "Does volunteerism vary by income?")
Does volunteerism vary by income? | ||||
volunteer | none | Income Level | ||
---|---|---|---|---|
low | medium | high | ||
never | 0.69 | 0.67 | 0.60 | 0.59 |
sometimes | 0.27 | 0.19 | 0.24 | 0.21 |
often | 0.04 | 0.15 | 0.16 | 0.21 |
If you run the above code in the console, you can export the table from the plot/viewer window by clicking “Export” –> “Save as Image” or “Copy to Clipboard”.
df2 <- df %>%
mutate(income = case_when(cps21_income_cat == 1 ~ "none",
cps21_income_cat %in% c(2,3) ~ "low",
cps21_income_cat %in% c(4,5,6) ~ "medium",
cps21_income_cat %in% c(7,8) ~ "high",
TRUE ~ NA_character_),
voteduty = case_match(cps21_duty_choice,
1 ~ "duty",
2 ~ "choice",
TRUE ~ NA_character_)) %>%
mutate(income = factor(income, levels = c("none", "low", "medium", "high"))) %>%
select(cps21_ResponseId, income, cps21_income_cat, cps21_duty_choice, voteduty)
table(df2$cps21_duty_choice, df2$voteduty)
##
## choice duty
## 1 0 15718
## 2 4821 0
## 3 0 0
df2 <- df2 %>%
drop_na(income, voteduty)
table(df2$voteduty, df2$income)
##
## none low medium high
## choice 17 41 26 5
## duty 29 91 93 25
You’ll notice that if we sum down the columns, we would find the total number of respondents in each income category:
table(df2$income)
##
## none low medium high
## 46 132 119 30
prop.table(table(df2$voteduty, df2$income), margin=2)
##
## none low medium high
## choice 0.3695652 0.3106061 0.2184874 0.1666667
## duty 0.6304348 0.6893939 0.7815126 0.8333333
The column percentages sum to 100 - focus on interpreting the row that fits with your hypothesis. As income increases, the proportion of respondents who see voting as a duty increases. We find that people with lower income are more likely to view voting as “choice”. The relationship appears to be linear.
You can use the examples above to create cross-tabs of various categorical variables of interest. Just remember to identify which variable is your independent variable and which variable is your dependent variable before you make your cross-tab.
table()
prop.table()
pivot_wider()
group_by()
gt()
pivot_wider()