Two Categorical Variables

Harold Nelson

1/29/2021

Setup

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.5     ✓ dplyr   1.0.3
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(socviz)
library(gmodels)
library(vcd)
## Loading required package: grid

Religion and Region

These are two categorical variables and we want to use them as an example to study different ways of displaying the relationship between them. The usual non-visual way to do this is called a “cross-tab.” The standard table function provides a very rudimentary version.

table(gss_sm$bigregion,gss_sm$religion)
##            
##             Protestant Catholic Jewish None Other
##   Northeast        158      162     27  112    28
##   Midwest          325      172      3  157    33
##   South            650      160     11  170    50
##   West             238      155     10  180    48

Note that these numbers are different from those in the book. The version of the data is apparently different, probably reflecting a difference in years.

Fancy Crosstab

There is a more elaborate version, CrossTable, in the gmodels package. Interpreting the output requires a layout of each cell in the table provided in the upper left corner.

CrossTable(gss_sm$bigregion,gss_sm$religion)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2849 
## 
##  
##                  | gss_sm$religion 
## gss_sm$bigregion | Protestant |   Catholic |     Jewish |       None |      Other |  Row Total | 
## -----------------|------------|------------|------------|------------|------------|------------|
##        Northeast |        158 |        162 |         27 |        112 |         28 |        487 | 
##                  |     24.877 |     23.502 |     38.340 |      0.362 |      0.025 |            | 
##                  |      0.324 |      0.333 |      0.055 |      0.230 |      0.057 |      0.171 | 
##                  |      0.115 |      0.250 |      0.529 |      0.181 |      0.176 |            | 
##                  |      0.055 |      0.057 |      0.009 |      0.039 |      0.010 |            | 
## -----------------|------------|------------|------------|------------|------------|------------|
##          Midwest |        325 |        172 |          3 |        157 |         33 |        690 | 
##                  |      0.149 |      1.397 |      7.080 |      0.335 |      0.788 |            | 
##                  |      0.471 |      0.249 |      0.004 |      0.228 |      0.048 |      0.242 | 
##                  |      0.237 |      0.265 |      0.059 |      0.254 |      0.208 |            | 
##                  |      0.114 |      0.060 |      0.001 |      0.055 |      0.012 |            | 
## -----------------|------------|------------|------------|------------|------------|------------|
##            South |        650 |        160 |         11 |        170 |         50 |       1041 | 
##                  |     44.346 |     25.093 |      3.128 |     13.953 |      1.129 |            | 
##                  |      0.624 |      0.154 |      0.011 |      0.163 |      0.048 |      0.365 | 
##                  |      0.474 |      0.247 |      0.216 |      0.275 |      0.314 |            | 
##                  |      0.228 |      0.056 |      0.004 |      0.060 |      0.018 |            | 
## -----------------|------------|------------|------------|------------|------------|------------|
##             West |        238 |        155 |         10 |        180 |         48 |        631 | 
##                  |     14.194 |      0.882 |      0.149 |     13.426 |      4.641 |            | 
##                  |      0.377 |      0.246 |      0.016 |      0.285 |      0.076 |      0.221 | 
##                  |      0.174 |      0.239 |      0.196 |      0.291 |      0.302 |            | 
##                  |      0.084 |      0.054 |      0.004 |      0.063 |      0.017 |            | 
## -----------------|------------|------------|------------|------------|------------|------------|
##     Column Total |       1371 |        649 |         51 |        619 |        159 |       2849 | 
##                  |      0.481 |      0.228 |      0.018 |      0.217 |      0.056 |            | 
## -----------------|------------|------------|------------|------------|------------|------------|
## 
## 

Note that the column and row percents, “marginals” in Healy’s terminology closely match the results in Healy’s tables, 5.1 and 5.2.

Visualize a Crosstab

Our goal in this course is to master ways of displaying the information visually.

Before getting into Healy’s methods, I’ll demonstrate two simple methods based on the idea of a scatterplot of two categorical variables. The simple scatterplot of two categorical results is not very useful. Try this and see what you get.

Answer

gss_sm %>% select(religion,bigregion) %>% ggplot(aes(x=bigregion,y=religion)) + 
  geom_point()

Method 1

The first of the two simple methods uses the raw data and simply replaces geom_point() with geom_jitter(). Manual adjustment of the alpha and size parameters may improve on the result.

Try this.

Answer

gss_sm %>% select(religion,bigregion) %>% ggplot(aes(x=bigregion,y=religion)) + 
  geom_jitter(alpha=.5,size=.5)

Exercise

Adjust size and alpha to get a result you find effective.

Another Method

My second method is to use dplyr to count the observations in the cells of the table and produce a scatterplot of the categorical variables. The counts are mapped to the size of the dots in the scatterplot.

gss_sm %>% 
  select(religion,bigregion) %>%
  group_by(religion, bigregion) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  ggplot(aes(x=bigregion,y=religion,size=count)) + 
  geom_point()
## `summarise()` has grouped output by 'religion'. You can override using the `.groups` argument.

A variant on this idea is to map count to color instead of size. This requires larger size dots to have any hope of success.

gss_sm %>% 
  select(religion,bigregion) %>%
  group_by(religion, bigregion) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  ggplot(aes(x=bigregion,y=religion,color=count)) + geom_point(size=9)
## `summarise()` has grouped output by 'religion'. You can override using the `.groups` argument.

Mosaic Plots

The mosaic plot is another alernative for looking at relationships among categorical variables. The best version is in the vcd (Visualization of Categorical Data) package. Here’s a look at our relationship.

library(vcd)
mosaic(~religion + bigregion,data=gss_sm)

Discussion

Do any of these methods work? which do you prefer?

Healy’s rel_by_region

rel_by_region <- gss_sm %>%
    group_by(bigregion, religion) %>%
    summarize(N = n()) %>%
    mutate(freq = N / sum(N),
           pct = round((freq*100), 0))
## `summarise()` has grouped output by 'bigregion'. You can override using the `.groups` argument.
head(rel_by_region,10)
## # A tibble: 10 x 5
## # Groups:   bigregion [2]
##    bigregion religion       N    freq   pct
##    <fct>     <fct>      <int>   <dbl> <dbl>
##  1 Northeast Protestant   158 0.324      32
##  2 Northeast Catholic     162 0.332      33
##  3 Northeast Jewish        27 0.0553      6
##  4 Northeast None         112 0.230      23
##  5 Northeast Other         28 0.0574      6
##  6 Northeast <NA>           1 0.00205     0
##  7 Midwest   Protestant   325 0.468      47
##  8 Midwest   Catholic     172 0.247      25
##  9 Midwest   Jewish         3 0.00432     0
## 10 Midwest   None         157 0.226      23

Note that there is an alternative choice, region with religion. Do this as an exercise before you look at the next slide..

My region_by_rel

region_by_rel <- gss_sm %>%
    group_by(religion,bigregion) %>%
    summarize(N = n()) %>%
    mutate(freq = N / sum(N),
           pct = round((freq*100), 0))
## `summarise()` has grouped output by 'religion'. You can override using the `.groups` argument.
head(region_by_rel,10)
## # A tibble: 10 x 5
## # Groups:   religion [3]
##    religion   bigregion     N   freq   pct
##    <fct>      <fct>     <int>  <dbl> <dbl>
##  1 Protestant Northeast   158 0.115     12
##  2 Protestant Midwest     325 0.237     24
##  3 Protestant South       650 0.474     47
##  4 Protestant West        238 0.174     17
##  5 Catholic   Northeast   162 0.250     25
##  6 Catholic   Midwest     172 0.265     27
##  7 Catholic   South       160 0.247     25
##  8 Catholic   West        155 0.239     24
##  9 Jewish     Northeast    27 0.529     53
## 10 Jewish     Midwest       3 0.0588     6

Note the difference.

Healy’s First Plot

p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
    labs(x = "Region",y = "Percent", fill = "Religion") +
    theme(legend.position = "top")

Trivial Exercise

What is the difference if you set position to dodge instead of dodge2?

Answer

p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion))
p + geom_col(position = "dodge") +
    labs(x = "Region",y = "Percent", fill = "Religion") +
    theme(legend.position = "top")

## Another Trivial Exercise

What does the graph look like if you use geom_point() instead of geom_col(). Think for a minute before you do this.

Answer

p <- ggplot(rel_by_region, aes(x = bigregion, y = pct))
p + geom_point(aes(color = religion),size=4) +
    labs(x = "Region",y = "Percent", fill = "Religion") +
    theme(legend.position = "top")

Note that since the default plot character is a solid dot, you must use color instead of fill to get the color to show. There is some overplotting. Would jitter help? Would a smaller or larger dot be preferred? Experiment!

Answer

p <- ggplot(rel_by_region, aes(x = bigregion, y = pct))
p + geom_jitter(aes(color = religion),size=4) +
    labs(x = "Region",y = "Percent", fill = "Religion") +
    theme(legend.position = "top")

My First Plot (Exercise)

Repeat the graphics above reversing the roles of region and religion. Produce the dodge2 geom_col() and the corresponding geom_point.

Answer

p <- ggplot(region_by_rel, aes(x = religion, y = pct, fill = bigregion))
p + geom_col(position = "dodge2") +
    labs(x = "Religion",y = "Percent", fill = "Region") +
    theme(legend.position = "top")

Exercise

Try geom_jitter instead of geom_col.

Answer

p <- ggplot(region_by_rel, aes(x = religion, y = pct))
p + geom_jitter(aes(color = bigregion),size=4) +
    labs(x = "Religion",y = "Percent", fill = "Region") +
    theme(legend.position = "top")

Healy’s Facet Example

p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
    labs(x = NULL, y = "Percent", fill = "Religion") +
    guides(fill = FALSE) + 
    coord_flip() + 
    facet_grid(~ bigregion)

Is there any difference if we use facet_wrap() with nrow=1 instead of facet_grid()?

Answer

p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
    labs(x = NULL, y = "Percent", fill = "Religion") +
    guides(fill = FALSE) + 
    coord_flip() + 
    facet_wrap(~ bigregion,nrow=1)

Exercise

Repeat this with the roles of religion and region reversed.

Answer

p <- ggplot(region_by_rel, aes(x = bigregion, y = pct, fill = bigregion))
p + geom_col(position = "dodge2") +
    labs(x = NULL, y = "Percent", fill = "bigregion") +
    guides(fill = FALSE) + 
    coord_flip() + 
    facet_grid(~ religion)

We started by examining visualizations which emphasized symmetry between religion and region. Could thsi be done by with facet_grid(), putting a simple block to represent the count in each cell of the grid. In effect, this would use the grid to make the scatterplot. Try it.

Answer

rr = gss_sm %>%
  select(religion,bigregion) %>%
  group_by(religion, bigregion) %>%
  summarize(count = n()) %>%
  ungroup() %>%
  mutate(relreg = " ")
## `summarise()` has grouped output by 'religion'. You can override using the `.groups` argument.
ggplot(data = rr,aes(x=relreg,y=count)) + geom_col() +
  facet_grid(religion~bigregion) +
  labs(x="")

Exercise

Try that with point instead of col. Also map color to religion.

Answer

rr = gss_sm %>%
  select(religion,bigregion) %>%
  group_by(religion, bigregion) %>%
  summarize(count = n()) %>%
  ungroup() %>%
  mutate(relreg = " ")
## `summarise()` has grouped output by 'religion'. You can override using the `.groups` argument.
ggplot(data = rr,aes(x=relreg,y=count,color = religion)) + geom_point() +
  facet_grid(religion~bigregion) +
  labs(x="")

For comparison, here is the simple table.

t = table(gss_sm$religion,gss_sm$bigregion, useNA="ifany")
t
##             
##              Northeast Midwest South West
##   Protestant       158     325   650  238
##   Catholic         162     172   160  155
##   Jewish            27       3    11   10
##   None             112     157   170  180
##   Other             28      33    50   48
##   <NA>               1       5    11    1