Two Categorical Variables

Harold Nelson

10/3/2022

Setup

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(socviz)
library(gmodels)
library(vcd)
## Loading required package: grid
load("~/Dropbox/Documents/SMU/CSC 463/cdc2.Rdata")

Religion and Region

These are two categorical variables and we want to use them as an example to study different ways of displaying the relationship between them. The usual non-visual way to do this is called a “cross-tab.” The standard table function provides a very rudimentary version.

table(gss_sm$bigregion,gss_sm$religion)
##            
##             Protestant Catholic Jewish None Other
##   Northeast        158      162     27  112    28
##   Midwest          325      172      3  157    33
##   South            650      160     11  170    50
##   West             238      155     10  180    48

Note that these numbers are different from those in the book. The version of the data is apparently different, probably reflecting a difference in years.

Fancy Crosstab

There is a more elaborate version, CrossTable, in the gmodels package. Interpreting the output requires a layout of each cell in the table provided in the upper left corner.

CrossTable(gss_sm$bigregion,gss_sm$religion)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2849 
## 
##  
##                  | gss_sm$religion 
## gss_sm$bigregion | Protestant |   Catholic |     Jewish |       None |      Other |  Row Total | 
## -----------------|------------|------------|------------|------------|------------|------------|
##        Northeast |        158 |        162 |         27 |        112 |         28 |        487 | 
##                  |     24.877 |     23.502 |     38.340 |      0.362 |      0.025 |            | 
##                  |      0.324 |      0.333 |      0.055 |      0.230 |      0.057 |      0.171 | 
##                  |      0.115 |      0.250 |      0.529 |      0.181 |      0.176 |            | 
##                  |      0.055 |      0.057 |      0.009 |      0.039 |      0.010 |            | 
## -----------------|------------|------------|------------|------------|------------|------------|
##          Midwest |        325 |        172 |          3 |        157 |         33 |        690 | 
##                  |      0.149 |      1.397 |      7.080 |      0.335 |      0.788 |            | 
##                  |      0.471 |      0.249 |      0.004 |      0.228 |      0.048 |      0.242 | 
##                  |      0.237 |      0.265 |      0.059 |      0.254 |      0.208 |            | 
##                  |      0.114 |      0.060 |      0.001 |      0.055 |      0.012 |            | 
## -----------------|------------|------------|------------|------------|------------|------------|
##            South |        650 |        160 |         11 |        170 |         50 |       1041 | 
##                  |     44.346 |     25.093 |      3.128 |     13.953 |      1.129 |            | 
##                  |      0.624 |      0.154 |      0.011 |      0.163 |      0.048 |      0.365 | 
##                  |      0.474 |      0.247 |      0.216 |      0.275 |      0.314 |            | 
##                  |      0.228 |      0.056 |      0.004 |      0.060 |      0.018 |            | 
## -----------------|------------|------------|------------|------------|------------|------------|
##             West |        238 |        155 |         10 |        180 |         48 |        631 | 
##                  |     14.194 |      0.882 |      0.149 |     13.426 |      4.641 |            | 
##                  |      0.377 |      0.246 |      0.016 |      0.285 |      0.076 |      0.221 | 
##                  |      0.174 |      0.239 |      0.196 |      0.291 |      0.302 |            | 
##                  |      0.084 |      0.054 |      0.004 |      0.063 |      0.017 |            | 
## -----------------|------------|------------|------------|------------|------------|------------|
##     Column Total |       1371 |        649 |         51 |        619 |        159 |       2849 | 
##                  |      0.481 |      0.228 |      0.018 |      0.217 |      0.056 |            | 
## -----------------|------------|------------|------------|------------|------------|------------|
## 
## 

Note that the column and row percents, “marginals” in Healy’s terminology closely match the results in Healy’s tables, 5.1 and 5.2.

Exercise

Do a crosstable of gender and genhlth in cdc2.

Solution

CrossTable(cdc2$gender,cdc2$genhlth)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  19997 
## 
##  
##              | cdc2$genhlth 
##  cdc2$gender | excellent | very good |      good |      fair |      poor | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
##            m |      2298 |      3380 |      2721 |       884 |       283 |      9566 | 
##              |     2.214 |     0.628 |     0.017 |     6.933 |     5.155 |           | 
##              |     0.240 |     0.353 |     0.284 |     0.092 |     0.030 |     0.478 | 
##              |     0.493 |     0.485 |     0.480 |     0.438 |     0.418 |           | 
##              |     0.115 |     0.169 |     0.136 |     0.044 |     0.014 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
##            f |      2359 |      3590 |      2953 |      1135 |       394 |     10431 | 
##              |     2.030 |     0.576 |     0.015 |     6.359 |     4.727 |           | 
##              |     0.226 |     0.344 |     0.283 |     0.109 |     0.038 |     0.522 | 
##              |     0.507 |     0.515 |     0.520 |     0.562 |     0.582 |           | 
##              |     0.118 |     0.180 |     0.148 |     0.057 |     0.020 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total |      4657 |      6970 |      5674 |      2019 |       677 |     19997 | 
##              |     0.233 |     0.349 |     0.284 |     0.101 |     0.034 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 

Visualize a Crosstab

Our goal in this course is to master ways of displaying the information visually.

Before getting into Healy’s methods, I’ll demonstrate two simple methods based on the idea of a scatterplot of two categorical variables. The simple scatterplot of two categorical results is not very useful. Try this and see what you get.

Answer

gss_sm %>% select(religion,bigregion) %>% ggplot(aes(x=bigregion,y=religion)) + 
  geom_point()

Method 1

The first of the two simple methods uses the raw data and simply replaces geom_point() with geom_jitter(). Manual adjustment of the alpha and size parameters may improve on the result.

Try this.

Answer

gss_sm %>% select(religion,bigregion) %>% ggplot(aes(x=bigregion,y=religion)) + 
  geom_jitter(alpha=.5,size=.5)

Exercise

Adjust size and alpha to get a result you find effective.

Another Method

My second method is to use dplyr to count the observations in the cells of the table and produce a scatterplot of the categorical variables. The counts are mapped to the size of the dots in the scatterplot.

gss_sm %>% 
  select(religion,bigregion) %>%
  group_by(religion, bigregion) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  ggplot(aes(x=bigregion,y=religion,size=count)) + 
  geom_point()
## `summarise()` has grouped output by 'religion'. You can override using the
## `.groups` argument.

A variant on this idea is to map count to color instead of size. This requires larger size dots to have any hope of success.

gss_sm %>% 
  select(religion,bigregion) %>%
  group_by(religion, bigregion) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  ggplot(aes(x=bigregion,y=religion,color=count)) + geom_point(size=9)
## `summarise()` has grouped output by 'religion'. You can override using the
## `.groups` argument.

Mosaic Plots

The mosaic plot is another alernative for looking at relationships among categorical variables. The best version is in the vcd (Visualization of Categorical Data) package. Here’s a look at our relationship.

mosaic(~religion + bigregion,data=gss_sm)

Exercise

Use a mosaic plot to explore the relationship between gender and genhlth in cdc2.

Solution

mosaic(~gender + genhlth,data=cdc2)

And flipped

mosaic(~genhlth + gender,data=cdc2)

Discussion

Do any of these methods work? which do you prefer?

Healy’s rel_by_region

rel_by_region <- gss_sm %>%
    group_by(bigregion, religion) %>%
    summarize(N = n()) %>%
    mutate(freq = N / sum(N),
           pct = round((freq*100), 0))
## `summarise()` has grouped output by 'bigregion'. You can override using the
## `.groups` argument.
head(rel_by_region,10)
## # A tibble: 10 × 5
## # Groups:   bigregion [2]
##    bigregion religion       N    freq   pct
##    <fct>     <fct>      <int>   <dbl> <dbl>
##  1 Northeast Protestant   158 0.324      32
##  2 Northeast Catholic     162 0.332      33
##  3 Northeast Jewish        27 0.0553      6
##  4 Northeast None         112 0.230      23
##  5 Northeast Other         28 0.0574      6
##  6 Northeast <NA>           1 0.00205     0
##  7 Midwest   Protestant   325 0.468      47
##  8 Midwest   Catholic     172 0.247      25
##  9 Midwest   Jewish         3 0.00432     0
## 10 Midwest   None         157 0.226      23

Note that there is an alternative choice, region with religion. Do this as an exercise before you look at the next slide..

My region_by_rel

region_by_rel <- gss_sm %>%
    group_by(religion,bigregion) %>%
    summarize(N = n()) %>%
    mutate(freq = N / sum(N),
           pct = round((freq*100), 0))
## `summarise()` has grouped output by 'religion'. You can override using the
## `.groups` argument.
head(region_by_rel,10)
## # A tibble: 10 × 5
## # Groups:   religion [3]
##    religion   bigregion     N   freq   pct
##    <fct>      <fct>     <int>  <dbl> <dbl>
##  1 Protestant Northeast   158 0.115     12
##  2 Protestant Midwest     325 0.237     24
##  3 Protestant South       650 0.474     47
##  4 Protestant West        238 0.174     17
##  5 Catholic   Northeast   162 0.250     25
##  6 Catholic   Midwest     172 0.265     27
##  7 Catholic   South       160 0.247     25
##  8 Catholic   West        155 0.239     24
##  9 Jewish     Northeast    27 0.529     53
## 10 Jewish     Midwest       3 0.0588     6

Note the difference.

Healy’s First Plot

p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
    labs(x = "Region",y = "Percent", fill = "Religion") +
    theme(legend.position = "top")

Trivial Exercise

What is the difference if you set position to dodge instead of dodge2?

Answer

p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion))
p + geom_col(position = "dodge") +
    labs(x = "Region",y = "Percent", fill = "Religion") +
    theme(legend.position = "top")

## Another Trivial Exercise

What does the graph look like if you use geom_point() instead of geom_col(). Think for a minute before you do this.

Answer

p <- ggplot(rel_by_region, aes(x = bigregion, y = pct))
p + geom_point(aes(color = religion),size=4) +
    labs(x = "Region",y = "Percent", fill = "Religion") +
    theme(legend.position = "top")

Note that since the default plot character is a solid dot, you must use color instead of fill to get the color to show. There is some overplotting. Would jitter help? Would a smaller or larger dot be preferred? Experiment!

Answer

p <- ggplot(rel_by_region, aes(x = bigregion, y = pct))
p + geom_jitter(aes(color = religion),size=4) +
    labs(x = "Region",y = "Percent", fill = "Religion") +
    theme(legend.position = "top")

My First Plot (Exercise)

Repeat the graphics above reversing the roles of region and religion. Produce the dodge2 geom_col() and the corresponding geom_point.

Answer

p <- ggplot(region_by_rel, aes(x = religion, y = pct, fill = bigregion))
p + geom_col(position = "dodge2") +
    labs(x = "Religion",y = "Percent", fill = "Region") +
    theme(legend.position = "top")

Exercise

Try geom_jitter instead of geom_col.

Answer

p <- ggplot(region_by_rel, aes(x = religion, y = pct))
p + geom_jitter(aes(color = bigregion),size=4) +
    labs(x = "Religion",y = "Percent", fill = "Region") +
    theme(legend.position = "top")

Healy’s Facet Example

p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
    labs(x = NULL, y = "Percent", fill = "Religion") +
    guides(fill = FALSE) + 
    coord_flip() + 
    facet_grid(~ bigregion)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Is there any difference if we use facet_wrap() with nrow=1 instead of facet_grid()?

Answer

p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
    labs(x = NULL, y = "Percent", fill = "Religion") +
    guides(fill = FALSE) + 
    coord_flip() + 
    facet_wrap(~ bigregion,nrow=1)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Exercise

Repeat this with the roles of religion and region reversed.

Answer

p <- ggplot(region_by_rel, aes(x = bigregion, y = pct, fill = bigregion))
p + geom_col(position = "dodge2") +
    labs(x = NULL, y = "Percent", fill = "bigregion") +
    guides(fill = FALSE) + 
    coord_flip() + 
    facet_grid(~ religion)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

We started by examining visualizations which emphasized symmetry between religion and region. Could thsi be done by with facet_grid(), putting a simple block to represent the count in each cell of the grid. In effect, this would use the grid to make the scatterplot. Try it.

Answer

rr = gss_sm %>%
  select(religion,bigregion) %>%
  group_by(religion, bigregion) %>%
  summarize(count = n()) %>%
  ungroup() %>%
  mutate(relreg = " ")
## `summarise()` has grouped output by 'religion'. You can override using the
## `.groups` argument.
ggplot(data = rr,aes(x=relreg,y=count)) + geom_col() +
  facet_grid(religion~bigregion) +
  labs(x="")

Exercise

Try that with point instead of col. Also map color to religion.

Answer

rr = gss_sm %>%
  select(religion,bigregion) %>%
  group_by(religion, bigregion) %>%
  summarize(count = n()) %>%
  ungroup() %>%
  mutate(relreg = " ")
## `summarise()` has grouped output by 'religion'. You can override using the
## `.groups` argument.
ggplot(data = rr,aes(x=relreg,y=count,color = religion)) + geom_point() +
  facet_grid(religion~bigregion) +
  labs(x="")

For comparison, here is the simple table.

t = table(gss_sm$religion,gss_sm$bigregion, useNA="ifany")
t
##             
##              Northeast Midwest South West
##   Protestant       158     325   650  238
##   Catholic         162     172   160  155
##   Jewish            27       3    11   10
##   None             112     157   170  180
##   Other             28      33    50   48
##   <NA>               1       5    11    1