Healy Chapter 5 Part 1

Harold Nelson

9/29/2018

Setup

library(tidyverse)
## ── Attaching packages ── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ───── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(socviz)
library(gmodels)

Religion and Region

We want to explore different ways of displaying the relationship between geographical region and religious afficliation in the US. The usual non-visual way to do this is called a “cross-tab.” The standard table function provides a very rudimentary version.

table(gss_sm$religion,gss_sm$bigregion)
##             
##              Northeast Midwest South West
##   Protestant       158     325   650  238
##   Catholic         162     172   160  155
##   Jewish            27       3    11   10
##   None             112     157   170  180
##   Other             28      33    50   48

Fancy Crosstab

There is a more elaborate version, CrossTable, in the gmodels package. The output requires a layout of each cell in the table provided in the upper left corner.

CrossTable(gss_sm$religion,gss_sm$bigregion)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2849 
## 
##  
##                 | gss_sm$bigregion 
## gss_sm$religion | Northeast |   Midwest |     South |      West | Row Total | 
## ----------------|-----------|-----------|-----------|-----------|-----------|
##      Protestant |       158 |       325 |       650 |       238 |      1371 | 
##                 |    24.877 |     0.149 |    44.346 |    14.194 |           | 
##                 |     0.115 |     0.237 |     0.474 |     0.174 |     0.481 | 
##                 |     0.324 |     0.471 |     0.624 |     0.377 |           | 
##                 |     0.055 |     0.114 |     0.228 |     0.084 |           | 
## ----------------|-----------|-----------|-----------|-----------|-----------|
##        Catholic |       162 |       172 |       160 |       155 |       649 | 
##                 |    23.502 |     1.397 |    25.093 |     0.882 |           | 
##                 |     0.250 |     0.265 |     0.247 |     0.239 |     0.228 | 
##                 |     0.333 |     0.249 |     0.154 |     0.246 |           | 
##                 |     0.057 |     0.060 |     0.056 |     0.054 |           | 
## ----------------|-----------|-----------|-----------|-----------|-----------|
##          Jewish |        27 |         3 |        11 |        10 |        51 | 
##                 |    38.340 |     7.080 |     3.128 |     0.149 |           | 
##                 |     0.529 |     0.059 |     0.216 |     0.196 |     0.018 | 
##                 |     0.055 |     0.004 |     0.011 |     0.016 |           | 
##                 |     0.009 |     0.001 |     0.004 |     0.004 |           | 
## ----------------|-----------|-----------|-----------|-----------|-----------|
##            None |       112 |       157 |       170 |       180 |       619 | 
##                 |     0.362 |     0.335 |    13.953 |    13.426 |           | 
##                 |     0.181 |     0.254 |     0.275 |     0.291 |     0.217 | 
##                 |     0.230 |     0.228 |     0.163 |     0.285 |           | 
##                 |     0.039 |     0.055 |     0.060 |     0.063 |           | 
## ----------------|-----------|-----------|-----------|-----------|-----------|
##           Other |        28 |        33 |        50 |        48 |       159 | 
##                 |     0.025 |     0.788 |     1.129 |     4.641 |           | 
##                 |     0.176 |     0.208 |     0.314 |     0.302 |     0.056 | 
##                 |     0.057 |     0.048 |     0.048 |     0.076 |           | 
##                 |     0.010 |     0.012 |     0.018 |     0.017 |           | 
## ----------------|-----------|-----------|-----------|-----------|-----------|
##    Column Total |       487 |       690 |      1041 |       631 |      2849 | 
##                 |     0.171 |     0.242 |     0.365 |     0.221 |           | 
## ----------------|-----------|-----------|-----------|-----------|-----------|
## 
## 

Visualize a Crosstab

Before getting into Healy’s method, I’ll demonstrate two simple methods based on the ideo of a scatterplot of two cateoorical variables.

The first of these uses the raw data and simply replaces geom_point() with geom_jitter(). Manual adjustment of the alpha and size parameters may improve on the result.

gss_sm %>% select(religion,bigregion) %>% ggplot(aes(x=bigregion,y=religion)) + 
  geom_jitter(alpha=.5,size=.5)

My second method is to use dplyr to count the observations in the cells of the table and produce a scatterplot of the categorical variables. The counts are mapped to the size of the dots in the scatterplot.

gss_sm %>% 
  select(religion,bigregion) %>%
  group_by(religion, bigregion) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  ggplot(aes(x=bigregion,y=religion,size=count)) + geom_point()

  ggsave("rel_reg.pdf",width = 8, height = 10)

A variant on this idea is to map count to color instead of size. This requires larger size dots to have any hope of success/

gss_sm %>% 
  select(religion,bigregion) %>%
  group_by(religion, bigregion) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  ggplot(aes(x=bigregion,y=religion,color=count)) + geom_point(size=9)

  #ggsave("rel_reg.pdf",width = 8, height = 10)

Do these methods work?

Healy’s rel_by_region

rel_by_region <- gss_sm %>%
    group_by(bigregion, religion) %>%
    summarize(N = n()) %>%
    mutate(freq = N / sum(N),
           pct = round((freq*100), 0))
head(rel_by_region,10)
## # A tibble: 10 x 5
## # Groups:   bigregion [2]
##    bigregion religion       N    freq   pct
##    <fct>     <fct>      <int>   <dbl> <dbl>
##  1 Northeast Protestant   158 0.324      32
##  2 Northeast Catholic     162 0.332      33
##  3 Northeast Jewish        27 0.0553      6
##  4 Northeast None         112 0.230      23
##  5 Northeast Other         28 0.0574      6
##  6 Northeast <NA>           1 0.00205     0
##  7 Midwest   Protestant   325 0.468      47
##  8 Midwest   Catholic     172 0.247      25
##  9 Midwest   Jewish         3 0.00432     0
## 10 Midwest   None         157 0.226      23

Note that there is an alternative choice, region with religion. Do this as an exercise before you look at the next slide..

My region_by_rel

region_by_rel <- gss_sm %>%
    group_by(religion,bigregion) %>%
    summarize(N = n()) %>%
    mutate(freq = N / sum(N),
           pct = round((freq*100), 0))
head(region_by_rel,10)
## # A tibble: 10 x 5
## # Groups:   religion [3]
##    religion   bigregion     N   freq   pct
##    <fct>      <fct>     <int>  <dbl> <dbl>
##  1 Protestant Northeast   158 0.115     12
##  2 Protestant Midwest     325 0.237     24
##  3 Protestant South       650 0.474     47
##  4 Protestant West        238 0.174     17
##  5 Catholic   Northeast   162 0.250     25
##  6 Catholic   Midwest     172 0.265     27
##  7 Catholic   South       160 0.247     25
##  8 Catholic   West        155 0.239     24
##  9 Jewish     Northeast    27 0.529     53
## 10 Jewish     Midwest       3 0.0588     6

Healy’s First Plot

p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
    labs(x = "Region",y = "Percent", fill = "Religion") +
    theme(legend.position = "top")

Trivial Exercise

What is the difference if you set position to dodge instead of dodge2?

Another Trivial Exercise

What does the graph look like if you use geom_point() instead of geom_col(). Think for a minute before you do this.

p <- ggplot(rel_by_region, aes(x = bigregion, y = pct))
p + geom_point(aes(color = religion),size=4) +
    labs(x = "Region",y = "Percent", fill = "Religion") +
    theme(legend.position = "top")

Note that since the default plot character is a solid dot, you must use color instead of fill to get the color to show. There is some overplotting. Would jitter help? Would a smaller or larger dot be preferred? Experiment!

p <- ggplot(rel_by_region, aes(x = bigregion, y = pct))
p + geom_jitter(aes(color = religion),size=4) +
    labs(x = "Region",y = "Percent", fill = "Religion") +
    theme(legend.position = "top")

My First Plot (Exercise)

Repeat the graphics above reversing the roles of region and religion. Produce the dodge2 geom_col() and the corresponding geom_point.

p <- ggplot(region_by_rel, aes(x = religion, y = pct, fill = bigregion))
p + geom_col(position = "dodge2") +
    labs(x = "Religion",y = "Percent", fill = "Region") +
    theme(legend.position = "top")

p <- ggplot(region_by_rel, aes(x = religion, y = pct))
p + geom_jitter(aes(color = bigregion),size=4) +
    labs(x = "Religion",y = "Percent", fill = "Region") +
    theme(legend.position = "top")

Healy’s Facet Example

p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
    labs(x = NULL, y = "Percent", fill = "Religion") +
    guides(fill = FALSE) + 
    coord_flip() + 
    facet_grid(~ bigregion)

Is there any difference if we use facet_wrap() with nrow=1 instead of facet_grid()?

Exercise

Repeat this with the roles of religion and region reversed.

Answer

p <- ggplot(region_by_rel, aes(x = bigregion, y = pct, fill = bigregion))
p + geom_col(position = "dodge2") +
    labs(x = NULL, y = "Percent", fill = "bigregion") +
    guides(fill = FALSE) + 
    coord_flip() + 
    facet_grid(~ religion)

We started by examining visualizations which emphasized symmetry between religion and region. Could thsi be done by with facet_grid(), putting a simple block to represent the count in each cell of the grid. In effect, this would use the grid to make the scatterplot. Try it.

Answer

rr = gss_sm %>%
  select(religion,bigregion) %>%
  group_by(religion, bigregion) %>%
  summarize(count = n()) %>%
  ungroup() %>%
  mutate(relreg = " ")

ggplot(data = rr,aes(x=relreg,y=count)) + geom_col() +
  facet_grid(religion~bigregion) +
  labs(x="")

For comparison, here is the simple table.

t = table(gss_sm$religion,gss_sm$bigregion, useNA="ifany")
t
##             
##              Northeast Midwest South West
##   Protestant       158     325   650  238
##   Catholic         162     172   160  155
##   Jewish            27       3    11   10
##   None             112     157   170  180
##   Other             28      33    50   48
##   <NA>               1       5    11    1

Mosaic Plots

Another method of visualizing relationships among categorical variables is the mosaic plot. Base R has a command to produce these.

t = table(gss_sm$religion,gss_sm$bigregion, useNA="ifany")
mosaicplot(t)