Harold Nelson
1/29/2021
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.4
## ✓ tibble 3.0.5 ✓ dplyr 1.0.3
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: grid
These are two categorical variables and we want to use them as an example to study different ways of displaying the relationship between them. The usual non-visual way to do this is called a “cross-tab.” The standard table function provides a very rudimentary version.
##
## Protestant Catholic Jewish None Other
## Northeast 158 162 27 112 28
## Midwest 325 172 3 157 33
## South 650 160 11 170 50
## West 238 155 10 180 48
Note that these numbers are different from those in the book. The version of the data is apparently different, probably reflecting a difference in years.
There is a more elaborate version, CrossTable, in the gmodels package. Interpreting the output requires a layout of each cell in the table provided in the upper left corner.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 2849
##
##
## | gss_sm$religion
## gss_sm$bigregion | Protestant | Catholic | Jewish | None | Other | Row Total |
## -----------------|------------|------------|------------|------------|------------|------------|
## Northeast | 158 | 162 | 27 | 112 | 28 | 487 |
## | 24.877 | 23.502 | 38.340 | 0.362 | 0.025 | |
## | 0.324 | 0.333 | 0.055 | 0.230 | 0.057 | 0.171 |
## | 0.115 | 0.250 | 0.529 | 0.181 | 0.176 | |
## | 0.055 | 0.057 | 0.009 | 0.039 | 0.010 | |
## -----------------|------------|------------|------------|------------|------------|------------|
## Midwest | 325 | 172 | 3 | 157 | 33 | 690 |
## | 0.149 | 1.397 | 7.080 | 0.335 | 0.788 | |
## | 0.471 | 0.249 | 0.004 | 0.228 | 0.048 | 0.242 |
## | 0.237 | 0.265 | 0.059 | 0.254 | 0.208 | |
## | 0.114 | 0.060 | 0.001 | 0.055 | 0.012 | |
## -----------------|------------|------------|------------|------------|------------|------------|
## South | 650 | 160 | 11 | 170 | 50 | 1041 |
## | 44.346 | 25.093 | 3.128 | 13.953 | 1.129 | |
## | 0.624 | 0.154 | 0.011 | 0.163 | 0.048 | 0.365 |
## | 0.474 | 0.247 | 0.216 | 0.275 | 0.314 | |
## | 0.228 | 0.056 | 0.004 | 0.060 | 0.018 | |
## -----------------|------------|------------|------------|------------|------------|------------|
## West | 238 | 155 | 10 | 180 | 48 | 631 |
## | 14.194 | 0.882 | 0.149 | 13.426 | 4.641 | |
## | 0.377 | 0.246 | 0.016 | 0.285 | 0.076 | 0.221 |
## | 0.174 | 0.239 | 0.196 | 0.291 | 0.302 | |
## | 0.084 | 0.054 | 0.004 | 0.063 | 0.017 | |
## -----------------|------------|------------|------------|------------|------------|------------|
## Column Total | 1371 | 649 | 51 | 619 | 159 | 2849 |
## | 0.481 | 0.228 | 0.018 | 0.217 | 0.056 | |
## -----------------|------------|------------|------------|------------|------------|------------|
##
##
Note that the column and row percents, “marginals” in Healy’s terminology closely match the results in Healy’s tables, 5.1 and 5.2.
Our goal in this course is to master ways of displaying the information visually.
Before getting into Healy’s methods, I’ll demonstrate two simple methods based on the idea of a scatterplot of two categorical variables. The simple scatterplot of two categorical results is not very useful. Try this and see what you get.
The first of the two simple methods uses the raw data and simply replaces geom_point() with geom_jitter(). Manual adjustment of the alpha and size parameters may improve on the result.
Try this.
gss_sm %>% select(religion,bigregion) %>% ggplot(aes(x=bigregion,y=religion)) +
geom_jitter(alpha=.5,size=.5)
Adjust size and alpha to get a result you find effective.
My second method is to use dplyr to count the observations in the cells of the table and produce a scatterplot of the categorical variables. The counts are mapped to the size of the dots in the scatterplot.
gss_sm %>%
select(religion,bigregion) %>%
group_by(religion, bigregion) %>%
summarize(count = n()) %>%
ungroup() %>%
ggplot(aes(x=bigregion,y=religion,size=count)) +
geom_point()
## `summarise()` has grouped output by 'religion'. You can override using the `.groups` argument.
A variant on this idea is to map count to color instead of size. This requires larger size dots to have any hope of success.
gss_sm %>%
select(religion,bigregion) %>%
group_by(religion, bigregion) %>%
summarize(count = n()) %>%
ungroup() %>%
ggplot(aes(x=bigregion,y=religion,color=count)) + geom_point(size=9)
## `summarise()` has grouped output by 'religion'. You can override using the `.groups` argument.
The mosaic plot is another alernative for looking at relationships among categorical variables. The best version is in the vcd (Visualization of Categorical Data) package. Here’s a look at our relationship.
Do any of these methods work? which do you prefer?
rel_by_region <- gss_sm %>%
group_by(bigregion, religion) %>%
summarize(N = n()) %>%
mutate(freq = N / sum(N),
pct = round((freq*100), 0))
## `summarise()` has grouped output by 'bigregion'. You can override using the `.groups` argument.
## # A tibble: 10 x 5
## # Groups: bigregion [2]
## bigregion religion N freq pct
## <fct> <fct> <int> <dbl> <dbl>
## 1 Northeast Protestant 158 0.324 32
## 2 Northeast Catholic 162 0.332 33
## 3 Northeast Jewish 27 0.0553 6
## 4 Northeast None 112 0.230 23
## 5 Northeast Other 28 0.0574 6
## 6 Northeast <NA> 1 0.00205 0
## 7 Midwest Protestant 325 0.468 47
## 8 Midwest Catholic 172 0.247 25
## 9 Midwest Jewish 3 0.00432 0
## 10 Midwest None 157 0.226 23
Note that there is an alternative choice, region with religion. Do this as an exercise before you look at the next slide..
region_by_rel <- gss_sm %>%
group_by(religion,bigregion) %>%
summarize(N = n()) %>%
mutate(freq = N / sum(N),
pct = round((freq*100), 0))
## `summarise()` has grouped output by 'religion'. You can override using the `.groups` argument.
## # A tibble: 10 x 5
## # Groups: religion [3]
## religion bigregion N freq pct
## <fct> <fct> <int> <dbl> <dbl>
## 1 Protestant Northeast 158 0.115 12
## 2 Protestant Midwest 325 0.237 24
## 3 Protestant South 650 0.474 47
## 4 Protestant West 238 0.174 17
## 5 Catholic Northeast 162 0.250 25
## 6 Catholic Midwest 172 0.265 27
## 7 Catholic South 160 0.247 25
## 8 Catholic West 155 0.239 24
## 9 Jewish Northeast 27 0.529 53
## 10 Jewish Midwest 3 0.0588 6
Note the difference.
p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
labs(x = "Region",y = "Percent", fill = "Religion") +
theme(legend.position = "top")
What is the difference if you set position to dodge instead of dodge2?
p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion))
p + geom_col(position = "dodge") +
labs(x = "Region",y = "Percent", fill = "Religion") +
theme(legend.position = "top")
## Another Trivial Exercise
What does the graph look like if you use geom_point() instead of geom_col(). Think for a minute before you do this.
p <- ggplot(rel_by_region, aes(x = bigregion, y = pct))
p + geom_point(aes(color = religion),size=4) +
labs(x = "Region",y = "Percent", fill = "Religion") +
theme(legend.position = "top")
Note that since the default plot character is a solid dot, you must use color instead of fill to get the color to show. There is some overplotting. Would jitter help? Would a smaller or larger dot be preferred? Experiment!
p <- ggplot(rel_by_region, aes(x = bigregion, y = pct))
p + geom_jitter(aes(color = religion),size=4) +
labs(x = "Region",y = "Percent", fill = "Religion") +
theme(legend.position = "top")
Repeat the graphics above reversing the roles of region and religion. Produce the dodge2 geom_col() and the corresponding geom_point.
p <- ggplot(region_by_rel, aes(x = religion, y = pct, fill = bigregion))
p + geom_col(position = "dodge2") +
labs(x = "Religion",y = "Percent", fill = "Region") +
theme(legend.position = "top")
Try geom_jitter instead of geom_col.
p <- ggplot(region_by_rel, aes(x = religion, y = pct))
p + geom_jitter(aes(color = bigregion),size=4) +
labs(x = "Religion",y = "Percent", fill = "Region") +
theme(legend.position = "top")
p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
labs(x = NULL, y = "Percent", fill = "Religion") +
guides(fill = FALSE) +
coord_flip() +
facet_grid(~ bigregion)
Is there any difference if we use facet_wrap() with nrow=1 instead of facet_grid()?
p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
labs(x = NULL, y = "Percent", fill = "Religion") +
guides(fill = FALSE) +
coord_flip() +
facet_wrap(~ bigregion,nrow=1)
Repeat this with the roles of religion and region reversed.
p <- ggplot(region_by_rel, aes(x = bigregion, y = pct, fill = bigregion))
p + geom_col(position = "dodge2") +
labs(x = NULL, y = "Percent", fill = "bigregion") +
guides(fill = FALSE) +
coord_flip() +
facet_grid(~ religion)
We started by examining visualizations which emphasized symmetry between religion and region. Could thsi be done by with facet_grid(), putting a simple block to represent the count in each cell of the grid. In effect, this would use the grid to make the scatterplot. Try it.
rr = gss_sm %>%
select(religion,bigregion) %>%
group_by(religion, bigregion) %>%
summarize(count = n()) %>%
ungroup() %>%
mutate(relreg = " ")
## `summarise()` has grouped output by 'religion'. You can override using the `.groups` argument.
Try that with point instead of col. Also map color to religion.
rr = gss_sm %>%
select(religion,bigregion) %>%
group_by(religion, bigregion) %>%
summarize(count = n()) %>%
ungroup() %>%
mutate(relreg = " ")
## `summarise()` has grouped output by 'religion'. You can override using the `.groups` argument.
ggplot(data = rr,aes(x=relreg,y=count,color = religion)) + geom_point() +
facet_grid(religion~bigregion) +
labs(x="")
For comparison, here is the simple table.
##
## Northeast Midwest South West
## Protestant 158 325 650 238
## Catholic 162 172 160 155
## Jewish 27 3 11 10
## None 112 157 170 180
## Other 28 33 50 48
## <NA> 1 5 11 1