Harold Nelson
9/29/2018
library(tidyverse)
## ── Attaching packages ── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(socviz)
library(gmodels)
We want to explore different ways of displaying the relationship between geographical region and religious afficliation in the US. The usual non-visual way to do this is called a “cross-tab.” The standard table function provides a very rudimentary version.
table(gss_sm$religion,gss_sm$bigregion)
##
## Northeast Midwest South West
## Protestant 158 325 650 238
## Catholic 162 172 160 155
## Jewish 27 3 11 10
## None 112 157 170 180
## Other 28 33 50 48
There is a more elaborate version, CrossTable, in the gmodels package. The output requires a layout of each cell in the table provided in the upper left corner.
CrossTable(gss_sm$religion,gss_sm$bigregion)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 2849
##
##
## | gss_sm$bigregion
## gss_sm$religion | Northeast | Midwest | South | West | Row Total |
## ----------------|-----------|-----------|-----------|-----------|-----------|
## Protestant | 158 | 325 | 650 | 238 | 1371 |
## | 24.877 | 0.149 | 44.346 | 14.194 | |
## | 0.115 | 0.237 | 0.474 | 0.174 | 0.481 |
## | 0.324 | 0.471 | 0.624 | 0.377 | |
## | 0.055 | 0.114 | 0.228 | 0.084 | |
## ----------------|-----------|-----------|-----------|-----------|-----------|
## Catholic | 162 | 172 | 160 | 155 | 649 |
## | 23.502 | 1.397 | 25.093 | 0.882 | |
## | 0.250 | 0.265 | 0.247 | 0.239 | 0.228 |
## | 0.333 | 0.249 | 0.154 | 0.246 | |
## | 0.057 | 0.060 | 0.056 | 0.054 | |
## ----------------|-----------|-----------|-----------|-----------|-----------|
## Jewish | 27 | 3 | 11 | 10 | 51 |
## | 38.340 | 7.080 | 3.128 | 0.149 | |
## | 0.529 | 0.059 | 0.216 | 0.196 | 0.018 |
## | 0.055 | 0.004 | 0.011 | 0.016 | |
## | 0.009 | 0.001 | 0.004 | 0.004 | |
## ----------------|-----------|-----------|-----------|-----------|-----------|
## None | 112 | 157 | 170 | 180 | 619 |
## | 0.362 | 0.335 | 13.953 | 13.426 | |
## | 0.181 | 0.254 | 0.275 | 0.291 | 0.217 |
## | 0.230 | 0.228 | 0.163 | 0.285 | |
## | 0.039 | 0.055 | 0.060 | 0.063 | |
## ----------------|-----------|-----------|-----------|-----------|-----------|
## Other | 28 | 33 | 50 | 48 | 159 |
## | 0.025 | 0.788 | 1.129 | 4.641 | |
## | 0.176 | 0.208 | 0.314 | 0.302 | 0.056 |
## | 0.057 | 0.048 | 0.048 | 0.076 | |
## | 0.010 | 0.012 | 0.018 | 0.017 | |
## ----------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 487 | 690 | 1041 | 631 | 2849 |
## | 0.171 | 0.242 | 0.365 | 0.221 | |
## ----------------|-----------|-----------|-----------|-----------|-----------|
##
##
Before getting into Healy’s method, I’ll demonstrate two simple methods based on the ideo of a scatterplot of two cateoorical variables.
The first of these uses the raw data and simply replaces geom_point() with geom_jitter(). Manual adjustment of the alpha and size parameters may improve on the result.
gss_sm %>% select(religion,bigregion) %>% ggplot(aes(x=bigregion,y=religion)) +
geom_jitter(alpha=.5,size=.5)
My second method is to use dplyr to count the observations in the cells of the table and produce a scatterplot of the categorical variables. The counts are mapped to the size of the dots in the scatterplot.
gss_sm %>%
select(religion,bigregion) %>%
group_by(religion, bigregion) %>%
summarize(count = n()) %>%
ungroup() %>%
ggplot(aes(x=bigregion,y=religion,size=count)) + geom_point()
ggsave("rel_reg.pdf",width = 8, height = 10)
A variant on this idea is to map count to color instead of size. This requires larger size dots to have any hope of success/
gss_sm %>%
select(religion,bigregion) %>%
group_by(religion, bigregion) %>%
summarize(count = n()) %>%
ungroup() %>%
ggplot(aes(x=bigregion,y=religion,color=count)) + geom_point(size=9)
#ggsave("rel_reg.pdf",width = 8, height = 10)
Do these methods work?
rel_by_region <- gss_sm %>%
group_by(bigregion, religion) %>%
summarize(N = n()) %>%
mutate(freq = N / sum(N),
pct = round((freq*100), 0))
head(rel_by_region,10)
## # A tibble: 10 x 5
## # Groups: bigregion [2]
## bigregion religion N freq pct
## <fct> <fct> <int> <dbl> <dbl>
## 1 Northeast Protestant 158 0.324 32
## 2 Northeast Catholic 162 0.332 33
## 3 Northeast Jewish 27 0.0553 6
## 4 Northeast None 112 0.230 23
## 5 Northeast Other 28 0.0574 6
## 6 Northeast <NA> 1 0.00205 0
## 7 Midwest Protestant 325 0.468 47
## 8 Midwest Catholic 172 0.247 25
## 9 Midwest Jewish 3 0.00432 0
## 10 Midwest None 157 0.226 23
Note that there is an alternative choice, region with religion. Do this as an exercise before you look at the next slide..
region_by_rel <- gss_sm %>%
group_by(religion,bigregion) %>%
summarize(N = n()) %>%
mutate(freq = N / sum(N),
pct = round((freq*100), 0))
head(region_by_rel,10)
## # A tibble: 10 x 5
## # Groups: religion [3]
## religion bigregion N freq pct
## <fct> <fct> <int> <dbl> <dbl>
## 1 Protestant Northeast 158 0.115 12
## 2 Protestant Midwest 325 0.237 24
## 3 Protestant South 650 0.474 47
## 4 Protestant West 238 0.174 17
## 5 Catholic Northeast 162 0.250 25
## 6 Catholic Midwest 172 0.265 27
## 7 Catholic South 160 0.247 25
## 8 Catholic West 155 0.239 24
## 9 Jewish Northeast 27 0.529 53
## 10 Jewish Midwest 3 0.0588 6
p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
labs(x = "Region",y = "Percent", fill = "Religion") +
theme(legend.position = "top")
What is the difference if you set position to dodge instead of dodge2?
What does the graph look like if you use geom_point() instead of geom_col(). Think for a minute before you do this.
p <- ggplot(rel_by_region, aes(x = bigregion, y = pct))
p + geom_point(aes(color = religion),size=4) +
labs(x = "Region",y = "Percent", fill = "Religion") +
theme(legend.position = "top")
Note that since the default plot character is a solid dot, you must use color instead of fill to get the color to show. There is some overplotting. Would jitter help? Would a smaller or larger dot be preferred? Experiment!
p <- ggplot(rel_by_region, aes(x = bigregion, y = pct))
p + geom_jitter(aes(color = religion),size=4) +
labs(x = "Region",y = "Percent", fill = "Religion") +
theme(legend.position = "top")
Repeat the graphics above reversing the roles of region and religion. Produce the dodge2 geom_col() and the corresponding geom_point.
p <- ggplot(region_by_rel, aes(x = religion, y = pct, fill = bigregion))
p + geom_col(position = "dodge2") +
labs(x = "Religion",y = "Percent", fill = "Region") +
theme(legend.position = "top")
p <- ggplot(region_by_rel, aes(x = religion, y = pct))
p + geom_jitter(aes(color = bigregion),size=4) +
labs(x = "Religion",y = "Percent", fill = "Region") +
theme(legend.position = "top")
p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
labs(x = NULL, y = "Percent", fill = "Religion") +
guides(fill = FALSE) +
coord_flip() +
facet_grid(~ bigregion)
Is there any difference if we use facet_wrap() with nrow=1 instead of facet_grid()?
Repeat this with the roles of religion and region reversed.
p <- ggplot(region_by_rel, aes(x = bigregion, y = pct, fill = bigregion))
p + geom_col(position = "dodge2") +
labs(x = NULL, y = "Percent", fill = "bigregion") +
guides(fill = FALSE) +
coord_flip() +
facet_grid(~ religion)
We started by examining visualizations which emphasized symmetry between religion and region. Could thsi be done by with facet_grid(), putting a simple block to represent the count in each cell of the grid. In effect, this would use the grid to make the scatterplot. Try it.
rr = gss_sm %>%
select(religion,bigregion) %>%
group_by(religion, bigregion) %>%
summarize(count = n()) %>%
ungroup() %>%
mutate(relreg = " ")
ggplot(data = rr,aes(x=relreg,y=count)) + geom_col() +
facet_grid(religion~bigregion) +
labs(x="")
For comparison, here is the simple table.
t = table(gss_sm$religion,gss_sm$bigregion, useNA="ifany")
t
##
## Northeast Midwest South West
## Protestant 158 325 650 238
## Catholic 162 172 160 155
## Jewish 27 3 11 10
## None 112 157 170 180
## Other 28 33 50 48
## <NA> 1 5 11 1
Another method of visualizing relationships among categorical variables is the mosaic plot. Base R has a command to produce these.
t = table(gss_sm$religion,gss_sm$bigregion, useNA="ifany")
mosaicplot(t)