First, I load a package called tidyVerse.

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  2.0.0     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.3.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

We are going to be looking at some Pew Survey Data, which I have uploaded to R. Now, I just have to load it into this R Notebook.

pew <- read_csv("January 3-10, 2018 - Core Trends Survey/January 3-10, 2018 - Core Trends Survey - CSV.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   usr = col_character(),
##   `pial11ao@` = col_character()
## )
## See spec(...) for full column specifications.

The Pew Survery data set contains, among many other things, a list of the top social media sites. Participants were asked whether or not they use each of these sites, with “Yes”, “No”, and “Don’t know” and refusal to answer being possible.

In this case, I am most interested in learning about how many survey respondents use Twitter. The relevant variable from the dataset is called web1a.

I am particularly interested in seeing the number of Twitter users sorted by educational attainment. Do those with a college degree seem to be more likely to use Twitter than those with only some college, for example?

  1. I have chosen Twitter, but I want to view the participant answers in real terms (not just the coded numbers that the dataset provides; i.e., instead of 1 for Yes, 2 for No, etc., I want to see just Yes, No, etc.). To achieve this, I will convert the web1a (Twitter) variable to a factor, and recode it. Since numbers 8 and 9 coded for “Don’t know” and for “Refused”, respectively, I am opting to represent both of those as NULL, so that they will appear under the collective category of “NA” when I view the data.
pew <- pew %>% 
  mutate(web1a = as.factor(web1a))
pew <- pew %>% 
  mutate(web1a = fct_recode(web1a, "Yes" = "1", "No" = "2", NULL = "8", NULL = "9"))
  1. I am doing the same for educ2 (the variable containing the educational attaintment of each participant), again so that I can view the answers in real terms instead of as coded numbers.
pew <- pew %>% 
  mutate(educ2 = as.factor(educ2))
pew <- pew %>% 
  mutate(educ2 = fct_recode(educ2, "Less than HS" = "1", "Some HS" = "2", "HS graduate" = "3", "Some college" = "4", "Associate degree" = "5", "College degree" = "6", "Some grad school" = "7", "Grad degree" = "8", NULL = "98", NULL = "99"))
  1. Now I want to see how many people gave each answer. I therefore am creating one table to show how many answered Yes, how many answered No, and how many answered NA (“Don’t know” or “Refused”).
pew %>%
  count(web1a)

I also want to see how many participants fell into each category of educational attainment, so I am creating a table to display that as well.

pew %>%
  count(educ2)

Finally, it would be interesting to see these two variables (web1a and educ2, that is, Twitter usage and educational attainment) listed all in one table. I am now going to create that table to take a look.

pew %>%
  count(web1a, educ2)
  1. Now I want to create a graph to show visually how these two variables look. I will set up the graph so that there are two separate sets of bars displayed, one for those who do use Twitter, and one for those who do not use Twitter. For each of those two categories (those who do, and those who don’t use Twitter), the graph shows a bar representing the number of people who gave that answer according to their educational attainment.
pew %>% 
  drop_na(web1a) %>% 
  ggplot(aes(x = web1a, fill = educ2)) +
  scale_fill_viridis_d() +
  geom_bar(position = "dodge") +
  coord_flip() +
  theme_minimal() +
  labs(y = "Number of people", 
       x = "Do you use Twitter?", 
       title = "Twitter Usage by Educational Attainment")

  1. Come to think of it, I am not sure that having that many bars for that many distinct levels of educational attainment is actually necessary. To me, it looks cluttered. Why not instead break the data down into just two levels of educational attainment? How about, I create two categories: one for those with “some college or less,” and one for those with an “associate degree or more”? This will result in only two bars on the graph per answer to the Twitter usage question, for a total of four bars. That’s easier to digest at a glance when compared to the 16 bars above! To accomplish this, I will use a “collapse” function within R to create a new variable representing the two collective categories I have decided to create. Then, I will make a new graph using that new variable.
pew <- pew %>% 
  mutate(educ2_simple = fct_collapse(educ2,
                                            Some_college_or_less = c("Less than HS", 
                                                              "Some HS", 
                                                              "HS graduate",
                                                              "Some college"),
                                            Associate_degree_or_more = c("Associate degree",
                                                                "College degree",
                                                                "Some grad school",
                                                                "Grad degree")))

pew %>% 
  count(educ2_simple)
pew %>% 
  drop_na(web1a) %>% 
  ggplot(aes(x = web1a, fill = educ2_simple)) +
  scale_fill_viridis_d() +
  geom_bar(position = "dodge") +
  coord_flip() +
  theme_minimal() +
  labs(y = "Number of people", 
       x = "Do you use Twitter?", 
       title = "Twitter Usage by Educational Attainment")

  1. Now I will publish to Rpubs so that anyone can see this awesome data in all its glory!