We have already covered a lot! Let’s take a step back and review everything that we have learned so far. This is based very closely on a practice exam I gave to my Harvard students in Fall 2020.

We will use data from the College Scorecard, a public dataset from the US Department of Education that contains information about colleges and universities in the United States. Here is the link to where the dataset was acquired:

https://collegescorecard.ed.gov/data/

The relevant columns in the data are described below:

Name Description
name Name
state State
region US region
lat Latitude
lon Longitude
campuses Number of campuses
id Institution ID - shared between campuses
id_long Longer ID - unique to every row
class College-type indicator
locale City, town, rural, etc.
ug_enrollment Number of enrolled undergraduates 2020
main_campus 1 if main campus, 0 otherwise
hbcu Historically Black College / University
women_only Women-only college / university
religious_affiliation 1 if religiously affiliated, NA otherwise
admission_rate Overall admission rate
social_sciences Percentage of all degrees in social sciences
physical_sciences Percentage of all degrees in physical sciences
ethnic_gender_sciences Percentage of all degrees in ethnic, gender, group, or cultural studies
comp_sci Percentage of all degrees in computer science
avg_faculty_salary Average monthly faculty salary
completion_rate Average 4-year completion rate
pell_grant Percentage of undergraduates on a Pell Grant
first_gen_completion_4 Percentage of first-gen students who complete degree in 4 years

Question 1: group_by() and summarise()

Exercise

First, let’s read in the data. It is a .csv file, so we can use read_csv(). Don’t forget to call library(tidyverse) to access these functions.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
colleges <- read_csv("data/college_scorecard.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   name = col_character(),
##   state = col_character(),
##   region = col_character(),
##   locale = col_character(),
##   ug_enrollment = col_character(),
##   completion_rate = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

Using group_by() and summarise(), calculate the proportion of religious colleges in each region of the United States. One way to do this is by creating these two new columns:

  1. religious_num = the number of colleges and universities with a religious affiliation in each region of the US. religious_affiliation is a 1 or 0 value, so adding all of them up would give you the number. If you get an NA result, trying adding na.rm = TRUE as an argument to your function. To see what it does, try running ?sum to see the documentation.
  2. religious_prop = the proportion of all colleges and universities in that region that have a religious affiliation. A proportion is a number divided by the total. To get the total number of observations in a group defined by group_by(), you can use the n() function.

Save your result into an object called region_religion.

region_religion <- colleges %>% 
  group_by(region) %>%
  summarise(religious_num = sum(religious_affiliation, na.rm = T),
            religious_prop = religious_num / n(), 
            .groups = "drop")

Question 2: Practice Plotting

Now, let’s use the region_religion object to make a plot.

First, filter out the “US Service Schools” from the region column. Then, create a barplot (with geom_col()) that has region on the x-axis and religious_prop on the y-axis. Add an appropriate title, axis labels, and a theme. Finally, try flipping the axes with coord_flip().

Once you’re done, visit this website: https://coolors.co/ to find a nice color for your bars. Click on “Start the Generator” and press the space bar until you find a color you like. The sequence of letters is called a HEX Code and it’s one way to represent colors. Luckily, R recognizes this. In geom_col() (or any geom), you can use fill = or color = followed by the HEX code with a # in front of it. For example, geom_col(fill = "#69A297").

region_religion %>%
  ggplot(aes(x = region, y = religious_prop)) + 
    geom_col(fill = "#69A297") + 
    coord_flip() + 
    theme_bw() + 
    labs(title = "Proportion of Colleges with a Religious Affiliation",
         x = "Region",
         y = "Religious Proportion")

Reordering

This looks great, but it’s not immediately clear how the bar values relate to each other. For example, the Rocky Mountains and New England have very similar values. It might be nice to reorder these bars so they appear in order.

This is what the reorder() function is for. It takes two arguments - the column you want to order, and the value you want to order it by. For example, here we want to order the region variable by religious_prop, so the size of the bars increase as you move up the plot.

region_religion %>%
  filter(region != "US Service Schools") %>% 
  ggplot(aes(x = reorder(region, religious_prop), y = religious_prop)) + 
    geom_col(fill = "#69A297") + 
    coord_flip() + 
    theme_bw() + 
    labs(title = "Proportion of Colleges with a Religious Affiliation",
         x = "Region",
         y = "Religious Proportion")