We have already covered a lot! Let’s take a step back and review everything that we have learned so far. This is based very closely on a practice exam I gave to my Harvard students in Fall 2020.
We will use data from the College Scorecard, a public dataset from the US Department of Education that contains information about colleges and universities in the United States. Here is the link to where the dataset was acquired:
The relevant columns in the data are described below:
||Number of campuses|
||Institution ID - shared between campuses|
||Longer ID - unique to every row|
||City, town, rural, etc.|
||Number of enrolled undergraduates 2020|
||1 if main campus, 0 otherwise|
||Historically Black College / University|
||Women-only college / university|
||1 if religiously affiliated, NA otherwise|
||Overall admission rate|
||Percentage of all degrees in social sciences|
||Percentage of all degrees in physical sciences|
||Percentage of all degrees in ethnic, gender, group, or cultural studies|
||Percentage of all degrees in computer science|
||Average monthly faculty salary|
||Average 4-year completion rate|
||Percentage of undergraduates on a Pell Grant|
||Percentage of first-gen students who complete degree in 4 years|
First, let’s read in the data. It is a .csv file, so we can use
read_csv(). Don’t forget to call
library(tidyverse) to access these functions.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4 ## ✓ tibble 3.0.4 ✓ dplyr 1.0.2 ## ✓ tidyr 1.1.2 ✓ stringr 1.4.0 ## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag()
## ## ── Column specification ──────────────────────────────────────────────────────── ## cols( ## .default = col_double(), ## name = col_character(), ## state = col_character(), ## region = col_character(), ## locale = col_character(), ## ug_enrollment = col_character(), ## completion_rate = col_character() ## ) ## ℹ Use `spec()` for the full column specifications.
summarise(), calculate the proportion of religious colleges in each region of the United States. One way to do this is by creating these two new columns:
religious_num= the number of colleges and universities with a religious affiliation in each region of the US.
religious_affiliationis a 1 or 0 value, so adding all of them up would give you the number. If you get an NA result, trying adding
na.rm = TRUEas an argument to your function. To see what it does, try running
?sumto see the documentation.
religious_prop= the proportion of all colleges and universities in that region that have a religious affiliation. A proportion is a number divided by the total. To get the total number of observations in a group defined by
group_by(), you can use the
Save your result into an object called
Now, let’s use the
region_religion object to make a plot.
First, filter out the “US Service Schools” from the
region column. Then, create a barplot (with
geom_col()) that has
region on the x-axis and
religious_prop on the y-axis. Add an appropriate title, axis labels, and a theme. Finally, try flipping the axes with
Once you’re done, visit this website: https://coolors.co/ to find a nice color for your bars. Click on “Start the Generator” and press the space bar until you find a color you like. The sequence of letters is called a HEX Code and it’s one way to represent colors. Luckily, R recognizes this. In
geom_col() (or any geom), you can use
fill = or
color = followed by the HEX code with a
# in front of it. For example,
geom_col(fill = "#69A297").
This looks great, but it’s not immediately clear how the bar values relate to each other. For example, the Rocky Mountains and New England have very similar values. It might be nice to reorder these bars so they appear in order.
This is what the
reorder() function is for. It takes two arguments - the column you want to order, and the value you want to order it by. For example, here we want to order the
region variable by
religious_prop, so the size of the bars increase as you move up the plot.
region_religion %>% filter(region != "US Service Schools") %>% ggplot(aes(x = reorder(region, religious_prop), y = religious_prop)) + geom_col(fill = "#69A297") + coord_flip() + theme_bw() + labs(title = "Proportion of Colleges with a Religious Affiliation", x = "Region", y = "Religious Proportion")