For this project, I will be using data generated by the Department of Education regarding colleges around the world. The dataset I will be using is called the College Scorecard. Included in this dataset are hundreds of variables describing each accredited college within the United States. However, for the purposes of this project, I will be focusing solely on demographical data. Below is a breakdown of the variables that I will be importing from the scorecard and their description.
The data that is available through the scorecard is massive, including data on each individual branch of each college in the United States. For purposes of speed and ease of use, I will be limiting the data in this project to main campuses only.
In terms of time, data is available from 1996-2013. Again for purposes of speed and relevance, I will limit the data to only the last 5 years (2009-2013). Full documentation of the data can be found here.
To import the data, we will utilize a package called “rscorecard”. The API to access the data directly is messing and confusing, so this package was made specifically for accessing the College Scorecard data.
First, we wil load the packages.
## Load Required Packages ##
library(rscorecard)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(ggmap)
library(DT)
library(magrittr)
Finally, we will import the data. It’s important to note that rscorecard is currently only capable of pulling 1 year’s worth of data at a time, so we will pull each year individually and then combine them in the end.
## Initiate key for API usage ##
sc_key("JV4KLNjlODA8KU5ZDCNWDOCq6kLMuUcklZZO045s")
## Demographic data on ethnicity for each state and year (main campuses only). ##
## Note: College scorecard package is only capable of pulling 1 year at a time. ##
demographic_2009 <- sc_init() %>%
sc_filter(main == 1) %>%
sc_select(instnm, city, stabbr, zip,longitude, latitude,
ugds, ugds_white, ugds_black, ugds_hisp, ugds_asian, ugds_aian,
ugds_nhpi, ugds_2mor, ugds_nra, ugds_unkn) %>%
sc_year(2009) %>%
sc_get()
demographic_2010 <- sc_init() %>%
sc_filter(main == 1) %>%
sc_select(instnm, city, stabbr, zip,longitude, latitude,
ugds, ugds_white, ugds_black, ugds_hisp, ugds_asian, ugds_aian,
ugds_nhpi, ugds_2mor, ugds_nra, ugds_unkn) %>%
sc_year(2010) %>%
sc_get()
demographic_2011 <- sc_init() %>%
sc_filter(main == 1) %>%
sc_select(instnm, city, stabbr, zip,longitude, latitude,
ugds, ugds_white, ugds_black, ugds_hisp, ugds_asian, ugds_aian,
ugds_nhpi, ugds_2mor, ugds_nra, ugds_unkn) %>%
sc_year(2011) %>%
sc_get()
demographic_2012 <- sc_init() %>%
sc_filter(main == 1) %>%
sc_select(instnm, city, stabbr, zip, longitude, latitude,
ugds, ugds_white, ugds_black, ugds_hisp, ugds_asian, ugds_aian,
ugds_nhpi, ugds_2mor, ugds_nra, ugds_unkn) %>%
sc_year(2012) %>%
sc_get()
demographic_2013 <- sc_init() %>%
sc_filter(main == 1) %>%
sc_select(instnm, city, stabbr, zip,longitude, latitude,
ugds, ugds_white, ugds_black, ugds_hisp, ugds_asian, ugds_aian,
ugds_nhpi, ugds_2mor, ugds_nra, ugds_unkn) %>%
sc_year(2013) %>%
sc_get()
Fortunately, this data is already tidy. The only real cleaning necessary is the combination of the individual years into a single dataset.
## Combine all years together ##
combined_data <- as_tibble(rbind(demographic_2009,
demographic_2010,
demographic_2011,
demographic_2012,
demographic_2013))
datatable(combined_data)
With this data, I plan to do the following analyses.