Proposal Breakdown

Data Description

For this project, I will be using data generated by the Department of Education regarding colleges around the world. The dataset I will be using is called the College Scorecard. Included in this dataset are hundreds of variables describing each accredited college within the United States. However, for the purposes of this project, I will be focusing solely on demographical data. Below is a breakdown of the variables that I will be importing from the scorecard and their description.

  • instnm: Listed legal name of the college.
  • city: Municipality in which the university is located.
  • stabbr: Abbreviation of the state in which the university is located.
  • zip: Zip code in which the university is located.
  • longitude: Longitudinal coordinate at which the university is located.
  • latitude: Latitudinal coordinate at which the university is located.
  • ugds: Total number of undergraduate students enrolled at the university.
  • ugds_white: Percentage of undergraduates students who are White.
  • ugds_black: Percentage of undergraduates students who are Black.
  • ugds_hisp: Percentage of undergraduates students who are Hispanic.
  • ugds_asian: Percentage of undergraduates students who are Asian.
  • ugds_aian: Percentage of undergraduates students who are American Indian or Native Alaskan.
  • ugds_nhpi: Percentage of undergraduates students who are Native Hawaiian or Pacific Islander.
  • ugds_2mor: Percentage of undergraduates students who are two or more races.
  • ugds_nra: Percentage of undergraduates students who are non-resident aliens.
  • ugds_unkn: Percentage of undergraduates students whose race is unknown.

The data that is available through the scorecard is massive, including data on each individual branch of each college in the United States. For purposes of speed and ease of use, I will be limiting the data in this project to main campuses only.

In terms of time, data is available from 1996-2013. Again for purposes of speed and relevance, I will limit the data to only the last 5 years (2009-2013). Full documentation of the data can be found here.

Importing the data

To import the data, we will utilize a package called “rscorecard”. The API to access the data directly is messing and confusing, so this package was made specifically for accessing the College Scorecard data.

First, we wil load the packages.

## Load Required Packages ##
library(rscorecard)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(ggmap)
library(DT)
library(magrittr)

Finally, we will import the data. It’s important to note that rscorecard is currently only capable of pulling 1 year’s worth of data at a time, so we will pull each year individually and then combine them in the end.

## Initiate key for API usage ##
sc_key("JV4KLNjlODA8KU5ZDCNWDOCq6kLMuUcklZZO045s")

## Demographic data on ethnicity for each state and year (main campuses only). ##
## Note: College scorecard package is only capable of pulling 1 year at a time. ##
demographic_2009 <- sc_init() %>%
  sc_filter(main == 1) %>%
  sc_select(instnm, city, stabbr, zip,longitude, latitude,
            ugds, ugds_white, ugds_black, ugds_hisp, ugds_asian, ugds_aian,
            ugds_nhpi, ugds_2mor, ugds_nra, ugds_unkn) %>%
  sc_year(2009) %>%
  sc_get()

demographic_2010 <- sc_init() %>%
  sc_filter(main == 1) %>%
  sc_select(instnm, city, stabbr, zip,longitude, latitude,
            ugds, ugds_white, ugds_black, ugds_hisp, ugds_asian, ugds_aian,
            ugds_nhpi, ugds_2mor, ugds_nra, ugds_unkn) %>%
  sc_year(2010) %>%
  sc_get()

demographic_2011 <- sc_init() %>%
  sc_filter(main == 1) %>%
  sc_select(instnm, city, stabbr, zip,longitude, latitude,
            ugds, ugds_white, ugds_black, ugds_hisp, ugds_asian, ugds_aian,
            ugds_nhpi, ugds_2mor, ugds_nra, ugds_unkn) %>%
  sc_year(2011) %>%
  sc_get()

demographic_2012 <- sc_init() %>%
  sc_filter(main == 1) %>%
  sc_select(instnm, city, stabbr, zip, longitude, latitude,
            ugds, ugds_white, ugds_black, ugds_hisp, ugds_asian, ugds_aian,
            ugds_nhpi, ugds_2mor, ugds_nra, ugds_unkn) %>%
  sc_year(2012) %>%
  sc_get()

demographic_2013 <- sc_init() %>%
  sc_filter(main == 1) %>%
  sc_select(instnm, city, stabbr, zip,longitude, latitude,
            ugds, ugds_white, ugds_black, ugds_hisp, ugds_asian, ugds_aian,
            ugds_nhpi, ugds_2mor, ugds_nra, ugds_unkn) %>%
  sc_year(2013) %>%
  sc_get()

Cleaning data

Fortunately, this data is already tidy. The only real cleaning necessary is the combination of the individual years into a single dataset.

## Combine all years together ##
combined_data <- as_tibble(rbind(demographic_2009,
                                demographic_2010,
                                demographic_2011,
                                demographic_2012,
                                demographic_2013))
datatable(combined_data)

Planned Analysis

With this data, I plan to do the following analyses.

  • Aggregate the information by state and year to see if there are any geographical differences in the diversity of their college populations.
  • Determine which areas and universities are the most diverse in the country and which have changed the most over the five year period.
  • Analyze the historical diversity of the University of Cincinnati.
  • Analyse the change in the different racial categories at the University of Cincinnati.