US College Analysis

Hello

In this notebook i’ll talk a little about an analysis of the data released by the american government about colleges. This data can be found at: https://collegescorecard.ed.gov/data/

For this analysis, i used R. ggplot2 was used for some basic plotting, readr to read the csv dataset.

All the code can be found on Github: https://github.com/felipegalvao/us_college_analysis_in_R

Preparation

Well, first we got to set the working directory and load the required packages. Remember to change the directory to where the datasets are stored on your computer.

# Set the working directory to where the datasets are stored
setwd("C:/Data_Science/US College/CollegeScorecard_Raw_Data")

# Load the required packages
library(readr)
library(ggplot2)
library(R2HTML)

Then we’ll read the CSV. We’ll use the data from 2005, because this is the latest year where the earnings informations are available.

# Read the data. 2005 is the latest year where 
# earnings information is available
college_data <- read_csv("MERGED2005_PP.csv")
## 
|================================================================================| 100%  101 MB
## Warning: 63099 problems parsing 'MERGED2005_PP.csv'. See problems(...) for
## more details.

Cleaning

We will start the cleaning removing the duplicated data based on the opeid6 column. Then we’ll change values that are equal to “PrivacySuppressed” and “NULL” for columns GRAD_DEBT_MDN (the median debt of the students), md_earn_wne_p6 (median earnings 6 years after entry) and TUITIONFEE_IN (tuition fee):

# Remove duplicates on the dataframe based on the opeid6 column, since
# earnings and debt informations is the same
college_data <- college_data[!duplicated(college_data[,"opeid6"]),]

# Change NULL and PrivacySuppresed values on column to R NA, then convert
# everything to numeric
college_data$GRAD_DEBT_MDN[college_data$GRAD_DEBT_MDN == "PrivacySuppressed"] <- NA
college_data$GRAD_DEBT_MDN[college_data$GRAD_DEBT_MDN == "NULL"] <- NA
college_data$GRAD_DEBT_MDN <- as.numeric(college_data$GRAD_DEBT_MDN)

# Do the same cleaning for the earnings in 6 years column
college_data$md_earn_wne_p6[college_data$md_earn_wne_p6 == "PrivacySuppressed"] <- NA
college_data$md_earn_wne_p6[college_data$md_earn_wne_p6 == "NULL"] <- NA
college_data$md_earn_wne_p6 <- as.numeric(college_data$md_earn_wne_p6)

# Cleaning for tuition fees
college_data$TUITIONFEE_IN[college_data$TUITIONFEE_IN == "PrivacySuppressed"] <- NA
college_data$TUITIONFEE_IN[college_data$TUITIONFEE_IN == "NULL"] <- NA
college_data$TUITIONFEE_IN <- as.numeric(college_data$TUITIONFEE_IN)

Features

We are also going to create some new features. Here they are:

Earnings to Debt Ratio

This feature will calculate the ratio between the median earnings 6 months from entry and the average debt for each college. Below is the code:

# Create a column on the DF for Earnings to Debt ratio (earnings / debt)
college_data$earnings_debt_ratio <- college_data$md_earn_wne_p6 / college_data$GRAD_DEBT_MDN

Months to pay the average debt

On this feature we make some assumptions. We assume that 10% of the median earnings are used to pay the debt and calculate, based on that, the number of months necessary to pay the debt in each college. There’s the code:

# Create column that calculates the number of months necessary to pay the
# average debt with the median earnings, considering that one person
# uses 10% of that earnings to pay the debt
college_data$months_to_pay <- college_data$GRAD_DEBT_MDN / (college_data$md_earn_wne_p6 / 10 / 12)

We are also going to set some constant variables, that are the US average per capita income for 2011 (6 years from 2011, according to the dataset variable) and the average student loan debt for 2005 (source: http://www.forbes.com/sites/halahtouryalai/2013/01/29/more-evidence-on-the-student-debt-crisis-average-grads-loan-jumps-to-27000/)

# Set constant variables about the US average per capita income
# and average student loan debt (debt for 2005, income for 2011 (6 years after entry))
us_per_capita_income <- 49781.4
us_avg_stdnt_loan_debt <- 17233

Filtering and Sorting by Earnings to Debt

Now we will do some filtering to show some stuff related to the Earnings to Debt ratio feature that we created. We are going to leave only colleges that have a median earning above the US average per capita income and the average debt lower than the average student loan debt, and then we will sort and get the top 10 based on the earnings to debt ratio, with the code below:

# Filter the colleges where the earnings average is above 
# and the average debt is lesser than the US average
selected_colleges <- college_data[which(college_data$md_earn_wne_p6 > us_per_capita_income & college_data$GRAD_DEBT_MDN < us_avg_stdnt_loan_debt),]

# Sort the data frame by the ratio column and select the top 10
top_10_selected_by_ratio <- selected_colleges[with(selected_colleges, order(-earnings_debt_ratio)),]
top_10_selected_by_ratio <- top_10_selected_by_ratio[1:10,]

# Here we'll trim the college names so that they don't mess up the plot
top_10_selected_by_ratio$Institution <- strtrim(top_10_selected_by_ratio$INSTNM, 40)

# And this is so the graphic won't get organized in alphabetical order, but in 
# the same order of the data frame
top_10_selected_by_ratio$Institution <- factor(top_10_selected_by_ratio$Institution, 
                                               levels=top_10_selected_by_ratio$Institution)

Plotting the top 10 colleges by earnings to debt ratio

Now we will plot the top 10 colleges sorted by earnings to debt ratio

# Plot bar graph with the top 10 colleges by Earnings to Debt Ratio
ggplot(data=top_10_selected_by_ratio, aes(x=Institution, y=earnings_debt_ratio)) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

And on the table below, the details about the plot that you just saw

print(top_10_selected_by_ratio[,c("Institution","md_earn_wne_p6","GRAD_DEBT_MDN","earnings_debt_ratio", "months_to_pay")])
##                                   Institution md_earn_wne_p6 GRAD_DEBT_MDN
## 2650                      Brooklyn Law School          86900        6000.0
## 1911    Massachusetts Institute of Technology          75300        6125.0
## 246        California Institute of Technology          62100        5500.0
## 2349             St Louis College of Pharmacy         115900       11625.0
## 1887                       Harvard University          60200        6125.0
## 425  Los Angeles County College of Nursing an          83700        8647.5
## 2556                     Princeton University          52300        5500.0
## 174               Jefferson School of Nursing          54600        5859.0
## 2787 Southern Westchester BOCES-Practical Nur          61300        6625.0
## 2732           Helene Fuld College of Nursing          85200        9250.0
##      earnings_debt_ratio months_to_pay
## 2650           14.483333      8.285386
## 1911           12.293878      9.760956
## 246            11.290909     10.628019
## 2349            9.969892     12.036238
## 1887            9.828571     12.209302
## 425             9.679098     12.397849
## 2556            9.509091     12.619503
## 174             9.318996     12.876923
## 2787            9.252830     12.969005
## 2732            9.210811     13.028169

Now, the top 10 by earnings

We will do the same plotting as above, but now we will consider the colleges that will leave you with the highest average earnings. First, we’ll select the top 10 colleges by earnings that meet the requirements:

top_10_selected_by_earnings <- selected_colleges[with(selected_colleges, order(-md_earn_wne_p6)),]
top_10_selected_by_earnings <- top_10_selected_by_earnings[1:10,]
top_10_selected_by_earnings$Institution <- strtrim(top_10_selected_by_earnings$INSTNM, 40)
top_10_selected_by_earnings$Institution <- factor(top_10_selected_by_earnings$Institution, levels=top_10_selected_by_earnings$Institution)

Now we will plot this data

# Plot bar graph with the top 10 colleges by earnings
ggplot(data=top_10_selected_by_earnings, aes(x=Institution, y=md_earn_wne_p6)) +
  geom_bar(stat="identity") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

And the associated table:

print(top_10_selected_by_earnings[,c("Institution","md_earn_wne_p6","GRAD_DEBT_MDN","earnings_debt_ratio", "months_to_pay")])
##                                   Institution md_earn_wne_p6 GRAD_DEBT_MDN
## 2349             St Louis College of Pharmacy         115900       11625.0
## 2623               Albany College of Pharmacy         114500       17125.0
## 2650                      Brooklyn Law School          86900        6000.0
## 2732           Helene Fuld College of Nursing          85200        9250.0
## 425  Los Angeles County College of Nursing an          83700        8647.5
## 2501      Charles E Gregory School of Nursing          78700       11625.0
## 2662 St Vincent Catholic Medical Center New Y          76800       10750.0
## 1911    Massachusetts Institute of Technology          75300        6125.0
## 2672                Cochran School of Nursing          73800       14125.0
## 2542 Muhlenberg Regional Medical Center-Harol          73400       14125.0
##      earnings_debt_ratio months_to_pay
## 2349            9.969892     12.036238
## 2623            6.686131     17.947598
## 2650           14.483333      8.285386
## 2732            9.210811     13.028169
## 425             9.679098     12.397849
## 2501            6.769892     17.725540
## 2662            7.144186     16.796875
## 1911           12.293878      9.760956
## 2672            5.224779     22.967480
## 2542            5.196460     23.092643

Correlation between Earnings and Tuition

To finish, we’ll check if bigger tuitions lead to bigger earnings. Let’s calculate the correlation between the two variables:

cor_tuition_earnings <- cor(college_data$TUITIONFEE_IN, college_data$md_earn_wne_p6, use="complete.obs")
print(cor_tuition_earnings)
## [1] 0.4716303

And then, check it on a scatterplot:

# Plot scatterplot with the correlation between Tuition and Earnings
ggplot(college_data, aes(x=TUITIONFEE_IN, y=md_earn_wne_p6)) +
  geom_point(shape=1) +
  geom_smooth(method=lm)
## Warning: Removed 2499 rows containing missing values (stat_smooth).
## Warning: Removed 2499 rows containing missing values (geom_point).

We can see that there is a relation, but there are some big outliers in there.