Hello
In this notebook i’ll talk a little about an analysis of the data released by the american government about colleges. This data can be found at: https://collegescorecard.ed.gov/data/
For this analysis, i used R. ggplot2 was used for some basic plotting, readr to read the csv dataset.
All the code can be found on Github: https://github.com/felipegalvao/us_college_analysis_in_R
Well, first we got to set the working directory and load the required packages. Remember to change the directory to where the datasets are stored on your computer.
# Set the working directory to where the datasets are stored
setwd("C:/Data_Science/US College/CollegeScorecard_Raw_Data")
# Load the required packages
library(readr)
library(ggplot2)
library(R2HTML)
Then we’ll read the CSV. We’ll use the data from 2005, because this is the latest year where the earnings informations are available.
# Read the data. 2005 is the latest year where
# earnings information is available
college_data <- read_csv("MERGED2005_PP.csv")
##
|================================================================================| 100% 101 MB
## Warning: 63099 problems parsing 'MERGED2005_PP.csv'. See problems(...) for
## more details.
We will start the cleaning removing the duplicated data based on the opeid6 column. Then we’ll change values that are equal to “PrivacySuppressed” and “NULL” for columns GRAD_DEBT_MDN (the median debt of the students), md_earn_wne_p6 (median earnings 6 years after entry) and TUITIONFEE_IN (tuition fee):
# Remove duplicates on the dataframe based on the opeid6 column, since
# earnings and debt informations is the same
college_data <- college_data[!duplicated(college_data[,"opeid6"]),]
# Change NULL and PrivacySuppresed values on column to R NA, then convert
# everything to numeric
college_data$GRAD_DEBT_MDN[college_data$GRAD_DEBT_MDN == "PrivacySuppressed"] <- NA
college_data$GRAD_DEBT_MDN[college_data$GRAD_DEBT_MDN == "NULL"] <- NA
college_data$GRAD_DEBT_MDN <- as.numeric(college_data$GRAD_DEBT_MDN)
# Do the same cleaning for the earnings in 6 years column
college_data$md_earn_wne_p6[college_data$md_earn_wne_p6 == "PrivacySuppressed"] <- NA
college_data$md_earn_wne_p6[college_data$md_earn_wne_p6 == "NULL"] <- NA
college_data$md_earn_wne_p6 <- as.numeric(college_data$md_earn_wne_p6)
# Cleaning for tuition fees
college_data$TUITIONFEE_IN[college_data$TUITIONFEE_IN == "PrivacySuppressed"] <- NA
college_data$TUITIONFEE_IN[college_data$TUITIONFEE_IN == "NULL"] <- NA
college_data$TUITIONFEE_IN <- as.numeric(college_data$TUITIONFEE_IN)
We are also going to create some new features. Here they are:
This feature will calculate the ratio between the median earnings 6 months from entry and the average debt for each college. Below is the code:
# Create a column on the DF for Earnings to Debt ratio (earnings / debt)
college_data$earnings_debt_ratio <- college_data$md_earn_wne_p6 / college_data$GRAD_DEBT_MDN
On this feature we make some assumptions. We assume that 10% of the median earnings are used to pay the debt and calculate, based on that, the number of months necessary to pay the debt in each college. There’s the code:
# Create column that calculates the number of months necessary to pay the
# average debt with the median earnings, considering that one person
# uses 10% of that earnings to pay the debt
college_data$months_to_pay <- college_data$GRAD_DEBT_MDN / (college_data$md_earn_wne_p6 / 10 / 12)
We are also going to set some constant variables, that are the US average per capita income for 2011 (6 years from 2011, according to the dataset variable) and the average student loan debt for 2005 (source: http://www.forbes.com/sites/halahtouryalai/2013/01/29/more-evidence-on-the-student-debt-crisis-average-grads-loan-jumps-to-27000/)
# Set constant variables about the US average per capita income
# and average student loan debt (debt for 2005, income for 2011 (6 years after entry))
us_per_capita_income <- 49781.4
us_avg_stdnt_loan_debt <- 17233
Now we will do some filtering to show some stuff related to the Earnings to Debt ratio feature that we created. We are going to leave only colleges that have a median earning above the US average per capita income and the average debt lower than the average student loan debt, and then we will sort and get the top 10 based on the earnings to debt ratio, with the code below:
# Filter the colleges where the earnings average is above
# and the average debt is lesser than the US average
selected_colleges <- college_data[which(college_data$md_earn_wne_p6 > us_per_capita_income & college_data$GRAD_DEBT_MDN < us_avg_stdnt_loan_debt),]
# Sort the data frame by the ratio column and select the top 10
top_10_selected_by_ratio <- selected_colleges[with(selected_colleges, order(-earnings_debt_ratio)),]
top_10_selected_by_ratio <- top_10_selected_by_ratio[1:10,]
# Here we'll trim the college names so that they don't mess up the plot
top_10_selected_by_ratio$Institution <- strtrim(top_10_selected_by_ratio$INSTNM, 40)
# And this is so the graphic won't get organized in alphabetical order, but in
# the same order of the data frame
top_10_selected_by_ratio$Institution <- factor(top_10_selected_by_ratio$Institution,
levels=top_10_selected_by_ratio$Institution)
Now we will plot the top 10 colleges sorted by earnings to debt ratio
# Plot bar graph with the top 10 colleges by Earnings to Debt Ratio
ggplot(data=top_10_selected_by_ratio, aes(x=Institution, y=earnings_debt_ratio)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
And on the table below, the details about the plot that you just saw
print(top_10_selected_by_ratio[,c("Institution","md_earn_wne_p6","GRAD_DEBT_MDN","earnings_debt_ratio", "months_to_pay")])
## Institution md_earn_wne_p6 GRAD_DEBT_MDN
## 2650 Brooklyn Law School 86900 6000.0
## 1911 Massachusetts Institute of Technology 75300 6125.0
## 246 California Institute of Technology 62100 5500.0
## 2349 St Louis College of Pharmacy 115900 11625.0
## 1887 Harvard University 60200 6125.0
## 425 Los Angeles County College of Nursing an 83700 8647.5
## 2556 Princeton University 52300 5500.0
## 174 Jefferson School of Nursing 54600 5859.0
## 2787 Southern Westchester BOCES-Practical Nur 61300 6625.0
## 2732 Helene Fuld College of Nursing 85200 9250.0
## earnings_debt_ratio months_to_pay
## 2650 14.483333 8.285386
## 1911 12.293878 9.760956
## 246 11.290909 10.628019
## 2349 9.969892 12.036238
## 1887 9.828571 12.209302
## 425 9.679098 12.397849
## 2556 9.509091 12.619503
## 174 9.318996 12.876923
## 2787 9.252830 12.969005
## 2732 9.210811 13.028169
We will do the same plotting as above, but now we will consider the colleges that will leave you with the highest average earnings. First, we’ll select the top 10 colleges by earnings that meet the requirements:
top_10_selected_by_earnings <- selected_colleges[with(selected_colleges, order(-md_earn_wne_p6)),]
top_10_selected_by_earnings <- top_10_selected_by_earnings[1:10,]
top_10_selected_by_earnings$Institution <- strtrim(top_10_selected_by_earnings$INSTNM, 40)
top_10_selected_by_earnings$Institution <- factor(top_10_selected_by_earnings$Institution, levels=top_10_selected_by_earnings$Institution)
Now we will plot this data
# Plot bar graph with the top 10 colleges by earnings
ggplot(data=top_10_selected_by_earnings, aes(x=Institution, y=md_earn_wne_p6)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
And the associated table:
print(top_10_selected_by_earnings[,c("Institution","md_earn_wne_p6","GRAD_DEBT_MDN","earnings_debt_ratio", "months_to_pay")])
## Institution md_earn_wne_p6 GRAD_DEBT_MDN
## 2349 St Louis College of Pharmacy 115900 11625.0
## 2623 Albany College of Pharmacy 114500 17125.0
## 2650 Brooklyn Law School 86900 6000.0
## 2732 Helene Fuld College of Nursing 85200 9250.0
## 425 Los Angeles County College of Nursing an 83700 8647.5
## 2501 Charles E Gregory School of Nursing 78700 11625.0
## 2662 St Vincent Catholic Medical Center New Y 76800 10750.0
## 1911 Massachusetts Institute of Technology 75300 6125.0
## 2672 Cochran School of Nursing 73800 14125.0
## 2542 Muhlenberg Regional Medical Center-Harol 73400 14125.0
## earnings_debt_ratio months_to_pay
## 2349 9.969892 12.036238
## 2623 6.686131 17.947598
## 2650 14.483333 8.285386
## 2732 9.210811 13.028169
## 425 9.679098 12.397849
## 2501 6.769892 17.725540
## 2662 7.144186 16.796875
## 1911 12.293878 9.760956
## 2672 5.224779 22.967480
## 2542 5.196460 23.092643
To finish, we’ll check if bigger tuitions lead to bigger earnings. Let’s calculate the correlation between the two variables:
cor_tuition_earnings <- cor(college_data$TUITIONFEE_IN, college_data$md_earn_wne_p6, use="complete.obs")
print(cor_tuition_earnings)
## [1] 0.4716303
And then, check it on a scatterplot:
# Plot scatterplot with the correlation between Tuition and Earnings
ggplot(college_data, aes(x=TUITIONFEE_IN, y=md_earn_wne_p6)) +
geom_point(shape=1) +
geom_smooth(method=lm)
## Warning: Removed 2499 rows containing missing values (stat_smooth).
## Warning: Removed 2499 rows containing missing values (geom_point).
We can see that there is a relation, but there are some big outliers in there.