Introduction

The college Scorecard site provides the data behind the College Scorecard, as well as other data on federal financial aid and earnings information. These data provide insights into the performance of schools eligible to receive federal financial aid, and offer a look at the outcomes of students at those schools. This project aims to analyze the data and check if the student at private for-profit institutions have lower student loan repayment rate compared to that of non-profit institutions.

Research question

  1. Are students at private for-profit institution have lower rate of federal loan repayment compared to students at non-profit institutions?

  2. Build regression model for federal loan repayment rate of students from an institution based on the data collected by Integrated Postsecondary Education Data System (IPEDS), National Student Loan Data System (NSLDS) and Administrative Earnings Data from Tax Records.

Data

Data collection Education Data- Collected annually through surveys administered by the Department of Education’s National Center for Education Statistics (NCES)

Federal Financial Aid - Collected from federal load records. This data is taken from the National Student Loan Data System (NSLDS) which is the Department of Education’s central database for monitoring federal student aid—primarily federal student loans and Pell grants

Earnings Data - From Tax records

Cases There are 7804 cases available in the website. But as the data has missing invalid value the data has to be cleansed. After cleansing 6281 observation are found to have enough information for Hypothesis testing. For linear modeling 5600 cases are used. The number of cases linear modeling is less than that of Hypothesis testing as more variables are considered for linear modeling.

Variables
There are more than 100 variable available for each case. But after examining the data only subset of the variables is used in the project. The variables that are relevant and have enough data is selected for the project.

The variables of the most interest are repayment_rate(3_yr_repayment_suppressed.overall) and the ownership of the institution. These variables are considered for Hypothesis testing. In addition to above two Variables following variables are considered for linear modeling. ownership
part_time_share
price
federal_loan_rate

Type of Study
The data is collected by observing the past and so this study is observational. The project conclusion is arrived based on the Exploratory analysis and Hypothesis testing. Further a linear model is fitted to the data to study the effect of various variables.

Scope of Inference - Generalizability
The population of interest is undergraduate/graduate students of US educational institutions who avail education loan for their education. As the sampling is random and has enough number of observations the results of this study could be generalized for population of interest.

Scope of Inference - Causality
As this is observational study we may not in a position to establish casual relationships.

Obtain and Clean Data

Load data into a dataframe

scorecard=read.csv("Data/Most+Recent+Cohorts+(Scorecard+Elements).csv",stringsAsFactors = F)
#Check number of variables and number of observations
dim(scorecard)
## [1] 7804  122

Check some variable to see how the names/data look

#Peek into first 10 Variables
str(scorecard[,1:10])
## 'data.frame':    7804 obs. of  10 variables:
##  $ UNITID : int  100654 100663 100690 100706 100724 100751 100760 100812 100830 100858 ...
##  $ OPEID  : int  100200 105200 2503400 105500 100500 105100 100700 100800 831000 100900 ...
##  $ opeid6 : int  1002 1052 25034 1055 1005 1051 1007 1008 8310 1009 ...
##  $ INSTNM : chr  "Alabama A & M University" "University of Alabama at Birmingham" "Amridge University" "University of Alabama in Huntsville" ...
##  $ CITY   : chr  "Normal" "Birmingham" "Montgomery" "Huntsville" ...
##  $ STABBR : chr  "AL" "AL" "AL" "AL" ...
##  $ INSTURL: chr  "www.aamu.edu/" "www.uab.edu" "www.amridgeuniversity.edu" "www.uah.edu" ...
##  $ NPCURL : chr  "galileo.aamu.edu/netpricecalculator/npcalc.htm" "www.collegeportraits.org/AL/UAB/estimator/agree" "tcc.noellevitz.com/(S(miwoihs5stz5cpyifh4nczu0))/Amridge%20University/Freshman-Students" "finaid.uah.edu/" ...
##  $ HCM2   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PREDDEG: int  3 3 3 3 3 3 2 3 3 3 ...

The college scorecard site provides data dictionary for the data. Note that the data dictionary is for all the data available on the website while the data we loaded pertains to scorecard data. The data dictionary has easy to read description for Variables. These friendly variable names will be used for analysis instead of the cryptic names provided on the csv file.

#Read Data dictionary
datadictionary=read.csv("Data/CollegeScorecardDataDictionary-09-08-2015.csv",stringsAsFactors = F)
#Friendly Names
better_names=datadictionary$developer.friendly.name
#Use names to map cryptic names to friendly names to use the vector like a dictionary
names(better_names)=datadictionary$VARIABLE.NAME
#set column names to friendly names
colnames(scorecard)=better_names[colnames(scorecard)]
#Peek into first 10 variables
str(scorecard[,1:10])
## 'data.frame':    7804 obs. of  10 variables:
##  $ id                         : int  100654 100663 100690 100706 100724 100751 100760 100812 100830 100858 ...
##  $ ope8_id                    : int  100200 105200 2503400 105500 100500 105100 100700 100800 831000 100900 ...
##  $ ope6_id                    : int  1002 1052 25034 1055 1005 1051 1007 1008 8310 1009 ...
##  $ name                       : chr  "Alabama A & M University" "University of Alabama at Birmingham" "Amridge University" "University of Alabama in Huntsville" ...
##  $ city                       : chr  "Normal" "Birmingham" "Montgomery" "Huntsville" ...
##  $ state                      : chr  "AL" "AL" "AL" "AL" ...
##  $ school_url                 : chr  "www.aamu.edu/" "www.uab.edu" "www.amridgeuniversity.edu" "www.uah.edu" ...
##  $ price_calculator_url       : chr  "galileo.aamu.edu/netpricecalculator/npcalc.htm" "www.collegeportraits.org/AL/UAB/estimator/agree" "tcc.noellevitz.com/(S(miwoihs5stz5cpyifh4nczu0))/Amridge%20University/Freshman-Students" "finaid.uah.edu/" ...
##  $ under_investigation        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ degrees_awarded.predominant: int  3 3 3 3 3 3 2 3 3 3 ...

There are more than hundred variables available in the data. After some exploratory analysis following variables are picked up for the project

#Variables selected for analysis
sel_variables=c("id",
              "name",
              "state",
              "degrees_awarded.predominant",
              "ownership",
              "part_time_share",
              "avg_net_price.public",
              "avg_net_price.private",
              "federal_loan_rate",
              "3_yr_repayment_suppressed.overall",
              "10_yrs_after_entry.median"
              )
scorecard=scorecard[sel_variables]
dim(scorecard)
## [1] 7804   11

As the data contains many missing value and some textual indication for masked values the variables are not converted into right data type. Let’s examine the data type of variables of interest and convert them to the format provided in data dictionary.

lapply(scorecard,class)
## $id
## [1] "integer"
## 
## $name
## [1] "character"
## 
## $state
## [1] "character"
## 
## $degrees_awarded.predominant
## [1] "integer"
## 
## $ownership
## [1] "integer"
## 
## $part_time_share
## [1] "character"
## 
## $avg_net_price.public
## [1] "character"
## 
## $avg_net_price.private
## [1] "character"
## 
## $federal_loan_rate
## [1] "character"
## 
## $`3_yr_repayment_suppressed.overall`
## [1] "character"
## 
## $`10_yrs_after_entry.median`
## [1] "character"
datadictionary[datadictionary$developer.friendly.name %in% sel_variables,3:4]
##                developer.friendly.name API.data.type
## 1                                   id       integer
## 4                                 name  autocomplete
## 6                                state        string
## 15         degrees_awarded.predominant       integer
## 25                           ownership       integer
## 545                    part_time_share         float
## 546                    part_time_share         float
## 549               avg_net_price.public       integer
## 550              avg_net_price.private       integer
## 661                  federal_loan_rate         float
## 1864         10_yrs_after_entry.median       integer
## 1936 3_yr_repayment_suppressed.overall         float
#Check the data
head(scorecard)
##       id                                name state
## 1 100654            Alabama A & M University    AL
## 2 100663 University of Alabama at Birmingham    AL
## 3 100690                  Amridge University    AL
## 4 100706 University of Alabama in Huntsville    AL
## 5 100724            Alabama State University    AL
## 6 100751           The University of Alabama    AL
##   degrees_awarded.predominant ownership part_time_share
## 1                           3         1          0.0622
## 2                           3         1          0.2579
## 3                           3         2          0.3727
## 4                           3         1          0.2395
## 5                           3         1          0.0902
## 6                           3         1          0.0852
##   avg_net_price.public avg_net_price.private federal_loan_rate
## 1                13415                  NULL            0.8204
## 2                14805                  NULL            0.5397
## 3                 NULL                  7455            0.7629
## 4                17520                  NULL            0.4728
## 5                11936                  NULL            0.8735
## 6                20916                  NULL            0.4148
##   3_yr_repayment_suppressed.overall 10_yrs_after_entry.median
## 1                    0.444713870029                     31400
## 2                    0.756266666667                     40300
## 3                    0.647249190939                     38100
## 4                    0.781997918835                     46600
## 5                    0.331198861615                     27800
## 6                     0.81394129979                     42400
#Missing values are coded as NULL, let's convert them to NA
scorecard[scorecard=="NULL"]=NA
#Now covert the data into right data type
scorecard$id=as.factor(scorecard$id)
scorecard$state=as.factor(scorecard$state)
scorecard$degrees_awarded.predominant=as.factor(scorecard$degrees_awarded.predominant)
scorecard$ownership=as.factor(scorecard$ownership)

scorecard$part_time_share=as.double(scorecard$part_time_share)
scorecard$avg_net_price.private=as.integer(scorecard$avg_net_price.private)
scorecard$avg_net_price.public=as.integer(scorecard$avg_net_price.public)
scorecard$federal_loan_rate=as.double(scorecard$federal_loan_rate)
scorecard$`3_yr_repayment_suppressed.overall`=
    as.double(scorecard$`3_yr_repayment_suppressed.overall`)
scorecard$`10_yrs_after_entry.median`=as.double(scorecard$`10_yrs_after_entry.median`)
#Check the data class Now
lapply(scorecard,class)
## $id
## [1] "factor"
## 
## $name
## [1] "character"
## 
## $state
## [1] "factor"
## 
## $degrees_awarded.predominant
## [1] "factor"
## 
## $ownership
## [1] "factor"
## 
## $part_time_share
## [1] "numeric"
## 
## $avg_net_price.public
## [1] "integer"
## 
## $avg_net_price.private
## [1] "integer"
## 
## $federal_loan_rate
## [1] "numeric"
## 
## $`3_yr_repayment_suppressed.overall`
## [1] "numeric"
## 
## $`10_yrs_after_entry.median`
## [1] "numeric"

Let’s do more edits to variables of most interest so that it is easy to understand/work with

levels(scorecard$ownership)<- c("Public","Non-profit","for-profit")
#Rename "3_yr_repayment_suppressed.overall" as "repayment_rate"
colnames(scorecard)[10]<-"repayment_rate"
#Let's obtain boolean values so that we can easily separate data by for-profit or not
for_profit=scorecard$ownership=="for-profit"

Let’s clean up the data

#Remove entries that has missing values for repayment_rate
scorecard=scorecard[!is.na(scorecard$repayment_rate),]
#Convert repayment to percentage value
scorecard$repayment_rate=scorecard$repayment_rate*100
#Cost of the education is stored in two variables, let consolidated that into one
scorecard$price=NA
scorecard$price[!is.na(scorecard$avg_net_price.public)]=
  scorecard$avg_net_price.public[!is.na(scorecard$avg_net_price.public)]
scorecard$price[!is.na(scorecard$avg_net_price.private)]=
  scorecard$avg_net_price.private[!is.na(scorecard$avg_net_price.private)]
scorecard$avg_net_price.public=NULL
scorecard$avg_net_price.private=NULL
dim(scorecard)
## [1] 6281   10

Exploratory data analysis

Check the distribution of repayment rate

library(ggplot2)
title_theme=theme(plot.title = element_text(face="bold",vjust=2,size=16))
ggplot(data=scorecard,aes(repayment_rate))+
    geom_histogram(aes(y=..density..),binwidth=2) +
     xlab("Repayment %") +
     ggtitle("Distribution of Repayment Rate") +
     title_theme

Distribution Though the distribution is not perfect normal (distribution is somewhat normal), given the number of observations we could proceed with the assumption that the data is normal. Each observation is independent of each other and given the large number of students we could assume that the sample represent less than 10% of total population. As the repayment could only be between 0% to 100% the skew of the data is limited.

Summary statistics Check the summary statistics to better understand the data Note: For exploratory purpose the institution is broken into three categories: public, non-profit and for-profit. But for hypothesis testing we shall analyze for-profit vs rest.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
scorecard %>%
  group_by(ownership) %>% summarize(repay=mean(repayment_rate),cost=mean(price,na.rm=T),
    Earning=mean(`10_yrs_after_entry.median`,na.rm=T))
## Source: local data frame [3 x 4]
## 
##    ownership    repay     cost  Earning
##       (fctr)    (dbl)    (dbl)    (dbl)
## 1     Public 66.28180 10191.56 35322.43
## 2 Non-profit 77.08854 21171.94 40817.64
## 3 for-profit 48.33115 18298.14 28404.29

From the summary statistics above we could see that the for-profit institution performs poorly on all three parameters repayment rate, cost and earning compared to other institution type.

Graph Let’s explore the data using some graphs to see how the data looks and check some characteristics we could observe Check the Repayment rate by institution type

ggplot(data=scorecard,aes(ownership,repayment_rate))+
    geom_boxplot(aes(ownership,fill=ownership)) +
     ylab("Repayment %") +
     ggtitle("Repayment Rate") +
     title_theme

Let’s check the cost and repayment rate

ggplot(data=scorecard,aes(id,price/1000))+
    geom_point(aes(color=ownership,size=repayment_rate)) +
     ylab("Cost in Thousands") +
     xlab("Institutions")+
    ggtitle("Cost Vs Repayment") +
     title_theme

Check cost, earing and Repayment by institution type

ggplot(data=scorecard,aes(`10_yrs_after_entry.median`/1000,price/1000)) +
     geom_point(aes(color=ownership,size=repayment_rate)) +
     xlab("Earning after 10yrs") +
    ylab("Cost in Thousands") +
     xlim(0,125) +
     ylim(0,60) +
    ggtitle("Cost, Earning and Repayment") +
     title_theme

In the above diagram we could see that student’s from for-profit earn less comparatively and has lesser repayment rate.

Let’s check distribution of repayment rate by institution type

ggplot(data=scorecard,aes(repayment_rate))+
    geom_histogram(aes(y=..density..,fill=ownership),binwidth=5) +
     xlab("Repayment %") +
     facet_grid(~ownership) +
     ggtitle("Distribution of Repayment") +
     title_theme

How about the loan rate? Is student from particular institution avail more loan?

ggplot(data=scorecard,aes(federal_loan_rate*100))+
    geom_histogram(aes(y=..density..,fill=ownership),binwidth=5) +
     xlab("Fed Loan %") +
     facet_grid(~ownership) +
     ggtitle("Loan Rate") +
     title_theme

From above graph we could see that more proportion of student from for-profit institution avail fed loan compared to others

Cost of education

ggplot(data=scorecard,aes(price/1000)) +
    geom_histogram(aes(y=..density..,fill=ownership),binwidth=2,position="dodge") +
     xlab("Cost in Thousands") +
     facet_grid(~ownership) +
     ggtitle("Cost") +
     title_theme

Is there difference from state to state?

#Summarise by state and institution type
repay_by_state = scorecard %>%
    group_by(ownership,state) %>%
    summarise(mean_repay=mean(repayment_rate))

ggplot(repay_by_state,aes(ownership,mean_repay))+
    geom_bar(aes(fill=ownership),position="dodge",stat="identity") +
     facet_wrap(~state) +
     xlab("") +
     ylab("") +
     ggtitle("Mean payment rate - By State") +
     title_theme

From the above diagram we could see that in every state the for-profit repayment rates are lower than others. (Though in three states the for-profit repayment rate appear to be better than non-profit, for-profit repayment rate are still lower than that of public institutions in those states. Public institutions are consider not-for-profit.)

Inference

Hypothesis Testing

Let’s formulate hypothesis testing Null hypothesis: The repayment rate of for-profit institutions are not different from repayment_rate of population Alternate hypothesis: The repayment rate of for-profit institutions are less than repayment rate of all institutions Though this testing requires to do one-tail testing I am sticking to two tail test in order to ensure higher accuracy. While I would check if the value under question falls on left side of population mean I shall consider both the tails in order to compute p-value.

I am using p-value to compute population mean and 95% confidence interval.

t.test(scorecard$repayment_rate)
## 
##  One Sample t-test
## 
## data:  scorecard$repayment_rate
## t = 230.85, df = 6280, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  59.31882 60.33491
## sample estimates:
## mean of x 
##  59.82687
mean(scorecard$repayment_rate[for_profit],na.rm=T)
## [1] 56.00262

The point estimate of population mean of repayment rate is 59.826 and 95% confidence interval is 59.319 to 60.3349. The mean rate repayment for for-profit institutions is 56.002. So, the mean repayment rate of for-profit institutions falls outside the confidence interval on left side of confidence interval. So, we reject Null hypothesis.

Student T-test

Let’s conduct two group student t-test by dividing the sample into two groups for-profit institutions and rest.
Check variance of two groups

#Check if the variance are equal
var.test(scorecard$repayment_rate[for_profit],scorecard$repayment_rate[!for_profit])
## 
##  F test to compare two variances
## 
## data:  scorecard$repayment_rate[for_profit] and scorecard$repayment_rate[!for_profit]
## F = 0.95454, num df = 2432, denom df = 3847, p-value = 0.2059
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.8886462 1.0259160
## sample estimates:
## ratio of variances 
##          0.9545362

As the p-value is .2 (>.05) the variance of two groups could be considered as same.

Conduct T-test

t.test(scorecard$repayment_rate[!for_profit],scorecard$repayment_rate[for_profit])
## 
##  Welch Two Sample t-test
## 
## data:  scorecard$repayment_rate[!for_profit] and scorecard$repayment_rate[for_profit]
## t = 11.926, df = 5261.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  5.216085 7.268360
## sample estimates:
## mean of x mean of y 
##  62.24485  56.00262

The p-value from student t-test is nearly zero which strongly indicates that the mean repayment rate of the two groups are not the same. The 95% confidence interval for the difference of mean repayment rate is 5.216 to 7.268. Which indicates that the other institutions have higher repayment rate in the range 5.216 to 7.268 at 95% confidence level.

Linear Model

In order to identify the variable of importance and their impact on the repayment rate let’s build a linear model. Linear model is selected for regression as it is easier to understand and interpret linear models

lm_repayment=lm(repayment_rate~ownership+
    part_time_share+
    price+
    federal_loan_rate,data=scorecard,na.action = na.omit)
summary(lm_repayment)
## 
## Call:
## lm(formula = repayment_rate ~ ownership + part_time_share + price + 
##     federal_loan_rate, data = scorecard, na.action = na.omit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.361 -10.285   0.686  11.244  51.192 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          6.612e+01  7.892e-01  83.788  < 2e-16 ***
## ownershipNon-profit  3.324e+00  7.538e-01   4.410 1.05e-05 ***
## ownershipfor-profit -2.267e+01  6.464e-01 -35.073  < 2e-16 ***
## part_time_share     -1.111e+01  1.025e+00 -10.837  < 2e-16 ***
## price                6.049e-04  3.698e-05  16.358  < 2e-16 ***
## federal_loan_rate   -5.915e+00  1.109e+00  -5.335 9.95e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.94 on 5594 degrees of freedom
##   (681 observations deleted due to missingness)
## Multiple R-squared:  0.3878, Adjusted R-squared:  0.3872 
## F-statistic: 708.6 on 5 and 5594 DF,  p-value: < 2.2e-16

From the linear model we could see that the institution type is a significant variable and the payment rate is reduces by about 22.6 points if the institution type is for-profit compared to compared to public institutions. The difference is even large (that is a drop of about 25.99 points) if we are to compare the profit with private non-profit. The linear model is inline with the results of hypothesis testing.

Conclusion

All the analysis done on the given the sample data indicate that the student loan repayment rates of for-profit institutions are lower than student loan repayment rate of not for-profit institutions. The exploratory analysis, hypothesis testing and the regression model agree with each other and strongly indicate that the loan repayment rate of for-profit institutes are lower than population. So, we could conclude that the student loan repayment rates of student from for-profit institutions are less than that of student from non for-profit institutions

References

College Scorecard website - https://collegescorecard.ed.gov/
Data - https://collegescorecard.ed.gov/data/
Two sample student t test http://www.r-bloggers.com/two-sample-students-t-test-1/