Student Loan Default Rate Data Analysis Summary

Bo Suzow
September 1, 2018

Full Report is available here

Introduction

  • Student loan debt (SLD) total in the US reached 1.5 trillion dollars, borrowed by 44 million people.
  • Between 2008 and 2015, the total almost doubled from 640 billion to 1.2 trillion.
  • SLD as societal issues such as divorce, and lack of financial independence and home ownership.
  • Average student loan default rate is 12% for FY2012 tracking cohort.
  • Institutions at high default rate (30% or above) risking the loss of federal loan eligibility.
  • What are the strong predictors for the default rate?
  • Can we build a predictive model?

Data Source and Wrangling

  • 2014-15 College Scorecard data MERGED2014_15_PP.csv set published by the US Dept. of Education.
  • The data set reports default rates for the FY2012 cohort whose members entered repayment between October 2011 and September 2012 and defaulted on their loans by September 2014.
  • Institutions with missing values are removed. The removal left in the data set the institutions whose predominant degrees are (PREDDEG) Certificate, Associate's and Bachelor's. It also reduced the number of columns from 1700+ to 23. See the next slide for the list.

Independent Variables

  • OPEID6 : 6-digit ID for institution
  • INSTNM : Institution name
  • NUMBRANCH : Number of branch campuses
  • CONTROL : Institution ownership type
  • PREDDEG : Predominant degree awarded
  • REGION : Region code (New England, Mid East, Great Lakes, Plains, Southeast, Southwest, Rocky Mountain, Far West, Outlying Areas)
  • DISTANCEONLY : Whether distance education only institution
  • UGDS : Enrollment of certificate/degree seeking students
  • TUITFTE : Net tuition per full-time equivalent student (FTE)
  • INEXPFTE : Instructional expenditures per FTE
  • PCTPELL : Percentage of students who receive a Pell Grant

Independent Variables (continued)

  • PCTFLOAN : Percentage of students receiving a federal student loan
  • PAR_ED_PCT_1STGEN : Percentage of first-generation students
  • DEP_INC_AVG : Average family income of dependent students
  • IND_INC_AVG : Average family income of independent students
  • DEBT_MDN : Median loan amount upon entering repayment
  • GRAD_DEBT_MDN : Median debt for students who have completed
  • WDRAW_DEBT_MDN : Median debt for students who have not completed
  • MD_FAMINC : Median family income
  • CDR3_DENOM : Number of students in the tracking cohort

Predictive Models

  • Linear Regression
  • Classification and Regression Tree (CART)
  • Random Forest (RF)

Linear Regression

  • Backward feature selection using a 10-fold cross validation reduces the number of variables to 19.

  • The Root Mean Square Error (RMSE) of this model is 0.05.

CART - Decision Tree

plot of chunk unnamed-chunk-2

  • The regression tree reports only 4 variables (DEP_INC_AVG, IND_INC_AVG, TUITFTE, and GRAD_DEB) for splitting rules.
  • The RMSE of this model is 0.051 which is a little higher than the linear regression model.

Random Forest

plot of chunk unnamed-chunk-3

  • This model reports the variables in the order of importance.
  • The RMSE of this model is 0.045 which is the lowest of the three models.

Further Possible Studies and Analyses

  • Multi-year analysis with the most recent 3 or 4-year cohorts.
    • Whether the same or similar set of predictors emerges, or time-series based trends would be of research interests.
  • Analysis on repayment rates.
    • The college scorecard data set provides repayment rates in RPY_*YR_RT columns. What are the predictors for repayment trends? Any correlation between repayment and default rates?
  • Study on defaults after tracking period.
    • A recent New York Times article discusses student loan default statistics after the end of a given tracking period, and states that the picture looks gloomier. An analysis on the after-tracking-period data would provide a fuller picture.

What does the data tell us?

  • Bachelor's degree institution is associated with a decrease in default rate in comparison with Associate's and Certificate institutions. Bachelor's degree holders tend to achieve higher earnings, improving family income which is negatively correlated to default rate.
  • Parent's education level is an important predictor. The average default rate of the institutions in the 4th quartile of PAR_ED_PCT_1STGEN is 25% higher than the overall average (16% vs 13%).
  • Institution's spending on instructional resources is negatively correlated to default rate. The average default rate of the instituions in the 4th quartile of INEXPFTE is 38% lower than the overall average (8% vs 13%). These institutions report $7,800 or more per student (FTE) for instructional expenditures.
  • 60% of the 4th quartitle default rate institutions are private-for-profit.

Recommendations

  • Rule change in federal loan eligibility.
    • from: losing eligibility if the default rate reaches 30% or above for 3 years in a row or 40%+ for a single year
    • to: losing eligibility if the default rate falls on the 90th percentile for 3 years in a row or the 95th percentfile for a single year.
  • Empower first generation students with information regarding dire consequencies of student loan defaults and tools with which students control loans within a manageable amount.
  • Incentivize employers that offer repayment assistant programs that would attract talent, relieve our future generations from heavy financial burden, and pave the way for soft landing of possible student loan crashes.

References