This RMarkdown describes my project work as a response to a Coursera DS challenge. I have tried to document the entire process in the form of a neat reproducible research work.
Working on this challenge, and being able to complete it, has really enriched my experience as a beginner. I see this project as a stepping stone to solving even more complex real-life problems by learning a lot in this field of Data Science.
Using Institutional Features to Predict Students’ Ability to Repay Educational Loans
During their college education, most students incur a significant amount of debt. The average debt can vary from one institute to the other and so can the students’ repayment rates. Different factors can influence the debt and repayment including institute features and the students’ earnings after graduating.
This project explores to what extent institutional characteristics as well as other certain factors can predict debt repayment. As a guideline, accuracy of the prediction system, which is assessed using RMSE (Root Mean Squared Error), should be a maximum of 10-11 on the hold-out from the training data.
In an effort to make educational investments less speculative, the US Department of Education has matched information from the student financial aid system with federal tax returns to create the College Scorecard dataset. The College Scorecard is designed to provide students, families, and their advisers with a truer picture on college cost and value, and includes the most reliable national data on the earnings of former college graduates and new data on student debt.
For this project, I have used the data from the following source, which contains the data for the most recent cohort:
https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-All-Data-Elements.csv
Data from all the preceding cohorts can be downloaded from:
https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data.zip
Documentation is located at https://collegescorecard.ed.gov/data/documentation/ while the data dictionary can be found at https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx
College Scorecard Dataset provides the cohort data for every year since 1996. For this project, I have taken the data for the most recent cohort as provided in the Coursera challenge. This section broadly covers my approach to solving the challenge.
Since the dataset is very huge and consists of a large number of features, I resort to picking up a small subset of useful-looking features from the dataset. It would be counterproductive to investigate each feature by itself. Through our judgement and good sense, we can initially pick a few seemingly important features that may influence the loan repayment rates. However, before that, we would need to choose our response variable.
The response variable is present in disaggregated form. Repayment rate, which is the fraction of students properly repaying their loans, is a value in the range of 0-1. I choose COMPL_RPY_3YR_RT as the response variable which is defined in the data dictionary as “Three-year repayment rate for completers”.
After spending a considerable amount of time exploring the data dictionary and the dataset, I have decided to select the following features for further analysis:-
| Feature Name | Description |
|---|---|
| GRAD_DEBT_MDN | The median debt for students who have completed |
| COMP_ORIG_YR4_RT | Percent completed within 4 years at original institution |
| PCTFLOAN | Percent of all undergraduate students receiving a federal student loan |
| MD_EARN_WNE_P8 | Median earnings of students working and not enrolled 8 years after entry |
| CDR3 | Three-year cohort default rate |
| MEDIAN_HH_INC | Median household income |
| AGE_ENTRY | Average age of entry |
| HIGHDEG | Highest degree awarded |
| UGDS_WOMEN | Total share of enrollment of undergraduate degree-seeking students who are women |
| UGDS_NRA | Total share of enrollment of undergraduate degree-seeking students who are non-resident aliens |
| UGDS | Enrollment of undergraduate certificate/degree-seeking students |
| CONTROL | Control of institution |
| COSTT4_A | Average cost of attendance (academic year institutions) |
| COSTT4_P | Average cost of attendance (program-year institutions) |
| PCIP11 | Percentage of degrees awarded in Computer And Information Sciences And Support Services. |
| STABBR | State postcode |
| INSTNM | Institution name |
These 17 features shown above are selected due to their potential usefulness and after considering many features from the dataset and the data dictionary. This task of feature selection has been done manually. Their vetting, however, doesn’t stop here. They will be analyzed further in the subsequent stages till the modeling phase.
Missing values in this dataset are not clearly obvious. We can see NULL values instead of NA across different features along with some unknown values called PrivacySuppressed. From the official documentation on the College Scorecard website, data shown as PrivacySuppressed are the data which are not reported in order to protect an individual’s privacy. The presence of these values hinders the predictive ability of our features. We can either opt for removing the entire rows containing such data or converting these values into NA and then imputing them using different methods. I try to go ahead with the latter. Values shown as PrivacySuppressed come in the category of randomly missing data and can be definitely imputed but the NULL values can either be randomly or non-randomly missing data. We shouldn’t try to impute non-random missing data using typical imputation methods. It’s better to either remove them or investigate the method through which the data was collected. I will remove them. Moreover, if certain rows contain far too many NA values, I will remove them as well.
After analyzing the features and imputing the randomly missing values, I will split the dataset into training, validation and test set. Test set would contain the missing data in the response variable. My predictive system’s task would be to predict their values.
Model will be built on the training set and then applied on the validation set to compute the RMSE value. That would help in verifying that the model is working properly and hasn’t overfit the training set. After that, the model will predict the response variable in the test set.
First I load all the necessary packages.
library(dplyr)
library(Amelia)
library(ggplot2)
library(mice)
library(lattice)
library(gridExtra)
library(caret)One may be required to install these packages from the CRAN first if they don’t exist in the system before.
Next, I read my dataset from the directory I have stored it in, and the features that I had selected before.
# read the file most recent cohorts
schoolDf <- read.csv("C:/Users/admin/Desktop/R/Coursera - Predicting Student Loan Repayment/Dataset/Most-Recent-Cohorts-All-Data-Elements.csv")
schoolDf_sub <- schoolDf %>% select(INSTNM, STATE = STABBR, CONTROL, HIGHDEG,
CSDEG = PCIP11, COSTA = COSTT4_A, COSTP = COSTT4_P,
UGDS, UGDS_WOMEN, UGDS_NRA, AGE_ENTRY, COMP_ORIG_YR4_RT,
MEDIAN_HH_INC, PCTFLOAN, MD_EARN_WNE_P8, GRAD_DEBT_MDN, CDR3, COMPL_RPY_3YR_RT)
# remove the huge data frame schoolDf as we're not going to use it again.
rm(schoolDf)First, I will redefine a couple of my features, CONTROL and HIGHDEG, using their definition from the data dictionary.
controlTable <- data.frame(CONTROL = c(1, 2, 3), OWNERSHIP = c("Public", "Private nonprofit", "Private for-profit"))
degTable <- data.frame(HIGHDEG = c(0, 1, 2, 3, 4), Degree = c("Non-degree-granting", "Certificate degree", "Associate degree", "Bachelor's degree", "Graduate degree"))
# Now Inner join with schoolDf_sub dataframe
schoolDf_sub2 <- inner_join(x = schoolDf_sub, y = controlTable)
# and once more
schoolDf_sub2 <- inner_join(x = schoolDf_sub2, y = degTable)
# now copy and paste the newly added features to CONTROL and HIGHDEG and remove the last two features afterwards
schoolDf_sub2$CONTROL <- schoolDf_sub2$OWNERSHIP
schoolDf_sub2$HIGHDEG <- schoolDf_sub2$Degree
schoolDf_sub2 <- schoolDf_sub2[, c(-19, -20)]Now, I will operate on COSTA and COSTP. From the data dictionary, COSTT4_A (COSTA) and COSTT4_P (COSTP) are the average costs of attendance for academic-year institutions and program-year institutions respectively. There are too many missing values and as obvious, some of them must be non-random NA values. This is because only the academic-year institutions will have a value for COSTA feature and NULL for the rest and vice-versa. I can go ahead and engineer a common feature representing the cost of attendance irrespective of whether the institute is an academic-year based or program-year.
# first it is required to convert factor to character before converting to numeric
schoolDf_sub2$COSTA <- as.numeric(as.character(schoolDf_sub2$COSTA))
schoolDf_sub2$COSTP <- as.numeric(as.character(schoolDf_sub2$COSTP))
# make the non-random NA values equal to zero.
schoolDf_sub2$COSTA[is.na(schoolDf_sub2$COSTA) & !is.na(schoolDf_sub2$COSTP)] <- 0
schoolDf_sub2$COSTP[is.na(schoolDf_sub2$COSTP) & !is.na(schoolDf_sub2$COSTA)] <- 0
# now use mutate to create a feature for the cost of attendance
schoolDf_cost <- schoolDf_sub2 %>% mutate(ATDCOST = COSTA + COSTP)
# remove the old cost features and move the new feature to the correct position
schoolDf_cost2 <- schoolDf_cost[, c(1:5, 19, 8:18)]
summary(schoolDf_cost2)## INSTNM STATE
## UEI College : 8 CA : 795
## McCann School of Business & Technology: 7 TX : 485
## Stevens-Henager College : 7 NY : 467
## Bryan University : 5 FL : 446
## Columbia College : 5 PA : 405
## Academy of Cosmetology : 4 OH : 355
## (Other) :7667 (Other):4750
## CONTROL HIGHDEG CSDEG
## Private for-profit:3703 Associate degree :1671 0 :4021
## Private nonprofit :1956 Bachelor's degree : 937 NULL : 714
## Public :2044 Certificate degree :2606 0.013 : 15
## Graduate degree :2026 0.0101 : 13
## Non-degree-granting: 463 0.0196 : 13
## 0.0214 : 13
## (Other):2914
## ATDCOST UGDS UGDS_WOMEN UGDS_NRA
## Min. : 2473 NULL : 713 NULL : 713 0 :3992
## 1st Qu.:14328 39 : 30 1 : 188 NULL : 713
## Median :20253 61 : 28 0 : 96 0.0018 : 27
## Mean :22521 38 : 27 0.9286 : 21 0.0009 : 24
## 3rd Qu.:27165 46 : 26 0.9167 : 17 0.0015 : 24
## Max. :91135 58 : 26 0.9655 : 16 0.0007 : 23
## NA's :1102 (Other):6853 (Other):6652 (Other):2900
## AGE_ENTRY COMP_ORIG_YR4_RT
## NULL : 206 PrivacySuppressed: 963
## 30.396914011 : 142 NULL : 515
## PrivacySuppressed: 111 0.3599931422 : 140
## 24.776654776 : 85 0.504224454 : 85
## 32.888760139 : 81 0.1045028733 : 81
## 31.763342552 : 38 0.3725579628 : 33
## (Other) :7040 (Other) :5886
## MEDIAN_HH_INC PCTFLOAN MD_EARN_WNE_P8
## NULL :2068 NULL : 737 NULL :1169
## PrivacySuppressed: 98 0 : 698 PrivacySuppressed: 705
## 63381.69 : 23 1 : 57 33900 : 150
## 62086.89 : 17 0.5 : 17 20700 : 106
## 48773.91 : 14 0.75 : 15 43300 : 87
## 55670.96 : 14 0.8 : 15 27200 : 66
## (Other) :5469 (Other):6164 (Other) :5420
## GRAD_DEBT_MDN CDR3 COMPL_RPY_3YR_RT
## PrivacySuppressed:1408 NULL :1059 PrivacySuppressed:1388
## 9500 : 546 0.188 : 151 NULL : 902
## 27000 : 319 0 : 120 0.3438101347 : 142
## 25827.5 : 142 0.116 : 109 0.3570504528 : 85
## 25000 : 125 0.141 : 109 0.4160583942 : 81
## 12000 : 124 0.135 : 66 0.4495896681 : 38
## (Other) :5039 (Other):6089 (Other) :5067
From the summary of the data frame above, we observe two features, CSDEG and UGDS_NRA, having a large number of 0’s. If they were categorical data, there would be no problem. But they are actually numerical and so they won’t be of much use as predictors. So, I can get rid of them. Moreover, there can be duplicate rows in the table too. I will need to remove them as well.
schoolDf2 <- schoolDf_cost2[, c(-5, -9)]
# some institutes seem to have duplicate entries. So clear them.
schoolDf2 <- unique(schoolDf2)For identifying more non-random NA values, I can form a criteria and remove the resulting rows.
schoolDf2_sub <- schoolDf2 %>% filter(!(PCTFLOAN == "0" & (GRAD_DEBT_MDN == "PrivacySuppressed" | GRAD_DEBT_MDN == "NULL")
& (COMPL_RPY_3YR_RT == "NULL" | COMPL_RPY_3YR_RT == "PrivacySuppressed")))
# After subsetting, some factor levels remain even when the observation is removed. We can reset the factor levels this way:
schoolDf2_sub <- droplevels(schoolDf2_sub)Upto this point, almost all the numeric features still exist in the form of factor. I will convert them to numeric at once and release all the NA values from the NULL and PrivacySuppressed values in the process.
# set of features to be converted to numeric
toNumeric <- names(schoolDf2_sub[, -c(1:5)])
nextSchoolDf <- schoolDf2_sub
# convert all to numeric in one go
nextSchoolDf[toNumeric] <- nextSchoolDf[toNumeric] %>% lapply(FUN = function(x) { as.numeric(as.character(x)) })
str(nextSchoolDf)## 'data.frame': 7040 obs. of 15 variables:
## $ INSTNM : Factor w/ 6903 levels "A & W Healthcare Educators",..: 85 6185 220 6186 87 5953 1050 361 376 375 ...
## $ STATE : Factor w/ 54 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ CONTROL : Factor w/ 3 levels "Private for-profit",..: 3 3 2 3 3 3 3 3 3 3 ...
## $ HIGHDEG : Factor w/ 5 levels "Associate degree",..: 4 4 4 4 4 4 1 2 4 4 ...
## $ ATDCOST : num 21475 20621 16370 21107 18184 ...
## $ UGDS : num 4206 11383 291 5451 4811 ...
## $ UGDS_WOMEN : num 0.517 0.583 0.601 0.427 0.612 ...
## $ AGE_ENTRY : num 20.5 23.4 33.7 23.8 20.5 ...
## $ COMP_ORIG_YR4_RT: num 0.242 0.428 0.21 0.392 0.187 ...
## $ MEDIAN_HH_INC : num 49720 55735 53684 58689 46065 ...
## $ PCTFLOAN : num 0.828 0.521 0.779 0.46 0.755 ...
## $ MD_EARN_WNE_P8 : num 26600 36900 36400 40900 23000 38300 24200 35800 31200 41400 ...
## $ GRAD_DEBT_MDN : num 33888 21942 23370 24097 33119 ...
## $ CDR3 : num 0.172 0.062 0.122 0.061 0.156 0.063 0.219 0.086 0.106 0.045 ...
## $ COMPL_RPY_3YR_RT: num 0.389 0.611 0.514 0.702 0.327 ...
First of all, let’s try to visualize the missingness in the dataset obtained so far using the missmap function of the Amelia package.
missmap(nextSchoolDf, col=c('grey', 'steelblue'), y.cex=0.5, x.cex=0.8)Displayed above is the map of missingness in our dataset. We can see some horizontal lines stretching across from one direction to another. These are the rows that contain many NA values. We will first try to get rid of those rows before carrying out any imputation on the rest.
One more thing we can observe from the visualization is that a few features have contiguous NAs in the bottom portion of the dataset.
To move forward, I can make another feature that will contain the number of occurences of NA values in each row. Then I can remove those rows that have too many NAs.
nextSchoolDf$countNA <- rowSums(is.na(nextSchoolDf))
summary(nextSchoolDf)## INSTNM STATE
## Stevens-Henager College : 7 CA : 727
## Bryan University : 5 TX : 450
## Academy of Cosmetology : 4 NY : 415
## Columbia College : 4 PA : 396
## McCann School of Business & Technology: 4 FL : 388
## UEI College : 4 OH : 335
## (Other) :7012 (Other):4329
## CONTROL HIGHDEG ATDCOST
## Private for-profit:3490 Associate degree :1541 Min. : 4725
## Private nonprofit :1821 Bachelor's degree : 868 1st Qu.:15448
## Public :1729 Certificate degree :2201 Median :21376
## Graduate degree :1987 Mean :23468
## Non-degree-granting: 443 3rd Qu.:27849
## Max. :91135
## NA's :1025
## UGDS UGDS_WOMEN AGE_ENTRY COMP_ORIG_YR4_RT
## Min. : 0 Min. :0.0000 Min. :17.65 Min. :0.0023
## 1st Qu.: 125 1st Qu.:0.5364 1st Qu.:23.70 1st Qu.:0.2687
## Median : 446 Median :0.6424 Median :26.13 Median :0.4613
## Mean : 2497 Mean :0.6642 Mean :26.33 Mean :0.4434
## 3rd Qu.: 2092 3rd Qu.:0.8710 3rd Qu.:28.84 3rd Qu.:0.6129
## Max. :151558 Max. :1.0000 Max. :50.89 Max. :0.9464
## NA's :697 NA's :697 NA's :280 NA's :1104
## MEDIAN_HH_INC PCTFLOAN MD_EARN_WNE_P8 GRAD_DEBT_MDN
## Min. : 16155 Min. :0.0000 Min. : 10500 Min. : 1409
## 1st Qu.: 51032 1st Qu.:0.4298 1st Qu.: 22300 1st Qu.: 9500
## Median : 58697 Median :0.6142 Median : 28850 Median :14250
## Mean : 59175 Mean :0.5767 Mean : 30802 Mean :16750
## 3rd Qu.: 66484 3rd Qu.:0.7630 3rd Qu.: 36300 3rd Qu.:24500
## Max. :101579 Max. :1.0000 Max. :213900 Max. :49750
## NA's :2011 NA's :721 NA's :1612 NA's :791
## CDR3 COMPL_RPY_3YR_RT countNA
## Min. :0.0000 Min. :0.0565 Min. : 0.000
## 1st Qu.:0.0600 1st Qu.:0.3571 1st Qu.: 0.000
## Median :0.1160 Median :0.5267 Median : 0.000
## Mean :0.1235 Mean :0.5318 Mean : 1.567
## 3rd Qu.:0.1770 3rd Qu.:0.6995 3rd Qu.: 2.000
## Max. :0.7890 Max. :0.9718 Max. :11.000
## NA's :454 NA's :1640
As can be seen from the summary above, the first 4 features don’t contain any NA value. The rest of the 11 features contain NA values. Let’s try to tabulate the countNA feature and get a sense of what it depicts.
misRows <- data.frame(table(nextSchoolDf$countNA))
names(misRows) <- c("NACount", "Freq")
# Now, let's also calculate the cumulative frequency
misRows$CumFreq <- cumsum(misRows$Freq)
# visualize the tabular data for more insights
ggplot(misRows, aes(x = NACount, y = CumFreq, group = 1)) +
geom_line() +
geom_point(size = 2) +
geom_text(aes(label = CumFreq), hjust = 0, vjust = 1.5)Our goal here basically is to allow the rows that have as less NA values as possible while also keeping as many observations in the dataset as possible. So, what we need here is a tradeoff between NACount and CumFreq. An appropriate threshold can help us achieve that tradeoff.
Upon inspecting the plot above, we can see that upto the mark 5 at the x-axis, the slope is quite good. From the next value mark 6, the plot seems to be saturating. NACount equal to 5 seems a pretty reasonable choice for threshold. This means that I will remove all the rows having NACount greater than 5 and keep the rest. This choice will retain many observations in my dataset as well as delete those that have more than 5 NA values.
noLinesDf <- nextSchoolDf %>% filter(countNA <= 5)
# 6485 rows
# Let's again create a new missmap to show contrast with the previous one
par(mfrow = c(1,2))
missmap(nextSchoolDf, col=c('grey', 'steelblue'), y.cex=0.5, x.cex=0.8, main = "Before")
missmap(noLinesDf, col=c('grey', 'steelblue'), y.cex=0.5, x.cex=0.8, main = "After")From the above illustration, we can see how the long horizontal lines showing missingness have faded away.
Now, as a part of some more wrangling, I see another feature of interest here: UGDS. UGDS directly can’t be used effectively as a predictor because it’s an absolute value (not relative like %age), and the number of degree seeking students enrolling can vary hugely across different schools.
So, what I aim to do here is create a new feature called INST_ENSIZE which will be an institutional feature directly derived from UGDS. Short for “Institution Enrollment Size”, it will be a categorical feature which will make more sense for prediction instead of using UGDS.
Let’s first visualize the distribution of values in UGDS.
ggplot(noLinesDf, aes(x = "UGDS", y = UGDS)) +
geom_boxplot() +
ggtitle("Distribution of enrollment at different schools")From this boxplot, it is evident that the range of values for UGDS is extrememly big. There are also many outliers present in the feature. We’ll break UGDS down into categories to overcome the impact of outliers on the predictive model that we are going to build.
To determine the number of categories, let’s try to plot a histogram.
summary(noLinesDf$UGDS)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 136 482 2562 2211 151558 307
ggplot(noLinesDf, aes(x = UGDS)) +
geom_histogram()We can see from above that the distribution of UGDS is positively skewed, or skewed to the right. Moreover, mean > median also confirms the same. I will now try to break UGDS into categories near the quartile and median values, in the way as shown below:
noLinesDf$INST_ENSIZE <- NA
noLinesDf$INST_ENSIZE[noLinesDf$UGDS <= 150] <- "Very Small"
noLinesDf$INST_ENSIZE[noLinesDf$UGDS > 150 & noLinesDf$UGDS <= 500] <- "Small"
noLinesDf$INST_ENSIZE[noLinesDf$UGDS > 500 & noLinesDf$UGDS <= 2500] <- "Medium"
noLinesDf$INST_ENSIZE[noLinesDf$UGDS > 2500 & noLinesDf$UGDS <= 15000] <- "Large"
noLinesDf$INST_ENSIZE[noLinesDf$UGDS > 15000] <- "Very Large"
noLinesDf$INST_ENSIZE <- as.factor(noLinesDf$INST_ENSIZE)
# visualizing the new feature
ggplot(noLinesDf, aes(x = INST_ENSIZE, fill = INST_ENSIZE)) +
geom_bar(alpha = 0.5) +
xlab("Institute Enrollment Size")Fair enough, let’s replace UGDS with INST_ENSIZE. After that, my data frame will be ready for imputation of all the remaining NA values. I am going to use MICE imputation technique for that.
miceImpDf <- noLinesDf
miceImpDf$UGDS <- miceImpDf$INST_ENSIZE
miceImpDf <- miceImpDf[, c(1:16)]
colnames(miceImpDf)[6] <- "INST_ENSIZE"Now, I can apply MICE imputation on the data frame. MICE stands for Multiple Imputation using Chained Equations. It will take care of all the NA values, all of which are hopefully completely random NA values. Moreover, I will remove a few features from imputation, particularly the response variable. This is because I am going to use the rows having missing values in the response variable in the TEST set.
# set a random seed
set.seed(165)
# use the CART or Decision Tree method in mice
mice_mod <- mice(miceImpDf[, !names(miceImpDf) %in% c("INSTNM", "countNA", "COMPL_RPY_3YR_RT")],
method = "cart")To inspect the results of imputation, I can create density plots.
# inspecting the result
densityplot(mice_mod)The density plots look similar if not equal. I can place the imputed values back into the dataset.
mice_output <- complete(mice_mod)
# let's place this into the same dataframe
verifyMice <- miceImpDf
verifyMice[, c(2:14)] <- mice_outputLet’s just visualize to compare how the data frame look like before and after imputation.
# first, visualize the categorical variable INST_ENSIZE
plot1 <- ggplot(miceImpDf, aes(x = INST_ENSIZE, fill = INST_ENSIZE)) +
geom_bar(alpha = 0.5) +
ggtitle("Before Imputation")
plot2 <- ggplot(verifyMice, aes(x = INST_ENSIZE, fill = INST_ENSIZE)) +
geom_bar(alpha = 0.5) +
ggtitle("After Imputation")
grid.arrange(plot1, plot2, nrow = 1, ncol = 2)# now the rest of the variables using boxplot
par(las = 2, mfrow = c(1,2))
boxplot(miceImpDf[, c(5, 7:14)])
boxplot(verifyMice[, c(5, 7:14)])We observe that nothing looks out of ordianary when comparing. There is no drastic change in the distributions which confirms that our imputation using MICE was carried out well.
I have my final imputed dataset. Before splitting into training and test sets, I will shuffle the rows in the dataset and perform a few other important tasks.
# remove the countNA variable
imputedDf <- verifyMice[, -16]
# set a seed and shuffle the rows
set.seed(243)
schoolData <- imputedDf[sample(nrow(imputedDf)), ]
# backUp; to be used later
backUp <- schoolData$INSTNM[is.na(schoolData$COMPL_RPY_3YR_RT)]
# Let's remove the INSTNM feature right here
schoolData <- schoolData[, c(2:15)]Now comes the part where I am actually going to start creating a predictive system. Since my task involves predicting continuous values, I will have to create my model according to that. I will apply a linear regression model to solve this prediction problem. First of all, I will split the dataset I have right now into train and test sets. From the train set, I will form training and validation sets. Training set will be used to train my machine learning model, validation set for validating the model and determining the associated RMSE value, and the test set as a supplementary predictive task.
# splitting into train and test set
train <- schoolData %>% filter(!is.na(COMPL_RPY_3YR_RT))
test <- schoolData %>% filter(is.na(COMPL_RPY_3YR_RT))
# Now creating data partitions for training set and validation set
# We make a 75-25 split.
set.seed(162)
forTraining <- createDataPartition(train$COMPL_RPY_3YR_RT, p = 0.75, list = FALSE)
training <- train[forTraining, ]
validation <- train[-forTraining, ]Now, I can start training my model. I will use the functions contained in the caret package, which is a machine learning package. I am also going to involve cross-validation in the training of my model.
# training a linear regression model using cross-validation
control <- trainControl(method = "cv", number = 10)
modelFit <- train(COMPL_RPY_3YR_RT ~ ., data = training, method = "lm", metric = "RMSE", trControl = control)
# check the performance on the training set, how well my model fits the training set using cross-validation
print(modelFit)## Linear Regression
##
## 3990 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3591, 3591, 3590, 3591, 3591, 3591, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1032599 0.7418828 0.0782965
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# checking variable importance
importance <- varImp(modelFit, scale = FALSE)
print(importance)## lm variable importance
##
## only 20 most important variables shown (out of 72)
##
## Overall
## CDR3 20.435
## CONTROL2 19.207
## MD_EARN_WNE_P8 18.368
## AGE_ENTRY 16.881
## CONTROL3 16.444
## MEDIAN_HH_INC 9.484
## PCTFLOAN 7.299
## HIGHDEG4 4.870
## HIGHDEG2 3.923
## STATE44 2.803
## STATE28 2.756
## STATE30 2.631
## STATE8 2.075
## STATE22 1.937
## UGDS_WOMEN 1.935
## STATE11 1.826
## COMP_ORIG_YR4_RT 1.733
## INST_ENSIZE2 1.717
## HIGHDEG5 1.699
## STATE43 1.696
plot(importance, top = 22)We can get some important insights from above: first, the RMSE and Rsquared values on the training set which are on the lower side; and then, the plot of importance of various features in training the linear regression model.
As we can see, many features were very important, while some features were not much, eg, GRAD_DEBT_MDN and ATDCOST. Surprisingly, ATDCOST is somewhere on the least important side and is not visible in the plot. On the other hand, CDR3 (Cohort Default Rates) has a critical impact on the predictions. Also MD_EARN_WNE_P8 (Median Earnings of the individuals after 8 years of enrollment) is very important. Average age of entry, median household income, etc are some other important features.
If we want, we can go ahead by removing the less important features and keep only the higher ones. It may or may not improve the model. I am going to stick with my present model and go ahead with predictions on the the validation set. The model will predict the COMPL_RPY_3YR_RT feature, or the Three-year loan repayment rate for completers, in the validation set and also compute the RMSE value for assessing accuracy.
valPredictions <- predict(modelFit, validation)
# computing RMSE using caret's function
RMSE(validation$COMPL_RPY_3YR_RT, valPredictions)## [1] 0.09805046
The RMSE score I have obtained from this predictive model on the hold-out from the train set is far less than the threshold of 10-11. This is a fairly good score which, in fact, tells that my model performed quite well on the validation set.
Finally, I can extend this model to predict the loan repayment rates in the test set as well. However, there wouldn’t be any way of assessing the accuracy in that case as there is no way of knowing the actual values.
# predicting and writing the final result into a csv file
Predictions <- predict(modelFit, test)
predictionsDf <- data.frame(INSTNM = backUp, COMPL_RPY_3YR_RT = Predictions)
summary(predictionsDf$COMPL_RPY_3YR_RT)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.005392 0.339784 0.434507 0.459931 0.568855 1.193114
# A few values seem to be greater than 1. I can correct them in the way shown below.
predictionsDf$COMPL_RPY_3YR_RT[predictionsDf$COMPL_RPY_3YR_RT > 1] <- 1
# saving the predictions to disk
write.csv(predictionsDf, file = "TestPredictions.csv", row.names = FALSE)The RMSE value calculated on the predictions made on the validation set was far lower than the maximum permissible value of 10-11. It was even lower than the RMSE value calculated in the case of training set. This clearly indicates that my model performed very well. It didn’t overfit my training set as is obvious from a better RMSE score.
The future work in this project may include implementing some other machine learning models like knn (k-nearest neighbors), rpart (decision trees), etc., training the model with only top important features, exploring additional features related to institutes and demographics, and exploring other years’ cohort data.