This RMarkdown describes my project work as a response to a Coursera DS challenge. I have tried to document the entire process in the form of a neat reproducible research work.

Working on this challenge, and being able to complete it, has really enriched my experience as a beginner. I see this project as a stepping stone to solving even more complex real-life problems by learning a lot in this field of Data Science.

1 Objective

Using Institutional Features to Predict Students’ Ability to Repay Educational Loans

1.1 Context

During their college education, most students incur a significant amount of debt. The average debt can vary from one institute to the other and so can the students’ repayment rates. Different factors can influence the debt and repayment including institute features and the students’ earnings after graduating.

This project explores to what extent institutional characteristics as well as other certain factors can predict debt repayment. As a guideline, accuracy of the prediction system, which is assessed using RMSE (Root Mean Squared Error), should be a maximum of 10-11 on the hold-out from the training data.

2 College Scorecard Dataset

In an effort to make educational investments less speculative, the US Department of Education has matched information from the student financial aid system with federal tax returns to create the College Scorecard dataset. The College Scorecard is designed to provide students, families, and their advisers with a truer picture on college cost and value, and includes the most reliable national data on the earnings of former college graduates and new data on student debt.

For this project, I have used the data from the following source, which contains the data for the most recent cohort:

https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-All-Data-Elements.csv

Data from all the preceding cohorts can be downloaded from:

https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data.zip

Documentation is located at https://collegescorecard.ed.gov/data/documentation/ while the data dictionary can be found at https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx

3 Approach

College Scorecard Dataset provides the cohort data for every year since 1996. For this project, I have taken the data for the most recent cohort as provided in the Coursera challenge. This section broadly covers my approach to solving the challenge.

3.1 Feature Selection

Since the dataset is very huge and consists of a large number of features, I resort to picking up a small subset of useful-looking features from the dataset. It would be counterproductive to investigate each feature by itself. Through our judgement and good sense, we can initially pick a few seemingly important features that may influence the loan repayment rates. However, before that, we would need to choose our response variable.

3.1.1 Response Variable

The response variable is present in disaggregated form. Repayment rate, which is the fraction of students properly repaying their loans, is a value in the range of 0-1. I choose COMPL_RPY_3YR_RT as the response variable which is defined in the data dictionary as “Three-year repayment rate for completers”.

3.1.2 Predictor Variables

After spending a considerable amount of time exploring the data dictionary and the dataset, I have decided to select the following features for further analysis:-

Feature Name Description
GRAD_DEBT_MDN The median debt for students who have completed
COMP_ORIG_YR4_RT Percent completed within 4 years at original institution
PCTFLOAN Percent of all undergraduate students receiving a federal student loan
MD_EARN_WNE_P8 Median earnings of students working and not enrolled 8 years after entry
CDR3 Three-year cohort default rate
MEDIAN_HH_INC Median household income
AGE_ENTRY Average age of entry
HIGHDEG Highest degree awarded
UGDS_WOMEN Total share of enrollment of undergraduate degree-seeking students who are women
UGDS_NRA Total share of enrollment of undergraduate degree-seeking students who are non-resident aliens
UGDS Enrollment of undergraduate certificate/degree-seeking students
CONTROL Control of institution
COSTT4_A Average cost of attendance (academic year institutions)
COSTT4_P Average cost of attendance (program-year institutions)
PCIP11 Percentage of degrees awarded in Computer And Information Sciences And Support Services.
STABBR State postcode
INSTNM Institution name

These 17 features shown above are selected due to their potential usefulness and after considering many features from the dataset and the data dictionary. This task of feature selection has been done manually. Their vetting, however, doesn’t stop here. They will be analyzed further in the subsequent stages till the modeling phase.

3.2 Handling Missing Values

Missing values in this dataset are not clearly obvious. We can see NULL values instead of NA across different features along with some unknown values called PrivacySuppressed. From the official documentation on the College Scorecard website, data shown as PrivacySuppressed are the data which are not reported in order to protect an individual’s privacy. The presence of these values hinders the predictive ability of our features. We can either opt for removing the entire rows containing such data or converting these values into NA and then imputing them using different methods. I try to go ahead with the latter. Values shown as PrivacySuppressed come in the category of randomly missing data and can be definitely imputed but the NULL values can either be randomly or non-randomly missing data. We shouldn’t try to impute non-random missing data using typical imputation methods. It’s better to either remove them or investigate the method through which the data was collected. I will remove them. Moreover, if certain rows contain far too many NA values, I will remove them as well.

3.3 Splitting into Train-Validation-Test sets

After analyzing the features and imputing the randomly missing values, I will split the dataset into training, validation and test set. Test set would contain the missing data in the response variable. My predictive system’s task would be to predict their values.

3.4 Modeling

Model will be built on the training set and then applied on the validation set to compute the RMSE value. That would help in verifying that the model is working properly and hasn’t overfit the training set. After that, the model will predict the response variable in the test set.

4 Execution

4.1 Loading Data

First I load all the necessary packages.

library(dplyr) 
library(Amelia) 
library(ggplot2)
library(mice)
library(lattice)
library(gridExtra)
library(caret)

One may be required to install these packages from the CRAN first if they don’t exist in the system before.

Next, I read my dataset from the directory I have stored it in, and the features that I had selected before.

# read the file most recent cohorts
schoolDf <- read.csv("C:/Users/admin/Desktop/R/Coursera - Predicting Student Loan Repayment/Dataset/Most-Recent-Cohorts-All-Data-Elements.csv")

schoolDf_sub <- schoolDf %>% select(INSTNM, STATE = STABBR, CONTROL, HIGHDEG, 
                                    CSDEG = PCIP11, COSTA = COSTT4_A, COSTP = COSTT4_P, 
                                    UGDS, UGDS_WOMEN, UGDS_NRA, AGE_ENTRY, COMP_ORIG_YR4_RT,
                                    MEDIAN_HH_INC, PCTFLOAN, MD_EARN_WNE_P8, GRAD_DEBT_MDN, CDR3, COMPL_RPY_3YR_RT)

# remove the huge data frame schoolDf as we're not going to use it again.
rm(schoolDf)

4.2 Wrangling

First, I will redefine a couple of my features, CONTROL and HIGHDEG, using their definition from the data dictionary.

controlTable <- data.frame(CONTROL = c(1, 2, 3), OWNERSHIP = c("Public", "Private nonprofit", "Private for-profit"))
degTable <- data.frame(HIGHDEG = c(0, 1, 2, 3, 4), Degree = c("Non-degree-granting", "Certificate degree", "Associate degree", "Bachelor's degree", "Graduate degree"))

# Now Inner join with schoolDf_sub dataframe
schoolDf_sub2 <- inner_join(x = schoolDf_sub, y = controlTable)
# and once more
schoolDf_sub2 <- inner_join(x = schoolDf_sub2, y = degTable)

# now copy and paste the newly added features to CONTROL and HIGHDEG and remove the last two features afterwards
schoolDf_sub2$CONTROL <- schoolDf_sub2$OWNERSHIP
schoolDf_sub2$HIGHDEG <- schoolDf_sub2$Degree
schoolDf_sub2 <- schoolDf_sub2[, c(-19, -20)]

Now, I will operate on COSTA and COSTP. From the data dictionary, COSTT4_A (COSTA) and COSTT4_P (COSTP) are the average costs of attendance for academic-year institutions and program-year institutions respectively. There are too many missing values and as obvious, some of them must be non-random NA values. This is because only the academic-year institutions will have a value for COSTA feature and NULL for the rest and vice-versa. I can go ahead and engineer a common feature representing the cost of attendance irrespective of whether the institute is an academic-year based or program-year.

# first it is required to convert factor to character before converting to numeric
schoolDf_sub2$COSTA <- as.numeric(as.character(schoolDf_sub2$COSTA))
schoolDf_sub2$COSTP <- as.numeric(as.character(schoolDf_sub2$COSTP))

# make the non-random NA values equal to zero. 
schoolDf_sub2$COSTA[is.na(schoolDf_sub2$COSTA) & !is.na(schoolDf_sub2$COSTP)] <- 0
schoolDf_sub2$COSTP[is.na(schoolDf_sub2$COSTP) & !is.na(schoolDf_sub2$COSTA)] <- 0

# now use mutate to create a feature for the cost of attendance
schoolDf_cost <- schoolDf_sub2 %>% mutate(ATDCOST = COSTA + COSTP)

# remove the old cost features and move the new feature to the correct position
schoolDf_cost2 <- schoolDf_cost[, c(1:5, 19, 8:18)]
summary(schoolDf_cost2)
##                                     INSTNM         STATE     
##  UEI College                           :   8   CA     : 795  
##  McCann School of Business & Technology:   7   TX     : 485  
##  Stevens-Henager College               :   7   NY     : 467  
##  Bryan University                      :   5   FL     : 446  
##  Columbia College                      :   5   PA     : 405  
##  Academy of Cosmetology                :   4   OH     : 355  
##  (Other)                               :7667   (Other):4750  
##                CONTROL                    HIGHDEG         CSDEG     
##  Private for-profit:3703   Associate degree   :1671   0      :4021  
##  Private nonprofit :1956   Bachelor's degree  : 937   NULL   : 714  
##  Public            :2044   Certificate degree :2606   0.013  :  15  
##                            Graduate degree    :2026   0.0101 :  13  
##                            Non-degree-granting: 463   0.0196 :  13  
##                                                       0.0214 :  13  
##                                                       (Other):2914  
##     ATDCOST           UGDS        UGDS_WOMEN      UGDS_NRA   
##  Min.   : 2473   NULL   : 713   NULL   : 713   0      :3992  
##  1st Qu.:14328   39     :  30   1      : 188   NULL   : 713  
##  Median :20253   61     :  28   0      :  96   0.0018 :  27  
##  Mean   :22521   38     :  27   0.9286 :  21   0.0009 :  24  
##  3rd Qu.:27165   46     :  26   0.9167 :  17   0.0015 :  24  
##  Max.   :91135   58     :  26   0.9655 :  16   0.0007 :  23  
##  NA's   :1102    (Other):6853   (Other):6652   (Other):2900  
##              AGE_ENTRY             COMP_ORIG_YR4_RT
##  NULL             : 206   PrivacySuppressed: 963   
##  30.396914011     : 142   NULL             : 515   
##  PrivacySuppressed: 111   0.3599931422     : 140   
##  24.776654776     :  85   0.504224454      :  85   
##  32.888760139     :  81   0.1045028733     :  81   
##  31.763342552     :  38   0.3725579628     :  33   
##  (Other)          :7040   (Other)          :5886   
##            MEDIAN_HH_INC     PCTFLOAN              MD_EARN_WNE_P8
##  NULL             :2068   NULL   : 737   NULL             :1169  
##  PrivacySuppressed:  98   0      : 698   PrivacySuppressed: 705  
##  63381.69         :  23   1      :  57   33900            : 150  
##  62086.89         :  17   0.5    :  17   20700            : 106  
##  48773.91         :  14   0.75   :  15   43300            :  87  
##  55670.96         :  14   0.8    :  15   27200            :  66  
##  (Other)          :5469   (Other):6164   (Other)          :5420  
##            GRAD_DEBT_MDN       CDR3               COMPL_RPY_3YR_RT
##  PrivacySuppressed:1408   NULL   :1059   PrivacySuppressed:1388   
##  9500             : 546   0.188  : 151   NULL             : 902   
##  27000            : 319   0      : 120   0.3438101347     : 142   
##  25827.5          : 142   0.116  : 109   0.3570504528     :  85   
##  25000            : 125   0.141  : 109   0.4160583942     :  81   
##  12000            : 124   0.135  :  66   0.4495896681     :  38   
##  (Other)          :5039   (Other):6089   (Other)          :5067

From the summary of the data frame above, we observe two features, CSDEG and UGDS_NRA, having a large number of 0’s. If they were categorical data, there would be no problem. But they are actually numerical and so they won’t be of much use as predictors. So, I can get rid of them. Moreover, there can be duplicate rows in the table too. I will need to remove them as well.

schoolDf2 <- schoolDf_cost2[, c(-5, -9)]
# some institutes seem to have duplicate entries. So clear them.
schoolDf2 <- unique(schoolDf2)

For identifying more non-random NA values, I can form a criteria and remove the resulting rows.

schoolDf2_sub <- schoolDf2 %>% filter(!(PCTFLOAN == "0" & (GRAD_DEBT_MDN == "PrivacySuppressed" | GRAD_DEBT_MDN == "NULL") 
                                        & (COMPL_RPY_3YR_RT == "NULL" | COMPL_RPY_3YR_RT == "PrivacySuppressed")))

# After subsetting, some factor levels remain even when the observation is removed. We can reset the factor levels this way:
schoolDf2_sub <- droplevels(schoolDf2_sub)

Upto this point, almost all the numeric features still exist in the form of factor. I will convert them to numeric at once and release all the NA values from the NULL and PrivacySuppressed values in the process.

# set of features to be converted to numeric
toNumeric <- names(schoolDf2_sub[, -c(1:5)])
nextSchoolDf <- schoolDf2_sub

# convert all to numeric in one go
nextSchoolDf[toNumeric] <- nextSchoolDf[toNumeric] %>% lapply(FUN = function(x) { as.numeric(as.character(x)) })
str(nextSchoolDf)
## 'data.frame':    7040 obs. of  15 variables:
##  $ INSTNM          : Factor w/ 6903 levels "A & W Healthcare Educators",..: 85 6185 220 6186 87 5953 1050 361 376 375 ...
##  $ STATE           : Factor w/ 54 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ CONTROL         : Factor w/ 3 levels "Private for-profit",..: 3 3 2 3 3 3 3 3 3 3 ...
##  $ HIGHDEG         : Factor w/ 5 levels "Associate degree",..: 4 4 4 4 4 4 1 2 4 4 ...
##  $ ATDCOST         : num  21475 20621 16370 21107 18184 ...
##  $ UGDS            : num  4206 11383 291 5451 4811 ...
##  $ UGDS_WOMEN      : num  0.517 0.583 0.601 0.427 0.612 ...
##  $ AGE_ENTRY       : num  20.5 23.4 33.7 23.8 20.5 ...
##  $ COMP_ORIG_YR4_RT: num  0.242 0.428 0.21 0.392 0.187 ...
##  $ MEDIAN_HH_INC   : num  49720 55735 53684 58689 46065 ...
##  $ PCTFLOAN        : num  0.828 0.521 0.779 0.46 0.755 ...
##  $ MD_EARN_WNE_P8  : num  26600 36900 36400 40900 23000 38300 24200 35800 31200 41400 ...
##  $ GRAD_DEBT_MDN   : num  33888 21942 23370 24097 33119 ...
##  $ CDR3            : num  0.172 0.062 0.122 0.061 0.156 0.063 0.219 0.086 0.106 0.045 ...
##  $ COMPL_RPY_3YR_RT: num  0.389 0.611 0.514 0.702 0.327 ...

4.3 Missingness and some more wrangling

First of all, let’s try to visualize the missingness in the dataset obtained so far using the missmap function of the Amelia package.

missmap(nextSchoolDf, col=c('grey', 'steelblue'), y.cex=0.5, x.cex=0.8)

Displayed above is the map of missingness in our dataset. We can see some horizontal lines stretching across from one direction to another. These are the rows that contain many NA values. We will first try to get rid of those rows before carrying out any imputation on the rest.

One more thing we can observe from the visualization is that a few features have contiguous NAs in the bottom portion of the dataset.

To move forward, I can make another feature that will contain the number of occurences of NA values in each row. Then I can remove those rows that have too many NAs.

nextSchoolDf$countNA <- rowSums(is.na(nextSchoolDf))
summary(nextSchoolDf)
##                                     INSTNM         STATE     
##  Stevens-Henager College               :   7   CA     : 727  
##  Bryan University                      :   5   TX     : 450  
##  Academy of Cosmetology                :   4   NY     : 415  
##  Columbia College                      :   4   PA     : 396  
##  McCann School of Business & Technology:   4   FL     : 388  
##  UEI College                           :   4   OH     : 335  
##  (Other)                               :7012   (Other):4329  
##                CONTROL                    HIGHDEG        ATDCOST     
##  Private for-profit:3490   Associate degree   :1541   Min.   : 4725  
##  Private nonprofit :1821   Bachelor's degree  : 868   1st Qu.:15448  
##  Public            :1729   Certificate degree :2201   Median :21376  
##                            Graduate degree    :1987   Mean   :23468  
##                            Non-degree-granting: 443   3rd Qu.:27849  
##                                                       Max.   :91135  
##                                                       NA's   :1025   
##       UGDS          UGDS_WOMEN       AGE_ENTRY     COMP_ORIG_YR4_RT
##  Min.   :     0   Min.   :0.0000   Min.   :17.65   Min.   :0.0023  
##  1st Qu.:   125   1st Qu.:0.5364   1st Qu.:23.70   1st Qu.:0.2687  
##  Median :   446   Median :0.6424   Median :26.13   Median :0.4613  
##  Mean   :  2497   Mean   :0.6642   Mean   :26.33   Mean   :0.4434  
##  3rd Qu.:  2092   3rd Qu.:0.8710   3rd Qu.:28.84   3rd Qu.:0.6129  
##  Max.   :151558   Max.   :1.0000   Max.   :50.89   Max.   :0.9464  
##  NA's   :697      NA's   :697      NA's   :280     NA's   :1104    
##  MEDIAN_HH_INC       PCTFLOAN      MD_EARN_WNE_P8   GRAD_DEBT_MDN  
##  Min.   : 16155   Min.   :0.0000   Min.   : 10500   Min.   : 1409  
##  1st Qu.: 51032   1st Qu.:0.4298   1st Qu.: 22300   1st Qu.: 9500  
##  Median : 58697   Median :0.6142   Median : 28850   Median :14250  
##  Mean   : 59175   Mean   :0.5767   Mean   : 30802   Mean   :16750  
##  3rd Qu.: 66484   3rd Qu.:0.7630   3rd Qu.: 36300   3rd Qu.:24500  
##  Max.   :101579   Max.   :1.0000   Max.   :213900   Max.   :49750  
##  NA's   :2011     NA's   :721      NA's   :1612     NA's   :791    
##       CDR3        COMPL_RPY_3YR_RT    countNA      
##  Min.   :0.0000   Min.   :0.0565   Min.   : 0.000  
##  1st Qu.:0.0600   1st Qu.:0.3571   1st Qu.: 0.000  
##  Median :0.1160   Median :0.5267   Median : 0.000  
##  Mean   :0.1235   Mean   :0.5318   Mean   : 1.567  
##  3rd Qu.:0.1770   3rd Qu.:0.6995   3rd Qu.: 2.000  
##  Max.   :0.7890   Max.   :0.9718   Max.   :11.000  
##  NA's   :454      NA's   :1640

As can be seen from the summary above, the first 4 features don’t contain any NA value. The rest of the 11 features contain NA values. Let’s try to tabulate the countNA feature and get a sense of what it depicts.

misRows <- data.frame(table(nextSchoolDf$countNA))
names(misRows) <- c("NACount", "Freq")

# Now, let's also calculate the cumulative frequency
misRows$CumFreq <- cumsum(misRows$Freq)

# visualize the tabular data for more insights
ggplot(misRows, aes(x = NACount, y = CumFreq, group = 1)) +
  geom_line() +
  geom_point(size = 2) + 
  geom_text(aes(label = CumFreq), hjust = 0, vjust = 1.5)

Our goal here basically is to allow the rows that have as less NA values as possible while also keeping as many observations in the dataset as possible. So, what we need here is a tradeoff between NACount and CumFreq. An appropriate threshold can help us achieve that tradeoff.

Upon inspecting the plot above, we can see that upto the mark 5 at the x-axis, the slope is quite good. From the next value mark 6, the plot seems to be saturating. NACount equal to 5 seems a pretty reasonable choice for threshold. This means that I will remove all the rows having NACount greater than 5 and keep the rest. This choice will retain many observations in my dataset as well as delete those that have more than 5 NA values.

noLinesDf <- nextSchoolDf %>% filter(countNA <= 5)
# 6485 rows

# Let's again create a new missmap to show contrast with the previous one
par(mfrow = c(1,2))
missmap(nextSchoolDf, col=c('grey', 'steelblue'), y.cex=0.5, x.cex=0.8, main = "Before")
missmap(noLinesDf, col=c('grey', 'steelblue'), y.cex=0.5, x.cex=0.8, main = "After")

From the above illustration, we can see how the long horizontal lines showing missingness have faded away.

Now, as a part of some more wrangling, I see another feature of interest here: UGDS. UGDS directly can’t be used effectively as a predictor because it’s an absolute value (not relative like %age), and the number of degree seeking students enrolling can vary hugely across different schools.

So, what I aim to do here is create a new feature called INST_ENSIZE which will be an institutional feature directly derived from UGDS. Short for “Institution Enrollment Size”, it will be a categorical feature which will make more sense for prediction instead of using UGDS.

Let’s first visualize the distribution of values in UGDS.

ggplot(noLinesDf, aes(x = "UGDS", y = UGDS)) + 
  geom_boxplot() +
  ggtitle("Distribution of enrollment at different schools")

From this boxplot, it is evident that the range of values for UGDS is extrememly big. There are also many outliers present in the feature. We’ll break UGDS down into categories to overcome the impact of outliers on the predictive model that we are going to build.

To determine the number of categories, let’s try to plot a histogram.

summary(noLinesDf$UGDS)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0     136     482    2562    2211  151558     307
ggplot(noLinesDf, aes(x = UGDS)) + 
  geom_histogram()

We can see from above that the distribution of UGDS is positively skewed, or skewed to the right. Moreover, mean > median also confirms the same. I will now try to break UGDS into categories near the quartile and median values, in the way as shown below:

noLinesDf$INST_ENSIZE <- NA
noLinesDf$INST_ENSIZE[noLinesDf$UGDS <= 150] <- "Very Small"
noLinesDf$INST_ENSIZE[noLinesDf$UGDS > 150 & noLinesDf$UGDS <= 500] <- "Small"
noLinesDf$INST_ENSIZE[noLinesDf$UGDS > 500 & noLinesDf$UGDS <= 2500] <- "Medium"
noLinesDf$INST_ENSIZE[noLinesDf$UGDS > 2500 & noLinesDf$UGDS <= 15000] <- "Large"
noLinesDf$INST_ENSIZE[noLinesDf$UGDS > 15000] <- "Very Large"
noLinesDf$INST_ENSIZE <- as.factor(noLinesDf$INST_ENSIZE)

# visualizing the new feature
ggplot(noLinesDf, aes(x = INST_ENSIZE, fill = INST_ENSIZE)) + 
  geom_bar(alpha = 0.5) + 
  xlab("Institute Enrollment Size")

Fair enough, let’s replace UGDS with INST_ENSIZE. After that, my data frame will be ready for imputation of all the remaining NA values. I am going to use MICE imputation technique for that.

miceImpDf <- noLinesDf
miceImpDf$UGDS <- miceImpDf$INST_ENSIZE
miceImpDf <- miceImpDf[, c(1:16)]
colnames(miceImpDf)[6] <- "INST_ENSIZE"

Now, I can apply MICE imputation on the data frame. MICE stands for Multiple Imputation using Chained Equations. It will take care of all the NA values, all of which are hopefully completely random NA values. Moreover, I will remove a few features from imputation, particularly the response variable. This is because I am going to use the rows having missing values in the response variable in the TEST set.

# set a random seed
set.seed(165)
# use the CART or Decision Tree method in mice
mice_mod <- mice(miceImpDf[, !names(miceImpDf) %in% c("INSTNM", "countNA", "COMPL_RPY_3YR_RT")], 
                 method = "cart")

To inspect the results of imputation, I can create density plots.

# inspecting the result
densityplot(mice_mod)

The density plots look similar if not equal. I can place the imputed values back into the dataset.

mice_output <- complete(mice_mod)

# let's place this into the same dataframe
verifyMice <- miceImpDf
verifyMice[, c(2:14)] <- mice_output

Let’s just visualize to compare how the data frame look like before and after imputation.

# first, visualize the categorical variable INST_ENSIZE
plot1 <- ggplot(miceImpDf, aes(x = INST_ENSIZE, fill = INST_ENSIZE)) +
  geom_bar(alpha = 0.5) + 
  ggtitle("Before Imputation")

plot2 <- ggplot(verifyMice, aes(x = INST_ENSIZE, fill = INST_ENSIZE)) +
  geom_bar(alpha = 0.5) + 
  ggtitle("After Imputation")

grid.arrange(plot1, plot2, nrow = 1, ncol = 2)

# now the rest of the variables using boxplot
par(las = 2, mfrow = c(1,2))
boxplot(miceImpDf[, c(5, 7:14)])
boxplot(verifyMice[, c(5, 7:14)])

We observe that nothing looks out of ordianary when comparing. There is no drastic change in the distributions which confirms that our imputation using MICE was carried out well.

I have my final imputed dataset. Before splitting into training and test sets, I will shuffle the rows in the dataset and perform a few other important tasks.

# remove the countNA variable
imputedDf <- verifyMice[, -16]

# set a seed and shuffle the rows
set.seed(243)
schoolData <- imputedDf[sample(nrow(imputedDf)), ]

# backUp; to be used later
backUp <- schoolData$INSTNM[is.na(schoolData$COMPL_RPY_3YR_RT)]

# Let's remove the INSTNM feature right here
schoolData <- schoolData[, c(2:15)]

4.4 Modeling

Now comes the part where I am actually going to start creating a predictive system. Since my task involves predicting continuous values, I will have to create my model according to that. I will apply a linear regression model to solve this prediction problem. First of all, I will split the dataset I have right now into train and test sets. From the train set, I will form training and validation sets. Training set will be used to train my machine learning model, validation set for validating the model and determining the associated RMSE value, and the test set as a supplementary predictive task.

# splitting into train and test set
train <- schoolData %>% filter(!is.na(COMPL_RPY_3YR_RT))
test <- schoolData %>% filter(is.na(COMPL_RPY_3YR_RT))

# Now creating data partitions for training set and validation set
# We make a 75-25 split.
set.seed(162)
forTraining <- createDataPartition(train$COMPL_RPY_3YR_RT, p = 0.75, list = FALSE)

training <- train[forTraining, ]
validation <- train[-forTraining, ]

Now, I can start training my model. I will use the functions contained in the caret package, which is a machine learning package. I am also going to involve cross-validation in the training of my model.

# training a linear regression model using cross-validation
control <- trainControl(method = "cv", number = 10)
modelFit <- train(COMPL_RPY_3YR_RT ~ ., data = training, method = "lm", metric = "RMSE", trControl = control)

# check the performance on the training set, how well my model fits the training set using cross-validation
print(modelFit)
## Linear Regression 
## 
## 3990 samples
##   13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3591, 3591, 3590, 3591, 3591, 3591, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1032599  0.7418828  0.0782965
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# checking variable importance
importance <- varImp(modelFit, scale = FALSE)
print(importance)
## lm variable importance
## 
##   only 20 most important variables shown (out of 72)
## 
##                  Overall
## CDR3              20.435
## CONTROL2          19.207
## MD_EARN_WNE_P8    18.368
## AGE_ENTRY         16.881
## CONTROL3          16.444
## MEDIAN_HH_INC      9.484
## PCTFLOAN           7.299
## HIGHDEG4           4.870
## HIGHDEG2           3.923
## STATE44            2.803
## STATE28            2.756
## STATE30            2.631
## STATE8             2.075
## STATE22            1.937
## UGDS_WOMEN         1.935
## STATE11            1.826
## COMP_ORIG_YR4_RT   1.733
## INST_ENSIZE2       1.717
## HIGHDEG5           1.699
## STATE43            1.696
plot(importance, top = 22)

We can get some important insights from above: first, the RMSE and Rsquared values on the training set which are on the lower side; and then, the plot of importance of various features in training the linear regression model.

As we can see, many features were very important, while some features were not much, eg, GRAD_DEBT_MDN and ATDCOST. Surprisingly, ATDCOST is somewhere on the least important side and is not visible in the plot. On the other hand, CDR3 (Cohort Default Rates) has a critical impact on the predictions. Also MD_EARN_WNE_P8 (Median Earnings of the individuals after 8 years of enrollment) is very important. Average age of entry, median household income, etc are some other important features.

If we want, we can go ahead by removing the less important features and keep only the higher ones. It may or may not improve the model. I am going to stick with my present model and go ahead with predictions on the the validation set. The model will predict the COMPL_RPY_3YR_RT feature, or the Three-year loan repayment rate for completers, in the validation set and also compute the RMSE value for assessing accuracy.

valPredictions <- predict(modelFit, validation)

# computing RMSE using caret's function
RMSE(validation$COMPL_RPY_3YR_RT, valPredictions)
## [1] 0.09805046

The RMSE score I have obtained from this predictive model on the hold-out from the train set is far less than the threshold of 10-11. This is a fairly good score which, in fact, tells that my model performed quite well on the validation set.

Finally, I can extend this model to predict the loan repayment rates in the test set as well. However, there wouldn’t be any way of assessing the accuracy in that case as there is no way of knowing the actual values.

# predicting and writing the final result into a csv file
Predictions <- predict(modelFit, test)
predictionsDf <- data.frame(INSTNM = backUp, COMPL_RPY_3YR_RT = Predictions)
summary(predictionsDf$COMPL_RPY_3YR_RT)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.005392  0.339784  0.434507  0.459931  0.568855  1.193114
# A few values seem to be greater than 1. I can correct them in the way shown below.
predictionsDf$COMPL_RPY_3YR_RT[predictionsDf$COMPL_RPY_3YR_RT > 1] <- 1
# saving the predictions to disk
write.csv(predictionsDf, file = "TestPredictions.csv", row.names = FALSE)

5 Conclusion

The RMSE value calculated on the predictions made on the validation set was far lower than the maximum permissible value of 10-11. It was even lower than the RMSE value calculated in the case of training set. This clearly indicates that my model performed very well. It didn’t overfit my training set as is obvious from a better RMSE score.

The future work in this project may include implementing some other machine learning models like knn (k-nearest neighbors), rpart (decision trees), etc., training the model with only top important features, exploring additional features related to institutes and demographics, and exploring other years’ cohort data.