1 Introduction

Loan Smart is a lending advisory firm. Based on their client’s characteristic and needed loan amount they advise them on which Financial Institution to apply for loan at. So far their recommendations have been based hunches business experience. Now they are trying to leverage power of data that they have collected so far.

They want to check whether given their client’s characteristics , they can predict how much interest rates they will be offered by various financial institution. They want to run with proof of concept for this idea. They have given us data collected for one such financial institution ABC Capitals Ltd.

What we need to do is to figure out whether using that data we can predict interest rate offered to client.

2 Loading and Exploring Data

2.1 Loading libraries required

Loading R packages used besides base R.

library(knitr)
library(dplyr)
library(car)
library(ggplot2)

2.2 Reading the data into R

Below, I am reading the csv’s as dataframes into R.

ld_train=read.csv("loan_data_train.csv",stringsAsFactors = F)
ld_test= read.csv("loan_data_test.csv",stringsAsFactors = F)
glimpse(ld_train)
## Observations: 2,200
## Variables: 15
## $ ID                             <int> 79542, 75473, 67265, 80167, 17240, 3...
## $ Amount.Requested               <chr> "25000", "19750", "2100", "28000", "...
## $ Amount.Funded.By.Investors     <chr> "25000", "19750", "2100", "28000", "...
## $ Interest.Rate                  <chr> "18.49%", "17.27%", "14.33%", "16.29...
## $ Loan.Length                    <chr> "60 months", "60 months", "36 months...
## $ Loan.Purpose                   <chr> "debt_consolidation", "debt_consolid...
## $ Debt.To.Income.Ratio           <chr> "27.56%", "13.39%", "3.50%", "19.62%...
## $ State                          <chr> "VA", "NY", "LA", "NV", "OH", "OH", ...
## $ Home.Ownership                 <chr> "MORTGAGE", "MORTGAGE", "OWN", "MORT...
## $ Monthly.Income                 <dbl> 8606.56, 6737.50, 1000.00, 7083.33, ...
## $ FICO.Range                     <chr> "720-724", "710-714", "690-694", "71...
## $ Open.CREDIT.Lines              <chr> "11", "14", "13", "12", "6", "2", "5...
## $ Revolving.CREDIT.Balance       <chr> "15210", "19070", "893", "38194", "3...
## $ Inquiries.in.the.Last.6.Months <int> 3, 3, 1, 1, 2, 2, 0, 1, 0, 1, 0, 0, ...
## $ Employment.Length              <chr> "5 years", "4 years", "< 1 year", "1...
glimpse(ld_test)
## Observations: 300
## Variables: 14
## $ ID                             <int> 20093, 62445, 65248, 81822, 57923, 3...
## $ Amount.Requested               <int> 5000, 18000, 7200, 7200, 22000, 6000...
## $ Amount.Funded.By.Investors     <chr> "5000", "18000", "7200", "7200", "22...
## $ Loan.Length                    <chr> "60 months", "60 months", "60 months...
## $ Loan.Purpose                   <chr> "moving", "debt_consolidation", "deb...
## $ Debt.To.Income.Ratio           <chr> "12.59%", "4.93%", "25.16%", "17.27%...
## $ State                          <chr> "NY", "CA", "LA", "NY", "MI", "NJ", ...
## $ Home.Ownership                 <chr> "RENT", "RENT", "MORTGAGE", "MORTGAG...
## $ Monthly.Income                 <dbl> 4416.67, 5258.50, 3750.00, 3416.67, ...
## $ FICO.Range                     <chr> "690-694", "710-714", "750-754", "79...
## $ Open.CREDIT.Lines              <chr> "13", "6", "13", "14", "9", ".", "11...
## $ Revolving.CREDIT.Balance       <int> 7686, 11596, 7283, 4838, 20181, 1442...
## $ Inquiries.in.the.Last.6.Months <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, ...
## $ Employment.Length              <chr> "< 1 year", "10+ years", "6 years", ...

Variable names are self explanatory as to what they represent.

3 Data preparation

We’ll combine our two datasets so that we do not need to prepare data separately for them. And we’ll also avoid problem of dealing with different columns in different datasets.

However before combining them, we’ll need to add response column to test because number of columns need to be same for two datasets to stack vertically.

3.1 Adding response to test

ld_test$Interest.Rate=NA
ld_train$data='train'
ld_test$data='test'
ld_all=rbind(ld_train,ld_test)

Now we’ll start with data prep on ld_all.

3.2 Converting columns to proper form

When you look at the data set , you’ll find out Interest.Rate , Debt.To.Income.Ratio have been imported as characters due to “%” sign.

Open.Credit.Lines and some othe supposedly numeric variables have been converted to character becuase of some character value present possibly. Lets convert them to numbers first.

ld_all=ld_all %>%
  mutate(Interest.Rate=as.numeric(gsub("%","",Interest.Rate)) ,
         Debt.To.Income.Ratio=as.numeric(gsub("%","",Debt.To.Income.Ratio)) ,
         Open.CREDIT.Lines=as.numeric(Open.CREDIT.Lines) , 
         Amount.Requested=as.numeric(Amount.Requested) ,
         Revolving.CREDIT.Balance=as.numeric(Revolving.CREDIT.Balance)
         )

glimpse(ld_all)
## Observations: 2,500
## Variables: 16
## $ ID                             <int> 79542, 75473, 67265, 80167, 17240, 3...
## $ Amount.Requested               <dbl> 25000, 19750, 2100, 28000, 24250, 54...
## $ Amount.Funded.By.Investors     <chr> "25000", "19750", "2100", "28000", "...
## $ Interest.Rate                  <dbl> 18.49, 17.27, 14.33, 16.29, 12.23, 8...
## $ Loan.Length                    <chr> "60 months", "60 months", "36 months...
## $ Loan.Purpose                   <chr> "debt_consolidation", "debt_consolid...
## $ Debt.To.Income.Ratio           <dbl> 27.56, 13.39, 3.50, 19.62, 23.79, 6....
## $ State                          <chr> "VA", "NY", "LA", "NV", "OH", "OH", ...
## $ Home.Ownership                 <chr> "MORTGAGE", "MORTGAGE", "OWN", "MORT...
## $ Monthly.Income                 <dbl> 8606.56, 6737.50, 1000.00, 7083.33, ...
## $ FICO.Range                     <chr> "720-724", "710-714", "690-694", "71...
## $ Open.CREDIT.Lines              <dbl> 11, 14, 13, 12, 6, 2, 5, 11, 24, 7, ...
## $ Revolving.CREDIT.Balance       <dbl> 15210, 19070, 893, 38194, 31061, 90,...
## $ Inquiries.in.the.Last.6.Months <int> 3, 3, 1, 1, 2, 2, 0, 1, 0, 1, 0, 0, ...
## $ Employment.Length              <chr> "5 years", "4 years", "< 1 year", "1...
## $ data                           <chr> "train", "train", "train", "train", ...

3.3 Removing highly correlated predictors

The variable Amount_Funded_By_Investors happens to have high correlation with Interest.Rate [our target], but we should still drop this variables because it contains information which will not be available at the time when we need to use this model.

ld_all = ld_all %>%
select(-Amount.Funded.By.Investors)
# ld_all$Amount.Funded.By.Investors=NULL
# we could have used commented code instead for dropping. Infact i find commented method
# more convenient when dropping a single column. dplyr way is more conevnient when dropping
# multiple columns

3.4 Converting Columns

Although fico_range is recorded as categorical variable in terms of ranges , we can convert it to a numeric variable by assigning value as average of the range. lets do that.

ld_all= ld_all %>% 
  mutate(f1=as.numeric(substr(FICO.Range,1,3)),
         f2=as.numeric(substr(FICO.Range,5,7)),
         fico=0.5*(f1+f2)
         ) %>%
  select(-FICO.Range,-f1,-f2)

glimpse(ld_all)
## Observations: 2,500
## Variables: 15
## $ ID                             <int> 79542, 75473, 67265, 80167, 17240, 3...
## $ Amount.Requested               <dbl> 25000, 19750, 2100, 28000, 24250, 54...
## $ Interest.Rate                  <dbl> 18.49, 17.27, 14.33, 16.29, 12.23, 8...
## $ Loan.Length                    <chr> "60 months", "60 months", "36 months...
## $ Loan.Purpose                   <chr> "debt_consolidation", "debt_consolid...
## $ Debt.To.Income.Ratio           <dbl> 27.56, 13.39, 3.50, 19.62, 23.79, 6....
## $ State                          <chr> "VA", "NY", "LA", "NV", "OH", "OH", ...
## $ Home.Ownership                 <chr> "MORTGAGE", "MORTGAGE", "OWN", "MORT...
## $ Monthly.Income                 <dbl> 8606.56, 6737.50, 1000.00, 7083.33, ...
## $ Open.CREDIT.Lines              <dbl> 11, 14, 13, 12, 6, 2, 5, 11, 24, 7, ...
## $ Revolving.CREDIT.Balance       <dbl> 15210, 19070, 893, 38194, 31061, 90,...
## $ Inquiries.in.the.Last.6.Months <int> 3, 3, 1, 1, 2, 2, 0, 1, 0, 1, 0, 0, ...
## $ Employment.Length              <chr> "5 years", "4 years", "< 1 year", "1...
## $ data                           <chr> "train", "train", "train", "train", ...
## $ fico                           <dbl> 722, 712, 692, 712, 732, 787, 667, 7...

In a similar fashion we can convert employment length to numbers as well.

ld_all=ld_all %>%
  mutate(el=ifelse(substr(Employment.Length,1,2)=="10",10,Employment.Length),
         el=ifelse(substr(Employment.Length,1,1)=="<",0,el),
         el=gsub("years","",el),
         el=gsub("year","",el),
         el=as.numeric(el)
         ) %>%
  select(-Employment.Length)

Lets take a pause at this point and think over the decision of how we processed employment length here. Converting this to numbers using the logic above was a subjective decision. We can’t really say that this is what we do in every case. We might have as well went on treating it as categorical variable and created dummy variables for it. We might have used some other logic to convert it to numeric by assigning some other numeric values. Each of these decisions would have resulted in a different model. Lets say we build models m1, m2, m3 based on all these decision separately. You can select one of these models to go forward with , based on performance on the validation data. In this text however , we’ll be focusing on just one iteration and in turn developing just one model.

3.5 Dummy Variables

CreateDummies=function(data,var,freq_cutoff=0){
  t=table(data[,var])
  t=t[t>freq_cutoff]
  t=sort(t)
  categories=names(t)[-1]
  
  for( cat in categories){
    name=paste(var,cat,sep="_")
    name=gsub(" ","",name)
    name=gsub("-","_",name)
    name=gsub("\\?","Q",name)
    name=gsub("<","LT_",name)
    name=gsub("\\+","",name)
    
    data[,name]=as.numeric(data[,var]==cat)
}

  data[,var]=NULL
  return(data)
}

3.5.1 Explaining the function

Let me explain the function if you havent come across this before

t=table(data[,var]) this bit creates a frequency table for the given categorical column. t here is now simply a table which contains names as categories of the categorical variable and their frequency in the data.

t=t[t>freq_cutoff] this line of code removes those categories from the table which have frequencies below the frequency cutoff. ( this is a subjective choice)

‘t=sort(t)’ this line simple sorts the remaining table in ascending order

categories=names(t)[-1] since we sorted the table in ascending manner in the previous line, first category here has least count. In this line of code we are taking out the category names except the first one ( which has least count), thus making n-1 dummies from the remaining categories.

name=paste(var,cat,sep=“_“) all the dummy vars that we intend to make, need to have some name. this line of code creates that name by concatenating variable name with category name with an _.

name=gsub(" “,”“,name) subsequent lines like these using gsub are essentially cleaning up the name thats all. Since we dont have any control over what the categories can be, we are removing special characters and spaces in the code in an automated fashion.

data[,name]=as.numeric(data[,var]==cat) once we have a cleaned up name, this line creates the dummy var for that particular category.

data[,var]=NULL once we are done creating dummies for the variable using for loop. Variable is removed from the data in this line.

3.5.2 Example

Consider a hypothetical example where a categorical variable takes 5 distinct values a ,b ,c, d, e. Among these e occurs very few times. We do not make 5 dummy variable because of this relation ship:

cat_a + cat_b + cat_c + cat_d + cat_e = 1

Which implies that if we make 5 dummy variables , any one taken at a time will be perfect linear combination of the rest which is undesirable as discussed previously.

So we will go ahead and take e as base category , this being least frequent and make 4 dummy variables. But e was a rare category. Meaning dummy variable for e , if made would have been equal to 1 rarely, which means that rest 4 dummies will still add up to exactly one for most part of the data , hence the high VIF values. We’ll that in our results also.

3.6 Looking at columns specifically

3.6.1 Loan.Purpose

Next we look at variable Loan.Purpose

table(ld_all$Loan.Purpose)
## 
##                car        credit_card debt_consolidation        educational 
##                 50                444               1307                 15 
##   home_improvement              house     major_purchase            medical 
##                152                 20                101                 30 
##             moving              other   renewable_energy     small_business 
##                 29                200                  4                 87 
##           vacation            wedding 
##                 21                 39

It has too many categories. There is no direct harm in considering to create dummy variables for n-1 for them. That can be one iteration to try. Here we are going to combine categories on the basis of average response , this will bring down number of categories and we can make dummies then.

round(tapply(ld_all$Interest.Rate,ld_all$Loan.Purpose,mean,na.rm=T))
##                car        credit_card debt_consolidation        educational 
##                 11                 13                 14                 10 
##   home_improvement              house     major_purchase            medical 
##                 12                 14                 11                 11 
##             moving              other   renewable_energy     small_business 
##                 14                 13                  8                 13 
##           vacation            wedding 
##                 12                 12

we’ll combine categories into new one which have similar reponse rate as per the table obtained abvoe and make dummies for them as well while we are at it.

ld_all=ld_all %>%
mutate(lp_10=as.numeric(Loan.Purpose=='educational'),
lp_11=as.numeric(Loan.Purpose %in% c("major_purchase","medical","car")),
lp_12=as.numeric(Loan.Purpose %in% c("vacation","wedding","home_improvement")),
lp_13=as.numeric(Loan.Purpose %in% c("other","small_business","credit_card")),
lp_14=as.numeric(Loan.Purpose %in% c("debt_consolidation","house","moving"))) %>%
select(-Loan.Purpose)

3.6.2 Remaining columns

Next we will make dummy vars for remaining categorical variables.

for(col in c("Loan.Length","State","Home.Ownership"))
  {
  ld_all=CreateDummies(ld_all,col,100)
  }

3.7 Missing Values

3.7.1 Checking missing values

lapply(ld_all,function(x) sum(is.na(x)))
## $ID
## [1] 1
## 
## $Amount.Requested
## [1] 5
## 
## $Interest.Rate
## [1] 300
## 
## $Debt.To.Income.Ratio
## [1] 1
## 
## $Monthly.Income
## [1] 3
## 
## $Open.CREDIT.Lines
## [1] 9
## 
## $Revolving.CREDIT.Balance
## [1] 5
## 
## $Inquiries.in.the.Last.6.Months
## [1] 3
## 
## $data
## [1] 0
## 
## $fico
## [1] 0
## 
## $el
## [1] 80
## 
## $lp_10
## [1] 1
## 
## $lp_11
## [1] 0
## 
## $lp_12
## [1] 0
## 
## $lp_13
## [1] 0
## 
## $lp_14
## [1] 0
## 
## $Loan.Length_36months
## [1] 1
## 
## $State_FL
## [1] 1
## 
## $State_TX
## [1] 1
## 
## $State_NY
## [1] 1
## 
## $State_CA
## [1] 1
## 
## $Home.Ownership_RENT
## [1] 1
## 
## $Home.Ownership_MORTGAGE
## [1] 1

3.7.2 Removing missing values

It doesnt make sense to keep an observation with a missing ID, we’ll filter that.

ld_all=ld_all[!(is.na(ld_all$ID)),]

3.7.3 Imputing missing values

for(col in names(ld_all)){
  if(sum(is.na(ld_all[,col]))>0 & !(col %in% c("ID","data","Interest.Rate"))){

    ld_all[is.na(ld_all[,col]),col]=mean(ld_all[ld_all$data=="train",col],na.rm=T)
}
}

Notice that in missing values imputation also, only training data obs are used.

4 Training and Test Data

Now we are done with preparing data , lets separate the data.

ld_train=ld_all %>% filter(data=='train') %>% select(-data)
ld_test=ld_all %>% filter(data=='test') %>% select(-data,-Interest.Rate)

4.1 Splitting Train Data

we can very well build the model on entire training data if we are not interested in looking at the performance of the model . However here we’d like to know how our model might tentatively perform on test/production data ( it wont be an exact match ever, and in some cases realties might be very different). To do this, we’ll break our data into two parts.

set.seed(2)
s=sample(1:nrow(ld_train),0.7*nrow(ld_train))
ld_train1=ld_train[s,]
ld_train2=ld_train[-s,]

5 Linear Regression

Now we’ll start with our modeling process. We’ll be using function lm for building our linear regression model. First argument to this function is the modeling equation which we are trying to model. second argument is the dataset.

glimpse(ld_train1)
## Observations: 1,539
## Variables: 22
## $ ID                             <int> 34331, 92046, 8382, 47097, 62900, 29...
## $ Amount.Requested               <dbl> 3300, 32000, 6250, 7200, 6000, 6000,...
## $ Interest.Rate                  <dbl> 11.71, 17.27, 10.25, 7.90, 6.03, 11....
## $ Debt.To.Income.Ratio           <dbl> 4.16, 15.61, 18.04, 12.13, 5.65, 19....
## $ Monthly.Income                 <dbl> 3730.00, 9250.00, 2400.00, 2333.33, ...
## $ Open.CREDIT.Lines              <dbl> 7, 15, 5, 7, 15, 6, 13, 13, 9, 16, 1...
## $ Revolving.CREDIT.Balance       <dbl> 3675, 32566, 14863, 6262, 5747, 4298...
## $ Inquiries.in.the.Last.6.Months <dbl> 1, 1, 0, 1, 0, 1, 1, 0, 0, 3, 0, 2, ...
## $ fico                           <dbl> 702, 702, 722, 727, 752, 727, 667, 7...
## $ el                             <dbl> 1.000000, 2.000000, 10.000000, 2.000...
## $ lp_10                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ lp_11                          <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
## $ lp_12                          <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ lp_13                          <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, ...
## $ lp_14                          <dbl> 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, ...
## $ Loan.Length_36months           <dbl> 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, ...
## $ State_FL                       <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ State_TX                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ State_NY                       <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...
## $ State_CA                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Home.Ownership_RENT            <dbl> 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, ...
## $ Home.Ownership_MORTGAGE        <dbl> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, ...

You can see that all the variables in data are numeric , lets start now. It doesnt make sense to include variable ID in the modelling process. Its just a id number for the transactions which should be ignored.

fit=lm(Interest.Rate~. -ID,data=ld_train1)

5.1 Checking variance inflation factor of the columns

Before we go start looking at model summary and p-value for the variables , we need to drop variables from the model which have high VIF. Remember that VIF is a mutual thing, so if you drop one variable, it might happen that VIF value for several other variables will go down. For this reason, always drop variables for high VIF , one by one, never in a chunk. For looking at VIF values for the variables , we’ll use function vif from package car.

vif(fit)
##               Amount.Requested           Debt.To.Income.Ratio 
##                       1.777267                       1.390000 
##                 Monthly.Income              Open.CREDIT.Lines 
##                       1.684595                       1.358993 
##       Revolving.CREDIT.Balance Inquiries.in.the.Last.6.Months 
##                       1.322998                       1.049780 
##                           fico                             el 
##                       1.167498                       1.103588 
##                          lp_10                          lp_11 
##                       3.517704                      26.712734 
##                          lp_12                          lp_13 
##                      32.261817                      81.984659 
##                          lp_14           Loan.Length_36months 
##                      97.509584                       1.278274 
##                       State_FL                       State_TX 
##                       1.055457                       1.058533 
##                       State_NY                       State_CA 
##                       1.109780                       1.110674 
##            Home.Ownership_RENT        Home.Ownership_MORTGAGE 
##                       3.623192                       3.686325

It will be easier to identify high VIF vars if we sort the results and say look at top 3 only

sort(vif(fit),decreasing = T)[1:3]
##    lp_14    lp_13    lp_12 
## 97.50958 81.98466 32.26182

We’ll drop highest vif var and run the model again.

fit=lm(Interest.Rate~. -ID - lp_14,data=ld_train)
sort(vif(fit),decreasing = T)[1:3]
## Home.Ownership_MORTGAGE     Home.Ownership_RENT        Amount.Requested 
##                3.735010                3.650675                1.587393

5.2 Model Summary

All VIF values are under control.Now we look at the model summary , specifically p-values asscociated with the variables. We can drop variables with high p-values [ >0.05] one by one or we can use step function which drops vars based on AIC score one by one. Although the methodology is different but end result is generally similar due to both of them targetting vars which do not contribute towards explaning not very well.

fit=step(fit)
## Start:  AIC=3203.34
## Interest.Rate ~ (ID + Amount.Requested + Debt.To.Income.Ratio + 
##     Monthly.Income + Open.CREDIT.Lines + Revolving.CREDIT.Balance + 
##     Inquiries.in.the.Last.6.Months + fico + el + lp_10 + lp_11 + 
##     lp_12 + lp_13 + lp_14 + Loan.Length_36months + State_FL + 
##     State_TX + State_NY + State_CA + Home.Ownership_RENT + Home.Ownership_MORTGAGE) - 
##     ID - lp_14
## 
##                                  Df Sum of Sq     RSS    AIC
## - lp_12                           1       0.0  9267.7 3201.3
## - State_NY                        1       0.2  9267.9 3201.4
## - State_CA                        1       0.4  9268.1 3201.4
## - Debt.To.Income.Ratio            1       0.8  9268.5 3201.5
## - State_FL                        1       1.3  9269.0 3201.6
## - lp_11                           1       2.2  9269.9 3201.9
## - lp_10                           1       2.6  9270.3 3202.0
## - Home.Ownership_RENT             1       3.7  9271.5 3202.2
## - Revolving.CREDIT.Balance        1       5.6  9273.3 3202.7
## - el                              1       6.5  9274.2 3202.9
## - lp_13                           1       8.4  9276.1 3203.3
## <none>                                         9267.7 3203.3
## - Monthly.Income                  1      11.0  9278.7 3203.9
## - Home.Ownership_MORTGAGE         1      23.8  9291.5 3207.0
## - State_TX                        1      51.4  9319.1 3213.5
## - Open.CREDIT.Lines               1      55.4  9323.1 3214.4
## - Inquiries.in.the.Last.6.Months  1     383.3  9651.1 3290.5
## - Amount.Requested                1    2006.0 11273.7 3632.2
## - Loan.Length_36months            1    3129.4 12397.1 3841.1
## - fico                            1   17558.2 26826.0 5538.5
## 
## Step:  AIC=3201.34
## Interest.Rate ~ Amount.Requested + Debt.To.Income.Ratio + Monthly.Income + 
##     Open.CREDIT.Lines + Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + el + lp_10 + lp_11 + lp_13 + Loan.Length_36months + 
##     State_FL + State_TX + State_NY + State_CA + Home.Ownership_RENT + 
##     Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - State_NY                        1       0.2  9267.9 3199.4
## - State_CA                        1       0.4  9268.1 3199.4
## - Debt.To.Income.Ratio            1       0.8  9268.5 3199.5
## - State_FL                        1       1.3  9269.0 3199.6
## - lp_11                           1       2.3  9270.0 3199.9
## - lp_10                           1       2.6  9270.4 3200.0
## - Home.Ownership_RENT             1       3.7  9271.5 3200.2
## - Revolving.CREDIT.Balance        1       5.6  9273.3 3200.7
## - el                              1       6.5  9274.2 3200.9
## <none>                                         9267.7 3201.3
## - lp_13                           1       8.9  9276.6 3201.4
## - Monthly.Income                  1      11.0  9278.8 3202.0
## - Home.Ownership_MORTGAGE         1      23.8  9291.5 3205.0
## - State_TX                        1      51.5  9319.2 3211.5
## - Open.CREDIT.Lines               1      55.4  9323.2 3212.5
## - Inquiries.in.the.Last.6.Months  1     383.5  9651.3 3288.5
## - Amount.Requested                1    2019.3 11287.0 3632.8
## - Loan.Length_36months            1    3132.3 12400.0 3839.6
## - fico                            1   17802.8 27070.5 5556.5
## 
## Step:  AIC=3199.38
## Interest.Rate ~ Amount.Requested + Debt.To.Income.Ratio + Monthly.Income + 
##     Open.CREDIT.Lines + Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + el + lp_10 + lp_11 + lp_13 + Loan.Length_36months + 
##     State_FL + State_TX + State_CA + Home.Ownership_RENT + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - State_CA                        1       0.5  9268.4 3197.5
## - Debt.To.Income.Ratio            1       0.8  9268.7 3197.6
## - State_FL                        1       1.2  9269.1 3197.7
## - lp_11                           1       2.3  9270.2 3197.9
## - lp_10                           1       2.6  9270.6 3198.0
## - Home.Ownership_RENT             1       3.8  9271.7 3198.3
## - Revolving.CREDIT.Balance        1       5.6  9273.5 3198.7
## - el                              1       6.6  9274.5 3198.9
## <none>                                         9267.9 3199.4
## - lp_13                           1       8.9  9276.8 3199.5
## - Monthly.Income                  1      10.9  9278.8 3200.0
## - Home.Ownership_MORTGAGE         1      24.7  9292.6 3203.2
## - State_TX                        1      51.5  9319.4 3209.6
## - Open.CREDIT.Lines               1      55.3  9323.2 3210.5
## - Inquiries.in.the.Last.6.Months  1     383.4  9651.3 3286.5
## - Amount.Requested                1    2020.1 11288.0 3631.0
## - Loan.Length_36months            1    3137.7 12405.6 3838.6
## - fico                            1   17802.6 27070.5 5554.5
## 
## Step:  AIC=3197.51
## Interest.Rate ~ Amount.Requested + Debt.To.Income.Ratio + Monthly.Income + 
##     Open.CREDIT.Lines + Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + el + lp_10 + lp_11 + lp_13 + Loan.Length_36months + 
##     State_FL + State_TX + Home.Ownership_RENT + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - Debt.To.Income.Ratio            1       0.8  9269.2 3195.7
## - State_FL                        1       1.4  9269.8 3195.8
## - lp_11                           1       2.2  9270.7 3196.0
## - lp_10                           1       2.6  9271.0 3196.1
## - Home.Ownership_RENT             1       4.1  9272.5 3196.5
## - Revolving.CREDIT.Balance        1       5.7  9274.2 3196.9
## - el                              1       6.5  9275.0 3197.1
## <none>                                         9268.4 3197.5
## - lp_13                           1       9.0  9277.4 3197.6
## - Monthly.Income                  1      10.8  9279.3 3198.1
## - Home.Ownership_MORTGAGE         1      24.7  9293.1 3201.4
## - State_TX                        1      53.5  9322.0 3208.2
## - Open.CREDIT.Lines               1      55.4  9323.8 3208.6
## - Inquiries.in.the.Last.6.Months  1     385.2  9653.7 3285.1
## - Amount.Requested                1    2019.9 11288.3 3629.0
## - Loan.Length_36months            1    3147.0 12415.4 3838.3
## - fico                            1   17802.4 27070.9 5552.5
## 
## Step:  AIC=3195.69
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + el + lp_10 + lp_11 + lp_13 + Loan.Length_36months + 
##     State_FL + State_TX + Home.Ownership_RENT + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - State_FL                        1       1.4  9270.6 3194.0
## - lp_11                           1       2.4  9271.6 3194.3
## - lp_10                           1       2.8  9272.0 3194.4
## - Home.Ownership_RENT             1       4.0  9273.3 3194.7
## - el                              1       6.4  9275.6 3195.2
## - Revolving.CREDIT.Balance        1       6.7  9275.9 3195.3
## <none>                                         9269.2 3195.7
## - lp_13                           1       9.1  9278.3 3195.8
## - Monthly.Income                  1      10.1  9279.3 3196.1
## - Home.Ownership_MORTGAGE         1      24.5  9293.7 3199.5
## - State_TX                        1      52.9  9322.1 3206.2
## - Open.CREDIT.Lines               1      68.4  9337.6 3209.8
## - Inquiries.in.the.Last.6.Months  1     386.9  9656.1 3283.6
## - Amount.Requested                1    2023.6 11292.8 3627.9
## - Loan.Length_36months            1    3146.9 12416.1 3836.5
## - fico                            1   18301.5 27570.7 5590.7
## 
## Step:  AIC=3194.03
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + el + lp_10 + lp_11 + lp_13 + Loan.Length_36months + 
##     State_TX + Home.Ownership_RENT + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - lp_11                           1       2.4  9273.1 3192.6
## - lp_10                           1       3.0  9273.6 3192.7
## - Home.Ownership_RENT             1       4.3  9274.9 3193.0
## - el                              1       6.3  9276.9 3193.5
## - Revolving.CREDIT.Balance        1       6.4  9277.0 3193.5
## <none>                                         9270.6 3194.0
## - lp_13                           1       9.2  9279.8 3194.2
## - Monthly.Income                  1      10.4  9281.0 3194.5
## - Home.Ownership_MORTGAGE         1      24.8  9295.4 3197.9
## - State_TX                        1      51.9  9322.5 3204.3
## - Open.CREDIT.Lines               1      68.6  9339.2 3208.2
## - Inquiries.in.the.Last.6.Months  1     386.2  9656.8 3281.8
## - Amount.Requested                1    2024.0 11294.7 3626.3
## - Loan.Length_36months            1    3145.7 12416.3 3834.5
## - fico                            1   18317.4 27588.1 5590.1
## 
## Step:  AIC=3192.61
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + el + lp_10 + lp_13 + Loan.Length_36months + State_TX + 
##     Home.Ownership_RENT + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - lp_10                           1       2.7  9275.8 3191.2
## - Home.Ownership_RENT             1       4.4  9277.5 3191.7
## - el                              1       6.0  9279.0 3192.0
## - Revolving.CREDIT.Balance        1       6.6  9279.6 3192.2
## - lp_13                           1       7.8  9280.9 3192.5
## <none>                                         9273.1 3192.6
## - Monthly.Income                  1       9.5  9282.6 3192.9
## - Home.Ownership_MORTGAGE         1      25.5  9298.5 3196.6
## - State_TX                        1      52.6  9325.7 3203.1
## - Open.CREDIT.Lines               1      69.3  9342.3 3207.0
## - Inquiries.in.the.Last.6.Months  1     385.3  9658.3 3280.1
## - Amount.Requested                1    2073.0 11346.1 3634.3
## - Loan.Length_36months            1    3181.3 12454.4 3839.2
## - fico                            1   18654.6 27927.7 5615.0
## 
## Step:  AIC=3191.25
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + el + lp_13 + Loan.Length_36months + State_TX + Home.Ownership_RENT + 
##     Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - Home.Ownership_RENT             1       4.6  9280.4 3190.3
## - el                              1       5.8  9281.6 3190.6
## - Revolving.CREDIT.Balance        1       6.6  9282.4 3190.8
## - lp_13                           1       7.3  9283.1 3191.0
## <none>                                         9275.8 3191.2
## - Monthly.Income                  1       9.6  9285.4 3191.5
## - Home.Ownership_MORTGAGE         1      26.1  9301.8 3195.4
## - State_TX                        1      52.1  9327.9 3201.6
## - Open.CREDIT.Lines               1      69.3  9345.1 3205.6
## - Inquiries.in.the.Last.6.Months  1     391.4  9667.2 3280.1
## - Amount.Requested                1    2070.9 11346.7 3632.4
## - Loan.Length_36months            1    3179.0 12454.8 3837.3
## - fico                            1   18680.3 27956.1 5615.3
## 
## Step:  AIC=3190.35
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + el + lp_13 + Loan.Length_36months + State_TX + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - el                              1       6.2  9286.6 3189.8
## - Revolving.CREDIT.Balance        1       6.2  9286.7 3189.8
## - lp_13                           1       7.3  9287.8 3190.1
## <none>                                         9280.4 3190.3
## - Monthly.Income                  1       9.9  9290.3 3190.7
## - Home.Ownership_MORTGAGE         1      34.9  9315.3 3196.6
## - State_TX                        1      53.1  9333.5 3200.9
## - Open.CREDIT.Lines               1      68.4  9348.9 3204.5
## - Inquiries.in.the.Last.6.Months  1     392.0  9672.4 3279.3
## - Amount.Requested                1    2069.3 11349.7 3631.0
## - Loan.Length_36months            1    3175.8 12456.2 3835.5
## - fico                            1   18727.5 28007.9 5617.3
## 
## Step:  AIC=3189.81
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + lp_13 + Loan.Length_36months + State_TX + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - Revolving.CREDIT.Balance        1       5.7  9292.3 3189.2
## - lp_13                           1       7.5  9294.1 3189.6
## <none>                                         9286.6 3189.8
## - Monthly.Income                  1       9.9  9296.5 3190.2
## - Home.Ownership_MORTGAGE         1      30.3  9316.9 3195.0
## - State_TX                        1      51.2  9337.8 3199.9
## - Open.CREDIT.Lines               1      68.1  9354.7 3203.9
## - Inquiries.in.the.Last.6.Months  1     388.5  9675.1 3277.9
## - Amount.Requested                1    2107.7 11394.3 3637.6
## - Loan.Length_36months            1    3176.8 12463.4 3834.8
## - fico                            1   18751.9 28038.5 5617.7
## 
## Step:  AIC=3189.16
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Inquiries.in.the.Last.6.Months + fico + lp_13 + Loan.Length_36months + 
##     State_TX + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - lp_13                           1       6.4  9298.7 3188.7
## <none>                                         9292.3 3189.2
## - Monthly.Income                  1      14.9  9307.2 3190.7
## - Home.Ownership_MORTGAGE         1      33.2  9325.5 3195.0
## - State_TX                        1      52.5  9344.8 3199.6
## - Open.CREDIT.Lines               1      80.1  9372.3 3206.0
## - Inquiries.in.the.Last.6.Months  1     389.8  9682.1 3277.5
## - Amount.Requested                1    2119.6 11411.9 3639.0
## - Loan.Length_36months            1    3191.2 12483.5 3836.3
## - fico                            1   18755.2 28047.4 5616.4
## 
## Step:  AIC=3188.68
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Inquiries.in.the.Last.6.Months + fico + Loan.Length_36months + 
##     State_TX + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## <none>                                         9298.7 3188.7
## - Monthly.Income                  1      15.2  9313.9 3190.3
## - Home.Ownership_MORTGAGE         1      34.2  9333.0 3194.8
## - State_TX                        1      53.1  9351.8 3199.2
## - Open.CREDIT.Lines               1      78.9  9377.6 3205.3
## - Inquiries.in.the.Last.6.Months  1     388.4  9687.1 3276.7
## - Amount.Requested                1    2113.3 11412.1 3637.0
## - Loan.Length_36months            1    3184.8 12483.6 3834.4
## - fico                            1   18803.2 28101.9 5618.7

lets check model summary to see what all variable remain in the model.

summary(fit)
## 
## Call:
## lm(formula = Interest.Rate ~ Amount.Requested + Monthly.Income + 
##     Open.CREDIT.Lines + Inquiries.in.the.Last.6.Months + fico + 
##     Loan.Length_36months + State_TX + Home.Ownership_MORTGAGE, 
##     data = ld_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5280 -1.3681 -0.2007  1.2432 10.0654 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     7.466e+01  9.250e-01  80.718  < 2e-16 ***
## Amount.Requested                1.509e-04  6.764e-06  22.310  < 2e-16 ***
## Monthly.Income                 -2.275e-05  1.203e-05  -1.892 0.058675 .  
## Open.CREDIT.Lines              -4.414e-02  1.024e-02  -4.311  1.7e-05 ***
## Inquiries.in.the.Last.6.Months  3.504e-01  3.664e-02   9.564  < 2e-16 ***
## fico                           -8.562e-02  1.287e-03 -66.547  < 2e-16 ***
## Loan.Length_36months           -3.233e+00  1.180e-01 -27.388  < 2e-16 ***
## State_TX                        6.253e-01  1.769e-01   3.536 0.000415 ***
## Home.Ownership_MORTGAGE        -2.623e-01  9.238e-02  -2.839 0.004566 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.061 on 2190 degrees of freedom
## Multiple R-squared:  0.7571, Adjusted R-squared:  0.7562 
## F-statistic: 853.4 on 8 and 2190 DF,  p-value: < 2.2e-16

5.3 Checking the assumption of Normality

It turns out there is still one var for which p-value is higher than .05 . we can manually iclude all other vars of the model or extract the exact formula using formula function.

formula(fit)
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Inquiries.in.the.Last.6.Months + fico + Loan.Length_36months + 
##     State_TX + Home.Ownership_MORTGAGE

We can drop variable Monthly.Income from this.

fit=lm(Interest.Rate ~ Amount.Requested + Open.CREDIT.Lines + Inquiries.in.the.Last.6.Months + fico + Loan.Length_36months + State_TX + Home.Ownership_MORTGAGE,data=ld_train1)

summary(fit)
## 
## Call:
## lm(formula = Interest.Rate ~ Amount.Requested + Open.CREDIT.Lines + 
##     Inquiries.in.the.Last.6.Months + fico + Loan.Length_36months + 
##     State_TX + Home.Ownership_MORTGAGE, data = ld_train1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6999 -1.3742 -0.2214  1.1553 10.1446 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     7.474e+01  1.104e+00  67.690  < 2e-16 ***
## Amount.Requested                1.489e-04  7.630e-06  19.520  < 2e-16 ***
## Open.CREDIT.Lines              -4.902e-02  1.224e-02  -4.006 6.48e-05 ***
## Inquiries.in.the.Last.6.Months  3.216e-01  4.180e-02   7.696 2.51e-14 ***
## fico                           -8.583e-02  1.544e-03 -55.595  < 2e-16 ***
## Loan.Length_36months           -3.192e+00  1.412e-01 -22.605  < 2e-16 ***
## State_TX                        5.441e-01  2.084e-01   2.611  0.00912 ** 
## Home.Ownership_MORTGAGE        -2.621e-01  1.099e-01  -2.385  0.01719 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.061 on 1531 degrees of freedom
## Multiple R-squared:  0.7596, Adjusted R-squared:  0.7585 
## F-statistic:   691 on 7 and 1531 DF,  p-value: < 2.2e-16

This is our final model with all variables being significant [ p-value < 0.05]. How good is this model. Adj R-squared value for the model is 0.7585 which means that this models explains 75.85% variation in our response. You can quickly check this visually too by plotting predicted values and the original response.

6 Visualizing the model

6.1 Predicted vs Original

ld_train1 %>%
mutate(pred_IR=predict(fit,newdata=ld_train1)) %>%
ggplot(aes(x=Interest.Rate,y=pred_IR))+geom_point(alpha=0.6)

You can see that there is good overlap. Our model equation from the model is :[taking down values from estimates from summary(fit)]. We are going to use collapse property of paste function. The result will need little modification but it saves us from noting down each value and writing prediction equation manually.

model_string=paste(fit$coefficients,names(fit$coefficients),sep="*",collapse = " + ")
model_eq=strwrap(sub("\\*\\(Intercept\\)","",gsub("+ -","- ",model_string,fixed=TRUE)))
model_eq
## [1] "74.7429652067777 + 0.000148930791905846*Amount.Requested -"            
## [2] "0.0490239121960481*Open.CREDIT.Lines +"                                
## [3] "0.321643365683685*Inquiries.in.the.Last.6.Months -"                    
## [4] "0.0858327626569249*fico - 3.19157260176938*Loan.Length_36months +"     
## [5] "0.544104302623496*State_TX - 0.262061920444776*Home.Ownership_MORTGAGE"

6.2 Diagnostic Plots

Next we look at different diagnostic plots to see if any of our linear regression assumptions are being violated and some other information regarding outliers.

plot(fit,which=1)

Ideally the red line the plot should be horizontal or close to horizontal . If it shows a significant pattern , it means we might have missed non-linear components in the model. If that is the case you might want to look at pair wise relation ship between response and each predictor and figure out if any of the predictors need to be transformed to make the relationship linear between response and that predictor. [Although a better alternative will be to go ahead and to use an algorithm which can extract non linear relationship. e.g. RandomForest , BoostinMachines etc]

6.2.1 QQ Plot

plot(fit,which=2)

This tells you whether you residuals are normal or not.

6.2.2 Density Plot

df=data.frame(res=fit$residual)
ggplot(df,aes(x=res))+geom_density(color="red")+stat_function(fun=dnorm ,args = list(mean=mean(df$res),sd=sd(df$res)),color="green")

6.2.2.1 Normality Test

shapiro.test(fit$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit$residuals
## W = 0.97713, p-value = 6.184e-15

This tells you that there is significant deviation from normality in your residual which is confirmed by shapiro wilk test as well. This essentially means that you should not use p-values to drop vars as they rely on the assumption of normality. [ We should have only used AIC score to drop vars]

6.2.3 Checking Homoscedasticity

plot(fit,which=3)

This plot tells us whether our residuals are homoscedastic or not. If this plot has a pattern then that mean residuals are heteroscedastic. If that is the case which is not in our case , you can again log transform response and rebuild your model. Why log transform of response in these deviations of residuals? Becuase it brings down scale of response and in turn for residuals. Effect of scale coming down is differences being less prominent.

6.2.4 Cook’s distance

plot(fit,which=4)

This gives us cook’s distance for all the observations. If cook’s distance for any observation is greater than 1 , you should consider removing that observation from your modeling process and rebuild the model. In our case; cook’s distance is less than 1 for all the observations.[ dont worry about high cook’s distance value for any observation if its less than 1]

Cook’s distance basically tells how much leverage [effect] one observation has on your model equation. High value of cook’s distance [>1] hints at that observation being very different from rest of the data or in other words, that observation being an outlier.

7 RMSE

Done with the diagnostic plots.But we havent yet done the crucial part of testing here. Lets look at rmse value for validation data

rmse= mean((ld_train2$Interest.Rate-predict(fit,newdata=ld_train2))**2) %>%
sqrt()

rmse
## [1] 2.065264

We see that model performance RMSE on validation data comes out to be 2.06. We can check performance of other models with this and then choose the best one having least RMSE.

8 Final model

Now that we know about the tentative performance of the model , we’ll go ahead and build the model on the entire training data and use that model to predict result on the test/production data.

fit.final=fit=lm(Interest.Rate ~ .-ID, data=ld_train)

fit.final=step(fit.final)
## Start:  AIC=3204.82
## Interest.Rate ~ (ID + Amount.Requested + Debt.To.Income.Ratio + 
##     Monthly.Income + Open.CREDIT.Lines + Revolving.CREDIT.Balance + 
##     Inquiries.in.the.Last.6.Months + fico + el + lp_10 + lp_11 + 
##     lp_12 + lp_13 + lp_14 + Loan.Length_36months + State_FL + 
##     State_TX + State_NY + State_CA + Home.Ownership_RENT + Home.Ownership_MORTGAGE) - 
##     ID
## 
##                                  Df Sum of Sq     RSS    AIC
## - State_NY                        1       0.2  9265.7 3202.9
## - lp_10                           1       0.3  9265.8 3202.9
## - State_CA                        1       0.3  9265.9 3202.9
## - Debt.To.Income.Ratio            1       0.8  9266.3 3203.0
## - State_FL                        1       1.3  9266.9 3203.1
## - lp_13                           1       1.4  9266.9 3203.2
## - lp_11                           1       1.4  9267.0 3203.2
## - lp_12                           1       2.2  9267.7 3203.3
## - lp_14                           1       2.2  9267.7 3203.3
## - Home.Ownership_RENT             1       3.8  9269.3 3203.7
## - Revolving.CREDIT.Balance        1       5.6  9271.1 3204.2
## - el                              1       6.6  9272.1 3204.4
## <none>                                         9265.5 3204.8
## - Monthly.Income                  1      10.9  9276.4 3205.4
## - Home.Ownership_MORTGAGE         1      24.0  9289.5 3208.5
## - State_TX                        1      51.7  9317.3 3215.1
## - Open.CREDIT.Lines               1      54.5  9320.1 3215.7
## - Inquiries.in.the.Last.6.Months  1     383.3  9648.8 3292.0
## - Amount.Requested                1    2008.1 11273.7 3634.2
## - Loan.Length_36months            1    3130.7 12396.3 3842.9
## - fico                            1   17547.1 26812.7 5539.4
## 
## Step:  AIC=3202.86
## Interest.Rate ~ Amount.Requested + Debt.To.Income.Ratio + Monthly.Income + 
##     Open.CREDIT.Lines + Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + el + lp_10 + lp_11 + lp_12 + lp_13 + lp_14 + Loan.Length_36months + 
##     State_FL + State_TX + State_CA + Home.Ownership_RENT + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - lp_10                           1       0.3  9266.0 3200.9
## - State_CA                        1       0.5  9266.2 3201.0
## - Debt.To.Income.Ratio            1       0.8  9266.6 3201.1
## - State_FL                        1       1.2  9267.0 3201.2
## - lp_13                           1       1.4  9267.1 3201.2
## - lp_11                           1       1.4  9267.2 3201.2
## - lp_12                           1       2.1  9267.9 3201.4
## - lp_14                           1       2.2  9267.9 3201.4
## - Home.Ownership_RENT             1       3.8  9269.6 3201.8
## - Revolving.CREDIT.Balance        1       5.6  9271.3 3202.2
## - el                              1       6.7  9272.4 3202.4
## <none>                                         9265.7 3202.9
## - Monthly.Income                  1      10.7  9276.5 3203.4
## - Home.Ownership_MORTGAGE         1      24.9  9290.6 3206.8
## - State_TX                        1      51.7  9317.4 3213.1
## - Open.CREDIT.Lines               1      54.3  9320.1 3213.7
## - Inquiries.in.the.Last.6.Months  1     383.1  9648.8 3290.0
## - Amount.Requested                1    2008.6 11274.3 3632.3
## - Loan.Length_36months            1    3135.8 12401.5 3841.9
## - fico                            1   17547.1 26812.8 5537.4
## 
## Step:  AIC=3200.93
## Interest.Rate ~ Amount.Requested + Debt.To.Income.Ratio + Monthly.Income + 
##     Open.CREDIT.Lines + Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + el + lp_11 + lp_12 + lp_13 + lp_14 + Loan.Length_36months + 
##     State_FL + State_TX + State_CA + Home.Ownership_RENT + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - State_CA                        1       0.5  9266.5 3199.0
## - Debt.To.Income.Ratio            1       0.8  9266.8 3199.1
## - State_FL                        1       1.2  9267.2 3199.2
## - lp_11                           1       2.3  9268.3 3199.5
## - lp_13                           1       2.3  9268.3 3199.5
## - Home.Ownership_RENT             1       3.8  9269.8 3199.8
## - lp_12                           1       4.2  9270.2 3199.9
## - lp_14                           1       4.6  9270.5 3200.0
## - Revolving.CREDIT.Balance        1       5.6  9271.6 3200.3
## - el                              1       6.7  9272.7 3200.5
## <none>                                         9266.0 3200.9
## - Monthly.Income                  1      10.8  9276.8 3201.5
## - Home.Ownership_MORTGAGE         1      24.7  9290.7 3204.8
## - State_TX                        1      51.7  9317.7 3211.2
## - Open.CREDIT.Lines               1      54.7  9320.7 3211.9
## - Inquiries.in.the.Last.6.Months  1     382.8  9648.8 3288.0
## - Amount.Requested                1    2009.1 11275.1 3630.5
## - Loan.Length_36months            1    3135.6 12401.6 3839.9
## - fico                            1   17547.1 26813.1 5535.4
## 
## Step:  AIC=3199.05
## Interest.Rate ~ Amount.Requested + Debt.To.Income.Ratio + Monthly.Income + 
##     Open.CREDIT.Lines + Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + el + lp_11 + lp_12 + lp_13 + lp_14 + Loan.Length_36months + 
##     State_FL + State_TX + Home.Ownership_RENT + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - Debt.To.Income.Ratio            1       0.8  9267.2 3197.2
## - State_FL                        1       1.4  9267.9 3197.4
## - lp_11                           1       2.3  9268.8 3197.6
## - lp_13                           1       2.3  9268.8 3197.6
## - Home.Ownership_RENT             1       4.1  9270.5 3198.0
## - lp_12                           1       4.2  9270.7 3198.0
## - lp_14                           1       4.5  9271.0 3198.1
## - Revolving.CREDIT.Balance        1       5.8  9272.2 3198.4
## - el                              1       6.6  9273.1 3198.6
## <none>                                         9266.5 3199.0
## - Monthly.Income                  1      10.7  9277.2 3199.6
## - Home.Ownership_MORTGAGE         1      24.8  9291.3 3202.9
## - State_TX                        1      53.7  9320.2 3209.7
## - Open.CREDIT.Lines               1      54.9  9321.3 3210.0
## - Inquiries.in.the.Last.6.Months  1     384.6  9651.1 3286.5
## - Amount.Requested                1    2008.7 11275.2 3628.5
## - Loan.Length_36months            1    3144.5 12411.0 3839.5
## - fico                            1   17546.6 26813.1 5533.5
## 
## Step:  AIC=3197.23
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + el + lp_11 + lp_12 + lp_13 + lp_14 + Loan.Length_36months + 
##     State_FL + State_TX + Home.Ownership_RENT + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - State_FL                        1       1.4  9268.7 3195.6
## - lp_11                           1       2.3  9269.6 3195.8
## - lp_13                           1       2.4  9269.7 3195.8
## - Home.Ownership_RENT             1       4.0  9271.3 3196.2
## - lp_12                           1       4.3  9271.5 3196.2
## - lp_14                           1       4.7  9272.0 3196.4
## - el                              1       6.5  9273.7 3196.8
## - Revolving.CREDIT.Balance        1       6.7  9273.9 3196.8
## <none>                                         9267.2 3197.2
## - Monthly.Income                  1      10.0  9277.2 3197.6
## - Home.Ownership_MORTGAGE         1      24.6  9291.8 3201.0
## - State_TX                        1      53.1  9320.3 3207.8
## - Open.CREDIT.Lines               1      67.4  9334.6 3211.2
## - Inquiries.in.the.Last.6.Months  1     386.2  9653.4 3285.0
## - Amount.Requested                1    2011.4 11278.7 3627.2
## - Loan.Length_36months            1    3144.2 12411.5 3837.6
## - fico                            1   17978.9 27246.2 5566.7
## 
## Step:  AIC=3195.56
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + el + lp_11 + lp_12 + lp_13 + lp_14 + Loan.Length_36months + 
##     State_TX + Home.Ownership_RENT + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - lp_11                           1       2.4  9271.1 3194.1
## - lp_13                           1       2.5  9271.2 3194.2
## - Home.Ownership_RENT             1       4.3  9272.9 3194.6
## - lp_12                           1       4.5  9273.1 3194.6
## - lp_14                           1       4.9  9273.6 3194.7
## - el                              1       6.3  9275.0 3195.1
## - Revolving.CREDIT.Balance        1       6.4  9275.0 3195.1
## <none>                                         9268.7 3195.6
## - Monthly.Income                  1      10.3  9279.0 3196.0
## - Home.Ownership_MORTGAGE         1      24.9  9293.6 3199.5
## - State_TX                        1      52.1  9320.7 3205.9
## - Open.CREDIT.Lines               1      67.6  9336.3 3209.5
## - Inquiries.in.the.Last.6.Months  1     385.5  9654.2 3283.2
## - Amount.Requested                1    2011.5 11280.2 3625.5
## - Loan.Length_36months            1    3143.0 12411.6 3835.7
## - fico                            1   17990.7 27259.4 5565.8
## 
## Step:  AIC=3194.14
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + el + lp_12 + lp_13 + lp_14 + Loan.Length_36months + 
##     State_TX + Home.Ownership_RENT + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - lp_13                           1       0.1  9271.3 3192.2
## - lp_12                           1       2.7  9273.8 3192.8
## - Home.Ownership_RENT             1       4.4  9275.5 3193.2
## - lp_14                           1       4.6  9275.7 3193.2
## - el                              1       6.3  9277.4 3193.6
## - Revolving.CREDIT.Balance        1       6.4  9277.5 3193.6
## <none>                                         9271.1 3194.1
## - Monthly.Income                  1      10.7  9281.8 3194.7
## - Home.Ownership_MORTGAGE         1      25.1  9296.2 3198.1
## - State_TX                        1      51.4  9322.5 3204.3
## - Open.CREDIT.Lines               1      67.9  9339.0 3208.2
## - Inquiries.in.the.Last.6.Months  1     390.9  9662.0 3283.0
## - Amount.Requested                1    2012.1 11283.2 3624.1
## - Loan.Length_36months            1    3141.4 12412.6 3833.8
## - fico                            1   17988.4 27259.5 5563.8
## 
## Step:  AIC=3192.18
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + el + lp_12 + lp_14 + Loan.Length_36months + State_TX + 
##     Home.Ownership_RENT + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - lp_12                           1       3.3  9274.6 3191.0
## - Home.Ownership_RENT             1       4.4  9275.7 3191.2
## - el                              1       6.2  9277.4 3191.6
## - Revolving.CREDIT.Balance        1       6.5  9277.8 3191.7
## <none>                                         9271.3 3192.2
## - Monthly.Income                  1      10.6  9281.8 3192.7
## - lp_14                           1      11.2  9282.5 3192.8
## - Home.Ownership_MORTGAGE         1      25.2  9296.4 3196.1
## - State_TX                        1      51.4  9322.7 3202.3
## - Open.CREDIT.Lines               1      68.2  9339.4 3206.3
## - Inquiries.in.the.Last.6.Months  1     391.1  9662.3 3281.0
## - Amount.Requested                1    2047.4 11318.6 3628.9
## - Loan.Length_36months            1    3165.5 12436.7 3836.1
## - fico                            1   18411.2 27682.4 5595.6
## 
## Step:  AIC=3190.96
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + el + lp_14 + Loan.Length_36months + State_TX + Home.Ownership_RENT + 
##     Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - Home.Ownership_RENT             1       4.1  9278.7 3189.9
## - Revolving.CREDIT.Balance        1       6.1  9280.7 3190.4
## - el                              1       6.3  9280.9 3190.5
## <none>                                         9274.6 3191.0
## - lp_14                           1       8.5  9283.1 3191.0
## - Monthly.Income                  1      11.1  9285.7 3191.6
## - Home.Ownership_MORTGAGE         1      25.6  9300.2 3195.0
## - State_TX                        1      51.2  9325.8 3201.1
## - Open.CREDIT.Lines               1      67.2  9341.8 3204.8
## - Inquiries.in.the.Last.6.Months  1     389.4  9664.0 3279.4
## - Amount.Requested                1    2045.2 11319.8 3627.2
## - Loan.Length_36months            1    3162.5 12437.1 3834.2
## - fico                            1   18550.7 27825.3 5604.9
## 
## Step:  AIC=3189.93
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months + 
##     fico + el + lp_14 + Loan.Length_36months + State_TX + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - Revolving.CREDIT.Balance        1       5.8  9284.4 3189.3
## - el                              1       6.7  9285.4 3189.5
## <none>                                         9278.7 3189.9
## - lp_14                           1       9.1  9287.8 3190.1
## - Monthly.Income                  1      11.5  9290.1 3190.6
## - Home.Ownership_MORTGAGE         1      36.3  9315.0 3196.5
## - State_TX                        1      52.1  9330.8 3200.2
## - Open.CREDIT.Lines               1      66.3  9345.0 3203.6
## - Inquiries.in.the.Last.6.Months  1     389.9  9668.5 3278.4
## - Amount.Requested                1    2045.1 11323.8 3625.9
## - Loan.Length_36months            1    3159.5 12438.1 3832.3
## - fico                            1   18588.0 27866.7 5606.2
## 
## Step:  AIC=3189.3
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Inquiries.in.the.Last.6.Months + fico + el + lp_14 + Loan.Length_36months + 
##     State_TX + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - el                              1       6.2  9290.6 3188.8
## <none>                                         9284.4 3189.3
## - lp_14                           1       8.5  9292.9 3189.3
## - Monthly.Income                  1      16.7  9301.1 3191.3
## - Home.Ownership_MORTGAGE         1      38.9  9323.4 3196.5
## - State_TX                        1      53.4  9337.8 3199.9
## - Open.CREDIT.Lines               1      78.4  9362.8 3205.8
## - Inquiries.in.the.Last.6.Months  1     391.2  9675.6 3278.1
## - Amount.Requested                1    2054.7 11339.1 3626.9
## - Loan.Length_36months            1    3176.2 12460.7 3834.3
## - fico                            1   18597.8 27882.2 5605.4
## 
## Step:  AIC=3188.76
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Inquiries.in.the.Last.6.Months + fico + lp_14 + Loan.Length_36months + 
##     State_TX + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## - lp_14                           1       8.1  9298.7 3188.7
## <none>                                         9290.6 3188.8
## - Monthly.Income                  1      16.4  9307.0 3190.6
## - Home.Ownership_MORTGAGE         1      34.1  9324.7 3194.8
## - State_TX                        1      51.5  9342.1 3198.9
## - Open.CREDIT.Lines               1      77.6  9368.2 3205.0
## - Inquiries.in.the.Last.6.Months  1     387.7  9678.3 3276.7
## - Amount.Requested                1    2093.6 11384.2 3633.7
## - Loan.Length_36months            1    3176.7 12467.3 3833.5
## - fico                            1   18621.3 27911.9 5605.8
## 
## Step:  AIC=3188.68
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines + 
##     Inquiries.in.the.Last.6.Months + fico + Loan.Length_36months + 
##     State_TX + Home.Ownership_MORTGAGE
## 
##                                  Df Sum of Sq     RSS    AIC
## <none>                                         9298.7 3188.7
## - Monthly.Income                  1      15.2  9313.9 3190.3
## - Home.Ownership_MORTGAGE         1      34.2  9333.0 3194.8
## - State_TX                        1      53.1  9351.8 3199.2
## - Open.CREDIT.Lines               1      78.9  9377.6 3205.3
## - Inquiries.in.the.Last.6.Months  1     388.4  9687.1 3276.7
## - Amount.Requested                1    2113.3 11412.1 3637.0
## - Loan.Length_36months            1    3184.8 12483.6 3834.4
## - fico                            1   18803.2 28101.9 5618.7
summary(fit.final)
## 
## Call:
## lm(formula = Interest.Rate ~ Amount.Requested + Monthly.Income + 
##     Open.CREDIT.Lines + Inquiries.in.the.Last.6.Months + fico + 
##     Loan.Length_36months + State_TX + Home.Ownership_MORTGAGE, 
##     data = ld_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5280 -1.3681 -0.2007  1.2432 10.0654 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     7.466e+01  9.250e-01  80.718  < 2e-16 ***
## Amount.Requested                1.509e-04  6.764e-06  22.310  < 2e-16 ***
## Monthly.Income                 -2.275e-05  1.203e-05  -1.892 0.058675 .  
## Open.CREDIT.Lines              -4.414e-02  1.024e-02  -4.311  1.7e-05 ***
## Inquiries.in.the.Last.6.Months  3.504e-01  3.664e-02   9.564  < 2e-16 ***
## fico                           -8.562e-02  1.287e-03 -66.547  < 2e-16 ***
## Loan.Length_36months           -3.233e+00  1.180e-01 -27.388  < 2e-16 ***
## State_TX                        6.253e-01  1.769e-01   3.536 0.000415 ***
## Home.Ownership_MORTGAGE        -2.623e-01  9.238e-02  -2.839 0.004566 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.061 on 2190 degrees of freedom
## Multiple R-squared:  0.7571, Adjusted R-squared:  0.7562 
## F-statistic: 853.4 on 8 and 2190 DF,  p-value: < 2.2e-16

We can use this model to make prediction on the test data

pred.IR=predict(fit.final,newdata=ld_test)
head(pred.IR)
##         1         2         3         4         5         6 
## 15.494899 16.034174 10.442629  3.748237 15.368185  7.133317
#write.csv(pred.IR,"mysubmission.csv",row.names = F)

9 Interpreting Model Coefficients

There are many other observation you can make from the coefficients. For example :

  1. For high amount requested , interest rates are going to be higher as well

  2. If someone has been applying for loan at too many places [ results in high number of inquiries ], they’ll get higher interest rates

  3. Where as if you have high fico score [ mroe credit worthy] , your interest rates are going to be lower.

  4. If you are applying for a loan for 36 months , your interest will be lower by almost 3.5% in comparisons to 60 months loan.