Loan Smart is a lending advisory firm. Based on their client’s characteristic and needed loan amount they advise them on which Financial Institution to apply for loan at. So far their recommendations have been based hunches business experience. Now they are trying to leverage power of data that they have collected so far.
They want to check whether given their client’s characteristics , they can predict how much interest rates they will be offered by various financial institution. They want to run with proof of concept for this idea. They have given us data collected for one such financial institution ABC Capitals Ltd.
What we need to do is to figure out whether using that data we can predict interest rate offered to client.
Loading R packages used besides base R.
library(knitr)
library(dplyr)
library(car)
library(ggplot2)
Below, I am reading the csv’s as dataframes into R.
ld_train=read.csv("loan_data_train.csv",stringsAsFactors = F)
ld_test= read.csv("loan_data_test.csv",stringsAsFactors = F)
glimpse(ld_train)
## Observations: 2,200
## Variables: 15
## $ ID <int> 79542, 75473, 67265, 80167, 17240, 3...
## $ Amount.Requested <chr> "25000", "19750", "2100", "28000", "...
## $ Amount.Funded.By.Investors <chr> "25000", "19750", "2100", "28000", "...
## $ Interest.Rate <chr> "18.49%", "17.27%", "14.33%", "16.29...
## $ Loan.Length <chr> "60 months", "60 months", "36 months...
## $ Loan.Purpose <chr> "debt_consolidation", "debt_consolid...
## $ Debt.To.Income.Ratio <chr> "27.56%", "13.39%", "3.50%", "19.62%...
## $ State <chr> "VA", "NY", "LA", "NV", "OH", "OH", ...
## $ Home.Ownership <chr> "MORTGAGE", "MORTGAGE", "OWN", "MORT...
## $ Monthly.Income <dbl> 8606.56, 6737.50, 1000.00, 7083.33, ...
## $ FICO.Range <chr> "720-724", "710-714", "690-694", "71...
## $ Open.CREDIT.Lines <chr> "11", "14", "13", "12", "6", "2", "5...
## $ Revolving.CREDIT.Balance <chr> "15210", "19070", "893", "38194", "3...
## $ Inquiries.in.the.Last.6.Months <int> 3, 3, 1, 1, 2, 2, 0, 1, 0, 1, 0, 0, ...
## $ Employment.Length <chr> "5 years", "4 years", "< 1 year", "1...
glimpse(ld_test)
## Observations: 300
## Variables: 14
## $ ID <int> 20093, 62445, 65248, 81822, 57923, 3...
## $ Amount.Requested <int> 5000, 18000, 7200, 7200, 22000, 6000...
## $ Amount.Funded.By.Investors <chr> "5000", "18000", "7200", "7200", "22...
## $ Loan.Length <chr> "60 months", "60 months", "60 months...
## $ Loan.Purpose <chr> "moving", "debt_consolidation", "deb...
## $ Debt.To.Income.Ratio <chr> "12.59%", "4.93%", "25.16%", "17.27%...
## $ State <chr> "NY", "CA", "LA", "NY", "MI", "NJ", ...
## $ Home.Ownership <chr> "RENT", "RENT", "MORTGAGE", "MORTGAG...
## $ Monthly.Income <dbl> 4416.67, 5258.50, 3750.00, 3416.67, ...
## $ FICO.Range <chr> "690-694", "710-714", "750-754", "79...
## $ Open.CREDIT.Lines <chr> "13", "6", "13", "14", "9", ".", "11...
## $ Revolving.CREDIT.Balance <int> 7686, 11596, 7283, 4838, 20181, 1442...
## $ Inquiries.in.the.Last.6.Months <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, ...
## $ Employment.Length <chr> "< 1 year", "10+ years", "6 years", ...
Variable names are self explanatory as to what they represent.
We’ll combine our two datasets so that we do not need to prepare data separately for them. And we’ll also avoid problem of dealing with different columns in different datasets.
However before combining them, we’ll need to add response column to test because number of columns need to be same for two datasets to stack vertically.
ld_test$Interest.Rate=NA
ld_train$data='train'
ld_test$data='test'
ld_all=rbind(ld_train,ld_test)
Now we’ll start with data prep on ld_all.
When you look at the data set , you’ll find out Interest.Rate , Debt.To.Income.Ratio have been imported as characters due to “%” sign.
Open.Credit.Lines and some othe supposedly numeric variables have been converted to character becuase of some character value present possibly. Lets convert them to numbers first.
ld_all=ld_all %>%
mutate(Interest.Rate=as.numeric(gsub("%","",Interest.Rate)) ,
Debt.To.Income.Ratio=as.numeric(gsub("%","",Debt.To.Income.Ratio)) ,
Open.CREDIT.Lines=as.numeric(Open.CREDIT.Lines) ,
Amount.Requested=as.numeric(Amount.Requested) ,
Revolving.CREDIT.Balance=as.numeric(Revolving.CREDIT.Balance)
)
glimpse(ld_all)
## Observations: 2,500
## Variables: 16
## $ ID <int> 79542, 75473, 67265, 80167, 17240, 3...
## $ Amount.Requested <dbl> 25000, 19750, 2100, 28000, 24250, 54...
## $ Amount.Funded.By.Investors <chr> "25000", "19750", "2100", "28000", "...
## $ Interest.Rate <dbl> 18.49, 17.27, 14.33, 16.29, 12.23, 8...
## $ Loan.Length <chr> "60 months", "60 months", "36 months...
## $ Loan.Purpose <chr> "debt_consolidation", "debt_consolid...
## $ Debt.To.Income.Ratio <dbl> 27.56, 13.39, 3.50, 19.62, 23.79, 6....
## $ State <chr> "VA", "NY", "LA", "NV", "OH", "OH", ...
## $ Home.Ownership <chr> "MORTGAGE", "MORTGAGE", "OWN", "MORT...
## $ Monthly.Income <dbl> 8606.56, 6737.50, 1000.00, 7083.33, ...
## $ FICO.Range <chr> "720-724", "710-714", "690-694", "71...
## $ Open.CREDIT.Lines <dbl> 11, 14, 13, 12, 6, 2, 5, 11, 24, 7, ...
## $ Revolving.CREDIT.Balance <dbl> 15210, 19070, 893, 38194, 31061, 90,...
## $ Inquiries.in.the.Last.6.Months <int> 3, 3, 1, 1, 2, 2, 0, 1, 0, 1, 0, 0, ...
## $ Employment.Length <chr> "5 years", "4 years", "< 1 year", "1...
## $ data <chr> "train", "train", "train", "train", ...
Although fico_range is recorded as categorical variable in terms of ranges , we can convert it to a numeric variable by assigning value as average of the range. lets do that.
ld_all= ld_all %>%
mutate(f1=as.numeric(substr(FICO.Range,1,3)),
f2=as.numeric(substr(FICO.Range,5,7)),
fico=0.5*(f1+f2)
) %>%
select(-FICO.Range,-f1,-f2)
glimpse(ld_all)
## Observations: 2,500
## Variables: 15
## $ ID <int> 79542, 75473, 67265, 80167, 17240, 3...
## $ Amount.Requested <dbl> 25000, 19750, 2100, 28000, 24250, 54...
## $ Interest.Rate <dbl> 18.49, 17.27, 14.33, 16.29, 12.23, 8...
## $ Loan.Length <chr> "60 months", "60 months", "36 months...
## $ Loan.Purpose <chr> "debt_consolidation", "debt_consolid...
## $ Debt.To.Income.Ratio <dbl> 27.56, 13.39, 3.50, 19.62, 23.79, 6....
## $ State <chr> "VA", "NY", "LA", "NV", "OH", "OH", ...
## $ Home.Ownership <chr> "MORTGAGE", "MORTGAGE", "OWN", "MORT...
## $ Monthly.Income <dbl> 8606.56, 6737.50, 1000.00, 7083.33, ...
## $ Open.CREDIT.Lines <dbl> 11, 14, 13, 12, 6, 2, 5, 11, 24, 7, ...
## $ Revolving.CREDIT.Balance <dbl> 15210, 19070, 893, 38194, 31061, 90,...
## $ Inquiries.in.the.Last.6.Months <int> 3, 3, 1, 1, 2, 2, 0, 1, 0, 1, 0, 0, ...
## $ Employment.Length <chr> "5 years", "4 years", "< 1 year", "1...
## $ data <chr> "train", "train", "train", "train", ...
## $ fico <dbl> 722, 712, 692, 712, 732, 787, 667, 7...
In a similar fashion we can convert employment length to numbers as well.
ld_all=ld_all %>%
mutate(el=ifelse(substr(Employment.Length,1,2)=="10",10,Employment.Length),
el=ifelse(substr(Employment.Length,1,1)=="<",0,el),
el=gsub("years","",el),
el=gsub("year","",el),
el=as.numeric(el)
) %>%
select(-Employment.Length)
Lets take a pause at this point and think over the decision of how we processed employment length here. Converting this to numbers using the logic above was a subjective decision. We can’t really say that this is what we do in every case. We might have as well went on treating it as categorical variable and created dummy variables for it. We might have used some other logic to convert it to numeric by assigning some other numeric values. Each of these decisions would have resulted in a different model. Lets say we build models m1, m2, m3 based on all these decision separately. You can select one of these models to go forward with , based on performance on the validation data. In this text however , we’ll be focusing on just one iteration and in turn developing just one model.
CreateDummies=function(data,var,freq_cutoff=0){
t=table(data[,var])
t=t[t>freq_cutoff]
t=sort(t)
categories=names(t)[-1]
for( cat in categories){
name=paste(var,cat,sep="_")
name=gsub(" ","",name)
name=gsub("-","_",name)
name=gsub("\\?","Q",name)
name=gsub("<","LT_",name)
name=gsub("\\+","",name)
data[,name]=as.numeric(data[,var]==cat)
}
data[,var]=NULL
return(data)
}
Let me explain the function if you havent come across this before
t=table(data[,var]) this bit creates a frequency table for the given categorical column. t here is now simply a table which contains names as categories of the categorical variable and their frequency in the data.
t=t[t>freq_cutoff] this line of code removes those categories from the table which have frequencies below the frequency cutoff. ( this is a subjective choice)
‘t=sort(t)’ this line simple sorts the remaining table in ascending order
categories=names(t)[-1] since we sorted the table in ascending manner in the previous line, first category here has least count. In this line of code we are taking out the category names except the first one ( which has least count), thus making n-1 dummies from the remaining categories.
name=paste(var,cat,sep=“_“) all the dummy vars that we intend to make, need to have some name. this line of code creates that name by concatenating variable name with category name with an _.
name=gsub(" “,”“,name) subsequent lines like these using gsub are essentially cleaning up the name thats all. Since we dont have any control over what the categories can be, we are removing special characters and spaces in the code in an automated fashion.
data[,name]=as.numeric(data[,var]==cat) once we have a cleaned up name, this line creates the dummy var for that particular category.
data[,var]=NULL once we are done creating dummies for the variable using for loop. Variable is removed from the data in this line.
Consider a hypothetical example where a categorical variable takes 5 distinct values a ,b ,c, d, e. Among these e occurs very few times. We do not make 5 dummy variable because of this relation ship:
cat_a + cat_b + cat_c + cat_d + cat_e = 1
Which implies that if we make 5 dummy variables , any one taken at a time will be perfect linear combination of the rest which is undesirable as discussed previously.
So we will go ahead and take e as base category , this being least frequent and make 4 dummy variables. But e was a rare category. Meaning dummy variable for e , if made would have been equal to 1 rarely, which means that rest 4 dummies will still add up to exactly one for most part of the data , hence the high VIF values. We’ll that in our results also.
Next we look at variable Loan.Purpose
table(ld_all$Loan.Purpose)
##
## car credit_card debt_consolidation educational
## 50 444 1307 15
## home_improvement house major_purchase medical
## 152 20 101 30
## moving other renewable_energy small_business
## 29 200 4 87
## vacation wedding
## 21 39
It has too many categories. There is no direct harm in considering to create dummy variables for n-1 for them. That can be one iteration to try. Here we are going to combine categories on the basis of average response , this will bring down number of categories and we can make dummies then.
round(tapply(ld_all$Interest.Rate,ld_all$Loan.Purpose,mean,na.rm=T))
## car credit_card debt_consolidation educational
## 11 13 14 10
## home_improvement house major_purchase medical
## 12 14 11 11
## moving other renewable_energy small_business
## 14 13 8 13
## vacation wedding
## 12 12
we’ll combine categories into new one which have similar reponse rate as per the table obtained abvoe and make dummies for them as well while we are at it.
ld_all=ld_all %>%
mutate(lp_10=as.numeric(Loan.Purpose=='educational'),
lp_11=as.numeric(Loan.Purpose %in% c("major_purchase","medical","car")),
lp_12=as.numeric(Loan.Purpose %in% c("vacation","wedding","home_improvement")),
lp_13=as.numeric(Loan.Purpose %in% c("other","small_business","credit_card")),
lp_14=as.numeric(Loan.Purpose %in% c("debt_consolidation","house","moving"))) %>%
select(-Loan.Purpose)
Next we will make dummy vars for remaining categorical variables.
for(col in c("Loan.Length","State","Home.Ownership"))
{
ld_all=CreateDummies(ld_all,col,100)
}
lapply(ld_all,function(x) sum(is.na(x)))
## $ID
## [1] 1
##
## $Amount.Requested
## [1] 5
##
## $Interest.Rate
## [1] 300
##
## $Debt.To.Income.Ratio
## [1] 1
##
## $Monthly.Income
## [1] 3
##
## $Open.CREDIT.Lines
## [1] 9
##
## $Revolving.CREDIT.Balance
## [1] 5
##
## $Inquiries.in.the.Last.6.Months
## [1] 3
##
## $data
## [1] 0
##
## $fico
## [1] 0
##
## $el
## [1] 80
##
## $lp_10
## [1] 1
##
## $lp_11
## [1] 0
##
## $lp_12
## [1] 0
##
## $lp_13
## [1] 0
##
## $lp_14
## [1] 0
##
## $Loan.Length_36months
## [1] 1
##
## $State_FL
## [1] 1
##
## $State_TX
## [1] 1
##
## $State_NY
## [1] 1
##
## $State_CA
## [1] 1
##
## $Home.Ownership_RENT
## [1] 1
##
## $Home.Ownership_MORTGAGE
## [1] 1
It doesnt make sense to keep an observation with a missing ID, we’ll filter that.
ld_all=ld_all[!(is.na(ld_all$ID)),]
for(col in names(ld_all)){
if(sum(is.na(ld_all[,col]))>0 & !(col %in% c("ID","data","Interest.Rate"))){
ld_all[is.na(ld_all[,col]),col]=mean(ld_all[ld_all$data=="train",col],na.rm=T)
}
}
Notice that in missing values imputation also, only training data obs are used.
Now we are done with preparing data , lets separate the data.
ld_train=ld_all %>% filter(data=='train') %>% select(-data)
ld_test=ld_all %>% filter(data=='test') %>% select(-data,-Interest.Rate)
we can very well build the model on entire training data if we are not interested in looking at the performance of the model . However here we’d like to know how our model might tentatively perform on test/production data ( it wont be an exact match ever, and in some cases realties might be very different). To do this, we’ll break our data into two parts.
set.seed(2)
s=sample(1:nrow(ld_train),0.7*nrow(ld_train))
ld_train1=ld_train[s,]
ld_train2=ld_train[-s,]
Now we’ll start with our modeling process. We’ll be using function lm for building our linear regression model. First argument to this function is the modeling equation which we are trying to model. second argument is the dataset.
glimpse(ld_train1)
## Observations: 1,539
## Variables: 22
## $ ID <int> 34331, 92046, 8382, 47097, 62900, 29...
## $ Amount.Requested <dbl> 3300, 32000, 6250, 7200, 6000, 6000,...
## $ Interest.Rate <dbl> 11.71, 17.27, 10.25, 7.90, 6.03, 11....
## $ Debt.To.Income.Ratio <dbl> 4.16, 15.61, 18.04, 12.13, 5.65, 19....
## $ Monthly.Income <dbl> 3730.00, 9250.00, 2400.00, 2333.33, ...
## $ Open.CREDIT.Lines <dbl> 7, 15, 5, 7, 15, 6, 13, 13, 9, 16, 1...
## $ Revolving.CREDIT.Balance <dbl> 3675, 32566, 14863, 6262, 5747, 4298...
## $ Inquiries.in.the.Last.6.Months <dbl> 1, 1, 0, 1, 0, 1, 1, 0, 0, 3, 0, 2, ...
## $ fico <dbl> 702, 702, 722, 727, 752, 727, 667, 7...
## $ el <dbl> 1.000000, 2.000000, 10.000000, 2.000...
## $ lp_10 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ lp_11 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
## $ lp_12 <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ lp_13 <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, ...
## $ lp_14 <dbl> 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, ...
## $ Loan.Length_36months <dbl> 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, ...
## $ State_FL <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ State_TX <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ State_NY <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...
## $ State_CA <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Home.Ownership_RENT <dbl> 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, ...
## $ Home.Ownership_MORTGAGE <dbl> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, ...
You can see that all the variables in data are numeric , lets start now. It doesnt make sense to include variable ID in the modelling process. Its just a id number for the transactions which should be ignored.
fit=lm(Interest.Rate~. -ID,data=ld_train1)
Before we go start looking at model summary and p-value for the variables , we need to drop variables from the model which have high VIF. Remember that VIF is a mutual thing, so if you drop one variable, it might happen that VIF value for several other variables will go down. For this reason, always drop variables for high VIF , one by one, never in a chunk. For looking at VIF values for the variables , we’ll use function vif from package car.
vif(fit)
## Amount.Requested Debt.To.Income.Ratio
## 1.777267 1.390000
## Monthly.Income Open.CREDIT.Lines
## 1.684595 1.358993
## Revolving.CREDIT.Balance Inquiries.in.the.Last.6.Months
## 1.322998 1.049780
## fico el
## 1.167498 1.103588
## lp_10 lp_11
## 3.517704 26.712734
## lp_12 lp_13
## 32.261817 81.984659
## lp_14 Loan.Length_36months
## 97.509584 1.278274
## State_FL State_TX
## 1.055457 1.058533
## State_NY State_CA
## 1.109780 1.110674
## Home.Ownership_RENT Home.Ownership_MORTGAGE
## 3.623192 3.686325
It will be easier to identify high VIF vars if we sort the results and say look at top 3 only
sort(vif(fit),decreasing = T)[1:3]
## lp_14 lp_13 lp_12
## 97.50958 81.98466 32.26182
We’ll drop highest vif var and run the model again.
fit=lm(Interest.Rate~. -ID - lp_14,data=ld_train)
sort(vif(fit),decreasing = T)[1:3]
## Home.Ownership_MORTGAGE Home.Ownership_RENT Amount.Requested
## 3.735010 3.650675 1.587393
All VIF values are under control.Now we look at the model summary , specifically p-values asscociated with the variables. We can drop variables with high p-values [ >0.05] one by one or we can use step function which drops vars based on AIC score one by one. Although the methodology is different but end result is generally similar due to both of them targetting vars which do not contribute towards explaning not very well.
fit=step(fit)
## Start: AIC=3203.34
## Interest.Rate ~ (ID + Amount.Requested + Debt.To.Income.Ratio +
## Monthly.Income + Open.CREDIT.Lines + Revolving.CREDIT.Balance +
## Inquiries.in.the.Last.6.Months + fico + el + lp_10 + lp_11 +
## lp_12 + lp_13 + lp_14 + Loan.Length_36months + State_FL +
## State_TX + State_NY + State_CA + Home.Ownership_RENT + Home.Ownership_MORTGAGE) -
## ID - lp_14
##
## Df Sum of Sq RSS AIC
## - lp_12 1 0.0 9267.7 3201.3
## - State_NY 1 0.2 9267.9 3201.4
## - State_CA 1 0.4 9268.1 3201.4
## - Debt.To.Income.Ratio 1 0.8 9268.5 3201.5
## - State_FL 1 1.3 9269.0 3201.6
## - lp_11 1 2.2 9269.9 3201.9
## - lp_10 1 2.6 9270.3 3202.0
## - Home.Ownership_RENT 1 3.7 9271.5 3202.2
## - Revolving.CREDIT.Balance 1 5.6 9273.3 3202.7
## - el 1 6.5 9274.2 3202.9
## - lp_13 1 8.4 9276.1 3203.3
## <none> 9267.7 3203.3
## - Monthly.Income 1 11.0 9278.7 3203.9
## - Home.Ownership_MORTGAGE 1 23.8 9291.5 3207.0
## - State_TX 1 51.4 9319.1 3213.5
## - Open.CREDIT.Lines 1 55.4 9323.1 3214.4
## - Inquiries.in.the.Last.6.Months 1 383.3 9651.1 3290.5
## - Amount.Requested 1 2006.0 11273.7 3632.2
## - Loan.Length_36months 1 3129.4 12397.1 3841.1
## - fico 1 17558.2 26826.0 5538.5
##
## Step: AIC=3201.34
## Interest.Rate ~ Amount.Requested + Debt.To.Income.Ratio + Monthly.Income +
## Open.CREDIT.Lines + Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + el + lp_10 + lp_11 + lp_13 + Loan.Length_36months +
## State_FL + State_TX + State_NY + State_CA + Home.Ownership_RENT +
## Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - State_NY 1 0.2 9267.9 3199.4
## - State_CA 1 0.4 9268.1 3199.4
## - Debt.To.Income.Ratio 1 0.8 9268.5 3199.5
## - State_FL 1 1.3 9269.0 3199.6
## - lp_11 1 2.3 9270.0 3199.9
## - lp_10 1 2.6 9270.4 3200.0
## - Home.Ownership_RENT 1 3.7 9271.5 3200.2
## - Revolving.CREDIT.Balance 1 5.6 9273.3 3200.7
## - el 1 6.5 9274.2 3200.9
## <none> 9267.7 3201.3
## - lp_13 1 8.9 9276.6 3201.4
## - Monthly.Income 1 11.0 9278.8 3202.0
## - Home.Ownership_MORTGAGE 1 23.8 9291.5 3205.0
## - State_TX 1 51.5 9319.2 3211.5
## - Open.CREDIT.Lines 1 55.4 9323.2 3212.5
## - Inquiries.in.the.Last.6.Months 1 383.5 9651.3 3288.5
## - Amount.Requested 1 2019.3 11287.0 3632.8
## - Loan.Length_36months 1 3132.3 12400.0 3839.6
## - fico 1 17802.8 27070.5 5556.5
##
## Step: AIC=3199.38
## Interest.Rate ~ Amount.Requested + Debt.To.Income.Ratio + Monthly.Income +
## Open.CREDIT.Lines + Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + el + lp_10 + lp_11 + lp_13 + Loan.Length_36months +
## State_FL + State_TX + State_CA + Home.Ownership_RENT + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - State_CA 1 0.5 9268.4 3197.5
## - Debt.To.Income.Ratio 1 0.8 9268.7 3197.6
## - State_FL 1 1.2 9269.1 3197.7
## - lp_11 1 2.3 9270.2 3197.9
## - lp_10 1 2.6 9270.6 3198.0
## - Home.Ownership_RENT 1 3.8 9271.7 3198.3
## - Revolving.CREDIT.Balance 1 5.6 9273.5 3198.7
## - el 1 6.6 9274.5 3198.9
## <none> 9267.9 3199.4
## - lp_13 1 8.9 9276.8 3199.5
## - Monthly.Income 1 10.9 9278.8 3200.0
## - Home.Ownership_MORTGAGE 1 24.7 9292.6 3203.2
## - State_TX 1 51.5 9319.4 3209.6
## - Open.CREDIT.Lines 1 55.3 9323.2 3210.5
## - Inquiries.in.the.Last.6.Months 1 383.4 9651.3 3286.5
## - Amount.Requested 1 2020.1 11288.0 3631.0
## - Loan.Length_36months 1 3137.7 12405.6 3838.6
## - fico 1 17802.6 27070.5 5554.5
##
## Step: AIC=3197.51
## Interest.Rate ~ Amount.Requested + Debt.To.Income.Ratio + Monthly.Income +
## Open.CREDIT.Lines + Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + el + lp_10 + lp_11 + lp_13 + Loan.Length_36months +
## State_FL + State_TX + Home.Ownership_RENT + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - Debt.To.Income.Ratio 1 0.8 9269.2 3195.7
## - State_FL 1 1.4 9269.8 3195.8
## - lp_11 1 2.2 9270.7 3196.0
## - lp_10 1 2.6 9271.0 3196.1
## - Home.Ownership_RENT 1 4.1 9272.5 3196.5
## - Revolving.CREDIT.Balance 1 5.7 9274.2 3196.9
## - el 1 6.5 9275.0 3197.1
## <none> 9268.4 3197.5
## - lp_13 1 9.0 9277.4 3197.6
## - Monthly.Income 1 10.8 9279.3 3198.1
## - Home.Ownership_MORTGAGE 1 24.7 9293.1 3201.4
## - State_TX 1 53.5 9322.0 3208.2
## - Open.CREDIT.Lines 1 55.4 9323.8 3208.6
## - Inquiries.in.the.Last.6.Months 1 385.2 9653.7 3285.1
## - Amount.Requested 1 2019.9 11288.3 3629.0
## - Loan.Length_36months 1 3147.0 12415.4 3838.3
## - fico 1 17802.4 27070.9 5552.5
##
## Step: AIC=3195.69
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + el + lp_10 + lp_11 + lp_13 + Loan.Length_36months +
## State_FL + State_TX + Home.Ownership_RENT + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - State_FL 1 1.4 9270.6 3194.0
## - lp_11 1 2.4 9271.6 3194.3
## - lp_10 1 2.8 9272.0 3194.4
## - Home.Ownership_RENT 1 4.0 9273.3 3194.7
## - el 1 6.4 9275.6 3195.2
## - Revolving.CREDIT.Balance 1 6.7 9275.9 3195.3
## <none> 9269.2 3195.7
## - lp_13 1 9.1 9278.3 3195.8
## - Monthly.Income 1 10.1 9279.3 3196.1
## - Home.Ownership_MORTGAGE 1 24.5 9293.7 3199.5
## - State_TX 1 52.9 9322.1 3206.2
## - Open.CREDIT.Lines 1 68.4 9337.6 3209.8
## - Inquiries.in.the.Last.6.Months 1 386.9 9656.1 3283.6
## - Amount.Requested 1 2023.6 11292.8 3627.9
## - Loan.Length_36months 1 3146.9 12416.1 3836.5
## - fico 1 18301.5 27570.7 5590.7
##
## Step: AIC=3194.03
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + el + lp_10 + lp_11 + lp_13 + Loan.Length_36months +
## State_TX + Home.Ownership_RENT + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - lp_11 1 2.4 9273.1 3192.6
## - lp_10 1 3.0 9273.6 3192.7
## - Home.Ownership_RENT 1 4.3 9274.9 3193.0
## - el 1 6.3 9276.9 3193.5
## - Revolving.CREDIT.Balance 1 6.4 9277.0 3193.5
## <none> 9270.6 3194.0
## - lp_13 1 9.2 9279.8 3194.2
## - Monthly.Income 1 10.4 9281.0 3194.5
## - Home.Ownership_MORTGAGE 1 24.8 9295.4 3197.9
## - State_TX 1 51.9 9322.5 3204.3
## - Open.CREDIT.Lines 1 68.6 9339.2 3208.2
## - Inquiries.in.the.Last.6.Months 1 386.2 9656.8 3281.8
## - Amount.Requested 1 2024.0 11294.7 3626.3
## - Loan.Length_36months 1 3145.7 12416.3 3834.5
## - fico 1 18317.4 27588.1 5590.1
##
## Step: AIC=3192.61
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + el + lp_10 + lp_13 + Loan.Length_36months + State_TX +
## Home.Ownership_RENT + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - lp_10 1 2.7 9275.8 3191.2
## - Home.Ownership_RENT 1 4.4 9277.5 3191.7
## - el 1 6.0 9279.0 3192.0
## - Revolving.CREDIT.Balance 1 6.6 9279.6 3192.2
## - lp_13 1 7.8 9280.9 3192.5
## <none> 9273.1 3192.6
## - Monthly.Income 1 9.5 9282.6 3192.9
## - Home.Ownership_MORTGAGE 1 25.5 9298.5 3196.6
## - State_TX 1 52.6 9325.7 3203.1
## - Open.CREDIT.Lines 1 69.3 9342.3 3207.0
## - Inquiries.in.the.Last.6.Months 1 385.3 9658.3 3280.1
## - Amount.Requested 1 2073.0 11346.1 3634.3
## - Loan.Length_36months 1 3181.3 12454.4 3839.2
## - fico 1 18654.6 27927.7 5615.0
##
## Step: AIC=3191.25
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + el + lp_13 + Loan.Length_36months + State_TX + Home.Ownership_RENT +
## Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - Home.Ownership_RENT 1 4.6 9280.4 3190.3
## - el 1 5.8 9281.6 3190.6
## - Revolving.CREDIT.Balance 1 6.6 9282.4 3190.8
## - lp_13 1 7.3 9283.1 3191.0
## <none> 9275.8 3191.2
## - Monthly.Income 1 9.6 9285.4 3191.5
## - Home.Ownership_MORTGAGE 1 26.1 9301.8 3195.4
## - State_TX 1 52.1 9327.9 3201.6
## - Open.CREDIT.Lines 1 69.3 9345.1 3205.6
## - Inquiries.in.the.Last.6.Months 1 391.4 9667.2 3280.1
## - Amount.Requested 1 2070.9 11346.7 3632.4
## - Loan.Length_36months 1 3179.0 12454.8 3837.3
## - fico 1 18680.3 27956.1 5615.3
##
## Step: AIC=3190.35
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + el + lp_13 + Loan.Length_36months + State_TX + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - el 1 6.2 9286.6 3189.8
## - Revolving.CREDIT.Balance 1 6.2 9286.7 3189.8
## - lp_13 1 7.3 9287.8 3190.1
## <none> 9280.4 3190.3
## - Monthly.Income 1 9.9 9290.3 3190.7
## - Home.Ownership_MORTGAGE 1 34.9 9315.3 3196.6
## - State_TX 1 53.1 9333.5 3200.9
## - Open.CREDIT.Lines 1 68.4 9348.9 3204.5
## - Inquiries.in.the.Last.6.Months 1 392.0 9672.4 3279.3
## - Amount.Requested 1 2069.3 11349.7 3631.0
## - Loan.Length_36months 1 3175.8 12456.2 3835.5
## - fico 1 18727.5 28007.9 5617.3
##
## Step: AIC=3189.81
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + lp_13 + Loan.Length_36months + State_TX + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - Revolving.CREDIT.Balance 1 5.7 9292.3 3189.2
## - lp_13 1 7.5 9294.1 3189.6
## <none> 9286.6 3189.8
## - Monthly.Income 1 9.9 9296.5 3190.2
## - Home.Ownership_MORTGAGE 1 30.3 9316.9 3195.0
## - State_TX 1 51.2 9337.8 3199.9
## - Open.CREDIT.Lines 1 68.1 9354.7 3203.9
## - Inquiries.in.the.Last.6.Months 1 388.5 9675.1 3277.9
## - Amount.Requested 1 2107.7 11394.3 3637.6
## - Loan.Length_36months 1 3176.8 12463.4 3834.8
## - fico 1 18751.9 28038.5 5617.7
##
## Step: AIC=3189.16
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Inquiries.in.the.Last.6.Months + fico + lp_13 + Loan.Length_36months +
## State_TX + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - lp_13 1 6.4 9298.7 3188.7
## <none> 9292.3 3189.2
## - Monthly.Income 1 14.9 9307.2 3190.7
## - Home.Ownership_MORTGAGE 1 33.2 9325.5 3195.0
## - State_TX 1 52.5 9344.8 3199.6
## - Open.CREDIT.Lines 1 80.1 9372.3 3206.0
## - Inquiries.in.the.Last.6.Months 1 389.8 9682.1 3277.5
## - Amount.Requested 1 2119.6 11411.9 3639.0
## - Loan.Length_36months 1 3191.2 12483.5 3836.3
## - fico 1 18755.2 28047.4 5616.4
##
## Step: AIC=3188.68
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Inquiries.in.the.Last.6.Months + fico + Loan.Length_36months +
## State_TX + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## <none> 9298.7 3188.7
## - Monthly.Income 1 15.2 9313.9 3190.3
## - Home.Ownership_MORTGAGE 1 34.2 9333.0 3194.8
## - State_TX 1 53.1 9351.8 3199.2
## - Open.CREDIT.Lines 1 78.9 9377.6 3205.3
## - Inquiries.in.the.Last.6.Months 1 388.4 9687.1 3276.7
## - Amount.Requested 1 2113.3 11412.1 3637.0
## - Loan.Length_36months 1 3184.8 12483.6 3834.4
## - fico 1 18803.2 28101.9 5618.7
lets check model summary to see what all variable remain in the model.
summary(fit)
##
## Call:
## lm(formula = Interest.Rate ~ Amount.Requested + Monthly.Income +
## Open.CREDIT.Lines + Inquiries.in.the.Last.6.Months + fico +
## Loan.Length_36months + State_TX + Home.Ownership_MORTGAGE,
## data = ld_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5280 -1.3681 -0.2007 1.2432 10.0654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.466e+01 9.250e-01 80.718 < 2e-16 ***
## Amount.Requested 1.509e-04 6.764e-06 22.310 < 2e-16 ***
## Monthly.Income -2.275e-05 1.203e-05 -1.892 0.058675 .
## Open.CREDIT.Lines -4.414e-02 1.024e-02 -4.311 1.7e-05 ***
## Inquiries.in.the.Last.6.Months 3.504e-01 3.664e-02 9.564 < 2e-16 ***
## fico -8.562e-02 1.287e-03 -66.547 < 2e-16 ***
## Loan.Length_36months -3.233e+00 1.180e-01 -27.388 < 2e-16 ***
## State_TX 6.253e-01 1.769e-01 3.536 0.000415 ***
## Home.Ownership_MORTGAGE -2.623e-01 9.238e-02 -2.839 0.004566 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.061 on 2190 degrees of freedom
## Multiple R-squared: 0.7571, Adjusted R-squared: 0.7562
## F-statistic: 853.4 on 8 and 2190 DF, p-value: < 2.2e-16
It turns out there is still one var for which p-value is higher than .05 . we can manually iclude all other vars of the model or extract the exact formula using formula function.
formula(fit)
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Inquiries.in.the.Last.6.Months + fico + Loan.Length_36months +
## State_TX + Home.Ownership_MORTGAGE
We can drop variable Monthly.Income from this.
fit=lm(Interest.Rate ~ Amount.Requested + Open.CREDIT.Lines + Inquiries.in.the.Last.6.Months + fico + Loan.Length_36months + State_TX + Home.Ownership_MORTGAGE,data=ld_train1)
summary(fit)
##
## Call:
## lm(formula = Interest.Rate ~ Amount.Requested + Open.CREDIT.Lines +
## Inquiries.in.the.Last.6.Months + fico + Loan.Length_36months +
## State_TX + Home.Ownership_MORTGAGE, data = ld_train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6999 -1.3742 -0.2214 1.1553 10.1446
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.474e+01 1.104e+00 67.690 < 2e-16 ***
## Amount.Requested 1.489e-04 7.630e-06 19.520 < 2e-16 ***
## Open.CREDIT.Lines -4.902e-02 1.224e-02 -4.006 6.48e-05 ***
## Inquiries.in.the.Last.6.Months 3.216e-01 4.180e-02 7.696 2.51e-14 ***
## fico -8.583e-02 1.544e-03 -55.595 < 2e-16 ***
## Loan.Length_36months -3.192e+00 1.412e-01 -22.605 < 2e-16 ***
## State_TX 5.441e-01 2.084e-01 2.611 0.00912 **
## Home.Ownership_MORTGAGE -2.621e-01 1.099e-01 -2.385 0.01719 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.061 on 1531 degrees of freedom
## Multiple R-squared: 0.7596, Adjusted R-squared: 0.7585
## F-statistic: 691 on 7 and 1531 DF, p-value: < 2.2e-16
This is our final model with all variables being significant [ p-value < 0.05]. How good is this model. Adj R-squared value for the model is 0.7585 which means that this models explains 75.85% variation in our response. You can quickly check this visually too by plotting predicted values and the original response.
ld_train1 %>%
mutate(pred_IR=predict(fit,newdata=ld_train1)) %>%
ggplot(aes(x=Interest.Rate,y=pred_IR))+geom_point(alpha=0.6)
You can see that there is good overlap. Our model equation from the model is :[taking down values from estimates from summary(fit)]. We are going to use collapse property of paste function. The result will need little modification but it saves us from noting down each value and writing prediction equation manually.
model_string=paste(fit$coefficients,names(fit$coefficients),sep="*",collapse = " + ")
model_eq=strwrap(sub("\\*\\(Intercept\\)","",gsub("+ -","- ",model_string,fixed=TRUE)))
model_eq
## [1] "74.7429652067777 + 0.000148930791905846*Amount.Requested -"
## [2] "0.0490239121960481*Open.CREDIT.Lines +"
## [3] "0.321643365683685*Inquiries.in.the.Last.6.Months -"
## [4] "0.0858327626569249*fico - 3.19157260176938*Loan.Length_36months +"
## [5] "0.544104302623496*State_TX - 0.262061920444776*Home.Ownership_MORTGAGE"
Next we look at different diagnostic plots to see if any of our linear regression assumptions are being violated and some other information regarding outliers.
plot(fit,which=1)
Ideally the red line the plot should be horizontal or close to horizontal . If it shows a significant pattern , it means we might have missed non-linear components in the model. If that is the case you might want to look at pair wise relation ship between response and each predictor and figure out if any of the predictors need to be transformed to make the relationship linear between response and that predictor. [Although a better alternative will be to go ahead and to use an algorithm which can extract non linear relationship. e.g. RandomForest , BoostinMachines etc]
plot(fit,which=2)
This tells you whether you residuals are normal or not.
df=data.frame(res=fit$residual)
ggplot(df,aes(x=res))+geom_density(color="red")+stat_function(fun=dnorm ,args = list(mean=mean(df$res),sd=sd(df$res)),color="green")
shapiro.test(fit$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit$residuals
## W = 0.97713, p-value = 6.184e-15
This tells you that there is significant deviation from normality in your residual which is confirmed by shapiro wilk test as well. This essentially means that you should not use p-values to drop vars as they rely on the assumption of normality. [ We should have only used AIC score to drop vars]
plot(fit,which=3)
This plot tells us whether our residuals are homoscedastic or not. If this plot has a pattern then that mean residuals are heteroscedastic. If that is the case which is not in our case , you can again log transform response and rebuild your model. Why log transform of response in these deviations of residuals? Becuase it brings down scale of response and in turn for residuals. Effect of scale coming down is differences being less prominent.
plot(fit,which=4)
This gives us cook’s distance for all the observations. If cook’s distance for any observation is greater than 1 , you should consider removing that observation from your modeling process and rebuild the model. In our case; cook’s distance is less than 1 for all the observations.[ dont worry about high cook’s distance value for any observation if its less than 1]
Cook’s distance basically tells how much leverage [effect] one observation has on your model equation. High value of cook’s distance [>1] hints at that observation being very different from rest of the data or in other words, that observation being an outlier.
Done with the diagnostic plots.But we havent yet done the crucial part of testing here. Lets look at rmse value for validation data
rmse= mean((ld_train2$Interest.Rate-predict(fit,newdata=ld_train2))**2) %>%
sqrt()
rmse
## [1] 2.065264
We see that model performance RMSE on validation data comes out to be 2.06. We can check performance of other models with this and then choose the best one having least RMSE.
Now that we know about the tentative performance of the model , we’ll go ahead and build the model on the entire training data and use that model to predict result on the test/production data.
fit.final=fit=lm(Interest.Rate ~ .-ID, data=ld_train)
fit.final=step(fit.final)
## Start: AIC=3204.82
## Interest.Rate ~ (ID + Amount.Requested + Debt.To.Income.Ratio +
## Monthly.Income + Open.CREDIT.Lines + Revolving.CREDIT.Balance +
## Inquiries.in.the.Last.6.Months + fico + el + lp_10 + lp_11 +
## lp_12 + lp_13 + lp_14 + Loan.Length_36months + State_FL +
## State_TX + State_NY + State_CA + Home.Ownership_RENT + Home.Ownership_MORTGAGE) -
## ID
##
## Df Sum of Sq RSS AIC
## - State_NY 1 0.2 9265.7 3202.9
## - lp_10 1 0.3 9265.8 3202.9
## - State_CA 1 0.3 9265.9 3202.9
## - Debt.To.Income.Ratio 1 0.8 9266.3 3203.0
## - State_FL 1 1.3 9266.9 3203.1
## - lp_13 1 1.4 9266.9 3203.2
## - lp_11 1 1.4 9267.0 3203.2
## - lp_12 1 2.2 9267.7 3203.3
## - lp_14 1 2.2 9267.7 3203.3
## - Home.Ownership_RENT 1 3.8 9269.3 3203.7
## - Revolving.CREDIT.Balance 1 5.6 9271.1 3204.2
## - el 1 6.6 9272.1 3204.4
## <none> 9265.5 3204.8
## - Monthly.Income 1 10.9 9276.4 3205.4
## - Home.Ownership_MORTGAGE 1 24.0 9289.5 3208.5
## - State_TX 1 51.7 9317.3 3215.1
## - Open.CREDIT.Lines 1 54.5 9320.1 3215.7
## - Inquiries.in.the.Last.6.Months 1 383.3 9648.8 3292.0
## - Amount.Requested 1 2008.1 11273.7 3634.2
## - Loan.Length_36months 1 3130.7 12396.3 3842.9
## - fico 1 17547.1 26812.7 5539.4
##
## Step: AIC=3202.86
## Interest.Rate ~ Amount.Requested + Debt.To.Income.Ratio + Monthly.Income +
## Open.CREDIT.Lines + Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + el + lp_10 + lp_11 + lp_12 + lp_13 + lp_14 + Loan.Length_36months +
## State_FL + State_TX + State_CA + Home.Ownership_RENT + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - lp_10 1 0.3 9266.0 3200.9
## - State_CA 1 0.5 9266.2 3201.0
## - Debt.To.Income.Ratio 1 0.8 9266.6 3201.1
## - State_FL 1 1.2 9267.0 3201.2
## - lp_13 1 1.4 9267.1 3201.2
## - lp_11 1 1.4 9267.2 3201.2
## - lp_12 1 2.1 9267.9 3201.4
## - lp_14 1 2.2 9267.9 3201.4
## - Home.Ownership_RENT 1 3.8 9269.6 3201.8
## - Revolving.CREDIT.Balance 1 5.6 9271.3 3202.2
## - el 1 6.7 9272.4 3202.4
## <none> 9265.7 3202.9
## - Monthly.Income 1 10.7 9276.5 3203.4
## - Home.Ownership_MORTGAGE 1 24.9 9290.6 3206.8
## - State_TX 1 51.7 9317.4 3213.1
## - Open.CREDIT.Lines 1 54.3 9320.1 3213.7
## - Inquiries.in.the.Last.6.Months 1 383.1 9648.8 3290.0
## - Amount.Requested 1 2008.6 11274.3 3632.3
## - Loan.Length_36months 1 3135.8 12401.5 3841.9
## - fico 1 17547.1 26812.8 5537.4
##
## Step: AIC=3200.93
## Interest.Rate ~ Amount.Requested + Debt.To.Income.Ratio + Monthly.Income +
## Open.CREDIT.Lines + Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + el + lp_11 + lp_12 + lp_13 + lp_14 + Loan.Length_36months +
## State_FL + State_TX + State_CA + Home.Ownership_RENT + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - State_CA 1 0.5 9266.5 3199.0
## - Debt.To.Income.Ratio 1 0.8 9266.8 3199.1
## - State_FL 1 1.2 9267.2 3199.2
## - lp_11 1 2.3 9268.3 3199.5
## - lp_13 1 2.3 9268.3 3199.5
## - Home.Ownership_RENT 1 3.8 9269.8 3199.8
## - lp_12 1 4.2 9270.2 3199.9
## - lp_14 1 4.6 9270.5 3200.0
## - Revolving.CREDIT.Balance 1 5.6 9271.6 3200.3
## - el 1 6.7 9272.7 3200.5
## <none> 9266.0 3200.9
## - Monthly.Income 1 10.8 9276.8 3201.5
## - Home.Ownership_MORTGAGE 1 24.7 9290.7 3204.8
## - State_TX 1 51.7 9317.7 3211.2
## - Open.CREDIT.Lines 1 54.7 9320.7 3211.9
## - Inquiries.in.the.Last.6.Months 1 382.8 9648.8 3288.0
## - Amount.Requested 1 2009.1 11275.1 3630.5
## - Loan.Length_36months 1 3135.6 12401.6 3839.9
## - fico 1 17547.1 26813.1 5535.4
##
## Step: AIC=3199.05
## Interest.Rate ~ Amount.Requested + Debt.To.Income.Ratio + Monthly.Income +
## Open.CREDIT.Lines + Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + el + lp_11 + lp_12 + lp_13 + lp_14 + Loan.Length_36months +
## State_FL + State_TX + Home.Ownership_RENT + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - Debt.To.Income.Ratio 1 0.8 9267.2 3197.2
## - State_FL 1 1.4 9267.9 3197.4
## - lp_11 1 2.3 9268.8 3197.6
## - lp_13 1 2.3 9268.8 3197.6
## - Home.Ownership_RENT 1 4.1 9270.5 3198.0
## - lp_12 1 4.2 9270.7 3198.0
## - lp_14 1 4.5 9271.0 3198.1
## - Revolving.CREDIT.Balance 1 5.8 9272.2 3198.4
## - el 1 6.6 9273.1 3198.6
## <none> 9266.5 3199.0
## - Monthly.Income 1 10.7 9277.2 3199.6
## - Home.Ownership_MORTGAGE 1 24.8 9291.3 3202.9
## - State_TX 1 53.7 9320.2 3209.7
## - Open.CREDIT.Lines 1 54.9 9321.3 3210.0
## - Inquiries.in.the.Last.6.Months 1 384.6 9651.1 3286.5
## - Amount.Requested 1 2008.7 11275.2 3628.5
## - Loan.Length_36months 1 3144.5 12411.0 3839.5
## - fico 1 17546.6 26813.1 5533.5
##
## Step: AIC=3197.23
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + el + lp_11 + lp_12 + lp_13 + lp_14 + Loan.Length_36months +
## State_FL + State_TX + Home.Ownership_RENT + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - State_FL 1 1.4 9268.7 3195.6
## - lp_11 1 2.3 9269.6 3195.8
## - lp_13 1 2.4 9269.7 3195.8
## - Home.Ownership_RENT 1 4.0 9271.3 3196.2
## - lp_12 1 4.3 9271.5 3196.2
## - lp_14 1 4.7 9272.0 3196.4
## - el 1 6.5 9273.7 3196.8
## - Revolving.CREDIT.Balance 1 6.7 9273.9 3196.8
## <none> 9267.2 3197.2
## - Monthly.Income 1 10.0 9277.2 3197.6
## - Home.Ownership_MORTGAGE 1 24.6 9291.8 3201.0
## - State_TX 1 53.1 9320.3 3207.8
## - Open.CREDIT.Lines 1 67.4 9334.6 3211.2
## - Inquiries.in.the.Last.6.Months 1 386.2 9653.4 3285.0
## - Amount.Requested 1 2011.4 11278.7 3627.2
## - Loan.Length_36months 1 3144.2 12411.5 3837.6
## - fico 1 17978.9 27246.2 5566.7
##
## Step: AIC=3195.56
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + el + lp_11 + lp_12 + lp_13 + lp_14 + Loan.Length_36months +
## State_TX + Home.Ownership_RENT + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - lp_11 1 2.4 9271.1 3194.1
## - lp_13 1 2.5 9271.2 3194.2
## - Home.Ownership_RENT 1 4.3 9272.9 3194.6
## - lp_12 1 4.5 9273.1 3194.6
## - lp_14 1 4.9 9273.6 3194.7
## - el 1 6.3 9275.0 3195.1
## - Revolving.CREDIT.Balance 1 6.4 9275.0 3195.1
## <none> 9268.7 3195.6
## - Monthly.Income 1 10.3 9279.0 3196.0
## - Home.Ownership_MORTGAGE 1 24.9 9293.6 3199.5
## - State_TX 1 52.1 9320.7 3205.9
## - Open.CREDIT.Lines 1 67.6 9336.3 3209.5
## - Inquiries.in.the.Last.6.Months 1 385.5 9654.2 3283.2
## - Amount.Requested 1 2011.5 11280.2 3625.5
## - Loan.Length_36months 1 3143.0 12411.6 3835.7
## - fico 1 17990.7 27259.4 5565.8
##
## Step: AIC=3194.14
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + el + lp_12 + lp_13 + lp_14 + Loan.Length_36months +
## State_TX + Home.Ownership_RENT + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - lp_13 1 0.1 9271.3 3192.2
## - lp_12 1 2.7 9273.8 3192.8
## - Home.Ownership_RENT 1 4.4 9275.5 3193.2
## - lp_14 1 4.6 9275.7 3193.2
## - el 1 6.3 9277.4 3193.6
## - Revolving.CREDIT.Balance 1 6.4 9277.5 3193.6
## <none> 9271.1 3194.1
## - Monthly.Income 1 10.7 9281.8 3194.7
## - Home.Ownership_MORTGAGE 1 25.1 9296.2 3198.1
## - State_TX 1 51.4 9322.5 3204.3
## - Open.CREDIT.Lines 1 67.9 9339.0 3208.2
## - Inquiries.in.the.Last.6.Months 1 390.9 9662.0 3283.0
## - Amount.Requested 1 2012.1 11283.2 3624.1
## - Loan.Length_36months 1 3141.4 12412.6 3833.8
## - fico 1 17988.4 27259.5 5563.8
##
## Step: AIC=3192.18
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + el + lp_12 + lp_14 + Loan.Length_36months + State_TX +
## Home.Ownership_RENT + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - lp_12 1 3.3 9274.6 3191.0
## - Home.Ownership_RENT 1 4.4 9275.7 3191.2
## - el 1 6.2 9277.4 3191.6
## - Revolving.CREDIT.Balance 1 6.5 9277.8 3191.7
## <none> 9271.3 3192.2
## - Monthly.Income 1 10.6 9281.8 3192.7
## - lp_14 1 11.2 9282.5 3192.8
## - Home.Ownership_MORTGAGE 1 25.2 9296.4 3196.1
## - State_TX 1 51.4 9322.7 3202.3
## - Open.CREDIT.Lines 1 68.2 9339.4 3206.3
## - Inquiries.in.the.Last.6.Months 1 391.1 9662.3 3281.0
## - Amount.Requested 1 2047.4 11318.6 3628.9
## - Loan.Length_36months 1 3165.5 12436.7 3836.1
## - fico 1 18411.2 27682.4 5595.6
##
## Step: AIC=3190.96
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + el + lp_14 + Loan.Length_36months + State_TX + Home.Ownership_RENT +
## Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - Home.Ownership_RENT 1 4.1 9278.7 3189.9
## - Revolving.CREDIT.Balance 1 6.1 9280.7 3190.4
## - el 1 6.3 9280.9 3190.5
## <none> 9274.6 3191.0
## - lp_14 1 8.5 9283.1 3191.0
## - Monthly.Income 1 11.1 9285.7 3191.6
## - Home.Ownership_MORTGAGE 1 25.6 9300.2 3195.0
## - State_TX 1 51.2 9325.8 3201.1
## - Open.CREDIT.Lines 1 67.2 9341.8 3204.8
## - Inquiries.in.the.Last.6.Months 1 389.4 9664.0 3279.4
## - Amount.Requested 1 2045.2 11319.8 3627.2
## - Loan.Length_36months 1 3162.5 12437.1 3834.2
## - fico 1 18550.7 27825.3 5604.9
##
## Step: AIC=3189.93
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Revolving.CREDIT.Balance + Inquiries.in.the.Last.6.Months +
## fico + el + lp_14 + Loan.Length_36months + State_TX + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - Revolving.CREDIT.Balance 1 5.8 9284.4 3189.3
## - el 1 6.7 9285.4 3189.5
## <none> 9278.7 3189.9
## - lp_14 1 9.1 9287.8 3190.1
## - Monthly.Income 1 11.5 9290.1 3190.6
## - Home.Ownership_MORTGAGE 1 36.3 9315.0 3196.5
## - State_TX 1 52.1 9330.8 3200.2
## - Open.CREDIT.Lines 1 66.3 9345.0 3203.6
## - Inquiries.in.the.Last.6.Months 1 389.9 9668.5 3278.4
## - Amount.Requested 1 2045.1 11323.8 3625.9
## - Loan.Length_36months 1 3159.5 12438.1 3832.3
## - fico 1 18588.0 27866.7 5606.2
##
## Step: AIC=3189.3
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Inquiries.in.the.Last.6.Months + fico + el + lp_14 + Loan.Length_36months +
## State_TX + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - el 1 6.2 9290.6 3188.8
## <none> 9284.4 3189.3
## - lp_14 1 8.5 9292.9 3189.3
## - Monthly.Income 1 16.7 9301.1 3191.3
## - Home.Ownership_MORTGAGE 1 38.9 9323.4 3196.5
## - State_TX 1 53.4 9337.8 3199.9
## - Open.CREDIT.Lines 1 78.4 9362.8 3205.8
## - Inquiries.in.the.Last.6.Months 1 391.2 9675.6 3278.1
## - Amount.Requested 1 2054.7 11339.1 3626.9
## - Loan.Length_36months 1 3176.2 12460.7 3834.3
## - fico 1 18597.8 27882.2 5605.4
##
## Step: AIC=3188.76
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Inquiries.in.the.Last.6.Months + fico + lp_14 + Loan.Length_36months +
## State_TX + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## - lp_14 1 8.1 9298.7 3188.7
## <none> 9290.6 3188.8
## - Monthly.Income 1 16.4 9307.0 3190.6
## - Home.Ownership_MORTGAGE 1 34.1 9324.7 3194.8
## - State_TX 1 51.5 9342.1 3198.9
## - Open.CREDIT.Lines 1 77.6 9368.2 3205.0
## - Inquiries.in.the.Last.6.Months 1 387.7 9678.3 3276.7
## - Amount.Requested 1 2093.6 11384.2 3633.7
## - Loan.Length_36months 1 3176.7 12467.3 3833.5
## - fico 1 18621.3 27911.9 5605.8
##
## Step: AIC=3188.68
## Interest.Rate ~ Amount.Requested + Monthly.Income + Open.CREDIT.Lines +
## Inquiries.in.the.Last.6.Months + fico + Loan.Length_36months +
## State_TX + Home.Ownership_MORTGAGE
##
## Df Sum of Sq RSS AIC
## <none> 9298.7 3188.7
## - Monthly.Income 1 15.2 9313.9 3190.3
## - Home.Ownership_MORTGAGE 1 34.2 9333.0 3194.8
## - State_TX 1 53.1 9351.8 3199.2
## - Open.CREDIT.Lines 1 78.9 9377.6 3205.3
## - Inquiries.in.the.Last.6.Months 1 388.4 9687.1 3276.7
## - Amount.Requested 1 2113.3 11412.1 3637.0
## - Loan.Length_36months 1 3184.8 12483.6 3834.4
## - fico 1 18803.2 28101.9 5618.7
summary(fit.final)
##
## Call:
## lm(formula = Interest.Rate ~ Amount.Requested + Monthly.Income +
## Open.CREDIT.Lines + Inquiries.in.the.Last.6.Months + fico +
## Loan.Length_36months + State_TX + Home.Ownership_MORTGAGE,
## data = ld_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5280 -1.3681 -0.2007 1.2432 10.0654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.466e+01 9.250e-01 80.718 < 2e-16 ***
## Amount.Requested 1.509e-04 6.764e-06 22.310 < 2e-16 ***
## Monthly.Income -2.275e-05 1.203e-05 -1.892 0.058675 .
## Open.CREDIT.Lines -4.414e-02 1.024e-02 -4.311 1.7e-05 ***
## Inquiries.in.the.Last.6.Months 3.504e-01 3.664e-02 9.564 < 2e-16 ***
## fico -8.562e-02 1.287e-03 -66.547 < 2e-16 ***
## Loan.Length_36months -3.233e+00 1.180e-01 -27.388 < 2e-16 ***
## State_TX 6.253e-01 1.769e-01 3.536 0.000415 ***
## Home.Ownership_MORTGAGE -2.623e-01 9.238e-02 -2.839 0.004566 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.061 on 2190 degrees of freedom
## Multiple R-squared: 0.7571, Adjusted R-squared: 0.7562
## F-statistic: 853.4 on 8 and 2190 DF, p-value: < 2.2e-16
We can use this model to make prediction on the test data
pred.IR=predict(fit.final,newdata=ld_test)
head(pred.IR)
## 1 2 3 4 5 6
## 15.494899 16.034174 10.442629 3.748237 15.368185 7.133317
#write.csv(pred.IR,"mysubmission.csv",row.names = F)
There are many other observation you can make from the coefficients. For example :
For high amount requested , interest rates are going to be higher as well
If someone has been applying for loan at too many places [ results in high number of inquiries ], they’ll get higher interest rates
Where as if you have high fico score [ mroe credit worthy] , your interest rates are going to be lower.
If you are applying for a loan for 36 months , your interest will be lower by almost 3.5% in comparisons to 60 months loan.