PROBLEM STATEMENT
Payday Loans are high risk short term lending financial products and its very important to asses risk of payment default. Use dataset “paydayloan_collections.csv” to build a model whether repayment will be successful or not.
AIM: Break the dataset into test and train data.Use RandomForest and DecisionTress to build your model on train data and compare their performance on test data.
Initial Setup for using decision tree and random forest model:
library(tree)
## Warning: package 'tree' was built under R version 3.3.3
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.3.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.3.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
loan=read.csv("paydayloan_collections.csv")
View(loan)
str(loan) #VERY IMP TO NOTE THAT PAYMENT IS OF FACTOR TYPE...WHILE REST OTHERS ARE ALSO OF FACTOR N NUM TYPE.
## 'data.frame': 30000 obs. of 31 variables:
## $ payment: Factor w/ 2 levels "Denied","Success": 2 1 1 2 2 2 2 2 1 1 ...
## $ var1 : Factor w/ 4 levels "kq","ma","qw",..: 3 3 3 4 2 1 4 3 4 3 ...
## $ var2 : Factor w/ 9 levels "bq","hk","js",..: 2 7 9 3 8 5 5 7 7 7 ...
## $ var3 : num 3.11 3.35 4.15 6.23 1.28 -2.45 1.05 5.41 7.29 3.13 ...
## $ var4 : num 16.1 11.2 29.2 15.7 20.7 ...
## $ var5 : num -4.6 -18.55 18.91 2.81 14.98 ...
## $ var6 : num 22.34 6.68 16.4 4.46 11.19 ...
## $ var7 : num 13.53 12.78 3.67 5.13 17.66 ...
## $ var8 : num 1.53 6.62 5.72 8.66 1.13 ...
## $ var9 : Factor w/ 3 levels "ch","ja","nv": 3 3 1 2 3 1 1 1 1 1 ...
## $ var10 : Factor w/ 3 levels "db","ld","pe": 2 2 2 2 2 1 2 2 1 2 ...
## $ var11 : Factor w/ 2 levels "rl","te": 1 2 2 2 2 2 2 2 2 1 ...
## $ var12 : num 4.46 4.04 -4.41 2.14 14.93 ...
## $ var13 : Factor w/ 7 levels "cb","iz","kh",..: 1 3 1 2 6 2 1 1 3 6 ...
## $ var14 : num 4.93 -0.76 1.21 3.56 2.2 0.37 -0.84 5.85 5.23 0.62 ...
## $ var15 : num 26.5 16.2 29.1 23.6 -19.2 ...
## $ var16 : num 10.48 -0.87 5.49 15.34 -3 ...
## $ var17 : Factor w/ 5 levels "az","bw","ki",..: 4 3 2 2 3 3 5 2 2 2 ...
## $ var18 : num 11.2 15.5 -10.8 -24.3 12 ...
## $ var19 : Factor w/ 7 levels "bz","ev","fh",..: 2 7 5 2 5 2 2 7 6 5 ...
## $ var20 : num 19.1 28.4 1.6 -11.9 18.5 ...
## $ var21 : num 8.94 31.02 23.26 29.25 2.19 ...
## $ var22 : num -12.76 34.76 9.5 -1.53 10.24 ...
## $ var23 : Factor w/ 10 levels "cz","da","fe",..: 9 1 6 6 9 6 8 9 8 8 ...
## $ var24 : num 12.06 1.44 7.77 8.94 8.92 ...
## $ var25 : num 2.46 9.44 8.7 19.33 5.48 ...
## $ var26 : num 4.73 13.56 -1.75 23.73 -0.28 ...
## $ var27 : num -1.72 -2.24 5.96 5.54 4.01 6.65 1.22 9.08 0.87 -2.54 ...
## $ var28 : num 0.91 0.24 1.91 0.85 1.21 -1.12 1.26 0.68 1.6 -1.43 ...
## $ var29 : Factor w/ 2 levels "dg","ev": 2 2 2 2 2 2 2 2 1 2 ...
## $ var30 : num 8 -2.9 22.7 36.3 11.3 ...
glimpse(loan)
## Observations: 30,000
## Variables: 31
## $ payment <fctr> Success, Denied, Denied, Success, Success, Success, S...
## $ var1 <fctr> qw, qw, qw, wv, ma, kq, wv, qw, wv, qw, qw, qw, wv, w...
## $ var2 <fctr> hk, rv, zg, js, xn, py, py, rv, rv, rv, py, bq, py, p...
## $ var3 <dbl> 3.11, 3.35, 4.15, 6.23, 1.28, -2.45, 1.05, 5.41, 7.29,...
## $ var4 <dbl> 16.06, 11.18, 29.19, 15.70, 20.71, 22.45, 23.02, 17.92...
## $ var5 <dbl> -4.60, -18.55, 18.91, 2.81, 14.98, 15.18, 17.59, -14.5...
## $ var6 <dbl> 22.34, 6.68, 16.40, 4.46, 11.19, -2.12, 6.65, 5.00, 13...
## $ var7 <dbl> 13.53, 12.78, 3.67, 5.13, 17.66, -8.24, -2.06, 1.34, 2...
## $ var8 <dbl> 1.53, 6.62, 5.72, 8.66, 1.13, 10.34, 12.20, -8.54, 4.4...
## $ var9 <fctr> nv, nv, ch, ja, nv, ch, ch, ch, ch, ch, ch, ch, ch, n...
## $ var10 <fctr> ld, ld, ld, ld, ld, db, ld, ld, db, ld, ld, ld, db, l...
## $ var11 <fctr> rl, te, te, te, te, te, te, te, te, rl, te, rl, te, t...
## $ var12 <dbl> 4.46, 4.04, -4.41, 2.14, 14.93, 8.64, 14.91, 13.97, 15...
## $ var13 <fctr> cb, kh, cb, iz, te, iz, cb, cb, kh, te, iz, np, te, i...
## $ var14 <dbl> 4.93, -0.76, 1.21, 3.56, 2.20, 0.37, -0.84, 5.85, 5.23...
## $ var15 <dbl> 26.48, 16.21, 29.06, 23.61, -19.16, 35.83, -8.38, 28.9...
## $ var16 <dbl> 10.48, -0.87, 5.49, 15.34, -3.00, 8.05, 15.66, 15.06, ...
## $ var17 <fctr> ov, ki, bw, bw, ki, ki, zk, bw, bw, bw, bw, bw, ki, z...
## $ var18 <dbl> 11.17, 15.50, -10.84, -24.26, 12.02, 32.12, 69.45, 15....
## $ var19 <fctr> ev, tg, me, ev, me, ev, ev, tg, qu, me, ev, ev, me, e...
## $ var20 <dbl> 19.15, 28.39, 1.60, -11.89, 18.47, 9.94, 27.55, 17.71,...
## $ var21 <dbl> 8.94, 31.02, 23.26, 29.25, 2.19, -0.26, 14.39, 4.09, 1...
## $ var22 <dbl> -12.76, 34.76, 9.50, -1.53, 10.24, 8.11, -4.73, 22.62,...
## $ var23 <fctr> ub, cz, ri, ri, ub, ri, tf, ub, tf, tf, tf, tf, ub, u...
## $ var24 <dbl> 12.06, 1.44, 7.77, 8.94, 8.92, 2.32, 8.57, 2.82, 15.00...
## $ var25 <dbl> 2.46, 9.44, 8.70, 19.33, 5.48, 6.89, 8.73, -11.05, 7.4...
## $ var26 <dbl> 4.73, 13.56, -1.75, 23.73, -0.28, -3.51, 2.87, 11.94, ...
## $ var27 <dbl> -1.72, -2.24, 5.96, 5.54, 4.01, 6.65, 1.22, 9.08, 0.87...
## $ var28 <dbl> 0.91, 0.24, 1.91, 0.85, 1.21, -1.12, 1.26, 0.68, 1.60,...
## $ var29 <fctr> ev, ev, ev, ev, ev, ev, ev, ev, dg, ev, dg, dg, ev, d...
## $ var30 <dbl> 8.00, -2.90, 22.67, 36.31, 11.33, 9.93, 21.75, 7.53, -...
table(loan$payment)
##
## Denied Success
## 18755 11245
the reponse variable is of classification type.ie.payment=denied/sucess lets make decision tree first and then random forest method
Step 1:divide the data into train and test(50:50%)
set.seed(2)
s=sample(1:nrow(loan),15000)
loan_train=loan[s,]
loan_test=loan[-s,]
step 2: making tree model for train data
tree.loan=tree(payment~.,data=loan_train)
lets plot it
plot(tree.loan) #model with 7 nodes.
text(tree.loan,pretty=0)
tree.loan #text format output
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 15000 19790.0 Denied ( 0.62847 0.37153 )
## 2) var25 < 9.105 7041 9332.0 Success ( 0.37722 0.62278 )
## 4) var17: bw 3197 4338.0 Denied ( 0.58555 0.41445 )
## 8) var25 < 1.345 1355 1661.0 Success ( 0.30258 0.69742 ) *
## 9) var25 > 1.345 1842 1875.0 Denied ( 0.79370 0.20630 ) *
## 5) var17: az,ki,ov,zk 3844 3889.0 Success ( 0.20395 0.79605 ) *
## 3) var25 > 9.105 7959 6708.0 Denied ( 0.85074 0.14926 )
## 6) var25 < 15.195 3510 3703.0 Denied ( 0.77949 0.22051 )
## 12) var23: cz,da,fe,po,qu,ri,sy,tf 2778 2408.0 Denied ( 0.84377 0.15623 ) *
## 13) var23: ub,yv 732 1011.0 Denied ( 0.53552 0.46448 )
## 26) var17: bw 327 262.0 Denied ( 0.86239 0.13761 ) *
## 27) var17: az,ki,ov,zk 405 473.7 Success ( 0.27160 0.72840 ) *
## 7) var25 > 15.195 4449 2754.0 Denied ( 0.90695 0.09305 ) *
summary(tree.loan)
##
## Classification tree:
## tree(formula = payment ~ ., data = loan_train)
## Variables actually used in tree construction:
## [1] "var25" "var17" "var23"
## Number of terminal nodes: 7
## Residual mean deviance: 0.8886 = 13320 / 14990
## Misclassification error rate: 0.1718 = 2577 / 15000
output:for train dataset having 7 nodes: residual mean dev=0.88 error rate=0.1718
step 3:predict for test dataset
tree.pred=predict(tree.loan,newdata = loan_test,type ="class")
step 4:check the predictions made on test data and calculate error.
table(tree.pred,loan_test$payment)
##
## tree.pred Denied Success
## Denied 8076 1373
## Success 1252 4299
(1252+1373)/15000
## [1] 0.175
error on test data=0.175
lets find error on train data:
tree.pred.train=predict(tree.loan,newdata = loan_train,type ="class")
table(tree.pred.train,loan_train$payment)
##
## tree.pred.train Denied Success
## Denied 8123 1273
## Success 1304 4300
(1304+1273)/15000
## [1] 0.1718
CONCLUSION:ERROR FOUND ON TRAIN DATA IS 0.1718 WHILE ON TEST DATA 0.175.(GOOD MODEL)
| Step 5: STILL LETS TRY AND DO PRUNING FOR OPTIMUM MODEL OF TREE. WE HERE USE CROSS VALIDATION FUNCTION AND PRUNE FUNCTION |
r set.seed(2) cv.loan=cv.tree(tree.loan,FUN = prune.misclass) |
| lets plot the pruned tree |
r plot(cv.loan$size,cv.loan$dev,type="b") |
| plot shows that error is minimum on 7 nodes ONLY where deviance is minimum |
| Hence the tree itself is best size with 7 nodes which already exist. |
Lets do random forest method
Here response is clasification type(payment=success/denied)
step 1:build rf model on train dataset.
rf.loan=randomForest(payment~.,data = loan_train)#U CAN USE ARGUMENT do.trace=T
rf.loan
##
## Call:
## randomForest(formula = payment ~ ., data = loan_train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 13.43%
## Confusion matrix:
## Denied Success class.error
## Denied 8537 890 0.09440967
## Success 1125 4448 0.20186614
output: interpretatn:ntree=500 no. of variables tried at each split=5 OOB estimate of error rate=13.43% i.e.(1125+890)/15000=13.43%
step 2:predicting rf model created on test dataset and find error
rf.loan.pred=predict(rf.loan,newdata = loan_test)
t=table(loan_test$payment,rf.loan.pred)
t
## rf.loan.pred
## Denied Success
## Denied 8496 832
## Success 1149 4523
interpretation:
(1149+832)/15000
## [1] 0.1320667
error=13.20% by random forest model.
step 3:Plot variablesImpPlot(i.e. variableswhich are important):
importance(rf.loan)
## MeanDecreaseGini
## var1 40.32600
## var2 128.59517
## var3 168.25258
## var4 167.97693
## var5 163.46701
## var6 163.03284
## var7 252.18814
## var8 165.88244
## var9 25.67926
## var10 26.78148
## var11 12.40307
## var12 160.85998
## var13 98.65424
## var14 164.22960
## var15 168.69391
## var16 340.11406
## var17 517.84024
## var18 167.12299
## var19 96.80594
## var20 162.52858
## var21 168.95544
## var22 168.19927
## var23 265.28208
## var24 168.26335
## var25 2360.36785
## var26 157.98775
## var27 162.64079
## var28 154.84962
## var29 39.65420
## var30 167.07499
varImpPlot(rf.loan)
variables 25,17,16,23 are most imp variables contributing to the output.
Conclusion:error generated on PREDICTING TEST data by random forest is 13.20% as against 17.5% by decision tree model.
Hence random forest is better than decision tree model .