Tired of remembering too many different Packages?
One of the biggest challenge beginners in Data Science face is which algorithms to learn and focus on. In case of R, the problem gets accentuated by the fact that one functionality can be achieved by various approaches by using different libraries available in R, which is great but quite frustrating since each package was designed independently and has very different syntax, inputs and outputs. This could be too much for a beginner.
Here is a tip to handle everything from Exploring Data to performing complex Machine learning Algorithms to tuning those algorithms using hyper parameters, everything under a single roof.
All this has been made possible by the years of effort that have gone behind CARET ( Classification And REgression Training) which is possibly the biggest project in R. This package alone is all you need to know for solve almost any supervised machine learning problem. Not only does caret allow you to run a plethora of ML methods, it also provides tools for auxiliary techniques such as:
• Data preparation (imputation, centering/scaling data, removing correlated predictors, reducing skewness)
• Data splitting
• Variable selection
• Model evaluation
In this problem statement, we have to predict the Loan Status of an Individual based on his/ her profile. We’ll get started by loading the Caret Library and Loan Default dataset in R available in my Working Directory.
# Installing the Library.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
# Setting up the working Directory and Loading the Loan Default dataset.
setwd("D:/Great Learning/Finance and Risk Analytics")
dataset <- read.csv('raw-data.csv')
Once we have the data available in R Environment, we perform a few exploratory checks to understand the Structure of the data and ensure that the data loaded is correct.
### Performing basic Exploratory Analysis
# Checking the class of the data.
class(dataset)
## [1] "data.frame"
# Checking the dimension of data.
dim(dataset)
## [1] 3541 53
# Reading top 5 Rows.
head(dataset, n=5)
## Num Networth.Next.Year Total.assets Net.worth Total.income
## 1 1 8890.6 17512.3 7093.2 24965.2
## 2 2 394.3 941.0 351.5 1527.4
## 3 3 92.2 232.8 100.6 477.3
## 4 4 2.7 2.7 2.7 NA
## 5 5 109.0 478.5 107.6 1580.5
## Change.in.stock Total.expenses Profit.after.tax PBDITA PBT
## 1 235.8 23657.8 1543.2 2860.2 2417.2
## 2 42.7 1454.9 115.2 283.0 188.4
## 3 -5.2 478.7 -6.6 5.8 -6.6
## 4 NA NA NA NA NA
## 5 -17.0 1558.0 5.5 31.0 6.3
## Cash.profit PBDITA.as...of.total.income PBT.as...of.total.income
## 1 1872.8 11.46 9.68
## 2 158.6 18.53 12.33
## 3 0.3 1.22 -1.38
## 4 NA 0.00 0.00
## 5 11.9 1.96 0.40
## PAT.as...of.total.income Cash.profit.as...of.total.income
## 1 6.18 7.50
## 2 7.54 10.38
## 3 -1.38 0.06
## 4 0.00 0.00
## 5 0.35 0.75
## PAT.as...of.net.worth Sales Income.from.financial.services
## 1 23.78 24458.0 158.0
## 2 38.08 1504.3 4.0
## 3 -6.35 475.6 1.5
## 4 0.00 NA NA
## 5 5.25 1575.1 3.9
## Other.income Total.capital Reserves.and.funds
## 1 297.2 423.8 6822.8
## 2 15.9 115.5 257.8
## 3 0.2 81.4 19.2
## 4 NA 0.5 2.2
## 5 0.9 6.2 161.8
## Deposits..accepted.by.commercial.banks. Borrowings
## 1 NA 14.9
## 2 NA 272.5
## 3 NA 35.4
## 4 NA NA
## 5 NA 193.1
## Current.liabilities...provisions Deferred.tax.liability
## 1 9965.9 284.9
## 2 210.0 85.2
## 3 96.8 NA
## 4 NA NA
## 5 112.8 4.6
## Shareholders.funds Cumulative.retained.profits Capital.employed TOL.TNW
## 1 7093.2 6263.3 7108.1 1.33
## 2 351.5 247.4 624.0 1.23
## 3 100.6 32.4 136.0 1.44
## 4 2.7 2.2 2.7 0.00
## 5 107.6 82.7 300.7 2.83
## Total.term.liabilities...tangible.net.worth
## 1 0.00
## 2 0.34
## 3 0.29
## 4 0.00
## 5 1.59
## Contingent.liabilities...Net.worth.... Contingent.liabilities
## 1 14.80 1049.7
## 2 19.23 67.6
## 3 45.83 46.1
## 4 0.00 NA
## 5 34.94 37.6
## Net.fixed.assets Investments Current.assets Net.working.capital
## 1 1900.2 1069.6 13277.5 3588.5
## 2 286.4 2.2 563.9 203.5
## 3 38.7 4.3 167.5 59.6
## 4 2.5 NA 0.2 0.2
## 5 94.8 7.4 349.7 215.8
## Quick.ratio..times. Current.ratio..times. Debt.to.equity.ratio..times.
## 1 1.18 1.37 0.00
## 2 0.95 1.56 0.78
## 3 1.11 1.55 0.35
## 4 NA NA 0.00
## 5 1.41 2.54 1.79
## Cash.to.current.liabilities..times.
## 1 0.43
## 2 0.06
## 3 0.21
## 4 NA
## 5 0.00
## Cash.to.average.cost.of.sales.per.day Creditors.turnover
## 1 68.21 3.62
## 2 5.96 9.80
## 3 17.07 5.28
## 4 NA 0.00
## 5 0.00 13.00
## Debtors.turnover Finished.goods.turnover WIP.turnover
## 1 3.85 200.55 21.78
## 2 5.70 14.21 7.49
## 3 5.07 9.24 0.23
## 4 0.00 NA NA
## 5 9.46 12.68 7.90
## Raw.material.turnover Shares.outstanding Equity.face.value EPS
## 1 7.71 42381675 10 35.52
## 2 11.46 11550000 10 9.97
## 3 NA 8149090 10 -0.50
## 4 0.00 52404 10 0.00
## 5 17.03 619635 10 7.91
## Adjusted.EPS Total.liabilities PE.on.BSE Default
## 1 7.10 17512.3 27.31 0
## 2 9.97 941.0 8.17 0
## 3 -0.50 232.8 -5.76 0
## 4 0.00 2.7 NA 0
## 5 7.91 478.5 NA 0
# Reading bottom 5 Rows.
tail(dataset, n=5)
## Num Networth.Next.Year Total.assets Net.worth Total.income
## 3537 3541 226.4 450.5 172.3 565.0
## 3538 3542 89.4 97.6 82.0 75.8
## 3539 3543 246.2 902.9 209.1 1005.1
## 3540 3544 146.9 177.0 137.2 371.0
## 3541 3545 -0.2 0.6 0.3 NA
## Change.in.stock Total.expenses Profit.after.tax PBDITA PBT
## 3537 30.5 581.1 14.4 76.7 41.1
## 3538 -4.0 66.5 5.3 11.1 6.2
## 3539 5.6 966.5 44.2 120.3 70.0
## 3540 3.9 348.9 26.0 50.5 40.8
## 3541 NA 17.4 -17.4 -17.4 -17.4
## Cash.profit PBDITA.as...of.total.income PBT.as...of.total.income
## 3537 48.4 13.58 7.27
## 3538 9.2 14.64 8.18
## 3539 62.6 11.97 6.96
## 3540 33.6 13.61 11.00
## 3541 -17.4 NA NA
## PAT.as...of.total.income Cash.profit.as...of.total.income
## 3537 2.55 8.57
## 3538 6.99 12.14
## 3539 4.40 6.23
## 3540 7.01 9.06
## 3541 NA NA
## PAT.as...of.net.worth Sales Income.from.financial.services
## 3537 8.71 564.5 0.5
## 3538 6.68 73.9 1.7
## 3539 22.77 995.9 2.6
## 3540 20.30 365.8 3.3
## 3541 -193.33 NA NA
## Other.income Total.capital Reserves.and.funds
## 3537 NA 89.0 85.5
## 3538 NA 38.6 48.4
## 3539 0.3 30.0 179.1
## 3540 1.6 50.9 86.3
## 3541 NA 28.3 -28.0
## Deposits..accepted.by.commercial.banks. Borrowings
## 3537 NA 190.2
## 3538 NA 3.0
## 3539 NA 305.0
## 3540 NA 1.3
## 3541 NA NA
## Current.liabilities...provisions Deferred.tax.liability
## 3537 42.5 36.8
## 3538 7.6 NA
## 3539 363.4 25.4
## 3540 21.1 17.4
## 3541 0.3 NA
## Shareholders.funds Cumulative.retained.profits Capital.employed
## 3537 172.3 76.8 362.5
## 3538 87.0 36.6 90.0
## 3539 209.1 179.1 514.1
## 3540 137.2 77.1 138.5
## 3541 0.3 -28.0 0.3
## TOL.TNW Total.term.liabilities...tangible.net.worth
## 3537 1.30 0.72
## 3538 0.12 0.02
## 3539 2.45 0.68
## 3540 0.10 0.01
## 3541 1.00 0.00
## Contingent.liabilities...Net.worth.... Contingent.liabilities
## 3537 0.00 NA
## 3538 5.12 4.2
## 3539 93.45 195.4
## 3540 6.20 8.5
## 3541 0.00 NA
## Net.fixed.assets Investments Current.assets Net.working.capital
## 3537 227.0 NA 187.0 78.3
## 3538 21.9 6.8 55.8 47.2
## 3539 217.7 17.5 477.5 -49.5
## 3540 73.5 NA 80.8 59.7
## 3541 NA NA 0.6 0.3
## Quick.ratio..times. Current.ratio..times.
## 3537 0.41 1.71
## 3538 4.58 6.49
## 3539 0.59 0.91
## 3540 2.83 3.83
## 3541 2.00 2.00
## Debt.to.equity.ratio..times. Cash.to.current.liabilities..times.
## 3537 1.10 0.07
## 3538 0.10 3.88
## 3539 1.46 0.05
## 3540 0.01 1.35
## 3541 0.00 2.00
## Cash.to.average.cost.of.sales.per.day Creditors.turnover
## 3537 5.67 15.65
## 3538 177.71 10.07
## 3539 11.05 3.96
## 3540 29.93 25.00
## 3541 2190.00 0.00
## Debtors.turnover Finished.goods.turnover WIP.turnover
## 3537 20.64 8.66 5.14
## 3538 14.21 5.13 4.17
## 3539 3.76 33.03 11.68
## 3540 13.75 49.00 47.03
## 3541 0.00 NA NA
## Raw.material.turnover Shares.outstanding Equity.face.value EPS
## 3537 19.47 14904213 10 0.97
## 3538 4.83 3362800 10 1.61
## 3539 4.63 3000000 10 13.10
## 3540 17.42 4422346 10 6.06
## 3541 0.00 5220000 10 -0.02
## Adjusted.EPS Total.liabilities PE.on.BSE Default
## 3537 0.97 450.5 NA 0
## 3538 1.61 97.6 2.49 0
## 3539 13.10 902.9 12.62 0
## 3540 6.06 177.0 4.07 0
## 3541 -0.02 0.6 NA 1
# Understanding the Structure of the data loaded.
str(dataset)
## 'data.frame': 3541 obs. of 53 variables:
## $ Num : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Networth.Next.Year : num 8890.6 394.3 92.2 2.7 109 ...
## $ Total.assets : num 17512.3 941 232.8 2.7 478.5 ...
## $ Net.worth : num 7093.2 351.5 100.6 2.7 107.6 ...
## $ Total.income : num 24965 1527 477 NA 1580 ...
## $ Change.in.stock : num 235.8 42.7 -5.2 NA -17 ...
## $ Total.expenses : num 23658 1455 479 NA 1558 ...
## $ Profit.after.tax : num 1543.2 115.2 -6.6 NA 5.5 ...
## $ PBDITA : num 2860.2 283 5.8 NA 31 ...
## $ PBT : num 2417.2 188.4 -6.6 NA 6.3 ...
## $ Cash.profit : num 1872.8 158.6 0.3 NA 11.9 ...
## $ PBDITA.as...of.total.income : num 11.46 18.53 1.22 0 1.96 ...
## $ PBT.as...of.total.income : num 9.68 12.33 -1.38 0 0.4 ...
## $ PAT.as...of.total.income : num 6.18 7.54 -1.38 0 0.35 2.81 0 0.72 8.29 -2.88 ...
## $ Cash.profit.as...of.total.income : num 7.5 10.38 0.06 0 0.75 ...
## $ PAT.as...of.net.worth : num 23.78 38.08 -6.35 0 5.25 ...
## $ Sales : num 24458 1504 476 NA 1575 ...
## $ Income.from.financial.services : num 158 4 1.5 NA 3.9 6.4 NA NA 7.3 NA ...
## $ Other.income : num 297.2 15.9 0.2 NA 0.9 ...
## $ Total.capital : num 423.8 115.5 81.4 0.5 6.2 ...
## $ Reserves.and.funds : num 6822.8 257.8 19.2 2.2 161.8 ...
## $ Deposits..accepted.by.commercial.banks. : logi NA NA NA NA NA NA ...
## $ Borrowings : num 14.9 272.5 35.4 NA 193.1 ...
## $ Current.liabilities...provisions : num 9965.9 210 96.8 NA 112.8 ...
## $ Deferred.tax.liability : num 284.9 85.2 NA NA 4.6 ...
## $ Shareholders.funds : num 7093.2 351.5 100.6 2.7 107.6 ...
## $ Cumulative.retained.profits : num 6263.3 247.4 32.4 2.2 82.7 ...
## $ Capital.employed : num 7108.1 624 136 2.7 300.7 ...
## $ TOL.TNW : num 1.33 1.23 1.44 0 2.83 1.8 0.03 5.17 1.05 3.25 ...
## $ Total.term.liabilities...tangible.net.worth: num 0 0.34 0.29 0 1.59 0.37 0.03 0.94 0.3 0.54 ...
## $ Contingent.liabilities...Net.worth.... : num 14.8 19.2 45.8 0 34.9 ...
## $ Contingent.liabilities : num 1049.7 67.6 46.1 NA 37.6 ...
## $ Net.fixed.assets : num 1900.2 286.4 38.7 2.5 94.8 ...
## $ Investments : num 1069.6 2.2 4.3 NA 7.4 ...
## $ Current.assets : num 13277.5 563.9 167.5 0.2 349.7 ...
## $ Net.working.capital : num 3588.5 203.5 59.6 0.2 215.8 ...
## $ Quick.ratio..times. : num 1.18 0.95 1.11 NA 1.41 0.48 NA 0.54 0.59 0.39 ...
## $ Current.ratio..times. : num 1.37 1.56 1.55 NA 2.54 1.27 NA 1.15 1.58 0.5 ...
## $ Debt.to.equity.ratio..times. : num 0 0.78 0.35 0 1.79 1.09 0.32 2.31 0.94 3.13 ...
## $ Cash.to.current.liabilities..times. : num 0.43 0.06 0.21 NA 0 0.11 NA 0.04 0.19 0 ...
## $ Cash.to.average.cost.of.sales.per.day : num 68.21 5.96 17.07 NA 0 ...
## $ Creditors.turnover : num 3.62 9.8 5.28 0 13 ...
## $ Debtors.turnover : num 3.85 5.7 5.07 0 9.46 ...
## $ Finished.goods.turnover : num 200.55 14.21 9.24 NA 12.68 ...
## $ WIP.turnover : num 21.78 7.49 0.23 NA 7.9 ...
## $ Raw.material.turnover : num 7.71 11.46 NA 0 17.03 ...
## $ Shares.outstanding : num 42381675 11550000 8149090 52404 619635 ...
## $ Equity.face.value : num 10 10 10 10 10 10 10 NA 10 10 ...
## $ EPS : num 35.52 9.97 -0.5 0 7.91 ...
## $ Adjusted.EPS : num 7.1 9.97 -0.5 0 7.91 ...
## $ Total.liabilities : num 17512.3 941 232.8 2.7 478.5 ...
## $ PE.on.BSE : num 27.31 8.17 -5.76 NA NA ...
## $ Default : int 0 0 0 0 0 0 0 0 0 1 ...
#Understanding the Summary of the data loaded.
summary(dataset)
## Num Networth.Next.Year Total.assets Net.worth
## Min. : 1 Min. :-74265.6 Min. : 0.1 Min. : 0.0
## 1st Qu.: 886 1st Qu.: 31.7 1st Qu.: 91.3 1st Qu.: 31.3
## Median :1773 Median : 116.3 Median : 309.7 Median : 102.3
## Mean :1772 Mean : 1616.3 Mean : 3443.4 Mean : 1295.9
## 3rd Qu.:2658 3rd Qu.: 456.1 3rd Qu.: 1098.7 3rd Qu.: 377.3
## Max. :3545 Max. :805773.4 Max. :1176509.2 Max. :613151.6
##
## Total.income Change.in.stock Total.expenses
## Min. : 0.0 Min. :-3029.40 Min. : -0.1
## 1st Qu.: 106.5 1st Qu.: -1.80 1st Qu.: 95.8
## Median : 444.9 Median : 1.60 Median : 407.7
## Mean : 4582.8 Mean : 41.49 Mean : 4262.9
## 3rd Qu.: 1440.9 3rd Qu.: 18.05 3rd Qu.: 1359.8
## Max. :2442828.2 Max. :14185.50 Max. :2366035.3
## NA's :198 NA's :458 NA's :139
## Profit.after.tax PBDITA PBT
## Min. : -3908.30 Min. : -440.7 Min. : -3894.80
## 1st Qu.: 0.50 1st Qu.: 6.9 1st Qu.: 0.70
## Median : 8.80 Median : 35.4 Median : 12.40
## Mean : 277.36 Mean : 578.1 Mean : 383.81
## 3rd Qu.: 52.27 3rd Qu.: 150.2 3rd Qu.: 71.97
## Max. :119439.10 Max. :208576.5 Max. :145292.60
## NA's :131 NA's :131 NA's :131
## Cash.profit PBDITA.as...of.total.income PBT.as...of.total.income
## Min. : -2245.70 Min. :-6400.000 Min. :-21340.00
## 1st Qu.: 2.90 1st Qu.: 5.000 1st Qu.: 0.55
## Median : 18.85 Median : 9.660 Median : 3.31
## Mean : 392.07 Mean : 4.571 Mean : -17.28
## 3rd Qu.: 93.20 3rd Qu.: 16.390 3rd Qu.: 8.80
## Max. :176911.80 Max. : 100.000 Max. : 100.00
## NA's :131 NA's :68 NA's :68
## PAT.as...of.total.income Cash.profit.as...of.total.income
## Min. :-21340.00 Min. :-15020.000
## 1st Qu.: 0.35 1st Qu.: 2.020
## Median : 2.34 Median : 5.640
## Mean : -19.20 Mean : -8.229
## 3rd Qu.: 6.34 3rd Qu.: 10.700
## Max. : 150.00 Max. : 100.000
## NA's :68 NA's :68
## PAT.as...of.net.worth Sales Income.from.financial.services
## Min. :-748.72 Min. : 0.1 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 112.7 1st Qu.: 0.40
## Median : 7.92 Median : 453.1 Median : 1.80
## Mean : 10.27 Mean : 4549.5 Mean : 80.84
## 3rd Qu.: 20.19 3rd Qu.: 1433.5 3rd Qu.: 9.68
## Max. :2466.67 Max. :2384984.4 Max. :51938.20
## NA's :259 NA's :935
## Other.income Total.capital Reserves.and.funds
## Min. : 0.00 Min. : 0.1 Min. : -6525.9
## 1st Qu.: 0.40 1st Qu.: 13.1 1st Qu.: 5.0
## Median : 1.40 Median : 42.1 Median : 54.8
## Mean : 41.36 Mean : 216.6 Mean : 1163.8
## 3rd Qu.: 5.97 3rd Qu.: 100.3 3rd Qu.: 277.3
## Max. :42856.70 Max. :78273.2 Max. :625137.8
## NA's :1295 NA's :4 NA's :85
## Deposits..accepted.by.commercial.banks. Borrowings
## Mode:logical Min. : 0.10
## NA's:3541 1st Qu.: 23.95
## Median : 99.20
## Mean : 1122.28
## 3rd Qu.: 352.60
## Max. :278257.30
## NA's :366
## Current.liabilities...provisions Deferred.tax.liability
## Min. : 0.1 Min. : 0.1
## 1st Qu.: 17.8 1st Qu.: 3.2
## Median : 69.4 Median : 13.4
## Mean : 940.6 Mean : 227.2
## 3rd Qu.: 261.7 3rd Qu.: 50.0
## Max. :352240.3 Max. :72796.6
## NA's :96 NA's :1140
## Shareholders.funds Cumulative.retained.profits Capital.employed
## Min. : 0.0 Min. : -6534.3 Min. : 0.0
## 1st Qu.: 32.0 1st Qu.: 1.1 1st Qu.: 60.8
## Median : 105.6 Median : 37.1 Median : 214.7
## Mean : 1322.1 Mean : 890.5 Mean : 2328.3
## 3rd Qu.: 393.2 3rd Qu.: 202.3 3rd Qu.: 767.3
## Max. :613151.6 Max. :390133.8 Max. :891408.9
## NA's :38
## TOL.TNW Total.term.liabilities...tangible.net.worth
## Min. :-350.480 Min. :-325.600
## 1st Qu.: 0.600 1st Qu.: 0.050
## Median : 1.430 Median : 0.340
## Mean : 3.994 Mean : 1.844
## 3rd Qu.: 2.830 3rd Qu.: 1.000
## Max. : 473.000 Max. : 456.000
##
## Contingent.liabilities...Net.worth.... Contingent.liabilities
## Min. : 0.00 Min. : 0.1
## 1st Qu.: 0.00 1st Qu.: 6.3
## Median : 5.33 Median : 38.0
## Mean : 53.94 Mean : 932.9
## 3rd Qu.: 30.76 3rd Qu.: 192.7
## Max. :14704.27 Max. :559506.8
## NA's :1188
## Net.fixed.assets Investments Current.assets
## Min. : 0.0 Min. : 0.00 Min. : 0.1
## 1st Qu.: 26.0 1st Qu.: 1.00 1st Qu.: 36.2
## Median : 93.5 Median : 8.35 Median : 145.1
## Mean : 1189.7 Mean : 694.73 Mean : 1293.4
## 3rd Qu.: 344.9 3rd Qu.: 64.30 3rd Qu.: 502.2
## Max. :636604.6 Max. :199978.60 Max. :354815.2
## NA's :118 NA's :1435 NA's :66
## Net.working.capital Quick.ratio..times. Current.ratio..times.
## Min. :-63839.0 Min. : 0.000 Min. : 0.00
## 1st Qu.: -1.1 1st Qu.: 0.410 1st Qu.: 0.93
## Median : 16.2 Median : 0.670 Median : 1.23
## Mean : 138.6 Mean : 1.401 Mean : 2.13
## 3rd Qu.: 84.2 3rd Qu.: 1.030 3rd Qu.: 1.71
## Max. : 85782.8 Max. :341.000 Max. :505.00
## NA's :32 NA's :93 NA's :93
## Debt.to.equity.ratio..times. Cash.to.current.liabilities..times.
## Min. : 0.00 Min. : 0.0000
## 1st Qu.: 0.22 1st Qu.: 0.0200
## Median : 0.79 Median : 0.0700
## Mean : 2.78 Mean : 0.4904
## 3rd Qu.: 1.75 3rd Qu.: 0.1900
## Max. :456.00 Max. :165.0000
## NA's :93
## Cash.to.average.cost.of.sales.per.day Creditors.turnover
## Min. : 0.00 Min. : 0.000
## 1st Qu.: 2.79 1st Qu.: 3.700
## Median : 8.03 Median : 6.095
## Mean : 158.44 Mean : 15.446
## 3rd Qu.: 21.79 3rd Qu.: 11.490
## Max. :128040.76 Max. :2401.000
## NA's :85 NA's :333
## Debtors.turnover Finished.goods.turnover WIP.turnover
## Min. : 0.00 Min. : -0.09 Min. : -0.18
## 1st Qu.: 3.76 1st Qu.: 8.20 1st Qu.: 5.10
## Median : 6.32 Median : 17.27 Median : 9.76
## Mean : 17.04 Mean : 87.08 Mean : 27.93
## 3rd Qu.: 11.68 3rd Qu.: 40.35 3rd Qu.: 20.24
## Max. :3135.20 Max. :17947.60 Max. :5651.40
## NA's :328 NA's :740 NA's :640
## Raw.material.turnover Shares.outstanding Equity.face.value
## Min. : -2.00 Min. :-2.147e+09 Min. :-999999
## 1st Qu.: 2.99 1st Qu.: 1.316e+06 1st Qu.: 10
## Median : 6.40 Median : 4.672e+06 Median : 10
## Mean : 19.09 Mean : 2.207e+07 Mean : -1334
## 3rd Qu.: 11.85 3rd Qu.: 1.065e+07 3rd Qu.: 10
## Max. :21092.00 Max. : 4.130e+09 Max. : 100000
## NA's :361 NA's :692 NA's :692
## EPS Adjusted.EPS Total.liabilities
## Min. :-843181.8 Min. :-843181.8 Min. : 0.1
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 91.3
## Median : 1.4 Median : 1.2 Median : 309.7
## Mean : -220.3 Mean : -221.5 Mean : 3443.4
## 3rd Qu.: 9.6 3rd Qu.: 7.5 3rd Qu.: 1098.7
## Max. : 34522.5 Max. : 34522.5 Max. :1176509.2
##
## PE.on.BSE Default
## Min. :-1116.64 Min. :0.00000
## 1st Qu.: 3.27 1st Qu.:0.00000
## Median : 9.10 Median :0.00000
## Mean : 63.91 Mean :0.06608
## 3rd Qu.: 17.79 3rd Qu.:0.00000
## Max. :51002.74 Max. :1.00000
## NA's :2194
We need to pre-process our data before we can use it for modeling. This step involves the below steps.
• Missing Value Treatment
• Outlier Treatment
• Performing Multicollinearity check.
Let’s check if there are any missing values present in data.
# Checking for missing values available in data.
colSums(is.na(dataset))
## Num
## 0
## Networth.Next.Year
## 0
## Total.assets
## 0
## Net.worth
## 0
## Total.income
## 198
## Change.in.stock
## 458
## Total.expenses
## 139
## Profit.after.tax
## 131
## PBDITA
## 131
## PBT
## 131
## Cash.profit
## 131
## PBDITA.as...of.total.income
## 68
## PBT.as...of.total.income
## 68
## PAT.as...of.total.income
## 68
## Cash.profit.as...of.total.income
## 68
## PAT.as...of.net.worth
## 0
## Sales
## 259
## Income.from.financial.services
## 935
## Other.income
## 1295
## Total.capital
## 4
## Reserves.and.funds
## 85
## Deposits..accepted.by.commercial.banks.
## 3541
## Borrowings
## 366
## Current.liabilities...provisions
## 96
## Deferred.tax.liability
## 1140
## Shareholders.funds
## 0
## Cumulative.retained.profits
## 38
## Capital.employed
## 0
## TOL.TNW
## 0
## Total.term.liabilities...tangible.net.worth
## 0
## Contingent.liabilities...Net.worth....
## 0
## Contingent.liabilities
## 1188
## Net.fixed.assets
## 118
## Investments
## 1435
## Current.assets
## 66
## Net.working.capital
## 32
## Quick.ratio..times.
## 93
## Current.ratio..times.
## 93
## Debt.to.equity.ratio..times.
## 0
## Cash.to.current.liabilities..times.
## 93
## Cash.to.average.cost.of.sales.per.day
## 85
## Creditors.turnover
## 333
## Debtors.turnover
## 328
## Finished.goods.turnover
## 740
## WIP.turnover
## 640
## Raw.material.turnover
## 361
## Shares.outstanding
## 692
## Equity.face.value
## 692
## EPS
## 0
## Adjusted.EPS
## 0
## Total.liabilities
## 0
## PE.on.BSE
## 2194
## Default
## 0
We observe that there are variables with missing values more then 25% of the total records. Imputing such variables can end up creating artifical data giving lower accuracy in Data Modelling. Hence we’ll be eliminating those variables where the missing data is more then 25%.
# Eliminating variables having missing value greater the 25%
data <- dataset[,-c(1,22,25,18,32,34,52)]
# Imputing missing values for the remaining variables.
imputed <- preProcess(data[,-46],method = "knnImpute",k = 5)
imputed_val <- predict(imputed,data)
# Checking for missing values on the Output data.
anyNA(imputed_val)
## [1] FALSE
# Checking the dimension of the Output data.
dim(imputed_val)
## [1] 3541 46
Once the missing values are treated, we proceed with the next step for treating Outliers in data. For more details on Outlier Detection and treatment, visit http://r-statistics.co/Outlier-Treatment-With-R.html
# Checking for Outliers.
boxplot(imputed_val)
From the above plot, we can see there are a huge number of Outliers available in the data which are required to be treated. For this problem, we’ll be performing the method of Capping at 5% and 95% at lower and upper limit respectively.
# Performing Outlier Treatment.
final <- imputed_val
names(final)
## [1] "Networth.Next.Year"
## [2] "Total.assets"
## [3] "Net.worth"
## [4] "Total.income"
## [5] "Change.in.stock"
## [6] "Total.expenses"
## [7] "Profit.after.tax"
## [8] "PBDITA"
## [9] "PBT"
## [10] "Cash.profit"
## [11] "PBDITA.as...of.total.income"
## [12] "PBT.as...of.total.income"
## [13] "PAT.as...of.total.income"
## [14] "Cash.profit.as...of.total.income"
## [15] "PAT.as...of.net.worth"
## [16] "Sales"
## [17] "Other.income"
## [18] "Total.capital"
## [19] "Reserves.and.funds"
## [20] "Borrowings"
## [21] "Current.liabilities...provisions"
## [22] "Shareholders.funds"
## [23] "Cumulative.retained.profits"
## [24] "Capital.employed"
## [25] "TOL.TNW"
## [26] "Total.term.liabilities...tangible.net.worth"
## [27] "Contingent.liabilities...Net.worth...."
## [28] "Net.fixed.assets"
## [29] "Current.assets"
## [30] "Net.working.capital"
## [31] "Quick.ratio..times."
## [32] "Current.ratio..times."
## [33] "Debt.to.equity.ratio..times."
## [34] "Cash.to.current.liabilities..times."
## [35] "Cash.to.average.cost.of.sales.per.day"
## [36] "Creditors.turnover"
## [37] "Debtors.turnover"
## [38] "Finished.goods.turnover"
## [39] "WIP.turnover"
## [40] "Raw.material.turnover"
## [41] "Shares.outstanding"
## [42] "Equity.face.value"
## [43] "EPS"
## [44] "Adjusted.EPS"
## [45] "Total.liabilities"
## [46] "Default"
summary(final)
## Networth.Next.Year Total.assets Net.worth
## Min. :-4.34613 Min. :-0.11118 Min. :-0.09679
## 1st Qu.:-0.09076 1st Qu.:-0.10823 1st Qu.:-0.09446
## Median :-0.08591 Median :-0.10118 Median :-0.08915
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.:-0.06645 3rd Qu.:-0.07571 3rd Qu.:-0.06861
## Max. :46.05806 Max. :37.87640 Max. :45.70217
## Total.income Change.in.stock Total.expenses
## Min. :-0.08230 Min. :-6.97020 Min. :-0.08039
## 1st Qu.:-0.08044 1st Qu.:-0.09803 1st Qu.:-0.07862
## Median :-0.07514 Median :-0.09099 Median :-0.07319
## Mean :-0.00403 Mean :-0.00820 Mean :-0.00281
## 3rd Qu.:-0.05784 3rd Qu.:-0.05990 3rd Qu.:-0.05603
## Max. :43.78935 Max. :32.10362 Max. :44.53764
## Profit.after.tax PBDITA PBT
## Min. :-1.36590 Min. :-0.18019 Min. :-1.03924
## 1st Qu.:-0.09038 1st Qu.:-0.10113 1st Qu.:-0.09308
## Median :-0.08796 Median :-0.09632 Median :-0.09063
## Mean :-0.00325 Mean :-0.00343 Mean :-0.00334
## 3rd Qu.:-0.07465 3rd Qu.:-0.07718 3rd Qu.:-0.07661
## Max. :38.88575 Max. :36.78933 Max. :35.19707
## Cash.profit PBDITA.as...of.total.income PBT.as...of.total.income
## Min. :-0.62408 Min. :-43.53237 Min. :-50.11996
## 1st Qu.:-0.09212 1st Qu.: 0.00299 1st Qu.: 0.04186
## Median :-0.08864 Median : 0.03493 Median : 0.04847
## Mean :-0.00320 Mean : 0.00084 Mean : 0.00069
## 3rd Qu.:-0.07211 3rd Qu.: 0.08108 3rd Qu.: 0.06153
## Max. :41.76360 Max. : 0.64864 Max. : 0.27567
## PAT.as...of.total.income Cash.profit.as...of.total.income
## Min. :-49.60502 Min. :-49.28902
## 1st Qu.: 0.04537 1st Qu.: 0.03368
## Median : 0.05002 Median : 0.04561
## Mean : 0.00055 Mean : 0.00077
## 3rd Qu.: 0.05930 3rd Qu.: 0.06251
## Max. : 0.39366 Max. : 0.35536
## PAT.as...of.net.worth Sales Other.income
## Min. :-11.65019 Min. :-0.08277 Min. :-0.04509
## 1st Qu.: -0.15763 1st Qu.:-0.08087 1st Qu.:-0.04454
## Median : -0.03606 Median :-0.07558 Median :-0.04343
## Mean : 0.00000 Mean :-0.00541 Mean :-0.01510
## 3rd Qu.: 0.15228 3rd Qu.:-0.05851 3rd Qu.:-0.04055
## Max. : 37.70481 Max. :43.30847 Max. :46.67590
## Total.capital Reserves.and.funds Borrowings
## Min. :-0.12868 Min. :-0.57503 Min. :-0.12925
## 1st Qu.:-0.12095 1st Qu.:-0.08666 1st Qu.:-0.12652
## Median :-0.10366 Median :-0.08301 Median :-0.11838
## Mean : 0.00013 Mean :-0.00194 Mean :-0.00976
## 3rd Qu.:-0.06901 3rd Qu.:-0.06706 3rd Qu.:-0.09288
## Max. :46.39073 Max. :46.66034 Max. :31.92094
## Current.liabilities...provisions Shareholders.funds
## Min. :-0.09852 Min. :-0.09833
## 1st Qu.:-0.09675 1st Qu.:-0.09595
## Median :-0.09161 Median :-0.09048
## Mean :-0.00253 Mean : 0.00000
## 3rd Qu.:-0.07230 3rd Qu.:-0.06908
## Max. :36.80037 Max. :45.50511
## Cumulative.retained.profits Capital.employed TOL.TNW
## Min. :-0.73423 Min. :-0.11051 Min. :-18.2730
## 1st Qu.:-0.08795 1st Qu.:-0.10762 1st Qu.: -0.1750
## Median :-0.08441 Median :-0.10032 Median : -0.1322
## Mean :-0.00091 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.:-0.06832 3rd Qu.:-0.07409 3rd Qu.: -0.0600
## Max. :38.49177 Max. :42.19850 Max. : 24.1771
## Total.term.liabilities...tangible.net.worth
## Min. :-22.16484
## 1st Qu.: -0.12143
## Median : -0.10180
## Mean : 0.00000
## 3rd Qu.: -0.05712
## Max. : 30.74205
## Contingent.liabilities...Net.worth.... Net.fixed.assets
## Min. :-0.14238 Min. :-0.08950
## 1st Qu.:-0.14238 1st Qu.:-0.08758
## Median :-0.12831 Median :-0.08258
## Mean : 0.00000 Mean :-0.00270
## 3rd Qu.:-0.06118 3rd Qu.:-0.06459
## Max. :38.67202 Max. :47.80464
## Current.assets Net.working.capital Quick.ratio..times.
## Min. :-0.12764 Min. :-21.720618 Min. :-0.18130
## 1st Qu.:-0.12411 1st Qu.: -0.047384 1st Qu.:-0.12696
## Median :-0.11367 Median : -0.041579 Median :-0.09333
## Mean :-0.00223 Mean : -0.000359 Mean :-0.00049
## 3rd Qu.:-0.07970 3rd Qu.: -0.019307 3rd Qu.:-0.04546
## Max. :34.89241 Max. : 29.076529 Max. :43.93427
## Current.ratio..times. Debt.to.equity.ratio..times.
## Min. :-0.21076 Min. :-0.18563
## 1st Qu.:-0.11777 1st Qu.:-0.17094
## Median :-0.08809 Median :-0.13287
## Mean :-0.00027 Mean : 0.00000
## 3rd Qu.:-0.03961 3rd Qu.:-0.06876
## Max. :49.74931 Max. :30.26704
## Cash.to.current.liabilities..times. Cash.to.average.cost.of.sales.per.day
## Min. :-0.11688 Min. :-0.05768
## 1st Qu.:-0.11211 1st Qu.:-0.05663
## Median :-0.10020 Median :-0.05469
## Mean :-0.00049 Mean :-0.00097
## 3rd Qu.:-0.07160 3rd Qu.:-0.04948
## Max. :39.20874 Max. :46.55514
## Creditors.turnover Debtors.turnover Finished.goods.turnover
## Min. :-0.22645 Min. :-0.20301 Min. :-0.145634
## 1st Qu.:-0.16942 1st Qu.:-0.15667 1st Qu.:-0.129004
## Median :-0.13364 Median :-0.12515 Median :-0.112196
## Mean : 0.00263 Mean :-0.00354 Mean :-0.004225
## 3rd Qu.:-0.04641 3rd Qu.:-0.06125 3rd Qu.:-0.073205
## Max. :34.97439 Max. :37.14581 Max. :29.839856
## WIP.turnover Raw.material.turnover Shares.outstanding
## Min. :-0.18611 Min. :-0.05604 Min. :-13.10754
## 1st Qu.:-0.14764 1st Qu.:-0.04182 1st Qu.: -0.12201
## Median :-0.11613 Median :-0.03279 Median : -0.10635
## Mean :-0.01257 Mean :-0.00222 Mean : -0.01667
## 3rd Qu.:-0.05264 3rd Qu.:-0.01939 3rd Qu.: -0.07281
## Max. :37.23095 Max. :55.99423 Max. : 24.82087
## Equity.face.value EPS Adjusted.EPS
## Min. :-26.63055 Min. :-59.10564 Min. :-59.10565
## 1st Qu.: 0.03583 1st Qu.: 0.01545 1st Qu.: 0.01553
## Median : 0.03583 Median : 0.01555 Median : 0.01561
## Mean : 0.00711 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.03583 3rd Qu.: 0.01612 3rd Qu.: 0.01606
## Max. : 2.70218 Max. : 2.43605 Max. : 2.43614
## Total.liabilities Default
## Min. :-0.11118 Min. :0.00000
## 1st Qu.:-0.10823 1st Qu.:0.00000
## Median :-0.10118 Median :0.00000
## Mean : 0.00000 Mean :0.06608
## 3rd Qu.:-0.07571 3rd Qu.:0.00000
## Max. :37.87640 Max. :1.00000
a <- c(1:45)
for(val in a){
qnt<- quantile(final[,val],probs = c(0.25,0.75))
cap<- quantile(final[,val],probs = c(0.05,0.95))
h= 1.5*IQR(final[,val])
final[,val][final[,val]>(qnt[2]+h)]<- cap[2]
final[,val][final[,val]<(qnt[1]-h)]<- cap[1]
}
After we are done with Outlier treatment, we are left with the last step of Data preperation i.e. Treating the problem of Multicollinearity.
For Details on Multicollinearity, visit https://towardsdatascience.com/https-towardsdatascience-com-multicollinearity-how-does-it-create-a-problem-72956a49058
In this case, we use the method available in Caret for treating collinearity in the dataset available. This function searches through a correlation matrix and returns a vector of integers corresponding to columns to remove or to reduce pair-wise correlations.
# Checking for variables where correlation is higher the .7
x<- findCorrelation(cor(final),
cutoff = 0.7,
verbose = TRUE,
names = TRUE,
exact = ncol(imputed_val) < 100
)
## Compare row 8 and column 3 with corr 0.841
## Means: 0.458 vs 0.274 so flagging column 8
## Compare row 3 and column 22 with corr 0.981
## Means: 0.444 vs 0.266 so flagging column 3
## Compare row 22 and column 1 with corr 0.92
## Means: 0.431 vs 0.258 so flagging column 22
## Compare row 1 and column 10 with corr 0.839
## Means: 0.42 vs 0.249 so flagging column 1
## Compare row 10 and column 2 with corr 0.802
## Means: 0.413 vs 0.241 so flagging column 10
## Compare row 2 and column 45 with corr 1
## Means: 0.4 vs 0.233 so flagging column 2
## Compare row 45 and column 24 with corr 0.952
## Means: 0.384 vs 0.225 so flagging column 45
## Compare row 24 and column 4 with corr 0.839
## Means: 0.365 vs 0.217 so flagging column 24
## Compare row 4 and column 9 with corr 0.754
## Means: 0.353 vs 0.209 so flagging column 4
## Compare row 9 and column 19 with corr 0.779
## Means: 0.35 vs 0.201 so flagging column 9
## Compare row 19 and column 7 with corr 0.776
## Means: 0.324 vs 0.193 so flagging column 19
## Compare row 7 and column 16 with corr 0.733
## Means: 0.318 vs 0.185 so flagging column 7
## Compare row 16 and column 6 with corr 0.984
## Means: 0.296 vs 0.178 so flagging column 16
## Compare row 6 and column 29 with corr 0.871
## Means: 0.274 vs 0.171 so flagging column 6
## Compare row 29 and column 23 with corr 0.736
## Means: 0.259 vs 0.164 so flagging column 29
## Compare row 21 and column 28 with corr 0.717
## Means: 0.234 vs 0.159 so flagging column 21
## Compare row 28 and column 20 with corr 0.783
## Means: 0.217 vs 0.155 so flagging column 28
## Compare row 41 and column 18 with corr 0.824
## Means: 0.16 vs 0.152 so flagging column 41
## Compare row 12 and column 13 with corr 0.963
## Means: 0.247 vs 0.148 so flagging column 12
## Compare row 13 and column 14 with corr 0.778
## Means: 0.211 vs 0.141 so flagging column 13
## Compare row 14 and column 11 with corr 0.855
## Means: 0.192 vs 0.137 so flagging column 14
## Compare row 43 and column 44 with corr 0.888
## Means: 0.151 vs 0.134 so flagging column 43
## Compare row 25 and column 33 with corr 0.843
## Means: 0.191 vs 0.13 so flagging column 25
## Compare row 33 and column 26 with corr 0.877
## Means: 0.159 vs 0.127 so flagging column 33
## Compare row 31 and column 32 with corr 0.84
## Means: 0.17 vs 0.123 so flagging column 31
## Compare row 34 and column 35 with corr 0.707
## Means: 0.12 vs 0.12 so flagging column 35
## All correlations <= 0.7
The above piece of code gives the variables where the correaltion is higher then 70%. We will eliminate these variables from our dataset.
final1 <- final[x]
# Adding Dependent varible in the dataset.
final <- cbind(data$Default, final1)
# Checking the Dimensions of the final data.
dim(final)
## [1] 3541 27
# Checking the structure of the final data.
str(final)
## 'data.frame': 3541 obs. of 27 variables:
## $ data$Default : int 0 0 0 0 0 0 0 0 0 1 ...
## $ PBDITA : num 0.1147 -0.0522 -0.1012 -0.1018 -0.0968 ...
## $ Net.worth : num 0.1299 -0.0705 -0.0893 -0.0966 -0.0888 ...
## $ Shareholders.funds : num 0.1367 -0.0722 -0.0908 -0.0981 -0.0903 ...
## $ Networth.Next.Year : num 0.123 -0.07 -0.0873 -0.0924 -0.0863 ...
## $ Cash.profit : num 0.0972 -0.0552 -0.0927 -0.0925 -0.0899 ...
## $ Total.assets : num 0.1617 -0.0808 -0.1037 -0.1111 -0.0957 ...
## $ Total.liabilities : num 0.1617 -0.0808 -0.1037 -0.1111 -0.0957 ...
## $ Capital.employed : num 0.1737 -0.0809 -0.1041 -0.1104 -0.0962 ...
## $ Total.income : num 0.0773 -0.0549 -0.0737 -0.0817 -0.0539 ...
## $ PBT : num 0.0853 0.0853 -0.0948 -0.0932 -0.0917 ...
## $ Reserves.and.funds : num 0.116 -0.0678 -0.0856 -0.0869 -0.0749 ...
## $ Profit.after.tax : num 0.093 -0.0529 -0.0927 -0.0905 -0.0887 ...
## $ Sales : num 0.0777 -0.0554 -0.0741 -0.0821 -0.0541 ...
## $ Total.expenses : num 0.0815 -0.053 -0.0714 -0.0797 -0.051 ...
## $ Current.assets : num 0.1981 -0.072 -0.1111 -0.1276 -0.0931 ...
## $ Current.liabilities...provisions : num 0.1116 -0.0765 -0.0884 -0.0973 -0.0867 ...
## $ Net.fixed.assets : num 0.1114 -0.068 -0.0866 -0.0893 -0.0824 ...
## $ Shares.outstanding : num 0.2571 -0.0635 -0.0841 -0.133 -0.1296 ...
## $ PBT.as...of.total.income : num 0.0634 0.0696 0.0374 0.0406 0.0416 ...
## $ PAT.as...of.total.income : num 0.059 0.0622 0.0415 0.0447 0.0455 ...
## $ Cash.profit.as...of.total.income : num 0.0516 0.0611 0.0272 0.027 0.0295 ...
## $ EPS : num 0.0216 0.0161 0.0154 0.0154 0.016 ...
## $ TOL.TNW : num -0.137 -0.142 -0.132 -0.206 -0.06 ...
## $ Debt.to.equity.ratio..times. : num -0.1856 -0.1335 -0.1623 -0.1856 -0.0661 ...
## $ Quick.ratio..times. : num -0.02864 -0.0584 -0.0377 -0.07754 0.00111 ...
## $ Cash.to.average.cost.of.sales.per.day: num 0.0135 -0.0555 -0.0515 -0.052 -0.0577 ...
#Converting the Dependent variable to Factor
colnames(final)[colnames(final) == "data$Default"] <- "Default"
final$Default <- as.factor(final$Default)
Feature selection is an extremely crucial part of modeling. To understand the importance of feature selection and various techniques used for feature selection. For now, we’ll be using Recursive Feature elimination which is a wrapper method to find the best subset of features to use for modeling.
# Feature Selection using Recursive Feature Elimination
ctrl <- rfeControl(functions = rfFuncs,
method = "cv",
verbose = FALSE)
lmProfile <- rfe(x=final[, 2:27], y=final$Default,
rfeControl = ctrl)
lmProfile
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 4 1 1 0 0 *
## 8 1 1 0 0
## 16 1 1 0 0
## 26 1 1 0 0
##
## The top 4 variables (out of 4):
## Networth.Next.Year, TOL.TNW, Net.worth, Debt.to.equity.ratio..times.
In the above feature selection Algorith, we identified that there are only 4 features that appears to be the most important variables.
This is probably the part where Caret stands out from any other available package. It provides the ability for implementing 200+ machine learning algorithms using consistent syntax. To get a list of all the algorithms that Caret supports, you can use:
modelnames <- paste(names(getModelInfo()))
modelnames
## [1] "ada" "AdaBag" "AdaBoost.M1"
## [4] "adaboost" "amdai" "ANFIS"
## [7] "avNNet" "awnb" "awtan"
## [10] "bag" "bagEarth" "bagEarthGCV"
## [13] "bagFDA" "bagFDAGCV" "bam"
## [16] "bartMachine" "bayesglm" "binda"
## [19] "blackboost" "blasso" "blassoAveraged"
## [22] "bridge" "brnn" "BstLm"
## [25] "bstSm" "bstTree" "C5.0"
## [28] "C5.0Cost" "C5.0Rules" "C5.0Tree"
## [31] "cforest" "chaid" "CSimca"
## [34] "ctree" "ctree2" "cubist"
## [37] "dda" "deepboost" "DENFIS"
## [40] "dnn" "dwdLinear" "dwdPoly"
## [43] "dwdRadial" "earth" "elm"
## [46] "enet" "evtree" "extraTrees"
## [49] "fda" "FH.GBML" "FIR.DM"
## [52] "foba" "FRBCS.CHI" "FRBCS.W"
## [55] "FS.HGD" "gam" "gamboost"
## [58] "gamLoess" "gamSpline" "gaussprLinear"
## [61] "gaussprPoly" "gaussprRadial" "gbm_h2o"
## [64] "gbm" "gcvEarth" "GFS.FR.MOGUL"
## [67] "GFS.LT.RS" "GFS.THRIFT" "glm.nb"
## [70] "glm" "glmboost" "glmnet_h2o"
## [73] "glmnet" "glmStepAIC" "gpls"
## [76] "hda" "hdda" "hdrda"
## [79] "HYFIS" "icr" "J48"
## [82] "JRip" "kernelpls" "kknn"
## [85] "knn" "krlsPoly" "krlsRadial"
## [88] "lars" "lars2" "lasso"
## [91] "lda" "lda2" "leapBackward"
## [94] "leapForward" "leapSeq" "Linda"
## [97] "lm" "lmStepAIC" "LMT"
## [100] "loclda" "logicBag" "LogitBoost"
## [103] "logreg" "lssvmLinear" "lssvmPoly"
## [106] "lssvmRadial" "lvq" "M5"
## [109] "M5Rules" "manb" "mda"
## [112] "Mlda" "mlp" "mlpKerasDecay"
## [115] "mlpKerasDecayCost" "mlpKerasDropout" "mlpKerasDropoutCost"
## [118] "mlpML" "mlpSGD" "mlpWeightDecay"
## [121] "mlpWeightDecayML" "monmlp" "msaenet"
## [124] "multinom" "mxnet" "mxnetAdam"
## [127] "naive_bayes" "nb" "nbDiscrete"
## [130] "nbSearch" "neuralnet" "nnet"
## [133] "nnls" "nodeHarvest" "null"
## [136] "OneR" "ordinalNet" "ordinalRF"
## [139] "ORFlog" "ORFpls" "ORFridge"
## [142] "ORFsvm" "ownn" "pam"
## [145] "parRF" "PART" "partDSA"
## [148] "pcaNNet" "pcr" "pda"
## [151] "pda2" "penalized" "PenalizedLDA"
## [154] "plr" "pls" "plsRglm"
## [157] "polr" "ppr" "PRIM"
## [160] "protoclass" "qda" "QdaCov"
## [163] "qrf" "qrnn" "randomGLM"
## [166] "ranger" "rbf" "rbfDDA"
## [169] "Rborist" "rda" "regLogistic"
## [172] "relaxo" "rf" "rFerns"
## [175] "RFlda" "rfRules" "ridge"
## [178] "rlda" "rlm" "rmda"
## [181] "rocc" "rotationForest" "rotationForestCp"
## [184] "rpart" "rpart1SE" "rpart2"
## [187] "rpartCost" "rpartScore" "rqlasso"
## [190] "rqnc" "RRF" "RRFglobal"
## [193] "rrlda" "RSimca" "rvmLinear"
## [196] "rvmPoly" "rvmRadial" "SBC"
## [199] "sda" "sdwd" "simpls"
## [202] "SLAVE" "slda" "smda"
## [205] "snn" "sparseLDA" "spikeslab"
## [208] "spls" "stepLDA" "stepQDA"
## [211] "superpc" "svmBoundrangeString" "svmExpoString"
## [214] "svmLinear" "svmLinear2" "svmLinear3"
## [217] "svmLinearWeights" "svmLinearWeights2" "svmPoly"
## [220] "svmRadial" "svmRadialCost" "svmRadialSigma"
## [223] "svmRadialWeights" "svmSpectrumString" "tan"
## [226] "tanSearch" "treebag" "vbmpRadial"
## [229] "vglmAdjCat" "vglmContRatio" "vglmCumulative"
## [232] "widekernelpls" "WM" "wsrf"
## [235] "xgbDART" "xgbLinear" "xgbTree"
## [238] "xyf"
Yes, it’s a huge list!
And if you want to know more details like the hyperparameters and if it can be used of regression or classification problem, then do a modelLookup(algo).
Let’s look out for the hyperparameters for one of the Model from the wide range of Models supported by Caret.
modelLookup("rf")
## model parameter label forReg forClass probModel
## 1 rf mtry #Randomly Selected Predictors TRUE TRUE TRUE
Let’s split the data into Train and Test before we proceed with Model Creation.
# Create the training and test datasets
set.seed(100)
# Step 1: Get row numbers for the training data
trainRowNumbers <- createDataPartition(final$Default, p=0.8, list=FALSE)
# Step 2: Create the training dataset
trainData <- final[trainRowNumbers,]
# Step 3: Create the test dataset
testData <- final[-trainRowNumbers,]
Once data is splitted to Train and Test, we check the proportion of Default and Non Default in Train data.
prop.table(table(trainData$Default))
##
## 0 1
## 0.93366267 0.06633733
Distribution of Default vs Non-Default is 94:6 percent, we can perform balancing of data by either by using Undersampling or Oversampling methods. Caret provides the feature of Undersampling by the method downSample() and Oversampling by using upSample().
In this we use oversampling method in Caret.
train <- upSample(trainData[,c(2:24)], as.factor(trainData$Default))
prop.table(table(train$Class))
##
## 0 1
## 0.5 0.5
Let’s train a Random Forest model by setting the method=‘rf’ and passing the hyperparameter mtry=5.
# Performing Random Forest
Control <- trainControl(method = "cv",number = 5)
rf <- train(Class~., train,trCtrl= Control,method = "rf",tuneLength=10)
In the above piece of code, we have used trainControl() method. This method is used for Parameter tuning in Caret. It’s extremely easy to tune parameters using Caret.
It is possible to customize almost every step in the tuning process. The resampling technique used for evaluating the performance of the model using a set of parameters in Caret by default is bootstrap, but it provides alternatives for using k-fold, repeated k-fold as well as Leave-one-out cross validation (LOOCV) which can be specified using trainControl(). In this example, we’ll be using 5-Fold cross-validation.
Let’s Train a few more Models from the list using the same Tuning parameters we used for Random Forest Model.
# Training eXtreme Gradient Boost Model
xgbModel1 <- train(Class~.,train,
trControl = Control ,method = "xgbTree",tuneLength=10)
# Training Logistic Model
logit.Model2 <- train(Class ~.,
data = train,
method = "glmnet",
trControl = Control,
tuneLength=10)
# Training Neural Network Model using 'adaboost'
neural.net <- train(Class~.,train,
trControl = Control ,method = "adaboost",
tuneLength=10,
verbose=FALSE)
Caret also makes the variable importance estimates accessible with the use of varImp() for any model. Let’s have a look at the variable importance for all the four models that we created:
#Checking variable importance for Random Forest
#Variable Importance
varImp(rf)
## rf variable importance
##
## only 20 most important variables shown (out of 23)
##
## Overall
## Networth.Next.Year 1.000e+02
## PAT.as...of.total.income 5.697e+00
## PBT.as...of.total.income 4.701e-01
## Reserves.and.funds 1.199e-01
## TOL.TNW 1.139e-01
## Sales 5.577e-03
## Net.worth 4.914e-03
## Cash.profit 4.766e-03
## Total.expenses 0.000e+00
## Current.liabilities...provisions 0.000e+00
## Total.income 0.000e+00
## Total.liabilities 0.000e+00
## PBDITA 0.000e+00
## EPS 0.000e+00
## PBT 0.000e+00
## Cash.profit.as...of.total.income 0.000e+00
## Current.assets 0.000e+00
## Net.fixed.assets 0.000e+00
## Shares.outstanding 0.000e+00
## Shareholders.funds 0.000e+00
#Plotting Variable Importance
plot(varImp(rf),main="RF - Variable Importance")
#Checking variable importance for XG Boost
#Variable Importance
varImp(xgbModel1)
## xgbTree variable importance
##
## only 20 most important variables shown (out of 23)
##
## Overall
## Networth.Next.Year 100.00000
## PAT.as...of.total.income 2.88568
## TOL.TNW 2.01947
## PBT 1.51418
## Reserves.and.funds 0.69349
## Cash.profit.as...of.total.income 0.24398
## Net.worth 0.06515
## EPS 0.01307
## Total.liabilities 0.00000
## Current.liabilities...provisions 0.00000
## Total.expenses 0.00000
## Total.income 0.00000
## Total.assets 0.00000
## Cash.profit 0.00000
## Shares.outstanding 0.00000
## PBDITA 0.00000
## Sales 0.00000
## Current.assets 0.00000
## Shareholders.funds 0.00000
## Capital.employed 0.00000
#Plotting Variable Importance
plot(varImp(xgbModel1),main="XG Boost - Variable Importance")
#Checking variable importance for Logistic Regression
#Variable Importance
varImp(logit.Model2)
## glmnet variable importance
##
## only 20 most important variables shown (out of 23)
##
## Overall
## Networth.Next.Year 100.00000
## Net.fixed.assets 1.32306
## EPS 1.03294
## PAT.as...of.total.income 0.81899
## PBT.as...of.total.income 0.49490
## Profit.after.tax 0.48259
## Total.expenses 0.36353
## Net.worth 0.33063
## Total.income 0.32835
## Sales 0.29101
## PBDITA 0.26791
## Cash.profit.as...of.total.income 0.25505
## TOL.TNW 0.06527
## Shares.outstanding 0.00344
## Cash.profit 0.00000
## Shareholders.funds 0.00000
## PBT 0.00000
## Total.assets 0.00000
## Current.liabilities...provisions 0.00000
## Capital.employed 0.00000
#Plotting Variable Importance
plot(varImp(logit.Model2),main="Logistic Regression - Variable Importance")
#Checking variable importance for adaboost
#Variable Importance
varImp(neural.net)
## ROC curve variable importance
##
## only 20 most important variables shown (out of 23)
##
## Importance
## Networth.Next.Year 100.00
## PBT 74.24
## Profit.after.tax 74.11
## Cash.profit 72.16
## PAT.as...of.total.income 71.16
## PBT.as...of.total.income 69.03
## Reserves.and.funds 68.72
## Cash.profit.as...of.total.income 65.35
## EPS 65.35
## Net.worth 62.41
## Shareholders.funds 59.89
## TOL.TNW 57.47
## PBDITA 57.44
## Current.assets 38.07
## Capital.employed 37.50
## Total.expenses 34.96
## Total.liabilities 33.99
## Total.assets 33.99
## Total.income 33.18
## Sales 32.16
#Plotting Variable Immportance
plot(varImp(neural.net),main="adaboost - Variable Importance")
Clearly, the variable importance estimates of different models differs and thus might be used to get a more holistic view of importance of each predictor. Two main uses of variable importance from various models are:
Predictors that are important for the majority of models represents genuinely important predictors.
For ensembling, we should use predictions from models that have significantly different variable importance as their predictions are also expected to be different. Although, one thing that must be taken into consideration is that all of them are sufficiently accurate.
We can fine tune our Models by using the variables that appeared to be the most significant in variable Importance estimations.
For predicting the dependent variable we will be using predict() from base library. For classification problems, it also offers another feature named ‘type’ which can be set to either “prob” or “raw”. For type=”raw”, the predictions will just be the outcome classes while for type=”prob”, it will give probabilities for the occurrence of each observation in various classes of the outcome variable.
We predict the Model on Train as well as Test to understand how the model is performing on both the Training and the Unseen data.
# Predicting Model on Train Data for Random Forest
pred.train.rf <- predict(rf, train, type = "raw")
summary(pred.train.rf)
## 0 1
## 2646 2646
Caret also provides a confusionMatrix function which will give the confusion matrix along with various other metrics for your predictions. Here is the performance analysis of our Random Forest model.
# Creating Confusion Matrix on Train data.
confusionMatrix(train$Class,pred.train.rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2646 0
## 1 0 2646
##
## Accuracy : 1
## 95% CI : (0.9993, 1)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0
## Specificity : 1.0
## Pos Pred Value : 1.0
## Neg Pred Value : 1.0
## Prevalence : 0.5
## Detection Rate : 0.5
## Detection Prevalence : 0.5
## Balanced Accuracy : 1.0
##
## 'Positive' Class : 0
##
# Predicting Model on Test Data for Random Forest
pred.test.rf <- predict(rf, testData, type = "raw")
summary(pred.test.rf)
## 0 1
## 661 46
# Creating Confusion Matrix on Train data.
confusionMatrix(pred.test.rf, testData$Default, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 661 0
## 1 0 46
##
## Accuracy : 1
## 95% CI : (0.9948, 1)
## No Information Rate : 0.9349
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.00000
## Specificity : 1.00000
## Pos Pred Value : 1.00000
## Neg Pred Value : 1.00000
## Prevalence : 0.06506
## Detection Rate : 0.06506
## Detection Prevalence : 0.06506
## Balanced Accuracy : 1.00000
##
## 'Positive' Class : 1
##
Performing predictions on XG Boost, Logistic Regression and adaboost Models created above for Train and Test data.
# Predicting Model on Train Data for XG Boost
pred.train.xgb <- predict(xgbModel1, data =train, type="raw")
summary(pred.train.rf)
## 0 1
## 2646 2646
# Creating Confusion Matrix on Train data.
confusionMatrix(train$Class,pred.train.xgb)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2646 0
## 1 0 2646
##
## Accuracy : 1
## 95% CI : (0.9993, 1)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0
## Specificity : 1.0
## Pos Pred Value : 1.0
## Neg Pred Value : 1.0
## Prevalence : 0.5
## Detection Rate : 0.5
## Detection Prevalence : 0.5
## Balanced Accuracy : 1.0
##
## 'Positive' Class : 0
##
# Predicting Model on Test Data for XG Boosting
pred.test.xgb <- predict(xgbModel1, testData, type = "raw")
summary(pred.test.xgb)
## 0 1
## 661 46
# Creating Confusion Matrix on Train data.
confusionMatrix(pred.test.xgb, testData$Default, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 661 0
## 1 0 46
##
## Accuracy : 1
## 95% CI : (0.9948, 1)
## No Information Rate : 0.9349
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.00000
## Specificity : 1.00000
## Pos Pred Value : 1.00000
## Neg Pred Value : 1.00000
## Prevalence : 0.06506
## Detection Rate : 0.06506
## Detection Prevalence : 0.06506
## Balanced Accuracy : 1.00000
##
## 'Positive' Class : 1
##
# Predicting Model on Train Data for Logistic Regression
pred.train.logit <- predict(logit.Model2, data =train, type="raw")
summary(pred.train.rf)
## 0 1
## 2646 2646
# Creating Confusion Matrix on Train data.
confusionMatrix(train$Class,pred.train.logit)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2555 91
## 1 0 2646
##
## Accuracy : 0.9828
## 95% CI : (0.9789, 0.9861)
## No Information Rate : 0.5172
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9656
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.9668
## Pos Pred Value : 0.9656
## Neg Pred Value : 1.0000
## Prevalence : 0.4828
## Detection Rate : 0.4828
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9834
##
## 'Positive' Class : 0
##
# Predicting Model on Test Data for Logistic Regression
pred.test.logit <- predict(xgbModel1, testData, type = "raw")
summary(pred.test.logit)
## 0 1
## 661 46
# Creating Confusion Matrix on Train data.
confusionMatrix(pred.test.logit, testData$Default, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 661 0
## 1 0 46
##
## Accuracy : 1
## 95% CI : (0.9948, 1)
## No Information Rate : 0.9349
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.00000
## Specificity : 1.00000
## Pos Pred Value : 1.00000
## Neg Pred Value : 1.00000
## Prevalence : 0.06506
## Detection Rate : 0.06506
## Detection Prevalence : 0.06506
## Balanced Accuracy : 1.00000
##
## 'Positive' Class : 1
##
# Predicting Model on Train Data for adaboost Model
pred.train.nn <- predict(neural.net, data =train, type="raw")
summary(pred.train.nn)
## 0 1
## 2646 2646
# Creating Confusion Matrix on Train data.
confusionMatrix(train$Class,pred.train.nn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2646 0
## 1 0 2646
##
## Accuracy : 1
## 95% CI : (0.9993, 1)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0
## Specificity : 1.0
## Pos Pred Value : 1.0
## Neg Pred Value : 1.0
## Prevalence : 0.5
## Detection Rate : 0.5
## Detection Prevalence : 0.5
## Balanced Accuracy : 1.0
##
## 'Positive' Class : 0
##
# Predicting Model on Test Data for Random Forest
pred.test.nn <- predict(neural.net, testData, type = "raw")
summary(pred.test.nn)
## 0 1
## 661 46
# Creating Confusion Matrix on Train data.
confusionMatrix(pred.test.nn, testData$Default, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 661 0
## 1 0 46
##
## Accuracy : 1
## 95% CI : (0.9948, 1)
## No Information Rate : 0.9349
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.00000
## Specificity : 1.00000
## Pos Pred Value : 1.00000
## Neg Pred Value : 1.00000
## Prevalence : 0.06506
## Detection Rate : 0.06506
## Detection Prevalence : 0.06506
## Balanced Accuracy : 1.00000
##
## 'Positive' Class : 1
##
Now when we have created various Models using Caret, you might be having a thought how to check which model works best and should be considered. This can be done in many ways. In this problem, we will be selecting the model by comparing the Confusion Matrix created on Train and Test data for all the Models and select the Model which gives the highest accuracy on both Train and Test.
The purpose of this post was to showcase the power of Caret as a Standalone package that can act as a one stop shop for most of your needs as a Data scientist. It is to show how you can effectively use it for basic data preperation to building complex machine learning models.
This information should serve as a reference and also as a template you can use to build a standardised machine learning workflow from scrape, so you can develop it further from there.
• Caret Package Homepage: http://topepo.github.io/caret/index.html
• Caret Package on CRAN: http://cran.r-project.org/web/packages/caret/
• Caret Package Manual: http://cran.r-project.org/web/packages/caret/caret.pdf
• A Short Introduction to the caret Package: http://cran.r-project.org/web/packages/caret/vignettes/caret.pdf