Caret: A Complete Solution for Machine Learning in R

Tired of remembering too many different Packages?

One of the biggest challenge beginners in Data Science face is which algorithms to learn and focus on. In case of R, the problem gets accentuated by the fact that one functionality can be achieved by various approaches by using different libraries available in R, which is great but quite frustrating since each package was designed independently and has very different syntax, inputs and outputs. This could be too much for a beginner.

Here is a tip to handle everything from Exploring Data to performing complex Machine learning Algorithms to tuning those algorithms using hyper parameters, everything under a single roof.

All this has been made possible by the years of effort that have gone behind CARET ( Classification And REgression Training) which is possibly the biggest project in R. This package alone is all you need to know for solve almost any supervised machine learning problem. Not only does caret allow you to run a plethora of ML methods, it also provides tools for auxiliary techniques such as:

• Data preparation (imputation, centering/scaling data, removing correlated predictors, reducing skewness)

• Data splitting

• Variable selection

• Model evaluation

Here is an end to end guide to showcase the power of a package that has it all.

In this problem statement, we have to predict the Loan Status of an Individual based on his/ her profile. We’ll get started by loading the Caret Library and Loan Default dataset in R available in my Working Directory.

# Installing the Library.
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

# Setting up the working Directory and Loading the Loan Default dataset.

setwd("D:/Great Learning/Finance and Risk Analytics")

dataset <- read.csv('raw-data.csv')

Once we have the data available in R Environment, we perform a few exploratory checks to understand the Structure of the data and ensure that the data loaded is correct.

### Performing basic Exploratory Analysis

# Checking the class of the data. 
class(dataset)

## [1] "data.frame"

# Checking the dimension of data.
dim(dataset)

## [1] 3541   53

# Reading top 5 Rows.
head(dataset, n=5)

##   Num Networth.Next.Year Total.assets Net.worth Total.income
## 1   1             8890.6      17512.3    7093.2      24965.2
## 2   2              394.3        941.0     351.5       1527.4
## 3   3               92.2        232.8     100.6        477.3
## 4   4                2.7          2.7       2.7           NA
## 5   5              109.0        478.5     107.6       1580.5
##   Change.in.stock Total.expenses Profit.after.tax PBDITA    PBT
## 1           235.8        23657.8           1543.2 2860.2 2417.2
## 2            42.7         1454.9            115.2  283.0  188.4
## 3            -5.2          478.7             -6.6    5.8   -6.6
## 4              NA             NA               NA     NA     NA
## 5           -17.0         1558.0              5.5   31.0    6.3
##   Cash.profit PBDITA.as...of.total.income PBT.as...of.total.income
## 1      1872.8                       11.46                     9.68
## 2       158.6                       18.53                    12.33
## 3         0.3                        1.22                    -1.38
## 4          NA                        0.00                     0.00
## 5        11.9                        1.96                     0.40
##   PAT.as...of.total.income Cash.profit.as...of.total.income
## 1                     6.18                             7.50
## 2                     7.54                            10.38
## 3                    -1.38                             0.06
## 4                     0.00                             0.00
## 5                     0.35                             0.75
##   PAT.as...of.net.worth   Sales Income.from.financial.services
## 1                 23.78 24458.0                          158.0
## 2                 38.08  1504.3                            4.0
## 3                 -6.35   475.6                            1.5
## 4                  0.00      NA                             NA
## 5                  5.25  1575.1                            3.9
##   Other.income Total.capital Reserves.and.funds
## 1        297.2         423.8             6822.8
## 2         15.9         115.5              257.8
## 3          0.2          81.4               19.2
## 4           NA           0.5                2.2
## 5          0.9           6.2              161.8
##   Deposits..accepted.by.commercial.banks. Borrowings
## 1                                      NA       14.9
## 2                                      NA      272.5
## 3                                      NA       35.4
## 4                                      NA         NA
## 5                                      NA      193.1
##   Current.liabilities...provisions Deferred.tax.liability
## 1                           9965.9                  284.9
## 2                            210.0                   85.2
## 3                             96.8                     NA
## 4                               NA                     NA
## 5                            112.8                    4.6
##   Shareholders.funds Cumulative.retained.profits Capital.employed TOL.TNW
## 1             7093.2                      6263.3           7108.1    1.33
## 2              351.5                       247.4            624.0    1.23
## 3              100.6                        32.4            136.0    1.44
## 4                2.7                         2.2              2.7    0.00
## 5              107.6                        82.7            300.7    2.83
##   Total.term.liabilities...tangible.net.worth
## 1                                        0.00
## 2                                        0.34
## 3                                        0.29
## 4                                        0.00
## 5                                        1.59
##   Contingent.liabilities...Net.worth.... Contingent.liabilities
## 1                                  14.80                 1049.7
## 2                                  19.23                   67.6
## 3                                  45.83                   46.1
## 4                                   0.00                     NA
## 5                                  34.94                   37.6
##   Net.fixed.assets Investments Current.assets Net.working.capital
## 1           1900.2      1069.6        13277.5              3588.5
## 2            286.4         2.2          563.9               203.5
## 3             38.7         4.3          167.5                59.6
## 4              2.5          NA            0.2                 0.2
## 5             94.8         7.4          349.7               215.8
##   Quick.ratio..times. Current.ratio..times. Debt.to.equity.ratio..times.
## 1                1.18                  1.37                         0.00
## 2                0.95                  1.56                         0.78
## 3                1.11                  1.55                         0.35
## 4                  NA                    NA                         0.00
## 5                1.41                  2.54                         1.79
##   Cash.to.current.liabilities..times.
## 1                                0.43
## 2                                0.06
## 3                                0.21
## 4                                  NA
## 5                                0.00
##   Cash.to.average.cost.of.sales.per.day Creditors.turnover
## 1                                 68.21               3.62
## 2                                  5.96               9.80
## 3                                 17.07               5.28
## 4                                    NA               0.00
## 5                                  0.00              13.00
##   Debtors.turnover Finished.goods.turnover WIP.turnover
## 1             3.85                  200.55        21.78
## 2             5.70                   14.21         7.49
## 3             5.07                    9.24         0.23
## 4             0.00                      NA           NA
## 5             9.46                   12.68         7.90
##   Raw.material.turnover Shares.outstanding Equity.face.value   EPS
## 1                  7.71           42381675                10 35.52
## 2                 11.46           11550000                10  9.97
## 3                    NA            8149090                10 -0.50
## 4                  0.00              52404                10  0.00
## 5                 17.03             619635                10  7.91
##   Adjusted.EPS Total.liabilities PE.on.BSE Default
## 1         7.10           17512.3     27.31       0
## 2         9.97             941.0      8.17       0
## 3        -0.50             232.8     -5.76       0
## 4         0.00               2.7        NA       0
## 5         7.91             478.5        NA       0

# Reading bottom 5 Rows.
tail(dataset, n=5)

##       Num Networth.Next.Year Total.assets Net.worth Total.income
## 3537 3541              226.4        450.5     172.3        565.0
## 3538 3542               89.4         97.6      82.0         75.8
## 3539 3543              246.2        902.9     209.1       1005.1
## 3540 3544              146.9        177.0     137.2        371.0
## 3541 3545               -0.2          0.6       0.3           NA
##      Change.in.stock Total.expenses Profit.after.tax PBDITA   PBT
## 3537            30.5          581.1             14.4   76.7  41.1
## 3538            -4.0           66.5              5.3   11.1   6.2
## 3539             5.6          966.5             44.2  120.3  70.0
## 3540             3.9          348.9             26.0   50.5  40.8
## 3541              NA           17.4            -17.4  -17.4 -17.4
##      Cash.profit PBDITA.as...of.total.income PBT.as...of.total.income
## 3537        48.4                       13.58                     7.27
## 3538         9.2                       14.64                     8.18
## 3539        62.6                       11.97                     6.96
## 3540        33.6                       13.61                    11.00
## 3541       -17.4                          NA                       NA
##      PAT.as...of.total.income Cash.profit.as...of.total.income
## 3537                     2.55                             8.57
## 3538                     6.99                            12.14
## 3539                     4.40                             6.23
## 3540                     7.01                             9.06
## 3541                       NA                               NA
##      PAT.as...of.net.worth Sales Income.from.financial.services
## 3537                  8.71 564.5                            0.5
## 3538                  6.68  73.9                            1.7
## 3539                 22.77 995.9                            2.6
## 3540                 20.30 365.8                            3.3
## 3541               -193.33    NA                             NA
##      Other.income Total.capital Reserves.and.funds
## 3537           NA          89.0               85.5
## 3538           NA          38.6               48.4
## 3539          0.3          30.0              179.1
## 3540          1.6          50.9               86.3
## 3541           NA          28.3              -28.0
##      Deposits..accepted.by.commercial.banks. Borrowings
## 3537                                      NA      190.2
## 3538                                      NA        3.0
## 3539                                      NA      305.0
## 3540                                      NA        1.3
## 3541                                      NA         NA
##      Current.liabilities...provisions Deferred.tax.liability
## 3537                             42.5                   36.8
## 3538                              7.6                     NA
## 3539                            363.4                   25.4
## 3540                             21.1                   17.4
## 3541                              0.3                     NA
##      Shareholders.funds Cumulative.retained.profits Capital.employed
## 3537              172.3                        76.8            362.5
## 3538               87.0                        36.6             90.0
## 3539              209.1                       179.1            514.1
## 3540              137.2                        77.1            138.5
## 3541                0.3                       -28.0              0.3
##      TOL.TNW Total.term.liabilities...tangible.net.worth
## 3537    1.30                                        0.72
## 3538    0.12                                        0.02
## 3539    2.45                                        0.68
## 3540    0.10                                        0.01
## 3541    1.00                                        0.00
##      Contingent.liabilities...Net.worth.... Contingent.liabilities
## 3537                                   0.00                     NA
## 3538                                   5.12                    4.2
## 3539                                  93.45                  195.4
## 3540                                   6.20                    8.5
## 3541                                   0.00                     NA
##      Net.fixed.assets Investments Current.assets Net.working.capital
## 3537            227.0          NA          187.0                78.3
## 3538             21.9         6.8           55.8                47.2
## 3539            217.7        17.5          477.5               -49.5
## 3540             73.5          NA           80.8                59.7
## 3541               NA          NA            0.6                 0.3
##      Quick.ratio..times. Current.ratio..times.
## 3537                0.41                  1.71
## 3538                4.58                  6.49
## 3539                0.59                  0.91
## 3540                2.83                  3.83
## 3541                2.00                  2.00
##      Debt.to.equity.ratio..times. Cash.to.current.liabilities..times.
## 3537                         1.10                                0.07
## 3538                         0.10                                3.88
## 3539                         1.46                                0.05
## 3540                         0.01                                1.35
## 3541                         0.00                                2.00
##      Cash.to.average.cost.of.sales.per.day Creditors.turnover
## 3537                                  5.67              15.65
## 3538                                177.71              10.07
## 3539                                 11.05               3.96
## 3540                                 29.93              25.00
## 3541                               2190.00               0.00
##      Debtors.turnover Finished.goods.turnover WIP.turnover
## 3537            20.64                    8.66         5.14
## 3538            14.21                    5.13         4.17
## 3539             3.76                   33.03        11.68
## 3540            13.75                   49.00        47.03
## 3541             0.00                      NA           NA
##      Raw.material.turnover Shares.outstanding Equity.face.value   EPS
## 3537                 19.47           14904213                10  0.97
## 3538                  4.83            3362800                10  1.61
## 3539                  4.63            3000000                10 13.10
## 3540                 17.42            4422346                10  6.06
## 3541                  0.00            5220000                10 -0.02
##      Adjusted.EPS Total.liabilities PE.on.BSE Default
## 3537         0.97             450.5        NA       0
## 3538         1.61              97.6      2.49       0
## 3539        13.10             902.9     12.62       0
## 3540         6.06             177.0      4.07       0
## 3541        -0.02               0.6        NA       1

# Understanding the Structure of the data loaded. 
str(dataset)

## 'data.frame':    3541 obs. of  53 variables:
##  $ Num                                        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Networth.Next.Year                         : num  8890.6 394.3 92.2 2.7 109 ...
##  $ Total.assets                               : num  17512.3 941 232.8 2.7 478.5 ...
##  $ Net.worth                                  : num  7093.2 351.5 100.6 2.7 107.6 ...
##  $ Total.income                               : num  24965 1527 477 NA 1580 ...
##  $ Change.in.stock                            : num  235.8 42.7 -5.2 NA -17 ...
##  $ Total.expenses                             : num  23658 1455 479 NA 1558 ...
##  $ Profit.after.tax                           : num  1543.2 115.2 -6.6 NA 5.5 ...
##  $ PBDITA                                     : num  2860.2 283 5.8 NA 31 ...
##  $ PBT                                        : num  2417.2 188.4 -6.6 NA 6.3 ...
##  $ Cash.profit                                : num  1872.8 158.6 0.3 NA 11.9 ...
##  $ PBDITA.as...of.total.income                : num  11.46 18.53 1.22 0 1.96 ...
##  $ PBT.as...of.total.income                   : num  9.68 12.33 -1.38 0 0.4 ...
##  $ PAT.as...of.total.income                   : num  6.18 7.54 -1.38 0 0.35 2.81 0 0.72 8.29 -2.88 ...
##  $ Cash.profit.as...of.total.income           : num  7.5 10.38 0.06 0 0.75 ...
##  $ PAT.as...of.net.worth                      : num  23.78 38.08 -6.35 0 5.25 ...
##  $ Sales                                      : num  24458 1504 476 NA 1575 ...
##  $ Income.from.financial.services             : num  158 4 1.5 NA 3.9 6.4 NA NA 7.3 NA ...
##  $ Other.income                               : num  297.2 15.9 0.2 NA 0.9 ...
##  $ Total.capital                              : num  423.8 115.5 81.4 0.5 6.2 ...
##  $ Reserves.and.funds                         : num  6822.8 257.8 19.2 2.2 161.8 ...
##  $ Deposits..accepted.by.commercial.banks.    : logi  NA NA NA NA NA NA ...
##  $ Borrowings                                 : num  14.9 272.5 35.4 NA 193.1 ...
##  $ Current.liabilities...provisions           : num  9965.9 210 96.8 NA 112.8 ...
##  $ Deferred.tax.liability                     : num  284.9 85.2 NA NA 4.6 ...
##  $ Shareholders.funds                         : num  7093.2 351.5 100.6 2.7 107.6 ...
##  $ Cumulative.retained.profits                : num  6263.3 247.4 32.4 2.2 82.7 ...
##  $ Capital.employed                           : num  7108.1 624 136 2.7 300.7 ...
##  $ TOL.TNW                                    : num  1.33 1.23 1.44 0 2.83 1.8 0.03 5.17 1.05 3.25 ...
##  $ Total.term.liabilities...tangible.net.worth: num  0 0.34 0.29 0 1.59 0.37 0.03 0.94 0.3 0.54 ...
##  $ Contingent.liabilities...Net.worth....     : num  14.8 19.2 45.8 0 34.9 ...
##  $ Contingent.liabilities                     : num  1049.7 67.6 46.1 NA 37.6 ...
##  $ Net.fixed.assets                           : num  1900.2 286.4 38.7 2.5 94.8 ...
##  $ Investments                                : num  1069.6 2.2 4.3 NA 7.4 ...
##  $ Current.assets                             : num  13277.5 563.9 167.5 0.2 349.7 ...
##  $ Net.working.capital                        : num  3588.5 203.5 59.6 0.2 215.8 ...
##  $ Quick.ratio..times.                        : num  1.18 0.95 1.11 NA 1.41 0.48 NA 0.54 0.59 0.39 ...
##  $ Current.ratio..times.                      : num  1.37 1.56 1.55 NA 2.54 1.27 NA 1.15 1.58 0.5 ...
##  $ Debt.to.equity.ratio..times.               : num  0 0.78 0.35 0 1.79 1.09 0.32 2.31 0.94 3.13 ...
##  $ Cash.to.current.liabilities..times.        : num  0.43 0.06 0.21 NA 0 0.11 NA 0.04 0.19 0 ...
##  $ Cash.to.average.cost.of.sales.per.day      : num  68.21 5.96 17.07 NA 0 ...
##  $ Creditors.turnover                         : num  3.62 9.8 5.28 0 13 ...
##  $ Debtors.turnover                           : num  3.85 5.7 5.07 0 9.46 ...
##  $ Finished.goods.turnover                    : num  200.55 14.21 9.24 NA 12.68 ...
##  $ WIP.turnover                               : num  21.78 7.49 0.23 NA 7.9 ...
##  $ Raw.material.turnover                      : num  7.71 11.46 NA 0 17.03 ...
##  $ Shares.outstanding                         : num  42381675 11550000 8149090 52404 619635 ...
##  $ Equity.face.value                          : num  10 10 10 10 10 10 10 NA 10 10 ...
##  $ EPS                                        : num  35.52 9.97 -0.5 0 7.91 ...
##  $ Adjusted.EPS                               : num  7.1 9.97 -0.5 0 7.91 ...
##  $ Total.liabilities                          : num  17512.3 941 232.8 2.7 478.5 ...
##  $ PE.on.BSE                                  : num  27.31 8.17 -5.76 NA NA ...
##  $ Default                                    : int  0 0 0 0 0 0 0 0 0 1 ...

#Understanding the Summary of the data loaded.
summary(dataset)

##       Num       Networth.Next.Year  Total.assets         Net.worth       
##  Min.   :   1   Min.   :-74265.6   Min.   :      0.1   Min.   :     0.0  
##  1st Qu.: 886   1st Qu.:    31.7   1st Qu.:     91.3   1st Qu.:    31.3  
##  Median :1773   Median :   116.3   Median :    309.7   Median :   102.3  
##  Mean   :1772   Mean   :  1616.3   Mean   :   3443.4   Mean   :  1295.9  
##  3rd Qu.:2658   3rd Qu.:   456.1   3rd Qu.:   1098.7   3rd Qu.:   377.3  
##  Max.   :3545   Max.   :805773.4   Max.   :1176509.2   Max.   :613151.6  
##                                                                          
##   Total.income       Change.in.stock    Total.expenses     
##  Min.   :      0.0   Min.   :-3029.40   Min.   :     -0.1  
##  1st Qu.:    106.5   1st Qu.:   -1.80   1st Qu.:     95.8  
##  Median :    444.9   Median :    1.60   Median :    407.7  
##  Mean   :   4582.8   Mean   :   41.49   Mean   :   4262.9  
##  3rd Qu.:   1440.9   3rd Qu.:   18.05   3rd Qu.:   1359.8  
##  Max.   :2442828.2   Max.   :14185.50   Max.   :2366035.3  
##  NA's   :198         NA's   :458        NA's   :139        
##  Profit.after.tax        PBDITA              PBT           
##  Min.   : -3908.30   Min.   :  -440.7   Min.   : -3894.80  
##  1st Qu.:     0.50   1st Qu.:     6.9   1st Qu.:     0.70  
##  Median :     8.80   Median :    35.4   Median :    12.40  
##  Mean   :   277.36   Mean   :   578.1   Mean   :   383.81  
##  3rd Qu.:    52.27   3rd Qu.:   150.2   3rd Qu.:    71.97  
##  Max.   :119439.10   Max.   :208576.5   Max.   :145292.60  
##  NA's   :131         NA's   :131        NA's   :131        
##   Cash.profit        PBDITA.as...of.total.income PBT.as...of.total.income
##  Min.   : -2245.70   Min.   :-6400.000           Min.   :-21340.00       
##  1st Qu.:     2.90   1st Qu.:    5.000           1st Qu.:     0.55       
##  Median :    18.85   Median :    9.660           Median :     3.31       
##  Mean   :   392.07   Mean   :    4.571           Mean   :   -17.28       
##  3rd Qu.:    93.20   3rd Qu.:   16.390           3rd Qu.:     8.80       
##  Max.   :176911.80   Max.   :  100.000           Max.   :   100.00       
##  NA's   :131         NA's   :68                  NA's   :68              
##  PAT.as...of.total.income Cash.profit.as...of.total.income
##  Min.   :-21340.00        Min.   :-15020.000              
##  1st Qu.:     0.35        1st Qu.:     2.020              
##  Median :     2.34        Median :     5.640              
##  Mean   :   -19.20        Mean   :    -8.229              
##  3rd Qu.:     6.34        3rd Qu.:    10.700              
##  Max.   :   150.00        Max.   :   100.000              
##  NA's   :68               NA's   :68                      
##  PAT.as...of.net.worth     Sales           Income.from.financial.services
##  Min.   :-748.72       Min.   :      0.1   Min.   :    0.00              
##  1st Qu.:   0.00       1st Qu.:    112.7   1st Qu.:    0.40              
##  Median :   7.92       Median :    453.1   Median :    1.80              
##  Mean   :  10.27       Mean   :   4549.5   Mean   :   80.84              
##  3rd Qu.:  20.19       3rd Qu.:   1433.5   3rd Qu.:    9.68              
##  Max.   :2466.67       Max.   :2384984.4   Max.   :51938.20              
##                        NA's   :259         NA's   :935                   
##   Other.income      Total.capital     Reserves.and.funds
##  Min.   :    0.00   Min.   :    0.1   Min.   : -6525.9  
##  1st Qu.:    0.40   1st Qu.:   13.1   1st Qu.:     5.0  
##  Median :    1.40   Median :   42.1   Median :    54.8  
##  Mean   :   41.36   Mean   :  216.6   Mean   :  1163.8  
##  3rd Qu.:    5.97   3rd Qu.:  100.3   3rd Qu.:   277.3  
##  Max.   :42856.70   Max.   :78273.2   Max.   :625137.8  
##  NA's   :1295       NA's   :4         NA's   :85        
##  Deposits..accepted.by.commercial.banks.   Borrowings       
##  Mode:logical                            Min.   :     0.10  
##  NA's:3541                               1st Qu.:    23.95  
##                                          Median :    99.20  
##                                          Mean   :  1122.28  
##                                          3rd Qu.:   352.60  
##                                          Max.   :278257.30  
##                                          NA's   :366        
##  Current.liabilities...provisions Deferred.tax.liability
##  Min.   :     0.1                 Min.   :    0.1       
##  1st Qu.:    17.8                 1st Qu.:    3.2       
##  Median :    69.4                 Median :   13.4       
##  Mean   :   940.6                 Mean   :  227.2       
##  3rd Qu.:   261.7                 3rd Qu.:   50.0       
##  Max.   :352240.3                 Max.   :72796.6       
##  NA's   :96                       NA's   :1140          
##  Shareholders.funds Cumulative.retained.profits Capital.employed  
##  Min.   :     0.0   Min.   : -6534.3            Min.   :     0.0  
##  1st Qu.:    32.0   1st Qu.:     1.1            1st Qu.:    60.8  
##  Median :   105.6   Median :    37.1            Median :   214.7  
##  Mean   :  1322.1   Mean   :   890.5            Mean   :  2328.3  
##  3rd Qu.:   393.2   3rd Qu.:   202.3            3rd Qu.:   767.3  
##  Max.   :613151.6   Max.   :390133.8            Max.   :891408.9  
##                     NA's   :38                                    
##     TOL.TNW         Total.term.liabilities...tangible.net.worth
##  Min.   :-350.480   Min.   :-325.600                           
##  1st Qu.:   0.600   1st Qu.:   0.050                           
##  Median :   1.430   Median :   0.340                           
##  Mean   :   3.994   Mean   :   1.844                           
##  3rd Qu.:   2.830   3rd Qu.:   1.000                           
##  Max.   : 473.000   Max.   : 456.000                           
##                                                                
##  Contingent.liabilities...Net.worth.... Contingent.liabilities
##  Min.   :    0.00                       Min.   :     0.1      
##  1st Qu.:    0.00                       1st Qu.:     6.3      
##  Median :    5.33                       Median :    38.0      
##  Mean   :   53.94                       Mean   :   932.9      
##  3rd Qu.:   30.76                       3rd Qu.:   192.7      
##  Max.   :14704.27                       Max.   :559506.8      
##                                         NA's   :1188          
##  Net.fixed.assets    Investments        Current.assets    
##  Min.   :     0.0   Min.   :     0.00   Min.   :     0.1  
##  1st Qu.:    26.0   1st Qu.:     1.00   1st Qu.:    36.2  
##  Median :    93.5   Median :     8.35   Median :   145.1  
##  Mean   :  1189.7   Mean   :   694.73   Mean   :  1293.4  
##  3rd Qu.:   344.9   3rd Qu.:    64.30   3rd Qu.:   502.2  
##  Max.   :636604.6   Max.   :199978.60   Max.   :354815.2  
##  NA's   :118        NA's   :1435        NA's   :66        
##  Net.working.capital Quick.ratio..times. Current.ratio..times.
##  Min.   :-63839.0    Min.   :  0.000     Min.   :  0.00       
##  1st Qu.:    -1.1    1st Qu.:  0.410     1st Qu.:  0.93       
##  Median :    16.2    Median :  0.670     Median :  1.23       
##  Mean   :   138.6    Mean   :  1.401     Mean   :  2.13       
##  3rd Qu.:    84.2    3rd Qu.:  1.030     3rd Qu.:  1.71       
##  Max.   : 85782.8    Max.   :341.000     Max.   :505.00       
##  NA's   :32          NA's   :93          NA's   :93           
##  Debt.to.equity.ratio..times. Cash.to.current.liabilities..times.
##  Min.   :  0.00               Min.   :  0.0000                   
##  1st Qu.:  0.22               1st Qu.:  0.0200                   
##  Median :  0.79               Median :  0.0700                   
##  Mean   :  2.78               Mean   :  0.4904                   
##  3rd Qu.:  1.75               3rd Qu.:  0.1900                   
##  Max.   :456.00               Max.   :165.0000                   
##                               NA's   :93                         
##  Cash.to.average.cost.of.sales.per.day Creditors.turnover
##  Min.   :     0.00                     Min.   :   0.000  
##  1st Qu.:     2.79                     1st Qu.:   3.700  
##  Median :     8.03                     Median :   6.095  
##  Mean   :   158.44                     Mean   :  15.446  
##  3rd Qu.:    21.79                     3rd Qu.:  11.490  
##  Max.   :128040.76                     Max.   :2401.000  
##  NA's   :85                            NA's   :333       
##  Debtors.turnover  Finished.goods.turnover  WIP.turnover    
##  Min.   :   0.00   Min.   :   -0.09        Min.   :  -0.18  
##  1st Qu.:   3.76   1st Qu.:    8.20        1st Qu.:   5.10  
##  Median :   6.32   Median :   17.27        Median :   9.76  
##  Mean   :  17.04   Mean   :   87.08        Mean   :  27.93  
##  3rd Qu.:  11.68   3rd Qu.:   40.35        3rd Qu.:  20.24  
##  Max.   :3135.20   Max.   :17947.60        Max.   :5651.40  
##  NA's   :328       NA's   :740             NA's   :640      
##  Raw.material.turnover Shares.outstanding   Equity.face.value
##  Min.   :   -2.00      Min.   :-2.147e+09   Min.   :-999999  
##  1st Qu.:    2.99      1st Qu.: 1.316e+06   1st Qu.:     10  
##  Median :    6.40      Median : 4.672e+06   Median :     10  
##  Mean   :   19.09      Mean   : 2.207e+07   Mean   :  -1334  
##  3rd Qu.:   11.85      3rd Qu.: 1.065e+07   3rd Qu.:     10  
##  Max.   :21092.00      Max.   : 4.130e+09   Max.   : 100000  
##  NA's   :361           NA's   :692          NA's   :692      
##       EPS             Adjusted.EPS       Total.liabilities  
##  Min.   :-843181.8   Min.   :-843181.8   Min.   :      0.1  
##  1st Qu.:      0.0   1st Qu.:      0.0   1st Qu.:     91.3  
##  Median :      1.4   Median :      1.2   Median :    309.7  
##  Mean   :   -220.3   Mean   :   -221.5   Mean   :   3443.4  
##  3rd Qu.:      9.6   3rd Qu.:      7.5   3rd Qu.:   1098.7  
##  Max.   :  34522.5   Max.   :  34522.5   Max.   :1176509.2  
##                                                             
##    PE.on.BSE           Default       
##  Min.   :-1116.64   Min.   :0.00000  
##  1st Qu.:    3.27   1st Qu.:0.00000  
##  Median :    9.10   Median :0.00000  
##  Mean   :   63.91   Mean   :0.06608  
##  3rd Qu.:   17.79   3rd Qu.:0.00000  
##  Max.   :51002.74   Max.   :1.00000  
##  NA's   :2194

We need to pre-process our data before we can use it for modeling. This step involves the below steps.

• Missing Value Treatment

• Outlier Treatment

• Performing Multicollinearity check.

Let’s check if there are any missing values present in data.

# Checking for missing values available in data.

colSums(is.na(dataset))

##                                         Num 
##                                           0 
##                          Networth.Next.Year 
##                                           0 
##                                Total.assets 
##                                           0 
##                                   Net.worth 
##                                           0 
##                                Total.income 
##                                         198 
##                             Change.in.stock 
##                                         458 
##                              Total.expenses 
##                                         139 
##                            Profit.after.tax 
##                                         131 
##                                      PBDITA 
##                                         131 
##                                         PBT 
##                                         131 
##                                 Cash.profit 
##                                         131 
##                 PBDITA.as...of.total.income 
##                                          68 
##                    PBT.as...of.total.income 
##                                          68 
##                    PAT.as...of.total.income 
##                                          68 
##            Cash.profit.as...of.total.income 
##                                          68 
##                       PAT.as...of.net.worth 
##                                           0 
##                                       Sales 
##                                         259 
##              Income.from.financial.services 
##                                         935 
##                                Other.income 
##                                        1295 
##                               Total.capital 
##                                           4 
##                          Reserves.and.funds 
##                                          85 
##     Deposits..accepted.by.commercial.banks. 
##                                        3541 
##                                  Borrowings 
##                                         366 
##            Current.liabilities...provisions 
##                                          96 
##                      Deferred.tax.liability 
##                                        1140 
##                          Shareholders.funds 
##                                           0 
##                 Cumulative.retained.profits 
##                                          38 
##                            Capital.employed 
##                                           0 
##                                     TOL.TNW 
##                                           0 
## Total.term.liabilities...tangible.net.worth 
##                                           0 
##      Contingent.liabilities...Net.worth.... 
##                                           0 
##                      Contingent.liabilities 
##                                        1188 
##                            Net.fixed.assets 
##                                         118 
##                                 Investments 
##                                        1435 
##                              Current.assets 
##                                          66 
##                         Net.working.capital 
##                                          32 
##                         Quick.ratio..times. 
##                                          93 
##                       Current.ratio..times. 
##                                          93 
##                Debt.to.equity.ratio..times. 
##                                           0 
##         Cash.to.current.liabilities..times. 
##                                          93 
##       Cash.to.average.cost.of.sales.per.day 
##                                          85 
##                          Creditors.turnover 
##                                         333 
##                            Debtors.turnover 
##                                         328 
##                     Finished.goods.turnover 
##                                         740 
##                                WIP.turnover 
##                                         640 
##                       Raw.material.turnover 
##                                         361 
##                          Shares.outstanding 
##                                         692 
##                           Equity.face.value 
##                                         692 
##                                         EPS 
##                                           0 
##                                Adjusted.EPS 
##                                           0 
##                           Total.liabilities 
##                                           0 
##                                   PE.on.BSE 
##                                        2194 
##                                     Default 
##                                           0

We observe that there are variables with missing values more then 25% of the total records. Imputing such variables can end up creating artifical data giving lower accuracy in Data Modelling. Hence we’ll be eliminating those variables where the missing data is more then 25%.

# Eliminating variables having missing value greater the 25%
data <- dataset[,-c(1,22,25,18,32,34,52)]

# Imputing missing values for the remaining variables.
imputed <- preProcess(data[,-46],method = "knnImpute",k = 5)

imputed_val <- predict(imputed,data)

# Checking for missing values on the Output data.
anyNA(imputed_val)

## [1] FALSE

# Checking the dimension of the Output data.
dim(imputed_val)

## [1] 3541   46

Once the missing values are treated, we proceed with the next step for treating Outliers in data. For more details on Outlier Detection and treatment, visit http://r-statistics.co/Outlier-Treatment-With-R.html

# Checking for Outliers.
boxplot(imputed_val)

From the above plot, we can see there are a huge number of Outliers available in the data which are required to be treated. For this problem, we’ll be performing the method of Capping at 5% and 95% at lower and upper limit respectively.

# Performing Outlier Treatment. 
final <- imputed_val
names(final)

##  [1] "Networth.Next.Year"                         
##  [2] "Total.assets"                               
##  [3] "Net.worth"                                  
##  [4] "Total.income"                               
##  [5] "Change.in.stock"                            
##  [6] "Total.expenses"                             
##  [7] "Profit.after.tax"                           
##  [8] "PBDITA"                                     
##  [9] "PBT"                                        
## [10] "Cash.profit"                                
## [11] "PBDITA.as...of.total.income"                
## [12] "PBT.as...of.total.income"                   
## [13] "PAT.as...of.total.income"                   
## [14] "Cash.profit.as...of.total.income"           
## [15] "PAT.as...of.net.worth"                      
## [16] "Sales"                                      
## [17] "Other.income"                               
## [18] "Total.capital"                              
## [19] "Reserves.and.funds"                         
## [20] "Borrowings"                                 
## [21] "Current.liabilities...provisions"           
## [22] "Shareholders.funds"                         
## [23] "Cumulative.retained.profits"                
## [24] "Capital.employed"                           
## [25] "TOL.TNW"                                    
## [26] "Total.term.liabilities...tangible.net.worth"
## [27] "Contingent.liabilities...Net.worth...."     
## [28] "Net.fixed.assets"                           
## [29] "Current.assets"                             
## [30] "Net.working.capital"                        
## [31] "Quick.ratio..times."                        
## [32] "Current.ratio..times."                      
## [33] "Debt.to.equity.ratio..times."               
## [34] "Cash.to.current.liabilities..times."        
## [35] "Cash.to.average.cost.of.sales.per.day"      
## [36] "Creditors.turnover"                         
## [37] "Debtors.turnover"                           
## [38] "Finished.goods.turnover"                    
## [39] "WIP.turnover"                               
## [40] "Raw.material.turnover"                      
## [41] "Shares.outstanding"                         
## [42] "Equity.face.value"                          
## [43] "EPS"                                        
## [44] "Adjusted.EPS"                               
## [45] "Total.liabilities"                          
## [46] "Default"

summary(final)

##  Networth.Next.Year  Total.assets        Net.worth       
##  Min.   :-4.34613   Min.   :-0.11118   Min.   :-0.09679  
##  1st Qu.:-0.09076   1st Qu.:-0.10823   1st Qu.:-0.09446  
##  Median :-0.08591   Median :-0.10118   Median :-0.08915  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.:-0.06645   3rd Qu.:-0.07571   3rd Qu.:-0.06861  
##  Max.   :46.05806   Max.   :37.87640   Max.   :45.70217  
##   Total.income      Change.in.stock    Total.expenses    
##  Min.   :-0.08230   Min.   :-6.97020   Min.   :-0.08039  
##  1st Qu.:-0.08044   1st Qu.:-0.09803   1st Qu.:-0.07862  
##  Median :-0.07514   Median :-0.09099   Median :-0.07319  
##  Mean   :-0.00403   Mean   :-0.00820   Mean   :-0.00281  
##  3rd Qu.:-0.05784   3rd Qu.:-0.05990   3rd Qu.:-0.05603  
##  Max.   :43.78935   Max.   :32.10362   Max.   :44.53764  
##  Profit.after.tax       PBDITA              PBT          
##  Min.   :-1.36590   Min.   :-0.18019   Min.   :-1.03924  
##  1st Qu.:-0.09038   1st Qu.:-0.10113   1st Qu.:-0.09308  
##  Median :-0.08796   Median :-0.09632   Median :-0.09063  
##  Mean   :-0.00325   Mean   :-0.00343   Mean   :-0.00334  
##  3rd Qu.:-0.07465   3rd Qu.:-0.07718   3rd Qu.:-0.07661  
##  Max.   :38.88575   Max.   :36.78933   Max.   :35.19707  
##   Cash.profit       PBDITA.as...of.total.income PBT.as...of.total.income
##  Min.   :-0.62408   Min.   :-43.53237           Min.   :-50.11996       
##  1st Qu.:-0.09212   1st Qu.:  0.00299           1st Qu.:  0.04186       
##  Median :-0.08864   Median :  0.03493           Median :  0.04847       
##  Mean   :-0.00320   Mean   :  0.00084           Mean   :  0.00069       
##  3rd Qu.:-0.07211   3rd Qu.:  0.08108           3rd Qu.:  0.06153       
##  Max.   :41.76360   Max.   :  0.64864           Max.   :  0.27567       
##  PAT.as...of.total.income Cash.profit.as...of.total.income
##  Min.   :-49.60502        Min.   :-49.28902               
##  1st Qu.:  0.04537        1st Qu.:  0.03368               
##  Median :  0.05002        Median :  0.04561               
##  Mean   :  0.00055        Mean   :  0.00077               
##  3rd Qu.:  0.05930        3rd Qu.:  0.06251               
##  Max.   :  0.39366        Max.   :  0.35536               
##  PAT.as...of.net.worth     Sales           Other.income     
##  Min.   :-11.65019     Min.   :-0.08277   Min.   :-0.04509  
##  1st Qu.: -0.15763     1st Qu.:-0.08087   1st Qu.:-0.04454  
##  Median : -0.03606     Median :-0.07558   Median :-0.04343  
##  Mean   :  0.00000     Mean   :-0.00541   Mean   :-0.01510  
##  3rd Qu.:  0.15228     3rd Qu.:-0.05851   3rd Qu.:-0.04055  
##  Max.   : 37.70481     Max.   :43.30847   Max.   :46.67590  
##  Total.capital      Reserves.and.funds   Borrowings      
##  Min.   :-0.12868   Min.   :-0.57503   Min.   :-0.12925  
##  1st Qu.:-0.12095   1st Qu.:-0.08666   1st Qu.:-0.12652  
##  Median :-0.10366   Median :-0.08301   Median :-0.11838  
##  Mean   : 0.00013   Mean   :-0.00194   Mean   :-0.00976  
##  3rd Qu.:-0.06901   3rd Qu.:-0.06706   3rd Qu.:-0.09288  
##  Max.   :46.39073   Max.   :46.66034   Max.   :31.92094  
##  Current.liabilities...provisions Shareholders.funds
##  Min.   :-0.09852                 Min.   :-0.09833  
##  1st Qu.:-0.09675                 1st Qu.:-0.09595  
##  Median :-0.09161                 Median :-0.09048  
##  Mean   :-0.00253                 Mean   : 0.00000  
##  3rd Qu.:-0.07230                 3rd Qu.:-0.06908  
##  Max.   :36.80037                 Max.   :45.50511  
##  Cumulative.retained.profits Capital.employed      TOL.TNW        
##  Min.   :-0.73423            Min.   :-0.11051   Min.   :-18.2730  
##  1st Qu.:-0.08795            1st Qu.:-0.10762   1st Qu.: -0.1750  
##  Median :-0.08441            Median :-0.10032   Median : -0.1322  
##  Mean   :-0.00091            Mean   : 0.00000   Mean   :  0.0000  
##  3rd Qu.:-0.06832            3rd Qu.:-0.07409   3rd Qu.: -0.0600  
##  Max.   :38.49177            Max.   :42.19850   Max.   : 24.1771  
##  Total.term.liabilities...tangible.net.worth
##  Min.   :-22.16484                          
##  1st Qu.: -0.12143                          
##  Median : -0.10180                          
##  Mean   :  0.00000                          
##  3rd Qu.: -0.05712                          
##  Max.   : 30.74205                          
##  Contingent.liabilities...Net.worth.... Net.fixed.assets  
##  Min.   :-0.14238                       Min.   :-0.08950  
##  1st Qu.:-0.14238                       1st Qu.:-0.08758  
##  Median :-0.12831                       Median :-0.08258  
##  Mean   : 0.00000                       Mean   :-0.00270  
##  3rd Qu.:-0.06118                       3rd Qu.:-0.06459  
##  Max.   :38.67202                       Max.   :47.80464  
##  Current.assets     Net.working.capital  Quick.ratio..times.
##  Min.   :-0.12764   Min.   :-21.720618   Min.   :-0.18130   
##  1st Qu.:-0.12411   1st Qu.: -0.047384   1st Qu.:-0.12696   
##  Median :-0.11367   Median : -0.041579   Median :-0.09333   
##  Mean   :-0.00223   Mean   : -0.000359   Mean   :-0.00049   
##  3rd Qu.:-0.07970   3rd Qu.: -0.019307   3rd Qu.:-0.04546   
##  Max.   :34.89241   Max.   : 29.076529   Max.   :43.93427   
##  Current.ratio..times. Debt.to.equity.ratio..times.
##  Min.   :-0.21076      Min.   :-0.18563            
##  1st Qu.:-0.11777      1st Qu.:-0.17094            
##  Median :-0.08809      Median :-0.13287            
##  Mean   :-0.00027      Mean   : 0.00000            
##  3rd Qu.:-0.03961      3rd Qu.:-0.06876            
##  Max.   :49.74931      Max.   :30.26704            
##  Cash.to.current.liabilities..times. Cash.to.average.cost.of.sales.per.day
##  Min.   :-0.11688                    Min.   :-0.05768                     
##  1st Qu.:-0.11211                    1st Qu.:-0.05663                     
##  Median :-0.10020                    Median :-0.05469                     
##  Mean   :-0.00049                    Mean   :-0.00097                     
##  3rd Qu.:-0.07160                    3rd Qu.:-0.04948                     
##  Max.   :39.20874                    Max.   :46.55514                     
##  Creditors.turnover Debtors.turnover   Finished.goods.turnover
##  Min.   :-0.22645   Min.   :-0.20301   Min.   :-0.145634      
##  1st Qu.:-0.16942   1st Qu.:-0.15667   1st Qu.:-0.129004      
##  Median :-0.13364   Median :-0.12515   Median :-0.112196      
##  Mean   : 0.00263   Mean   :-0.00354   Mean   :-0.004225      
##  3rd Qu.:-0.04641   3rd Qu.:-0.06125   3rd Qu.:-0.073205      
##  Max.   :34.97439   Max.   :37.14581   Max.   :29.839856      
##   WIP.turnover      Raw.material.turnover Shares.outstanding 
##  Min.   :-0.18611   Min.   :-0.05604      Min.   :-13.10754  
##  1st Qu.:-0.14764   1st Qu.:-0.04182      1st Qu.: -0.12201  
##  Median :-0.11613   Median :-0.03279      Median : -0.10635  
##  Mean   :-0.01257   Mean   :-0.00222      Mean   : -0.01667  
##  3rd Qu.:-0.05264   3rd Qu.:-0.01939      3rd Qu.: -0.07281  
##  Max.   :37.23095   Max.   :55.99423      Max.   : 24.82087  
##  Equity.face.value        EPS             Adjusted.EPS      
##  Min.   :-26.63055   Min.   :-59.10564   Min.   :-59.10565  
##  1st Qu.:  0.03583   1st Qu.:  0.01545   1st Qu.:  0.01553  
##  Median :  0.03583   Median :  0.01555   Median :  0.01561  
##  Mean   :  0.00711   Mean   :  0.00000   Mean   :  0.00000  
##  3rd Qu.:  0.03583   3rd Qu.:  0.01612   3rd Qu.:  0.01606  
##  Max.   :  2.70218   Max.   :  2.43605   Max.   :  2.43614  
##  Total.liabilities     Default       
##  Min.   :-0.11118   Min.   :0.00000  
##  1st Qu.:-0.10823   1st Qu.:0.00000  
##  Median :-0.10118   Median :0.00000  
##  Mean   : 0.00000   Mean   :0.06608  
##  3rd Qu.:-0.07571   3rd Qu.:0.00000  
##  Max.   :37.87640   Max.   :1.00000

a <- c(1:45)

for(val in a){
  qnt<- quantile(final[,val],probs = c(0.25,0.75))
  cap<- quantile(final[,val],probs = c(0.05,0.95))
  
  h= 1.5*IQR(final[,val])
  final[,val][final[,val]>(qnt[2]+h)]<- cap[2]
  final[,val][final[,val]<(qnt[1]-h)]<- cap[1]
}

After we are done with Outlier treatment, we are left with the last step of Data preperation i.e. Treating the problem of Multicollinearity.

For Details on Multicollinearity, visit https://towardsdatascience.com/https-towardsdatascience-com-multicollinearity-how-does-it-create-a-problem-72956a49058

In this case, we use the method available in Caret for treating collinearity in the dataset available. This function searches through a correlation matrix and returns a vector of integers corresponding to columns to remove or to reduce pair-wise correlations.

# Checking for variables where correlation is higher the .7

x<- findCorrelation(cor(final),
  cutoff = 0.7,
  verbose = TRUE,
  names = TRUE,
  exact = ncol(imputed_val) < 100
)

## Compare row 8  and column  3 with corr  0.841 
##   Means:  0.458 vs 0.274 so flagging column 8 
## Compare row 3  and column  22 with corr  0.981 
##   Means:  0.444 vs 0.266 so flagging column 3 
## Compare row 22  and column  1 with corr  0.92 
##   Means:  0.431 vs 0.258 so flagging column 22 
## Compare row 1  and column  10 with corr  0.839 
##   Means:  0.42 vs 0.249 so flagging column 1 
## Compare row 10  and column  2 with corr  0.802 
##   Means:  0.413 vs 0.241 so flagging column 10 
## Compare row 2  and column  45 with corr  1 
##   Means:  0.4 vs 0.233 so flagging column 2 
## Compare row 45  and column  24 with corr  0.952 
##   Means:  0.384 vs 0.225 so flagging column 45 
## Compare row 24  and column  4 with corr  0.839 
##   Means:  0.365 vs 0.217 so flagging column 24 
## Compare row 4  and column  9 with corr  0.754 
##   Means:  0.353 vs 0.209 so flagging column 4 
## Compare row 9  and column  19 with corr  0.779 
##   Means:  0.35 vs 0.201 so flagging column 9 
## Compare row 19  and column  7 with corr  0.776 
##   Means:  0.324 vs 0.193 so flagging column 19 
## Compare row 7  and column  16 with corr  0.733 
##   Means:  0.318 vs 0.185 so flagging column 7 
## Compare row 16  and column  6 with corr  0.984 
##   Means:  0.296 vs 0.178 so flagging column 16 
## Compare row 6  and column  29 with corr  0.871 
##   Means:  0.274 vs 0.171 so flagging column 6 
## Compare row 29  and column  23 with corr  0.736 
##   Means:  0.259 vs 0.164 so flagging column 29 
## Compare row 21  and column  28 with corr  0.717 
##   Means:  0.234 vs 0.159 so flagging column 21 
## Compare row 28  and column  20 with corr  0.783 
##   Means:  0.217 vs 0.155 so flagging column 28 
## Compare row 41  and column  18 with corr  0.824 
##   Means:  0.16 vs 0.152 so flagging column 41 
## Compare row 12  and column  13 with corr  0.963 
##   Means:  0.247 vs 0.148 so flagging column 12 
## Compare row 13  and column  14 with corr  0.778 
##   Means:  0.211 vs 0.141 so flagging column 13 
## Compare row 14  and column  11 with corr  0.855 
##   Means:  0.192 vs 0.137 so flagging column 14 
## Compare row 43  and column  44 with corr  0.888 
##   Means:  0.151 vs 0.134 so flagging column 43 
## Compare row 25  and column  33 with corr  0.843 
##   Means:  0.191 vs 0.13 so flagging column 25 
## Compare row 33  and column  26 with corr  0.877 
##   Means:  0.159 vs 0.127 so flagging column 33 
## Compare row 31  and column  32 with corr  0.84 
##   Means:  0.17 vs 0.123 so flagging column 31 
## Compare row 34  and column  35 with corr  0.707 
##   Means:  0.12 vs 0.12 so flagging column 35 
## All correlations <= 0.7

The above piece of code gives the variables where the correaltion is higher then 70%. We will eliminate these variables from our dataset.

final1 <- final[x]

# Adding Dependent varible in the dataset.
final <- cbind(data$Default, final1)

# Checking the Dimensions of the final data.
dim(final)

## [1] 3541   27

# Checking the structure of the final data.
str(final)

## 'data.frame':    3541 obs. of  27 variables:
##  $ data$Default                         : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ PBDITA                               : num  0.1147 -0.0522 -0.1012 -0.1018 -0.0968 ...
##  $ Net.worth                            : num  0.1299 -0.0705 -0.0893 -0.0966 -0.0888 ...
##  $ Shareholders.funds                   : num  0.1367 -0.0722 -0.0908 -0.0981 -0.0903 ...
##  $ Networth.Next.Year                   : num  0.123 -0.07 -0.0873 -0.0924 -0.0863 ...
##  $ Cash.profit                          : num  0.0972 -0.0552 -0.0927 -0.0925 -0.0899 ...
##  $ Total.assets                         : num  0.1617 -0.0808 -0.1037 -0.1111 -0.0957 ...
##  $ Total.liabilities                    : num  0.1617 -0.0808 -0.1037 -0.1111 -0.0957 ...
##  $ Capital.employed                     : num  0.1737 -0.0809 -0.1041 -0.1104 -0.0962 ...
##  $ Total.income                         : num  0.0773 -0.0549 -0.0737 -0.0817 -0.0539 ...
##  $ PBT                                  : num  0.0853 0.0853 -0.0948 -0.0932 -0.0917 ...
##  $ Reserves.and.funds                   : num  0.116 -0.0678 -0.0856 -0.0869 -0.0749 ...
##  $ Profit.after.tax                     : num  0.093 -0.0529 -0.0927 -0.0905 -0.0887 ...
##  $ Sales                                : num  0.0777 -0.0554 -0.0741 -0.0821 -0.0541 ...
##  $ Total.expenses                       : num  0.0815 -0.053 -0.0714 -0.0797 -0.051 ...
##  $ Current.assets                       : num  0.1981 -0.072 -0.1111 -0.1276 -0.0931 ...
##  $ Current.liabilities...provisions     : num  0.1116 -0.0765 -0.0884 -0.0973 -0.0867 ...
##  $ Net.fixed.assets                     : num  0.1114 -0.068 -0.0866 -0.0893 -0.0824 ...
##  $ Shares.outstanding                   : num  0.2571 -0.0635 -0.0841 -0.133 -0.1296 ...
##  $ PBT.as...of.total.income             : num  0.0634 0.0696 0.0374 0.0406 0.0416 ...
##  $ PAT.as...of.total.income             : num  0.059 0.0622 0.0415 0.0447 0.0455 ...
##  $ Cash.profit.as...of.total.income     : num  0.0516 0.0611 0.0272 0.027 0.0295 ...
##  $ EPS                                  : num  0.0216 0.0161 0.0154 0.0154 0.016 ...
##  $ TOL.TNW                              : num  -0.137 -0.142 -0.132 -0.206 -0.06 ...
##  $ Debt.to.equity.ratio..times.         : num  -0.1856 -0.1335 -0.1623 -0.1856 -0.0661 ...
##  $ Quick.ratio..times.                  : num  -0.02864 -0.0584 -0.0377 -0.07754 0.00111 ...
##  $ Cash.to.average.cost.of.sales.per.day: num  0.0135 -0.0555 -0.0515 -0.052 -0.0577 ...

#Converting the Dependent variable to Factor
colnames(final)[colnames(final) == "data$Default"] <- "Default"
final$Default <- as.factor(final$Default)

Feature selection using Caret

Feature selection is an extremely crucial part of modeling. To understand the importance of feature selection and various techniques used for feature selection. For now, we’ll be using Recursive Feature elimination which is a wrapper method to find the best subset of features to use for modeling.

# Feature Selection using Recursive Feature Elimination
ctrl <- rfeControl(functions = rfFuncs,
                   method = "cv",
                   verbose = FALSE)

lmProfile <- rfe(x=final[, 2:27], y=final$Default,
                 rfeControl = ctrl)


lmProfile

## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy Kappa AccuracySD KappaSD Selected
##          4        1     1          0       0        *
##          8        1     1          0       0         
##         16        1     1          0       0         
##         26        1     1          0       0         
## 
## The top 4 variables (out of 4):
##    Networth.Next.Year, TOL.TNW, Net.worth, Debt.to.equity.ratio..times.

In the above feature selection Algorith, we identified that there are only 4 features that appears to be the most important variables.

Training models using Caret

This is probably the part where Caret stands out from any other available package. It provides the ability for implementing 200+ machine learning algorithms using consistent syntax. To get a list of all the algorithms that Caret supports, you can use:

modelnames <- paste(names(getModelInfo()))
modelnames

##   [1] "ada"                 "AdaBag"              "AdaBoost.M1"        
##   [4] "adaboost"            "amdai"               "ANFIS"              
##   [7] "avNNet"              "awnb"                "awtan"              
##  [10] "bag"                 "bagEarth"            "bagEarthGCV"        
##  [13] "bagFDA"              "bagFDAGCV"           "bam"                
##  [16] "bartMachine"         "bayesglm"            "binda"              
##  [19] "blackboost"          "blasso"              "blassoAveraged"     
##  [22] "bridge"              "brnn"                "BstLm"              
##  [25] "bstSm"               "bstTree"             "C5.0"               
##  [28] "C5.0Cost"            "C5.0Rules"           "C5.0Tree"           
##  [31] "cforest"             "chaid"               "CSimca"             
##  [34] "ctree"               "ctree2"              "cubist"             
##  [37] "dda"                 "deepboost"           "DENFIS"             
##  [40] "dnn"                 "dwdLinear"           "dwdPoly"            
##  [43] "dwdRadial"           "earth"               "elm"                
##  [46] "enet"                "evtree"              "extraTrees"         
##  [49] "fda"                 "FH.GBML"             "FIR.DM"             
##  [52] "foba"                "FRBCS.CHI"           "FRBCS.W"            
##  [55] "FS.HGD"              "gam"                 "gamboost"           
##  [58] "gamLoess"            "gamSpline"           "gaussprLinear"      
##  [61] "gaussprPoly"         "gaussprRadial"       "gbm_h2o"            
##  [64] "gbm"                 "gcvEarth"            "GFS.FR.MOGUL"       
##  [67] "GFS.LT.RS"           "GFS.THRIFT"          "glm.nb"             
##  [70] "glm"                 "glmboost"            "glmnet_h2o"         
##  [73] "glmnet"              "glmStepAIC"          "gpls"               
##  [76] "hda"                 "hdda"                "hdrda"              
##  [79] "HYFIS"               "icr"                 "J48"                
##  [82] "JRip"                "kernelpls"           "kknn"               
##  [85] "knn"                 "krlsPoly"            "krlsRadial"         
##  [88] "lars"                "lars2"               "lasso"              
##  [91] "lda"                 "lda2"                "leapBackward"       
##  [94] "leapForward"         "leapSeq"             "Linda"              
##  [97] "lm"                  "lmStepAIC"           "LMT"                
## [100] "loclda"              "logicBag"            "LogitBoost"         
## [103] "logreg"              "lssvmLinear"         "lssvmPoly"          
## [106] "lssvmRadial"         "lvq"                 "M5"                 
## [109] "M5Rules"             "manb"                "mda"                
## [112] "Mlda"                "mlp"                 "mlpKerasDecay"      
## [115] "mlpKerasDecayCost"   "mlpKerasDropout"     "mlpKerasDropoutCost"
## [118] "mlpML"               "mlpSGD"              "mlpWeightDecay"     
## [121] "mlpWeightDecayML"    "monmlp"              "msaenet"            
## [124] "multinom"            "mxnet"               "mxnetAdam"          
## [127] "naive_bayes"         "nb"                  "nbDiscrete"         
## [130] "nbSearch"            "neuralnet"           "nnet"               
## [133] "nnls"                "nodeHarvest"         "null"               
## [136] "OneR"                "ordinalNet"          "ordinalRF"          
## [139] "ORFlog"              "ORFpls"              "ORFridge"           
## [142] "ORFsvm"              "ownn"                "pam"                
## [145] "parRF"               "PART"                "partDSA"            
## [148] "pcaNNet"             "pcr"                 "pda"                
## [151] "pda2"                "penalized"           "PenalizedLDA"       
## [154] "plr"                 "pls"                 "plsRglm"            
## [157] "polr"                "ppr"                 "PRIM"               
## [160] "protoclass"          "qda"                 "QdaCov"             
## [163] "qrf"                 "qrnn"                "randomGLM"          
## [166] "ranger"              "rbf"                 "rbfDDA"             
## [169] "Rborist"             "rda"                 "regLogistic"        
## [172] "relaxo"              "rf"                  "rFerns"             
## [175] "RFlda"               "rfRules"             "ridge"              
## [178] "rlda"                "rlm"                 "rmda"               
## [181] "rocc"                "rotationForest"      "rotationForestCp"   
## [184] "rpart"               "rpart1SE"            "rpart2"             
## [187] "rpartCost"           "rpartScore"          "rqlasso"            
## [190] "rqnc"                "RRF"                 "RRFglobal"          
## [193] "rrlda"               "RSimca"              "rvmLinear"          
## [196] "rvmPoly"             "rvmRadial"           "SBC"                
## [199] "sda"                 "sdwd"                "simpls"             
## [202] "SLAVE"               "slda"                "smda"               
## [205] "snn"                 "sparseLDA"           "spikeslab"          
## [208] "spls"                "stepLDA"             "stepQDA"            
## [211] "superpc"             "svmBoundrangeString" "svmExpoString"      
## [214] "svmLinear"           "svmLinear2"          "svmLinear3"         
## [217] "svmLinearWeights"    "svmLinearWeights2"   "svmPoly"            
## [220] "svmRadial"           "svmRadialCost"       "svmRadialSigma"     
## [223] "svmRadialWeights"    "svmSpectrumString"   "tan"                
## [226] "tanSearch"           "treebag"             "vbmpRadial"         
## [229] "vglmAdjCat"          "vglmContRatio"       "vglmCumulative"     
## [232] "widekernelpls"       "WM"                  "wsrf"               
## [235] "xgbDART"             "xgbLinear"           "xgbTree"            
## [238] "xyf"

Yes, it’s a huge list!

And if you want to know more details like the hyperparameters and if it can be used of regression or classification problem, then do a modelLookup(algo).

Let’s look out for the hyperparameters for one of the Model from the wide range of Models supported by Caret.

modelLookup("rf")

##   model parameter                         label forReg forClass probModel
## 1    rf      mtry #Randomly Selected Predictors   TRUE     TRUE      TRUE

Let’s split the data into Train and Test before we proceed with Model Creation.

# Create the training and test datasets
set.seed(100)


# Step 1: Get row numbers for the training data
trainRowNumbers <- createDataPartition(final$Default, p=0.8, list=FALSE)


# Step 2: Create the training  dataset
trainData <- final[trainRowNumbers,]


# Step 3: Create the test dataset
testData <- final[-trainRowNumbers,]

Once data is splitted to Train and Test, we check the proportion of Default and Non Default in Train data.

prop.table(table(trainData$Default))

## 
##          0          1 
## 0.93366267 0.06633733

Distribution of Default vs Non-Default is 94:6 percent, we can perform balancing of data by either by using Undersampling or Oversampling methods. Caret provides the feature of Undersampling by the method downSample() and Oversampling by using upSample().

In this we use oversampling method in Caret.

train <- upSample(trainData[,c(2:24)], as.factor(trainData$Default))

prop.table(table(train$Class))

## 
##   0   1 
## 0.5 0.5

Let’s train a Random Forest model by setting the method=‘rf’ and passing the hyperparameter mtry=5.

# Performing Random Forest
Control <- trainControl(method = "cv",number = 5)

rf <- train(Class~., train,trCtrl= Control,method = "rf",tuneLength=10)

In the above piece of code, we have used trainControl() method. This method is used for Parameter tuning in Caret. It’s extremely easy to tune parameters using Caret.

It is possible to customize almost every step in the tuning process. The resampling technique used for evaluating the performance of the model using a set of parameters in Caret by default is bootstrap, but it provides alternatives for using k-fold, repeated k-fold as well as Leave-one-out cross validation (LOOCV) which can be specified using trainControl(). In this example, we’ll be using 5-Fold cross-validation.

Let’s Train a few more Models from the list using the same Tuning parameters we used for Random Forest Model.

# Training eXtreme Gradient Boost Model
xgbModel1 <- train(Class~.,train,
                   trControl = Control ,method = "xgbTree",tuneLength=10)



# Training Logistic Model
logit.Model2 <- train(Class ~.,
                      data = train,
                      method = "glmnet",
                      trControl = Control,
                      tuneLength=10)


# Training Neural Network Model using 'adaboost'
neural.net <- train(Class~.,train,
                   trControl = Control ,method = "adaboost",
                   tuneLength=10,
                   verbose=FALSE)

Variable importance estimation using caret

Caret also makes the variable importance estimates accessible with the use of varImp() for any model. Let’s have a look at the variable importance for all the four models that we created:

#Checking variable importance for Random Forest

#Variable Importance
varImp(rf)

## rf variable importance
## 
##   only 20 most important variables shown (out of 23)
## 
##                                    Overall
## Networth.Next.Year               1.000e+02
## PAT.as...of.total.income         5.697e+00
## PBT.as...of.total.income         4.701e-01
## Reserves.and.funds               1.199e-01
## TOL.TNW                          1.139e-01
## Sales                            5.577e-03
## Net.worth                        4.914e-03
## Cash.profit                      4.766e-03
## Total.expenses                   0.000e+00
## Current.liabilities...provisions 0.000e+00
## Total.income                     0.000e+00
## Total.liabilities                0.000e+00
## PBDITA                           0.000e+00
## EPS                              0.000e+00
## PBT                              0.000e+00
## Cash.profit.as...of.total.income 0.000e+00
## Current.assets                   0.000e+00
## Net.fixed.assets                 0.000e+00
## Shares.outstanding               0.000e+00
## Shareholders.funds               0.000e+00

#Plotting Variable Importance
plot(varImp(rf),main="RF - Variable Importance")

#Checking variable importance for XG Boost

#Variable Importance
varImp(xgbModel1)

## xgbTree variable importance
## 
##   only 20 most important variables shown (out of 23)
## 
##                                    Overall
## Networth.Next.Year               100.00000
## PAT.as...of.total.income           2.88568
## TOL.TNW                            2.01947
## PBT                                1.51418
## Reserves.and.funds                 0.69349
## Cash.profit.as...of.total.income   0.24398
## Net.worth                          0.06515
## EPS                                0.01307
## Total.liabilities                  0.00000
## Current.liabilities...provisions   0.00000
## Total.expenses                     0.00000
## Total.income                       0.00000
## Total.assets                       0.00000
## Cash.profit                        0.00000
## Shares.outstanding                 0.00000
## PBDITA                             0.00000
## Sales                              0.00000
## Current.assets                     0.00000
## Shareholders.funds                 0.00000
## Capital.employed                   0.00000

#Plotting Variable Importance
plot(varImp(xgbModel1),main="XG Boost - Variable Importance")

#Checking variable importance for Logistic Regression

#Variable Importance
varImp(logit.Model2)

## glmnet variable importance
## 
##   only 20 most important variables shown (out of 23)
## 
##                                    Overall
## Networth.Next.Year               100.00000
## Net.fixed.assets                   1.32306
## EPS                                1.03294
## PAT.as...of.total.income           0.81899
## PBT.as...of.total.income           0.49490
## Profit.after.tax                   0.48259
## Total.expenses                     0.36353
## Net.worth                          0.33063
## Total.income                       0.32835
## Sales                              0.29101
## PBDITA                             0.26791
## Cash.profit.as...of.total.income   0.25505
## TOL.TNW                            0.06527
## Shares.outstanding                 0.00344
## Cash.profit                        0.00000
## Shareholders.funds                 0.00000
## PBT                                0.00000
## Total.assets                       0.00000
## Current.liabilities...provisions   0.00000
## Capital.employed                   0.00000

#Plotting Variable Importance
plot(varImp(logit.Model2),main="Logistic Regression - Variable Importance")

#Checking variable importance for adaboost

#Variable Importance
varImp(neural.net)

## ROC curve variable importance
## 
##   only 20 most important variables shown (out of 23)
## 
##                                  Importance
## Networth.Next.Year                   100.00
## PBT                                   74.24
## Profit.after.tax                      74.11
## Cash.profit                           72.16
## PAT.as...of.total.income              71.16
## PBT.as...of.total.income              69.03
## Reserves.and.funds                    68.72
## Cash.profit.as...of.total.income      65.35
## EPS                                   65.35
## Net.worth                             62.41
## Shareholders.funds                    59.89
## TOL.TNW                               57.47
## PBDITA                                57.44
## Current.assets                        38.07
## Capital.employed                      37.50
## Total.expenses                        34.96
## Total.liabilities                     33.99
## Total.assets                          33.99
## Total.income                          33.18
## Sales                                 32.16

#Plotting Variable Immportance
plot(varImp(neural.net),main="adaboost - Variable Importance")

Clearly, the variable importance estimates of different models differs and thus might be used to get a more holistic view of importance of each predictor. Two main uses of variable importance from various models are:

Predictors that are important for the majority of models represents genuinely important predictors.
For ensembling, we should use predictions from models that have significantly different variable importance as their predictions are also expected to be different. Although, one thing that must be taken into consideration is that all of them are sufficiently accurate.

We can fine tune our Models by using the variables that appeared to be the most significant in variable Importance estimations.

Predictions using Caret

For predicting the dependent variable we will be using predict() from base library. For classification problems, it also offers another feature named ‘type’ which can be set to either “prob” or “raw”. For type=”raw”, the predictions will just be the outcome classes while for type=”prob”, it will give probabilities for the occurrence of each observation in various classes of the outcome variable.

We predict the Model on Train as well as Test to understand how the model is performing on both the Training and the Unseen data.

# Predicting Model on Train Data for Random Forest
pred.train.rf <- predict(rf, train, type = "raw")
summary(pred.train.rf)

##    0    1 
## 2646 2646

Caret also provides a confusionMatrix function which will give the confusion matrix along with various other metrics for your predictions. Here is the performance analysis of our Random Forest model.

# Creating Confusion Matrix on Train data.
confusionMatrix(train$Class,pred.train.rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2646    0
##          1    0 2646
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9993, 1)
##     No Information Rate : 0.5        
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0        
##             Specificity : 1.0        
##          Pos Pred Value : 1.0        
##          Neg Pred Value : 1.0        
##              Prevalence : 0.5        
##          Detection Rate : 0.5        
##    Detection Prevalence : 0.5        
##       Balanced Accuracy : 1.0        
##                                      
##        'Positive' Class : 0          
##

# Predicting Model on Test Data for Random Forest
pred.test.rf <- predict(rf, testData, type = "raw")
summary(pred.test.rf)

##   0   1 
## 661  46

# Creating Confusion Matrix on Train data.
confusionMatrix(pred.test.rf, testData$Default, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 661   0
##          1   0  46
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9948, 1)
##     No Information Rate : 0.9349     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.00000    
##             Specificity : 1.00000    
##          Pos Pred Value : 1.00000    
##          Neg Pred Value : 1.00000    
##              Prevalence : 0.06506    
##          Detection Rate : 0.06506    
##    Detection Prevalence : 0.06506    
##       Balanced Accuracy : 1.00000    
##                                      
##        'Positive' Class : 1          
##

Performing predictions on XG Boost, Logistic Regression and adaboost Models created above for Train and Test data.

# Predicting Model on Train Data for XG Boost
pred.train.xgb <- predict(xgbModel1, data =train, type="raw")
summary(pred.train.rf)

##    0    1 
## 2646 2646

# Creating Confusion Matrix on Train data.
confusionMatrix(train$Class,pred.train.xgb)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2646    0
##          1    0 2646
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9993, 1)
##     No Information Rate : 0.5        
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0        
##             Specificity : 1.0        
##          Pos Pred Value : 1.0        
##          Neg Pred Value : 1.0        
##              Prevalence : 0.5        
##          Detection Rate : 0.5        
##    Detection Prevalence : 0.5        
##       Balanced Accuracy : 1.0        
##                                      
##        'Positive' Class : 0          
##

# Predicting Model on Test Data for XG Boosting
pred.test.xgb <- predict(xgbModel1, testData, type = "raw")
summary(pred.test.xgb)

##   0   1 
## 661  46

# Creating Confusion Matrix on Train data.
confusionMatrix(pred.test.xgb, testData$Default, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 661   0
##          1   0  46
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9948, 1)
##     No Information Rate : 0.9349     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.00000    
##             Specificity : 1.00000    
##          Pos Pred Value : 1.00000    
##          Neg Pred Value : 1.00000    
##              Prevalence : 0.06506    
##          Detection Rate : 0.06506    
##    Detection Prevalence : 0.06506    
##       Balanced Accuracy : 1.00000    
##                                      
##        'Positive' Class : 1          
##

# Predicting Model on Train Data for Logistic Regression
pred.train.logit <- predict(logit.Model2, data =train, type="raw")
summary(pred.train.rf)

##    0    1 
## 2646 2646

# Creating Confusion Matrix on Train data.
confusionMatrix(train$Class,pred.train.logit)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2555   91
##          1    0 2646
##                                           
##                Accuracy : 0.9828          
##                  95% CI : (0.9789, 0.9861)
##     No Information Rate : 0.5172          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9656          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9668          
##          Pos Pred Value : 0.9656          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.4828          
##          Detection Rate : 0.4828          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9834          
##                                           
##        'Positive' Class : 0               
##

# Predicting Model on Test Data for Logistic Regression
pred.test.logit <- predict(xgbModel1, testData, type = "raw")
summary(pred.test.logit)

##   0   1 
## 661  46

# Creating Confusion Matrix on Train data.
confusionMatrix(pred.test.logit, testData$Default, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 661   0
##          1   0  46
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9948, 1)
##     No Information Rate : 0.9349     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.00000    
##             Specificity : 1.00000    
##          Pos Pred Value : 1.00000    
##          Neg Pred Value : 1.00000    
##              Prevalence : 0.06506    
##          Detection Rate : 0.06506    
##    Detection Prevalence : 0.06506    
##       Balanced Accuracy : 1.00000    
##                                      
##        'Positive' Class : 1          
##

# Predicting Model on Train Data for adaboost Model
pred.train.nn <- predict(neural.net, data =train, type="raw")
summary(pred.train.nn)

##    0    1 
## 2646 2646

# Creating Confusion Matrix on Train data.
confusionMatrix(train$Class,pred.train.nn)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2646    0
##          1    0 2646
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9993, 1)
##     No Information Rate : 0.5        
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0        
##             Specificity : 1.0        
##          Pos Pred Value : 1.0        
##          Neg Pred Value : 1.0        
##              Prevalence : 0.5        
##          Detection Rate : 0.5        
##    Detection Prevalence : 0.5        
##       Balanced Accuracy : 1.0        
##                                      
##        'Positive' Class : 0          
##

# Predicting Model on Test Data for Random Forest
pred.test.nn <- predict(neural.net, testData, type = "raw")
summary(pred.test.nn)

##   0   1 
## 661  46

# Creating Confusion Matrix on Train data.
confusionMatrix(pred.test.nn, testData$Default, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 661   0
##          1   0  46
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9948, 1)
##     No Information Rate : 0.9349     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.00000    
##             Specificity : 1.00000    
##          Pos Pred Value : 1.00000    
##          Neg Pred Value : 1.00000    
##              Prevalence : 0.06506    
##          Detection Rate : 0.06506    
##    Detection Prevalence : 0.06506    
##       Balanced Accuracy : 1.00000    
##                                      
##        'Positive' Class : 1          
##

Performing Model Comparisions

Now when we have created various Models using Caret, you might be having a thought how to check which model works best and should be considered. This can be done in many ways. In this problem, we will be selecting the model by comparing the Confusion Matrix created on Train and Test data for all the Models and select the Model which gives the highest accuracy on both Train and Test.

Conclusion

The purpose of this post was to showcase the power of Caret as a Standalone package that can act as a one stop shop for most of your needs as a Data scientist. It is to show how you can effectively use it for basic data preperation to building complex machine learning models.

This information should serve as a reference and also as a template you can use to build a standardised machine learning workflow from scrape, so you can develop it further from there.

Additional Resources

• Caret Package Homepage: http://topepo.github.io/caret/index.html

• Caret Package on CRAN: http://cran.r-project.org/web/packages/caret/

• Caret Package Manual: http://cran.r-project.org/web/packages/caret/caret.pdf

• A Short Introduction to the caret Package: http://cran.r-project.org/web/packages/caret/vignettes/caret.pdf