Great Lakes

Credits

Great Lakes Institute of Management
Dr V Balachandar Campus
ECR Road, Manamai
Tirukazhukundram
Tamil Nadu 603102

Mentor: Dr. P.K. Viswanathan

Data Partner: Kaggle

Cast: Logistic Regression, Naive Bayes, SVM, Decision Trees and Random Forests.

Crew: R, MS Excel.

We use the random seed value of set.seed(211) so that results can be reproduced.

You can visit the following webpage using the Safari browser on your laptop to access the presentatation.

http://rpubs.com/cjagtap

Tools, Techniques and more…

Executive Summary:

Before we plunge into the depths of each of these techniques it will be good to have an overview of the results and comparions based on my experience.

Defintions

The performance seen has to be viewed in the light of the following:

  1. There is no one techinque that can be termed perfect or superior to others.
  2. Certain techniques work well on particular kind of data.
  3. The user's experience in using a particular technique plays a role in its performance.

Comparisons

Introduction:

  • Banks play a pivotal role in the world and national economies. They decide who is creditworthy and who is not based on the individuals/companies demographic information, credit history with the bank (if available) and the information available at Credit Bureaus. In India we have 4 credit bureaus namely Credit Information Bureau (India) Ltd (CIBIL), Experian Credit Information Co. of India Pvt. Ltd, Equifax Credit Information Services Pvt. Ltd and High Mark Credit Information Services Pvt. Ltd.
  • Banks and Credit Bureaus compute a score called as a Credit Score for assessing the creditworthiness of customers as well as other loan terms and conditions like the interest charged on the loan, value of collateral required, need of gurantor etc.
  • The demographic and credit history information of the customer is used to compute the credit score which has a probability of default associated with it.
  • Globally, a credit score is an essential part of the banking system and is used for a variety of reasons, including movement of goods and services, purchase of homes and cars, even getting jobs. It is therefore, an integral part of any banking system.
  • Historically banks have been using Discriminant Analysis and Logistic Regression to compute the probability of default of a customer. Recently with advances in technology many new techniques like Classification and Regression Trees (CART), Neural Networks, Naïve Bayes, Support Vector Machines (SVM), and ensemble methods like Bagging, Boosting, Random Forest have come up as alternatives to discriminant analysis and logistic regression.
  • Through this project I aim to compare the various popular techniques available to an analyst/data scientist for solving classification problems in the Consumer Credit Risk domain.

Scope & Objectives:

The primary objective of the project is to explore and compare the most popular classification techniques used by the analytics industry in the Consumer Credit Risk domain.

  • Using the techniques of Logistic Regression, Naïve Bayes, Support Vector Machines (SVM), and Ensemble Method Random Forest classify the set of customers into two sets:
  • Those who will experience financial distress in the next two years.
  • Those who will not experience financial distress in the next two years.
  • Do a comparative study of the discriminatory power of the various techniques listed in point #1 using AUROC i.e. The Area under the Receiver Operating Characteristic Curve.
  • The classification or confusion matrix will be generated to compare the models on Sensitivity or True Positive rate, Specificity or True Negative Rate and Accuracy.

Exploratory data analysis:

  • As a first step data exploration is done through graphs that help one understand a single categorical or continuous variable by visualizing the distribution of the variable.
  • Bar plots have been used to gain insight into the distribution of categorical variables.
  • Histograms and box plots have been used to visualize the distribution of continuous variables.

Plots of Dependent variable SeriousDlqin2yrs

Distribution of dependent variable

  • Above we look at the distribution of the dependent variable.
  • The over all proportion of Good and Bad accounts is 93.32%, 6.68%.
  • This tells us that the event of interest person experienced 90 days past due delinquency or worse is a rare event occuring just 6.68% of the times.
  • Thus the title of the project
    'Minority Report'.

Categorical Independent variables

We now look at the distribution of each of the independent variables in turn.

  • Bar plots have been used to gain insight into the distribution of categorical variables.

Plots: num_30_59_dpd, num_60_89_dpd

Plots of Categorical variables

  • Number Of Time 30-59 Days Past Due Not Worse: The variable takes values 0 through 98; the extreme values have been clubbed together in a group '7-98'.
  • Number Of Time 60-89 Days Past Due Not Worse: The variable takes values 0 through 98; the extreme values have been clubbed together in a group '6-98'.

Plots: num_90_dpd, num_open_trades

Plots of Categorical variables

  • Number of times borrower has been 90 days or more past due: The variable takes values 0 through 98; the extreme values have been clubbed together in a group '6-98'.
  • Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards): The variable takes values from zero to 58.

Plots: num_RE_trades, num_depend

Plots of Categorical variables

  • Number of mortgage and real estate loans including home equity lines of credit: The variable takes values from zero to 54.
  • Number of Dependents in Family: The variable takes values from zero to 20.

Continous Independent variables

  • Histograms and box plots have been used to visualize the distribution of continuous variables.
  • A normal curve is superimposed on each of the Histograms.

Histogram: Rev_Util_unsec, Debt_Ratio

Histograms of Continous variables

  • Revolving Utilization of Unsecured Lines: The variable is continuous taking values from zero to 50,708. The variable also takes missing ‘NA’ values. For the purpose of plotting the histogram the ‘NA’ values and values greater than 99 percentile (1.09) have been ignored.
  • Debt Ratio: The variable is continuous taking values from zero to 329,664. For the purpose of plotting the histogram values greater than 99 percentile (4,979) have been ignored.

Histogram: Monthly_Income, Age

Histograms of Continous variables

  • Monthly Income: The variable is continuous taking values from zero to 30,08,750. The variable also takes missing ‘NA’ values. For the purpose of plotting the histogram the ‘NA’ values and values greater than 99 percentile (25,000) have been ignored.
  • Age: The variable is continuous taking values from zero to 109.

Box Plot of Continous Independent variables

We now look at the distribution of continuous variables on the categories of outcome variable through box plots.

Box Plot: Age, Monthly_Income

Box Plots of Continous variables

  • Age: The variable has outliers at the higher end, with just one outlier at the lower end. The box plot also tells us that younger people tend to be riskier.
  • Monthly Income: The variable has many outliers at the higher end, with some at the lower end. The box plot also tells us that people with less income tend to be riskier.
  • Each whisker extends to 1.5 times the interquartile range. Values outside this range are depicted as dots.

Box Plot: Debt_Ratio, Rev_Util_unsec

Box Plots of Continous variables

  • Debt Ratio: The variable has many outliers at the higher end. The box plot also tells us that people with higher debt ratio tend to be riskier.
  • Revolving Utilization of Unsecured Lines The variable has many outliers on the higher end. The box plot also tells us that people with higher Revolving Utilization of Unsecured Lines tend to be riskier.

Descriptive Statistics:

We now look at the distribution of each variable numerically.

Age, num_30_59_dpd, num_60_89_dpd, num_90_dpd

Age num_30_59_dpd num_60_89_dpd num_90_dpd
n 150,000.0 150,000.0 150,000.0 150,000.0
na 0.0 0.0 0.0 0.0
min 0.0 0.0 0.0 0.0
max 109.0 98.0 98.0 98.0
mean 52.3 0.4 0.2 0.3
stdev 14.8 4.2 4.2 4.2
skew 0.2 22.6 23.3 23.1
kurtosis 2.5 525.4 548.7 540.7
quantile.1% 24.0 0.0 0.0 0.0
quantile.5% 29.0 0.0 0.0 0.0
quantile.10% 33.0 0.0 0.0 0.0
quantile.25% 41.0 0.0 0.0 0.0
quantile.50% 52.0 0.0 0.0 0.0
quantile.75% 63.0 0.0 0.0 0.0
quantile.90% 72.0 1.0 0.0 0.0
quantile.95% 78.0 2.0 1.0 1.0
quantile.99% 87.0 4.0 2.0 3.0
  • Skewness: If a unimodal distribution has a longer tail extending towards lower values of the variate it is said to have negative skewness; in the contrary case, positive skewness.

num_open_trades, num_RE_trades, num_depend

num_open_trades num_RE_trades num_depend
n 150,000.0 150,000.0 146,076.0
na 0.0 0.0 3,924.0
min 0.0 0.0 0.0
max 58.0 54.0 20.0
mean 8.5 1.0 0.8
stdev 5.1 1.1 1.1
skew 1.2 3.5 1.6
kurtosis 6.1 63.5 6.0
quantile.1% 0.0 0.0 0.0
quantile.5% 2.0 0.0 0.0
quantile.10% 3.0 0.0 0.0
quantile.25% 5.0 0.0 0.0
quantile.50% 8.0 1.0 0.0
quantile.75% 11.0 2.0 1.0
quantile.90% 15.0 2.0 2.0
quantile.95% 18.0 3.0 3.0
quantile.99% 24.0 4.0 4.0
  • Kurtosis: A term used to describe the extent to which a unimodal frequency curve is “peaked”; that is to say, the extent of the relative steepness of ascent in the neighbourhood of the mode. The term was introduced by Karl Pearson in 1906.

Rev_Util_unsec, Debt_Ratio, Monthly_Income

Rev_Util_unsec Debt_Ratio Monthly_Income
n 150,000.0 150,000.0 120,269.0
na 0.0 0.0 29,731.0
min 0.0 0.0 0.0
max 50,708.0 329,664.0 3,008,750.0
mean 6.0 353.0 6,670.2
stdev 249.8 2,037.8 14,384.7
skew 97.6 95.2 114.0
kurtosis 14,547.3 13,736.9 19,507.1
quantile.1% 0.0 0.0 0.0
quantile.5% 0.0 0.0 1,300.0
quantile.10% 0.0 0.0 2,005.0
quantile.25% 0.0 0.2 3,400.0
quantile.50% 0.2 0.4 5,400.0
quantile.75% 0.6 0.9 8,249.0
quantile.90% 1.0 1,267.0 11,666.0
quantile.95% 1.0 2,449.0 14,587.6
quantile.99% 1.1 4,979.0 25,000.0

Bivariate Correlations:

  • Before we begin the modeling process we upfront look at the correlations among raw independent variables.
  • Exploration of the relationships among selected variables two at a time using Pearson Correlation.
  • In the chart on next slide we see bivariate correlations between raw variables:
  1. Blue color shows positive correlation.
  2. Red color shows negative correlation.
  3. Darker and more saturated the color, the greater the magnitude of the correlation.
  4. The filled portion of the pie indicates the magnitude of the correlation.
  5. Positive correlations fill the pie starting at 12 o’clock and moving in a clockwise direction.
  6. Negative correlations fill the pie by moving in a counterclockwise direction.
  7. In the upper triangular matrix we have correlations and their confidence intervals.
  8. Each variable occurs in the diagonal of the matrix with the minimum and maximum value it takes.

Corrgram of Raw variables

Corrgram of Bi-variate correlations

Data Preparation

  • One of the most challenging tasks in data analysis is data preparation.
  • Data Preparation will typically involve ways to create new variables, transform and recode existing variables, merge datasets, and select observations (train and test samples).

Variable transformation

  • We create a new variable Ratio of Monthly Income to Number of Dependents which is a better metric as compared to the variables Monthly Income and Number of Dependents used individually.
  • The variable age (in years) had some records take value zero which was a data error; the variable has been floored at 21.
  • All of the independent variables have been binned to take care of extreme and missing values. No outliers and missing observations on the independent variables have been deleted.
  • The bins have been assigned bin values called as Weight of Evidence ‘WOE’ that is representative of the bin.
  • WOE is computed as loge (Proportion of Bad in the bin/ Proportion of Good in the bin).
  • The binning has been done so that the binned independent variables perfectly rank order bad rate on training sample.
  • The data was randomly split into training and testing samples in the ratio 70:30 such that the proportion of dependent variable remains the same in the two sub-samples. We have split the total data set randomly into train and test samples using random seed value of set.seed(211) so that results can be reproduced.

Binned Variables

Binned Variables cont.

  • The over all proportion of Bad accounts in the training sample is 6.68%.
  • The over all proportion of Bad accounts in the test sample is 6.68%.
  • Thus the resulting train and test samples have equal proportion of good and bad accounts as the orginal sample.

Logit and Bad Rate Plots of Independent variables

  1. Logistic regression models the log odds of the event as a linear function of Xi.
  2. The logit model assumes that the log of the odds ratio is linearly related to Xi.
  3. To check on this we plot the logit against the binned variables.
  4. In the following plots we request a linear model fit method = "lm" between the logit and the variable bins. It is shown by the straight red line, the line includes 95% confidence intervals (the darker band).

Logit Plots: num_30_59_dpd, num_60_89_dpd, num_90_dpd, num_open_trades

Plot of Binned VariablesPlot of Binned VariablesPlot of Binned VariablesPlot of Binned VariablesPlot of Binned Variables

Logit Plots: num_RE_trades, num_depend, Rev_Util_unsec, Debt_Ratio

Plot of Binned VariablesPlot of Binned VariablesPlot of Binned VariablesPlot of Binned VariablesPlot of Binned Variables

Logit Plots: Monthly Income, Age, Ratio_Income_num_depend

Plot of Binned VariablesPlot of Binned VariablesPlot of Binned VariablesPlot of Binned Variables

Logistic Regression

  1. When the outcome variable is binary, and one wishes to measure the effects of several independent variables on it the method of analysis to use is Logistic Regression.
  2. The binary outcome variable is coded 0 and 1. The convention is to associate 1 with ‘success’ (e.g. patient survived; risk factor is present; correct answer is given, and so on), and 0 with ‘failure’.
  3. In multiple linear regression the basic activity is to draw the least square line around which the values of Y (the outcome variable) are distributed. In logistic regression we are trying to estimate the probability that a given individual will fall into one outcome group or the other.

Violations of OLS if used to model a Binary response variable:

  1. Homoscedasticity: This assumption means that the variance of Y around the regression line is the same for all values of the predictor variable (X).
  2. When OLS is used to predict probabilities for a dichotomous dependent variable the regression equation can produce impossible probabilities either above 1 or below 0.
  3. Assumption of Linearity, a unit change in X should produce a constant change in Y whatever be the values of X.

Correlations between binned WOE variables

  • We see that the variables num_30_59_dpd, num_60_89_dpd and num_90_dpd show medium positive correlation.

Corrgram of Binned variables

Corrgram of Bi-variate correlations WOE variables

Logistic Model:

  • Stepwise dropped Debt Ratio, we drop num_open_trades_woe as it has wrong sign.
  1. Subtracting the deviance of the final model from the null model gives model chi-square.
  2. The smaller the deviance, the better the fit of the model.
  3. Since the significance of model chi square is below 0.05 it is concluded that the model with predictors significantly improves the fit of the model as compared to the model with no predictors.
  • df = 7
  • Chisquare = 12923.6273
  • p Value = 0

Logistic Model: Optimal Threshold Values for classification

  • Threshold is a single integer value representing the number of equal interval threshold values to be tried between 0 & 1.
  1. If you vary the threshold, you can increase the sensitivity of the classification model at the expense of its specificity.
  2. Lowering the threshold for classifying bad credit to 6% results in a model with improved sensitivity.
  3. sensitivity=specificity= 0.06, threshold value or range where sensitivity is equal to sensitivity.
  4. max.sensitivity+specificity = 0.05, threshold value or range that maximizes sensitivity plus specificity.
  5. min.ROC.plot.distance = 0.06, threshold value or range where the ROC curve is closest to point (0,1) (or perfect fit).
  6. We choose 0.06 as the threshold value for classifying a customer as bad credit.

Sensitivity, Specificity and Accuracy of Logistic Model on Training and Test samples

Classification Table of Logistic Model on Training sample
Predicted Actual Good Actual Bad Total
0 76034 1618 77652
1 21948 5400 27348
Sum 97982 7018 105000
  1. Sensitivity is 76.94%
  2. Specificity is 77.60%
  3. Accuracy is 77.56%
Classification Table of Logistic Model on Test sample
Predicted Actual Good Actual Bad Total
0 32553 698 33251
1 9439 2310 11749
Sum 41992 3008 45000
  1. Sensitivity is 76.80%
  2. Specificity is 77.52%
  3. Accuracy is 77.47%

ROCR and Area under the curve Logistic Model

  1. The impact of varying the threshold value is typically assessed using a receiver operating characteristic (ROC) curve.
  2. A ROC curve plots sensitivity versus specificity for a range of threshold values.
  3. The ROC curve is created by evaluating the class probabilities for the model across a continuum of thresholds.
  4. For each candidate threshold, the resulting true-positive rate (i.e., the sensitivity) and the false-positive rate (one minus the specificity) are plotted against each other.
  5. Area under the Receiver Operating Characteristic curve is 85.69%

Receiver Operating Characteristic Curve of Logistic Model on test sample

Naive Bayes

Predicting a class label using Naive Bayes classifier:

  1. A Naive Bayes Classifier is a program which predicts a class value given a set of set of attributes.
  2. NB basically tells us that if given that a customer is bad credit what is the probability of observing the X attributes.
  3. Note that it does not predict the probability of bad credit given a set of attributes but rather the other way round.

Sensitivity, Specificity and Accuracy of Naive Bayes Model on Training and Test samples

We choose 0.005 as the threshold to classify an account as bad credit.

Classification Table of Naive Bayes Model on Training sample
Predicted Actual Good Actual Bad Total
0 74677 1658 76335
1 23305 5360 28665
Sum 97982 7018 105000
  1. Sensitivity is 76.38%
  2. Specificity is 76.22%
  3. Accuracy is 76.23%
Classification Table of Naive Bayes Model on Test sample
Predicted Actual Good Actual Bad Total
0 32018 681 32699
1 9974 2327 12301
Sum 41992 3008 45000
  1. Sensitivity is 77.36%
  2. Specificity is 76.25%
  3. Accuracy is 76.32%

ROCR and Area under the curve Naive Bayes Model

Area under the Receiver Operating Characteristic curve is 76.80%

Receiver Operating Characteristic Curve of Naive Bayes Model on test sample

Classification Tree

What is a Classification Tree?

  1. A classification tree is used to predict a qualitative response.
  2. For a classification tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs. In interpreting the results of a classification tree, we are often interested not only in the class prediction corresponding to a particular terminal node region, but also in the class proportions among the training observations that fall into that region.
  3. We use recursive binary splitting to grow a classification tree. Since we plan to assign an observation in a given region to the most commonly occurring, error rate class of training observations in that region, the classification error rate is simply the fraction of the training observations in that region that do not belong to the most common class.
  4. The Gini index is defined by a measure of total variance across the K nodes. Hence the Gini index is referred to as a measure of node purity—a small value indicates that a node contains predominantly observations from a single class.
  5. When building a classification tree, the Gini index is typically used to evaluate the quality of a particular split.
  6. When decision trees are built, many of the branches may reflect noise or outliers in the training data. Tree pruning attempts to identify and remove such branches, with the goal of improving classification accuracy on unseen data.
  7. When pruning the tree, the classification error rate is preferable measure if prediction accuracy of the final pruned tree is the goal.

Tree Beard trying to judge whether to classify as Orc or Hobbit!

CP Table of Decision Tree

CP nsplit rel error xerror xstd
0.4165 0 1.0000 1.0000 0.0025
0.0293 1 0.5835 0.5835 0.0022
0.0166 2 0.5543 0.5543 0.0022
0.0070 3 0.5376 0.5376 0.0022
0.0049 7 0.5098 0.5037 0.0021
0.0026 10 0.4906 0.4876 0.0021
0.0016 14 0.4784 0.4802 0.0021
0.0009 16 0.4753 0.4819 0.0021
0.0007 17 0.4744 0.4822 0.0021
0.0005 18 0.4737 0.4829 0.0021
0.0003 19 0.4732 0.4836 0.0021
0.0003 20 0.4730 0.4831 0.0021
0.0002 23 0.4722 0.4831 0.0021
0.0001 25 0.4717 0.4827 0.0021
0.0000 28 0.4714 0.4831 0.0021
-1.0000 136 0.4714 0.4831 0.0021
  1. In order to choose a final tree size, examine the cptable component of the list returned by rpart().
  2. It contains data about the prediction error for various tree sizes.
  3. The complexity parameter (cp) is used to penalize larger trees.
  4. Tree size is defined by the number of branch splits (nsplit). A tree with n splits has n + 1 terminal nodes.
  5. The rel error column contains the error rate for a tree of a given size in the training sample.
  6. The cross-validated error (xerror) is based on 10-fold cross validation (also using the training sample).
  7. The xstd column contains the standard error of the cross-validation error.

CP Plot of Decision Tree

Plot of cross-validated error against complexity parameter cp

  • The plotcp() function plots the cross-validated error against the complexity parameter cp.
  • As a rule of thumb, it’s best to prune a decision tree using the cp of smallest tree that is within one standard deviation of the tree with the smallest xerror or equivalently the tree size associated with the largest complexity parameter below the line in plotcp.
  • The cp plot suggests selecting the tree with the leftmost cp value below the line, hence we select a cp value of 0.00035.

Pruned Decision Tree

Sensitivity, Specificity and Accuracy of Decision Tree on Training and Test samples

We choose a probability cut-off of 0.43 to classify an account as Bad credit.

Classification Table of Decision Tree Training Sample
Predicted Actual Good Actual Bad Total
0 75618 1521 77139
1 22364 5497 27861
Sum 97982 7018 105000
  • Sensitivity is 78.33%
  • Specificity is 77.18%
  • Accuracy is 77.25%
Classification Table of Decision Tree on Test Sample
Predicted Actual Good Actual Bad Total
0 32315 679 32994
1 9677 2329 12006
Sum 41992 3008 45000
  • Sensitivity is 77.43%
  • Specificity is 76.96%
  • Accuracy is 76.99%
  • Area under the Receiver Operating Characteristic curve is 77.19%

ROC Curve of Decision Tree on test sample

  • Receiver Operating Characteristic curve (or ROC curve) a plot of the true positive rate against the false positive rate for the different possible cutpoints of probability of classification.
  • The area under the curve (AUC) is a measure of accuracy of the Decision Tree.
  • Note of historical interest:
  • ROC analysis is part of a field called "Signal Dectection Theory" developed during World War II for the analysis of radar images.
  • Radar operators had to decide whether a blip on the screen represented an enemy target, a friendly ship, or just noise.
  • Signal detection theory measures the ability of radar receiver operators to make these important distinctions.
  • Their ability to do so was called the Receiver Operating Characteristics.

Random Forest

  • Random forests are very good in that it is an ensemble learning method used for classification and regression.
  • It uses multiple models for better performance than just using a single tree model.
  • The forest error rate depends on two things:
  1. The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate.
  2. The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of the individual trees decreases the forest error rate.

Broad Steps in Random Forest

  1. Random Forests grows many classification trees.
  2. To classify a new object from an input vector (x variables), put the input vector down each of the trees in the forest.
  3. Each tree gives a classification, and we say the tree "votes" for that class.
  4. The forest chooses the classification having the most votes (over all the trees in the forest).

Parameters of a Random Forest

  1. ntree is the number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.
  2. mtry is Number of variables randomly sampled as candidates at each split. Note that the default values for classification is (sqrt(p) where p is number of variables in x).
  3. replace: Should sampling of cases be done with or without replacement?
  4. maxnodes: Maximum number of terminal nodes trees in the forest can have. If not given, trees are grown to the maximum possible (subject to limits by nodesize). If set larger than maximum possible, a warning is issued.
  5. nodesize: Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time).
  6. classwt: Priors of the classes. Need not add up to one.
  7. sampsize: Size(s) of sample to draw.
  8. importance: Should importance of predictors be assessed?
  9. do.trace: If set to TRUE, gives a more verbose output as randomForest is run. If set to some integer, then running output is printed for every do.trace trees.
  10. keep.forest: If set to FALSE, the forest will not be retained in the output object.
  • The decision tree we built earlier coreesponding to the CP value of 0.0035 had 20 nodes.
  • In order to not overfit the forest to the training sample we choose a nodesize=500.
  • We start with the default value of 500 decision trees in the forest.

tuneRF: Finding Optimum Value of mtry

  • Reducing mtry reduces both the correlation between trees and the strength of an individual tree. Increasing it increases both.
  • Somewhere in between is an "optimal" range of mtry - usually quite wide.
  • Using the OOB error rate a value of m in the range can quickly be found.
  • mtry is the only adjustable parameter to which random forests is somewhat sensitive.
  • We will now use the tuneRF function to find the optimum value of mtry.

tuneRF Plot of mtry against OOB error

0.0212 0.0001 
-0.01775 0.0001 
-0.1088 0.0001 
      mtry OOBError
1.OOB    1  0.06444
2.OOB    2  0.06331
3.OOB    3  0.06469
6.OOB    6  0.07020

0.02821 0.0001 
0.03175 0.0001 
-0.1047 0.0001 
      mtry OOBError
3.OOB    3  0.06449
4.OOB    4  0.06660
5.OOB    5  0.06853
7.OOB    7  0.07124

Plots of mtry against OOB error convey mtry=2 is the optimum.

Random Forest with 500 trees and mtry=2

  1. An out-of-bag (OOB) error estimate is obtained by classifying the cases that aren’t selected when building a tree, using that tree.
  2. The number of trees necessary for good performance grows with the number of predictors.
  3. The best way to determine how many trees are necessary is to compare predictions made by a forest to predictions made by a subset of a forest.
  4. When the subsets work as well as the full forest, you have enough trees.
  5. We see that the three error rates are stable after we have 100 trees in the forest.
  6. Hence we decide to use 100 trees in the random forest.

Variable Importance in Random Forest with 100 trees

  • We decide to drop the variable Debt_Ratio_woe due to its very low Mean decerease in Gini
  • Finally we decide to build a Random Forest with 100 Trees and 8 variables

Sensitivity, Specificity and Accuracy of Random Forest on Training and Test sample

Based on the output of optim.thresh we choose a probability cut-off of 0.58 to classify an account as Bad credit.

Classification Table of Random Forest on Training Sample
Predicted Actual Good Actual Bad Total
0 74027 1410 75437
1 23955 5608 29563
Sum 97982 7018 105000
  • Sensitivity is 79.91%
  • Specificity is 75.55%
  • Accuracy is 75.84%
Classification Table of Random Forest on Test Sample
Predicted Actual Good Actual Bad Total
0 31592 617 32209
1 10400 2391 12791
Sum 41992 3008 45000
  • Sensitivity is 79.49%
  • Specificity is 75.23%
  • Accuracy is 75.52%
  • Area under the Receiver Operating Characteristic curve is 77.36%

ROC Curve & Area under the curve of Random Forest on Test Sample

Age of Support Vector Machines

  • Support vector machines (SVMs) are a group of supervised machine-learning models that can be used for classification and regression.
  • The process of how support vector machines transform from input to output is less clear and can be hard to interpret.
  • Hence support vector machines are referred to as black box method.

Introduction:

  • SVMs use kernel functions to transform the data into higher dimensions, in the hope that they will become more linearly separable. So they first map input data into a high dimension feature space defined by the kernel function, and find the optimum hyper plane that separates the training data by the maximum margin.
  • We can think of support vector machines as a linear algorithm in a high dimensional space.

SVM Advantages and Disadvantages:

Advantages:

  1. It builds a highly accurate model through an engineering problem-oriented kernel.
  2. It makes use of the regularization term to avoid over-fitting.
  3. It also does not suffer from multicollinearity.

Disadvantages:

  1. The main limitation of SVM is its speed.
  2. It is not suitable or efficient enough to construct classification models for data that is large in size as we will see shortly.

Working of a SVM:

  1. Support Vector machines are built using constrained optimization problems, Ring Any Bells!
  2. The popularity of SVMs has led to the development of a large number of special purpose solvers for the SVM optimization problem.
  3. The complexity of training of non-linear SVMs with solvers such as LIBSVM has been estimated to be quadratic in the number of training examples, which can be prohibitive for datasets with hundreds of thousands of sample points, due to the sheer amount of time required to train the model.

SVM Details:

  1. The support vector machine constructs a hyper plane (or set of hyper planes) that maximizes the margin width between two classes in a high dimensional space. In these, the cases that define the hyper plane are support vectors. The margin is the gap, represented by the distance between the two lines.
  2. The hyper plane is chosen to maximize the margin between the two classes’ closest points.
  3. The points on the boundary of the margin are called support vectors (they help define the margin), and the
    middle of the margin is the separating hyper plane.
  4. In the two-dimensional case, the optimal hyper plane is the discontinuous black line in the middle of the gap.
  5. For an N-dimensional space (that is, with N predictor variables), the optimal hyperplane (also called a linear decision surface) has N – 1 dimensions. If there are two variables, the surface is a line. For three variables, the surface is a plane.
  6. The optimal hyper plane is identified using quadratic programming to optimize the margin under the constraint that the data points on one side have an outcome value of +1 and the data on the other side has an outcome value of -1.
  7. Because predictor variables with larger variances typically have a greater influence on the development of SVMs, the
    svm() function scales each variable to a mean of 0 and standard deviation of 1 before fitting the model by default.

SVM Parameters: kernel, cost, and the gamma

  1. Kernel can be linear, polynomial, radial, or sigmoid. The default value is radial basis function (RBF) to map samples into a higher-dimensional space. The RBF kernel is often a good choice because it’s a nonlinear mapping that can handle relations between class labels and predictors that are nonlinear. There are no golden rules for determining which admissible kernel will result in the most accurate SVM. In practice, the kernel chosen does not generally make a large difference in resulting accuracy.
  2. The Gamma parameter controls the shape of the separating hyper plane. Increasing the gamma argument usually increases the number of support vectors. Gamma must be greater than zero. The number of support vectors can range from very few to every single data point if you completely over-fit your data. Thus a high value of gamma can lead to over fitting. The default value is equal to (1/data dimension) i.e. predictor variables.
  3. Intuitively, the gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. The gamma parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors. Thus for high values of gamma the influence of a single training example is close and hence more sample points are needed as support vectors.
  4. The Cost parameter controls training errors and margins. The default value is set to 1. Thus the cost parameter represents the cost of making errors. A large cost creates a narrow margin (a hard margin) and permits fewer misclassifications. A large value severely penalizes errors and leads to a more complex classification boundary. There will be less misclassification in the training sample, but over-fitting may result in poor predictive ability in new samples. A small cost creates a large margin (a soft margin) and allows more misclassifications. Smaller values lead to a flatter classification boundary but may result in under-fitting. Like gamma, cost is always positive.
  5. By default, the svm() function sets gamma to 1 / (number of predictors) and cost to 1. But a different combination of gamma and cost may lead to a more effective model.

SVM Accuracy

We can see from the graph below (borrowed from Image: scikit-learn) that as the cost and gamma are increased the validation accuracy goes down.

Tuning a support vector machine

  • Adjusting the gamma and cost parameters can optimize the performance of SVM.
  • SVM provides a tuning function, tune.svm, which makes the tuning easier.
  • tune.svm() fits every combination of values and reports on the performance of each by doing a grid search.
  1. To tune the support vector machine one has to generate a variety of combinations of gamma and cost to find the best gamma and cost parameters.
  2. In the SVM I ran, I started with five values of Gamma and nine values of Cost, Gamma <- list(0.001, 0.011, 0.111, 1.111, 11.111) and Cost <- list(0.01, 1, 30, 50, 70, 85, 100, 250, 500).
  3. The tune.svm with the above parameter sets (5 x 9 = 45 sets of parameters)
    did not finish running over the weekend.
  4. It was run on the entire training data set with 105,000 observations. Since the main limitation of SVM is speed it was decided to run processes in parallel.
  5. The R libraries doParallel and foreach were used to do parallel processing.
  6. To find out the number of cores on my system the following approach was used:
num_cores <- detectCores()
cl <- makeCluster(num_cores)
registerDoParallel(cores = cl)
getDoParWorkers()
[1] 4

Calling all workers!

  • My Mac has 4 cores and all 4 were used to run the tuning algorithm.
  • One of the ways of checking if indeed 4 cores are being used is through the Activity monitor on a Mac.
  • The CPU usage window as seen below shows that all 4 cores are being utilized.
  • The experience of the four processors while working on SVM is similar to what we have experienced!
  • The state of the processors once the run was complete 7 hours or after 4 lectures!
[1] 18375    10

Tune.svm

  • The training data set was split into 4 equal subsets with the same delinquency rates.
  • Each of the four training sub-samples have 18,375 records, so we cover 70% of training sample in tuning the SVM.

Tune.svm First run

tune.svm Output

  1. Best Performance represents the error of the model corresponding to the Best Gamma and Cost parameters.
  2. Of the four tuning procedures three agree that the model with the fewest 4-fold cross validated errors in the training sample has gamma = 0.111 and a cost of 1.
  3. Hence we deicde to look in the neighbourhood of these values to further fine tune the parameters. Ex. (0.011+0.111)/2 = 0.061 and (0.111+1.111)/2 = 0.611.
  4. Similarly we arrive at the cost values of 0.505 and 15.5.

Tune.svm Second run

Training a SVM

SVM: Optimal Threshold Values for classification

  • Threshold is a single integer value representing the number of equal interval threshold values to be tried between 0 & 1.
  • The optimal threshold value given by the optim.thresh function is 0.9447.
  • So we choose a probability cut-off of 0.9447 to classify an account as Good credit which optimizes sensitivity and specificity on the combined training sample.
  • Blended Prediction using the 4 training samples
  • If the predicted probability of being good credit for a customer in the test sample using the models built on the four training samples is greater than the threshold value of 0.9447 then we classify that customer as good credit.
  • We make the final class prediction by simply counting each model’s predicted class for a given input and determining the final blended class assignment using a voting scheme between the models.
  • For example, if 2 or more models predict Bad Credit and 1 model predicts Good Credit, the blended prediction is Bad Credit.

Sensitivity, Specificity and Accuracy of SVM on Training and Test samples

Classification Table of SVM on Training Sample
Predicted Actual Good Actual Bad Total
0 11171 665 11836
1 30821 2343 33164
Sum 41992 3008 45000
  1. Sensitivity is 67.13%
  2. Specificity is 56.84%
  3. Accuracy is 57.52%
Classification Table of SVM on Test Sample
Predicted Actual Good Actual Bad Total
0 11171 665 11836
1 30821 2343 33164
Sum 41992 3008 45000
  1. Sensitivity is 77.89%
  2. Specificity is 26.60%
  3. Accuracy is 30.03%
  4. Area under the Receiver Operating Characteristic curve is 52.25%
  5. We see that the models accuracy is dismal on the test sample.
  6. As I said before this has to be seen in the light of I being relatively new to SVM.

SVM ROCR on Test sample

Recommendation and Conclusions:

  1. The various classification techniques available to an analyst can only be compared basis the classification problem.
  2. Apriori one cannot determine if a classification technique is superior to others.
  3. The performance of the classfier depends to a large extent on the nature of the classification problem, and the users expertise in getting the best out of the technique.

References:

Online:

References: Decison Trees
References: Random forests
References: Parallel Processing in R
References: Pandoc
References: R Markdown
References: LaTeX
References: R Cook Book
References: Knitr Wrap output
References:cs.cmu.edu

Books:

R in Action: Data analysis and graphics with R, Second Edition, by Robert I. Kabacoff
Publisher: Manning Publications

Dynamic Documents with R and knitr, Second Edition, by Yihui Xie
Publisher: CRC Press

Data Mining Concepts and Techniques, Third Edition, by Jiawei Han, Micheline Kamber, Jian Pei
Publisher: Morgan Kaufmann

Machine Learning with R Cookbook, by Yu Wei, Chiu (David Chiu)
Publisher: Packt Publishing

Thank You

Mentor: Dr. P.K. Viswanathan

Dr. Mathew Thomas

Dr. R.L. Shankar

Great Lakes Institute of Management, Chennai.
and last but not the least
BABI5 Friends