Machine Learning Metrics

Sameer Mathur

Machine Learning Metrics Using library(mlMetrics) and data(Default) from library(ISLR)

---

Default Dataset

Reading Default Data into R

# loading required package
library(ISLR)
# loading default dataset
data("Default")
attach(Default)

First Few Rows of the Dataset

# first few rows of the dataset
head(Default)
  default student   balance    income
1      No      No  729.5265 44361.625
2      No     Yes  817.1804 12106.135
3      No      No 1073.5492 31767.139
4      No      No  529.2506 35704.494
5      No      No  785.6559 38463.496
6      No     Yes  919.5885  7491.559

Describing the Dataset

# some descriptive statistics of the dataset
library(psych)
describe(Default)[,c(1:5,8:9)]
         vars     n     mean       sd   median    min      max
default*    1 10000     1.03     0.18     1.00   1.00     2.00
student*    2 10000     1.29     0.46     1.00   1.00     2.00
balance     3 10000   835.37   483.71   823.64   0.00  2654.32
income      4 10000 33516.98 13336.64 34552.64 771.97 73554.23

Structure of the Dataset

# structure of the dataset
str(Default)
'data.frame':   10000 obs. of  4 variables:
 $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
 $ balance: num  730 817 1074 529 786 ...
 $ income : num  44362 12106 31767 35704 38463 ...

Logit Model to Predict default, as a Function of student,balance and income

Objective of the Logistic Regression Model

We apply the function glm to a formula that describes the (default) by the student, balance & income.

This creates a generalized linear model (GLM) in the binomial family.

Fitting the Logistic Regression Model

# fitting logistic regression model
logitReg <- glm(default ~ student + balance + income, 
              data = Default, 
              family = binomial)
# summary of the model
summary(logitReg)

Call:
glm(formula = default ~ student + balance + income, family = binomial, 
    data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4691  -0.1418  -0.0557  -0.0203   3.7383  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
income       3.033e-06  8.203e-06   0.370  0.71152    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1571.5  on 9996  degrees of freedom
AIC: 1579.5

Number of Fisher Scoring iterations: 8

Log Odds ratios

# log-odds ratios
cbind(LogOddsRatio = coef(logitReg))
             LogOddsRatio
(Intercept) -1.086905e+01
studentYes  -6.467758e-01
balance      5.736505e-03
income       3.033450e-06

Odds ratios

# odds ratios
exp(cbind(OddsRatio = coef(logitReg)))
               OddsRatio
(Intercept) 1.903854e-05
studentYes  5.237317e-01
balance     1.005753e+00
income      1.000003e+00

Calculating probability

# Writing the equation
eqn <- (-10.87 - 0.6467758*1) + (0.005736505*mean(Default$balance)) + (0.00000303345*mean(Default$income))
eqn
[1] -6.622972
# calculating the probability
exp(eqn)/(1+exp(eqn))
[1] 0.001327709

Prediction

# prediction of glm
ProbPred <- predict(logitReg, Default , type = "response")
# assigning values
predDefault <- ifelse(logitReg$fitted.values < 0.5, "No", "Yes")
# creating new dataframe with predicted values
pred_DF2 <- data.frame(default,ProbPred,predDefault)
# some rows for the predicted probabilities
head(pred_DF2,n=12)
   default     ProbPred predDefault
1       No 1.428724e-03          No
2       No 1.122204e-03          No
3       No 9.812272e-03          No
4       No 4.415893e-04          No
5       No 1.935506e-03          No
6       No 1.989518e-03          No
7       No 2.333767e-03          No
8       No 1.086718e-03          No
9       No 1.638333e-02          No
10      No 2.080617e-05          No
11      No 1.065494e-05          No
12      No 1.127658e-02          No

Probability for average balance and average income for sududent = “Yes”

# creating a single value data frame
newdata <- with(Default, data.frame(balance = mean(balance),
                                    income = mean(income),
                                    student = "Yes"))
# predicting probability
PredProb <- predict(logitReg, newdata , type = "response")

PredProb
          1 
0.001328976 

Confusion Matrix

  • true positives (TP): These are cases the number of correct predictions that an instance is positive.

  • true negatives (TN): We predicted no, the number of correct predictions that an instance is negative.

  • false positives (FP): We predicted yes, is the number of incorrect predictions that an instance is positive. (Also known as a “Type I error.”)

  • false negatives (FN): We predicted no, is the number of incorrect of predictions that an instance negative. (Also known as a “Type II error.”)

Making Confusion Matrix

# assigning values
pred <- ifelse(logitReg$fitted.values < 0.5, "No", "Yes")
# creating confusion matrix
table(y_true = Default$default,y_pred = predDefault )
      y_pred
y_true   No  Yes
   No  9627   40
   Yes  228  105

true positives (TP): = 9627

true negatives (TN): = 105

false positives (FP): = 228

false negatives (FN): = 40

Confusion Matrix Using mlMetrics Package

# loading the required package
library(MLmetrics)
# making confusion matrix
ConfusionMatrix(y_pred = predDefault, y_true = Default$default)
      y_pred
y_true   No  Yes
   No  9627   40
   Yes  228  105

true positives (TP): = 9627

true negatives (TN): = 105

false positives (FP): = 228

false negatives (FN): = 40

Accuracy

Accuracy is defined as


\( Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \)

where

true positives (TP): These are cases the number of correct predictions that an instance is positive.

true negatives (TN): We predicted no, the number of correct predictions that an instance is negative.

false positives (FP): We predicted yes, is the number of incorrect predictions that an instance is positive. (Also known as a “Type I error.”)

false negatives (FN): We predicted no, is the number of incorrect of predictions that an instance negative. (Also known as a “Type II error.”)

Computing Accuracy Using mlMetrics Package

# computing accuracy
Accuracy(y_pred = predDefault, y_true = Default$default)
[1] 0.9732

Sensitivity and Specificity


Sensitivity

\( Sensitivity = \frac{TP}{TP + FN} \)

where

true positives (TP): These are cases the number of correct predictions that an instance is positive.

false negatives (FN): We predicted no, is the number of incorrect of predictions that an instance negative. (Also known as a “Type II error.”)

Specificity

\( Specificity = \frac{TN}{TN + FP} \)

where

true negatives (TN): We predicted no, the number of correct predictions that an instance is negative.

false positives (FP): We predicted yes, is the number of incorrect predictions that an instance is positive. (Also known as a “Type I error.”)

Sensitivity and Specificity using mlMetrics Package

# computing Sensitivity
Sensitivity(y_pred = predDefault, y_true = Default$default, positive = NULL)
[1] 0.9958622
# computing Specificity
Specificity(y_pred = predDefault, y_true = Default$default, positive = NULL)
[1] 0.3153153

Precision and Recall


Precision

\( Precision = \frac{TP}{TP + FP} \)

where

true positives (TP): These are cases the number of correct predictions that an instance is positive.

false positives (FP): We predicted yes, is the number of incorrect predictions that an instance is positive. (Also known as a “Type I error.”)

Recall

\( Recall = \frac{TP}{TP + FN} \)

where

true positives (TP): These are cases the number of correct predictions that an instance is positive.

false negatives (FN): We predicted no, is the number of incorrect of predictions that an instance negative. (Also known as a “Type II error.”)

Precision and Recall using mlMetrics Package

# computing Precision
Precision(y_pred = predDefault, y_true = Default$default, positive = NULL)
[1] 0.9768645
# computing Recall
Recall(y_pred = predDefault, y_true = Default$default, positive = NULL)
[1] 0.9958622