Machine Learning Metrics

Sameer Mathur

Machine Learning Metrics Using library(mlMetrics) and data(Default) from library(ISLR)

---

`Default` Dataset

Reading `Default` Data into R

# loading required package
library(ISLR)
# loading default dataset
data("Default")
attach(Default)

First Few Rows of the Dataset

# first few rows of the dataset
head(Default)

  default student   balance    income
1      No      No  729.5265 44361.625
2      No     Yes  817.1804 12106.135
3      No      No 1073.5492 31767.139
4      No      No  529.2506 35704.494
5      No      No  785.6559 38463.496
6      No     Yes  919.5885  7491.559

Describing the Dataset

# some descriptive statistics of the dataset
library(psych)
describe(Default)[,c(1:5,8:9)]

         vars     n     mean       sd   median    min      max
default*    1 10000     1.03     0.18     1.00   1.00     2.00
student*    2 10000     1.29     0.46     1.00   1.00     2.00
balance     3 10000   835.37   483.71   823.64   0.00  2654.32
income      4 10000 33516.98 13336.64 34552.64 771.97 73554.23

Structure of the Dataset

# structure of the dataset
str(Default)

'data.frame':   10000 obs. of  4 variables:
 $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
 $ balance: num  730 817 1074 529 786 ...
 $ income : num  44362 12106 31767 35704 38463 ...

Logit Model to Predict default, as a Function of student,balance and income

Objective of the Logistic Regression Model

We apply the function glm to a formula that describes the (default) by the student, balance & income.

This creates a generalized linear model (GLM) in the binomial family.

Fitting the Logistic Regression Model

# fitting logistic regression model
logitReg <- glm(default ~ student + balance + income, 
              data = Default, 
              family = binomial)
# summary of the model
summary(logitReg)


Call:
glm(formula = default ~ student + balance + income, family = binomial, 
    data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4691  -0.1418  -0.0557  -0.0203   3.7383  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
income       3.033e-06  8.203e-06   0.370  0.71152    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1571.5  on 9996  degrees of freedom
AIC: 1579.5

Number of Fisher Scoring iterations: 8

Log Odds ratios

# log-odds ratios
cbind(LogOddsRatio = coef(logitReg))

             LogOddsRatio
(Intercept) -1.086905e+01
studentYes  -6.467758e-01
balance      5.736505e-03
income       3.033450e-06

Odds ratios

# odds ratios
exp(cbind(OddsRatio = coef(logitReg)))

               OddsRatio
(Intercept) 1.903854e-05
studentYes  5.237317e-01
balance     1.005753e+00
income      1.000003e+00

Calculating probability

# Writing the equation
eqn <- (-10.87 - 0.6467758*1) + (0.005736505*mean(Default$balance)) + (0.00000303345*mean(Default$income))
eqn

[1] -6.622972

# calculating the probability
exp(eqn)/(1+exp(eqn))

[1] 0.001327709

Prediction

# prediction of glm
ProbPred <- predict(logitReg, Default , type = "response")

# assigning values
predDefault <- ifelse(logitReg$fitted.values < 0.5, "No", "Yes")
# creating new dataframe with predicted values
pred_DF2 <- data.frame(default,ProbPred,predDefault)
# some rows for the predicted probabilities
head(pred_DF2,n=12)

   default     ProbPred predDefault
1       No 1.428724e-03          No
2       No 1.122204e-03          No
3       No 9.812272e-03          No
4       No 4.415893e-04          No
5       No 1.935506e-03          No
6       No 1.989518e-03          No
7       No 2.333767e-03          No
8       No 1.086718e-03          No
9       No 1.638333e-02          No
10      No 2.080617e-05          No
11      No 1.065494e-05          No
12      No 1.127658e-02          No

Probability for average `balance` and average `income` for sududent = “Yes”

# creating a single value data frame
newdata <- with(Default, data.frame(balance = mean(balance),
                                    income = mean(income),
                                    student = "Yes"))
# predicting probability
PredProb <- predict(logitReg, newdata , type = "response")

PredProb

          1 
0.001328976

Confusion Matrix

true positives (TP): These are cases the number of correct predictions that an instance is positive.
true negatives (TN): We predicted no, the number of correct predictions that an instance is negative.
false positives (FP): We predicted yes, is the number of incorrect predictions that an instance is positive. (Also known as a “Type I error.”)
false negatives (FN): We predicted no, is the number of incorrect of predictions that an instance negative. (Also known as a “Type II error.”)

Making Confusion Matrix

# assigning values
pred <- ifelse(logitReg$fitted.values < 0.5, "No", "Yes")
# creating confusion matrix
table(y_true = Default$default,y_pred = predDefault )

      y_pred
y_true   No  Yes
   No  9627   40
   Yes  228  105

true positives (TP): = 9627

true negatives (TN): = 105

false positives (FP): = 228

false negatives (FN): = 40

Confusion Matrix Using `mlMetrics` Package

# loading the required package
library(MLmetrics)
# making confusion matrix
ConfusionMatrix(y_pred = predDefault, y_true = Default$default)

      y_pred
y_true   No  Yes
   No  9627   40
   Yes  228  105

true positives (TP): = 9627

true negatives (TN): = 105

false positives (FP): = 228

false negatives (FN): = 40

Accuracy

Accuracy is defined as

\( Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \)

where

true positives (TP): These are cases the number of correct predictions that an instance is positive.

true negatives (TN): We predicted no, the number of correct predictions that an instance is negative.

false positives (FP): We predicted yes, is the number of incorrect predictions that an instance is positive. (Also known as a “Type I error.”)

false negatives (FN): We predicted no, is the number of incorrect of predictions that an instance negative. (Also known as a “Type II error.”)

Computing Accuracy Using `mlMetrics` Package

# computing accuracy
Accuracy(y_pred = predDefault, y_true = Default$default)

[1] 0.9732

Sensitivity and Specificity

Sensitivity

\( Sensitivity = \frac{TP}{TP + FN} \)

where

true positives (TP): These are cases the number of correct predictions that an instance is positive.

false negatives (FN): We predicted no, is the number of incorrect of predictions that an instance negative. (Also known as a “Type II error.”)

Specificity

\( Specificity = \frac{TN}{TN + FP} \)

where

true negatives (TN): We predicted no, the number of correct predictions that an instance is negative.

false positives (FP): We predicted yes, is the number of incorrect predictions that an instance is positive. (Also known as a “Type I error.”)

Sensitivity and Specificity using `mlMetrics` Package

# computing Sensitivity
Sensitivity(y_pred = predDefault, y_true = Default$default, positive = NULL)

[1] 0.9958622

# computing Specificity
Specificity(y_pred = predDefault, y_true = Default$default, positive = NULL)

[1] 0.3153153

Precision and Recall

Precision

\( Precision = \frac{TP}{TP + FP} \)

where

true positives (TP): These are cases the number of correct predictions that an instance is positive.

false positives (FP): We predicted yes, is the number of incorrect predictions that an instance is positive. (Also known as a “Type I error.”)

Recall

\( Recall = \frac{TP}{TP + FN} \)

where

true positives (TP): These are cases the number of correct predictions that an instance is positive.

false negatives (FN): We predicted no, is the number of incorrect of predictions that an instance negative. (Also known as a “Type II error.”)

Precision and Recall using `mlMetrics` Package

# computing Precision
Precision(y_pred = predDefault, y_true = Default$default, positive = NULL)

[1] 0.9768645

# computing Recall
Recall(y_pred = predDefault, y_true = Default$default, positive = NULL)

[1] 0.9958622

Machine Learning Metrics

Default Dataset

Reading Default Data into R

First Few Rows of the Dataset

Describing the Dataset

Structure of the Dataset

Logit Model to Predict default, as a Function of student,balance and income

Objective of the Logistic Regression Model

Fitting the Logistic Regression Model

Log Odds ratios

Odds ratios

Calculating probability

Prediction

Probability for average balance and average income for sududent = “Yes”

Confusion Matrix

Making Confusion Matrix

Confusion Matrix Using mlMetrics Package

Accuracy

Accuracy is defined as

Computing Accuracy Using mlMetrics Package

Sensitivity and Specificity

Sensitivity

Specificity

Sensitivity and Specificity using mlMetrics Package

Precision and Recall

Precision

Recall

Precision and Recall using mlMetrics Package

`Default` Dataset

Reading `Default` Data into R

Probability for average `balance` and average `income` for sududent = “Yes”

Confusion Matrix Using `mlMetrics` Package

Computing Accuracy Using `mlMetrics` Package

Sensitivity and Specificity using `mlMetrics` Package

Precision and Recall using `mlMetrics` Package