Sameer Mathur
Machine Learning Metrics Using library(mlMetrics) and data(Default) from library(ISLR)
---
Default DatasetDefault Data into R# loading required package
library(ISLR)
# loading default dataset
data("Default")
attach(Default)
# first few rows of the dataset
head(Default)
default student balance income
1 No No 729.5265 44361.625
2 No Yes 817.1804 12106.135
3 No No 1073.5492 31767.139
4 No No 529.2506 35704.494
5 No No 785.6559 38463.496
6 No Yes 919.5885 7491.559
# some descriptive statistics of the dataset
library(psych)
describe(Default)[,c(1:5,8:9)]
vars n mean sd median min max
default* 1 10000 1.03 0.18 1.00 1.00 2.00
student* 2 10000 1.29 0.46 1.00 1.00 2.00
balance 3 10000 835.37 483.71 823.64 0.00 2654.32
income 4 10000 33516.98 13336.64 34552.64 771.97 73554.23
# structure of the dataset
str(Default)
'data.frame': 10000 obs. of 4 variables:
$ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
$ balance: num 730 817 1074 529 786 ...
$ income : num 44362 12106 31767 35704 38463 ...
We apply the function glm to a formula that describes the (default) by the student, balance & income.
This creates a generalized linear model (GLM) in the binomial family.
# fitting logistic regression model
logitReg <- glm(default ~ student + balance + income,
data = Default,
family = binomial)
# summary of the model
summary(logitReg)
Call:
glm(formula = default ~ student + balance + income, family = binomial,
data = Default)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4691 -0.1418 -0.0557 -0.0203 3.7383
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
income 3.033e-06 8.203e-06 0.370 0.71152
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1571.5 on 9996 degrees of freedom
AIC: 1579.5
Number of Fisher Scoring iterations: 8
# log-odds ratios
cbind(LogOddsRatio = coef(logitReg))
LogOddsRatio
(Intercept) -1.086905e+01
studentYes -6.467758e-01
balance 5.736505e-03
income 3.033450e-06
# odds ratios
exp(cbind(OddsRatio = coef(logitReg)))
OddsRatio
(Intercept) 1.903854e-05
studentYes 5.237317e-01
balance 1.005753e+00
income 1.000003e+00
# Writing the equation
eqn <- (-10.87 - 0.6467758*1) + (0.005736505*mean(Default$balance)) + (0.00000303345*mean(Default$income))
eqn
[1] -6.622972
# calculating the probability
exp(eqn)/(1+exp(eqn))
[1] 0.001327709
# prediction of glm
ProbPred <- predict(logitReg, Default , type = "response")
# assigning values
predDefault <- ifelse(logitReg$fitted.values < 0.5, "No", "Yes")
# creating new dataframe with predicted values
pred_DF2 <- data.frame(default,ProbPred,predDefault)
# some rows for the predicted probabilities
head(pred_DF2,n=12)
default ProbPred predDefault
1 No 1.428724e-03 No
2 No 1.122204e-03 No
3 No 9.812272e-03 No
4 No 4.415893e-04 No
5 No 1.935506e-03 No
6 No 1.989518e-03 No
7 No 2.333767e-03 No
8 No 1.086718e-03 No
9 No 1.638333e-02 No
10 No 2.080617e-05 No
11 No 1.065494e-05 No
12 No 1.127658e-02 No
balance and average income for sududent = “Yes”# creating a single value data frame
newdata <- with(Default, data.frame(balance = mean(balance),
income = mean(income),
student = "Yes"))
# predicting probability
PredProb <- predict(logitReg, newdata , type = "response")
PredProb
1
0.001328976
true positives (TP): These are cases the number of correct predictions that an instance is positive.
true negatives (TN): We predicted no, the number of correct predictions that an instance is negative.
false positives (FP): We predicted yes, is the number of incorrect predictions that an instance is positive. (Also known as a “Type I error.”)
false negatives (FN): We predicted no, is the number of incorrect of predictions that an instance negative. (Also known as a “Type II error.”)
# assigning values
pred <- ifelse(logitReg$fitted.values < 0.5, "No", "Yes")
# creating confusion matrix
table(y_true = Default$default,y_pred = predDefault )
y_pred
y_true No Yes
No 9627 40
Yes 228 105
true positives (TP): = 9627
true negatives (TN): = 105
false positives (FP): = 228
false negatives (FN): = 40
mlMetrics Package# loading the required package
library(MLmetrics)
# making confusion matrix
ConfusionMatrix(y_pred = predDefault, y_true = Default$default)
y_pred
y_true No Yes
No 9627 40
Yes 228 105
true positives (TP): = 9627
true negatives (TN): = 105
false positives (FP): = 228
false negatives (FN): = 40
\( Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \)
where
true positives (TP): These are cases the number of correct predictions that an instance is positive.
true negatives (TN): We predicted no, the number of correct predictions that an instance is negative.
false positives (FP): We predicted yes, is the number of incorrect predictions that an instance is positive. (Also known as a “Type I error.”)
false negatives (FN): We predicted no, is the number of incorrect of predictions that an instance negative. (Also known as a “Type II error.”)
mlMetrics Package# computing accuracy
Accuracy(y_pred = predDefault, y_true = Default$default)
[1] 0.9732
\( Sensitivity = \frac{TP}{TP + FN} \)
where
true positives (TP): These are cases the number of correct predictions that an instance is positive.
false negatives (FN): We predicted no, is the number of incorrect of predictions that an instance negative. (Also known as a “Type II error.”)
\( Specificity = \frac{TN}{TN + FP} \)
where
true negatives (TN): We predicted no, the number of correct predictions that an instance is negative.
false positives (FP): We predicted yes, is the number of incorrect predictions that an instance is positive. (Also known as a “Type I error.”)
mlMetrics Package# computing Sensitivity
Sensitivity(y_pred = predDefault, y_true = Default$default, positive = NULL)
[1] 0.9958622
# computing Specificity
Specificity(y_pred = predDefault, y_true = Default$default, positive = NULL)
[1] 0.3153153
\( Precision = \frac{TP}{TP + FP} \)
where
true positives (TP): These are cases the number of correct predictions that an instance is positive.
false positives (FP): We predicted yes, is the number of incorrect predictions that an instance is positive. (Also known as a “Type I error.”)
\( Recall = \frac{TP}{TP + FN} \)
where
true positives (TP): These are cases the number of correct predictions that an instance is positive.
false negatives (FN): We predicted no, is the number of incorrect of predictions that an instance negative. (Also known as a “Type II error.”)
mlMetrics Package# computing Precision
Precision(y_pred = predDefault, y_true = Default$default, positive = NULL)
[1] 0.9768645
# computing Recall
Recall(y_pred = predDefault, y_true = Default$default, positive = NULL)
[1] 0.9958622