Image Source: medium.com
Exercise: Fit a logistic regression model to the training data (training_dat) using function glm() from the stats package, and name the resulting model as LR_CRX. Print a summary of your model with summary(LR_CRX). Then, perform classification of the test data (test_predictors) using function predict.glm(), applying the 0.5 probability threshold to get a binary classification result. Compute classification accuracy (the fraction of correct classifications) and the classification error (the fraction of misclassifications).
LR_CRX <- glm(data=training_dat, formula = Approved~.,family = binomial)
summary(LR_CRX)
##
## Call:
## glm(formula = Approved ~ ., family = binomial, data = training_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0651 -0.7611 -0.5933 0.7382 2.0219
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2297037 0.3584100 -3.431 0.000601 ***
## Age -0.0096094 0.0104315 -0.921 0.356953
## MonthlyExpenses 0.0334341 0.0237634 1.407 0.159440
## YearsEmployed 0.2401305 0.0482561 4.976 6.49e-07 ***
## CreditScore 0.2968467 0.0461484 6.432 1.26e-10 ***
## MonthlyIncome -0.0007063 0.0006965 -1.014 0.310534
## AccountBalance 0.0007475 0.0001862 4.015 5.93e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 722.69 on 522 degrees of freedom
## Residual deviance: 507.77 on 516 degrees of freedom
## AIC: 521.77
##
## Number of Fisher Scoring iterations: 8
test_pred <- predict.glm(LR_CRX,newdata = test_predictors, type='response')
test_pred_labels <- ifelse(test_pred - 0.5>0,'+','-')
accuracy = sum(test_pred_labels == test_class_labels) / length(test_class_labels)
class_error = 1 - accuracy
# class_error = sum(test_pred_labels != test_class_labels) / length(test_class)
paste('Test Accuracy of model ', round(accuracy,5))
## [1] "Test Accuracy of model 0.78462"
paste('Test Error of model ', round(class_error,5))
## [1] "Test Error of model 0.21538"
Age, MonthlyExpenses and
MonthlyIncome are not significant for the model.
Exercise: Repeat the previous exercise using precisely the same training and test data sets, but now using only variables YearsEmployed, CreditScore, and AccountBalance as predictors in the logistic regression model.
LR_CRX <- glm(data=training_dat, formula = Approved~YearsEmployed+CreditScore+AccountBalance,family = binomial)
summary(LR_CRX)
##
## Call:
## glm(formula = Approved ~ YearsEmployed + CreditScore + AccountBalance,
## family = binomial, data = training_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0604 -0.7445 -0.6415 0.7583 1.8382
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.4855064 0.1552672 -9.567 < 2e-16 ***
## YearsEmployed 0.2289114 0.0452678 5.057 4.26e-07 ***
## CreditScore 0.3031317 0.0458008 6.618 3.63e-11 ***
## AccountBalance 0.0007439 0.0001847 4.028 5.63e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 722.69 on 522 degrees of freedom
## Residual deviance: 511.98 on 519 degrees of freedom
## AIC: 519.98
##
## Number of Fisher Scoring iterations: 7
test_pred <- predict.glm(LR_CRX,newdata = test_predictors, type='response')
test_pred_labels <- ifelse(test_pred - 0.5>0,'+','-')
accuracy = sum(test_pred_labels == test_class_labels) / length(test_class_labels)
class_error = 1 - accuracy
# class_error = sum(test_pred_labels != test_class_labels) / length(test_class_labels)
paste('Test Accuracy of model ', round(accuracy,5))
## [1] "Test Accuracy of model 0.77692"
paste('Test Error of model ', round(class_error,5))
## [1] "Test Error of model 0.22308"
Exercise: Using the model coefficients from the previous exercise, i.e., β0, β1, β2 and β3, associated with the intercept and with predictors YearsEmployed, CreditScore, and AccountBalance, respectively:
compute the probability that a person that has been employed for 5 years, with a credit score of 6, and an account balance of 10000, will get his/her credit application approved (class positive, +). Do the calculation analytically first, using Eq. (2), then check your result by computing the same probability using predict.glm();
Repeat item (a) above, but now for a person with only 1/2 year employed, no account balance (zero), and a credit score of 3;
for a given person, according to our model how much will the log-odds — Eq. (7) — of credit approval increase or decrease if the person stays 1 additional year employed?
\[P(Y = +1\|X = x) = 1/(1+e^-f(x)\]
fn_manual_predict <- function(YearsEmployed=5, CreditScore=6,AccountBalance=10000){
df_check <- data.frame(YearsEmployed=YearsEmployed,
CreditScore=CreditScore,
AccountBalance=AccountBalance)
# coef(LR_CRX)
f_df_check <- coef(LR_CRX)[1] + sum(coef(LR_CRX)[-1]*df_check)
P_df_check <- 1/(1+exp(-f_df_check))
cat(paste('\n\n***Approval Probability for a person who has been employed for', df_check[1],
' years, with a credit score of ', df_check[2],
', and an account balance of ', df_check[3]))
cat('\nP(Aprroved = +) by using equation ', P_df_check)
P_predict <- predict.glm(LR_CRX,newdata = df_check, type = 'response')
cat('\nP(Approved = 1) by predict function ', P_predict)
}
fn_manual_predict() # default parameters
##
##
## ***Approval Probability for a person who has been employed for 5 years, with a credit score of 6 , and an account balance of 10000
## P(Aprroved = +) by using equation 0.999866
## P(Approved = 1) by predict function 0.999866
fn_manual_predict(0.5,3,0)
##
##
## ***Approval Probability for a person who has been employed for 0.5 years, with a credit score of 3 , and an account balance of 0
## P(Aprroved = +) by using equation 0.3865932
## P(Approved = 1) by predict function 0.3865932
fn_manual_predict(0.25,-2,0)
##
##
## ***Approval Probability for a person who has been employed for 0.25 years, with a credit score of -2 , and an account balance of 0
## P(Aprroved = +) by using equation 0.1156237
## P(Approved = 1) by predict function 0.1156237
cat('\nCoefficient for YearsEmployed:' ,LR_CRX$coefficients['YearsEmployed'])
##
## Coefficient for YearsEmployed: 0.2289114
For a given person, the logit of credit approval increases 0.2289114 for each additional year employed.
Let’s consider once again the data set iris from base R (datasets package), but now in its original form,with all its four predictors. We will randomly split the data set into a training set (with 80% of the observations) and a test set (with the remaining 20% of the observations)
Exercise: Instead of randomly splitting the data into a training and
a test set only once, repeat this
procedure 10 times (use a for loop like for(i in 1:10){# your code}) and
compute the mean and the
standard deviation of the classification accuracy over the 10 test
sets.
library(class)
set.seed(8)
no_obs <- dim(iris)[1] # No. of observations (150)
accuracy <- list()
for (i in 1:10){
# 20% data records for test:
test_index <- sample(no_obs, size = as.integer(no_obs*0.2), replace = FALSE)
test_predictors <- iris[test_index, c("Sepal.Length", "Sepal.Width",
"Petal.Length", "Petal.Width")]
test_class <- iris[test_index, "Species"]
# 80% data records for training:
training_predictors <- iris[-test_index, c("Sepal.Length", "Sepal.Width",
"Petal.Length", "Petal.Width")]
training_class <- iris[-test_index, "Species"]
Pred_class <- knn(train=training_predictors, test=test_predictors,
cl=training_class, k=1)
confusion_matrix <- table(Pred_class, test_class)
# (TP + TN) / all
accuracy <- append(accuracy,sum(diag(confusion_matrix))/sum(confusion_matrix))
}
# flatten the list
accuracy <- unlist(accuracy)
summary(accuracy)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9333 0.9667 0.9667 0.9700 0.9917 1.0000
cat('Over 10 times, mean of accuracy is', mean(accuracy),
' with standard deviation ', sd(accuracy))
## Over 10 times, mean of accuracy is 0.97 with standard deviation 0.02459549
confusion_matrix <- as.data.frame(confusion_matrix)
ggplot(data=confusion_matrix, aes(y=test_class, x=Pred_class, fill=Freq))+
geom_tile(show.legend = FALSE)+
scale_y_discrete(limits=rev)+
geom_text(aes(label = Freq), color='white')+
ggtitle('Confusion Matrix of the last model')
There are some positive false or negative false from the last model
Exercise: Compute the Euclidean distance from observation 78 to all observations in the iris dataset, using all 4 predictors. Sort the distances to identify the 7 nearest neighbours to observation 78. Look at the class labels of these 7 nearest neighbours. Manually compute the probabilities that observation 78 belong to each class, for k = 1, 3, 5 and 7, and classify this observation based on these probabilities for each k. Repeat, but now using function knn(), and compare the results. Hints:you can use dist() from the stats package (the result can be converted into a distance matrix using as.matrix()) and order() from base R to make things easier. Do not include the class labels (“Species”) when you call dist(), distances in KNN must be computed using the predictors only!
fn_predict_knn <- function(predictors,labels,k=1, target=1){
# Euclidean distance from the #(target)th obs to all neighbors
data_dist <- as.matrix(dist(predictors, method = 'euclidean'))
dist_from_target <- data_dist[target,]
#sort by distance, take k nearest indices, exclude itself (index 1)
nearest_idx <- order(dist_from_target)
nearest_idx <- nearest_idx[2:(k+1)]
#count for each classification, take proportions
classify_counts <- prop.table(table(labels[nearest_idx]))
# classify_counts <- table(labels[nearest_idx])
#return the highest freq
return(classify_counts[which.max(classify_counts)])
}
# exclude column #5 Species
iris_predictors <- iris[,-5]
iris_labels <- iris[,5]
k=c(1,3,5,7,9,11)
target <- 78
for (i in k) {
cat('\n\n *** Prediction for the ', target,'th observation, with k =',i,'***\n')
cat('\n==> Using Euclidean distance, predicted and proability: \n')
print(fn_predict_knn(predictors=iris_predictors,labels=iris_labels,k=i,target=target))
pred_knn <- knn(train=iris[-target,-5], test=iris[target,-5],
cl=iris[-target,5], k=i)
cat('\n==> Predicted by knn() function: ',levels(pred_knn)[pred_knn])
cat('\n\n+++ True Species: ', levels(iris$Species)[iris[target,5]])
cat('\n****************')
}
##
##
## *** Prediction for the 78 th observation, with k = 1 ***
##
## ==> Using Euclidean distance, predicted and proability:
## versicolor
## 1
##
## ==> Predicted by knn() function: versicolor
##
## +++ True Species: versicolor
## ****************
##
## *** Prediction for the 78 th observation, with k = 3 ***
##
## ==> Using Euclidean distance, predicted and proability:
## versicolor
## 0.6666667
##
## ==> Predicted by knn() function: versicolor
##
## +++ True Species: versicolor
## ****************
##
## *** Prediction for the 78 th observation, with k = 5 ***
##
## ==> Using Euclidean distance, predicted and proability:
## versicolor
## 0.6
##
## ==> Predicted by knn() function: versicolor
##
## +++ True Species: versicolor
## ****************
##
## *** Prediction for the 78 th observation, with k = 7 ***
##
## ==> Using Euclidean distance, predicted and proability:
## virginica
## 0.5714286
##
## ==> Predicted by knn() function: virginica
##
## +++ True Species: versicolor
## ****************
##
## *** Prediction for the 78 th observation, with k = 9 ***
##
## ==> Using Euclidean distance, predicted and proability:
## virginica
## 0.5555556
##
## ==> Predicted by knn() function: virginica
##
## +++ True Species: versicolor
## ****************
##
## *** Prediction for the 78 th observation, with k = 11 ***
##
## ==> Using Euclidean distance, predicted and proability:
## virginica
## 0.5454545
##
## ==> Predicted by knn() function: virginica
##
## +++ True Species: versicolor
## ****************
The prediction by using knn function or computing the Euclidean distance are the same.
We can see that when k is small enough (1,3,5), the predicted species is correct. Increasing k, will reduce overfitting risk but the model might underfitting or unreliable ^^