Image Source: medium.com

Week 4, Exercise

Exercise: Fit a logistic regression model to the training data (training_dat) using function glm() from the stats package, and name the resulting model as LR_CRX. Print a summary of your model with summary(LR_CRX). Then, perform classification of the test data (test_predictors) using function predict.glm(), applying the 0.5 probability threshold to get a binary classification result. Compute classification accuracy (the fraction of correct classifications) and the classification error (the fraction of misclassifications).

LR_CRX <- glm(data=training_dat, formula = Approved~.,family = binomial)
summary(LR_CRX)
## 
## Call:
## glm(formula = Approved ~ ., family = binomial, data = training_dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0651  -0.7611  -0.5933   0.7382   2.0219  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -1.2297037  0.3584100  -3.431 0.000601 ***
## Age             -0.0096094  0.0104315  -0.921 0.356953    
## MonthlyExpenses  0.0334341  0.0237634   1.407 0.159440    
## YearsEmployed    0.2401305  0.0482561   4.976 6.49e-07 ***
## CreditScore      0.2968467  0.0461484   6.432 1.26e-10 ***
## MonthlyIncome   -0.0007063  0.0006965  -1.014 0.310534    
## AccountBalance   0.0007475  0.0001862   4.015 5.93e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 722.69  on 522  degrees of freedom
## Residual deviance: 507.77  on 516  degrees of freedom
## AIC: 521.77
## 
## Number of Fisher Scoring iterations: 8
test_pred <- predict.glm(LR_CRX,newdata = test_predictors, type='response')
test_pred_labels <- ifelse(test_pred - 0.5>0,'+','-')

accuracy = sum(test_pred_labels == test_class_labels) / length(test_class_labels)
class_error = 1 - accuracy
# class_error = sum(test_pred_labels != test_class_labels) / length(test_class)
paste('Test Accuracy of model ', round(accuracy,5))
## [1] "Test Accuracy of model  0.78462"
paste('Test Error of model ', round(class_error,5))
## [1] "Test Error of model  0.21538"

Age, MonthlyExpenses and MonthlyIncome are not significant for the model.

Exercise: Repeat the previous exercise using precisely the same training and test data sets, but now using only variables YearsEmployed, CreditScore, and AccountBalance as predictors in the logistic regression model.

LR_CRX <- glm(data=training_dat, formula = Approved~YearsEmployed+CreditScore+AccountBalance,family = binomial)
summary(LR_CRX)
## 
## Call:
## glm(formula = Approved ~ YearsEmployed + CreditScore + AccountBalance, 
##     family = binomial, data = training_dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0604  -0.7445  -0.6415   0.7583   1.8382  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.4855064  0.1552672  -9.567  < 2e-16 ***
## YearsEmployed   0.2289114  0.0452678   5.057 4.26e-07 ***
## CreditScore     0.3031317  0.0458008   6.618 3.63e-11 ***
## AccountBalance  0.0007439  0.0001847   4.028 5.63e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 722.69  on 522  degrees of freedom
## Residual deviance: 511.98  on 519  degrees of freedom
## AIC: 519.98
## 
## Number of Fisher Scoring iterations: 7
test_pred <- predict.glm(LR_CRX,newdata = test_predictors, type='response')
test_pred_labels <- ifelse(test_pred - 0.5>0,'+','-')

accuracy = sum(test_pred_labels == test_class_labels) / length(test_class_labels)
class_error = 1 - accuracy
# class_error = sum(test_pred_labels != test_class_labels) / length(test_class_labels)
paste('Test Accuracy of model ', round(accuracy,5))
## [1] "Test Accuracy of model  0.77692"
paste('Test Error of model ', round(class_error,5))
## [1] "Test Error of model  0.22308"

Exercise: Using the model coefficients from the previous exercise, i.e., β0, β1, β2 and β3, associated with the intercept and with predictors YearsEmployed, CreditScore, and AccountBalance, respectively:

  1. compute the probability that a person that has been employed for 5 years, with a credit score of 6, and an account balance of 10000, will get his/her credit application approved (class positive, +). Do the calculation analytically first, using Eq. (2), then check your result by computing the same probability using predict.glm();

  2. Repeat item (a) above, but now for a person with only 1/2 year employed, no account balance (zero), and a credit score of 3;

  3. for a given person, according to our model how much will the log-odds — Eq. (7) — of credit approval increase or decrease if the person stays 1 additional year employed?

\[P(Y = +1\|X = x) = 1/(1+e^-f(x)\]

fn_manual_predict <- function(YearsEmployed=5, CreditScore=6,AccountBalance=10000){
      df_check <- data.frame(YearsEmployed=YearsEmployed,
                             CreditScore=CreditScore,
                             AccountBalance=AccountBalance)
      # coef(LR_CRX) 
      f_df_check <- coef(LR_CRX)[1] + sum(coef(LR_CRX)[-1]*df_check)
      P_df_check <- 1/(1+exp(-f_df_check))
      cat(paste('\n\n***Approval Probability for a person who has been employed for', df_check[1],
            ' years, with a credit score of ', df_check[2],
            ', and an account balance of ', df_check[3]))
      cat('\nP(Aprroved = +) by using equation ', P_df_check)
      P_predict <- predict.glm(LR_CRX,newdata = df_check, type = 'response')
      cat('\nP(Approved = 1) by predict function ', P_predict)
}

fn_manual_predict() # default parameters
## 
## 
## ***Approval Probability for a person who has been employed for 5  years, with a credit score of  6 , and an account balance of  10000
## P(Aprroved = +) by using equation  0.999866
## P(Approved = 1) by predict function  0.999866
fn_manual_predict(0.5,3,0)
## 
## 
## ***Approval Probability for a person who has been employed for 0.5  years, with a credit score of  3 , and an account balance of  0
## P(Aprroved = +) by using equation  0.3865932
## P(Approved = 1) by predict function  0.3865932
fn_manual_predict(0.25,-2,0)
## 
## 
## ***Approval Probability for a person who has been employed for 0.25  years, with a credit score of  -2 , and an account balance of  0
## P(Aprroved = +) by using equation  0.1156237
## P(Approved = 1) by predict function  0.1156237
cat('\nCoefficient for YearsEmployed:' ,LR_CRX$coefficients['YearsEmployed'])
## 
## Coefficient for YearsEmployed: 0.2289114

For a given person, the logit of credit approval increases 0.2289114 for each additional year employed.

K-Nearest Neighbours (KNN)

Let’s consider once again the data set iris from base R (datasets package), but now in its original form,with all its four predictors. We will randomly split the data set into a training set (with 80% of the observations) and a test set (with the remaining 20% of the observations)

Exercise: Instead of randomly splitting the data into a training and a test set only once, repeat this
procedure 10 times (use a for loop like for(i in 1:10){# your code}) and compute the mean and the
standard deviation of the classification accuracy over the 10 test sets.

library(class)
set.seed(8)
no_obs <- dim(iris)[1] # No. of observations (150)

accuracy <- list()
for (i in 1:10){
      # 20% data records for test:
      test_index <- sample(no_obs, size = as.integer(no_obs*0.2), replace = FALSE)
      test_predictors <- iris[test_index, c("Sepal.Length", "Sepal.Width",
      "Petal.Length", "Petal.Width")]
      test_class <- iris[test_index, "Species"]
      # 80% data records for training:
      training_predictors <- iris[-test_index, c("Sepal.Length", "Sepal.Width",
      "Petal.Length", "Petal.Width")]
      training_class <- iris[-test_index, "Species"]
      Pred_class <- knn(train=training_predictors, test=test_predictors,
                        cl=training_class, k=1)
      
      confusion_matrix <- table(Pred_class, test_class)
      # (TP + TN) / all
      accuracy <- append(accuracy,sum(diag(confusion_matrix))/sum(confusion_matrix))
}
# flatten the list
accuracy <- unlist(accuracy)
summary(accuracy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9333  0.9667  0.9667  0.9700  0.9917  1.0000
cat('Over 10 times, mean of accuracy is', mean(accuracy),
    ' with standard deviation ', sd(accuracy))
## Over 10 times, mean of accuracy is 0.97  with standard deviation  0.02459549
confusion_matrix <- as.data.frame(confusion_matrix)
ggplot(data=confusion_matrix, aes(y=test_class, x=Pred_class, fill=Freq))+
      geom_tile(show.legend = FALSE)+
      scale_y_discrete(limits=rev)+
      geom_text(aes(label = Freq), color='white')+
      ggtitle('Confusion Matrix of the last model')

There are some positive false or negative false from the last model

Exercise: Compute the Euclidean distance from observation 78 to all observations in the iris dataset, using all 4 predictors. Sort the distances to identify the 7 nearest neighbours to observation 78. Look at the class labels of these 7 nearest neighbours. Manually compute the probabilities that observation 78 belong to each class, for k = 1, 3, 5 and 7, and classify this observation based on these probabilities for each k. Repeat, but now using function knn(), and compare the results. Hints:you can use dist() from the stats package (the result can be converted into a distance matrix using as.matrix()) and order() from base R to make things easier. Do not include the class labels (“Species”) when you call dist(), distances in KNN must be computed using the predictors only!

fn_predict_knn <- function(predictors,labels,k=1, target=1){
      # Euclidean distance from the #(target)th obs to all neighbors
      data_dist <- as.matrix(dist(predictors, method = 'euclidean'))
      dist_from_target <- data_dist[target,]
      #sort by distance, take k nearest indices, exclude itself (index 1)
      nearest_idx <- order(dist_from_target)
      nearest_idx <- nearest_idx[2:(k+1)]
      #count for each classification, take proportions
      classify_counts <- prop.table(table(labels[nearest_idx]))
      # classify_counts <- table(labels[nearest_idx])
      #return the highest freq
      return(classify_counts[which.max(classify_counts)])
}

# exclude column #5 Species
iris_predictors <- iris[,-5]
iris_labels <- iris[,5]

k=c(1,3,5,7,9,11)
target <- 78

for (i in k) {
      cat('\n\n *** Prediction for the ', target,'th observation, with k =',i,'***\n')
      cat('\n==> Using Euclidean distance, predicted and proability: \n')
      print(fn_predict_knn(predictors=iris_predictors,labels=iris_labels,k=i,target=target))
      
      pred_knn <- knn(train=iris[-target,-5], test=iris[target,-5],
                        cl=iris[-target,5], k=i)
      
      cat('\n==> Predicted by knn() function: ',levels(pred_knn)[pred_knn])
      cat('\n\n+++ True Species: ', levels(iris$Species)[iris[target,5]])
      cat('\n****************')
}
## 
## 
##  *** Prediction for the  78 th observation, with k = 1 ***
## 
## ==> Using Euclidean distance, predicted and proability: 
## versicolor 
##          1 
## 
## ==> Predicted by knn() function:  versicolor
## 
## +++ True Species:  versicolor
## ****************
## 
##  *** Prediction for the  78 th observation, with k = 3 ***
## 
## ==> Using Euclidean distance, predicted and proability: 
## versicolor 
##  0.6666667 
## 
## ==> Predicted by knn() function:  versicolor
## 
## +++ True Species:  versicolor
## ****************
## 
##  *** Prediction for the  78 th observation, with k = 5 ***
## 
## ==> Using Euclidean distance, predicted and proability: 
## versicolor 
##        0.6 
## 
## ==> Predicted by knn() function:  versicolor
## 
## +++ True Species:  versicolor
## ****************
## 
##  *** Prediction for the  78 th observation, with k = 7 ***
## 
## ==> Using Euclidean distance, predicted and proability: 
## virginica 
## 0.5714286 
## 
## ==> Predicted by knn() function:  virginica
## 
## +++ True Species:  versicolor
## ****************
## 
##  *** Prediction for the  78 th observation, with k = 9 ***
## 
## ==> Using Euclidean distance, predicted and proability: 
## virginica 
## 0.5555556 
## 
## ==> Predicted by knn() function:  virginica
## 
## +++ True Species:  versicolor
## ****************
## 
##  *** Prediction for the  78 th observation, with k = 11 ***
## 
## ==> Using Euclidean distance, predicted and proability: 
## virginica 
## 0.5454545 
## 
## ==> Predicted by knn() function:  virginica
## 
## +++ True Species:  versicolor
## ****************

The prediction by using knn function or computing the Euclidean distance are the same.

We can see that when k is small enough (1,3,5), the predicted species is correct. Increasing k, will reduce overfitting risk but the model might underfitting or unreliable ^^