Classification Models

Purpose:

Learning Outcomes measured in this assignment: LO1 to LO8
Content knowledge you’ll gain from doing this assignment: Test/Train split, data prep, build a classification model using logistic regression, lda, and knn algorithm, and communicate your results to non-technical and technical audiences.
Decision/why?: Explain your reasoning behind your choice of the procedure, set of variables and such for the question.
- Explain why you use the procedure/model/variable
- To exceed this criterion, describe steps taken to implement the procedure in a non technical way.
Communication of your findings: Explain your results in terms of training MSE, testing MSE, and prediction of the variable Y
- Explain why you think one model is better than the other.
- To exceed this criterion, explain your model and how it predicts the variable of interest in a non technical way.

The Data:

The data set has the following variables:

Administrative: Number of pages visited by the visitor about account management
Administrative_Duration: Total amount of time (in seconds) spent by the visitor on account management related pages
Informational: Number of pages visited by the visitor about Web site, communication and address information of the shopping site
Informational_Duration: Total amount of time (in seconds) spent by the visitor on informational pages
ProductRelated: Number of pages visited by visitor about product related pages
ProductRelated_Duration: Total amount of time (in seconds) spent by the visitor on product related pages
BounceRates: Average bounce rate value of the pages visited by the visitor
ExitRates: Average exit rate value of the pages visited by the visitor
PageValues: Average page value of the pages visited by the visitor

The following variables are non numerically values:

OperatingSystems: Operating system of the visitor
Browser: Browser of the visitor
Region: Geographic region from which the session has been started by the visitor
TrafficType: Traffic source by which the visitor has arrived at the Web site (e.g., banner, SMS, - VisitorType: Visitor type: New Visitor, Returning Visitor, and Other
Weekend: Boolean value indicating whether the date of the visit is weekend
Revenue: Class label indicating whether the visit has been finalized with a transaction (1) or not (0)
The variable of interest: Revenue

Source: Sakar, C.O., Polat, S.O., Katircioglu, M. et al. Neural Comput & Applic (2018).

shoppers = read.table("https://unh.box.com/shared/static/ohhlu1ee64z0aed11ccazmeibzxpi717.csv", header=TRUE, sep=",")

head(shoppers)

##   Administrative Administrative_Duration Informational Informational_Duration
## 1              0                       0             0                      0
## 2              0                       0             0                      0
## 3              0                       0             0                      0
## 4              0                       0             0                      0
## 5              0                       0             0                      0
## 6              0                       0             0                      0
##   ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1              1                0.000000  0.20000000 0.2000000          0
## 2              2               64.000000  0.00000000 0.1000000          0
## 3              1                0.000000  0.20000000 0.2000000          0
## 4              2                2.666667  0.05000000 0.1400000          0
## 5             10              627.500000  0.02000000 0.0500000          0
## 6             19              154.216667  0.01578947 0.0245614          0
##   OperatingSystems Browser Region TrafficType       VisitorType Weekend Revenue
## 1                1       1      1           1 Returning_Visitor   FALSE       0
## 2                2       2      1           2 Returning_Visitor   FALSE       0
## 3                4       1      9           3 Returning_Visitor   FALSE       0
## 4                3       2      2           4 Returning_Visitor   FALSE       0
## 5                3       3      1           4 Returning_Visitor    TRUE       0
## 6                2       2      1           3 Returning_Visitor   FALSE       0

Part 1: Data Prep (10 points)

Check for existence of NA’s (missing data) (Hint: check if the complete.case(df) has the same number of rows and the original df)

The dataset consists of feature vectors belonging to 12,330 sessions.

sum(is.na(shoppers))

## [1] 0

Test/training separation: Separate your data into test/train sets. (testing set should be no less than 10% and no more than 30% of the entire data set)

library(caTools)
set.seed(123)
sh_split = sample.split(shoppers$Revenue, SplitRatio=.75)
train_sh = subset(shoppers, sh_split == TRUE)
test_sh = subset(shoppers, sh_split == FALSE)

c(nrow(shoppers), nrow(train_sh), nrow(test_sh))

## [1] 12330  9247  3083

Part 2: Logistic Regression (25 points)

Develop a classification model where the variable Revenue is the dependent variable using the Logistic Regression method, 5 predictors and your training data set.

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-3

log_classifier = glm(formula = Revenue~ProductRelated+ProductRelated_Duration+BounceRates+Administrative+Administrative_Duration,
                    family = binomial,
                    data = train_sh)
summary(log_classifier)

## 
## Call:
## glm(formula = Revenue ~ ProductRelated + ProductRelated_Duration + 
##     BounceRates + Administrative + Administrative_Duration, family = binomial, 
##     data = train_sh)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8693  -0.6297  -0.5874  -0.2029   4.0858  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -1.675e+00  4.398e-02 -38.084  < 2e-16 ***
## ProductRelated           1.923e-03  1.177e-03   1.634   0.1023    
## ProductRelated_Duration  6.938e-05  2.856e-05   2.429   0.0151 *  
## BounceRates             -3.337e+01  2.957e+00 -11.285  < 2e-16 ***
## Administrative           4.563e-02  1.052e-02   4.339 1.43e-05 ***
## Administrative_Duration  7.950e-05  1.838e-04   0.432   0.6654    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7968.4  on 9246  degrees of freedom
## Residual deviance: 7395.8  on 9241  degrees of freedom
## AIC: 7407.8
## 
## Number of Fisher Scoring iterations: 7

Predict the Revenue using testing data. Obtain the confusion matrix and compute the testing error rate based on the logistic regression classification.

Prediction in Probabilities

probPred = predict(log_classifier, type = 'response', newdata = test_sh[,-16])
head(probPred, 5)

##            2            4            5            8           11 
## 0.1588904168 0.0342517735 0.0928443856 0.0002477675 0.1622447846

Prediction in Actual Values

log_pred = ifelse(probPred>.5, 1, 0)

Confusion Matrix: 2595 (class 0) and 3 (class 1) correct predictions of buy, 476 and 7 incorrect predictions.

log_cm = table(test_sh[, 16], log_pred)
log_cm

##    log_pred
##        0    1
##   0 2599    7
##   1  470    7

Computing Testing Error Rate: 85% correct predcitions

log_error = mean(test_sh[, 16]!= log_pred)
log_error

## [1] 0.1547194

Hint: If your testing data and training data happens to have different levels of a variable, say TrafficType, you can remove that from the data set using the following code: test = test[!(test$TrafficType==25),] This code will remove TrafficType=25 from your testing set.

Hint 2: Alternatively, you can use a stratified sampling strategy for your test/train split. See Week 1 lecture notes/R code.

Do you believe logistic regression did a good job here? Why/Why not?

Yes it did a fairly decent job based on the error rate which shows that it made 15.6% incorrect predictions.

Visualizing Test Result

#library(ElemStatLearn)
#set = test_sh
#X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.1)
#X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.1)
#grid_set = expand.grid(X1, X2)
#colnames(grid_set) = c('ProductRelated', 'ProductRelated_Duration', 'BounceRates', #'Administrative', 'Administrative_Duration')
#prob_set = predict(log_classifier, type = 'response', newdata = grid_set)
#y_grid = ifelse(prob_set > 0.5, 1, 0)
#plot(set[, -3],
#     main = 'Logistic Regression (Test set)',
#    xlab = 'ProductRelated', 'ProductRelated_Duration', 'BounceRates', 'Administrative', #'Administrative_Duration', ylab = 'Revenue',
#     xlim = range(X1), ylim = range(X2))
#contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
#points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
#points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Part 3: KNN (25 points)

Apply a KNN classification to the training data using the same set of variables as in Part 2. Did you run into any issues? If so, how did you solve them?

library(class)
knn_train = train_sh[, c(1,2,5,6,7)]
knn_test = test_sh[, c(1,2,5,6,7)]

knn_trainLabel = train_sh$Revenue
knn_testLabel = test_sh$Revenue
knn_classifier = knn(knn_train, 
                     knn_test, 
                     cl = knn_trainLabel, 
                     k=5)

Predict Revenue. Obtain the confusion matrix and compute the testing error rate based on the KNN classification.

plot(knn_classifier)

Confusion Matrix

knn_cm = table(knn_classifier, knn_testLabel)

Test error rate

knn_error = mean(knn_classifier!=knn_testLabel)
knn_error

## [1] 0.1832631

Do you believe knn classifier did a good job here? Why/Why not?

KNN was not as good as Logistic Regression. The error rate is about 18.1% compared to 15.6% of the logistic regression model.

Part 4: LDA (25 points)

Apply a LDA classification to the training data using the same set of variables as in Part 2.

library(MASS)
lda_model = lda(formula = Revenue~ProductRelated+ProductRelated_Duration+BounceRates+Administrative+Administrative_Duration,
                data = train_sh)
lda_model

## Call:
## lda(Revenue ~ ProductRelated + ProductRelated_Duration + BounceRates + 
##     Administrative + Administrative_Duration, data = train_sh)
## 
## Prior probabilities of groups:
##         0         1 
## 0.8452471 0.1547529 
## 
## Group means:
##   ProductRelated ProductRelated_Duration BounceRates Administrative
## 0       28.66824                1072.185 0.026121006       2.059493
## 1       47.14675                1838.834 0.004980754       3.320056
##   Administrative_Duration
## 0                72.69565
## 1               119.62421
## 
## Coefficients of linear discriminants:
##                                   LD1
## ProductRelated           5.736195e-03
## ProductRelated_Duration  1.369778e-04
## BounceRates             -1.152448e+01
## Administrative           1.027339e-01
## Administrative_Duration  1.774148e-04

plot(lda_model)

Predict Revenue. Obtain the confusion matrix and compute the testing error rate based on the LDA classification.

ldaPred = predict(lda_model, newdata = test_sh)

Confusion Matrix

lda_cm = table(test_sh$Revenue, ldaPred$class)
lda_cm

##    
##        0    1
##   0 2589   17
##   1  468    9

Testing error rate

lda_error = mean(test_sh$Revenue!= ldaPred$class)
lda_error

## [1] 0.1573143

Do you believe LDA classifier did a good job here? Why/Why not?

The error rare is about 15.7%, which is similar to the Logistic Regression model. It’s our second best model based on prediction accuracy score.

Part 5: Communication of your results

Pick one of the techniques from parts 2, 3, or 4, i.e, Logistic Regression, KNN, LDA, and explain the process to a 10 year old with no experience at all with statistics.

In Logistic Regression, we’re basically trying to predict whether something happens or not based on some other things or factors. For example, we can predict whether it will rain or not based on the month of the year, how cloudy the sky has been recently, how cool the temperature is etc. We make this decision based on probabilities, say if we’re 50% sure that it’s gonna rain, we just say that our prediction is that it will rain, if we’re less than 50% sure, then we will say it won’t rain.

However, in this case we’re trying to predict whether somebody buys a product on a website, based on their behavior on the website (the pages they visited, for how long, etc).

Pick one of the techniques from parts 2, 3, or 4, i.e, Logistic Regression, KNN, LDA. This time explain what you did to your boss, i.e., technical audience.

In KNN, we identify categories present in the response variable (Buy or Not Buy), when we a new dataset is added, how did we classify where it belongs to?

The first step involves choose the number K or neighbors for the model (usually 5). Then the model takes the k nearest neighbor of the new data point according to the Euclidean or Manhattan, Minkowsky distance and so on. Among these neighbors, it counts the number of data points in each category, for instance if the new data point has 3 neighbors that are in the buy category and 2 neighbors in the not buy category, the model assigns the new data point to the category where it counted the most neighbors (in this case, buy).