Purpose:

The Data:

The data set has the following variables:

The following variables are non numerically values:

Source: Sakar, C.O., Polat, S.O., Katircioglu, M. et al. Neural Comput & Applic (2018).

shoppers = read.table("https://unh.box.com/shared/static/ohhlu1ee64z0aed11ccazmeibzxpi717.csv", header=TRUE, sep=",")

head(shoppers)
##   Administrative Administrative_Duration Informational Informational_Duration
## 1              0                       0             0                      0
## 2              0                       0             0                      0
## 3              0                       0             0                      0
## 4              0                       0             0                      0
## 5              0                       0             0                      0
## 6              0                       0             0                      0
##   ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1              1                0.000000  0.20000000 0.2000000          0
## 2              2               64.000000  0.00000000 0.1000000          0
## 3              1                0.000000  0.20000000 0.2000000          0
## 4              2                2.666667  0.05000000 0.1400000          0
## 5             10              627.500000  0.02000000 0.0500000          0
## 6             19              154.216667  0.01578947 0.0245614          0
##   OperatingSystems Browser Region TrafficType       VisitorType Weekend Revenue
## 1                1       1      1           1 Returning_Visitor   FALSE       0
## 2                2       2      1           2 Returning_Visitor   FALSE       0
## 3                4       1      9           3 Returning_Visitor   FALSE       0
## 4                3       2      2           4 Returning_Visitor   FALSE       0
## 5                3       3      1           4 Returning_Visitor    TRUE       0
## 6                2       2      1           3 Returning_Visitor   FALSE       0

Part 1: Data Prep (10 points)

  1. Check for existence of NA’s (missing data) (Hint: check if the complete.case(df) has the same number of rows and the original df)

The dataset consists of feature vectors belonging to 12,330 sessions.

sum(is.na(shoppers))
## [1] 0
  1. Test/training separation: Separate your data into test/train sets. (testing set should be no less than 10% and no more than 30% of the entire data set)
library(caTools)
set.seed(123)
sh_split = sample.split(shoppers$Revenue, SplitRatio=.75)
train_sh = subset(shoppers, sh_split == TRUE)
test_sh = subset(shoppers, sh_split == FALSE)

c(nrow(shoppers), nrow(train_sh), nrow(test_sh))
## [1] 12330  9247  3083

Part 2: Logistic Regression (25 points)

  1. Develop a classification model where the variable Revenue is the dependent variable using the Logistic Regression method, 5 predictors and your training data set.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-3
log_classifier = glm(formula = Revenue~ProductRelated+ProductRelated_Duration+BounceRates+Administrative+Administrative_Duration,
                    family = binomial,
                    data = train_sh)
summary(log_classifier)
## 
## Call:
## glm(formula = Revenue ~ ProductRelated + ProductRelated_Duration + 
##     BounceRates + Administrative + Administrative_Duration, family = binomial, 
##     data = train_sh)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8693  -0.6297  -0.5874  -0.2029   4.0858  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -1.675e+00  4.398e-02 -38.084  < 2e-16 ***
## ProductRelated           1.923e-03  1.177e-03   1.634   0.1023    
## ProductRelated_Duration  6.938e-05  2.856e-05   2.429   0.0151 *  
## BounceRates             -3.337e+01  2.957e+00 -11.285  < 2e-16 ***
## Administrative           4.563e-02  1.052e-02   4.339 1.43e-05 ***
## Administrative_Duration  7.950e-05  1.838e-04   0.432   0.6654    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7968.4  on 9246  degrees of freedom
## Residual deviance: 7395.8  on 9241  degrees of freedom
## AIC: 7407.8
## 
## Number of Fisher Scoring iterations: 7
  1. Predict the Revenue using testing data. Obtain the confusion matrix and compute the testing error rate based on the logistic regression classification.

Prediction in Probabilities

probPred = predict(log_classifier, type = 'response', newdata = test_sh[,-16])
head(probPred, 5)
##            2            4            5            8           11 
## 0.1588904168 0.0342517735 0.0928443856 0.0002477675 0.1622447846

Prediction in Actual Values

log_pred = ifelse(probPred>.5, 1, 0)

Confusion Matrix: 2595 (class 0) and 3 (class 1) correct predictions of buy, 476 and 7 incorrect predictions.

log_cm = table(test_sh[, 16], log_pred)
log_cm
##    log_pred
##        0    1
##   0 2599    7
##   1  470    7

Computing Testing Error Rate: 85% correct predcitions

log_error = mean(test_sh[, 16]!= log_pred)
log_error
## [1] 0.1547194

Hint: If your testing data and training data happens to have different levels of a variable, say TrafficType, you can remove that from the data set using the following code: test = test[!(test$TrafficType==25),] This code will remove TrafficType=25 from your testing set.

Hint 2: Alternatively, you can use a stratified sampling strategy for your test/train split. See Week 1 lecture notes/R code.

  1. Do you believe logistic regression did a good job here? Why/Why not?

Yes it did a fairly decent job based on the error rate which shows that it made 15.6% incorrect predictions.

Visualizing Test Result

#library(ElemStatLearn)
#set = test_sh
#X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.1)
#X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.1)
#grid_set = expand.grid(X1, X2)
#colnames(grid_set) = c('ProductRelated', 'ProductRelated_Duration', 'BounceRates', #'Administrative', 'Administrative_Duration')
#prob_set = predict(log_classifier, type = 'response', newdata = grid_set)
#y_grid = ifelse(prob_set > 0.5, 1, 0)
#plot(set[, -3],
#     main = 'Logistic Regression (Test set)',
#    xlab = 'ProductRelated', 'ProductRelated_Duration', 'BounceRates', 'Administrative', #'Administrative_Duration', ylab = 'Revenue',
#     xlim = range(X1), ylim = range(X2))
#contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
#points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
#points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Part 3: KNN (25 points)

  1. Apply a KNN classification to the training data using the same set of variables as in Part 2. Did you run into any issues? If so, how did you solve them?
library(class)
knn_train = train_sh[, c(1,2,5,6,7)]
knn_test = test_sh[, c(1,2,5,6,7)]

knn_trainLabel = train_sh$Revenue
knn_testLabel = test_sh$Revenue
knn_classifier = knn(knn_train, 
                     knn_test, 
                     cl = knn_trainLabel, 
                     k=5)
  1. Predict Revenue. Obtain the confusion matrix and compute the testing error rate based on the KNN classification.
plot(knn_classifier)

Confusion Matrix

knn_cm = table(knn_classifier, knn_testLabel)

Test error rate

knn_error = mean(knn_classifier!=knn_testLabel)
knn_error
## [1] 0.1832631
  1. Do you believe knn classifier did a good job here? Why/Why not?

KNN was not as good as Logistic Regression. The error rate is about 18.1% compared to 15.6% of the logistic regression model.

Part 4: LDA (25 points)

  1. Apply a LDA classification to the training data using the same set of variables as in Part 2.
library(MASS)
lda_model = lda(formula = Revenue~ProductRelated+ProductRelated_Duration+BounceRates+Administrative+Administrative_Duration,
                data = train_sh)
lda_model
## Call:
## lda(Revenue ~ ProductRelated + ProductRelated_Duration + BounceRates + 
##     Administrative + Administrative_Duration, data = train_sh)
## 
## Prior probabilities of groups:
##         0         1 
## 0.8452471 0.1547529 
## 
## Group means:
##   ProductRelated ProductRelated_Duration BounceRates Administrative
## 0       28.66824                1072.185 0.026121006       2.059493
## 1       47.14675                1838.834 0.004980754       3.320056
##   Administrative_Duration
## 0                72.69565
## 1               119.62421
## 
## Coefficients of linear discriminants:
##                                   LD1
## ProductRelated           5.736195e-03
## ProductRelated_Duration  1.369778e-04
## BounceRates             -1.152448e+01
## Administrative           1.027339e-01
## Administrative_Duration  1.774148e-04
plot(lda_model)

  1. Predict Revenue. Obtain the confusion matrix and compute the testing error rate based on the LDA classification.
ldaPred = predict(lda_model, newdata = test_sh)

Confusion Matrix

lda_cm = table(test_sh$Revenue, ldaPred$class)
lda_cm
##    
##        0    1
##   0 2589   17
##   1  468    9

Testing error rate

lda_error = mean(test_sh$Revenue!= ldaPred$class)
lda_error
## [1] 0.1573143
  1. Do you believe LDA classifier did a good job here? Why/Why not?

The error rare is about 15.7%, which is similar to the Logistic Regression model. It’s our second best model based on prediction accuracy score.

Part 5: Communication of your results

  1. Pick one of the techniques from parts 2, 3, or 4, i.e, Logistic Regression, KNN, LDA, and explain the process to a 10 year old with no experience at all with statistics.

In Logistic Regression, we’re basically trying to predict whether something happens or not based on some other things or factors. For example, we can predict whether it will rain or not based on the month of the year, how cloudy the sky has been recently, how cool the temperature is etc. We make this decision based on probabilities, say if we’re 50% sure that it’s gonna rain, we just say that our prediction is that it will rain, if we’re less than 50% sure, then we will say it won’t rain.

However, in this case we’re trying to predict whether somebody buys a product on a website, based on their behavior on the website (the pages they visited, for how long, etc).

  1. Pick one of the techniques from parts 2, 3, or 4, i.e, Logistic Regression, KNN, LDA. This time explain what you did to your boss, i.e., technical audience.

In KNN, we identify categories present in the response variable (Buy or Not Buy), when we a new dataset is added, how did we classify where it belongs to?

The first step involves choose the number K or neighbors for the model (usually 5). Then the model takes the k nearest neighbor of the new data point according to the Euclidean or Manhattan, Minkowsky distance and so on. Among these neighbors, it counts the number of data points in each category, for instance if the new data point has 3 neighbors that are in the buy category and 2 neighbors in the not buy category, the model assigns the new data point to the category where it counted the most neighbors (in this case, buy).