Learning Outcomes measured in this assignment: LO1 to LO8
Content knowledge you’ll gain from doing this assignment: Test/Train split, data prep, build a classification model using logistic regression, lda, and knn algorithm, and communicate your results to non-technical and technical audiences.
Decision/why?: Explain your reasoning behind your choice of the procedure, set of variables and such for the question.
Communication of your findings: Explain your results in terms of training MSE, testing MSE, and prediction of the variable Y
The data set has the following variables:
The following variables are non numerically values:
OperatingSystems: Operating system of the visitor
Browser: Browser of the visitor
Region: Geographic region from which the session has been started by the visitor
TrafficType: Traffic source by which the visitor has arrived at the Web site (e.g., banner, SMS, - VisitorType: Visitor type: New Visitor, Returning Visitor, and Other
Weekend: Boolean value indicating whether the date of the visit is weekend
Revenue: Class label indicating whether the visit has been finalized with a transaction (1) or not (0)
The variable of interest: Revenue
Source: Sakar, C.O., Polat, S.O., Katircioglu, M. et al. Neural Comput & Applic (2018).
shoppers = read.table("https://unh.box.com/shared/static/ohhlu1ee64z0aed11ccazmeibzxpi717.csv", header=TRUE, sep=",")
head(shoppers)
## Administrative Administrative_Duration Informational Informational_Duration
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1 1 0.000000 0.20000000 0.2000000 0
## 2 2 64.000000 0.00000000 0.1000000 0
## 3 1 0.000000 0.20000000 0.2000000 0
## 4 2 2.666667 0.05000000 0.1400000 0
## 5 10 627.500000 0.02000000 0.0500000 0
## 6 19 154.216667 0.01578947 0.0245614 0
## OperatingSystems Browser Region TrafficType VisitorType Weekend Revenue
## 1 1 1 1 1 Returning_Visitor FALSE 0
## 2 2 2 1 2 Returning_Visitor FALSE 0
## 3 4 1 9 3 Returning_Visitor FALSE 0
## 4 3 2 2 4 Returning_Visitor FALSE 0
## 5 3 3 1 4 Returning_Visitor TRUE 0
## 6 2 2 1 3 Returning_Visitor FALSE 0
The dataset consists of feature vectors belonging to 12,330 sessions.
sum(is.na(shoppers))
## [1] 0
library(caTools)
set.seed(123)
sh_split = sample.split(shoppers$Revenue, SplitRatio=.75)
train_sh = subset(shoppers, sh_split == TRUE)
test_sh = subset(shoppers, sh_split == FALSE)
c(nrow(shoppers), nrow(train_sh), nrow(test_sh))
## [1] 12330 9247 3083
Revenue is the dependent variable using the Logistic Regression method, 5 predictors and your training data set.library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-3
log_classifier = glm(formula = Revenue~ProductRelated+ProductRelated_Duration+BounceRates+Administrative+Administrative_Duration,
family = binomial,
data = train_sh)
summary(log_classifier)
##
## Call:
## glm(formula = Revenue ~ ProductRelated + ProductRelated_Duration +
## BounceRates + Administrative + Administrative_Duration, family = binomial,
## data = train_sh)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8693 -0.6297 -0.5874 -0.2029 4.0858
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.675e+00 4.398e-02 -38.084 < 2e-16 ***
## ProductRelated 1.923e-03 1.177e-03 1.634 0.1023
## ProductRelated_Duration 6.938e-05 2.856e-05 2.429 0.0151 *
## BounceRates -3.337e+01 2.957e+00 -11.285 < 2e-16 ***
## Administrative 4.563e-02 1.052e-02 4.339 1.43e-05 ***
## Administrative_Duration 7.950e-05 1.838e-04 0.432 0.6654
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7968.4 on 9246 degrees of freedom
## Residual deviance: 7395.8 on 9241 degrees of freedom
## AIC: 7407.8
##
## Number of Fisher Scoring iterations: 7
Revenue using testing data. Obtain the confusion matrix and compute the testing error rate based on the logistic regression classification.Prediction in Probabilities
probPred = predict(log_classifier, type = 'response', newdata = test_sh[,-16])
head(probPred, 5)
## 2 4 5 8 11
## 0.1588904168 0.0342517735 0.0928443856 0.0002477675 0.1622447846
Prediction in Actual Values
log_pred = ifelse(probPred>.5, 1, 0)
Confusion Matrix: 2595 (class 0) and 3 (class 1) correct predictions of buy, 476 and 7 incorrect predictions.
log_cm = table(test_sh[, 16], log_pred)
log_cm
## log_pred
## 0 1
## 0 2599 7
## 1 470 7
Computing Testing Error Rate: 85% correct predcitions
log_error = mean(test_sh[, 16]!= log_pred)
log_error
## [1] 0.1547194
Hint: If your testing data and training data happens to have different levels of a variable, say TrafficType, you can remove that from the data set using the following code: test = test[!(test$TrafficType==25),] This code will remove TrafficType=25 from your testing set.
Hint 2: Alternatively, you can use a stratified sampling strategy for your test/train split. See Week 1 lecture notes/R code.
Yes it did a fairly decent job based on the error rate which shows that it made 15.6% incorrect predictions.
Visualizing Test Result
#library(ElemStatLearn)
#set = test_sh
#X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.1)
#X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.1)
#grid_set = expand.grid(X1, X2)
#colnames(grid_set) = c('ProductRelated', 'ProductRelated_Duration', 'BounceRates', #'Administrative', 'Administrative_Duration')
#prob_set = predict(log_classifier, type = 'response', newdata = grid_set)
#y_grid = ifelse(prob_set > 0.5, 1, 0)
#plot(set[, -3],
# main = 'Logistic Regression (Test set)',
# xlab = 'ProductRelated', 'ProductRelated_Duration', 'BounceRates', 'Administrative', #'Administrative_Duration', ylab = 'Revenue',
# xlim = range(X1), ylim = range(X2))
#contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
#points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
#points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
library(class)
knn_train = train_sh[, c(1,2,5,6,7)]
knn_test = test_sh[, c(1,2,5,6,7)]
knn_trainLabel = train_sh$Revenue
knn_testLabel = test_sh$Revenue
knn_classifier = knn(knn_train,
knn_test,
cl = knn_trainLabel,
k=5)
Revenue. Obtain the confusion matrix and compute the testing error rate based on the KNN classification.plot(knn_classifier)
Confusion Matrix
knn_cm = table(knn_classifier, knn_testLabel)
Test error rate
knn_error = mean(knn_classifier!=knn_testLabel)
knn_error
## [1] 0.1832631
KNN was not as good as Logistic Regression. The error rate is about 18.1% compared to 15.6% of the logistic regression model.
library(MASS)
lda_model = lda(formula = Revenue~ProductRelated+ProductRelated_Duration+BounceRates+Administrative+Administrative_Duration,
data = train_sh)
lda_model
## Call:
## lda(Revenue ~ ProductRelated + ProductRelated_Duration + BounceRates +
## Administrative + Administrative_Duration, data = train_sh)
##
## Prior probabilities of groups:
## 0 1
## 0.8452471 0.1547529
##
## Group means:
## ProductRelated ProductRelated_Duration BounceRates Administrative
## 0 28.66824 1072.185 0.026121006 2.059493
## 1 47.14675 1838.834 0.004980754 3.320056
## Administrative_Duration
## 0 72.69565
## 1 119.62421
##
## Coefficients of linear discriminants:
## LD1
## ProductRelated 5.736195e-03
## ProductRelated_Duration 1.369778e-04
## BounceRates -1.152448e+01
## Administrative 1.027339e-01
## Administrative_Duration 1.774148e-04
plot(lda_model)
Revenue. Obtain the confusion matrix and compute the testing error rate based on the LDA classification.ldaPred = predict(lda_model, newdata = test_sh)
Confusion Matrix
lda_cm = table(test_sh$Revenue, ldaPred$class)
lda_cm
##
## 0 1
## 0 2589 17
## 1 468 9
Testing error rate
lda_error = mean(test_sh$Revenue!= ldaPred$class)
lda_error
## [1] 0.1573143
The error rare is about 15.7%, which is similar to the Logistic Regression model. It’s our second best model based on prediction accuracy score.
In Logistic Regression, we’re basically trying to predict whether something happens or not based on some other things or factors. For example, we can predict whether it will rain or not based on the month of the year, how cloudy the sky has been recently, how cool the temperature is etc. We make this decision based on probabilities, say if we’re 50% sure that it’s gonna rain, we just say that our prediction is that it will rain, if we’re less than 50% sure, then we will say it won’t rain.
However, in this case we’re trying to predict whether somebody buys a product on a website, based on their behavior on the website (the pages they visited, for how long, etc).
In KNN, we identify categories present in the response variable (Buy or Not Buy), when we a new dataset is added, how did we classify where it belongs to?
The first step involves choose the number K or neighbors for the model (usually 5). Then the model takes the k nearest neighbor of the new data point according to the Euclidean or Manhattan, Minkowsky distance and so on. Among these neighbors, it counts the number of data points in each category, for instance if the new data point has 3 neighbors that are in the buy category and 2 neighbors in the not buy category, the model assigns the new data point to the category where it counted the most neighbors (in this case, buy).