Homework RQ 2

R Markdown

###2.Research question 


#Research Question: "What factors influence the likelihood of online shoppers making a purchase?"

#Revenue = True makes a purchase 
#Revenue = False does no make a purchase

mydata <- read.table("C:/Users/teodo/OneDrive/Desktop/Program R/Homework - Research Question 2/online shoppers purchasing intention dataset/online_shoppers_intention.csv", header = TRUE, sep=",", dec = ",")

###2.1 Data 

#The dataset used for the second research question of the homework is found from the *UCI Machine learning Repository* and the name of the dataset is **Online Shoppers Purchasing Intention Dataset** [The Dataset](https://doi.org/10.24432/C5F88Q). 

#The dataset gives information about an online bookstore that is using e-commerce platform and details about the behavior of its online shoppers. They are using data from 12330 sessions over a year that can help to predict the visitors' likelihood to make a purchase (Revenue = 1). This information is beneficial for sellers who want to understand what attracts their customers the best and that will make them buy their products.

mydata$ID <- 1:nrow(mydata) #adding ID column to each row

head(mydata) #showing the first six rows

##   Administrative Administrative_Duration Informational Informational_Duration
## 1              0                       0             0                      0
## 2              0                       0             0                      0
## 3              0                       0             0                      0
## 4              0                       0             0                      0
## 5              0                       0             0                      0
## 6              0                       0             0                      0
##   ProductRelated ProductRelated_Duration BounceRates   ExitRates PageValues
## 1              1                       0         0.2         0.2          0
## 2              2                      64           0         0.1          0
## 3              1                       0         0.2         0.2          0
## 4              2             2.666666667        0.05        0.14          0
## 5             10                   627.5        0.02        0.05          0
## 6             19             154.2166667 0.015789474 0.024561404          0
##   SpecialDay Month OperatingSystems Browser Region TrafficType
## 1          0   Feb                1       1      1           1
## 2          0   Feb                2       2      1           2
## 3          0   Feb                4       1      9           3
## 4          0   Feb                3       2      2           4
## 5          0   Feb                3       3      1           4
## 6          0   Feb                2       2      1           3
##         VisitorType Weekend Revenue ID
## 1 Returning_Visitor   FALSE   FALSE  1
## 2 Returning_Visitor   FALSE   FALSE  2
## 3 Returning_Visitor   FALSE   FALSE  3
## 4 Returning_Visitor   FALSE   FALSE  4
## 5 Returning_Visitor    TRUE   FALSE  5
## 6 Returning_Visitor   FALSE   FALSE  6

tail(mydata) #showing the last six rows, we can see that the dataset has 12330 observations

##       Administrative Administrative_Duration Informational
## 12325              0                       0             1
## 12326              3                     145             0
## 12327              0                       0             0
## 12328              0                       0             0
## 12329              4                      75             0
## 12330              0                       0             0
##       Informational_Duration ProductRelated ProductRelated_Duration BounceRates
## 12325                      0             16                     503           0
## 12326                      0             53             1783.791667 0.007142857
## 12327                      0              5                  465.75           0
## 12328                      0              6                  184.25 0.083333333
## 12329                      0             15                     346           0
## 12330                      0              3                   21.25           0
##         ExitRates  PageValues SpecialDay Month OperatingSystems Browser Region
## 12325 0.037647059           0          0   Nov                2       2      1
## 12326 0.029030612 12.24171745          0   Dec                4       6      1
## 12327 0.021333333           0          0   Nov                3       2      1
## 12328 0.086666667           0          0   Nov                3       2      1
## 12329 0.021052632           0          0   Nov                2       2      3
## 12330 0.066666667           0          0   Nov                3       2      1
##       TrafficType       VisitorType Weekend Revenue    ID
## 12325           1 Returning_Visitor   FALSE   FALSE 12325
## 12326           1 Returning_Visitor    TRUE   FALSE 12326
## 12327           8 Returning_Visitor    TRUE   FALSE 12327
## 12328          13 Returning_Visitor    TRUE   FALSE 12328
## 12329          11 Returning_Visitor   FALSE   FALSE 12329
## 12330           2       New_Visitor    TRUE   FALSE 12330

#showing the number of columns, we can see that the dataset has 19 columns (one is the dependent variable - Revenue)
ncol(mydata)

## [1] 19

nrow(mydata) #showing the number of rows

## [1] 12330

#creating a vector named categorical_columns, that will contain the names of the columns in the data frame named mydata from the 11th to the 18th column

categorical_columns <- names(mydata)[11:18] 

for (col in categorical_columns) #for each column in the vector named categorical_columns do this 
                             
{
  #for each column find the unique values and print them as string in one line
  print(toString(unique(mydata[[col]]))) 
}

## [1] "Feb, Mar, May, Oct, June, Jul, Aug, Nov, Sep, Dec"
## [1] "1, 2, 4, 3, 7, 6, 8, 5"
## [1] "1, 2, 3, 4, 5, 6, 7, 10, 8, 9, 12, 13, 11"
## [1] "1, 9, 2, 3, 4, 5, 6, 7, 8"
## [1] "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 18, 19, 16, 17, 20"
## [1] "Returning_Visitor, New_Visitor, Other"
## [1] "FALSE, TRUE"
## [1] "FALSE, TRUE"

#is.na() is used to find missing values or NA in the data frame named mydata the function returns TRUE if the corresponding element in is NA, and FALSE if not; 
#sum() is used to count the number of missing values or NA, and we can see that it is 0, so there are no missing values/NA
sum(is.na(mydata))

## [1] 0

#Duplicates are not good for analyzing the data because it can lead to inaccurate analyses, biased results, and misinterpretation of patterns.

#duplicated() is used to check if the dataset has duplicates; 
#sum() is used to count the number of duplicates and there are 0
sum(duplicated(mydata))

## [1] 0

#contain the names of the columns in the data frame named mydata from the 7th to the 10th column

char_columns <- names(mydata)[7:10] 

for (col in char_columns) #for each column in the vector named char_columns do the following
{
  #for each column the char value make it to be integer from character to number
  mydata[[col]] <- as.integer(mydata[[col]]) 
}

unique_values_BounceRates <- unique(mydata$BounceRates) #finding the unique values for Bounce Rates

print("The unique values are:")

## [1] "The unique values are:"

print(unique_values_BounceRates)

## [1] 0

print("The number of unique values is:")

## [1] "The number of unique values is:"

print(length(unique_values_BounceRates))

## [1] 1

unique_values_ExitRates <- unique(mydata$ExitRates) #finding the unique values for Exit Rates

print("The unique values are:")

## [1] "The unique values are:"

print(unique_values_ExitRates)

## [1] 0

print("The number of unique values is:")

## [1] "The number of unique values is:"

print(length(unique_values_ExitRates))

## [1] 1

###Description of variables in the dataset:

##Numerical variables

#-Administrative: the number of pages related to administrative tasks (like account settings) and shows the user's engagement with administrative content
#-Administrative Duration: the time spent on administrative pages in a session 
#-Informational: the number of pages related to informational content (non-product related) and shows the user's interest with products 
#-Informational Duration: the time spent on informational pages in a session 
#-Product Related: the number of pages related to products and shows the user's engagement with administrative content
#-Product Related Duration: the time spent on product related pages in a session 
#-Bounce Rates: the percentage of users who enter a page and leave without taking any action and shows how often users leave without exploring
#-Exit Rates: the percentage of page views that were the last in a session and shows the last pages users view before leaving the site
#-Page Values:  the average value of a page that a user visited before making a transaction and shows the contribution of a page to completing transactions.
#-Special Day: how close the visit of a user is to a specific day/holiday (e.g. Christmas, Valentine's Day)

##Categorical variables
#-Operating Systems: type of OS used by the user (e.g. Windows, Mac) (1-8)
#-Browser: web browser used by the user (e.g. Chrome, Firefox) (1-13)
#-Region: geographical region of the user (1-9)
#-Traffic type: type of traffic source that brought the user to the site (1-20)
#-Visitor type: if the visitor is a returning or new visitor (Returning, New & Other). 
#-Weekend: if the date of the visit is on a weekend (False & True)
#-Month: month in which the session has happened (Feb to Dec; w/o January and April)
#-Revenue: if a transaction resulted in income during the user's online session (False, True)

#Good to know that "Bounce Rate", "Exit Rate" and "Page Value" are metrics measured by "Google Analytics".

#For the purpose of this homework several variables will not be used since they are considered as not useful for the analysis like: "Administrative" & "Administrative Duration", "Informational" & "Informational Duration", and "Product Related" & "Product Related Duration". This columns represent the number of pages visited by the user in each of these categories, and is are also captured in the variable "Page Values". Also, the type of OS or browser a user use are not very meaningful since people can buy online no matter what kind of OS or internet browser they use. Month should be removed as well, because some months are missing, and we can not be sure if the month makes a big difference. The variable weekend will not be used as well as the variable region. The variables "Bounce Rates" and "Exit Rates" have constantly zeros and no variability and can not provide useful information, so they will not be used.


###Description of the effect of the variables used over depended variable Revenue

#**Page Values**: a high page value will show that users who visited certain pages are more likely to complete a transaction, and that may suggest how much important are those pages with revenue

#**Special Day**:  will show if the revenue is increased around some special days, and may suggest that people are willing to buy during holidays and take advantage of the special deals and discounts offered by the store

#**Visitor Type**: those who are returning visitors may have a higher likelihood of making a purchase, as instead of the new visitors

#the type Other will be removed since it has no meaning for this variable
mydata <- mydata[mydata$VisitorType != "Other", ]

unique_values_VisitorType <- unique(mydata$VisitorType) #finding the unique values for Visitor Type

print("The unique values are:")

## [1] "The unique values are:"

print(unique_values_VisitorType)

## [1] "Returning_Visitor" "New_Visitor"

print("The number of unique values is:")

## [1] "The number of unique values is:"

print(length(unique_values_VisitorType))

## [1] 2

#The function factor() helps converting categorical vectors into data type called factor, that can help with analyzing categorical data. This was used for the variable VisitorType, and Revenue and the factor data type was put in new variable called VisitorTypeF & RevenueF accordingly.

#reference group Returning_Visitor

mydata$VisitorTypeF <- factor(mydata$VisitorType, 
                          levels = c("Returning_Visitor", "New_Visitor"), 
                          labels = c("Returning_Visitor", "New_Visitor"))

#reference group FALSE

mydata$RevenueF <- factor(mydata$Revenue, 
                          levels = c("FALSE", "TRUE"), 
                          labels = c("False", "True"))

#The descriptive statistics for the variable VisitorTypeF, show the occurrences of the type of visitors on the site. Based on the summary, the "Returning_Visitor" type of user is the most common, "New_Visitor" type is the second most common. This can help when making decisions like targeting promotions for the new visitors to encourage their first purchase, or offering rewards for the returning visitors.

summary(mydata$VisitorTypeF)

## Returning_Visitor       New_Visitor 
##             10551              1694

#The descriptive statistics for the variable RevenueF, show the occurrences if the visitors made purchase (Revenue=True) or not (Revenue=False) the type of visitors on the site.

summary(mydata$RevenueF)

## False  True 
## 10353  1892

#The summary for PageValues shows that this variable has a range from minimum 0 to maximum 361. By the mean of this variable we can say that the average number of pages that are visited by users have and positive outcome by completing transactions is 5.8. The range for variable SpecialDay is from minimum 0 to maximum 1, and if it is close to 0 means that the users are not visiting the page close to some spacial day, which is opposite of 1, that means users are visiting the page close to some special day or even on that specific one. By the mean 0.01, we can say that on average, user visits are not close to certain special days.

summary(mydata[colnames(mydata) %in% c("PageValues", "SpecialDay")])

##    PageValues        SpecialDay     
##  Min.   :  0.000   Min.   :0.00000  
##  1st Qu.:  0.000   1st Qu.:0.00000  
##  Median :  0.000   Median :0.00000  
##  Mean   :  5.694   Mean   :0.01258  
##  3rd Qu.:  0.000   3rd Qu.:0.00000  
##  Max.   :361.000   Max.   :1.00000

###2.2 Analysis

##fit0

#dependent and explanatory variables | only reg. constant is included because we want to know the starting point without explanatory variables

fit0 <- glm(RevenueF ~ 1, 
            family = binomial, #binary logistic regression
            data = mydata)

summary(fit0)

## 
## Call:
## glm(formula = RevenueF ~ 1, family = binomial, data = mydata)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.700      0.025  -67.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10542  on 12244  degrees of freedom
## Residual deviance: 10542  on 12244  degrees of freedom
## AIC: 10544
## 
## Number of Fisher Scoring iterations: 3

#The probability that a visitor will make purchase on the site is 0.183 times the probability that the visitor will not make purchase on the site.

exp(cbind(odds = fit0$coefficients, confint.default(fit0))) #odds for Revenue=True

##                 odds     2.5 %    97.5 %
## (Intercept) 0.182749 0.1740097 0.1919271

head(fitted(fit0)) #estimated probability for Revenue=True

##        1        2        3        4        5        6 
## 0.154512 0.154512 0.154512 0.154512 0.154512 0.154512

Pseudo_R2_fit1 <- 10297/12205 #prop. of correctly classified units

Pseudo_R2_fit1

## [1] 0.8436706

#Ho: beta1 = 0 H1: beta1 != 0

#PageValues is significant at p<0.001

#If PageValues has increased by one page, the odds that the visitor will make a purchase on the site are equal to e^-0.089242 = 1.093 times the inital odds at p<0.001. 

#So, if we increase the PageValues, the probability for visitor to make purchase will increase.

#Null deviance: 10541.9  simpler model
#Residual deviance:  7888.9 complex model

fit1 <- glm(RevenueF ~ PageValues, #dependent and explanatory variables
            family = binomial, #binary logistic regression
            data = mydata)

summary(fit1)

## 
## Call:
## glm(formula = RevenueF ~ PageValues, family = binomial, data = mydata)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.399622   0.034278  -70.01   <2e-16 ***
## PageValues   0.089242   0.002373   37.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10541.9  on 12244  degrees of freedom
## Residual deviance:  7888.9  on 12243  degrees of freedom
## AIC: 7892.9
## 
## Number of Fisher Scoring iterations: 5

#Better is the complex model, it is significant.
anova(fit0, fit1, test = "Chi") #comparison of models based on -2LL statistics

## Analysis of Deviance Table
## 
## Model 1: RevenueF ~ 1
## Model 2: RevenueF ~ PageValues
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1     12244    10541.9                          
## 2     12243     7888.9  1     2653 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

exp(cbind(OR = fit1$coefficients, confint.default(fit1))) #odds ratio for Revenue=1 (with 95% CI)

##                     OR      2.5 %     97.5 %
## (Intercept) 0.09075223 0.08485555 0.09705869
## PageValues  1.09334488 1.08827082 1.09844258

##fit2

#Assumptions for logistic regression

#install.packages("car")

library(car)

## Warning: package 'car' was built under R version 4.3.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.2

fit2 <- glm(RevenueF ~ PageValues + SpecialDay + VisitorTypeF, #dependent and explanatory variables
            family = binomial, #binary logistic regression
            data = mydata)

#1. The dependent variable is dichotomous. - Yes, Revenue has False & True.

#2. The explanatory variables are numerical or in the form of a Dummy variable. - Yes, PageValues        and SpecialDay are numerical & VisitorTypeF is Dummy variable.

#3. No outliers and/or units with high impact.

mydata$StdResid <- round(rstandard(fit2), 3) #standardized residuals | remove above 3 and below -3

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

head(mydata[order(-mydata$StdResid), c("ID", "StdResid")])

##        ID StdResid
## 2670 2670    2.337
## 4902 4902    2.266
## 1189 1189    2.255
## 5491 5491    2.255
## 5515 5515    2.255
## 5555 5555    2.255

head(mydata[order(mydata$StdResid), c("ID", "StdResid")], n=30) #showing the first thirty rows

##          ID StdResid
## 5774   5774   -6.288
## 8346   8346   -5.802
## 10641 10641   -5.139
## 6430   6430   -5.038
## 7939   7939   -4.417
## 12248 12248   -4.315
## 5878   5878   -4.233
## 10026 10026   -4.040
## 3201   3201   -4.018
## 7008   7008   -3.861
## 8895   8895   -3.838
## 3272   3272   -3.526
## 9055   9055   -3.239
## 7037   7037   -3.212
## 9250   9250   -3.212
## 10106 10106   -3.212
## 1790   1790   -3.185
## 11599 11599   -3.129
## 5543   5543   -3.105
## 8417   8417   -3.101
## 11520 11520   -3.072
## 3629   3629   -3.044
## 4195   4195   -3.015
## 9960   9960   -3.015
## 10436 10436   -2.956
## 8097   8097   -2.926
## 8706   8706   -2.926
## 2957   2957   -2.897
## 10068 10068   -2.805
## 7728   7728   -2.774

mydata$CooksD <- round(cooks.distance(fit2), 3) #cooks distances | remove above 1 or gaps 

hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

head(mydata[order(-mydata$Cook), c("ID", "CooksD")])

##          ID CooksD
## 5774   5774  0.083
## 8346   8346  0.064
## 10641 10641  0.042
## 6430   6430  0.037
## 2670   2670  0.026
## 4902   4902  0.026

#installed.packages("dplyr")

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- mydata %>%
  dplyr::filter(!(ID %in% c(5689, 8249, 10496, 6344, 5688, 7850, 12043, 5792, 9896, 3148, 6921, 8787,                             3218, 8944, 6950, 9139, 9974, 1749, 11424, 5458, 8319, 11347, 3572, 4132,                             9833)))

nrow(mydata)

## [1] 12220

#4. Sufficiently large samples. - Yes, the dataset has more than 400 observations.

#5. Absence of too strong multicolinearity. - Yes, all VIF values are close to 1, so this shows that the variables are not highly correlated, and that  multicollinearity is not a problem. 
vif(fit2)

##   PageValues   SpecialDay VisitorTypeF 
##     1.000439     1.000734     1.000435

fit1 <- glm(RevenueF ~ PageValues,  
            family = binomial, 
            data = mydata)

fit2 <- glm(RevenueF ~ PageValues + SpecialDay + VisitorTypeF,  
            family = binomial, 
            data = mydata)

anova(fit1, fit2, test = "Chi") #comparison of models based on -2LL statistics

## Analysis of Deviance Table
## 
## Model 1: RevenueF ~ PageValues
## Model 2: RevenueF ~ PageValues + SpecialDay + VisitorTypeF
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1     12218     7875.0                          
## 2     12216     7841.8  2   33.263 5.986e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(fit2)

## 
## Call:
## glm(formula = RevenueF ~ PageValues + SpecialDay + VisitorTypeF, 
##     family = binomial, data = mydata)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -2.457519   0.036979 -66.457  < 2e-16 ***
## PageValues               0.088391   0.002387  37.026  < 2e-16 ***
## SpecialDay              -0.537325   0.344977  -1.558    0.119    
## VisitorTypeFNew_Visitor  0.450792   0.079997   5.635 1.75e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10516.5  on 12219  degrees of freedom
## Residual deviance:  7841.8  on 12216  degrees of freedom
## AIC: 7849.8
## 
## Number of Fisher Scoring iterations: 5

#There are no enough evidence to explain SpecialDay, so it can not be explaind since is not significant. 

#If PageValues increases by one page, the odds that the visitor will make a purchase on the site are equal to 1.092 times the initial odds at p<0.001, under the assumptions that the remaining expalanotyr variables do not change. The probability to make purchase on the site increase.

#Given the values of other explanatory variables, the odds for visitor to make a purchase if it is a part of the group who are new visitor is 1.569 times the odds of the group of visitors who are returning. New visitors are more likely to purchase product on the site.

exp(cbind(OR = fit2$coefficients, confint.default(fit2))) #odds ratio for Revenue=True (with 95% CI)

##                                 OR      2.5 %     97.5 %
## (Intercept)             0.08564717 0.07965924 0.09208521
## PageValues              1.09241505 1.08731569 1.09753833
## SpecialDay              0.58430947 0.29716507 1.14891551
## VisitorTypeFNew_Visitor 1.56955542 1.34178511 1.83599012

mydata$EstProb <- fitted(fit2) #estimates of probabilities for Revenue=True: P(Y=1)
head(mydata)

##   Administrative Administrative_Duration Informational Informational_Duration
## 1              0                       0             0                      0
## 2              0                       0             0                      0
## 3              0                       0             0                      0
## 4              0                       0             0                      0
## 5              0                       0             0                      0
## 6              0                       0             0                      0
##   ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1              1                       0           0         0          0
## 2              2                      64           0         0          0
## 3              1                       0           0         0          0
## 4              2             2.666666667           0         0          0
## 5             10                   627.5           0         0          0
## 6             19             154.2166667           0         0          0
##   SpecialDay Month OperatingSystems Browser Region TrafficType
## 1          0   Feb                1       1      1           1
## 2          0   Feb                2       2      1           2
## 3          0   Feb                4       1      9           3
## 4          0   Feb                3       2      2           4
## 5          0   Feb                3       3      1           4
## 6          0   Feb                2       2      1           3
##         VisitorType Weekend Revenue ID      VisitorTypeF RevenueF StdResid
## 1 Returning_Visitor   FALSE   FALSE  1 Returning_Visitor    False   -0.405
## 2 Returning_Visitor   FALSE   FALSE  2 Returning_Visitor    False   -0.405
## 3 Returning_Visitor   FALSE   FALSE  3 Returning_Visitor    False   -0.405
## 4 Returning_Visitor   FALSE   FALSE  4 Returning_Visitor    False   -0.405
## 5 Returning_Visitor    TRUE   FALSE  5 Returning_Visitor    False   -0.405
## 6 Returning_Visitor   FALSE   FALSE  6 Returning_Visitor    False   -0.405
##   CooksD    EstProb
## 1      0 0.07889043
## 2      0 0.07889043
## 3      0 0.07889043
## 4      0 0.07889043
## 5      0 0.07889043
## 6      0 0.07889043

mydata$Classification <- ifelse(test = mydata$EstProb < 0.50, 
                                yes = "False", 
                                no = "True")

#if estimated probability is below 0.50, visitor is classified into the group of people that have not made a purchase (Revenue=False), otherwise True

mydata$ClassificationF <- factor(mydata$Classification,
                                 levels = c("False", "True"), 
                                 labels = c("False", "True"))

#classification table based on 2 categorical variables

ClassificationTable <- table(mydata$RevenueF, mydata$ClassificationF) 

ClassificationTable

##        
##         False  True
##   False 10129   204
##   True   1206   681

#The Pseudo_R2_fit2 is 0.8846154 and says that approximately 89% of the variability of the dependent variable RevenueF (if a purchased is made or not) can be explained by the explanaotry variables in the logistic regression model.

#prop. of correctly classified units

Pseudo_R2_fit2 <- (ClassificationTable[1, 1] + ClassificationTable[2, 2] )/ nrow(mydata) 

Pseudo_R2_fit2

## [1] 0.8846154

###2.3 Conclusion 

#In summary, this dataset helped to understand what influences online shoppers, and why they make the decisions to make a purchase or not. We discovered that certain things, like how close it is to special days is not very significant if they will make transactions, and on the other hand the type of visitor or the value of pages are significant. For the purpose of this type of analysis, some of the variables were excluded because they that did not really help to understand the things. 

#Explanations for each part of the 2.2 chapter Analysis are given alongside the code, to understand different types of logisitc models and what is the effect of the variables.

Homework RQ 2

2024-01-08

R Markdown