library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.3.3
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.3.2
library(AmesHousing)
## Warning: package 'AmesHousing' was built under R version 4.3.3
library(boot)
library(broom)
library(lindia)
## Warning: package 'lindia' was built under R version 4.3.3
project_data <- read.csv("online_shoppers_intention.csv")

Explanation & Multicollinearity Analysis

Based on last weeks model, we had the best results when we didn’t use the interaction variable we made, and we just used the 3 duration variables.

So this week I’ll use those 3 and also add in the binary variable, weekend, to see if this addition affects / improves our model.

First, I’ll check for multicollinearity by looking at the correlation for all the variables I’m going to use in my model.

# Select the variables for correlation analysis
corr <- project_data |>
  select(ProductRelated_Duration, Administrative_Duration, Informational_Duration, Weekend, Revenue)

# Compute the correlation matrix
correlation_matrix <- cor(corr, use = "complete.obs")

# Display the correlation matrix as a table
correlation_table <- as.data.frame(as.table(correlation_matrix))
names(correlation_table) <- c("Variable 1", "Variable 2", "Correlation")

# Print the correlation table
print(correlation_table)
##                 Variable 1              Variable 2 Correlation
## 1  ProductRelated_Duration ProductRelated_Duration 1.000000000
## 2  Administrative_Duration ProductRelated_Duration 0.355421954
## 3   Informational_Duration ProductRelated_Duration 0.347363577
## 4                  Weekend ProductRelated_Duration 0.007310614
## 5                  Revenue ProductRelated_Duration 0.152372611
## 6  ProductRelated_Duration Administrative_Duration 0.355421954
## 7  Administrative_Duration Administrative_Duration 1.000000000
## 8   Informational_Duration Administrative_Duration 0.238030789
## 9                  Weekend Administrative_Duration 0.014990142
## 10                 Revenue Administrative_Duration 0.093586719
## 11 ProductRelated_Duration  Informational_Duration 0.347363577
## 12 Administrative_Duration  Informational_Duration 0.238030789
## 13  Informational_Duration  Informational_Duration 1.000000000
## 14                 Weekend  Informational_Duration 0.024078486
## 15                 Revenue  Informational_Duration 0.070344502
## 16 ProductRelated_Duration                 Weekend 0.007310614
## 17 Administrative_Duration                 Weekend 0.014990142
## 18  Informational_Duration                 Weekend 0.024078486
## 19                 Weekend                 Weekend 1.000000000
## 20                 Revenue                 Weekend 0.029295368
## 21 ProductRelated_Duration                 Revenue 0.152372611
## 22 Administrative_Duration                 Revenue 0.093586719
## 23  Informational_Duration                 Revenue 0.070344502
## 24                 Weekend                 Revenue 0.029295368
## 25                 Revenue                 Revenue 1.000000000

Although there is some correlation within the duration variables, .355 isn’t high enough for me to consider it an issue.

Now I’ll use VIF to check further into the multicollinearity issue. I’ll do this because my correlation table only captures relationships between 2 variables. Since multicollinearity can sometimes arise as a result of more than two variables relationship, Variance Inflation Factor (VIF) is used as a more robust check.

# Model for VIF calculation without the Revenue variable
model_vif <- lm(PageValues ~ ProductRelated_Duration + Administrative_Duration + Informational_Duration + Weekend, 
                data = project_data)

# Calculate VIF values
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:boot':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
vif_values <- vif(model_vif)

# Display VIF values
print(vif_values)
## ProductRelated_Duration Administrative_Duration  Informational_Duration 
##                1.249265                1.164611                1.157518 
##                 Weekend 
##                1.000689

Since these are all close to one I’m not concerned about multicollinearity.

Evaluating the model

# Fit the model
model <- lm(PageValues ~ ProductRelated_Duration + Administrative_Duration + Informational_Duration + Weekend, data = project_data)

# Print the model summary
summary(model)
## 
## Call:
## lm(formula = PageValues ~ ProductRelated_Duration + Administrative_Duration + 
##     Informational_Duration + Weekend, data = project_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.97  -5.71  -5.18  -4.92 356.44 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.924e+00  2.205e-01  22.334  < 2e-16 ***
## ProductRelated_Duration 2.996e-04  9.741e-05   3.076   0.0021 ** 
## Administrative_Duration 5.761e-03  1.018e-03   5.658 1.56e-08 ***
## Informational_Duration  8.995e-04  1.275e-03   0.706   0.4805    
## WeekendTRUE             4.742e-01  3.949e-01   1.201   0.2298    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.52 on 12325 degrees of freedom
## Multiple R-squared:  0.005679,   Adjusted R-squared:  0.005357 
## F-statistic:  17.6 on 4 and 12325 DF,  p-value: 2.056e-14

My main takeaways from this model summary:

Low R-squared (0.005679): This value indicates that the model explains only about 0.57% of the variance in PageValues. This suggests that the predictors used here don't capture much of the variability in PageValues and that other, possibly unobserved, factors may influence it more significantly.

Statistical Significance of Predictors:

Positive Coefficients: The coefficients for ProductRelated_Duration (0.00029965) and Administrative_Duration (0.00576107) are positive, meaning that as time on these pages increases, PageValues slightly increases. However, the effect size is quite small.

Impact of Weekend: The Weekend variable has a p-value of 0.2298, indicating that whether the session occurred on a weekend does not significantly impact PageValues in this model.

Overall Model Significance (F-statistic): Despite the low R-squared, the model's F-statistic (17.6) with a very low p-value (< 0.05) suggests that, collectively, the predictors still contribute some meaningful information to the model, albeit modestly.

Nonetheless, I’ll move on to making diagnostics plots which will further highlight these issues.

Diagnostics Plots

First, I’ll plot the confidence intervals for all the terms

# Tidy the model and add confidence intervals
model_data <- tidy(model, conf.int = TRUE)

# Create the coefficient plot
ggplot(model_data, aes(x = estimate, y = term)) +
  geom_point() +
  geom_vline(xintercept = 0, linetype = 'dotted', color = 'gray') +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, height = 0.3)) +
  labs(title = "Model Coefficient C.I.", x = "Estimate", y = "Term") +
  theme_minimal()

having confidence intervals that touch or overlap with zero indicates a potential issue with statistical significance, which was expected. When a confidence interval includes zero, it suggests that the effect of that predictor might not be distinguishable from zero, meaning that it might not have a significant impact on the response variable.

The intercept estimate is around 4.92. This suggests that if a session had zero time spent on all types of pages and did not occur on a weekend, PageValues would be approximately 4.92. However, in real-world terms, having zero values for all duration variables is unlikely (and potentially impossible), so the intercept has limited practical interpretation, in this case.

Next we’ll interpret leverage with Cook’s Distance analysis

project_data <- project_data |>
  mutate(
    cooks_d = cooks.distance(model),           # Calculate Cook's Distance
    influential = cooks_d > (4 / nrow(project_data))  # Mark influential points
  )

# Plotting with Cook's Distance
ggplot(project_data, aes(x = ProductRelated_Duration, y = PageValues)) +
  geom_point(size = 2, color = 'darkblue') +
  
  # Add labels for influential points based on Cook's Distance
  geom_text_repel(data = filter(project_data, influential),
                  aes(label = paste("Cook's D", round(cooks_d, 3))),
                  color = "darkred") +
  
  # Regression line without influential points
  geom_smooth(data = filter(project_data, influential == FALSE),
              method = 'lm', se = FALSE, color = 'lightblue') +
  
  # Regression line with all points
  geom_smooth(method = 'lm', se = FALSE, color = 'darkred') +
  
  labs(title = "Influence of Cook's Distance on Regression Line",
       x = "ProductRelated_Duration",
       y = "PageValues")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## Warning: ggrepel: 410 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

This plot shows that most data points have very low Cook's Distance, meaning they have minimal influence on the regression line. A few points with high ProductRelated_Duration have slightly elevated Cook's D values, indicating they are points of high leverage—far from the mean in terms of ProductRelated_Duration. However, they don't appear to be strongly influential, as removing them would not drastically change the slope of the regression line. Overall, this plot suggests that there are no points with significant influence on the model’s fit.

Next, we’ll plot a histogram of the residuals:

ggplot(data = model, aes(x = .resid)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  labs(title = "Residual Histogram",
       x = "Residuals",
       y = "Frequency") +
  theme_minimal()

The residual histogram shows that most residuals are concentrated very close to zero, but there is a long right tail, indicating some positive residuals that are much larger than the bulk of the data. This skew suggests that the residuals are not normally distributed, which may violate the normality assumption for linear regression. This could be due to outliers or other factors influencing the response variable.

Now a QQ plot:

# QQ-Plot
ggplot(data = model, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line(color = "red") +
  labs(title = "QQ-Plot of Residuals") +
  theme_minimal()

The QQ-plot shows a heavy deviation from the normality line, particularly on the right side, indicating that the residuals have a right-skewed distribution with some extreme positive outliers. This again suggests that the normality assumption for linear regression is not met.

Next, I’ll look at the residuals vs fitted plot

# Residuals vs Fitted plot
ggplot(data = model, aes(x = .fitted, y = .resid)) +
  geom_point(color = "blue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal()

The residuals vs. fitted values plot reveals a funnel shape, with a wider spread of residuals at lower fitted values that narrows as fitted values increase. This pattern indicates a violation of the constant variance assumption in linear regression, meaning that the errors are not equally distributed across all predicted values. This issue suggests that the model's accuracy may be inconsistent, especially for predictions at lower values, impacting the reliability of those predictions.

Now I’ll plot the residuals vs the ProductRelated_Duration variable:

ggplot(data = project_data, aes(x = ProductRelated_Duration, y = model$residuals)) +
  geom_point(color = "purple") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs ProductRelated_Duration",
       x = "ProductRelated_Duration",
       y = "Residuals") +
  theme_minimal()

This visual pretty much highlights the same insight as the one before, no consistency in residuals is an issue.

Conclusion