library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.3.3
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.3.2
library(AmesHousing)
## Warning: package 'AmesHousing' was built under R version 4.3.3
library(boot)
library(broom)
library(lindia)
## Warning: package 'lindia' was built under R version 4.3.3
project_data <- read.csv("online_shoppers_intention.csv")
Based on last weeks model, we had the best results when we didn’t use the interaction variable we made, and we just used the 3 duration variables.
So this week I’ll use those 3 and also add in the binary variable, weekend, to see if this addition affects / improves our model.
First, I’ll check for multicollinearity by looking at the correlation for all the variables I’m going to use in my model.
# Select the variables for correlation analysis
corr <- project_data |>
select(ProductRelated_Duration, Administrative_Duration, Informational_Duration, Weekend, Revenue)
# Compute the correlation matrix
correlation_matrix <- cor(corr, use = "complete.obs")
# Display the correlation matrix as a table
correlation_table <- as.data.frame(as.table(correlation_matrix))
names(correlation_table) <- c("Variable 1", "Variable 2", "Correlation")
# Print the correlation table
print(correlation_table)
## Variable 1 Variable 2 Correlation
## 1 ProductRelated_Duration ProductRelated_Duration 1.000000000
## 2 Administrative_Duration ProductRelated_Duration 0.355421954
## 3 Informational_Duration ProductRelated_Duration 0.347363577
## 4 Weekend ProductRelated_Duration 0.007310614
## 5 Revenue ProductRelated_Duration 0.152372611
## 6 ProductRelated_Duration Administrative_Duration 0.355421954
## 7 Administrative_Duration Administrative_Duration 1.000000000
## 8 Informational_Duration Administrative_Duration 0.238030789
## 9 Weekend Administrative_Duration 0.014990142
## 10 Revenue Administrative_Duration 0.093586719
## 11 ProductRelated_Duration Informational_Duration 0.347363577
## 12 Administrative_Duration Informational_Duration 0.238030789
## 13 Informational_Duration Informational_Duration 1.000000000
## 14 Weekend Informational_Duration 0.024078486
## 15 Revenue Informational_Duration 0.070344502
## 16 ProductRelated_Duration Weekend 0.007310614
## 17 Administrative_Duration Weekend 0.014990142
## 18 Informational_Duration Weekend 0.024078486
## 19 Weekend Weekend 1.000000000
## 20 Revenue Weekend 0.029295368
## 21 ProductRelated_Duration Revenue 0.152372611
## 22 Administrative_Duration Revenue 0.093586719
## 23 Informational_Duration Revenue 0.070344502
## 24 Weekend Revenue 0.029295368
## 25 Revenue Revenue 1.000000000
Although there is some correlation within the duration variables, .355 isn’t high enough for me to consider it an issue.
Now I’ll use VIF to check further into the multicollinearity issue. I’ll do this because my correlation table only captures relationships between 2 variables. Since multicollinearity can sometimes arise as a result of more than two variables relationship, Variance Inflation Factor (VIF) is used as a more robust check.
# Model for VIF calculation without the Revenue variable
model_vif <- lm(PageValues ~ ProductRelated_Duration + Administrative_Duration + Informational_Duration + Weekend,
data = project_data)
# Calculate VIF values
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:boot':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif_values <- vif(model_vif)
# Display VIF values
print(vif_values)
## ProductRelated_Duration Administrative_Duration Informational_Duration
## 1.249265 1.164611 1.157518
## Weekend
## 1.000689
Since these are all close to one I’m not concerned about multicollinearity.
# Fit the model
model <- lm(PageValues ~ ProductRelated_Duration + Administrative_Duration + Informational_Duration + Weekend, data = project_data)
# Print the model summary
summary(model)
##
## Call:
## lm(formula = PageValues ~ ProductRelated_Duration + Administrative_Duration +
## Informational_Duration + Weekend, data = project_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.97 -5.71 -5.18 -4.92 356.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.924e+00 2.205e-01 22.334 < 2e-16 ***
## ProductRelated_Duration 2.996e-04 9.741e-05 3.076 0.0021 **
## Administrative_Duration 5.761e-03 1.018e-03 5.658 1.56e-08 ***
## Informational_Duration 8.995e-04 1.275e-03 0.706 0.4805
## WeekendTRUE 4.742e-01 3.949e-01 1.201 0.2298
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.52 on 12325 degrees of freedom
## Multiple R-squared: 0.005679, Adjusted R-squared: 0.005357
## F-statistic: 17.6 on 4 and 12325 DF, p-value: 2.056e-14
My main takeaways from this model summary:
Low R-squared (0.005679): This value indicates that the model explains only about 0.57% of the variance in PageValues. This suggests that the predictors used here don't capture much of the variability in PageValues and that other, possibly unobserved, factors may influence it more significantly.
Statistical Significance of Predictors:
ProductRelated_Duration and Administrative_Duration both have very low p-values (< 0.01), indicating they are statistically significant predictors of PageValues.
Informational_Duration and Weekend are not statistically significant (p-values > 0.05), suggesting that their impact on PageValues is not strong or consistent enough in this model.
Positive Coefficients: The coefficients for ProductRelated_Duration (0.00029965) and Administrative_Duration (0.00576107) are positive, meaning that as time on these pages increases, PageValues slightly increases. However, the effect size is quite small.
Impact of Weekend: The Weekend variable has a p-value of 0.2298, indicating that whether the session occurred on a weekend does not significantly impact PageValues in this model.
Overall Model Significance (F-statistic): Despite the low R-squared, the model's F-statistic (17.6) with a very low p-value (< 0.05) suggests that, collectively, the predictors still contribute some meaningful information to the model, albeit modestly.
In summary, ProductRelated_Duration and Administrative_Duration are the strongest predictors here, but the model as a whole doesn't explain much of the variance in PageValues, indicating room for improvement with additional or different variables.
Another important thing here is that the data set is not designed to predict a variable that can be analyzed with a linear regression model, the target variable is binary and more suitable for a logistic regression. As well, just because the PageValues variable is continuous, and one of our more correlated variables to the target, does not mean it can be easily predicted with the other explanatory variables. This make’s it difficult to come up with a significant model in this exercise. Hopefully when we move on to logistic regression I will be able to produce a model that provides real insights.
Nonetheless, I’ll move on to making diagnostics plots which will further highlight these issues.
First, I’ll plot the confidence intervals for all the terms
# Tidy the model and add confidence intervals
model_data <- tidy(model, conf.int = TRUE)
# Create the coefficient plot
ggplot(model_data, aes(x = estimate, y = term)) +
geom_point() +
geom_vline(xintercept = 0, linetype = 'dotted', color = 'gray') +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, height = 0.3)) +
labs(title = "Model Coefficient C.I.", x = "Estimate", y = "Term") +
theme_minimal()
having confidence intervals that touch or overlap with zero indicates a potential issue with statistical significance, which was expected. When a confidence interval includes zero, it suggests that the effect of that predictor might not be distinguishable from zero, meaning that it might not have a significant impact on the response variable.
The intercept estimate is around 4.92. This suggests that if a session had zero time spent on all types of pages and did not occur on a weekend, PageValues would be approximately 4.92. However, in real-world terms, having zero values for all duration variables is unlikely (and potentially impossible), so the intercept has limited practical interpretation, in this case.
Next we’ll interpret leverage with Cook’s Distance analysis
project_data <- project_data |>
mutate(
cooks_d = cooks.distance(model), # Calculate Cook's Distance
influential = cooks_d > (4 / nrow(project_data)) # Mark influential points
)
# Plotting with Cook's Distance
ggplot(project_data, aes(x = ProductRelated_Duration, y = PageValues)) +
geom_point(size = 2, color = 'darkblue') +
# Add labels for influential points based on Cook's Distance
geom_text_repel(data = filter(project_data, influential),
aes(label = paste("Cook's D", round(cooks_d, 3))),
color = "darkred") +
# Regression line without influential points
geom_smooth(data = filter(project_data, influential == FALSE),
method = 'lm', se = FALSE, color = 'lightblue') +
# Regression line with all points
geom_smooth(method = 'lm', se = FALSE, color = 'darkred') +
labs(title = "Influence of Cook's Distance on Regression Line",
x = "ProductRelated_Duration",
y = "PageValues")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## Warning: ggrepel: 410 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
This plot shows that most data points have very low Cook's Distance, meaning they have minimal influence on the regression line. A few points with high ProductRelated_Duration have slightly elevated Cook's D values, indicating they are points of high leverage—far from the mean in terms of ProductRelated_Duration. However, they don't appear to be strongly influential, as removing them would not drastically change the slope of the regression line. Overall, this plot suggests that there are no points with significant influence on the model’s fit.
Next, we’ll plot a histogram of the residuals:
ggplot(data = model, aes(x = .resid)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
labs(title = "Residual Histogram",
x = "Residuals",
y = "Frequency") +
theme_minimal()
The residual histogram shows that most residuals are concentrated very close to zero, but there is a long right tail, indicating some positive residuals that are much larger than the bulk of the data. This skew suggests that the residuals are not normally distributed, which may violate the normality assumption for linear regression. This could be due to outliers or other factors influencing the response variable.
Now a QQ plot:
# QQ-Plot
ggplot(data = model, aes(sample = .resid)) +
stat_qq() +
stat_qq_line(color = "red") +
labs(title = "QQ-Plot of Residuals") +
theme_minimal()
The QQ-plot shows a heavy deviation from the normality line, particularly on the right side, indicating that the residuals have a right-skewed distribution with some extreme positive outliers. This again suggests that the normality assumption for linear regression is not met.
Next, I’ll look at the residuals vs fitted plot
# Residuals vs Fitted plot
ggplot(data = model, aes(x = .fitted, y = .resid)) +
geom_point(color = "blue") +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Residuals vs Fitted Values",
x = "Fitted Values",
y = "Residuals") +
theme_minimal()
The residuals vs. fitted values plot reveals a funnel shape, with a wider spread of residuals at lower fitted values that narrows as fitted values increase. This pattern indicates a violation of the constant variance assumption in linear regression, meaning that the errors are not equally distributed across all predicted values. This issue suggests that the model's accuracy may be inconsistent, especially for predictions at lower values, impacting the reliability of those predictions.
Now I’ll plot the residuals vs the ProductRelated_Duration variable:
ggplot(data = project_data, aes(x = ProductRelated_Duration, y = model$residuals)) +
geom_point(color = "purple") +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Residuals vs ProductRelated_Duration",
x = "ProductRelated_Duration",
y = "Residuals") +
theme_minimal()
This visual pretty much highlights the same insight as the one before, no consistency in residuals is an issue.
Although multicollinearity is not an issue for our model, I think all of the other key assumptions of a linear regression are failing to be met in the model we created.
Like I said earlier, these prediction variables and the chosen target variable aren’t ideal for an analysis of this type, but they did provide me with some practice interpreting the glaring issues in a model that shouldn’t be missed in future real-world application.