DataDive6_mohler

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

project_data <- read.csv("online_shoppers_intention.csv")

NEW VARIABLES:

1. Time Efficiency

New Column: Calculate a “Time Efficiency” column by dividing PageValues by the sum of the Administrative, Informational, and Product-related durations. This column would give an idea of how efficiently users are navigating the website in terms of value generated per unit time.
Potential Insight: Explore whether users with higher Time Efficiency interact more efficiently with high-value content. You could look at correlations with ExitRates and Revenue to assess if more efficient sessions lead to lower bounce or exit rates.

2. Special Day Impact

New Column: Multiply the SpecialDay column by Page Values to create a “Special Day Page Value” column. This would give an indication of how much value users generate on special days.
Potential Insight: Explore how SpecialDay_Impact correlates with other numeric variables like ExitRates and Revenue. The goal here is to see if users on special days view pages differently (in terms of engagement or value) compared to non-special days.

# Creating the Time Efficiency column
project_data$TimeEfficiency <- project_data$PageValues / 
                               (project_data$Administrative + 
                                project_data$Informational + 
                                project_data$ProductRelated)

# Creating the Special Day Impact column
project_data$SpecialDay_Impact <- project_data$SpecialDay * project_data$PageValues

VISUALIZATION:

Now, plot both new variables vs ExitRates:

# Plot Time Efficiency vs Exit Rates
ggplot(project_data, aes(x = TimeEfficiency, y = ExitRates)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), col = "blue") +  # Optional: add a trend line
  labs(title = "Time Efficiency vs Exit Rates",
       x = "Time Efficiency",
       y = "Exit Rates")

## Warning: Removed 6 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 6 rows containing missing values (`geom_point()`).

# Plot Special Day Impact vs Exit Rates
ggplot(project_data, aes(x = SpecialDay_Impact, y = ExitRates)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), col = "blue") +  # Optional: add a trend line
  labs(title = "Special Day Impact vs Exit Rates",
       x = "Special Day Impact",
       y = "Exit Rates")

Now we’ll plot the same values, but with two trendlines to represent whether the instances are positive or negative cases of shopping (Revenue).

# Plot Time Efficiency vs Exit Rates with Revenue-based trendlines and black data points
ggplot(project_data, aes(x = TimeEfficiency, y = ExitRates)) +
  geom_point(color = "black") +  # Black data points
  geom_smooth(aes(color = as.factor(Revenue)), method = "lm", formula = y ~ poly(x, 2), se = FALSE) +  # Colored trend lines
  scale_color_manual(values = c("FALSE" = "red", "TRUE" = "green")) +  # Red for FALSE, Green for TRUE
  labs(title = "Time Efficiency vs Exit Rates",
       x = "Time Efficiency",
       y = "Exit Rates",
       color = "Revenue")  # Legend for Revenue

## Warning: Removed 6 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 6 rows containing missing values (`geom_point()`).

# Plot Special Day Impact vs Exit Rates with Revenue-based trendlines and black data points
ggplot(project_data, aes(x = SpecialDay_Impact, y = ExitRates)) +
  geom_point(color = "black") +  # Black data points
  geom_smooth(aes(color = as.factor(Revenue)), method = "lm", formula = y ~ poly(x, 2), se = FALSE) +  # Colored trend lines
  scale_color_manual(values = c("FALSE" = "red", "TRUE" = "green")) +  # Red for FALSE, Green for TRUE
  labs(title = "Special Day Impact vs Exit Rates",
       x = "Special Day Impact",
       y = "Exit Rates",
       color = "Revenue")

Now we’ll remove all the zero instances for both new variables just to see it impacts things. Looks like both new variables are overwhelming zero values so I figured this may show me something.

# Filter out rows where Time Efficiency or Special Day Impact is zero
filtered_data <- project_data |>
  filter(TimeEfficiency > 0, SpecialDay_Impact > 0)

# Plot Time Efficiency vs Exit Rates with Revenue-based trendlines and black data points
ggplot(filtered_data, aes(x = TimeEfficiency, y = ExitRates)) +
  geom_point(color = "black") +  # Black data points
  geom_smooth(aes(color = as.factor(Revenue)), method = "lm", formula = y ~ poly(x, 2), se = FALSE) +  # Colored trend lines
  scale_color_manual(values = c("FALSE" = "red", "TRUE" = "green")) +  # Red for FALSE, Green for TRUE
  labs(title = "Time Efficiency vs Exit Rates (Filtered)",
       x = "Time Efficiency",
       y = "Exit Rates",
       color = "Revenue")

# Plot Special Day Impact vs Exit Rates with Revenue-based trendlines and black data points
ggplot(filtered_data, aes(x = SpecialDay_Impact, y = ExitRates)) +
  geom_point(color = "black") +  # Black data points
  geom_smooth(aes(color = as.factor(Revenue)), method = "lm", formula = y ~ poly(x, 2), se = FALSE) +  # Colored trend lines
  scale_color_manual(values = c("FALSE" = "red", "TRUE" = "green")) +  # Red for FALSE, Green for TRUE
  labs(title = "Special Day Impact vs Exit Rates (Filtered)",
       x = "Special Day Impact",
       y = "Exit Rates",
       color = "Revenue")

I could use the trend lines to look for insights in these new variables, but with how random it looks, I don’t think they’d mean much.

CORRELATION AND CONFIDENCE INTERVAL ANALYSIS:

Correlation of all 4 combinations:

# Correlation between Time Efficiency and Exit Rates
cor_time_efficiency_exit = cor(project_data$TimeEfficiency, project_data$ExitRates, use = "complete.obs") #remove nulls
cat("Correlation between Time Efficiency and Exit Rates: ", cor_time_efficiency_exit, "\n")

## Correlation between Time Efficiency and Exit Rates:  -0.09212315

# Correlation between Time Efficiency and Revenue (Revenue is treated as numeric: 0 for FALSE, 1 for TRUE)
cor_time_efficiency_revenue = cor(project_data$TimeEfficiency, as.numeric(project_data$Revenue), use = "complete.obs")
cat("Correlation between Time Efficiency and Revenue: ", cor_time_efficiency_revenue, "\n")

## Correlation between Time Efficiency and Revenue:  0.3531592

# Correlation between Special Day Impact and Exit Rates
cor_special_day_exit = cor(project_data$SpecialDay_Impact, project_data$ExitRates)
cat("Correlation between Special Day Impact and Exit Rates: ", cor_special_day_exit, "\n")

## Correlation between Special Day Impact and Exit Rates:  -0.03563266

# Correlation between Special Day Impact and Revenue
cor_special_day_revenue = cor(project_data$SpecialDay_Impact, as.numeric(project_data$Revenue))
cat("Correlation between Special Day Impact and Revenue: ", cor_special_day_revenue, "\n")

## Correlation between Special Day Impact and Revenue:  0.1105657

Now Confidence Intervals:

# Calculate mean, standard deviation, and sample size for Exit Rates
mean_exit = mean(project_data$ExitRates)
std_exit = sd(project_data$ExitRates)
n_exit = length(project_data$ExitRates)

# 95% confidence interval for Exit Rates
error_exit = 1.96 * (std_exit / sqrt(n_exit))  # Margin of error
ci_exit_lower = mean_exit - error_exit
ci_exit_upper = mean_exit + error_exit

cat("95% Confidence Interval for Exit Rates: [", ci_exit_lower, ", ", ci_exit_upper, "]\n")

## 95% Confidence Interval for Exit Rates: [ 0.04221501 ,  0.04393059 ]

# Calculate proportion (mean), standard deviation, and sample size for Revenue
mean_revenue = mean(as.numeric(project_data$Revenue))  # Revenue as numeric: 0 for FALSE, 1 for TRUE
std_revenue = sd(as.numeric(project_data$Revenue))
n_revenue = length(project_data$Revenue)

# 95% confidence interval for Revenue
error_revenue = 1.96 * (std_revenue / sqrt(n_revenue))  # Margin of error
ci_revenue_lower = mean_revenue - error_revenue
ci_revenue_upper = mean_revenue + error_revenue

cat("95% Confidence Interval for Revenue: [", ci_revenue_lower, ", ", ci_revenue_upper, "]\n")

## 95% Confidence Interval for Revenue: [ 0.1483605 ,  0.1611285 ]

the only moderate correlation between these variables was between Time Efficiency and Revenue at 0.3532
As Time Efficiency increases (users interact with high-value content more quickly), there is a tendency for the likelihood of revenue to increase. This suggests that users who efficiently navigate the website are more likely to make a purchase.
Both Response variables have a very tight confidence interval which means we have a high confidence in this estimate being close to the true mean exit rate and revenue for the population.
Overall, I don’t think much of this provided true reliable insight as the values look to be pretty all over the place