DataDive7_mohler

** Apologies for the late submission, was struggling to determine how I should handle the D-value dilemna I have at the start of the assignment since it kept coming out really small.**

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.3.3

library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.3.2

library(effsize)

## Warning: package 'effsize' was built under R version 4.3.3

library(pwrss)

## Warning: package 'pwrss' was built under R version 4.3.3

## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

project_data <- read.csv("online_shoppers_intention.csv")

Hypothesis 1:

Comparing the avg time spent on the product pages during a viewing session for both positive and negative cases of the user making a purchase.

# make the average product page time variable for each instance of shopping
project_data$Avg_Product_Page_Time <- project_data$ProductRelated_Duration / project_data$ProductRelated

Plot new variable to look for relationships:

ggplot(project_data, aes(x = as.factor(Revenue), y = Avg_Product_Page_Time, fill = as.factor(Revenue))) +
  geom_boxplot() +
  scale_fill_manual(values = c("lightcoral", "lightblue"), labels = c("No Purchase", "Purchase")) +
  labs(title = "Average Time per Product Page by Purchase Status",
       x = "Purchase Status (Revenue)",
       y = "Average Time per Product Page (seconds)",
       fill = "Purchase") +
  theme_minimal() +
  scale_x_discrete(labels = c("No Purchase", "Purchase"))

## Warning: Removed 38 rows containing non-finite values (`stat_boxplot()`).

The extremely long times (greater than 500 seconds) in the no purchase group likely indicate sessions where users left the website open for extended periods without interaction. These outliers skew the results, as they are not representative of typical user behavior and could distort insights. By removing these outliers, we obtain a clearer picture of how average time per product page differs between those who make a purchase and those who don't, without the influence of extreme values that likely result from unattended sessions.

So i’ll flip the chart horizontally and remove the outliers to give us a better picture of the data:

# Remove rows where Avg_Product_Page_Time is greater than 500
filtered_data <- subset(project_data, Avg_Product_Page_Time <= 500)

# Create a horizontal boxplot with outliers removed (greater than 500)
ggplot(filtered_data, aes(x = as.factor(Revenue), y = Avg_Product_Page_Time, fill = as.factor(Revenue))) +
  geom_boxplot() +
  scale_fill_manual(values = c("lightcoral", "lightblue"), labels = c("No Purchase", "Purchase")) +
  labs(title = "Average Time per Product Page by Purchase Status (Filtered)",
       x = "Purchase Status (Revenue)",
       y = "Average Time per Product Page (seconds)",
       fill = "Purchase") +
  coord_flip() +  # Make the plot horizontal
  theme_minimal() +
  scale_x_discrete(labels = c("No Purchase", "Purchase"))

now I’ll do one more so we can get a better idea of where the data is centralized, only showing times up to 150. This will give a better idea of how the medians compare.

# Filter out rows where Avg_Product_Page_Time is greater than 150
centralized_data <- subset(project_data, Avg_Product_Page_Time <= 150)

# Create a horizontal boxplot showing only times up to 150
ggplot(centralized_data, aes(x = as.factor(Revenue), y = Avg_Product_Page_Time, fill = as.factor(Revenue))) +
  geom_boxplot() +
  scale_fill_manual(values = c("lightcoral", "lightblue"), labels = c("No Purchase", "Purchase")) +
  labs(title = "Average Time per Product Page by Purchase Status (<= 150 seconds)",
       x = "Purchase Status (Revenue)",
       y = "Average Time per Product Page (seconds)",
       fill = "Purchase") +
  coord_flip() +  # Make the plot horizontal
  theme_minimal() +
  scale_x_discrete(labels = c("No Purchase", "Purchase"))

With this being known:

Null Hypothesis (H₀): The average time spent on product-related pages is the same for customers who made a purchase and those who did not.
Alternative Hypothesis (H₁): The average time spent on product-related pages is higher for customers who made a purchase compared to those who did not.

# Calculate Cohen's d to estimate the effect size between Revenue groups
cohen.d(
  d = filter(project_data, Revenue == FALSE) |> pluck("Avg_Product_Page_Time"),
  f = filter(project_data, Revenue == TRUE) |> pluck("Avg_Product_Page_Time")
)

## 
## Cohen's d
## 
## d estimate: NaN (NA)
## 95 percent confidence interval:
## lower upper 
##   NaN   NaN

sum(is.na(project_data$Avg_Product_Page_Time))

## [1] 38

Remove the 38 nulls, and also limit outliers:

# Remove rows with NA values and filter for Avg_Product_Page_Time <= 500
filtered_data <- project_data |>
  filter(!is.na(Avg_Product_Page_Time) & Avg_Product_Page_Time <= 500)

# Calculate Cohen's d for Avg_Product_Page_Time between Revenue groups (TRUE vs FALSE)
cohen.d(
  d = filter(filtered_data, Revenue == FALSE) |> pluck("Avg_Product_Page_Time"),
  f = filter(filtered_data, Revenue == TRUE) |> pluck("Avg_Product_Page_Time")
)

## 
## Cohen's d
## 
## d estimate: -0.1810281 (negligible)
## 95 percent confidence interval:
##      lower      upper 
## -0.2299729 -0.1320833

This isn’t very good, but based on how the boxplots are, I feel there are more faulty instances in our dataset
realistically tons of the long instances that are negative cases are instances of people being idled for a long time before being forced off the screen.
In order to try and identify the issue I isolated instance where the total number of pages equaled 1 and product related duration was greater than 150 for both true and false instances, and compared the two.

# Isolate instances where total number of pages is 1 and ProductRelated_Duration > 150
faulty_data <- project_data %>%
  filter(ProductRelated < 5 & Avg_Product_Page_Time > 100)

# Count the number of instances for each Revenue group (TRUE vs FALSE)
faulty_counts <- table(faulty_data$Revenue)

# Display the count of instances for each group
print(faulty_counts)

## 
## FALSE  TRUE 
##   113     6

after messing around with that I found that less than 5 pages and greater than an average 100 seemed to remove an appropriate amount of these outliers, which I likely attribute to an idled customer rather than a customer with true intentions to not make a purchase. So we’ll remove all of these instances and try to get our effect size again.

# Remove rows with NA values and filter based on the specified conditions
cleaned_data <- project_data %>%
  filter(!is.na(Avg_Product_Page_Time) & ProductRelated >= 5 & Avg_Product_Page_Time <= 100)

# Recalculate Cohen's d for Avg_Product_Page_Time between Revenue groups (TRUE vs FALSE)
cohen_d_result <- cohen.d(
  d = filter(cleaned_data, Revenue == FALSE) |> pluck("Avg_Product_Page_Time"),
  f = filter(cleaned_data, Revenue == TRUE) |> pluck("Avg_Product_Page_Time")
)

# Print the result
print(cohen_d_result)

## 
## Cohen's d
## 
## d estimate: -0.2635538 (small)
## 95 percent confidence interval:
##      lower      upper 
## -0.3152265 -0.2118812

this is a better representation of the data to me, although it is “small” in D.

Now I’ll work on finding a sample size:

library(pwr)

## Warning: package 'pwr' was built under R version 4.3.3

# Calculate the sample size for a one-tailed t-test with d = 0.26, alpha = 0.05, power = 0.80
pwr.t.test(d = 0.26, power = 0.80, sig.level = 0.05, type = "two.sample", alternative = "two.sided")

## 
##      Two-sample t test power calculation 
## 
##               n = 233.1791
##               d = 0.26
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

So this suggests the required sample size should be about 233. Next let’s see if our samples of each outcome are large enough to use

# Count the number of instances for each Revenue group (TRUE vs FALSE)
table(cleaned_data$Revenue)

## 
## FALSE  TRUE 
##  8159  1758

As we see the sample sizes are sufficient so we can perform the t-test:

# Perform a two-tailed t-test comparing Avg_Product_Page_Time between Revenue groups
t_test_result <- t.test(Avg_Product_Page_Time ~ Revenue, 
                        data = cleaned_data, 
                        alternative = "two.sided")

print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  Avg_Product_Page_Time by Revenue
## t = -10.632, df = 2744.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  -5.979014 -4.117072
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            33.68675            38.73480

The results of the two-tailed t-test show that there is a statistically significant difference in the average time spent per product page between purchasers and non-purchasers. While the difference is not enormous (around 4 to 6 seconds), the p-value and confidence interval confirm that this difference is real and not due to random variation.

Hypothesis 2:

Null Hypothesis (H₀): The proportion of purchases made by New Visitors is the same as the proportion of purchases made by Returning Visitors.
Alternative Hypothesis (H₁): The proportion of purchases made by New Visitors is different from the proportion of purchases made by Returning Visitors.

# Create a contingency table of VisitorType and Revenue
visitor_revenue_table <- table(cleaned_data$VisitorType, cleaned_data$Revenue)

# Perform a chi-square test of independence
chi_test_result <- chisq.test(visitor_revenue_table)

# Display the result of the chi-square test
print(chi_test_result)

## 
##  Pearson's Chi-squared test
## 
## data:  visitor_revenue_table
## X-squared = 101.12, df = 2, p-value < 2.2e-16

The X-squared value of 101.12 with 2 degrees of freedom and a p-value < 2.2e-16 indicates a significant difference in purchase behavior between New Visitors and Returning Visitors.
We can confidently reject the null hypothesis, meaning that the proportion of purchases is significantly different between the two visitor types.
The large sample size and the suitability of the chi-square test give us confidence that these results are reliable. The extremely small p-value further supports that this difference is unlikely to be due to chance.

library(ggplot2)
library(dplyr)

# Calculate the proportions of TRUE for each VisitorType
prop_data <- filtered_data |>
  group_by(VisitorType, Revenue) |>
  summarise(count = n()) |>
  mutate(percentage = count / sum(count) * 100)

## `summarise()` has grouped output by 'VisitorType'. You can override using the
## `.groups` argument.

# Create the stacked bar chart with percentage labels for TRUE (purchases)
ggplot(filtered_data, aes(x = VisitorType, fill = Revenue)) +
  geom_bar(position = "fill", color = "black") +  # Stacked bar chart for proportions
  scale_fill_manual(values = c("lightcoral", "lightblue"), labels = c("No Purchase", "Purchase")) +
  labs(title = "Proportion of Purchases by Visitor Type",
       x = "Visitor Type",
       y = "Proportion",
       fill = "Purchase Status") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "right") +
  geom_text(data = subset(prop_data, Revenue == TRUE),  # Only label TRUE values
            aes(x = VisitorType, y = percentage / 100, label = paste0(round(percentage, 1), "%")),
            position = position_fill(vjust = 0.5), size = 5, color = "black")

Coclusions:

Significant difference between visitor types: The p-value from the Chi-Square test was extremely small (< 2.2e-16), indicating a statistically significant difference in purchase behavior between New Visitors and Returning Visitors. This is reflected in the chart, where New Visitors have a higher purchase proportion (27%) compared to Returning Visitors (16.2%).
Support for rejecting the null hypothesis: The X-squared value of 101.12 indicates a strong deviation from the null hypothesis, which assumed no difference in purchasing behavior. The chart visually supports this conclusion, showing that New Visitors are more likely to make a purchase compared to Returning Visitors, strengthening the case for rejecting the null hypothesis.

DataDive7_mohler

Carson

2024-10-16

Hypothesis 1:

Hypothesis 2: