Data Dive – Summaries bring in libraries:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
read in the csv:
project_data <- read.csv("online_shoppers_intention.csv")
get summary stats:
summary(project_data)
## Administrative Administrative_Duration Informational
## Min. : 0.000 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.0000
## Median : 1.000 Median : 7.50 Median : 0.0000
## Mean : 2.315 Mean : 80.82 Mean : 0.5036
## 3rd Qu.: 4.000 3rd Qu.: 93.26 3rd Qu.: 0.0000
## Max. :27.000 Max. :3398.75 Max. :24.0000
## Informational_Duration ProductRelated ProductRelated_Duration
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 7.00 1st Qu.: 184.1
## Median : 0.00 Median : 18.00 Median : 598.9
## Mean : 34.47 Mean : 31.73 Mean : 1194.8
## 3rd Qu.: 0.00 3rd Qu.: 38.00 3rd Qu.: 1464.2
## Max. :2549.38 Max. :705.00 Max. :63973.5
## BounceRates ExitRates PageValues SpecialDay
## Min. :0.000000 Min. :0.00000 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.01429 1st Qu.: 0.000 1st Qu.:0.00000
## Median :0.003112 Median :0.02516 Median : 0.000 Median :0.00000
## Mean :0.022191 Mean :0.04307 Mean : 5.889 Mean :0.06143
## 3rd Qu.:0.016813 3rd Qu.:0.05000 3rd Qu.: 0.000 3rd Qu.:0.00000
## Max. :0.200000 Max. :0.20000 Max. :361.764 Max. :1.00000
## Month OperatingSystems Browser Region
## Length:12330 Min. :1.000 Min. : 1.000 Min. :1.000
## Class :character 1st Qu.:2.000 1st Qu.: 2.000 1st Qu.:1.000
## Mode :character Median :2.000 Median : 2.000 Median :3.000
## Mean :2.124 Mean : 2.357 Mean :3.147
## 3rd Qu.:3.000 3rd Qu.: 2.000 3rd Qu.:4.000
## Max. :8.000 Max. :13.000 Max. :9.000
## TrafficType VisitorType Weekend Revenue
## Min. : 1.00 Length:12330 Mode :logical Mode :logical
## 1st Qu.: 2.00 Class :character FALSE:9462 FALSE:10422
## Median : 2.00 Mode :character TRUE :2868 TRUE :1908
## Mean : 4.07
## 3rd Qu.: 4.00
## Max. :20.00
First, he are the column descriptions as provided by the data documentation: Administrative: This is the number of pages of this type (administrative) that the user visited.
Administrative_Duration: This is the amount of time spent in this category of pages.
Informational: This is the number of pages of this type (informational) that the user visited.
Informational_Duration: This is the amount of time spent in this category of pages.
ProductRelated: This is the number of pages of this type (product related) that the user visited.
ProductRelated_Duration: This is the amount of time spent in this category of pages.
BounceRates: The percentage of visitors who enter the website through that page and exit without triggering any additional tasks.
ExitRates: The percentage of pageviews on the website that end at that specific page.
PageValues: The average value of the page averaged over the value of the target page and/or the completion of an eCommerce transaction.
SpecialDay: This value represents the closeness of the browsing date to special days or holidays (eg Mother’s Day or Valentine’s day) in which the transaction is more likely to be finalized.
Month: Contains the month the pageview occurred, in string form.
OperatingSystems: An integer value representing the operating system that the user was on when viewing the page.
Browser: An integer value representing the browser that the user was using to view the page.
Region: An integer value representing which region the user is located in.
TrafficType: An integer value representing what type of traffic the user is categorized into. Read more about traffic types here.
VisitorType: A string representing whether a visitor is New Visitor, Returning Visitor, or Other.
Weekend: A boolean representing whether the session is on a weekend.
Revenue: A boolean representing whether or not the user completed the purchase.
With that understood, lets determine if the operating system and browser fields will be significant
Three Novel Questions:
Are shoppers more likely to purchase on the weekend?
How often are new visitors choose to make purchases vs our returning customers?
How does changes in our exit and bounce rates effect the likely hood of a purchase?
table(project_data$OperatingSystems)
##
## 1 2 3 4 5 6 7 8
## 2585 6601 2555 478 6 19 7 79
table(project_data$Browser)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 2462 7961 105 736 467 174 49 135 1 163 6 10 61
Because of the lack of documentation on these 2 variables from the source, I’m unsure what each number (category) represents. I thought maybe by looking at the distribution and real-world data I could maybe infer which is which
browser_counts <- table(project_data$Browser)
total_count <- sum(browser_counts)
browser_percentages <- (browser_counts / total_count) * 100
browser_distribution <- data.frame(
Browser = names(browser_counts),
Count = as.integer(browser_counts),
Percentage = round(browser_percentages, 2)
)
print(browser_distribution)
## Browser Count Percentage.Var1 Percentage.Freq
## 1 1 2462 1 19.97
## 2 2 7961 2 64.57
## 3 3 105 3 0.85
## 4 4 736 4 5.97
## 5 5 467 5 3.79
## 6 6 174 6 1.41
## 7 7 49 7 0.40
## 8 8 135 8 1.09
## 9 9 1 9 0.01
## 10 10 163 10 1.32
## 11 11 6 11 0.05
## 12 12 10 12 0.08
## 13 13 61 13 0.49
Next I want to see the distribution of who ended up shopping and who didn’t for the full dataset - this is represented by ‘Revenue’
table(project_data$Revenue)
##
## FALSE TRUE
## 10422 1908
so 1908 of the instances overall ended up shopping, this is 15.47% Next I want to compare this to the Weekend variable to see if theres a simple correlation
table(project_data$Weekend)
##
## FALSE TRUE
## 9462 2868
But are they more likely to complete a purchase?
ggplot(project_data, aes(x = Weekend, fill = Revenue)) +
geom_bar(position = "stack") +
labs(title = "Purchase Distribution by Weekend",
x = "Weekend",
y = "Count",
fill = "Made a Purchase (Revenue)") +
theme_minimal()
weekend_filteredT <- filter(project_data, Weekend == TRUE)
weekend_filteredF <- filter(project_data, Weekend == FALSE)
weekend_purchase_percentage <- sum(weekend_filteredT$Revenue == TRUE) / nrow(weekend_filteredT) * 100
weekday_purchase_percentage <- sum(weekend_filteredF$Revenue == TRUE) / nrow(weekend_filteredF) * 100
print(weekend_purchase_percentage)
## [1] 17.39888
print(weekday_purchase_percentage)
## [1] 14.89114
So, Yes there is a slightly higher chance the customer makes a purchase on the weekend vs the weekday.
Next lets look at value for shopping in different months
table(project_data$Month)
##
## Aug Dec Feb Jul June Mar May Nov Oct Sep
## 433 1727 184 432 288 1907 3364 2998 549 448
Lets put them in the proper order and visualize with a barchart
month_order <- c("Feb", "Mar", "Apr", "May", "June", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
project_data$Month <- factor(project_data$Month, levels = month_order)
ggplot(project_data, aes(x = Month)) +
geom_bar(fill = "blue", color = "black") +
labs(title = "Distribution of Data by Month",
x = "Month",
y = "Count") +
theme_minimal()
It appears shopping picks up in the spring and leading up to Christmas, which makes sense.
Next, let’s analyze the visitor type and see the distribution of returning customers and find How often are new visitors choose to make purchases vs our returning customers?
table(project_data$VisitorType)
##
## New_Visitor Other Returning_Visitor
## 1694 85 10551
Next we can visualize this with a bar chart
visitor_counts <- table(project_data$VisitorType)
visitor_data <- as.data.frame(visitor_counts)
names(visitor_data) <- c("VisitorType", "Count")
ggplot(visitor_data, aes(x = VisitorType, y = Count)) +
geom_bar(stat = "identity", fill = "blue", color = "black") +
labs(title = "Distribution of Visitor Types",
x = "Visitor Type",
y = "Count") +
theme_minimal()
Next lets analyze the bounce rates and exit rates,
summary(project_data$BounceRates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.003112 0.022191 0.016813 0.200000
summary(project_data$ExitRates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.01429 0.02516 0.04307 0.05000 0.20000
lets use box plots to visualize these in terms of their relationship with Revenue
This way we can determine How does changes in our exit and bounce rates effect the likely hood of a purchase?
ggplot(project_data, aes(x = Revenue, y = BounceRates, fill = Revenue)) +
geom_boxplot() +
labs(title = "Bounce Rates by Shopping Status",
x = "Made a Purchase (Revenue)",
y = "Bounce Rate") +
theme_minimal()
ggplot(project_data, aes(x = Revenue, y = ExitRates, fill = Revenue)) +
geom_boxplot() +
labs(title = "Exit Rates by Shopping Status",
x = "Made a Purchase (Revenue)",
y = "Exit Rate") +
theme_minimal()
Going forward there’s a few things I’d like to investigate:
Firstly, I’d like to go a little deeper into attempting to determine the operating systems that are being used?
I also want to get a better understanding of the variables that represent the duration metrics as well as the page values variables. They would likely be useful in finding insights for my project.