Project_summary.knit

Data Dive – Summaries bring in libraries:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

read in the csv:

project_data <- read.csv("online_shoppers_intention.csv")

get summary stats:

summary(project_data)

##  Administrative   Administrative_Duration Informational    
##  Min.   : 0.000   Min.   :   0.00         Min.   : 0.0000  
##  1st Qu.: 0.000   1st Qu.:   0.00         1st Qu.: 0.0000  
##  Median : 1.000   Median :   7.50         Median : 0.0000  
##  Mean   : 2.315   Mean   :  80.82         Mean   : 0.5036  
##  3rd Qu.: 4.000   3rd Qu.:  93.26         3rd Qu.: 0.0000  
##  Max.   :27.000   Max.   :3398.75         Max.   :24.0000  
##  Informational_Duration ProductRelated   ProductRelated_Duration
##  Min.   :   0.00        Min.   :  0.00   Min.   :    0.0        
##  1st Qu.:   0.00        1st Qu.:  7.00   1st Qu.:  184.1        
##  Median :   0.00        Median : 18.00   Median :  598.9        
##  Mean   :  34.47        Mean   : 31.73   Mean   : 1194.8        
##  3rd Qu.:   0.00        3rd Qu.: 38.00   3rd Qu.: 1464.2        
##  Max.   :2549.38        Max.   :705.00   Max.   :63973.5        
##   BounceRates         ExitRates         PageValues        SpecialDay     
##  Min.   :0.000000   Min.   :0.00000   Min.   :  0.000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.01429   1st Qu.:  0.000   1st Qu.:0.00000  
##  Median :0.003112   Median :0.02516   Median :  0.000   Median :0.00000  
##  Mean   :0.022191   Mean   :0.04307   Mean   :  5.889   Mean   :0.06143  
##  3rd Qu.:0.016813   3rd Qu.:0.05000   3rd Qu.:  0.000   3rd Qu.:0.00000  
##  Max.   :0.200000   Max.   :0.20000   Max.   :361.764   Max.   :1.00000  
##     Month           OperatingSystems    Browser           Region     
##  Length:12330       Min.   :1.000    Min.   : 1.000   Min.   :1.000  
##  Class :character   1st Qu.:2.000    1st Qu.: 2.000   1st Qu.:1.000  
##  Mode  :character   Median :2.000    Median : 2.000   Median :3.000  
##                     Mean   :2.124    Mean   : 2.357   Mean   :3.147  
##                     3rd Qu.:3.000    3rd Qu.: 2.000   3rd Qu.:4.000  
##                     Max.   :8.000    Max.   :13.000   Max.   :9.000  
##   TrafficType    VisitorType         Weekend         Revenue       
##  Min.   : 1.00   Length:12330       Mode :logical   Mode :logical  
##  1st Qu.: 2.00   Class :character   FALSE:9462      FALSE:10422    
##  Median : 2.00   Mode  :character   TRUE :2868      TRUE :1908     
##  Mean   : 4.07                                                     
##  3rd Qu.: 4.00                                                     
##  Max.   :20.00

First, he are the column descriptions as provided by the data documentation: Administrative: This is the number of pages of this type (administrative) that the user visited.

Administrative_Duration: This is the amount of time spent in this category of pages.

Informational: This is the number of pages of this type (informational) that the user visited.

Informational_Duration: This is the amount of time spent in this category of pages.

ProductRelated: This is the number of pages of this type (product related) that the user visited.

ProductRelated_Duration: This is the amount of time spent in this category of pages.

BounceRates: The percentage of visitors who enter the website through that page and exit without triggering any additional tasks.

ExitRates: The percentage of pageviews on the website that end at that specific page.

PageValues: The average value of the page averaged over the value of the target page and/or the completion of an eCommerce transaction.

SpecialDay: This value represents the closeness of the browsing date to special days or holidays (eg Mother’s Day or Valentine’s day) in which the transaction is more likely to be finalized.

Month: Contains the month the pageview occurred, in string form.

OperatingSystems: An integer value representing the operating system that the user was on when viewing the page.

Browser: An integer value representing the browser that the user was using to view the page.

Region: An integer value representing which region the user is located in.

TrafficType: An integer value representing what type of traffic the user is categorized into. Read more about traffic types here.

VisitorType: A string representing whether a visitor is New Visitor, Returning Visitor, or Other.

Weekend: A boolean representing whether the session is on a weekend.

Revenue: A boolean representing whether or not the user completed the purchase.

With that understood, lets determine if the operating system and browser fields will be significant

Three Novel Questions:

Are shoppers more likely to purchase on the weekend?

How often are new visitors choose to make purchases vs our returning customers?

How does changes in our exit and bounce rates effect the likely hood of a purchase?

table(project_data$OperatingSystems)

## 
##    1    2    3    4    5    6    7    8 
## 2585 6601 2555  478    6   19    7   79

table(project_data$Browser)

## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13 
## 2462 7961  105  736  467  174   49  135    1  163    6   10   61

Because of the lack of documentation on these 2 variables from the source, I’m unsure what each number (category) represents. I thought maybe by looking at the distribution and real-world data I could maybe infer which is which

For the operating systems, although there are 3 that make up over 95% of the systems, the numbers still don’t align with the distribution of Windows, MacOS, and Linux (mac and linux shouldn’t be so close to equal). So it’s difficult to say what’s what.

browser_counts <- table(project_data$Browser)
total_count <- sum(browser_counts)
browser_percentages <- (browser_counts / total_count) * 100

browser_distribution <- data.frame(
  Browser = names(browser_counts),
  Count = as.integer(browser_counts),
  Percentage = round(browser_percentages, 2)
)

print(browser_distribution)

##    Browser Count Percentage.Var1 Percentage.Freq
## 1        1  2462               1           19.97
## 2        2  7961               2           64.57
## 3        3   105               3            0.85
## 4        4   736               4            5.97
## 5        5   467               5            3.79
## 6        6   174               6            1.41
## 7        7    49               7            0.40
## 8        8   135               8            1.09
## 9        9     1               9            0.01
## 10      10   163              10            1.32
## 11      11     6              11            0.05
## 12      12    10              12            0.08
## 13      13    61              13            0.49

For browsers they actually lineup with the actual real distribution where 2 is Chrome (65-70), 1 is safari (15-20), 4 is firefox (5-10), 5 is edge (3-5).

Next I want to see the distribution of who ended up shopping and who didn’t for the full dataset - this is represented by ‘Revenue’

table(project_data$Revenue)

## 
## FALSE  TRUE 
## 10422  1908

so 1908 of the instances overall ended up shopping, this is 15.47% Next I want to compare this to the Weekend variable to see if theres a simple correlation

table(project_data$Weekend)

## 
## FALSE  TRUE 
##  9462  2868

the amount of shopping on the weekend days is lower than that of the shopping on the week days (9462/5 = 1892 per day, 2868/2 = 1434 per day)

But are they more likely to complete a purchase?

ggplot(project_data, aes(x = Weekend, fill = Revenue)) +
  geom_bar(position = "stack") +
  labs(title = "Purchase Distribution by Weekend",
       x = "Weekend",
       y = "Count",
       fill = "Made a Purchase (Revenue)") +
  theme_minimal()

weekend_filteredT <- filter(project_data, Weekend == TRUE)
weekend_filteredF <- filter(project_data, Weekend == FALSE)

weekend_purchase_percentage <- sum(weekend_filteredT$Revenue == TRUE) / nrow(weekend_filteredT) * 100
weekday_purchase_percentage <- sum(weekend_filteredF$Revenue == TRUE) / nrow(weekend_filteredF) * 100

print(weekend_purchase_percentage)

## [1] 17.39888

print(weekday_purchase_percentage)

## [1] 14.89114

So, Yes there is a slightly higher chance the customer makes a purchase on the weekend vs the weekday.

Next lets look at value for shopping in different months

table(project_data$Month)

## 
##  Aug  Dec  Feb  Jul June  Mar  May  Nov  Oct  Sep 
##  433 1727  184  432  288 1907 3364 2998  549  448

Lets put them in the proper order and visualize with a barchart

month_order <- c("Feb", "Mar", "Apr", "May", "June", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
project_data$Month <- factor(project_data$Month, levels = month_order)

ggplot(project_data, aes(x = Month)) +
  geom_bar(fill = "blue", color = "black") +
  labs(title = "Distribution of Data by Month",
       x = "Month",
       y = "Count") +
  theme_minimal()

It appears shopping picks up in the spring and leading up to Christmas, which makes sense.

I’m unsure why there is no data included for January and April

Next, let’s analyze the visitor type and see the distribution of returning customers and find How often are new visitors choose to make purchases vs our returning customers?

table(project_data$VisitorType)

## 
##       New_Visitor             Other Returning_Visitor 
##              1694                85             10551

13.74% are new customers
85.56% are returning

Next we can visualize this with a bar chart

visitor_counts <- table(project_data$VisitorType)

visitor_data <- as.data.frame(visitor_counts)

names(visitor_data) <- c("VisitorType", "Count")

ggplot(visitor_data, aes(x = VisitorType, y = Count)) +
  geom_bar(stat = "identity", fill = "blue", color = "black") +
  labs(title = "Distribution of Visitor Types",
       x = "Visitor Type",
       y = "Count") +
  theme_minimal()

Next lets analyze the bounce rates and exit rates,

summary(project_data$BounceRates)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.003112 0.022191 0.016813 0.200000

summary(project_data$ExitRates)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01429 0.02516 0.04307 0.05000 0.20000

lets use box plots to visualize these in terms of their relationship with Revenue

This way we can determine How does changes in our exit and bounce rates effect the likely hood of a purchase?

ggplot(project_data, aes(x = Revenue, y = BounceRates, fill = Revenue)) +
  geom_boxplot() +
  labs(title = "Bounce Rates by Shopping Status",
       x = "Made a Purchase (Revenue)",
       y = "Bounce Rate") +
  theme_minimal()

ggplot(project_data, aes(x = Revenue, y = ExitRates, fill = Revenue)) +
  geom_boxplot() +
  labs(title = "Exit Rates by Shopping Status",
       x = "Made a Purchase (Revenue)",
       y = "Exit Rate") +
  theme_minimal()

Both BoxPlots indicate a negative correlation where the likelyhood of a purchase increases as bounce rate / exit rate decrease.

Going forward there’s a few things I’d like to investigate:

Firstly, I’d like to go a little deeper into attempting to determine the operating systems that are being used?
I also want to get a better understanding of the variables that represent the duration metrics as well as the page values variables. They would likely be useful in finding insights for my project.