library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
project_data <- read.csv("online_shoppers_intention.csv")
Head:
head(project_data)
## Administrative Administrative_Duration Informational Informational_Duration
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1 1 0.000000 0.20000000 0.2000000 0
## 2 2 64.000000 0.00000000 0.1000000 0
## 3 1 0.000000 0.20000000 0.2000000 0
## 4 2 2.666667 0.05000000 0.1400000 0
## 5 10 627.500000 0.02000000 0.0500000 0
## 6 19 154.216667 0.01578947 0.0245614 0
## SpecialDay Month OperatingSystems Browser Region TrafficType
## 1 0 Feb 1 1 1 1
## 2 0 Feb 2 2 1 2
## 3 0 Feb 4 1 9 3
## 4 0 Feb 3 2 2 4
## 5 0 Feb 3 3 1 4
## 6 0 Feb 2 2 1 3
## VisitorType Weekend Revenue
## 1 Returning_Visitor FALSE FALSE
## 2 Returning_Visitor FALSE FALSE
## 3 Returning_Visitor FALSE FALSE
## 4 Returning_Visitor FALSE FALSE
## 5 Returning_Visitor TRUE FALSE
## 6 Returning_Visitor FALSE FALSE
Variable List:
names(project_data)
## [1] "Administrative" "Administrative_Duration"
## [3] "Informational" "Informational_Duration"
## [5] "ProductRelated" "ProductRelated_Duration"
## [7] "BounceRates" "ExitRates"
## [9] "PageValues" "SpecialDay"
## [11] "Month" "OperatingSystems"
## [13] "Browser" "Region"
## [15] "TrafficType" "VisitorType"
## [17] "Weekend" "Revenue"
Aside from Month, VisitorType, and maybe Weekend, pretty much none of the variables are completely understood by just looking at the table
To understand them we’ll want to look at the documentation
After analyzing the documentation the following variable are completely clear:
Administrative, Informational, and ProductRelated: The total number of pages of each page type a viewer visited
…_Duration: The amount of time in seconds that the user spent on the page type
Revenue: A boolean variable that states whether the shopping session resulted in a purchase
Also, after checking the documentation some are mostly understood, but you still end up with questions:
BounceRate: The percentage of visitors who enter the website through that page and exit without triggering any additional pages. But, how is this represented in user shopping sessions that have multiple different pages viewed?
ExitRate: The percentage of pageviews on the website that end at that specific page. But, we’re left with the same question as BounceRate.
SpecialDay: The documentation explains that this represents how close a session was to a holiday or special day, on a scale of 0 to 1, what’s not clear is what days are considered special days.
for OperatingSystem, Browser, and Region we have classification variables that are represented numerically, but no where in the documentation does it explicitly say what each number represents.
TrafficType: Classes represented numerically that say how the user gained access to our website, classes like Direct traffic, Organic search, Paid search, Referral traffic, etc. The documentation provides a link that explain how google analytics records traffic types, but in our data set we have 20 types which doesn’t align with what the article describes. Therefore, it’s hard to determine what types our set represent.
Additionally, some variables are still very confusing even with the data documentation:
So I did some further investigation and found info for these variables that I had confusion on:
BounceRates and ExitRates: When multiple pages are viewed in a session the rates are then the average of all the pages they viewed.
SpecialDay: This variable is not further explained anywhere, but we can still draw value from this
OperatingSystem, Browser, Region, TrafficType: No further explanation of these variables exist to my knowledge, but there are still ways we can try to infer the meaning of the classes.
I will use graphs and charts to try and determine more about these 4:
browser_counts <- table(project_data$Browser)
# Create labels for the pie chart, only showing percentages for 1, 2, 4, and 5
labels <- ifelse(names(browser_counts) %in% c("1", "2", "4", "5"),
paste(round(prop.table(browser_counts) * 100, 1), "%"), "")
# Create a pie chart with the specified labels
pie(browser_counts,
labels = labels, # Only shows % for 1, 2, 4, and 5
main = "Distribution of Browser Variable",
col = rainbow(length(browser_counts)))
Next, I’ll look at OperatingSystem:
table(project_data$OperatingSystem)
##
## 1 2 3 4 5 6 7 8
## 2585 6601 2555 478 6 19 7 79
os_counts <- table(project_data$OperatingSystem)
# Create labels for the pie chart, only showing percentages for 1, 2, 3, and 4
os_labels <- ifelse(names(os_counts) %in% c("1", "2", "3", "4"),
paste(round(prop.table(os_counts) * 100, 1), "%"), "")
# Create a pie chart with the specified labels
pie(os_counts,
labels = os_labels, # Only shows % for 1, 2, 3, and 4
main = "Distribution of Operating System Variable",
col = rainbow(length(os_counts)))
Next we’ll try to determine some of the traffic types:
table(project_data$TrafficType)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2451 3913 2052 1069 260 444 40 343 42 450 247 1 738 13 38 3
## 17 18 19 20
## 1 10 17 198
# Create a new variable 'TotalPagesVisited' by combining the three page count variables
project_data$TotalPagesVisited <- project_data$Administrative +
project_data$Informational +
project_data$ProductRelated
# Calculate the average total page count for each TrafficType
avg_pages_by_traffic <- aggregate(TotalPagesVisited ~ TrafficType, data = project_data, FUN = mean)
# View the list of averages for each traffic type
print(avg_pages_by_traffic)
## TrafficType TotalPagesVisited
## 1 1 34.20441
## 2 2 41.75415
## 3 3 27.90789
## 4 4 31.37325
## 5 5 21.84231
## 6 6 32.09459
## 7 7 32.60000
## 8 8 29.49854
## 9 9 17.02381
## 10 10 35.75111
## 11 11 27.34413
## 12 12 3.00000
## 13 13 35.10298
## 14 14 85.07692
## 15 15 17.92105
## 16 16 18.33333
## 17 17 4.00000
## 18 18 16.30000
## 19 19 41.35294
## 20 20 22.47980
Now I’ll see if i can find anything out about the region variable
table(project_data$Region)
##
## 1 2 3 4 5 6 7 8 9
## 4780 1136 2403 1182 318 805 761 434 511
# Create a contingency table of Region and Revenue
region_revenue_counts <- table(project_data$Revenue, project_data$Region)
# Calculate percentages for False (assuming False is the first row in the table)
false_percents <- prop.table(region_revenue_counts, 2)[1, ] * 100
# Create a stacked bar plot for Region by Revenue
bar_positions <- barplot(region_revenue_counts,
main = "Distribution of Region by Revenue",
xlab = "Region",
ylab = "Count",
col = c("lightcoral", "lightblue"), # Two different colors for True/False
legend = rownames(region_revenue_counts), # Add legend for True/False
beside = FALSE) # Stacked bars
# Add percentage labels for the False part of each bar (first row in table)
text(bar_positions, region_revenue_counts[1, ] / 2,
labels = paste(round(false_percents, 1), "%"),
col = "black", cex = 0.8)
I have a better understanding of my variables now but there is more to gather on these still
The reasoning for storing the variables this way would probably be for simplicity, and to allow some modeling methods to work properly, but it certainly takes away some of the analysis value