I decided to group by the categorical variables Visitor_type and Revenue, and then comparing them to continuous variables like bounce/exit rates and duration metrics.
I figured this could lead to insights on what is leading customers to leave the website or potentially make a purchase.
First I load in the libraries:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gridExtra) #to view later visualizations side by side
## Warning: package 'gridExtra' was built under R version 4.3.3
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
Then, I look at a summary and head(), as well as the distribution of these and see what groups the smallest/largest and a few other summary details to get started:
project_data <- read.csv("online_shoppers_intention.csv")
summary(project_data)
## Administrative Administrative_Duration Informational
## Min. : 0.000 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.0000
## Median : 1.000 Median : 7.50 Median : 0.0000
## Mean : 2.315 Mean : 80.82 Mean : 0.5036
## 3rd Qu.: 4.000 3rd Qu.: 93.26 3rd Qu.: 0.0000
## Max. :27.000 Max. :3398.75 Max. :24.0000
## Informational_Duration ProductRelated ProductRelated_Duration
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 7.00 1st Qu.: 184.1
## Median : 0.00 Median : 18.00 Median : 598.9
## Mean : 34.47 Mean : 31.73 Mean : 1194.8
## 3rd Qu.: 0.00 3rd Qu.: 38.00 3rd Qu.: 1464.2
## Max. :2549.38 Max. :705.00 Max. :63973.5
## BounceRates ExitRates PageValues SpecialDay
## Min. :0.000000 Min. :0.00000 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.01429 1st Qu.: 0.000 1st Qu.:0.00000
## Median :0.003112 Median :0.02516 Median : 0.000 Median :0.00000
## Mean :0.022191 Mean :0.04307 Mean : 5.889 Mean :0.06143
## 3rd Qu.:0.016813 3rd Qu.:0.05000 3rd Qu.: 0.000 3rd Qu.:0.00000
## Max. :0.200000 Max. :0.20000 Max. :361.764 Max. :1.00000
## Month OperatingSystems Browser Region
## Length:12330 Min. :1.000 Min. : 1.000 Min. :1.000
## Class :character 1st Qu.:2.000 1st Qu.: 2.000 1st Qu.:1.000
## Mode :character Median :2.000 Median : 2.000 Median :3.000
## Mean :2.124 Mean : 2.357 Mean :3.147
## 3rd Qu.:3.000 3rd Qu.: 2.000 3rd Qu.:4.000
## Max. :8.000 Max. :13.000 Max. :9.000
## TrafficType VisitorType Weekend Revenue
## Min. : 1.00 Length:12330 Mode :logical Mode :logical
## 1st Qu.: 2.00 Class :character FALSE:9462 FALSE:10422
## Median : 2.00 Mode :character TRUE :2868 TRUE :1908
## Mean : 4.07
## 3rd Qu.: 4.00
## Max. :20.00
print(head(project_data[,c('VisitorType','Administrative_Duration','Informational_Duration','ProductRelated_Duration', 'ExitRates','BounceRates', 'Revenue')], 150))
## VisitorType Administrative_Duration Informational_Duration
## 1 Returning_Visitor 0.00000 0
## 2 Returning_Visitor 0.00000 0
## 3 Returning_Visitor 0.00000 0
## 4 Returning_Visitor 0.00000 0
## 5 Returning_Visitor 0.00000 0
## 6 Returning_Visitor 0.00000 0
## 7 Returning_Visitor 0.00000 0
## 8 Returning_Visitor 0.00000 0
## 9 Returning_Visitor 0.00000 0
## 10 Returning_Visitor 0.00000 0
## 11 Returning_Visitor 0.00000 0
## 12 Returning_Visitor 0.00000 0
## 13 Returning_Visitor 0.00000 0
## 14 Returning_Visitor 0.00000 0
## 15 Returning_Visitor 0.00000 0
## 16 Returning_Visitor 53.00000 0
## 17 Returning_Visitor 0.00000 0
## 18 Returning_Visitor 0.00000 0
## 19 Returning_Visitor 0.00000 0
## 20 Returning_Visitor 0.00000 0
## 21 Returning_Visitor 0.00000 0
## 22 Returning_Visitor 0.00000 0
## 23 Returning_Visitor 0.00000 0
## 24 Returning_Visitor 0.00000 0
## 25 Returning_Visitor 0.00000 0
## 26 Returning_Visitor 0.00000 0
## 27 Returning_Visitor 64.60000 0
## 28 Returning_Visitor 0.00000 0
## 29 Returning_Visitor 0.00000 0
## 30 Returning_Visitor 6.00000 0
## 31 Returning_Visitor 0.00000 0
## 32 Returning_Visitor 0.00000 0
## 33 Returning_Visitor 0.00000 0
## 34 Returning_Visitor 0.00000 0
## 35 Returning_Visitor 0.00000 0
## 36 Returning_Visitor 0.00000 0
## 37 Returning_Visitor 0.00000 0
## 38 Returning_Visitor 18.00000 0
## 39 Returning_Visitor 0.00000 0
## 40 Returning_Visitor 0.00000 0
## 41 Returning_Visitor 9.00000 0
## 42 Returning_Visitor 0.00000 0
## 43 Returning_Visitor 0.00000 0
## 44 Returning_Visitor 0.00000 0
## 45 Returning_Visitor 0.00000 0
## 46 Returning_Visitor 0.00000 0
## 47 Returning_Visitor 0.00000 0
## 48 Returning_Visitor 0.00000 0
## 49 Returning_Visitor 0.00000 0
## 50 Returning_Visitor 0.00000 0
## 51 Returning_Visitor 0.00000 0
## 52 Returning_Visitor 0.00000 0
## 53 Returning_Visitor 0.00000 0
## 54 Returning_Visitor 0.00000 0
## 55 Returning_Visitor 0.00000 0
## 56 Returning_Visitor 0.00000 0
## 57 Returning_Visitor 0.00000 0
## 58 Returning_Visitor 56.00000 120
## 59 Returning_Visitor 0.00000 0
## 60 Returning_Visitor 16.00000 0
## 61 Returning_Visitor 0.00000 0
## 62 Returning_Visitor 0.00000 0
## 63 Returning_Visitor 279.41667 0
## 64 Returning_Visitor 0.00000 0
## 65 Returning_Visitor 0.00000 0
## 66 Returning_Visitor 87.83333 0
## 67 Returning_Visitor 44.00000 0
## 68 Returning_Visitor 0.00000 0
## 69 Returning_Visitor 0.00000 0
## 70 Returning_Visitor 0.00000 0
## 71 Returning_Visitor 0.00000 0
## 72 Returning_Visitor 0.00000 0
## 73 Returning_Visitor 0.00000 0
## 74 Returning_Visitor 0.00000 0
## 75 Returning_Visitor 0.00000 0
## 76 Returning_Visitor 0.00000 0
## 77 Returning_Visitor 1005.66667 0
## 78 Returning_Visitor 0.00000 0
## 79 Returning_Visitor 0.00000 0
## 80 Returning_Visitor 0.00000 0
## 81 Returning_Visitor 0.00000 0
## 82 Returning_Visitor 0.00000 0
## 83 Returning_Visitor 0.00000 0
## 84 Returning_Visitor 0.00000 0
## 85 Returning_Visitor 0.00000 0
## 86 Returning_Visitor 0.00000 0
## 87 Returning_Visitor 0.00000 0
## 88 Returning_Visitor 0.00000 0
## 89 Returning_Visitor 0.00000 0
## 90 Returning_Visitor 0.00000 0
## 91 Returning_Visitor 36.00000 0
## 92 Returning_Visitor 0.00000 0
## 93 Returning_Visitor 0.00000 0
## 94 New_Visitor 0.00000 0
## 95 Returning_Visitor 0.00000 0
## 96 Returning_Visitor 0.00000 0
## 97 Returning_Visitor 0.00000 0
## 98 Returning_Visitor 0.00000 0
## 99 Returning_Visitor 0.00000 0
## 100 Returning_Visitor 0.00000 0
## 101 Returning_Visitor 18.33333 0
## 102 Returning_Visitor 61.00000 0
## 103 Returning_Visitor 0.00000 0
## 104 Returning_Visitor 31.00000 16
## 105 Returning_Visitor 58.50000 0
## 106 Returning_Visitor 0.00000 0
## 107 Returning_Visitor 0.00000 0
## 108 Returning_Visitor 0.00000 0
## 109 Returning_Visitor 0.00000 0
## 110 Returning_Visitor 326.25000 94
## 111 Returning_Visitor 0.00000 0
## 112 Returning_Visitor 0.00000 0
## 113 Returning_Visitor 0.00000 0
## 114 Returning_Visitor 22.00000 0
## 115 Returning_Visitor 0.00000 93
## 116 Returning_Visitor 0.00000 0
## 117 Returning_Visitor 0.00000 0
## 118 Returning_Visitor 0.00000 0
## 119 Returning_Visitor 0.00000 0
## 120 Returning_Visitor 0.00000 0
## 121 Returning_Visitor 0.00000 0
## 122 Returning_Visitor 0.00000 0
## 123 Returning_Visitor 0.00000 75
## 124 Returning_Visitor 0.00000 0
## 125 Returning_Visitor 0.00000 0
## 126 Returning_Visitor 0.00000 0
## 127 Returning_Visitor 0.00000 0
## 128 Returning_Visitor 0.00000 0
## 129 Returning_Visitor 462.00000 0
## 130 Returning_Visitor 0.00000 0
## 131 Returning_Visitor 0.00000 0
## 132 Returning_Visitor 20.00000 0
## 133 Returning_Visitor 0.00000 0
## 134 Returning_Visitor 0.00000 0
## 135 Returning_Visitor 111.50000 0
## 136 Returning_Visitor 0.00000 0
## 137 Returning_Visitor 0.00000 0
## 138 Returning_Visitor 0.00000 0
## 139 Returning_Visitor 0.00000 0
## 140 Returning_Visitor 103.62500 0
## 141 Returning_Visitor 0.00000 0
## 142 Returning_Visitor 0.00000 0
## 143 Returning_Visitor 19.00000 0
## 144 Returning_Visitor 0.00000 0
## 145 Returning_Visitor 0.00000 19
## 146 Returning_Visitor 0.00000 0
## 147 Returning_Visitor 0.00000 0
## 148 Returning_Visitor 0.00000 0
## 149 Returning_Visitor 0.00000 0
## 150 Returning_Visitor 0.00000 0
## ProductRelated_Duration ExitRates BounceRates Revenue
## 1 0.000000 0.200000000 0.200000000 FALSE
## 2 64.000000 0.100000000 0.000000000 FALSE
## 3 0.000000 0.200000000 0.200000000 FALSE
## 4 2.666667 0.140000000 0.050000000 FALSE
## 5 627.500000 0.050000000 0.020000000 FALSE
## 6 154.216667 0.024561404 0.015789474 FALSE
## 7 0.000000 0.200000000 0.200000000 FALSE
## 8 0.000000 0.200000000 0.200000000 FALSE
## 9 37.000000 0.100000000 0.000000000 FALSE
## 10 738.000000 0.022222222 0.000000000 FALSE
## 11 395.000000 0.066666667 0.000000000 FALSE
## 12 407.750000 0.025833333 0.018750000 FALSE
## 13 280.500000 0.028571429 0.000000000 FALSE
## 14 98.000000 0.066666667 0.000000000 FALSE
## 15 68.000000 0.100000000 0.000000000 FALSE
## 16 1668.285119 0.016312636 0.008333333 FALSE
## 17 0.000000 0.200000000 0.200000000 FALSE
## 18 334.966667 0.007692308 0.000000000 FALSE
## 19 32.000000 0.100000000 0.000000000 FALSE
## 20 2981.166667 0.010000000 0.000000000 FALSE
## 21 136.166667 0.008333333 0.000000000 FALSE
## 22 0.000000 0.200000000 0.200000000 FALSE
## 23 105.000000 0.033333333 0.000000000 FALSE
## 24 15.000000 0.100000000 0.000000000 FALSE
## 25 0.000000 0.200000000 0.200000000 FALSE
## 26 156.000000 0.040000000 0.000000000 FALSE
## 27 1135.444444 0.009523810 0.002857143 FALSE
## 28 76.000000 0.100000000 0.050000000 FALSE
## 29 63.000000 0.050000000 0.000000000 FALSE
## 30 1582.750000 0.050821256 0.043478261 FALSE
## 31 35.000000 0.100000000 0.000000000 FALSE
## 32 78.000000 0.066666667 0.000000000 FALSE
## 33 209.500000 0.025000000 0.000000000 FALSE
## 34 183.666667 0.080000000 0.040000000 FALSE
## 35 380.500000 0.028571429 0.014285714 FALSE
## 36 2086.242857 0.020352564 0.015384615 FALSE
## 37 388.000000 0.056250000 0.025000000 FALSE
## 38 298.000000 0.028571429 0.000000000 FALSE
## 39 63.000000 0.071428571 0.028571429 FALSE
## 40 482.000000 0.022222222 0.000000000 FALSE
## 41 4084.393939 0.001794872 0.000000000 FALSE
## 42 22.000000 0.066666667 0.000000000 FALSE
## 43 310.166667 0.006666667 0.000000000 FALSE
## 44 34.000000 0.050000000 0.000000000 FALSE
## 45 88.000000 0.050000000 0.000000000 FALSE
## 46 622.250000 0.006060606 0.003030303 FALSE
## 47 222.400000 0.057142857 0.017142857 FALSE
## 48 80.000000 0.133333333 0.066666667 FALSE
## 49 800.833333 0.003636364 0.000000000 FALSE
## 50 0.000000 0.200000000 0.200000000 FALSE
## 51 0.000000 0.200000000 0.200000000 FALSE
## 52 265.166667 0.026111111 0.011111111 FALSE
## 53 29.000000 0.100000000 0.000000000 FALSE
## 54 160.000000 0.075000000 0.000000000 FALSE
## 55 135.666667 0.025000000 0.050000000 FALSE
## 56 0.000000 0.200000000 0.200000000 FALSE
## 57 0.000000 0.200000000 0.200000000 FALSE
## 58 998.741667 0.014736467 0.000000000 FALSE
## 59 104.000000 0.050000000 0.000000000 FALSE
## 60 381.686508 0.046568627 0.011764706 FALSE
## 61 169.000000 0.016666667 0.000000000 FALSE
## 62 400.800000 0.120833333 0.050000000 FALSE
## 63 1553.583333 0.019666667 0.009000000 FALSE
## 64 706.500000 0.007142857 0.000000000 FALSE
## 65 0.000000 0.200000000 0.200000000 FALSE
## 66 798.333333 0.012643678 0.000000000 TRUE
## 67 6951.972222 0.015013034 0.002150538 FALSE
## 68 0.000000 0.200000000 0.200000000 FALSE
## 69 902.000000 0.007407407 0.000000000 FALSE
## 70 0.000000 0.200000000 0.200000000 FALSE
## 71 0.000000 0.200000000 0.200000000 FALSE
## 72 197.377778 0.052500000 0.025000000 FALSE
## 73 43.000000 0.100000000 0.000000000 FALSE
## 74 426.666667 0.012500000 0.000000000 FALSE
## 75 135.000000 0.066666667 0.000000000 FALSE
## 76 588.333333 0.025000000 0.000000000 FALSE
## 77 2111.341667 0.014492754 0.004347826 TRUE
## 78 76.000000 0.050000000 0.000000000 FALSE
## 79 0.000000 0.200000000 0.200000000 FALSE
## 80 0.000000 0.200000000 0.200000000 FALSE
## 81 208.000000 0.028571429 0.000000000 FALSE
## 82 270.000000 0.016666667 0.000000000 FALSE
## 83 39.500000 0.020000000 0.000000000 FALSE
## 84 375.000000 0.058333333 0.016666667 FALSE
## 85 0.000000 0.200000000 0.200000000 FALSE
## 86 0.000000 0.200000000 0.200000000 FALSE
## 87 150.000000 0.085714286 0.057142857 FALSE
## 88 138.000000 0.066666667 0.000000000 FALSE
## 89 337.500000 0.023809524 0.028571429 FALSE
## 90 620.033333 0.007894737 0.000000000 FALSE
## 91 168.846154 0.011764706 0.000000000 FALSE
## 92 0.000000 0.200000000 0.200000000 FALSE
## 93 52.000000 0.100000000 0.000000000 FALSE
## 94 649.250000 0.015384615 0.000000000 FALSE
## 95 925.333333 0.025925926 0.003703704 FALSE
## 96 33.000000 0.100000000 0.000000000 FALSE
## 97 1566.500000 0.066666667 0.050000000 FALSE
## 98 105.000000 0.025000000 0.000000000 FALSE
## 99 50.000000 0.080952381 0.038095238 FALSE
## 100 644.200000 0.031666667 0.004166667 FALSE
## 101 2635.177778 0.008947368 0.000000000 FALSE
## 102 607.000000 0.026984127 0.000000000 TRUE
## 103 415.000000 0.033333333 0.000000000 FALSE
## 104 2083.530952 0.013510250 0.000000000 FALSE
## 105 281.666667 0.017647059 0.007058824 FALSE
## 106 27.000000 0.150000000 0.150000000 FALSE
## 107 215.000000 0.074074074 0.044444444 FALSE
## 108 162.000000 0.066666667 0.050000000 FALSE
## 109 114.000000 0.066666667 0.050000000 FALSE
## 110 5062.213753 0.017918470 0.000854701 FALSE
## 111 253.000000 0.016666667 0.000000000 FALSE
## 112 0.000000 0.200000000 0.200000000 FALSE
## 113 0.000000 0.200000000 0.200000000 FALSE
## 114 436.500000 0.024000000 0.008000000 FALSE
## 115 1045.833333 0.035483871 0.012903226 FALSE
## 116 317.833333 0.005555556 0.000000000 FALSE
## 117 1146.333333 0.012962963 0.000000000 FALSE
## 118 577.000000 0.027272727 0.018181818 FALSE
## 119 95.000000 0.075000000 0.000000000 FALSE
## 120 301.500000 0.111111111 0.000000000 FALSE
## 121 243.000000 0.033333333 0.000000000 FALSE
## 122 135.500000 0.050000000 0.000000000 FALSE
## 123 442.333333 0.034375000 0.000000000 FALSE
## 124 41.000000 0.066666667 0.000000000 FALSE
## 125 292.900000 0.119523810 0.028000000 FALSE
## 126 0.000000 0.200000000 0.200000000 FALSE
## 127 1238.333333 0.040740741 0.000000000 FALSE
## 128 225.466667 0.050000000 0.000000000 FALSE
## 129 1873.216667 0.007547170 0.000000000 FALSE
## 130 13.000000 0.166666667 0.133333333 FALSE
## 131 222.000000 0.100000000 0.000000000 FALSE
## 132 214.833333 0.022222222 0.000000000 FALSE
## 133 0.000000 0.200000000 0.200000000 FALSE
## 134 2017.166667 0.011111111 0.000000000 FALSE
## 135 449.027778 0.018518519 0.000000000 FALSE
## 136 158.166667 0.033333333 0.000000000 FALSE
## 137 303.666667 0.046296296 0.005555556 FALSE
## 138 25.000000 0.066666667 0.000000000 FALSE
## 139 42.000000 0.150000000 0.050000000 FALSE
## 140 1003.416667 0.018039216 0.000000000 FALSE
## 141 0.000000 0.200000000 0.200000000 FALSE
## 142 272.500000 0.011851852 0.000000000 FALSE
## 143 344.000000 0.007407407 0.000000000 FALSE
## 144 0.000000 0.200000000 0.200000000 FALSE
## 145 852.000000 0.009090909 0.000000000 FALSE
## 146 780.000000 0.066666667 0.066666667 FALSE
## 147 1038.000000 0.033333333 0.000000000 FALSE
## 148 573.000000 0.050000000 0.000000000 FALSE
## 149 1231.233333 0.030357143 0.007142857 FALSE
## 150 316.666667 0.027272727 0.000000000 FALSE
df_visitor_type <- project_data |>
group_by(VisitorType) |>
summarize(
count = n(), # Count the number of rows in each group
revenue_rate = mean(Revenue), # Proportion of sessions resulting in a purchase (TRUE)
total_revenue_sessions = sum(Revenue), # Total number of sessions where Revenue is TRUE
avg_bounce = mean(BounceRates)
)
print(df_visitor_type)
## # A tibble: 3 × 5
## VisitorType count revenue_rate total_revenue_sessions avg_bounce
## <chr> <int> <dbl> <int> <dbl>
## 1 New_Visitor 1694 0.249 422 0.00526
## 2 Other 85 0.188 16 0.0386
## 3 Returning_Visitor 10551 0.139 1470 0.0248
what we find is that our lowest percentage group is new customers that do make a purchase (only 422), so lets analyze this group based on bounce and exit rates on certain page types durations
From our head() we notice that there a lot of zero values through the duration variable, so i chose to divide set into 3 subsets. 1 for all the instances where that particular duration isn’t equal to zero. This way our data won’t be skewed towards the zeros.
As for the reason for all these zero values in my durations, it’s hard to tell why it’s happening given the limited documentation.
You would think it would be impossible to spend zero seconds on any of the possible page types and it be a legit value for analysis
We’ll also go ahead and isolate the variables we need and remove all the “Other” instances in visitor type
df_admin_duration <- subset(project_data, Administrative_Duration != 0 & VisitorType != "Other")
df_admin_duration <- df_admin_duration[, c("Administrative_Duration",
"Informational_Duration",
"ProductRelated_Duration",
"ExitRates",
"BounceRates",
"VisitorType",
"Revenue")]
# Informational duration subset
df_info_duration <- subset(project_data, Informational_Duration != 0 & VisitorType != "Other")
df_info_duration <- df_info_duration[, c("Administrative_Duration",
"Informational_Duration",
"ProductRelated_Duration",
"ExitRates",
"BounceRates",
"VisitorType",
"Revenue")]
# Product related duration subset
df_product_duration <- subset(project_data, ProductRelated_Duration != 0 & VisitorType != "Other")
df_product_duration <- df_product_duration[, c("Administrative_Duration",
"Informational_Duration",
"ProductRelated_Duration",
"ExitRates",
"BounceRates",
"VisitorType",
"Revenue")]
Now let’s look at summaries and head() of these 3 subsets.
# Print the summaries of the new data frames
summary(df_admin_duration)
## Administrative_Duration Informational_Duration ProductRelated_Duration
## Min. : 1.333 Min. : 0.00 Min. : 0.0
## 1st Qu.: 40.000 1st Qu.: 0.00 1st Qu.: 411.5
## Median : 88.000 Median : 0.00 Median : 969.8
## Mean : 155.015 Mean : 57.04 Mean : 1673.4
## 3rd Qu.: 183.000 3rd Qu.: 21.44 3rd Qu.: 2032.2
## Max. :3398.750 Max. :2549.38 Max. :63973.5
## ExitRates BounceRates VisitorType Revenue
## Min. :0.00000 Min. :0.000000 Length:6394 Mode :logical
## 1st Qu.:0.01111 1st Qu.:0.000000 Class :character FALSE:5044
## Median :0.01852 Median :0.001905 Mode :character TRUE :1350
## Mean :0.02270 Mean :0.007063
## 3rd Qu.:0.02941 3rd Qu.:0.009684
## Max. :0.15000 Max. :0.161905
summary(df_info_duration)
## Administrative_Duration Informational_Duration ProductRelated_Duration
## Min. : 0.00 Min. : 1.0 Min. : 0.0
## 1st Qu.: 23.00 1st Qu.: 28.1 1st Qu.: 727.4
## Median : 97.75 Median : 76.0 Median : 1572.5
## Mean : 173.82 Mean : 176.8 Mean : 2490.5
## 3rd Qu.: 215.51 3rd Qu.: 201.0 3rd Qu.: 3066.3
## Max. :3398.75 Max. :2549.4 Max. :63973.5
## ExitRates BounceRates VisitorType Revenue
## Min. :0.00000 Min. :0.000000 Length:2399 Mode :logical
## 1st Qu.:0.01215 1st Qu.:0.000000 Class :character FALSE:1836
## Median :0.01909 Median :0.004167 Mode :character TRUE :563
## Mean :0.02266 Mean :0.007678
## 3rd Qu.:0.02915 3rd Qu.:0.010690
## Max. :0.13333 Max. :0.090323
summary(df_product_duration)
## Administrative_Duration Informational_Duration ProductRelated_Duration
## Min. : 0.00 Min. : 0.00 Min. : 0.5
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 243.4
## Median : 16.00 Median : 0.00 Median : 673.0
## Mean : 85.96 Mean : 36.83 Mean : 1276.0
## 3rd Qu.: 100.83 3rd Qu.: 0.00 3rd Qu.: 1560.8
## Max. :3398.75 Max. :2549.38 Max. :63973.5
## ExitRates BounceRates VisitorType Revenue
## Min. :0.00000 Min. :0.000000 Length:11507 Mode :logical
## 1st Qu.:0.01339 1st Qu.:0.000000 Class :character FALSE:9628
## Median :0.02403 Median :0.002062 Mode :character TRUE :1879
## Mean :0.03328 Mean :0.011472
## 3rd Qu.:0.04187 3rd Qu.:0.013520
## Max. :0.19231 Max. :0.183333
# Print the first few rows of each new data frame to verify its how we want
print(head(df_admin_duration))
## Administrative_Duration Informational_Duration ProductRelated_Duration
## 16 53.0 0 1668.2851
## 27 64.6 0 1135.4444
## 30 6.0 0 1582.7500
## 38 18.0 0 298.0000
## 41 9.0 0 4084.3939
## 58 56.0 120 998.7417
## ExitRates BounceRates VisitorType Revenue
## 16 0.016312636 0.008333333 Returning_Visitor FALSE
## 27 0.009523810 0.002857143 Returning_Visitor FALSE
## 30 0.050821256 0.043478261 Returning_Visitor FALSE
## 38 0.028571429 0.000000000 Returning_Visitor FALSE
## 41 0.001794872 0.000000000 Returning_Visitor FALSE
## 58 0.014736467 0.000000000 Returning_Visitor FALSE
print(head(df_info_duration))
## Administrative_Duration Informational_Duration ProductRelated_Duration
## 58 56.00 120 998.7417
## 104 31.00 16 2083.5310
## 110 326.25 94 5062.2138
## 115 0.00 93 1045.8333
## 123 0.00 75 442.3333
## 145 0.00 19 852.0000
## ExitRates BounceRates VisitorType Revenue
## 58 0.014736467 0.000000000 Returning_Visitor FALSE
## 104 0.013510250 0.000000000 Returning_Visitor FALSE
## 110 0.017918470 0.000854701 Returning_Visitor FALSE
## 115 0.035483871 0.012903226 Returning_Visitor FALSE
## 123 0.034375000 0.000000000 Returning_Visitor FALSE
## 145 0.009090909 0.000000000 Returning_Visitor FALSE
print(head(df_product_duration))
## Administrative_Duration Informational_Duration ProductRelated_Duration
## 2 0 0 64.000000
## 4 0 0 2.666667
## 5 0 0 627.500000
## 6 0 0 154.216667
## 9 0 0 37.000000
## 10 0 0 738.000000
## ExitRates BounceRates VisitorType Revenue
## 2 0.10000000 0.00000000 Returning_Visitor FALSE
## 4 0.14000000 0.05000000 Returning_Visitor FALSE
## 5 0.05000000 0.02000000 Returning_Visitor FALSE
## 6 0.02456140 0.01578947 Returning_Visitor FALSE
## 9 0.10000000 0.00000000 Returning_Visitor FALSE
## 10 0.02222222 0.00000000 Returning_Visitor FALSE
lets first look at a comparison of exit/bounce rates on the different page types overall
We’ll get some summary stats on the different variables involved in each of our subsets
# Summarize the data for df_admin_duration
df_admin_summary <- df_admin_duration |>
group_by(VisitorType) |>
summarize(
count = n(), # Count the number of rows in each group
revenue_rate = mean(Revenue, na.rm = TRUE), # Proportion of sessions resulting in a purchase (TRUE)
total_revenue_sessions = sum(Revenue, na.rm = TRUE), # Total number of sessions where Revenue is TRUE
avg_bounce = mean(BounceRates, na.rm = TRUE) # Average bounce rate
)
# Summarize the data for df_info_duration
df_info_summary <- df_info_duration |>
group_by(VisitorType) |>
summarize(
count = n(), # Count the number of rows in each group
revenue_rate = mean(Revenue, na.rm = TRUE), # Proportion of sessions resulting in a purchase (TRUE)
total_revenue_sessions = sum(Revenue, na.rm = TRUE), # Total number of sessions where Revenue is TRUE
avg_bounce = mean(BounceRates, na.rm = TRUE) # Average bounce rate
)
# Summarize the data for df_product_duration
df_product_summary <- df_product_duration |>
group_by(VisitorType) |>
summarize(
count = n(), # Count the number of rows in each group
revenue_rate = mean(Revenue, na.rm = TRUE), # Proportion of sessions resulting in a purchase (TRUE)
total_revenue_sessions = sum(Revenue, na.rm = TRUE), # Total number of sessions where Revenue is TRUE
avg_bounce = mean(BounceRates, na.rm = TRUE) # Average bounce rate
)
# Print the summaries of the new data frames
print(df_admin_summary)
## # A tibble: 2 × 5
## VisitorType count revenue_rate total_revenue_sessions avg_bounce
## <chr> <int> <dbl> <int> <dbl>
## 1 New_Visitor 1185 0.220 261 0.00155
## 2 Returning_Visitor 5209 0.209 1089 0.00832
print(df_info_summary)
## # A tibble: 2 × 5
## VisitorType count revenue_rate total_revenue_sessions avg_bounce
## <chr> <int> <dbl> <int> <dbl>
## 1 New_Visitor 254 0.232 59 0.00126
## 2 Returning_Visitor 2145 0.235 504 0.00844
print(df_product_summary)
## # A tibble: 2 × 5
## VisitorType count revenue_rate total_revenue_sessions avg_bounce
## <chr> <int> <dbl> <int> <dbl>
## 1 New_Visitor 1652 0.252 417 0.00186
## 2 Returning_Visitor 9855 0.148 1462 0.0131
The count of these dataframes becomes:
admin duration - 6394
info duration - 2399
product - 11.507
New and return customers have relatively equal odds of making a purchase until they get to the product page, and then new visitors are over 70% more likely to purchase
Now lets see how it’s distributed in relation to the Revenue variable to see if there’s any differences with this now that Revenue is involved:
# Summarize the data for df_admin_duration including Revenue
df_admin_duration_summary <- df_admin_duration |>
group_by(VisitorType, Revenue) |>
summarize(
count = n(),
avg_product_duration = mean(ProductRelated_Duration, na.rm = TRUE),
avg_informational_duration = mean(Informational_Duration, na.rm = TRUE),
avg_administrative_duration = mean(Administrative_Duration, na.rm = TRUE),
avg_exit_rate = mean(ExitRates, na.rm = TRUE),
avg_bounce_rate = mean(BounceRates, na.rm = TRUE),
.groups = 'drop' # Ungroup after summarizing
)
# Summarize the data for df_info_duration including Revenue
df_info_duration_summary <- df_info_duration |>
group_by(VisitorType, Revenue) |>
summarize(
count = n(),
avg_product_duration = mean(ProductRelated_Duration, na.rm = TRUE),
avg_informational_duration = mean(Informational_Duration, na.rm = TRUE),
avg_administrative_duration = mean(Administrative_Duration, na.rm = TRUE),
avg_exit_rate = mean(ExitRates, na.rm = TRUE),
avg_bounce_rate = mean(BounceRates, na.rm = TRUE),
.groups = 'drop' # Ungroup after summarizing
)
# Summarize the data for df_product_duration including Revenue
df_product_duration_summary <- df_product_duration |>
group_by(VisitorType, Revenue) |>
summarize(
count = n(),
avg_product_duration = mean(ProductRelated_Duration, na.rm = TRUE),
avg_informational_duration = mean(Informational_Duration, na.rm = TRUE),
avg_administrative_duration = mean(Administrative_Duration, na.rm = TRUE),
avg_exit_rate = mean(ExitRates, na.rm = TRUE),
avg_bounce_rate = mean(BounceRates, na.rm = TRUE),
.groups = 'drop' # Ungroup after summarizing
)
# Print the summaries of the new data frames
print(df_admin_duration_summary)
## # A tibble: 4 × 8
## VisitorType Revenue count avg_product_duration avg_informational_durat…¹
## <chr> <lgl> <int> <dbl> <dbl>
## 1 New_Visitor FALSE 924 566. 22.1
## 2 New_Visitor TRUE 261 943. 24.8
## 3 Returning_Visitor FALSE 4120 1763. 60.1
## 4 Returning_Visitor TRUE 1089 2450. 82.8
## # ℹ abbreviated name: ¹avg_informational_duration
## # ℹ 3 more variables: avg_administrative_duration <dbl>, avg_exit_rate <dbl>,
## # avg_bounce_rate <dbl>
print(df_info_duration_summary)
## # A tibble: 4 × 8
## VisitorType Revenue count avg_product_duration avg_informational_durat…¹
## <chr> <lgl> <int> <dbl> <dbl>
## 1 New_Visitor FALSE 195 817. 124.
## 2 New_Visitor TRUE 59 1325. 141.
## 3 Returning_Visitor FALSE 1641 2508. 177.
## 4 Returning_Visitor TRUE 504 3218. 201.
## # ℹ abbreviated name: ¹avg_informational_duration
## # ℹ 3 more variables: avg_administrative_duration <dbl>, avg_exit_rate <dbl>,
## # avg_bounce_rate <dbl>
print(df_product_duration_summary)
## # A tibble: 4 × 8
## VisitorType Revenue count avg_product_duration avg_informational_durat…¹
## <chr> <lgl> <int> <dbl> <dbl>
## 1 New_Visitor FALSE 1235 574. 19.6
## 2 New_Visitor TRUE 417 885. 20.0
## 3 Returning_Visitor FALSE 8393 1241. 34.6
## 4 Returning_Visitor TRUE 1462 2182. 69.2
## # ℹ abbreviated name: ¹avg_informational_duration
## # ℹ 3 more variables: avg_administrative_duration <dbl>, avg_exit_rate <dbl>,
## # avg_bounce_rate <dbl>
The main difference i notice in this ‘group_by’ comparison between our smallest group, ‘new_visitors’, is how the average bounce/exit rate and values are effected differently based on the type of customer they are
For example, when a new customer is on a administrative page, if their duration is longer they often don’t make a purchase
Why might this be?
Next we try to visualize this
First we’ll look at these stats for all the data:
This visualization shows the 2 groups for true and false on the revenue stat, based on the comparison between bounce/exit rates and the different stats
I set the best-fit of line (based on the distribution I saw in last weeks datadive) because i feel I’ll get more out of it when the data is displayed as a polynomial regression rather than a straight line
For now this includes all customer types
# Plot 1: Product Related Duration vs Exit Rates (with legend)
plot1 <- ggplot(df_product_duration, aes(x = ProductRelated_Duration, y = ExitRates, color = factor(Revenue))) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = TRUE) + # Show legend
labs(title = "Product vs Exit",
x = "Product Related Duration",
y = "Exit Rates",
color = "Revenue Status") +
scale_y_continuous(limits = c(0, NA)) + # Ensure no negative y-axis values
theme_minimal()
# Plot 2: Informational Duration vs Exit Rates (without legend)
plot2 <- ggplot(df_info_duration, aes(x = Informational_Duration, y = ExitRates, color = factor(Revenue))) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) + # Hide legend
labs(title = "Info vs Exit",
x = "Informational Duration",
y = "Exit Rates") +
scale_y_continuous(limits = c(0, NA)) + # Ensure no negative y-axis values
theme_minimal()
# Plot 3: Administrative Duration vs Exit Rates (without legend)
plot3 <- ggplot(df_admin_duration, aes(x = Administrative_Duration, y = ExitRates, color = factor(Revenue))) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) + # Hide legend
labs(title = "Admin vs Exit",
x = "Administrative Duration",
y = "Exit Rates") +
scale_y_continuous(limits = c(0, NA)) + # Ensure no negative y-axis values
theme_minimal()
# Plot 4: Product Related Duration vs Bounce Rates (without legend)
plot4 <- ggplot(df_product_duration, aes(x = ProductRelated_Duration, y = BounceRates, color = factor(Revenue))) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) + # Hide legend
labs(title = "Product vs Bounce",
x = "Product Related Duration",
y = "Bounce Rates") +
scale_y_continuous(limits = c(0, NA)) + # Ensure no negative y-axis values
theme_minimal()
# Plot 5: Informational Duration vs Bounce Rates (without legend)
plot5 <- ggplot(df_info_duration, aes(x = Informational_Duration, y = BounceRates, color = factor(Revenue))) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) + # Hide legend
labs(title = "Info vs Bounce",
x = "Informational Duration",
y = "Bounce Rates") +
scale_y_continuous(limits = c(0, NA)) + # Ensure no negative y-axis values
theme_minimal()
# Plot 6: Administrative Duration vs Bounce Rates (without legend)
plot6 <- ggplot(df_admin_duration, aes(x = Administrative_Duration, y = BounceRates, color = factor(Revenue))) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) + # Hide legend
labs(title = "Admin vs Bounce",
x = "Administrative Duration",
y = "Bounce Rates") +
scale_y_continuous(limits = c(0, NA)) + # Ensure no negative y-axis values
theme_minimal()
# Combine all six plots into a grid layout
Exit_vs_Bounce1 <- grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6,
ncol = 3,
top = "Rates Vs Duration (All Cases)"
)
## Warning: Removed 37 rows containing missing values (`geom_smooth()`).
## Warning: Removed 43 rows containing missing values (`geom_smooth()`).
print(Exit_vs_Bounce1)
## TableGrob (3 x 3) "arrange": 7 grobs
## z cells name grob
## 1 1 (2-2,1-1) arrange gtable[layout]
## 2 2 (2-2,2-2) arrange gtable[layout]
## 3 3 (2-2,3-3) arrange gtable[layout]
## 4 4 (3-3,1-1) arrange gtable[layout]
## 5 5 (3-3,2-2) arrange gtable[layout]
## 6 6 (3-3,3-3) arrange gtable[layout]
## 7 7 (1-1,1-3) arrange text[GRID.text.255]
As we expect, in false cases bounce and exit rates increase at a greater magnitude than that of true cases overall.
Now lets look at this with only our positive cases for the revenue field:
# Plot 1: Product Related Duration vs Exit Rates (with legend)
plot1 <- ggplot(df_product_duration |> filter(Revenue == TRUE), aes(x = ProductRelated_Duration, y = ExitRates, color = factor(Revenue))) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) + # Show legend
labs(title = "Product vs Exit",
x = "Product Related Duration",
y = "Exit Rates") +
scale_y_continuous(limits = c(0, NA)) + # Ensure no negative y-axis values
theme_minimal()
# Plot 2: Informational Duration vs Exit Rates (without legend)
plot2 <- ggplot(df_info_duration |> filter(Revenue == TRUE), aes(x = Informational_Duration, y = ExitRates, color = factor(Revenue))) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) + # Hide legend
labs(title = "Info vs Exit",
x = "Informational Duration",
y = "Exit Rates") +
scale_y_continuous(limits = c(0, NA)) + # Ensure no negative y-axis values
theme_minimal()
# Plot 3: Administrative Duration vs Exit Rates (without legend)
plot3 <- ggplot(df_admin_duration |> filter(Revenue == TRUE), aes(x = Administrative_Duration, y = ExitRates, color = factor(Revenue))) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) + # Hide legend
labs(title = "Admin vs Exit",
x = "Administrative Duration",
y = "Exit Rates") +
scale_y_continuous(limits = c(0, NA)) + # Ensure no negative y-axis values
theme_minimal()
# Plot 4: Product Related Duration vs Bounce Rates (without legend)
plot4 <- ggplot(df_product_duration |> filter(Revenue == TRUE), aes(x = ProductRelated_Duration, y = BounceRates, color = factor(Revenue))) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) + # Hide legend
labs(title = "Product vs Bounce",
x = "Product Related Duration",
y = "Bounce Rates") +
scale_y_continuous(limits = c(0, NA)) + # Ensure no negative y-axis values
theme_minimal()
# Plot 5: Informational Duration vs Bounce Rates (without legend)
plot5 <- ggplot(df_info_duration |> filter(Revenue == TRUE), aes(x = Informational_Duration, y = BounceRates, color = factor(Revenue))) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) + # Hide legend
labs(title = "Info vs Bounce",
x = "Informational Duration",
y = "Bounce Rates") +
scale_y_continuous(limits = c(0, NA)) + # Ensure no negative y-axis values
theme_minimal()
# Plot 6: Administrative Duration vs Bounce Rates (without legend)
plot6 <- ggplot(df_admin_duration |> filter(Revenue == TRUE), aes(x = Administrative_Duration, y = BounceRates, color = factor(Revenue))) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) + # Hide legend
labs(title = "Admin vs Bounce",
x = "Administrative Duration",
y = "Bounce Rates") +
scale_y_continuous(limits = c(0, NA)) + # Ensure no negative y-axis values
theme_minimal()
# Combine all six plots into a grid layout
Exit_vs_Bounce2 <- grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6,
ncol = 3,
top = "Rates Vs Duration (All Positive Cases)"
)
print(Exit_vs_Bounce2)
## TableGrob (3 x 3) "arrange": 7 grobs
## z cells name grob
## 1 1 (2-2,1-1) arrange gtable[layout]
## 2 2 (2-2,2-2) arrange gtable[layout]
## 3 3 (2-2,3-3) arrange gtable[layout]
## 4 4 (3-3,1-1) arrange gtable[layout]
## 5 5 (3-3,2-2) arrange gtable[layout]
## 6 6 (3-3,3-3) arrange gtable[layout]
## 7 7 (1-1,1-3) arrange text[GRID.text.472]
This shows that bounce rates go down significantly as time progresses when leading to a purchase and as they get closer to the product page
For exit rates we consistently see a U shape, showing that exit rates start to increase again after a certain duration
Now we can plot this smaller group alone with just positive cases of new customers:
# Plot 1: Product Related Duration vs Bounce Rates
plot1 <- ggplot(df_product_duration |> filter(Revenue == TRUE, VisitorType == "New_Visitor"),
aes(x = ProductRelated_Duration, y = BounceRates)) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
labs(title = "Bounce vs. Product",
x = "Product Related Duration",
y = "Bounce Rates") +
scale_y_continuous(limits = c(0, NA)) +
theme_minimal()
# Plot 2: Informational Duration vs Bounce Rates
plot2 <- ggplot(df_info_duration |> filter(Revenue == TRUE, VisitorType == "New_Visitor"),
aes(x = Informational_Duration, y = BounceRates)) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
labs(title = "Bounce vs. Informational",
x = "Informational Duration",
y = "Bounce Rates") +
scale_y_continuous(limits = c(0, NA)) +
theme_minimal()
# Plot 3: Administrative Duration vs Bounce Rates
plot3 <- ggplot(df_admin_duration |> filter(Revenue == TRUE, VisitorType == "New_Visitor"),
aes(x = Administrative_Duration, y = BounceRates)) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
labs(title = "Bounce vs. Admin",
x = "Administrative Duration",
y = "Bounce Rates") +
scale_y_continuous(limits = c(0, NA)) +
theme_minimal()
# Plot 4: Product Related Duration vs Exit Rates
plot4 <- ggplot(df_product_duration |> filter(Revenue == TRUE, VisitorType == "New_Visitor"),
aes(x = ProductRelated_Duration, y = ExitRates)) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
labs(title = "Exit vs. Product",
x = "Product Related Duration",
y = "Exit Rates") +
scale_y_continuous(limits = c(0, NA)) +
theme_minimal()
# Plot 5: Informational Duration vs Exit Rates
plot5 <- ggplot(df_info_duration |> filter(Revenue == TRUE, VisitorType == "New_Visitor"),
aes(x = Informational_Duration, y = ExitRates)) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
labs(title = "Exit vs. Informational",
x = "Informational Duration",
y = "Exit Rates") +
scale_y_continuous(limits = c(0, NA)) +
theme_minimal()
# Plot 6: Administrative Duration vs Exit Rates
plot6 <- ggplot(df_admin_duration |> filter(Revenue == TRUE, VisitorType == "New_Visitor"),
aes(x = Administrative_Duration, y = ExitRates)) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
labs(title = "Exit vs. Admin",
x = "Administrative Duration",
y = "Exit Rates") +
scale_y_continuous(limits = c(0, NA)) +
theme_minimal()
# Combine plots into a grid
Exit_vs_bounce3 <- grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6,
ncol = 3,
top = "Rates Vs Duration (All Positive Cases, New Visitor only)"
)
## Warning: Removed 15 rows containing missing values (`geom_smooth()`).
print(Exit_vs_bounce3)
## TableGrob (3 x 3) "arrange": 7 grobs
## z cells name grob
## 1 1 (2-2,1-1) arrange gtable[layout]
## 2 2 (2-2,2-2) arrange gtable[layout]
## 3 3 (2-2,3-3) arrange gtable[layout]
## 4 4 (3-3,1-1) arrange gtable[layout]
## 5 5 (3-3,2-2) arrange gtable[layout]
## 6 6 (3-3,3-3) arrange gtable[layout]
## 7 7 (1-1,1-3) arrange text[GRID.text.689]
This shows a similar result to the returning customers, just too a much higher degree in magnitude.
Now I will set the 2 visitor types side by side to see what is different about the two types:
# Filter the dataset for successful revenue cases (Revenue = TRUE) & both visitortypes
# (RED = new, BLUE = returning)
plot1 <- ggplot(df_product_duration |> filter(Revenue == TRUE, VisitorType %in% c('New_Visitor', 'Returning_Visitor')),
aes(x = ProductRelated_Duration, y = BounceRates, color = VisitorType)) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +
labs(title = "Bounce vs. Product",
x = "Product Related Duration",
y = "Bounce Rates") +
scale_y_continuous(limits = c(0, NA)) +
theme_minimal()
# Plot 2: Informational Duration vs Bounce Rates
plot2 <- ggplot(df_info_duration |> filter(Revenue == TRUE, VisitorType %in% c('New_Visitor', 'Returning_Visitor')),
aes(x = Informational_Duration, y = BounceRates, color = VisitorType)) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +
labs(title = "Bounce vs. Informational",
x = "Informational Duration",
y = "Bounce Rates") +
scale_y_continuous(limits = c(0, NA)) +
theme_minimal()
# Plot 3: Administrative Duration vs Bounce Rates
plot3 <- ggplot(df_admin_duration |> filter(Revenue == TRUE, VisitorType %in% c('New_Visitor', 'Returning_Visitor')),
aes(x = Administrative_Duration, y = BounceRates, color = VisitorType)) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +
labs(title = "Bounce vs. Admin",
x = "Administrative Duration",
y = "Bounce Rates") +
scale_y_continuous(limits = c(0, NA)) +
theme_minimal()
# Plot 4: Product Related Duration vs Exit Rates
plot4 <- ggplot(df_product_duration |> filter(Revenue == TRUE, VisitorType %in% c('New_Visitor', 'Returning_Visitor')),
aes(x = ProductRelated_Duration, y = ExitRates, color = VisitorType)) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +
labs(title = "Exit vs. Product",
x = "Product Related Duration",
y = "Exit Rates") +
scale_y_continuous(limits = c(0, NA)) +
theme_minimal()
# Plot 5: Informational Duration vs Exit Rates
plot5 <- ggplot(df_info_duration |> filter(Revenue == TRUE, VisitorType %in% c('New_Visitor', 'Returning_Visitor')),
aes(x = Informational_Duration, y = ExitRates, color = VisitorType)) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +
labs(title = "Exit vs. Informational",
x = "Informational Duration",
y = "Exit Rates") +
scale_y_continuous(limits = c(0, NA)) +
theme_minimal()
# Plot 6: Administrative Duration vs Exit Rates
plot6 <- ggplot(df_admin_duration |> filter(Revenue == TRUE, VisitorType %in% c('New_Visitor', 'Returning_Visitor')),
aes(x = Administrative_Duration, y = ExitRates, color = VisitorType)) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +
labs(title = "Exit vs. Admin",
x = "Administrative Duration",
y = "Exit Rates") +
scale_y_continuous(limits = c(0, NA)) +
theme_minimal()
# Combine all six plots into a 2-row grid, legend only in plot1
Exit_vs_Bounce4 <- grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6,
ncol = 3,
top = "Rates Vs Duration (All Positive Cases by Visitor Type(RED = New))"
)
## Warning: Removed 15 rows containing missing values (`geom_smooth()`).
print(Exit_vs_Bounce4)
## TableGrob (3 x 3) "arrange": 7 grobs
## z cells name grob
## 1 1 (2-2,1-1) arrange gtable[layout]
## 2 2 (2-2,2-2) arrange gtable[layout]
## 3 3 (2-2,3-3) arrange gtable[layout]
## 4 4 (3-3,1-1) arrange gtable[layout]
## 5 5 (3-3,2-2) arrange gtable[layout]
## 6 6 (3-3,3-3) arrange gtable[layout]
## 7 7 (1-1,1-3) arrange text[GRID.text.912]
New Customers: The plots suggest that new customers who engage more tend to bounce and exit less, possibly due to finding what they need or being more engaged with the site.
New Customers’ Behavior: For new customers, higher product-related duration could indicate that they are exploring and evaluating products. If they find what they’re looking for, they might not bounce or exit as frequently. Conversely, returning customers might have a different interaction pattern, which could lead to an upward slope if they’re more likely to return frequently and become more efficient in their browsing.
Final Hypothesis: New customers who spend more time on the site are more engaged and find what they need, leading to lower bounce and exit rates. This could be due to a higher relevance of content or products, better site usability, or targeted marketing efforts. In contrast, the general customer base, which includes both new and returning customers, might exhibit different engagement patterns based on their familiarity with the site and their specific needs.
Website Usability: If the website is designed to be user-friendly and helps users quickly find what they need, new customers might experience a more positive user journey, resulting in lower bounce/exit rates with longer durations.
Site Changes or Campaigns: There could be specific site changes or marketing campaigns targeting new visitors, influencing their behavior differently compared to the general customer base.
Relevance of Content: The content or products that new visitors encounter might be different from what returning visitors see. High engagement with relevant content could reduce bounce rates for new customers as they are more likely to be interested in the content.
Product Discovery: New customers should spend more time discovering and evaluating products, leading to decreased bounce and exit rates if they find the site engaging and valuable.