Data Dive 3: Group_by analysis

I decided to group by the categorical variables Visitor_type and Revenue, and then comparing them to continuous variables like bounce/exit rates and duration metrics.
I figured this could lead to insights on what is leading customers to leave the website or potentially make a purchase.

First I load in the libraries:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(gridExtra) #to view later visualizations side by side

## Warning: package 'gridExtra' was built under R version 4.3.3

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

Then, I look at a summary and head(), as well as the distribution of these and see what groups the smallest/largest and a few other summary details to get started:

project_data <- read.csv("online_shoppers_intention.csv")
summary(project_data)

##  Administrative   Administrative_Duration Informational    
##  Min.   : 0.000   Min.   :   0.00         Min.   : 0.0000  
##  1st Qu.: 0.000   1st Qu.:   0.00         1st Qu.: 0.0000  
##  Median : 1.000   Median :   7.50         Median : 0.0000  
##  Mean   : 2.315   Mean   :  80.82         Mean   : 0.5036  
##  3rd Qu.: 4.000   3rd Qu.:  93.26         3rd Qu.: 0.0000  
##  Max.   :27.000   Max.   :3398.75         Max.   :24.0000  
##  Informational_Duration ProductRelated   ProductRelated_Duration
##  Min.   :   0.00        Min.   :  0.00   Min.   :    0.0        
##  1st Qu.:   0.00        1st Qu.:  7.00   1st Qu.:  184.1        
##  Median :   0.00        Median : 18.00   Median :  598.9        
##  Mean   :  34.47        Mean   : 31.73   Mean   : 1194.8        
##  3rd Qu.:   0.00        3rd Qu.: 38.00   3rd Qu.: 1464.2        
##  Max.   :2549.38        Max.   :705.00   Max.   :63973.5        
##   BounceRates         ExitRates         PageValues        SpecialDay     
##  Min.   :0.000000   Min.   :0.00000   Min.   :  0.000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.01429   1st Qu.:  0.000   1st Qu.:0.00000  
##  Median :0.003112   Median :0.02516   Median :  0.000   Median :0.00000  
##  Mean   :0.022191   Mean   :0.04307   Mean   :  5.889   Mean   :0.06143  
##  3rd Qu.:0.016813   3rd Qu.:0.05000   3rd Qu.:  0.000   3rd Qu.:0.00000  
##  Max.   :0.200000   Max.   :0.20000   Max.   :361.764   Max.   :1.00000  
##     Month           OperatingSystems    Browser           Region     
##  Length:12330       Min.   :1.000    Min.   : 1.000   Min.   :1.000  
##  Class :character   1st Qu.:2.000    1st Qu.: 2.000   1st Qu.:1.000  
##  Mode  :character   Median :2.000    Median : 2.000   Median :3.000  
##                     Mean   :2.124    Mean   : 2.357   Mean   :3.147  
##                     3rd Qu.:3.000    3rd Qu.: 2.000   3rd Qu.:4.000  
##                     Max.   :8.000    Max.   :13.000   Max.   :9.000  
##   TrafficType    VisitorType         Weekend         Revenue       
##  Min.   : 1.00   Length:12330       Mode :logical   Mode :logical  
##  1st Qu.: 2.00   Class :character   FALSE:9462      FALSE:10422    
##  Median : 2.00   Mode  :character   TRUE :2868      TRUE :1908     
##  Mean   : 4.07                                                     
##  3rd Qu.: 4.00                                                     
##  Max.   :20.00

print(head(project_data[,c('VisitorType','Administrative_Duration','Informational_Duration','ProductRelated_Duration', 'ExitRates','BounceRates', 'Revenue')], 150))

##           VisitorType Administrative_Duration Informational_Duration
## 1   Returning_Visitor                 0.00000                      0
## 2   Returning_Visitor                 0.00000                      0
## 3   Returning_Visitor                 0.00000                      0
## 4   Returning_Visitor                 0.00000                      0
## 5   Returning_Visitor                 0.00000                      0
## 6   Returning_Visitor                 0.00000                      0
## 7   Returning_Visitor                 0.00000                      0
## 8   Returning_Visitor                 0.00000                      0
## 9   Returning_Visitor                 0.00000                      0
## 10  Returning_Visitor                 0.00000                      0
## 11  Returning_Visitor                 0.00000                      0
## 12  Returning_Visitor                 0.00000                      0
## 13  Returning_Visitor                 0.00000                      0
## 14  Returning_Visitor                 0.00000                      0
## 15  Returning_Visitor                 0.00000                      0
## 16  Returning_Visitor                53.00000                      0
## 17  Returning_Visitor                 0.00000                      0
## 18  Returning_Visitor                 0.00000                      0
## 19  Returning_Visitor                 0.00000                      0
## 20  Returning_Visitor                 0.00000                      0
## 21  Returning_Visitor                 0.00000                      0
## 22  Returning_Visitor                 0.00000                      0
## 23  Returning_Visitor                 0.00000                      0
## 24  Returning_Visitor                 0.00000                      0
## 25  Returning_Visitor                 0.00000                      0
## 26  Returning_Visitor                 0.00000                      0
## 27  Returning_Visitor                64.60000                      0
## 28  Returning_Visitor                 0.00000                      0
## 29  Returning_Visitor                 0.00000                      0
## 30  Returning_Visitor                 6.00000                      0
## 31  Returning_Visitor                 0.00000                      0
## 32  Returning_Visitor                 0.00000                      0
## 33  Returning_Visitor                 0.00000                      0
## 34  Returning_Visitor                 0.00000                      0
## 35  Returning_Visitor                 0.00000                      0
## 36  Returning_Visitor                 0.00000                      0
## 37  Returning_Visitor                 0.00000                      0
## 38  Returning_Visitor                18.00000                      0
## 39  Returning_Visitor                 0.00000                      0
## 40  Returning_Visitor                 0.00000                      0
## 41  Returning_Visitor                 9.00000                      0
## 42  Returning_Visitor                 0.00000                      0
## 43  Returning_Visitor                 0.00000                      0
## 44  Returning_Visitor                 0.00000                      0
## 45  Returning_Visitor                 0.00000                      0
## 46  Returning_Visitor                 0.00000                      0
## 47  Returning_Visitor                 0.00000                      0
## 48  Returning_Visitor                 0.00000                      0
## 49  Returning_Visitor                 0.00000                      0
## 50  Returning_Visitor                 0.00000                      0
## 51  Returning_Visitor                 0.00000                      0
## 52  Returning_Visitor                 0.00000                      0
## 53  Returning_Visitor                 0.00000                      0
## 54  Returning_Visitor                 0.00000                      0
## 55  Returning_Visitor                 0.00000                      0
## 56  Returning_Visitor                 0.00000                      0
## 57  Returning_Visitor                 0.00000                      0
## 58  Returning_Visitor                56.00000                    120
## 59  Returning_Visitor                 0.00000                      0
## 60  Returning_Visitor                16.00000                      0
## 61  Returning_Visitor                 0.00000                      0
## 62  Returning_Visitor                 0.00000                      0
## 63  Returning_Visitor               279.41667                      0
## 64  Returning_Visitor                 0.00000                      0
## 65  Returning_Visitor                 0.00000                      0
## 66  Returning_Visitor                87.83333                      0
## 67  Returning_Visitor                44.00000                      0
## 68  Returning_Visitor                 0.00000                      0
## 69  Returning_Visitor                 0.00000                      0
## 70  Returning_Visitor                 0.00000                      0
## 71  Returning_Visitor                 0.00000                      0
## 72  Returning_Visitor                 0.00000                      0
## 73  Returning_Visitor                 0.00000                      0
## 74  Returning_Visitor                 0.00000                      0
## 75  Returning_Visitor                 0.00000                      0
## 76  Returning_Visitor                 0.00000                      0
## 77  Returning_Visitor              1005.66667                      0
## 78  Returning_Visitor                 0.00000                      0
## 79  Returning_Visitor                 0.00000                      0
## 80  Returning_Visitor                 0.00000                      0
## 81  Returning_Visitor                 0.00000                      0
## 82  Returning_Visitor                 0.00000                      0
## 83  Returning_Visitor                 0.00000                      0
## 84  Returning_Visitor                 0.00000                      0
## 85  Returning_Visitor                 0.00000                      0
## 86  Returning_Visitor                 0.00000                      0
## 87  Returning_Visitor                 0.00000                      0
## 88  Returning_Visitor                 0.00000                      0
## 89  Returning_Visitor                 0.00000                      0
## 90  Returning_Visitor                 0.00000                      0
## 91  Returning_Visitor                36.00000                      0
## 92  Returning_Visitor                 0.00000                      0
## 93  Returning_Visitor                 0.00000                      0
## 94        New_Visitor                 0.00000                      0
## 95  Returning_Visitor                 0.00000                      0
## 96  Returning_Visitor                 0.00000                      0
## 97  Returning_Visitor                 0.00000                      0
## 98  Returning_Visitor                 0.00000                      0
## 99  Returning_Visitor                 0.00000                      0
## 100 Returning_Visitor                 0.00000                      0
## 101 Returning_Visitor                18.33333                      0
## 102 Returning_Visitor                61.00000                      0
## 103 Returning_Visitor                 0.00000                      0
## 104 Returning_Visitor                31.00000                     16
## 105 Returning_Visitor                58.50000                      0
## 106 Returning_Visitor                 0.00000                      0
## 107 Returning_Visitor                 0.00000                      0
## 108 Returning_Visitor                 0.00000                      0
## 109 Returning_Visitor                 0.00000                      0
## 110 Returning_Visitor               326.25000                     94
## 111 Returning_Visitor                 0.00000                      0
## 112 Returning_Visitor                 0.00000                      0
## 113 Returning_Visitor                 0.00000                      0
## 114 Returning_Visitor                22.00000                      0
## 115 Returning_Visitor                 0.00000                     93
## 116 Returning_Visitor                 0.00000                      0
## 117 Returning_Visitor                 0.00000                      0
## 118 Returning_Visitor                 0.00000                      0
## 119 Returning_Visitor                 0.00000                      0
## 120 Returning_Visitor                 0.00000                      0
## 121 Returning_Visitor                 0.00000                      0
## 122 Returning_Visitor                 0.00000                      0
## 123 Returning_Visitor                 0.00000                     75
## 124 Returning_Visitor                 0.00000                      0
## 125 Returning_Visitor                 0.00000                      0
## 126 Returning_Visitor                 0.00000                      0
## 127 Returning_Visitor                 0.00000                      0
## 128 Returning_Visitor                 0.00000                      0
## 129 Returning_Visitor               462.00000                      0
## 130 Returning_Visitor                 0.00000                      0
## 131 Returning_Visitor                 0.00000                      0
## 132 Returning_Visitor                20.00000                      0
## 133 Returning_Visitor                 0.00000                      0
## 134 Returning_Visitor                 0.00000                      0
## 135 Returning_Visitor               111.50000                      0
## 136 Returning_Visitor                 0.00000                      0
## 137 Returning_Visitor                 0.00000                      0
## 138 Returning_Visitor                 0.00000                      0
## 139 Returning_Visitor                 0.00000                      0
## 140 Returning_Visitor               103.62500                      0
## 141 Returning_Visitor                 0.00000                      0
## 142 Returning_Visitor                 0.00000                      0
## 143 Returning_Visitor                19.00000                      0
## 144 Returning_Visitor                 0.00000                      0
## 145 Returning_Visitor                 0.00000                     19
## 146 Returning_Visitor                 0.00000                      0
## 147 Returning_Visitor                 0.00000                      0
## 148 Returning_Visitor                 0.00000                      0
## 149 Returning_Visitor                 0.00000                      0
## 150 Returning_Visitor                 0.00000                      0
##     ProductRelated_Duration   ExitRates BounceRates Revenue
## 1                  0.000000 0.200000000 0.200000000   FALSE
## 2                 64.000000 0.100000000 0.000000000   FALSE
## 3                  0.000000 0.200000000 0.200000000   FALSE
## 4                  2.666667 0.140000000 0.050000000   FALSE
## 5                627.500000 0.050000000 0.020000000   FALSE
## 6                154.216667 0.024561404 0.015789474   FALSE
## 7                  0.000000 0.200000000 0.200000000   FALSE
## 8                  0.000000 0.200000000 0.200000000   FALSE
## 9                 37.000000 0.100000000 0.000000000   FALSE
## 10               738.000000 0.022222222 0.000000000   FALSE
## 11               395.000000 0.066666667 0.000000000   FALSE
## 12               407.750000 0.025833333 0.018750000   FALSE
## 13               280.500000 0.028571429 0.000000000   FALSE
## 14                98.000000 0.066666667 0.000000000   FALSE
## 15                68.000000 0.100000000 0.000000000   FALSE
## 16              1668.285119 0.016312636 0.008333333   FALSE
## 17                 0.000000 0.200000000 0.200000000   FALSE
## 18               334.966667 0.007692308 0.000000000   FALSE
## 19                32.000000 0.100000000 0.000000000   FALSE
## 20              2981.166667 0.010000000 0.000000000   FALSE
## 21               136.166667 0.008333333 0.000000000   FALSE
## 22                 0.000000 0.200000000 0.200000000   FALSE
## 23               105.000000 0.033333333 0.000000000   FALSE
## 24                15.000000 0.100000000 0.000000000   FALSE
## 25                 0.000000 0.200000000 0.200000000   FALSE
## 26               156.000000 0.040000000 0.000000000   FALSE
## 27              1135.444444 0.009523810 0.002857143   FALSE
## 28                76.000000 0.100000000 0.050000000   FALSE
## 29                63.000000 0.050000000 0.000000000   FALSE
## 30              1582.750000 0.050821256 0.043478261   FALSE
## 31                35.000000 0.100000000 0.000000000   FALSE
## 32                78.000000 0.066666667 0.000000000   FALSE
## 33               209.500000 0.025000000 0.000000000   FALSE
## 34               183.666667 0.080000000 0.040000000   FALSE
## 35               380.500000 0.028571429 0.014285714   FALSE
## 36              2086.242857 0.020352564 0.015384615   FALSE
## 37               388.000000 0.056250000 0.025000000   FALSE
## 38               298.000000 0.028571429 0.000000000   FALSE
## 39                63.000000 0.071428571 0.028571429   FALSE
## 40               482.000000 0.022222222 0.000000000   FALSE
## 41              4084.393939 0.001794872 0.000000000   FALSE
## 42                22.000000 0.066666667 0.000000000   FALSE
## 43               310.166667 0.006666667 0.000000000   FALSE
## 44                34.000000 0.050000000 0.000000000   FALSE
## 45                88.000000 0.050000000 0.000000000   FALSE
## 46               622.250000 0.006060606 0.003030303   FALSE
## 47               222.400000 0.057142857 0.017142857   FALSE
## 48                80.000000 0.133333333 0.066666667   FALSE
## 49               800.833333 0.003636364 0.000000000   FALSE
## 50                 0.000000 0.200000000 0.200000000   FALSE
## 51                 0.000000 0.200000000 0.200000000   FALSE
## 52               265.166667 0.026111111 0.011111111   FALSE
## 53                29.000000 0.100000000 0.000000000   FALSE
## 54               160.000000 0.075000000 0.000000000   FALSE
## 55               135.666667 0.025000000 0.050000000   FALSE
## 56                 0.000000 0.200000000 0.200000000   FALSE
## 57                 0.000000 0.200000000 0.200000000   FALSE
## 58               998.741667 0.014736467 0.000000000   FALSE
## 59               104.000000 0.050000000 0.000000000   FALSE
## 60               381.686508 0.046568627 0.011764706   FALSE
## 61               169.000000 0.016666667 0.000000000   FALSE
## 62               400.800000 0.120833333 0.050000000   FALSE
## 63              1553.583333 0.019666667 0.009000000   FALSE
## 64               706.500000 0.007142857 0.000000000   FALSE
## 65                 0.000000 0.200000000 0.200000000   FALSE
## 66               798.333333 0.012643678 0.000000000    TRUE
## 67              6951.972222 0.015013034 0.002150538   FALSE
## 68                 0.000000 0.200000000 0.200000000   FALSE
## 69               902.000000 0.007407407 0.000000000   FALSE
## 70                 0.000000 0.200000000 0.200000000   FALSE
## 71                 0.000000 0.200000000 0.200000000   FALSE
## 72               197.377778 0.052500000 0.025000000   FALSE
## 73                43.000000 0.100000000 0.000000000   FALSE
## 74               426.666667 0.012500000 0.000000000   FALSE
## 75               135.000000 0.066666667 0.000000000   FALSE
## 76               588.333333 0.025000000 0.000000000   FALSE
## 77              2111.341667 0.014492754 0.004347826    TRUE
## 78                76.000000 0.050000000 0.000000000   FALSE
## 79                 0.000000 0.200000000 0.200000000   FALSE
## 80                 0.000000 0.200000000 0.200000000   FALSE
## 81               208.000000 0.028571429 0.000000000   FALSE
## 82               270.000000 0.016666667 0.000000000   FALSE
## 83                39.500000 0.020000000 0.000000000   FALSE
## 84               375.000000 0.058333333 0.016666667   FALSE
## 85                 0.000000 0.200000000 0.200000000   FALSE
## 86                 0.000000 0.200000000 0.200000000   FALSE
## 87               150.000000 0.085714286 0.057142857   FALSE
## 88               138.000000 0.066666667 0.000000000   FALSE
## 89               337.500000 0.023809524 0.028571429   FALSE
## 90               620.033333 0.007894737 0.000000000   FALSE
## 91               168.846154 0.011764706 0.000000000   FALSE
## 92                 0.000000 0.200000000 0.200000000   FALSE
## 93                52.000000 0.100000000 0.000000000   FALSE
## 94               649.250000 0.015384615 0.000000000   FALSE
## 95               925.333333 0.025925926 0.003703704   FALSE
## 96                33.000000 0.100000000 0.000000000   FALSE
## 97              1566.500000 0.066666667 0.050000000   FALSE
## 98               105.000000 0.025000000 0.000000000   FALSE
## 99                50.000000 0.080952381 0.038095238   FALSE
## 100              644.200000 0.031666667 0.004166667   FALSE
## 101             2635.177778 0.008947368 0.000000000   FALSE
## 102              607.000000 0.026984127 0.000000000    TRUE
## 103              415.000000 0.033333333 0.000000000   FALSE
## 104             2083.530952 0.013510250 0.000000000   FALSE
## 105              281.666667 0.017647059 0.007058824   FALSE
## 106               27.000000 0.150000000 0.150000000   FALSE
## 107              215.000000 0.074074074 0.044444444   FALSE
## 108              162.000000 0.066666667 0.050000000   FALSE
## 109              114.000000 0.066666667 0.050000000   FALSE
## 110             5062.213753 0.017918470 0.000854701   FALSE
## 111              253.000000 0.016666667 0.000000000   FALSE
## 112                0.000000 0.200000000 0.200000000   FALSE
## 113                0.000000 0.200000000 0.200000000   FALSE
## 114              436.500000 0.024000000 0.008000000   FALSE
## 115             1045.833333 0.035483871 0.012903226   FALSE
## 116              317.833333 0.005555556 0.000000000   FALSE
## 117             1146.333333 0.012962963 0.000000000   FALSE
## 118              577.000000 0.027272727 0.018181818   FALSE
## 119               95.000000 0.075000000 0.000000000   FALSE
## 120              301.500000 0.111111111 0.000000000   FALSE
## 121              243.000000 0.033333333 0.000000000   FALSE
## 122              135.500000 0.050000000 0.000000000   FALSE
## 123              442.333333 0.034375000 0.000000000   FALSE
## 124               41.000000 0.066666667 0.000000000   FALSE
## 125              292.900000 0.119523810 0.028000000   FALSE
## 126                0.000000 0.200000000 0.200000000   FALSE
## 127             1238.333333 0.040740741 0.000000000   FALSE
## 128              225.466667 0.050000000 0.000000000   FALSE
## 129             1873.216667 0.007547170 0.000000000   FALSE
## 130               13.000000 0.166666667 0.133333333   FALSE
## 131              222.000000 0.100000000 0.000000000   FALSE
## 132              214.833333 0.022222222 0.000000000   FALSE
## 133                0.000000 0.200000000 0.200000000   FALSE
## 134             2017.166667 0.011111111 0.000000000   FALSE
## 135              449.027778 0.018518519 0.000000000   FALSE
## 136              158.166667 0.033333333 0.000000000   FALSE
## 137              303.666667 0.046296296 0.005555556   FALSE
## 138               25.000000 0.066666667 0.000000000   FALSE
## 139               42.000000 0.150000000 0.050000000   FALSE
## 140             1003.416667 0.018039216 0.000000000   FALSE
## 141                0.000000 0.200000000 0.200000000   FALSE
## 142              272.500000 0.011851852 0.000000000   FALSE
## 143              344.000000 0.007407407 0.000000000   FALSE
## 144                0.000000 0.200000000 0.200000000   FALSE
## 145              852.000000 0.009090909 0.000000000   FALSE
## 146              780.000000 0.066666667 0.066666667   FALSE
## 147             1038.000000 0.033333333 0.000000000   FALSE
## 148              573.000000 0.050000000 0.000000000   FALSE
## 149             1231.233333 0.030357143 0.007142857   FALSE
## 150              316.666667 0.027272727 0.000000000   FALSE

df_visitor_type <- project_data |>
  group_by(VisitorType) |>
  summarize(
    count = n(),  # Count the number of rows in each group
    revenue_rate = mean(Revenue),  # Proportion of sessions resulting in a purchase (TRUE)
    total_revenue_sessions = sum(Revenue),  # Total number of sessions where Revenue is TRUE
    avg_bounce = mean(BounceRates)
  )

print(df_visitor_type)

## # A tibble: 3 × 5
##   VisitorType       count revenue_rate total_revenue_sessions avg_bounce
##   <chr>             <int>        <dbl>                  <int>      <dbl>
## 1 New_Visitor        1694        0.249                    422    0.00526
## 2 Other                85        0.188                     16    0.0386 
## 3 Returning_Visitor 10551        0.139                   1470    0.0248

what we find is that our lowest percentage group is new customers that do make a purchase (only 422), so lets analyze this group based on bounce and exit rates on certain page types durations
- This is a useful group to look at because determining what leads a new customer to make a purchase could significantly improve business overall
From our head() we notice that there a lot of zero values through the duration variable, so i chose to divide set into 3 subsets. 1 for all the instances where that particular duration isn’t equal to zero. This way our data won’t be skewed towards the zeros.
- As for the reason for all these zero values in my durations, it’s hard to tell why it’s happening given the limited documentation.
- You would think it would be impossible to spend zero seconds on any of the possible page types and it be a legit value for analysis

We’ll also go ahead and isolate the variables we need and remove all the “Other” instances in visitor type

df_admin_duration <- subset(project_data, Administrative_Duration != 0 & VisitorType != "Other")
df_admin_duration <- df_admin_duration[, c("Administrative_Duration", 
                                           "Informational_Duration", 
                                           "ProductRelated_Duration", 
                                           "ExitRates", 
                                           "BounceRates",
                                           "VisitorType",
                                           "Revenue")]

# Informational duration subset
df_info_duration <- subset(project_data, Informational_Duration != 0 & VisitorType != "Other")
df_info_duration <- df_info_duration[, c("Administrative_Duration", 
                                         "Informational_Duration", 
                                         "ProductRelated_Duration", 
                                         "ExitRates", 
                                         "BounceRates",
                                         "VisitorType",
                                         "Revenue")]

# Product related duration subset
df_product_duration <- subset(project_data, ProductRelated_Duration != 0 & VisitorType != "Other")
df_product_duration <- df_product_duration[, c("Administrative_Duration", 
                                               "Informational_Duration", 
                                               "ProductRelated_Duration", 
                                               "ExitRates", 
                                               "BounceRates",
                                               "VisitorType",
                                               "Revenue")]

Now let’s look at summaries and head() of these 3 subsets.

# Print the summaries of the new data frames
summary(df_admin_duration)

##  Administrative_Duration Informational_Duration ProductRelated_Duration
##  Min.   :   1.333        Min.   :   0.00        Min.   :    0.0        
##  1st Qu.:  40.000        1st Qu.:   0.00        1st Qu.:  411.5        
##  Median :  88.000        Median :   0.00        Median :  969.8        
##  Mean   : 155.015        Mean   :  57.04        Mean   : 1673.4        
##  3rd Qu.: 183.000        3rd Qu.:  21.44        3rd Qu.: 2032.2        
##  Max.   :3398.750        Max.   :2549.38        Max.   :63973.5        
##    ExitRates        BounceRates       VisitorType         Revenue       
##  Min.   :0.00000   Min.   :0.000000   Length:6394        Mode :logical  
##  1st Qu.:0.01111   1st Qu.:0.000000   Class :character   FALSE:5044     
##  Median :0.01852   Median :0.001905   Mode  :character   TRUE :1350     
##  Mean   :0.02270   Mean   :0.007063                                     
##  3rd Qu.:0.02941   3rd Qu.:0.009684                                     
##  Max.   :0.15000   Max.   :0.161905

summary(df_info_duration)

##  Administrative_Duration Informational_Duration ProductRelated_Duration
##  Min.   :   0.00         Min.   :   1.0         Min.   :    0.0        
##  1st Qu.:  23.00         1st Qu.:  28.1         1st Qu.:  727.4        
##  Median :  97.75         Median :  76.0         Median : 1572.5        
##  Mean   : 173.82         Mean   : 176.8         Mean   : 2490.5        
##  3rd Qu.: 215.51         3rd Qu.: 201.0         3rd Qu.: 3066.3        
##  Max.   :3398.75         Max.   :2549.4         Max.   :63973.5        
##    ExitRates        BounceRates       VisitorType         Revenue       
##  Min.   :0.00000   Min.   :0.000000   Length:2399        Mode :logical  
##  1st Qu.:0.01215   1st Qu.:0.000000   Class :character   FALSE:1836     
##  Median :0.01909   Median :0.004167   Mode  :character   TRUE :563      
##  Mean   :0.02266   Mean   :0.007678                                     
##  3rd Qu.:0.02915   3rd Qu.:0.010690                                     
##  Max.   :0.13333   Max.   :0.090323

summary(df_product_duration)

##  Administrative_Duration Informational_Duration ProductRelated_Duration
##  Min.   :   0.00         Min.   :   0.00        Min.   :    0.5        
##  1st Qu.:   0.00         1st Qu.:   0.00        1st Qu.:  243.4        
##  Median :  16.00         Median :   0.00        Median :  673.0        
##  Mean   :  85.96         Mean   :  36.83        Mean   : 1276.0        
##  3rd Qu.: 100.83         3rd Qu.:   0.00        3rd Qu.: 1560.8        
##  Max.   :3398.75         Max.   :2549.38        Max.   :63973.5        
##    ExitRates        BounceRates       VisitorType         Revenue       
##  Min.   :0.00000   Min.   :0.000000   Length:11507       Mode :logical  
##  1st Qu.:0.01339   1st Qu.:0.000000   Class :character   FALSE:9628     
##  Median :0.02403   Median :0.002062   Mode  :character   TRUE :1879     
##  Mean   :0.03328   Mean   :0.011472                                     
##  3rd Qu.:0.04187   3rd Qu.:0.013520                                     
##  Max.   :0.19231   Max.   :0.183333

# Print the first few rows of each new data frame to verify its how we want
print(head(df_admin_duration))

##    Administrative_Duration Informational_Duration ProductRelated_Duration
## 16                    53.0                      0               1668.2851
## 27                    64.6                      0               1135.4444
## 30                     6.0                      0               1582.7500
## 38                    18.0                      0                298.0000
## 41                     9.0                      0               4084.3939
## 58                    56.0                    120                998.7417
##      ExitRates BounceRates       VisitorType Revenue
## 16 0.016312636 0.008333333 Returning_Visitor   FALSE
## 27 0.009523810 0.002857143 Returning_Visitor   FALSE
## 30 0.050821256 0.043478261 Returning_Visitor   FALSE
## 38 0.028571429 0.000000000 Returning_Visitor   FALSE
## 41 0.001794872 0.000000000 Returning_Visitor   FALSE
## 58 0.014736467 0.000000000 Returning_Visitor   FALSE

print(head(df_info_duration))

##     Administrative_Duration Informational_Duration ProductRelated_Duration
## 58                    56.00                    120                998.7417
## 104                   31.00                     16               2083.5310
## 110                  326.25                     94               5062.2138
## 115                    0.00                     93               1045.8333
## 123                    0.00                     75                442.3333
## 145                    0.00                     19                852.0000
##       ExitRates BounceRates       VisitorType Revenue
## 58  0.014736467 0.000000000 Returning_Visitor   FALSE
## 104 0.013510250 0.000000000 Returning_Visitor   FALSE
## 110 0.017918470 0.000854701 Returning_Visitor   FALSE
## 115 0.035483871 0.012903226 Returning_Visitor   FALSE
## 123 0.034375000 0.000000000 Returning_Visitor   FALSE
## 145 0.009090909 0.000000000 Returning_Visitor   FALSE

print(head(df_product_duration))

##    Administrative_Duration Informational_Duration ProductRelated_Duration
## 2                        0                      0               64.000000
## 4                        0                      0                2.666667
## 5                        0                      0              627.500000
## 6                        0                      0              154.216667
## 9                        0                      0               37.000000
## 10                       0                      0              738.000000
##     ExitRates BounceRates       VisitorType Revenue
## 2  0.10000000  0.00000000 Returning_Visitor   FALSE
## 4  0.14000000  0.05000000 Returning_Visitor   FALSE
## 5  0.05000000  0.02000000 Returning_Visitor   FALSE
## 6  0.02456140  0.01578947 Returning_Visitor   FALSE
## 9  0.10000000  0.00000000 Returning_Visitor   FALSE
## 10 0.02222222  0.00000000 Returning_Visitor   FALSE

lets first look at a comparison of exit/bounce rates on the different page types overall

We’ll get some summary stats on the different variables involved in each of our subsets

# Summarize the data for df_admin_duration
df_admin_summary <- df_admin_duration |>
  group_by(VisitorType) |>
  summarize(
    count = n(),  # Count the number of rows in each group
    revenue_rate = mean(Revenue, na.rm = TRUE),  # Proportion of sessions resulting in a purchase (TRUE)
    total_revenue_sessions = sum(Revenue, na.rm = TRUE),  # Total number of sessions where Revenue is TRUE
    avg_bounce = mean(BounceRates, na.rm = TRUE)  # Average bounce rate
  )

# Summarize the data for df_info_duration
df_info_summary <- df_info_duration |>
  group_by(VisitorType) |>
  summarize(
    count = n(),  # Count the number of rows in each group
    revenue_rate = mean(Revenue, na.rm = TRUE),  # Proportion of sessions resulting in a purchase (TRUE)
    total_revenue_sessions = sum(Revenue, na.rm = TRUE),  # Total number of sessions where Revenue is TRUE
    avg_bounce = mean(BounceRates, na.rm = TRUE)  # Average bounce rate
  )

# Summarize the data for df_product_duration
df_product_summary <- df_product_duration |>
  group_by(VisitorType) |>
  summarize(
    count = n(),  # Count the number of rows in each group
    revenue_rate = mean(Revenue, na.rm = TRUE),  # Proportion of sessions resulting in a purchase (TRUE)
    total_revenue_sessions = sum(Revenue, na.rm = TRUE),  # Total number of sessions where Revenue is TRUE
    avg_bounce = mean(BounceRates, na.rm = TRUE)  # Average bounce rate
  )

# Print the summaries of the new data frames
print(df_admin_summary)

## # A tibble: 2 × 5
##   VisitorType       count revenue_rate total_revenue_sessions avg_bounce
##   <chr>             <int>        <dbl>                  <int>      <dbl>
## 1 New_Visitor        1185        0.220                    261    0.00155
## 2 Returning_Visitor  5209        0.209                   1089    0.00832

print(df_info_summary)

## # A tibble: 2 × 5
##   VisitorType       count revenue_rate total_revenue_sessions avg_bounce
##   <chr>             <int>        <dbl>                  <int>      <dbl>
## 1 New_Visitor         254        0.232                     59    0.00126
## 2 Returning_Visitor  2145        0.235                    504    0.00844

print(df_product_summary)

## # A tibble: 2 × 5
##   VisitorType       count revenue_rate total_revenue_sessions avg_bounce
##   <chr>             <int>        <dbl>                  <int>      <dbl>
## 1 New_Visitor        1652        0.252                    417    0.00186
## 2 Returning_Visitor  9855        0.148                   1462    0.0131

The count of these dataframes becomes:
- admin duration - 6394
- info duration - 2399
- product - 11.507
New and return customers have relatively equal odds of making a purchase until they get to the product page, and then new visitors are over 70% more likely to purchase

Now lets see how it’s distributed in relation to the Revenue variable to see if there’s any differences with this now that Revenue is involved:

# Summarize the data for df_admin_duration including Revenue
df_admin_duration_summary <- df_admin_duration |>
  group_by(VisitorType, Revenue) |>
  summarize(
    count = n(),
    avg_product_duration = mean(ProductRelated_Duration, na.rm = TRUE),
    avg_informational_duration = mean(Informational_Duration, na.rm = TRUE),
    avg_administrative_duration = mean(Administrative_Duration, na.rm = TRUE),
    avg_exit_rate = mean(ExitRates, na.rm = TRUE),
    avg_bounce_rate = mean(BounceRates, na.rm = TRUE),
    .groups = 'drop'  # Ungroup after summarizing
  )

# Summarize the data for df_info_duration including Revenue
df_info_duration_summary <- df_info_duration |>
  group_by(VisitorType, Revenue) |>
  summarize(
    count = n(),
    avg_product_duration = mean(ProductRelated_Duration, na.rm = TRUE),
    avg_informational_duration = mean(Informational_Duration, na.rm = TRUE),
    avg_administrative_duration = mean(Administrative_Duration, na.rm = TRUE),
    avg_exit_rate = mean(ExitRates, na.rm = TRUE),
    avg_bounce_rate = mean(BounceRates, na.rm = TRUE),
    .groups = 'drop'  # Ungroup after summarizing
  )

# Summarize the data for df_product_duration including Revenue
df_product_duration_summary <- df_product_duration |>
  group_by(VisitorType, Revenue) |>
  summarize(
    count = n(),
    avg_product_duration = mean(ProductRelated_Duration, na.rm = TRUE),
    avg_informational_duration = mean(Informational_Duration, na.rm = TRUE),
    avg_administrative_duration = mean(Administrative_Duration, na.rm = TRUE),
    avg_exit_rate = mean(ExitRates, na.rm = TRUE),
    avg_bounce_rate = mean(BounceRates, na.rm = TRUE),
    .groups = 'drop'  # Ungroup after summarizing
  )

# Print the summaries of the new data frames
print(df_admin_duration_summary)

## # A tibble: 4 × 8
##   VisitorType       Revenue count avg_product_duration avg_informational_durat…¹
##   <chr>             <lgl>   <int>                <dbl>                     <dbl>
## 1 New_Visitor       FALSE     924                 566.                      22.1
## 2 New_Visitor       TRUE      261                 943.                      24.8
## 3 Returning_Visitor FALSE    4120                1763.                      60.1
## 4 Returning_Visitor TRUE     1089                2450.                      82.8
## # ℹ abbreviated name: ¹avg_informational_duration
## # ℹ 3 more variables: avg_administrative_duration <dbl>, avg_exit_rate <dbl>,
## #   avg_bounce_rate <dbl>

print(df_info_duration_summary)

## # A tibble: 4 × 8
##   VisitorType       Revenue count avg_product_duration avg_informational_durat…¹
##   <chr>             <lgl>   <int>                <dbl>                     <dbl>
## 1 New_Visitor       FALSE     195                 817.                      124.
## 2 New_Visitor       TRUE       59                1325.                      141.
## 3 Returning_Visitor FALSE    1641                2508.                      177.
## 4 Returning_Visitor TRUE      504                3218.                      201.
## # ℹ abbreviated name: ¹avg_informational_duration
## # ℹ 3 more variables: avg_administrative_duration <dbl>, avg_exit_rate <dbl>,
## #   avg_bounce_rate <dbl>

print(df_product_duration_summary)

## # A tibble: 4 × 8
##   VisitorType       Revenue count avg_product_duration avg_informational_durat…¹
##   <chr>             <lgl>   <int>                <dbl>                     <dbl>
## 1 New_Visitor       FALSE    1235                 574.                      19.6
## 2 New_Visitor       TRUE      417                 885.                      20.0
## 3 Returning_Visitor FALSE    8393                1241.                      34.6
## 4 Returning_Visitor TRUE     1462                2182.                      69.2
## # ℹ abbreviated name: ¹avg_informational_duration
## # ℹ 3 more variables: avg_administrative_duration <dbl>, avg_exit_rate <dbl>,
## #   avg_bounce_rate <dbl>

The main difference i notice in this ‘group_by’ comparison between our smallest group, ‘new_visitors’, is how the average bounce/exit rate and values are effected differently based on the type of customer they are
For example, when a new customer is on a administrative page, if their duration is longer they often don’t make a purchase
- this is the opposite when dealing with a returning customer, if they spend a lot of time on the same type of page, then they are more likely to make the purchase because of it
Why might this be?
- New vs. Returning Customers: New customers might be more exploratory, leading to longers and potentially higher bounce rates if they do not find what they are looking for quickly. Returning customers might navigate more efficiently or have specific goals, leading to different patterns.
- Revenue Generating vs. Non-Revenue Generating: Customers who make a purchase might spend more time on the site to gather information or compare options. In contrast, non-revenue generating customers might leave sooner if they do not find relevant content quickly.

Visualization:

Next we try to visualize this

First we’ll look at these stats for all the data:
This visualization shows the 2 groups for true and false on the revenue stat, based on the comparison between bounce/exit rates and the different stats
I set the best-fit of line (based on the distribution I saw in last weeks datadive) because i feel I’ll get more out of it when the data is displayed as a polynomial regression rather than a straight line
For now this includes all customer types

# Plot 1: Product Related Duration vs Exit Rates (with legend)
plot1 <- ggplot(df_product_duration, aes(x = ProductRelated_Duration, y = ExitRates, color = factor(Revenue))) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = TRUE) +  # Show legend
  labs(title = "Product vs Exit",
       x = "Product Related Duration",
       y = "Exit Rates",
       color = "Revenue Status") +
  scale_y_continuous(limits = c(0, NA)) +  # Ensure no negative y-axis values
  theme_minimal()

# Plot 2: Informational Duration vs Exit Rates (without legend)
plot2 <- ggplot(df_info_duration, aes(x = Informational_Duration, y = ExitRates, color = factor(Revenue))) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +  # Hide legend
  labs(title = "Info vs Exit",
       x = "Informational Duration",
       y = "Exit Rates") +
  scale_y_continuous(limits = c(0, NA)) +  # Ensure no negative y-axis values
  theme_minimal()

# Plot 3: Administrative Duration vs Exit Rates (without legend)
plot3 <- ggplot(df_admin_duration, aes(x = Administrative_Duration, y = ExitRates, color = factor(Revenue))) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +  # Hide legend
  labs(title = "Admin vs Exit",
       x = "Administrative Duration",
       y = "Exit Rates") +
  scale_y_continuous(limits = c(0, NA)) +  # Ensure no negative y-axis values
  theme_minimal()

# Plot 4: Product Related Duration vs Bounce Rates (without legend)
plot4 <- ggplot(df_product_duration, aes(x = ProductRelated_Duration, y = BounceRates, color = factor(Revenue))) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +  # Hide legend
  labs(title = "Product vs Bounce",
       x = "Product Related Duration",
       y = "Bounce Rates") +
  scale_y_continuous(limits = c(0, NA)) +  # Ensure no negative y-axis values
  theme_minimal()

# Plot 5: Informational Duration vs Bounce Rates (without legend)
plot5 <- ggplot(df_info_duration, aes(x = Informational_Duration, y = BounceRates, color = factor(Revenue))) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +  # Hide legend
  labs(title = "Info vs Bounce",
       x = "Informational Duration",
       y = "Bounce Rates") +
  scale_y_continuous(limits = c(0, NA)) +  # Ensure no negative y-axis values
  theme_minimal()

# Plot 6: Administrative Duration vs Bounce Rates (without legend)
plot6 <- ggplot(df_admin_duration, aes(x = Administrative_Duration, y = BounceRates, color = factor(Revenue))) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +  # Hide legend
  labs(title = "Admin vs Bounce",
       x = "Administrative Duration",
       y = "Bounce Rates") +
  scale_y_continuous(limits = c(0, NA)) +  # Ensure no negative y-axis values
  theme_minimal()

# Combine all six plots into a grid layout
Exit_vs_Bounce1 <- grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6,
                                ncol = 3,
                                top = "Rates Vs Duration (All Cases)"
)

## Warning: Removed 37 rows containing missing values (`geom_smooth()`).

## Warning: Removed 43 rows containing missing values (`geom_smooth()`).

print(Exit_vs_Bounce1)

## TableGrob (3 x 3) "arrange": 7 grobs
##   z     cells    name                grob
## 1 1 (2-2,1-1) arrange      gtable[layout]
## 2 2 (2-2,2-2) arrange      gtable[layout]
## 3 3 (2-2,3-3) arrange      gtable[layout]
## 4 4 (3-3,1-1) arrange      gtable[layout]
## 5 5 (3-3,2-2) arrange      gtable[layout]
## 6 6 (3-3,3-3) arrange      gtable[layout]
## 7 7 (1-1,1-3) arrange text[GRID.text.255]

As we expect, in false cases bounce and exit rates increase at a greater magnitude than that of true cases overall.

Now lets look at this with only our positive cases for the revenue field:

# Plot 1: Product Related Duration vs Exit Rates (with legend)
plot1 <- ggplot(df_product_duration |> filter(Revenue == TRUE), aes(x = ProductRelated_Duration, y = ExitRates, color = factor(Revenue))) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +  # Show legend
  labs(title = "Product vs Exit",
       x = "Product Related Duration",
       y = "Exit Rates") +
  scale_y_continuous(limits = c(0, NA)) +  # Ensure no negative y-axis values
  theme_minimal()

# Plot 2: Informational Duration vs Exit Rates (without legend)
plot2 <- ggplot(df_info_duration |> filter(Revenue == TRUE), aes(x = Informational_Duration, y = ExitRates, color = factor(Revenue))) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +  # Hide legend
  labs(title = "Info vs Exit",
       x = "Informational Duration",
       y = "Exit Rates") +
  scale_y_continuous(limits = c(0, NA)) +  # Ensure no negative y-axis values
  theme_minimal()

# Plot 3: Administrative Duration vs Exit Rates (without legend)
plot3 <- ggplot(df_admin_duration |> filter(Revenue == TRUE), aes(x = Administrative_Duration, y = ExitRates, color = factor(Revenue))) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +  # Hide legend
  labs(title = "Admin vs Exit",
       x = "Administrative Duration",
       y = "Exit Rates") +
  scale_y_continuous(limits = c(0, NA)) +  # Ensure no negative y-axis values
  theme_minimal()

# Plot 4: Product Related Duration vs Bounce Rates (without legend)
plot4 <- ggplot(df_product_duration |> filter(Revenue == TRUE), aes(x = ProductRelated_Duration, y = BounceRates, color = factor(Revenue))) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +  # Hide legend
  labs(title = "Product vs Bounce",
       x = "Product Related Duration",
       y = "Bounce Rates") +
  scale_y_continuous(limits = c(0, NA)) +  # Ensure no negative y-axis values
  theme_minimal()

# Plot 5: Informational Duration vs Bounce Rates (without legend)
plot5 <- ggplot(df_info_duration |> filter(Revenue == TRUE), aes(x = Informational_Duration, y = BounceRates, color = factor(Revenue))) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +  # Hide legend
  labs(title = "Info vs Bounce",
       x = "Informational Duration",
       y = "Bounce Rates") +
  scale_y_continuous(limits = c(0, NA)) +  # Ensure no negative y-axis values
  theme_minimal()

# Plot 6: Administrative Duration vs Bounce Rates (without legend)
plot6 <- ggplot(df_admin_duration |> filter(Revenue == TRUE), aes(x = Administrative_Duration, y = BounceRates, color = factor(Revenue))) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +  # Hide legend
  labs(title = "Admin vs Bounce",
       x = "Administrative Duration",
       y = "Bounce Rates") +
  scale_y_continuous(limits = c(0, NA)) +  # Ensure no negative y-axis values
  theme_minimal()

# Combine all six plots into a grid layout
Exit_vs_Bounce2 <- grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6,
                                ncol = 3,
                                top = "Rates Vs Duration (All Positive Cases)"
)

print(Exit_vs_Bounce2)

## TableGrob (3 x 3) "arrange": 7 grobs
##   z     cells    name                grob
## 1 1 (2-2,1-1) arrange      gtable[layout]
## 2 2 (2-2,2-2) arrange      gtable[layout]
## 3 3 (2-2,3-3) arrange      gtable[layout]
## 4 4 (3-3,1-1) arrange      gtable[layout]
## 5 5 (3-3,2-2) arrange      gtable[layout]
## 6 6 (3-3,3-3) arrange      gtable[layout]
## 7 7 (1-1,1-3) arrange text[GRID.text.472]

This shows that bounce rates go down significantly as time progresses when leading to a purchase and as they get closer to the product page
For exit rates we consistently see a U shape, showing that exit rates start to increase again after a certain duration

Now we can plot this smaller group alone with just positive cases of new customers:

# Plot 1: Product Related Duration vs Bounce Rates
plot1 <- ggplot(df_product_duration |> filter(Revenue == TRUE, VisitorType == "New_Visitor"), 
                aes(x = ProductRelated_Duration, y = BounceRates)) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
  labs(title = "Bounce vs. Product",
       x = "Product Related Duration",
       y = "Bounce Rates") +
  scale_y_continuous(limits = c(0, NA)) +
  theme_minimal()

# Plot 2: Informational Duration vs Bounce Rates
plot2 <- ggplot(df_info_duration |> filter(Revenue == TRUE, VisitorType == "New_Visitor"), 
                aes(x = Informational_Duration, y = BounceRates)) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
  labs(title = "Bounce vs. Informational",
       x = "Informational Duration",
       y = "Bounce Rates") +
  scale_y_continuous(limits = c(0, NA)) +
  theme_minimal()

# Plot 3: Administrative Duration vs Bounce Rates
plot3 <- ggplot(df_admin_duration |> filter(Revenue == TRUE, VisitorType == "New_Visitor"), 
                aes(x = Administrative_Duration, y = BounceRates)) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
  labs(title = "Bounce vs. Admin",
       x = "Administrative Duration",
       y = "Bounce Rates") +
  scale_y_continuous(limits = c(0, NA)) +
  theme_minimal()

# Plot 4: Product Related Duration vs Exit Rates
plot4 <- ggplot(df_product_duration |> filter(Revenue == TRUE, VisitorType == "New_Visitor"), 
                aes(x = ProductRelated_Duration, y = ExitRates)) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
  labs(title = "Exit vs. Product",
       x = "Product Related Duration",
       y = "Exit Rates") +
  scale_y_continuous(limits = c(0, NA)) +
  theme_minimal()

# Plot 5: Informational Duration vs Exit Rates
plot5 <- ggplot(df_info_duration |> filter(Revenue == TRUE, VisitorType == "New_Visitor"), 
                aes(x = Informational_Duration, y = ExitRates)) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
  labs(title = "Exit vs. Informational",
       x = "Informational Duration",
       y = "Exit Rates") +
  scale_y_continuous(limits = c(0, NA)) +
  theme_minimal()

# Plot 6: Administrative Duration vs Exit Rates
plot6 <- ggplot(df_admin_duration |> filter(Revenue == TRUE, VisitorType == "New_Visitor"), 
                aes(x = Administrative_Duration, y = ExitRates)) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
  labs(title = "Exit vs. Admin",
       x = "Administrative Duration",
       y = "Exit Rates") +
  scale_y_continuous(limits = c(0, NA)) +
  theme_minimal()

# Combine plots into a grid
Exit_vs_bounce3 <- grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6,
                                ncol = 3,
                                top = "Rates Vs Duration (All Positive Cases, New Visitor only)"
)

## Warning: Removed 15 rows containing missing values (`geom_smooth()`).

print(Exit_vs_bounce3)

## TableGrob (3 x 3) "arrange": 7 grobs
##   z     cells    name                grob
## 1 1 (2-2,1-1) arrange      gtable[layout]
## 2 2 (2-2,2-2) arrange      gtable[layout]
## 3 3 (2-2,3-3) arrange      gtable[layout]
## 4 4 (3-3,1-1) arrange      gtable[layout]
## 5 5 (3-3,2-2) arrange      gtable[layout]
## 6 6 (3-3,3-3) arrange      gtable[layout]
## 7 7 (1-1,1-3) arrange text[GRID.text.689]

This shows a similar result to the returning customers, just too a much higher degree in magnitude.

Now I will set the 2 visitor types side by side to see what is different about the two types:

# Filter the dataset for successful revenue cases (Revenue = TRUE) & both visitortypes 
# (RED = new, BLUE = returning)

plot1 <- ggplot(df_product_duration |> filter(Revenue == TRUE, VisitorType %in% c('New_Visitor', 'Returning_Visitor')), 
                aes(x = ProductRelated_Duration, y = BounceRates, color = VisitorType)) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +
  labs(title = "Bounce vs. Product",
       x = "Product Related Duration",
       y = "Bounce Rates") +
  scale_y_continuous(limits = c(0, NA)) +
  theme_minimal()

# Plot 2: Informational Duration vs Bounce Rates
plot2 <- ggplot(df_info_duration |> filter(Revenue == TRUE, VisitorType %in% c('New_Visitor', 'Returning_Visitor')), 
                aes(x = Informational_Duration, y = BounceRates, color = VisitorType)) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +
  labs(title = "Bounce vs. Informational",
       x = "Informational Duration",
       y = "Bounce Rates") +
  scale_y_continuous(limits = c(0, NA)) +
  theme_minimal()

# Plot 3: Administrative Duration vs Bounce Rates
plot3 <- ggplot(df_admin_duration |> filter(Revenue == TRUE, VisitorType %in% c('New_Visitor', 'Returning_Visitor')), 
                aes(x = Administrative_Duration, y = BounceRates, color = VisitorType)) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +
  labs(title = "Bounce vs. Admin",
       x = "Administrative Duration",
       y = "Bounce Rates") +
  scale_y_continuous(limits = c(0, NA)) +
  theme_minimal()

# Plot 4: Product Related Duration vs Exit Rates
plot4 <- ggplot(df_product_duration |> filter(Revenue == TRUE, VisitorType %in% c('New_Visitor', 'Returning_Visitor')), 
                aes(x = ProductRelated_Duration, y = ExitRates, color = VisitorType)) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +
  labs(title = "Exit vs. Product",
       x = "Product Related Duration",
       y = "Exit Rates") +
  scale_y_continuous(limits = c(0, NA)) +
  theme_minimal()

# Plot 5: Informational Duration vs Exit Rates
plot5 <- ggplot(df_info_duration |> filter(Revenue == TRUE, VisitorType %in% c('New_Visitor', 'Returning_Visitor')), 
                aes(x = Informational_Duration, y = ExitRates, color = VisitorType)) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +
  labs(title = "Exit vs. Informational",
       x = "Informational Duration",
       y = "Exit Rates") +
  scale_y_continuous(limits = c(0, NA)) +
  theme_minimal()

# Plot 6: Administrative Duration vs Exit Rates
plot6 <- ggplot(df_admin_duration |> filter(Revenue == TRUE, VisitorType %in% c('New_Visitor', 'Returning_Visitor')), 
                aes(x = Administrative_Duration, y = ExitRates, color = VisitorType)) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, show.legend = FALSE) +
  labs(title = "Exit vs. Admin",
       x = "Administrative Duration",
       y = "Exit Rates") +
  scale_y_continuous(limits = c(0, NA)) +
  theme_minimal()

# Combine all six plots into a 2-row grid, legend only in plot1
Exit_vs_Bounce4 <- grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6,
                                ncol = 3,
                                top = "Rates Vs Duration (All Positive Cases by Visitor Type(RED = New))"
)

## Warning: Removed 15 rows containing missing values (`geom_smooth()`).

print(Exit_vs_Bounce4)

## TableGrob (3 x 3) "arrange": 7 grobs
##   z     cells    name                grob
## 1 1 (2-2,1-1) arrange      gtable[layout]
## 2 2 (2-2,2-2) arrange      gtable[layout]
## 3 3 (2-2,3-3) arrange      gtable[layout]
## 4 4 (3-3,1-1) arrange      gtable[layout]
## 5 5 (3-3,2-2) arrange      gtable[layout]
## 6 6 (3-3,3-3) arrange      gtable[layout]
## 7 7 (1-1,1-3) arrange text[GRID.text.912]

This introduces to us that new customers are not engaged with the website for nearly as long of durations, especially in the product pages

ANALYSIS:

New Customers: The plots suggest that new customers who engage more tend to bounce and exit less, possibly due to finding what they need or being more engaged with the site.
New Customers’ Behavior: For new customers, higher product-related duration could indicate that they are exploring and evaluating products. If they find what they’re looking for, they might not bounce or exit as frequently. Conversely, returning customers might have a different interaction pattern, which could lead to an upward slope if they’re more likely to return frequently and become more efficient in their browsing.

Final Hypothesis: New customers who spend more time on the site are more engaged and find what they need, leading to lower bounce and exit rates. This could be due to a higher relevance of content or products, better site usability, or targeted marketing efforts. In contrast, the general customer base, which includes both new and returning customers, might exhibit different engagement patterns based on their familiarity with the site and their specific needs.

Website Usability: If the website is designed to be user-friendly and helps users quickly find what they need, new customers might experience a more positive user journey, resulting in lower bounce/exit rates with longer durations.
Site Changes or Campaigns: There could be specific site changes or marketing campaigns targeting new visitors, influencing their behavior differently compared to the general customer base.
Relevance of Content: The content or products that new visitors encounter might be different from what returning visitors see. High engagement with relevant content could reduce bounce rates for new customers as they are more likely to be interested in the content.
Product Discovery: New customers should spend more time discovering and evaluating products, leading to decreased bounce and exit rates if they find the site engaging and valuable.

DataDive3, Mohler

Carson

2024-09-18

Data Dive 3: Group_by analysis

Visualization:

ANALYSIS: