1. What insights can be gained from examining the dataset’s structure, dimensions, and summary statistics? How does this initial exploration help identify potential data quality issues before analysis?

data <- read_csv(file.choose())
colnames(data) <- gsub(" ", "_", colnames(data))

head(data)

## # A tibble: 6 × 7
##   Page_Views Session_Duration Bounce_Rate Traffic_Source Time_on_Page
##        <dbl>            <dbl>       <dbl> <chr>                 <dbl>
## 1          5            11.1        0.231 Organic                3.89
## 2          4             3.43       0.391 Social                 8.48
## 3          4             1.62       0.398 Organic                9.64
## 4          5             3.63       0.180 Organic                2.07
## 5          5             4.24       0.292 Paid                   1.96
## 6          3             4.54       0.421 Social                 3.44
## # ℹ 2 more variables: Previous_Visits <dbl>, Conversion_Rate <dbl>

dim(data)

## [1] 2000    7

str(data)

## spc_tbl_ [2,000 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Page_Views      : num [1:2000] 5 4 4 5 5 3 5 4 6 7 ...
##  $ Session_Duration: num [1:2000] 11.05 3.43 1.62 3.63 4.24 ...
##  $ Bounce_Rate     : num [1:2000] 0.231 0.391 0.398 0.18 0.292 ...
##  $ Traffic_Source  : chr [1:2000] "Organic" "Social" "Organic" "Organic" ...
##  $ Time_on_Page    : num [1:2000] 3.89 8.48 9.64 2.07 1.96 ...
##  $ Previous_Visits : num [1:2000] 3 0 2 3 5 2 1 5 1 5 ...
##  $ Conversion_Rate : num [1:2000] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Page Views` = col_double(),
##   ..   `Session Duration` = col_double(),
##   ..   `Bounce Rate` = col_double(),
##   ..   `Traffic Source` = col_character(),
##   ..   `Time on Page` = col_double(),
##   ..   `Previous Visits` = col_double(),
##   ..   `Conversion Rate` = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

summary(data)

##    Page_Views    Session_Duration     Bounce_Rate       Traffic_Source    
##  Min.   : 0.00   Min.   : 0.003613   Min.   :0.007868   Length:2000       
##  1st Qu.: 3.00   1st Qu.: 0.815828   1st Qu.:0.161986   Class :character  
##  Median : 5.00   Median : 1.993983   Median :0.266375   Mode  :character  
##  Mean   : 4.95   Mean   : 3.022045   Mean   :0.284767                     
##  3rd Qu.: 6.00   3rd Qu.: 4.197569   3rd Qu.:0.388551                     
##  Max.   :14.00   Max.   :20.290516   Max.   :0.844939                     
##   Time_on_Page      Previous_Visits Conversion_Rate 
##  Min.   : 0.06852   Min.   :0.000   Min.   :0.3437  
##  1st Qu.: 1.93504   1st Qu.:1.000   1st Qu.:1.0000  
##  Median : 3.31532   Median :2.000   Median :1.0000  
##  Mean   : 4.02744   Mean   :1.978   Mean   :0.9821  
##  3rd Qu.: 5.41463   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :24.79618   Max.   :9.000   Max.   :1.0000

Interpretation: As a data analyst, you first examine the dataset’s dimensions (rows and columns), data types, and summary statistics. This overview helps plan further analyses and detect any obvious data issues before diving into specific metrics.

2. Which variables contain missing values and how might they impact analysis accuracy? What steps should be taken to handle incomplete data for reliable results?

colSums(is.na(data))

##       Page_Views Session_Duration      Bounce_Rate   Traffic_Source 
##                0                0                0                0 
##     Time_on_Page  Previous_Visits  Conversion_Rate 
##                0                0                0

Interpretation: Missing data can distort calculations. Counting missing values per column (e.g., Traffic_Source, Session_Duration, Bounce_Rate) reveals columns that may need imputation, removal, or flagging for stakeholders.

3. Which traffic source contributes the highest number of visitors to the website? How can this information guide marketing resource allocation decisions?

traffic_count <- data %>%
  group_by(Traffic_Source) %>%
  summarise(Visitors = n())
traffic_count

## # A tibble: 5 × 2
##   Traffic_Source Visitors
##   <chr>             <int>
## 1 Direct              216
## 2 Organic             786
## 3 Paid                428
## 4 Referral            301
## 5 Social              269

Interpretation: Grouping by Traffic_Source shows which acquisition channel drives the most visitors. This informs budget allocation – if Organic leads, invest more in SEO; if Paid ads dominate, evaluate ROI.

4. How does the visual comparison of traffic sources help identify dominant channels? Which source clearly outperforms others in attracting users?

ggplot(traffic_count, aes(Traffic_Source, Visitors)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Visitors by Traffic Source")

Interpretation: The bar chart makes differences between traffic sources immediately visible. The marketing director can use this visual to highlight which source outperforms others in executive presentations.

5. Which traffic source results in the longest average session duration? How does session duration reflect user engagement across sources?

avg_session <- data %>%
  group_by(Traffic_Source) %>%
  summarise(Avg_Session = mean(Session_Duration, na.rm = TRUE))
avg_session

## # A tibble: 5 × 2
##   Traffic_Source Avg_Session
##   <chr>                <dbl>
## 1 Direct                2.69
## 2 Organic               3.10
## 3 Paid                  2.94
## 4 Referral              3.13
## 5 Social                3.06

Interpretation: Average session duration reveals which sources bring users who stay longer – a key engagement metric. Longer sessions often indicate higher interest and likelihood of conversion.

6. How does session duration vary across different traffic sources visually? Which source attracts the most engaged users based on time spent?

ggplot(avg_session, aes(Traffic_Source, Avg_Session)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Average Session Duration")

Interpretation: The plot highlights whether, for example, Referral traffic has notably longer sessions than Direct traffic, suggesting external links bring more committed readers – guiding content strategy.

7. Which traffic source has the highest and lowest average bounce rate? What does this indicate about user interest and landing page effectiveness?

avg_bounce <- data %>%
  group_by(Traffic_Source) %>%
  summarise(Avg_Bounce = mean(Bounce_Rate, na.rm = TRUE))
avg_bounce

## # A tibble: 5 × 2
##   Traffic_Source Avg_Bounce
##   <chr>               <dbl>
## 1 Direct              0.285
## 2 Organic             0.282
## 3 Paid                0.296
## 4 Referral            0.266
## 5 Social              0.296

Interpretation: A low average bounce rate signals that visitors from that source tend to explore further, while a high rate indicates a possible landing page mismatch. This helps prioritise optimisation efforts.

8. How can bounce rate comparisons help identify areas needing UX improvement? Which traffic source performs best in retaining users?

ggplot(avg_bounce, aes(Traffic_Source, Avg_Bounce)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Average Bounce Rate")

Interpretation: The UX team uses this plot to decide where to run A/B tests on page design or content relevance – the lowest bounce rate source may serve as a benchmark.

9. Which traffic source delivers the highest conversion rate? How can this insight influence ROI-focused marketing strategies?

conversion_analysis <- data %>%
  group_by(Traffic_Source) %>%
  summarise(Avg_Conversion = mean(Conversion_Rate, na.rm = TRUE))
conversion_analysis

## # A tibble: 5 × 2
##   Traffic_Source Avg_Conversion
##   <chr>                   <dbl>
## 1 Direct                  0.979
## 2 Organic                 0.982
## 3 Paid                    0.979
## 4 Referral                0.988
## 5 Social                  0.983

Interpretation: Even a low-traffic source may be highly valuable if it converts often. This analysis directly informs ROI calculations and channel strategy – allocate budget to high-conversion sources.

10. How does the visual representation of conversion rates support decision-making? Which channels should receive increased investment based on performance?

ggplot(conversion_analysis, aes(Traffic_Source, Avg_Conversion)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Average Conversion Rate")

Interpretation: The CMO can use this visual to justify shifting budget from low-conversion channels to high-conversion ones. Clear labels and title ensure executive clarity.

11. Is there a relationship between the number of page views and session duration? Does increased browsing activity lead to longer user engagement?

ggplot(data, aes(Page_Views, Session_Duration)) +
  geom_point() +
  theme_minimal() +
  labs(title = "Page Views vs Session Duration")

Interpretation: Each point represents a user session. A positive trend means encouraging deeper browsing could increase overall time on site – a potential lever for engagement.

12. Which traffic source shows the highest average time spent on pages? What does this reveal about content relevance across channels?

avg_time_page <- data %>%
  group_by(Traffic_Source) %>%
  summarise(Avg_Time = mean(Time_on_Page, na.rm = TRUE))
avg_time_page

## # A tibble: 5 × 2
##   Traffic_Source Avg_Time
##   <chr>             <dbl>
## 1 Direct             3.95
## 2 Organic            3.98
## 3 Paid               4.09
## 4 Referral           3.98
## 5 Social             4.19

Interpretation: Time on page measures content quality. If Organic visitors have much higher time on page than Paid visitors, your organic content may be more relevant, while paid landing pages need improvement.

13. How does user interaction time differ visually across traffic sources? Which source indicates deeper content engagement?

ggplot(avg_time_page, aes(Traffic_Source, Avg_Time)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Average Time on Page")

Interpretation: The content team uses this to identify which traffic sources appreciate long-form articles vs. quick scans – enabling different content strategies per channel.

14. Do repeat visitors show higher conversion rates compared to new users? How does prior interaction influence user purchasing behavior?

ggplot(data, aes(Previous_Visits, Conversion_Rate)) +
  geom_point() +
  theme_minimal() +
  labs(title = "Previous Visits vs Conversion")

Interpretation: Look for a pattern – if repeat visitors (higher Previous_Visits) convert more often, a re-targeting or loyalty campaign is justified. This plot visualises that relationship.

15. What does the distribution of session duration reveal about user behavior patterns? Are most users highly engaged or leaving quickly?

ggplot(data, aes(Session_Duration)) +
  geom_histogram(bins = 20) +
  theme_minimal() +
  labs(title = "Session Duration Distribution")

Interpretation: The histogram reveals skewness, many short sessions (e.g., under 10 seconds) that may be bounces, and a long tail of highly engaged users. This informs how to define “good” vs. “bad” sessions.

16. What is the relationship between bounce rate and conversion rate? Does reducing bounce rate lead to higher conversions?

ggplot(data, aes(Bounce_Rate, Conversion_Rate)) +
  geom_point() +
  theme_minimal() +
  labs(title = "Bounce Rate vs Conversion")

Interpretation: A clear downward trend confirms that reducing bounce rate (e.g., by improving page load speed or relevance) is a proven path to increasing conversions.

17. What characteristics can be observed in the longest user sessions? How can these highly engaged sessions inform product or content strategy?

top_sessions <- data %>%
  arrange(desc(Session_Duration))

head(top_sessions, 10)

## # A tibble: 10 × 7
##    Page_Views Session_Duration Bounce_Rate Traffic_Source Time_on_Page
##         <dbl>            <dbl>       <dbl> <chr>                 <dbl>
##  1          4             20.3      0.600  Paid                  1.26 
##  2          3             19.9      0.250  Paid                  0.188
##  3          6             19.1      0.297  Organic               4.03 
##  4          4             19.1      0.0741 Paid                  1.70 
##  5          4             19.0      0.364  Referral              4.05 
##  6          6             18.3      0.338  Referral              2.98 
##  7          8             18.2      0.0468 Social                4.17 
##  8          4             17.8      0.298  Organic               3.48 
##  9          8             16.7      0.457  Organic               4.65 
## 10          4             16.6      0.207  Paid                  0.543
## # ℹ 2 more variables: Previous_Visits <dbl>, Conversion_Rate <dbl>

Interpretation: These extreme cases may be your most valuable users – completing a long checkout or reading many articles. Share this list with the product team to understand what keeps ultra-engaged users on the site.

18. Which traffic source generates the highest average number of page views? How does this metric reflect user exploration and engagement quality?

avg_pageviews <- data %>%
  group_by(Traffic_Source) %>%
  summarise(Avg_Page_Views = mean(Page_Views, na.rm = TRUE))

avg_pageviews

## # A tibble: 5 × 2
##   Traffic_Source Avg_Page_Views
##   <chr>                   <dbl>
## 1 Direct                   4.96
## 2 Organic                  5.03
## 3 Paid                     4.94
## 4 Referral                 4.98
## 5 Social                   4.70

Interpretation: Page views per session indicate browsing depth. Some sources (e.g., Referral) might lead to deeper exploration than others (e.g., Direct) – this metric helps assess traffic quality, not just quantity.

19. How do page views vary across traffic sources visually? Which sources should be targeted to improve user engagement depth?

ggplot(avg_pageviews, aes(Traffic_Source, Avg_Page_Views)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Average Page Views")

Interpretation: The growth team uses this chart to decide which sources to target for campaigns aimed at increasing engagement. If Paid search gives high page views but low conversion, the issue may be the call-to-action, not quantity.

20. Which traffic source provides the most stable and predictable user engagement based on session duration variability? How can we identify the source with the least fluctuation in user behavior?

consistency <- data %>%
  group_by(Traffic_Source) %>%
  summarise(
    Mean_Session = mean(Session_Duration, na.rm = TRUE),
    SD_Session   = sd(Session_Duration, na.rm = TRUE),
    CV           = SD_Session / Mean_Session
  ) %>%
  arrange(CV)

consistency

## # A tibble: 5 × 4
##   Traffic_Source Mean_Session SD_Session    CV
##   <chr>                 <dbl>      <dbl> <dbl>
## 1 Social                 3.06       2.89 0.947
## 2 Direct                 2.69       2.61 0.969
## 3 Organic                3.10       3.13 1.01 
## 4 Referral               3.13       3.18 1.02 
## 5 Paid                   2.94       3.36 1.14

Interpretation: The traffic source with the smallest Coefficient of Variation (CV) delivers the most predictable engagement.

21. Does spending more time on a page increase the likelihood of conversion? What is the relationship between user engagement time and conversion performance?

cor_time_conversion <- cor(data$Time_on_Page, data$Conversion_Rate, use = "complete.obs")
cor_time_conversion

## [1] 0.2296688

Interpretation: A positive correlation means more time on page is associated with higher conversion rates.

22. Do different traffic sources lead to significantly different bounce rates? Which source contributes to unusually high or low bounce behavior?

anova_bounce <- aov(Bounce_Rate ~ Traffic_Source, data = data)
summary(anova_bounce)

##                  Df Sum Sq Mean Sq F value Pr(>F)  
## Traffic_Source    4   0.20 0.04981   1.955 0.0989 .
## Residuals      1995  50.84 0.02548                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation: If p-value < 0.05, at least one traffic source has a statistically different bounce rate.

23. How do page views and time on page together influence conversion rates? To what extent can these engagement metrics explain conversion performance?

model_multiple <- lm(Conversion_Rate ~ Page_Views + Time_on_Page, data = data)
summary(model_multiple)

## 
## Call:
## lm(formula = Conversion_Rate ~ Page_Views + Time_on_Page, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62137  0.00077  0.01648  0.02719  0.04892 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.9432452  0.0039984 235.906  < 2e-16 ***
## Page_Views   0.0036452  0.0006501   5.607 2.34e-08 ***
## Time_on_Page 0.0051582  0.0004917  10.491  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06346 on 1997 degrees of freedom
## Multiple R-squared:  0.06743,    Adjusted R-squared:  0.0665 
## F-statistic:  72.2 on 2 and 1997 DF,  p-value: < 2.2e-16

Interpretation: The adjusted R-squared tells how much conversion variance is explained by page views and time on page.

24. Which numeric variable has the strongest relationship with conversion rate? What factors act as the best predictors of user conversion?

numeric_cols <- data %>% select(where(is.numeric))
cor_matrix <- cor(numeric_cols, use = "complete.obs")
cor_matrix[, "Conversion_Rate"] %>% sort(decreasing = TRUE)

##  Conversion_Rate     Time_on_Page Session_Duration       Page_Views 
##       1.00000000       0.22966876       0.17779803       0.12663537 
##  Previous_Visits      Bounce_Rate 
##       0.10949602      -0.04905104

Interpretation: The variable with the highest absolute correlation is the strongest predictor of conversion.

25. Are there duplicate entries in the dataset that could distort analysis results? How might repeated records impact overall data accuracy?

dup_count <- sum(duplicated(data))
dup_count

## [1] 0

Interpretation: Duplicate rows artificially inflate counts and bias averages.

26. What is the variability in the number of pages viewed by users? What is the threshold below which 90% of user page views fall?

sd_pageviews  <- sd(data$Page_Views, na.rm = TRUE)
p90_pageviews <- quantile(data$Page_Views, 0.9, na.rm = TRUE)
cat("Standard Deviation:", sd_pageviews, "\n90th Percentile:", p90_pageviews)

## Standard Deviation: 2.183903 
## 90th Percentile: 8

Interpretation: The 90th percentile tells us that 90% of users view at most that many pages.

27. Is there a relationship between session duration and bounce rate? Do longer sessions reduce the chances of users leaving immediately?

ggplot(data, aes(Session_Duration, Bounce_Rate)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Session Duration vs Bounce Rate (with regression line)")

Interpretation: The downward slope confirms that longer sessions usually mean lower bounce rates.

28. Which traffic source shows the highest and most consistent conversion rates? Are there noticeable outliers or variations across sources?

ggplot(data, aes(Traffic_Source, Conversion_Rate, fill = Traffic_Source)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Conversion Rate Distribution by Traffic Source")

Interpretation: Boxplots reveal medians, outliers, and spread – a high median and narrow box indicates consistent good performance.

29. Which traffic source contributes the largest share of total users? How is overall user traffic distributed among sources?

source_counts <- table(data$Traffic_Source)
pie(source_counts, main = "Traffic Source Share", col = rainbow(length(source_counts)))

Interpretation: The largest slice shows our primary acquisition channel.

30. How are key engagement metrics interrelated with each other? Which variables show strong positive or negative relationships?

ggpairs(data %>% select(Session_Duration, Page_Views, Time_on_Page, Bounce_Rate, Conversion_Rate))

Interpretation: This matrix shows scatter plots, correlations, and distributions – strong positive correlation between Time_on_Page and Conversion_Rate confirms engaged users convert better.

31. What percentage of users fall below a certain session duration? How can we determine session duration percentiles visually?

ggplot(data, aes(Session_Duration)) +
  stat_ecdf(geom = "step") +
  labs(title = "Cumulative Distribution of Session Duration", y = "Cumulative Probability")

Interpretation: From the CDF we can read percentiles directly – e.g., where the curve reaches 0.8 gives the 80th percentile.

32. Do repeat visitors convert differently compared to new visitors? Is the difference in conversion rates statistically significant?

data <- data %>%
  mutate(Visitor_Type = ifelse(Previous_Visits > 0, "Repeat", "New"))
t.test(Conversion_Rate ~ Visitor_Type, data = data)

## 
##  Welch Two Sample t-test
## 
## data:  Conversion_Rate by Visitor_Type
## t = -3.2839, df = 335, p-value = 0.001132
## alternative hypothesis: true difference in means between group New and group Repeat is not equal to 0
## 95 percent confidence interval:
##  -0.027038461 -0.006780576
## sample estimates:
##    mean in group New mean in group Repeat 
##            0.9675400            0.9844495

Interpretation: If p-value < 0.05, repeat visitors convert at a statistically different rate (usually higher).

33. Is the relationship between session duration and conversion linear or nonlinear? Is there an optimal session duration that maximizes conversion?

poly_model <- lm(Conversion_Rate ~ poly(Session_Duration, 2), data = data)
summary(poly_model)

## 
## Call:
## lm(formula = Conversion_Rate ~ poly(Session_Duration, 2), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62258 -0.00087  0.01598  0.02825  0.04532 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.982065   0.001435 684.433  < 2e-16 ***
## poly(Session_Duration, 2)1  0.522114   0.064169   8.137 7.07e-16 ***
## poly(Session_Duration, 2)2 -0.357530   0.064169  -5.572 2.86e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06417 on 1997 degrees of freedom
## Multiple R-squared:  0.04644,    Adjusted R-squared:  0.04548 
## F-statistic: 48.62 on 2 and 1997 DF,  p-value: < 2.2e-16

ggplot(data, aes(Session_Duration, Conversion_Rate)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, color = "red") +
  labs(title = "Polynomial Regression: Conversion Rate vs Session Duration")

Interpretation: A significant quadratic term indicates a U-shape or inverted-U; an inverted-U reveals a “sweet spot” session duration for peak conversion.

34. Are there extreme values in page views that deviate from normal behavior? Could these outliers indicate abnormal user activity like bots?

Q1        <- quantile(data$Page_Views, 0.25, na.rm = TRUE)
Q3        <- quantile(data$Page_Views, 0.75, na.rm = TRUE)
IQR_val   <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
outliers  <- data %>% filter(Page_Views < lower_bound | Page_Views > upper_bound)
n_outliers <- nrow(outliers)
n_outliers

## [1] 21

Interpretation: Outliers may be bots (extremely high page views) or immediate exits (zero page views).

35. How does performance across engagement metrics vary by traffic source? Which source consistently performs better across multiple indicators?

pair_data <- data %>%
  select(Session_Duration, Page_Views, Time_on_Page, Bounce_Rate, Conversion_Rate, Traffic_Source)

ggpairs(pair_data,
        columns = 1:5,
        aes(color = Traffic_Source, alpha = 0.5),
        title = "Pair Plot: Engagement Metrics Colored by Traffic Source") +
  theme_minimal()

Interpretation: This coloured pair plot reveals which traffic source performs best across multiple metrics simultaneously.

Website Traffic and User Behavior Analysis

Sanjana Upadhyay and Sudhanshu Thakur

14-04-2026