1. Import Dataset

data <- read_csv("C:/Users/tsudh/Desktop/R/R project/website_wata.csv")
colnames(data) <- gsub(" ", "_", colnames(data))

head(data)
## # A tibble: 6 × 7
##   Page_Views Session_Duration Bounce_Rate Traffic_Source Time_on_Page
##        <dbl>            <dbl>       <dbl> <chr>                 <dbl>
## 1          5            11.1        0.231 Organic                3.89
## 2          4             3.43       0.391 Social                 8.48
## 3          4             1.62       0.398 Organic                9.64
## 4          5             3.63       0.180 Organic                2.07
## 5          5             4.24       0.292 Paid                   1.96
## 6          3             4.54       0.421 Social                 3.44
## # ℹ 2 more variables: Previous_Visits <dbl>, Conversion_Rate <dbl>
dim(data)
## [1] 2000    7
str(data)
## spc_tbl_ [2,000 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Page_Views      : num [1:2000] 5 4 4 5 5 3 5 4 6 7 ...
##  $ Session_Duration: num [1:2000] 11.05 3.43 1.62 3.63 4.24 ...
##  $ Bounce_Rate     : num [1:2000] 0.231 0.391 0.398 0.18 0.292 ...
##  $ Traffic_Source  : chr [1:2000] "Organic" "Social" "Organic" "Organic" ...
##  $ Time_on_Page    : num [1:2000] 3.89 8.48 9.64 2.07 1.96 ...
##  $ Previous_Visits : num [1:2000] 3 0 2 3 5 2 1 5 1 5 ...
##  $ Conversion_Rate : num [1:2000] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Page Views` = col_double(),
##   ..   `Session Duration` = col_double(),
##   ..   `Bounce Rate` = col_double(),
##   ..   `Traffic Source` = col_character(),
##   ..   `Time on Page` = col_double(),
##   ..   `Previous Visits` = col_double(),
##   ..   `Conversion Rate` = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(data)
##    Page_Views    Session_Duration     Bounce_Rate       Traffic_Source    
##  Min.   : 0.00   Min.   : 0.003613   Min.   :0.007868   Length:2000       
##  1st Qu.: 3.00   1st Qu.: 0.815828   1st Qu.:0.161986   Class :character  
##  Median : 5.00   Median : 1.993983   Median :0.266375   Mode  :character  
##  Mean   : 4.95   Mean   : 3.022045   Mean   :0.284767                     
##  3rd Qu.: 6.00   3rd Qu.: 4.197569   3rd Qu.:0.388551                     
##  Max.   :14.00   Max.   :20.290516   Max.   :0.844939                     
##   Time_on_Page      Previous_Visits Conversion_Rate 
##  Min.   : 0.06852   Min.   :0.000   Min.   :0.3437  
##  1st Qu.: 1.93504   1st Qu.:1.000   1st Qu.:1.0000  
##  Median : 3.31532   Median :2.000   Median :1.0000  
##  Mean   : 4.02744   Mean   :1.978   Mean   :0.9821  
##  3rd Qu.: 5.41463   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :24.79618   Max.   :9.000   Max.   :1.0000

Interpretation: As a data analyst, you first examine the dataset’s dimensions (rows and columns), data types, and summary statistics. This overview helps plan further analyses and detect any obvious data issues before diving into specific metrics.

2. Missing Values Check

colSums(is.na(data))
##       Page_Views Session_Duration      Bounce_Rate   Traffic_Source 
##                0                0                0                0 
##     Time_on_Page  Previous_Visits  Conversion_Rate 
##                0                0                0

Interpretation: Missing data can distort calculations. Counting missing values per column (e.g., Traffic_Source, Session_Duration, Bounce_Rate) reveals columns that may need imputation, removal, or flagging for stakeholders.

3. Visitors by Traffic Source

traffic_count <- data %>%
  group_by(Traffic_Source) %>%
  summarise(Visitors = n())
traffic_count
## # A tibble: 5 × 2
##   Traffic_Source Visitors
##   <chr>             <int>
## 1 Direct              216
## 2 Organic             786
## 3 Paid                428
## 4 Referral            301
## 5 Social              269

Interpretation:
Grouping by Traffic_Source shows which acquisition channel drives the most visitors. This informs budget allocation – if Organic leads, invest more in SEO; if Paid ads dominate, evaluate ROI.

4. Traffic Source Bar Plot

ggplot(traffic_count, aes(Traffic_Source, Visitors)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Visitors by Traffic Source")

Interpretation:
The bar chart makes differences between traffic sources immediately visible. The marketing director can use this visual to highlight which source outperforms others in executive presentations.

5. Avg Session Duration by Source

avg_session <- data %>%
  group_by(Traffic_Source) %>%
  summarise(Avg_Session = mean(Session_Duration, na.rm = TRUE))
avg_session
## # A tibble: 5 × 2
##   Traffic_Source Avg_Session
##   <chr>                <dbl>
## 1 Direct                2.69
## 2 Organic               3.10
## 3 Paid                  2.94
## 4 Referral              3.13
## 5 Social                3.06

Interpretation:
Average session duration reveals which sources bring users who stay longer – a key engagement metric. Longer sessions often indicate higher interest and likelihood of conversion.

6. Session Duration Bar Plot

ggplot(avg_session, aes(Traffic_Source, Avg_Session)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Average Session Duration")

Interpretation:
The plot highlights whether, for example, Referral traffic has notably longer sessions than Direct traffic, suggesting external links bring more committed readers – guiding content strategy.

7. Bounce Rate Analysis

avg_bounce <- data %>%
  group_by(Traffic_Source) %>%
  summarise(Avg_Bounce = mean(Bounce_Rate))
avg_bounce
## # A tibble: 5 × 2
##   Traffic_Source Avg_Bounce
##   <chr>               <dbl>
## 1 Direct              0.285
## 2 Organic             0.282
## 3 Paid                0.296
## 4 Referral            0.266
## 5 Social              0.296

Interpretation:
A low average bounce rate signals that visitors from that source tend to explore further, while a high rate indicates a possible landing page mismatch. This helps prioritise optimisation efforts.

8. Bounce Rate Plot

ggplot(avg_bounce, aes(Traffic_Source, Avg_Bounce)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Average Bounce Rate")

Interpretation:
The UX team uses this plot to decide where to run A/B tests on page design or content relevance – the lowest bounce rate source may serve as a benchmark.

9. Conversion Rate Analysis

conversion_analysis <- data %>%
  group_by(Traffic_Source) %>%
  summarise(Avg_Conversion = mean(Conversion_Rate))
conversion_analysis
## # A tibble: 5 × 2
##   Traffic_Source Avg_Conversion
##   <chr>                   <dbl>
## 1 Direct                  0.979
## 2 Organic                 0.982
## 3 Paid                    0.979
## 4 Referral                0.988
## 5 Social                  0.983

Interpretation:
Even a low‑traffic source may be highly valuable if it converts often. This analysis directly informs ROI calculations and channel strategy – allocate budget to high‑conversion sources.

10. Conversion Plot

ggplot(conversion_analysis, aes(Traffic_Source, Avg_Conversion)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Average Conversion Rate")

Interpretation:
The CMO can use this visual to justify shifting budget from low‑conversion channels to high‑conversion ones. Clear labels and title ensure executive clarity.

11. Page Views vs Session Duration

ggplot(data, aes(Page_Views, Session_Duration)) +
  geom_point() +
  theme_minimal() +
  labs(title = "Page Views vs Session Duration")

Interpretation:
Each point represents a user session. A positive trend means encouraging deeper browsing could increase overall time on site – a potential lever for engagement.

12. Avg Time on Page by Source

avg_time_page <- data %>%
  group_by(Traffic_Source) %>%
  summarise(Avg_Time = mean(Time_on_Page))
avg_time_page
## # A tibble: 5 × 2
##   Traffic_Source Avg_Time
##   <chr>             <dbl>
## 1 Direct             3.95
## 2 Organic            3.98
## 3 Paid               4.09
## 4 Referral           3.98
## 5 Social             4.19

Interpretation:
Time on page measures content quality. If Organic visitors have much higher time on page than Paid visitors, your organic content may be more relevant, while paid landing pages need improvement.

13. Time on Page Plot

ggplot(avg_time_page, aes(Traffic_Source, Avg_Time)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Average Time on Page")

Interpretation:
The content team uses this to identify which traffic sources appreciate long‑form articles vs. quick scans – enabling different content strategies per channel.

14. Previous Visits vs Conversion

ggplot(data, aes(Previous_Visits, Conversion_Rate)) +
  geom_point() +
  theme_minimal() +
  labs(title = "Previous Visits vs Conversion")

Interpretation:
Look for a pattern – if repeat visitors (higher Previous_Visits) convert more often, a retargeting or loyalty campaign is justified. This plot visualises that relationship.

15. Session Duration Distribution

ggplot(data, aes(Session_Duration)) +
  geom_histogram(bins = 20) +
  theme_minimal() +
  labs(title = "Session Duration Distribution")

Interpretation:
The histogram reveals skewness, many short sessions (e.g., under 10 seconds) that may be bounces, and a long tail of highly engaged users. This informs how to define “good” vs. “bad” sessions.

16. Bounce Rate vs Conversion

ggplot(data, aes(Bounce_Rate, Conversion_Rate)) +
  geom_point() +
  theme_minimal() +
  labs(title = "Bounce Rate vs Conversion")

Interpretation:
A clear downward trend confirms that reducing bounce rate (e.g., by improving page load speed or relevance) is a proven path to increasing conversions.

17. Top 10 Longest Sessions

top_sessions <- data %>%
  arrange(desc(Session_Duration))

head(top_sessions, 10)
## # A tibble: 10 × 7
##    Page_Views Session_Duration Bounce_Rate Traffic_Source Time_on_Page
##         <dbl>            <dbl>       <dbl> <chr>                 <dbl>
##  1          4             20.3      0.600  Paid                  1.26 
##  2          3             19.9      0.250  Paid                  0.188
##  3          6             19.1      0.297  Organic               4.03 
##  4          4             19.1      0.0741 Paid                  1.70 
##  5          4             19.0      0.364  Referral              4.05 
##  6          6             18.3      0.338  Referral              2.98 
##  7          8             18.2      0.0468 Social                4.17 
##  8          4             17.8      0.298  Organic               3.48 
##  9          8             16.7      0.457  Organic               4.65 
## 10          4             16.6      0.207  Paid                  0.543
## # ℹ 2 more variables: Previous_Visits <dbl>, Conversion_Rate <dbl>

Interpretation:
These extreme cases may be your most valuable users – completing a long checkout or reading many articles. Share this list with the product team to understand what keeps ultra‑engaged users on the site.

18. Avg Page Views by Source

avg_pageviews <- data %>%
  group_by(Traffic_Source) %>%
  summarise(Avg_Page_Views = mean(Page_Views))

avg_pageviews
## # A tibble: 5 × 2
##   Traffic_Source Avg_Page_Views
##   <chr>                   <dbl>
## 1 Direct                   4.96
## 2 Organic                  5.03
## 3 Paid                     4.94
## 4 Referral                 4.98
## 5 Social                   4.70

Interpretation:
Page views per session indicate browsing depth. Some sources (e.g., Referral) might lead to deeper exploration than others (e.g., Direct) – this metric helps assess traffic quality, not just quantity.

19. Page Views Plot

ggplot(avg_pageviews, aes(Traffic_Source, Avg_Page_Views)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Average Page Views")

Interpretation:
The growth team uses this chart to decide which sources to target for campaigns aimed at increasing engagement. If Paid search gives high page views but low conversion, the issue may be the call‑to‑action, not quantity.

20.

consistency <- data %>%
  group_by(Traffic_Source) %>%
  summarise(
    Mean_Session = mean(Session_Duration, na.rm = TRUE),
    SD_Session = sd(Session_Duration, na.rm = TRUE),
    CV = SD_Session / Mean_Session   # Coefficient of variation (lower = more consistent)
  ) %>%
  arrange(CV)

consistency
## # A tibble: 5 × 4
##   Traffic_Source Mean_Session SD_Session    CV
##   <chr>                 <dbl>      <dbl> <dbl>
## 1 Social                 3.06       2.89 0.947
## 2 Direct                 2.69       2.61 0.969
## 3 Organic                3.10       3.13 1.01 
## 4 Referral               3.13       3.18 1.02 
## 5 Paid                   2.94       3.36 1.14

Interpretation: The traffic source with the smallest Coefficient of Variation (CV) delivers the most predictable engagement.

21.

cor_time_conversion <- cor(data$Time_on_Page, data$Conversion_Rate, use = "complete.obs")
cor_time_conversion
## [1] 0.2296688

Interpretation: A positive correlation means more time on page is associated with higher conversion rates.

22.

anova_bounce <- aov(Bounce_Rate ~ Traffic_Source, data = data)
summary(anova_bounce)
##                  Df Sum Sq Mean Sq F value Pr(>F)  
## Traffic_Source    4   0.20 0.04981   1.955 0.0989 .
## Residuals      1995  50.84 0.02548                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation: If p-value < 0.05, at least one traffic source has a statistically different bounce rate.

23.

model_multiple <- lm(Conversion_Rate ~ Page_Views + Time_on_Page, data = data)
summary(model_multiple)
## 
## Call:
## lm(formula = Conversion_Rate ~ Page_Views + Time_on_Page, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62137  0.00077  0.01648  0.02719  0.04892 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.9432452  0.0039984 235.906  < 2e-16 ***
## Page_Views   0.0036452  0.0006501   5.607 2.34e-08 ***
## Time_on_Page 0.0051582  0.0004917  10.491  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06346 on 1997 degrees of freedom
## Multiple R-squared:  0.06743,    Adjusted R-squared:  0.0665 
## F-statistic:  72.2 on 2 and 1997 DF,  p-value: < 2.2e-16

Interpretation: The adjusted R‑squared tells how much conversion variance is explained by page views and time on page.

24.

numeric_cols <- data %>% select(where(is.numeric))
cor_matrix <- cor(numeric_cols, use = "complete.obs")
cor_matrix[, "Conversion_Rate"] %>% sort(decreasing = TRUE)
##  Conversion_Rate     Time_on_Page Session_Duration       Page_Views 
##       1.00000000       0.22966876       0.17779803       0.12663537 
##  Previous_Visits      Bounce_Rate 
##       0.10949602      -0.04905104

Interpretation: The variable with the highest absolute correlation is the strongest predictor of conversion.

25.

dup_count <- sum(duplicated(data))
dup_count
## [1] 0

Interpretation: Duplicate rows artificially inflate counts and bias averages.

26.

sd_pageviews <- sd(data$Page_Views, na.rm = TRUE)
p90_pageviews <- quantile(data$Page_Views, 0.9, na.rm = TRUE)
cat("Standard Deviation:", sd_pageviews, "\n90th Percentile:", p90_pageviews)
## Standard Deviation: 2.183903 
## 90th Percentile: 8

Interpretation: The 90th percentile tells us that 90% of users view at most that many pages.

27.

ggplot(data, aes(Session_Duration, Bounce_Rate)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Session Duration vs Bounce Rate (with regression line)")

Interpretation: The downward slope confirms that longer sessions usually mean lower bounce rates.

28.

ggplot(data, aes(Traffic_Source, Conversion_Rate, fill = Traffic_Source)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Conversion Rate Distribution by Traffic Source")

Interpretation: Boxplots reveal medians, outliers, and spread – a high median and narrow box indicates consistent good performance.

29.

source_counts <- table(data$Traffic_Source)
pie(source_counts, main = "Traffic Source Share", col = rainbow(length(source_counts)))

Interpretation: The largest slice shows our primary acquisition channel.

30.

library(GGally)
ggpairs(data %>% select(Session_Duration, Page_Views, Time_on_Page, Bounce_Rate, Conversion_Rate))

Interpretation: This matrix shows scatter plots, correlations, and distributions – strong positive correlation between Time_on_Page and Conversion_Rate confirms engaged users convert better.

31.

ggplot(data, aes(Session_Duration)) +
  stat_ecdf(geom = "step") +
  labs(title = "Cumulative Distribution of Session Duration", y = "Cumulative Probability")

Interpretation: From the CDF we can read percentiles directly – e.g., where the curve reaches 0.8 gives the 80th percentile.

32.

data <- data %>%
  mutate(Visitor_Type = ifelse(Previous_Visits > 0, "Repeat", "New"))
t.test(Conversion_Rate ~ Visitor_Type, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  Conversion_Rate by Visitor_Type
## t = -3.2839, df = 335, p-value = 0.001132
## alternative hypothesis: true difference in means between group New and group Repeat is not equal to 0
## 95 percent confidence interval:
##  -0.027038461 -0.006780576
## sample estimates:
##    mean in group New mean in group Repeat 
##            0.9675400            0.9844495

Interpretation: If p-value < 0.05, repeat visitors convert at a statistically different rate (usually higher).

33.

poly_model <- lm(Conversion_Rate ~ poly(Session_Duration, 2), data = data)
summary(poly_model)
## 
## Call:
## lm(formula = Conversion_Rate ~ poly(Session_Duration, 2), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62258 -0.00087  0.01598  0.02825  0.04532 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.982065   0.001435 684.433  < 2e-16 ***
## poly(Session_Duration, 2)1  0.522114   0.064169   8.137 7.07e-16 ***
## poly(Session_Duration, 2)2 -0.357530   0.064169  -5.572 2.86e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06417 on 1997 degrees of freedom
## Multiple R-squared:  0.04644,    Adjusted R-squared:  0.04548 
## F-statistic: 48.62 on 2 and 1997 DF,  p-value: < 2.2e-16
ggplot(data, aes(Session_Duration, Conversion_Rate)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, color = "red") +
  labs(title = "Polynomial Regression: Conversion Rate vs Session Duration")

Interpretation: A significant quadratic term indicates a U‑shape or inverted‑U; an inverted‑U reveals a “sweet spot” session duration for peak conversion.

34.

Q1 <- quantile(data$Page_Views, 0.25, na.rm = TRUE)
Q3 <- quantile(data$Page_Views, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
outliers <- data %>% filter(Page_Views < lower_bound | Page_Views > upper_bound)
n_outliers <- nrow(outliers)
n_outliers
## [1] 21

Interpretation: Outliers may be bots (extremely high page views) or immediate exits (zero page views).

35.

library(GGally)

pair_data <- data %>%
  select(Session_Duration, Page_Views, Time_on_Page, Bounce_Rate, Conversion_Rate, Traffic_Source)

ggpairs(pair_data,
        columns = 1:5,
        aes(color = Traffic_Source, alpha = 0.5),
        title = "Pair Plot: Engagement Metrics Colored by Traffic Source") +
  theme_minimal()

Interpretation: This coloured pair plot reveals which traffic source performs best across multiple metrics simultaneously.