data <- read_csv("C:/Users/tsudh/Desktop/R/R project/website_wata.csv")
colnames(data) <- gsub(" ", "_", colnames(data))
head(data)
## # A tibble: 6 × 7
## Page_Views Session_Duration Bounce_Rate Traffic_Source Time_on_Page
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 5 11.1 0.231 Organic 3.89
## 2 4 3.43 0.391 Social 8.48
## 3 4 1.62 0.398 Organic 9.64
## 4 5 3.63 0.180 Organic 2.07
## 5 5 4.24 0.292 Paid 1.96
## 6 3 4.54 0.421 Social 3.44
## # ℹ 2 more variables: Previous_Visits <dbl>, Conversion_Rate <dbl>
dim(data)
## [1] 2000 7
str(data)
## spc_tbl_ [2,000 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Page_Views : num [1:2000] 5 4 4 5 5 3 5 4 6 7 ...
## $ Session_Duration: num [1:2000] 11.05 3.43 1.62 3.63 4.24 ...
## $ Bounce_Rate : num [1:2000] 0.231 0.391 0.398 0.18 0.292 ...
## $ Traffic_Source : chr [1:2000] "Organic" "Social" "Organic" "Organic" ...
## $ Time_on_Page : num [1:2000] 3.89 8.48 9.64 2.07 1.96 ...
## $ Previous_Visits : num [1:2000] 3 0 2 3 5 2 1 5 1 5 ...
## $ Conversion_Rate : num [1:2000] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. `Page Views` = col_double(),
## .. `Session Duration` = col_double(),
## .. `Bounce Rate` = col_double(),
## .. `Traffic Source` = col_character(),
## .. `Time on Page` = col_double(),
## .. `Previous Visits` = col_double(),
## .. `Conversion Rate` = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(data)
## Page_Views Session_Duration Bounce_Rate Traffic_Source
## Min. : 0.00 Min. : 0.003613 Min. :0.007868 Length:2000
## 1st Qu.: 3.00 1st Qu.: 0.815828 1st Qu.:0.161986 Class :character
## Median : 5.00 Median : 1.993983 Median :0.266375 Mode :character
## Mean : 4.95 Mean : 3.022045 Mean :0.284767
## 3rd Qu.: 6.00 3rd Qu.: 4.197569 3rd Qu.:0.388551
## Max. :14.00 Max. :20.290516 Max. :0.844939
## Time_on_Page Previous_Visits Conversion_Rate
## Min. : 0.06852 Min. :0.000 Min. :0.3437
## 1st Qu.: 1.93504 1st Qu.:1.000 1st Qu.:1.0000
## Median : 3.31532 Median :2.000 Median :1.0000
## Mean : 4.02744 Mean :1.978 Mean :0.9821
## 3rd Qu.: 5.41463 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :24.79618 Max. :9.000 Max. :1.0000
Interpretation: As a data analyst, you first examine the dataset’s dimensions (rows and columns), data types, and summary statistics. This overview helps plan further analyses and detect any obvious data issues before diving into specific metrics.
colSums(is.na(data))
## Page_Views Session_Duration Bounce_Rate Traffic_Source
## 0 0 0 0
## Time_on_Page Previous_Visits Conversion_Rate
## 0 0 0
Interpretation: Missing data can distort calculations. Counting missing values per column (e.g., Traffic_Source, Session_Duration, Bounce_Rate) reveals columns that may need imputation, removal, or flagging for stakeholders.
traffic_count <- data %>%
group_by(Traffic_Source) %>%
summarise(Visitors = n())
traffic_count
## # A tibble: 5 × 2
## Traffic_Source Visitors
## <chr> <int>
## 1 Direct 216
## 2 Organic 786
## 3 Paid 428
## 4 Referral 301
## 5 Social 269
Interpretation:
Grouping by Traffic_Source shows which acquisition channel drives the
most visitors. This informs budget allocation – if Organic leads, invest
more in SEO; if Paid ads dominate, evaluate ROI.
ggplot(traffic_count, aes(Traffic_Source, Visitors)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Visitors by Traffic Source")
Interpretation:
The bar chart makes differences between traffic sources immediately
visible. The marketing director can use this visual to highlight which
source outperforms others in executive presentations.
avg_session <- data %>%
group_by(Traffic_Source) %>%
summarise(Avg_Session = mean(Session_Duration, na.rm = TRUE))
avg_session
## # A tibble: 5 × 2
## Traffic_Source Avg_Session
## <chr> <dbl>
## 1 Direct 2.69
## 2 Organic 3.10
## 3 Paid 2.94
## 4 Referral 3.13
## 5 Social 3.06
Interpretation:
Average session duration reveals which sources bring users who stay
longer – a key engagement metric. Longer sessions often indicate higher
interest and likelihood of conversion.
ggplot(avg_session, aes(Traffic_Source, Avg_Session)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Average Session Duration")
Interpretation:
The plot highlights whether, for example, Referral traffic has notably
longer sessions than Direct traffic, suggesting external links bring
more committed readers – guiding content strategy.
avg_bounce <- data %>%
group_by(Traffic_Source) %>%
summarise(Avg_Bounce = mean(Bounce_Rate))
avg_bounce
## # A tibble: 5 × 2
## Traffic_Source Avg_Bounce
## <chr> <dbl>
## 1 Direct 0.285
## 2 Organic 0.282
## 3 Paid 0.296
## 4 Referral 0.266
## 5 Social 0.296
Interpretation:
A low average bounce rate signals that visitors from that source tend to
explore further, while a high rate indicates a possible landing page
mismatch. This helps prioritise optimisation efforts.
ggplot(avg_bounce, aes(Traffic_Source, Avg_Bounce)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Average Bounce Rate")
Interpretation:
The UX team uses this plot to decide where to run A/B tests on page
design or content relevance – the lowest bounce rate source may serve as
a benchmark.
conversion_analysis <- data %>%
group_by(Traffic_Source) %>%
summarise(Avg_Conversion = mean(Conversion_Rate))
conversion_analysis
## # A tibble: 5 × 2
## Traffic_Source Avg_Conversion
## <chr> <dbl>
## 1 Direct 0.979
## 2 Organic 0.982
## 3 Paid 0.979
## 4 Referral 0.988
## 5 Social 0.983
Interpretation:
Even a low‑traffic source may be highly valuable if it converts often.
This analysis directly informs ROI calculations and channel strategy –
allocate budget to high‑conversion sources.
ggplot(conversion_analysis, aes(Traffic_Source, Avg_Conversion)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Average Conversion Rate")
Interpretation:
The CMO can use this visual to justify shifting budget from
low‑conversion channels to high‑conversion ones. Clear labels and title
ensure executive clarity.
ggplot(data, aes(Page_Views, Session_Duration)) +
geom_point() +
theme_minimal() +
labs(title = "Page Views vs Session Duration")
Interpretation:
Each point represents a user session. A positive trend means encouraging
deeper browsing could increase overall time on site – a potential lever
for engagement.
avg_time_page <- data %>%
group_by(Traffic_Source) %>%
summarise(Avg_Time = mean(Time_on_Page))
avg_time_page
## # A tibble: 5 × 2
## Traffic_Source Avg_Time
## <chr> <dbl>
## 1 Direct 3.95
## 2 Organic 3.98
## 3 Paid 4.09
## 4 Referral 3.98
## 5 Social 4.19
Interpretation:
Time on page measures content quality. If Organic visitors have much
higher time on page than Paid visitors, your organic content may be more
relevant, while paid landing pages need improvement.
ggplot(avg_time_page, aes(Traffic_Source, Avg_Time)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Average Time on Page")
Interpretation:
The content team uses this to identify which traffic sources appreciate
long‑form articles vs. quick scans – enabling different content
strategies per channel.
ggplot(data, aes(Previous_Visits, Conversion_Rate)) +
geom_point() +
theme_minimal() +
labs(title = "Previous Visits vs Conversion")
Interpretation:
Look for a pattern – if repeat visitors (higher Previous_Visits) convert
more often, a retargeting or loyalty campaign is justified. This plot
visualises that relationship.
ggplot(data, aes(Session_Duration)) +
geom_histogram(bins = 20) +
theme_minimal() +
labs(title = "Session Duration Distribution")
Interpretation:
The histogram reveals skewness, many short sessions (e.g., under 10
seconds) that may be bounces, and a long tail of highly engaged users.
This informs how to define “good” vs. “bad” sessions.
ggplot(data, aes(Bounce_Rate, Conversion_Rate)) +
geom_point() +
theme_minimal() +
labs(title = "Bounce Rate vs Conversion")
Interpretation:
A clear downward trend confirms that reducing bounce rate (e.g., by
improving page load speed or relevance) is a proven path to increasing
conversions.
top_sessions <- data %>%
arrange(desc(Session_Duration))
head(top_sessions, 10)
## # A tibble: 10 × 7
## Page_Views Session_Duration Bounce_Rate Traffic_Source Time_on_Page
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 4 20.3 0.600 Paid 1.26
## 2 3 19.9 0.250 Paid 0.188
## 3 6 19.1 0.297 Organic 4.03
## 4 4 19.1 0.0741 Paid 1.70
## 5 4 19.0 0.364 Referral 4.05
## 6 6 18.3 0.338 Referral 2.98
## 7 8 18.2 0.0468 Social 4.17
## 8 4 17.8 0.298 Organic 3.48
## 9 8 16.7 0.457 Organic 4.65
## 10 4 16.6 0.207 Paid 0.543
## # ℹ 2 more variables: Previous_Visits <dbl>, Conversion_Rate <dbl>
Interpretation:
These extreme cases may be your most valuable users – completing a long
checkout or reading many articles. Share this list with the product team
to understand what keeps ultra‑engaged users on the site.
avg_pageviews <- data %>%
group_by(Traffic_Source) %>%
summarise(Avg_Page_Views = mean(Page_Views))
avg_pageviews
## # A tibble: 5 × 2
## Traffic_Source Avg_Page_Views
## <chr> <dbl>
## 1 Direct 4.96
## 2 Organic 5.03
## 3 Paid 4.94
## 4 Referral 4.98
## 5 Social 4.70
Interpretation:
Page views per session indicate browsing depth. Some sources (e.g.,
Referral) might lead to deeper exploration than others (e.g., Direct) –
this metric helps assess traffic quality, not just quantity.
ggplot(avg_pageviews, aes(Traffic_Source, Avg_Page_Views)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Average Page Views")
Interpretation:
The growth team uses this chart to decide which sources to target for
campaigns aimed at increasing engagement. If Paid search gives high page
views but low conversion, the issue may be the call‑to‑action, not
quantity.
consistency <- data %>%
group_by(Traffic_Source) %>%
summarise(
Mean_Session = mean(Session_Duration, na.rm = TRUE),
SD_Session = sd(Session_Duration, na.rm = TRUE),
CV = SD_Session / Mean_Session # Coefficient of variation (lower = more consistent)
) %>%
arrange(CV)
consistency
## # A tibble: 5 × 4
## Traffic_Source Mean_Session SD_Session CV
## <chr> <dbl> <dbl> <dbl>
## 1 Social 3.06 2.89 0.947
## 2 Direct 2.69 2.61 0.969
## 3 Organic 3.10 3.13 1.01
## 4 Referral 3.13 3.18 1.02
## 5 Paid 2.94 3.36 1.14
Interpretation: The traffic source with the smallest Coefficient of Variation (CV) delivers the most predictable engagement.
cor_time_conversion <- cor(data$Time_on_Page, data$Conversion_Rate, use = "complete.obs")
cor_time_conversion
## [1] 0.2296688
Interpretation: A positive correlation means more time on page is associated with higher conversion rates.
anova_bounce <- aov(Bounce_Rate ~ Traffic_Source, data = data)
summary(anova_bounce)
## Df Sum Sq Mean Sq F value Pr(>F)
## Traffic_Source 4 0.20 0.04981 1.955 0.0989 .
## Residuals 1995 50.84 0.02548
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: If p-value < 0.05, at least one traffic source has a statistically different bounce rate.
model_multiple <- lm(Conversion_Rate ~ Page_Views + Time_on_Page, data = data)
summary(model_multiple)
##
## Call:
## lm(formula = Conversion_Rate ~ Page_Views + Time_on_Page, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62137 0.00077 0.01648 0.02719 0.04892
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9432452 0.0039984 235.906 < 2e-16 ***
## Page_Views 0.0036452 0.0006501 5.607 2.34e-08 ***
## Time_on_Page 0.0051582 0.0004917 10.491 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06346 on 1997 degrees of freedom
## Multiple R-squared: 0.06743, Adjusted R-squared: 0.0665
## F-statistic: 72.2 on 2 and 1997 DF, p-value: < 2.2e-16
Interpretation: The adjusted R‑squared tells how much conversion variance is explained by page views and time on page.
numeric_cols <- data %>% select(where(is.numeric))
cor_matrix <- cor(numeric_cols, use = "complete.obs")
cor_matrix[, "Conversion_Rate"] %>% sort(decreasing = TRUE)
## Conversion_Rate Time_on_Page Session_Duration Page_Views
## 1.00000000 0.22966876 0.17779803 0.12663537
## Previous_Visits Bounce_Rate
## 0.10949602 -0.04905104
Interpretation: The variable with the highest absolute correlation is the strongest predictor of conversion.
dup_count <- sum(duplicated(data))
dup_count
## [1] 0
Interpretation: Duplicate rows artificially inflate counts and bias averages.
sd_pageviews <- sd(data$Page_Views, na.rm = TRUE)
p90_pageviews <- quantile(data$Page_Views, 0.9, na.rm = TRUE)
cat("Standard Deviation:", sd_pageviews, "\n90th Percentile:", p90_pageviews)
## Standard Deviation: 2.183903
## 90th Percentile: 8
Interpretation: The 90th percentile tells us that 90% of users view at most that many pages.
ggplot(data, aes(Session_Duration, Bounce_Rate)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Session Duration vs Bounce Rate (with regression line)")
Interpretation: The downward slope confirms that longer sessions usually mean lower bounce rates.
ggplot(data, aes(Traffic_Source, Conversion_Rate, fill = Traffic_Source)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Conversion Rate Distribution by Traffic Source")
Interpretation: Boxplots reveal medians, outliers, and spread – a high median and narrow box indicates consistent good performance.
source_counts <- table(data$Traffic_Source)
pie(source_counts, main = "Traffic Source Share", col = rainbow(length(source_counts)))
Interpretation: The largest slice shows our primary acquisition channel.
library(GGally)
ggpairs(data %>% select(Session_Duration, Page_Views, Time_on_Page, Bounce_Rate, Conversion_Rate))
Interpretation: This matrix shows scatter plots, correlations, and distributions – strong positive correlation between Time_on_Page and Conversion_Rate confirms engaged users convert better.
ggplot(data, aes(Session_Duration)) +
stat_ecdf(geom = "step") +
labs(title = "Cumulative Distribution of Session Duration", y = "Cumulative Probability")
Interpretation: From the CDF we can read percentiles directly – e.g., where the curve reaches 0.8 gives the 80th percentile.
data <- data %>%
mutate(Visitor_Type = ifelse(Previous_Visits > 0, "Repeat", "New"))
t.test(Conversion_Rate ~ Visitor_Type, data = data)
##
## Welch Two Sample t-test
##
## data: Conversion_Rate by Visitor_Type
## t = -3.2839, df = 335, p-value = 0.001132
## alternative hypothesis: true difference in means between group New and group Repeat is not equal to 0
## 95 percent confidence interval:
## -0.027038461 -0.006780576
## sample estimates:
## mean in group New mean in group Repeat
## 0.9675400 0.9844495
Interpretation: If p-value < 0.05, repeat visitors convert at a statistically different rate (usually higher).
poly_model <- lm(Conversion_Rate ~ poly(Session_Duration, 2), data = data)
summary(poly_model)
##
## Call:
## lm(formula = Conversion_Rate ~ poly(Session_Duration, 2), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62258 -0.00087 0.01598 0.02825 0.04532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.982065 0.001435 684.433 < 2e-16 ***
## poly(Session_Duration, 2)1 0.522114 0.064169 8.137 7.07e-16 ***
## poly(Session_Duration, 2)2 -0.357530 0.064169 -5.572 2.86e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06417 on 1997 degrees of freedom
## Multiple R-squared: 0.04644, Adjusted R-squared: 0.04548
## F-statistic: 48.62 on 2 and 1997 DF, p-value: < 2.2e-16
ggplot(data, aes(Session_Duration, Conversion_Rate)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, color = "red") +
labs(title = "Polynomial Regression: Conversion Rate vs Session Duration")
Interpretation: A significant quadratic term indicates a U‑shape or inverted‑U; an inverted‑U reveals a “sweet spot” session duration for peak conversion.
Q1 <- quantile(data$Page_Views, 0.25, na.rm = TRUE)
Q3 <- quantile(data$Page_Views, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
outliers <- data %>% filter(Page_Views < lower_bound | Page_Views > upper_bound)
n_outliers <- nrow(outliers)
n_outliers
## [1] 21
Interpretation: Outliers may be bots (extremely high page views) or immediate exits (zero page views).
library(GGally)
pair_data <- data %>%
select(Session_Duration, Page_Views, Time_on_Page, Bounce_Rate, Conversion_Rate, Traffic_Source)
ggpairs(pair_data,
columns = 1:5,
aes(color = Traffic_Source, alpha = 0.5),
title = "Pair Plot: Engagement Metrics Colored by Traffic Source") +
theme_minimal()
Interpretation: This coloured pair plot reveals which
traffic source performs best across multiple metrics simultaneously.