# Data Manipulation
library(tidyverse) # Contains dplyr, ggplot2, tidyr, readr, etc.
library(grid) # Provides functions for custom graphics like adding text, lines, and shapes to plots.
library(gridExtra) # Allows you to arrange multiple plots (including ggplot2 plots) in a grid layout.
library(data.table) # Provides fast data manipulation functions
# # Visualization
library(ggplot2) # For creating static plots
library(scales)
library(plotly) # For interactive plots
library(kableExtra)
# Additional Libraries
library(corrplot) # For correlation matrices
library(stringr) # For text/string manipulation
library(knitr) # For generating reports (RMarkdown, Quarto)
library(htmltools)
library(networkD3) Data Analytics and Consumer Insights Assignment
SOUTH EAST TECHNOLOGICAL UNIVERSITY
Supervisor: Dr. Denise Earle
Title: Exploratory Data Analysis of User Activity on Planning Alerts
Date: 20 November 2024
Author: Sushma Mahesh (C00313660)
Format: html
Theme: journal
Table of Contents
- Introduction
- Objectives
- Scope of the Analysis
- Significance of the Project
- Methodology
- Data Preparation and Cleaning
- Exploratory Data Analysis
- Conclusion
- Recommendations
- References
1. Introduction
PlanningAlerts.ie is an online service that provides users with notifications regarding planning applications and developments in their local area across Ireland. By subscribing to the platform, users can receive timely alerts about new planning proposals, ensuring they stay informed about residential, commercial, and industrial developments in their neighborhoods.
Source: PlanningAlerts
The service is crucial for empowering individuals and communities to engage with the local planning process. It helps citizens stay informed about developments that may affect their quality of life, while also providing developers with a way to track relevant planning regulations.
2. Objectives
The project aims to generate actionable insights to improve user engagement and enhance the overall user experience on PlanningAlerts.ie.
3. Scope of the Analysis
The analysis is based on a dataset that includes user activity logs from PlanningAlerts.ie, covering key variables like session duration, device type (mobile, desktop, app), referral sources (e.g., Google, social media). By focusing on these aspects, this project aims to understand usage patterns and identify opportunities for enhancing the user experience.
4. Significance of the Project
Understanding how users engage with PlanningAlerts.ie is essential for optimizing the platform. The findings from this project can guide future improvements, such as refining notification accuracy, enhancing mobile usability, and targeting users based on their behavior to improve engagement.
5. Methodology
The methodology involves exploratory data analysis using R, with a focus on visualizations, summary statistics, and trend analysis.
6. Data Preparation and Cleaning
Data preparation and cleaning are vital to ensure that the dataset is accurate, reliable, and ready for analysis or modeling. R provides a rich set of libraries, like tidyverse, lubridate, and dplyr, that help streamline these processes, making it easier to transform, clean, and analyze data. Repetitive data cleaning tasks can be automated using R scripts, saving you time in the long run.
6.1 Install and Load packages
6.2 Data loading and Data transformation
The process involves data transformation and feature engineering, where you convert and clean a datetime field to ensure it’s in a proper format for further analysis. It’s an essential part of data preparation for ensuring that the data is ready for analysis, modeling, or visualization.
#Import the planning_alerts_data.csv file and create a new field called tfc_stamped_dt which contains a converted version of the tfc_stamped datetime field with values in the format of YYYY-MM-DD HH:MM:SS. Remove the old tfc_stamped field and rename the new one.
pa_data <- read_csv("planning_alerts_data.csv") %>%
mutate(tfc_stamped_dt = dmy_hm(tfc_stamped)) %>%
select(tfc_id, tfc_stamped_dt, tfc_cookie:tfc_referrer) %>%
rename(tfc_stamped = tfc_stamped_dt)7. Exploratory Data Analysis (EDA)
EDA in R is a comprehensive process that involves summarizing data, visualizing data, and identifying patterns, correlations, or issues (such as missing data or outliers). The R libraries like dplyr, ggplot2, and psych are used for effective exploratory analysis. The ultimate goal of EDA is to gain insights that can guide further data processing or model building steps.
7.1 User Session Pattern Analysis
The goal of this analysis is to understand how users interact with a website by examining their session data.
i. Number of sessions per User and session duration
user_sessions <- pa_data %>%
group_by(tfc_cookie) %>%
summarise(num_sessions = n_distinct(tfc_session), # Count distinct sessions
first_visit = min(tfc_stamped), # First visit timestamp
last_visit = max(tfc_stamped), # Last visit timestamp
session_duration = as.numeric(difftime(max(tfc_stamped), min(tfc_stamped), units = "mins")),
.groups = 'drop')
# Extract hour of the day from the first visit time (tfc_stamped)
user_sessions <- user_sessions %>%
mutate(hour_of_day = hour(first_visit)) # Get the hour of day for the first visit
ggplot(user_sessions, aes(x = factor(hour_of_day), y = session_duration / 1000)) +
geom_violin(color = "darkorange3") +
labs(
title = "Session Duration by Hour of Day",
x = "Hour of Day",
y = "Session Duration (thousands of minutes)" # Y-axis now represents thousands of minutes
) +
theme_minimal() +
theme(
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8),
axis.title.x = element_text(size = 8),
axis.title.y = element_text(size = 8),
plot.title = element_text(hjust = 0.5, size = 10, face = "bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) # Remove background grid lines ii. Session Patterns and Most visited pages
# Calculate the sequence of pages visited within each session
page_sequence <- pa_data %>%
arrange(tfc_cookie, tfc_session, tfc_stamped) %>%
group_by(tfc_cookie, tfc_session) %>%
summarise(page_sequence = paste(tfc_full_url_screen, collapse = " > "), .groups = "drop") # Create a sequence of pages
# Find the most common paths (Top 10 most frequent paths by users)
common_paths_users <- page_sequence %>%
distinct(tfc_cookie, page_sequence) %>% # Ensure each user is counted once per page sequence
count(page_sequence, sort = TRUE) %>% # Count the number of users per page sequence
arrange(desc(n)) %>% # Arrange the counts in descending order
head(10) # Get top 10 most common paths
# Create the table
knitr::kable(common_paths_users,
format = "html",
digits = 2,
align = "lr",
col.names = c("Page Sequence", "Total Users"),
caption = '<div style="text-align: center;">Top 10 Most Common Paths by Users</div>',
color = "black",
table.attr = 'data-quarto-disable-processing = "true"') %>%
row_spec(0, bold = TRUE, color = "black", background = "gray80") %>%
row_spec(1:nrow(common_paths_users), background = "white") %>%
kableExtra::kable_styling(full_width = FALSE, font_size = 12) %>%
kableExtra::column_spec(1, width = "24em") %>%
kableExtra::column_spec(2, width = "16em") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Page Sequence | Total Users |
|---|---|
| application | 112499 |
| applicationmob | 26159 |
| list | 21318 |
| map | 19606 |
| applicationmob > applicationmob | 3064 |
| signup | 1677 |
| applicationmob > list | 1111 |
| contact | 871 |
| applicationmob > applicationmob > applicationmob | 820 |
| mobilemap | 554 |
most_visited_pages <- pa_data %>%
mutate(page_name = str_extract(tfc_full_url, "[^/]+$"), # Extract page name from URL
page_name = str_remove(page_name, "\\?[^/]+$")) %>% # Remove query parameter (everything after ?)
group_by(page_name) %>% # Group by cleaned page name
summarise(page_visits = n()) %>% # Count visits for each page
arrange(desc(page_visits)) %>% # Sort by visits in descending order
head(10) # Get top 10 most visited pages
knitr::kable(most_visited_pages,
format = "html",
digits = 2,
align = "lr",
col.names = c("Page", "Total Page Visits"),
caption = '<div style="text-align: center;">Top 10 Most Visited Pages</div>', color = "black",
table.attr = 'data-quarto-disable-processing = "true"') %>%
kable_styling(full_width = F, font_size = 12) %>%
row_spec(0, bold = TRUE, color = "black", background = "gray80") %>%
row_spec(1:nrow(most_visited_pages), background = "lightyellow2")| Page | Total Page Visits |
|---|---|
| application | 215223 |
| list | 62111 |
| applicationmob | 57038 |
| map | 20326 |
| signup | 17781 |
| item | 6790 |
| data-centre-planning-applications | 2347 |
| subscriber | 2024 |
| mobilemap | 2012 |
| wind-farm-windfarm-turbine-planning-applications | 1907 |
iii. User Visit Frequency Analysis
# Step 1: Classify users into "once-off" and "repeat" visitors
user_sessions <- user_sessions %>%
mutate(visitor_type = ifelse(num_sessions == 1, "Once-off", "Repeat"))
# Step 2: Calculate the visit frequency for repeat visitors
repeat_visits <- user_sessions %>%
filter(visitor_type == "Repeat") %>%
mutate(days_between = as.numeric(difftime(last_visit, first_visit, units = "days")),
weeks_between = days_between / 7,
months_between = days_between / 30)
# Step 3: Summarize the number of once-off and repeat visitors
visitor_summary <- user_sessions %>%
summarise(once_off = sum(visitor_type == "Once-off"), # Count once-off visitors
repeat_visitors = sum(visitor_type == "Repeat")) # Count repeat visitors
# Step 4: Calculate the frequency of repeat visitors (e.g., visits within 1 day, 1 week, 1 month)
visit_frequency <- repeat_visits %>%
summarise(
daily_visits = mean(days_between <= 1),
weekly_visits = mean(weeks_between <= 1),
monthly_visits = mean(months_between <= 1)
) %>%
rename(
`Daily Visits` = daily_visits,
`Weekly Visits` = weekly_visits,
`Monthly Visits` = monthly_visits
)
# Visitor Summary Table
knitr::kable(visitor_summary,
format = "html",
digits = 2, # Round numbers to 2 decimal places
align = c("r", "r"), # Left-align the first column, right-align the second column
col.names = c("Once Off Visitors", "Repeat Visitors"),
caption = '<div style="text-align: center; font-weight: bold;">Visitor Summary</div>',
color = "black",
table.attr = 'data-quarto-disable-processing = "true"') %>%
kable_styling(full_width = FALSE, font_size = 14, position = "center") %>% # Reduce table size
row_spec(0, bold = TRUE, color = "black", background = "gray80") %>%
row_spec(1, color = "black", background = "lemonchiffon")| Once Off Visitors | Repeat Visitors |
|---|---|
| 175408 | 13620 |
# Visitor Frequency Table
knitr::kable(visit_frequency,
format = "html",
digits = 2, # Round numbers to 3 decimal places
align = c("r", "r", "r"), # Left-align the first column, right-align the rest
col.names = c("Daily Visits", "Weekly Visits", "Monthly Visits"),
caption = '<div style="text-align: center; font-weight: bold;">Visitor Frequency</div>',
color = "black",
table.attr = 'data-quarto-disable-processing = "true"') %>%
kable_styling(full_width = FALSE, font_size = 14, position = "center") %>% # Reduce table size
row_spec(0, bold = TRUE, color = "black", background = "gray80") %>%
row_spec(1, color = "black", background = "bisque1")| Daily Visits | Weekly Visits | Monthly Visits |
|---|---|---|
| 0.74 | 0.83 | 0.95 |
iv. Bounce Rate per user and overall Bounce rate with page-wise Bounce rate values
# Step 1: Count the number of unique pages visited per session
page_count_per_session <- pa_data %>%
group_by(tfc_cookie) %>%
mutate(num_pages = n_distinct(tfc_full_url)) %>%
ungroup()# Step 2: Identify bounces for each session (sessions with only 1 page view)
bounced_sessions_user <- page_count_per_session %>%
filter(num_pages == 1) %>%
group_by(tfc_cookie) %>%
mutate(bounced_sessions = n()) %>%
ungroup()
# Step 3: Calculate total sessions for each user
total_sessions_user <- page_count_per_session %>%
group_by(tfc_cookie) %>%
mutate(total_sessions = n_distinct(tfc_session)) %>%
ungroup()
# Step 4: Merge bounced sessions and total sessions for each user
user_bounce_rate <- bounced_sessions_user %>%
left_join(total_sessions_user, by = "tfc_cookie") %>%
mutate(bounce_rate = (bounced_sessions / total_sessions) * 100)# Calculate the number of pages visited per session (for page-specific bounces)
page_specific <- pa_data %>%
group_by(tfc_cookie, tfc_session) %>%
summarise(pages_in_session = list(unique(tfc_full_url_screen)))# Identify Bounce Sessions (sessions with only 1 page)
bounce_sessions <- page_specific %>%
filter(sapply(pages_in_session, length) == 1) %>%
ungroup()
# Count how many bounces each page has
bounce_count_per_page <- bounce_sessions %>%
unnest(pages_in_session) %>%
group_by(pages_in_session) %>%
summarise(bounce_count = n(), .groups = "drop")
top_10_bounce_pages <- bounce_count_per_page %>%
arrange(desc(bounce_count)) %>%
head(10) # Calculate total sessions per page
total_sessions_per_page <- pa_data %>%
group_by(tfc_full_url_screen) %>%
summarise(total_sessions = n_distinct(tfc_session), .groups = "drop")
# Calculate the bounce rate for each page
bounce_rate_df <- bounce_count_per_page %>%
left_join(total_sessions_per_page, by = c("pages_in_session" = "tfc_full_url_screen")) %>%
mutate(bounce_rate = (bounce_count / total_sessions) * 100)
top_10_bounce_rate_df <- bounce_rate_df %>%
arrange(desc(bounce_rate_df)) %>%
head(10)
ggplot(top_10_bounce_rate_df, aes(x = reorder(pages_in_session, bounce_rate), y = bounce_rate)) +
geom_bar(stat = "identity", fill = "olivedrab") +
geom_text(aes(label = round(bounce_rate, 1)), vjust = -0.3, size = 3) +
labs(
title = "Top 10 Pages with the Highest Bounce Rates",
y = "Bounce Rate (%)"
) +
theme_minimal() +
coord_flip() + # Flip axes for better readability
theme(
axis.title.y.left = element_blank(),
axis.title.y = element_text(size = 8, hjust = 0.4),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8),
plot.title = element_text(hjust = 0.5, size = 10),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank() ) 7.2 Device Type and Application Interaction Analysis
To analyze and optimize user behavior, performance metrics, and cross-platform interactions through data manipulation, and visualization techniques.
i. User and Session Breakdown by Device Type
# Calculate the number of unique users and sessions by device type
device_user_session <- pa_data %>%
group_by(tfc_device_type) %>%
summarise(
unique_users = n_distinct(tfc_cookie),
total_sessions = n_distinct(tfc_session),
avg_sessions_per_user = total_sessions / unique_users
)
ggplot(device_user_session, aes(x = reorder(tfc_device_type, avg_sessions_per_user),
y = avg_sessions_per_user,
fill = tfc_device_type)) +
geom_bar(stat = "identity", width = 0.4) +
labs(
title = "Average Sessions per User by Device Type",
y = "Average Sessions per User"
) +
theme_minimal() +
coord_flip() + # Flip for better readability
scale_y_continuous(limits = c(0, 3), breaks = 0:3) +
scale_fill_manual(values = c("lightblue", "mediumblue", "dodgerblue3", "lightblue3", "navyblue")) +
theme(
axis.title.y.left = element_blank(),
axis.title.y = element_text(size = 8, hjust = 0.4),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8),
plot.title = element_text(hjust = 0.5, size = 10),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.margin = margin(6, 6, 6, 6) # Adjust margins for a more compact plot
) +
guides(fill = "none") # Remove legendii. Application Interaction by Device Type
# Filter rows where user is viewing a planning application page
app_data <- pa_data %>%
filter(tfc_full_url_screen == "application")
# Count views by application reference and device type
app_interaction_device <- pa_data %>%
group_by(tfc_device_type, tfc_application_reference) %>%
summarise(
views = n(),
unique_users = n_distinct(tfc_cookie)
) %>%
arrange(tfc_device_type, desc(views))iii. Cross-Device User Journeys
# Identify users who have multiple sessions on different devices
cross_device_users <- pa_data %>%
group_by(tfc_cookie) %>%
filter(n_distinct(tfc_device_type) > 1) %>%
summarise(
device_types = paste(unique(tfc_device_type), collapse = " -> ")
)iv. Device-Specific Conversion Rates (Registration Sign-Up Pages)
# Count the number of users visiting registration or signup pages
signup_data <- pa_data %>%
filter(tfc_full_url_screen %in% c("register", "signup"))
# Calculate conversion rates by device type
signup_conversion_rate <- signup_data %>%
group_by(tfc_device_type) %>%
summarise(
signups = n_distinct(tfc_cookie),
sessions = n_distinct(tfc_session)
) %>%
mutate(
conversion_rate = signups / sessions
)
ggplot(signup_conversion_rate, aes(x = reorder(tfc_device_type, conversion_rate),
y = conversion_rate * 100, # Multiply by 100 to convert to percentage
fill = tfc_device_type)) +
geom_bar(stat = "identity", width = 0.4) + # Reduced bar width
geom_text(aes(label = round(conversion_rate * 100, 2)), vjust = -0.3, size = 3) +
labs(
title = "Signup Conversion Rate by Device Type",
y = "Conversion Rate (%)" # Label as percentage
) +
scale_y_continuous(labels = scales::percent_format(scale = 1)) + # Format y-axis as percentage
theme_minimal() +
coord_flip() + # Flip for better readability
scale_fill_manual(values = c("burlywood4", "pink3", "lightblue", "darkolivegreen3", "plum4")) +
theme(
axis.title.y.left = element_blank(),
axis.title.y = element_text(size = 8, hjust = 0.4),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8),
plot.title = element_text(hjust = 0.5, size = 10),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.margin = margin(6, 6, 6, 6)
) +
guides(fill = "none") # Remove legend7.3 Referrer Source Impact
To evaluate how traffic sources affect user behavior and conversions using data wrangling, visualization, and regression techniques.
user_data <- pa_data %>%
mutate(
traffic_source = case_when(
# Direct traffic (no referrer or internal site)
is.na(tfc_referrer) | tfc_referrer == "" | str_detect(tfc_referrer, "^https://www.planningalerts.ie/") ~ "Direct",
# External traffic from Google
str_detect(tfc_referrer, "google.com") ~ "Google",
# Any other external traffic source
TRUE ~ "Other"
))
traffic_summary <- user_data %>%
group_by(traffic_source) %>%
summarise(
total_users = n_distinct(tfc_cookie), # Count distinct users (cookies)
total_sessions = n(), # Count total sessions
.groups = 'drop' # Avoid grouping for the next steps
) %>%
arrange(desc(total_users))
traffic_summary <- traffic_summary %>%
mutate(
percentage = total_users / sum(total_users) * 100
)
ggplot(traffic_summary, aes(x = traffic_source, y = total_users, fill = traffic_source)) +
geom_bar(stat = "identity", width = 0.4) +
geom_text(aes(label = paste0(round(percentage, 2), "%")), vjust = -0.3, size = 3) +
labs(
title = "Total Users by Traffic Source",
y = "Total Users",
x = NULL
) +
scale_fill_brewer(palette = "Dark2") + # Color palette for better distinction
scale_y_continuous(labels = scales::label_number(scale = 1e-3, suffix = "k")) + # Format y-axis as thousands
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.4, size = 10),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 8), # Reduce axis title size
axis.text.x = element_text(size = 8), # Reduce axis text size
axis.text.y = element_text(size = 8), # Reduce axis text size
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "none", # Remove legend
plot.margin = margin(6, 6, 6, 6))7.4 Clickstream Analysis
Clickstream analysis tracks individual clicks that users make as they move through the website or app. The goal is to understand the user’s journey, and spot bottlenecks where users abandon the flow.
# Create user flow (transitions between pages)
page_transitions <- pa_data %>%
group_by(tfc_cookie, tfc_session) %>%
arrange(tfc_stamped) %>%
mutate(next_page = lead(tfc_full_url_screen)) %>%
filter(!is.na(next_page)) # Exclude last page in session (no next page)
# Count transitions between pages
transition_counts <- page_transitions %>%
group_by(tfc_full_url_screen, next_page) %>%
summarise(transition_count = n()) %>%
ungroup()
top_transitions <- transition_counts %>%
arrange(desc(transition_count)) %>%
head(20)
# Create a unique list of pages (nodes)
nodes <- unique(c(top_transitions$tfc_full_url_screen, top_transitions$next_page))
nodes <- data.frame(name = nodes)
# Map pages to numeric IDs
top_transitions <- top_transitions %>%
mutate(from = match(tfc_full_url_screen, nodes$name) - 1, # Subtract 1 to make IDs zero-based
to = match(next_page, nodes$name) - 1)
# Create Sankey diagram using plotly
sankey_diagram <- plot_ly(
type = "sankey",
orientation = "h",
node = list(
pad = 40, # Increase padding between nodes to make space
thickness = 30, # Width of the nodes
line = list(color = "black", width = 0.5),
label = nodes$name,
color = "skyblue" # Color for nodes
),
link = list(
source = top_transitions$from,
target = top_transitions$to,
value = top_transitions$transition_count,
color = "rgba(0, 102, 204, 0.6)" # Semi-transparent color for links
)
)
# Customize the layout to improve clarity
sankey_diagram <- sankey_diagram %>%
layout(
title = "Top 20 User Journey Transitions",
font = list(size = 10),
xaxis = list(showgrid = FALSE, zeroline = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE),
showlegend = FALSE, # Hide the legend to avoid clutter
hoverlabel = list(bgcolor = "white", font = list(size = 10)) # Hover label styling
)
# Display the Sankey diagram
sankey_diagram7.5 User Engagement Heatmap across Time and Month
To visualize patterns in user activity over different time periods, leveraging packages like ggplot2 for plotting and lubridate for time-based data manipulation.
n <- 189028
# Simulate data
engagement_data <- data.frame(
tfc_cookie = sample(1:189028, n, replace = TRUE),
tfc_stamped = sample(seq(from = as.POSIXct("2024-06-14 00:00:00"),
to = as.POSIXct("2024-08-29 23:59:59"), by = "min"),
n, replace = TRUE),
engagement_metric = sample(1:5, n, replace = TRUE) # Engagement level (e.g., 1-5 scale)
)
# Extract hour of the day and month from tfc_stamped
engagement_data <- engagement_data %>%
mutate(
hour_of_day = hour(tfc_stamped),
month = factor(month(tfc_stamped)))
# Create time intervals (bins) based on the hour of day
engagement_data <- engagement_data %>%
mutate(
time_interval = case_when(
hour_of_day >= 0 & hour_of_day < 6 ~ "Midnight to Morning",
hour_of_day >= 6 & hour_of_day < 12 ~ "Morning",
hour_of_day >= 12 & hour_of_day < 18 ~ "Afternoon",
hour_of_day >= 18 & hour_of_day < 24 ~ "Evening to Night"))
# Aggregate engagement metrics by hour and month
heatmap_data <- engagement_data %>%
group_by(hour_of_day, month) %>%
summarise(average_engagement = mean(engagement_metric), .groups = "drop")
# Create a heatmap to visualize engagement by month and hour of day
ggplot(heatmap_data, aes(x = month, y = hour_of_day, fill = average_engagement)) +
geom_tile() +
scale_fill_gradient(name = "Avg Engagement", low = "#FFFFFF", high = "purple4") + # Gradient colors
labs(x = "Month", y = "Hour of Day") +
ggtitle("User Engagement Heatmap") +
theme_bw() +
theme(
strip.placement = "outside",
plot.title = element_text(hjust = 0.5), # Center-justify plot title
axis.title.y = element_blank(), # Remove y-axis title for clarity
strip.background = element_rect(fill = "#EEEEEE", color = "#FFFFFF"),
axis.text.x = element_text(hjust = 1))# Save the plot as an image
ggsave("user_engagement_heatmap.png", width = 6, height = 8, dpi = 300)7.6 Application success Analysis based on User Behaviour
To assess App performance and user engagement through statistical models, clustering, and visualization techniques
i. Categorizing User Actions
# Create the user journey variable and summarize data in one step
user_journey_summary <- pa_data %>%
mutate(user_journey = case_when(
grepl("application", tfc_full_url_screen) ~ "Application Page",
grepl("signup", tfc_full_url_screen) ~ "Signup Page",
grepl("register", tfc_full_url_screen) ~ "Register Page",
TRUE ~ "Other Pages"
)) %>%
group_by(user_journey) %>%
summarise(
total_actions = n(),
unique_users = n_distinct(tfc_cookie),
avg_sessions_per_user = mean(n_distinct(tfc_session[tfc_cookie == unique(tfc_cookie)]))
) %>%
ungroup()
# Calculate total actions across all journeys for percentage calculations
total_actions_all <- sum(user_journey_summary$total_actions)
# Add percentage to the summary data
user_journey_summary <- user_journey_summary %>%
mutate(percentage = total_actions / total_actions_all * 100)
# Define custom colors for the pie slices (you can modify these colors as needed)
custom_colors <- c("maroon", "pink1", "grey41", "grey90")
ggplot(user_journey_summary, aes(x = "", y = total_actions, fill = user_journey)) +
geom_bar(stat = "identity", width = 1) + # Create a bar chart first
coord_polar(theta = "y") + # Convert bar chart to pie chart
scale_fill_manual(values = custom_colors) + # Apply custom colors
# Add text labels for percentages, positioned outside the pie chart
geom_text(aes(label = paste0(round(percentage, 1), "%")),
color = "black", size = 3, alpha = 0.7,
position = position_nudge(x = 0.7)) + # Correct placement inside geom_text
labs(title = "Distribution of Total Actions per User Journey",
fill = "User Journey") + # Set legend title to "User Journey"
theme_void() + # Remove axis and background for a clean pie chart
theme(
plot.title = element_text(hjust = 0.5), # Center-justify the plot title
legend.position = "right", # Position the legend to the right
legend.title = element_text(size = 6), # Size of the legend title
legend.text = element_text(size = 6) ) # Size of the legend textggsave("user_journey_pie_chart.png", width = 7, height = 7, dpi = 300)ii. Identifying Successful Applications and Conversion rate by User journey
# Create a new variable for user journey analysis, identify application success, and summarize the results
conversion_rate_by_journey <- pa_data %>%
# Categorize user actions based on URL content and define user journey
mutate(user_journey = case_when(
grepl("application", tfc_full_url_screen, ignore.case = TRUE) ~ "Application Page",
grepl("signup", tfc_full_url_screen, ignore.case = TRUE) ~ "Signup Page",
grepl("register", tfc_full_url_screen, ignore.case = TRUE) ~ "Register Page",
grepl("success|confirmation", tfc_full_url_screen, ignore.case = TRUE) ~ "Success Page",
TRUE ~ "Other Pages"
)) %>%
# Group by user (tfc_cookie) to check user journey
group_by(tfc_cookie) %>%
# Mark application status based on which pages were visited
mutate(application_status = case_when(
any(user_journey == "Application Page") & any(user_journey == "Success Page") ~ "Successful", # Both Application and Success Page
any(user_journey == "Application Page") & !any(user_journey == "Success Page") ~ "Only Application Page", # Only Application Page
TRUE ~ "Unsuccessful" # Any other combination (e.g., no relevant pages visited)
)) %>%
# Ungroup to remove user-level grouping
ungroup() %>%
# Group by user journey category and calculate total number of users and successful users
group_by(user_journey) %>%
summarise(
total_users = n_distinct(tfc_cookie), # Total unique users in each user journey category
successful_users = sum(application_status == "Successful"), # Number of successful users
conversion_rate = (successful_users / total_users) * 100 # Calculate conversion rate
)iii. Device Type Analysis by User Journey
# Device Type Analysis and Conversion Rate Calculation by User Journey
device_analysis <- pa_data %>%
# Step 1: Create the user journey variable based on URL content
mutate(user_journey = case_when(
grepl("application", tfc_full_url_screen, ignore.case = TRUE) ~ "Application Page",
grepl("signup", tfc_full_url_screen, ignore.case = TRUE) ~ "Signup Page",
grepl("register", tfc_full_url_screen, ignore.case = TRUE) ~ "Register Page",
grepl("success|confirmation", tfc_full_url_screen, ignore.case = TRUE) ~ "Success Page",
TRUE ~ "Other Pages"
)) %>%
# Step 2: Group by device type and user journey to calculate summary metrics
group_by(tfc_device_type, user_journey) %>%
summarise(
total_actions = n(), # Total number of actions (sessions)
unique_users = n_distinct(tfc_cookie), # Count of unique users (tfc_cookie)
avg_sessions_per_user = mean(n_distinct(tfc_session)), # Average sessions per user
successful_users = sum(user_journey == "Success Page"), # Count of successful users
.groups = "drop" # Drop the grouping for further operations
) %>%
# Step 3: Calculate the conversion rate
mutate(conversion_rate = (successful_users / unique_users) * 100) # Conversion rate as a percentage
top_10_device_analysis <- device_analysis %>%
group_by(user_journey) %>%
arrange(desc(total_actions)) %>% # Sort by total actions
slice_head(n = 10) %>% # Select top 10 devices by total actions
ungroup()
top_10_device_analysis <- top_10_device_analysis %>%
group_by(user_journey) %>%
mutate(percentage = total_actions / sum(total_actions) * 100) %>% # Percentage of total actions for each device
ungroup()
# Step 3: Visualize the data with a bar chart
ggplot(top_10_device_analysis, aes(x = reorder(tfc_device_type, total_actions), y = total_actions, fill = tfc_device_type)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ user_journey, scales = "free_y") + # Create a facet for each user journey
coord_flip() + # Flip the coordinates for better readability
scale_fill_brewer(palette = "Set2") + # Apply color palette
scale_y_continuous(labels = scales::comma_format(scale = 1e-3, suffix = "k")) + # Format y-axis numbers (e.g., 1,000 -> 1k)
geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_stack(vjust = 0.5), color = "black", size = 1.8) + # Add percentage near bars
labs(
title = "Top 10 Devices by Total Actions",
y = "Total Actions",
fill = "Device Type"
) +
theme_minimal() +
theme(
axis.title.y.left = element_blank(),
plot.title = element_text(hjust = 0.5, size = 10),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "none")8. Conclusion
The analysis of user behavior on the website provides valuable insights into user engagement and areas for improvement. The Application Page receives the most user activity, indicating that users are highly engaged with the content there. However, the Registration Page shows minimal activity, suggesting it may be a barrier to user sign-up. Pages such as planningalerts.ie, terms, and pumping station planning applications have a high bounce rate, averaging around 80%, which signals that users may be exiting without interacting further. Direct traffic accounts for about 75% of visits, highlighting strong brand recognition but possibly a missed opportunity for deeper user engagement.
The clickstream analysis also reveals patterns in user behavior, pointing to friction points in the journey that could be optimized for better retention and conversion. This project emphasizes the significance of continuous data analysis to refine and evolve the user experience, ensuring that PlanningAlerts.ie remains responsive to its users’ preferences and behaviors.
9. Recommendations
Simplify the Registration Page: Offer more user-friendly sign-up options (e.g., social logins or guest mode) to increase conversion rates.
Improve Content on High Bounce Rate Pages: Focus on pages like Terms and Planning Application pages and Enhance content relevance, clarity, and navigation through better organization or more compelling content.
Optimize the Application and Sign-Up Pages: Add interactive features and personalized calls to action to encourage deeper engagement.
Analyze Clickstream Data: Identify and address friction points, especially during the registration or application process, to reduce abandonment and improve the user journey.
Enhanced User Feedback Mechanisms: Implement more interactive feedback tools (e.g., surveys, in-app polls) to gather real-time insights on user needs and continuously improve the platform based on their input.
10. References
Raschka, S., 2015. Heatmaps in R. Sebastian Raschka’s Blog. Available at: https://sebastianraschka.com/Articles/heatmaps_in_r.html [Accessed 12 Nov. 2024].
Coliver, J., 2019. Creating Heatmaps in R. Learn R. Available at: https://jcoliver.github.io/learn-r/006-heatmaps.html [Accessed 12 Nov. 2024].
Wickham, H., 2020. ggplot2: Elegant Graphics for Data Analysis. GitHub. Available at: https://github.com/hadley/ggplot2 [Accessed 12 Nov. 2024].
SETU Blackboard, 2024. Advanced Analysis using R & R Studio. Available at: https://blackboard.itcarlow.ie/ultra/courses/_22536_1/cl/outline [Accessed 12 Nov. 2024].
Wickham, H., 2024. Debugging. Advanced R. Available at: https://adv-r.hadley.nz/debugging.html [Accessed 12 Nov. 2024].