Data Analytics and Consumer Insights Assignment

                                SOUTH EAST TECHNOLOGICAL UNIVERSITY

                                  Supervisor: Dr. Denise Earle

                 Title: Exploratory Data Analysis of User Activity on Planning Alerts

                                      Date: 20 November 2024

                                Author: Sushma Mahesh (C00313660)

Format: html

Theme: journal

Table of Contents

Introduction
Objectives
Scope of the Analysis
Significance of the Project
Methodology
Data Preparation and Cleaning
Exploratory Data Analysis
Conclusion
Recommendations
References

1. Introduction

PlanningAlerts.ie is an online service that provides users with notifications regarding planning applications and developments in their local area across Ireland. By subscribing to the platform, users can receive timely alerts about new planning proposals, ensuring they stay informed about residential, commercial, and industrial developments in their neighborhoods.

                                   Source: PlanningAlerts

The service is crucial for empowering individuals and communities to engage with the local planning process. It helps citizens stay informed about developments that may affect their quality of life, while also providing developers with a way to track relevant planning regulations.

2. Objectives

The project aims to generate actionable insights to improve user engagement and enhance the overall user experience on PlanningAlerts.ie.

3. Scope of the Analysis

The analysis is based on a dataset that includes user activity logs from PlanningAlerts.ie, covering key variables like session duration, device type (mobile, desktop, app), referral sources (e.g., Google, social media). By focusing on these aspects, this project aims to understand usage patterns and identify opportunities for enhancing the user experience.

4. Significance of the Project

Understanding how users engage with PlanningAlerts.ie is essential for optimizing the platform. The findings from this project can guide future improvements, such as refining notification accuracy, enhancing mobile usability, and targeting users based on their behavior to improve engagement.

5. Methodology

The methodology involves exploratory data analysis using R, with a focus on visualizations, summary statistics, and trend analysis.

6. Data Preparation and Cleaning

Data preparation and cleaning are vital to ensure that the dataset is accurate, reliable, and ready for analysis or modeling. R provides a rich set of libraries, like tidyverse, lubridate, and dplyr, that help streamline these processes, making it easier to transform, clean, and analyze data. Repetitive data cleaning tasks can be automated using R scripts, saving you time in the long run.

6.1 Install and Load packages

# Data Manipulation
library(tidyverse)     # Contains dplyr, ggplot2, tidyr, readr, etc.
library(grid)          # Provides functions for custom graphics like adding text, lines, and shapes to plots.
library(gridExtra)     # Allows you to arrange multiple plots (including ggplot2 plots) in a grid layout.
library(data.table)    # Provides fast data manipulation functions

# # Visualization
library(ggplot2)      # For creating static plots
library(scales)
library(plotly)       # For interactive plots
library(kableExtra)

# Additional Libraries
library(corrplot)     # For correlation matrices
library(stringr)      # For text/string manipulation
library(knitr)        # For generating reports (RMarkdown, Quarto)
library(htmltools)
library(networkD3)

6.2 Data loading and Data transformation

The process involves data transformation and feature engineering, where you convert and clean a datetime field to ensure it’s in a proper format for further analysis. It’s an essential part of data preparation for ensuring that the data is ready for analysis, modeling, or visualization.

#Import the planning_alerts_data.csv file and create a new field called tfc_stamped_dt which contains a converted version of the tfc_stamped datetime field with values in the format of YYYY-MM-DD HH:MM:SS. Remove the old tfc_stamped field and rename the new one.

pa_data <- read_csv("planning_alerts_data.csv") %>%
  mutate(tfc_stamped_dt = dmy_hm(tfc_stamped)) %>%
  select(tfc_id, tfc_stamped_dt, tfc_cookie:tfc_referrer) %>%
  rename(tfc_stamped = tfc_stamped_dt)

7. Exploratory Data Analysis (EDA)

EDA in R is a comprehensive process that involves summarizing data, visualizing data, and identifying patterns, correlations, or issues (such as missing data or outliers). The R libraries like dplyr, ggplot2, and psych are used for effective exploratory analysis. The ultimate goal of EDA is to gain insights that can guide further data processing or model building steps.

7.1 User Session Pattern Analysis

The goal of this analysis is to understand how users interact with a website by examining their session data.

i. Number of sessions per User and session duration

user_sessions <- pa_data %>%
  group_by(tfc_cookie) %>%
  summarise(num_sessions = n_distinct(tfc_session),  # Count distinct sessions
            first_visit = min(tfc_stamped),  # First visit timestamp
            last_visit = max(tfc_stamped),   # Last visit timestamp
            session_duration = as.numeric(difftime(max(tfc_stamped), min(tfc_stamped), units = "mins")),
            .groups = 'drop')

# Extract hour of the day from the first visit time (tfc_stamped)
user_sessions <- user_sessions %>%
  mutate(hour_of_day = hour(first_visit))  # Get the hour of day for the first visit

ggplot(user_sessions, aes(x = factor(hour_of_day), y = session_duration / 1000)) +
  geom_violin(color = "darkorange3") +
  labs(
    title = "Session Duration by Hour of Day",
    x = "Hour of Day",
    y = "Session Duration (thousands of minutes)"  # Y-axis now represents thousands of minutes
  ) +
  theme_minimal() + 
  theme(
    axis.text.x = element_text(size = 8),
    axis.text.y = element_text(size = 8),
    axis.title.x = element_text(size = 8),
    axis.title.y = element_text(size = 8),
    plot.title = element_text(hjust = 0.5, size = 10, face = "bold"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank())  # Remove background grid lines

ii. Session Patterns and Most visited pages

# Calculate the sequence of pages visited within each session
page_sequence <- pa_data %>%
  arrange(tfc_cookie, tfc_session, tfc_stamped) %>% 
  group_by(tfc_cookie, tfc_session) %>%             
  summarise(page_sequence = paste(tfc_full_url_screen, collapse = " > "), .groups = "drop")  # Create a sequence of pages

# Find the most common paths (Top 10 most frequent paths by users)
common_paths_users <- page_sequence %>%
  distinct(tfc_cookie, page_sequence) %>%    # Ensure each user is counted once per page sequence
  count(page_sequence, sort = TRUE) %>%      # Count the number of users per page sequence
  arrange(desc(n)) %>%                       # Arrange the counts in descending order
  head(10)                                   # Get top 10 most common paths

# Create the table
knitr::kable(common_paths_users, 
             format = "html",
             digits = 2, 
             align = "lr", 
             col.names = c("Page Sequence", "Total Users"),
             caption = '<div style="text-align: center;">Top 10 Most Common Paths by Users</div>', 
             color = "black", 
             table.attr = 'data-quarto-disable-processing = "true"') %>% 
    row_spec(0, bold = TRUE, color = "black", background = "gray80") %>%
    row_spec(1:nrow(common_paths_users), background = "white") %>%
  kableExtra::kable_styling(full_width = FALSE, font_size = 12) %>%
  kableExtra::column_spec(1, width = "24em") %>%  
  kableExtra::column_spec(2, width = "16em") %>%   
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Top 10 Most Common Paths by Users
Page Sequence	Total Users
application	112499
applicationmob	26159
list	21318
map	19606
applicationmob > applicationmob	3064
signup	1677
applicationmob > list	1111
contact	871
applicationmob > applicationmob > applicationmob	820
mobilemap	554

most_visited_pages <- pa_data %>%
  mutate(page_name = str_extract(tfc_full_url, "[^/]+$"),  # Extract page name from URL
         page_name = str_remove(page_name, "\\?[^/]+$")) %>%  # Remove query parameter (everything after ?)
  group_by(page_name) %>%  # Group by cleaned page name
  summarise(page_visits = n()) %>%  # Count visits for each page
  arrange(desc(page_visits)) %>%  # Sort by visits in descending order
  head(10)  # Get top 10 most visited pages

knitr::kable(most_visited_pages, 
             format = "html",
             digits = 2, 
             align = "lr", 
             col.names = c("Page", "Total Page Visits"),
             caption = '<div style="text-align: center;">Top 10 Most Visited Pages</div>', color = "black",
             table.attr = 'data-quarto-disable-processing = "true"') %>% 
  kable_styling(full_width = F, font_size = 12) %>%
    row_spec(0, bold = TRUE, color = "black", background = "gray80") %>%
    row_spec(1:nrow(most_visited_pages), background = "lightyellow2")

Top 10 Most Visited Pages
Page	Total Page Visits
application	215223
list	62111
applicationmob	57038
map	20326
signup	17781
item	6790
data-centre-planning-applications	2347
subscriber	2024
mobilemap	2012
wind-farm-windfarm-turbine-planning-applications	1907

iii. User Visit Frequency Analysis

# Step 1: Classify users into "once-off" and "repeat" visitors
user_sessions <- user_sessions %>%
  mutate(visitor_type = ifelse(num_sessions == 1, "Once-off", "Repeat"))

# Step 2: Calculate the visit frequency for repeat visitors
repeat_visits <- user_sessions %>%
  filter(visitor_type == "Repeat") %>%
  mutate(days_between = as.numeric(difftime(last_visit, first_visit, units = "days")),
         weeks_between = days_between / 7,
         months_between = days_between / 30)

# Step 3: Summarize the number of once-off and repeat visitors
visitor_summary <- user_sessions %>%
  summarise(once_off = sum(visitor_type == "Once-off"),  # Count once-off visitors
            repeat_visitors = sum(visitor_type == "Repeat"))  # Count repeat visitors

# Step 4: Calculate the frequency of repeat visitors (e.g., visits within 1 day, 1 week, 1 month)
visit_frequency <- repeat_visits %>%
  summarise(
    daily_visits = mean(days_between <= 1),
    weekly_visits = mean(weeks_between <= 1),
    monthly_visits = mean(months_between <= 1)
  ) %>%
  rename(
    `Daily Visits` = daily_visits,
    `Weekly Visits` = weekly_visits, 
    `Monthly Visits` = monthly_visits 
  )
# Visitor Summary Table
knitr::kable(visitor_summary, 
             format = "html",
             digits = 2,           # Round numbers to 2 decimal places
             align = c("r", "r"),  # Left-align the first column, right-align the second column
             col.names = c("Once Off Visitors", "Repeat Visitors"),
             caption = '<div style="text-align: center; font-weight: bold;">Visitor Summary</div>', 
             color = "black",
             table.attr = 'data-quarto-disable-processing = "true"') %>% 
  kable_styling(full_width = FALSE, font_size = 14, position = "center") %>%  # Reduce table size
  row_spec(0, bold = TRUE, color = "black", background = "gray80") %>%
  row_spec(1, color = "black", background = "lemonchiffon")

Visitor Summary
Once Off Visitors	Repeat Visitors
175408	13620

# Visitor Frequency Table
knitr::kable(visit_frequency, 
             format = "html",
             digits = 2,           # Round numbers to 3 decimal places
             align = c("r", "r", "r"),  # Left-align the first column, right-align the rest
             col.names = c("Daily Visits", "Weekly Visits", "Monthly Visits"),
             caption = '<div style="text-align: center; font-weight: bold;">Visitor Frequency</div>', 
             color = "black",
             table.attr = 'data-quarto-disable-processing = "true"') %>% 
  kable_styling(full_width = FALSE, font_size = 14, position = "center") %>%  # Reduce table size
  row_spec(0, bold = TRUE, color = "black", background = "gray80") %>%
  row_spec(1, color = "black", background = "bisque1")

Visitor Frequency
Daily Visits	Weekly Visits	Monthly Visits
0.74	0.83	0.95

iv. Bounce Rate per user and overall Bounce rate with page-wise Bounce rate values

# Step 1: Count the number of unique pages visited per session
page_count_per_session <- pa_data %>%
  group_by(tfc_cookie) %>%
  mutate(num_pages = n_distinct(tfc_full_url)) %>%
  ungroup()

# Step 2: Identify bounces for each session (sessions with only 1 page view)
bounced_sessions_user <- page_count_per_session %>%
  filter(num_pages == 1) %>%
  group_by(tfc_cookie) %>%
  mutate(bounced_sessions = n()) %>%
  ungroup()

# Step 3: Calculate total sessions for each user
total_sessions_user <- page_count_per_session %>%
  group_by(tfc_cookie) %>%
  mutate(total_sessions = n_distinct(tfc_session)) %>%
  ungroup()

# Step 4: Merge bounced sessions and total sessions for each user
user_bounce_rate <- bounced_sessions_user %>%
  left_join(total_sessions_user, by = "tfc_cookie") %>%
  mutate(bounce_rate = (bounced_sessions / total_sessions) * 100)

# Calculate the number of pages visited per session (for page-specific bounces)
page_specific <- pa_data %>%
  group_by(tfc_cookie, tfc_session) %>%
 summarise(pages_in_session = list(unique(tfc_full_url_screen)))

# Identify Bounce Sessions (sessions with only 1 page)
bounce_sessions <- page_specific %>%
  filter(sapply(pages_in_session, length) == 1) %>%
  ungroup()

# Count how many bounces each page has
bounce_count_per_page <- bounce_sessions %>%
  unnest(pages_in_session) %>%
  group_by(pages_in_session) %>%
  summarise(bounce_count = n(), .groups = "drop")
top_10_bounce_pages <- bounce_count_per_page %>%
  arrange(desc(bounce_count)) %>%  
  head(10)

# Calculate total sessions per page
total_sessions_per_page <- pa_data %>%
  group_by(tfc_full_url_screen) %>%
  summarise(total_sessions = n_distinct(tfc_session), .groups = "drop")

# Calculate the bounce rate for each page
bounce_rate_df <- bounce_count_per_page %>%
  left_join(total_sessions_per_page, by = c("pages_in_session" = "tfc_full_url_screen")) %>%
  mutate(bounce_rate = (bounce_count / total_sessions) * 100)
top_10_bounce_rate_df <- bounce_rate_df %>%
  arrange(desc(bounce_rate_df)) %>%  
  head(10) 

ggplot(top_10_bounce_rate_df, aes(x = reorder(pages_in_session, bounce_rate), y = bounce_rate)) +
  geom_bar(stat = "identity", fill = "olivedrab") +
  geom_text(aes(label = round(bounce_rate, 1)), vjust = -0.3, size = 3) + 
  labs(
    title = "Top 10 Pages with the Highest Bounce Rates", 
    y = "Bounce Rate (%)"
  ) +
  theme_minimal() +
  coord_flip() +  # Flip axes for better readability
  theme(
    axis.title.y.left = element_blank(),  
    axis.title.y = element_text(size = 8, hjust = 0.4),  
    axis.text.x = element_text(size = 8),  
    axis.text.y = element_text(size = 8),   
    plot.title = element_text(hjust = 0.5, size = 10),  
    panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank() )

7.2 Device Type and Application Interaction Analysis

To analyze and optimize user behavior, performance metrics, and cross-platform interactions through data manipulation, and visualization techniques.

i. User and Session Breakdown by Device Type

# Calculate the number of unique users and sessions by device type
device_user_session <- pa_data %>%
  group_by(tfc_device_type) %>%
  summarise(
    unique_users = n_distinct(tfc_cookie),
    total_sessions = n_distinct(tfc_session),
    avg_sessions_per_user = total_sessions / unique_users
  )

ggplot(device_user_session, aes(x = reorder(tfc_device_type, avg_sessions_per_user), 
                                y = avg_sessions_per_user, 
                                fill = tfc_device_type)) +
  geom_bar(stat = "identity", width = 0.4) + 
  labs(
    title = "Average Sessions per User by Device Type",
    y = "Average Sessions per User"
  ) +
  theme_minimal() +
  coord_flip() +  # Flip for better readability
  scale_y_continuous(limits = c(0, 3), breaks = 0:3) +  
  scale_fill_manual(values = c("lightblue", "mediumblue", "dodgerblue3", "lightblue3", "navyblue")) +
  theme(
    axis.title.y.left = element_blank(),  
    axis.title.y = element_text(size = 8, hjust = 0.4),  
    axis.text.x = element_text(size = 8),  
    axis.text.y = element_text(size = 8),   
    plot.title = element_text(hjust = 0.5, size = 10),  
    panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank(),  
    plot.margin = margin(6, 6, 6, 6)  # Adjust margins for a more compact plot
  ) +
  guides(fill = "none")  # Remove legend

ii. Application Interaction by Device Type

# Filter rows where user is viewing a planning application page
app_data <- pa_data %>%
  filter(tfc_full_url_screen == "application")

# Count views by application reference and device type
app_interaction_device <- pa_data %>%
  group_by(tfc_device_type, tfc_application_reference) %>%
  summarise(
    views = n(),
    unique_users = n_distinct(tfc_cookie)
  ) %>%
  arrange(tfc_device_type, desc(views))

iii. Cross-Device User Journeys

# Identify users who have multiple sessions on different devices
cross_device_users <- pa_data %>%
  group_by(tfc_cookie) %>%
  filter(n_distinct(tfc_device_type) > 1) %>%
  summarise(
    device_types = paste(unique(tfc_device_type), collapse = " -> ")
  )

iv. Device-Specific Conversion Rates (Registration Sign-Up Pages)

# Count the number of users visiting registration or signup pages
signup_data <- pa_data %>%
  filter(tfc_full_url_screen %in% c("register", "signup"))

# Calculate conversion rates by device type
signup_conversion_rate <- signup_data %>%
  group_by(tfc_device_type) %>%
  summarise(
    signups = n_distinct(tfc_cookie),
    sessions = n_distinct(tfc_session)
  ) %>%
  mutate(
    conversion_rate = signups / sessions
  )
ggplot(signup_conversion_rate, aes(x = reorder(tfc_device_type, conversion_rate), 
                                          y = conversion_rate * 100,  # Multiply by 100 to convert to percentage
                                          fill = tfc_device_type)) +
  geom_bar(stat = "identity", width = 0.4) +  # Reduced bar width
  geom_text(aes(label = round(conversion_rate * 100, 2)), vjust = -0.3, size = 3) +
  labs(
    title = "Signup Conversion Rate by Device Type",
    y = "Conversion Rate (%)"  # Label as percentage
  ) +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +  # Format y-axis as percentage
  theme_minimal() +
  coord_flip() +  # Flip for better readability
    scale_fill_manual(values = c("burlywood4", "pink3", "lightblue", "darkolivegreen3", "plum4")) +
  theme(
    axis.title.y.left = element_blank(),  
    axis.title.y = element_text(size = 8, hjust = 0.4),  
    axis.text.x = element_text(size = 8),  
    axis.text.y = element_text(size = 8),   
    plot.title = element_text(hjust = 0.5, size = 10),  
    panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank(),  
    plot.margin = margin(6, 6, 6, 6)  
  ) +
  guides(fill = "none")  # Remove legend

7.3 Referrer Source Impact

To evaluate how traffic sources affect user behavior and conversions using data wrangling, visualization, and regression techniques.

user_data <- pa_data %>%
  mutate(
    traffic_source = case_when(
      # Direct traffic (no referrer or internal site)
      is.na(tfc_referrer) | tfc_referrer == "" | str_detect(tfc_referrer, "^https://www.planningalerts.ie/") ~ "Direct",
      
      # External traffic from Google
      str_detect(tfc_referrer, "google.com") ~ "Google",
      
      # Any other external traffic source
      TRUE ~ "Other"
    ))
traffic_summary <- user_data %>%
  group_by(traffic_source) %>%
  summarise(
    total_users = n_distinct(tfc_cookie),    # Count distinct users (cookies)
    total_sessions = n(),  # Count total sessions
    .groups = 'drop'  # Avoid grouping for the next steps
  ) %>%
  arrange(desc(total_users))

traffic_summary <- traffic_summary %>%
  mutate(
    percentage = total_users / sum(total_users) * 100  
  )

ggplot(traffic_summary, aes(x = traffic_source, y = total_users, fill = traffic_source)) +
  geom_bar(stat = "identity", width = 0.4) +  
  geom_text(aes(label = paste0(round(percentage, 2), "%")), vjust = -0.3, size = 3) +
  labs(
    title = "Total Users by Traffic Source",
    y = "Total Users",
    x = NULL
  ) +
  scale_fill_brewer(palette = "Dark2") +  # Color palette for better distinction
  scale_y_continuous(labels = scales::label_number(scale = 1e-3, suffix = "k")) +  # Format y-axis as thousands
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.4, size = 10), 
    axis.title.x = element_blank(),  
    axis.title.y = element_text(size = 8),  # Reduce axis title size
    axis.text.x = element_text(size = 8),  # Reduce axis text size
    axis.text.y = element_text(size = 8),  # Reduce axis text size
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    legend.position = "none",  # Remove legend
    plot.margin = margin(6, 6, 6, 6))

7.4 Clickstream Analysis

Clickstream analysis tracks individual clicks that users make as they move through the website or app. The goal is to understand the user’s journey, and spot bottlenecks where users abandon the flow.

# Create user flow (transitions between pages)
page_transitions <- pa_data %>%
  group_by(tfc_cookie, tfc_session) %>%
  arrange(tfc_stamped) %>%
  mutate(next_page = lead(tfc_full_url_screen)) %>%
  filter(!is.na(next_page))  # Exclude last page in session (no next page)

# Count transitions between pages
transition_counts <- page_transitions %>%
  group_by(tfc_full_url_screen, next_page) %>%
  summarise(transition_count = n()) %>%
  ungroup()

top_transitions <- transition_counts %>%
  arrange(desc(transition_count)) %>%
  head(20)

# Create a unique list of pages (nodes)
nodes <- unique(c(top_transitions$tfc_full_url_screen, top_transitions$next_page))
nodes <- data.frame(name = nodes)

# Map pages to numeric IDs
top_transitions <- top_transitions %>%
  mutate(from = match(tfc_full_url_screen, nodes$name) - 1,  # Subtract 1 to make IDs zero-based
         to = match(next_page, nodes$name) - 1)

# Create Sankey diagram using plotly
sankey_diagram <- plot_ly(
  type = "sankey",
  orientation = "h",
  node = list(
    pad = 40,  # Increase padding between nodes to make space
    thickness = 30,  # Width of the nodes
    line = list(color = "black", width = 0.5),
    label = nodes$name,
    color = "skyblue"  # Color for nodes
  ),
  link = list(
    source = top_transitions$from,
    target = top_transitions$to,
    value = top_transitions$transition_count,
    color = "rgba(0, 102, 204, 0.6)"  # Semi-transparent color for links
  )
)

# Customize the layout to improve clarity
sankey_diagram <- sankey_diagram %>%
  layout(
    title = "Top 20 User Journey Transitions",
    font = list(size = 10),
    xaxis = list(showgrid = FALSE, zeroline = FALSE),
    yaxis = list(showgrid = FALSE, zeroline = FALSE),
    showlegend = FALSE,  # Hide the legend to avoid clutter
    hoverlabel = list(bgcolor = "white", font = list(size = 10))  # Hover label styling
  )

# Display the Sankey diagram
sankey_diagram

7.5 User Engagement Heatmap across Time and Month

To visualize patterns in user activity over different time periods, leveraging packages like ggplot2 for plotting and lubridate for time-based data manipulation.

n <- 189028
# Simulate data 
engagement_data <- data.frame(
  tfc_cookie = sample(1:189028, n, replace = TRUE),
  tfc_stamped = sample(seq(from = as.POSIXct("2024-06-14 00:00:00"), 
                           to = as.POSIXct("2024-08-29 23:59:59"), by = "min"), 
                       n, replace = TRUE),
  engagement_metric = sample(1:5, n, replace = TRUE)  # Engagement level (e.g., 1-5 scale)
)
 # Extract hour of the day and month from tfc_stamped
engagement_data <- engagement_data %>%
  mutate(
    hour_of_day = hour(tfc_stamped),
    month = factor(month(tfc_stamped)))

# Create time intervals (bins) based on the hour of day
engagement_data <- engagement_data %>%
  mutate(
    time_interval = case_when(
      hour_of_day >= 0 & hour_of_day < 6  ~ "Midnight to Morning",
      hour_of_day >= 6 & hour_of_day < 12 ~ "Morning",
      hour_of_day >= 12 & hour_of_day < 18 ~ "Afternoon",
      hour_of_day >= 18 & hour_of_day < 24 ~ "Evening to Night"))

# Aggregate engagement metrics by hour and month
heatmap_data <- engagement_data %>%
  group_by(hour_of_day, month) %>%
  summarise(average_engagement = mean(engagement_metric), .groups = "drop")

# Create a heatmap to visualize engagement by month and hour of day
ggplot(heatmap_data, aes(x = month, y = hour_of_day, fill = average_engagement)) + 
  geom_tile() +
  scale_fill_gradient(name = "Avg Engagement", low = "#FFFFFF", high = "purple4") +  # Gradient colors
  labs(x = "Month", y = "Hour of Day") + 
  ggtitle("User Engagement Heatmap") +
  theme_bw() +
  theme(
    strip.placement = "outside", 
    plot.title = element_text(hjust = 0.5), # Center-justify plot title
    axis.title.y = element_blank(),         # Remove y-axis title for clarity
    strip.background = element_rect(fill = "#EEEEEE", color = "#FFFFFF"),
    axis.text.x = element_text(hjust = 1))

# Save the plot as an image
ggsave("user_engagement_heatmap.png", width = 6, height = 8, dpi = 300)

7.6 Application success Analysis based on User Behaviour

To assess App performance and user engagement through statistical models, clustering, and visualization techniques

i. Categorizing User Actions

# Create the user journey variable and summarize data in one step
user_journey_summary <- pa_data %>%
  mutate(user_journey = case_when(
    grepl("application", tfc_full_url_screen) ~ "Application Page",
    grepl("signup", tfc_full_url_screen) ~ "Signup Page",
    grepl("register", tfc_full_url_screen) ~ "Register Page",
    TRUE ~ "Other Pages"
  )) %>%
  group_by(user_journey) %>%
  summarise(
    total_actions = n(),
    unique_users = n_distinct(tfc_cookie),
    avg_sessions_per_user = mean(n_distinct(tfc_session[tfc_cookie == unique(tfc_cookie)]))
  ) %>%
  ungroup()  

# Calculate total actions across all journeys for percentage calculations
total_actions_all <- sum(user_journey_summary$total_actions)

# Add percentage to the summary data
user_journey_summary <- user_journey_summary %>%
  mutate(percentage = total_actions / total_actions_all * 100)

# Define custom colors for the pie slices (you can modify these colors as needed)
custom_colors <- c("maroon", "pink1", "grey41", "grey90")

ggplot(user_journey_summary, aes(x = "", y = total_actions, fill = user_journey)) +
  geom_bar(stat = "identity", width = 1) +  # Create a bar chart first
  coord_polar(theta = "y") +  # Convert bar chart to pie chart
  scale_fill_manual(values = custom_colors) +  # Apply custom colors
  # Add text labels for percentages, positioned outside the pie chart
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            color = "black", size = 3, alpha = 0.7, 
            position = position_nudge(x = 0.7)) +  # Correct placement inside geom_text
  labs(title = "Distribution of Total Actions per User Journey",
       fill = "User Journey") +  # Set legend title to "User Journey"
  theme_void() +  # Remove axis and background for a clean pie chart
  theme(
    plot.title = element_text(hjust = 0.5),  # Center-justify the plot title
    legend.position = "right",  # Position the legend to the right
    legend.title = element_text(size = 6),  # Size of the legend title
    legend.text = element_text(size = 6) ) # Size of the legend text

ggsave("user_journey_pie_chart.png", width = 7, height = 7, dpi = 300)

ii. Identifying Successful Applications and Conversion rate by User journey

# Create a new variable for user journey analysis, identify application success, and summarize the results
conversion_rate_by_journey <- pa_data %>%
  # Categorize user actions based on URL content and define user journey
  mutate(user_journey = case_when(
    grepl("application", tfc_full_url_screen, ignore.case = TRUE) ~ "Application Page",
    grepl("signup", tfc_full_url_screen, ignore.case = TRUE) ~ "Signup Page",
    grepl("register", tfc_full_url_screen, ignore.case = TRUE) ~ "Register Page",
    grepl("success|confirmation", tfc_full_url_screen, ignore.case = TRUE) ~ "Success Page",
    TRUE ~ "Other Pages"
  )) %>%
  # Group by user (tfc_cookie) to check user journey
  group_by(tfc_cookie) %>%
  # Mark application status based on which pages were visited
  mutate(application_status = case_when(
    any(user_journey == "Application Page") & any(user_journey == "Success Page") ~ "Successful",          # Both Application and Success Page
    any(user_journey == "Application Page") & !any(user_journey == "Success Page") ~ "Only Application Page", # Only Application Page
    TRUE ~ "Unsuccessful"  # Any other combination (e.g., no relevant pages visited)
  )) %>%
  # Ungroup to remove user-level grouping
  ungroup() %>%
  # Group by user journey category and calculate total number of users and successful users
  group_by(user_journey) %>%
  summarise(
    total_users = n_distinct(tfc_cookie),  # Total unique users in each user journey category
    successful_users = sum(application_status == "Successful"),  # Number of successful users
    conversion_rate = (successful_users / total_users) * 100  # Calculate conversion rate
  )

iii. Device Type Analysis by User Journey

# Device Type Analysis and Conversion Rate Calculation by User Journey
device_analysis <- pa_data %>%
  # Step 1: Create the user journey variable based on URL content
  mutate(user_journey = case_when(
    grepl("application", tfc_full_url_screen, ignore.case = TRUE) ~ "Application Page",
    grepl("signup", tfc_full_url_screen, ignore.case = TRUE) ~ "Signup Page",
    grepl("register", tfc_full_url_screen, ignore.case = TRUE) ~ "Register Page",
    grepl("success|confirmation", tfc_full_url_screen, ignore.case = TRUE) ~ "Success Page",
    TRUE ~ "Other Pages"
  )) %>%
  # Step 2: Group by device type and user journey to calculate summary metrics
  group_by(tfc_device_type, user_journey) %>%
  summarise(
    total_actions = n(),  # Total number of actions (sessions)
    unique_users = n_distinct(tfc_cookie),  # Count of unique users (tfc_cookie)
    avg_sessions_per_user = mean(n_distinct(tfc_session)),  # Average sessions per user
    successful_users = sum(user_journey == "Success Page"),  # Count of successful users
    .groups = "drop"  # Drop the grouping for further operations
  ) %>%
  # Step 3: Calculate the conversion rate
  mutate(conversion_rate = (successful_users / unique_users) * 100)  # Conversion rate as a percentage

top_10_device_analysis <- device_analysis %>%
  group_by(user_journey) %>%
  arrange(desc(total_actions)) %>%  # Sort by total actions
  slice_head(n = 10) %>%  # Select top 10 devices by total actions
  ungroup()

top_10_device_analysis <- top_10_device_analysis %>%
  group_by(user_journey) %>%
  mutate(percentage = total_actions / sum(total_actions) * 100) %>%  # Percentage of total actions for each device
  ungroup()

# Step 3: Visualize the data with a bar chart
ggplot(top_10_device_analysis, aes(x = reorder(tfc_device_type, total_actions), y = total_actions, fill = tfc_device_type)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ user_journey, scales = "free_y") +  # Create a facet for each user journey
  coord_flip() +  # Flip the coordinates for better readability
  scale_fill_brewer(palette = "Set2") +  # Apply color palette
  scale_y_continuous(labels = scales::comma_format(scale = 1e-3, suffix = "k")) +  # Format y-axis numbers (e.g., 1,000 -> 1k)
  geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_stack(vjust = 0.5), color = "black", size = 1.8) +  # Add percentage near bars
  labs(
    title = "Top 10 Devices by Total Actions",
    y = "Total Actions",
    fill = "Device Type"
  ) +
  theme_minimal() +
  theme(
    axis.title.y.left = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 10),
    axis.text.x = element_text(size = 8),
    axis.text.y = element_text(size = 8),
    panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank(),  
    legend.position = "none")

8. Conclusion

The analysis of user behavior on the website provides valuable insights into user engagement and areas for improvement. The Application Page receives the most user activity, indicating that users are highly engaged with the content there. However, the Registration Page shows minimal activity, suggesting it may be a barrier to user sign-up. Pages such as planningalerts.ie, terms, and pumping station planning applications have a high bounce rate, averaging around 80%, which signals that users may be exiting without interacting further. Direct traffic accounts for about 75% of visits, highlighting strong brand recognition but possibly a missed opportunity for deeper user engagement.

The clickstream analysis also reveals patterns in user behavior, pointing to friction points in the journey that could be optimized for better retention and conversion. This project emphasizes the significance of continuous data analysis to refine and evolve the user experience, ensuring that PlanningAlerts.ie remains responsive to its users’ preferences and behaviors.

9. Recommendations

Simplify the Registration Page: Offer more user-friendly sign-up options (e.g., social logins or guest mode) to increase conversion rates.
Improve Content on High Bounce Rate Pages: Focus on pages like Terms and Planning Application pages and Enhance content relevance, clarity, and navigation through better organization or more compelling content.
Optimize the Application and Sign-Up Pages: Add interactive features and personalized calls to action to encourage deeper engagement.
Analyze Clickstream Data: Identify and address friction points, especially during the registration or application process, to reduce abandonment and improve the user journey.
Enhanced User Feedback Mechanisms: Implement more interactive feedback tools (e.g., surveys, in-app polls) to gather real-time insights on user needs and continuously improve the platform based on their input.

10. References

Raschka, S., 2015. Heatmaps in R. Sebastian Raschka’s Blog. Available at: https://sebastianraschka.com/Articles/heatmaps_in_r.html [Accessed 12 Nov. 2024].
Coliver, J., 2019. Creating Heatmaps in R. Learn R. Available at: https://jcoliver.github.io/learn-r/006-heatmaps.html [Accessed 12 Nov. 2024].
Wickham, H., 2020. ggplot2: Elegant Graphics for Data Analysis. GitHub. Available at: https://github.com/hadley/ggplot2 [Accessed 12 Nov. 2024].
SETU Blackboard, 2024. Advanced Analysis using R & R Studio. Available at: https://blackboard.itcarlow.ie/ultra/courses/_22536_1/cl/outline [Accessed 12 Nov. 2024].
Wickham, H., 2024. Debugging. Advanced R. Available at: https://adv-r.hadley.nz/debugging.html [Accessed 12 Nov. 2024].