library(tidyverse)
library(knitr)
library(gridExtra)Planning Alerts Project
Introduction
In this analysis, we will use planning_alerts_data.csv to conduct explanatory data analysis to explore different trends and patterns in the data. The dataset contains 400,215 observations and 15 variables and each observation is a record of each time a page was visited on the website PlanningAlerts.ie. The dataset contains the user’s information of June, July, and August and September from the current year. PlanningAlerts.ie is a notifications service provider about Irish planning applications based on user-defined preferences. This innovative platform saves users significant time by eliminating the need to manually search through countywide lists or newspapers. Instead, users receive timely alerts via email, app, or SMS for any new, updated, or decided applications relevant to their locations of interest. The service caters to individuals and businesses, offering tools like mobile-friendly interfaces and app-based notifications. This convenience empowers users to stay informed effortlessly. First the data is loaded, date and time variables are added in the code chunk data_setup below. We will now perform the explanatory data analysis now.
# loading the data into rstudio
pa_data <- read_csv("planning_alerts_data.csv")Rows: 400215 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): tfc_stamped, tfc_cookie, tfc_session, tfc_device_type, tfc_full_url...
dbl (1): tfc_id
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# using mutate to add new variables into the data. We have converted the timestamp into date time
# stamp using dmy_hm() function. as_date() function is used to create date variable
# year() is used to get the year from the data. month() is used to get the month,
# weekdays() and hour() are used to add the week day and hour variables
pa_data <- pa_data %>% mutate(datetime = dmy_hm(tfc_stamped),
date = as_date(datetime), year = year(datetime),
month = month(datetime, label = TRUE),
week = weekdays(datetime, abbr = TRUE),
hour = hour(datetime))Explanatory Data Analysis
In this section, we will perform explanatory data analysis to get the key findings and insights from the data. The code chunk below, we have created a time series plot of daily total visits and unique users on PlanningAlerts.ie. The total visits are colored as red and total total unique users are colored as black. The plot shows total visits and unique users are lowest in the month of June and both are highest in the month of July. The spikes in the plot indicate the presence of weekly trends of visits in the data.
# group_by() is used to group the data by date and summarize is used to to count
# total unique user on each data. ggplot() is used to create the timeseries plot.
# geom_line() is used to for time series plot and labs() is used to change the x and y-axis labels
# and plot title theme_light() is used to add light theme. scale_x_date() is used to add month 2024 labels on
# x-axis
pa_data %>% group_by(date) %>%
summarise(total_unique_users = n_distinct(tfc_cookie), total_visits = n()) %>%
ggplot(aes(x = date, y = total_unique_users, group = 1, color = 'Unique Users'))+
geom_line()+
geom_line(aes(x = date, y = total_visits, color = 'Total Visitors'))+
scale_x_date(breaks = "month", date_labels = "%b %Y") +
labs(x = "Date", y = "Total Visits & Unique Users", color = 'Web Stats',
title = 'Daily Total Visits & Unique Users on PlanningAlerts.ie') +
theme_light()In the code chunk below, we have created a summary table indicating the total visits, unique users, ratio of unique users to total visits and percentage increase in unique users on the website. The table shows in June, the website recorded 79,295 visits and 32,773 unique users, and 41.33% of the visited users were unique on the website. Since this is the first month, the percentage increase is not applicable (NA). In July, the total visits significantly increased to 167,186, and the number of unique users grew to 81,866, leading to a ratio of 48.97% and 18.48% increase in the new users as compare to the previous month. In August, the total visits decreased slightly to 153,734, while the number of unique users declined to 76,246. Despite the decline in unique users, the ratio increased slightly to 49.60%, indicating a 1.28% increase in unique users as compare to previous month.
# group_by(month) is used to group the data per month. summarise() is used to calculate
# Total.Visits, Total.Unique.Users and Total.Visits.to.Unique.Users.Ratio. mutate() is used
# to create a new column, Percentage.Increase, which calculates the percentage increase in the ratio from the previous month.
# lag() is used to get the previous month's ratio to calculate the percentage change.
# kable() to generate nice table
pa_data %>%
group_by(month) %>%
summarise(Total.Visits = n(), Total.Unique.Users = n_distinct(tfc_cookie),
Total.Visits.to.Unique.Users.Ratio = Total.Unique.Users/Total.Visits*100) %>%
mutate(Percentage.Increase = (Total.Visits.to.Unique.Users.Ratio -
lag(Total.Visits.to.Unique.Users.Ratio)) /
lag(Total.Visits.to.Unique.Users.Ratio) * 100) %>%
kable(digits = 2)| month | Total.Visits | Total.Unique.Users | Total.Visits.to.Unique.Users.Ratio | Percentage.Increase |
|---|---|---|---|---|
| Jun | 79295 | 32773 | 41.33 | NA |
| Jul | 167186 | 81866 | 48.97 | 18.48 |
| Aug | 153734 | 76246 | 49.60 | 1.28 |
To observe the trends of each week day and hour, we have visualized line plots in the code chunk below. The line plots shows the total visits and total unique users starts to increase from Monday to Wednesday and on the website, total visits and total unique users are highest on Wednesday. From Thursday to Saturday, the trend of total visits and unique users on the website is downward. Based on this plot we can say the trend of users on the website is increasing from Sunday to Wednesday and trend is downward from Thursday to Saturday. The hourly line plot shows the trend of total visits and unique users remains almost constant from 12am to 7am and trend is increasing from 6am till 10am. The increasing trend touches the highest point at 4pm and after that the trend is decreasing.
# group_by(month) is used to group the data each weekday. summarise() is used to calculate
# total visits and total unique users on the website. ggplot() is used to create the timeseries plot.
# geom_line() is used to for time series plot and labs() is used to change the x and y-axis labels
# and plot title. theme_light() is used to add light theme. geom_point() is used to add data points
# in the plot
p1 <- pa_data %>% group_by(week) %>%
summarise(total_unique_users = n_distinct(tfc_cookie), total_visits = n()) %>%
ggplot(aes(x = week, y = total_unique_users, group = 1, color = "Unique Users"))+
geom_line()+
geom_point()+
geom_line(aes(x = week, y = total_visits, color = 'Total Visits'))+
geom_point(aes(x = week, y = total_visits, color = 'Total Visits'))+
labs(x = "Weekday", y = "Total Visits & Unique Users", color = 'Web Stats') +
theme_light()
# group_by(month) is used to group the data each hour. summarise() is used to calculate
# total visits and total unique users on the website. ggplot() is used to create the timeseries plot.
# geom_line() is used to for time series plot and labs() is used to change the x and y-axis labels
# and plot title. theme_light() is used to add light theme. geom_point() is used to add data points
# in the plot. scale_x_continuous is used to change the x-axis labels.
p2 <- pa_data %>% group_by(hour) %>%
summarise(total_unique_users = n_distinct(tfc_cookie), total_visits = n()) %>%
ggplot(aes(x = hour, y = total_unique_users, group = 1, color = 'Unique Users'))+
geom_line()+
geom_point()+
geom_line(aes(x = hour, y = total_visits, color = 'Total Visits'))+
geom_point(aes(x = hour, y = total_visits, color = 'Total Visits'))+
labs(x = "Hour", y = "Total Visits & Unique Users") +
scale_x_continuous(breaks = 0:23, labels = 0:23) +
theme_light()
# grid.arrange() method is used to visualize two plots in a same figure
grid.arrange(p1, p2, nrow = 2)In the code chunk below, we have visualized a barplot of percentage of users who had more than 1 session on the website. The plot shows almost 90% of the users have only a single session indicating they visited only a single page and quit the browsing the website. Only 10% of the users had more than one session on the website indicating those users had browsed multiple pages on the website. These findings suggest very limited percentage of users browse more than one page on the website.
# group_by() is used to group by each tfc_cookie user. summarize() is used to
# total sessions by each users. ungroup() is used to drop the groups. mutate is used
# create sessions variable which indicates 1 if user had more than 1 sessions. mutate
# is also used to create the session percentage and label variables. pase0() is
# used to add % sign in the session percentage variable.
session_percentage <- pa_data %>%
group_by(tfc_cookie) %>%
summarise(total_sessions = length(tfc_session)) %>%
ungroup() %>%
mutate(sessions = factor(ifelse(total_sessions > 1, 1, 0))) %>%
group_by(sessions) %>%
summarise(total = n()) %>%
ungroup() %>%
mutate(session_percentage = round(total/sum(total)*100, 1),
sessions = c("Single Session", "Multiple Sessions"),
label = paste0(session_percentage, "%"))
# ggplot is used to create a pie chart. coorad_polar() is used to rotate the axis.
# geom_text() is used to add session percentage inside the pie chart. labs() is to
# add title. theme_minimal() is used to add white background of the plot. theme()
# is used to remove all the axis ticks and labels.
ggplot(session_percentage, aes(x = 2, y = session_percentage, fill = sessions)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
geom_text(aes(label = label), position = position_stack(vjust = 0.5)) +
labs(title = "Website Session Distribution", fill = "Sessions") +
theme_minimal() +
theme(axis.text = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold"))We have visualized a barplot and generated a summary table to check the devices used by the users that visited the PlanningAlerts.ie website. The plot shows majority of the users visited the website through desktop and mobile browsers. The descriptive summary shows 63.41% of the users visited the website through desktop, 24% visited from mobile browser and 11% of the users visited via android app. Less than 1% of the users visited through iPhone app and tablet browser. Based on these findings we have observed the most common medium of browsing to the website is desktop and mobile browser and least common source of browsing is iPhone app and tablet.
# group_by() is used to group the data by tfc_device_type. summarize() is used
# count the total users. ggplot() is used to create the barplot. fct_reorder() is used
# reorder the barplot based on highest frequency. theme() is used to tilt the x-axis labels
# by 15 degree.
pa_data %>%
group_by(tfc_device_type) %>%
summarise(Total.Users = n()) %>%
ggplot(aes(x = fct_reorder(tfc_device_type, -Total.Users), y = Total.Users,
fill = tfc_device_type))+
geom_col()+
theme_light()+
labs(x = 'Device Type', y = 'Total Users', fill = 'Device Type')+
theme(axis.text.x = element_text(angle = 15))# group_by() is used to group the data by tfc_device_type. summarize() is used
# count the total users. mutate() is used to get the percentage of devices used by users
# arrange(desc()) is used to arrange the table from highest to lowest by total users.
# rename() is used to rename tfc_device_type and kable() is used print nice table
pa_data %>%
group_by(tfc_device_type) %>%
summarise(Total.Users = n()) %>%
mutate(Percentage = Total.Users / sum(Total.Users)*100) %>%
arrange(desc(Total.Users)) %>%
rename('Device.Type' = tfc_device_type) %>%
kable(digits = 2)| Device.Type | Total.Users | Percentage |
|---|---|---|
| Desktop | 253774 | 63.41 |
| Mobile (browser) | 96146 | 24.02 |
| Android App | 44101 | 11.02 |
| iPhone App | 3133 | 0.78 |
| Tablet (browser) | 3061 | 0.76 |
We have created a plot and created a summary table to check the most visited pages on the website. The plot and summary shows almost 54% of the users visited planning application listing pages on the desktop while 14.25% of the users visited planning application listing pages on the mobile. So almost 68% of the users are browsed planning application listing pages. 15.5% of the users visited the visited the pages that contained list of applications either nationally or by Council. 4.44% of the users visited signup pages and 0.51% of the users visited subscriber pages. 0.35% of the users visited the pricing page and only 0.12% of the visitors clicked on privacy web page. So these findings suggest majority of the users are looking for planning application listing and very few are looking for signup and subscription services.
# group_by() is used to group the data by tfc_full_url_screen. summarize() is used
# count the total users. mutate() is used to get the percentage of users visits on specific page type.
# arrange(desc()) is used to arrange the table from highest to lowest by total users.
# rename() is used to rename tfc_full_url_screen and kable() is used print nice table
page_visits <- pa_data %>%
group_by(tfc_full_url_screen) %>%
summarise(Total.Users = n()) %>%
mutate(Percentage = Total.Users / sum(Total.Users)*100) %>%
rename('Page.Visit.Type' = tfc_full_url_screen) %>%
arrange(desc(Total.Users)) %>%
slice_head(n = 20)
page_visits %>% kable(digits = 2)| Page.Visit.Type | Total.Users | Percentage |
|---|---|---|
| application | 215223 | 53.78 |
| list | 62111 | 15.52 |
| applicationmob | 57038 | 14.25 |
| map | 20326 | 5.08 |
| signup | 17781 | 4.44 |
| item | 6790 | 1.70 |
| data-centre-planning-applications | 2347 | 0.59 |
| subscriber | 2025 | 0.51 |
| mobilemap | 2012 | 0.50 |
| wind-farm-windfarm-turbine-planning-applications | 1907 | 0.48 |
| contact | 1691 | 0.42 |
| pricing | 1615 | 0.40 |
| about | 1392 | 0.35 |
| solar-pv-photovoltaic-planning-applications | 929 | 0.23 |
| demolition-planning-applications | 799 | 0.20 |
| freetrial-business | 688 | 0.17 |
| mobile | 666 | 0.17 |
| submission-deadlines | 497 | 0.12 |
| privacy | 463 | 0.12 |
| register | 436 | 0.11 |
# ggplot() is used to create the barplot. fct_reorder() is used reorder the
# ggplot is used to creat this plot. geom_segment() is used to generate the
# vertical lines in the plot. geom_point() is used add points for each page visit type
# fct_reorder is used to order the page visit type category be percentage.
# scale_y_continuous() is used to add percentage sign on y-axis and to add y-axis
# labels and limits as well.
page_visits %>% head(10) %>%
ggplot(aes(x = fct_reorder(Page.Visit.Type, -Percentage), y = Percentage,
color = Page.Visit.Type))+
geom_segment(aes(x = fct_reorder(Page.Visit.Type, -Percentage),
xend = fct_reorder(Page.Visit.Type, -Percentage),
y = 0, yend = Percentage, color = 'black'), show.legend = FALSE)+
geom_point(show.legend = FALSE)+
theme_light()+
labs(x = 'Page Visit Type', y = 'Percetange Visits')+
theme(axis.text.x = element_text(angle = 30, hjust = 1))+
scale_y_continuous(labels = scales::percent_format(scale = 1),
limits = c(0, 60), breaks = seq(0, 60, by = 10))We have created two tables to check the user traffic on the website from different search engines and social platforms. The first shows the 97.66% of the users from search engines came from Google searches and 1.55% of the users came from Bing and less than one percent of the users came from duckduck go search engine. From the social platforms, almost 97.5% of the users came from X and only 2.38% of the users came from Facebook. These findings suggest majority of the users use Google search engine and X social platform for searching planning applications list.
# mutate is used to create new variable named referrer_platform. case_when()
# is used to add the categories based on conidition which checks the presence of given keyword
# in regex.
pa_data <- pa_data %>%
mutate(referrer_platform = case_when(
str_detect(tfc_referrer, regex('google', ignore_case = TRUE)) ~ "Google",
str_detect(tfc_referrer, regex('facebook', ignore_case = TRUE)) ~ "Facebook",
str_detect(tfc_referrer, regex('instagram', ignore_case = TRUE)) ~ "Instagram",
str_detect(tfc_referrer, regex('linkedin', ignore_case = TRUE)) ~ "LinkedIn",
str_detect(tfc_referrer, regex('bing', ignore_case = TRUE)) ~ "Bing",
str_detect(tfc_referrer, regex('duckduckgo', ignore_case = TRUE)) ~ "Duckduckgo",
str_detect(tfc_referrer, regex('x', ignore_case = TRUE)) ~ "X",
is.na(tfc_referrer) ~ "Direct Access/Email/Unknown",
TRUE ~ 'Other'
))
# select is used to select referrer_platform, filter() is used to filter the
# search engine categories. data is then grouped by group_by() and summarized is used to get total
# visits and mutate is used to get the percentage. arrange() is used to arrange by percentage
# and kable() is used to get the nice table
pa_data %>% select(referrer_platform) %>%
filter(referrer_platform %in% c('Google', 'Bing', 'Duckduckgo')) %>%
group_by(referrer_platform) %>%
summarise(Total = n()) %>%
mutate(Percentage = Total / sum(Total)*100) %>%
arrange(desc(Percentage)) %>%
rename('Referrer.Platform' = referrer_platform) %>%
kable(digits = 2)| Referrer.Platform | Total | Percentage |
|---|---|---|
| 72422 | 97.66 | |
| Bing | 1146 | 1.55 |
| Duckduckgo | 590 | 0.80 |
# select is used to select referrer_platform, filter() is used to filter the
# social platform categories. data is then grouped by group_by() and summarized is used to get total
# visits and mutate is used to get the percentage. arrange() is used to arrange by percentage
# and kable() is used to get the nice table.
pa_data %>% select(referrer_platform) %>%
filter(referrer_platform %in% c('X', 'Facebook', 'Instagram', 'LinkedIn')) %>%
group_by(referrer_platform) %>%
summarise(Total = n()) %>%
mutate(Percentage = Total / sum(Total)*100) %>%
arrange(desc(Percentage)) %>%
rename('Referrer.Platform' = referrer_platform) %>%
kable(digits = 2)| Referrer.Platform | Total | Percentage |
|---|---|---|
| X | 1190 | 97.46 |
| 29 | 2.38 | |
| 1 | 0.08 | |
| 1 | 0.08 |
Below we have create a table to check the percentage of users by each device type session indicator. The session indicator is representing whether a user had a single or multiple sessions on the website where 1 indicates multiple sessions and 0 indicates only a single session. The table reveals 95.8% users on android app had only a single session while 62.97% of the users from desktop had multiple sessions on the website. 65.26% of the users from mobile browser had multiple sessions. 92% of the users from tablet and 96.5% of the users from iPhone app had multiple sessions. These findings suggest android app users mostly have single session while the multiple session increase when the user’s device type is desktop and mobile browser. The users with multiple session on the websites are highest when the device type is tablet or iPhone.
# we created a new data. first the data is grouped by tfc_cookie and total sessions
# are summed by each user cookie using summarize() method. data is ungrouped using ungroup()
# mutate() is used to create sessions indicator with 1 indicating user has more than 1 session
# inner_join() is used to get tfc_device_type variable from the pa_data
new_data <- pa_data %>%
group_by(tfc_cookie) %>%
summarise(total_sessions = length(tfc_session)) %>%
ungroup() %>%
mutate(sessions = factor(ifelse(total_sessions > 1, 1, 0))) %>%
group_by(sessions) %>%
inner_join(pa_data[, c('tfc_device_type', 'tfc_cookie')], by = 'tfc_cookie')
# data is grouped by tfc_device_type and sessions indicator, summarize is used to get the
# total users. mutate() is used to create percentage of users for each device type
new_data %>%
group_by(tfc_device_type, sessions) %>%
summarise( total = n()) %>%
mutate(Percentage = total / sum(total)*100) %>%
rename('Device.Type' = tfc_device_type) %>%
kable(digits = 2)`summarise()` has grouped output by 'tfc_device_type'. You can override using
the `.groups` argument.
| Device.Type | sessions | total | Percentage |
|---|---|---|---|
| Android App | 0 | 42244 | 95.79 |
| Android App | 1 | 1857 | 4.21 |
| Desktop | 0 | 93964 | 37.03 |
| Desktop | 1 | 159810 | 62.97 |
| Mobile (browser) | 0 | 33398 | 34.74 |
| Mobile (browser) | 1 | 62748 | 65.26 |
| Tablet (browser) | 0 | 238 | 7.78 |
| Tablet (browser) | 1 | 2823 | 92.22 |
| iPhone App | 0 | 109 | 3.48 |
| iPhone App | 1 | 3024 | 96.52 |
Conclusion
The explanatory data analysis of the PlanningAlerts.ie dataset revealed insightful patterns and trends. The analysis revealed the total users on the website were highest in July and lowest in June. Weekly and hourly trends revealed that user activity is highest mid-week (Wednesday) and peaks during the afternoon hours (4 PM). The analysis showed that 90% of users had only a single session, indicating limited page exploration. Device usage analysis revealed that the majority of users accessed the website through desktop and mobile browsers, with desktop dominating at over 63% of total visits. The least-used mediums were iPhone apps and tablets, indicating opportunities to improve these platforms. Regarding page visits, the majority of users browsed planning application listings, with less engagement on pages related to subscriptions, pricing, and privacy. Traffic sources analysis demonstrated that the website heavily relies on Google as a search engine (97.66% of search traffic) and X (formerly Twitter) as a social media platform (97.5% of social traffic). The session trends revealed that most Android app users had single sessions, while multiple sessions were more common among desktop and mobile browser users. Tablet and iPhone app users displayed the highest rates of multiple sessions, suggesting higher engagement levels among these device types.