Load Required Libraries

library(ggplot2)
library(dplyr)
library(lubridate)

Getting the data and creating the main data frame

logs_notifications = read.csv('logs_notifications.csv',stringsAsFactors = FALSE)
logs_web = read.csv('logs_web.csv',stringsAsFactors = FALSE)

logs_notifications %>%
  dplyr::full_join(logs_web %>% rename(datetime_click = datetime)) -> log_data_intermediate
## Joining, by = c("browser", "ip_anonymized", "language", "message")
log_data_intermediate %>% head()
log_data_intermediate %>% select(ip_anonymized) %>% unique() %>% nrow()
## [1] 500054

Assumptions about the data

Basic EDA

Are there any entries that have notification click date , but no notification send date?

print(log_data_intermediate %>% filter(is.na(datetime)) %>% nrow())
## [1] 54
  • There are 54 users who have clicked the notification but lack the date of receipt of the notification
  • Given that this is < 0.5% of the data, this should not affect our analysis significantly.

What were the total number of users who clicked the notifications?

The users that have not clicked the notification, do not have the date_click variable assigned to them.

log_data_intermediate %>% 
  mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
  group_by(clicked) %>%
  summarise(n=n())
  • Approximately, 1.3 % of users, clicked the notification for the given period of time.

How many notifications were sent on a daily basis?

log_data_intermediate %>% mutate(datetime = ymd_hms(datetime)) %>%
  mutate(date_ = date(datetime)) %>%
  group_by(date_,message) %>%
  summarise(n=n()) %>% ungroup() %>% filter(!is.na(date_)) %>%
  ggplot(aes(x=as.factor(date_),y=n,fill= as.factor(message)))+geom_bar(stat = 'identity',position = 'dodge')+
  labs(x='Date',y='Number of Notification Sent')+
  scale_fill_discrete(name = "Notification Type")+
  theme(axis.text.x = element_text(angle = 45, hjust=1))

  • The number of notifications sent per day reduces between the date period 2015-09-01 & 2015-09-07. If our assumption about choosing 1% of daily users is right, then this would mean that the active users are going down.

  • The number of notification types sent per day is equal to each other as evidenced by the problem statement (The quote “Once selected, we sent one of three promoting notifications to the user with equal probability”).

At what times were the notifications sent?

log_data_intermediate %>% filter(!is.na(datetime)) %>%
  mutate(datetime = ymd_hms(datetime)) %>%
  mutate(hour_ = hour(datetime)) %>%
  mutate(time_period = ifelse(hour_<=4,'12AM to 4AM','8PM to 12AM'),
         time_period = ifelse(hour_>4 & hour_<=8,'4AM to 8AM',time_period),
        time_period = ifelse(hour_>8 & hour_<=12,'8AM to 12PM',time_period),
        time_period = ifelse(hour_>12 & hour_<=16,'12PM to 4PM',time_period),
        time_period = ifelse(hour_>16 & hour_<=20,'4PM to 8PM',time_period)
         ) %>%
  mutate(time_period = factor(time_period,levels = c('12AM to 4AM','4AM to 8AM','8AM to 12PM','12PM to 4PM','4PM to 8PM','8PM to 12AM'))) %>%
  group_by(time_period,message) %>%
  summarise(n=n()) %>% ungroup() %>%
  ggplot(aes(x=time_period,y=n,fill= as.factor(message)))+geom_bar(stat = 'identity',position = 'dodge')+
  labs(x='Hour of the Day',y='Number of Notification Sent')+
  scale_fill_discrete(name = "Notification Type")+
    theme(axis.text.x = element_text(angle = 45, hjust=1))

  • Most notifications are sent between 12 AM and 4AM & 4PM and 8PM.

At what time did the clicks happen?

log_data_intermediate %>% 
  filter(!is.na(datetime_click)) %>%
  mutate(datetime_click = ymd_hms(datetime_click)) %>%
  mutate(hour_ = hour(datetime_click)) %>%
  mutate(time_period = ifelse(hour_<=4,'12AM to 4AM','8PM to 12AM'),
         time_period = ifelse(hour_>4 & hour_<=8,'4AM to 8AM',time_period),
        time_period = ifelse(hour_>8 & hour_<=12,'8AM to 12PM',time_period),
        time_period = ifelse(hour_>12 & hour_<=16,'12PM to 4PM',time_period),
        time_period = ifelse(hour_>16 & hour_<=20,'4PM to 8PM',time_period)
         ) %>%
  mutate(time_period = factor(time_period,levels = c('12AM to 4AM','4AM to 8AM','8AM to 12PM','12PM to 4PM','4PM to 8PM','8PM to 12AM'))) %>%
  group_by(time_period,message) %>%
  summarise(n=n()) %>% ungroup() %>%
  ggplot(aes(x=time_period,y=n,fill= as.factor(message)))+geom_bar(stat = 'identity',position = 'dodge')+
  labs(x='Time of the Day',y='Number of Notification Clicked')+
  scale_fill_discrete(name = "Notification Type") + facet_wrap(~message,nrow = 3)+
    theme(axis.text.x = element_text(angle = 45, hjust=1))

  • Most notifications are clicked on between 12 AM to 4AM , following by the period 4PM to 8PM according to the data. This could be attributed to late night internet usage & internet usage while travelling from work respectively.

How long did the converted users take to click the notification?

log_data_intermediate %>%   filter(!is.na(datetime_click)) %>%
  filter(!is.na(datetime)) %>%
  mutate(datetime = ymd_hms(datetime)) %>%
  mutate(datetime_click = ymd_hms(datetime_click)) %>%
  mutate(day_diff = lubridate::interval(datetime,datetime_click)) %>%
  mutate(hour_diff = as.period(day_diff) %>% hour()) %>%
  group_by(hour_diff,message) %>% 
  summarise(n=n()) %>% ungroup() %>%
  arrange(desc(message)) %>%
  inner_join(log_data_intermediate %>%   filter(!is.na(datetime_click)) %>%
  filter(!is.na(datetime)) %>% group_by(message) %>% summarise(total=n())) %>%
  mutate(percentage_click = n/total) %>%
  rename(`Number of Hours Elapsed Since Notification is sent` = hour_diff) %>%
  mutate(percentage_click = stringr::str_c(round(percentage_click,3)*100,'%')) %>%
  mutate(`Number of Hours Elapsed Since Notification is sent` = stringr::str_c(`Number of Hours Elapsed Since Notification is sent`,'H') ) %>%
  rename(`Percentage Clicks %` = percentage_click)
## Joining, by = "message"
  # ggplot(aes(x=hour_diff,y=percentage_click,fill=as.factor(message),label=scales::percent(round(percentage_click,2))))+geom_bar(stat='identity',position = 'dodge')+
  # #facet_wrap(~message)+
  # labs(x='Number of Hours Elapsed between Notification Receipt and Click', y= 'Percentage %')+
  # scale_y_continuous(labels = scales::percent)
  • A larger number of users clicked message 1 compared to messages 2 & 0 within the 1st hour.
  • This could be because of the phrasing/syntax of message 1.

For each message type, how does the click through rate change with time?

log_data_intermediate %>%   filter(!is.na(datetime_click)) %>%
  filter(!is.na(datetime)) %>%
  mutate(datetime = ymd_hms(datetime)) %>%
  mutate(datetime_click = ymd_hms(datetime_click)) %>%
  mutate(date_click = lubridate::as_date(datetime_click)) %>%
  group_by(date_click,message) %>%
  summarise(n=n()) %>%
  inner_join(log_data_intermediate %>%
  filter(!is.na(datetime)) %>%
  mutate(datetime = ymd_hms(datetime)) %>%
  mutate(datetime_click = ymd_hms(datetime_click)) %>%
  mutate(date_click = lubridate::as_date(datetime)) %>%
  group_by(date_click,message) %>%
  summarise(total=n())) %>% ungroup() %>%
  arrange((date_click)) %>%
  mutate(ctr = n/total) %>%
  ggplot(aes(x=as.factor(date_click),y=ctr,fill=as.factor(message),label = scales::percent(round(ctr,4))))+geom_bar(stat='identity',position='dodge')+ labs(x='Date',y='CTR')+ scale_fill_discrete(name = "Notification Type")+
  scale_y_continuous(labels = scales::percent)+
  geom_text(position = position_dodge(width = .9),    # move to center of bars
              vjust = 0.5,# nudge above top of bar
              size = 3,
            hjust = 1.5
            ) + coord_flip()
## Joining, by = c("date_click", "message")

  • The click through rates for message 1 remain high between 1st till 4th of September, and slightly decrease after that. (Tuesday to Friday)
  • Message 0 experiences a peak click through rate on 5th of September.(Saturday)
  • Message 2’s click through rates peak on 3rd September & 5th September.(Thursday & Saturday)
  • Understanding the messaging syntax and wordings would give us a better idea of why we see peak behaviour for message 1 during week days. The messaging could be to the point. The other messages might have been too wordy.

For each message type, how does the click through rate change by browser?

log_data_intermediate %>%   filter(!is.na(datetime_click)) %>%
  filter(!is.na(datetime)) %>%
  group_by(message,browser) %>%
  summarise(n=n()) %>%
  inner_join(log_data_intermediate %>%
  filter(!is.na(datetime)) %>%
  group_by(message,browser) %>%
  summarise(total=n())) %>%
  mutate(`% CTR` = n/total)
## Joining, by = c("message", "browser")
  • The click through rates are higher for messages that have been sent to Firefox users.

How significant is this difference?

log_data_intermediate %>%   filter(!is.na(datetime_click)) %>%
  filter(!is.na(datetime)) %>%
  group_by(browser) %>%
  summarise(n=n()) %>%
  inner_join(log_data_intermediate %>%
  filter(!is.na(datetime)) %>%
  group_by(browser) %>%
  summarise(total=n())) 
## Joining, by = "browser"
prop.test(x = c(2794, 3590), n = c(166584,333416),alternative = "greater")
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(2794, 3590) out of c(166584, 333416)
## X-squared = 317.31, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.005405284 1.000000000
## sample estimates:
##     prop 1     prop 2 
## 0.01677232 0.01076733
  • The difference in click through rates via Firefox and Chrome is significant as evidenced by the p-value<0.05 of the proportion test.
  • This could be because of how the notifications are served on Firefox and Chrome. Deep diving into the UI of the notifications for Firefox and Chrome would give us a better idea about the reason/s for this difference.

Deep Dive EDA

How does the click through rate change by notification type? Are these differences significant?

log_data_intermediate %>% 
  mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
  filter(!is.na(datetime)) %>%
  group_by(clicked,message) %>%
  summarise(n=n()) %>% ungroup() %>%
  inner_join(log_data_intermediate %>% 
  mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
  filter(!is.na(datetime)) %>%
  group_by(message) %>%
  summarise(total=n())) %>%
  arrange(desc(message)) %>%
  mutate(percentage_ = n/total)
## Joining, by = "message"
  • Message type 1 had the highest click through rate at 1.4 %
  • This is followed by messages 2 & 0

How significantly higher is the proportion of clicks to total notifications for message type 1?

prop.test(x = c(2380, 2180), n = c(167055,166744),alternative = "greater")
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(2380, 2180) out of c(167055, 166744)
## X-squared = 8.4328, df = 1, p-value = 0.001843
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.0005059651 1.0000000000
## sample estimates:
##     prop 1     prop 2 
## 0.01424680 0.01307393
prop.test(x = c(2380, 1824), n = c(167055,166201),alternative = "greater")
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(2380, 1824) out of c(167055, 166201)
## X-squared = 71.353, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.00263042 1.00000000
## sample estimates:
##     prop 1     prop 2 
## 0.01424680 0.01097466
  • Given the p-values<0.05, this shows that there is enough evidence to prove that the click through rates of message 1 is significantly higher than that of 0 & 2, and do not occur due to randomness.

Which messages were the top performers for each language?

log_data_intermediate %>% 
  mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
  filter(!is.na(datetime)) %>%
  group_by(clicked,message,language) %>%
  summarise(n=n()) %>% ungroup() %>%
  inner_join(log_data_intermediate %>% 
  mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
  filter(!is.na(datetime)) %>%
  group_by(message,language) %>%
  summarise(total=n())) %>%
  arrange(desc(language)) %>%
  mutate(percentage_ = n/total) %>%
  filter(clicked == 'Clicked') %>%
  arrange(desc(percentage_)) -> language_message_ctr
## Joining, by = c("message", "language")
language_message_ctr %>% ungroup() %>% 
  arrange_(~ desc(percentage_)) %>%
  group_by_(~ language) %>%
  do(head(., n = 2))
## Warning: arrange_() is deprecated. 
## Please use arrange() instead
## 
## The 'programming' vignette or the tidyeval book can help you
## to program with arrange() : https://tidyeval.tidyverse.org
## This warning is displayed once per session.
## Warning: group_by_() is deprecated. 
## Please use group_by() instead
## 
## The 'programming' vignette or the tidyeval book can help you
## to program with group_by() : https://tidyeval.tidyverse.org
## This warning is displayed once per session.
  • Message 1 consistently emerges as the top performer across all languages except for French & Chinese.
  • For French and Chinese users, Message 2 & Message 0 emerge as top performers.

Which language had the least click through rate?

log_data_intermediate %>% 
  mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
  filter(!is.na(datetime)) %>%
  group_by(clicked,language) %>%
  summarise(n=n()) %>% ungroup() %>%
  inner_join(log_data_intermediate %>% 
  mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
  filter(!is.na(datetime)) %>%
  group_by(language) %>%
  summarise(total=n())) %>%
  arrange(desc(language)) %>%
  mutate(percentage_ = n/total) %>%
  ungroup() %>%
  filter(clicked == ' Not Clicked')  %>%
  arrange(desc(percentage_))
## Joining, by = "language"
  • Close to 100 % of notifications in Spanish & Russian did not get clicked.
  • This might have been due to translation issues but only deep diving into the message syntax/grammar & consumer feedback can support/reject our hypothesis.

Estimation of the number of installations if notifications were sent to all users

Before we proceed with the problem its good to revisit what we have so far

  • We have assumed that the sample (500,054 users) comprises of 1% of users.

  • Each of these users are independent of each other.

  • We can assume that the behaviour of the population would be similar to the sample as the sample is selected randomly.

  • This estimation is for a week of notifications.

  • The process used to send notifications will be the same as the sample. There are no confounding variables that are taken into account, that affect the conversion and clicks.

  • 1.3% of the sample size clicked the notifications that were sent to them.

  • Out of this 1.3% we have to assume that 50% of them actually installed.

We need to estimate the proportion of the population, that will click the link in the notification & the number of installs that would be generated because of this

Here, we try to estimate the 95% confidence interval of the proportion of users that will click the notification.

sample_size = 500054
proportion_estimate = 0.01287461

standard_error = sqrt(proportion_estimate*(1-proportion_estimate))/ sqrt(sample_size)

margin_of_error = qnorm(0.975)*standard_error

lower_ = proportion_estimate - margin_of_error

upper_ = proportion_estimate + margin_of_error

print(lower_)
## [1] 0.01256215
print(upper_)
## [1] 0.01318707
  • At 95% confidence interval , between 1.26% and 1.32% of the larger consumer base would click the notifications that will be sent over the week.

  • Given that the total consumer base is 50,005,400 the clicks would then range from 628,175 to 659,425 at 95% confidence interval.

  • This would then translate to 314,088 to 329,713 installations achieved via notifications for a week. This is because we can assume 50% of clicks to translate to installations.

  • Thus, for the given time period and data, it will not be possible to attain > 1 million installations.

Conclusions

Which message should we send if we are sending to all users?

Our translation manager thinks that one or several of the translations are crappy and look like a 4-year old could have written that. Can you confirm that?

Our designer is interested if there are differences between browsers, because the mechanisms for displaying notifications for these differ.

Our Head of Business Development wants to decide if we actually should send out notifications to all users. He expects that one million or more installations should be worth it. Estimate the number of installations if we send out notifications to all users (assume that 50% of users that click the link included in the notifications will install Adblock Browser)

Given that the users are getting their notifications on a desktop,news about a new browser for mobile platforms might not be attended to as this requires an extra step to install them on their mobile devices. As an alternative, sending an email notification would be better.These email notifications can then be attended to via mobile devices and hence the browser can be downloaded directly.

Considering user segments based on early vs late adoption of new features, can help target the notifications to the right audiences. Based on feedback from early adopters(if the product is a beta version), the notifications and the product can be changed to meet the demands of a wider audience.

Questions/Concerns that should be cleared before proceeding with the analysis.