library(ggplot2)
library(dplyr)
library(lubridate)
logs_notifications = read.csv('logs_notifications.csv',stringsAsFactors = FALSE)
logs_web = read.csv('logs_web.csv',stringsAsFactors = FALSE)
logs_notifications %>%
dplyr::full_join(logs_web %>% rename(datetime_click = datetime)) -> log_data_intermediate
## Joining, by = c("browser", "ip_anonymized", "language", "message")
log_data_intermediate %>% head()
log_data_intermediate %>% select(ip_anonymized) %>% unique() %>% nrow()
## [1] 500054
log_notifications.csv & logs_web.csv have been joined using the variables ip_anonymized, language & messagedatetime_click variable is NA, it means that the user did not click the notification.Each IP address refers to a single user and these users are randomly selected and are independent from each other.
The total number of IP addresses (500,054 addresses) refer to 1% of the total consumer base. Using this assumption, the total consumer base is 50,005,400 IP addresses.
The rest of the users (the 99%) have similar properties as the 1% due to randomness. This assumption will enable us to extrapolate our results to the general consumer base with a good degee of statistical confidence.
This data is for a week. So we will present the results for this analysis using this time period only. We are not going to consider the clicks/conversions that happened after this.
AdBlock Plus software is used primarily on desktops. The users were notified via browser notifications about the the AdBlock Plus Browser when they activated their AdBlock Plus plugin. This message has a link that redirects the user to the app download page.
For this time period, 1 % of users were chosen daily for the notifications.
print(log_data_intermediate %>% filter(is.na(datetime)) %>% nrow())
## [1] 54
The users that have not clicked the notification, do not have the date_click variable assigned to them.
log_data_intermediate %>%
mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
group_by(clicked) %>%
summarise(n=n())
log_data_intermediate %>% mutate(datetime = ymd_hms(datetime)) %>%
mutate(date_ = date(datetime)) %>%
group_by(date_,message) %>%
summarise(n=n()) %>% ungroup() %>% filter(!is.na(date_)) %>%
ggplot(aes(x=as.factor(date_),y=n,fill= as.factor(message)))+geom_bar(stat = 'identity',position = 'dodge')+
labs(x='Date',y='Number of Notification Sent')+
scale_fill_discrete(name = "Notification Type")+
theme(axis.text.x = element_text(angle = 45, hjust=1))
The number of notifications sent per day reduces between the date period 2015-09-01 & 2015-09-07. If our assumption about choosing 1% of daily users is right, then this would mean that the active users are going down.
The number of notification types sent per day is equal to each other as evidenced by the problem statement (The quote “Once selected, we sent one of three promoting notifications to the user with equal probability”).
log_data_intermediate %>% filter(!is.na(datetime)) %>%
mutate(datetime = ymd_hms(datetime)) %>%
mutate(hour_ = hour(datetime)) %>%
mutate(time_period = ifelse(hour_<=4,'12AM to 4AM','8PM to 12AM'),
time_period = ifelse(hour_>4 & hour_<=8,'4AM to 8AM',time_period),
time_period = ifelse(hour_>8 & hour_<=12,'8AM to 12PM',time_period),
time_period = ifelse(hour_>12 & hour_<=16,'12PM to 4PM',time_period),
time_period = ifelse(hour_>16 & hour_<=20,'4PM to 8PM',time_period)
) %>%
mutate(time_period = factor(time_period,levels = c('12AM to 4AM','4AM to 8AM','8AM to 12PM','12PM to 4PM','4PM to 8PM','8PM to 12AM'))) %>%
group_by(time_period,message) %>%
summarise(n=n()) %>% ungroup() %>%
ggplot(aes(x=time_period,y=n,fill= as.factor(message)))+geom_bar(stat = 'identity',position = 'dodge')+
labs(x='Hour of the Day',y='Number of Notification Sent')+
scale_fill_discrete(name = "Notification Type")+
theme(axis.text.x = element_text(angle = 45, hjust=1))
12 AM and 4AM & 4PM and 8PM.log_data_intermediate %>%
filter(!is.na(datetime_click)) %>%
mutate(datetime_click = ymd_hms(datetime_click)) %>%
mutate(hour_ = hour(datetime_click)) %>%
mutate(time_period = ifelse(hour_<=4,'12AM to 4AM','8PM to 12AM'),
time_period = ifelse(hour_>4 & hour_<=8,'4AM to 8AM',time_period),
time_period = ifelse(hour_>8 & hour_<=12,'8AM to 12PM',time_period),
time_period = ifelse(hour_>12 & hour_<=16,'12PM to 4PM',time_period),
time_period = ifelse(hour_>16 & hour_<=20,'4PM to 8PM',time_period)
) %>%
mutate(time_period = factor(time_period,levels = c('12AM to 4AM','4AM to 8AM','8AM to 12PM','12PM to 4PM','4PM to 8PM','8PM to 12AM'))) %>%
group_by(time_period,message) %>%
summarise(n=n()) %>% ungroup() %>%
ggplot(aes(x=time_period,y=n,fill= as.factor(message)))+geom_bar(stat = 'identity',position = 'dodge')+
labs(x='Time of the Day',y='Number of Notification Clicked')+
scale_fill_discrete(name = "Notification Type") + facet_wrap(~message,nrow = 3)+
theme(axis.text.x = element_text(angle = 45, hjust=1))
log_data_intermediate %>% filter(!is.na(datetime_click)) %>%
filter(!is.na(datetime)) %>%
mutate(datetime = ymd_hms(datetime)) %>%
mutate(datetime_click = ymd_hms(datetime_click)) %>%
mutate(day_diff = lubridate::interval(datetime,datetime_click)) %>%
mutate(hour_diff = as.period(day_diff) %>% hour()) %>%
group_by(hour_diff,message) %>%
summarise(n=n()) %>% ungroup() %>%
arrange(desc(message)) %>%
inner_join(log_data_intermediate %>% filter(!is.na(datetime_click)) %>%
filter(!is.na(datetime)) %>% group_by(message) %>% summarise(total=n())) %>%
mutate(percentage_click = n/total) %>%
rename(`Number of Hours Elapsed Since Notification is sent` = hour_diff) %>%
mutate(percentage_click = stringr::str_c(round(percentage_click,3)*100,'%')) %>%
mutate(`Number of Hours Elapsed Since Notification is sent` = stringr::str_c(`Number of Hours Elapsed Since Notification is sent`,'H') ) %>%
rename(`Percentage Clicks %` = percentage_click)
## Joining, by = "message"
# ggplot(aes(x=hour_diff,y=percentage_click,fill=as.factor(message),label=scales::percent(round(percentage_click,2))))+geom_bar(stat='identity',position = 'dodge')+
# #facet_wrap(~message)+
# labs(x='Number of Hours Elapsed between Notification Receipt and Click', y= 'Percentage %')+
# scale_y_continuous(labels = scales::percent)
1 compared to messages 2 & 0 within the 1st hour.1.log_data_intermediate %>% filter(!is.na(datetime_click)) %>%
filter(!is.na(datetime)) %>%
mutate(datetime = ymd_hms(datetime)) %>%
mutate(datetime_click = ymd_hms(datetime_click)) %>%
mutate(date_click = lubridate::as_date(datetime_click)) %>%
group_by(date_click,message) %>%
summarise(n=n()) %>%
inner_join(log_data_intermediate %>%
filter(!is.na(datetime)) %>%
mutate(datetime = ymd_hms(datetime)) %>%
mutate(datetime_click = ymd_hms(datetime_click)) %>%
mutate(date_click = lubridate::as_date(datetime)) %>%
group_by(date_click,message) %>%
summarise(total=n())) %>% ungroup() %>%
arrange((date_click)) %>%
mutate(ctr = n/total) %>%
ggplot(aes(x=as.factor(date_click),y=ctr,fill=as.factor(message),label = scales::percent(round(ctr,4))))+geom_bar(stat='identity',position='dodge')+ labs(x='Date',y='CTR')+ scale_fill_discrete(name = "Notification Type")+
scale_y_continuous(labels = scales::percent)+
geom_text(position = position_dodge(width = .9), # move to center of bars
vjust = 0.5,# nudge above top of bar
size = 3,
hjust = 1.5
) + coord_flip()
## Joining, by = c("date_click", "message")
1 remain high between 1st till 4th of September, and slightly decrease after that. (Tuesday to Friday)0 experiences a peak click through rate on 5th of September.(Saturday)2’s click through rates peak on 3rd September & 5th September.(Thursday & Saturday)1 during week days. The messaging could be to the point. The other messages might have been too wordy.log_data_intermediate %>% filter(!is.na(datetime_click)) %>%
filter(!is.na(datetime)) %>%
group_by(message,browser) %>%
summarise(n=n()) %>%
inner_join(log_data_intermediate %>%
filter(!is.na(datetime)) %>%
group_by(message,browser) %>%
summarise(total=n())) %>%
mutate(`% CTR` = n/total)
## Joining, by = c("message", "browser")
How significant is this difference?
log_data_intermediate %>% filter(!is.na(datetime_click)) %>%
filter(!is.na(datetime)) %>%
group_by(browser) %>%
summarise(n=n()) %>%
inner_join(log_data_intermediate %>%
filter(!is.na(datetime)) %>%
group_by(browser) %>%
summarise(total=n()))
## Joining, by = "browser"
prop.test(x = c(2794, 3590), n = c(166584,333416),alternative = "greater")
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(2794, 3590) out of c(166584, 333416)
## X-squared = 317.31, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.005405284 1.000000000
## sample estimates:
## prop 1 prop 2
## 0.01677232 0.01076733
log_data_intermediate %>%
mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
filter(!is.na(datetime)) %>%
group_by(clicked,message) %>%
summarise(n=n()) %>% ungroup() %>%
inner_join(log_data_intermediate %>%
mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
filter(!is.na(datetime)) %>%
group_by(message) %>%
summarise(total=n())) %>%
arrange(desc(message)) %>%
mutate(percentage_ = n/total)
## Joining, by = "message"
1 had the highest click through rate at 1.4 %2 & 0How significantly higher is the proportion of clicks to total notifications for message type 1?
prop.test(x = c(2380, 2180), n = c(167055,166744),alternative = "greater")
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(2380, 2180) out of c(167055, 166744)
## X-squared = 8.4328, df = 1, p-value = 0.001843
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.0005059651 1.0000000000
## sample estimates:
## prop 1 prop 2
## 0.01424680 0.01307393
prop.test(x = c(2380, 1824), n = c(167055,166201),alternative = "greater")
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(2380, 1824) out of c(167055, 166201)
## X-squared = 71.353, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.00263042 1.00000000
## sample estimates:
## prop 1 prop 2
## 0.01424680 0.01097466
1 is significantly higher than that of 0 & 2, and do not occur due to randomness.log_data_intermediate %>%
mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
filter(!is.na(datetime)) %>%
group_by(clicked,message,language) %>%
summarise(n=n()) %>% ungroup() %>%
inner_join(log_data_intermediate %>%
mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
filter(!is.na(datetime)) %>%
group_by(message,language) %>%
summarise(total=n())) %>%
arrange(desc(language)) %>%
mutate(percentage_ = n/total) %>%
filter(clicked == 'Clicked') %>%
arrange(desc(percentage_)) -> language_message_ctr
## Joining, by = c("message", "language")
language_message_ctr %>% ungroup() %>%
arrange_(~ desc(percentage_)) %>%
group_by_(~ language) %>%
do(head(., n = 2))
## Warning: arrange_() is deprecated.
## Please use arrange() instead
##
## The 'programming' vignette or the tidyeval book can help you
## to program with arrange() : https://tidyeval.tidyverse.org
## This warning is displayed once per session.
## Warning: group_by_() is deprecated.
## Please use group_by() instead
##
## The 'programming' vignette or the tidyeval book can help you
## to program with group_by() : https://tidyeval.tidyverse.org
## This warning is displayed once per session.
1 consistently emerges as the top performer across all languages except for French & Chinese.2 & Message 0 emerge as top performers.log_data_intermediate %>%
mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
filter(!is.na(datetime)) %>%
group_by(clicked,language) %>%
summarise(n=n()) %>% ungroup() %>%
inner_join(log_data_intermediate %>%
mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
filter(!is.na(datetime)) %>%
group_by(language) %>%
summarise(total=n())) %>%
arrange(desc(language)) %>%
mutate(percentage_ = n/total) %>%
ungroup() %>%
filter(clicked == ' Not Clicked') %>%
arrange(desc(percentage_))
## Joining, by = "language"
Before we proceed with the problem its good to revisit what we have so far
We have assumed that the sample (500,054 users) comprises of 1% of users.
Each of these users are independent of each other.
We can assume that the behaviour of the population would be similar to the sample as the sample is selected randomly.
This estimation is for a week of notifications.
The process used to send notifications will be the same as the sample. There are no confounding variables that are taken into account, that affect the conversion and clicks.
1.3% of the sample size clicked the notifications that were sent to them.
Out of this 1.3% we have to assume that 50% of them actually installed.
We need to estimate the proportion of the population, that will click the link in the notification & the number of installs that would be generated because of this
Here, we try to estimate the 95% confidence interval of the proportion of users that will click the notification.
sample_size = 500054
proportion_estimate = 0.01287461
standard_error = sqrt(proportion_estimate*(1-proportion_estimate))/ sqrt(sample_size)
margin_of_error = qnorm(0.975)*standard_error
lower_ = proportion_estimate - margin_of_error
upper_ = proportion_estimate + margin_of_error
print(lower_)
## [1] 0.01256215
print(upper_)
## [1] 0.01318707
At 95% confidence interval , between 1.26% and 1.32% of the larger consumer base would click the notifications that will be sent over the week.
Given that the total consumer base is 50,005,400 the clicks would then range from 628,175 to 659,425 at 95% confidence interval.
This would then translate to 314,088 to 329,713 installations achieved via notifications for a week. This is because we can assume 50% of clicks to translate to installations.
Thus, for the given time period and data, it will not be possible to attain > 1 million installations.
Which message should we send if we are sending to all users?
Message 1 for all languages except Chinese & French
Message 0 & Message 2 respectively for Chinese & French users respectively.
Apart from doing the above, exploring the pain points experienced by Spanish users would be essential.
Our translation manager thinks that one or several of the translations are crappy and look like a 4-year old could have written that. Can you confirm that?
Our designer is interested if there are differences between browsers, because the mechanisms for displaying notifications for these differ.
There is significant evidence that FireFox would have a higher proportion of clicks as compared to Chrome if the notifications were sent to the entire consumer base.
To understand the reasons for this, we have to dive deeper into the mechanisms (syntax/UI/positioning of the messages) of the messaging on Firefox and Chrome platforms.
Our Head of Business Development wants to decide if we actually should send out notifications to all users. He expects that one million or more installations should be worth it. Estimate the number of installations if we send out notifications to all users (assume that 50% of users that click the link included in the notifications will install Adblock Browser)
314,088 to 329,713 installations with a high statistical degree of confidence.Given that the users are getting their notifications on a desktop,news about a new browser for mobile platforms might not be attended to as this requires an extra step to install them on their mobile devices. As an alternative, sending an email notification would be better.These email notifications can then be attended to via mobile devices and hence the browser can be downloaded directly.
Considering user segments based on early vs late adoption of new features, can help target the notifications to the right audiences. Based on feedback from early adopters(if the product is a beta version), the notifications and the product can be changed to meet the demands of a wider audience.
Understanding the product better
Who is the target of our new product?
What specific need of these target consumers will the product satisfy?
Which specific benefits should the product have?
The number of notifications sent per day reduces between the date period 2015-09-01 & 2015-09-07. If our assumption about choosing 1% of daily users is right, then this would mean that the active users are going down.Thorough discussion with the product manager/data scientist would clear this confusion up.
Did the introduction of notifications reduce the performance/affect the performance of the platform in anyway?
Are there any competitors in the market that have introduced a similar product?
Are the notifications intuitive to understand?
Are there any other mediums/channels through which users are made aware of the browser? This could be blog coverage, word of mouth, advertising,social media etc.
Are there better ways to serve notifications to users?
The assumption that each IP address refers to a single user is wrong. It is possible for individuals to own more than one desktop device(office and home computer). Thus, a more robust representation of user identification is required. A converted user might install the browser on multiple mobile devices. Thus the conversion estimates might have been underestimated.
Our analysis assumes independence of data points. But there will be repeated users in the rest of the consumer base per day (the 99%)
Considering installations due to external factors such as blog coverage,advertising, word of mouth can help assess the effectiveness of notifications.
Our analysis does not assess the effectiveness of the notifications as was required by the task. To assess this, we need to compare the results we have, against the clicks and installs generated without the use of notifications. This would entail using two groups of random users. One group would be exposed to notifications and the others would be left as is (A/B testing). Installation behaviour in both the groups can then be compared against each other.
Users are not necessarily independent of each other. One might inform others about the new product release via mass communication tools such as social media.