Eyeo Analysis

Load Required Libraries
Getting the data and creating the main data frame
Assumptions about the data
Basic EDA
Deep Dive EDA
Conclusions
Questions/Concerns that should be cleared before proceeding with the analysis.

Load Required Libraries

library(ggplot2)
library(dplyr)
library(lubridate)

Getting the data and creating the main data frame

logs_notifications = read.csv('logs_notifications.csv',stringsAsFactors = FALSE)
logs_web = read.csv('logs_web.csv',stringsAsFactors = FALSE)

logs_notifications %>%
  dplyr::full_join(logs_web %>% rename(datetime_click = datetime)) -> log_data_intermediate

## Joining, by = c("browser", "ip_anonymized", "language", "message")

log_data_intermediate %>% head()

log_data_intermediate %>% select(ip_anonymized) %>% unique() %>% nrow()

## [1] 500054

In the above snippet, the datasets log_notifications.csv & logs_web.csv have been joined using the variables ip_anonymized, language & message
The the resulting table,the rows where the datetime_click variable is NA, it means that the user did not click the notification.

Assumptions about the data

Each IP address refers to a single user and these users are randomly selected and are independent from each other.
The total number of IP addresses (500,054 addresses) refer to 1% of the total consumer base. Using this assumption, the total consumer base is 50,005,400 IP addresses.
The rest of the users (the 99%) have similar properties as the 1% due to randomness. This assumption will enable us to extrapolate our results to the general consumer base with a good degee of statistical confidence.
This data is for a week. So we will present the results for this analysis using this time period only. We are not going to consider the clicks/conversions that happened after this.
AdBlock Plus software is used primarily on desktops. The users were notified via browser notifications about the the AdBlock Plus Browser when they activated their AdBlock Plus plugin. This message has a link that redirects the user to the app download page.
For this time period, 1 % of users were chosen daily for the notifications.

Basic EDA

Are there any entries that have notification click date , but no notification send date?

print(log_data_intermediate %>% filter(is.na(datetime)) %>% nrow())

## [1] 54

There are 54 users who have clicked the notification but lack the date of receipt of the notification
Given that this is < 0.5% of the data, this should not affect our analysis significantly.

What were the total number of users who clicked the notifications?

The users that have not clicked the notification, do not have the date_click variable assigned to them.

log_data_intermediate %>% 
  mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
  group_by(clicked) %>%
  summarise(n=n())

Approximately, 1.3 % of users, clicked the notification for the given period of time.

How many notifications were sent on a daily basis?

log_data_intermediate %>% mutate(datetime = ymd_hms(datetime)) %>%
  mutate(date_ = date(datetime)) %>%
  group_by(date_,message) %>%
  summarise(n=n()) %>% ungroup() %>% filter(!is.na(date_)) %>%
  ggplot(aes(x=as.factor(date_),y=n,fill= as.factor(message)))+geom_bar(stat = 'identity',position = 'dodge')+
  labs(x='Date',y='Number of Notification Sent')+
  scale_fill_discrete(name = "Notification Type")+
  theme(axis.text.x = element_text(angle = 45, hjust=1))

The number of notifications sent per day reduces between the date period 2015-09-01 & 2015-09-07. If our assumption about choosing 1% of daily users is right, then this would mean that the active users are going down.
The number of notification types sent per day is equal to each other as evidenced by the problem statement (The quote “Once selected, we sent one of three promoting notifications to the user with equal probability”).

At what times were the notifications sent?

log_data_intermediate %>% filter(!is.na(datetime)) %>%
  mutate(datetime = ymd_hms(datetime)) %>%
  mutate(hour_ = hour(datetime)) %>%
  mutate(time_period = ifelse(hour_<=4,'12AM to 4AM','8PM to 12AM'),
         time_period = ifelse(hour_>4 & hour_<=8,'4AM to 8AM',time_period),
        time_period = ifelse(hour_>8 & hour_<=12,'8AM to 12PM',time_period),
        time_period = ifelse(hour_>12 & hour_<=16,'12PM to 4PM',time_period),
        time_period = ifelse(hour_>16 & hour_<=20,'4PM to 8PM',time_period)
         ) %>%
  mutate(time_period = factor(time_period,levels = c('12AM to 4AM','4AM to 8AM','8AM to 12PM','12PM to 4PM','4PM to 8PM','8PM to 12AM'))) %>%
  group_by(time_period,message) %>%
  summarise(n=n()) %>% ungroup() %>%
  ggplot(aes(x=time_period,y=n,fill= as.factor(message)))+geom_bar(stat = 'identity',position = 'dodge')+
  labs(x='Hour of the Day',y='Number of Notification Sent')+
  scale_fill_discrete(name = "Notification Type")+
    theme(axis.text.x = element_text(angle = 45, hjust=1))

Most notifications are sent between 12 AM and 4AM & 4PM and 8PM.

At what time did the clicks happen?

log_data_intermediate %>% 
  filter(!is.na(datetime_click)) %>%
  mutate(datetime_click = ymd_hms(datetime_click)) %>%
  mutate(hour_ = hour(datetime_click)) %>%
  mutate(time_period = ifelse(hour_<=4,'12AM to 4AM','8PM to 12AM'),
         time_period = ifelse(hour_>4 & hour_<=8,'4AM to 8AM',time_period),
        time_period = ifelse(hour_>8 & hour_<=12,'8AM to 12PM',time_period),
        time_period = ifelse(hour_>12 & hour_<=16,'12PM to 4PM',time_period),
        time_period = ifelse(hour_>16 & hour_<=20,'4PM to 8PM',time_period)
         ) %>%
  mutate(time_period = factor(time_period,levels = c('12AM to 4AM','4AM to 8AM','8AM to 12PM','12PM to 4PM','4PM to 8PM','8PM to 12AM'))) %>%
  group_by(time_period,message) %>%
  summarise(n=n()) %>% ungroup() %>%
  ggplot(aes(x=time_period,y=n,fill= as.factor(message)))+geom_bar(stat = 'identity',position = 'dodge')+
  labs(x='Time of the Day',y='Number of Notification Clicked')+
  scale_fill_discrete(name = "Notification Type") + facet_wrap(~message,nrow = 3)+
    theme(axis.text.x = element_text(angle = 45, hjust=1))

Most notifications are clicked on between 12 AM to 4AM , following by the period 4PM to 8PM according to the data. This could be attributed to late night internet usage & internet usage while travelling from work respectively.

How long did the converted users take to click the notification?

log_data_intermediate %>%   filter(!is.na(datetime_click)) %>%
  filter(!is.na(datetime)) %>%
  mutate(datetime = ymd_hms(datetime)) %>%
  mutate(datetime_click = ymd_hms(datetime_click)) %>%
  mutate(day_diff = lubridate::interval(datetime,datetime_click)) %>%
  mutate(hour_diff = as.period(day_diff) %>% hour()) %>%
  group_by(hour_diff,message) %>% 
  summarise(n=n()) %>% ungroup() %>%
  arrange(desc(message)) %>%
  inner_join(log_data_intermediate %>%   filter(!is.na(datetime_click)) %>%
  filter(!is.na(datetime)) %>% group_by(message) %>% summarise(total=n())) %>%
  mutate(percentage_click = n/total) %>%
  rename(`Number of Hours Elapsed Since Notification is sent` = hour_diff) %>%
  mutate(percentage_click = stringr::str_c(round(percentage_click,3)*100,'%')) %>%
  mutate(`Number of Hours Elapsed Since Notification is sent` = stringr::str_c(`Number of Hours Elapsed Since Notification is sent`,'H') ) %>%
  rename(`Percentage Clicks %` = percentage_click)

## Joining, by = "message"

  # ggplot(aes(x=hour_diff,y=percentage_click,fill=as.factor(message),label=scales::percent(round(percentage_click,2))))+geom_bar(stat='identity',position = 'dodge')+
  # #facet_wrap(~message)+
  # labs(x='Number of Hours Elapsed between Notification Receipt and Click', y= 'Percentage %')+
  # scale_y_continuous(labels = scales::percent)

A larger number of users clicked message 1 compared to messages 2 & 0 within the 1st hour.
This could be because of the phrasing/syntax of message 1.

For each message type, how does the click through rate change with time?

log_data_intermediate %>%   filter(!is.na(datetime_click)) %>%
  filter(!is.na(datetime)) %>%
  mutate(datetime = ymd_hms(datetime)) %>%
  mutate(datetime_click = ymd_hms(datetime_click)) %>%
  mutate(date_click = lubridate::as_date(datetime_click)) %>%
  group_by(date_click,message) %>%
  summarise(n=n()) %>%
  inner_join(log_data_intermediate %>%
  filter(!is.na(datetime)) %>%
  mutate(datetime = ymd_hms(datetime)) %>%
  mutate(datetime_click = ymd_hms(datetime_click)) %>%
  mutate(date_click = lubridate::as_date(datetime)) %>%
  group_by(date_click,message) %>%
  summarise(total=n())) %>% ungroup() %>%
  arrange((date_click)) %>%
  mutate(ctr = n/total) %>%
  ggplot(aes(x=as.factor(date_click),y=ctr,fill=as.factor(message),label = scales::percent(round(ctr,4))))+geom_bar(stat='identity',position='dodge')+ labs(x='Date',y='CTR')+ scale_fill_discrete(name = "Notification Type")+
  scale_y_continuous(labels = scales::percent)+
  geom_text(position = position_dodge(width = .9),    # move to center of bars
              vjust = 0.5,# nudge above top of bar
              size = 3,
            hjust = 1.5
            ) + coord_flip()

## Joining, by = c("date_click", "message")

The click through rates for message 1 remain high between 1st till 4th of September, and slightly decrease after that. (Tuesday to Friday)
Message 0 experiences a peak click through rate on 5th of September.(Saturday)
Message 2’s click through rates peak on 3rd September & 5th September.(Thursday & Saturday)
Understanding the messaging syntax and wordings would give us a better idea of why we see peak behaviour for message 1 during week days. The messaging could be to the point. The other messages might have been too wordy.

For each message type, how does the click through rate change by browser?

log_data_intermediate %>%   filter(!is.na(datetime_click)) %>%
  filter(!is.na(datetime)) %>%
  group_by(message,browser) %>%
  summarise(n=n()) %>%
  inner_join(log_data_intermediate %>%
  filter(!is.na(datetime)) %>%
  group_by(message,browser) %>%
  summarise(total=n())) %>%
  mutate(`% CTR` = n/total)

## Joining, by = c("message", "browser")

The click through rates are higher for messages that have been sent to Firefox users.

How significant is this difference?

log_data_intermediate %>%   filter(!is.na(datetime_click)) %>%
  filter(!is.na(datetime)) %>%
  group_by(browser) %>%
  summarise(n=n()) %>%
  inner_join(log_data_intermediate %>%
  filter(!is.na(datetime)) %>%
  group_by(browser) %>%
  summarise(total=n()))

## Joining, by = "browser"

prop.test(x = c(2794, 3590), n = c(166584,333416),alternative = "greater")

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(2794, 3590) out of c(166584, 333416)
## X-squared = 317.31, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.005405284 1.000000000
## sample estimates:
##     prop 1     prop 2 
## 0.01677232 0.01076733

The difference in click through rates via Firefox and Chrome is significant as evidenced by the p-value<0.05 of the proportion test.
This could be because of how the notifications are served on Firefox and Chrome. Deep diving into the UI of the notifications for Firefox and Chrome would give us a better idea about the reason/s for this difference.

Deep Dive EDA

How does the click through rate change by notification type? Are these differences significant?

log_data_intermediate %>% 
  mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
  filter(!is.na(datetime)) %>%
  group_by(clicked,message) %>%
  summarise(n=n()) %>% ungroup() %>%
  inner_join(log_data_intermediate %>% 
  mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
  filter(!is.na(datetime)) %>%
  group_by(message) %>%
  summarise(total=n())) %>%
  arrange(desc(message)) %>%
  mutate(percentage_ = n/total)

## Joining, by = "message"

Message type 1 had the highest click through rate at 1.4 %
This is followed by messages 2 & 0

How significantly higher is the proportion of clicks to total notifications for message type 1?

prop.test(x = c(2380, 2180), n = c(167055,166744),alternative = "greater")

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(2380, 2180) out of c(167055, 166744)
## X-squared = 8.4328, df = 1, p-value = 0.001843
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.0005059651 1.0000000000
## sample estimates:
##     prop 1     prop 2 
## 0.01424680 0.01307393

prop.test(x = c(2380, 1824), n = c(167055,166201),alternative = "greater")

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(2380, 1824) out of c(167055, 166201)
## X-squared = 71.353, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.00263042 1.00000000
## sample estimates:
##     prop 1     prop 2 
## 0.01424680 0.01097466

Given the p-values<0.05, this shows that there is enough evidence to prove that the click through rates of message 1 is significantly higher than that of 0 & 2, and do not occur due to randomness.

Which messages were the top performers for each language?

log_data_intermediate %>% 
  mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
  filter(!is.na(datetime)) %>%
  group_by(clicked,message,language) %>%
  summarise(n=n()) %>% ungroup() %>%
  inner_join(log_data_intermediate %>% 
  mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
  filter(!is.na(datetime)) %>%
  group_by(message,language) %>%
  summarise(total=n())) %>%
  arrange(desc(language)) %>%
  mutate(percentage_ = n/total) %>%
  filter(clicked == 'Clicked') %>%
  arrange(desc(percentage_)) -> language_message_ctr

## Joining, by = c("message", "language")

language_message_ctr %>% ungroup() %>% 
  arrange_(~ desc(percentage_)) %>%
  group_by_(~ language) %>%
  do(head(., n = 2))

## Warning: arrange_() is deprecated. 
## Please use arrange() instead
## 
## The 'programming' vignette or the tidyeval book can help you
## to program with arrange() : https://tidyeval.tidyverse.org
## This warning is displayed once per session.

## Warning: group_by_() is deprecated. 
## Please use group_by() instead
## 
## The 'programming' vignette or the tidyeval book can help you
## to program with group_by() : https://tidyeval.tidyverse.org
## This warning is displayed once per session.

Message 1 consistently emerges as the top performer across all languages except for French & Chinese.
For French and Chinese users, Message 2 & Message 0 emerge as top performers.

Which language had the least click through rate?

log_data_intermediate %>% 
  mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
  filter(!is.na(datetime)) %>%
  group_by(clicked,language) %>%
  summarise(n=n()) %>% ungroup() %>%
  inner_join(log_data_intermediate %>% 
  mutate(clicked = ifelse(is.na(datetime_click),' Not Clicked','Clicked')) %>%
  filter(!is.na(datetime)) %>%
  group_by(language) %>%
  summarise(total=n())) %>%
  arrange(desc(language)) %>%
  mutate(percentage_ = n/total) %>%
  ungroup() %>%
  filter(clicked == ' Not Clicked')  %>%
  arrange(desc(percentage_))

## Joining, by = "language"

Close to 100 % of notifications in Spanish & Russian did not get clicked.
This might have been due to translation issues but only deep diving into the message syntax/grammar & consumer feedback can support/reject our hypothesis.

Estimation of the number of installations if notifications were sent to all users

Before we proceed with the problem its good to revisit what we have so far

We have assumed that the sample (500,054 users) comprises of 1% of users.
Each of these users are independent of each other.
We can assume that the behaviour of the population would be similar to the sample as the sample is selected randomly.
This estimation is for a week of notifications.
The process used to send notifications will be the same as the sample. There are no confounding variables that are taken into account, that affect the conversion and clicks.
1.3% of the sample size clicked the notifications that were sent to them.
Out of this 1.3% we have to assume that 50% of them actually installed.

We need to estimate the proportion of the population, that will click the link in the notification & the number of installs that would be generated because of this

Here, we try to estimate the 95% confidence interval of the proportion of users that will click the notification.

sample_size = 500054
proportion_estimate = 0.01287461

standard_error = sqrt(proportion_estimate*(1-proportion_estimate))/ sqrt(sample_size)

margin_of_error = qnorm(0.975)*standard_error

lower_ = proportion_estimate - margin_of_error

upper_ = proportion_estimate + margin_of_error

print(lower_)

## [1] 0.01256215

print(upper_)

## [1] 0.01318707

At 95% confidence interval , between 1.26% and 1.32% of the larger consumer base would click the notifications that will be sent over the week.
Given that the total consumer base is 50,005,400 the clicks would then range from 628,175 to 659,425 at 95% confidence interval.
This would then translate to 314,088 to 329,713 installations achieved via notifications for a week. This is because we can assume 50% of clicks to translate to installations.
Thus, for the given time period and data, it will not be possible to attain > 1 million installations.

Conclusions

Which message should we send if we are sending to all users?

Message 1 for all languages except Chinese & French
Message 0 & Message 2 respectively for Chinese & French users respectively.
Apart from doing the above, exploring the pain points experienced by Spanish users would be essential.

Our translation manager thinks that one or several of the translations are crappy and look like a 4-year old could have written that. Can you confirm that?

Due to all notifications not being clicked, Spanish might not have been translated accurately.

Our designer is interested if there are differences between browsers, because the mechanisms for displaying notifications for these differ.

There is significant evidence that FireFox would have a higher proportion of clicks as compared to Chrome if the notifications were sent to the entire consumer base.
To understand the reasons for this, we have to dive deeper into the mechanisms (syntax/UI/positioning of the messages) of the messaging on Firefox and Chrome platforms.

Our Head of Business Development wants to decide if we actually should send out notifications to all users. He expects that one million or more installations should be worth it. Estimate the number of installations if we send out notifications to all users (assume that 50% of users that click the link included in the notifications will install Adblock Browser)

If we sent out notifications to all users, we would attain between 314,088 to 329,713 installations with a high statistical degree of confidence.
This estimate is assuming that users are independent of each other and the sample is chosen at random.

Given that the users are getting their notifications on a desktop,news about a new browser for mobile platforms might not be attended to as this requires an extra step to install them on their mobile devices. As an alternative, sending an email notification would be better.These email notifications can then be attended to via mobile devices and hence the browser can be downloaded directly.

Considering user segments based on early vs late adoption of new features, can help target the notifications to the right audiences. Based on feedback from early adopters(if the product is a beta version), the notifications and the product can be changed to meet the demands of a wider audience.

Questions/Concerns that should be cleared before proceeding with the analysis.

Understanding the product better
- Who is the target of our new product?
- What specific need of these target consumers will the product satisfy?
- Which specific benefits should the product have?
The number of notifications sent per day reduces between the date period 2015-09-01 & 2015-09-07. If our assumption about choosing 1% of daily users is right, then this would mean that the active users are going down.Thorough discussion with the product manager/data scientist would clear this confusion up.
With the click through rate for a week’s notification period being < 2%, it is essential to confirm this with the product manager/data science lead/data scientist who will have more experience with the product/consumer behaviour. If this is not normal,we should investigate confounding issues that might have lead to this. For example :
- Did the introduction of notifications reduce the performance/affect the performance of the platform in anyway?
- Are there any competitors in the market that have introduced a similar product?
- Are the notifications intuitive to understand?
- Are there any other mediums/channels through which users are made aware of the browser? This could be blog coverage, word of mouth, advertising,social media etc.
- Are there better ways to serve notifications to users?
The assumption that each IP address refers to a single user is wrong. It is possible for individuals to own more than one desktop device(office and home computer). Thus, a more robust representation of user identification is required. A converted user might install the browser on multiple mobile devices. Thus the conversion estimates might have been underestimated.
Our analysis assumes independence of data points. But there will be repeated users in the rest of the consumer base per day (the 99%)
Considering installations due to external factors such as blog coverage,advertising, word of mouth can help assess the effectiveness of notifications.
Our analysis does not assess the effectiveness of the notifications as was required by the task. To assess this, we need to compare the results we have, against the clicks and installs generated without the use of notifications. This would entail using two groups of random users. One group would be exposed to notifications and the others would be left as is (A/B testing). Installation behaviour in both the groups can then be compared against each other.
Users are not necessarily independent of each other. One might inform others about the new product release via mass communication tools such as social media.