DATA DICTIONARY :
The data frame has 294478 rows and 6 columns.This is a non continuous dataset.
This data frame contains the following columns:
user_id :
A unique user_id is alloted to every user that opens the advertisement link. user_id will repeat in the dataset if the same user opens the link multiple number of times.
country :
The country from which the user belongs. There are only three country in the data, namely United stateS,Canada,United kingdom
group :
the users are divided into two groups namely control group and treatment group.
landing_page :
The users from control group are shown the old_page and users from treatment groups are shown new_page . However a clear classification is not done in the survey as some of the control group are exposed to new_page and some of the treatment group are exposed to old_page as well.
timestamp :
the time at which the user had visited the link
converted :
Dependent variable. 1 signifies the conversion and 0 for non conversion
Importing library and dataset
library(tidyverse)
dataset = read.csv("ab_data.csv")
Checkig for any NA values in the data
str(dataset)
## 'data.frame': 294478 obs. of 5 variables:
## $ user_id : int 851104 804228 661590 853541 864975 936923 679687 719014 817355 839785 ...
## $ timestamp : chr "2017-01-21 22:11:48.556739" "2017-01-12 08:01:45.159739" "2017-01-11 16:55:06.154213" "2017-01-08 18:28:03.143765" ...
## $ group : chr "control" "control" "treatment" "treatment" ...
## $ landing_page: chr "old_page" "old_page" "new_page" "new_page" ...
## $ converted : int 0 0 0 0 1 0 1 0 1 1 ...
sum(is.na(dataset))
## [1] 0
Plotting a stacked bar plot to find out if all the users from control group were exposed to only old_page and users from treatment group were exposed to new page only,which was the orginal idea
ggplot(data = dataset)+ aes(group,fill=landing_page) + geom_bar()
As it can be clearly seen from the bar graph, a major number of people form control group were exposed to only old page and only a minimal portion were exposed to new_page , same is the case with treatment group.
Since the discrepancy is so low , I am going to assume here that the control group represents the old_page group and treatment group represents the new_page group. We could most definitely remove the discrepancy by using a simple filter function, but it will still give the same result.
So now Let’s see which page performed better. To do the comparison between the pages, we will simply use the Click through matrix. CTR is a wonderful matrix to do the comparison. CTR is the ratio of users who click on a specific link to the number of total users who view a page, email, or advertisement. With that in mind I am going to use simple subset function find out the CTR for both control and treatment group.
CTR_control = nrow(subset(dataset,group == "control" & converted == "1"))/nrow(subset(dataset,group == "control" ))
CTR_treatment= nrow(subset(dataset,group == "treatment" & converted == "1"))/nrow(subset(dataset,group == "treatment" ))
(CTR_control)
## [1] 0.1203992
(CTR_treatment)
## [1] 0.1189196
So the click through rate for the control group is 12.03% and for treatment group is 11.89%
So both of these groups seem to be working at the same conversion rate.
library(RColorBrewer)
#install.packages("RColorBrewer")
dataset$converted = factor(dataset$converted)
ggplot(data = dataset)+ aes(group,fill=converted) + geom_bar()
Hypothesis testing :
To check out if the change was well accepted or not. We build a simple hypothesis here.
H0 : old_page_CTR = new_page_CTR
H1 : old_page_CTR =/ new_page_CTR
Checking for hypothesis
Old_page_CTR = nrow(subset(dataset,landing_page == "old_page" & converted == "1"))/nrow(subset(dataset,landing_page == "old_page" ))
new_page_CTR= nrow(subset(dataset,landing_page == "new_page" & converted == "1"))/nrow(subset(dataset,landing_page == "new_page" ))
percentage_change = ((new_page_CTR-Old_page_CTR)/Old_page_CTR)*100
(Old_page_CTR)
## [1] 0.1204776
(new_page_CTR)
## [1] 0.1188408
(percentage_change)
## [1] -1.358588
The percentage change between the old_page_CTR and new_page_CTR comes to be 1.36%.
Which is very to draw a comparison. So upto this point both of the landing_page are working at the same level.
We need to bring in more information to accept or reject the hypothesis.
Let’s see if time plays an important factor for the conversion rate Importing lubridate library creating new column date which signifies the date at which user visited the link creating new column hour which signifies the hour (format:1-24) at which user visited the link.
library(lubridate)
dataset$date = date(dataset$timestamp)
dataset$hour = hour(dataset$timestamp)
class(dataset$converted)
## [1] "factor"
Our depedent varribale is in factor format. We need to convert it into integer format. So that we can do algebra functions on it. Converting to integer format changed the 0 into 1 and 1 into 2. So substracting 1 from the whole column.
dataset$converted = as.integer(dataset$converted)
dataset$converted=dataset$converted-1
str(dataset)
## 'data.frame': 294478 obs. of 7 variables:
## $ user_id : int 851104 804228 661590 853541 864975 936923 679687 719014 817355 839785 ...
## $ timestamp : chr "2017-01-21 22:11:48.556739" "2017-01-12 08:01:45.159739" "2017-01-11 16:55:06.154213" "2017-01-08 18:28:03.143765" ...
## $ group : chr "control" "control" "treatment" "treatment" ...
## $ landing_page: chr "old_page" "old_page" "new_page" "new_page" ...
## $ converted : num 0 0 0 0 1 0 1 0 1 1 ...
## $ date : Date, format: "2017-01-21" "2017-01-12" ...
## $ hour : int 22 8 16 18 1 15 3 1 17 18 ...
checking if there is any strong relation between the hour time and total conversion
Creating a new object called timeconversion and plotting sum of all the conversion in that hour
timeconversion = dataset %>%
group_by(hour) %>%
dplyr::summarise(totalconversion = sum(converted))%>%
arrange(desc(totalconversion))
(timeconversion)
## # A tibble: 24 x 2
## hour totalconversion
## <int> <dbl>
## 1 18 1552
## 2 12 1525
## 3 9 1522
## 4 11 1522
## 5 23 1517
## 6 17 1513
## 7 5 1502
## 8 6 1481
## 9 21 1476
## 10 22 1471
## # ... with 14 more rows
plotting a line graph between hour and total conversion in that hour
str(timeconversion)
## tibble [24 x 2] (S3: tbl_df/tbl/data.frame)
## $ hour : int [1:24] 18 12 9 11 23 17 5 6 21 22 ...
## $ totalconversion: num [1:24] 1552 1525 1522 1522 1517 ...
ggplot(timeconversion) + aes(x= hour,y=totalconversion,label) + geom_line(color='blue')+
theme(panel.grid.major = element_line(colour = "yellow"),panel.background = element_rect(fill = "white", colour = "grey50"))+ geom_text(label=timeconversion$totalconversion)
Finding :
1.The conversion is at lowest at 2-3 AM .
2.Definitely it is not recommended to send the link beteen 12 AM to 4 PM.
3.At 6 PM we can get the highest conversion rate. Maybe because this is the time when most of the working class professionals and students get free from their work/studies and check the email.
4.After 6 pm the line takes a sharp down fall .Maybe the users get busy into their routine household work, so we again see a very low conversion.
5.It is recommended to send the email at 6-7 pm, 12-1 pm, 9-10 AM to get the maximum conversion.
checking if there is any correlation between the date and total conversion
Creating a new object called dateconversion. which contains two columns namely date and totalconversion on that date
dateconversion = dataset %>%
group_by(date) %>%
dplyr::summarize(totalconversion = sum(converted))%>%
arrange(desc(totalconversion))
(dateconversion)
## # A tibble: 23 x 2
## date totalconversion
## <date> <dbl>
## 1 2017-01-23 1668
## 2 2017-01-17 1662
## 3 2017-01-18 1655
## 4 2017-01-14 1641
## 5 2017-01-21 1631
## 6 2017-01-12 1629
## 7 2017-01-06 1626
## 8 2017-01-08 1626
## 9 2017-01-10 1621
## 10 2017-01-16 1611
## # ... with 13 more rows
plotting a line graph between date and total conversion on that date
ggplot(dateconversion) + aes(x= date,y=totalconversion) + geom_line(size=1,color="darkblue")+ labs(
x = "date",
y = "number of conversion",
colour = "green")+
theme(panel.grid.major = element_line(colour = "green"))+geom_text(label=dateconversion$totalconversion)
Finding:
1.we see the lowest conversion on 2 january 2017 followed by 24 january 2017. Please note that this is not due to the date. It’s only because , the data was collected from middle of those dates. So we don’t get the whole picture of those dates.
2.On 13 january we see the lowest conversion , there might be some factor to it, but since the difference is not that high, we can’t make out any deduction.
Let’s see if country has any role to play in terms of conversion. Importing countries dataset.
country= read.csv("countries.csv")
table(country$country)
##
## CA UK US
## 14499 72466 203619
Most of the users belong to United states
merging the two datasets by user_id. Interesting thing is that the difference between the countries dataset rows and dataset rows will give us the exact number of duplicates we have in our user_id, Because all the duplicates are because of the same users visiting the link again and again.
dataset=dataset %>%
left_join(country,by= c("user_id"="user_id"))
arranging the columns
dataset = dataset[c(1,8,3,4,2,6,7,5)]
checking if there is any correlation between the Country and total conversion.
Using group_by function to group the data by country first than by landing_page Using summarize function to calculate sum of all the conversion arranging the whole dataframe into ascending order of total conversion.
countryconversion = dataset %>%
group_by(country,landing_page) %>%
dplyr::summarize(totalconversion = sum(converted))%>%
arrange(totalconversion)
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
Plotting a stacked bar graph between country and total conversion . Filling it up with type of landing page
ggplot(countryconversion)+aes(x=country,y=totalconversion,fill=landing_page)+geom_bar(stat="identity")
Finding :
1.Obviously the total conversion in US is highest since most of the emails were sent to US.
2.In case of types of pages in all 3 countries, both of the pages are performing nearly equally.
3.So our NULL hypothesis seems to be true.
Using aggregate function to group by country than further group by landing page and than by hour to see the total conversion in each category
group = aggregate (converted ~ country+landing_page+hour,dataset,sum)
(head(group))
## country landing_page hour converted
## 1 CA new_page 0 37
## 2 UK new_page 0 161
## 3 US new_page 0 534
## 4 CA old_page 0 37
## 5 UK old_page 0 175
## 6 US old_page 0 508
ggplot(group)+aes(x=country,y=converted,fill=hour)+geom_bar(stat="identity")
Conclusion :
1.We accept the NULL hypothesis, since there isn’t ,much of difference seen between the total conversion between the old_page and new_page.
2.It is recommended to send the email at 6-7 pm, 12-1 pm, 9-10 AM to get the maximum conversion.
3.Date doesn’t have much of role to play in conversion.
4.So the conversion rate is more or less the same in both pages.