The dataset is retrieved from Kaggle. This data set contains the results of a mock email campaign.
R is a free programming language that is dedicated for statisticians and statistical analysis. R also supports building statistical models and machine learning models.
R is an interpreted programming language, meaning the code is not compiled and it is executed line by line. R can be used through a command-line interface, through writing R scripts or through the use of notebooks.
Although R presents many advantages when it comes to statistical modeling and data analysis, it is not a scalable programming language.
In this notebook, we will explore the following:
library('ggChernoff')
face <- chernoffGrob(x = 0.5, y = 0.5, size = 1000, colour = "black", fill = "yellow",
alpha = 1, smile = 1, brow = NA, nose = FALSE)
grid::grid.newpage()
grid::grid.draw(face)
Packages (or libraries) in R are codes written by the maintainers of R or the R community to extend the functionality of R language. These packages often used to simplify the process of reading, cleaning and analyzing the data. Some packages are also used to generate charts and visualizations. Sometimes, these packages are written in C, C++ or Python and have an R interface so we can use them in R.
Here are a few of popular R packages:
filter to filter data in the dataframe).pivot_longer or pivot_wider functions to pivot the data before we do any analysis. We can also use tidyr replace missing data in a few different ways.str_length and str_interpTo install a package, we can run the following code and the package will be installed from the CRAN repository.
install.packages('dplyr')
In order to use the functions in a package, we need to load the package. We can do that by typing the following code.
library(ggplot2)
library(scales)
library(dplyr)
It is optional to surround the library name with quotations.
The data is retrieved from Kaggle as a zip file. The zip file contains three separate files.
To load the data, we can use read.csv funciton.
userbase <- read.csv('mock_emails/userbase.csv')
sent_emails <- read.csv('mock_emails/sent_emails.csv')
responded <- read.csv('mock_emails/responded.csv')
print("Summary of userbase data")
summary(userbase)
print("Summary of sent_emails data")
summary(sent_emails)
print("Summary of respnded data")
summary(responded)
## [1] "Summary of userbase data"
## Customer_ID Gender Type Email_Address
## Min. : 1 Length:500000 Length:500000 Length:500000
## 1st Qu.:125001 Class :character Class :character Class :character
## Median :250000 Mode :character Mode :character Mode :character
## Mean :250000
## 3rd Qu.:375000
## Max. :500000
## Age Tenure
## Min. :18.00 Min. : 0.00
## 1st Qu.:27.00 1st Qu.:11.00
## Median :31.00 Median :14.00
## Mean :31.62 Mean :16.19
## 3rd Qu.:36.00 3rd Qu.:22.00
## Max. :63.00 Max. :38.00
## [1] "Summary of sent_emails data"
## Sent_Date Customer_ID SubjectLine_ID
## Length:2476354 Min. : 1 Min. :1.000
## Class :character 1st Qu.:124883 1st Qu.:1.000
## Mode :character Median :249815 Median :2.000
## Mean :249894 Mean :1.999
## 3rd Qu.:374998 3rd Qu.:3.000
## Max. :500000 Max. :3.000
## [1] "Summary of respnded data"
## Responded_Date Customer_ID SubjectLine_ID
## Length:378208 Min. : 1 Min. :1.000
## Class :character 1st Qu.:125161 1st Qu.:1.000
## Mode :character Median :250326 Median :2.000
## Mean :250108 Mean :1.961
## 3rd Qu.:375056 3rd Qu.:3.000
## Max. :500000 Max. :3.000
The summary of the sent_emails & responded data shows that the date columns (Sent_Date & Respnded_Date respectivly) were parsed as character columns, however, they’re date columns written in ISO 8601 format. As an alternative, we can use the read_csv function from readr library.
library('readr')
library(ggplot2)
library(scales)
library(dplyr)
userbase <- read_csv('mock_emails/userbase.csv')
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Customer_ID = col_double(),
## Gender = col_character(),
## Type = col_character(),
## Email_Address = col_character(),
## Age = col_double(),
## Tenure = col_double()
## )
sent_emails <- read_csv('mock_emails/sent_emails.csv')
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Sent_Date = col_date(format = ""),
## Customer_ID = col_double(),
## SubjectLine_ID = col_double()
## )
responded <- read_csv('mock_emails/responded.csv')
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Responded_Date = col_date(format = ""),
## Customer_ID = col_double(),
## SubjectLine_ID = col_double()
## )
print("Summary of userbase data")
summary(userbase)
print("Summary of sent_emails data")
summary(sent_emails)
print("Summary of respnded data")
summary(responded)
## [1] "Summary of userbase data"
## Customer_ID Gender Type Email_Address
## Min. : 1 Length:500000 Length:500000 Length:500000
## 1st Qu.:125001 Class :character Class :character Class :character
## Median :250000 Mode :character Mode :character Mode :character
## Mean :250000
## 3rd Qu.:375000
## Max. :500000
## Age Tenure
## Min. :18.00 Min. : 0.00
## 1st Qu.:27.00 1st Qu.:11.00
## Median :31.00 Median :14.00
## Mean :31.62 Mean :16.19
## 3rd Qu.:36.00 3rd Qu.:22.00
## Max. :63.00 Max. :38.00
## [1] "Summary of sent_emails data"
## Sent_Date Customer_ID SubjectLine_ID
## Min. :2016-01-01 Min. : 1 Min. :1.000
## 1st Qu.:2016-01-22 1st Qu.:124883 1st Qu.:1.000
## Median :2016-02-13 Median :249815 Median :2.000
## Mean :2016-02-13 Mean :249894 Mean :1.999
## 3rd Qu.:2016-03-06 3rd Qu.:374998 3rd Qu.:3.000
## Max. :2016-03-27 Max. :500000 Max. :3.000
## [1] "Summary of respnded data"
## Responded_Date Customer_ID SubjectLine_ID
## Min. :2016-01-01 Min. : 1 Min. :1.000
## 1st Qu.:2016-01-23 1st Qu.:125161 1st Qu.:1.000
## Median :2016-02-14 Median :250326 Median :2.000
## Mean :2016-02-13 Mean :250108 Mean :1.961
## 3rd Qu.:2016-03-07 3rd Qu.:375056 3rd Qu.:3.000
## Max. :2016-04-03 Max. :500000 Max. :3.000
As clearly demonstrated here, when function read_csv from readr library was used the date columns were loaded as date values. This saves us the effort of having to cast the values to the correct data types.
Check for missing values in any of the dataframes
any(is.na(userbase) == TRUE)
any(is.na(sent_emails) == TRUE)
any(is.na(responded) == TRUE)
## [1] FALSE
## [1] FALSE
## [1] FALSE
To see a sample of the data, we can use head() & tail() functions.
head(userbase, n=5)
## # A tibble: 5 x 6
## Customer_ID Gender Type Email_Address Age Tenure
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 1 M C Mph@msn.com 38 12
## 2 2 F B SophcYN7381qMe8lRqK@aol.com 37 23
## 3 3 F C xE@msn.com 32 15
## 4 4 M C WduR0o0@comcast.net 31 24
## 5 5 M C walHI9cmWJIBKvy0@aol.com 35 13
tail(userbase, n=5)
## # A tibble: 5 x 6
## Customer_ID Gender Type Email_Address Age Tenure
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 499996 M B iMt8ckJsdR@msn.com 25 12
## 2 499997 F C z58KHLR7VtjRZYZh@gmail.com 30 9
## 3 499998 M B PGqUn8eT9A4lz7geMZ@comcast.net 30 11
## 4 499999 M C QkI8AH@hotmail.com 33 11
## 5 500000 F B 7NYSY@yahoo.com 37 13
head(sent_emails, n=5)
## # A tibble: 5 x 3
## Sent_Date Customer_ID SubjectLine_ID
## <date> <dbl> <dbl>
## 1 2016-01-28 1413 2
## 2 2016-03-02 83889 2
## 3 2016-03-09 457832 3
## 4 2016-01-20 127772 1
## 5 2016-02-03 192123 3
tail(sent_emails, n=5)
## # A tibble: 5 x 3
## Sent_Date Customer_ID SubjectLine_ID
## <date> <dbl> <dbl>
## 1 2016-02-22 460565 3
## 2 2016-02-25 360823 2
## 3 2016-03-02 443177 2
## 4 2016-03-12 179200 3
## 5 2016-01-17 68532 2
head(responded, n=5)
## # A tibble: 5 x 3
## Responded_Date Customer_ID SubjectLine_ID
## <date> <dbl> <dbl>
## 1 2016-01-25 217102 3
## 2 2016-03-26 366875 2
## 3 2016-03-17 201561 2
## 4 2016-01-11 395345 3
## 5 2016-01-11 427539 3
tail(responded, n=5)
## # A tibble: 5 x 3
## Responded_Date Customer_ID SubjectLine_ID
## <date> <dbl> <dbl>
## 1 2016-01-09 43156 1
## 2 2016-03-16 397330 2
## 3 2016-02-25 377520 1
## 4 2016-02-04 146008 2
## 5 2016-02-09 374971 3
The gender data is saved as M for male and F for female, while the type column indicate the customer types using C for consumer & B for business. We can replace these values with clearer names to get a better view of the data when we are reporting on it.
Furthermore, the user email addresses are masked, the only unmasked part is the domain name. Therefore, we can split the email column into two columns, one will contain the masked part of the email, and another will contain the domain name. Having the domain names of the customers can then help in deciding if we can send future marketing campaigns through the ads service of the email domain providers rather than sending out emails.
library(tidyr)
userbase$Gender[userbase$Gender=='M'] <- 'Male'
userbase$Gender[userbase$Gender=='F'] <- 'Female'
userbase$Type[userbase$Type=='C'] <- 'Consumer'
userbase$Type[userbase$Type=='B'] <- 'Business'
userbase <- separate(userbase, Email_Address, into=c('email', 'domain'), sep="@")
head(userbase)
## # A tibble: 6 x 7
## Customer_ID Gender Type email domain Age Tenure
## <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 1 Male Consumer Mph msn.com 38 12
## 2 2 Female Business SophcYN7381qMe8lRqK aol.com 37 23
## 3 3 Female Consumer xE msn.com 32 15
## 4 4 Male Consumer WduR0o0 comcast.net 31 24
## 5 5 Male Consumer walHI9cmWJIBKvy0 aol.com 35 13
## 6 6 Female Business OoececBAnkd yahoo.com 38 23
In order to be able to analyze the relation between sent emails and the responses received from users, we need to join the sent_emails dataframe to the responded dataframe. The common values between both dataframes are Customer_ID and SubjectLine_ID. Ho
joined_emails <- merge(x=sent_emails, y=responded, all.x=TRUE)
joined_emails$days_till_response <- (joined_emails$Responded_Date - joined_emails$Sent_Date)
head(joined_emails)
## Customer_ID SubjectLine_ID Sent_Date Responded_Date days_till_response
## 1 1 1 2016-02-28 2016-02-28 0 days
## 2 1 2 2016-02-28 2016-01-24 -35 days
## 3 1 2 2016-01-24 2016-01-24 0 days
## 4 1 2 2016-01-12 2016-01-24 12 days
## 5 1 3 2016-01-19 <NA> NA days
## 6 1 3 2016-03-10 <NA> NA days
After looking at the values, there seems to be emails that have Sent_Date after Responded_Date. Upon closer examination of the data (in data analysis section), it is clear that the same emails can be sent multiple times. To remove those records, we will need to records that have Responded_Date >= to Sent_Date, or records that have NA Responded_Date.
joined_emails2 <- filter(joined_emails, is.na(joined_emails$Responded_Date) | (joined_emails$Responded_Date >= joined_emails$Sent_Date))
head(joined_emails2)
## Customer_ID SubjectLine_ID Sent_Date Responded_Date days_till_response
## 1 1 1 2016-02-28 2016-02-28 0 days
## 2 1 2 2016-01-24 2016-01-24 0 days
## 3 1 2 2016-01-12 2016-01-24 12 days
## 4 1 3 2016-01-19 <NA> NA days
## 5 1 3 2016-03-10 <NA> NA days
## 6 1 3 2016-03-14 <NA> NA days
The data analysis is divided into three parts:
Based on the charts below, we can conclude the following:
# Here, we group the userbase data by gender and display the count. The same process will be nominal variable
group_by(userbase, Gender) %>%
summarise(customers=length(Customer_ID),
customers_percentage=(length(Customer_ID)/500000)) %>%
ggplot(aes(x=Gender, y=customers, color=Gender, fill=Gender)) +
geom_bar(stat="Identity", show.legend = TRUE) +
ggtitle("Gender Distribution") +
geom_text(aes(label=customers), vjust=1.6, color="white",
position = position_dodge(0.9), size=3.5) +
geom_text(aes(label=label_percent()(customers_percentage)), vjust=3.6, color="white",
position = position_dodge(0.9), size=3.5)
group_by(userbase, Type) %>%
summarise(customers=length(Customer_ID),
customers_percentage=(length(Customer_ID)/500000)) %>%
ggplot(aes(x=Type, y=customers, color=Type, fill=Type)) +
geom_bar(stat="Identity", show.legend = TRUE) +
ggtitle("Customer Type Distribution") +
geom_text(aes(label=customers), vjust=1.6, color="white",
position = position_dodge(0.9), size=3.5) +
geom_text(aes(label=label_percent()(customers_percentage)), vjust=3.6, color="white",
position = position_dodge(0.9), size=3.5)
group_by(userbase, domain) %>%
summarise(customers=length(Customer_ID),
customers_percentage=(length(Customer_ID)/500000)) %>%
ggplot(aes(x=domain, y=customers, color=domain, fill=domain)) +
geom_bar(stat="identity", show.legend = TRUE) +
ggtitle("Customer Email Domain Distribution") +
geom_text(aes(label=customers), vjust=1.6, color="white",
position = position_dodge(0.9), size=3.5) +
geom_text(aes(label=label_percent()(customers_percentage)), vjust=3.6, color="white",
position = position_dodge(0.9), size=3.5)
group_by(userbase, Gender, Type) %>%
summarise(customers=length(Customer_ID),
customers_percentage=(length(Customer_ID)/500000)) %>%
arrange(desc(Gender), desc(Type)) %>%
ggplot(aes(x=Gender, y=customers, color=Type, fill=Type)) +
geom_bar(stat="identity", show.legend = TRUE) +
ggtitle("Customer Type per Gender") +
geom_text(aes(y=customers, label=customers), vjust=1.6, color="white",
position = 'stack', size=3.5) +
geom_text(aes(y=customers, label=label_percent()(customers_percentage)), vjust=3.6, color="white",
position = 'stack', size=3.5)
## `summarise()` has grouped output by 'Gender'. You can override using the `.groups` argument.
After analyzing the numeric variables, we can conclude the following:
library(stringr)
group_by(userbase, Age) %>%
summarise(customers=length(Customer_ID),
customers_percentage=(length(Customer_ID)/500000)) %>%
ggplot(aes(x=Age, y=customers)) +
geom_bar(stat="identity", show.legend = TRUE) +
ggtitle("Age Distribution") +
geom_text(aes(label=customers), vjust=0.5, hjust=1.5, color="white",
position = position_dodge(0.9), size=2.5, angle=90)
group_by(userbase, Age, Gender) %>%
ggplot(aes(x=Age, color=Gender, fill=Gender)) +
geom_histogram(position="identity", show.legend = TRUE, alpha=0.2, binwidth = 1) +
ggtitle("Age Distribution per Gender")
group_by(userbase, Tenure) %>%
summarise(customers=length(Customer_ID),
customers_percentage=(length(Customer_ID)/500000)) %>%
ggplot(aes(x=Tenure, y=customers)) +
geom_bar(stat="identity", show.legend = TRUE) +
ggtitle("Age Distribution") +
geom_text(aes(label=customers), vjust=0.5, hjust=1.5, color="white",
position = position_dodge(0.9), size=2.5, angle=90)
group_by(userbase, Tenure, Gender) %>%
ggplot(aes(x=Tenure, color=Gender, fill=Gender)) +
geom_histogram(position="identity", show.legend = TRUE, alpha=0.2, binwidth = 1) +
ggtitle("Age Distribution per Gender")
cor_age_tenure <- cor(userbase$Age, userbase$Tenure)
print(str_interp("The correlation coefficeint between customer's age & customer's tenure is: ${cor_age_tenure}"))
## [1] "The correlation coefficeint between customer's age & customer's tenure is: 0.00105923035031279"
userbase %>% ggplot(aes(x=Age, y=Tenure, shape=Type, color=Gender)) + geom_point() + geom_smooth(method=lm, fullrange=TRUE)
After analysing the emails sent and responses data, we can conclude the following:
group_by(joined_emails2, SubjectLine_ID) %>%
summarise(emails = length(Customer_ID)) %>%
ggplot(aes(x=SubjectLine_ID, y=emails)) +
geom_bar(stat="Identity") +
ggtitle("Emails sent per campeign") +
geom_text(aes(label=emails), vjust=2, color="white",
position = position_dodge(0.9), size=3.5)
group_by(joined_emails2, SubjectLine_ID) %>%
summarise(ratio=length(Customer_ID)/length(unique(Customer_ID))) %>%
ggplot(aes(x=SubjectLine_ID, y=ratio)) +
geom_bar(stat="Identity") +
ggtitle("Emails sent per customer per Campeign") +
geom_text(aes(label=ratio), vjust=1.6, color="white",
position = position_dodge(0.9), size=3.5)
filter(joined_emails2, !is.na(days_till_response)) %>%
ggplot(aes(x=days_till_response)) +
geom_histogram(binwidth=7, show.legend = TRUE, position="identity") +
ggtitle("Days till response from customer")
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
filter(joined_emails2, !is.na(days_till_response)) %>%
group_by(SubjectLine_ID) %>%
summarise(emails = length(Customer_ID)) %>%
ggplot(aes(x=SubjectLine_ID, y=emails)) +
geom_bar(stat="Identity") +
ggtitle("Responses per Campaign") +
geom_text(aes(label=emails), vjust=2, color="white",
position = position_dodge(0.9), size=3.5)