Description

The dataset is retrieved from Kaggle. This data set contains the results of a mock email campaign.

Introduction

R is a free programming language that is dedicated for statisticians and statistical analysis. R also supports building statistical models and machine learning models.

R is an interpreted programming language, meaning the code is not compiled and it is executed line by line. R can be used through a command-line interface, through writing R scripts or through the use of notebooks.

Although R presents many advantages when it comes to statistical modeling and data analysis, it is not a scalable programming language.

In this notebook, we will explore the following:

  • Popular library that are often used in R
  • Prepare a dataset for analysis
  • Analyze the dataset and build visualizations
library('ggChernoff')
face <- chernoffGrob(x = 0.5, y = 0.5, size = 1000, colour = "black", fill = "yellow",
  alpha = 1, smile = 1, brow = NA, nose = FALSE)
grid::grid.newpage()
grid::grid.draw(face)

Packages

Packages (or libraries) in R are codes written by the maintainers of R or the R community to extend the functionality of R language. These packages often used to simplify the process of reading, cleaning and analyzing the data. Some packages are also used to generate charts and visualizations. Sometimes, these packages are written in C, C++ or Python and have an R interface so we can use them in R.

Here are a few of popular R packages:

  • dplyr: dplyr is mainly used for data manipulation. It provides easy to use functions with familiar names to perform certain tasks (e.g. filter to filter data in the dataframe).
  • tidyr: tidyr is used to clean and shape the data before doing the analysis. For example, we can use pivot_longer or pivot_wider functions to pivot the data before we do any analysis. We can also use tidyr replace missing data in a few different ways.
  • readr: readr is a library used for loading data into R dataframes. It is mainly used for loading CSV files and parsing the data to the correct types
  • stringr: stringr is a library used for making string manipulation in R easier to do. It provides a set of pre-built functions like str_length and str_interp
  • ggplot2: ggplot2 is used to create plots and visualize the data during the analysis. It helps simplify the process of creating aesthetically pleasing plots.
  • sqldf: sqldf is a library that does data transformation and manipulation on dataframes using SQL syntax.

To install a package, we can run the following code and the package will be installed from the CRAN repository.

install.packages('dplyr')

In order to use the functions in a package, we need to load the package. We can do that by typing the following code.

library(ggplot2)
library(scales)
library(dplyr)

It is optional to surround the library name with quotations.

Data Preperation

The data is retrieved from Kaggle as a zip file. The zip file contains three separate files.

  • userbase.csv: This file contains information about the users that were contacted. The data in this file can be used to perform demographic analysis on the target audience of the marketing campeign
  • sent_emails.csv: This file contains the list of user IDs, the marketing email ID that was sent to the user and the date the emails were sent
  • respoded.csv: This file contains the IDs of the users who responded to the marketing emails, the ID of the marketing email they responded to and the date when the customer responded

To load the data, we can use read.csv funciton.

userbase <- read.csv('mock_emails/userbase.csv')
sent_emails <- read.csv('mock_emails/sent_emails.csv')
responded <- read.csv('mock_emails/responded.csv')
print("Summary of userbase data")
summary(userbase)
print("Summary of sent_emails data")
summary(sent_emails)
print("Summary of respnded data")
summary(responded)
## [1] "Summary of userbase data"
##   Customer_ID        Gender              Type           Email_Address     
##  Min.   :     1   Length:500000      Length:500000      Length:500000     
##  1st Qu.:125001   Class :character   Class :character   Class :character  
##  Median :250000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :250000                                                           
##  3rd Qu.:375000                                                           
##  Max.   :500000                                                           
##       Age            Tenure     
##  Min.   :18.00   Min.   : 0.00  
##  1st Qu.:27.00   1st Qu.:11.00  
##  Median :31.00   Median :14.00  
##  Mean   :31.62   Mean   :16.19  
##  3rd Qu.:36.00   3rd Qu.:22.00  
##  Max.   :63.00   Max.   :38.00  
## [1] "Summary of sent_emails data"
##   Sent_Date          Customer_ID     SubjectLine_ID 
##  Length:2476354     Min.   :     1   Min.   :1.000  
##  Class :character   1st Qu.:124883   1st Qu.:1.000  
##  Mode  :character   Median :249815   Median :2.000  
##                     Mean   :249894   Mean   :1.999  
##                     3rd Qu.:374998   3rd Qu.:3.000  
##                     Max.   :500000   Max.   :3.000  
## [1] "Summary of respnded data"
##  Responded_Date      Customer_ID     SubjectLine_ID 
##  Length:378208      Min.   :     1   Min.   :1.000  
##  Class :character   1st Qu.:125161   1st Qu.:1.000  
##  Mode  :character   Median :250326   Median :2.000  
##                     Mean   :250108   Mean   :1.961  
##                     3rd Qu.:375056   3rd Qu.:3.000  
##                     Max.   :500000   Max.   :3.000

The summary of the sent_emails & responded data shows that the date columns (Sent_Date & Respnded_Date respectivly) were parsed as character columns, however, they’re date columns written in ISO 8601 format. As an alternative, we can use the read_csv function from readr library.

library('readr')
library(ggplot2)
library(scales)
library(dplyr)
userbase <- read_csv('mock_emails/userbase.csv')
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Customer_ID = col_double(),
##   Gender = col_character(),
##   Type = col_character(),
##   Email_Address = col_character(),
##   Age = col_double(),
##   Tenure = col_double()
## )
sent_emails <- read_csv('mock_emails/sent_emails.csv')
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Sent_Date = col_date(format = ""),
##   Customer_ID = col_double(),
##   SubjectLine_ID = col_double()
## )
responded <- read_csv('mock_emails/responded.csv')
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Responded_Date = col_date(format = ""),
##   Customer_ID = col_double(),
##   SubjectLine_ID = col_double()
## )
print("Summary of userbase data")
summary(userbase)
print("Summary of sent_emails data")
summary(sent_emails)
print("Summary of respnded data")
summary(responded)
## [1] "Summary of userbase data"
##   Customer_ID        Gender              Type           Email_Address     
##  Min.   :     1   Length:500000      Length:500000      Length:500000     
##  1st Qu.:125001   Class :character   Class :character   Class :character  
##  Median :250000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :250000                                                           
##  3rd Qu.:375000                                                           
##  Max.   :500000                                                           
##       Age            Tenure     
##  Min.   :18.00   Min.   : 0.00  
##  1st Qu.:27.00   1st Qu.:11.00  
##  Median :31.00   Median :14.00  
##  Mean   :31.62   Mean   :16.19  
##  3rd Qu.:36.00   3rd Qu.:22.00  
##  Max.   :63.00   Max.   :38.00  
## [1] "Summary of sent_emails data"
##    Sent_Date           Customer_ID     SubjectLine_ID 
##  Min.   :2016-01-01   Min.   :     1   Min.   :1.000  
##  1st Qu.:2016-01-22   1st Qu.:124883   1st Qu.:1.000  
##  Median :2016-02-13   Median :249815   Median :2.000  
##  Mean   :2016-02-13   Mean   :249894   Mean   :1.999  
##  3rd Qu.:2016-03-06   3rd Qu.:374998   3rd Qu.:3.000  
##  Max.   :2016-03-27   Max.   :500000   Max.   :3.000  
## [1] "Summary of respnded data"
##  Responded_Date        Customer_ID     SubjectLine_ID 
##  Min.   :2016-01-01   Min.   :     1   Min.   :1.000  
##  1st Qu.:2016-01-23   1st Qu.:125161   1st Qu.:1.000  
##  Median :2016-02-14   Median :250326   Median :2.000  
##  Mean   :2016-02-13   Mean   :250108   Mean   :1.961  
##  3rd Qu.:2016-03-07   3rd Qu.:375056   3rd Qu.:3.000  
##  Max.   :2016-04-03   Max.   :500000   Max.   :3.000

As clearly demonstrated here, when function read_csv from readr library was used the date columns were loaded as date values. This saves us the effort of having to cast the values to the correct data types.

Check for missing values in any of the dataframes

any(is.na(userbase) == TRUE)
any(is.na(sent_emails) == TRUE)
any(is.na(responded) == TRUE)
## [1] FALSE
## [1] FALSE
## [1] FALSE

To see a sample of the data, we can use head() & tail() functions.

head(userbase, n=5)
## # A tibble: 5 x 6
##   Customer_ID Gender Type  Email_Address                 Age Tenure
##         <dbl> <chr>  <chr> <chr>                       <dbl>  <dbl>
## 1           1 M      C     Mph@msn.com                    38     12
## 2           2 F      B     SophcYN7381qMe8lRqK@aol.com    37     23
## 3           3 F      C     xE@msn.com                     32     15
## 4           4 M      C     WduR0o0@comcast.net            31     24
## 5           5 M      C     walHI9cmWJIBKvy0@aol.com       35     13
tail(userbase, n=5)
## # A tibble: 5 x 6
##   Customer_ID Gender Type  Email_Address                    Age Tenure
##         <dbl> <chr>  <chr> <chr>                          <dbl>  <dbl>
## 1      499996 M      B     iMt8ckJsdR@msn.com                25     12
## 2      499997 F      C     z58KHLR7VtjRZYZh@gmail.com        30      9
## 3      499998 M      B     PGqUn8eT9A4lz7geMZ@comcast.net    30     11
## 4      499999 M      C     QkI8AH@hotmail.com                33     11
## 5      500000 F      B     7NYSY@yahoo.com                   37     13
head(sent_emails, n=5)
## # A tibble: 5 x 3
##   Sent_Date  Customer_ID SubjectLine_ID
##   <date>           <dbl>          <dbl>
## 1 2016-01-28        1413              2
## 2 2016-03-02       83889              2
## 3 2016-03-09      457832              3
## 4 2016-01-20      127772              1
## 5 2016-02-03      192123              3
tail(sent_emails, n=5)
## # A tibble: 5 x 3
##   Sent_Date  Customer_ID SubjectLine_ID
##   <date>           <dbl>          <dbl>
## 1 2016-02-22      460565              3
## 2 2016-02-25      360823              2
## 3 2016-03-02      443177              2
## 4 2016-03-12      179200              3
## 5 2016-01-17       68532              2
head(responded, n=5)
## # A tibble: 5 x 3
##   Responded_Date Customer_ID SubjectLine_ID
##   <date>               <dbl>          <dbl>
## 1 2016-01-25          217102              3
## 2 2016-03-26          366875              2
## 3 2016-03-17          201561              2
## 4 2016-01-11          395345              3
## 5 2016-01-11          427539              3
tail(responded, n=5)
## # A tibble: 5 x 3
##   Responded_Date Customer_ID SubjectLine_ID
##   <date>               <dbl>          <dbl>
## 1 2016-01-09           43156              1
## 2 2016-03-16          397330              2
## 3 2016-02-25          377520              1
## 4 2016-02-04          146008              2
## 5 2016-02-09          374971              3

The gender data is saved as M for male and F for female, while the type column indicate the customer types using C for consumer & B for business. We can replace these values with clearer names to get a better view of the data when we are reporting on it.

Furthermore, the user email addresses are masked, the only unmasked part is the domain name. Therefore, we can split the email column into two columns, one will contain the masked part of the email, and another will contain the domain name. Having the domain names of the customers can then help in deciding if we can send future marketing campaigns through the ads service of the email domain providers rather than sending out emails.

library(tidyr)
userbase$Gender[userbase$Gender=='M'] <- 'Male'
userbase$Gender[userbase$Gender=='F'] <- 'Female'
userbase$Type[userbase$Type=='C'] <- 'Consumer'
userbase$Type[userbase$Type=='B'] <- 'Business'
userbase <- separate(userbase, Email_Address, into=c('email', 'domain'), sep="@")
head(userbase)
## # A tibble: 6 x 7
##   Customer_ID Gender Type     email               domain        Age Tenure
##         <dbl> <chr>  <chr>    <chr>               <chr>       <dbl>  <dbl>
## 1           1 Male   Consumer Mph                 msn.com        38     12
## 2           2 Female Business SophcYN7381qMe8lRqK aol.com        37     23
## 3           3 Female Consumer xE                  msn.com        32     15
## 4           4 Male   Consumer WduR0o0             comcast.net    31     24
## 5           5 Male   Consumer walHI9cmWJIBKvy0    aol.com        35     13
## 6           6 Female Business OoececBAnkd         yahoo.com      38     23

In order to be able to analyze the relation between sent emails and the responses received from users, we need to join the sent_emails dataframe to the responded dataframe. The common values between both dataframes are Customer_ID and SubjectLine_ID. Ho

joined_emails <- merge(x=sent_emails, y=responded, all.x=TRUE)
joined_emails$days_till_response <- (joined_emails$Responded_Date - joined_emails$Sent_Date)
head(joined_emails)
##   Customer_ID SubjectLine_ID  Sent_Date Responded_Date days_till_response
## 1           1              1 2016-02-28     2016-02-28             0 days
## 2           1              2 2016-02-28     2016-01-24           -35 days
## 3           1              2 2016-01-24     2016-01-24             0 days
## 4           1              2 2016-01-12     2016-01-24            12 days
## 5           1              3 2016-01-19           <NA>            NA days
## 6           1              3 2016-03-10           <NA>            NA days

After looking at the values, there seems to be emails that have Sent_Date after Responded_Date. Upon closer examination of the data (in data analysis section), it is clear that the same emails can be sent multiple times. To remove those records, we will need to records that have Responded_Date >= to Sent_Date, or records that have NA Responded_Date.

joined_emails2 <- filter(joined_emails, is.na(joined_emails$Responded_Date) | (joined_emails$Responded_Date >= joined_emails$Sent_Date))
head(joined_emails2)
##   Customer_ID SubjectLine_ID  Sent_Date Responded_Date days_till_response
## 1           1              1 2016-02-28     2016-02-28             0 days
## 2           1              2 2016-01-24     2016-01-24             0 days
## 3           1              2 2016-01-12     2016-01-24            12 days
## 4           1              3 2016-01-19           <NA>            NA days
## 5           1              3 2016-03-10           <NA>            NA days
## 6           1              3 2016-03-14           <NA>            NA days

Data Analysis

The data analysis is divided into three parts:

  1. The demographic analysis based on the nominal categorical variables for customer base
  2. The demographic analysis based on the numerical continuous variables for customer base
  3. The customer responses to email campaigns

Categorical Variables

Based on the charts below, we can conclude the following:

  • Most of our customers are males (65%)
  • Most of the marketing is done direct-to-consumer since most of the customers fall under “Consumer” type
  • Gmail is the highest used email domain name among the customers. However, if we combine hotmail.com and msn.com into one “Microsoft” domain, we find that 40% of the customers use Microsoft’s email services.
  • 75% of male and female customers fall under business group.
# Here, we group the userbase data by gender and display the count. The same process will be nominal variable
group_by(userbase, Gender) %>%
  summarise(customers=length(Customer_ID),
            customers_percentage=(length(Customer_ID)/500000)) %>%
  ggplot(aes(x=Gender, y=customers, color=Gender, fill=Gender)) + 
  geom_bar(stat="Identity", show.legend = TRUE) +
  ggtitle("Gender Distribution") +
  geom_text(aes(label=customers), vjust=1.6, color="white",
            position = position_dodge(0.9), size=3.5) +
  geom_text(aes(label=label_percent()(customers_percentage)), vjust=3.6, color="white",
            position = position_dodge(0.9), size=3.5)

group_by(userbase, Type) %>%
  summarise(customers=length(Customer_ID),
            customers_percentage=(length(Customer_ID)/500000)) %>%
  ggplot(aes(x=Type, y=customers, color=Type, fill=Type)) + 
  geom_bar(stat="Identity", show.legend = TRUE) +
  ggtitle("Customer Type Distribution") +
  geom_text(aes(label=customers), vjust=1.6, color="white",
            position = position_dodge(0.9), size=3.5) +
  geom_text(aes(label=label_percent()(customers_percentage)), vjust=3.6, color="white",
            position = position_dodge(0.9), size=3.5)

group_by(userbase, domain) %>%
  summarise(customers=length(Customer_ID),
            customers_percentage=(length(Customer_ID)/500000)) %>%
  ggplot(aes(x=domain, y=customers, color=domain, fill=domain)) + 
  geom_bar(stat="identity", show.legend = TRUE) +
  ggtitle("Customer Email Domain Distribution") +
  geom_text(aes(label=customers), vjust=1.6, color="white",
            position = position_dodge(0.9), size=3.5) +
  geom_text(aes(label=label_percent()(customers_percentage)), vjust=3.6, color="white",
            position = position_dodge(0.9), size=3.5)

group_by(userbase, Gender, Type) %>%
  summarise(customers=length(Customer_ID),
            customers_percentage=(length(Customer_ID)/500000)) %>%
  arrange(desc(Gender), desc(Type)) %>%
  ggplot(aes(x=Gender, y=customers, color=Type, fill=Type)) + 
  geom_bar(stat="identity", show.legend = TRUE) +
  ggtitle("Customer Type per Gender") +
  geom_text(aes(y=customers, label=customers), vjust=1.6, color="white",
            position = 'stack', size=3.5) +
  geom_text(aes(y=customers, label=label_percent()(customers_percentage)), vjust=3.6, color="white",
           position = 'stack', size=3.5)
## `summarise()` has grouped output by 'Gender'. You can override using the `.groups` argument.

Numerical Variables

After analyzing the numeric variables, we can conclude the following:

  • Age has a normal distribution that is skewed slightly to the left (towards younger customer base). However, there is a large spike in the number of 18 year old customers
  • Age distribution over gender shows the same results as the overall age distribution
  • Tenure has a binomial distribution.
  • Tenure distribution over gender has the same results as the overall tenure distribution
  • The correlation coefficient between age and tenure is 0.00105923035031279, which means that there is no correlation between customer’s age & customer’s tenure. This is demonstrated when looking at the scatter plot.
library(stringr)

group_by(userbase, Age) %>%
  summarise(customers=length(Customer_ID),
            customers_percentage=(length(Customer_ID)/500000)) %>%
  ggplot(aes(x=Age, y=customers)) + 
  geom_bar(stat="identity", show.legend = TRUE) +
  ggtitle("Age Distribution") +
  geom_text(aes(label=customers), vjust=0.5, hjust=1.5, color="white",
            position = position_dodge(0.9), size=2.5, angle=90)

group_by(userbase, Age, Gender) %>%
  ggplot(aes(x=Age, color=Gender, fill=Gender)) + 
  geom_histogram(position="identity", show.legend = TRUE, alpha=0.2, binwidth = 1) +
  ggtitle("Age Distribution per Gender")

group_by(userbase, Tenure) %>%
  summarise(customers=length(Customer_ID),
            customers_percentage=(length(Customer_ID)/500000)) %>%
  ggplot(aes(x=Tenure, y=customers)) + 
  geom_bar(stat="identity", show.legend = TRUE) +
  ggtitle("Age Distribution") +
  geom_text(aes(label=customers), vjust=0.5, hjust=1.5, color="white",
            position = position_dodge(0.9), size=2.5, angle=90)

group_by(userbase, Tenure, Gender) %>%
  ggplot(aes(x=Tenure, color=Gender, fill=Gender)) + 
  geom_histogram(position="identity", show.legend = TRUE, alpha=0.2, binwidth = 1) +
  ggtitle("Age Distribution per Gender")

cor_age_tenure <- cor(userbase$Age, userbase$Tenure)
print(str_interp("The correlation coefficeint between customer's age & customer's tenure is: ${cor_age_tenure}"))
## [1] "The correlation coefficeint between customer's age & customer's tenure is: 0.00105923035031279"

userbase %>% ggplot(aes(x=Age, y=Tenure, shape=Type, color=Gender)) + geom_point() + geom_smooth(method=lm, fullrange=TRUE)

Campaign Responses

After analysing the emails sent and responses data, we can conclude the following:

  • The number of customers emailed for each campaign are almost the same (marginal differences)
  • The number of emails sent for each campaign are almost the same (marginal differences)
  • On average, each customer is sent two emails per campaign
  • Most customers who are going to respond to the emails would respond within the first 7 days of receiving the emails
  • Campaign 3 is the campaign that received the least email responses
group_by(joined_emails2, SubjectLine_ID) %>%
  summarise(emails = length(Customer_ID)) %>%
  ggplot(aes(x=SubjectLine_ID, y=emails)) +
  geom_bar(stat="Identity") +
  ggtitle("Emails sent per campeign") +
  geom_text(aes(label=emails), vjust=2, color="white",
            position = position_dodge(0.9), size=3.5)

group_by(joined_emails2, SubjectLine_ID) %>%
  summarise(ratio=length(Customer_ID)/length(unique(Customer_ID))) %>%
  ggplot(aes(x=SubjectLine_ID, y=ratio)) +
  geom_bar(stat="Identity") +
  ggtitle("Emails sent per customer per Campeign") +
  geom_text(aes(label=ratio), vjust=1.6, color="white",
            position = position_dodge(0.9), size=3.5)

filter(joined_emails2, !is.na(days_till_response)) %>%
  ggplot(aes(x=days_till_response)) +
  geom_histogram(binwidth=7, show.legend = TRUE, position="identity") +
  ggtitle("Days till response from customer")
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

filter(joined_emails2, !is.na(days_till_response)) %>% 
  group_by(SubjectLine_ID) %>%
  summarise(emails = length(Customer_ID)) %>%
  ggplot(aes(x=SubjectLine_ID, y=emails)) +
  geom_bar(stat="Identity") +
  ggtitle("Responses per Campaign") +
  geom_text(aes(label=emails), vjust=2, color="white",
            position = position_dodge(0.9), size=3.5)