Assignment 2

R Markdown

This document contains all the analysis for the second Anova Analytics R Bootcamp assignment. The requirements of this assignment are: Find 3-5 datasets pertinant to: Churn, Segmentation, Marketing Datasets should be good for regression (time series data is not ideal) Load the data into R Explore the datasets 10 plots using different variables are required We will probably do some other stuff beyond the assignment description =)

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 366250 19.6     592000 31.7   460000 24.6
## Vcells 561653  4.3    1308461 10.0   786403  6.0

#onlineRetailInformation
#This fails and asks us to provide a sheet name or sheet index, as I don't have MSoffice I don't know about this...
#onlineRetailUrl <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx' 
#onlineRetailxls <- './data/onlineRetail/onlineRetail.xls'
#download.file(url =onlineRetailUrl, destfile = onlineRetailxls)
#library(xlsx)
#onlineRetailData <- read.xlsx(onlineRetailxls)

#bankdata
#bankUrl <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip'
#bankZip <- './data/bank/bank.zip'
#bankAdditionalUrl <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip'
#bankAdditionalZip <- './data/bank/bank-additional.zip'
#download.file(url = bankUrl, destfile = bankZip)
#download.file(url = bankAdditionalUrl, destfile = bankAdditionalZip)
#unzip(zipfile = bankZip, exdir = 'C:/Users/grass_000/Dropbox/Assignment 2/data/bank')
#unzip(zipfile = bankAdditionalZip, exdir = 'C:/Users/grass_000/Dropbox/Assignment 2/data/bank')
bankFullCsv <- './data/bank/bank-full.csv'
bankNamesTxt <- './data/bank/bank-names.txt'
bankCsv <- './data/bank/bank.csv'

bankFull <- read.csv(bankFullCsv, sep = ';')
bankSample <- read.csv(bankCsv, sep = ';')

##Load more sample data
#bankdata
#webAnalyticsDemoUrl <- 'http://www.tatvic.com/blog/downloads/dataset.zip'
#webAnalyticsDemoZip <- './data/bank/bank.zip'
#download.file(url = webAnalyticsDemoUrl, destfile = webAnalyticsDemoZip)
#unzip(zipfile = webAnalyticsDemoZip, exdir = 'C:/Users/grass_000/Dropbox/Assignment 2/data/webAnalyticsDemo')

webAnalyticsDemoCsv1 <- './data/webAnalyticsDemo/dataset1.csv'
webAnalyticsDemoCsv2 <- './data/webAnalyticsDemo/dataset2.csv'
webAnalyticsDemoCsv3 <- './data/webAnalyticsDemo/dataset3.csv'

webAnalyticsDemo1 <- read.csv(webAnalyticsDemoCsv1)
webAnalyticsDemo2 <- read.csv(webAnalyticsDemoCsv2)
webAnalyticsDemo3 <- read.csv(webAnalyticsDemoCsv3)

Bank info

Relevant Information for the bank dataset:

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed.

There are two datasets: 1) bank-full.csv with all examples, ordered by date (from May 2008 to November 2010). 2) bank.csv with 10% of the examples (4521), randomly selected from bank-full.csv. The smallest dataset is provided to test more computationally demanding machine learning algorithms (e.g. SVM).

The classification goal is to predict if the client will subscribe a term deposit (variable y).

For more information, read [Moro et al., 2011].

Input variables: # bank client data: 1 - age (numeric) 2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”) 3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed) 4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”) 5 - default: has credit in default? (binary: “yes”,“no”) 6 - balance: average yearly balance, in euros (numeric) 7 - housing: has housing loan? (binary: “yes”,“no”) 8 - loan: has personal loan? (binary: “yes”,“no”) # related with the last contact of the current campaign: 9 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”) 10 - day: last contact day of the month (numeric) 11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”) 12 - duration: last contact duration, in seconds (numeric) # other attributes: 13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted) 15 - previous: number of contacts performed before this campaign and for this client (numeric) 16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)

Output variable (desired target): 17 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)

Exploring the Bank Dataset

##Find out if there are any NAs, data that we might have to either destroy or impute

naColumns <- function(df) {
    colnames(df)[unlist(lapply(df, function(x) any(is.na(x))))]
}
naColumns(bankSample)

## character(0)

#There are no NAs to work with, this is a nice clean dataset

head(bankSample)

##   age         job marital education default balance housing loan  contact
## 1  30  unemployed married   primary      no    1787      no   no cellular
## 2  33    services married secondary      no    4789     yes  yes cellular
## 3  35  management  single  tertiary      no    1350     yes   no cellular
## 4  30  management married  tertiary      no    1476     yes  yes  unknown
## 5  59 blue-collar married secondary      no       0     yes   no  unknown
## 6  35  management  single  tertiary      no     747      no   no cellular
##   day month duration campaign pdays previous poutcome  y
## 1  19   oct       79        1    -1        0  unknown no
## 2  11   may      220        1   339        4  failure no
## 3  16   apr      185        1   330        1  failure no
## 4   3   jun      199        4    -1        0  unknown no
## 5   5   may      226        1    -1        0  unknown no
## 6  23   feb      141        2   176        3  failure no

head(bankFull)

##   age          job marital education default balance housing loan contact
## 1  58   management married  tertiary      no    2143     yes   no unknown
## 2  44   technician  single secondary      no      29     yes   no unknown
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown
## 5  33      unknown  single   unknown      no       1      no   no unknown
## 6  35   management married  tertiary      no     231     yes   no unknown
##   day month duration campaign pdays previous poutcome  y
## 1   5   may      261        1    -1        0  unknown no
## 2   5   may      151        1    -1        0  unknown no
## 3   5   may       76        1    -1        0  unknown no
## 4   5   may       92        1    -1        0  unknown no
## 5   5   may      198        1    -1        0  unknown no
## 6   5   may      139        1    -1        0  unknown no

dim(bankSample)

## [1] 4521   17

dim(bankFull)

## [1] 45211    17

#As expected there are 17 attributes across both datasets, the bankSample dataset contains 4521 observations
#The bankFull dataset contains 45211 observations

#Lets see if there is any relationship between y (client subscribed to term deposit) and client age

library('ggplot2')
library('dplyr')

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library("gridExtra")

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

bankSampleYYes <-filter(bankSample, y == 'yes')
bankSampleYNo <-filter(bankSample, y == 'no')
yesTermDepositsByAge <- ggplot(bankSampleYYes, aes(age)) + geom_histogram(binwidth = 5) + labs(title = "Term Deposits Yes by Age Count", x="age", y="Count of Yes")
noTermDepositsByAge <- ggplot(bankSampleYNo, aes(age)) + geom_histogram(binwidth = 5) + labs(title = "Term Deposits No by Age Count", x="age", y="Count of No")
grid.arrange(yesTermDepositsByAge, noTermDepositsByAge)

#This chart makes me think that age probably doesn't have much to do with our predicted value
#lets plot education level vs y on a chart and see if there is anything interesting there
ggplot(bankSample, aes(education, y)) + geom_jitter()

#This again doesn't look all that meaninful. Relatively speaking, the "yes's" and the "No's" seem to line up
#relatively evenly across educational levels

#lets do the same analysis but with job type
ggplot(bankSample, aes(job, y)) + geom_jitter()

#Here again the relative pattern appears about the same across job types. So job type may not be a good predictor of Y

#Now we'll do single vs married folks
ggplot(bankSample, aes(marital, y)) + geom_jitter()

#No obvious relationship... again

#how about default Y/N vs the predicted?
ggplot(bankSample, aes(default, y)) + geom_jitter()

#Maybe there is something here, but it doesn't look like a home-run

#lets consider the continious variable of "balance" now
yesTermDepositsByBalance <- ggplot(bankSampleYYes, aes(balance)) + geom_histogram(binwidth = 1000) + labs(title = "Term Deposits Yes by Balance", x="Balance", y="Count of Yes") +xlim(-1000,25000)
noTermDepositsByBalance <- ggplot(bankSampleYNo, aes(balance)) + geom_histogram(binwidth = 1000) + labs(title = "Term Deposits No by Age Count", x="balance", y="Count of No")+xlim(-1000,25000)
grid.arrange(yesTermDepositsByBalance, noTermDepositsByBalance)

## Warning: Removed 2 rows containing non-finite values (stat_bin).

## Warning: Removed 22 rows containing non-finite values (stat_bin).

#Yeah, even term deposits doesn't seem to have much effect on the predicted

#lets consider housing/loan/contact variables and see if there is anything interesting 
ggplot(bankSample, aes(housing, y)) + geom_jitter()

#That doesn't look strong either
ggplot(bankSample, aes(loan, y)) + geom_jitter()

# or that....
ggplot(bankSample, aes(contact, y)) + geom_jitter()

# or that :(

#Surely there should be some relationship between duration and the predicted?
yesTermDepositsByDuration <- ggplot(bankSampleYYes, aes(duration)) + geom_histogram(binwidth = 10) + labs(title = "Term Deposits Yes by Duration", x="Duration", y="Count of Yes") + xlim(0,3000)
noTermDepositsByDuration <- ggplot(bankSampleYNo, aes(duration)) + geom_histogram(binwidth = 10) + labs(title = "Term Deposits No by Duration", x="duration", y="Count of No")+ xlim(0,3000)
grid.arrange(yesTermDepositsByDuration, noTermDepositsByDuration)

## Warning: Removed 1 rows containing non-finite values (stat_bin).

#finally, something interesting. Unfortunately this doesn't really help anyone to know, obviously, the longer the prospect is on the call the higher the probability that they will say yes...

yesTermDepositsByCampaign <- ggplot(bankSampleYYes, aes(campaign)) + geom_histogram(binwidth = 5) + labs(title = "Term Deposits Yes by number of campaigns", x="campaigns", y="Count of Yes") + xlim(0,20)
noTermDepositsByCampaign <- ggplot(bankSampleYNo, aes(campaign)) + geom_histogram(binwidth = 5) + labs(title = "Term Deposits No by number of campaigns", x="campaigns", y="Count of No") + xlim(0,20)
grid.arrange(yesTermDepositsByCampaign, noTermDepositsByCampaign)

## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 22 rows containing non-finite values (stat_bin).

#This is actually actionable, because it indicates that if a person is going to say yes, they probably will on the first 5 campaigns. 

#Lets do previous now
yesTermDepositsByPrevious <- ggplot(bankSampleYYes, aes(previous)) + geom_histogram(binwidth = 2) + labs(title = "Term Deposits Yes by number of previous", x="previous", y="Count of Yes") + xlim(0,15)
noTermDepositsByPrevious <- ggplot(bankSampleYNo, aes(previous)) + geom_histogram(binwidth = 2) + labs(title = "Term Deposits No by number of previous", x="previous", y="Count of No") + xlim(0,15)
grid.arrange(yesTermDepositsByPrevious, noTermDepositsByPrevious)

## Warning: Removed 8 rows containing non-finite values (stat_bin).

#This is also somewhat interesting, as the first graph has a much longer "tail"

Summary of what we’ve learned in the bankSample data set

The longer the duration of the last call, the higher the probability of closing the sale
The larger the number of previous contacts, the higher the probability of closing the sale
Other factors, job, marital, education, default, balance, housing, loan, and contact don’t seem to be strongly related to probability of closing the sale
I left day & month columns unexplored
I’d really like other common segmentation factors like demographics, religion, region, urban/suburban/country
I’d like other banking specific factors such as credit score
I wouldn’t want to make a regression model based on this dataset {peter}

Web Analytics Demo Exploration

require(ggplot2)
head(webAnalyticsDemo1)

##   month transactions
## 1     1          287
## 2     2          265
## 3     3          336
## 4     4          360
## 5     5          372
## 6     6          335

# Append a new column that maps month numbers to month names
webAnalyticsDemo1$monthf <- factor(webAnalyticsDemo1$month,levels=as.character(1:12),
labels=c("Jan","Feb","Mar","Apr","May","Jun",
"Jul","Aug","Sep","Oct","Nov","Dec"),
ordered=TRUE)

# Plot Transactions vs Month
ggplot(webAnalyticsDemo1,aes(monthf,transactions)) +
geom_bar(stat="identity")

# December has the highest transactions, this was a little too basic

# Load data frame that includes Medium as a dimension
head(webAnalyticsDemo2)

##   month   medium transactions
## 1     1   (none)           70
## 2     1     FBJ9            1
## 3     1      cpc           30
## 4     1  organic          153
## 5     1 referral           33
## 6     1  twitter            0

#now the problem is we need to map month 1 = jan, month 2 = feb etc
webAnalyticsDemo2$monthf <- factor(webAnalyticsDemo2$month,levels=as.character(1:12),
labels=c("Jan","Feb","Mar","Apr","May","Jun",
"Jul","Aug","Sep","Oct","Nov","Dec"),
ordered=TRUE)

# Facet the Transactions by medium
ggplot(webAnalyticsDemo2,aes(monthf,transactions)) +
geom_bar(stat="identity") +
facet_wrap(~medium)

# What is the problem with this plot ?

# Exclude the mediums having zero transactions
fresh_data <- filter(webAnalyticsDemo2, medium == "cpc" | medium == "organic" | medium == "referral" | medium == "(none)")
# Re-plot
ggplot(fresh_data,aes(monthf,transactions)) +
geom_bar(stat="identity") +
facet_wrap(~medium)

# Stack the plots vertically for easier comparison
ggplot(fresh_data,aes(monthf,transactions)) +
geom_bar(stat="identity") +
facet_wrap(~medium,ncol=1)

# Which medium performed best w.r.t transactions ?

# Load the data frame including an additional dimension Visitor Type

webAnalyticsDemo3$monthf <- factor(webAnalyticsDemo3$month,levels=as.character(1:12),
labels=c("Jan","Feb","Mar","Apr","May","Jun",
"Jul","Aug","Sep","Oct","Nov","Dec"),
ordered=TRUE)

# Strip the grey background and add a plot title
ggplot(webAnalyticsDemo3,aes(monthf,transactions,fill=visitorType)) +
geom_bar(stat="identity",position="dodge") +
facet_wrap(~medium) +
theme_bw() +
ggtitle("Transactions split by Visitor Type")

Citations

[Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM’2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.

Available at: [pdf] http://hdl.handle.net/1822/14838 [bib] http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt

The full dataset was described and analyzed in:

S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM’2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.