This document contains all the analysis for the second Anova Analytics R Bootcamp assignment. The requirements of this assignment are: Find 3-5 datasets pertinant to: Churn, Segmentation, Marketing Datasets should be good for regression (time series data is not ideal) Load the data into R Explore the datasets 10 plots using different variables are required We will probably do some other stuff beyond the assignment description =)
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 366250 19.6 592000 31.7 460000 24.6
## Vcells 561653 4.3 1308461 10.0 786403 6.0
#onlineRetailInformation
#This fails and asks us to provide a sheet name or sheet index, as I don't have MSoffice I don't know about this...
#onlineRetailUrl <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx'
#onlineRetailxls <- './data/onlineRetail/onlineRetail.xls'
#download.file(url =onlineRetailUrl, destfile = onlineRetailxls)
#library(xlsx)
#onlineRetailData <- read.xlsx(onlineRetailxls)
#bankdata
#bankUrl <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip'
#bankZip <- './data/bank/bank.zip'
#bankAdditionalUrl <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip'
#bankAdditionalZip <- './data/bank/bank-additional.zip'
#download.file(url = bankUrl, destfile = bankZip)
#download.file(url = bankAdditionalUrl, destfile = bankAdditionalZip)
#unzip(zipfile = bankZip, exdir = 'C:/Users/grass_000/Dropbox/Assignment 2/data/bank')
#unzip(zipfile = bankAdditionalZip, exdir = 'C:/Users/grass_000/Dropbox/Assignment 2/data/bank')
bankFullCsv <- './data/bank/bank-full.csv'
bankNamesTxt <- './data/bank/bank-names.txt'
bankCsv <- './data/bank/bank.csv'
bankFull <- read.csv(bankFullCsv, sep = ';')
bankSample <- read.csv(bankCsv, sep = ';')
##Load more sample data
#bankdata
#webAnalyticsDemoUrl <- 'http://www.tatvic.com/blog/downloads/dataset.zip'
#webAnalyticsDemoZip <- './data/bank/bank.zip'
#download.file(url = webAnalyticsDemoUrl, destfile = webAnalyticsDemoZip)
#unzip(zipfile = webAnalyticsDemoZip, exdir = 'C:/Users/grass_000/Dropbox/Assignment 2/data/webAnalyticsDemo')
webAnalyticsDemoCsv1 <- './data/webAnalyticsDemo/dataset1.csv'
webAnalyticsDemoCsv2 <- './data/webAnalyticsDemo/dataset2.csv'
webAnalyticsDemoCsv3 <- './data/webAnalyticsDemo/dataset3.csv'
webAnalyticsDemo1 <- read.csv(webAnalyticsDemoCsv1)
webAnalyticsDemo2 <- read.csv(webAnalyticsDemoCsv2)
webAnalyticsDemo3 <- read.csv(webAnalyticsDemoCsv3)
Relevant Information for the bank dataset:
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed.
There are two datasets: 1) bank-full.csv with all examples, ordered by date (from May 2008 to November 2010). 2) bank.csv with 10% of the examples (4521), randomly selected from bank-full.csv. The smallest dataset is provided to test more computationally demanding machine learning algorithms (e.g. SVM).
The classification goal is to predict if the client will subscribe a term deposit (variable y).
For more information, read [Moro et al., 2011].
Input variables: # bank client data: 1 - age (numeric) 2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”) 3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed) 4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”) 5 - default: has credit in default? (binary: “yes”,“no”) 6 - balance: average yearly balance, in euros (numeric) 7 - housing: has housing loan? (binary: “yes”,“no”) 8 - loan: has personal loan? (binary: “yes”,“no”) # related with the last contact of the current campaign: 9 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”) 10 - day: last contact day of the month (numeric) 11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”) 12 - duration: last contact duration, in seconds (numeric) # other attributes: 13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted) 15 - previous: number of contacts performed before this campaign and for this client (numeric) 16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)
Output variable (desired target): 17 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)
##Find out if there are any NAs, data that we might have to either destroy or impute
naColumns <- function(df) {
colnames(df)[unlist(lapply(df, function(x) any(is.na(x))))]
}
naColumns(bankSample)
## character(0)
#There are no NAs to work with, this is a nice clean dataset
head(bankSample)
## age job marital education default balance housing loan contact
## 1 30 unemployed married primary no 1787 no no cellular
## 2 33 services married secondary no 4789 yes yes cellular
## 3 35 management single tertiary no 1350 yes no cellular
## 4 30 management married tertiary no 1476 yes yes unknown
## 5 59 blue-collar married secondary no 0 yes no unknown
## 6 35 management single tertiary no 747 no no cellular
## day month duration campaign pdays previous poutcome y
## 1 19 oct 79 1 -1 0 unknown no
## 2 11 may 220 1 339 4 failure no
## 3 16 apr 185 1 330 1 failure no
## 4 3 jun 199 4 -1 0 unknown no
## 5 5 may 226 1 -1 0 unknown no
## 6 23 feb 141 2 176 3 failure no
head(bankFull)
## age job marital education default balance housing loan contact
## 1 58 management married tertiary no 2143 yes no unknown
## 2 44 technician single secondary no 29 yes no unknown
## 3 33 entrepreneur married secondary no 2 yes yes unknown
## 4 47 blue-collar married unknown no 1506 yes no unknown
## 5 33 unknown single unknown no 1 no no unknown
## 6 35 management married tertiary no 231 yes no unknown
## day month duration campaign pdays previous poutcome y
## 1 5 may 261 1 -1 0 unknown no
## 2 5 may 151 1 -1 0 unknown no
## 3 5 may 76 1 -1 0 unknown no
## 4 5 may 92 1 -1 0 unknown no
## 5 5 may 198 1 -1 0 unknown no
## 6 5 may 139 1 -1 0 unknown no
dim(bankSample)
## [1] 4521 17
dim(bankFull)
## [1] 45211 17
#As expected there are 17 attributes across both datasets, the bankSample dataset contains 4521 observations
#The bankFull dataset contains 45211 observations
#Lets see if there is any relationship between y (client subscribed to term deposit) and client age
library('ggplot2')
library('dplyr')
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("gridExtra")
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
bankSampleYYes <-filter(bankSample, y == 'yes')
bankSampleYNo <-filter(bankSample, y == 'no')
yesTermDepositsByAge <- ggplot(bankSampleYYes, aes(age)) + geom_histogram(binwidth = 5) + labs(title = "Term Deposits Yes by Age Count", x="age", y="Count of Yes")
noTermDepositsByAge <- ggplot(bankSampleYNo, aes(age)) + geom_histogram(binwidth = 5) + labs(title = "Term Deposits No by Age Count", x="age", y="Count of No")
grid.arrange(yesTermDepositsByAge, noTermDepositsByAge)
#This chart makes me think that age probably doesn't have much to do with our predicted value
#lets plot education level vs y on a chart and see if there is anything interesting there
ggplot(bankSample, aes(education, y)) + geom_jitter()
#This again doesn't look all that meaninful. Relatively speaking, the "yes's" and the "No's" seem to line up
#relatively evenly across educational levels
#lets do the same analysis but with job type
ggplot(bankSample, aes(job, y)) + geom_jitter()
#Here again the relative pattern appears about the same across job types. So job type may not be a good predictor of Y
#Now we'll do single vs married folks
ggplot(bankSample, aes(marital, y)) + geom_jitter()
#No obvious relationship... again
#how about default Y/N vs the predicted?
ggplot(bankSample, aes(default, y)) + geom_jitter()
#Maybe there is something here, but it doesn't look like a home-run
#lets consider the continious variable of "balance" now
yesTermDepositsByBalance <- ggplot(bankSampleYYes, aes(balance)) + geom_histogram(binwidth = 1000) + labs(title = "Term Deposits Yes by Balance", x="Balance", y="Count of Yes") +xlim(-1000,25000)
noTermDepositsByBalance <- ggplot(bankSampleYNo, aes(balance)) + geom_histogram(binwidth = 1000) + labs(title = "Term Deposits No by Age Count", x="balance", y="Count of No")+xlim(-1000,25000)
grid.arrange(yesTermDepositsByBalance, noTermDepositsByBalance)
## Warning: Removed 2 rows containing non-finite values (stat_bin).
## Warning: Removed 22 rows containing non-finite values (stat_bin).
#Yeah, even term deposits doesn't seem to have much effect on the predicted
#lets consider housing/loan/contact variables and see if there is anything interesting
ggplot(bankSample, aes(housing, y)) + geom_jitter()
#That doesn't look strong either
ggplot(bankSample, aes(loan, y)) + geom_jitter()
# or that....
ggplot(bankSample, aes(contact, y)) + geom_jitter()
# or that :(
#Surely there should be some relationship between duration and the predicted?
yesTermDepositsByDuration <- ggplot(bankSampleYYes, aes(duration)) + geom_histogram(binwidth = 10) + labs(title = "Term Deposits Yes by Duration", x="Duration", y="Count of Yes") + xlim(0,3000)
noTermDepositsByDuration <- ggplot(bankSampleYNo, aes(duration)) + geom_histogram(binwidth = 10) + labs(title = "Term Deposits No by Duration", x="duration", y="Count of No")+ xlim(0,3000)
grid.arrange(yesTermDepositsByDuration, noTermDepositsByDuration)
## Warning: Removed 1 rows containing non-finite values (stat_bin).
#finally, something interesting. Unfortunately this doesn't really help anyone to know, obviously, the longer the prospect is on the call the higher the probability that they will say yes...
yesTermDepositsByCampaign <- ggplot(bankSampleYYes, aes(campaign)) + geom_histogram(binwidth = 5) + labs(title = "Term Deposits Yes by number of campaigns", x="campaigns", y="Count of Yes") + xlim(0,20)
noTermDepositsByCampaign <- ggplot(bankSampleYNo, aes(campaign)) + geom_histogram(binwidth = 5) + labs(title = "Term Deposits No by number of campaigns", x="campaigns", y="Count of No") + xlim(0,20)
grid.arrange(yesTermDepositsByCampaign, noTermDepositsByCampaign)
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 22 rows containing non-finite values (stat_bin).
#This is actually actionable, because it indicates that if a person is going to say yes, they probably will on the first 5 campaigns.
#Lets do previous now
yesTermDepositsByPrevious <- ggplot(bankSampleYYes, aes(previous)) + geom_histogram(binwidth = 2) + labs(title = "Term Deposits Yes by number of previous", x="previous", y="Count of Yes") + xlim(0,15)
noTermDepositsByPrevious <- ggplot(bankSampleYNo, aes(previous)) + geom_histogram(binwidth = 2) + labs(title = "Term Deposits No by number of previous", x="previous", y="Count of No") + xlim(0,15)
grid.arrange(yesTermDepositsByPrevious, noTermDepositsByPrevious)
## Warning: Removed 8 rows containing non-finite values (stat_bin).
#This is also somewhat interesting, as the first graph has a much longer "tail"
require(ggplot2)
head(webAnalyticsDemo1)
## month transactions
## 1 1 287
## 2 2 265
## 3 3 336
## 4 4 360
## 5 5 372
## 6 6 335
# Append a new column that maps month numbers to month names
webAnalyticsDemo1$monthf <- factor(webAnalyticsDemo1$month,levels=as.character(1:12),
labels=c("Jan","Feb","Mar","Apr","May","Jun",
"Jul","Aug","Sep","Oct","Nov","Dec"),
ordered=TRUE)
# Plot Transactions vs Month
ggplot(webAnalyticsDemo1,aes(monthf,transactions)) +
geom_bar(stat="identity")
# December has the highest transactions, this was a little too basic
# Load data frame that includes Medium as a dimension
head(webAnalyticsDemo2)
## month medium transactions
## 1 1 (none) 70
## 2 1 FBJ9 1
## 3 1 cpc 30
## 4 1 organic 153
## 5 1 referral 33
## 6 1 twitter 0
#now the problem is we need to map month 1 = jan, month 2 = feb etc
webAnalyticsDemo2$monthf <- factor(webAnalyticsDemo2$month,levels=as.character(1:12),
labels=c("Jan","Feb","Mar","Apr","May","Jun",
"Jul","Aug","Sep","Oct","Nov","Dec"),
ordered=TRUE)
# Facet the Transactions by medium
ggplot(webAnalyticsDemo2,aes(monthf,transactions)) +
geom_bar(stat="identity") +
facet_wrap(~medium)
# What is the problem with this plot ?
# Exclude the mediums having zero transactions
fresh_data <- filter(webAnalyticsDemo2, medium == "cpc" | medium == "organic" | medium == "referral" | medium == "(none)")
# Re-plot
ggplot(fresh_data,aes(monthf,transactions)) +
geom_bar(stat="identity") +
facet_wrap(~medium)
# Stack the plots vertically for easier comparison
ggplot(fresh_data,aes(monthf,transactions)) +
geom_bar(stat="identity") +
facet_wrap(~medium,ncol=1)
# Which medium performed best w.r.t transactions ?
# Load the data frame including an additional dimension Visitor Type
webAnalyticsDemo3$monthf <- factor(webAnalyticsDemo3$month,levels=as.character(1:12),
labels=c("Jan","Feb","Mar","Apr","May","Jun",
"Jul","Aug","Sep","Oct","Nov","Dec"),
ordered=TRUE)
# Strip the grey background and add a plot title
ggplot(webAnalyticsDemo3,aes(monthf,transactions,fill=visitorType)) +
geom_bar(stat="identity",position="dodge") +
facet_wrap(~medium) +
theme_bw() +
ggtitle("Transactions split by Visitor Type")
[Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM’2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.
Available at: [pdf] http://hdl.handle.net/1822/14838 [bib] http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt
The full dataset was described and analyzed in:
S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM’2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.