Welcome to my LBB.
I usebank.csv dataset from https://archive.ics.uci.edu. It contains data on
telemarketing from a bank in Portugal. Using this data, we can predict
which prospective customers will buy the product when the bank calls
them.
More than one contact/campaign to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
First, we read and check the data structures using
read.csv and str().
bank <- read.csv2(file = "bank.csv")
str(bank)## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : chr "unemployed" "services" "management" "management" ...
## $ marital : chr "married" "married" "single" "married" ...
## $ education: chr "primary" "secondary" "tertiary" "tertiary" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : chr "no" "yes" "yes" "yes" ...
## $ loan : chr "no" "yes" "no" "yes" ...
## $ contact : chr "cellular" "cellular" "cellular" "unknown" ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : chr "oct" "may" "apr" "jun" ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : chr "unknown" "failure" "failure" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
There are 17 variables:
We have to change some data types.
bank[,c("job", "marital", "education")] <- lapply(bank[,c("job", "marital", "education")], as.factor)
bank[,c("age", "balance", "duration")] <- lapply(bank[,c("age", "balance", "duration")], as.numeric)
str(bank)## 'data.frame': 4521 obs. of 17 variables:
## $ age : num 30 33 35 30 59 35 36 39 41 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : num 1787 4789 1350 1476 0 ...
## $ housing : chr "no" "yes" "yes" "yes" ...
## $ loan : chr "no" "yes" "no" "yes" ...
## $ contact : chr "cellular" "cellular" "cellular" "unknown" ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : chr "oct" "may" "apr" "jun" ...
## $ duration : num 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : chr "unknown" "failure" "failure" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
colSums(is.na(bank))## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
anyNA(bank)## [1] FALSE
There are no missing values in this dataframe.
Let us begin to explore the data.
job
the customers have?bjob <- as.data.frame(prop.table(table(bank$job)))
bjob[order(bjob$Freq, decreasing = T),]Answer1: management,
blue-collar and technician are the top 3 of
job type
job type who subscribed for term deposit product?bank_yes <- bank[bank$y == "yes",]
yesbjob <- as.data.frame(prop.table(table(bank_yes$job)))
yesbjob[order(yesbjob$Freq, decreasing = T),]**Answer2 :management,technicianand
blue-collar are also the top 3 of job types who subscribed
term deposit, although in different order.
education who
subscribed term deposit the most?yesedu <- as.data.frame(prop.table(table(bank_yes$education)))
yesedu[order(yesedu$Freq, decreasing = T),]Answer 3: Customers with secondary
education buy the product the most.
job dan
education variables together, who subscribed term deposit
the most?#change column 'y' from binary to numeric
bank_yes$y <- sapply(X = as.character(bank_yes$y),
FUN = switch,
"yes" = "1")
#add column ynum
bank_yes$ynum <- as.numeric(bank_yes$y)
head(bank_yes)#Find aggregate of `job`+`education` which have the most `ynum`
bank_yes_agg1 <- aggregate(ynum ~ job + education,
data = bank_yes,
FUN = sum)
bank_yes_agg1[order(bank_yes_agg1$ynum, decreasing = T),]Answer 4: Customers with “management” job type and have “tertiary” education subscribed term deposit the most.
Step 1: Let us see the summary of age column.
summary(bank_yes$age)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 32.00 40.00 42.49 50.00 87.00
Step 2: Use boxplot
boxplot(x=bank_yes$age)Answer 5: From the boxplot, we can see
median of customers’ age who buy the product is at 40 y.o.
There are outliers at age >78 y.o.
balance > 200USD subscribed term deposit?bank200 <- bank[bank$balance > 200,]
bank200[bank200$y=="yes",]Answer 6: There are 381 customers who has balance >200 USD subscribed term deposit.
previous) which have “success”
outcome (poutcome) compared with last campaign.outcome <- bank[bank$poutcome=="success",]
mean(outcome$previous)## [1] 3.015504
outcome1 <- bank[bank$y=="yes",]
mean(outcome1$campaign)## [1] 2.266795