The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls.
bank <- read.csv("bank-full.csv",sep = ";")head(bank, 5)dim(bank)## [1] 45211 17
From our inspection we can conclude :
str(bank)## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education: chr "tertiary" "secondary" "secondary" "unknown" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : chr "yes" "yes" "yes" "yes" ...
## $ loan : chr "no" "no" "yes" "no" ...
## $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
Based on STR, I want to change some columns into factor
bank$job <- as.factor(bank$job)
bank$marital <- as.factor(bank$marital)
bank$education <- as.factor(bank$education)
bank$default <- as.factor(bank$default)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$contact <- as.factor(bank$contact)
bank$month <- as.factor(bank$month)
bank$poutcome <- as.factor(bank$poutcome)
bank$y <- as.factor(bank$y)
str(bank)## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
After that, I want to check if there is any NA value in the data.
colSums(is.na(bank))## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
anyNA(bank)## [1] FALSE
Based on the result, the data don’t have any NA values, so we can continue to explore the data
I want to know more about education degree of our clients
data.frame(sort(table("education" = bank$education),decreasing = T))I want to know more about loan condition of our clients
data.frame(table("loan" = bank$loan,
"housing" = bank$housing,
"default" = bank$default))I want to know marital status of our clients
data.frame(sort(table("marital_status" = bank$marital),decreasing = T))I want to know more about how to contact our clients
data.frame(sort(table("contact" = bank$contact),decreasing = T))I want to know how many clients have not been contacted
data.frame(sort(table("contacted" = bank$previous != 0),decreasing = T))Based on the data, we know that:
Despite majority of our clients already have housing loan, many of them don’t
For this reason, I want to create a new program for loan.
But first, I have to explore the data about job and age distribution.
To see job description and the distribution
data.frame(sort(table("job" = bank$job),decreasing = T))To see age distribution
data.frame(sort(table("age" = bank$age),decreasing = T))Based on the data, I want to know how many clients we can prospect to join the loan program. The requirements:
Condition for subsetting
age_cond <- bank$age == 30:60
job_cond <- bank$job %in% c("management", "blue-collar", "technician", "admin.", "services")
loan_cond <- bank$loan == "no" & bank$housing == "no" & bank$default == "no"
cont_cond <- bank$contact != "unknown"Subsetting to get new dataframe
prog_cond <- bank[age_cond & job_cond & loan_cond & cont_cond, ]
prog_condBased the data, we have 278 clients to be offered.