library(DT)
library(graphics)Exploratory Data Analysis refers to the critical process of performing initial investigations on data to discover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations. Understanding the data with basic statistics will give us a tremendous advantage in making the best business decision. We try to get insights from The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution.
To access the data, check the followinG URL https://archive.ics.uci.edu/ml/datasets/bank+marketing
bank <- read.csv("bank-full.csv", header = TRUE, sep = ";")rmarkdown::paged_table(bank)dim(bank)#> [1] 45211 17
names(bank)#> [1] "age" "job" "marital" "education" "default" "balance"
#> [7] "housing" "loan" "contact" "day" "month" "duration"
#> [13] "campaign" "pdays" "previous" "poutcome" "y"
The dataset includes information as follow:
1 - age (numeric)2 - job : type of job (categorical)3 - marital : marital status (categorical)4 - education (categorical)5 - default: has credit in default? (categorical)6 - housing: has housing loan? (categorical)7 - loan: has personal loan? (categorical)8 - contact: contact communication type (categorical)9 - month: last contact month of year (categorical)10 - day_of_week: last contact day of the week (categorical)11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)14 - previous: number of contacts performed before this campaign and for this client (numeric)15 - poutcome: outcome of the previous marketing campaign (categorical)16 - balance: average yearly balance17 - y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)We will see whether the data have missing value or not. Missing values are represented by the symbol NA (not available) in R.
colSums(is.na(bank))#> age job marital education default balance housing loan
#> 0 0 0 0 0 0 0 0
#> contact day month duration campaign pdays previous poutcome
#> 0 0 0 0 0 0 0 0
#> y
#> 0
anyNA(bank)#> [1] FALSE
We see no missing value in the data. The next step is viewing the metadata to check the type of each variable.
str(bank)#> 'data.frame': 45211 obs. of 17 variables:
#> $ age : int 58 44 33 47 33 35 28 42 58 43 ...
#> $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
#> $ marital : chr "married" "single" "married" "married" ...
#> $ education: chr "tertiary" "secondary" "secondary" "unknown" ...
#> $ default : chr "no" "no" "no" "no" ...
#> $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
#> $ housing : chr "yes" "yes" "yes" "yes" ...
#> $ loan : chr "no" "no" "yes" "no" ...
#> $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
#> $ day : int 5 5 5 5 5 5 5 5 5 5 ...
#> $ month : chr "may" "may" "may" "may" ...
#> $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
#> $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#> $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
#> $ y : chr "no" "no" "no" "no" ...
From this result, we find some of the data type not in the correct type. we need to convert it into the correct type (data coercion)
bank$job <- as.factor(bank$job)
bank$marital <- as.factor(bank$marital)
bank$education <- as.factor(bank$education)
bank$default <- as.factor(bank$default)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$contact <- as.factor(bank$contact)
bank$month <- as.factor(bank$month)
bank$default <- as.factor(bank$default)
bank$poutcome <- as.factor(bank$poutcome)
bank$y <- as.factor(bank$y)
str(bank)#> 'data.frame': 45211 obs. of 17 variables:
#> $ age : int 58 44 33 47 33 35 28 42 58 43 ...
#> $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
#> $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
#> $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
#> $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
#> $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
#> $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
#> $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
#> $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
#> $ day : int 5 5 5 5 5 5 5 5 5 5 ...
#> $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
#> $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
#> $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#> $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
#> $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Each column already changed into the desired data type.
nrow(bank)#> [1] 45211
summary(bank)#> age job marital education
#> Min. :18.00 blue-collar:9732 divorced: 5207 primary : 6851
#> 1st Qu.:33.00 management :9458 married :27214 secondary:23202
#> Median :39.00 technician :7597 single :12790 tertiary :13301
#> Mean :40.94 admin. :5171 unknown : 1857
#> 3rd Qu.:48.00 services :4154
#> Max. :95.00 retired :2264
#> (Other) :6835
#> default balance housing loan contact
#> no :44396 Min. : -8019 no :20081 no :37967 cellular :29285
#> yes: 815 1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906
#> Median : 448 unknown :13020
#> Mean : 1362
#> 3rd Qu.: 1428
#> Max. :102127
#>
#> day month duration campaign
#> Min. : 1.00 may :13766 Min. : 0.0 Min. : 1.000
#> 1st Qu.: 8.00 jul : 6895 1st Qu.: 103.0 1st Qu.: 1.000
#> Median :16.00 aug : 6247 Median : 180.0 Median : 2.000
#> Mean :15.81 jun : 5341 Mean : 258.2 Mean : 2.764
#> 3rd Qu.:21.00 nov : 3970 3rd Qu.: 319.0 3rd Qu.: 3.000
#> Max. :31.00 apr : 2932 Max. :4918.0 Max. :63.000
#> (Other): 6060
#> pdays previous poutcome y
#> Min. : -1.0 Min. : 0.0000 failure: 4901 no :39922
#> 1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840 yes: 5289
#> Median : -1.0 Median : 0.0000 success: 1511
#> Mean : 40.2 Mean : 0.5803 unknown:36959
#> 3rd Qu.: -1.0 3rd Qu.: 0.0000
#> Max. :871.0 Max. :275.0000
#>
Based on the summary above, we find pieces of information:
We can see the distribution of the subscriber with their equal marital status and education.
bank_yes <- bank[bank$y == "yes",]graphics::pie(xtabs(~ bank_yes$marital, bank_yes$y))graphics::pie(xtabs(~ bank_yes$education, bank_yes$y))xtabs(~ bank_yes$marital + bank_yes$education, bank_yes)#> bank_yes$education
#> bank_yes$marital primary secondary tertiary unknown
#> divorced 104 291 203 24
#> married 396 1304 913 142
#> single 91 855 880 86
The subscribers mostly come from married and secondary education status. Next we’ll find out the balance of the customes with their job.
blue_collar.mean <- (mean(bank_yes$balance[bank_yes$job == "blue-collar"]))
management.mean <- (mean(bank_yes$balance[bank_yes$job == "management"]))
technician.mean <- (mean(bank_yes$balance[bank_yes$job == "technician"]))
admin.mean <- (mean(bank_yes$balance[bank_yes$job == "admin."]))
services.mean <- (mean(bank_yes$balance[bank_yes$job == "services"]))
retired.mean <- (mean(bank_yes$balance[bank_yes$job == "retired"]))
other <- subset(bank_yes, job != "blue-collar" | job != "management"| job!= "technician" | job!= "admin."| job!= "services"| job != "retired")
other.mean <- mean(other$balance)
job.mean <- cbind(job=c("blue-collar","management","technician","admin.","services", "retired", "other"))
job.mean <- cbind(job.mean,as.data.frame(c(blue_collar.mean,management.mean,technician.mean,admin.mean,services.mean, retired.mean, other.mean)))
names(job.mean)[2] <- paste("Mean")
job.meangraphics::barplot(xtabs(Mean ~ job, job.mean), col= "Green")Customers who retired have the highest average yearly balance with the average annual balance of 2690 Euro, with the following ranking from management in the second place and technician in the third place.
bank_yes$age.Range <-c("18-32", "33-39", "40-47", "48-95")[findInterval(as.numeric(as.character(bank_yes$age)) , c(18, 33, 40, 48, Inf) )]
xtabs(~ bank_yes$age.Range + bank_yes$job, bank_yes)#> bank_yes$job
#> bank_yes$age.Range admin. blue-collar entrepreneur housemaid management retired
#> 18-32 193 218 18 10 372 0
#> 33-39 164 179 35 20 389 1
#> 40-47 131 169 30 17 227 4
#> 48-95 143 142 40 62 313 511
#> bank_yes$job
#> bank_yes$age.Range self-employed services student technician unemployed unknown
#> 18-32 76 125 247 256 52 5
#> 33-39 42 116 19 253 46 5
#> 40-47 33 62 3 153 45 6
#> 48-95 36 66 0 178 59 18
graphics::barplot(xtabs(~ age.Range + job, bank_yes), legend.text = TRUE)table(droplevels(bank_yes$job),bank_yes$age.Range)#>
#> 18-32 33-39 40-47 48-95
#> admin. 193 164 131 143
#> blue-collar 218 179 169 142
#> entrepreneur 18 35 30 40
#> housemaid 10 20 17 62
#> management 372 389 227 313
#> retired 0 1 4 511
#> self-employed 76 42 33 36
#> services 125 116 62 66
#> student 247 19 3 0
#> technician 256 253 153 178
#> unemployed 52 46 45 59
#> unknown 5 5 6 18
The Data show the customers with the management background are the most likely to subscribe to the marketing campaign. We also see here the customers with management background at the age interval between 33 - 39 are the highest subscribers.
crosstab <- xtabs(formula = duration ~ contact + job, data = bank_yes)heatmap(crosstab, Rowv = NA, Colv = NA, cexCol = 0.8, scale = "row")We see that the customer dialed through telephone and retired has the highest in the last contact duration. Interestingly, the blue-collar people have the highest last contact duration through an unknown source. Further investigation may appear regarding this insight.
graphics::pie(xtabs(~ bank_yes$loan + bank_yes$housing, bank_yes$y))xtabs(~ bank_yes$loan + bank_yes$housing, bank_yes$y)#> bank_yes$housing
#> bank_yes$loan no yes
#> no 3135 1670
#> yes 219 265
The subscribers with no housing and personal loan will likely to accept the marketing campaign.
We have a few insights regarding the data. We find that customers who will likely subscribe come from the person with married status = yes, has a secondary education, with a management job. The target customer may come from the age ranging from 33 - 39 and should be called by cellular. That customer may also have no loan to take this consideration.
As we know, the data provide much more insight that has been dug. We have not discovered other variables that may affect a customer’s decision to participate in the campaign. Further analysis should be done to take insight genuinely; for example, we may run a machine learning model for more careful consideration.