This is an overview of bank data set:
bank = read.csv("C:/Users/jiahui/Documents/R/bank-full.csv", header=T, sep=";")
summary(bank)
## age job marital education
## Min. :18.0 blue-collar:9732 divorced: 5207 primary : 6851
## 1st Qu.:33.0 management :9458 married :27214 secondary:23202
## Median :39.0 technician :7597 single :12790 tertiary :13301
## Mean :40.9 admin. :5171 unknown : 1857
## 3rd Qu.:48.0 services :4154
## Max. :95.0 retired :2264
## (Other) :6835
## default balance housing loan contact
## no :44396 Min. : -8019 no :20081 no :37967 cellular :29285
## yes: 815 1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906
## Median : 448 unknown :13020
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
##
## day month duration campaign
## Min. : 1.0 may :13766 Min. : 0 Min. : 1.00
## 1st Qu.: 8.0 jul : 6895 1st Qu.: 103 1st Qu.: 1.00
## Median :16.0 aug : 6247 Median : 180 Median : 2.00
## Mean :15.8 jun : 5341 Mean : 258 Mean : 2.76
## 3rd Qu.:21.0 nov : 3970 3rd Qu.: 319 3rd Qu.: 3.00
## Max. :31.0 apr : 2932 Max. :4918 Max. :63.00
## (Other): 6060
## pdays previous poutcome y
## Min. : -1.0 Min. : 0.00 failure: 4901 no :39922
## 1st Qu.: -1.0 1st Qu.: 0.00 other : 1840 yes: 5289
## Median : -1.0 Median : 0.00 success: 1511
## Mean : 40.2 Mean : 0.58 unknown:36959
## 3rd Qu.: -1.0 3rd Qu.: 0.00
## Max. :871.0 Max. :275.00
##
Take a little bit look at the data, the following is the top 5 rows of the data set
head(bank, n=5)
## age job marital education default balance housing loan contact
## 1 58 management married tertiary no 2143 yes no unknown
## 2 44 technician single secondary no 29 yes no unknown
## 3 33 entrepreneur married secondary no 2 yes yes unknown
## 4 47 blue-collar married unknown no 1506 yes no unknown
## 5 33 unknown single unknown no 1 no no unknown
## day month duration campaign pdays previous poutcome y
## 1 5 may 261 1 -1 0 unknown no
## 2 5 may 151 1 -1 0 unknown no
## 3 5 may 76 1 -1 0 unknown no
## 4 5 may 92 1 -1 0 unknown no
## 5 5 may 198 1 -1 0 unknown no
More in depth information
str(bank)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
It is an overview of the relationship between age & job , marital & job,
require(lattice)
## Loading required package: lattice
require(ggplot2)
## Loading required package: ggplot2
ggplot(bank, aes(age)) + geom_bar(fill="skyblue") + facet_wrap(~job)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(bank, aes(marital)) + geom_bar(fill="skyblue") + facet_wrap(~job)
ggplot(bank, aes(balance)) + geom_bar(fill="skyblue") + facet_wrap(~education)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Visualizing of balance with outliers
balance <- ggplot(bank, aes(factor(job), balance))
balance + geom_boxplot()
Visualizing of balance without outliers
balance + geom_boxplot(outlier.shape = NA) + scale_y_continuous(limits = quantile(bank$balance, c(0.1, 0.9)))
## Warning: Removed 8287 rows containing non-finite values (stat_boxplot).
## Warning: Removed 290 rows containing missing values (geom_point).
## Warning: Removed 502 rows containing missing values (geom_point).
## Warning: Removed 79 rows containing missing values (geom_point).
## Warning: Removed 55 rows containing missing values (geom_point).
## Warning: Removed 299 rows containing missing values (geom_point).
## Warning: Removed 10 rows containing missing values (geom_point).
## Warning: Removed 64 rows containing missing values (geom_point).
## Warning: Removed 226 rows containing missing values (geom_point).
## Warning: Removed 48 rows containing missing values (geom_point).
## Warning: Removed 413 rows containing missing values (geom_point).
## Warning: Removed 48 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
Devided bank dataset into two groups, one is bank_training and another one is bank_testing
bank_training = bank[c(1:25211),];
bank_testing = bank[c(25212:45211),]
bank_temp = bank[c(1,6,12)]
splom(bank_temp)