Bank dataset overview

This is an overview of bank data set:

bank = read.csv("C:/Users/jiahui/Documents/R/bank-full.csv", header=T, sep=";")
summary(bank)

##       age                job           marital          education    
##  Min.   :18.0   blue-collar:9732   divorced: 5207   primary  : 6851  
##  1st Qu.:33.0   management :9458   married :27214   secondary:23202  
##  Median :39.0   technician :7597   single  :12790   tertiary :13301  
##  Mean   :40.9   admin.     :5171                    unknown  : 1857  
##  3rd Qu.:48.0   services   :4154                                     
##  Max.   :95.0   retired    :2264                                     
##                 (Other)    :6835                                     
##  default        balance       housing      loan            contact     
##  no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
##  yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
##              Median :   448                           unknown  :13020  
##              Mean   :  1362                                            
##              3rd Qu.:  1428                                            
##              Max.   :102127                                            
##                                                                        
##       day           month          duration       campaign    
##  Min.   : 1.0   may    :13766   Min.   :   0   Min.   : 1.00  
##  1st Qu.: 8.0   jul    : 6895   1st Qu.: 103   1st Qu.: 1.00  
##  Median :16.0   aug    : 6247   Median : 180   Median : 2.00  
##  Mean   :15.8   jun    : 5341   Mean   : 258   Mean   : 2.76  
##  3rd Qu.:21.0   nov    : 3970   3rd Qu.: 319   3rd Qu.: 3.00  
##  Max.   :31.0   apr    : 2932   Max.   :4918   Max.   :63.00  
##                 (Other): 6060                                 
##      pdays          previous         poutcome       y        
##  Min.   : -1.0   Min.   :  0.00   failure: 4901   no :39922  
##  1st Qu.: -1.0   1st Qu.:  0.00   other  : 1840   yes: 5289  
##  Median : -1.0   Median :  0.00   success: 1511              
##  Mean   : 40.2   Mean   :  0.58   unknown:36959              
##  3rd Qu.: -1.0   3rd Qu.:  0.00                              
##  Max.   :871.0   Max.   :275.00                              
##

Take a little bit look at the data, the following is the top 5 rows of the data set

head(bank, n=5)

##   age          job marital education default balance housing loan contact
## 1  58   management married  tertiary      no    2143     yes   no unknown
## 2  44   technician  single secondary      no      29     yes   no unknown
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown
## 5  33      unknown  single   unknown      no       1      no   no unknown
##   day month duration campaign pdays previous poutcome  y
## 1   5   may      261        1    -1        0  unknown no
## 2   5   may      151        1    -1        0  unknown no
## 3   5   may       76        1    -1        0  unknown no
## 4   5   may       92        1    -1        0  unknown no
## 5   5   may      198        1    -1        0  unknown no

More in depth information

str(bank)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

It is an overview of the relationship between age & job , marital & job,

require(lattice)

## Loading required package: lattice

require(ggplot2)

## Loading required package: ggplot2

ggplot(bank, aes(age)) + geom_bar(fill="skyblue") +  facet_wrap(~job)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-4

ggplot(bank, aes(marital)) + geom_bar(fill="skyblue") +  facet_wrap(~job)

plot of chunk unnamed-chunk-4

ggplot(bank, aes(balance)) + geom_bar(fill="skyblue") +  facet_wrap(~education)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-4

Visualizing of balance with outliers

balance <- ggplot(bank, aes(factor(job), balance))
balance + geom_boxplot()

plot of chunk unnamed-chunk-5

Visualizing of balance without outliers

balance + geom_boxplot(outlier.shape = NA) + scale_y_continuous(limits = quantile(bank$balance, c(0.1, 0.9)))

## Warning: Removed 8287 rows containing non-finite values (stat_boxplot).
## Warning: Removed 290 rows containing missing values (geom_point).
## Warning: Removed 502 rows containing missing values (geom_point).
## Warning: Removed 79 rows containing missing values (geom_point).
## Warning: Removed 55 rows containing missing values (geom_point).
## Warning: Removed 299 rows containing missing values (geom_point).
## Warning: Removed 10 rows containing missing values (geom_point).
## Warning: Removed 64 rows containing missing values (geom_point).
## Warning: Removed 226 rows containing missing values (geom_point).
## Warning: Removed 48 rows containing missing values (geom_point).
## Warning: Removed 413 rows containing missing values (geom_point).
## Warning: Removed 48 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-6 Devided bank dataset into two groups, one is bank_training and another one is bank_testing

bank_training = bank[c(1:25211),];
bank_testing = bank[c(25212:45211),]

bank_temp = bank[c(1,6,12)]
splom(bank_temp)

plot of chunk unnamed-chunk-8