Exploratory data analysis(EDA) is an approach to data analysis for summarising and visualising the important characteristics of a data set. John Tukey, an American mathematician has contributed significantly to the development of EDA and was instrumental in distinguishing EDA from Confirmatory data analysis and Initial data analysis. He published a book, Exploratory Data Analysis in 1977 and introduced box plot.
In this post, we will cover the fundamentals of EDA using R and RStudio. The R codes used in this tutorial can be found at Github. An accomapnying video has been uploaded on Youtube. The slides used in the video are available at RPubs.
In simple terms, any method of analysing data that is not part of formal statistical modeling and inference falls into the category of exploratory data analysis. Before we build statistical models or do any other analysis, it is important to examine all the variables in our data set.
Why?
It is not possible for us to determine the characterisits of data by looking at columns or a whole spreadsheet and this is when EDA comes to our aid. It is also important to keep in mind the problems with EDA:
In the next section, we look at the objectives of EDA.
EDA can be broadly classified into the following:
Quantitative methods include summary statisitcs while graphical methods include plots and charts. Univariate methods involve analysing one variable at a time while bivariate involves analysing two or more variables to examine their underlying relationships. EDA also depends on the role of variable being examined:
Further division is based on the type of variable being examined:
Univariate analysis includes:
We will explore each of the above in detail in the case study provided at the end of this tutorial.
Graphical methods include the following:
Bivariate analysis include:
Advanced techniques include:
Graphical techniques include:
In this case study, we will select a data set that has all the variable types discussed in Types of EDA. R has a lot of built-in data sets which come as part of the packages and can be used. We will use a data set from the UCI Machine Learning Repository. You can download the data set from here. It is related to direct marketing campaigns of a Portugese Bank. The description of the data set can be found at the website. Please go through the tutorials on using RStudio and Github to better organise your analysis and pojects.
Now, without further delay, let us begin our analysis.
We have the data set in our project folder. Let us EDA:
# check working directory
getwd()
## [1] "H:/R/Exploratory Data Analysis"
# import the data set into the current workspace
data <- read.csv("bank-data.csv", header = T)
# let us explore the data set a bit
View(data) # allows us to view the data set
names(data) # names of the variables
## [1] "age" "job" "marital" "education" "default"
## [6] "balance" "housing" "loan" "contact" "day"
## [11] "month" "duration" "campaign" "pdays" "previous"
## [16] "poutcome" "y"
dim(data) # dimension (number of rows and columns)
## [1] 45211 17
str(data) # structure of the data set
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
class(data) # type of data
## [1] "data.frame"
head(data, n = 5) # displays the first 5 rows
## age job marital education default balance housing loan contact
## 1 58 management married tertiary no 2143 yes no unknown
## 2 44 technician single secondary no 29 yes no unknown
## 3 33 entrepreneur married secondary no 2 yes yes unknown
## 4 47 blue-collar married unknown no 1506 yes no unknown
## 5 33 unknown single unknown no 1 no no unknown
## day month duration campaign pdays previous poutcome y
## 1 5 may 261 1 -1 0 unknown no
## 2 5 may 151 1 -1 0 unknown no
## 3 5 may 76 1 -1 0 unknown no
## 4 5 may 92 1 -1 0 unknown no
## 5 5 may 198 1 -1 0 unknown no
tail(data, n = 5) # displays the last 5 rows
## age job marital education default balance housing loan
## 45207 51 technician married tertiary no 825 no no
## 45208 71 retired divorced primary no 1729 no no
## 45209 72 retired married secondary no 5715 no no
## 45210 57 blue-collar married secondary no 668 no no
## 45211 37 entrepreneur married secondary no 2971 no no
## contact day month duration campaign pdays previous poutcome y
## 45207 cellular 17 nov 977 3 -1 0 unknown yes
## 45208 cellular 17 nov 456 2 -1 0 unknown yes
## 45209 cellular 17 nov 1127 5 184 3 success yes
## 45210 telephone 17 nov 508 4 -1 0 unknown no
## 45211 cellular 17 nov 361 2 188 11 other no
When we run str(data), R displays comprehensive information about the data set. It provides the following details:
We will select one variable from the data set for the following category of variables:
R has a wide range of built-in functions for summary statistics. Before we explore these functions, let us install the necessary packages and load the libraries. Packages can be installed using install.packages() and loaded using library(). Install and load the following packages:
library(Hmisc)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
library(pastecs)
## Loading required package: boot
##
## Attaching package: 'boot'
##
## The following object is masked from 'package:survival':
##
## aml
##
## The following object is masked from 'package:lattice':
##
## melanoma
library(psych)
##
## Attaching package: 'psych'
##
## The following object is masked from 'package:boot':
##
## logit
##
## The following object is masked from 'package:Hmisc':
##
## describe
Let us run some codes and examine the output:
summary(data)
## age job marital education
## Min. :18.0 blue-collar:9732 divorced: 5207 primary : 6851
## 1st Qu.:33.0 management :9458 married :27214 secondary:23202
## Median :39.0 technician :7597 single :12790 tertiary :13301
## Mean :40.9 admin. :5171 unknown : 1857
## 3rd Qu.:48.0 services :4154
## Max. :95.0 retired :2264
## (Other) :6835
## default balance housing loan contact
## no :44396 Min. : -8019 no :20081 no :37967 cellular :29285
## yes: 815 1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906
## Median : 448 unknown :13020
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
##
## day month duration campaign
## Min. : 1.0 may :13766 Min. : 0 Min. : 1.00
## 1st Qu.: 8.0 jul : 6895 1st Qu.: 103 1st Qu.: 1.00
## Median :16.0 aug : 6247 Median : 180 Median : 2.00
## Mean :15.8 jun : 5341 Mean : 258 Mean : 2.76
## 3rd Qu.:21.0 nov : 3970 3rd Qu.: 319 3rd Qu.: 3.00
## Max. :31.0 apr : 2932 Max. :4918 Max. :63.00
## (Other): 6060
## pdays previous poutcome y
## Min. : -1.0 Min. : 0.00 failure: 4901 no :39922
## 1st Qu.: -1.0 1st Qu.: 0.00 other : 1840 yes: 5289
## Median : -1.0 Median : 0.00 success: 1511
## Mean : 40.2 Mean : 0.58 unknown:36959
## 3rd Qu.: -1.0 3rd Qu.: 0.00
## Max. :871.0 Max. :275.00
##
Examining the output of the summary function, we can see that the result is different for continuos and categorical variables. For categorical variables, only the count of different levels is provided while for the continuos variables there are a bunch of statistics. Let us subset the data set by selecting the 4 variables mentioned earlier.
var <- c("age", "marital", "education", "housing")
eda_data <- data[var]
names(eda_data)
## [1] "age" "marital" "education" "housing"
We have a new data set “eda_data” which we will use for the rest of the analysis. Let us do some univariate analysis on the “age” variable. We begin with quantitative techniques before exploring the graphical techniques.
summary(eda_data$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.0 33.0 39.0 40.9 48.0 95.0
The mean and median are pretty close and we can say that the distribution of the variable is approximately normal (we will test this with a graphical technique at a later stage). We will use the fivenum() function next. The fivenum() function displays the min, max, median, lower-hinge and upper-hinge values.
fivenum(eda_data$age)
## [1] 18 33 39 48 95
describe(eda_data$age)
## vars n mean sd median trimmed mad min max range skew kurtosis
## 1 1 45211 40.94 10.62 39 40.25 10.38 18 95 77 0.68 0.32
## se
## 1 0.05
The describe() function which is part of the Hmisc package displays the following additional statistics:
stat.desc(eda_data$age)
## nbr.val nbr.null nbr.na min max
## 4.521e+04 0.000e+00 0.000e+00 1.800e+01 9.500e+01
## range sum median mean SE.mean
## 7.700e+01 1.851e+06 3.900e+01 4.094e+01 4.994e-02
## CI.mean.0.95 var std.dev coef.var
## 9.788e-02 1.128e+02 1.062e+01 2.594e-01
The stat.desc() function which is part of the pastecs package displays the following additional statistics:
We now move on to the graphical techniques that can be employed for EDA and begin with the histogram. The age variables is positively skewed as we can see marginal frequency for age greater than 60.
hist(eda_data$age,
main = "Histogram of Age",
xlab = "Age in Years")
boxplot(eda_data$age,
main = toupper("Boxplot of Age"),
ylab = "Age in years",
col = "blue")
d <- density(eda_data$age)
plot(d, main = "Kernel density of Age")
polygon(d, col = "red", border = "blue")