Exploratory Data Analysis with R

Exploratory data analysis(EDA) is an approach to data analysis for summarising and visualising the important characteristics of a data set. John Tukey, an American mathematician has contributed significantly to the development of EDA and was instrumental in distinguishing EDA from Confirmatory data analysis and Initial data analysis. He published a book, Exploratory Data Analysis in 1977 and introduced box plot.

In this post, we will cover the fundamentals of EDA using R and RStudio. The R codes used in this tutorial can be found at Github. An accomapnying video has been uploaded on Youtube. The slides used in the video are available at RPubs.

Overview & Objectives

In simple terms, any method of analysing data that is not part of formal statistical modeling and inference falls into the category of exploratory data analysis. Before we build statistical models or do any other analysis, it is important to examine all the variables in our data set.

Why?

  • Discover any existing patterns
  • Identify mistakes
  • Suggest hypothesis that can be tested
  • Check assumptions
  • Identify relationships among explanatory variables
  • Assess direction and size of relationship between expplanatory and outcome variables

It is not possible for us to determine the characterisits of data by looking at columns or a whole spreadsheet and this is when EDA comes to our aid. It is also important to keep in mind the problems with EDA:

  • Does not have the same rigor of formal statistical techniques
  • It cannot be used for prediction or inference
  • Not efficient in case of voluminous data

In the next section, we look at the objectives of EDA.

Types of EDA

EDA can be broadly classified into the following:

  • Graphical or Quantitative
  • Univariate or Bivariate

Quantitative methods include summary statisitcs while graphical methods include plots and charts. Univariate methods involve analysing one variable at a time while bivariate involves analysing two or more variables to examine their underlying relationships. EDA also depends on the role of variable being examined:

  • Outcome variable
  • Explanatory variable

Further division is based on the type of variable being examined:

  • Categorical variable
    • Binary
    • Ordinal
    • Nominal
  • Quantitative variable
    • Discrete
    • Continuous
Univariate
Univariate Quantitative

Univariate analysis includes:

Measures of central tendancy:
  • Mean
  • Median
  • Mode
Measures of dispersion:
  • Min
  • Max
  • Range
  • Quartiles
  • Variance
  • Standard deviation
Other measures include:
  • Skewness
  • Kurtosis

We will explore each of the above in detail in the case study provided at the end of this tutorial.

Univariate Graphical

Graphical methods include the following:

  • Histogram
  • Box plots
  • Bar plots
  • Kernel density plots
Bivariate
Bivariate Quantitative

Bivariate analysis include:

  • Crosstabs
  • Covariance
  • Correlation

Advanced techniques include:

  • Cluster analysis
  • Analysis of variance (ANOVA)
  • Factor analysis
  • Principal component analysis (PCA)
Bivariate Graphical

Graphical techniques include:

  • Scatterplot
  • Box plot

Case Study

In this case study, we will select a data set that has all the variable types discussed in Types of EDA. R has a lot of built-in data sets which come as part of the packages and can be used. We will use a data set from the UCI Machine Learning Repository. You can download the data set from here. It is related to direct marketing campaigns of a Portugese Bank. The description of the data set can be found at the website. Please go through the tutorials on using RStudio and Github to better organise your analysis and pojects.

Now, without further delay, let us begin our analysis.

  • Step 1 : Create a new project in RStudio and name it Exploratory Data Analysis.
  • Step 2 : Create a new R file and save it as Codes.
  • Step 3 : Download the bank direct marketing data set from here.

We have the data set in our project folder. Let us EDA:

# check working directory
getwd()
## [1] "H:/R/Exploratory Data Analysis"
# import the data set into the current workspace
data <- read.csv("bank-data.csv", header = T)

# let us explore the data set a bit
View(data)  # allows us to view the data set
names(data)  # names of the variables 
##  [1] "age"       "job"       "marital"   "education" "default"  
##  [6] "balance"   "housing"   "loan"      "contact"   "day"      
## [11] "month"     "duration"  "campaign"  "pdays"     "previous" 
## [16] "poutcome"  "y"
dim(data)  # dimension (number of rows and columns)
## [1] 45211    17
str(data)  # structure of the data set
## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
class(data)  # type of data
## [1] "data.frame"
head(data, n = 5)  # displays the first 5 rows 
##   age          job marital education default balance housing loan contact
## 1  58   management married  tertiary      no    2143     yes   no unknown
## 2  44   technician  single secondary      no      29     yes   no unknown
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown
## 5  33      unknown  single   unknown      no       1      no   no unknown
##   day month duration campaign pdays previous poutcome  y
## 1   5   may      261        1    -1        0  unknown no
## 2   5   may      151        1    -1        0  unknown no
## 3   5   may       76        1    -1        0  unknown no
## 4   5   may       92        1    -1        0  unknown no
## 5   5   may      198        1    -1        0  unknown no
tail(data, n = 5)  # displays the last 5 rows
##       age          job  marital education default balance housing loan
## 45207  51   technician  married  tertiary      no     825      no   no
## 45208  71      retired divorced   primary      no    1729      no   no
## 45209  72      retired  married secondary      no    5715      no   no
## 45210  57  blue-collar  married secondary      no     668      no   no
## 45211  37 entrepreneur  married secondary      no    2971      no   no
##         contact day month duration campaign pdays previous poutcome   y
## 45207  cellular  17   nov      977        3    -1        0  unknown yes
## 45208  cellular  17   nov      456        2    -1        0  unknown yes
## 45209  cellular  17   nov     1127        5   184        3  success yes
## 45210 telephone  17   nov      508        4    -1        0  unknown  no
## 45211  cellular  17   nov      361        2   188       11    other  no

When we run str(data), R displays comprehensive information about the data set. It provides the following details:

  • Type of data set: dataframe, vector, matrix, list
  • Dimensions of the data set: Number of rows and columns
  • Variable name
  • Variable type: Factor, integer, numeric

We will select one variable from the data set for the following category of variables:

  • Continuos: age
  • Binary: housing
  • Nominal: marital
  • Ordinal: education

R has a wide range of built-in functions for summary statistics. Before we explore these functions, let us install the necessary packages and load the libraries. Packages can be installed using install.packages() and loaded using library(). Install and load the following packages:

  • install.packages(“Hmisc”)
  • install.packages(“pastecs”)
  • install.packages(“psych”)
library(Hmisc)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
library(pastecs)
## Loading required package: boot
## 
## Attaching package: 'boot'
## 
## The following object is masked from 'package:survival':
## 
##     aml
## 
## The following object is masked from 'package:lattice':
## 
##     melanoma
library(psych)
## 
## Attaching package: 'psych'
## 
## The following object is masked from 'package:boot':
## 
##     logit
## 
## The following object is masked from 'package:Hmisc':
## 
##     describe

Let us run some codes and examine the output:

summary(data)
##       age                job           marital          education    
##  Min.   :18.0   blue-collar:9732   divorced: 5207   primary  : 6851  
##  1st Qu.:33.0   management :9458   married :27214   secondary:23202  
##  Median :39.0   technician :7597   single  :12790   tertiary :13301  
##  Mean   :40.9   admin.     :5171                    unknown  : 1857  
##  3rd Qu.:48.0   services   :4154                                     
##  Max.   :95.0   retired    :2264                                     
##                 (Other)    :6835                                     
##  default        balance       housing      loan            contact     
##  no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
##  yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
##              Median :   448                           unknown  :13020  
##              Mean   :  1362                                            
##              3rd Qu.:  1428                                            
##              Max.   :102127                                            
##                                                                        
##       day           month          duration       campaign    
##  Min.   : 1.0   may    :13766   Min.   :   0   Min.   : 1.00  
##  1st Qu.: 8.0   jul    : 6895   1st Qu.: 103   1st Qu.: 1.00  
##  Median :16.0   aug    : 6247   Median : 180   Median : 2.00  
##  Mean   :15.8   jun    : 5341   Mean   : 258   Mean   : 2.76  
##  3rd Qu.:21.0   nov    : 3970   3rd Qu.: 319   3rd Qu.: 3.00  
##  Max.   :31.0   apr    : 2932   Max.   :4918   Max.   :63.00  
##                 (Other): 6060                                 
##      pdays          previous         poutcome       y        
##  Min.   : -1.0   Min.   :  0.00   failure: 4901   no :39922  
##  1st Qu.: -1.0   1st Qu.:  0.00   other  : 1840   yes: 5289  
##  Median : -1.0   Median :  0.00   success: 1511              
##  Mean   : 40.2   Mean   :  0.58   unknown:36959              
##  3rd Qu.: -1.0   3rd Qu.:  0.00                              
##  Max.   :871.0   Max.   :275.00                              
## 

Examining the output of the summary function, we can see that the result is different for continuos and categorical variables. For categorical variables, only the count of different levels is provided while for the continuos variables there are a bunch of statistics. Let us subset the data set by selecting the 4 variables mentioned earlier.

var <- c("age", "marital", "education", "housing")
eda_data <- data[var]
names(eda_data)
## [1] "age"       "marital"   "education" "housing"

We have a new data set “eda_data” which we will use for the rest of the analysis. Let us do some univariate analysis on the “age” variable. We begin with quantitative techniques before exploring the graphical techniques.

summary(eda_data$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.0    33.0    39.0    40.9    48.0    95.0

The mean and median are pretty close and we can say that the distribution of the variable is approximately normal (we will test this with a graphical technique at a later stage). We will use the fivenum() function next. The fivenum() function displays the min, max, median, lower-hinge and upper-hinge values.

fivenum(eda_data$age)
## [1] 18 33 39 48 95
describe(eda_data$age)
##   vars     n  mean    sd median trimmed   mad min max range skew kurtosis
## 1    1 45211 40.94 10.62     39   40.25 10.38  18  95    77 0.68     0.32
##     se
## 1 0.05

The describe() function which is part of the Hmisc package displays the following additional statistics:

  • Number of rows
  • Standard deviation
  • Trimmed mean
  • Mean absolute deviation
  • Skewness
  • Kurtosis
  • Standard error
stat.desc(eda_data$age)
##      nbr.val     nbr.null       nbr.na          min          max 
##    4.521e+04    0.000e+00    0.000e+00    1.800e+01    9.500e+01 
##        range          sum       median         mean      SE.mean 
##    7.700e+01    1.851e+06    3.900e+01    4.094e+01    4.994e-02 
## CI.mean.0.95          var      std.dev     coef.var 
##    9.788e-02    1.128e+02    1.062e+01    2.594e-01

The stat.desc() function which is part of the pastecs package displays the following additional statistics:

  • Variance
  • Coefficient of variation
  • Confidence interval for mean

We now move on to the graphical techniques that can be employed for EDA and begin with the histogram. The age variables is positively skewed as we can see marginal frequency for age greater than 60.

Histogram

hist(eda_data$age,
     main = "Histogram of Age",
     xlab = "Age in Years")

plot of chunk histogram

Boxplot

boxplot(eda_data$age,
        main = toupper("Boxplot of Age"),
        ylab = "Age in years",
        col = "blue")

plot of chunk boxplot

Kernel Density plot

d <- density(eda_data$age)
plot(d, main = "Kernel density of Age")
polygon(d, col = "red", border = "blue")

plot of chunk kernel density plot