“An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.”

John Tukey

Introduction

The purpose of Exploratory Data Analysis (EDA) is to obtain a deeper understanding of the data set and extract meaningful insights about the problems associated with it. There is no hard and fast rule on how to conduct an EDA. Identifying underlying patterns through visualizations, summarizations and modeling of the data is a fundamental part of any data analysis venture. EDA inherently forms an iterative cycle where one

  • Generate questions about a data set
  • Try to obtain the answers or understand the questions through visualization or modelling of the data
  • Use the obtained insights to refine the understanding of the data or modify existing questions or generate new questions

EDA expands the capacity of handling or understanding the data in an independent manner, encouraging using one’s own judgement. For more details on EDA in \(\textrm{R}\) , one can go through \(\textit{Exploratory Data Analysis Using R}\) (Pearson 2018).

Exploratory Data Analysis Example 1

The following data set on birth weight of children were collected at Baystate Medical Center, Springfield, Mass during 1986. To load the data set in \(\textrm{R}\) , run the following commands in \(\textrm{R}\) console:

library(MASS)
mydata <- birthwt

Type ?birthwt to get the details of the data set. We can view the data set by simply printing it on the console with the command print(mydata) or we can use the following commands:

#View(mydata) #To open the data set in a spreadsheet-style environment

head(mydata) #Print first 6 rows of the data set
##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622

The primary steps of EDA include data cleaning and pre-processing. Our next task is to understand the structure of the data set. The following command produces:

str(mydata)
## 'data.frame':    189 obs. of  10 variables:
##  $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : int  2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
sum(is.na(mydata)) #To check whether there are any missing values
## [1] 0

Now we have to identify the level of measurements for each variables. It can be seen that except age, lwt, ptl, ftv and bwt, all the other variables are either nominal or ordinal. Converting these variables into factors is necessary for further analysis.

mydata[,c("low", "race", "smoke", "ht", "ui")] <- lapply(mydata[,c("low", "race", "smoke", "ht", "ui")], FUN = factor)

#Checking the structure of the transformed data

str(mydata)
## 'data.frame':    189 obs. of  10 variables:
##  $ low  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ui   : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

Summarizing data includes calculating different measures of central tendencies, dispersion, skewness, kurtosis etc. for continuous variables. For categorical data, one can obtain the frequencies.

summary(mydata)
##  low          age             lwt        race   smoke        ptl        
##  0:130   Min.   :14.00   Min.   : 80.0   1:96   0:115   Min.   :0.0000  
##  1: 59   1st Qu.:19.00   1st Qu.:110.0   2:26   1: 74   1st Qu.:0.0000  
##          Median :23.00   Median :121.0   3:67           Median :0.0000  
##          Mean   :23.24   Mean   :129.8                  Mean   :0.1958  
##          3rd Qu.:26.00   3rd Qu.:140.0                  3rd Qu.:0.0000  
##          Max.   :45.00   Max.   :250.0                  Max.   :3.0000  
##  ht      ui           ftv              bwt      
##  0:177   0:161   Min.   :0.0000   Min.   : 709  
##  1: 12   1: 28   1st Qu.:0.0000   1st Qu.:2414  
##                  Median :0.0000   Median :2977  
##                  Mean   :0.7937   Mean   :2945  
##                  3rd Qu.:1.0000   3rd Qu.:3487  
##                  Max.   :6.0000   Max.   :4990

For a continuous variable, we can calculate further measures:

mode <- function(x)
{
  uniq <- unique(x)
  uniq[which.max(tabulate(match(x, uniq)))]
}
mode(mydata$age)
## [1] 20
sd(mydata$age)
## [1] 5.298678
var(mydata$age)
## [1] 28.07599
IQR(mydata$age)
## [1] 7

The next step is to graphically visualize the data set. Graphical exploration of the data set is a crucial part of EDA as it helps us to detect patterns in the data set in a meaningful way.

par(mfrow = c(3, 2))
for(i in c("age", "lwt", "bwt"))
{
  qqnorm(mydata[, i], main = paste("Normal Q-Q Plot of", i))
  qqline(mydata[, i])
  hist(mydata[, i], probability = T, main = paste("Histogram of", i))
  lines(density(mydata[, i]), col = "blue")
}

par(mfrow = c(1, 2))
for(i in c("ptl", "ftv"))
{
  barplot(table(mydata[, i]), main = paste("Barplot of", i) )
}

par(mfrow = c(1, 1))
#Scatter plots
#Relation between continious variables
plot(mydata[, c("age", "lwt", "bwt")])

#Mosaic plot
#Association between continuous variables
mosaicplot(race~low, data = mydata, shade = T, main = "Mosaic plot for low birth weight against race")

The box plots can be drawn to see or compare the distributions of the continuous data across different categorical variables. For example,

boxplot(bwt ~ race, data = mydata)

Exercises

\(\large{\textbf{Exercise 1:}}\)

Load the following survival data in \(\textrm{R}\) console by running the following commands:

library(survival)
bladder1

The next step is to understand the data.

  • Find the class of the loaded data (e.g. data frame or matrix)

  • Find the number of variables and observations in the data

  • Find the level of measurement of each variable in the data set

Summarize the data according to the requirement.

  • Convert into appropriate level of measurement whenever necessary

  • Derive the appropriate summary statistics for each variables

  • Find the number of missing values

  • Find a way to deal with the missing values based on your judgement

  • What does the summary statistics tell you about the data?

Visualize the data.

  • Draw appropriate plots for each variable. Properly define axis names, plot titles and customize the plots according to the need and better visibility

  • Draw graphical plots consisting two or more variables and try to decipher whether there are any relations between them

  • What can you tell about the data from different plots?

Modelling the data.

  • Try to establish suitable regression models using the different variables in the data set.

  • Verify the assumptions behind the regression models you are choosing

  • Which model explains the data set most efficiently?

  • Identify the significant variables in the model

  • Explain the regression outputs

  • What do you conclude from your model about the data?

\(\large{\textbf{Exercise 2:}}\)

Load the following reliability data in \(\textrm{R}\) console by running the following commands:

library(survival)
data(reliability)
capacitor

Apply the steps from the \(\large{\textbf{Exercise 1}}\). Try to gain as much insight as possible from the data.

Pearson, Ronald K. 2018. Exploratory Data Analysis Using R. Chapman; Hall/CRC. https://doi.org/10.1201/9781315382111.