“An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.”
— John Tukey
The purpose of Exploratory Data Analysis (EDA) is to obtain a deeper understanding of the data set and extract meaningful insights about the problems associated with it. There is no hard and fast rule on how to conduct an EDA. Identifying underlying patterns through visualizations, summarizations and modeling of the data is a fundamental part of any data analysis venture. EDA inherently forms an iterative cycle where one
EDA expands the capacity of handling or understanding the data in an independent manner, encouraging using one’s own judgement. For more details on EDA in \(\textrm{R}\) , one can go through \(\textit{Exploratory Data Analysis Using R}\) (Pearson 2018).
The following data set on birth weight of children were collected at Baystate Medical Center, Springfield, Mass during 1986. To load the data set in \(\textrm{R}\) , run the following commands in \(\textrm{R}\) console:
library(MASS)
mydata <- birthwt
Type ?birthwt to get the details of the data set. We can
view the data set by simply printing it on the console with the command
print(mydata) or we can use the following commands:
#View(mydata) #To open the data set in a spreadsheet-style environment
head(mydata) #Print first 6 rows of the data set
## low age lwt race smoke ptl ht ui ftv bwt
## 85 0 19 182 2 0 0 0 1 0 2523
## 86 0 33 155 3 0 0 0 0 3 2551
## 87 0 20 105 1 1 0 0 0 1 2557
## 88 0 21 108 1 1 0 0 1 2 2594
## 89 0 18 107 1 1 0 0 1 0 2600
## 91 0 21 124 3 0 0 0 0 0 2622
The primary steps of EDA include data cleaning and pre-processing. Our next task is to understand the structure of the data set. The following command produces:
str(mydata)
## 'data.frame': 189 obs. of 10 variables:
## $ low : int 0 0 0 0 0 0 0 0 0 0 ...
## $ age : int 19 33 20 21 18 21 22 17 29 26 ...
## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
## $ race : int 2 3 1 1 1 3 1 3 1 1 ...
## $ smoke: int 0 0 1 1 1 0 0 0 1 1 ...
## $ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ht : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ui : int 1 0 0 1 1 0 0 0 0 0 ...
## $ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
sum(is.na(mydata)) #To check whether there are any missing values
## [1] 0
Now we have to identify the level of measurements for each variables.
It can be seen that except age, lwt,
ptl, ftv and bwt, all the other
variables are either nominal or ordinal. Converting these variables into
factors is necessary for further analysis.
mydata[,c("low", "race", "smoke", "ht", "ui")] <- lapply(mydata[,c("low", "race", "smoke", "ht", "ui")], FUN = factor)
#Checking the structure of the transformed data
str(mydata)
## 'data.frame': 189 obs. of 10 variables:
## $ low : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 19 33 20 21 18 21 22 17 29 26 ...
## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
## $ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ...
## $ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ...
## $ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ht : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ ui : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ...
## $ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
Summarizing data includes calculating different measures of central tendencies, dispersion, skewness, kurtosis etc. for continuous variables. For categorical data, one can obtain the frequencies.
summary(mydata)
## low age lwt race smoke ptl
## 0:130 Min. :14.00 Min. : 80.0 1:96 0:115 Min. :0.0000
## 1: 59 1st Qu.:19.00 1st Qu.:110.0 2:26 1: 74 1st Qu.:0.0000
## Median :23.00 Median :121.0 3:67 Median :0.0000
## Mean :23.24 Mean :129.8 Mean :0.1958
## 3rd Qu.:26.00 3rd Qu.:140.0 3rd Qu.:0.0000
## Max. :45.00 Max. :250.0 Max. :3.0000
## ht ui ftv bwt
## 0:177 0:161 Min. :0.0000 Min. : 709
## 1: 12 1: 28 1st Qu.:0.0000 1st Qu.:2414
## Median :0.0000 Median :2977
## Mean :0.7937 Mean :2945
## 3rd Qu.:1.0000 3rd Qu.:3487
## Max. :6.0000 Max. :4990
For a continuous variable, we can calculate further measures:
mode <- function(x)
{
uniq <- unique(x)
uniq[which.max(tabulate(match(x, uniq)))]
}
mode(mydata$age)
## [1] 20
sd(mydata$age)
## [1] 5.298678
var(mydata$age)
## [1] 28.07599
IQR(mydata$age)
## [1] 7
The next step is to graphically visualize the data set. Graphical exploration of the data set is a crucial part of EDA as it helps us to detect patterns in the data set in a meaningful way.
par(mfrow = c(3, 2))
for(i in c("age", "lwt", "bwt"))
{
qqnorm(mydata[, i], main = paste("Normal Q-Q Plot of", i))
qqline(mydata[, i])
hist(mydata[, i], probability = T, main = paste("Histogram of", i))
lines(density(mydata[, i]), col = "blue")
}
par(mfrow = c(1, 2))
for(i in c("ptl", "ftv"))
{
barplot(table(mydata[, i]), main = paste("Barplot of", i) )
}
par(mfrow = c(1, 1))
#Scatter plots
#Relation between continious variables
plot(mydata[, c("age", "lwt", "bwt")])
#Mosaic plot
#Association between continuous variables
mosaicplot(race~low, data = mydata, shade = T, main = "Mosaic plot for low birth weight against race")
The box plots can be drawn to see or compare the distributions of the continuous data across different categorical variables. For example,
boxplot(bwt ~ race, data = mydata)
\(\large{\textbf{Exercise 1:}}\)
Load the following survival data in \(\textrm{R}\) console by running the following commands:
library(survival)
bladder1
The next step is to understand the data.
Find the class of the loaded data (e.g. data frame or matrix)
Find the number of variables and observations in the data
Find the level of measurement of each variable in the data set
Summarize the data according to the requirement.
Convert into appropriate level of measurement whenever necessary
Derive the appropriate summary statistics for each variables
Find the number of missing values
Find a way to deal with the missing values based on your judgement
What does the summary statistics tell you about the data?
Visualize the data.
Draw appropriate plots for each variable. Properly define axis names, plot titles and customize the plots according to the need and better visibility
Draw graphical plots consisting two or more variables and try to decipher whether there are any relations between them
What can you tell about the data from different plots?
Modelling the data.
Try to establish suitable regression models using the different variables in the data set.
Verify the assumptions behind the regression models you are choosing
Which model explains the data set most efficiently?
Identify the significant variables in the model
Explain the regression outputs
What do you conclude from your model about the data?
\(\large{\textbf{Exercise 2:}}\)
Load the following reliability data in \(\textrm{R}\) console by running the following commands:
library(survival)
data(reliability)
capacitor
Apply the steps from the \(\large{\textbf{Exercise 1}}\). Try to gain as much insight as possible from the data.