This post will guide you through the basics of EDA, the essential first step to understand your dataset and ask the right questions. EDA is a process that depend on iteration. So you need to:

Professor John Tukey (Tukey, 1977) strongly advocated the practice of exploratory data analysis (EDA) as a critical part of the scientific process.

“No catalog of techniques can convey a willingness to look for what can be seen, whether or not anticipated. Yet this is at the heart of exploratory data analysis. The graph paper and transparencies are there, not as a technique, but rather as a recognition that the picture examining eye is the best finder we have of the wholly unanticipated.”

What would you find in this post?

EDA covers different techniques, in this article we will focus on descriptive statistics and visualization and how we can apply those with R.

Descriptive statistics include: - Mean - arithmetic average - Median - middle value in the sample - Mode - most frequently observed value - Standard Deviation - variation around the mean - Interquartile Range - range encompasses 50% of the values between 25% and 75% of the distribution - Skewness - is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean - Kurtosis - identifies whether the tails of a given distribution differ from the null distribution

Graphical methods include: - Histogram - a graphic where each bar equals to the frequency of observations for a given range of sample values - Density estimation - an estimation of the frequency distribution based on the sample data - Quantile-quantile plot - a plot of the actual data values against points normally distributed - Scatter plot - a graphical display of a variable plotted on the x axis against another one on the y axis - Box plot - a visual representation of a group of statistics such as median, quartiles, symmetry, skewness, and outliers

Hands-on calculating some basic descriptive statistics

To perform EDA you need data so let’s load a dataset that is publicly available in UCI machine learning repository and dive in for the specifics. The dataset contains values for six biomechanical features used to classify orthopaedic patients into 3 classes (normal, disk hernia or spondilolysthesis) of vertebral column lesions.

suppressMessages(library(tidyverse))
spine=read_csv('spine_3c.csv')
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   plv_inc = col_double(),
##   plv_tilt = col_double(),
##   l_l_angle = col_double(),
##   sacr_slope = col_double(),
##   plv_radius = col_double(),
##   grade_spondy = col_double(),
##   class = col_character(),
##   category = col_double()
## )

After loading the data we have to check the data types that are present in our data and the name of the variables

glimpse(spine)
## Rows: 310
## Columns: 8
## $ plv_inc      <dbl> 63.03, 39.06, 68.83, 69.30, 49.71, 40.25, 53.43, 45.37, …
## $ plv_tilt     <dbl> 22.55, 10.06, 22.22, 24.65, 9.65, 13.92, 15.86, 10.76, 1…
## $ l_l_angle    <dbl> 39.61, 25.02, 50.09, 44.31, 28.32, 25.12, 37.17, 29.04, …
## $ sacr_slope   <dbl> 40.48, 29.00, 46.61, 44.64, 40.06, 26.33, 37.57, 34.61, …
## $ plv_radius   <dbl> 98.67, 114.41, 105.99, 101.87, 108.17, 130.33, 120.57, 1…
## $ grade_spondy <dbl> -0.25, 4.56, -3.53, 11.21, 7.92, 2.23, 5.99, -10.68, 13.…
## $ class        <chr> "DH", "DH", "DH", "DH", "DH", "DH", "DH", "DH", "DH", "D…
## $ category     <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…

If there are categorical variables check the distribution of each level

spine%>%group_by(class)%>%count()
## # A tibble: 3 x 2
## # Groups:   class [3]
##   class     n
##   <chr> <int>
## 1 DH       60
## 2 NO      100
## 3 SL      150

Another important aspect of exploration is the normality of the distribution for each numeric variable based on class. For example if we want to check how the pelvic tilt (plv_tilt) is distributed across the three levels of class we can use:

spine1=spine %>%
    group_by(class) %>%
    mutate(n_samples = n()) %>%
    nest() %>%
    mutate(Shapiro = map(data, ~ shapiro.test(.x$plv_tilt)))
spine1
## # A tibble: 3 x 3
## # Groups:   class [3]
##   class data               Shapiro
##   <chr> <list>             <list> 
## 1 DH    <tibble [60 × 8]>  <htest>
## 2 SL    <tibble [150 × 8]> <htest>
## 3 NO    <tibble [100 × 8]> <htest>
library(broom)
spine1.glance <- spine1 %>%
   mutate(glance_shapiro = Shapiro %>% map(glance)) %>%
   unnest(glance_shapiro)
spine1.glance
## # A tibble: 3 x 6
## # Groups:   class [3]
##   class data               Shapiro statistic p.value method                     
##   <chr> <list>             <list>      <dbl>   <dbl> <chr>                      
## 1 DH    <tibble [60 × 8]>  <htest>     0.974  0.237  Shapiro-Wilk normality test
## 2 SL    <tibble [150 × 8]> <htest>     0.980  0.0286 Shapiro-Wilk normality test
## 3 NO    <tibble [100 × 8]> <htest>     0.992  0.792  Shapiro-Wilk normality test

Checking the output of the code chunk we see that the SL group is not normally distributed since the p.value indicates rejection of the null hypothesis for this level. So if we wish to include this variable in a model we must consider a tranformation in a log or another scale to surpass probable side effects. Alternatively for calculating the shapiro-wilk statistic for a single variable not taking into account any of the inherent levels present in the data we could just call:

shapiro.test(spine$plv_tilt)
## 
##  Shapiro-Wilk normality test
## 
## data:  spine$plv_tilt
## W = 0.96638, p-value = 1.318e-06

Calculating various descriptive statistics to understand our dataset is also very straight-forward using the might of the tidyverse. Lets assume we need to calculate the mean, sd, and IQR of lumbar lordosis angle (l_l_angle)

spine.descriptives=spine%>%group_by(class)%>%
  summarise(mu=mean(plv_tilt),std=sd(plv_tilt),iqr=IQR(plv_tilt))
## `summarise()` ungrouping output (override with `.groups` argument)
spine.descriptives
## # A tibble: 3 x 4
##   class    mu   std   iqr
##   <chr> <dbl> <dbl> <dbl>
## 1 DH     17.4  7.02  9.12
## 2 NO     12.8  6.78  7.99
## 3 SL     20.7 11.5  15.8

Inside the summarise() verb we can also use skewness() and kurtosis() functions to calculate their values.

Again we could calculate separately all them calling the mean(), sd() and IQR() functions.

Visualize data

We have several choices regarding plotting in R. Base R, Lattice and GGplot2 are the most often used libraries for visualization. Here we will use the ggplot2 library because it is widely adopted and it is relatively easy to produce custom, informative and beautiful plots. GGplot is based on the grammar of graphics principles

Histogram is the first type of plot that is essential for EDA. In ggplot a histogram with the default bins can be obtained calling:

ggplot(spine,aes(plv_tilt))+geom_histogram(bins = 30)+
  theme_bw()

That was easy! just plug the data set and the aisthetics(aes) into the ggplot() function add the desired geom_, in our case histogram and voila. Let’s make our histogram a little more informative using a pre specified condition for coloring the bins according to categorical variable in our dataset

ggplot(spine,aes(plv_tilt,fill=class))+geom_histogram(bins = 30)+
  theme_bw()

Now it is easy to observe the differences in the distribution of pelvic_tilt across the different levels of class. In almost the same way the density function can be displayed

ggplot(spine,aes(plv_tilt,fill=class,color=class))+geom_density()+
  theme_bw()

The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution. Producing this plot is a way to visualize normal distribution deviations

ggplot(spine, aes(sample = plv_tilt)) +
  stat_qq() +
  stat_qq_line()+theme_bw()

The code below explores pelvic_tilt in all three levels of the class variable.

ggplot(spine, aes(sample = plv_tilt,colour = class))+
  stat_qq() +
  stat_qq_line()+theme_bw()

We can see the deviation from the 45 degree line for the SL level of class. We previously observed that when we tested for normality with the shapiro.wilk test.

Scatterplots are commonly used to explore two conitnuous variables

ggplot(spine,aes(plv_tilt,grade_spondy))+geom_point()+
  theme_bw()

It seems that there is some kind of linear relationship between the variables and also an extreme value that is probably an outlier. We can go a step further just by adding some color at the points.

ggplot(spine,aes(plv_tilt,grade_spondy,color=class))+geom_point()+
  theme_bw()

Just by looking at the graph a very interesting pattern emerges! Remember the linear trend we observed? it seems that only resides in the data points of the SL level of the class variable.