stay connected with me: linggaajiandika.github.io | linggaajiandika@gmail.com

00. Introduction

Exploratory Data Analysis (EDA) is the first step in data analysis process developed by “John Tukey” in the 1970s. In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

*note: some processed data is skipped or ignored like cbind or selecting variable because it’s not the focus of this tutorial

01. Load Package

library(dplyr) #A Grammar of Data Manipulation
library(tibble) #modern take on data frames.
library(dlookr) #Tools for Data Diagnosis, Exploration, Transformation (main library)

02. Load Dataset

he dataste used is airquality which is already available in R, the airquality dataset is daily air quality measurements in New York, May to September 1973.

# View Data
head(airquality)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

03. Data Diagnosis

The first step is diagnosis from simple data

3.1 General Data Diagnosis

## # A tibble: 6 x 6
##   variables types   missing_count missing_percent unique_count unique_rate
##   <chr>     <chr>           <int>           <dbl>        <int>       <dbl>
## 1 Ozone     integer            37           24.2            68      0.444 
## 2 Solar.R   integer             7            4.58          118      0.771 
## 3 Wind      numeric             0            0              31      0.203 
## 4 Temp      integer             0            0              40      0.261 
## 5 Month     integer             0            0               5      0.0327
## 6 Day       integer             0            0              31      0.203

variables : variable names
types : the data type of the variables
missing_count : number of missing values
missing_percent : percentage of missing values
unique_count : number of unique values
unique_rate : rate of unique value. unique_count / number of observation

3.2 Diagnose Numeric Variable

Only Numeric Variable

diagnose_numeric(airquality)

## # A tibble: 6 x 10
##   variables   min    Q1   mean median    Q3   max  zero minus outlier
##   <chr>     <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <int> <int>   <int>
## 1 Ozone       1    18    42.1    31.5  63.2 168       0     0       2
## 2 Solar.R     7   116.  186.    205   259.  334       0     0       0
## 3 Wind        1.7   7.4   9.96    9.7  11.5  20.7     0     0       3
## 4 Temp       56    72    77.9    79    85    97       0     0       0
## 5 Month       5     6     6.99    7     8     9       0     0       0
## 6 Day         1     8    15.8    16    23    31       0     0       0

min : minimum value
Q1 : 1/4 quartile, 25th percentile
mean : arithmetic mean
median : median, 50th percentile
Q3 : 3/4 quartile, 75th percentile
max : maximum value
zero : number of observations with a value of 0
minus : number of observations with negative numbers
outlier : number of outliers

3.3 Diagnosis Outlier

diagnose_outlier(airquality)

##   variables outliers_cnt outliers_ratio outliers_mean  with_mean without_mean
## 1     Ozone            2       1.307190     151.50000  42.129310    40.210526
## 2   Solar.R            0       0.000000           NaN 185.931507   185.931507
## 3      Wind            3       1.960784      19.73333   9.957516     9.762000
## 4      Temp            0       0.000000           NaN  77.882353    77.882353
## 5     Month            0       0.000000           NaN   6.993464     6.993464
## 6       Day            0       0.000000           NaN  15.803922    15.803922

3.4 Plot Outlier

airquality %>%
  plot_outlier(Wind)

3.5 Plot Intersect Missing Value

plot_na_intersect(airquality)

3.6 It’s Magic Time!

Easy Diagnose Data using diagnose_web_report

diagnose_web_report(airquality)

Output from the syntax:

04. Statisics Descriptive

deeper, this is the main step

4.1 Summry

Simple statistics descriptive

summary(airquality)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
##

4.2 Describe

Complex statistics descriptive

describe(airquality)

## # A tibble: 6 x 26
##   variable     n    na   mean    sd se_mean   IQR skewness kurtosis   p00   p01
##   <chr>    <int> <int>  <dbl> <dbl>   <dbl> <dbl>    <dbl>    <dbl> <dbl> <dbl>
## 1 Ozone      116    37  42.1  33.0    3.06   45.2  1.24       1.29    1    4.3 
## 2 Solar.R    146     7 186.   90.1    7.45  143   -0.428     -0.968   7   10.2 
## 3 Wind       153     0   9.96  3.52   0.285   4.1  0.348      0.111   1.7  2.56
## 4 Temp       153     0  77.9   9.47   0.765  13   -0.378     -0.404  56   57   
## 5 Month      153     0   6.99  1.42   0.115   2   -0.00239   -1.30    5    5   
## 6 Day        153     0  15.8   8.86   0.717  15    0.00265   -1.20    1    1   
## # ... with 15 more variables: p05 <dbl>, p10 <dbl>, p20 <dbl>, p25 <dbl>,
## #   p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>, p75 <dbl>,
## #   p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>

4.3 Normality Check

It’s your data normal?

normality(airquality)

## # A tibble: 6 x 4
##   vars    statistic       p_value sample
##   <chr>       <dbl>         <dbl>  <dbl>
## 1 Ozone       0.879 0.0000000279     153
## 2 Solar.R     0.942 0.00000949       153
## 3 Wind        0.986 0.118            153
## 4 Temp        0.976 0.00932          153
## 5 Month       0.888 0.00000000226    153
## 6 Day         0.953 0.0000505        153

If p-value =< \(\alpha A\) data isn’t normal ### 4.4 Plot Normality

airquality %>%
  plot_normality(Ozone, Wind)

4.5 Correlation Matrix

correlate(airquality)

## # A tibble: 30 x 3
##    var1    var2    coef_corr
##    <fct>   <fct>       <dbl>
##  1 Solar.R Ozone      0.348 
##  2 Wind    Ozone     -0.602 
##  3 Temp    Ozone      0.698 
##  4 Month   Ozone      0.165 
##  5 Day     Ozone     -0.0132
##  6 Ozone   Solar.R    0.348 
##  7 Wind    Solar.R   -0.0568
##  8 Temp    Solar.R    0.276 
##  9 Month   Solar.R   -0.0753
## 10 Day     Solar.R   -0.150 
## # ... with 20 more rows

4.6 Plot Correlation Matrix

plot_correlate(airquality)

4.7 It’s Magic Time Again!!!

Exploratory Data Analysis only using One Line code:

eda_web_report(airquality)

Output from syntax:

05. Data Transformation

Last steps is Transform your data according to the needs

5.1 Handling Missing Value

Handling missing value can be done with two methods:

5.1.1 Remove Missing Value

NROW(airquality$Ozone)

## [1] 153

x <- na.omit(airquality$Ozone)
NROW(x)

## [1] 116

5.1.2 Impute Missing Value

Method:

mean : arithmetic mean
median : median
mode : mode
knn : K-nearest neighbors
- target variable must be specified
rpart : Recursive Partitioning and Regression Trees
- target variable must be specified
mice : Multivariate Imputation by Chained Equations
- target variable must be specified
- random seed must be set

ozone_impute <- imputate_na(airquality, Ozone, method = "mean")
summary(ozone_impute)

## Impute missing values with mean
## 
## * Information of Imputation (before vs after)
##            Original Imputation
## n        116.000000 153.000000
## na        37.000000   0.000000
## mean      42.129310  42.129310
## sd        32.987885  28.693372
## se_mean    3.062848   2.319722
## IQR       45.250000  25.000000
## skewness   1.241796   1.421624
## kurtosis   1.290303   2.643199
## p00        1.000000   1.000000
## p01        4.300000   5.040000
## p05        7.750000   9.000000
## p10       11.000000  12.200000
## p20       14.000000  18.000000
## p25       18.000000  21.000000
## p30       20.000000  23.000000
## p40       23.000000  33.600000
## p50       31.500000  42.129310
## p60       39.000000  42.129310
## p70       49.500000  42.129310
## p75       63.250000  46.000000
## p80       73.000000  60.200000
## p90       87.000000  81.600000
## p95      108.500000  97.000000
## p99      133.050000 128.240000
## p100     168.000000 168.000000

plot(ozone_impute)

5.2 Handling Outlier

Handling outlier can be done with two methods:

5.2.1 Remove Outlier

Q1 <- quantile(airquality$Wind, .25)
Q3 <- quantile(airquality$Wind, .75)
IQR <- IQR(airquality$Wind)
no_outliers <- subset(airquality, airquality$Wind > (Q1 - 1.5*IQR) & airquality$Wind < (Q3 + 1.5*IQR))
NROW(no_outliers)

## [1] 150

5.2.2 Impute Outlier

method:

mean : arithmetic mean
median : median
mode : mode
capping : Impute the upper outliers with 95 percentile, and Impute the bottom outliers with 5 percentile.

out_wind <- imputate_outlier(airquality, Wind, method = "capping")
summary(out_wind)

## Impute outliers with capping
## 
## * Information of Imputation (before vs after)
##             Original  Imputation
## n        153.0000000 153.0000000
## na         0.0000000   0.0000000
## mean       9.9575163   9.8745098
## sd         3.5230014   3.3325652
## se_mean    0.2848178   0.2694219
## IQR        4.1000000   4.1000000
## skewness   0.3478178   0.0590653
## kurtosis   0.1114183  -0.5478332
## p00        1.7000000   1.7000000
## p01        2.5600000   2.5600000
## p05        4.6000000   4.6000000
## p10        5.8200000   5.8200000
## p20        6.9000000   6.9000000
## p25        7.4000000   7.4000000
## p30        8.0000000   8.0000000
## p40        8.6000000   8.6000000
## p50        9.7000000   9.7000000
## p60       10.4200000  10.4200000
## p70       11.5000000  11.5000000
## p75       11.5000000  11.5000000
## p80       12.9600000  12.9600000
## p90       14.9000000  14.9000000
## p95       15.5000000  15.5000000
## p99       19.2160000  16.6000000
## p100      20.7000000  16.6000000

plot(out_wind)

5.3 Standardization