Exploratory Data Analysis (EDA) is the first step in data analysis process developed by “John Tukey” in the 1970s. In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.
*note: some processed data is skipped or ignored like cbind or selecting variable because it’s not the focus of this tutorial
library(dplyr) #A Grammar of Data Manipulation
library(tibble) #modern take on data frames.
library(dlookr) #Tools for Data Diagnosis, Exploration, Transformation (main library)
he dataste used is airquality which is already available in R, the airquality dataset is daily air quality measurements in New York, May to September 1973.
# View Data
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
The first step is diagnosis from simple data
## # A tibble: 6 x 6
## variables types missing_count missing_percent unique_count unique_rate
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Ozone integer 37 24.2 68 0.444
## 2 Solar.R integer 7 4.58 118 0.771
## 3 Wind numeric 0 0 31 0.203
## 4 Temp integer 0 0 40 0.261
## 5 Month integer 0 0 5 0.0327
## 6 Day integer 0 0 31 0.203
variables : variable namestypes : the data type of the variablesmissing_count : number of missing valuesmissing_percent : percentage of missing valuesunique_count : number of unique valuesunique_rate : rate of unique value. unique_count / number of observationOnly Numeric Variable
diagnose_numeric(airquality)
## # A tibble: 6 x 10
## variables min Q1 mean median Q3 max zero minus outlier
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 Ozone 1 18 42.1 31.5 63.2 168 0 0 2
## 2 Solar.R 7 116. 186. 205 259. 334 0 0 0
## 3 Wind 1.7 7.4 9.96 9.7 11.5 20.7 0 0 3
## 4 Temp 56 72 77.9 79 85 97 0 0 0
## 5 Month 5 6 6.99 7 8 9 0 0 0
## 6 Day 1 8 15.8 16 23 31 0 0 0
min : minimum valueQ1 : 1/4 quartile, 25th percentilemean : arithmetic meanmedian : median, 50th percentileQ3 : 3/4 quartile, 75th percentilemax : maximum valuezero : number of observations with a value of 0minus : number of observations with negative numbersoutlier : number of outliersdiagnose_outlier(airquality)
## variables outliers_cnt outliers_ratio outliers_mean with_mean without_mean
## 1 Ozone 2 1.307190 151.50000 42.129310 40.210526
## 2 Solar.R 0 0.000000 NaN 185.931507 185.931507
## 3 Wind 3 1.960784 19.73333 9.957516 9.762000
## 4 Temp 0 0.000000 NaN 77.882353 77.882353
## 5 Month 0 0.000000 NaN 6.993464 6.993464
## 6 Day 0 0.000000 NaN 15.803922 15.803922
airquality %>%
plot_outlier(Wind)
plot_na_intersect(airquality)
Easy Diagnose Data using diagnose_web_report
diagnose_web_report(airquality)
Output from the syntax:
deeper, this is the main step
Simple statistics descriptive
summary(airquality)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
Complex statistics descriptive
describe(airquality)
## # A tibble: 6 x 26
## variable n na mean sd se_mean IQR skewness kurtosis p00 p01
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Ozone 116 37 42.1 33.0 3.06 45.2 1.24 1.29 1 4.3
## 2 Solar.R 146 7 186. 90.1 7.45 143 -0.428 -0.968 7 10.2
## 3 Wind 153 0 9.96 3.52 0.285 4.1 0.348 0.111 1.7 2.56
## 4 Temp 153 0 77.9 9.47 0.765 13 -0.378 -0.404 56 57
## 5 Month 153 0 6.99 1.42 0.115 2 -0.00239 -1.30 5 5
## 6 Day 153 0 15.8 8.86 0.717 15 0.00265 -1.20 1 1
## # ... with 15 more variables: p05 <dbl>, p10 <dbl>, p20 <dbl>, p25 <dbl>,
## # p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>, p75 <dbl>,
## # p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>
It’s your data normal?
normality(airquality)
## # A tibble: 6 x 4
## vars statistic p_value sample
## <chr> <dbl> <dbl> <dbl>
## 1 Ozone 0.879 0.0000000279 153
## 2 Solar.R 0.942 0.00000949 153
## 3 Wind 0.986 0.118 153
## 4 Temp 0.976 0.00932 153
## 5 Month 0.888 0.00000000226 153
## 6 Day 0.953 0.0000505 153
If p-value =< \(\alpha A\) data isn’t normal ### 4.4 Plot Normality
airquality %>%
plot_normality(Ozone, Wind)
correlate(airquality)
## # A tibble: 30 x 3
## var1 var2 coef_corr
## <fct> <fct> <dbl>
## 1 Solar.R Ozone 0.348
## 2 Wind Ozone -0.602
## 3 Temp Ozone 0.698
## 4 Month Ozone 0.165
## 5 Day Ozone -0.0132
## 6 Ozone Solar.R 0.348
## 7 Wind Solar.R -0.0568
## 8 Temp Solar.R 0.276
## 9 Month Solar.R -0.0753
## 10 Day Solar.R -0.150
## # ... with 20 more rows
plot_correlate(airquality)
Exploratory Data Analysis only using One Line code:
eda_web_report(airquality)
Output from syntax:
Last steps is Transform your data according to the needs
Handling missing value can be done with two methods:
NROW(airquality$Ozone)
## [1] 153
x <- na.omit(airquality$Ozone)
NROW(x)
## [1] 116
Method:
mean : arithmetic meanmedian : medianmode : modeknn : K-nearest neighbors
rpart : Recursive Partitioning and Regression Trees
mice : Multivariate Imputation by Chained Equations
ozone_impute <- imputate_na(airquality, Ozone, method = "mean")
summary(ozone_impute)
## Impute missing values with mean
##
## * Information of Imputation (before vs after)
## Original Imputation
## n 116.000000 153.000000
## na 37.000000 0.000000
## mean 42.129310 42.129310
## sd 32.987885 28.693372
## se_mean 3.062848 2.319722
## IQR 45.250000 25.000000
## skewness 1.241796 1.421624
## kurtosis 1.290303 2.643199
## p00 1.000000 1.000000
## p01 4.300000 5.040000
## p05 7.750000 9.000000
## p10 11.000000 12.200000
## p20 14.000000 18.000000
## p25 18.000000 21.000000
## p30 20.000000 23.000000
## p40 23.000000 33.600000
## p50 31.500000 42.129310
## p60 39.000000 42.129310
## p70 49.500000 42.129310
## p75 63.250000 46.000000
## p80 73.000000 60.200000
## p90 87.000000 81.600000
## p95 108.500000 97.000000
## p99 133.050000 128.240000
## p100 168.000000 168.000000
plot(ozone_impute)
Handling outlier can be done with two methods:
Q1 <- quantile(airquality$Wind, .25)
Q3 <- quantile(airquality$Wind, .75)
IQR <- IQR(airquality$Wind)
no_outliers <- subset(airquality, airquality$Wind > (Q1 - 1.5*IQR) & airquality$Wind < (Q3 + 1.5*IQR))
NROW(no_outliers)
## [1] 150
method:
mean : arithmetic meanmedian : medianmode : modecapping : Impute the upper outliers with 95 percentile, and Impute the bottom outliers with 5 percentile.out_wind <- imputate_outlier(airquality, Wind, method = "capping")
summary(out_wind)
## Impute outliers with capping
##
## * Information of Imputation (before vs after)
## Original Imputation
## n 153.0000000 153.0000000
## na 0.0000000 0.0000000
## mean 9.9575163 9.8745098
## sd 3.5230014 3.3325652
## se_mean 0.2848178 0.2694219
## IQR 4.1000000 4.1000000
## skewness 0.3478178 0.0590653
## kurtosis 0.1114183 -0.5478332
## p00 1.7000000 1.7000000
## p01 2.5600000 2.5600000
## p05 4.6000000 4.6000000
## p10 5.8200000 5.8200000
## p20 6.9000000 6.9000000
## p25 7.4000000 7.4000000
## p30 8.0000000 8.0000000
## p40 8.6000000 8.6000000
## p50 9.7000000 9.7000000
## p60 10.4200000 10.4200000
## p70 11.5000000 11.5000000
## p75 11.5000000 11.5000000
## p80 12.9600000 12.9600000
## p90 14.9000000 14.9000000
## p95 15.5000000 15.5000000
## p99 19.2160000 16.6000000
## p100 20.7000000 16.6000000
plot(out_wind)
method:
zscore : z-score transformation. (x - mu) / sigmaminmax : minmax transformation. (x - min) / (max - min)air_trans<- transform(airquality$Wind, method = "minmax")
boxplot(air_trans)
log : log transformation. log(x)log+1 : log transformation. log(x + 1). Used for values that contain 0.sqrt : square root transformation.1/x : 1 / x transformationx^2 : x square transformationx^3 : x^3 square transformationOzone_log = transform(airquality$Ozone, method = "log")
summary(Ozone_log)
## * Resolving Skewness with log
##
## * Information of Transformation (before vs after)
## Original Transformation
## n 116.000000 116.00000000
## na 37.000000 37.00000000
## mean 42.129310 3.41851510
## sd 32.987885 0.86547454
## se_mean 3.062848 0.08035729
## IQR 45.250000 1.25670006
## skewness 1.241796 -0.56230968
## kurtosis 1.290303 0.93250632
## p00 1.000000 0.00000000
## p01 4.300000 1.44711413
## p05 7.750000 2.04605869
## p10 11.000000 2.39789527
## p20 14.000000 2.63905733
## p25 18.000000 2.89037176
## p30 20.000000 2.99573227
## p40 23.000000 3.13549422
## p50 31.500000 3.44986155
## p60 39.000000 3.66356165
## p70 49.500000 3.90192165
## p75 63.250000 4.14707182
## p80 73.000000 4.29045944
## p90 87.000000 4.46564381
## p95 108.500000 4.68671851
## p99 133.050000 4.89008672
## p100 168.000000 5.12396398
plot(Ozone_log)
stay connected with me: linggaajiandika.github.io | linggaajiandika@gmail.com
special thanks to: the dlookr team who made amazing packages