The aim of this EDA is just to randomly check few co-relations , covariances , plot some of the variables of ‘mtcars’ dataset which is an inbuilt data set of R. This will boost the practical knowledge of working in R and as well as performing some pre-liminary analysis.
# Install and load required library packages
# Uncomment below if the packages are not installed
# install.packages("tidyverse")
# install.packages("Hmisc")
# install.packages("funModeling")
library(funModeling)
library(tidyverse)
library(Hmisc)
We will be using built-in dataset ‘mtcars’. Let us check the structure of mtcars first.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
‘mtcars’ is a data frame of all numerical variables. There are no categorical variables.
class(mtcars)
## [1] "data.frame"
Check the numbers of zeros , NAs and inifinite values against each of the variables of the data frame.
df_status(mtcars)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 mpg 0 0.00 0 0 0 0 numeric 25
## 2 cyl 0 0.00 0 0 0 0 numeric 3
## 3 disp 0 0.00 0 0 0 0 numeric 27
## 4 hp 0 0.00 0 0 0 0 numeric 22
## 5 drat 0 0.00 0 0 0 0 numeric 22
## 6 wt 0 0.00 0 0 0 0 numeric 29
## 7 qsec 0 0.00 0 0 0 0 numeric 30
## 8 vs 18 56.25 0 0 0 0 numeric 2
## 9 am 19 59.38 0 0 0 0 numeric 2
## 10 gear 0 0.00 0 0 0 0 numeric 3
## 11 carb 0 0.00 0 0 0 0 numeric 6
Notation:<br.> q_XX : Denotes the count in Numbers.
p_XX : Denotes the count in Percentage.
Check the 5- Number summary of mtcars.
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
profiling_num(mtcars)
## variable mean std_dev variation_coef p_01 p_05
## 1 mpg 20.090625 6.0269481 0.2999881 10.40000 11.9950
## 2 cyl 6.187500 1.7859216 0.2886338 4.00000 4.0000
## 3 disp 230.721875 123.9386938 0.5371779 72.52600 77.3500
## 4 hp 146.687500 68.5628685 0.4674077 55.10000 63.6500
## 5 drat 3.596563 0.5346787 0.1486638 2.76000 2.8535
## 6 wt 3.217250 0.9784574 0.3041285 1.54462 1.7360
## 7 qsec 17.848750 1.7869432 0.1001159 14.53100 15.0455
## 8 vs 0.437500 0.5040161 1.1520369 0.00000 0.0000
## 9 am 0.406250 0.4989909 1.2282853 0.00000 0.0000
## 10 gear 3.687500 0.7378041 0.2000825 3.00000 3.0000
## 11 carb 2.812500 1.6152000 0.5742933 1.00000 1.0000
## p_25 p_50 p_75 p_95 p_99 skewness kurtosis
## 1 15.42500 19.200 22.80 31.30000 33.43500 0.6404399 2.799467
## 2 4.00000 6.000 8.00 8.00000 8.00000 -0.1831287 1.319032
## 3 120.82500 196.300 326.00 449.00000 468.28000 0.4002724 1.910317
## 4 96.50000 123.000 180.00 253.55000 312.99000 0.7614356 3.052233
## 5 3.08000 3.695 3.92 4.31450 4.77500 0.2788734 2.435116
## 6 2.58125 3.325 3.61 5.29275 5.39951 0.4437855 3.172471
## 7 16.89250 17.710 18.90 20.10450 22.06920 0.3870456 3.553753
## 8 0.00000 0.000 1.00 1.00000 1.00000 0.2519763 1.063492
## 9 0.00000 0.000 1.00 1.00000 1.00000 0.3817709 1.145749
## 10 3.00000 4.000 4.00 5.00000 5.00000 0.5546495 2.056790
## 11 2.00000 2.000 4.00 4.90000 7.38000 1.1021304 4.536121
## iqr range_98 range_80
## 1 7.37500 [10.4, 33.435] [14.34, 30.09]
## 2 4.00000 [4, 8] [4, 8]
## 3 205.17500 [72.526, 468.28] [80.61, 396]
## 4 83.50000 [55.1, 312.99] [66, 243.5]
## 5 0.84000 [2.76, 4.775] [3.007, 4.209]
## 6 1.02875 [1.54462, 5.39951] [1.9555, 4.0475]
## 7 2.00750 [14.531, 22.0692] [15.534, 19.99]
## 8 1.00000 [0, 1] [0, 1]
## 9 1.00000 [0, 1] [0, 1]
## 10 1.00000 [3, 5] [3, 5]
## 11 2.00000 [1, 7.38] [1, 4]
Notation:
p_XX - In the above result , p_XX denotes the ‘XX’ percentile. Example, p_25 means 25th percentile of the data.
iqr - Inter Qartile Range
Note: We will focus on cleaning up of data in some other post. Right now, our variables seems to be fine upto some extent. We shall continue on this original data without doing any clean up activity as of now.
describe(mtcars)
## mtcars
##
## 11 Variables 32 Observations
## ---------------------------------------------------------------------------
## mpg
## n missing distinct Info Mean Gmd .05 .10
## 32 0 25 0.999 20.09 6.796 12.00 14.34
## .25 .50 .75 .90 .95
## 15.43 19.20 22.80 30.09 31.30
##
## lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9
## ---------------------------------------------------------------------------
## cyl
## n missing distinct Info Mean Gmd
## 32 0 3 0.866 6.188 1.948
##
## Value 4 6 8
## Frequency 11 7 14
## Proportion 0.344 0.219 0.438
## ---------------------------------------------------------------------------
## disp
## n missing distinct Info Mean Gmd .05 .10
## 32 0 27 0.999 230.7 142.5 77.35 80.61
## .25 .50 .75 .90 .95
## 120.83 196.30 326.00 396.00 449.00
##
## lowest : 71.1 75.7 78.7 79.0 95.1, highest: 360.0 400.0 440.0 460.0 472.0
## ---------------------------------------------------------------------------
## hp
## n missing distinct Info Mean Gmd .05 .10
## 32 0 22 0.997 146.7 77.04 63.65 66.00
## .25 .50 .75 .90 .95
## 96.50 123.00 180.00 243.50 253.55
##
## lowest : 52 62 65 66 91, highest: 215 230 245 264 335
## ---------------------------------------------------------------------------
## drat
## n missing distinct Info Mean Gmd .05 .10
## 32 0 22 0.997 3.597 0.6099 2.853 3.007
## .25 .50 .75 .90 .95
## 3.080 3.695 3.920 4.209 4.314
##
## lowest : 2.76 2.93 3.00 3.07 3.08, highest: 4.08 4.11 4.22 4.43 4.93
## ---------------------------------------------------------------------------
## wt
## n missing distinct Info Mean Gmd .05 .10
## 32 0 29 0.999 3.217 1.089 1.736 1.956
## .25 .50 .75 .90 .95
## 2.581 3.325 3.610 4.048 5.293
##
## lowest : 1.513 1.615 1.835 1.935 2.140, highest: 3.845 4.070 5.250 5.345 5.424
## ---------------------------------------------------------------------------
## qsec
## n missing distinct Info Mean Gmd .05 .10
## 32 0 30 1 17.85 2.009 15.05 15.53
## .25 .50 .75 .90 .95
## 16.89 17.71 18.90 19.99 20.10
##
## lowest : 14.50 14.60 15.41 15.50 15.84, highest: 19.90 20.00 20.01 20.22 22.90
## ---------------------------------------------------------------------------
## vs
## n missing distinct Info Sum Mean Gmd
## 32 0 2 0.739 14 0.4375 0.5081
##
## ---------------------------------------------------------------------------
## am
## n missing distinct Info Sum Mean Gmd
## 32 0 2 0.724 13 0.4062 0.498
##
## ---------------------------------------------------------------------------
## gear
## n missing distinct Info Mean Gmd
## 32 0 3 0.841 3.688 0.7863
##
## Value 3 4 5
## Frequency 15 12 5
## Proportion 0.469 0.375 0.156
## ---------------------------------------------------------------------------
## carb
## n missing distinct Info Mean Gmd
## 32 0 6 0.929 2.812 1.718
##
## Value 1 2 3 4 6 8
## Frequency 7 10 3 10 1 1
## Proportion 0.219 0.312 0.094 0.312 0.031 0.031
## ---------------------------------------------------------------------------
Notation:
.05 ,.10 etc. means 5th percentile, 10th percentile etc…
Plot all the numeric variables of mtcars.
plot_num(mtcars)
Below code is used to check corelation between two variables mpg and cyl of mtcars using all three methods - Pearson , Kendal and Spearman. Out of all , Pearson is most famous method.
cor.test(mtcars$mpg,mtcars$cyl,method = "pearson")
##
## Pearson's product-moment correlation
##
## data: mtcars$mpg and mtcars$cyl
## t = -8.9197, df = 30, p-value = 6.113e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9257694 -0.7163171
## sample estimates:
## cor
## -0.852162
cor.test(mtcars$mpg,mtcars$cyl,method = "spearman")
## Warning in cor.test.default(mtcars$mpg, mtcars$cyl, method = "spearman"):
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: mtcars$mpg and mtcars$cyl
## S = 10425, p-value = 4.69e-13
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.9108013
cor.test(mtcars$mpg,mtcars$cyl,method = "kendall")
## Warning in cor.test.default(mtcars$mpg, mtcars$cyl, method = "kendall"):
## Cannot compute exact p-value with ties
##
## Kendall's rank correlation tau
##
## data: mtcars$mpg and mtcars$cyl
## z = -5.5913, p-value = 2.254e-08
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.7953134
We notice that our corelation value for Pearson test is highly negative.
Below code is used to check covariance between two variables mpg and cyl of mtcars using all three methods - Pearson , Kendal and Spearman. Out of all , Pearson is most famous method.
cov(mtcars$mpg,mtcars$cyl,use = "everything",method = "pearson")
## [1] -9.172379
cov(mtcars$mpg,mtcars$cyl,use = "everything",method = "spearman")
## [1] -74.54032
cov(mtcars$mpg,mtcars$cyl,use = "everything",method = "kendall")
## [1] -638
Use profiling_num function to profile the numerical variable within mtcars data set. Below code will help in finding the skewness and kurtosis of mpg variable. Actually these two values tells us a lot about the ‘shape’ of the distribution of mpg variable.
dispersion <- profiling_num(mtcars$mpg)
kurtosis <-dispersion$kurtosis
kurtosis
## [1] 2.799467
skewness<-dispersion$skewness
skewness
## [1] 0.6404399
The Kurtosis value of 2.799467 is very high than 0. This means that our distribution graph of ‘mpg’ variable will have heavier tails and lighter middle region.The skewness value of 0.6 signifies that the distribution of ‘mpg’ is right skewed means, more distribution of data towards right tail of he graph.
These two descriptive analysis could be visually proven by actually plotting the graph of ‘mpg’ vs. its count.
ggplot(mtcars,mapping = aes(x=mtcars$mpg)) + geom_density()
ggplot(mtcars,mapping = aes(x=mtcars$mpg)) + geom_histogram(aes(y = ..density..),
colour="black", fill="brown")
Let us plot a graph between mpg and cyl variables of mtcars and check the relation.
ggplot(mtcars,mapping = aes(x=mtcars$mpg,y=mtcars$cyl))
You may be amazed to find out an empty plot above . This is because , the graph in ggplot builds in layer format. We added the layer instructing R to create a plot. But did not actually plot the data points. Hence we add another layer on the graph which actually holds the data points. We can do this via either of the below three or amny other ways-
ggplot(mtcars,mapping = aes(x=mtcars$mpg,y=mtcars$cyl)) + geom_point()
ggplot(mtcars,mapping = aes(x=mtcars$mpg,y=mtcars$cyl)) + geom_jitter()
ggplot(mtcars,mapping = aes(x=mtcars$mpg,y=mtcars$cyl)) + geom_line()
So far you must have realised via visuals and via numbers that the two variables mpg and cyl are negatively corelated to each other. This means the increase in one does not lead to increase in other.
Now, randomly select another set variables to study the same set of details as above. For example, I am picking up {mpg,hp}.
result <- cor.test(mtcars$mpg,mtcars$hp,method = "pearson")
result
##
## Pearson's product-moment correlation
##
## data: mtcars$mpg and mtcars$hp
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8852686 -0.5860994
## sample estimates:
## cor
## -0.7761684
Plotting the graph between mpg and hp.
ggplot(mtcars,mapping = aes(x=mtcars$mpg,y=mtcars$hp)) + geom_line()
Another attempt between {mpg,wt}
cor.test(mtcars$wt,mtcars$mpg)
##
## Pearson's product-moment correlation
##
## data: mtcars$wt and mtcars$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9338264 -0.7440872
## sample estimates:
## cor
## -0.8676594
ggplot(mtcars,mapping = aes(x=mtcars$mpg,y=mtcars$wt)) + geom_line()
We can do as many permutation and combinations to find a good subset of highly corelated variables in the provided data set. Believe me,it takes perseverance to achieve it. But once you do enough practise it can be really easy for you to establish it. In the beginning , I suggest to use your intution to select some good set of variables to study. As and when you go deep into numbers you can refine your sets.