EDA of mtcars data set

The aim of this EDA is just to randomly check few co-relations , covariances , plot some of the variables of ‘mtcars’ dataset which is an inbuilt data set of R. This will boost the practical knowledge of working in R and as well as performing some pre-liminary analysis.

# Install and load required library packages
# Uncomment below if the packages are not installed
# install.packages("tidyverse")
# install.packages("Hmisc")
# install.packages("funModeling")
library(funModeling)
library(tidyverse)
library(Hmisc)

We will be using built-in dataset ‘mtcars’. Let us check the structure of mtcars first.

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

‘mtcars’ is a data frame of all numerical variables. There are no categorical variables.

class(mtcars)

## [1] "data.frame"

Check the numbers of zeros , NAs and inifinite values against each of the variables of the data frame.

df_status(mtcars)

##    variable q_zeros p_zeros q_na p_na q_inf p_inf    type unique
## 1       mpg       0    0.00    0    0     0     0 numeric     25
## 2       cyl       0    0.00    0    0     0     0 numeric      3
## 3      disp       0    0.00    0    0     0     0 numeric     27
## 4        hp       0    0.00    0    0     0     0 numeric     22
## 5      drat       0    0.00    0    0     0     0 numeric     22
## 6        wt       0    0.00    0    0     0     0 numeric     29
## 7      qsec       0    0.00    0    0     0     0 numeric     30
## 8        vs      18   56.25    0    0     0     0 numeric      2
## 9        am      19   59.38    0    0     0     0 numeric      2
## 10     gear       0    0.00    0    0     0     0 numeric      3
## 11     carb       0    0.00    0    0     0     0 numeric      6

Notation:<br.> q_XX : Denotes the count in Numbers.
p_XX : Denotes the count in Percentage.

Check the 5- Number summary of mtcars.

summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

profiling_num(mtcars)

##    variable       mean     std_dev variation_coef     p_01    p_05
## 1       mpg  20.090625   6.0269481      0.2999881 10.40000 11.9950
## 2       cyl   6.187500   1.7859216      0.2886338  4.00000  4.0000
## 3      disp 230.721875 123.9386938      0.5371779 72.52600 77.3500
## 4        hp 146.687500  68.5628685      0.4674077 55.10000 63.6500
## 5      drat   3.596563   0.5346787      0.1486638  2.76000  2.8535
## 6        wt   3.217250   0.9784574      0.3041285  1.54462  1.7360
## 7      qsec  17.848750   1.7869432      0.1001159 14.53100 15.0455
## 8        vs   0.437500   0.5040161      1.1520369  0.00000  0.0000
## 9        am   0.406250   0.4989909      1.2282853  0.00000  0.0000
## 10     gear   3.687500   0.7378041      0.2000825  3.00000  3.0000
## 11     carb   2.812500   1.6152000      0.5742933  1.00000  1.0000
##         p_25    p_50   p_75      p_95      p_99   skewness kurtosis
## 1   15.42500  19.200  22.80  31.30000  33.43500  0.6404399 2.799467
## 2    4.00000   6.000   8.00   8.00000   8.00000 -0.1831287 1.319032
## 3  120.82500 196.300 326.00 449.00000 468.28000  0.4002724 1.910317
## 4   96.50000 123.000 180.00 253.55000 312.99000  0.7614356 3.052233
## 5    3.08000   3.695   3.92   4.31450   4.77500  0.2788734 2.435116
## 6    2.58125   3.325   3.61   5.29275   5.39951  0.4437855 3.172471
## 7   16.89250  17.710  18.90  20.10450  22.06920  0.3870456 3.553753
## 8    0.00000   0.000   1.00   1.00000   1.00000  0.2519763 1.063492
## 9    0.00000   0.000   1.00   1.00000   1.00000  0.3817709 1.145749
## 10   3.00000   4.000   4.00   5.00000   5.00000  0.5546495 2.056790
## 11   2.00000   2.000   4.00   4.90000   7.38000  1.1021304 4.536121
##          iqr           range_98         range_80
## 1    7.37500     [10.4, 33.435]   [14.34, 30.09]
## 2    4.00000             [4, 8]           [4, 8]
## 3  205.17500   [72.526, 468.28]     [80.61, 396]
## 4   83.50000     [55.1, 312.99]      [66, 243.5]
## 5    0.84000      [2.76, 4.775]   [3.007, 4.209]
## 6    1.02875 [1.54462, 5.39951] [1.9555, 4.0475]
## 7    2.00750  [14.531, 22.0692]  [15.534, 19.99]
## 8    1.00000             [0, 1]           [0, 1]
## 9    1.00000             [0, 1]           [0, 1]
## 10   1.00000             [3, 5]           [3, 5]
## 11   2.00000          [1, 7.38]           [1, 4]

Notation:
p_XX - In the above result , p_XX denotes the ‘XX’ percentile. Example, p_25 means 25th percentile of the data.
iqr - Inter Qartile Range

Note: We will focus on cleaning up of data in some other post. Right now, our variables seems to be fine upto some extent. We shall continue on this original data without doing any clean up activity as of now.

describe(mtcars)

## mtcars 
## 
##  11  Variables      32  Observations
## ---------------------------------------------------------------------------
## mpg 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       32        0       25    0.999    20.09    6.796    12.00    14.34 
##      .25      .50      .75      .90      .95 
##    15.43    19.20    22.80    30.09    31.30 
## 
## lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9
## ---------------------------------------------------------------------------
## cyl 
##        n  missing distinct     Info     Mean      Gmd 
##       32        0        3    0.866    6.188    1.948 
##                             
## Value          4     6     8
## Frequency     11     7    14
## Proportion 0.344 0.219 0.438
## ---------------------------------------------------------------------------
## disp 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       32        0       27    0.999    230.7    142.5    77.35    80.61 
##      .25      .50      .75      .90      .95 
##   120.83   196.30   326.00   396.00   449.00 
## 
## lowest :  71.1  75.7  78.7  79.0  95.1, highest: 360.0 400.0 440.0 460.0 472.0
## ---------------------------------------------------------------------------
## hp 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       32        0       22    0.997    146.7    77.04    63.65    66.00 
##      .25      .50      .75      .90      .95 
##    96.50   123.00   180.00   243.50   253.55 
## 
## lowest :  52  62  65  66  91, highest: 215 230 245 264 335
## ---------------------------------------------------------------------------
## drat 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       32        0       22    0.997    3.597   0.6099    2.853    3.007 
##      .25      .50      .75      .90      .95 
##    3.080    3.695    3.920    4.209    4.314 
## 
## lowest : 2.76 2.93 3.00 3.07 3.08, highest: 4.08 4.11 4.22 4.43 4.93
## ---------------------------------------------------------------------------
## wt 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       32        0       29    0.999    3.217    1.089    1.736    1.956 
##      .25      .50      .75      .90      .95 
##    2.581    3.325    3.610    4.048    5.293 
## 
## lowest : 1.513 1.615 1.835 1.935 2.140, highest: 3.845 4.070 5.250 5.345 5.424
## ---------------------------------------------------------------------------
## qsec 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       32        0       30        1    17.85    2.009    15.05    15.53 
##      .25      .50      .75      .90      .95 
##    16.89    17.71    18.90    19.99    20.10 
## 
## lowest : 14.50 14.60 15.41 15.50 15.84, highest: 19.90 20.00 20.01 20.22 22.90
## ---------------------------------------------------------------------------
## vs 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##       32        0        2    0.739       14   0.4375   0.5081 
## 
## ---------------------------------------------------------------------------
## am 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##       32        0        2    0.724       13   0.4062    0.498 
## 
## ---------------------------------------------------------------------------
## gear 
##        n  missing distinct     Info     Mean      Gmd 
##       32        0        3    0.841    3.688   0.7863 
##                             
## Value          3     4     5
## Frequency     15    12     5
## Proportion 0.469 0.375 0.156
## ---------------------------------------------------------------------------
## carb 
##        n  missing distinct     Info     Mean      Gmd 
##       32        0        6    0.929    2.812    1.718 
##                                               
## Value          1     2     3     4     6     8
## Frequency      7    10     3    10     1     1
## Proportion 0.219 0.312 0.094 0.312 0.031 0.031
## ---------------------------------------------------------------------------

Notation:
.05 ,.10 etc. means 5th percentile, 10th percentile etc…

Plot all the numeric variables of mtcars.

plot_num(mtcars)

Below code is used to check corelation between two variables mpg and cyl of mtcars using all three methods - Pearson , Kendal and Spearman. Out of all , Pearson is most famous method.

cor.test(mtcars$mpg,mtcars$cyl,method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$mpg and mtcars$cyl
## t = -8.9197, df = 30, p-value = 6.113e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9257694 -0.7163171
## sample estimates:
##       cor 
## -0.852162

cor.test(mtcars$mpg,mtcars$cyl,method = "spearman")

## Warning in cor.test.default(mtcars$mpg, mtcars$cyl, method = "spearman"):
## Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  mtcars$mpg and mtcars$cyl
## S = 10425, p-value = 4.69e-13
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.9108013

cor.test(mtcars$mpg,mtcars$cyl,method = "kendall")

## Warning in cor.test.default(mtcars$mpg, mtcars$cyl, method = "kendall"):
## Cannot compute exact p-value with ties

## 
##  Kendall's rank correlation tau
## 
## data:  mtcars$mpg and mtcars$cyl
## z = -5.5913, p-value = 2.254e-08
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.7953134

We notice that our corelation value for Pearson test is highly negative.

Below code is used to check covariance between two variables mpg and cyl of mtcars using all three methods - Pearson , Kendal and Spearman. Out of all , Pearson is most famous method.

cov(mtcars$mpg,mtcars$cyl,use = "everything",method = "pearson")

## [1] -9.172379

cov(mtcars$mpg,mtcars$cyl,use = "everything",method = "spearman")

## [1] -74.54032

cov(mtcars$mpg,mtcars$cyl,use = "everything",method = "kendall")

## [1] -638

Use profiling_num function to profile the numerical variable within mtcars data set. Below code will help in finding the skewness and kurtosis of mpg variable. Actually these two values tells us a lot about the ‘shape’ of the distribution of mpg variable.

dispersion <- profiling_num(mtcars$mpg)
kurtosis <-dispersion$kurtosis
kurtosis

## [1] 2.799467

skewness<-dispersion$skewness
skewness

## [1] 0.6404399

The Kurtosis value of 2.799467 is very high than 0. This means that our distribution graph of ‘mpg’ variable will have heavier tails and lighter middle region.The skewness value of 0.6 signifies that the distribution of ‘mpg’ is right skewed means, more distribution of data towards right tail of he graph.

These two descriptive analysis could be visually proven by actually plotting the graph of ‘mpg’ vs. its count.

ggplot(mtcars,mapping = aes(x=mtcars$mpg)) + geom_density()

ggplot(mtcars,mapping = aes(x=mtcars$mpg)) + geom_histogram(aes(y = ..density..), 
                   colour="black", fill="brown")

Let us plot a graph between mpg and cyl variables of mtcars and check the relation.

ggplot(mtcars,mapping = aes(x=mtcars$mpg,y=mtcars$cyl))

You may be amazed to find out an empty plot above . This is because , the graph in ggplot builds in layer format. We added the layer instructing R to create a plot. But did not actually plot the data points. Hence we add another layer on the graph which actually holds the data points. We can do this via either of the below three or amny other ways-

ggplot(mtcars,mapping = aes(x=mtcars$mpg,y=mtcars$cyl)) + geom_point()

ggplot(mtcars,mapping = aes(x=mtcars$mpg,y=mtcars$cyl)) + geom_jitter()

ggplot(mtcars,mapping = aes(x=mtcars$mpg,y=mtcars$cyl)) + geom_line()

So far you must have realised via visuals and via numbers that the two variables mpg and cyl are negatively corelated to each other. This means the increase in one does not lead to increase in other.

Now, randomly select another set variables to study the same set of details as above. For example, I am picking up {mpg,hp}.

result <- cor.test(mtcars$mpg,mtcars$hp,method = "pearson")
result

## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$mpg and mtcars$hp
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8852686 -0.5860994
## sample estimates:
##        cor 
## -0.7761684

Plotting the graph between mpg and hp.

ggplot(mtcars,mapping = aes(x=mtcars$mpg,y=mtcars$hp)) + geom_line()

Another attempt between {mpg,wt}

cor.test(mtcars$wt,mtcars$mpg)

## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$wt and mtcars$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9338264 -0.7440872
## sample estimates:
##        cor 
## -0.8676594

ggplot(mtcars,mapping = aes(x=mtcars$mpg,y=mtcars$wt)) + geom_line()

We can do as many permutation and combinations to find a good subset of highly corelated variables in the provided data set. Believe me,it takes perseverance to achieve it. But once you do enough practise it can be really easy for you to establish it. In the beginning , I suggest to use your intution to select some good set of variables to study. As and when you go deep into numbers you can refine your sets.

EDA of ‘mtcars’ data set

Vikas Sinha (G.O.D.S - https://www.gods-online.com)

September 21, 2019