Analysing mpg DATASET

Objective :-

In this document , I am analyzing the mpg data set.

Description about Dataset :-

This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.

Descriptive Stastics of Dataset :-

The Dataset has 234 records and 11 attributes for each record. The first few records of the data is as follows.

## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~

The input structure of the dataset is as follows.

## tibble [234 x 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

Data Preprocessing :-

As the given data structure is not as per our requriment. We must do few transformation steps before visualizing the data (or) applying any ML algorithms.

As per my basic understanding of the attributes , The categorical variables are as follows

manufacturer
Model
Year
Cyl
trans ( Type of transformation)
drv ( Type of drive train)
f1 ( Fule type)
Class ( Type of car)

The numberical attributes are as follows

displ ( engine displacement, in litres )
cty ( city miles per gallon )
hwy ( highway miles per gallon )

After doing the basic preprocseeing , our dataset structure is as follows

## tibble [234 x 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ model       : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : Factor w/ 2 levels "1999","2008": 1 1 2 2 1 1 2 1 1 2 ...
##  $ cyl         : Factor w/ 4 levels "4","5","6","8": 1 1 1 1 3 3 3 1 1 1 ...
##  $ trans       : Factor w/ 2 levels "auto","manual": 1 2 2 1 1 2 1 2 1 2 ...
##  $ drv         : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ class       : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...

Visulizing data :-

Based on year of release :-

##### Based on number of cylinders :- ##### Based on transmittion type :- ##### Based on type of drive train :-

Distributuion of continous variables :-

Cross tables of categorical variables :-

Manufacturer vs year of release :-

## NULL
##            1999 2008 Total
## audi          9    9    18
## chevrolet     7   12    19
## dodge        16   21    37
## ford         15   10    25
## honda         5    4     9
## hyundai       6    8    14
## jeep          2    6     8
## land rover    2    2     4
## lincoln       2    1     3
## mercury       2    2     4
## nissan        6    7    13
## pontiac       3    2     5
## subaru        6    8    14
## toyota       20   14    34
## volkswagen   16   11    27
## Total       117  117   234

number of cylinders vs year of release :-

## NULL
##        4 5  6  8 Total
## 1999  45 0 45 27   117
## 2008  36 4 34 43   117
## Total 81 4 79 70   234

Co releation between all variables:-

Fitting Linear Model

## 
## Call:
## lm(formula = displ ~ cty + hwy, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8616 -0.5159 -0.1413  0.4789  3.0983 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.56454    0.21063  35.914  < 2e-16 ***
## cty         -0.23332    0.04094  -5.699 3.65e-08 ***
## hwy         -0.00679    0.02926  -0.232    0.817    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.781 on 231 degrees of freedom
## Multiple R-squared:  0.6377, Adjusted R-squared:  0.6346 
## F-statistic: 203.3 on 2 and 231 DF,  p-value: < 2.2e-16

Fitting greedy.wilks:-

## Formula containing included variables: 
## 
## trans ~ cty
## <environment: 0x0000000028116358>
## 
## 
## Values calculated in each step of the selection procedure: 
## 
##   vars Wilks.lambda F.statistics.overall p.value.overall F.statistics.diff
## 1  cty    0.9102864             22.86483    3.088974e-06          22.86483
##   p.value.diff
## 1 3.088974e-06

Fitting LDA :-

## Call:
## lda(fit.gw$formula, data = df)
## 
## Prior probabilities of groups:
##      auto    manual 
## 0.6709402 0.3290598 
## 
## Group means:
##             cty
## auto   15.96815
## manual 18.67532
## 
## Coefficients of linear discriminants:
##           LD1
## cty 0.2457429