In this document , I am analyzing the mpg data set.
This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.
The Dataset has 234 records and 11 attributes for each record. The first few records of the data is as follows.
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
The input structure of the dataset is as follows.
## tibble [234 x 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
As the given data structure is not as per our requriment. We must do few transformation steps before visualizing the data (or) applying any ML algorithms.
As per my basic understanding of the attributes , The categorical variables are as follows
The numberical attributes are as follows
After doing the basic preprocseeing , our dataset structure is as follows
## tibble [234 x 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ model : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : Factor w/ 2 levels "1999","2008": 1 1 2 2 1 1 2 1 1 2 ...
## $ cyl : Factor w/ 4 levels "4","5","6","8": 1 1 1 1 3 3 3 1 1 1 ...
## $ trans : Factor w/ 2 levels "auto","manual": 1 2 2 1 1 2 1 2 1 2 ...
## $ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
##### Based on number of cylinders :-
##### Based on transmittion type :-
##### Based on type of drive train :-
## NULL
## 1999 2008 Total
## audi 9 9 18
## chevrolet 7 12 19
## dodge 16 21 37
## ford 15 10 25
## honda 5 4 9
## hyundai 6 8 14
## jeep 2 6 8
## land rover 2 2 4
## lincoln 2 1 3
## mercury 2 2 4
## nissan 6 7 13
## pontiac 3 2 5
## subaru 6 8 14
## toyota 20 14 34
## volkswagen 16 11 27
## Total 117 117 234
## NULL
## 4 5 6 8 Total
## 1999 45 0 45 27 117
## 2008 36 4 34 43 117
## Total 81 4 79 70 234
##
## Call:
## lm(formula = displ ~ cty + hwy, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8616 -0.5159 -0.1413 0.4789 3.0983
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.56454 0.21063 35.914 < 2e-16 ***
## cty -0.23332 0.04094 -5.699 3.65e-08 ***
## hwy -0.00679 0.02926 -0.232 0.817
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.781 on 231 degrees of freedom
## Multiple R-squared: 0.6377, Adjusted R-squared: 0.6346
## F-statistic: 203.3 on 2 and 231 DF, p-value: < 2.2e-16
## Formula containing included variables:
##
## trans ~ cty
## <environment: 0x0000000028116358>
##
##
## Values calculated in each step of the selection procedure:
##
## vars Wilks.lambda F.statistics.overall p.value.overall F.statistics.diff
## 1 cty 0.9102864 22.86483 3.088974e-06 22.86483
## p.value.diff
## 1 3.088974e-06
## Call:
## lda(fit.gw$formula, data = df)
##
## Prior probabilities of groups:
## auto manual
## 0.6709402 0.3290598
##
## Group means:
## cty
## auto 15.96815
## manual 18.67532
##
## Coefficients of linear discriminants:
## LD1
## cty 0.2457429