This study is part of Coursera Course on Regression Model. We will use R package mtcars You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
“Is an automatic or manual transmission better for MPG”
“Quantify the MPG difference between automatic and manual transmissions”
Analytical Process :
We load mtcars and do some data processing and explanation. ‘mtcars’ data is part of datasets library and we ll load necessary library like, ggplot2, tidyverse,dplyr, GGally. Below variables are taken from the r studio help, ?mtcars We ll change variable class for listed below to factor
library(datasets)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# head(mtcars)
raw_mtcars <- mtcars
raw_mtcars[,c(2,8,9,10,11)] <- lapply(raw_mtcars[,c(2,8,9,10,11)],as.factor)
levels(raw_mtcars$am) <- c("AT","MN")
GGPairs plot showing corelation plot for each variable to understand which variable are corelated so we can remove it and we will explain it on lm model
Variable am and vs seems correlated, similar impact to mpg –> we will use am
Variable am and cyc seems uncorelated, it has negative impact to mpg –> we will use cyc variable as confound variable
Variable am and disp seems uncorelated, it has different in distribution
Variable am and drat seems uncorelated, it has different in distribution
Variable am and wt seems uncorelated, it has different in distribution
Variable am and qseq seems corelated, it has similar distributions
Variable am and hp seems uncorelated
We will see this distribution to get better understanding
head(raw_mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 MN 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 MN 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 MN 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 AT 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 AT 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 AT 3 1
glimpse(raw_mtcars)
## Rows: 32
## Columns: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl <fct> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs <fct> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am <fct> MN, MN, MN, AT, AT, AT, AT, AT, AT, AT, AT, AT, AT, AT, AT, AT, A…
## $ gear <fct> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <fct> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
ggplot(raw_mtcars,aes(x=am,y=mpg))+geom_boxplot(aes(fill = am))+theme_bw()+labs(x="Auto or Manual",y="Miles per gallon",title = "Miles per Gallons vs Auto/Manual")
Below are Correlation Density for above variable We will see mpg distribution per transmissions to get idea how data distributed. We can see mpg vs AM are not perfectly separated, there will be some unknown factor to get perfect modelling
ggplot(raw_mtcars,aes(mpg))+geom_density(aes(fill = am),alpha = 0.3, width = 0.3)+labs(x="mpg",y="Density",title = "MPG vs AM Density")
## Warning in geom_density(aes(fill = am), alpha = 0.3, width = 0.3): Ignoring
## unknown parameters: `width`
Below are variable corelation density We see here that 1. DISP and AM
are unclearly separated, this suggest that DISP is abit corelated with
AM, this suggest not to use DISP as regressor
DRAT and AM are separated, this suggest that DRAT is uncorelated with AM, this suggest to use DRAT as regressor
WT and AM are separated, this suggest that WT is uncorelated with AM, this suggest to use WT as regressor
QSEC and AM are not separated, this suggest that QSEC is corelated with AM, this suggest not to use DISP as regressor
HP and AM are separated, this suggest that HP is uncorelated with AM, this suggest to use HP as regressor
We can see cyl is also uncorelated based on pairs on appendix
ggplot(raw_mtcars,aes(disp))+geom_density(aes(fill = am),alpha = 0.3)+labs(x="disp",y="Density",title = "DISP vs AM Density")
ggplot(raw_mtcars,aes(drat))+geom_density(aes(fill = am),alpha = 0.3)+labs(x="drat",y="Density",title = "DRAT vs AM Density")
ggplot(raw_mtcars,aes(wt))+geom_density(aes(fill = am),alpha = 0.3)+labs(x="wt",y="Density",title = "WT vs AM Density")
ggplot(raw_mtcars,aes(qsec))+geom_density(aes(fill = am),alpha = 0.3)+labs(x="qsec",y="Density",title = "QSEC vs AM Density")
ggplot(raw_mtcars,aes(hp))+geom_density(aes(fill = am),alpha = 0.3)+labs(x="hp",y="Density",title = "HP vs AM Density")
This is to calculate the statistical inference about the MPG difference for Automatic and Manual Transmission. From previous box plot, we can see clear MPG difference for Automatic and Manual Transmission.
t.test(filter(raw_mtcars,am == "AT")$mpg, filter(raw_mtcars,am == "MN")$mpg)
##
## Welch Two Sample t-test
##
## data: filter(raw_mtcars, am == "AT")$mpg and filter(raw_mtcars, am == "MN")$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
We can see the t value = -3.77 and confidence interval is negative value with no 0 crossing, it suggests that MPG value for Automatic and Manual Transmission is different with Manual Transmission has higher Miles per Gallon. P-Value of 0.1% is sufficient to reject Null Hipothesis and suggested Alternative Hipothesis
We will use multiple modelling to see which model have better understanding ### 1. AM only as regressor
model1 <- lm(mpg~am,raw_mtcars)
summary(model1)
##
## Call:
## lm(formula = mpg ~ am, data = raw_mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amMN 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
model2 <- lm(mpg~cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb,raw_mtcars)
summary(model2)
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs +
## am + gear + carb, data = raw_mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913 20.06582 1.190 0.2525
## cyl6 -2.64870 3.04089 -0.871 0.3975
## cyl8 -0.33616 7.15954 -0.047 0.9632
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vs1 1.93085 2.87126 0.672 0.5115
## amMN 1.21212 3.21355 0.377 0.7113
## gear4 1.11435 3.79952 0.293 0.7733
## gear5 2.52840 3.73636 0.677 0.5089
## carb2 -0.97935 2.31797 -0.423 0.6787
## carb3 2.99964 4.29355 0.699 0.4955
## carb4 1.09142 4.44962 0.245 0.8096
## carb6 4.47757 6.38406 0.701 0.4938
## carb8 7.25041 8.36057 0.867 0.3995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
## F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
As per Linear Regression Modelling we can conclude below points :
Transmission Type suggest to have significant impact on Miles per Gallons based on T - test statistic suggest that it has significant difference
Here are factors impacting MPG based on linear model, Transmission Type, Weight, Rear Axle, Horse Power, Cylinder
Model suggest that Manual Transmission has better MPG
Some unexplainable factors due to Automatic and Manual Transmission MPG Density data are not perfectly segregated
Correlation Pairs Diagram
ggpairs(raw_mtcars,mapping = aes(col = am)) + theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Residual diagram of model 1
plot(model1)
Residual plot of model 2
plot(model2)
## Warning: not plotting observations with leverage one:
## 30, 31
residual plot of model 3 (best model)
plot(model3)