1.Overview/Executive Summary
This report is based on the Coursera Regression Model Course under the given instructions. This regression analysis based on the dataset of cars which can be worked automobile industry. It is interested in exploring the relationship between the set of variables which are given in the dataset. Here, we used mtcars dataset which is located in the reshape2 R package. This project is mainly focused two questions.
1.Is an automatic or manual transmission better for MPG?
2.Quantify the MPG difference between automatic and manual transmission?
2. Data Preprocessing
The dataset consists of 32 observations and 11 variables.
mpg :- Miles Per US Gallon
cyl:- Number of Cylinders
disp :- Displacement(cubic inches)
hp :- Gross horsepower
drat :- Rear axle Ratio
wt:-weight(lb/1000)
qsec:- ¼ mile time
vs :- V/S
am :- Transmission(0=automatic, 1=manual)
gear:-Number of forward gears
carb:-Number of carburetors
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
library(tidyverse)
library(corrplot)
df <- as.tibble(mtcars)
df
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 22 more rows
It can be identified cyl,vs,am,gear and carb are qualitative variables but here this datasets those variables are stored as integers. So that It is necessary to convert those variables in to factors.
df$cyl <- as.factor(df$cyl)
df$vs <- as.factor(df$vs)
df$gear <- as.factor(df$gear)
df$carb <- as.factor(df$carb)
df$am <- factor(df$am,labels=c('Automatic','Manual'))
str(df)
tibble [32 x 11] (S3: tbl_df/tbl/data.frame)
$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
$ disp: num [1:32] 160 160 108 258 360 ...
$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
$ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
$ am : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
$ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
$ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
M<-cor(mtcars)
corrplot(M, method="number")
According to this output, it can be say that cyl, disp, hp, drat, wt, vs, and am with strong linear relationship with mpg.
model1 <- lm(mpg ~ factor(am), data=df)
summary(model1)
Call:
lm(formula = mpg ~ factor(am), data = df)
Residuals:
Min 1Q Median 3Q Max
-9.3923 -3.0923 -0.2974 3.2439 9.5077
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.147 1.125 15.247 1.13e-15 ***
factor(am)Manual 7.245 1.764 4.106 0.000285 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.902 on 30 degrees of freedom
Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
According to this output p value is 0.000285 and it is less than the level of significance value 0.05. So that we do not have enough evidence to reject the null hypothesis. It can be say am factor variable is significant. In order to further analysis, it must be performed analysis of variance of data.
anova <- aov(mpg ~ ., data = df)
summary(anova)
Df Sum Sq Mean Sq F value Pr(>F)
cyl 2 824.8 412.4 51.377 1.94e-07 ***
disp 1 57.6 57.6 7.181 0.0171 *
hp 1 18.5 18.5 2.305 0.1497
drat 1 11.9 11.9 1.484 0.2419
wt 1 55.8 55.8 6.950 0.0187 *
qsec 1 1.5 1.5 0.190 0.6692
vs 1 0.3 0.3 0.038 0.8488
am 1 16.6 16.6 2.064 0.1714
gear 2 5.0 2.5 0.313 0.7361
carb 5 13.6 2.7 0.339 0.8814
Residuals 15 120.4 8.0
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the anova table, it can be only consider variables which are less than level of significance 0.05. Those variables are cyl, disp, wt and am.
model2 <- lm(mpg ~ cyl + disp + wt + am, data = df)
summary(model2)
Call:
lm(formula = mpg ~ cyl + disp + wt + am, data = df)
Residuals:
Min 1Q Median 3Q Max
-4.5029 -1.2829 -0.4825 1.4954 5.7889
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.816067 2.914272 11.604 8.79e-12 ***
cyl6 -4.304782 1.492355 -2.885 0.00777 **
cyl8 -6.318406 2.647658 -2.386 0.02458 *
disp 0.001632 0.013757 0.119 0.90647
wt -3.249176 1.249098 -2.601 0.01513 *
amManual 0.141212 1.326751 0.106 0.91605
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.652 on 26 degrees of freedom
Multiple R-squared: 0.8376, Adjusted R-squared: 0.8064
F-statistic: 26.82 on 5 and 26 DF, p-value: 1.73e-09
This output show that, R squatted value is 0.899. This means 83% variance can be explained by the model.
ggplot(model2,aes(sample=.resid)) + stat_qq()+stat_qq_line()
According to this figure shows that residuals are normally distributed and errors are uncorrelated.
ggplot(df, aes(x=mpg, y=am, fill = am,col= am)) +
geom_boxplot(outer.shape =NA,alpha=0.2) +
geom_jitter(aes(col = am))+
coord_flip()
In this report we focused two questions. As answers for those questions, It can be say manual transmission is better than automatic for MPG.
Following output shows that, answer of the second question. We also perform a t-test assuming that the transmission data has a normal distribution and we clearly see that the manual and automatic transmissions are significatively different.
t.test(mpg ~ am, data = df)
Welch Two Sample t-test
data: mpg by am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.280194 -3.209684
sample estimates:
mean in group Automatic mean in group Manual
17.14737 24.39231
Source of this report available at this link for verification. https://github.com/muhan1027/regressionModelReport