1.Overview/Executive Summary

This report is based on the Coursera Regression Model Course under the given instructions. This regression analysis based on the dataset of cars which can be worked automobile industry. It is interested in exploring the relationship between the set of variables which are given in the dataset. Here, we used mtcars dataset which is located in the reshape2 R package. This project is mainly focused two questions.

1.Is an automatic or manual transmission better for MPG?

2.Quantify the MPG difference between automatic and manual transmission?

2. Data Preprocessing

The dataset consists of 32 observations and 11 variables.
mpg :- Miles Per US Gallon
cyl:- Number of Cylinders
disp :- Displacement(cubic inches)
hp :- Gross horsepower
drat :- Rear axle Ratio
wt:-weight(lb/1000)
qsec:- ¼ mile time
vs :- V/S
am :- Transmission(0=automatic, 1=manual)
gear:-Number of forward gears
carb:-Number of carburetors

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
library(tidyverse)
library(corrplot)
df <- as.tibble(mtcars)
df
# A tibble: 32 x 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ... with 22 more rows

It can be identified cyl,vs,am,gear and carb are qualitative variables but here this datasets those variables are stored as integers. So that It is necessary to convert those variables in to factors.

df$cyl <- as.factor(df$cyl)
df$vs <- as.factor(df$vs)
df$gear <- as.factor(df$gear)
df$carb <- as.factor(df$carb)
df$am <- factor(df$am,labels=c('Automatic','Manual'))
str(df)
tibble [32 x 11] (S3: tbl_df/tbl/data.frame)
 $ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
 $ disp: num [1:32] 160 160 108 258 360 ...
 $ hp  : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
 $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
 $ am  : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
 $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
 $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
M<-cor(mtcars)
corrplot(M, method="number")

According to this output, it can be say that cyl, disp, hp, drat, wt, vs, and am with strong linear relationship with mpg.

model1 <- lm(mpg ~ factor(am), data=df)
summary(model1)

Call:
lm(formula = mpg ~ factor(am), data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.3923 -3.0923 -0.2974  3.2439  9.5077 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        17.147      1.125  15.247 1.13e-15 ***
factor(am)Manual    7.245      1.764   4.106 0.000285 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.902 on 30 degrees of freedom
Multiple R-squared:  0.3598,    Adjusted R-squared:  0.3385 
F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

According to this output p value is 0.000285 and it is less than the level of significance value 0.05. So that we do not have enough evidence to reject the null hypothesis. It can be say am factor variable is significant. In order to further analysis, it must be performed analysis of variance of data.

anova <- aov(mpg ~ ., data = df)
summary(anova)
            Df Sum Sq Mean Sq F value   Pr(>F)    
cyl          2  824.8   412.4  51.377 1.94e-07 ***
disp         1   57.6    57.6   7.181   0.0171 *  
hp           1   18.5    18.5   2.305   0.1497    
drat         1   11.9    11.9   1.484   0.2419    
wt           1   55.8    55.8   6.950   0.0187 *  
qsec         1    1.5     1.5   0.190   0.6692    
vs           1    0.3     0.3   0.038   0.8488    
am           1   16.6    16.6   2.064   0.1714    
gear         2    5.0     2.5   0.313   0.7361    
carb         5   13.6     2.7   0.339   0.8814    
Residuals   15  120.4     8.0                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the anova table, it can be only consider variables which are less than level of significance 0.05. Those variables are cyl, disp, wt and am.

model2 <- lm(mpg ~ cyl + disp + wt + am, data = df)
summary(model2)

Call:
lm(formula = mpg ~ cyl + disp + wt + am, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5029 -1.2829 -0.4825  1.4954  5.7889 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 33.816067   2.914272  11.604 8.79e-12 ***
cyl6        -4.304782   1.492355  -2.885  0.00777 ** 
cyl8        -6.318406   2.647658  -2.386  0.02458 *  
disp         0.001632   0.013757   0.119  0.90647    
wt          -3.249176   1.249098  -2.601  0.01513 *  
amManual     0.141212   1.326751   0.106  0.91605    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.652 on 26 degrees of freedom
Multiple R-squared:  0.8376,    Adjusted R-squared:  0.8064 
F-statistic: 26.82 on 5 and 26 DF,  p-value: 1.73e-09

This output show that, R squatted value is 0.899. This means 83% variance can be explained by the model.

ggplot(model2,aes(sample=.resid)) + stat_qq()+stat_qq_line()

According to this figure shows that residuals are normally distributed and errors are uncorrelated.

ggplot(df, aes(x=mpg, y=am, fill = am,col= am)) +
  geom_boxplot(outer.shape =NA,alpha=0.2) +
  geom_jitter(aes(col = am))+
  coord_flip()

In this report we focused two questions. As answers for those questions, It can be say manual transmission is better than automatic for MPG.

Following output shows that, answer of the second question. We also perform a t-test assuming that the transmission data has a normal distribution and we clearly see that the manual and automatic transmissions are significatively different.

t.test(mpg ~ am, data = df)

    Welch Two Sample t-test

data:  mpg by am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11.280194  -3.209684
sample estimates:
mean in group Automatic    mean in group Manual 
               17.14737                24.39231 

Source of this report available at this link for verification. https://github.com/muhan1027/regressionModelReport