Introduction

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

Project Tasks // Data Introduction

The project consists of 3 different areas to get to the answer of the question as follows: - Data Introduction and Exploratory Analysis - Hypothesis Testing / Analysis - Results / Concluding an answer of question

library(datasets)
data(mtcars)


dim(mtcars)
## [1] 32 11
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

A dataset that contains 11 variables, and with the command ?mtcars we find out that it comprises of fuel consumption and 10 aspects of automobile design and performance for the 32 automobiles. The question pertains for how the variable “mpg” behaves, in regards to automatic and manual transmittions (the “am” binary variable that holds a 0 for automatic, and 1 for manual transmission)

cor(mtcars[,1], mtcars)
##      mpg       cyl       disp         hp      drat         wt     qsec
## [1,]   1 -0.852162 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.418684
##             vs        am      gear       carb
## [1,] 0.6640389 0.5998324 0.4802848 -0.5509251
# Changing the variable into factor for boxplot creation
mtcars[,9] <- as.factor(mtcars[,9])

# We can notice that there are corelations close to -1 with many of the variables, but the "am" is not one of them

boxplot(mtcars$mpg ~ mtcars$am, data = mtcars, outpch = 15, ylab="MPG",xlab="Automatic vs. Manual boxplots ",main="mpg vs transmission type", col="maroon")

The boxplot is showing that the manuals are having greater amounts of Miles per gallon than the automatic transmission type. Nevertheless, if we want to test the hypothesis, we have to perform t-test with a certain threshold to (dis)prove this. The null-hypothesis of the test would be that there is no difference in MPG usage for the different methods of transmission, and the idea is that this would be rejected

performT <- t.test(mpg~am, data=mtcars, conf.level=0.95)

performT$p.value
## [1] 0.001373638

A very small value of p close to zero means that a null hypothesis has been rejected, and thus it proves that automatic transmission has a lower MPG than manual. Next, we create a model that predicts the MPG based on all variables

myModel <- lm(data=mtcars, mpg~.)

myModelAM <- lm(data=mtcars, mpg~am)

summary(myModel)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am1          2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07
summary(myModelAM)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am1            7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

In general, the smallest p-value we get in the model including all variables is from the “wt” variable, with qsec being the second smallest p-value. The overall adjusted R-squared explaining 85% of the variance, or 81% in its adjusted version The model only containing the transmission variable explains only 36% of the variance, which is pretty bad. Another thing one can try is to do a stepwise algorithm that will determine which variables are needed

myModelStep <- step(lm(data=mtcars, mpg~.), trace=0, steps=5000)
summary(myModelStep)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am1           2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

From this model we can see that only wt, qsec and am are selected, ant they explain 85% (R squared) or 83% (adjusted R squared) of the variances. From the variables chosen we can conclude the following: - A weight increase by 1kg, the MPG experiences a decrease of -3.9165 for AM. This means manual transmissions are a better choice - The acceleration (qsec) increasing increases the mpg by 1.22