Regression Models Course Project

library(datasets)
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

EXECUTIVE SUMMARY

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

“Is an automatic or manual transmission better for MPG”
“Quantify the MPG difference between automatic and manual transmissions”

Regression analysis using transmission type, weight, and 1/4 mile time as explanatory variables leads to the conclusion that manual cars get on average 2.9 more mpg than automatic cars, when the effects of weight and 1/4 mile time are ignored.

DATA DESCRIPTION

The data of this project are extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973/74 models).

The data consists of 32 observations on 11 variables.

mpg: Miles/(US) gallon
cyl: Number of cylinders
disp: Displacement (cu.in.)
hp: Gross horsepower
drat: Rear axle ratio
wt: Weight (lb/1000)
qsec: 1/4 mile time
vs: V/S
am: Transmission (0 = automatic, 1 = manual)
gear: Number of forward gears
carb: Number of carburetors

EXPLORATORY DATA ANALYSIS

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

In order to make the data more readable, the variable am is going to be recoded with 0 as auto and 1 as manual.

Also, in order to use regression models, the variable am needs to be converted to a factor

The same changes will be applied to variables vs, gear, and carb.

# Decoding

mtcars$am <- gsub("0", "auto", mtcars$am)
mtcars$am <- gsub("1", "manual", mtcars$am)

# Factoring

mtcars$am <- factor(mtcars$am)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)

dim(mtcars)  ## 32 observations and 11 variables

## [1] 32 11

head(mtcars) ## some observations to better understand mtcars

##                    mpg cyl disp  hp drat    wt  qsec vs     am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0 manual    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0 manual    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1 manual    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1   auto    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0   auto    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1   auto    3    1

str(mtcars) ## variable types after coersion

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "auto","manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

In order to get a first glance into a possible difference between manual and automatic cars, a boxplot will be showcased.

boxplot1 <- mtcars %>%
  ggplot(aes(x = am,
             y = mpg,
             fill = am)) +
  geom_boxplot() +
  labs(title = "Relation between manual and automatic cars",
       x = "Type of car")
boxplot1

According to the grah, there seems to be a better mileage on average with manual cars than with automatic cars. We can verify this by using statistical inference.

STATISTICAL INFERENCE

We will use t.test function in order to verify our hypothesis that manual cars get better gas mileage than automatic cars.

t.test(mpg ~ am, mtcars)

## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group auto and group manual is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
##   mean in group auto mean in group manual 
##             17.14737             24.39231

The 95% confidence interval shown in the output of t.test does not contain 0, so we can conclude that the difference in mpg between manual and automatic transmissions is in fact significant.

The p-value for this comparisson is 0.001374, which is smaller than 0.05.

However, there may have other variables that can play other role in determination of MPG, as cyl, disp, hp, wt and others. For example, it is common sense that the heavier the car, more likely he will fuel consumption.

** Refer to appendix**

The graph shows that MPG has correlations with other variables than just am. To obtain a more accurate model, we need predicting MPG in correlation with other variables than am. Lets use some models to evaluate the correlations.

REGRESSION MODEL

Let’s start by regressing mpg on just am

am_model <- lm(mpg ~ am, mtcars)
summary(am_model)

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## ammanual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

The R2 value for this model is only 0.3598, which means that fitting mpg on am alone only explains about 36% of the variance in mpg. The p-value of this model are low (0.000285).

Before making any conclusions on the effect of transmission type on fuel efficiency, we look at the variances between several variables in the dataset.

Building a model that regresses mpg on all other variables in the dataset will explain more of the variance.

full_model <- lm(mpg ~ ., mtcars)
summary(full_model)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6533 -1.3325 -0.5166  0.7643  4.7284 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 25.31994   23.88164   1.060   0.3048  
## cyl         -1.02343    1.48131  -0.691   0.4995  
## disp         0.04377    0.03058   1.431   0.1716  
## hp          -0.04881    0.03189  -1.531   0.1454  
## drat         1.82084    2.38101   0.765   0.4556  
## wt          -4.63540    2.52737  -1.834   0.0853 .
## qsec         0.26967    0.92631   0.291   0.7747  
## vs1          1.04908    2.70495   0.388   0.7032  
## ammanual     0.96265    3.19138   0.302   0.7668  
## gear4        1.75360    3.72534   0.471   0.6442  
## gear5        1.87899    3.65935   0.513   0.6146  
## carb2       -0.93427    2.30934  -0.405   0.6912  
## carb3        3.42169    4.25513   0.804   0.4331  
## carb4       -0.99364    3.84683  -0.258   0.7995  
## carb6        1.94389    5.76983   0.337   0.7406  
## carb8        4.36998    7.75434   0.564   0.5809  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.823 on 16 degrees of freedom
## Multiple R-squared:  0.8867, Adjusted R-squared:  0.7806 
## F-statistic: 8.352 on 15 and 16 DF,  p-value: 6.044e-05

As expected, the full model has a higher R2 value (0.8867). But the output of summary shows that none of the coefficients are significant at the 0.05 level.

Excluding variables that are correlated with transmission type will introduce bias in the coefficients. However, including unnecessary regressors will inflate the model’s variance. We will use the step function in R to determine which variables to include in our final model.

step_model <- step(full_model, 
                   direction = "both", 
                   trace = FALSE)
summary(step_model)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## ammanual      2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

The Residual standard error of this model is 2.459 on 28 degrees of freedom.

The adjusted R-squared has increased to 0.8336 and the coefficients are significant.

confint(step_model)['ammanual', ]

##      2.5 %     97.5 % 
## 0.04573031 5.82594408

Diagnostic plotting using base graphics shows that the residuals are uncorrelated with the fitted values. The quantile-quantile plot indicates that the distributon of the residiuals is roughly normal.

par(mfrow = c(2,2))
plot(step_model)

According to the residual plots, the following underlying assumptions can be varified: 1. The Residuals vs. Fitted plot shows no consistent pattern, supporting the accuracy of the independence assumption. 2. The Normal Q-Q plot indicates that the residuals are normally distributed because the points lie closely to the line. 3. The Scale-Location plot confirms the constant variance assumption, as the points are randomly distributed. 4. The Residuals vs. Leverage argues that no outliers are present, as all values fall well within the 0.5 bands.

APPENDIX: FIGURES

Pair graph of Motor Trend Car Road Tests

pairs(mtcars,
      panel = panel.smooth,
      main = "Pair Graph of Motor Trend Car Road Test")

2. Scatter Plot of MPG vs. Weight by Transmission

ggplot(mtcars, 
       aes(x=wt, 
           y=mpg, 
           group=am, 
           color=am, 
           height=3, 
           width=3)) + 
  geom_point() +
  scale_colour_discrete(labels=c("Automatic", 
                               "Manual")) + 
  xlab("weight") + 
  ggtitle("Scatter Plot of MPG vs. Weight by Transmission")