Peer-graded Assignment: Regression Models Project

1. SUMMARY

The Regression Models Course Project is about studying the Motor “Trend Car Road Test” dataset which could be loaded directly for the libraries of R. The data was extracted from the 1974 Motor Trend US magazine and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

Description of variables:

mpg Miles/(US) gallon: how far a car can travel

cyl Number of cylinders

disp Displacement (in cubic inches): this metric gives a good proxy for the total amount of power the engine can generate. It measures the volume of an engine’s cylinders.

hp Gross horsepower(HP)

drat Rear axle ratio: indicates the ratio of the angular velocity of the input gear to the angular velocity of the output gear. A vehicle with a high ratio would provide more torque and thus more towing capability

wt Weight (1000 lbs.)

qsec 1/4 mile time. A performance measure, primarily of acceleration from standstill

am Transmission (0 = automatic, 1 = manual)

gear Number of forward gears

carb Number of carburetors. A carburetor is a device that blends air and fuel for an internal combustion engine

2. RESEARCH QUESTIONS AND MOTIVATION

Our main goal is to explore the relationship between a set of variables (predictors) and miles per gallon (outcome). Basically, we have been asked to answer the next research questions:

“Is an automatic or manual transmission better for MPG”

“Quantify the MPG difference between automatic and manual transmissions”

3. LOADING PACKAGES AND DATA

library(ggplot2)   #R graphics library
library(gridExtra) #Some additions for graphing
library(dplyr)     #Useful for data transformation
library(plyr)      #Useful for data transformation
library(car)       #Includes useful statistical functions
library(knitr)     #Useful for graphing
library(MASS)      #Another package for statistics
library(effects)   #Package for visualizing plots (specially regression models)
library(psych)     #useful for correlation tests
library(jtools)    #useful for interaction plots
library(modelr)    #useful for obtaining performance metrics
library(broom)     #convert to a tidy dataframe the model's metrics

#Data can be extracted using the dataset: "mtcars". We assign this dataframe to other variable called data.cars:

data.cars <- mtcars

head(data.cars,5)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

4. EXPLORATORY DATA ANALYSIS (EDA)

Firstly, we start exploring the dataset:

str(data.cars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

summary(data.cars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

We have just identified that all of our variables correspond to numerical values. However, according to our summarized data, there are some variables that can treated as factors: cyl, carb, drat and gear.

It is important to state that every row of our dataset is a new vehicle model for the current sample. In general, there are distinguished types that vary according to their capacity. For example, the overall Horse Power is reflecting 146.7 in its mean, we have a minimum value of 52 and a maximum value of 335. According to their MPG development, there are on average 20 miles per gallon approximately for the overall sample; adding that the overall median is nearly equal to the mean, so we would expect a normally distributed data.

By looking at one of our main predictor AM, we can see that our variable is binary. The “0” value corresponds to Automatic Transmission and “1” for Manual Transmission. So we proceed to given this variable a more meanful representation:

data.cars$am <- as.factor(data.cars$am)
data.cars$am <- revalue(data.cars$am, c("0"="Automatic", "1"="Manual"))

We can clearly see from our summary representation, that we have outliers in some variables. It will be useful to see how variable is our data, looking at the outliers and see the differences in “AM” groups by creating boxplots:

mpg <- ggplot(data.cars, aes(x=am, y = mpg)) + 
        geom_boxplot(fill="cornflowerblue") + xlab("Transmission") + ylab("Median") + 
        theme(plot.title = element_text(hjust=0.5)) +  
        theme(plot.title = element_text(hjust=0.5, size = 15, face = "italic")) +   labs(title="Fuel Efficiency (mpg)") 

disp <- ggplot(data.cars, aes(x=am, y = disp, fill = factor(am))) + 
        geom_boxplot(fill="gold") + xlab("Transmission") + ylab("Median") + 
        theme(plot.title = element_text(hjust=0.5)) +  
        theme(plot.title = element_text(hjust=0.5, size = 15, face = "italic")) +   labs(title="Displacement in Cubic Inches(disp) Boxplot")

hp <- ggplot(data.cars, aes(x=am, y = hp, fill = factor(am))) + 
        geom_boxplot(fill="darkseagreen1") + xlab("Transmission") + ylab("Median") + 
        theme(plot.title = element_text(hjust=0.5)) +  
        theme(plot.title = element_text(hjust=0.5, size = 15, face = "italic")) +   labs(title="Engine's Power Output(HP) Boxplot")

drat <- ggplot(data.cars, aes(x=am, y = drat, fill = factor(am))) + 
        geom_boxplot(fill="lightblue3") + xlab("Transmission") + ylab("Median") + 
        theme(plot.title = element_text(hjust=0.5)) +  
        theme(plot.title = element_text(hjust=0.5, size = 15, face = "italic")) +   labs(title="Rear axle ratio(drat) Boxplot")

wt <- ggplot(data.cars, aes(x=am, y = wt, fill = factor(am))) + 
        geom_boxplot(fill="khaki2") + xlab("Transmission") + ylab("Median") + 
        theme(plot.title = element_text(hjust=0.5)) +  
        theme(plot.title = element_text(hjust=0.5, size = 15, face = "italic")) +   labs(title="Weight 1,000 lbs (wt) Boxplot")

qsec <- ggplot(data.cars, aes(x=am, y = qsec, fill = factor(am))) + 
        geom_boxplot(fill="lightslateblue") + xlab("Transmission") + ylab("Median") + 
        theme(plot.title = element_text(hjust=0.5)) +  
        theme(plot.title = element_text(hjust=0.5, size = 15, face = "italic")) +   labs(title="Aceleration Performance in seconds (qsec) Boxplot")

grid.arrange(mpg,disp,hp,drat,wt,qsec)

1.According to each variability (size of boxes) for every variable, we do see a such difference for every type of transmission; and that is mainly to the fact that different types of car models were selected randomly. Moreover, we can see an unbalanced design in respect to those variables:

detach("package:dplyr", character.only = TRUE)
library(dplyr)

data.cars.no <- data.cars %>%
        group_by(am) %>%
        summarize(n = n())
        
data.cars.type <- data.cars %>%
        mutate(car.model = row.names(data.cars)) %>%
        filter(disp > 300 | hp > 200) %>%
        select(car.model, disp, hp, am) %>%
        arrange(desc(as.character(am)))

print(data.cars.no)

## # A tibble: 2 x 2
##   am            n
##   <fct>     <int>
## 1 Automatic    19
## 2 Manual       13

print(data.cars.type)

##              car.model disp  hp        am
## 1       Ford Pantera L  351 264    Manual
## 2        Maserati Bora  301 335    Manual
## 3    Hornet Sportabout  360 175 Automatic
## 4           Duster 360  360 245 Automatic
## 5   Cadillac Fleetwood  472 205 Automatic
## 6  Lincoln Continental  460 215 Automatic
## 7    Chrysler Imperial  440 230 Automatic
## 8     Dodge Challenger  318 150 Automatic
## 9          AMC Javelin  304 150 Automatic
## 10          Camaro Z28  350 245 Automatic
## 11    Pontiac Firebird  400 175 Automatic

In general, 50% of the time, fuel efficiency is more effective (MPG) in manual transmission than automatic transmission. Also, it appears to be more variation in manual transmission than automatic transmission (from 23 to 31); so the latest is more consistent in terms of MPG but not as higher as the manual transmission.
However, this is not the case for the overall power performance; automatic transmission shows more efficiency in output power in both displacement and engines. For the case of displacement we do see a bigger difference of nearly 275 (automatic) and 120 (manual) = 165 cubic inches.
In terms of wheel’s performance (Rear Axie Ratio), manual transmission seems to be more effective.
Apparently, there is no such big difference in aceleration performance and its variation for both automatic and manual transmision, giving an output of around 17.90 and 17 seconds respectively.

Based on these conclusions we will now proceed in answering our research questions and whether the type of transmission AM should be a confounder for fitting a regression model to establish a difference in MPG for both manual and automatic types.

5. BUILDING A REGRESSION MODEL

Now that we have a clear idea of our “Trend Car Road Test” dataset, our next step is constructing a regression model in order to answer our research questions. We proceed to construct a new data frame “data.cars.mod”

data.cars.mod <- data.cars %>%
        select(mpg, disp, hp, wt, qsec, am)

First of all, we should get to know the relationships between our selected variables. We can do this by doing a scatterplot matrix using the “psych” package:

pairs.panels(data.cars.mod[c("mpg", "disp", "hp", "wt", "qsec")])

#What is interesting about this package that it also displays the correlation strenght and distribution of variables.

From the above matrix it appears to be a linear relationship between MPG and the rest of variables. Even more, we can see that our outcome variable MPG approximates to a normal distribution, which is good for fitting a model. Also, we can see a highly skewed “HP” predictor and a normally distributed QSEC and WT variables. This is not the case for DISP since it doesn’t appear to have normal distribution.

Stretched ovals indicate a strong correlation between variables. It appears that our outcome MPG is somehow related to each predictor variable which is good for fitting a model. However, highly correlated predictor variables such as DISP - HP , WT - DISP and HP-QSEC may give multicollinearity problems when doing our predictions.

predict.cor <- cor(data.cars.mod[c("mpg", "disp", "hp", "wt", "qsec")])
kable(predict.cor, format = 'markdown')

	mpg	disp	hp	wt	qsec
mpg	1.0000000	-0.8475514	-0.7761684	-0.8676594	0.4186840
disp	-0.8475514	1.0000000	0.7909486	0.8879799	-0.4336979
hp	-0.7761684	0.7909486	1.0000000	0.6587479	-0.7082234
wt	-0.8676594	0.8879799	0.6587479	1.0000000	-0.1747159
qsec	0.4186840	-0.4336979	-0.7082234	-0.1747159	1.0000000

Effectively, we have identified higher correlation between the variables previously mentioned. If we took a glance of a preliminary model, we would obtain the regression equation as follows:

fit.cars.1 <- lm( mpg ~ am + disp + hp + wt + qsec, data = data.cars.mod)
summary(fit.cars.1)

## 
## Call:
## lm(formula = mpg ~ am + disp + hp + wt + qsec, data = data.cars.mod)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5399 -1.7398 -0.3196  1.1676  4.5534 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 14.36190    9.74079   1.474  0.15238   
## amManual     3.47045    1.48578   2.336  0.02749 * 
## disp         0.01124    0.01060   1.060  0.29897   
## hp          -0.02117    0.01450  -1.460  0.15639   
## wt          -4.08433    1.19410  -3.420  0.00208 **
## qsec         1.00690    0.47543   2.118  0.04391 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.429 on 26 degrees of freedom
## Multiple R-squared:  0.8637, Adjusted R-squared:  0.8375 
## F-statistic: 32.96 on 5 and 26 DF,  p-value: 1.844e-10

Automatic.1 <- fit.cars.1$coefficients[1] 
Manual.1 <- fit.cars.1$coefficients[1] + fit.cars.1$coefficients[2]
Difference.1 <- fit.cars.1$coefficients[2]

cat("Automatic Transmission - MPG (Average):", Automatic.1)

## Automatic Transmission - MPG (Average): 14.3619

cat("Manual Transmission - MPG (Average):", Manual.1)

## Manual Transmission - MPG (Average): 17.83236

cat("Difference in AM Transmission - MPG (Average):", Difference.1)

## Difference in AM Transmission - MPG (Average): 3.470453

Our model yields significant results by looking at the Multiple R-Squared value of 86.37% (how much of the model outcome MPG is explained by adding the selected predictors); more importantly is seeing the Adjusted R-Squared (83.75%) value since it penalizes the addition of more predictors in our model.

From the results, we do see significant predictors such as wt:

All else being constant, a 1000 lbs. increase in average weight (wt) appears to be associated with 4.08 decrease in Miles Per Gallon performance (MPG)

As for our categorical predictor AM, it has been coded into separate binary variables (dummy coding). In our case 0 = Automatic transmission and 1 = Manual Transmission. Moreover, we do see that AM or transmission type is in fact significant through the model by looking at the predictor estimate of 3.47045 which represents an average difference in MPG in automatic and manual transmission. So that means that:

All else being constant, the average in MPG for Automatic Transmission is 14.36 whereas Manual Transmission is estimated to be a total of 17.83; representing a difference of a 3.47 between both AM types

Predictors such as DISP AND WT don’t appear to be significant in our selected model, since by looking at their coefficient estimates, there is only a small portion explained. That doesn’t mean that we should avoid totally them in the model at first glance, since there might be interactions with other predictors giving significant results; for that, we conduct a STEPWISE PROCESS in order to find which predictors should be included in our model.

Stepwise process

We are now going to construct a stepwise process for iteratively adding and removing predictors in order to find the subset of variables in the data, resulting in the best performing model (lowering prediction errors). By doing this step, we will also find out if our assumptions of whether or not avoiding correlated variables performs well the model. The key here is finding the best model which reduces the AIC errors (Akaike Information Criterion, an estimator for meassuring the quality of statistical models)

AIC = 2k - 2 Ln(L)

where,

k = number of estimated parameters

L = maximum function value of statistical parameters

#We can run a stepwise process using the "MASS" package in R. This gives us the option to use both FORWARD and BACKWARD to find which fits best
step.cars <- stepAIC(fit.cars.1, direction ="both")

## Start:  AIC=62.16
## mpg ~ am + disp + hp + wt + qsec
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## - qsec  1    26.470 179.91 65.255
## - am    1    32.198 185.63 66.258
## - wt    1    69.043 222.48 72.051
## 
## Step:  AIC=61.52
## mpg ~ am + hp + wt + qsec
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## + disp  1     6.629 153.44 62.162
## - qsec  1    20.225 180.29 63.323
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ am + wt + qsec
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## + hp    1     9.219 160.07 61.515
## + disp  1     3.276 166.01 62.682
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790

print(step.cars$anova)

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## mpg ~ am + disp + hp + wt + qsec
## 
## Final Model:
## mpg ~ am + wt + qsec
## 
## 
##     Step Df Deviance Resid. Df Resid. Dev     AIC
## 1                           26   153.4378 62.1619
## 2 - disp  1 6.628654        27   160.0665 61.5153
## 3   - hp  1 9.219469        28   169.2859 61.3073

There are two predictors that have been identified as non-significant throughout the stepwise process: “disp and hp”. So, in both directions, the function returns the final model:

mpg ~ am + wt + qsec

fit.cars.2 <- lm( mpg ~ am + qsec + wt , data = data.cars.mod)
summary(fit.cars.2)

## 
## Call:
## lm(formula = mpg ~ am + qsec + wt, data = data.cars.mod)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## amManual      2.9358     1.4109   2.081 0.046716 *  
## qsec          1.2259     0.2887   4.247 0.000216 ***
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Automatic.2 <- fit.cars.2$coefficients[1] 
Manual.2 <- fit.cars.2$coefficients[1] + fit.cars.1$coefficients[2]
Difference.2 <- fit.cars.2$coefficients[2]

cat("Automatic Transmission - MPG (Average):", Automatic.2)

## Automatic Transmission - MPG (Average): 9.617781

cat("Manual Transmission - MPG (Average):", Manual.2)

## Manual Transmission - MPG (Average): 13.08823

cat("Difference in AM Transmission - MPG (Average):", Difference.2)

## Difference in AM Transmission - MPG (Average): 2.935837

Our Adjusted R-Squared is now 83.36% and it’s nearly equal compared to the first model. Coefficient estimates intercepts are now different since our model equation has changed as well:

All else being constant, the average in MPG for Automatic Transmission is 9.61 whereas Manual Transmission is estimated to be a total of 13.08; representing a difference of a 2.93 between both AM types

Interaction plots:

effects.2 <- allEffects(fit.cars.2, lattent = FALSE)

plot(allEffects(fit.cars.2, lattent = TRUE), multiline = TRUE)

So manual transmission gives the most representative effectiveness for Miles Per Gallon (MPG). Moreover, we can see graphically the positive effect that QSEC has on MPG and the negative effect that wt has on MPG.

Interactions of predictors

Once we have detected the significant variables it may be worth to check if predictors WT and QSEC are somehow interacting with the categorical AM predictor:

#wt and transmission

fit.wt <- lm(mpg ~ wt*am, data = data.cars.mod)
interact_plot(fit.wt, pred = "wt", modx = "am")  # es significativa la interaccion

#qsec and am

fit.qsec <- lm(mpg ~ qsec*am, data = data.cars.mod)
interact_plot(fit.qsec, pred = "qsec", modx = "am")  # es poca la interaccion

fit.cars.3 <- lm(mpg ~ am*wt + qsec, data = data.cars.mod)
summary(fit.cars.3)

## 
## Call:
## lm(formula = mpg ~ am * wt + qsec, data = data.cars.mod)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5076 -1.3801 -0.5588  1.0630  4.3684 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.723      5.899   1.648 0.110893    
## amManual      14.079      3.435   4.099 0.000341 ***
## wt            -2.937      0.666  -4.409 0.000149 ***
## qsec           1.017      0.252   4.035 0.000403 ***
## amManual:wt   -4.141      1.197  -3.460 0.001809 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.084 on 27 degrees of freedom
## Multiple R-squared:  0.8959, Adjusted R-squared:  0.8804 
## F-statistic: 58.06 on 4 and 27 DF,  p-value: 7.168e-13

We can see that it seems to be a representative change in both manual and automatic transmission when WT of Weight is present rather than QSEC. Interestingly enough, is that when weight is around 2,800 lbs., both transmission types perform similarly. We do an ANOVA test to assess significance:

anova(fit.cars.2, fit.cars.3)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am + qsec + wt
## Model 2: mpg ~ am * wt + qsec
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     28 169.29                                
## 2     27 117.28  1     52.01 11.974 0.001809 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both models are compared according to their significant coefficients and with their RSS (Residual Standard Errors), which evaluates the overall quality of the fitted model. So model 2 (Model 2: mpg ~ am * wt + qsec) is more significant than the original model (Model 1: mpg ~ am + wt + qsec)

6. REGRESSION DIAGNOSTICS

Linearity, Normality, Homoscedasticity and Potential Outliers

We have to take into account that whenever R deploys a regression model, there are some assumptions about the data: Linearity of data; Normality of Residuals; Multicollinearity; and Homogeneity of Residuals (Constant Variance or Homoscedasticity). In previous calculations, we have assesed the Multicollinearity issue.

ASSESSING Model 1: mpg ~ am + wt + qsec

#using the car package we can plot four diagnostic plots: a) Residuals vs Fitted, b) Normal QQ plot, c) Scale-location and d) Residuals vs Leverage
par(mfrow = c(2,2))
plot(fit.cars.2, main = "mpg ~ am + wt + qsec")

ASSESSING Model 2: mpg ~ am * wt + qsec

par(mfrow = c(2,2))
plot(fit.cars.3, main = "mpg ~ am * wt + qsec")

Ideally, we would expect that our predicted values (fitted) nearly approximate an horizontal line; meaning a less distance between our residuals and the fitted values.
Residuals appear to be nearly normally distributed.
By standardize our residuals, we can determine if we have a heteroscedasticity. Ideally we would expect the predicted line to be horizontal.
Through a Cook’s Distance method, we can determine Leverage points and potential outliers.

Model 2 seems to fit pretty well compared to the Model 1, and this could be appratiated by looking at the Residuals Vs Fitted plot

In order to improve the model, it is worth to take a look at the potential outliers and whether we should include them in our predictions. For example, in MODEL 1 Fiat 128 and Chrysler Imperial seem to be nearly around the Cook’s Distance borders. Whereas, in MODEL 2 we have the new Maserati Bora model, and Chrysler Imperial / Fiat 128 as well.

7. REGRESSION ACCURACY

After the diagnostics stage, we are now going to assess the model’s accuracy. For this, we are going to use the next metrics:

R-Squared: proportion of variation in the outcome that is explained by the predictor variables

RMSE: Root Mean Squared Error. That will measure the average error in predicting the outcome from observations. This is the average difference between the observed actual outcome values and the predicted values.

MAE: Mean Absolute Error. Average absolute difference between observed and predicted outcomes. Less sensitive to outliers.

m1 <- data.frame(
        MODEL = "mpg ~ am + wt + qsec",
        R2 = rsquare(fit.cars.2, data = data.cars.mod ), 
        RMSE = rmse(fit.cars.2, data = data.cars.mod), 
        MAE = mae(fit.cars.2, data = data.cars.mod) )

m2 <- data.frame(
        MODEL = "mpg ~ am * wt + qsec",
        R2 = rsquare(fit.cars.3, data = data.cars.mod ), 
        RMSE = rmse(fit.cars.3, data = data.cars.mod), 
        MAE = mae(fit.cars.3, data = data.cars.mod) )

metrics <- rbind(m1,m2)

kable(metrics, format = 'markdown')

MODEL	R2	RMSE	MAE
mpg ~ am + wt + qsec	0.8496636	2.300040	1.931954
mpg ~ am * wt + qsec	0.8958514	1.914389	1.580804

From the above results, we can see the Model 2 = mpg ~ am * wt + qsec , performs better than Model 1, due to the fact that errors are less than the last one. Also, a higher value R-Squared is indicating that 89.5% of the model is explained by the predictor variables.

It will be helpful if we can give an idea of how the Prediction Values VS the Actual Values perform in both Model 1 and 2:

par(mfrow = c(1,2))

plot(predict(fit.cars.2), data.cars$mpg, col = "Blue", cex = 0.6,
      xlab="predicted",ylab="actual", main = "MODEL 1: mpg ~ am + wt + qsec")


plot(predict(fit.cars.3), data.cars$mpg, col = "Red", cex = 0.6,
      xlab="predicted",ylab="actual", main = "MODEL 2: mpg ~ am*wt + qsec")

In conclusion both models have performed well in terms of predicting the value of MPG. However, MODEL 2 shows less predicting errors so it means that it would be more accurate predictions.