In this report, we will perform a multiple linear regression analysis
on the built-in mtcars dataset in R. The
dataset contains information about various car models and their
performance characteristics. Our aim is to explore relationships between
different variables and build multiple regression models to predict the
mpg (miles per gallon) of the cars based on other
attributes.
The mtcars dataset consists of
32 observations and 11 variables, as
follows:
mpg: Miles per gallon (numeric)cyl: Number of cylinders (factor with 3 levels: 4, 6,
8)disp: Displacement (numeric)hp: Horsepower (numeric)drat: Rear axle ratio (numeric)wt: Weight (numeric)qsec: 1/4 mile time (numeric)vs: V/S (V-engine or straight engine) (numeric: 0 =
V-shaped engine, 1 = Straight engine)am: Transmission (numeric: 0 = automatic, 1 =
manual)gear: Number of forward gears (numeric)carb: Number of carburetors (numeric)We began by exploring the dataset to understand its structure and
gain insights into the variables. The str() function
revealed the data types of each variable. The cyl variable
was transformed into a factor since it represents a categorical variable
with three levels.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
The summary statistics provide a brief overview of the variables
present in the mtcars dataset, which contains information
about various car models. Here is an interpretation of the results:
“mpg” (Miles per Gallon):
+The data represents car fuel efficiency (miles per gallon). The values range from 10.40 to 33.90, with an average of 20.09. The median is slightly lower at 19.20, indicating that the data might be positively skewed.
“cyl” (Number of Cylinders):
“disp” (Displacement):
“hp” (Horsepower):
“drat” (Rear Axle Ratio):
“wt” (Weight):
“qsec” (Quarter Mile Time):
“vs” (Engine Type V/S):
“am” (Transmission Type):
“gear” (Number of Gears):
“carb” (Number of Carburetors):
To gain a better understanding of relationships between variables, we
used scatter plots and box plots for multiple variables using the
plot() function. This allowed us to visualize potential
associations and identify any outliers or patterns.
plot(mtcars)
We built three multiple linear regression models to predict
mpg based on different combinations of predictors. The
predictors used were cyl and wt in the first
model, cyl and hp in the second model, and
cyl, wt, and hp in the third
model. We utilized the lm() function to fit these
models.
mtcars$cyl <- as.factor(mtcars$cyl)
model_cars <- lm(mtcars$mpg ~ mtcars$cyl + mtcars$wt)
model_cars_1 <- lm(mtcars$mpg ~ mtcars$cyl + mtcars$hp)
model_cars_2 <- lm(mtcars$mpg ~ mtcars$cyl + mtcars$wt + mtcars$hp)
We evaluated each model by examining their summaries obtained using the summary() function. The summaries provided information on coefficients, standard errors, t-values, p-values, and the adjusted R-squared
summary(model_cars)
##
## Call:
## lm(formula = mtcars$mpg ~ mtcars$cyl + mtcars$wt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5890 -1.2357 -0.5159 1.3845 5.7915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.9908 1.8878 18.006 < 2e-16 ***
## mtcars$cyl6 -4.2556 1.3861 -3.070 0.004718 **
## mtcars$cyl8 -6.0709 1.6523 -3.674 0.000999 ***
## mtcars$wt -3.2056 0.7539 -4.252 0.000213 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared: 0.8374, Adjusted R-squared: 0.82
## F-statistic: 48.08 on 3 and 28 DF, p-value: 3.594e-11
summary(model_cars_1)
##
## Call:
## lm(formula = mtcars$mpg ~ mtcars$cyl + mtcars$hp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.818 -1.959 0.080 1.627 6.812
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.65012 1.58779 18.044 < 2e-16 ***
## mtcars$cyl6 -5.96766 1.63928 -3.640 0.00109 **
## mtcars$cyl8 -8.52085 2.32607 -3.663 0.00103 **
## mtcars$hp -0.02404 0.01541 -1.560 0.12995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.146 on 28 degrees of freedom
## Multiple R-squared: 0.7539, Adjusted R-squared: 0.7275
## F-statistic: 28.59 on 3 and 28 DF, p-value: 1.14e-08
summary(model_cars_2)
##
## Call:
## lm(formula = mtcars$mpg ~ mtcars$cyl + mtcars$wt + mtcars$hp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2612 -1.0320 -0.3210 0.9281 5.3947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.84600 2.04102 17.563 2.67e-16 ***
## mtcars$cyl6 -3.35902 1.40167 -2.396 0.023747 *
## mtcars$cyl8 -3.18588 2.17048 -1.468 0.153705
## mtcars$wt -3.18140 0.71960 -4.421 0.000144 ***
## mtcars$hp -0.02312 0.01195 -1.934 0.063613 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.44 on 27 degrees of freedom
## Multiple R-squared: 0.8572, Adjusted R-squared: 0.8361
## F-statistic: 40.53 on 4 and 27 DF, p-value: 4.869e-11
Three multiple linear regression models were developed to understand the relationship between the miles per gallon (mpg) of cars and their predictor variables (cylinders, weight, and horsepower). Each model includes different combinations of predictors. Here’s a summary of the key findings for each model:
Formula: mpg ~ cyl + wt
Coefficients:
Intercept: 33.9908
cyl6: -4.2556
cyl8: -6.0709
wt: -3.2056
Interpretation:
The model suggests that cars with 6 or 8 cylinders have lower mpg compared to cars with 4 cylinders when controlling for weight (wt). Additionally, for every one-unit increase in weight, the mpg decreases by 3.2056 units, holding the number of cylinders constant.
The model’s R-squared value (0.8374) indicates that about 83.74% of the variance in mpg is explained by the predictors (cyl and wt). The F-statistic is significant (p-value: 3.594e-11), suggesting that the model is statistically significant in predicting mpg.
Formula: mpg ~ cyl + hp
Coefficients:
Intercept: 28.65012
cyl6: -5.96766
cyl8: -8.52085
hp: -0.02404
Interpretation:
The model suggests that cars with 6 or 8 cylinders have lower mpg compared to cars with 4 cylinders when controlling for horsepower (hp). However, the horsepower (hp) variable does not appear to have a statistically significant effect on mpg, as its p-value is relatively high (0.12995). Goodness of Fit: The model’s R-squared value (0.7539) indicates that about 75.39% of the variance in mpg is explained by the predictors (cyl and hp). The F-statistic is significant (p-value: 1.14e-08), indicating that the model is statistically significant in predicting mpg.
Formula: mpg ~ cyl + wt + hp
Coefficients:
Intercept: 35.846
cyl6: -3.35902
cyl8: -3.18588
wt: -3.1814
hp: -0.02312
Interpretation:
The model suggests that cars with 6 or 8 cylinders have lower mpg compared to cars with 4 cylinders when controlling for weight (wt) and horsepower (hp). The coefficients for weight (wt) and cylinders (cyl) remain relatively unchanged compared to Model 1 and Model 2. The horsepower (hp) variable, similar to Model 2, does not have a statistically significant effect on mpg. Goodness of Fit: The model’s R-squared value (0.8572) indicates that about 85.72% of the variance in mpg is explained by the predictors (cyl, wt, and hp). The F-statistic is significant (p-value: 4.869e-11), indicating that the model is statistically significant in predicting mpg.
Overall, Model 3 with predictors cyl, wt, and hp has the highest R-squared value, indicating that it provides the best fit to the data among the three models. The weight (wt) variable consistently shows a significant negative impact on mpg, suggesting that heavier cars tend to have lower fuel efficiency. While the number of cylinders (cyl) demonstrates a significant effect in all models, the impact of horsepower (hp) appears less consistent and is not statistically significant in the presence of other predictors.
We checked the fitted values and residuals of the first model to
assess the goodness of fit. The scatter plot of fitted values against
residuals was used to identify any remaining patterns or deviations.
Additionally, we examined the normality of residuals using a
quantile-quantile (QQ) plot.
model_cars$fitted.values
## 1 2 3 4 5 6 7 8
## 21.33650 20.51907 26.55377 19.42916 16.89262 18.64379 16.47590 23.76489
## 9 10 11 12 13 14 15 16
## 23.89311 18.70790 18.70790 14.87309 15.96300 15.80272 11.09046 10.53269
## 17 18 19 20 21 22 23 24
## 10.78593 26.93844 28.81373 28.10849 26.08896 16.63618 16.90865 15.61038
## 25 26 27 28 29 30 31 32
## 15.59435 27.78793 27.13078 29.14070 17.75814 20.85566 16.47590 25.07919
plot(model_cars$fitted.values, model_cars$residuals)
abline(0, 0, col = "red")
qqnorm(model_cars$residuals)
qqline(model_cars$residuals, col = "red")
model_cars_only_wt <- lm(mtcars$mpg ~ mtcars$wt)
We compared the models using the
Akaike Information Criterion (AIC). The model with the
lowest AIC value (model_cars) was selected as
the best fit for the data.
AIC(model_cars)
## [1] 156.6223
AIC(model_cars_only_wt)
## [1] 166.0294
In conclusion, we performed multiple linear regression analysis on the mtcars dataset to predict “mpg” based on various car attributes. We found that the combination of predictors “cyl” and “wt” produced the best-fitted model (model_cars), which had the lowest AIC value. This model could be used for predicting “mpg” for similar car models with high accuracy.
It’s essential to note that this analysis is based on the available dataset and may not be applicable to all car models. Further studies and additional variables may be necessary for more comprehensive predictions and insights.
Wickham, H., & Grolemund, G. (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media.