mtcars General InformationThe data was extracted from the 1974 Motor Trend US magazine, and
comprises fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles (1973-74 models). mtcars has
a data frame with 32 observations on 11 (numeric) variables:
The data is readily available in the default R programming language
as mtcars. We can use head() to obtain the
first few rows of the data and tail() to get the last few
rows.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
We can also use str() to look at the data in a string
format.
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Through these commands, we can infer that the variables vs and am are binary (as written in the general information of the data), while the variables cyl, gear, and carb are integers. The rest of the data are numeric.
In order to obtain the summary of the data (basic descriptive
statistics), we can use summary():
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
from the results, we can infer a few things:
The minimum and maximum miles per US gallon among these cars are 10.4 and 33.9 miles per US gallon.
The mean gross horsepower is 146.6875.
The most common number of cylinders is 8
The mean weight is 3.21725.
The most common engine is V-shaped
The most common transmission is automatic
The most common number of forward gears is 3
The most common number of carburetors is 4
Multiple linear regression general form: \[ y = \beta_0 + \beta_1X_1+ \beta_2X_2 +
...+\beta_iX_i+\varepsilon \] In multiple linear regression, one
of the most common parameter methods is the Ordinary Least Squares (OLS)
method. In OLS, the parameters \(\hat{\beta}\) are calculated using the
matrix formula below: \[\mathbf{\hat{\beta}}=\mathbf{(X^TX)^{-1}X^TY}\]
where X is a \((p\times
p)\) matrix containing all values of X and Y is a \((p\times 1)\) matrix of \([y_1, y_2, ..., y_i]^T\) for all values of
X. the lm() command does the calculation automatically
after specifying the model.
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs +
## am + gear + carb, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
The model that is created using the data is:
\[ \begin{align*} Y_{mpg} = 12.3034-0.1114 X_{cyl} + 0.0133 X_{disp}-0.0215 X_{hp} + 0.7871 X_{drat}-3.7153 X_{wt} \\ + 0.821 X_{qsec} + 0.3178 X_{vs} + 2.5202 X_{am} + 0.6554 X_{gear}-0.1994 X_{carb} \end{align*} \]
Errors or residuals in a model must be normally distributed to
account for noise. If the errors are not normally distributed, that
means the model is not as accurate, since the error variable has a
pattern that is not random. The test that will be used is the
Kolmogorov-Smirnov test, provided by the base R language itself in the
form of ks.test(error, "pnorm", mean, std)
error = model$residuals
normality_ks=ks.test(error, "pnorm", mean(error), sqrt(var(error)))
normality_ks##
## Exact one-sample Kolmogorov-Smirnov test
##
## data: error
## D = 0.096266, p-value = 0.9006
## alternative hypothesis: two-sided
Other than the Kolmogorov-Smirnov test, the Shapiro-Wilk test can
also be used to test the normality assumption. The command
shapiro.test(residuals) is also conveniently provided by
the base R language.
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.95694, p-value = 0.2261
Since the P-Values for both the Kolmogorov-Smirnov test and the Shapiro-Wilk test are 0.9006 and 0.2261, the null hypothesis of the residuals being normally distributed cannot be rejected. The model can be declared fulfilling the normality assumption.
Just like how the assumption of normality assumes that the error
variable is normally distributed at random, the non-autocorrelation
assumption prefaces that the error variable is not correlated on its
own, be it by time or not. This test is important, because if the error
variable turns out to correlate with itself, that means there is a
pattern that emerges within the error variable, making the model less
accurate. The test that will be used is the Durbin-Watson test, provided
by the car package with the
durbinWatsonTest(model) command.
## Loading required package: carData
## lag Autocorrelation D-W Statistic p-value
## 1 0.03101277 1.860893 0.336
## Alternative hypothesis: rho != 0
With the P-Value of which is higher than 0.05, we can safely assume that the residuals of the model is not autocorrelated.
There is no dedicated test for linearity in R. However, we can use a Residuals vs Fitted plot to check for linear patterns.
From the plot, it can be inferred that there is no
Testing the homoscedasticity of the model is important to ascertain
that the variances of the residuals from the model has constant
variance. The test that will be used is the Breusch-Pagan test, provided
by the lmtest package in the form of
bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 14.914, df = 10, p-value = 0.1352
The resulting P-Value of 0.1352442 is lower than 0.05, which concludes the testing with the failing of rejecting the null hypothesis. The errors are homoscedastic, or having close to equal or constant variances.
The assumption of non-multicollinearity assumes that each variable used in the model has its own contributions to the response variable without having relationships with the other independent variables. In other words, the independent variables are truly independent of each other.
## cyl disp hp drat wt qsec vs am
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873 4.648487
## gear carb
## 5.357452 7.908747
There are many variables that exceed the maximum VIF threshold of 10,
which are cyl, disp, and wt.
However, hp has a VIF value close to 10, so for good
measure we will be taking them out as well.
##
## Call:
## lm(formula = mpg ~ drat + qsec + vs + am + gear + carb, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0858 -1.2152 0.4257 2.0044 4.7274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5965 11.5196 0.052 0.95911
## drat 2.1085 1.6265 1.296 0.20669
## qsec 0.4517 0.5586 0.809 0.42640
## vs 1.5001 1.9905 0.754 0.45811
## am 4.0671 2.0394 1.994 0.05714 .
## gear 1.7015 1.4540 1.170 0.25293
## carb -1.6831 0.4999 -3.367 0.00246 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.869 on 25 degrees of freedom
## Multiple R-squared: 0.8173, Adjusted R-squared: 0.7735
## F-statistic: 18.64 on 6 and 25 DF, p-value: 3.991e-08
Without conducting parameter testing, the model we obtain after removing multicollinear variables is:
\[ \begin{align*} Y_{mpg} = 0.5965 + 2.1085 X_{drat} 0.4517 X_{qsec} + 1.5001 X_{vs} + 4.0671 X_{am} \\ + 1.7015 X_{gear} -1.6831 X_{carb} \end{align*} \]