mtcars General Information

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). mtcars has a data frame with 32 observations on 11 (numeric) variables:

  • mpg: Miles/(US) gallon
  • cyl: Number of cylinders
  • disp: Displacement (cu.in.)
  • hp: Gross horsepower
  • drat: Rear axle ratio
  • wt: Weight (1000 lbs)
  • qsec: 1/4 mile time
  • vs: Engine (0 = V-shaped, 1 = straight)
  • am: Transmission (0 = automatic, 1 = manual)
  • gear: Number of forward gears
  • carb: Number of carburetors

The data is readily available in the default R programming language as mtcars. We can use head() to obtain the first few rows of the data and tail() to get the last few rows.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
tail(mtcars)
##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

We can also use str() to look at the data in a string format.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Through these commands, we can infer that the variables vs and am are binary (as written in the general information of the data), while the variables cyl, gear, and carb are integers. The rest of the data are numeric.

In order to obtain the summary of the data (basic descriptive statistics), we can use summary():

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

from the results, we can infer a few things:

  • The minimum and maximum miles per US gallon among these cars are 10.4 and 33.9 miles per US gallon.

  • The mean gross horsepower is 146.6875.

  • The most common number of cylinders is 8

  • The mean weight is 3.21725.

  • The most common engine is V-shaped

  • The most common transmission is automatic

  • The most common number of forward gears is 3

  • The most common number of carburetors is 4

Regression Model

Parameter Estimation

Multiple linear regression general form: \[ y = \beta_0 + \beta_1X_1+ \beta_2X_2 + ...+\beta_iX_i+\varepsilon \] In multiple linear regression, one of the most common parameter methods is the Ordinary Least Squares (OLS) method. In OLS, the parameters \(\hat{\beta}\) are calculated using the matrix formula below: \[\mathbf{\hat{\beta}}=\mathbf{(X^TX)^{-1}X^TY}\] where X is a \((p\times p)\) matrix containing all values of X and Y is a \((p\times 1)\) matrix of \([y_1, y_2, ..., y_i]^T\) for all values of X. the lm() command does the calculation automatically after specifying the model.

model = lm(mpg~cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb, data=mtcars)
summary(model)
## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs + 
##     am + gear + carb, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

The model that is created using the data is:

\[ \begin{align*} Y_{mpg} = 12.3034-0.1114 X_{cyl} + 0.0133 X_{disp}-0.0215 X_{hp} + 0.7871 X_{drat}-3.7153 X_{wt} \\ + 0.821 X_{qsec} + 0.3178 X_{vs} + 2.5202 X_{am} + 0.6554 X_{gear}-0.1994 X_{carb} \end{align*} \]

Hypothesis Test

Residual Normality Test

Errors or residuals in a model must be normally distributed to account for noise. If the errors are not normally distributed, that means the model is not as accurate, since the error variable has a pattern that is not random. The test that will be used is the Kolmogorov-Smirnov test, provided by the base R language itself in the form of ks.test(error, "pnorm", mean, std)

error = model$residuals
normality_ks=ks.test(error, "pnorm", mean(error), sqrt(var(error)))
normality_ks
## 
##  Exact one-sample Kolmogorov-Smirnov test
## 
## data:  error
## D = 0.096266, p-value = 0.9006
## alternative hypothesis: two-sided

Other than the Kolmogorov-Smirnov test, the Shapiro-Wilk test can also be used to test the normality assumption. The command shapiro.test(residuals) is also conveniently provided by the base R language.

normality_sw=shapiro.test(model$residuals)
normality_sw
## 
##  Shapiro-Wilk normality test
## 
## data:  model$residuals
## W = 0.95694, p-value = 0.2261

Since the P-Values for both the Kolmogorov-Smirnov test and the Shapiro-Wilk test are 0.9006 and 0.2261, the null hypothesis of the residuals being normally distributed cannot be rejected. The model can be declared fulfilling the normality assumption.

Autocorrelation Test

Just like how the assumption of normality assumes that the error variable is normally distributed at random, the non-autocorrelation assumption prefaces that the error variable is not correlated on its own, be it by time or not. This test is important, because if the error variable turns out to correlate with itself, that means there is a pattern that emerges within the error variable, making the model less accurate. The test that will be used is the Durbin-Watson test, provided by the car package with the durbinWatsonTest(model) command.

library(car)
## Loading required package: carData
autocorrelation=durbinWatsonTest(model)
autocorrelation
##  lag Autocorrelation D-W Statistic p-value
##    1      0.03101277      1.860893   0.336
##  Alternative hypothesis: rho != 0

With the P-Value of which is higher than 0.05, we can safely assume that the residuals of the model is not autocorrelated.

Linearity Test

There is no dedicated test for linearity in R. However, we can use a Residuals vs Fitted plot to check for linear patterns.

plot(model, 1)

From the plot, it can be inferred that there is no

Homoscedasticity Test

Testing the homoscedasticity of the model is important to ascertain that the variances of the residuals from the model has constant variance. The test that will be used is the Breusch-Pagan test, provided by the lmtest package in the form of bptest(model)

homoscedasticity=bptest(model)
homoscedasticity
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 14.914, df = 10, p-value = 0.1352

The resulting P-Value of 0.1352442 is lower than 0.05, which concludes the testing with the failing of rejecting the null hypothesis. The errors are homoscedastic, or having close to equal or constant variances.

Multicollinearity Test

The assumption of non-multicollinearity assumes that each variable used in the model has its own contributions to the response variable without having relationships with the other independent variables. In other words, the independent variables are truly independent of each other.

multicol=vif(model)
multicol
##       cyl      disp        hp      drat        wt      qsec        vs        am 
## 15.373833 21.620241  9.832037  3.374620 15.164887  7.527958  4.965873  4.648487 
##      gear      carb 
##  5.357452  7.908747

There are many variables that exceed the maximum VIF threshold of 10, which are cyl, disp, and wt. However, hp has a VIF value close to 10, so for good measure we will be taking them out as well.

model = lm(mpg~drat+qsec+vs+am+gear+carb, data=mtcars)
summary(model)
## 
## Call:
## lm(formula = mpg ~ drat + qsec + vs + am + gear + carb, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0858 -1.2152  0.4257  2.0044  4.7274 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   0.5965    11.5196   0.052  0.95911   
## drat          2.1085     1.6265   1.296  0.20669   
## qsec          0.4517     0.5586   0.809  0.42640   
## vs            1.5001     1.9905   0.754  0.45811   
## am            4.0671     2.0394   1.994  0.05714 . 
## gear          1.7015     1.4540   1.170  0.25293   
## carb         -1.6831     0.4999  -3.367  0.00246 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.869 on 25 degrees of freedom
## Multiple R-squared:  0.8173, Adjusted R-squared:  0.7735 
## F-statistic: 18.64 on 6 and 25 DF,  p-value: 3.991e-08

Results

Without conducting parameter testing, the model we obtain after removing multicollinear variables is:

\[ \begin{align*} Y_{mpg} = 0.5965 + 2.1085 X_{drat} 0.4517 X_{qsec} + 1.5001 X_{vs} + 4.0671 X_{am} \\ + 1.7015 X_{gear} -1.6831 X_{carb} \end{align*} \]