Data Exploration

There are 32 different car types and 11 different variables.

hist(mtcars$mpg, main="Miles per Gallon", xlab="mpg")

hist(mtcars$wt, main="Weight", xlab="Weight (1000 lbs)")

hist(mtcars$hp, main="Horsepower", xlab="hp")

hist(mtcars$cyl, main="Cylinders", xlab="cyl")

hist(mtcars$qsec, main="Quarter-mile time", xlab="qsec")

hist(mtcars$disp, main="Displacement", xlab="disp")

plot(mtcars$wt, mtcars$mpg, main="mpg vs Weight", xlab="Weight (1000 lbs)", ylab="mpg")

plot(mtcars$hp, mtcars$mpg, main="mpg vs Horsepower", xlab="Horsepower", ylab="mpg")

plot(mtcars$cyl, mtcars$mpg, main="mpg vs Cylinders", xlab="Cylinders", ylab="mpg")

cor(mtcars)
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

Visualizations

hist(mtcars\(mpg, main="Miles per Gallon", xlab="mpg") hist(mtcars\)wt, main=“Weight”, xlab=“Weight (1000 lbs)”)

plot(mtcars\(wt, mtcars\)mpg, main=“mpg vs Weight”, xlab=“Weight (1000 lbs)”, ylab=“mpg”) plot(mtcars\(hp, mtcars\)mpg, main=“mpg vs Horsepower”, xlab=“Horsepower”, ylab=“mpg”)

cor(mtcars$mpg, mtcars)
##      mpg       cyl       disp         hp      drat         wt     qsec
## [1,]   1 -0.852162 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.418684
##             vs        am      gear       carb
## [1,] 0.6640389 0.5998324 0.4802848 -0.5509251

Correlations

It really interests me how strong the negative correlation between cylinders and miles per gallon is. This is not something that I have previously considered in vehicle gas mileage. It also interested me how weight and mpg are so directly correlated in a negative way.

cor(mtcars$mpg, mtcars)

Weight, engine size, cylinder count, and horsepower are the variables most strongly correlated with mpg. This is because all of these have to do with engine size and general weight, which is the largest component in gas needed to move a vehicle.

Data Reprocessing

sum(is.na(mtcars))
## [1] 0
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The mtcars dataset has no null values since sum(is.na(mtcars)) = 0. There appears to be no major inconsistencies in this mtcars dataset. summary(mtcars) and str(mtcars) show the consistency of this dataset.

Linear Regression using lm

regress <- lm(mpg ~., data = mtcars)
summary(regress)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07
plot(regress)

est <- predict(regress, mtcars)
mse <- mean((mtcars$mpg - est)^2)
mse
## [1] 4.609201
interact <- lm(mpg ~ wt * am + ., data = mtcars)
summary(interact)
## 
## Call:
## lm(formula = mpg ~ wt * am + ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0807 -1.4803 -0.4741  1.3226  4.5850 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2.070807  17.191096   0.120  0.90532   
## wt          -3.030663   1.712406  -1.770  0.09200 . 
## am          13.334354   4.663324   2.859  0.00969 **
## cyl          0.225567   0.942201   0.239  0.81323   
## disp         0.004384   0.016328   0.269  0.79105   
## hp          -0.006131   0.020359  -0.301  0.76643   
## drat         0.359683   1.469370   0.245  0.80911   
## qsec         1.109647   0.662235   1.676  0.10938   
## vs          -0.077414   1.884792  -0.041  0.96764   
## gear         1.108383   1.344775   0.824  0.41954   
## carb        -0.090545   0.740919  -0.122  0.90395   
## wt:am       -4.137886   1.640318  -2.523  0.02023 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.365 on 20 degrees of freedom
## Multiple R-squared:  0.9006, Adjusted R-squared:  0.846 
## F-statistic: 16.48 on 11 and 20 DF,  p-value: 1.081e-07
boxplot(mtcars$wt, main="Boxplot of Weight")

mtcars$wt_wins <- Winsorize(mtcars$wt)
summary(mtcars$wt_wins)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.736   2.581   3.325   3.222   3.610   5.293
model_winsorized <- model_wins <- lm(mpg ~ . - wt + wt_wins, data = mtcars)
summary(model_winsorized)
## 
## Call:
## lm(formula = mpg ~ . - wt + wt_wins, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6280 -1.4760 -0.1924  1.2648  4.5156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 13.81707   18.95690   0.729   0.4741  
## cyl         -0.09028    1.06107  -0.085   0.9330  
## disp         0.01202    0.01861   0.646   0.5252  
## hp          -0.02146    0.02223  -0.965   0.3454  
## drat         0.76681    1.66752   0.460   0.6503  
## qsec         0.72003    0.73178   0.984   0.3363  
## vs           0.51686    2.13095   0.243   0.8107  
## am           2.42856    2.09624   1.159   0.2597  
## gear         0.75687    1.51096   0.501   0.6216  
## carb        -0.27079    0.85296  -0.317   0.7540  
## wt_wins     -3.61419    2.05327  -1.760   0.0929 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.691 on 21 degrees of freedom
## Multiple R-squared:  0.865,  Adjusted R-squared:  0.8006 
## F-statistic: 13.45 on 10 and 21 DF,  p-value: 5.141e-07
summary(regress)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

The intercept shows what the mpg would be if all of the other predictors were zero. The weight equaling -3.88 just shows that every 1000lb increase in vehicle leads to a decrease of 3.88 mpg. The horsepower being -0.02 shows that every additional unit of horsepower leads to a 0.02 decrease in mpg. The transmission being 2.52 shows that manual transmissions often have a higher mpg of 2.52 than automatics. plot(regress) These plots have showed almost linear relationships that assumes consistency and predictability is a key factor in this model. These assumptions are being met by this dataset according to these plots. 4.61 is the MSE value.

summary(interact)
## 
## Call:
## lm(formula = mpg ~ wt * am + ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0807 -1.4803 -0.4741  1.3226  4.5850 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2.070807  17.191096   0.120  0.90532   
## wt          -3.030663   1.712406  -1.770  0.09200 . 
## am          13.334354   4.663324   2.859  0.00969 **
## cyl          0.225567   0.942201   0.239  0.81323   
## disp         0.004384   0.016328   0.269  0.79105   
## hp          -0.006131   0.020359  -0.301  0.76643   
## drat         0.359683   1.469370   0.245  0.80911   
## qsec         1.109647   0.662235   1.676  0.10938   
## vs          -0.077414   1.884792  -0.041  0.96764   
## gear         1.108383   1.344775   0.824  0.41954   
## carb        -0.090545   0.740919  -0.122  0.90395   
## wt:am       -4.137886   1.640318  -2.523  0.02023 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.365 on 20 degrees of freedom
## Multiple R-squared:  0.9006, Adjusted R-squared:  0.846 
## F-statistic: 16.48 on 11 and 20 DF,  p-value: 1.081e-07

This summary shows that there is a significant difference in the way that weight affects vehicles when there is a manual transmission vs a manual transmission. These manuals lose less mpg than the automatics do as weight increases. There were two outliers in the dataset. After winsorizing these pieces of data, the R^2 went from a .84 to a .86 which is a slight increase. The improvement in this R^2 will most likely not improve the predictability of this model significantly.