mtcars dataset assigning itself to it’s name,
and load the appropriate packageslibrary(ggplot2)
library(MASS)
data("mtcars")
attach(mtcars)
## The following object is masked from package:ggplot2:
##
## mpg
?mtcars
## starting httpd help server ...
## done
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
table(mtcars$cyl) #Number of cylinders
##
## 4 6 8
## 11 7 14
table(mtcars$vs) #Type of engine, V-Shaped or Straight
##
## 0 1
## 18 14
table(mtcars$am) #Type of Transmission, Automatic or Manual
##
## 0 1
## 19 13
table(mtcars$gear) #Number of Forward gears
##
## 3 4 5
## 15 12 5
table(mtcars$carb) #Number of carburetors
##
## 1 2 3 4 6 8
## 7 10 3 10 1 1
mtcars dataset#Telling R to visually change the Continuous column of cyl into a Categorical variable, that it actually is
ggplot(mtcars, aes(factor(cyl), mpg)) +
geom_point()
#visually Mapping wt
ggplot(mtcars, aes(wt, mpg, size = disp)) +
geom_point()
wt) and miles per
gallon (mpg) is, as weight increases on the vehicle, the
miles one can travel per gallon, decreasesmpgcorr_matrix <- cor(mtcars)
print(corr_matrix)
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
wt has the highest correlation
with mpg, even though it has a negative valueFollowed by disp then cyl
library(corrplot)
## corrplot 0.92 loaded
corrplot(corr_matrix, method="circle", type="upper", order="hclust",
tl.col="black", tl.srt=45)
mpg, though with values close together it is easier to
discern from a correlation matrixcolSums(is.na(mtcars))
## mpg cyl disp hp drat wt qsec vs am gear carb
## 0 0 0 0 0 0 0 0 0 0 0
boxplot(mtcars, las=2, cex.axis=0.6)
hp, wt,
qsec, and carb have noticable outliers, with
disp and hp having a larger range between
their lower quartile, median, and upper qurtilehist(mtcars$disp)
hist(mtcars$hp)
hist(mtcars$wt)
mpg based on all
variablesmtcars_lm <- lm(mpg ~ . , data = mtcars)
summary(mtcars_lm)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
Most coefficients are of positive value, meaning as the predictor
value increases, the response variable will increase, an example would
be disp; as disp would increase by 1 unit, the
overall model/mpg will increase by 0.01334
But the there a few negative signed coefficients, such as
cyl, hp, wt, and
carb. Meaning as the predictor value increases, the
response variable will decrease, an example would be wt; as
wt would increase by 1 unit, the overall model/mpg
(constant) will decrease by -3.71530
We can also see that no p-value is significant except for wt`, when all other variables are in the model, and that has a negative affect on the constant
The assumptions we make when we use linear regression:
We can check whether the assumptions of linear regression are met using diagnostics plots:
# Diagnostic Plots
par(mfrow = c(2, 2))
plot(mtcars_lm)
lm_mse <- mean((mtcars_lm$fitted.values - mtcars$mpg)^2)
print(paste("Mean Squared Error for Linear Model:", round(lm_mse, 2)))
## [1] "Mean Squared Error for Linear Model: 4.61"
With a MSE of 4.61, a lower value, meaning that this model predicts closer to actual values, predicted values being closer to observed values, better to predict observations in the future.
cyl and dispmtcars_std = mtcars
mtcars_std$mpg = (mtcars$wt - min(mtcars$wt)) / (max(mtcars$wt) - min(mtcars$wt))
mtcars_lm_std <- lm(mpg ~ ., data = mtcars_std)
summary(mtcars_lm_std)
## Warning in summary.lm(mtcars_lm_std): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = mpg ~ ., data = mtcars_std)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.112e-16 -2.390e-17 -8.700e-18 1.618e-17 4.119e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.869e-01 7.807e-16 -4.955e+14 <2e-16 ***
## cyl -5.516e-18 4.359e-17 -1.270e-01 0.900
## disp 7.965e-19 7.448e-19 1.069e+00 0.297
## hp -8.821e-19 9.080e-19 -9.710e-01 0.342
## drat -2.947e-17 6.821e-17 -4.320e-01 0.670
## wt 2.557e-01 7.902e-17 3.236e+15 <2e-16 ***
## qsec -2.741e-18 3.048e-17 -9.000e-02 0.929
## vs -4.101e-17 8.778e-17 -4.670e-01 0.645
## am 4.591e-17 8.578e-17 5.350e-01 0.598
## gear -4.851e-17 6.228e-17 -7.790e-01 0.445
## carb 3.340e-17 3.457e-17 9.660e-01 0.345
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.105e-16 on 21 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.588e+31 on 10 and 21 DF, p-value: < 2.2e-16
mtcars_lm_it1 <- lm(mpg ~ cyl + disp + cyl*disp, data = mtcars_std)
summary(mtcars_lm_it1)
##
## Call:
## lm(formula = mpg ~ cyl + disp + cyl * disp, data = mtcars_std)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.210244 -0.059353 -0.000182 0.079461 0.175009
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3387703 0.2100789 -1.613 0.11805
## cyl 0.0373506 0.0352685 1.059 0.29863
## disp 0.0054955 0.0016792 3.273 0.00283 **
## cyl:disp -0.0004470 0.0002077 -2.152 0.04017 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1117 on 28 degrees of freedom
## Multiple R-squared: 0.8201, Adjusted R-squared: 0.8008
## F-statistic: 42.55 on 3 and 28 DF, p-value: 1.471e-10
With one categorical variable (cyl) and one continuous
variable (disp), cyl still isn’t a significant
predictor but disp and cyl by
disp is a significant predictor of mpg. With
this model explaining around 82% of variance.
While this model only explains 82% of the variance in the model, while the original regression model, with all variables but no interaction, explained 87%, but the model had become complicated. This newer model explains more, while having more significant values. So this model is an easier one to understand, while allowing room for modification, while explaining a large portion of variance with little variables used.
boxplot(mtcars, las=2, cex.axis=0.6)
It can be seen via the box plot that hp,
wt, qsec, and carb.
hp at 1 percentile and 99 percentile. In
another word, we are replaceing extreme values with less extreme values.
Usually keeps all the rows. WIth hp having the largest
outlier compared to the others, we will work on this one#creating a separate data set, to preserve the original
mtcars2 <- mtcars
# Calculate the 1st and 99th percentiles for 'hp'
lower_bound_hp <- quantile(mtcars2$hp, 0.01, na.rm = TRUE)
upper_bound_hp <- quantile(mtcars2$hp, 0.99, na.rm = TRUE)
# Winsorize the data, replacing value that are greater than the threshold with the threshold.
mtcars2$hp[mtcars2$hp < lower_bound_hp] <- lower_bound_hp
mtcars2$hp[mtcars2$hp > upper_bound_hp] <- upper_bound_hp
Double check
summary(mtcars[, "hp"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 52.0 96.5 123.0 146.7 180.0 335.0
summary(mtcars2[, "hp"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 55.1 96.5 123.0 146.1 180.0 313.0
The data variable hp has been transformed via winsorize,
we can see from the original mtcars data set the minimum
value is 52 and the max was 335, but due to the winsorize we replaced
the extreme values and created the mtcars2 where the mean
is slightly different, but allows the overall variable observations to
be predicted better, due to replacement of outliers
mtcars2 data setmtcars_lm <- lm(mpg ~ . , data = mtcars)
summary(mtcars_lm)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
mtcars2_lm <- lm(mpg ~ . , data = mtcars2)
summary(mtcars2_lm)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4201 -1.5602 -0.1017 1.1611 4.6188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.72098 18.68259 0.681 0.5034
## cyl -0.09268 1.04285 -0.089 0.9300
## disp 0.01389 0.01778 0.782 0.4432
## hp -0.02414 0.02288 -1.055 0.3034
## drat 0.86325 1.62519 0.531 0.6009
## wt -3.68497 1.87561 -1.965 0.0628 .
## qsec 0.78554 0.73222 1.073 0.2955
## vs 0.32962 2.09148 0.158 0.8763
## am 2.46214 2.04818 1.202 0.2427
## gear 0.65888 1.48812 0.443 0.6625
## carb -0.20609 0.80918 -0.255 0.8014
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.642 on 21 degrees of freedom
## Multiple R-squared: 0.8698, Adjusted R-squared: 0.8079
## F-statistic: 14.03 on 10 and 21 DF, p-value: 3.562e-07
With a focus on the R^2, we can see in the original value is 86.9%
and the original adjusted R^2 is 80.7%, while the new values of
mtcars2 have 87% and an adjusted score of 80.8%. Though the
increase is minuscule, there was an increase, the coefficients of the
new model have also changed, some have increased while others have
decreased, meaning predictive power has changed.
A reason why the model only predicts this well, is because we have
variables that predict poorly in the model, if we had a more precise
question/problem we needed to assess, we could use the new
mtcars2 set, with transformed data that is has more
predictive power. This increase in R^2 only provides so much of a value,
our model has become too complicated with so many variables that were
used, so the slight increase in R^2 didn’t really improve model
predictability