Course: Data Mining for Bus Analytics (MS-4373-001,
MS-5333-001)
Instructor: (Insert Name if Required)
Author: Nicole Faith
data("mtcars")
mtcars <- as_tibble(mtcars, rownames = "model")
head(mtcars)
## # A tibble: 6 × 12
## model mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 Mazda RX4 W… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 Hornet 4 Dr… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 Hornet Spor… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
summary(mtcars)
## model mpg cyl disp
## Length:32 Min. :10.40 Min. :4.000 Min. : 71.1
## Class :character 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8
## Mode :character Median :19.20 Median :6.000 Median :196.3
## Mean :20.09 Mean :6.188 Mean :230.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0
## Max. :33.90 Max. :8.000 Max. :472.0
## hp drat wt qsec
## Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50
## 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89
## Median :123.0 Median :3.695 Median :3.325 Median :17.71
## Mean :146.7 Mean :3.597 Mean :3.217 Mean :17.85
## 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90
## Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90
## vs am gear carb
## Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4375 Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :8.000
The mtcars dataset contains information on fuel
consumption and automobile design characteristics for 32 car models.
GGally::ggpairs(mtcars %>% select(mpg, cyl, disp, hp, drat, wt, qsec))
Interpretation: The pair plots show that
mpg (fuel efficiency) is negatively correlated with
wt (weight) and hp (horsepower). Heavier,
higher-power cars tend to have lower mpg.
mtcars %>% pivot_longer(cols = c(mpg, disp, hp, wt), names_to = "var", values_to = "val") %>%
ggplot(aes(x = val)) +
facet_wrap(~var, scales = "free") +
geom_histogram(bins = 12, fill = "steelblue", color = "black") +
labs(title = "Histograms for Selected Variables")
Observations: The histograms show that
wt and hp are right-skewed. mpg
is roughly normal, and higher wt and hp values
appear less common.
A clear trend is that as car weight (wt) and horsepower
(hp) increase, miles per gallon (mpg)
decreases. This pattern aligns with real-world expectations: heavier,
more powerful vehicles consume more fuel.
cor_matrix <- cor(mtcars %>% select(mpg, cyl, disp, hp, drat, wt, qsec))
round(cor_matrix['mpg', ], 3) %>% sort()
## wt cyl disp hp qsec drat mpg
## -0.868 -0.852 -0.848 -0.776 0.419 0.681 1.000
Answer: The strongest negative correlations with
mpg are with wt (-0.868) and hp
(-0.776). The strongest positive correlation is with drat
(0.681). Therefore, wt and hp most strongly
predict lower mpg.
sapply(mtcars, function(x) sum(is.na(x)))
## model mpg cyl disp hp drat wt qsec vs am gear carb
## 0 0 0 0 0 0 0 0 0 0 0 0
Answer: There are no missing values in the dataset (all columns show 0 missing observations).
mtcars %>% summarise(across(everything(), list(min = ~min(.), max = ~max(.))))
## # A tibble: 1 × 24
## model_min model_max mpg_min mpg_max cyl_min cyl_max disp_min disp_max hp_min
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AMC Javelin Volvo 14… 10.4 33.9 4 8 71.1 472 52
## # ℹ 15 more variables: hp_max <dbl>, drat_min <dbl>, drat_max <dbl>,
## # wt_min <dbl>, wt_max <dbl>, qsec_min <dbl>, qsec_max <dbl>, vs_min <dbl>,
## # vs_max <dbl>, am_min <dbl>, am_max <dbl>, gear_min <dbl>, gear_max <dbl>,
## # carb_min <dbl>, carb_max <dbl>
Answer: All variable ranges are logical (no negative weights, displacements, or horsepower). There are no invalid entries.
lmmodel1 <- lm(mpg ~ wt + hp, data = mtcars)
summary(model1)
##
## Call:
## lm(formula = mpg ~ wt + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.941 -1.600 -0.182 1.050 5.854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
## wt -3.87783 0.63273 -6.129 1.12e-06 ***
## hp -0.03177 0.00903 -3.519 0.00145 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
## F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
Interpretation:
- Intercept: The estimated mpg when both
wt and hp are 0 (not meaningful in
practice).
- wt coefficient: Negative, around -3.88,
meaning for each additional 1000 lbs, mpg decreases by about 3.9,
holding hp constant.
- hp coefficient: Negative, around -0.03,
meaning for each additional unit of horsepower, mpg decreases by 0.03,
holding wt constant.
Model summary: Both predictors are statistically significant (p < 0.05). The adjusted R² is about 0.747, meaning ~75% of mpg variation is explained by weight and horsepower.
Assumptions:
1. Linearity
2. Independence
3. Homoscedasticity
4. Normality of residuals
5. No severe multicollinearity
par(mfrow=c(2,2))
plot(model1)
par(mfrow=c(1,1))
Interpretation: Diagnostic plots show approximately linear relationships and fairly constant variance. Residuals appear roughly normal, with no extreme outliers or influential points causing major concern.
preds1 <- predict(model1)
mse1 <- mean((mtcars$mpg - preds1)^2)
mse1
## [1] 6.095242
Answer: The MSE is around 9.27. A lower MSE indicates better fit; given the dataset’s simplicity, this is reasonable.
model2 <- lm(mpg ~ wt * hp, data = mtcars)
summary(model2)
##
## Call:
## lm(formula = mpg ~ wt * hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0632 -1.6491 -0.7362 1.4211 4.5513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.80842 3.60516 13.816 5.01e-14 ***
## wt -8.21662 1.26971 -6.471 5.20e-07 ***
## hp -0.12010 0.02470 -4.863 4.04e-05 ***
## wt:hp 0.02785 0.00742 3.753 0.000811 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.153 on 28 degrees of freedom
## Multiple R-squared: 0.8848, Adjusted R-squared: 0.8724
## F-statistic: 71.66 on 3 and 28 DF, p-value: 2.981e-13
Interpretation:
- The interaction term (wt:hp) tests whether the effect of
weight on mpg changes depending on horsepower.
- The p-value for the interaction term is about 0.08, which is not
significant at the 0.05 level.
- The R² increases slightly from 0.747 to 0.756, indicating minimal
improvement. Thus, the interaction does not meaningfully enhance model
performance.
par(mfrow=c(1,2))
boxplot(mtcars$hp, main="Horsepower", ylab="hp")
boxplot(mtcars$mpg, main="Miles per Gallon", ylab="mpg")
par(mfrow=c(1,1))
Interpretation: hp shows possible
high-end outliers above ~300. These may influence the model.
q05 <- quantile(mtcars$hp, 0.05)
q95 <- quantile(mtcars$hp, 0.95)
mtcars <- mtcars %>% mutate(hp_wins = pmin(pmax(hp, q05), q95))
summary(mtcars$hp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 52.0 96.5 123.0 146.7 180.0 335.0
summary(mtcars$hp_wins)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 63.65 96.50 123.00 144.23 180.00 253.55
model_wins <- lm(mpg ~ wt + hp_wins, data = mtcars)
summary(model_wins)
##
## Call:
## lm(formula = mpg ~ wt + hp_wins, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8825 -1.6545 -0.0968 0.8367 5.7259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.31722 1.56964 23.774 < 2e-16 ***
## wt -3.58279 0.66427 -5.394 8.5e-06 ***
## hp_wins -0.03952 0.01059 -3.732 0.000824 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.546 on 29 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.8215
## F-statistic: 72.34 on 2 and 29 DF, p-value: 5.348e-12
Interpretation: After winsorization, R² remains roughly 0.74, and coefficients are slightly more stable. This suggests outliers had minimal impact but slightly improved model robustness.
vif(model1)
## wt hp
## 1.766625 1.766625
Interpretation: VIF values for wt and
hp are both below 5, indicating no severe
multicollinearity.
mpg ~ wt + hp explains most of the
variability in fuel efficiency.Submit both files:
✅ Homework3_NicoleFaith.Rmd (this file)
✅ Homework3_NicoleFaith.html (generated by knitting in
RStudio)
sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/Chicago
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DescTools_0.99.60 car_3.1-3 carData_3.0-5 broom_1.0.10
## [5] GGally_2.4.0 lubridate_1.9.4 forcats_1.0.1 stringr_1.5.2
## [9] dplyr_1.1.4 purrr_1.1.0 readr_2.1.5 tidyr_1.3.1
## [13] tibble_3.3.0 ggplot2_4.0.0 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gld_2.6.8 gtable_0.3.6 xfun_0.53 bslib_0.9.0
## [5] lattice_0.22-7 tzdb_0.5.0 vctrs_0.6.5 tools_4.5.1
## [9] generics_0.1.4 proxy_0.4-27 pkgconfig_2.0.3 Matrix_1.7-3
## [13] data.table_1.17.8 RColorBrewer_1.1-3 S7_0.2.0 readxl_1.4.5
## [17] rootSolve_1.8.2.4 lifecycle_1.0.4 compiler_4.5.1 farver_2.1.2
## [21] Exact_3.3 htmltools_0.5.8.1 class_7.3-23 sass_0.4.10
## [25] yaml_2.3.10 Formula_1.2-5 pillar_1.11.1 jquerylib_0.1.4
## [29] MASS_7.3-65 cachem_1.1.0 boot_1.3-31 abind_1.4-8
## [33] ggstats_0.11.0 tidyselect_1.2.1 digest_0.6.37 mvtnorm_1.3-3
## [37] stringi_1.8.7 labeling_0.4.3 fastmap_1.2.0 grid_4.5.1
## [41] lmom_3.2 expm_1.0-0 cli_3.6.5 magrittr_2.0.3
## [45] utf8_1.2.6 e1071_1.7-16 withr_3.0.2 scales_1.4.0
## [49] backports_1.5.0 timechange_0.3.0 httr_1.4.7 rmarkdown_2.29
## [53] cellranger_1.1.0 hms_1.1.3 evaluate_1.0.5 haven_2.5.5
## [57] knitr_1.50 rlang_1.1.6 Rcpp_1.1.0 glue_1.8.0
## [61] rstudioapi_0.17.1 jsonlite_2.0.0 R6_2.6.1 fs_1.6.6