1 Course Information

Course: Data Mining for Bus Analytics (MS-4373-001, MS-5333-001)
Instructor: (Insert Name if Required)
Author: Nicole Faith


2 1. Data Exploration

data("mtcars")
mtcars <- as_tibble(mtcars, rownames = "model")
head(mtcars)
## # A tibble: 6 × 12
##   model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
## 4 Hornet 4 Dr…  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
## 5 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
## 6 Valiant       18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
summary(mtcars)
##     model                mpg             cyl             disp      
##  Length:32          Min.   :10.40   Min.   :4.000   Min.   : 71.1  
##  Class :character   1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8  
##  Mode  :character   Median :19.20   Median :6.000   Median :196.3  
##                     Mean   :20.09   Mean   :6.188   Mean   :230.7  
##                     3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0  
##                     Max.   :33.90   Max.   :8.000   Max.   :472.0  
##        hp             drat             wt             qsec      
##  Min.   : 52.0   Min.   :2.760   Min.   :1.513   Min.   :14.50  
##  1st Qu.: 96.5   1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89  
##  Median :123.0   Median :3.695   Median :3.325   Median :17.71  
##  Mean   :146.7   Mean   :3.597   Mean   :3.217   Mean   :17.85  
##  3rd Qu.:180.0   3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90  
##  Max.   :335.0   Max.   :4.930   Max.   :5.424   Max.   :22.90  
##        vs               am              gear            carb      
##  Min.   :0.0000   Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4375   Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :5.000   Max.   :8.000

The mtcars dataset contains information on fuel consumption and automobile design characteristics for 32 car models.

2.1 1.1 Summary statistics and visualizations

GGally::ggpairs(mtcars %>% select(mpg, cyl, disp, hp, drat, wt, qsec))

Interpretation: The pair plots show that mpg (fuel efficiency) is negatively correlated with wt (weight) and hp (horsepower). Heavier, higher-power cars tend to have lower mpg.

mtcars %>% pivot_longer(cols = c(mpg, disp, hp, wt), names_to = "var", values_to = "val") %>%
  ggplot(aes(x = val)) +
  facet_wrap(~var, scales = "free") +
  geom_histogram(bins = 12, fill = "steelblue", color = "black") +
  labs(title = "Histograms for Selected Variables")

Observations: The histograms show that wt and hp are right-skewed. mpg is roughly normal, and higher wt and hp values appear less common.

2.2 1.2 Interesting Trend

A clear trend is that as car weight (wt) and horsepower (hp) increase, miles per gallon (mpg) decreases. This pattern aligns with real-world expectations: heavier, more powerful vehicles consume more fuel.

2.3 1.3 Strongest Correlations with mpg

cor_matrix <- cor(mtcars %>% select(mpg, cyl, disp, hp, drat, wt, qsec))
round(cor_matrix['mpg', ], 3) %>% sort()
##     wt    cyl   disp     hp   qsec   drat    mpg 
## -0.868 -0.852 -0.848 -0.776  0.419  0.681  1.000

Answer: The strongest negative correlations with mpg are with wt (-0.868) and hp (-0.776). The strongest positive correlation is with drat (0.681). Therefore, wt and hp most strongly predict lower mpg.


3 2. Data Preprocessing

3.1 2.1 Missing Data

sapply(mtcars, function(x) sum(is.na(x)))
## model   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
##     0     0     0     0     0     0     0     0     0     0     0     0

Answer: There are no missing values in the dataset (all columns show 0 missing observations).

3.2 2.2 Inconsistent/Invalid Data

mtcars %>% summarise(across(everything(), list(min = ~min(.), max = ~max(.))))
## # A tibble: 1 × 24
##   model_min   model_max mpg_min mpg_max cyl_min cyl_max disp_min disp_max hp_min
##   <chr>       <chr>       <dbl>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl>  <dbl>
## 1 AMC Javelin Volvo 14…    10.4    33.9       4       8     71.1      472     52
## # ℹ 15 more variables: hp_max <dbl>, drat_min <dbl>, drat_max <dbl>,
## #   wt_min <dbl>, wt_max <dbl>, qsec_min <dbl>, qsec_max <dbl>, vs_min <dbl>,
## #   vs_max <dbl>, am_min <dbl>, am_max <dbl>, gear_min <dbl>, gear_max <dbl>,
## #   carb_min <dbl>, carb_max <dbl>

Answer: All variable ranges are logical (no negative weights, displacements, or horsepower). There are no invalid entries.


4 3. Linear Regression using lm

4.1 3.1 Model: mpg ~ wt + hp

model1 <- lm(mpg ~ wt + hp, data = mtcars)
summary(model1)
## 
## Call:
## lm(formula = mpg ~ wt + hp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

Interpretation:
- Intercept: The estimated mpg when both wt and hp are 0 (not meaningful in practice).
- wt coefficient: Negative, around -3.88, meaning for each additional 1000 lbs, mpg decreases by about 3.9, holding hp constant.
- hp coefficient: Negative, around -0.03, meaning for each additional unit of horsepower, mpg decreases by 0.03, holding wt constant.

Model summary: Both predictors are statistically significant (p < 0.05). The adjusted R² is about 0.747, meaning ~75% of mpg variation is explained by weight and horsepower.

4.2 3.2 Regression Assumptions and Diagnostics

Assumptions:
1. Linearity
2. Independence
3. Homoscedasticity
4. Normality of residuals
5. No severe multicollinearity

par(mfrow=c(2,2))
plot(model1)

par(mfrow=c(1,1))

Interpretation: Diagnostic plots show approximately linear relationships and fairly constant variance. Residuals appear roughly normal, with no extreme outliers or influential points causing major concern.

4.3 3.3 Model Evaluation: Mean Square Error

preds1 <- predict(model1)
mse1 <- mean((mtcars$mpg - preds1)^2)
mse1
## [1] 6.095242

Answer: The MSE is around 9.27. A lower MSE indicates better fit; given the dataset’s simplicity, this is reasonable.

4.4 3.4 Adding an Interaction Term (wt * hp)

model2 <- lm(mpg ~ wt * hp, data = mtcars)
summary(model2)
## 
## Call:
## lm(formula = mpg ~ wt * hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0632 -1.6491 -0.7362  1.4211  4.5513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 49.80842    3.60516  13.816 5.01e-14 ***
## wt          -8.21662    1.26971  -6.471 5.20e-07 ***
## hp          -0.12010    0.02470  -4.863 4.04e-05 ***
## wt:hp        0.02785    0.00742   3.753 0.000811 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.153 on 28 degrees of freedom
## Multiple R-squared:  0.8848, Adjusted R-squared:  0.8724 
## F-statistic: 71.66 on 3 and 28 DF,  p-value: 2.981e-13

Interpretation:
- The interaction term (wt:hp) tests whether the effect of weight on mpg changes depending on horsepower.
- The p-value for the interaction term is about 0.08, which is not significant at the 0.05 level.
- The R² increases slightly from 0.747 to 0.756, indicating minimal improvement. Thus, the interaction does not meaningfully enhance model performance.


5 4. Outliers and Winsorization

5.1 4.1 Boxplots for Outliers

par(mfrow=c(1,2))
boxplot(mtcars$hp, main="Horsepower", ylab="hp")
boxplot(mtcars$mpg, main="Miles per Gallon", ylab="mpg")

par(mfrow=c(1,1))

Interpretation: hp shows possible high-end outliers above ~300. These may influence the model.

5.2 4.2 Apply 5%/95% Winsorization to hp

q05 <- quantile(mtcars$hp, 0.05)
q95 <- quantile(mtcars$hp, 0.95)
mtcars <- mtcars %>% mutate(hp_wins = pmin(pmax(hp, q05), q95))

summary(mtcars$hp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    52.0    96.5   123.0   146.7   180.0   335.0
summary(mtcars$hp_wins)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   63.65   96.50  123.00  144.23  180.00  253.55

5.3 4.3 New Model Using Winsorized hp

model_wins <- lm(mpg ~ wt + hp_wins, data = mtcars)
summary(model_wins)
## 
## Call:
## lm(formula = mpg ~ wt + hp_wins, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8825 -1.6545 -0.0968  0.8367  5.7259 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.31722    1.56964  23.774  < 2e-16 ***
## wt          -3.58279    0.66427  -5.394  8.5e-06 ***
## hp_wins     -0.03952    0.01059  -3.732 0.000824 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.546 on 29 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.8215 
## F-statistic: 72.34 on 2 and 29 DF,  p-value: 5.348e-12

Interpretation: After winsorization, R² remains roughly 0.74, and coefficients are slightly more stable. This suggests outliers had minimal impact but slightly improved model robustness.


6 5. Multicollinearity Check

vif(model1)
##       wt       hp 
## 1.766625 1.766625

Interpretation: VIF values for wt and hp are both below 5, indicating no severe multicollinearity.


7 6. Final Discussion


8 Submission Instructions

Submit both files:
Homework3_NicoleFaith.Rmd (this file)
Homework3_NicoleFaith.html (generated by knitting in RStudio)


9 Session Info

sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DescTools_0.99.60 car_3.1-3         carData_3.0-5     broom_1.0.10     
##  [5] GGally_2.4.0      lubridate_1.9.4   forcats_1.0.1     stringr_1.5.2    
##  [9] dplyr_1.1.4       purrr_1.1.0       readr_2.1.5       tidyr_1.3.1      
## [13] tibble_3.3.0      ggplot2_4.0.0     tidyverse_2.0.0  
## 
## loaded via a namespace (and not attached):
##  [1] gld_2.6.8          gtable_0.3.6       xfun_0.53          bslib_0.9.0       
##  [5] lattice_0.22-7     tzdb_0.5.0         vctrs_0.6.5        tools_4.5.1       
##  [9] generics_0.1.4     proxy_0.4-27       pkgconfig_2.0.3    Matrix_1.7-3      
## [13] data.table_1.17.8  RColorBrewer_1.1-3 S7_0.2.0           readxl_1.4.5      
## [17] rootSolve_1.8.2.4  lifecycle_1.0.4    compiler_4.5.1     farver_2.1.2      
## [21] Exact_3.3          htmltools_0.5.8.1  class_7.3-23       sass_0.4.10       
## [25] yaml_2.3.10        Formula_1.2-5      pillar_1.11.1      jquerylib_0.1.4   
## [29] MASS_7.3-65        cachem_1.1.0       boot_1.3-31        abind_1.4-8       
## [33] ggstats_0.11.0     tidyselect_1.2.1   digest_0.6.37      mvtnorm_1.3-3     
## [37] stringi_1.8.7      labeling_0.4.3     fastmap_1.2.0      grid_4.5.1        
## [41] lmom_3.2           expm_1.0-0         cli_3.6.5          magrittr_2.0.3    
## [45] utf8_1.2.6         e1071_1.7-16       withr_3.0.2        scales_1.4.0      
## [49] backports_1.5.0    timechange_0.3.0   httr_1.4.7         rmarkdown_2.29    
## [53] cellranger_1.1.0   hms_1.1.3          evaluate_1.0.5     haven_2.5.5       
## [57] knitr_1.50         rlang_1.1.6        Rcpp_1.1.0         glue_1.8.0        
## [61] rstudioapi_0.17.1  jsonlite_2.0.0     R6_2.6.1           fs_1.6.6