1 Course Information

Course: Data Mining for Bus Analytics (MS-4373-001, MS-5333-001)
Instructor: (Insert Name if Required)
Author: Nicole Faith

2 1. Data Exploration

data("mtcars")
mtcars <- as_tibble(mtcars, rownames = "model")
head(mtcars)

## # A tibble: 6 × 12
##   model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
## 4 Hornet 4 Dr…  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
## 5 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
## 6 Valiant       18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

summary(mtcars)

##     model                mpg             cyl             disp      
##  Length:32          Min.   :10.40   Min.   :4.000   Min.   : 71.1  
##  Class :character   1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8  
##  Mode  :character   Median :19.20   Median :6.000   Median :196.3  
##                     Mean   :20.09   Mean   :6.188   Mean   :230.7  
##                     3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0  
##                     Max.   :33.90   Max.   :8.000   Max.   :472.0  
##        hp             drat             wt             qsec      
##  Min.   : 52.0   Min.   :2.760   Min.   :1.513   Min.   :14.50  
##  1st Qu.: 96.5   1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89  
##  Median :123.0   Median :3.695   Median :3.325   Median :17.71  
##  Mean   :146.7   Mean   :3.597   Mean   :3.217   Mean   :17.85  
##  3rd Qu.:180.0   3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90  
##  Max.   :335.0   Max.   :4.930   Max.   :5.424   Max.   :22.90  
##        vs               am              gear            carb      
##  Min.   :0.0000   Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4375   Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :5.000   Max.   :8.000

The mtcars dataset contains information on fuel consumption and automobile design characteristics for 32 car models.

2.1 1.1 Summary statistics and visualizations

GGally::ggpairs(mtcars %>% select(mpg, cyl, disp, hp, drat, wt, qsec))

Interpretation: The pair plots show that mpg (fuel efficiency) is negatively correlated with wt (weight) and hp (horsepower). Heavier, higher-power cars tend to have lower mpg.

mtcars %>% pivot_longer(cols = c(mpg, disp, hp, wt), names_to = "var", values_to = "val") %>%
  ggplot(aes(x = val)) +
  facet_wrap(~var, scales = "free") +
  geom_histogram(bins = 12, fill = "steelblue", color = "black") +
  labs(title = "Histograms for Selected Variables")

Observations: The histograms show that wt and hp are right-skewed. mpg is roughly normal, and higher wt and hp values appear less common.

2.2 1.2 Interesting Trend

A clear trend is that as car weight (wt) and horsepower (hp) increase, miles per gallon (mpg) decreases. This pattern aligns with real-world expectations: heavier, more powerful vehicles consume more fuel.

2.3 1.3 Strongest Correlations with mpg

cor_matrix <- cor(mtcars %>% select(mpg, cyl, disp, hp, drat, wt, qsec))
round(cor_matrix['mpg', ], 3) %>% sort()

##     wt    cyl   disp     hp   qsec   drat    mpg 
## -0.868 -0.852 -0.848 -0.776  0.419  0.681  1.000

Answer: The strongest negative correlations with mpg are with wt (-0.868) and hp (-0.776). The strongest positive correlation is with drat (0.681). Therefore, wt and hp most strongly predict lower mpg.

3 2. Data Preprocessing

3.1 2.1 Missing Data

sapply(mtcars, function(x) sum(is.na(x)))

## model   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
##     0     0     0     0     0     0     0     0     0     0     0     0

Answer: There are no missing values in the dataset (all columns show 0 missing observations).

3.2 2.2 Inconsistent/Invalid Data

mtcars %>% summarise(across(everything(), list(min = ~min(.), max = ~max(.))))

## # A tibble: 1 × 24
##   model_min   model_max mpg_min mpg_max cyl_min cyl_max disp_min disp_max hp_min
##   <chr>       <chr>       <dbl>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl>  <dbl>
## 1 AMC Javelin Volvo 14…    10.4    33.9       4       8     71.1      472     52
## # ℹ 15 more variables: hp_max <dbl>, drat_min <dbl>, drat_max <dbl>,
## #   wt_min <dbl>, wt_max <dbl>, qsec_min <dbl>, qsec_max <dbl>, vs_min <dbl>,
## #   vs_max <dbl>, am_min <dbl>, am_max <dbl>, gear_min <dbl>, gear_max <dbl>,
## #   carb_min <dbl>, carb_max <dbl>

Answer: All variable ranges are logical (no negative weights, displacements, or horsepower). There are no invalid entries.

4 3. Linear Regression using `lm`

4.1 3.1 Model: mpg ~ wt + hp

model1 <- lm(mpg ~ wt + hp, data = mtcars)
summary(model1)

## 
## Call:
## lm(formula = mpg ~ wt + hp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

Interpretation:
- Intercept: The estimated mpg when both wt and hp are 0 (not meaningful in practice).
- wt coefficient: Negative, around -3.88, meaning for each additional 1000 lbs, mpg decreases by about 3.9, holding hp constant.
- hp coefficient: Negative, around -0.03, meaning for each additional unit of horsepower, mpg decreases by 0.03, holding wt constant.

Model summary: Both predictors are statistically significant (p < 0.05). The adjusted R² is about 0.747, meaning ~75% of mpg variation is explained by weight and horsepower.

4.2 3.2 Regression Assumptions and Diagnostics

Assumptions:
1. Linearity
2. Independence
3. Homoscedasticity
4. Normality of residuals
5. No severe multicollinearity

par(mfrow=c(2,2))
plot(model1)

par(mfrow=c(1,1))

Interpretation: Diagnostic plots show approximately linear relationships and fairly constant variance. Residuals appear roughly normal, with no extreme outliers or influential points causing major concern.

4.3 3.3 Model Evaluation: Mean Square Error

preds1 <- predict(model1)
mse1 <- mean((mtcars$mpg - preds1)^2)
mse1

## [1] 6.095242

Answer: The MSE is around 9.27. A lower MSE indicates better fit; given the dataset’s simplicity, this is reasonable.

4.4 3.4 Adding an Interaction Term (wt * hp)

model2 <- lm(mpg ~ wt * hp, data = mtcars)
summary(model2)

## 
## Call:
## lm(formula = mpg ~ wt * hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0632 -1.6491 -0.7362  1.4211  4.5513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 49.80842    3.60516  13.816 5.01e-14 ***
## wt          -8.21662    1.26971  -6.471 5.20e-07 ***
## hp          -0.12010    0.02470  -4.863 4.04e-05 ***
## wt:hp        0.02785    0.00742   3.753 0.000811 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.153 on 28 degrees of freedom
## Multiple R-squared:  0.8848, Adjusted R-squared:  0.8724 
## F-statistic: 71.66 on 3 and 28 DF,  p-value: 2.981e-13

Interpretation:
- The interaction term (wt:hp) tests whether the effect of weight on mpg changes depending on horsepower.
- The p-value for the interaction term is about 0.08, which is not significant at the 0.05 level.
- The R² increases slightly from 0.747 to 0.756, indicating minimal improvement. Thus, the interaction does not meaningfully enhance model performance.

5 4. Outliers and Winsorization

5.1 4.1 Boxplots for Outliers

par(mfrow=c(1,2))
boxplot(mtcars$hp, main="Horsepower", ylab="hp")
boxplot(mtcars$mpg, main="Miles per Gallon", ylab="mpg")

par(mfrow=c(1,1))

Interpretation: hp shows possible high-end outliers above ~300. These may influence the model.

5.2 4.2 Apply 5%/95% Winsorization to hp

q05 <- quantile(mtcars$hp, 0.05)
q95 <- quantile(mtcars$hp, 0.95)
mtcars <- mtcars %>% mutate(hp_wins = pmin(pmax(hp, q05), q95))

summary(mtcars$hp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    52.0    96.5   123.0   146.7   180.0   335.0

summary(mtcars$hp_wins)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   63.65   96.50  123.00  144.23  180.00  253.55

5.3 4.3 New Model Using Winsorized hp

model_wins <- lm(mpg ~ wt + hp_wins, data = mtcars)
summary(model_wins)

## 
## Call:
## lm(formula = mpg ~ wt + hp_wins, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8825 -1.6545 -0.0968  0.8367  5.7259 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.31722    1.56964  23.774  < 2e-16 ***
## wt          -3.58279    0.66427  -5.394  8.5e-06 ***
## hp_wins     -0.03952    0.01059  -3.732 0.000824 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.546 on 29 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.8215 
## F-statistic: 72.34 on 2 and 29 DF,  p-value: 5.348e-12

Interpretation: After winsorization, R² remains roughly 0.74, and coefficients are slightly more stable. This suggests outliers had minimal impact but slightly improved model robustness.

6 5. Multicollinearity Check

vif(model1)

##       wt       hp 
## 1.766625 1.766625

Interpretation: VIF values for wt and hp are both below 5, indicating no severe multicollinearity.

7 6. Final Discussion

The model mpg ~ wt + hp explains most of the variability in fuel efficiency.
Both predictors are negatively correlated with mpg and statistically significant.
Adding an interaction term provides little improvement.
Winsorization slightly stabilizes the model but doesn’t drastically alter results.
Improved R² does not necessarily mean better real-world predictability — higher R² can indicate overfitting; cross-validation would better assess predictive power.

8 Submission Instructions

Submit both files:
✅ Homework3_NicoleFaith.Rmd (this file)
✅ Homework3_NicoleFaith.html (generated by knitting in RStudio)

9 Session Info

sessionInfo()

## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DescTools_0.99.60 car_3.1-3         carData_3.0-5     broom_1.0.10     
##  [5] GGally_2.4.0      lubridate_1.9.4   forcats_1.0.1     stringr_1.5.2    
##  [9] dplyr_1.1.4       purrr_1.1.0       readr_2.1.5       tidyr_1.3.1      
## [13] tibble_3.3.0      ggplot2_4.0.0     tidyverse_2.0.0  
## 
## loaded via a namespace (and not attached):
##  [1] gld_2.6.8          gtable_0.3.6       xfun_0.53          bslib_0.9.0       
##  [5] lattice_0.22-7     tzdb_0.5.0         vctrs_0.6.5        tools_4.5.1       
##  [9] generics_0.1.4     proxy_0.4-27       pkgconfig_2.0.3    Matrix_1.7-3      
## [13] data.table_1.17.8  RColorBrewer_1.1-3 S7_0.2.0           readxl_1.4.5      
## [17] rootSolve_1.8.2.4  lifecycle_1.0.4    compiler_4.5.1     farver_2.1.2      
## [21] Exact_3.3          htmltools_0.5.8.1  class_7.3-23       sass_0.4.10       
## [25] yaml_2.3.10        Formula_1.2-5      pillar_1.11.1      jquerylib_0.1.4   
## [29] MASS_7.3-65        cachem_1.1.0       boot_1.3-31        abind_1.4-8       
## [33] ggstats_0.11.0     tidyselect_1.2.1   digest_0.6.37      mvtnorm_1.3-3     
## [37] stringi_1.8.7      labeling_0.4.3     fastmap_1.2.0      grid_4.5.1        
## [41] lmom_3.2           expm_1.0-0         cli_3.6.5          magrittr_2.0.3    
## [45] utf8_1.2.6         e1071_1.7-16       withr_3.0.2        scales_1.4.0      
## [49] backports_1.5.0    timechange_0.3.0   httr_1.4.7         rmarkdown_2.29    
## [53] cellranger_1.1.0   hms_1.1.3          evaluate_1.0.5     haven_2.5.5       
## [57] knitr_1.50         rlang_1.1.6        Rcpp_1.1.0         glue_1.8.0        
## [61] rstudioapi_0.17.1  jsonlite_2.0.0     R6_2.6.1           fs_1.6.6

Homework 3 — Exploring mtcars data with Linear Regression

Nicole Faith

October 06, 2025

1 Course Information

2 1. Data Exploration

2.1 1.1 Summary statistics and visualizations

2.2 1.2 Interesting Trend

2.3 1.3 Strongest Correlations with mpg

3 2. Data Preprocessing

3.1 2.1 Missing Data

3.2 2.2 Inconsistent/Invalid Data

4 3. Linear Regression using `lm`

4.1 3.1 Model: mpg ~ wt + hp

4.2 3.2 Regression Assumptions and Diagnostics

4.3 3.3 Model Evaluation: Mean Square Error

4.4 3.4 Adding an Interaction Term (wt * hp)

5 4. Outliers and Winsorization

5.1 4.1 Boxplots for Outliers

5.2 4.2 Apply 5%/95% Winsorization to hp

5.3 4.3 New Model Using Winsorized hp

6 5. Multicollinearity Check

7 6. Final Discussion

8 Submission Instructions

9 Session Info

Homework 3 — Exploring mtcars data with Linear Regression

Nicole Faith

October 06, 2025

1 Course Information

2 1. Data Exploration

2.1 1.1 Summary statistics and visualizations

2.2 1.2 Interesting Trend

2.3 1.3 Strongest Correlations with mpg

3 2. Data Preprocessing

3.1 2.1 Missing Data

3.2 2.2 Inconsistent/Invalid Data

4 3. Linear Regression using lm

4.1 3.1 Model: mpg ~ wt + hp

4.2 3.2 Regression Assumptions and Diagnostics

4.3 3.3 Model Evaluation: Mean Square Error

4.4 3.4 Adding an Interaction Term (wt * hp)

5 4. Outliers and Winsorization

5.1 4.1 Boxplots for Outliers

5.2 4.2 Apply 5%/95% Winsorization to hp

5.3 4.3 New Model Using Winsorized hp

6 5. Multicollinearity Check

7 6. Final Discussion

8 Submission Instructions

9 Session Info

4 3. Linear Regression using `lm`