Portifolio

Author

Jeremiah Mogaka

Published

December 8, 2025

INTRODUCTION

The better part of the semester we worked with the Auto data to practice and learn various regression techniques. Having have interacted with the data, I decided to use it to demonstrate my understanding of the course objectives as set out in the Statistical Modeling I course, my commitment letter and the portfolio requirements. I set out to do this in three parts.Part one will solely focus of data exploration of the Auto dataset, part two that will focus on explaining how I achieved the course objectives and part three which is the code and output to show my mastery of the course objectives

EXPLORATORY DATA ANALYSIS

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
✔ broom        1.0.9     ✔ rsample      1.3.1
✔ dials        1.4.1     ✔ tune         1.3.0
✔ infer        1.0.9     ✔ workflows    1.2.0
✔ modeldata    1.5.0     ✔ workflowsets 1.1.1
✔ parsnip      1.3.2     ✔ yardstick    1.3.2
✔ recipes      1.3.1     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()

library(ISLR)
library(nnet)
library(MASS)


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

library(glmnet)

Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loaded glmnet 4.1-10

library(discrim)


Attaching package: 'discrim'

The following object is masked from 'package:dials':

    smoothness

library(GGally)
library(corrplot)

corrplot 0.95 loaded

set.seed(123)
#| warning: false 
#| Message: false
#| error: false
#| echo:

data("Auto")

# Convert to tibble for easier handling
Auto <- as_tibble(Auto)

head(Auto)

# A tibble: 6 × 9
    mpg cylinders displacement horsepower weight acceleration  year origin name 
  <dbl>     <dbl>        <dbl>      <dbl>  <dbl>        <dbl> <dbl>  <dbl> <fct>
1    18         8          307        130   3504         12      70      1 chev…
2    15         8          350        165   3693         11.5    70      1 buic…
3    18         8          318        150   3436         11      70      1 plym…
4    16         8          304        150   3433         12      70      1 amc …
5    17         8          302        140   3449         10.5    70      1 ford…
6    15         8          429        198   4341         10      70      1 ford…

summary(Auto)

      mpg          cylinders      displacement     horsepower        weight    
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
 1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
 Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
 Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
 Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
                                                                               
  acceleration        year           origin                      name    
 Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
 1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
 Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
 Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
 3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
 Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
                                                 (Other)           :365

Histograms and Density Plots

colSums(is.na(Auto))

         mpg    cylinders displacement   horsepower       weight acceleration 
           0            0            0            0            0            0 
        year       origin         name 
           0            0            0

Auto %>% 
  dplyr::select(where(is.numeric)) %>%
  gather(variable, value) %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30, fill = "#4DBBD5", color = "white") +
  facet_wrap(~ variable, scales = "free") +
  theme_minimal() +
  theme(
    panel.border = element_rect(color = "black", fill = NA, linewidth = 1)
  ) +
  labs(title = "Histograms of Numeric Variables")

Auto %>%
  dplyr::select(where(is.numeric)) %>%
  gather(variable, value) %>%
  ggplot(aes(x = value)) +
  geom_density(fill = "#00A087", alpha = 0.5) +
  facet_wrap(~ variable, scales = "free") +
  theme_minimal() +
  theme(
    panel.border = element_rect(color = "black", fill = NA, linewidth = 1)
  ) +
  labs(title = "Density Plots of Numeric Variables")

Pairwise Scatter Plots

ggpairs(
  Auto %>% 
    dplyr::select(mpg, horsepower, weight, displacement, acceleration),
  title = "Pairwise Scatterplots"
)

Correlation Matrix

num_vars <- Auto %>% 
  dplyr::select(where(is.numeric))
cor_matrix <- cor(num_vars, use = "complete.obs")

corrplot(cor_matrix, method = "color", type = "upper",
         addCoef.col = "black", tl.cex = 0.8,
         title = "Correlation Matrix", mar = c(0,0,2,0))

Scatterplots with smoothers

Auto %>%
  ggplot(aes(weight, mpg)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess") +
  theme_minimal() +
  theme(
    panel.border = element_rect(color = "black", fill = NA, linewidth = 1)
  ) +
  labs(title = "MPG vs Weight")

`geom_smooth()` using formula = 'y ~ x'

Auto %>%
  ggplot(aes(horsepower, mpg)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess") +
  theme_minimal() +
  theme(
    panel.border = element_rect(color = "black", fill = NA, linewidth = 1)
  ) +
  labs(title = "MPG vs Horsepower")

`geom_smooth()` using formula = 'y ~ x'

Auto %>%
  ggplot(aes(displacement, mpg)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess") +
  theme_minimal() +
  theme(
    panel.border = element_rect(color = "black", fill = NA, linewidth = 1)
  ) +
  labs(title = "MPG vs Displacement")

`geom_smooth()` using formula = 'y ~ x'

Boxplot

# Cylinders vs MPG boxplot
Auto %>% 
  ggplot(aes(factor(cylinders), mpg)) +
  geom_boxplot(fill = "#E64B35") +
  theme_minimal() +
  theme(
    panel.border = element_rect(color = "black", fill = NA, linewidth = 1)
  ) +
  labs(
    title = "MPG by Number of Cylinders",
    x = "Cylinders"
  )

OBJECTIVES

Objective 1:Describe probability as a foundation of statistical modeling

Through the semester, I gained a deeper knowledge of how probability underpins every modelling decision, particularly while working with the Auto dataset. In my commitment letter I stated that I wanted to improve my understanding of maximum likelihood estimation and inference so that statistical ideas would feel operational rather than abstract. Through repeated practice with models that explicitly rely on probabilistic assumptions, this goal was achieved.

For instance, multinomial probability distributions combined with maximum likelihood estimation are the foundation of the multinomial logistic regression model. Fitting this model confirmed how parameters are selected to maximise the probability of observing the sample and how likelihood functions reflect assumptions about how data occur.

I also learnt the probabilistic foundations of Linear Discriminant Analysis. LDA is predicated on a shared covariance structure and normal distributions for every class. It became clear from observing how these presumptions affect classification boundaries why probabilistic thinking is necessary, not optional, for understanding and believing models. My understanding for distributional thinking has been reinforced by this practical experience, which is closely related to my objective of developing a deeper theoretical understanding of probability.

Objective 2: Apply the appropriate Generalized Linear Models

Working with the Auto dataset gave me an opportunity to evaluate different model and select those best suited to different response types. This aligns with my earlier desire to move beyond simply running models toward understanding why one model fits better than another.

For example, predicting mpg_bin required a model appropriate for categorical outcomes. This guided me toward a multinomial logistic regression model, which connects the log-odds of each category to linear predictors. Using horsepower, weight, and displacement as predictors illustrated how a GLM can handle multi-class classification while still maintaining interpretability. In contrast, predicting the continuous outcome mpg required linear modeling, making multiple regression, polynomial regression, ridge, and lasso appropriate.

Objective 3: Demonstrate model selection given a set of candidate models

A crucial stage in the statistical modelling process is model selection, which entails comparing several candidate models to determine which is best based on factors including predictive performance, complexity, and goodness of fit.

The Auto dataset provided a valuable context for exercising this skill.

I compared several candidate models aimed at predicting mpg:

Multiple linear regression — RMSE 3.57, R Squared 0.785
Polynomial regression — RMSE 3.37, R Squared 0.808
Ridge regression — RMSE 5.10, R Squared 0.868
Lasso regression — RMSE 2.16, R Squared 0.922

The results reveal a compelling narrative: the lasso model dramatically outperformed all others, likely because the dataset contains substantial multicollinearity which is clearly visible in the correlation heatmap and redundant predictors. Lasso’s ability to shrink some coefficients to zero allowed it to reduce variance and improve predictive accuracy far beyond the ordinary least squares model.

This process embodies exactly what my instructor emphasized throughout the semester comparing candidate models, validating performance, and justifying choices with evidence .

Before this course, model selection felt abstract to me, but working through these comparisons grounded the process in concrete metrics and trade-offs, bringing me closer to the disciplined approach I set as a goal in my commitment letter.

Objective 4: Communicating Results to General Audiences

For statistical results to be understood and be useful to a broad audience, effective communication is crucial. In order to convey the main conclusions and insights from this project, I use straightforward language that avoids technical language and offers understandable explanations of complex ideas.

Scatter plots, correlation matrix, and summary tables are examples of visualisations that are used to highlight the key findings of the auto data and show the relationships between the variables. I also provided interpretations of the regression coefficients.

Objective 5: Use programming software to fit and assess statistical models

The model outputs in my portfolio clearly evidence my progress in becoming more fluent and independent with R and the tidymodels ecosystem. Early in the semester, I lacked confidence with workflows, recipes, and tuning, but by the time I built the ridge and lasso pipelines, complete with dummy variables, normalizations, cross-validation folds, and hyperparameter tuning grids, I recognized how far I had progressed.

For instance, the ridge model necessitated defining a penalty grid, utilising tune_grid(), choosing the optimal model using select_best(metric = “rmse”), and assessing it using test data. I now understand how preprocessing, model definition, and validation fit together in a cohesive pipeline, as evidenced by the workflow’s effective execution and interpretable outputs.

Likewise, the multinomial logistic regression and LDA models required carefully managing factor encoding and recipe design. Seeing their outputs affirmed my mastery of the software tools required to meet this objective and validated the regular practice I committed to in my initial letter.

REGRESSIONS

1. Data Preparation

Auto2 <- Auto %>%
  na.omit() %>%
  mutate(
    cylinders = factor(cylinders),
    origin    = factor(origin),
    year      = factor(year),
    mpg_bin   = cut(mpg, breaks = 3, labels = c("low","med","high"))
  )

auto_split <- initial_split(Auto2, prop = 0.8, strata = mpg_bin)
auto_train <- training(auto_split)
auto_test  <- testing(auto_split)
auto_folds <- vfold_cv(auto_train, v = 5)

# Helper function for consistent metric printing
print_metrics <- function(name, df) {
  cat("--\n")
  cat(name, "\n")
  print(df)
}

2. Multiple Regression

rec_mult <- recipe(mpg ~ horsepower + weight + cylinders, data = auto_train)

mod_mult <- linear_reg() %>% 
  set_engine("lm")

wf_mult <- workflow() %>% 
  add_recipe(rec_mult) %>% 
  add_model(mod_mult)

fit_mult <- fit(wf_mult, auto_train)

metrics_mult <- predict(fit_mult, auto_test) %>%
  bind_cols(auto_test) %>%
  metrics(truth = mpg, estimate = .pred)

print_metrics("Multiple Linear Regression", metrics_mult)

--
Multiple Linear Regression 
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       3.57 
2 rsq     standard       0.785
3 mae     standard       2.79

coeff_mult <- tidy(extract_fit_engine(fit_mult))
coeff_mult

# A tibble: 7 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept) 38.3      2.63         14.6  7.08e-37
2 horsepower  -0.0619   0.0140       -4.42 1.39e- 5
3 weight      -0.00484  0.000711     -6.82 5.04e-11
4 cylinders4   6.99     2.09          3.34 9.44e- 4
5 cylinders5   8.95     4.69          1.91 5.72e- 2
6 cylinders6   3.44     2.18          1.58 1.16e- 1
7 cylinders8   6.24     2.38          2.63 9.06e- 3

The main variables influencing vehicle fuel efficiency, as expressed in miles per gallon (mpg), are clearly revealed by the multiple linear regression model. Both weight and horsepower exhibit significant negative correlations with mpg after adjusting for all model factors, suggesting that cars with larger engines and higher weight typically have lower fuel efficiency. In particular, mpg drops by roughly 0.06 for every increased horsepower unit and by roughly 0.005 for every additional pound of weight.

3. Polynomial Regression

rec_poly <- recipe(mpg ~ horsepower + weight, data = auto_train) %>%
  step_poly(horsepower, degree = 2) %>%
  step_poly(weight, degree = 2)

mod_poly <- linear_reg() %>% 
  set_engine("lm")

wf_poly <- workflow() %>% 
  add_recipe(rec_poly) %>% 
  add_model(mod_poly)

fit_poly <- fit(wf_poly, auto_train)

metrics_poly <- predict(fit_poly, auto_test) %>%
  bind_cols(auto_test) %>%
  metrics(truth = mpg, estimate = .pred)

print_metrics("Polynomial Regression", metrics_poly)

--
Polynomial Regression 
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       3.37 
2 rsq     standard       0.808
3 mae     standard       2.49

4. Multinomial Logistic Regression

rec_multi <- recipe(mpg_bin ~ horsepower + weight + displacement,
                    data = auto_train)

mod_multi <- multinom_reg() %>% 
  set_engine("nnet")

wf_multi <- workflow() %>% 
  add_recipe(rec_multi) %>% 
  add_model(mod_multi)

fit_multi <- fit(wf_multi, auto_train)

metrics_multi <- predict(fit_multi, auto_test) %>%
  bind_cols(auto_test) %>%
  metrics(truth = mpg_bin, estimate = .pred_class)

print_metrics("Multinomial Logistic Regression", metrics_multi)

--
Multinomial Logistic Regression 
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.838
2 kap      multiclass     0.718

5. Linear Discriminant Analysis

rec_lda <- recipe(origin ~ mpg + horsepower + weight, data = auto_train)

mod_lda <- discrim_linear() %>% 
  set_engine("MASS")

wf_lda <- workflow() %>% 
  add_recipe(rec_lda) %>% 
  add_model(mod_lda)

fit_lda <- fit(wf_lda, auto_train)

metrics_lda <- predict(fit_lda, auto_test) %>%
  bind_cols(auto_test) %>%
  metrics(truth = origin, estimate = .pred_class)

print_metrics("Linear Discriminant Analysis", metrics_lda)

--
Linear Discriminant Analysis 
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.662
2 kap      multiclass     0.356

6. Ridge Regression

rec_glmnet <- recipe(mpg ~ ., data = auto_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_predictors())

mod_ridge <- linear_reg(penalty = tune(), mixture = 0) %>% 
  set_engine("glmnet")

wf_ridge <- workflow() %>% 
  add_recipe(rec_glmnet) %>% 
  add_model(mod_ridge)

ridge_tune <- tune_grid(wf_ridge, resamples = auto_folds, grid = 20)

best_ridge <- select_best(ridge_tune, metric = "rmse")

final_ridge <- finalize_workflow(wf_ridge, best_ridge) %>%
  fit(auto_train)

ridge_metrics <- predict(final_ridge, auto_test) %>%
  bind_cols(auto_test) %>%
  metrics(truth = mpg, estimate = .pred)

print_metrics("Ridge Regression", ridge_metrics)

--
Ridge Regression 
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       5.10 
2 rsq     standard       0.868
3 mae     standard       4.23

7. Lasso Regression

mod_lasso <- linear_reg(penalty = tune(), mixture = 1) %>% 
  set_engine("glmnet")

wf_lasso <- workflow() %>% 
  add_recipe(rec_glmnet) %>% 
  add_model(mod_lasso)

lasso_tune <- tune_grid(wf_lasso, resamples = auto_folds, grid = 20)

best_lasso <- select_best(lasso_tune, metric = "rmse")

final_lasso <- finalize_workflow(wf_lasso, best_lasso) %>% 
  fit(auto_train)

lasso_metrics <- predict(final_lasso, auto_test) %>%
  bind_cols(auto_test) %>%
  metrics(truth = mpg, estimate = .pred)

print_metrics("Lasso Regression", lasso_metrics)

--
Lasso Regression 
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       2.16 
2 rsq     standard       0.922
3 mae     standard       1.73

Comparing Models

model_comparison <- bind_rows(
  metrics_mult %>% mutate(model = "Multiple Regression"),
  metrics_poly %>% mutate(model = "Polynomial Regression"),
  ridge_metrics %>% mutate(model = "Ridge Regression"),
  lasso_metrics %>% mutate(model = "Lasso Regression")
) %>%
  dplyr::select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  arrange(rmse)

model_comparison

# A tibble: 4 × 4
  model                  rmse   rsq   mae
  <chr>                 <dbl> <dbl> <dbl>
1 Lasso Regression       2.16 0.922  1.73
2 Polynomial Regression  3.37 0.808  2.49
3 Multiple Regression    3.57 0.785  2.79
4 Ridge Regression       5.10 0.868  4.23

CONCLUSION

This project provided a comprehensive exploration of statistical modeling using the Auto dataset, allowing me to examine how vehicle characteristics influence fuel efficiency and how different modeling techniques perform in predictive tasks. Through careful data preparation, model building, and performance evaluation, several important insights emerged.

The models consistently revealed that weight and horsepower are the strongest determinants of miles per gallon, reaffirming well understood engineering principles. More importantly, the comparison across models demonstrated the practical value of choosing the right analytical approach. While basic linear regression provided a useful starting point, models that could account for nonlinear patterns such as polynomial regression, offered improved accuracy. The greatest gains came from applying regularization techniques, especially Lasso Regression, which achieved the best predictive performance by simplifying the model and reducing overfitting. This illustrates how modern statistical methods can offer substantial advantages when dealing with correlated predictors and complex relationships.

For classification tasks, multinomial logistic regression effectively categorized vehicles into fuel efficiency groups, whereas Linear Discriminant Analysis provided moderate success in identifying country of origin. These results show that models perform differently depending on the clarity of the underlying group distinctions in the data.

SELF EVALUATION

My goals remained consistent throughout the semester because they were directly aligned with the course objectives. I believe I have met, and in several areas surpassed—these goals by fully immersing myself in the coursework, engaging deeply with the class materials, practicing extensively with multiple datasets, and taking time to explain my projects and insights to peers. Over the semester, I progressed from simply interpreting regression models to confidently building them and understanding which modeling approaches are most appropriate for different types of data.

Although I did not present it to this group as originally planned, I am currently developing an API that predicts Airbnb prices based on location and room type, a project made possible entirely through the skills I gained in this class. Working with web-scraped datasets initially posed challenges, and after attempting to clean Airbnb data from 2020 and 2023, I ultimately chose to seek a more well-organized dataset to continue refining my model, which I hope to complete before the end of the year. With a background in Economics, I especially appreciated seeing how linear regression can be applied to topics in educational economics, such as parent and family involvement. I intend to apply the knowledge and skills gained from this course in my current role as a Research Assistant in the Economics Department and in my future academic or professional endeavors.