The better part of the semester we worked with the Auto data to practice and learn various regression techniques. Having have interacted with the data, I decided to use it to demonstrate my understanding of the course objectives as set out in the Statistical Modeling I course, my commitment letter and the portfolio requirements. I set out to do this in three parts.Part one will solely focus of data exploration of the Auto dataset, part two that will focus on explaining how I achieved the course objectives and part three which is the code and output to show my mastery of the course objectives
EXPLORATORY DATA ANALYSIS
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
library(glmnet)
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Loaded glmnet 4.1-10
library(discrim)
Attaching package: 'discrim'
The following object is masked from 'package:dials':
smoothness
mpg cylinders displacement horsepower weight
Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
acceleration year origin name
Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
(Other) :365
Histograms and Density Plots
colSums(is.na(Auto))
mpg cylinders displacement horsepower weight acceleration
0 0 0 0 0 0
year origin name
0 0 0
Auto %>% dplyr::select(where(is.numeric)) %>%gather(variable, value) %>%ggplot(aes(x = value)) +geom_histogram(bins =30, fill ="#4DBBD5", color ="white") +facet_wrap(~ variable, scales ="free") +theme_minimal() +theme(panel.border =element_rect(color ="black", fill =NA, linewidth =1) ) +labs(title ="Histograms of Numeric Variables")
ggpairs( Auto %>% dplyr::select(mpg, horsepower, weight, displacement, acceleration),title ="Pairwise Scatterplots")
Correlation Matrix
num_vars <- Auto %>% dplyr::select(where(is.numeric))cor_matrix <-cor(num_vars, use ="complete.obs")corrplot(cor_matrix, method ="color", type ="upper",addCoef.col ="black", tl.cex =0.8,title ="Correlation Matrix", mar =c(0,0,2,0))
Scatterplots with smoothers
Auto %>%ggplot(aes(weight, mpg)) +geom_point(alpha =0.6) +geom_smooth(method ="loess") +theme_minimal() +theme(panel.border =element_rect(color ="black", fill =NA, linewidth =1) ) +labs(title ="MPG vs Weight")
`geom_smooth()` using formula = 'y ~ x'
Auto %>%ggplot(aes(horsepower, mpg)) +geom_point(alpha =0.6) +geom_smooth(method ="loess") +theme_minimal() +theme(panel.border =element_rect(color ="black", fill =NA, linewidth =1) ) +labs(title ="MPG vs Horsepower")
`geom_smooth()` using formula = 'y ~ x'
Auto %>%ggplot(aes(displacement, mpg)) +geom_point(alpha =0.6) +geom_smooth(method ="loess") +theme_minimal() +theme(panel.border =element_rect(color ="black", fill =NA, linewidth =1) ) +labs(title ="MPG vs Displacement")
`geom_smooth()` using formula = 'y ~ x'
Boxplot
# Cylinders vs MPG boxplotAuto %>%ggplot(aes(factor(cylinders), mpg)) +geom_boxplot(fill ="#E64B35") +theme_minimal() +theme(panel.border =element_rect(color ="black", fill =NA, linewidth =1) ) +labs(title ="MPG by Number of Cylinders",x ="Cylinders" )
OBJECTIVES
Objective 1:Describe probability as a foundation of statistical modeling
Through the semester, I gained a deeper knowledge of how probability underpins every modelling decision, particularly while working with the Auto dataset. In my commitment letter I stated that I wanted to improve my understanding of maximum likelihood estimation and inference so that statistical ideas would feel operational rather than abstract. Through repeated practice with models that explicitly rely on probabilistic assumptions, this goal was achieved.
For instance, multinomial probability distributions combined with maximum likelihood estimation are the foundation of the multinomial logistic regression model. Fitting this model confirmed how parameters are selected to maximise the probability of observing the sample and how likelihood functions reflect assumptions about how data occur.
I also learnt the probabilistic foundations of Linear Discriminant Analysis. LDA is predicated on a shared covariance structure and normal distributions for every class. It became clear from observing how these presumptions affect classification boundaries why probabilistic thinking is necessary, not optional, for understanding and believing models. My understanding for distributional thinking has been reinforced by this practical experience, which is closely related to my objective of developing a deeper theoretical understanding of probability.
Objective 2: Apply the appropriate Generalized Linear Models
Working with the Auto dataset gave me an opportunity to evaluate different model and select those best suited to different response types. This aligns with my earlier desire to move beyond simply running models toward understanding why one model fits better than another.
For example, predicting mpg_bin required a model appropriate for categorical outcomes. This guided me toward a multinomial logistic regression model, which connects the log-odds of each category to linear predictors. Using horsepower, weight, and displacement as predictors illustrated how a GLM can handle multi-class classification while still maintaining interpretability. In contrast, predicting the continuous outcome mpg required linear modeling, making multiple regression, polynomial regression, ridge, and lasso appropriate.
Objective 3: Demonstrate model selection given a set of candidate models
A crucial stage in the statistical modelling process is model selection, which entails comparing several candidate models to determine which is best based on factors including predictive performance, complexity, and goodness of fit.
The Auto dataset provided a valuable context for exercising this skill.
I compared several candidate models aimed at predicting mpg:
Multiple linear regression — RMSE 3.57, R Squared 0.785
Polynomial regression — RMSE 3.37, R Squared 0.808
Ridge regression — RMSE 5.10, R Squared 0.868
Lasso regression — RMSE 2.16, R Squared 0.922
The results reveal a compelling narrative: the lasso model dramatically outperformed all others, likely because the dataset contains substantial multicollinearity which is clearly visible in the correlation heatmap and redundant predictors. Lasso’s ability to shrink some coefficients to zero allowed it to reduce variance and improve predictive accuracy far beyond the ordinary least squares model.
This process embodies exactly what my instructor emphasized throughout the semester comparing candidate models, validating performance, and justifying choices with evidence .
Before this course, model selection felt abstract to me, but working through these comparisons grounded the process in concrete metrics and trade-offs, bringing me closer to the disciplined approach I set as a goal in my commitment letter.
Objective 4: Communicating Results to General Audiences
For statistical results to be understood and be useful to a broad audience, effective communication is crucial. In order to convey the main conclusions and insights from this project, I use straightforward language that avoids technical language and offers understandable explanations of complex ideas.
Scatter plots, correlation matrix, and summary tables are examples of visualisations that are used to highlight the key findings of the auto data and show the relationships between the variables. I also provided interpretations of the regression coefficients.
Objective 5: Use programming software to fit and assess statistical models
The model outputs in my portfolio clearly evidence my progress in becoming more fluent and independent with R and the tidymodels ecosystem. Early in the semester, I lacked confidence with workflows, recipes, and tuning, but by the time I built the ridge and lasso pipelines, complete with dummy variables, normalizations, cross-validation folds, and hyperparameter tuning grids, I recognized how far I had progressed.
For instance, the ridge model necessitated defining a penalty grid, utilising tune_grid(), choosing the optimal model using select_best(metric = “rmse”), and assessing it using test data. I now understand how preprocessing, model definition, and validation fit together in a cohesive pipeline, as evidenced by the workflow’s effective execution and interpretable outputs.
Likewise, the multinomial logistic regression and LDA models required carefully managing factor encoding and recipe design. Seeing their outputs affirmed my mastery of the software tools required to meet this objective and validated the regular practice I committed to in my initial letter.
REGRESSIONS
1. Data Preparation
Auto2 <- Auto %>%na.omit() %>%mutate(cylinders =factor(cylinders),origin =factor(origin),year =factor(year),mpg_bin =cut(mpg, breaks =3, labels =c("low","med","high")) )auto_split <-initial_split(Auto2, prop =0.8, strata = mpg_bin)auto_train <-training(auto_split)auto_test <-testing(auto_split)auto_folds <-vfold_cv(auto_train, v =5)# Helper function for consistent metric printingprint_metrics <-function(name, df) {cat("--\n")cat(name, "\n")print(df)}
--
Multiple Linear Regression
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 3.57
2 rsq standard 0.785
3 mae standard 2.79
The main variables influencing vehicle fuel efficiency, as expressed in miles per gallon (mpg), are clearly revealed by the multiple linear regression model. Both weight and horsepower exhibit significant negative correlations with mpg after adjusting for all model factors, suggesting that cars with larger engines and higher weight typically have lower fuel efficiency. In particular, mpg drops by roughly 0.06 for every increased horsepower unit and by roughly 0.005 for every additional pound of weight.
This project provided a comprehensive exploration of statistical modeling using the Auto dataset, allowing me to examine how vehicle characteristics influence fuel efficiency and how different modeling techniques perform in predictive tasks. Through careful data preparation, model building, and performance evaluation, several important insights emerged.
The models consistently revealed that weight and horsepower are the strongest determinants of miles per gallon, reaffirming well understood engineering principles. More importantly, the comparison across models demonstrated the practical value of choosing the right analytical approach. While basic linear regression provided a useful starting point, models that could account for nonlinear patterns such as polynomial regression, offered improved accuracy. The greatest gains came from applying regularization techniques, especially Lasso Regression, which achieved the best predictive performance by simplifying the model and reducing overfitting. This illustrates how modern statistical methods can offer substantial advantages when dealing with correlated predictors and complex relationships.
For classification tasks, multinomial logistic regression effectively categorized vehicles into fuel efficiency groups, whereas Linear Discriminant Analysis provided moderate success in identifying country of origin. These results show that models perform differently depending on the clarity of the underlying group distinctions in the data.
SELF EVALUATION
My goals remained consistent throughout the semester because they were directly aligned with the course objectives. I believe I have met, and in several areas surpassed—these goals by fully immersing myself in the coursework, engaging deeply with the class materials, practicing extensively with multiple datasets, and taking time to explain my projects and insights to peers. Over the semester, I progressed from simply interpreting regression models to confidently building them and understanding which modeling approaches are most appropriate for different types of data.
Although I did not present it to this group as originally planned, I am currently developing an API that predicts Airbnb prices based on location and room type, a project made possible entirely through the skills I gained in this class. Working with web-scraped datasets initially posed challenges, and after attempting to clean Airbnb data from 2020 and 2023, I ultimately chose to seek a more well-organized dataset to continue refining my model, which I hope to complete before the end of the year. With a background in Economics, I especially appreciated seeing how linear regression can be applied to topics in educational economics, such as parent and family involvement. I intend to apply the knowledge and skills gained from this course in my current role as a Research Assistant in the Economics Department and in my future academic or professional endeavors.