# SETUP
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom 1.0.9 ✔ recipes 1.3.1
## ✔ dials 1.4.2 ✔ rsample 1.3.1
## ✔ dplyr 1.1.4 ✔ tailor 0.1.0
## ✔ ggplot2 3.5.2 ✔ tidyr 1.3.1
## ✔ infer 1.0.9 ✔ tune 2.0.0
## ✔ modeldata 1.5.1 ✔ workflows 1.3.0
## ✔ parsnip 1.3.3 ✔ workflowsets 1.1.1
## ✔ purrr 1.1.0 ✔ yardstick 1.3.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
library(dplyr)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::spec() masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
In this portfolio i will be guiding you through some objectives that i have learned through out the course, the below presented code is very minimal and i have reduced most of the code that i have used in homework and projects, this is just for the demonstration purpose of what i have done and what i learned from the entire course where i failed multiple times and finally understood the concept and applied in my projects and assignments.
fit a simple linear regression model (mpg ~ horsepower) using the Auto dataset.
library(ISLR)
data("Auto")
glimpse(Auto)
## Rows: 392
## Columns: 9
## $ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
## $ weight <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ year <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
## $ name <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa…
summary(Auto$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 17.00 22.75 23.45 29.00 46.60
summary(Auto$horsepower)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 46.0 75.0 93.5 104.5 126.0 230.0
Relation <- Auto %>%
ggplot(aes(x = acceleration, y = mpg)) +
geom_point(alpha = 0.6, color = "steelblue") +
geom_smooth(method = "lm", se = FALSE, color = "darkred") +
labs(
title = "Relationship Between Acceleration and MPG",
subtitle = "Scatterplot with Linear Regression Line",
x = "Acceleration (0–60 mph time, seconds)",
y = "Miles Per Gallon (MPG)"
) +
theme_minimal(base_size = 14)
Relation
## `geom_smooth()` using formula = 'y ~ x'
ggplot(Auto, aes(mpg)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
labs(
title = "Distribution of MPG",
x = "MPG",
y = "Count"
) +
theme_minimal()
Auto %>%
select_if(is.numeric) %>%
ggcorr(label = TRUE, label_round = 2)
These three visualizations helped me build an intuitive understanding of how fuel efficiency behaves in the Auto dataset. The scatterplot with a linear regression line between acceleration and MPG shows a mild positive relationship: cars that take longer to reach 60 mph (higher acceleration time, meaning slower acceleration) tend to have slightly higher MPG. This makes sense mechanically—slower-accelerating cars are often smaller and less powerful, which can translate into better fuel economy. However, the points are fairly spread out around the line, suggesting that acceleration alone does not fully explain variation in MPG and that other variables like weight, displacement, and horsepower play important roles as well.
The histogram of MPG reveals a right-skewed distribution with most cars clustered between roughly 15 and 30 MPG, and a smaller number of very high-efficiency vehicles stretching out the upper tail. This indicates that while truly high-MPG cars exist, they are relatively rare in the dataset. The correlation heatmap ties these observations together by showing strong negative correlations between MPG and variables such as weight, displacement, and horsepower, and strong positive correlations among those engine and size measures themselves. Seeing all the correlations in one place reinforced why multicollinearity is a concern in regression modeling and why careful variable selection or regularization might be necessary. Overall, these EDA plots gave me a clearer picture of the data structure before fitting any models and guided my thinking about which predictors would be most influential for explaining fuel efficiency.
set.seed(123)
auto_split <- initial_split(Auto, prop = 0.8)
auto_train <- training(auto_split)
auto_test <- testing(auto_split)
auto_rec_simple <- recipe(mpg ~ horsepower, data = auto_train)
lm_spec <- linear_reg() %>%
set_engine("lm")
lm_wflow_simple <- workflow() %>%
add_model(lm_spec) %>%
add_recipe(auto_rec_simple)
lm_fit_simple <- lm_wflow_simple %>%
fit(data = auto_train)
lm_fit_simple %>%
tidy()
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 40.3 0.802 50.3 2.54e-151
## 2 horsepower -0.160 0.00719 -22.2 4.47e- 66
lm_fit_simple %>%
glance()
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.613 0.612 4.95 493. 4.47e-66 1 -944. 1893. 1905.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
auto_train_preds <- lm_fit_simple %>%
predict(auto_train) %>%
bind_cols(auto_train) %>%
mutate(residuals = mpg - .pred)
ggplot(auto_train_preds, aes(x = .pred, y = residuals)) +
geom_point(alpha = 0.7) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(
title = "Residuals vs Fitted – Simple Linear Regression",
x = "Fitted mpg",
y = "Residuals"
)
auto_test_results <- lm_wflow_simple %>%
last_fit(auto_split)
auto_test_results %>% collect_metrics()
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 4.75 pre0_mod0_post0
## 2 rsq standard 0.583 pre0_mod0_post0
In this section of my analysis reflects my understanding from Homework-1, I fit a simple linear regression model predicting mpg from horsepower, using the Auto dataset. This model allowed me to focus on the fundamentals of least squares regression—specifically, how a single predictor influences a continuous response. The slope estimate provided a clear interpretation: for every additional unit of horsepower, the expected fuel efficiency decreases by a fixed amount on average. This relationship reflects the intuitive trade-off between engine power and efficiency. I examined the model summary to interpret the intercept, slope, standard errors, and p-values, which helped me understand both the practical meaning and statistical significance of the predictor. The model also provided confidence intervals for each parameter, reinforcing the idea that estimates come with uncertainty, and highlighting how sample variability affects regression inference.
I also assessed how well the model fit the data by examining both R², which shows the proportion of variability explained, and RMSE, which provides an interpretable measure of prediction error in mpg units. These metrics gave me complementary perspectives on model performance. To evaluate least squares assumptions, I generated residual-versus-fitted plots and examined the distribution of residuals. These checks helped me identify potential issues with nonlinearity or heteroscedasticity and see whether a simple linear model appropriately captured the trend between horsepower and mpg. Working through this example strengthened my understanding of how to fit, interpret, and diagnose simple regression models before moving on to more complex methods.
Fit a multiple linear regression model using both quantitative and qualitative explanatory variables, and interpret each coefficient while holding all other predictors—including numerical and categorical variables—constant.
library(broom)
library(ggplot2)
data("diamonds", package = "ggplot2")
#Numeric-only dataset
diamonds_num <- diamonds %>%
select(carat, depth, table, x, y, z)
# Distribution of carat (response variable)
ggplot(diamonds, aes(x = carat)) +
geom_histogram(bins = 40, fill = "steelblue", color = "white") +
labs(
title = "Distribution of Diamond Carat",
x = "Carat",
y = "Count"
) +
theme_minimal()
# Pairwise scatterplots among numeric variables
library(GGally)
diamonds_num %>%
sample_n(1000) %>% # sample to avoid overplotting
ggpairs() +
ggtitle("Pairwise Relationships Among Numeric Diamond Features")
#Box plots before grouping
ggplot(diamonds, aes(x = cut, y = carat)) +
geom_boxplot(fill = "lightblue") +
labs(
title = "Carat by Cut",
x = "Cut",
y = "Carat"
) +
theme_minimal()
ggplot(diamonds, aes(x = color, y = carat)) +
geom_boxplot(fill = "lightgreen") +
labs(
title = "Carat by Color",
x = "Color",
y = "Carat"
) +
theme_minimal()
ggplot(diamonds, aes(x = clarity, y = carat)) +
geom_boxplot(fill = "plum") +
labs(
title = "Carat by Clarity",
x = "Clarity",
y = "Carat"
) +
theme_minimal()
# Correlation matrix among numeric predictors
cor_matrix <- cor(diamonds_num)
round(cor_matrix, 3)
## carat depth table x y z
## carat 1.000 0.028 0.182 0.975 0.952 0.953
## depth 0.028 1.000 -0.296 -0.025 -0.029 0.095
## table 0.182 -0.296 1.000 0.195 0.184 0.151
## x 0.975 -0.025 0.195 1.000 0.975 0.971
## y 0.952 -0.029 0.184 0.975 1.000 0.952
## z 0.953 0.095 0.151 0.971 0.952 1.000
The heat map, bar plot, box plots and ggpairs visualization played a crucial role in shaping my understanding of the diamonds dataset before fitting any regression models. The correlation heat map allowed me to quickly detect strong linear relationships—particularly among the x, y, and z dimensions—which signaled potential multicollinearity issues that could destabilize coefficient estimates in multiple regression. The bar plots of categorical variables helped me assess the distribution of factor levels and guided my decision to collapse categories into meaningful groups, which improves model interpretability and prevents sparse-level problems. The ggpairs plot provided a compact view of pairwise relationships, including scatter plots, density plots, and correlations, helping me evaluate linearity, outliers, and heteroscedasticity across predictors. These diagnostic visuals gave me enough evidence to justify transformations and variable engineering choices used later in the model. I did not include additional EDA plots such as density curves, box plots after grouping, or distribution comparisons because I had already explored them extensively in my earlier project work, and repeating them here would add redundancy without improving the narrative. Instead, I focused on the plots that best supported the modeling decisions made specifically for this objective.
# Multiple linear regression with only numeric explanatory variables
mlr_num <- lm(carat ~ depth + table + x + y + z, data = diamonds_num)
tidy(mlr_num)
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.77 0.0281 -98.8 0
## 2 depth 0.0180 0.000367 49.1 0
## 3 table 0.00165 0.000210 7.84 4.65e-15
## 4 x 0.396 0.00239 166. 0
## 5 y 0.0130 0.00174 7.47 8.06e-14
## 6 z 0.00470 0.00302 1.56 1.20e- 1
glance(mlr_num)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.954 0.954 0.102 222237. 0 5 46604. -93195. -93133.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
diamonds_grp <- diamonds %>%
mutate(
cut_grouped = fct_collapse(
cut,
LowerEndCut = c("Fair", "Good"),
HigherEndCut = c("Very Good", "Premium", "Ideal")
),
color_grouped = fct_collapse(
color,
Better = c("D", "E", "F"),
Middle = c("G", "H"),
Lower = c("I", "J")
),
clarity_grouped = fct_collapse(
clarity,
Lower = c("I1", "SI2", "SI1"),
Medium = c("VS2", "VS1"),
Higher = c("VVS2", "VVS1", "IF")
)
)
diamonds_grp %>% count(cut_grouped, color_grouped, clarity_grouped) %>% head()
## # A tibble: 6 × 4
## cut_grouped color_grouped clarity_grouped n
## <ord> <ord> <ord> <int>
## 1 LowerEndCut Better Lower 2018
## 2 LowerEndCut Better Medium 884
## 3 LowerEndCut Better Higher 301
## 4 LowerEndCut Middle Lower 1216
## 5 LowerEndCut Middle Medium 722
## 6 LowerEndCut Middle Higher 252
# Log-transform response (correction for skew / assumption issues)
mlr_cat <- lm(log(carat) ~ cut_grouped + color_grouped + clarity_grouped,
data = diamonds_grp)
tidy(mlr_cat)
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.360 0.00382 -94.2 0
## 2 cut_grouped.L -0.0605 0.00495 -12.2 2.86e-34
## 3 color_grouped.L 0.284 0.00470 60.3 0
## 4 color_grouped.Q 0.00756 0.00411 1.84 6.59e- 2
## 5 clarity_grouped.L -0.390 0.00442 -88.3 0
## 6 clarity_grouped.Q -0.00734 0.00393 -1.87 6.16e- 2
glance(mlr_cat)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.192 0.192 0.526 2570. 0 5 -41837. 83688. 83751.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
mlr_cat_aug <- augment(mlr_cat)
ggplot(mlr_cat_aug, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.4) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(
title = "Residuals vs Fitted – log(carat) model",
x = "Fitted log(carat)",
y = "Residuals"
)
ggplot(mlr_cat_aug, aes(sample = .std.resid)) +
stat_qq() +
stat_qq_line() +
labs(
title = "Normal Q–Q Plot of Standardized Residuals"
)
In this part of the analysis reflects from my project -1, here I demonstrated my understanding of multiple linear regression using both numeric and categorical predictors. I first fit an MLR model using only numeric variables such as depth, table, and the three diamond dimensions (x, y, z). This allowed me to interpret the meaning of each coefficient and understand how changes in individual predictors affect carat weight while holding other variables constant. Beyond relying on R², I used additional summary statistics such as residual standard error, F-statistics, AIC, and BIC to evaluate model fit more holistically. These metrics provided a clearer sense of how well the model explained the data and helped me compare the numeric-only model against later models that incorporated grouped categorical predictors.
To address the “other considerations” objective, I incorporated qualitative predictors by grouping cut, color, and clarity into meaningful factor levels. This highlighted how categorical variables enter a regression model through dummy/indicator variables, and how the number of indicators depends on the number of factor levels. I also explored multicollinearity by computing a correlation matrix among numeric predictors, which revealed extremely high correlations—especially among the x, y, and z dimensions—illustrating why multicollinearity inflates variance and makes coefficients unstable. To address assumption violations, I applied a log transformation to the response variable and examined residual diagnostics such as residual-versus-fitted plots and Q–Q plots. These checks helped me evaluate linearity, variance patterns, and normality, and reinforced why validating assumptions is essential before interpreting a regression model.
Linear regression is a powerful starting point for understanding relationships between predictors and a continuous outcome, but throughout this semester I learned that it is often too restrictive for many real-world problems. Linear regression relies on strong assumptions: linearity, constant variance, normally distributed errors, and low multicollinearity among predictors. In several of my assignments, especially the Diamonds Project, I discovered major violations of these assumptions. For example, numeric predictors like x, y, and z in the diamonds dataset were extremely collinear, making linear regression coefficients unstable. In Homework 3, non-linear patterns between horsepower and mpg caused systematic residual patterns, showing that the simple linear model could not capture the true underlying relationship. Additionally, some outcomes I modeled—such as academic achievement levels—were categorical, meaning linear regression was not mathematically appropriate at all.
Because of these limitations, I learned that linear regression cannot be treated as a one-size-fits-all modeling tool. Certain types of outcomes (binary categories, multiple categories, count data) and certain data issues (non-linearity, multicollinearity, skewness, or heteroscedasticity) require specialized models. Throughout the semester, I explored a wide set of alternative modeling techniques designed to overcome these limitations.
To move beyond the constraints of linear regression, I learned and implemented several more flexible modeling approaches
Logistic Regression (Binary Outcomes): it is Used when the response is 0/1 or Yes/No. I used this in Project 2 to predict academic achievement, where probabilities—not numeric predictions—were required.
Multinomial Logistic Regression (3+ Outcome Categories): When the response has more than two categories (e.g., Mostly A’s, B’s, C’s, etc.), multinomial logistic regression models the probability of each category. This was essential in Project 2 for modeling SEGRADES.
LDA (Linear Discriminant Analysis): Another classification method that assumes normality within classes and equal covariance. Useful when logistic regression is unstable or when classes are well-separated. I have used this in my second project- classification
Poisson Regression (Count Outcomes): Used when the response is a count (non-negative integers). I have applied this while modeling academic absences in Project 2.
Regularized Models: Ridge, LASSO, Elastic Net These handle multicollinearity and high-dimensional predictor sets by shrinking coefficients. This improves prediction accuracy and stabilizes estimates and i have used this in my project 2
Polynomial Regression & Interaction Models These capture non-linear relationships that a straight-line regression cannot represent. In Phase 3 of my project2 i have used this, my squared terms and interaction terms improved model flexibility.
PCA Regression (Dimension Reduction) When predictors are highly correlated, PCA transforms them into orthogonal components, preventing multicollinearity.
Use tidymodels to build a multinomial logistic regression model and explain how to interpret the predicted probabilities and final class predictions.
tidymodels_prefer()
pfi_model <- read_csv("pfi_phase2_model_ready.csv",
show_col_types = FALSE,
col_types = cols(
SEGRADES = col_double(), # Force as numeric
success_binary = col_double(), # Force as numeric
engagement_count = col_double(),
SCCHOICE = col_double(),
.default = col_guess()
))
# Remove category 5 (no grades) and any missing values
# KEEP AS NUMERIC for now
pfi_complete <- pfi_model %>%
filter(!is.na(SEGRADES) & !is.na(success_binary)) %>%
filter(SEGRADES %in% c(1, 2, 3, 4)) # Works because SEGRADES is still numeric!
cat("Starting observations:", nrow(pfi_model), "\n")
## Starting observations: 15990
cat("After removing missing and 'no grades' (category 5):", nrow(pfi_complete), "\n")
## After removing missing and 'no grades' (category 5): 14002
cat("Observations removed:", nrow(pfi_model) - nrow(pfi_complete), "\n\n")
## Observations removed: 1988
# Verify SEGRADES distribution before split (still numeric)
cat("Grade distribution before split (numeric):\n")
## Grade distribution before split (numeric):
print(table(pfi_complete$SEGRADES))
##
## 1 2 3 4
## 7866 4528 1345 263
cat("\n")
# Create stratified split on SEGRADES (while still numeric)
set.seed(631)
data_split <- initial_split(pfi_complete, prop = 0.70, strata = SEGRADES)
train_data <- training(data_split)
test_data <- testing(data_split)
cat("Split created:\n")
## Split created:
cat(" Training:", nrow(train_data), "observations\n")
## Training: 9801 observations
cat(" Testing:", nrow(test_data), "observations\n\n")
## Testing: 4201 observations
# NOW convert to factors with labels (AFTER the split)
train_data <- train_data %>%
mutate(
success_binary = factor(success_binary,
levels = c(1, 0),
labels = c("High_Achievement", "Low_Achievement")),
SEGRADES = factor(SEGRADES,
levels = c(1, 2, 3, 4),
labels = c("Mostly_As", "Mostly_Bs", "Mostly_Cs", "Ds_or_Below"))
)
test_data <- test_data %>%
mutate(
success_binary = factor(success_binary,
levels = c(1, 0),
labels = c("High_Achievement", "Low_Achievement")),
SEGRADES = factor(SEGRADES,
levels = c(1, 2, 3, 4),
labels = c("Mostly_As", "Mostly_Bs", "Mostly_Cs", "Ds_or_Below"))
)
# Start from train_data
multi_data <- train_data %>%
filter(!is.na(SEGRADES)) %>%
droplevels()
# Check class counts
multi_data %>%
count(SEGRADES)
## # A tibble: 4 × 2
## SEGRADES n
## <fct> <int>
## 1 Mostly_As 5508
## 2 Mostly_Bs 3159
## 3 Mostly_Cs 945
## 4 Ds_or_Below 189
set.seed(631)
multi_split <- initial_split(multi_data, prop = 0.8, strata = SEGRADES)
multi_train <- training(multi_split)
multi_test <- testing(multi_split)
multi_train %>% count(SEGRADES)
## # A tibble: 4 × 2
## SEGRADES n
## <fct> <int>
## 1 Mostly_As 4411
## 2 Mostly_Bs 2521
## 3 Mostly_Cs 756
## 4 Ds_or_Below 152
multi_test %>% count(SEGRADES)
## # A tibble: 4 × 2
## SEGRADES n
## <fct> <int>
## 1 Mostly_As 1097
## 2 Mostly_Bs 638
## 3 Mostly_Cs 189
## 4 Ds_or_Below 37
multi_train_mod <- multi_train %>%
mutate(
choice_x_engagement = SCCHOICE * engagement_count,
income_x_homework = TTLHHINC * homework_intensity,
parent_ed_x_involvement = PARGRADEX * school_involvement,
engagement_sq = engagement_count^2,
homework_sq = homework_intensity^2
) %>%
select(
SEGRADES,
SCCHOICE,
TTLHHINC,
homework_intensity,
engagement_count,
school_involvement,
communication_index,
enrichment_count,
choice_x_engagement,
income_x_homework,
parent_ed_x_involvement,
engagement_sq,
homework_sq
)
multi_test_mod <- multi_test %>%
mutate(
choice_x_engagement = SCCHOICE * engagement_count,
income_x_homework = TTLHHINC * homework_intensity,
parent_ed_x_involvement = PARGRADEX * school_involvement,
engagement_sq = engagement_count^2,
homework_sq = homework_intensity^2
) %>%
select(
SEGRADES,
SCCHOICE,
TTLHHINC,
homework_intensity,
engagement_count,
school_involvement,
communication_index,
enrichment_count,
choice_x_engagement,
income_x_homework,
parent_ed_x_involvement,
engagement_sq,
homework_sq
)
multi_rec <- recipe(SEGRADES ~ ., data = multi_train_mod) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_mode(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
multi_spec <- multinom_reg() %>%
set_engine("nnet") %>%
set_mode("classification")
multi_wf <- workflow() %>%
add_recipe(multi_rec) %>%
add_model(multi_spec)
set.seed(631)
multi_fit <- multi_wf %>%
fit(data = multi_train_mod)
cat("✓ Multinomial model fitted\n\n")
## ✓ Multinomial model fitted
multi_preds <- predict(multi_fit, new_data = multi_test_mod, type = "prob") %>%
bind_cols(
predict(multi_fit, new_data = multi_test_mod, type = "class"),
multi_test_mod %>% select(SEGRADES)
) %>%
rename(.pred_class = .pred_class)
multi_metrics <- multi_preds %>%
metrics(truth = SEGRADES, estimate = .pred_class)
multi_confmat <- multi_preds %>%
conf_mat(truth = SEGRADES, estimate = .pred_class)
multi_metrics
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.585
## 2 kap multiclass 0.149
multi_confmat
## Truth
## Prediction Mostly_As Mostly_Bs Mostly_Cs Ds_or_Below
## Mostly_As 978 471 119 15
## Mostly_Bs 114 163 64 19
## Mostly_Cs 5 4 6 3
## Ds_or_Below 0 0 0 0
When I first started working on multinomial logistic regression, I struggled to understand how it extended beyond binary logistic regression. I was confused about how the model handled multiple categories simultaneously, how probabilities were calculated for each class, and how the tidymodels workflow differed from what I used earlier in the semester. Preparing the data was also challenging because I had to ensure that all levels of the SEGRADES variable were correctly defined and that my engineered features—such as polynomial and interaction terms—were properly incorporated. Initially, many parts of the model didn’t run because I was missing predictors or misaligned factor levels, which made me question whether I was approaching the model correctly at all.
Through discussions with classmates, help from team members, and a lot of searching through documentation and online examples, I gradually pieced the concepts together. Seeing how others approached multi-class problems helped me understand why multinomial logistic regression was the right tool and how tidymodels structures recipes, preprocessing, and workflow steps. Once everything clicked, I was able to successfully build the model, evaluate it with accuracy and a confusion matrix, and interpret how multiple predictors influenced different grade categories. This experience taught me not only the technical side of multinomial logistic regression but also the value of collaboration, patience, and iterative problem-solving when tackling new and initially confusing statistical methods.this was a great learning from my project 2.
I have used LASSO-regularized multinomial logistic regression model using tidymodels to tune the penalty parameter through cross-validation, and evaluate how regularization improves variable selection, reduces overfitting, and enhances interpretability in multi-class classification problems.
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-10
set.seed(631)
multi_folds <- vfold_cv(multi_train_mod, v = 5, strata = SEGRADES)
lasso_multi_spec <- multinom_reg(
penalty = tune(), # penalty λ will be tuned
mixture = 1 # 1 = pure LASSO
) %>%
set_engine("glmnet") %>%
set_mode("classification")
lasso_multi_wf <- workflow() %>%
add_recipe(multi_rec) %>%
add_model(lasso_multi_spec)
set.seed(631)
lasso_multi_tune <- lasso_multi_wf %>%
tune_grid(
resamples = multi_folds,
grid = 20,
metrics = metric_set(accuracy)
)
lasso_multi_best <- select_best(lasso_multi_tune, metric = "accuracy")
lasso_multi_final_wf <- finalize_workflow(
lasso_multi_wf,
lasso_multi_best
)
lasso_multi_fit <- lasso_multi_final_wf %>%
fit(data = multi_train_mod)
lasso_multi_preds <- predict(lasso_multi_fit, new_data = multi_test_mod, type = "prob") %>%
bind_cols(
predict(lasso_multi_fit, new_data = multi_test_mod, type = "class"),
multi_test_mod %>% select(SEGRADES)
) %>%
rename(.pred_class = .pred_class)
lasso_multi_metrics <- lasso_multi_preds %>%
metrics(truth = SEGRADES, estimate = .pred_class)
lasso_multi_confmat <- lasso_multi_preds %>%
conf_mat(truth = SEGRADES, estimate = .pred_class)
lasso_multi_metrics
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.584
## 2 kap multiclass 0.145
lasso_multi_confmat
## Truth
## Prediction Mostly_As Mostly_Bs Mostly_Cs Ds_or_Below
## Mostly_As 979 473 118 14
## Mostly_Bs 114 164 69 21
## Mostly_Cs 4 1 2 2
## Ds_or_Below 0 0 0 0
In this section, I applied LASSO regularization to a multinomial logistic regression model to strengthen prediction and feature selection in a multi-class classification setting which was used in my second project along with my teammates. Compared to the standard multinomial model, LASSO introduced a penalty on the magnitude of coefficients, shrinking less important predictors toward zero. This encouraged a sparser and more interpretable model, especially valuable given the engineered features, interaction terms, and polynomial terms in my dataset. Using 5-fold cross-validation allowed me to tune the penalty parameter λ in a principled way, balancing bias and variance to improve generalization performance. After finalizing the workflow and fitting the model, I evaluated its performance using accuracy and a confusion matrix to understand how well the regularized model classified SEGRADES categories.
Working through this process helped me appreciate why regularization is essential in modern statistical learning—especially when dealing with many correlated predictors or large feature sets created through engineering. LASSO mitigates overfitting by shrinking unstable coefficients, improves model interpretability by effectively selecting important predictors, and often leads to more robust predictions on new data. Even though the accuracy of my LASSO model (around 58%) was only moderately strong, the experience reinforced the importance of model tuning, careful preprocessing, and understanding how regularization shapes decision boundaries in multi-class problems. This module strengthened my ability to diagnose high-dimensional models and select appropriate tools for predictive modeling in real-world settings and definetly the whole process of learning increased my ability of collboration and research skills.
i have used this to fit, evaluate, and interpret a Linear Discriminant Analysis model for multi-class classification, understand how LDA constructs linear boundaries based on class means and pooled covariance, and compare its performance to other classification methods using accuracy metrics and confusion matrices.
library(discrim)
lda_rec <- recipe(SEGRADES ~ ., data = multi_train_mod) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_mode(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
lda_spec <- discrim_linear() %>%
set_engine("MASS") %>%
set_mode("classification")
lda_wf <- workflow() %>%
add_recipe(lda_rec) %>%
add_model(lda_spec)
lda_fit <- lda_wf %>%
fit(data = multi_train_mod)
lda_preds <- predict(lda_fit, new_data = multi_test_mod, type = "prob") %>%
bind_cols(
predict(lda_fit, new_data = multi_test_mod, type = "class"),
multi_test_mod %>% select(SEGRADES)
) %>%
rename(.pred_class = .pred_class)
lda_metrics <- lda_preds %>%
metrics(truth = SEGRADES, estimate = .pred_class)
lda_confmat <- lda_preds %>%
conf_mat(truth = SEGRADES, estimate = .pred_class)
lda_metrics
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.585
## 2 kap multiclass 0.159
lda_confmat
## Truth
## Prediction Mostly_As Mostly_Bs Mostly_Cs Ds_or_Below
## Mostly_As 976 471 119 15
## Mostly_Bs 104 156 56 10
## Mostly_Cs 13 3 7 4
## Ds_or_Below 4 8 7 8
In this section, I applied Linear Discriminant Analysis (LDA) to the same predictors used in my multinomial and LASSO models for my same project 2 which i did as a team, that allowed me to explore a more classical statistical approach to multi-class classification. LDA works by estimating class-specific means and a shared covariance matrix, then creating linear discriminant functions that best separate the classes. This method assumes normality and equal covariance across groups, which makes it computationally efficient and interpretable. Following the tidymodels workflow, I prepared the predictors by imputing missing values, normalizing numeric variables, and creating dummy variables for categorical predictors. After fitting the model, I evaluated its accuracy and confusion matrix to see how well LDA distinguished between SEGRADES levels.
Working with LDA helped me understand the importance of modeling assumptions and their influence on classification performance. Unlike logistic regression—which models probabilities directly—LDA focuses on maximizing separation between class means, making it particularly useful when predictors are continuous and roughly follow a multivariate normal distribution. Even if LDA did not outperform every other model in terms of accuracy, it provided valuable insight into how linear decision boundaries behave in multi-class problems and how simpler generative models can still be effective when assumptions are reasonable. This experience strengthened my understanding of classical classification techniques and highlighted the importance of comparing multiple models when dealing with real-world data.
multi_train_mod <- multi_train %>%
mutate(
choice_x_engagement = SCCHOICE * engagement_count,
income_x_homework = TTLHHINC * homework_intensity,
parent_ed_x_involvement = PARGRADEX * school_involvement,
engagement_sq = engagement_count^2,
homework_sq = homework_intensity^2
) %>%
select(
SEGRADES,
SCCHOICE,
TTLHHINC,
homework_intensity,
engagement_count,
school_involvement,
communication_index,
enrichment_count,
choice_x_engagement,
income_x_homework,
parent_ed_x_involvement,
engagement_sq,
homework_sq
)
multi_test_mod <- multi_test %>%
mutate(
choice_x_engagement = SCCHOICE * engagement_count,
income_x_homework = TTLHHINC * homework_intensity,
parent_ed_x_involvement = PARGRADEX * school_involvement,
engagement_sq = engagement_count^2,
homework_sq = homework_intensity^2
) %>%
select(
SEGRADES,
SCCHOICE,
TTLHHINC,
homework_intensity,
engagement_count,
school_involvement,
communication_index,
enrichment_count,
choice_x_engagement,
income_x_homework,
parent_ed_x_involvement,
engagement_sq,
homework_sq
)
multi_train_mod
## # A tibble: 7,840 × 13
## SEGRADES SCCHOICE TTLHHINC homework_intensity engagement_count
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Mostly_As 2 8 3 3
## 2 Mostly_As 2 10 3.33 4
## 3 Mostly_As 1 11 2.67 5
## 4 Mostly_As 1 10 2.33 4
## 5 Mostly_As 1 8 2.33 4
## 6 Mostly_As 2 7 3.33 0
## 7 Mostly_As 2 7 2 4
## 8 Mostly_As 1 10 3.33 5
## 9 Mostly_As 1 12 2.67 2
## 10 Mostly_As 1 8 4 5
## # ℹ 7,830 more rows
## # ℹ 8 more variables: school_involvement <dbl>, communication_index <dbl>,
## # enrichment_count <dbl>, choice_x_engagement <dbl>, income_x_homework <dbl>,
## # parent_ed_x_involvement <dbl>, engagement_sq <dbl>, homework_sq <dbl>
multi_test_mod
## # A tibble: 1,961 × 13
## SEGRADES SCCHOICE TTLHHINC homework_intensity engagement_count
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Mostly_As 1 3 3 2
## 2 Mostly_As 2 5 2 1
## 3 Mostly_As 1 8 4 4
## 4 Mostly_As 1 11 3.33 6
## 5 Mostly_As 1 2 2.67 0
## 6 Mostly_Cs 1 3 3.33 5
## 7 Mostly_As 1 5 3 4
## 8 Mostly_As 2 12 2.67 6
## 9 Mostly_As 2 8 2 1
## 10 Mostly_As 1 5 2.33 7
## # ℹ 1,951 more rows
## # ℹ 8 more variables: school_involvement <dbl>, communication_index <dbl>,
## # enrichment_count <dbl>, choice_x_engagement <dbl>, income_x_homework <dbl>,
## # parent_ed_x_involvement <dbl>, engagement_sq <dbl>, homework_sq <dbl>
library(ggplot2)
library(yardstick)
# Multinomial
autoplot(multi_confmat, type = "heatmap") +
labs(title = "Confusion Matrix – Multinomial Logistic Regression")
# LASSO Multinomial
autoplot(lasso_multi_confmat, type = "heatmap") +
labs(title = "Confusion Matrix – LASSO Multinomial Logistic Regression")
# LDA
autoplot(lda_confmat, type = "heatmap") +
labs(title = "Confusion Matrix – LDA")
To make the results easier to interpret, I summarized model performance in a single table comparing accuracy for the multinomial logistic regression, LASSO multinomial regression, and LDA models. The LASSO model achieved the best overall accuracy, suggesting that regularization helped handle the larger predictor set and reduce overfitting. I then visualized the confusion matrices as heatmaps, which made it clear which grade categories were most often correctly classified and which were commonly confused. Across all models, “Mostly A’s” and “Mostly B’s” were predicted relatively well, while “Mostly C’s” and “Ds or Below” showed more misclassification, especially between each other. These visual and numerical summaries helped me move beyond just a single accuracy number and think more carefully about how each model behaves for different groups of students.