# SETUP

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──

## ✔ broom        1.0.9     ✔ recipes      1.3.1
## ✔ dials        1.4.2     ✔ rsample      1.3.1
## ✔ dplyr        1.1.4     ✔ tailor       0.1.0
## ✔ ggplot2      3.5.2     ✔ tidyr        1.3.1
## ✔ infer        1.0.9     ✔ tune         2.0.0
## ✔ modeldata    1.5.1     ✔ workflows    1.3.0
## ✔ parsnip      1.3.3     ✔ workflowsets 1.1.1
## ✔ purrr        1.1.0     ✔ yardstick    1.3.2

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()

library(dplyr)
library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ readr     2.1.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ stringr::fixed()    masks recipes::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::spec()       masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(GGally)

1 Course objective map

In this portfolio i will be guiding you through some objectives that i have learned through out the course, the below presented code is very minimal and i have reduced most of the code that i have used in homework and projects, this is just for the demonstration purpose of what i have done and what i learned from the entire course where i failed multiple times and finally understood the concept and applied in my projects and assignments.

1.1 Objective 1 –SLR

fit a simple linear regression model (mpg ~ horsepower) using the Auto dataset.

1.1.1 Load Auto data

library(ISLR)

data("Auto")

glimpse(Auto)

## Rows: 392
## Columns: 9
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
## $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
## $ name         <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa…

summary(Auto$mpg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   17.00   22.75   23.45   29.00   46.60

summary(Auto$horsepower)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    46.0    75.0    93.5   104.5   126.0   230.0

1.1.2 Exploratory Data Analysis

1.1.2.1 Scatterplot + Linear Regression Trend Line

Relation <- Auto %>% 
  ggplot(aes(x = acceleration, y = mpg)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "darkred") +
  labs(
    title = "Relationship Between Acceleration and MPG",
    subtitle = "Scatterplot with Linear Regression Line",
    x = "Acceleration (0–60 mph time, seconds)",
    y = "Miles Per Gallon (MPG)"
  ) +
  theme_minimal(base_size = 14)

Relation

## `geom_smooth()` using formula = 'y ~ x'

1.1.2.2 Histogram of MPG

ggplot(Auto, aes(mpg)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of MPG",
    x = "MPG",
    y = "Count"
  ) +
  theme_minimal()

1.1.2.3 Correlation Heatmap for All Numeric Variables

Auto %>%
  select_if(is.numeric) %>%
  ggcorr(label = TRUE, label_round = 2)

These three visualizations helped me build an intuitive understanding of how fuel efficiency behaves in the Auto dataset. The scatterplot with a linear regression line between acceleration and MPG shows a mild positive relationship: cars that take longer to reach 60 mph (higher acceleration time, meaning slower acceleration) tend to have slightly higher MPG. This makes sense mechanically—slower-accelerating cars are often smaller and less powerful, which can translate into better fuel economy. However, the points are fairly spread out around the line, suggesting that acceleration alone does not fully explain variation in MPG and that other variables like weight, displacement, and horsepower play important roles as well.

The histogram of MPG reveals a right-skewed distribution with most cars clustered between roughly 15 and 30 MPG, and a smaller number of very high-efficiency vehicles stretching out the upper tail. This indicates that while truly high-MPG cars exist, they are relatively rare in the dataset. The correlation heatmap ties these observations together by showing strong negative correlations between MPG and variables such as weight, displacement, and horsepower, and strong positive correlations among those engine and size measures themselves. Seeing all the correlations in one place reinforced why multicollinearity is a concern in regression modeling and why careful variable selection or regularization might be necessary. Overall, these EDA plots gave me a clearer picture of the data structure before fitting any models and guided my thinking about which predictors would be most influential for explaining fuel efficiency.

1.1.3 Train-test split

set.seed(123)
auto_split <- initial_split(Auto, prop = 0.8)
auto_train <- training(auto_split)
auto_test  <- testing(auto_split)

1.1.4 Simple linear regression recipe, model, and workflow

auto_rec_simple <- recipe(mpg ~ horsepower, data = auto_train)

lm_spec <- linear_reg() %>% 
  set_engine("lm")

lm_wflow_simple <- workflow() %>% 
  add_model(lm_spec) %>% 
  add_recipe(auto_rec_simple)

lm_fit_simple <- lm_wflow_simple %>% 
  fit(data = auto_train)

1.1.5 Parameter estimates & model fit

lm_fit_simple %>% 
  tidy()

## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   40.3     0.802        50.3 2.54e-151
## 2 horsepower    -0.160   0.00719     -22.2 4.47e- 66

lm_fit_simple %>% 
  glance()

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.613         0.612  4.95      493. 4.47e-66     1  -944. 1893. 1905.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

1.1.6 Residual plots on training data

auto_train_preds <- lm_fit_simple %>% 
  predict(auto_train) %>% 
  bind_cols(auto_train) %>% 
  mutate(residuals = mpg - .pred)

ggplot(auto_train_preds, aes(x = .pred, y = residuals)) +
  geom_point(alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Residuals vs Fitted – Simple Linear Regression",
    x = "Fitted mpg",
    y = "Residuals"
  )

1.1.7 Performance on test data

auto_test_results <- lm_wflow_simple %>% 
  last_fit(auto_split)

auto_test_results %>% collect_metrics()

## # A tibble: 2 × 4
##   .metric .estimator .estimate .config        
##   <chr>   <chr>          <dbl> <chr>          
## 1 rmse    standard       4.75  pre0_mod0_post0
## 2 rsq     standard       0.583 pre0_mod0_post0

1.1.8 Reflection

In this section of my analysis reflects my understanding from Homework-1, I fit a simple linear regression model predicting mpg from horsepower, using the Auto dataset. This model allowed me to focus on the fundamentals of least squares regression—specifically, how a single predictor influences a continuous response. The slope estimate provided a clear interpretation: for every additional unit of horsepower, the expected fuel efficiency decreases by a fixed amount on average. This relationship reflects the intuitive trade-off between engine power and efficiency. I examined the model summary to interpret the intercept, slope, standard errors, and p-values, which helped me understand both the practical meaning and statistical significance of the predictor. The model also provided confidence intervals for each parameter, reinforcing the idea that estimates come with uncertainty, and highlighting how sample variability affects regression inference.

I also assessed how well the model fit the data by examining both R², which shows the proportion of variability explained, and RMSE, which provides an interpretable measure of prediction error in mpg units. These metrics gave me complementary perspectives on model performance. To evaluate least squares assumptions, I generated residual-versus-fitted plots and examined the distribution of residuals. These checks helped me identify potential issues with nonlinearity or heteroscedasticity and see whether a simple linear model appropriately captured the trend between horsepower and mpg. Working through this example strengthened my understanding of how to fit, interpret, and diagnose simple regression models before moving on to more complex methods.

1.2 Objective 2 - MLR

Fit a multiple linear regression model using both quantitative and qualitative explanatory variables, and interpret each coefficient while holding all other predictors—including numerical and categorical variables—constant.

1.2.1 Multiple Linear Regression with Only Numeric Predictors

library(broom)
library(ggplot2)

data("diamonds", package = "ggplot2")

#Numeric-only dataset
diamonds_num <- diamonds %>%
  select(carat, depth, table, x, y, z)

1.2.2 Exploratory Data Analysis

# Distribution of carat (response variable)
ggplot(diamonds, aes(x = carat)) +
  geom_histogram(bins = 40, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of Diamond Carat",
    x = "Carat",
    y = "Count"
  ) +
  theme_minimal()

# Pairwise scatterplots among numeric variables
library(GGally)

diamonds_num %>%
  sample_n(1000) %>%  # sample to avoid overplotting
  ggpairs() +
  ggtitle("Pairwise Relationships Among Numeric Diamond Features")

#Box plots before grouping
ggplot(diamonds, aes(x = cut, y = carat)) +
  geom_boxplot(fill = "lightblue") +
  labs(
    title = "Carat by Cut",
    x = "Cut",
    y = "Carat"
  ) +
  theme_minimal()

ggplot(diamonds, aes(x = color, y = carat)) +
  geom_boxplot(fill = "lightgreen") +
  labs(
    title = "Carat by Color",
    x = "Color",
    y = "Carat"
  ) +
  theme_minimal()

ggplot(diamonds, aes(x = clarity, y = carat)) +
  geom_boxplot(fill = "plum") +
  labs(
    title = "Carat by Clarity",
    x = "Clarity",
    y = "Carat"
  ) +
  theme_minimal()

1.2.3 Multicollinearity Check (Causes & Corrections)

# Correlation matrix among numeric predictors
cor_matrix <- cor(diamonds_num)
round(cor_matrix, 3)

##       carat  depth  table      x      y     z
## carat 1.000  0.028  0.182  0.975  0.952 0.953
## depth 0.028  1.000 -0.296 -0.025 -0.029 0.095
## table 0.182 -0.296  1.000  0.195  0.184 0.151
## x     0.975 -0.025  0.195  1.000  0.975 0.971
## y     0.952 -0.029  0.184  0.975  1.000 0.952
## z     0.953  0.095  0.151  0.971  0.952 1.000

The heat map, bar plot, box plots and ggpairs visualization played a crucial role in shaping my understanding of the diamonds dataset before fitting any regression models. The correlation heat map allowed me to quickly detect strong linear relationships—particularly among the x, y, and z dimensions—which signaled potential multicollinearity issues that could destabilize coefficient estimates in multiple regression. The bar plots of categorical variables helped me assess the distribution of factor levels and guided my decision to collapse categories into meaningful groups, which improves model interpretability and prevents sparse-level problems. The ggpairs plot provided a compact view of pairwise relationships, including scatter plots, density plots, and correlations, helping me evaluate linearity, outliers, and heteroscedasticity across predictors. These diagnostic visuals gave me enough evidence to justify transformations and variable engineering choices used later in the model. I did not include additional EDA plots such as density curves, box plots after grouping, or distribution comparisons because I had already explored them extensively in my earlier project work, and repeating them here would add redundancy without improving the narrative. Instead, I focused on the plots that best supported the modeling decisions made specifically for this objective.

# Multiple linear regression with only numeric explanatory variables
mlr_num <- lm(carat ~ depth + table + x + y + z, data = diamonds_num)

tidy(mlr_num)

## # A tibble: 6 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) -2.77     0.0281      -98.8  0       
## 2 depth        0.0180   0.000367     49.1  0       
## 3 table        0.00165  0.000210      7.84 4.65e-15
## 4 x            0.396    0.00239     166.   0       
## 5 y            0.0130   0.00174       7.47 8.06e-14
## 6 z            0.00470  0.00302       1.56 1.20e- 1

glance(mlr_num)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik     AIC     BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>
## 1     0.954         0.954 0.102   222237.       0     5 46604. -93195. -93133.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

1.2.4 Creating Qualitative Predictors and Indicator Logic

diamonds_grp <- diamonds %>%
  mutate(
    cut_grouped = fct_collapse(
      cut,
      LowerEndCut  = c("Fair", "Good"),
      HigherEndCut = c("Very Good", "Premium", "Ideal")
    ),
    color_grouped = fct_collapse(
      color,
      Better = c("D", "E", "F"),
      Middle = c("G", "H"),
      Lower  = c("I", "J")
    ),
    clarity_grouped = fct_collapse(
      clarity,
      Lower  = c("I1", "SI2", "SI1"),
      Medium = c("VS2", "VS1"),
      Higher = c("VVS2", "VVS1", "IF")
    )
  )

diamonds_grp %>% count(cut_grouped, color_grouped, clarity_grouped) %>% head()

## # A tibble: 6 × 4
##   cut_grouped color_grouped clarity_grouped     n
##   <ord>       <ord>         <ord>           <int>
## 1 LowerEndCut Better        Lower            2018
## 2 LowerEndCut Better        Medium            884
## 3 LowerEndCut Better        Higher            301
## 4 LowerEndCut Middle        Lower            1216
## 5 LowerEndCut Middle        Medium            722
## 6 LowerEndCut Middle        Higher            252

1.2.5 MLR with Qualitative Predictors and Transformation

# Log-transform response (correction for skew / assumption issues)
mlr_cat <- lm(log(carat) ~ cut_grouped + color_grouped + clarity_grouped,
              data = diamonds_grp)

tidy(mlr_cat)

## # A tibble: 6 × 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)       -0.360     0.00382    -94.2  0       
## 2 cut_grouped.L     -0.0605    0.00495    -12.2  2.86e-34
## 3 color_grouped.L    0.284     0.00470     60.3  0       
## 4 color_grouped.Q    0.00756   0.00411      1.84 6.59e- 2
## 5 clarity_grouped.L -0.390     0.00442    -88.3  0       
## 6 clarity_grouped.Q -0.00734   0.00393     -1.87 6.16e- 2

glance(mlr_cat)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.192         0.192 0.526     2570.       0     5 -41837. 83688. 83751.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

1.2.6 Assumptions and Diagnostics

mlr_cat_aug <- augment(mlr_cat)

ggplot(mlr_cat_aug, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.4) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Residuals vs Fitted – log(carat) model",
    x = "Fitted log(carat)",
    y = "Residuals"
  )

ggplot(mlr_cat_aug, aes(sample = .std.resid)) +
  stat_qq() +
  stat_qq_line() +
  labs(
    title = "Normal Q–Q Plot of Standardized Residuals"
  )

1.2.7 Reflection

In this part of the analysis reflects from my project -1, here I demonstrated my understanding of multiple linear regression using both numeric and categorical predictors. I first fit an MLR model using only numeric variables such as depth, table, and the three diamond dimensions (x, y, z). This allowed me to interpret the meaning of each coefficient and understand how changes in individual predictors affect carat weight while holding other variables constant. Beyond relying on R², I used additional summary statistics such as residual standard error, F-statistics, AIC, and BIC to evaluate model fit more holistically. These metrics provided a clearer sense of how well the model explained the data and helped me compare the numeric-only model against later models that incorporated grouped categorical predictors.

To address the “other considerations” objective, I incorporated qualitative predictors by grouping cut, color, and clarity into meaningful factor levels. This highlighted how categorical variables enter a regression model through dummy/indicator variables, and how the number of indicators depends on the number of factor levels. I also explored multicollinearity by computing a correlation matrix among numeric predictors, which revealed extremely high correlations—especially among the x, y, and z dimensions—illustrating why multicollinearity inflates variance and makes coefficients unstable. To address assumption violations, I applied a log transformation to the response variable and examined residual diagnostics such as residual-versus-fitted plots and Q–Q plots. These checks helped me evaluate linearity, variance patterns, and normality, and reinforced why validating assumptions is essential before interpreting a regression model.

1.3 Why Not linear Regression

Linear regression is a powerful starting point for understanding relationships between predictors and a continuous outcome, but throughout this semester I learned that it is often too restrictive for many real-world problems. Linear regression relies on strong assumptions: linearity, constant variance, normally distributed errors, and low multicollinearity among predictors. In several of my assignments, especially the Diamonds Project, I discovered major violations of these assumptions. For example, numeric predictors like x, y, and z in the diamonds dataset were extremely collinear, making linear regression coefficients unstable. In Homework 3, non-linear patterns between horsepower and mpg caused systematic residual patterns, showing that the simple linear model could not capture the true underlying relationship. Additionally, some outcomes I modeled—such as academic achievement levels—were categorical, meaning linear regression was not mathematically appropriate at all.

Because of these limitations, I learned that linear regression cannot be treated as a one-size-fits-all modeling tool. Certain types of outcomes (binary categories, multiple categories, count data) and certain data issues (non-linearity, multicollinearity, skewness, or heteroscedasticity) require specialized models. Throughout the semester, I explored a wide set of alternative modeling techniques designed to overcome these limitations.

To move beyond the constraints of linear regression, I learned and implemented several more flexible modeling approaches

Logistic Regression (Binary Outcomes): it is Used when the response is 0/1 or Yes/No. I used this in Project 2 to predict academic achievement, where probabilities—not numeric predictions—were required.
Multinomial Logistic Regression (3+ Outcome Categories): When the response has more than two categories (e.g., Mostly A’s, B’s, C’s, etc.), multinomial logistic regression models the probability of each category. This was essential in Project 2 for modeling SEGRADES.
LDA (Linear Discriminant Analysis): Another classification method that assumes normality within classes and equal covariance. Useful when logistic regression is unstable or when classes are well-separated. I have used this in my second project- classification
Poisson Regression (Count Outcomes): Used when the response is a count (non-negative integers). I have applied this while modeling academic absences in Project 2.
Regularized Models: Ridge, LASSO, Elastic Net These handle multicollinearity and high-dimensional predictor sets by shrinking coefficients. This improves prediction accuracy and stabilizes estimates and i have used this in my project 2
Polynomial Regression & Interaction Models These capture non-linear relationships that a straight-line regression cannot represent. In Phase 3 of my project2 i have used this, my squared terms and interaction terms improved model flexibility.
PCA Regression (Dimension Reduction) When predictors are highly correlated, PCA transforms them into orthogonal components, preventing multicollinearity.

1.4 Objective 4 - Multinomial Logistic Regression

Use tidymodels to build a multinomial logistic regression model and explain how to interpret the predicted probabilities and final class predictions.

tidymodels_prefer()

pfi_model <- read_csv("pfi_phase2_model_ready.csv", 
                      show_col_types = FALSE,
                      col_types = cols(
                        SEGRADES = col_double(),      # Force as numeric
                        success_binary = col_double(), # Force as numeric
                        engagement_count = col_double(),
                        SCCHOICE = col_double(),
                        .default = col_guess()
                      ))

# Remove category 5 (no grades) and any missing values
# KEEP AS NUMERIC for now
pfi_complete <- pfi_model %>%
  filter(!is.na(SEGRADES) & !is.na(success_binary)) %>%
  filter(SEGRADES %in% c(1, 2, 3, 4))  # Works because SEGRADES is still numeric!

cat("Starting observations:", nrow(pfi_model), "\n")

## Starting observations: 15990

cat("After removing missing and 'no grades' (category 5):", nrow(pfi_complete), "\n")

## After removing missing and 'no grades' (category 5): 14002

cat("Observations removed:", nrow(pfi_model) - nrow(pfi_complete), "\n\n")

## Observations removed: 1988

# Verify SEGRADES distribution before split (still numeric)
cat("Grade distribution before split (numeric):\n")

## Grade distribution before split (numeric):

print(table(pfi_complete$SEGRADES))

## 
##    1    2    3    4 
## 7866 4528 1345  263

cat("\n")

# Create stratified split on SEGRADES (while still numeric)
set.seed(631)
data_split <- initial_split(pfi_complete, prop = 0.70, strata = SEGRADES)
train_data <- training(data_split)
test_data <- testing(data_split)

cat("Split created:\n")

## Split created:

cat("  Training:", nrow(train_data), "observations\n")

##   Training: 9801 observations

cat("  Testing:", nrow(test_data), "observations\n\n")

##   Testing: 4201 observations

# NOW convert to factors with labels (AFTER the split)
train_data <- train_data %>%
  mutate(
    success_binary = factor(success_binary, 
                           levels = c(1, 0),
                           labels = c("High_Achievement", "Low_Achievement")),
    SEGRADES = factor(SEGRADES, 
                     levels = c(1, 2, 3, 4),
                     labels = c("Mostly_As", "Mostly_Bs", "Mostly_Cs", "Ds_or_Below"))
  )

test_data <- test_data %>%
  mutate(
    success_binary = factor(success_binary, 
                           levels = c(1, 0),
                           labels = c("High_Achievement", "Low_Achievement")),
    SEGRADES = factor(SEGRADES, 
                     levels = c(1, 2, 3, 4),
                     labels = c("Mostly_As", "Mostly_Bs", "Mostly_Cs", "Ds_or_Below"))
  )


# Start from train_data
multi_data <- train_data %>%
  filter(!is.na(SEGRADES)) %>%
  droplevels()

# Check class counts
multi_data %>%
  count(SEGRADES)

## # A tibble: 4 × 2
##   SEGRADES        n
##   <fct>       <int>
## 1 Mostly_As    5508
## 2 Mostly_Bs    3159
## 3 Mostly_Cs     945
## 4 Ds_or_Below   189

1.4.1 Stratified train/test split for SEGRADES

set.seed(631)
multi_split <- initial_split(multi_data, prop = 0.8, strata = SEGRADES)
multi_train <- training(multi_split)
multi_test <- testing(multi_split)
multi_train %>% count(SEGRADES)

## # A tibble: 4 × 2
##   SEGRADES        n
##   <fct>       <int>
## 1 Mostly_As    4411
## 2 Mostly_Bs    2521
## 3 Mostly_Cs     756
## 4 Ds_or_Below   152

multi_test %>% count(SEGRADES)

## # A tibble: 4 × 2
##   SEGRADES        n
##   <fct>       <int>
## 1 Mostly_As    1097
## 2 Mostly_Bs     638
## 3 Mostly_Cs     189
## 4 Ds_or_Below    37

1.4.2 Create engineered pridictors , then select columns

multi_train_mod <- multi_train %>%
mutate(
choice_x_engagement     = SCCHOICE * engagement_count,
income_x_homework       = TTLHHINC * homework_intensity,
parent_ed_x_involvement = PARGRADEX * school_involvement,
engagement_sq           = engagement_count^2,
homework_sq             = homework_intensity^2
) %>%
select(
SEGRADES,
SCCHOICE,
TTLHHINC,
homework_intensity,
engagement_count,
school_involvement,
communication_index,
enrichment_count,
choice_x_engagement,
income_x_homework,
parent_ed_x_involvement,
engagement_sq,
homework_sq
)

multi_test_mod <- multi_test %>%
mutate(
choice_x_engagement     = SCCHOICE * engagement_count,
income_x_homework       = TTLHHINC * homework_intensity,
parent_ed_x_involvement = PARGRADEX * school_involvement,
engagement_sq           = engagement_count^2,
homework_sq             = homework_intensity^2
) %>%
select(
SEGRADES,
SCCHOICE,
TTLHHINC,
homework_intensity,
engagement_count,
school_involvement,
communication_index,
enrichment_count,
choice_x_engagement,
income_x_homework,
parent_ed_x_involvement,
engagement_sq,
homework_sq
)

1.4.3 Recipe for multinomial logistic regression

multi_rec <- recipe(SEGRADES ~ ., data = multi_train_mod) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_mode(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())

1.4.4 Multinomial model spec & workflow

multi_spec <- multinom_reg() %>%
set_engine("nnet") %>%
set_mode("classification")
multi_wf <- workflow() %>%
add_recipe(multi_rec) %>%
add_model(multi_spec)
set.seed(631)
multi_fit <- multi_wf %>%
fit(data = multi_train_mod)
cat("✓ Multinomial model fitted\n\n")

## ✓ Multinomial model fitted

1.4.5 Predictions, accuracy, confusion matrix

multi_preds <- predict(multi_fit, new_data = multi_test_mod, type = "prob") %>%
bind_cols(
predict(multi_fit, new_data = multi_test_mod, type = "class"),
multi_test_mod %>% select(SEGRADES)
) %>%
rename(.pred_class = .pred_class)
multi_metrics <- multi_preds %>%
metrics(truth = SEGRADES, estimate = .pred_class)
multi_confmat <- multi_preds %>%
conf_mat(truth = SEGRADES, estimate = .pred_class)
multi_metrics

## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.585
## 2 kap      multiclass     0.149

multi_confmat

##              Truth
## Prediction    Mostly_As Mostly_Bs Mostly_Cs Ds_or_Below
##   Mostly_As         978       471       119          15
##   Mostly_Bs         114       163        64          19
##   Mostly_Cs           5         4         6           3
##   Ds_or_Below         0         0         0           0

1.4.6 Reflection

When I first started working on multinomial logistic regression, I struggled to understand how it extended beyond binary logistic regression. I was confused about how the model handled multiple categories simultaneously, how probabilities were calculated for each class, and how the tidymodels workflow differed from what I used earlier in the semester. Preparing the data was also challenging because I had to ensure that all levels of the SEGRADES variable were correctly defined and that my engineered features—such as polynomial and interaction terms—were properly incorporated. Initially, many parts of the model didn’t run because I was missing predictors or misaligned factor levels, which made me question whether I was approaching the model correctly at all.

Through discussions with classmates, help from team members, and a lot of searching through documentation and online examples, I gradually pieced the concepts together. Seeing how others approached multi-class problems helped me understand why multinomial logistic regression was the right tool and how tidymodels structures recipes, preprocessing, and workflow steps. Once everything clicked, I was able to successfully build the model, evaluate it with accuracy and a confusion matrix, and interpret how multiple predictors influenced different grade categories. This experience taught me not only the technical side of multinomial logistic regression but also the value of collaboration, patience, and iterative problem-solving when tackling new and initially confusing statistical methods.this was a great learning from my project 2.

1.5 Objective 5- LASSO Multinomial Logistic Regression

I have used LASSO-regularized multinomial logistic regression model using tidymodels to tune the penalty parameter through cross-validation, and evaluate how regularization improves variable selection, reduces overfitting, and enhances interpretability in multi-class classification problems.

1.5.1 cross fold validation

library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-10

set.seed(631)
multi_folds <- vfold_cv(multi_train_mod, v = 5, strata = SEGRADES)

1.5.2 Deploying LASSO multinomial spec using glmnnet engine

lasso_multi_spec <- multinom_reg(
  penalty = tune(),   # penalty λ will be tuned
  mixture = 1         # 1 = pure LASSO
) %>%
  set_engine("glmnet") %>%
  set_mode("classification")

1.5.3 Workflow: reuse the same recipe (multi_rec)

lasso_multi_wf <- workflow() %>%
  add_recipe(multi_rec) %>%
  add_model(lasso_multi_spec)

set.seed(631)
lasso_multi_tune <- lasso_multi_wf %>%
  tune_grid(
    resamples = multi_folds,
    grid = 20,
    metrics = metric_set(accuracy)
  )

1.5.4 Slection and final fitting of the model

lasso_multi_best <- select_best(lasso_multi_tune, metric = "accuracy")

lasso_multi_final_wf <- finalize_workflow(
  lasso_multi_wf,
  lasso_multi_best
)

lasso_multi_fit <- lasso_multi_final_wf %>%
  fit(data = multi_train_mod)

1.5.5 Evaluation on test data

lasso_multi_preds <- predict(lasso_multi_fit, new_data = multi_test_mod, type = "prob") %>%
  bind_cols(
    predict(lasso_multi_fit, new_data = multi_test_mod, type = "class"),
    multi_test_mod %>% select(SEGRADES)
  ) %>%
  rename(.pred_class = .pred_class)

lasso_multi_metrics <- lasso_multi_preds %>%
  metrics(truth = SEGRADES, estimate = .pred_class)

lasso_multi_confmat <- lasso_multi_preds %>%
  conf_mat(truth = SEGRADES, estimate = .pred_class)

lasso_multi_metrics

## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.584
## 2 kap      multiclass     0.145

lasso_multi_confmat

##              Truth
## Prediction    Mostly_As Mostly_Bs Mostly_Cs Ds_or_Below
##   Mostly_As         979       473       118          14
##   Mostly_Bs         114       164        69          21
##   Mostly_Cs           4         1         2           2
##   Ds_or_Below         0         0         0           0

1.5.6 Reflection

In this section, I applied LASSO regularization to a multinomial logistic regression model to strengthen prediction and feature selection in a multi-class classification setting which was used in my second project along with my teammates. Compared to the standard multinomial model, LASSO introduced a penalty on the magnitude of coefficients, shrinking less important predictors toward zero. This encouraged a sparser and more interpretable model, especially valuable given the engineered features, interaction terms, and polynomial terms in my dataset. Using 5-fold cross-validation allowed me to tune the penalty parameter λ in a principled way, balancing bias and variance to improve generalization performance. After finalizing the workflow and fitting the model, I evaluated its performance using accuracy and a confusion matrix to understand how well the regularized model classified SEGRADES categories.

Working through this process helped me appreciate why regularization is essential in modern statistical learning—especially when dealing with many correlated predictors or large feature sets created through engineering. LASSO mitigates overfitting by shrinking unstable coefficients, improves model interpretability by effectively selecting important predictors, and often leads to more robust predictions on new data. Even though the accuracy of my LASSO model (around 58%) was only moderately strong, the experience reinforced the importance of model tuning, careful preprocessing, and understanding how regularization shapes decision boundaries in multi-class problems. This module strengthened my ability to diagnose high-dimensional models and select appropriate tools for predictive modeling in real-world settings and definetly the whole process of learning increased my ability of collboration and research skills.

1.6 onjective -6 Linear Discriminant Analysis

i have used this to fit, evaluate, and interpret a Linear Discriminant Analysis model for multi-class classification, understand how LDA constructs linear boundaries based on class means and pooled covariance, and compare its performance to other classification methods using accuracy metrics and confusion matrices.

library(discrim)

lda_rec <- recipe(SEGRADES ~ ., data = multi_train_mod) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_impute_mode(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

1.6.1 Model specifications and workflow

lda_spec <- discrim_linear() %>%
  set_engine("MASS") %>%
  set_mode("classification")

lda_wf <- workflow() %>%
  add_recipe(lda_rec) %>%
  add_model(lda_spec)

1.6.2 Fitting the training data and predictiosn of test data

lda_fit <- lda_wf %>%
  fit(data = multi_train_mod)

lda_preds <- predict(lda_fit, new_data = multi_test_mod, type = "prob") %>%
  bind_cols(
    predict(lda_fit, new_data = multi_test_mod, type = "class"),
    multi_test_mod %>% select(SEGRADES)
  ) %>%
  rename(.pred_class = .pred_class)

lda_metrics <- lda_preds %>%
  metrics(truth = SEGRADES, estimate = .pred_class)

lda_confmat <- lda_preds %>%
  conf_mat(truth = SEGRADES, estimate = .pred_class)

lda_metrics

## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.585
## 2 kap      multiclass     0.159

lda_confmat

##              Truth
## Prediction    Mostly_As Mostly_Bs Mostly_Cs Ds_or_Below
##   Mostly_As         976       471       119          15
##   Mostly_Bs         104       156        56          10
##   Mostly_Cs          13         3         7           4
##   Ds_or_Below         4         8         7           8

1.7 Reflection

In this section, I applied Linear Discriminant Analysis (LDA) to the same predictors used in my multinomial and LASSO models for my same project 2 which i did as a team, that allowed me to explore a more classical statistical approach to multi-class classification. LDA works by estimating class-specific means and a shared covariance matrix, then creating linear discriminant functions that best separate the classes. This method assumes normality and equal covariance across groups, which makes it computationally efficient and interpretable. Following the tidymodels workflow, I prepared the predictors by imputing missing values, normalizing numeric variables, and creating dummy variables for categorical predictors. After fitting the model, I evaluated its accuracy and confusion matrix to see how well LDA distinguished between SEGRADES levels.

Working with LDA helped me understand the importance of modeling assumptions and their influence on classification performance. Unlike logistic regression—which models probabilities directly—LDA focuses on maximizing separation between class means, making it particularly useful when predictors are continuous and roughly follow a multivariate normal distribution. Even if LDA did not outperform every other model in terms of accuracy, it provided valuable insight into how linear decision boundaries behave in multi-class problems and how simpler generative models can still be effective when assumptions are reasonable. This experience strengthened my understanding of classical classification techniques and highlighted the importance of comparing multiple models when dealing with real-world data.

1.8 multinomial logistic regression with engineered polynomial and interaction terms.

multi_train_mod <- multi_train %>%
  mutate(
    choice_x_engagement     = SCCHOICE * engagement_count,
    income_x_homework       = TTLHHINC * homework_intensity,
    parent_ed_x_involvement = PARGRADEX * school_involvement,
    engagement_sq           = engagement_count^2,
    homework_sq             = homework_intensity^2
  ) %>%
  select(
    SEGRADES,
    SCCHOICE,
    TTLHHINC,
    homework_intensity,
    engagement_count,
    school_involvement,
    communication_index,
    enrichment_count,
    choice_x_engagement,
    income_x_homework,
    parent_ed_x_involvement,
    engagement_sq,
    homework_sq
  )

multi_test_mod <- multi_test %>%
  mutate(
    choice_x_engagement     = SCCHOICE * engagement_count,
    income_x_homework       = TTLHHINC * homework_intensity,
    parent_ed_x_involvement = PARGRADEX * school_involvement,
    engagement_sq           = engagement_count^2,
    homework_sq             = homework_intensity^2
  ) %>%
  select(
    SEGRADES,
    SCCHOICE,
    TTLHHINC,
    homework_intensity,
    engagement_count,
    school_involvement,
    communication_index,
    enrichment_count,
    choice_x_engagement,
    income_x_homework,
    parent_ed_x_involvement,
    engagement_sq,
    homework_sq
  )

multi_train_mod

## # A tibble: 7,840 × 13
##    SEGRADES  SCCHOICE TTLHHINC homework_intensity engagement_count
##    <fct>        <dbl>    <dbl>              <dbl>            <dbl>
##  1 Mostly_As        2        8               3                   3
##  2 Mostly_As        2       10               3.33                4
##  3 Mostly_As        1       11               2.67                5
##  4 Mostly_As        1       10               2.33                4
##  5 Mostly_As        1        8               2.33                4
##  6 Mostly_As        2        7               3.33                0
##  7 Mostly_As        2        7               2                   4
##  8 Mostly_As        1       10               3.33                5
##  9 Mostly_As        1       12               2.67                2
## 10 Mostly_As        1        8               4                   5
## # ℹ 7,830 more rows
## # ℹ 8 more variables: school_involvement <dbl>, communication_index <dbl>,
## #   enrichment_count <dbl>, choice_x_engagement <dbl>, income_x_homework <dbl>,
## #   parent_ed_x_involvement <dbl>, engagement_sq <dbl>, homework_sq <dbl>

multi_test_mod

## # A tibble: 1,961 × 13
##    SEGRADES  SCCHOICE TTLHHINC homework_intensity engagement_count
##    <fct>        <dbl>    <dbl>              <dbl>            <dbl>
##  1 Mostly_As        1        3               3                   2
##  2 Mostly_As        2        5               2                   1
##  3 Mostly_As        1        8               4                   4
##  4 Mostly_As        1       11               3.33                6
##  5 Mostly_As        1        2               2.67                0
##  6 Mostly_Cs        1        3               3.33                5
##  7 Mostly_As        1        5               3                   4
##  8 Mostly_As        2       12               2.67                6
##  9 Mostly_As        2        8               2                   1
## 10 Mostly_As        1        5               2.33                7
## # ℹ 1,951 more rows
## # ℹ 8 more variables: school_involvement <dbl>, communication_index <dbl>,
## #   enrichment_count <dbl>, choice_x_engagement <dbl>, income_x_homework <dbl>,
## #   parent_ed_x_involvement <dbl>, engagement_sq <dbl>, homework_sq <dbl>

1.9 Confusion matrix heatmaps for each model

library(ggplot2)
library(yardstick)

# Multinomial
autoplot(multi_confmat, type = "heatmap") +
  labs(title = "Confusion Matrix – Multinomial Logistic Regression")

# LASSO Multinomial
autoplot(lasso_multi_confmat, type = "heatmap") +
  labs(title = "Confusion Matrix – LASSO Multinomial Logistic Regression")

# LDA
autoplot(lda_confmat, type = "heatmap") +
  labs(title = "Confusion Matrix – LDA")

1.9.1 Reflection

To make the results easier to interpret, I summarized model performance in a single table comparing accuracy for the multinomial logistic regression, LASSO multinomial regression, and LDA models. The LASSO model achieved the best overall accuracy, suggesting that regularization helped handle the larger predictor set and reduce overfitting. I then visualized the confusion matrices as heatmaps, which made it clear which grade categories were most often correctly classified and which were commonly confused. Across all models, “Mostly A’s” and “Mostly B’s” were predicted relatively well, while “Mostly C’s” and “Ds or Below” showed more misclassification, especially between each other. These visual and numerical summaries helped me move beyond just a single accuracy number and think more carefully about how each model behaves for different groups of students.

STA 631 Portfolio – Demonstration Code

Kuladeep Roy