Objective 1: Probability, Inference, and Maximum Likelihood

Course Learning Objective: Describe probability as a foundation of statistical modeling, including inference and maximum likelihood estimation

For this objective, I demonstrate my understanding by applying a Linear Discriminant Analysis model to the Cardiovascular Disease dataset. The goal of the analysis is to predict the presence of heart disease using three categorical predictors: chestpain, slope, and exerciseangia. These medically relevant variables are nominal in nature and were converted to factors to ensure compatibility with the LDA framework. This modeling approach enabled me to explore class separation driven by clinically significant features.

I will start by loading all the necessary libraries and imports the dataset from a CSV file.

Load packages and dataset

library(ggformula)
## Loading required package: ggplot2
## Loading required package: scales
## Loading required package: ggridges
## 
## New to ggformula?  Try the tutorials: 
##  learnr::run_tutorial("introduction", package = "ggformula")
##  learnr::run_tutorial("refining", package = "ggformula")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.6     ✔ rsample      1.2.1
## ✔ dials        1.3.0     ✔ tune         1.2.1
## ✔ infer        1.0.7     ✔ workflows    1.1.4
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.2.1     ✔ yardstick    1.3.1
## ✔ recipes      1.1.0     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard()  masks scales::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(ggplot2)
library(discrim)
## 
## Attaching package: 'discrim'
## 
## The following object is masked from 'package:dials':
## 
##     smoothness
library(nnet)
library(yardstick)
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-8
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
# The dataset I have used in the entire modeling process, was acquired from one of the multispecialty hospitals in India. # I downloaded it from https://data.mendeley.com/datasets/dzz48mvjht/1 website.

cardiovascular_disease <- read_csv("Cardiovascular_Disease_Dataset.csv")
## Rows: 1000 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): patientid, age, gender, chestpain, restingBP, serumcholestrol, fas...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(cardiovascular_disease)
## # A tibble: 6 × 14
##   patientid   age gender chestpain restingBP serumcholestrol fastingbloodsugar
##       <dbl> <dbl>  <dbl>     <dbl>     <dbl>           <dbl>             <dbl>
## 1    103368    53      1         2       171               0                 0
## 2    119250    40      1         0        94             229                 0
## 3    119372    49      1         2       133             142                 0
## 4    132514    43      1         0       138             295                 1
## 5    146211    31      1         1       199               0                 0
## 6    148462    24      1         1       173               0                 0
## # ℹ 7 more variables: restingrelectro <dbl>, maxheartrate <dbl>,
## #   exerciseangia <dbl>, oldpeak <dbl>, slope <dbl>, noofmajorvessels <dbl>,
## #   target <dbl>

Check for missing values per column

cardiovascular_disease %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Values") %>%
  arrange(desc(Missing_Values))
## # A tibble: 14 × 2
##    Variable          Missing_Values
##    <chr>                      <int>
##  1 patientid                      0
##  2 age                            0
##  3 gender                         0
##  4 chestpain                      0
##  5 restingBP                      0
##  6 serumcholestrol                0
##  7 fastingbloodsugar              0
##  8 restingrelectro                0
##  9 maxheartrate                   0
## 10 exerciseangia                  0
## 11 oldpeak                        0
## 12 slope                          0
## 13 noofmajorvessels               0
## 14 target                         0
head(cardiovascular_disease)
## # A tibble: 6 × 14
##   patientid   age gender chestpain restingBP serumcholestrol fastingbloodsugar
##       <dbl> <dbl>  <dbl>     <dbl>     <dbl>           <dbl>             <dbl>
## 1    103368    53      1         2       171               0                 0
## 2    119250    40      1         0        94             229                 0
## 3    119372    49      1         2       133             142                 0
## 4    132514    43      1         0       138             295                 1
## 5    146211    31      1         1       199               0                 0
## 6    148462    24      1         1       173               0                 0
## # ℹ 7 more variables: restingrelectro <dbl>, maxheartrate <dbl>,
## #   exerciseangia <dbl>, oldpeak <dbl>, slope <dbl>, noofmajorvessels <dbl>,
## #   target <dbl>

Convert outcome and predictor variables to factors

Will then convert the categorical variables to factors before modeling

cardiovascular_data <- cardiovascular_disease %>%
  mutate(
    chestpain = factor(chestpain),
    slope = factor(slope),
    exerciseangia = factor(exerciseangia),
    target = factor(target, levels = c(0, 1), labels = c("No", "Yes"))
  )

head(cardiovascular_data)
## # A tibble: 6 × 14
##   patientid   age gender chestpain restingBP serumcholestrol fastingbloodsugar
##       <dbl> <dbl>  <dbl> <fct>         <dbl>           <dbl>             <dbl>
## 1    103368    53      1 2               171               0                 0
## 2    119250    40      1 0                94             229                 0
## 3    119372    49      1 2               133             142                 0
## 4    132514    43      1 0               138             295                 1
## 5    146211    31      1 1               199               0                 0
## 6    148462    24      1 1               173               0                 0
## # ℹ 7 more variables: restingrelectro <dbl>, maxheartrate <dbl>,
## #   exerciseangia <fct>, oldpeak <dbl>, slope <fct>, noofmajorvessels <dbl>,
## #   target <fct>

Fit the LDA model

Using the discrim_linear() function from the tidymodels package with the MASS engine, I fit an LDA model. This model assumes normal distribution of predictors within each class and equal covariance matrices.

lda_heart <- discrim_linear() %>%
  set_mode("classification") %>%
  set_engine("MASS") %>%
  fit(target ~ chestpain + slope + exerciseangia, data = cardiovascular_data)
lda_heart
## parsnip model object
## 
## Call:
## lda(target ~ chestpain + slope + exerciseangia, data = data)
## 
## Prior probabilities of groups:
##   No  Yes 
## 0.42 0.58 
## 
## Group means:
##     chestpain1 chestpain2 chestpain3    slope1     slope2    slope3
## No   0.1666667 0.07857143 0.01190476 0.5428571 0.02857143 0.0000000
## Yes  0.2655172 0.48103448 0.06724138 0.1224138 0.53448276 0.3431034
##     exerciseangia1
## No       0.5214286
## Yes      0.4810345
## 
## Coefficients of linear discriminants:
##                        LD1
## chestpain1      0.73920092
## chestpain2      0.94466126
## chestpain3      1.04919170
## slope1          1.00501855
## slope2          4.04526585
## slope3          4.18066084
## exerciseangia1 -0.07249772

The coefficients from the LDA model suggest that the slope of the ST segment is the strongest predictor of heart disease. In particular, slope types 2 and 3 have the highest positive discriminant values, indicating they are strongly associated with the presence of heart disease. Similarly, higher levels of chest pain, such as non-anginal pain and asymptomatic cases, also contribute positively to predicting heart disease. In contrast, exercise-induced angina (exerciseangia1) shows a small negative coefficient, suggesting it has minimal discriminatory power in distinguishing between patients with and without heart disease in this dataset.

Generate class probability predictions

The LDA model estimates the likelihood of an observation belonging to each class by combining class prior probabilities with the distribution of predictor values, making it a direct application of Bayes’ Theorem. These class-specific means and the shared covariance matrix are estimated using maximum likelihood, enabling the model to construct discriminant functions that optimally separate the classes. This structure not only supports accurate classification but also facilitates inference through the interpretation of coefficients and predicted class probabilities.

predict(lda_heart, new_data = cardiovascular_data, type = "prob")
## # A tibble: 1,000 × 2
##     .pred_No .pred_Yes
##        <dbl>     <dbl>
##  1 0.0000937  1.00    
##  2 0.996      0.00430 
##  3 0.911      0.0887  
##  4 0.00442    0.996   
##  5 0.000195   1.00    
##  6 0.000195   1.00    
##  7 0.000152   1.00    
##  8 0.997      0.00332 
##  9 1.00       0.000119
## 10 0.00442    0.996   
## # ℹ 990 more rows

The predicted probabilities indicate the model is highly confident in its classifications. For most observations, the probability of either “Yes” (disease) or “No” (no disease) is close to 1. This shows that the LDA model distinguishes well between the two classes based on the input features. The sharp separation in probabilities reflects the model’s strong linear discriminant boundary.

Model Evaluation

augment(lda_heart, new_data = cardiovascular_data) %>%
  conf_mat(truth = target, estimate = .pred_class)
##           Truth
## Prediction  No Yes
##        No  408  71
##        Yes  12 509

The model correctly classified 408 “No” and 509 “Yes” cases, with only 83 total misclassifications. This reflects high overall accuracy and balanced performance across both classes.

augment(lda_heart, new_data = cardiovascular_data) %>%
  accuracy(truth = target, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.917

The model achieved an accuracy of 91.7%, meaning it correctly predicted heart disease status in over 9 out of 10 cases. This indicates strong performance for a classification task using LDA.

Visualize prediction vs actual Heart Disease Status

The plot below shows how predictions align with actual outcomes, making the classification accuracy more interpretable.

lda_preds <- predict(lda_heart, new_data = cardiovascular_data) %>%
  bind_cols(cardiovascular_data %>% dplyr::select(target))

gf_bar(~ .pred_class | target, 
       data = lda_preds, 
       fill = ~.pred_class, 
       position = "dodge", 
       alpha = 0.8) %>%
  gf_labs(
    title = "LDA: Predicted vs Actual Heart Disease Status",
    x = "Predicted Class", 
    y = "Count"
  )

The above plot shows that most predictions align with the actual heart disease status, with few misclassifications. The LDA model performed well, especially in predicting cases labeled “Yes” for heart disease.

Objective 2: Generalized Linear Models in Multiclass Classification

Course Learning Objective: Apply the appropriate generalized linear model for a specific data context

For this objective, I applied a multinomial logistic regression model to the Cardiovascular Disease dataset to predict the type of chest pain (chestpain) experienced by patients. The outcome variable includes four categories: typical angina, atypical angina, non-anginal pain, and asymptomatic. These distinct categories make it a strong candidate for a multinomial model, as it goes beyond binary classification. The goal was to examine how multiple predictors, including age, slope, oldpeak, maxheartrate, restingBP, and exerciseangia, influence the likelihood of each chest pain category.

The code below prepares the data by converting necessary variables to factors and ensuring there are no missing values for the selected predictors.

cardiovascular_data1 <- cardiovascular_disease %>%
  mutate(
    chestpain = case_match(
      chestpain,
      0 ~ "typical angina",
      1 ~ "atypical angina",
      2 ~ "non-anginal pain",
      3 ~ "asymptomatic"
    ),
    slope = case_match(
      slope,
      1 ~ "upsloping",
      2 ~ "flat",
      3 ~ "downsloping"
    ),
    exerciseangia = case_match(
      exerciseangia,
      0 ~ "no",
      1 ~ "yes"
    ),
    gender = case_match(
      gender,
      0 ~ "female",
      1 ~ "male"
    ),
    chestpain = factor(chestpain),
    slope = factor(slope),
    exerciseangia = factor(exerciseangia),
    gender = factor(gender)
  ) %>%
  drop_na(chestpain, age, slope, oldpeak, maxheartrate, restingBP, exerciseangia)

head(cardiovascular_data1)
## # A tibble: 6 × 14
##   patientid   age gender chestpain   restingBP serumcholestrol fastingbloodsugar
##       <dbl> <dbl> <fct>  <fct>           <dbl>           <dbl>             <dbl>
## 1    103368    53 male   non-angina…       171               0                 0
## 2    119250    40 male   typical an…        94             229                 0
## 3    119372    49 male   non-angina…       133             142                 0
## 4    132514    43 male   typical an…       138             295                 1
## 5    146211    31 male   atypical a…       199               0                 0
## 6    148462    24 male   atypical a…       173               0                 0
## # ℹ 7 more variables: restingrelectro <dbl>, maxheartrate <dbl>,
## #   exerciseangia <fct>, oldpeak <dbl>, slope <fct>, noofmajorvessels <dbl>,
## #   target <dbl>

Fit multinomial logistic regression model

The model was fitted using multinom_reg() from tidymodels, selecting multiple predictors relevant to cardiovascular function and symptom presentation. The nnet engine handles the multinomial case efficiently

multi_model <- multinom_reg() %>%
  set_engine("nnet") %>%
  set_mode("classification") %>%
  fit(chestpain ~ age + slope + oldpeak + maxheartrate + restingBP + exerciseangia, data = cardiovascular_data1)
multi_model
## parsnip model object
## 
## Call:
## nnet::multinom(formula = chestpain ~ age + slope + oldpeak + 
##     maxheartrate + restingBP + exerciseangia, data = data, trace = FALSE)
## 
## Coefficients:
##                  (Intercept)           age slopeflat slopeupsloping    oldpeak
## atypical angina    -2.215566 -3.724011e-03 0.6331198       1.441832 0.16767374
## non-anginal pain    4.487791  6.651009e-03 0.2801581      -0.103724 0.10516310
## typical angina      4.022153 -5.622435e-05 0.5544377       2.236218 0.09455084
##                  maxheartrate    restingBP exerciseangiayes
## atypical angina   0.010252589  0.007794564      0.008277056
## non-anginal pain  0.013356538 -0.032180090      0.149008794
## typical angina    0.008959388 -0.029824880      0.229483468
## 
## Residual Deviance: 1691.75 
## AIC: 1739.75

The coefficients above show how each predictor influences the odds of being in each chest pain category relative to the baseline (“asymptomatic”). For example, having exercise-induced angina (exerciseangiayes) increases the odds of “typical angina” compared to the baseline, while higher age has a small negative effect across all categories.

Make predictions and bind results

Here, I generated class predictions and probabilities, then bound them with the original outcome to enable evaluation.

preds <- predict(multi_model, new_data = cardiovascular_data1) %>%
  bind_cols(predict(multi_model, new_data = cardiovascular_data1, type = "prob")) %>%
  bind_cols(cardiovascular_data1 %>% dplyr::select(chestpain))
head(preds, 10)
## # A tibble: 10 × 6
##    .pred_class  .pred_asymptomatic .pred_atypical angin…¹ .pred_non-anginal pa…²
##    <fct>                     <dbl>                  <dbl>                  <dbl>
##  1 non-anginal…            0.0756                  0.282                   0.485
##  2 typical ang…            0.00601                 0.0301                  0.209
##  3 typical ang…            0.00533                 0.107                   0.239
##  4 non-anginal…            0.0298                  0.125                   0.594
##  5 atypical an…            0.121                   0.544                   0.235
##  6 non-anginal…            0.0762                  0.335                   0.428
##  7 non-anginal…            0.0214                  0.0687                  0.688
##  8 typical ang…            0.00506                 0.0850                  0.243
##  9 atypical an…            0.0829                  0.452                   0.325
## 10 atypical an…            0.0725                  0.426                   0.341
## # ℹ abbreviated names: ¹​`.pred_atypical angina`, ²​`.pred_non-anginal pain`
## # ℹ 2 more variables: `.pred_typical angina` <dbl>, chestpain <fct>

The predicted class matches the highest probability among the four chest pain types for each observation. Most predictions are confident, as seen by one class having a dominant probability, indicating the model distinguishes well between categories.

Model Evaluation

To evaluate the model, I used a confusion matrix and calculated the accuracy:

# Evaluate model using original data
conf_mat(preds, truth = chestpain, estimate = .pred_class)
##                   Truth
## Prediction         asymptomatic atypical angina non-anginal pain typical angina
##   asymptomatic                0               0                0              0
##   atypical angina            25              86               47             33
##   non-anginal pain           15              60              216             83
##   typical angina              1              43               41            170
accuracy(preds, truth = chestpain, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.576

The confusion matrix indicates the model most accurately predicts “typical angina” (170 correct), while misclassifying a significant number of other chest pain types. The accuracy on the training data is 57.6%, confirming moderate predictive power. To improve model robustness, I implemented 5-fold cross-validation:

Create 5-fold cross-validation splits for more robust model evaluation

set.seed(123)
cv_splits <- vfold_cv(cardiovascular_data1, v = 5)

multi_model <- multinom_reg() %>%
  set_engine("nnet") %>%
  set_mode("classification")

# Create a workflow with the model and formula
multi_wf <- workflow() %>%
  add_model(multi_model) %>%
  add_formula(chestpain ~ age + slope + oldpeak + maxheartrate + restingBP + exerciseangia)

# Fit the model to the 5-fold resamples and evaluate with accuracy and ROC AUC
cv_results <- fit_resamples(
  multi_wf,
  resamples = cv_splits,
  metrics = metric_set(accuracy, roc_auc),
  control = control_resamples(save_pred = TRUE)
)

collect_metrics(cv_results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy multiclass 0.574     5  0.0212 Preprocessor1_Model1
## 2 roc_auc  hand_till  0.689     5  0.0130 Preprocessor1_Model1

The 5-fold cross-validation further supports this with a similar accuracy (57.4%) and an average ROC AUC of 0.689, suggesting the model distinguishes between classes better than random guessing but lacks high discriminative strength. The model performs decently but may benefit from feature engineering or alternative modeling techniques, such as backward or forward selection to eliminate non-significant predictors.

Visualize predictions

Visualization using gf_bar() helped me understand prediction distribution across the actual categories as shown below

gf_bar(~ .pred_class | chestpain, data = preds, fill = ~.pred_class, position = "dodge") %>%
  gf_labs(title = "Predicted vs Actual Chest Pain Type", x = "Predicted", y = "Count")

The plot shows a decent alignment between predicted and actual chest pain types, especially for “non-anginal pain” and “typical angina,” where most predictions match the true class. However, there’s noticeable misclassification in “atypical angina” and “asymptomatic” categories, where predictions are more dispersed across other types. This suggests the model struggles to differentiate subtler classes, possibly due to overlapping predictor patterns.

Objective 3: Model Selection Using Ridge Regression

Course Learning Objective: Demonstrate model selection given a set of candidate models

To demonstrate this objective, I used Ridge Regression to model the presence of heart disease (target) using all available predictors in the Cardiovascular Disease dataset. Ridge Regression applies L2 regularization, which helps reduce overfitting by shrinking large coefficients. To select the best model configuration, I performed 5-fold cross-validation and tuned the penalty parameter (lambda) over a predefined grid. The goal was to identify the model that minimized RMSE and achieved strong generalization.

Split the dataset

The code below splits the data into 80% training and 20% testing sets while preserving the target distribution, and sets up 5-fold cross-validation for robust model evaluation.

set.seed(123)
cardio_split <- initial_split(cardiovascular_data1, prop = 0.8, strata = noofmajorvessels)
cardio_train <- training(cardio_split)
cardio_test <- testing(cardio_split)

cardio_folds <- vfold_cv(cardio_train, v = 5)

Define recipe and model specification for Ridge Regression

The code below prepares a Ridge regression model by defining a recipe with preprocessing steps, specifying the model with tuning, creating a workflow, and generating a grid of penalty values for tuning.

ridge_recipe <- 
  recipe(noofmajorvessels ~ oldpeak + slope + exerciseangia + chestpain, data = cardio_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors())

ridge_spec <- 
  linear_reg(penalty = tune(), mixture = 0) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet")

ridge_workflow <- workflow() %>% 
  add_recipe(ridge_recipe) %>% 
  add_model(ridge_spec)

ridge_grid <- grid_regular(penalty(range = c(-5, 5)), levels = 50)
ridge_grid
## # A tibble: 50 × 1
##      penalty
##        <dbl>
##  1 0.00001  
##  2 0.0000160
##  3 0.0000256
##  4 0.0000409
##  5 0.0000655
##  6 0.000105 
##  7 0.000168 
##  8 0.000268 
##  9 0.000429 
## 10 0.000687 
## # ℹ 40 more rows

The output displays a grid of 50 penalty values generated for Ridge regression tuning. These values enable model selection by identifying the optimal level of regularization that minimizes error and enhances generalization

Model tuning with cross-validation

ridge_results <- tune_grid(
  ridge_workflow,
  resamples = cardio_folds,
  grid = ridge_grid,
  metrics = metric_set(rmse, rsq)
)
## → A | warning: A correlation computation is required, but `estimate` is constant and has 0
##                standard deviation, resulting in a divide by 0 error. `NA` will be returned.
## There were issues with some computations   A: x1There were issues with some computations   A: x2There were issues with some computations   A: x3There were issues with some computations   A: x4There were issues with some computations   A: x5There were issues with some computations   A: x5
ridge_results
## # Tuning results
## # 5-fold cross-validation 
## # A tibble: 5 × 4
##   splits            id    .metrics           .notes          
##   <list>            <chr> <list>             <list>          
## 1 <split [523/131]> Fold1 <tibble [100 × 5]> <tibble [1 × 3]>
## 2 <split [523/131]> Fold2 <tibble [100 × 5]> <tibble [1 × 3]>
## 3 <split [523/131]> Fold3 <tibble [100 × 5]> <tibble [1 × 3]>
## 4 <split [523/131]> Fold4 <tibble [100 × 5]> <tibble [1 × 3]>
## 5 <split [524/130]> Fold5 <tibble [100 × 5]> <tibble [1 × 3]>
## 
## There were issues with some computations:
## 
##   - Warning(s) x5: A correlation computation is required, but `estimate` is constant...
## 
## Run `show_notes(.Last.tune.result)` for more information.

Visualization of tuning results

autoplot(ridge_results)

The plot above shows as regularization (penalty) increases, RMSE rises and R-squared decreases, indicating reduced model performance. This plot supports the objective to demonstrate model selection given a set of candidate models by helping identify the optimal penalty that balances model complexity and predictive accuracy.

collect_metrics(ridge_results)
## # A tibble: 100 × 7
##      penalty .metric .estimator  mean     n std_err .config              
##        <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1 0.00001   rmse    standard   0.893     5 0.00374 Preprocessor1_Model01
##  2 0.00001   rsq     standard   0.142     5 0.0225  Preprocessor1_Model01
##  3 0.0000160 rmse    standard   0.893     5 0.00374 Preprocessor1_Model02
##  4 0.0000160 rsq     standard   0.142     5 0.0225  Preprocessor1_Model02
##  5 0.0000256 rmse    standard   0.893     5 0.00374 Preprocessor1_Model03
##  6 0.0000256 rsq     standard   0.142     5 0.0225  Preprocessor1_Model03
##  7 0.0000409 rmse    standard   0.893     5 0.00374 Preprocessor1_Model04
##  8 0.0000409 rsq     standard   0.142     5 0.0225  Preprocessor1_Model04
##  9 0.0000655 rmse    standard   0.893     5 0.00374 Preprocessor1_Model05
## 10 0.0000655 rsq     standard   0.142     5 0.0225  Preprocessor1_Model05
## # ℹ 90 more rows

The above output shows Ridge regression tuning results across penalty values, with consistent RMSE (0.893) and R-squared (0.142) across models. It supports the objective to demonstrate model selection given a set of candidate models by helping identify which penalty yields optimal predictive performance with minimal error.

Best penalty selection

best_ridge <- select_best(ridge_results, metric = "rmse")
best_ridge
## # A tibble: 1 × 2
##   penalty .config              
##     <dbl> <chr>                
## 1 0.00001 Preprocessor1_Model01

The output shows that the optimal penalty value for the Ridge regression model—based on the lowest RMSE—is 1e-05. It supports the objective by identifying the best regularization strength through systematic tuning.

Final workflow fit

final_ridge <- finalize_workflow(ridge_workflow, best_ridge)

ridge_fit <- fit(final_ridge, data = cardio_train)

Generate predictions on test data

ridge_preds <- predict(ridge_fit, new_data = cardio_test) %>%
  bind_cols(cardio_test %>% dplyr::select(noofmajorvessels))
ridge_preds
## # A tibble: 166 × 2
##    .pred noofmajorvessels
##    <dbl>            <dbl>
##  1 0.968                1
##  2 1.81                 2
##  3 0.930                0
##  4 1.46                 0
##  5 1.60                 1
##  6 1.89                 2
##  7 0.912                2
##  8 1.57                 2
##  9 1.42                 2
## 10 0.908                0
## # ℹ 156 more rows

The model’s predictions closely align with the actual values of noofmajorvessels, indicating good performance in this regression task

Evaluate model performance on test set

rmse(ridge_preds, truth = noofmajorvessels, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.871
rsq(ridge_preds, truth = noofmajorvessels, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.187

The ridge regression model achieved an RMSE of 0.87 and an R² of 0.19 on the test set, indicating moderate prediction accuracy and that about 19% of the variance in the target is explained by the selected predictors.

Objective 4: Expressing Model Results to a General Audience

Course Learning Objective: Express the results of statistical models to a general audience

In this objective, I apply multiple linear regression to examine how several predictors, slope of the ST segment, chest pain type, resting blood pressure, maximum heart rate, and serum cholesterol—relate to the number of major vessels observed in patients. The primary goal of this activity is not only to fit a meaningful model, but also to interpret and communicate its findings clearly and accessibly for a general audience without a statistical background.

Select predictors and outcome for pairwise visualization

I will use ggpairs to visualize the numeric variables

# Select only numeric columns for the ggpairs plot
cardiovascular_data1 %>%
  dplyr::select(noofmajorvessels, restingBP, serumcholestrol, oldpeak, age, maxheartrate) %>%
  ggpairs()

The ggpairs plot shows that noofmajorvessels has weak positive correlations with variables like restingBP (0.192), maxheartrate (0.107), and serumcholestrol (0.092). These predictors may provide some predictive value, but overall the linear relationships are relatively weak.

Will use the Boxplot to display categorical predictors

# Individual plots
gf_boxplot(noofmajorvessels ~ gender, data = cardiovascular_data1, fill = ~gender) %>%
  gf_labs(title = "No. of Major Vessels by Gender", x = "Gender", y = "No. of Major Vessels")

gf_boxplot(noofmajorvessels ~ chestpain, data = cardiovascular_data1, fill = ~chestpain) %>%
  gf_labs(title = "By Chest Pain Type", x = "Chest Pain", y = "")

gf_boxplot(noofmajorvessels ~ fastingbloodsugar, data = cardiovascular_data1, fill = ~fastingbloodsugar) %>%
  gf_labs(title = "By Fasting Blood Sugar", x = "Fasting Sugar", y = "")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

gf_boxplot(noofmajorvessels ~ restingrelectro, data = cardiovascular_data1, fill = ~restingrelectro) %>%
  gf_labs(title = "By Resting ECG", x = "Resting ECG", y = "")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

gf_boxplot(noofmajorvessels ~ exerciseangia, data = cardiovascular_data1, fill = ~exerciseangia) %>%
  gf_labs(title = "By Exercise Angina", x = "Exercise Angina", y = "")

gf_boxplot(noofmajorvessels ~ slope, data = cardiovascular_data1, fill = ~slope) %>%
  gf_labs(title = "By Slope", x = "Slope", y = "")

Having a look at all the plots, slope and chest pain type tend to have potential predictive value for the noofmajorvessels response variable whereby downsloping tends to be associated with higher vessel counts and asymptomatic patients appear to have slightly higher vessel counts.

So we will use slope, chest pain, restingBP, maxheartrate, and serumcholestrol numeric variables to fit the model

Fit a multiple linear regression model

I will re-use the variables that has the training and testing data from objective 3 in the code below

#fit the mlr model
mlr_model <- linear_reg() %>%
  set_engine("lm") %>%
  fit(noofmajorvessels ~ slope + chestpain + restingBP + maxheartrate + serumcholestrol,
      data = cardio_train)

tidy(mlr_model)
## # A tibble: 9 × 5
##   term                       estimate std.error statistic  p.value
##   <chr>                         <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)                1.07      0.337        3.18  1.54e- 3
## 2 slopeflat                 -0.377     0.0889      -4.24  2.60e- 5
## 3 slopeupsloping            -0.833     0.101       -8.25  9.04e-16
## 4 chestpainatypical angina  -0.0217    0.163       -0.133 8.95e- 1
## 5 chestpainnon-anginal pain  0.201     0.159        1.26  2.07e- 1
## 6 chestpaintypical angina    0.0784    0.164        0.478 6.33e- 1
## 7 restingBP                  0.00357   0.00137      2.61  9.13e- 3
## 8 maxheartrate               0.000578  0.00110      0.525 6.00e- 1
## 9 serumcholestrol            0.000113  0.000252     0.450 6.53e- 1

When all predictors are at their reference level or 0, the expected number of major vessels is approximately 1.07. The ST slope is the strongest predictor, with both flat and upsloping categories significantly associated with fewer major vessels compared to the baseline (downsloping). Additionally, resting blood pressure shows a small but statistically significant positive effect, while chest pain type, serum cholesterol, and max heart rate do not significantly influence the outcome.

Check for multicollinearity

lm_model <- extract_fit_engine(mlr_model)
vif(lm_model)
##                     GVIF Df GVIF^(1/(2*Df))
## slope           1.390890  2        1.085983
## chestpain       1.500300  3        1.069949
## restingBP       1.316622  1        1.147441
## maxheartrate    1.069045  1        1.033946
## serumcholestrol 1.044457  1        1.021987

All GVIF values are below the common threshold of 5, indicating no serious multicollinearity among the predictors.

Model prediction and evaluation

# Predict on test set and evaluate
mlr_predictions <- predict(mlr_model, new_data = cardio_test) %>%
  bind_cols(cardio_test)

metrics(mlr_predictions, truth = noofmajorvessels, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.860
## 2 rsq     standard       0.205
## 3 mae     standard       0.715

The model achieved an RMSE of approximately 0.86 and MAE of 0.72, indicating moderate prediction error in estimating the number of major vessels. An R² value of 0.20 suggests the model explains about 20% of the variance in the outcome.

lm_model <- extract_fit_engine(mlr_model)

# Plot the 4 standard residual diagnostic plots
par(mfrow = c(2, 2))
plot(lm_model)

The Residuals vs Fitted plot shows that most of the prediction errors are fairly balanced, although the spread widens slightly for some fitted values. The Normal Q-Q plot suggests that the errors mostly follow a normal distribution, which is a good sign for model accuracy. The Scale-Location plot shows that the consistency of the errors varies somewhat across predicted values. Lastly, the Residuals vs Leverage plot highlights a few points that stand out, but none are strong enough to significantly affect the overall model.

Objective 5: Use programming software to fit and assess statistical models

Course Learning Objective: Use programming software to fit and assess statistical models

In this objective, I apply polynomial regression to explore how resting blood pressure, serum cholesterol, maximum heart rate, oldpeak, slope and chest pain relate to the number of major vessels observed in patients. The primary goal of this activity is to practice using statistical programming software to fit and assess a more complex regression model. Beyond model fitting, I focus on evaluating assumptions, interpreting output, and using diagnostics to ensure the model is statistically sound and computationally well-executed.

Having explored the data from the previous objectives, I will proceed to fit the polynomial regression model. I will also use the training data and test data from objective 4.

Define and fit the polynomial regression model

poly_model <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

poly_fit <- poly_model %>%
  fit(noofmajorvessels ~ 
        restingBP + I(restingBP^2) +
        serumcholestrol + I(serumcholestrol^2) +
        maxheartrate + I(maxheartrate^2) +
        oldpeak + I(oldpeak^2) +
        chestpain + slope,
      data = cardio_train
  )

tidy(poly_fit)
## # A tibble: 14 × 5
##    term                         estimate  std.error statistic  p.value
##    <chr>                           <dbl>      <dbl>     <dbl>    <dbl>
##  1 (Intercept)                0.335      1.17           0.287 7.74e- 1
##  2 restingBP                  0.0127     0.0138         0.920 3.58e- 1
##  3 I(restingBP^2)            -0.0000301  0.0000451     -0.666 5.05e- 1
##  4 serumcholestrol           -0.000884   0.000865      -1.02  3.07e- 1
##  5 I(serumcholestrol^2)       0.00000177 0.00000145     1.22  2.23e- 1
##  6 maxheartrate               0.00443    0.00919        0.483 6.29e- 1
##  7 I(maxheartrate^2)         -0.0000139  0.0000317     -0.437 6.62e- 1
##  8 oldpeak                   -0.0742     0.0803        -0.924 3.56e- 1
##  9 I(oldpeak^2)               0.0141     0.0126         1.12  2.64e- 1
## 10 chestpainatypical angina  -0.0537     0.164         -0.327 7.44e- 1
## 11 chestpainnon-anginal pain  0.167      0.162          1.03  3.01e- 1
## 12 chestpaintypical angina    0.0426     0.168          0.253 8.00e- 1
## 13 slopeflat                 -0.389      0.0893        -4.35  1.57e- 5
## 14 slopeupsloping            -0.805      0.105         -7.69  5.48e-14

The intercept (0.33) represents the predicted number of major vessels when all numeric predictors are zero, though this has limited real-world meaning. All numeric predictors, including their squared terms are not statistically significant, indicating weak predictive power. However, the categorical variable slope (flat and upsloping) is highly significant, its p value is less than 0.05, while chest pain types do not show a meaningful effect.

Model prediction and evaluation

# Predict on test set
poly_preds <- predict(poly_fit, new_data = cardio_test) %>%
  bind_cols(cardio_test)

# Evaluate performance
metrics(poly_preds, truth = noofmajorvessels, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.863
## 2 rsq     standard       0.199
## 3 mae     standard       0.713

The model has modest explanatory power (R² = 0.20), meaning it explains about 20% of the variation in the number of major vessels. The prediction errors are moderate, with an RMSE of 0.86 and an MAE of 0.71, indicating the model’s predictions are somewhat close to actual values but still have room for improvement.

Plot predicted vs actual

gf_point(noofmajorvessels ~ .pred, data = poly_preds) %>%
  gf_labs(title = "Predicted vs Actual: Polynomial Regression",
          x = "Predicted Number of Major Vessels",
          y = "Actual Number of Major Vessels")

The plot shows that the predicted number of major vessels mostly falls between 1 and 2, regardless of the actual values ranging from 0 to 3. This suggests that the polynomial regression model tends to underpredict or average predictions, limiting its ability to capture more extreme values accurately.

poly_model <- extract_fit_engine(poly_fit)

par(mfrow = c(2, 2))
plot(poly_model)

The diagnostic plots reveal that the Residuals vs Fitted plot displays a slight funnel shape, indicating mild heteroscedasticity. The Q-Q plot suggests the residuals generally follow a normal distribution, with minor deviations in the tails. The Scale-Location plot shows some variance inconsistency across fitted values, reinforcing the presence of non-constant error spread. Finally, the Residuals vs Leverage plot highlights a few observations with higher leverage, but none of the high leverage points seem to have an unreasonable or overly strong effect on the model’s outcome.