Discriminant Function Analysis: A Supervised Classification Technique

A requirement in PhDITI 617
Author

Jamal Kay B. Rogers

Published

September 23, 2023

This presentation uses the R programming language and some R packages for modeling and data transformation.

What is DFA?

Introduction: DFA as a Classifier

Suppose we have recorded data from 12 patients with the following variables: Type of Infection, C-Reactive Protein, Temperature.

   infection  CRP Temp
1      Viral 40.0 36.0
2      Viral 11.1 37.2
3      Viral 30.0 36.5
4      Viral 21.4 39.4
5      Viral 10.7 39.6
6      Viral  3.4 40.7
7  Bacterial 42.0 37.6
8  Bacterial 31.1 42.2
9  Bacterial 50.0 38.5
10 Bacterial 60.4 39.4
11 Bacterial 45.7 38.6
12 Bacterial 17.3 42.7

Our aim is to classify whether a patient’s infection is viral or bacterial (dependent variable) given the CRP and temperature (independent variables).

Use Case

It usually takes several hours or days to determine if a patient has a viral infection or a bacterial infection. This will delay appropriate treatment needed for the patients.

DFA can help identify new patients if they have a viral or bacterial infection based on CRP and Temperature.

Classify based on CRP

Classify based on Temp

Classify based on CRP and Temp

The Discriminant Function

A line that separates the 2 infections indicates that a linear discriminant analysis (LDA) would do a good job in classifying the infections by combining the two variables. The LDA function is represented by additional variables LD1 and LD2.

Definition

Discriminant Function Analysis (FDA) is a statistical technique that helps to classify observations into different groups based on certain characteristics. It creates new variables called discriminant functions that maximize the differences between groups while minimizing the differences within each group. These functions are used to assign each observation to a particular group or category based on the independent characteristics of the data.

DFA is typically employed to predict inclusion within naturally formed clusters, addressing the query of whether a set of variables can be utilized to anticipate group affiliation.

Example Plot: Linear Discriminant Analysis

Example Plot: Flexible Discriminant Analysis

Widely Used DFA Methods

  • Linear Discriminant Analysis (LDA). Uses linear combinations of predictors to predict the class of a given observation. Assumes that the predictor variables (p) are normally distributed and the classes have identical variances (for univariate analysis, p = 1), or identical covariance matrices (for multivariate analysis, p > 1).

  • Quadratic Discriminant Analysis (QDA). More flexible than LDA. Here, there is no assumption that the covariance matrix of classes is the same.

  • Flexible Discriminant Analysis (FDA). Non-linear combinations of predictors is used.

The Palmer Penguins Data Set

The Palmer Penguins data set contains size measurements for adult foraging penguins near Palmer Station. Antarctica. The data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

You can access the data set as a CSV file here: GitHub Gist

Goal

Create a discriminant model that predicts which specie a penguin is categorized given relevant characteristics.

The Data Set

Let’s examine the data set.

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Exploratory Data Analysis

penguins |>
  count(species)
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124
penguins |>
  ggplot() +
  geom_bar(aes(x = species, fill = species))

penguins |>
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species))

penguins |>
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = flipper_length_mm, color = species))

penguins |>
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = body_mass_g, color = species))

Final Dataset

We remove unwanted variables island, sex, and year, ang NA values and retain the variables that we want to analyze:

Independent variables:

bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g

Dependent variable:

species

datapeng <- penguins |>
        drop_na() |>
        select(-year, -island, -sex)
head(datapeng)
# A tibble: 6 × 5
  species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>            <dbl>         <dbl>             <int>       <int>
1 Adelie            39.1          18.7               181        3750
2 Adelie            39.5          17.4               186        3800
3 Adelie            40.3          18                 195        3250
4 Adelie            36.7          19.3               193        3450
5 Adelie            39.3          20.6               190        3650
6 Adelie            38.9          17.8               181        3625
glimpse(datapeng)
Rows: 333
Columns: 5
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…
$ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…
$ body_mass_g       <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…

Supervised Modeling Framework

Splitting Data into Training and Testing with Stratification

80% Training, 20% Testing

set.seed(18)
datapeng_split <- initial_split(datapeng, prop = 0.8, strata = species)
datapeng_train <- training(datapeng_split)
datapeng_test <- testing(datapeng_split)
datapeng_split
<Training/Testing/Total>
<265/68/333>

Resampling using 10-Fold Cross Validation

set.seed(18)
samples_peng <- vfold_cv(datapeng_train, strata = species)
samples_peng
#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits           id    
   <list>           <chr> 
 1 <split [237/28]> Fold01
 2 <split [237/28]> Fold02
 3 <split [237/28]> Fold03
 4 <split [237/28]> Fold04
 5 <split [238/27]> Fold05
 6 <split [239/26]> Fold06
 7 <split [240/25]> Fold07
 8 <split [240/25]> Fold08
 9 <split [240/25]> Fold09
10 <split [240/25]> Fold10

Linear Discriminant Analysis (LDA)

ldamodel_peng <-
  discrim_linear() |>
  set_engine('MASS') |>
  set_mode("classification")

ldarecipe_peng <-
  recipe(species ~., data = datapeng_train) |>
  step_scale(all_numeric_predictors())

ldaworkflow_peng <-
        workflow() |>
        add_recipe(ldarecipe_peng) |>
        add_model(ldamodel_peng)

ldaworkflow_peng
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: discrim_linear()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_scale()

── Model ───────────────────────────────────────────────────────────────────────
Linear Discriminant Model Specification (classification)

Computational engine: MASS 

LDA Training Performance

ldafit_peng <-
        fit_resamples(
                ldaworkflow_peng,
                samples_peng
        )

ldafit_peng |>
        collect_metrics()
# A tibble: 2 × 6
  .metric  .estimator  mean     n  std_err .config             
  <chr>    <chr>      <dbl> <int>    <dbl> <chr>               
1 accuracy multiclass 0.985    10 0.00620  Preprocessor1_Model1
2 roc_auc  hand_till  1.00     10 0.000463 Preprocessor1_Model1

Quadratic Discriminant Analysis (QDA)

qdamodel_peng <-
  discrim_quad() |>
  set_engine('MASS') |>
  set_mode("classification")

qdarecipe_peng <-
  recipe(species ~., data = datapeng_train)

qdaworkflow_peng <-
        workflow() |>
        add_recipe(qdarecipe_peng) |>
        add_model(qdamodel_peng)

qdaworkflow_peng
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: discrim_quad()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Quadratic Discriminant Model Specification (classification)

Computational engine: MASS 

QDA Training Performance

qdafit_peng <-
        fit_resamples(
                qdaworkflow_peng,
                samples_peng
        )

qdafit_peng |>
        collect_metrics()
# A tibble: 2 × 6
  .metric  .estimator  mean     n  std_err .config             
  <chr>    <chr>      <dbl> <int>    <dbl> <chr>               
1 accuracy multiclass 0.989    10 0.00569  Preprocessor1_Model1
2 roc_auc  hand_till  0.999    10 0.000926 Preprocessor1_Model1

Flexible Discriminant Analysis (FDA)

fdamodel_peng <-
  discrim_flexible() |>
  set_engine('earth') |>
  set_mode("classification")

fdarecipe_peng <-
  recipe(species ~., data = datapeng_train)

fdaworkflow_peng <-
        workflow() |>
        add_recipe(fdarecipe_peng) |>
        add_model(fdamodel_peng)

fdaworkflow_peng
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: discrim_flexible()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Flexible Discriminant Model Specification (classification)

Computational engine: earth 

FDA Training Performance

fdafit_peng <-
        fit_resamples(
                fdaworkflow_peng,
                samples_peng
        )

fdafit_peng |>
        collect_metrics()
# A tibble: 2 × 6
  .metric  .estimator  mean     n std_err .config             
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy multiclass 0.977    10 0.00628 Preprocessor1_Model1
2 roc_auc  hand_till  0.998    10 0.00133 Preprocessor1_Model1

LDA Final Model Fit and Test Performance

final_peng <-
        last_fit(
                ldaworkflow_peng,
                datapeng_split
        )

final_peng |>
        collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy multiclass     0.985 Preprocessor1_Model1
2 roc_auc  hand_till      0.999 Preprocessor1_Model1

Confusion Matrix on LDA Test Performance

final_peng |>
        collect_predictions() |>
        conf_mat(species, .pred_class) |>
        autoplot(type = "heatmap")

LDA Trained Model

results <- extract_workflow(final_peng)
results
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: discrim_linear()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_scale()

── Model ───────────────────────────────────────────────────────────────────────
Call:
lda(..y ~ ., data = data)

Prior probabilities of groups:
   Adelie Chinstrap    Gentoo 
0.4377358 0.2037736 0.3584906 

Group means:
          bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
Adelie          6.983339      9.543835          13.45330    4.639260
Chinstrap       8.794963      9.520066          13.84416    4.599919
Gentoo          8.603930      7.828523          15.39062    6.347715

Coefficients of linear discriminants:
                         LD1         LD2
bill_length_mm    -0.5509088 -2.21796571
bill_depth_mm      2.0666181 -0.07613317
flipper_length_mm -1.1048074 -0.01438381
body_mass_g       -1.1466640  1.54105502

Proportion of trace:
   LD1    LD2 
0.8663 0.1337 

Interpretation

Prior probabilities of groups

These represent the proportions of each class in the training set. For example, 43.8% of the observations in the training set are Adelie.

Group means

These display the mean values of each predictor variable for each species.

Coefficients of linear discriminant

These display the linear combination of predictor variables that are used to form the decision rule of the LDA model. In this case;

LD1: (-0.55*bill_length_mm) + (2.07*bill_depth_mm) + (-1.10*flipper_length_mm) + (-1.17*body_mass_g)

LD2: (-2.28*bill_length_mm) + (-0.08*bill_depth_mm) + (-0.01*flipper_length_mm) + (1.54*body_mass_g)

Proportion of trace

These display the percentage of separation achieved by each linear discriminant function.

Application

Suppose a new data set is recorded from new penguins and we would like to determine the species.

  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1           41.1          17.6               182        3200
2           46.3          15.8               215        5050
3           53.5          19.9               205        4500

We can classify/predict the penguin’s species by using the LDA model we have trained.

prediction <- augment(results, new_penguins)
prediction <- as.data.frame(prediction)
prediction
  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g .pred_class
1           41.1          17.6               182        3200      Adelie
2           46.3          15.8               215        5050      Gentoo
3           53.5          19.9               205        4500   Chinstrap
  .pred_Adelie .pred_Chinstrap .pred_Gentoo
1 9.403125e-01    5.968747e-02 2.325703e-17
2 5.132116e-12    4.143134e-11 1.000000e+00
3 1.314401e-05    9.999869e-01 2.696294e-12