infection CRP Temp
1 Viral 40.0 36.0
2 Viral 11.1 37.2
3 Viral 30.0 36.5
4 Viral 21.4 39.4
5 Viral 10.7 39.6
6 Viral 3.4 40.7
7 Bacterial 42.0 37.6
8 Bacterial 31.1 42.2
9 Bacterial 50.0 38.5
10 Bacterial 60.4 39.4
11 Bacterial 45.7 38.6
12 Bacterial 17.3 42.7
Discriminant Function Analysis: A Supervised Classification Technique
This presentation uses the R programming language and some R packages for modeling and data transformation.
What is DFA?
Introduction: DFA as a Classifier
Suppose we have recorded data from 12 patients with the following variables: Type of Infection, C-Reactive Protein, Temperature.
Our aim is to classify whether a patient’s infection is viral or bacterial (dependent variable) given the CRP and temperature (independent variables).
Use Case
It usually takes several hours or days to determine if a patient has a viral infection or a bacterial infection. This will delay appropriate treatment needed for the patients.
DFA can help identify new patients if they have a viral or bacterial infection based on CRP and Temperature.
Classify based on CRP
Classify based on Temp
Classify based on CRP and Temp
The Discriminant Function
A line that separates the 2 infections indicates that a linear discriminant analysis (LDA) would do a good job in classifying the infections by combining the two variables. The LDA function is represented by additional variables LD1 and LD2.
Definition
Discriminant Function Analysis (FDA) is a statistical technique that helps to classify observations into different groups based on certain characteristics. It creates new variables called discriminant functions that maximize the differences between groups while minimizing the differences within each group. These functions are used to assign each observation to a particular group or category based on the independent characteristics of the data.
DFA is typically employed to predict inclusion within naturally formed clusters, addressing the query of whether a set of variables can be utilized to anticipate group affiliation.
Example Plot: Linear Discriminant Analysis
Example Plot: Flexible Discriminant Analysis
Widely Used DFA Methods
Linear Discriminant Analysis (LDA). Uses linear combinations of predictors to predict the class of a given observation. Assumes that the predictor variables (p) are normally distributed and the classes have identical variances (for univariate analysis, p = 1), or identical covariance matrices (for multivariate analysis, p > 1).
Quadratic Discriminant Analysis (QDA). More flexible than LDA. Here, there is no assumption that the covariance matrix of classes is the same.
Flexible Discriminant Analysis (FDA). Non-linear combinations of predictors is used.
The Palmer Penguins Data Set
The Palmer Penguins data set contains size measurements for adult foraging penguins near Palmer Station. Antarctica. The data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
You can access the data set as a CSV file here: GitHub Gist
Goal
Create a discriminant model that predicts which specie a penguin is categorized given relevant characteristics.
The Data Set
Let’s examine the data set.
glimpse(penguins)Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Exploratory Data Analysis
penguins |>
count(species)# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
penguins |>
ggplot() +
geom_bar(aes(x = species, fill = species))penguins |>
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species))penguins |>
ggplot() +
geom_point(aes(x = bill_length_mm, y = flipper_length_mm, color = species))penguins |>
ggplot() +
geom_point(aes(x = bill_length_mm, y = body_mass_g, color = species))Final Dataset
We remove unwanted variables island, sex, and year, ang NA values and retain the variables that we want to analyze:
Independent variables:
bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g
Dependent variable:
species
datapeng <- penguins |>
drop_na() |>
select(-year, -island, -sex)
head(datapeng)# A tibble: 6 × 5
species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <dbl> <dbl> <int> <int>
1 Adelie 39.1 18.7 181 3750
2 Adelie 39.5 17.4 186 3800
3 Adelie 40.3 18 195 3250
4 Adelie 36.7 19.3 193 3450
5 Adelie 39.3 20.6 190 3650
6 Adelie 38.9 17.8 181 3625
glimpse(datapeng)Rows: 333
Columns: 5
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…
$ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…
$ body_mass_g <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…
Supervised Modeling Framework
Splitting Data into Training and Testing with Stratification
80% Training, 20% Testing
set.seed(18)
datapeng_split <- initial_split(datapeng, prop = 0.8, strata = species)
datapeng_train <- training(datapeng_split)
datapeng_test <- testing(datapeng_split)datapeng_split<Training/Testing/Total>
<265/68/333>
Resampling using 10-Fold Cross Validation
set.seed(18)
samples_peng <- vfold_cv(datapeng_train, strata = species)
samples_peng# 10-fold cross-validation using stratification
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [237/28]> Fold01
2 <split [237/28]> Fold02
3 <split [237/28]> Fold03
4 <split [237/28]> Fold04
5 <split [238/27]> Fold05
6 <split [239/26]> Fold06
7 <split [240/25]> Fold07
8 <split [240/25]> Fold08
9 <split [240/25]> Fold09
10 <split [240/25]> Fold10
Linear Discriminant Analysis (LDA)
ldamodel_peng <-
discrim_linear() |>
set_engine('MASS') |>
set_mode("classification")
ldarecipe_peng <-
recipe(species ~., data = datapeng_train) |>
step_scale(all_numeric_predictors())
ldaworkflow_peng <-
workflow() |>
add_recipe(ldarecipe_peng) |>
add_model(ldamodel_peng)
ldaworkflow_peng══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: discrim_linear()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_scale()
── Model ───────────────────────────────────────────────────────────────────────
Linear Discriminant Model Specification (classification)
Computational engine: MASS
LDA Training Performance
ldafit_peng <-
fit_resamples(
ldaworkflow_peng,
samples_peng
)
ldafit_peng |>
collect_metrics()# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy multiclass 0.985 10 0.00620 Preprocessor1_Model1
2 roc_auc hand_till 1.00 10 0.000463 Preprocessor1_Model1
Quadratic Discriminant Analysis (QDA)
qdamodel_peng <-
discrim_quad() |>
set_engine('MASS') |>
set_mode("classification")
qdarecipe_peng <-
recipe(species ~., data = datapeng_train)
qdaworkflow_peng <-
workflow() |>
add_recipe(qdarecipe_peng) |>
add_model(qdamodel_peng)
qdaworkflow_peng══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: discrim_quad()
── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────
Quadratic Discriminant Model Specification (classification)
Computational engine: MASS
QDA Training Performance
qdafit_peng <-
fit_resamples(
qdaworkflow_peng,
samples_peng
)
qdafit_peng |>
collect_metrics()# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy multiclass 0.989 10 0.00569 Preprocessor1_Model1
2 roc_auc hand_till 0.999 10 0.000926 Preprocessor1_Model1
Flexible Discriminant Analysis (FDA)
fdamodel_peng <-
discrim_flexible() |>
set_engine('earth') |>
set_mode("classification")
fdarecipe_peng <-
recipe(species ~., data = datapeng_train)
fdaworkflow_peng <-
workflow() |>
add_recipe(fdarecipe_peng) |>
add_model(fdamodel_peng)
fdaworkflow_peng══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: discrim_flexible()
── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────
Flexible Discriminant Model Specification (classification)
Computational engine: earth
FDA Training Performance
fdafit_peng <-
fit_resamples(
fdaworkflow_peng,
samples_peng
)
fdafit_peng |>
collect_metrics()# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy multiclass 0.977 10 0.00628 Preprocessor1_Model1
2 roc_auc hand_till 0.998 10 0.00133 Preprocessor1_Model1
LDA Final Model Fit and Test Performance
final_peng <-
last_fit(
ldaworkflow_peng,
datapeng_split
)
final_peng |>
collect_metrics()# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy multiclass 0.985 Preprocessor1_Model1
2 roc_auc hand_till 0.999 Preprocessor1_Model1
Confusion Matrix on LDA Test Performance
final_peng |>
collect_predictions() |>
conf_mat(species, .pred_class) |>
autoplot(type = "heatmap")LDA Trained Model
results <- extract_workflow(final_peng)
results══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: discrim_linear()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_scale()
── Model ───────────────────────────────────────────────────────────────────────
Call:
lda(..y ~ ., data = data)
Prior probabilities of groups:
Adelie Chinstrap Gentoo
0.4377358 0.2037736 0.3584906
Group means:
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
Adelie 6.983339 9.543835 13.45330 4.639260
Chinstrap 8.794963 9.520066 13.84416 4.599919
Gentoo 8.603930 7.828523 15.39062 6.347715
Coefficients of linear discriminants:
LD1 LD2
bill_length_mm -0.5509088 -2.21796571
bill_depth_mm 2.0666181 -0.07613317
flipper_length_mm -1.1048074 -0.01438381
body_mass_g -1.1466640 1.54105502
Proportion of trace:
LD1 LD2
0.8663 0.1337
Interpretation
Prior probabilities of groups
These represent the proportions of each class in the training set. For example, 43.8% of the observations in the training set are Adelie.
Group means
These display the mean values of each predictor variable for each species.
Coefficients of linear discriminant
These display the linear combination of predictor variables that are used to form the decision rule of the LDA model. In this case;
LD1: (-0.55*bill_length_mm) + (2.07*bill_depth_mm) + (-1.10*flipper_length_mm) + (-1.17*body_mass_g)
LD2: (-2.28*bill_length_mm) + (-0.08*bill_depth_mm) + (-0.01*flipper_length_mm) + (1.54*body_mass_g)
Proportion of trace
These display the percentage of separation achieved by each linear discriminant function.
Application
Suppose a new data set is recorded from new penguins and we would like to determine the species.
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 41.1 17.6 182 3200
2 46.3 15.8 215 5050
3 53.5 19.9 205 4500
We can classify/predict the penguin’s species by using the LDA model we have trained.
prediction <- augment(results, new_penguins)
prediction <- as.data.frame(prediction)
prediction bill_length_mm bill_depth_mm flipper_length_mm body_mass_g .pred_class
1 41.1 17.6 182 3200 Adelie
2 46.3 15.8 215 5050 Gentoo
3 53.5 19.9 205 4500 Chinstrap
.pred_Adelie .pred_Chinstrap .pred_Gentoo
1 9.403125e-01 5.968747e-02 2.325703e-17
2 5.132116e-12 4.143134e-11 1.000000e+00
3 1.314401e-05 9.999869e-01 2.696294e-12