---
title: Classifying USV subtypes
subtitle: Rat vocalization and alcohol
author: <a href="https://marce10.github.io/">Marcelo Araya-Salas</a>
date: "`r Sys.Date()`"
toc: true
toc-depth: 3
toc-location: left
number-sections: true
highlight-style: pygments
format:
html:
df-print: kable
code-fold: true
code-tools: true
css: qmd.css
editor_options:
chunk_output_type: console
---
```{r set root directory}
#| echo: false
# set working directory as project directory or one directory above,
knitr::opts_knit$set(root.dir = "..")
```
```{r add link to github repo}
#| echo: false
#| output: asis
# print link to github repo if any
if (file.exists("./.git/config")){
config <- readLines("./.git/config")
url <- grep("url", config, value = TRUE)
url <- gsub("\\turl = |.git$", "", url)
cat("\nSource code and data found at [", url, "](", url, ")", sep = "")
}
```
```{r setup style, echo = FALSE, message = FALSE, warning=FALSE}
# options to customize chunk outputs
knitr::opts_chunk$set(
tidy.opts = list(width.cutoff = 65),
tidy = TRUE,
message = FALSE
)
```
<!-- skyblue box -->
<div class="alert alert-info">
# Purpose
- Optimizing USV subtype classification based on acoustic features
</div>
# Load packages and set custom parameters {.unnumbered .unlisted}
```{r load packages}
# knitr is require for creating html/pdf/word reports
# formatR is used for soft-wrapping code
# install/ load packages
sketchy::load_packages(
packages = c(
"knitr",
"formatR",
"viridis",
"warbleR",
github = "maRce10/Rraven",
github = "maRce10/warbleR",
github = "maRce10/ohun",
"caret",
"randomForest",
"DiagrammeR",
"beepr",
"pbapply",
"ggridges",
"doParallel"
)
)
# recording hours
# isf <- info_sound_files()
# rh <- sum(isf$duration) / 60
rh <- 1455
# Define the custom colors
custom_colors <- rev(c("#38AAAC66", "#54C9AD66", "#A0DFB966", "#DEF5E566", "#ffffff"))
# Create a color palette function
custom_palette <- colorRampPalette(custom_colors)
# Generate a gradient of colors from the custom palette
gradient_colors <- custom_palette(100) # 100 colors for a smooth gradient
```
# Prepare data {.unnumbered .unlisted}
```{r}
#| eval: true
anns <- read.csv ("./data/processed/curated_final_data_set_single_unit_and_composed_annotations_and_spectrogram_and_mel_features.csv" )
anns$ modulation.index <- anns$ modindx * anns$ dfrange / (anns$ end - anns$ start)
```
# Describe annotations (single element subtypes)
- Single units were manually annotated in Raven
- Composed calls (2 or more single units) were determined automatically as those in which single units were 5 ms apart or less
- The data set `anns` contains all single units for the composed calls as single annotations (so several annotations per composed subtype) as well as the single unit calls (the column `anns$composed` )
- The column `call_id` contains a unique ID for each call (either single or multielement)
- The data comes from an acoustic data set with `r round(rh)` recording hours
- `r length(unique(anns$sound.files))` sound files were analyzed
- `r nrow(anns)` single units were manually annotated
- `r sum(!anns$composed)` single unit calls
- `r length(unique(anns$call_id[anns$composed]))` composed calls
# Summarize features for composed calls
- Acoustic features are summarized as the average value weighted by the duration of each composing single unit
```{r}
#| eval: false
sel_feats <- names (anns)[c (16 : 219 , 226 )]
# features at the composed level
comp_feats <- song_analysis (X = anns, song_colm = "call_id" , mean_colm = sel_feats, weight = "duration" , parallel = 20 )
comp_feats$ subtype <- sapply (comp_feats$ call_id, function (x) anns$ overall_type[anns$ call_id == x][1 ])
comp_feats$ subunit_type <- sapply (comp_feats$ call_id, function (x) anns$ subunit_type[anns$ call_id == x][1 ])
comp_feats$ ordered_subunit_type <- sapply (comp_feats$ call_id, function (x) anns$ ordered_subunit_type[anns$ call_id == x][1 ])
write.csv (comp_feats, "./data/processed/call_level_features.csv" , row.names = FALSE )
```
```{r}
#| eval: true
# read
comp_feats <- read.csv ("./data/processed/call_level_features.csv" )
comp_feats$ composed <- ! is.na (comp_feats$ gap.duration)
comp_feats$ gap.duration <- ifelse (is.na (comp_feats$ gap.duration), 0 , comp_feats$ gap.duration)
# order levels by mean
agg <- aggregate (modulation.index ~ subtype, data = comp_feats, mean)
comp_feats$ subtype <- factor (comp_feats$ subtype, levels = agg[order (agg$ modulation.index), "subtype" ])
```
# Classify subtypes
- Random forest is used for supervised classification (30 interations and 100 repeats)
- The data set is split into 75% training and 25% testing
- The model is trained using the training set and tested on the testing set
## Classification using manual single unit classification
- It uses the manually classified single unit categories and their combination for composed calls as predictors
- This approach represents the base case scenario in which there is no error in single unit classification
- It shows how distinguishable subtypes are
```{mermaid}
flowchart LR
A(Manual single\nunit classification) --> B(Measure\nacoustic\nstructure)
B --> C(Build\ncomposed\ntypes)
C --> D(Average\nvalues for\ncomposed calls)
D --> E(Supervised\nsubtype classification\nof calls)
style A fill:#0B04054D
style B fill:#3E356B4D
style C fill:#357BA24D
style D fill:#49C1AD4D
style E fill:#DEF5E54D
```
```{r}
#| eval: false
# remove collinear
target_features <- names (comp_feats)[! names (comp_feats) %in% c ("sound.files" , "selec" , "start" , "end" , "bottom.freq" , "top.freq" , "subtype" , names (comp_feats)[sapply (comp_feats, anyNA)])]
num_target_features <- target_features[sapply (comp_feats[, target_features], is.numeric)]
cat_target_features <- target_features[! sapply (comp_feats[, target_features], is.numeric)]
cormat <- cor (comp_feats[, num_target_features], use = "pairwise.complete.obs" )
hc <- findCorrelation (cormat, cutoff= 0.95 )
hc <- sort (hc)
num_target_features <- num_target_features[- hc]
num_target_features <- unique (c (num_target_features, "gap.duration" , "elm.duration" ))
target_features <- c (num_target_features, cat_target_features)
# Create data subsets
partition <- createDataPartition (
y = comp_feats$ subtype,
times = 1 ,
p = 0.75 ,
list = FALSE
)
comp_feats$ subtype <- make.names (comp_feats$ subtype, unique = FALSE , allow_ = TRUE )
comp_feats$ subtype <- as.factor (comp_feats$ subtype)
trainset <- comp_feats[partition, c (target_features, "subtype" )]
testset <- comp_feats[- partition, c (target_features, "subtype" )]
testset <- testset[testset$ subunit_type %in% unique (trainset$ subunit_type), ]
trcontrol <-
trainControl (
method = "repeatedcv" ,
number = 100 ,
repeats = 30 ,
savePredictions = TRUE ,
classProbs = TRUE
)
# use parallelization
registerDoParallel (cores = 20 )
pred_model <- train (
subtype ~ .,
data = trainset,
method = "rf" ,
trControl = trcontrol,
metric = "Accuracy"
)
beepr:: beep (2 )
ggplot (pred_model) + theme_bw ()
# save confusion matrix
conf_mat <-
confusionMatrix (predict (pred_model, testset), testset$ subtype)
conf_df <- as.data.frame (conf_mat$ table)
conf_df$ total <-
sapply (conf_df$ Reference, function (x)
sum (testset$ subtype ==
x))
conf_df$ proportion <- conf_df$ Freq / conf_df$ total
rf_model_results <-
list (
pred_model_bb = pred_model,
conf_mat_bb = conf_mat,
confusion_df_bb = conf_df,
testset_bb = testset
)
saveRDS (
rf_model_results,
"./data/processed/random_forest_model_results_perfect_subunit_classification.RDS"
)
```
### Checking performance on test data
```{r}
#| eval: true
rf_model_results <-
readRDS ("./data/processed/random_forest_model_results_perfect_subunit_classification.RDS" )
# print confusion matrix results
#rf_model_results$conf_mat_bb
confusion_df <- rf_model_results$ confusion_df_bb
confusion_df$ Prediction <- as.character (confusion_df$ Prediction)
confusion_df$ Prediction <- gsub ("step." , "step-" , confusion_df$ Prediction)
confusion_df$ Prediction <- gsub ("complex." , "complex-" , confusion_df$ Prediction)
confusion_df$ Prediction <- gsub ("." , " " , confusion_df$ Prediction, fixed = TRUE )
confusion_df$ Reference <- as.character (confusion_df$ Reference)
confusion_df$ Reference <- gsub ("step." , "step-" , confusion_df$ Reference)
confusion_df$ Reference <- gsub ("complex." , "complex-" , confusion_df$ Reference)
confusion_df$ Reference <- gsub ("." , " " , confusion_df$ Reference, fixed = TRUE )
# Use the custom palette in the ggplot
ggplot (confusion_df, aes (x = Reference, y = Prediction, fill = proportion)) +
geom_tile () +
coord_equal () +
scale_fill_gradientn (colors = gradient_colors) + # Use custom gradient
geom_text (aes (label = round (proportion, 2 )), color = "black" , size = 3 ) +
theme_classic () +
labs (x = "Observed" , y = "Predicted" , title = paste ("Overall accuracy:" , round (max (rf_model_results$ pred_model_bb$ results$ Accuracy), 2 ))) +
theme (axis.text.x = element_text (
color = "black" ,
size = 11 ,
angle = 30 ,
vjust = 0.8 ,
hjust = 0.8
))
```
#### Feature importance
Which predictor features contribute the most to the classification
```{r}
# feature importance
imp <- varImp (rf_model_results$ pred_model_bb)
dat <- ggplot (imp, top = 10 ) + theme_classic ()
ggplot (data = dat$ data, aes (x = Importance, y = Feature, fill = after_stat (x))) +
scale_fill_viridis_c (name = "Importance" , option = "G" , guide = "none" , begin = 0.4 ) +
geom_bar (stat = "identity" ) +
theme_classic ()
```
## Classification without single unit classification
- It only uses the numeric features of single and composed calls
- Skips any classification of single units
```{mermaid}
flowchart LR
A(Measure\nacoustic\nstructure) --> B(Build\ncomposed\ntypes)
B --> C(Average\nvalues for\ncomposed calls)
C --> D(Supervised\nsubtype classification\nof calls)
style A fill:#0B04054D
style B fill:#3E356B4D
style C fill:#357BA24D
style D fill:#DEF5E54D
```
```{r}
#| eval: false
comp_feats$ composed <- ! is.na (comp_feats$ gap.duration)
comp_feats$ gap.duration <- ifelse (is.na (comp_feats$ gap.duration), 0 , comp_feats$ gap.duration)
# remove collinear
target_features <- names (comp_feats)[! names (comp_feats) %in% c ("sound.files" , "selec" , "start" , "end" , "bottom.freq" , "top.freq" , "subtype" , names (comp_feats)[sapply (comp_feats, anyNA)])]
num_target_features <- target_features[sapply (comp_feats[, target_features], is.numeric)]
cat_target_features <- "composed"
cormat <- cor (comp_feats[, num_target_features], use = "pairwise.complete.obs" )
hc <- findCorrelation (cormat, cutoff= 0.95 )
hc <- sort (hc)
num_target_features <- num_target_features[- hc]
num_target_features <- unique (c (num_target_features, "gap.duration" , "elm.duration" ))
target_features <- c (num_target_features, cat_target_features)
set.seed (123 )
# Create data subsets
partition <- createDataPartition (
y = comp_feats$ subtype,
times = 1 ,
p = 0.75 ,
list = FALSE
)
comp_feats$ subtype <- make.names (comp_feats$ subtype, unique = FALSE , allow_ = TRUE )
comp_feats$ subtype <- as.factor (comp_feats$ subtype)
trainset <- comp_feats[partition, c (target_features, "subtype" )]
testset <- comp_feats[- partition, c (target_features, "subtype" )]
trcontrol <-
trainControl (
method = "repeatedcv" ,
number = 100 ,
repeats = 30 ,
savePredictions = TRUE ,
classProbs = TRUE
)
# use parallelization
registerDoParallel (cores = 20 )
pred_model <- train (
subtype ~ .,
data = trainset,
method = "rf" ,
trControl = trcontrol,
metric = "Accuracy"
)
beepr:: beep (2 )
ggplot (pred_model) + theme_bw ()
# save confusion matrix
conf_mat <-
confusionMatrix (predict (pred_model, testset), testset$ subtype)
conf_df <- as.data.frame (conf_mat$ table)
conf_df$ total <-
sapply (conf_df$ Reference, function (x)
sum (testset$ subtype ==
x))
conf_df$ proportion <- conf_df$ Freq / conf_df$ total
rf_model_results <-
list (
pred_model_bb = pred_model,
conf_mat_bb = conf_mat,
confusion_df_bb = conf_df,
testset_bb = testset
)
saveRDS (
rf_model_results,
"./data/processed/random_forest_model_results_no_subunit_classification.RDS"
)
```
### Checking performance on test data
```{r}
#| eval: true
rf_model_results <-
readRDS ("./data/processed/random_forest_model_results_no_subunit_classification.RDS" )
# print confusion matrix results
#rf_model_results$conf_mat_bb
confusion_df <- rf_model_results$ confusion_df_bb
confusion_df$ Prediction <- as.character (confusion_df$ Prediction)
confusion_df$ Prediction <- gsub ("step." , "step-" , confusion_df$ Prediction)
confusion_df$ Prediction <- gsub ("complex." , "complex-" , confusion_df$ Prediction)
confusion_df$ Prediction <- gsub ("." , " " , confusion_df$ Prediction, fixed = TRUE )
confusion_df$ Reference <- as.character (confusion_df$ Reference)
confusion_df$ Reference <- gsub ("step." , "step-" , confusion_df$ Reference)
confusion_df$ Reference <- gsub ("complex." , "complex-" , confusion_df$ Reference)
confusion_df$ Reference <- gsub ("." , " " , confusion_df$ Reference, fixed = TRUE )
ggplot (confusion_df, aes (x = Reference, y = Prediction, fill = proportion)) +
geom_tile () +
coord_equal () +
scale_fill_gradientn (colors = gradient_colors) + # Use custom gradient
geom_text (aes (label = round (proportion, 2 )), color = "black" , size = 3 ) +
theme_classic () +
labs (x = "Observed" , y = "Predicted" , title = paste ("Overall accuracy:" , round (max (rf_model_results$ pred_model_bb$ results$ Accuracy), 2 ))) +
theme (axis.text.x = element_text (
color = "black" ,
size = 11 ,
angle = 30 ,
vjust = 0.8 ,
hjust = 0.8
))
```
#### Feature importance
Which predictor features contribute the most to the classification
```{r}
# feature importance
imp <- varImp (rf_model_results$ pred_model_bb)
dat <- ggplot (imp, top = 10 ) + theme_classic ()
ggplot (data = dat$ data, aes (x = Importance, y = Feature, fill = after_stat (x))) +
scale_fill_viridis_c (name = "Importance" , option = "G" , guide = "none" , begin = 0.4 ) +
geom_bar (stat = "identity" ) +
theme_classic ()
```
## 2 step random forest
- The first step classifies single units in a supervised manner
- The second step classifies composed calls using the single units categories from the previous step along with the numeric features as predictors
```{mermaid}
flowchart LR
A(Measure\nacoustic\nstructure) --> B(Supervised single\nunit classification)
B --> C(Build\ncomposed\ntypes)
C --> D(Average\nvalues for\ncomposed calls)
D --> E(Supervised\nsubtype classification\nof calls\nusing single unit\nclassification)
style A fill:#0B04054D
style B fill:#3E356B4D
style C fill:#357BA24D
style D fill:#49C1AD4D
style E fill:#DEF5E54D
```
### 1. Classified single units
```{r}
#| eval: false
single_anns <- anns[! anns$ composed, ]
# remove collinear
target_features <- names (single_anns)[! names (single_anns) %in% c ("sound.files" , "selec" , "start" , "end" , "bottom.freq" , "top.freq" , "subtype" , "Delta.Time..s." , "composed.subtype" , "ovlp.sels" , "call_id" , "Channel" , "subtipos" , names (comp_feats)[sapply (single_anns, anyNA)])]
target_features <- target_features[sapply (single_anns[, target_features], is.numeric)]
cormat <- cor (single_anns[, target_features], use = "pairwise.complete.obs" )
hc <- findCorrelation (cormat, cutoff= 0.95 )
hc <- sort (hc)
target_features <- target_features[- hc]
target_features <- unique (c (target_features, "duration" ))
set.seed (123 )
# Create data subsets
partition <- createDataPartition (
y = single_anns$ subtype,
times = 1 ,
p = 0.75 ,
list = FALSE
)
# single_anns$subtype <- make.names(single_anns$subtype, unique = FALSE, allow_ = TRUE)
single_anns$ subtype <- as.factor (single_anns$ subtype)
trainset <- single_anns[partition, c (target_features, "subtype" )]
testset <- single_anns[- partition, c (target_features, "subtype" )]
trcontrol <-
trainControl (
method = "repeatedcv" ,
number = 100 ,
repeats = 30 ,
savePredictions = TRUE ,
classProbs = TRUE
)
# use parallelization
registerDoParallel (cores = 20 )
pred_model <- train (
subtype ~ .,
data = trainset,
method = "rf" ,
trControl = trcontrol,
metric = "Accuracy"
)
beepr:: beep (2 )
ggplot (pred_model) + theme_bw ()
# save confusion matrix
conf_mat <-
confusionMatrix (predict (pred_model, testset), testset$ subtype)
conf_df <- as.data.frame (conf_mat$ table)
conf_df$ total <-
sapply (conf_df$ Reference, function (x)
sum (testset$ subtype ==
x))
conf_df$ proportion <- conf_df$ Freq / conf_df$ total
rf_model_results_step_1 <-
list (
pred_model_bb = pred_model,
conf_mat_bb = conf_mat,
confusion_df_bb = conf_df,
testset_bb = testset
)
saveRDS (
rf_model_results_step_1,
"./data/processed/random_forest_subunit_classification.RDS"
)
```
#### Checking performance on test data
```{r}
#| eval: true
rf_model_results_step_1 <-
readRDS ("./data/processed/random_forest_subunit_classification.RDS" )
# print confusion matrix results
#rf_model_results_step_1$conf_mat_bb
confusion_df <- rf_model_results_step_1$ confusion_df_bb
ggplot (confusion_df, aes (x = Reference, y = Prediction, fill = proportion)) +
geom_tile () +
coord_equal () +
scale_fill_gradientn (colors = gradient_colors) + # Use custom gradient
geom_text (aes (label = round (proportion, 2 )), color = "black" , size = 3 ) +
theme_classic () +
labs (x = "Observed" , y = "Predicted" , title = paste ("Overall accuracy:" , round (max (rf_model_results_step_1$ pred_model_bb$ results$ Accuracy), 2 ))) +
theme (axis.text.x = element_text (
color = "black" ,
size = 11 ,
angle = 30 ,
vjust = 0.8 ,
hjust = 0.8
))
```
##### Feature importance
Which predictor features contribute the most to the classification
```{r}
# feature importance
imp <- varImp (rf_model_results_step_1$ pred_model_bb)
dat <- ggplot (imp, top = 10 ) + theme_classic ()
ggplot (data = dat$ data, aes (x = Importance, y = Feature, fill = after_stat (x))) +
scale_fill_viridis_c (name = "Importance" , option = "G" , guide = "none" , begin = 0.4 ) +
geom_bar (stat = "identity" ) +
theme_classic ()
```
### 2. Random forest with unsupervised-classified sub-units
```{r}
#| eval: true
anns$ superv_subunit_type <- NA
anns$ superv_subunit_type[! anns$ composed] <- as.character (predict (rf_model_results_step_1$ pred_model_bb, anns[! anns$ composed,names (anns) != "subtype" ]))
comp_feats$ superv_subunit_type <- sapply (comp_feats$ call_id, function (x) {
subtypes <- anns$ subtype.label[anns$ call_id == x]
paste (subtypes, collapse = "-" )
})
comp_feats$ superv_ordered_subunit_type <- sapply (comp_feats$ call_id, function (x) {
subtypes <- anns$ subtype.label[anns$ call_id == x]
order_subtypes <- sort (subtypes)
paste (order_subtypes, collapse = "-" )
})
comp_feats$ subtype <- make.names (comp_feats$ subtype, unique = FALSE , allow_ = TRUE )
comp_feats$ subtype <- as.factor (comp_feats$ subtype)
```
```{r}
#| eval: false
# remove collinear
target_features <- names (comp_feats)[! names (comp_feats) %in% c ("call_id" , "sound.files" , "selec" , "start" , "end" , "bottom.freq" , "top.freq" , "subtype" , names (comp_feats)[sapply (comp_feats, anyNA)])]
num_target_features <- target_features[sapply (comp_feats[, target_features], is.numeric)]
cat_target_features <- c ("composed" , "superv_subunit_type" , "superv_ordered_subunit_type" )
cormat <- cor (comp_feats[, num_target_features], use = "pairwise.complete.obs" )
hc <- findCorrelation (cormat, cutoff= 0.95 )
hc <- sort (hc)
num_target_features <- num_target_features[- hc]
num_target_features <- unique (c (num_target_features, "gap.duration" , "elm.duration" ))
target_features <- c (num_target_features, cat_target_features)
# factor superv_subunit_type has new levels complex-trill-complex, flat-complex-flat-trill, flat-flat-trill-complex, trill-flat-flat-complex-trill
# Create data subsets
partition <- createDataPartition (
y = comp_feats$ subtype,
times = 1 ,
p = 0.75 ,
list = FALSE
)
trainset <- comp_feats[partition, c (target_features, "subtype" )]
testset <- comp_feats[- partition, c (target_features, "subtype" )]
testset <- testset[testset$ superv_subunit_type %in% unique (trainset$ superv_subunit_type), ]
trcontrol <-
trainControl (
method = "repeatedcv" ,
number = 100 ,
repeats = 30 ,
savePredictions = TRUE ,
classProbs = TRUE
)
# use parallelization
registerDoParallel (cores = 20 )
pred_model <- train (
subtype ~ .,
data = trainset,
method = "rf" ,
trControl = trcontrol,
metric = "Accuracy"
)
beepr:: beep (2 )
ggplot (pred_model) + theme_bw ()
# save confusion matrix
conf_mat <-
confusionMatrix (predict (pred_model, testset), testset$ subtype)
conf_df <- as.data.frame (conf_mat$ table)
conf_df$ total <-
sapply (conf_df$ Reference, function (x)
sum (testset$ subtype ==
x))
conf_df$ proportion <- conf_df$ Freq / conf_df$ total
rf_model_results_step_2 <-
list (
pred_model_bb = pred_model,
conf_mat_bb = conf_mat,
confusion_df_bb = conf_df,
testset_bb = testset
)
saveRDS (
rf_model_results_step_2,
"./data/processed/random_forest_model_results_2_step_classification.RDS"
)
```
#### Checking performance on test data
```{r}
#| eval: true
rf_model_results_step_2 <-
readRDS ("./data/processed/random_forest_model_results_2_step_classification.RDS" )
# print confusion matrix results
#rf_model_results_step_2$conf_mat_bb
confusion_df <- rf_model_results_step_2$ confusion_df_bb
confusion_df$ Prediction <- as.character (confusion_df$ Prediction)
confusion_df$ Prediction <- gsub ("step." , "step-" , confusion_df$ Prediction)
confusion_df$ Prediction <- gsub ("complex." , "complex-" , confusion_df$ Prediction)
confusion_df$ Prediction <- gsub ("." , " " , confusion_df$ Prediction, fixed = TRUE )
confusion_df$ Reference <- as.character (confusion_df$ Reference)
confusion_df$ Reference <- gsub ("step." , "step-" , confusion_df$ Reference)
confusion_df$ Reference <- gsub ("complex." , "complex-" , confusion_df$ Reference)
confusion_df$ Reference <- gsub ("." , " " , confusion_df$ Reference, fixed = TRUE )
ggplot (confusion_df, aes (x = Reference, y = Prediction, fill = proportion)) +
geom_tile () +
coord_equal () +
scale_fill_gradientn (colors = gradient_colors) + # Use custom gradient
geom_text (aes (label = round (proportion, 2 )), color = "black" , size = 3 ) +
theme_classic () +
labs (x = "Observed" , y = "Predicted" , title = paste ("Overall accuracy:" , round (max (rf_model_results_step_2$ pred_model_bb$ results$ Accuracy), 2 ))) +
theme (axis.text.x = element_text (
color = "black" ,
size = 11 ,
angle = 30 ,
vjust = 0.8 ,
hjust = 0.8
))
```
##### Feature importance
Which predictor features contribute the most to the classification
```{r}
# feature importance
imp <- varImp (rf_model_results_step_2$ pred_model_bb)
dat <- ggplot (imp, top = 10 ) + theme_classic ()
ggplot (data = dat$ data, aes (x = Importance, y = Feature, fill = after_stat (x))) +
scale_fill_viridis_c (name = "Importance" , option = "G" , guide = "none" , begin = 0.4 ) +
geom_bar (stat = "identity" ) +
theme_classic ()
```
```{r}
#| eval: false
## XG boosting
trcontrol <-
trainControl (
method = "repeatedcv" ,
number = 100 ,
repeats = 30 ,
savePredictions = TRUE ,
classProbs = TRUE
)
# use parallelization
registerDoParallel (cores = 20 )
xgb_model <- train (
subtype ~ .,
data = trainset,
method = "xgbTree" ,
trControl = trcontrol,
metric = "Accuracy"
)
saveRDS (xbg_model, "./data/processed/xgb_model_results.RDS" )
```
# Compare manual vs supervised classification
- Comparing features from both approaches helps to understand the putative effect on downstream analyses
- Modulation index measures the cumulative change in frequency of the dominant frequency contour over the duration of the call
```{r}
#| eval: true
# read
# comp_feats <- read.csv("./data/processed/call_level_features.csv")
#
# comp_feats$composed <- !is.na(comp_feats$gap.duration)
# comp_feats$gap.duration <- ifelse(is.na(comp_feats$gap.duration), 0, comp_feats$gap.duration)
# order levels by mean
agg <- aggregate (modulation.index ~ subtype, data = comp_feats, mean)
# comp_feats$subtype <- factor(comp_feats$subtype, levels = agg[order(agg$modulation.index), "subtype"])
comp_feats2 <- comp_feats
comp_feats2 <- comp_feats2[comp_feats2$ superv_subunit_type %in% unique (rf_model_results_step_2$ pred_model_bb$ trainingData$ superv_subunit_type), ]
comp_feats2$ subtype <- predict (rf_model_results_step_2$ pred_model_bb, comp_feats2[, names (comp_feats2) != "subtype" ])
comp_feats$ method <- "manual"
comp_feats2$ method <- "2 step supervised"
comp_pooled <- rbind (comp_feats, comp_feats2)
# ggplot(comp_feats, aes(x = modulation.index, y = subtype, fill = after_stat(x))) +
# geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
# scale_fill_viridis_c(name = "Modulation", option = "G", guide = "none", begin = 0.4) +
# xlim(c(-100, 10000)) +
# theme_minimal() +
# labs(x = "Modulation index", y = "Density")
comp_pooled$ subtype <- factor (comp_pooled$ subtype, levels = agg[order (agg$ modulation.index), "subtype" ])
ggplot (comp_pooled, aes (y = modulation.index, x = subtype, fill = after_stat (x))) +
geom_boxplot (outliers = FALSE ) +
scale_fill_viridis_c (name = "Modulation" , option = "G" , guide = "none" , begin = 0.4 ) +
theme_minimal () +
facet_wrap (~ method, nrow = 2 ) +
labs (y = "Modulation index" , x = "Subtype" )
```
::: {.alert .alert-success}
# Takeaways {#takeaways .unnumbered .unlisted}
- Good classification performance with 2 step supervised classification
:::
<!-- '---' adds a gray vertical line -->
------------------------------------------------------------------------
---
# Session information {.unnumbered .unlisted}
```{r session info, echo=F}
devtools::session_info()
```