---
title: Classifying USV subtypes
subtitle: Rat vocalization and alcohol
author: <a href="https://marce10.github.io/">Marcelo Araya-Salas</a>
date: "`r Sys.Date()`"
toc: true
toc-depth: 2
toc-location: left
number-sections: true
highlight-style: pygments
format:
html:
df-print: kable
code-fold: true
code-tools: true
css: qmd.css
editor_options:
chunk_output_type: console
---
```{=html}
<style>
body
{ counter-reset: source-line 0; }
pre.numberSource code
{ counter-reset: none; }
</style>
```
```{r set root directory, echo = FALSE}
# set working directory as project directory or one directory above,
knitr::opts_knit$set(root.dir = "..")
```
```{r add link to github repo, echo = FALSE, results='asis'}
# print link to github repo if any
if (file.exists("./.git/config")){
config <- readLines("./.git/config")
url <- grep("url", config, value = TRUE)
url <- gsub("\\turl = |.git$", "", url)
cat("\nSource code and data found at [", url, "](", url, ")", sep = "")
}
```
```{r setup style, echo = FALSE, message = FALSE, warning=FALSE}
# options to customize chunk outputs
knitr::opts_chunk$set(
class.source = "numberLines lineAnchors", # for code line numbers
tidy.opts = list(width.cutoff = 65),
tidy = TRUE,
message = FALSE
)
```
<!-- skyblue box -->
<div class="alert alert-info">
# Purpose
- Classify subtypes based on structure
- Determine multi-element subtypes based on composing single subtypes
</div>
# Load packages {.unnumbered .unlisted}
```{r load packages}
# knitr is require for creating html/pdf/word reports
# formatR is used for soft-wrapping code
# install/ load packages
sketchy::load_packages(packages = c("knitr", "formatR", "viridis", "warbleR", github = "maRce10/Rraven", github = "maRce10/warbleR", github = "maRce10/ohun", "caret", "randomForest", "DiagrammeR"))
warbleR_options(wav.path = "/media/m/Expansion/audios_cin_alcohol_2023")
```
# Read data {.unnumbered .unlisted}
```{r, eval = TRUE}
anns <- imp_raven("./data/processed/annotations", warbler.format = TRUE, all.data = TRUE)
anns$subc <- anns$subtipos_completo <- NULL
anns <- anns[!is.na(anns$subtipos), ]
anns$subtipos[anns$subtipos %in% c(5, 44)] <- 4
anns <- anns[order(anns$sound.files, anns$start), ]
nrow(anns)
anns <- anns[anns$start < anns$end, ]
anns$subtype.label <- NA
anns$subtype.label[anns$subtipos == 22] <- "22 kHz"
anns$subtype.label[anns$subtipos == 1] <- "flat"
anns$subtype.label[anns$subtipos == 4] <- "trill"
anns$subtype.label[anns$subtipos == 6] <- "complex"
```
# Describe annotations (single element subtypes)
- Proportion of calls per subtype
- This is the annotation data without any modification
```{r}
count_types <- round (table (anns$ subtype.label)/ nrow (anns), 2 )
count_types <- as.data.frame (count_types)
count_types$ Count <- as.vector (table (anns$ subtype.label))
names (count_types) <- c ("Subtype" , "Proportion" , "Count" )
count_types
```
# Data analysis
```{mermaid}
flowchart LR
A[Annotations] --> C(Measure\nsingle-note\nsubtype\nstructure)
B[recordings] --> C
C --> D(Build\ncomposed\nsubtypes)
C --> E(Classify\nsimple\nsubtypes)
D --> F(Classify\ncomposed\nsubtypes)
F --> G(Evaluate classification)
E --> G
style A fill:#44015466
style B fill:#3E4A894D
style C fill:#26828E4D
style D fill:#31688E4D
style E fill:#6DCD594D
style G fill:#FDE7254D
```
```{r, eval = FALSE, echo = FALSE}
# r must be changed to dot
digraph {
graph [layout = dot]
# define the global styles of the nodes. We can override these in box if we wish
node [shape = rectangle, style = filled, fillcolor = Linen]
annotations [label = "Annotations", shape = rectangle, fillcolor = "#44015466"]
recordings [label = "Recordings", shape = rectangle, fillcolor = "#3E4A894D"]
structure [label = "Measure\nsingle-note\nsubtype\nstructure", fillcolor = "#26828E4D"]
simple [label = "Classify\nsimple\nsubtypes", fillcolor = "#6DCD594D"]
bcomposed [label = "Build\ncomposed\nsubtypes", fillcolor = "#31688E4D"]
class [label = "Classify\ncomposed\nsubtypes", fillcolor = "#6DCD594D"]
eval [label = "Evaluate classification", fillcolor = "#FDE7254D"]
# edge definitions with the node IDs
{annotations recordings} -> structure -> bcomposed -> class -> eval
structure -> simple
simple -> eval
}
```
# Measure acoustic structure
Acoustic structure measured as spectrographic/spectrum/envelope features as well as statistical descriptors of Mel frequency cepstral coefficients
```{r, eval = FALSE}
cs <- check_sels(anns, parallel = 1)
sp <- spectro_analysis(anns, parallel = 5, ovlp = 70)
mfccs <- mfcc_stats(anns, parallel = 5, ovlp = 70)
anns_sp <- merge(anns, sp)
anns_sp <- merge(anns_sp, mfccs)
colnames(anns_sp)
write.csv(anns_sp, "./data/processed/annotations_and_spectrogram_and_mel_features.csv", row.names = FALSE)
```
# Multi-element composed subtypes
- Define as those in which single elements are within a 5 ms range
```{r, eval = FALSE}
anns <- read.csv("./data/processed/annotations_and_spectrogram_and_mel_features.csv")
anns$true.end <- anns$end
anns$end <- anns$end + 0.005
anns <- overlapping_sels(anns)
anns$end <- anns$true.end
anns$true.end <- NULL
# add song label to elements
anns$composed.subtype <- anns$ovlp.sels
for(i in seq_len(nrow(anns)))
if (is.na(anns$composed.subtype[i]))
anns$composed.subtype[i] <- max(anns$composed.subtype, na.rm = TRUE) + 1
write.csv(anns, "./data/processed/annotations_groups_spectrogram_and_mel_features.csv", row.names = FALSE)
```
<!-- light brown box -->
<div class="alert alert-warning">
## Rules to determine composed subtypes
- If any of the composing subtypes is complex then **composed complex**
- If it contains 3 or more single subtypes then **composed complex**
- If 'flat-trill' and 'trill-flat' then **step flat trill**
- If it has at least two adjacent trills ('trill-trill') and the frequency change is equal or higher than 5 kHz then **step trill**
- If it has at least two adjacent flats ('flat-flat') and the frequency change is equal or higher than 5 kHz then **step flat**
</div>
```{r, eval = FALSE}
anns <- read.csv("./data/processed/annotations_groups_spectrogram_and_mel_features.csv")
composed_anns_list <- lapply(unique(na.omit(anns$ovlp.sels)), function(x){
data.frame(sound.files = anns$sound.files[anns$ovlp.sels == x & !is.na(anns$ovlp.sels)][1],
selec = x,
start = min(anns$start[anns$ovlp.sels == x & !is.na(anns$ovlp.sels)]),
end = max(anns$end[anns$ovlp.sels == x & !is.na(anns$ovlp.sels)]),
bottom.freq = min(anns$bottom.freq[anns$ovlp.sels == x & !is.na(anns$ovlp.sels)]),
top.freq = max(anns$top.freq[anns$ovlp.sels == x & !is.na(anns$ovlp.sels)]),
subtypes = paste(anns$subtipos[anns$ovlp.sels == x & !is.na(anns$ovlp.sels)], collapse = "-"),
composed.subtype = x,
count.subtypes = sum(anns$ovlp.sels == x & !is.na(anns$ovlp.sels))
)
})
composed_anns <- do.call(rbind, composed_anns_list)
unique(composed_anns$subtypes)
# subtypes <- c(1 = "flat", 4 = "trill", 6 = "complejas", 22 = "22 kHz")
# add labels
composed_anns$composed.subtype.label <- NA
# of it contain complex subtype then is complex
composed_anns$composed.subtype.label <- ifelse(grepl("6", composed_anns$subtypes), "composed.complex", composed_anns$composed.subtype.label)
# of it contain 3 or more elements it is complex
composed_anns$composed.subtype.label <- ifelse(composed_anns$count.subtypes >= 3, "composed.complex", composed_anns$composed.subtype.label)
# 1-4 and 4-1 is a 3 (trill flat steps)
composed_anns$composed.subtype.label <- ifelse(grepl("^1-4$|^4-1$", composed_anns$subtypes), "step.flat.trill", composed_anns$composed.subtype.label)
# if it has at least two adjacent 4 ("4-4" two trills) and the frequency change is equal or higher than 5 kHz it is a step
composed_anns$composed.subtype.label <- sapply(seq_len(nrow(composed_anns)), function(x){
if(sum(grepl("4-4", composed_anns$subtypes[x])) & composed_anns$count.subtypes[x] == 2) {
peakfs <- anns$meanpeakf[anns$composed.subtype == composed_anns$composed.subtype[x]]
diff_freq <- abs(peakfs[1] - peakfs[2])
out <- if (diff_freq >= 5) "step.trill" else
"composed.trill"
} else out <- composed_anns$composed.subtype.label[x]
return(out)
})
# if it has at least two adjacent 1 ("1-1" two flats) and the frequency change is equal or higher than 5 kHz it is a step
composed_anns$composed.subtype.label <- sapply(seq_len(nrow(composed_anns)), function(x){
if(sum(grepl("1-1", composed_anns$subtypes[x])) & composed_anns$count.subtypes[x] == 2) {
peakfs <- anns$meanpeakf[anns$composed.subtype == composed_anns$composed.subtype[x]]
diff_freq <- abs(peakfs[1] - peakfs[2])
out <- if (diff_freq >= 5) "step.flat" else
"composed.flat"
} else out <- composed_anns$composed.subtype.label[x]
return(out)
})
unique(composed_anns$subtypes[composed_anns$count.subtypes == 2 & is.na(composed_anns$composed.subtype.label)])
round(table(composed_anns$composed.subtype.label)/ nrow(composed_anns), 2)
composed_anns$subtype.label <- ifelse(grepl("step", composed_anns$composed.subtype.label), "step", composed_anns$composed.subtype.label)
composed_anns$subtype.label <- ifelse(grepl("complex", composed_anns$composed.subtype.label), "complex", composed_anns$composed.subtype.label)
composed_anns$subtype.label <- ifelse(grepl("composed.trill", composed_anns$composed.subtype.label), "trill", composed_anns$subtype.label)
write.csv(composed_anns, "./data/processed/annotations_composed_subtypes_5ms.csv", row.names = FALSE)
anns$type <- "single"
anns$composed.subtype.label <- anns$subtype.label
composed_anns$type <- "composed"
cols <- intersect(colnames(anns), colnames(composed_anns))
combined_anns <- rbind(anns[is.na(anns$ovlp.sels), cols], composed_anns[, cols])
write.csv(combined_anns, "./data/processed/annotations_simple_and_composed_subtypes_5ms.csv", row.names = FALSE)
```
# Summary
Proportion of calls per subtypes for composed (multi-element) subtypes
```{r}
composed_anns <- read.csv ("./data/processed/annotations_composed_subtypes_5ms.csv" )
count_types <- round (table (composed_anns$ composed.subtype.label)/ nrow (composed_anns), 2 )
count_types <- as.data.frame (count_types)
count_types$ Count <- as.vector (table (composed_anns$ composed.subtype.label))
names (count_types) <- c ("Subtype" , "Proportion" , "Count" )
count_types
```
Combination of single element calls on each composed subtype
```{r}
subtypes <- c ("1" = "flat" , "4" = "trill" , "6" = "complex" , "22" = "22 kHz" )
composed_anns$ single.type.combs <- composed_anns$ subtypes
for (i in seq_along (subtypes))
composed_anns$ single.type.combs <- gsub (names (subtypes)[i], subtypes[i], composed_anns$ single.type.combs)
count_types <- round (table (composed_anns$ single.type.combs)/ nrow (composed_anns), 2 )
count_types <- as.data.frame (count_types)
count_types$ Count <- as.vector (table (composed_anns$ single.type.combs))
count_types$ comp <- sapply (count_types$ Var1, function (x) composed_anns$ composed.subtype.label[composed_anns$ single.type.combs == x][1 ])
names (count_types) <- c ("Single subtype combination" , "Proportion" , "Count" , "Composed subtype" )
count_types <- count_types[order (count_types$ ` Composed subtype ` ), ]
count_types[, c (4 , 1 : 3 )]
```
Proportion of calls per subtypes for both single and composed (multi-element) subtypes
```{r}
combined_anns <- read.csv ("./data/processed/annotations_simple_and_composed_subtypes_5ms.csv" )
count_types <- round (table (combined_anns$ subtype.label)/ nrow (combined_anns), 2 )
count_types <- as.data.frame (count_types)
count_types$ Count <- as.vector (table (combined_anns$ subtype.label))
names (count_types) <- c ("Subtype" , "Proportion" , "Count" )
count_types$ type <- sapply (count_types$ Subtype, function (x) unique (combined_anns$ type[combined_anns$ subtype.label == x]))
count_types
```
Proportion of calls per subtypes for both single and composed by type (single vs composed)
```{r}
combined_anns <- read.csv ("./data/processed/annotations_simple_and_composed_subtypes_5ms.csv" )
count_types <- round (table (combined_anns$ composed.subtype.label)/ nrow (combined_anns), 2 )
count_types <- as.data.frame (count_types)
count_types$ Count <- as.vector (table (combined_anns$ composed.subtype.label))
names (count_types) <- c ("Subtype" , "Proportion" , "Count" )
count_types$ type <- sapply (count_types$ Subtype, function (x) unique (combined_anns$ type[combined_anns$ composed.subtype.label == x]))
count_types
```
# Create spectrograms
```{r, eval = FALSE}
types <- unique(combined_anns$composed.subtype.label)
for (i in types[7:1]) {
print(i)
X <- combined_anns[combined_anns$composed.subtype.label == i,]
# subset
set.seed(123)
if (nrow(X) > 64)
X <- X[sample(seq_len(nrow(X)), 64),]
nrw <- ncl <- 2
n <- nrow(X)
if (n > 4) {
nrw <- ncl <- ceiling(sqrt(n))
if (((nrw - 1) * ncl) >= n)
nrw <- nrw - 1
}
# print catalog
catalog(
X = X,
nrow = nrw,
ncol = ncl,
same.time.scale = T,
mar = 0.001,
res = 100,
pb = FALSE,
spec.mar = 0.001,
max.group.cols = 5,
title = i,
ovlp = 90,
wl = 512,
width = 15,
height = 9,
hatching = 0,
cex = 1.3,
fast.spec = FALSE,
pal = viridis,
img.prefix = i,
rm.axes = TRUE,
flim = c(min(X$bottom.freq), max(X$top.freq)) + c(-1, 1),
collevels = seq(-120, 0, 5),
box = FALSE,
lab.mar = 0.00001
)
move_images(
from = .Options$warbleR$path,
to = "./scripts",
overwrite = TRUE,
cut = TRUE,
pb = FALSE
)
}
```
## Single element subtypes
22 kHz

Flats

Trills

Complex

## Composed subtypes
Step flat

Step trill

Step flat

Composed complex

# Random forest classification
Models trained for 100 iterations on 75% of the data and tested on the remaining 25%.
## On single element subtypes
```{r, eval = FALSE}
elm_anns <- read.csv("./data/processed/annotations_groups_spectrogram_and_mel_features.csv")
elm_anns <- elm_anns[elm_anns$subtype.label != "22", ]
# remove collinear
target_features <- names(elm_anns)[!names(elm_anns) %in% c("sound.files", "selec", "View", "Channel", "start", "end", "bottom.freq", "top.freq", "Delta.Time..s.", "Begin.File", "subtipos", "selec.file", "true.end", "ovlp.sels", "composed.subtype", "subtype.label", names(elm_anns)[sapply(elm_anns, anyNA)])]
cormat <- cor(elm_anns[, target_features], use = "pairwise.complete.obs")
hc <- findCorrelation(cormat, cutoff= 0.9) # putt any value as a "cutoff"
hc <- sort(hc)
target_features <- target_features[hc]
# Create data subsets
partition <- createDataPartition(
y = elm_anns$subtype.label,
times = 1,
p = 0.75,
list = FALSE
)
elm_anns$subtype.label <- as.factor(elm_anns$subtype.label)
trainset <- elm_anns[partition, c(target_features, "subtype.label")]
testset <- elm_anns[-partition, c(target_features, "subtype.label")]
trcontrol <-
trainControl(
method = "repeatedcv",
number = 100,
repeats = 100,
savePredictions = TRUE,
sampling = "down",
classProbs = TRUE,
returnResamp = "all"
)
pred_model <- train(
subtype.label ~ .,
data = trainset,
method = "rf",
trControl = trcontrol,
metric = "Accuracy",
preProcess = "scale"
)
ggplot(pred_model) + theme_bw()
# save confusion matrix
conf_mat <-
confusionMatrix(predict(pred_model, testset), testset$subtype.label)
conf_df <- as.data.frame(conf_mat$table)
conf_df$total <-
sapply(conf_df$Reference, function(x)
sum(testset$subtype.label ==
x))
conf_df$proportion <- conf_df$Freq / conf_df$total
# fit model on complete data set
best_rf_model <- pred_model$finalModel
# all_rf_model <- randomForest(
# subtype.label ~ .,
# data = elm_anns[, c(target_features, "subtype.label")], # Your entire dataset
# proximity = TRUE, # Include proximity matrix
# ntree = best_rf_model$ntree, # Number of trees
# mtry = best_rf_model$mtry, # Number of variables tried for splitting at each node
# nodesize = best_rf_model$nodesize, # Minimum size of terminal nodes
# maxnodes = best_rf_model$maxnodes # Maximum number of terminal nodes
# )
rf_model_results <-
list(
pred_model_bb = pred_model,
conf_mat_bb = conf_mat,
confusion_df_bb = conf_df,
testset_bb = testset
#all_rf_model = all_rf_model
)
saveRDS(
rf_model_results,
"./data/processed/random_forest_model_results.RDS"
)
```
### Checking performance on test data
```{r, eval = TRUE}
rf_model_results <-
readRDS("./data/processed/random_forest_model_results.RDS")
# print confusion matrix results
rf_model_results$conf_mat_bb
confusion_df <- rf_model_results$confusion_df_bb
ggplot(confusion_df, aes(x = Reference, y = Prediction, fill = proportion)) +
geom_tile() +
coord_equal() +
scale_fill_distiller(palette = "Greens", direction = 1) +
geom_text(aes(label = round(proportion, 2)), color = "black", size = 3) +
theme_classic() +
labs(x = "Observed", y = "Predicted") +
theme(axis.text.x = element_text(
color = "black",
size = 11,
angle = 30,
vjust = 0.8,
hjust = 0.8
))
```
## On composed (multi-element) subtypes
Summarize structure at the multi-element subtype level
```{r, eval = FALSE}
elm_anns <- read.csv("./data/processed/annotations_groups_spectrogram_and_mel_features.csv")
mean_colms <- c("duration", "meanfreq", "sd", "freq.median", "freq.Q25", "freq.Q75", "freq.IQR", "time.median", "time.Q25", "time.Q75", "time.IQR", "skew", "kurt", "sp.ent", "time.ent", "entropy", "sfm", "meandom", "mindom", "maxdom", "dfrange", "modindx", "startdom", "enddom", "dfslope", "meanpeakf", "min.cc1", "min.cc2", "min.cc3", "min.cc4", "min.cc5", "min.cc6", "min.cc7", "min.cc8", "min.cc9", "min.cc10", "min.cc11", "min.cc12", "min.cc13", "min.cc14", "min.cc15", "min.cc16", "min.cc17", "min.cc18", "min.cc19", "min.cc20", "min.cc21", "min.cc22", "min.cc23", "min.cc24", "min.cc25", "max.cc1", "max.cc2", "max.cc3", "max.cc4", "max.cc5", "max.cc6", "max.cc7", "max.cc8", "max.cc9", "max.cc10", "max.cc11", "max.cc12", "max.cc13", "max.cc14", "max.cc15", "max.cc16", "max.cc17", "max.cc18", "max.cc19", "max.cc20", "max.cc21", "max.cc22", "max.cc23", "max.cc24", "max.cc25", "median.cc1", "median.cc2", "median.cc3", "median.cc4", "median.cc5", "median.cc6", "median.cc7", "median.cc8", "median.cc9", "median.cc10", "median.cc11", "median.cc12", "median.cc13", "median.cc14", "median.cc15", "median.cc16", "median.cc17", "median.cc18", "median.cc19", "median.cc20", "median.cc21", "median.cc22", "median.cc23", "median.cc24", "median.cc25", "mean.cc1", "mean.cc2", "mean.cc3", "mean.cc4", "mean.cc5", "mean.cc6", "mean.cc7", "mean.cc8", "mean.cc9", "mean.cc10", "mean.cc11", "mean.cc12", "mean.cc13", "mean.cc14", "mean.cc15", "mean.cc16", "mean.cc17", "mean.cc18", "mean.cc19", "mean.cc20", "mean.cc21", "mean.cc22", "mean.cc23", "mean.cc24", "mean.cc25", "var.cc1", "var.cc2", "var.cc3", "var.cc4", "var.cc5", "var.cc6", "var.cc7", "var.cc8", "var.cc9", "var.cc10", "var.cc11", "var.cc12", "var.cc13", "var.cc14", "var.cc15", "var.cc16", "var.cc17", "var.cc18", "var.cc19", "var.cc20", "var.cc21", "var.cc22", "var.cc23", "var.cc24", "var.cc25", "skew.cc1", "skew.cc2", "skew.cc3", "skew.cc4", "skew.cc5", "skew.cc6", "skew.cc7", "skew.cc8", "skew.cc9", "skew.cc10", "skew.cc11", "skew.cc12", "skew.cc13", "skew.cc14", "skew.cc15", "skew.cc16", "skew.cc17", "skew.cc18", "skew.cc19", "skew.cc20", "skew.cc21", "skew.cc22", "skew.cc23", "skew.cc24", "skew.cc25", "kurt.cc1", "kurt.cc2", "kurt.cc3", "kurt.cc4", "kurt.cc5", "kurt.cc6", "kurt.cc7", "kurt.cc8", "kurt.cc9", "kurt.cc10", "kurt.cc11", "kurt.cc12", "kurt.cc13", "kurt.cc14", "kurt.cc15", "kurt.cc16", "kurt.cc17", "kurt.cc18", "kurt.cc19", "kurt.cc20", "kurt.cc21", "kurt.cc22", "kurt.cc23", "kurt.cc24", "kurt.cc25", "mean.d1.cc", "var.d1.cc", "mean.d2.cc", "var.d2.cc")
composed_param <- song_param(X = elm_anns, song_colm = "composed.subtype", mean_colm = mean_colms)
combined_anns <- read.csv("./data/processed/annotations_simple_and_composed_subtypes_5ms.csv")
composed_param$composed.subtype.label <- sapply(composed_param$composed.subtype, function(x) unique(combined_anns$composed.subtype.label[combined_anns$composed.subtype == x]))
write.csv(composed_param, "./data/processed/accoustic_features_simple_and_composed_subtypes.csv", row.names = FALSE)
```
### Train model
```{r, eval = FALSE}
composed_param <- read.csv("./data/processed/accoustic_features_simple_and_composed_subtypes.csv")
# keep only multi-element
composed_param <- composed_param[composed_param$num.elms > 1, ]
composed_param$song.rate[is.infinite(composed_param$song.rate)] <- mean(composed_param$song.rate[!is.infinite(composed_param$song.rate)])
# remove collinear
target_features <- names(composed_param)[!names(composed_param) %in% c("sound.files", "selec", "View", "Channel", "start", "end", "bottom.freq", "top.freq", "Delta.Time..s.", "Begin.File", "subtipos", "selec.file", "true.end", "ovlp.sels", "composed.subtype", "subtype.label", "composed.subtype.label", names(composed_param)[sapply(composed_param, anyNA)])]
cormat <- cor(composed_param[, target_features], use = "pairwise.complete.obs")
hc <- findCorrelation(cormat, cutoff= 0.9) # putt any value as a "cutoff"
hc <- sort(hc)
target_features <- target_features[hc]
# Create data subsets
partition <- createDataPartition(
y = composed_param$composed.subtype.label,
times = 1,
p = 0.75,
list = FALSE
)
composed_param$composed_param$composed.subtype.label <- as.factor(composed_param$composed.subtype.label)
trainset <- composed_param[partition, c(target_features, "composed.subtype.label")]
testset <- composed_param[-partition, c(target_features, "composed.subtype.label")]
trcontrol <-
trainControl(
method = "repeatedcv",
number = 100,
repeats = 100,
savePredictions = TRUE,
sampling = "down",
classProbs = TRUE,
returnResamp = "all"
)
pred_model <- train(
composed.subtype.label ~ .,
data = trainset,
method = "rf",
trControl = trcontrol,
metric = "Accuracy",
preProcess = "scale"
)
# ggplot(pred_model) + theme_bw()
# save confusion matrix
conf_mat <-
confusionMatrix(predict(pred_model, testset), as.factor(testset$composed.subtype.label))
conf_df <- as.data.frame(conf_mat$table)
conf_df$total <-
sapply(conf_df$Reference, function(x)
sum(testset$composed.subtype.label ==
x))
conf_df$proportion <- conf_df$Freq / conf_df$total
# fit model on complete data set
best_rf_model <- pred_model$finalModel
# all_rf_model <- randomForest(
# subtype.label ~ .,
# data = elm_anns[, c(target_features, "subtype.label")], # Your entire dataset
# proximity = TRUE, # Include proximity matrix
# ntree = best_rf_model$ntree, # Number of trees
# mtry = best_rf_model$mtry, # Number of variables tried for splitting at each node
# nodesize = best_rf_model$nodesize, # Minimum size of terminal nodes
# maxnodes = best_rf_model$maxnodes # Maximum number of terminal nodes
# )
rf_model_results <-
list(
pred_model_bb = pred_model,
conf_mat_bb = conf_mat,
confusion_df_bb = conf_df,
testset_bb = testset
#all_rf_model = all_rf_model
)
saveRDS(
rf_model_results,
"./data/processed/random_forest_model_results_composed_subtypes.RDS"
)
```
### Checking performance on test data
```{r, eval = TRUE}
rf_model_results <-
readRDS("./data/processed/random_forest_model_results_composed_subtypes.RDS")
# print confusion matrix results
rf_model_results$conf_mat_bb
confusion_df <- rf_model_results$confusion_df_bb
ggplot(confusion_df, aes(x = Reference, y = Prediction, fill = proportion)) +
geom_tile() +
coord_equal() +
scale_fill_distiller(palette = "Greens", direction = 1) +
geom_text(aes(label = round(proportion, 2)), color = "black", size = 3) +
theme_classic() +
labs(x = "Observed", y = "Predicted") +
theme(axis.text.x = element_text(
color = "black",
size = 11,
angle = 30,
vjust = 0.8,
hjust = 0.8
))
```
## On single and composed (multi-element) subtypes
### Train model
```{r, eval = FALSE}
composed_param <- read.csv("./data/processed/accoustic_features_simple_and_composed_subtypes.csv")
composed_param <- composed_param[composed_param$composed.subtype.label != "22 kHz", ]
composed_param$song.rate[is.infinite(composed_param$song.rate)] <- mean(composed_param$song.rate[!is.infinite(composed_param$song.rate)])
# remove collinear
target_features <- names(composed_param)[!names(composed_param) %in% c("sound.files", "selec", "View", "Channel", "start", "end", "bottom.freq", "top.freq", "Delta.Time..s.", "Begin.File", "subtipos", "selec.file", "true.end", "ovlp.sels", "composed.subtype", "subtype.label", "composed.subtype.label" , "song.rate", names(composed_param)[sapply(composed_param, anyNA)])]
cormat <- cor(composed_param[, target_features], use = "pairwise.complete.obs")
hc <- findCorrelation(cormat, cutoff= 0.9) # putt any value as a "cutoff"
hc <- sort(hc)
target_features <- target_features[hc]
# Create data subsets
partition <- createDataPartition(
y = composed_param$composed.subtype.label,
times = 1,
p = 0.75,
list = FALSE
)
composed_param$composed_param$composed.subtype.label <- as.factor(composed_param$composed.subtype.label)
trainset <- composed_param[partition, c(target_features, "composed.subtype.label")]
testset <- composed_param[-partition, c(target_features, "composed.subtype.label")]
trcontrol <-
trainControl(
method = "repeatedcv",
number = 100,
repeats = 100,
savePredictions = TRUE,
sampling = "down",
classProbs = TRUE,
returnResamp = "all"
)
pred_model <- train(
composed.subtype.label ~ .,
data = trainset,
method = "rf",
trControl = trcontrol,
metric = "Accuracy",
preProcess = "scale"
)
ggplot(pred_model) + theme_bw()
# save confusion matrix
conf_mat <-
confusionMatrix(predict(pred_model, testset), as.factor(testset$composed.subtype.label))
conf_df <- as.data.frame(conf_mat$table)
conf_df$total <-
sapply(conf_df$Reference, function(x)
sum(testset$composed.subtype.label ==
x))
conf_df$proportion <- conf_df$Freq / conf_df$total
# fit model on complete data set
best_rf_model <- pred_model$finalModel
# all_rf_model <- randomForest(
# subtype.label ~ .,
# data = elm_anns[, c(target_features, "subtype.label")], # Your entire dataset
# proximity = TRUE, # Include proximity matrix
# ntree = best_rf_model$ntree, # Number of trees
# mtry = best_rf_model$mtry, # Number of variables tried for splitting at each node
# nodesize = best_rf_model$nodesize, # Minimum size of terminal nodes
# maxnodes = best_rf_model$maxnodes # Maximum number of terminal nodes
# )
rf_model_results <-
list(
pred_model_bb = pred_model,
conf_mat_bb = conf_mat,
confusion_df_bb = conf_df,
testset_bb = testset
#all_rf_model = all_rf_model
)
saveRDS(
rf_model_results,
"./data/processed/random_forest_model_results_single_and_composed_subtypes.RDS"
)
```
### Checking performance on test data
```{r, eval = TRUE}
rf_model_results <-
readRDS("./data/processed/random_forest_model_results_single_and_composed_subtypes.RDS")
# print confusion matrix results
rf_model_results$conf_mat_bb
confusion_df <- rf_model_results$confusion_df_bb
ggplot(confusion_df, aes(x = Reference, y = Prediction, fill = proportion)) +
geom_tile() +
coord_equal() +
scale_fill_distiller(palette = "Greens", direction = 1) +
geom_text(aes(label = round(proportion, 2)), color = "black", size = 3) +
theme_classic() +
labs(x = "Observed", y = "Predicted") +
theme(axis.text.x = element_text(
color = "black",
size = 11,
angle = 30,
vjust = 0.8,
hjust = 0.8
))
```
<!-- light green box -->
::: {.alert .alert-success}
# Takeaways {#takeaways .unnumbered .unlisted}
- Decent classification of single element subtypes (accuracy: 0.76)
- Poor classification of composed subtypes (accuracy: 0.52 and 0.37 when including single-element subtypes)
:::
<!-- '---' adds a gray vertical line -->
------------------------------------------------------------------------
# To-do list
- Try other classification methods (VGGish, BirdNet, google model)
- arreglar subtipos vacíos (NAs) y con etiquetas 5 y 44
- buscar ejemplos específicos de subtipos raros para aumentar n de entrenamiento
- buscar grabaciones completas (o cortes) con subtipos raros para n de prueba
```{r, eval = FALSE, echo=FALSE}
# Flats: sin cambios de frecuencia iguales o mayores a 5 kHz dentro del mismo llamado.
# Trill: 1 cambio de frecuencia mayor o igual a 5 kHz, 2 cambios de 4 kHz, 3 o más cambios de 3 kHz cada uno.
# Steps: Un cambio igual o mayor a 5 kHz entre cada componente y no deben estar distanciadas más de 0.05 segundos (50 milisegundos).
# Con éstos criterios se hacen todas las demás combinaciones de trills-trills, step-flats, step-trills, etc.
#
# Si ud ya tiene los flats (1), trills (4) y complex (6), solo es necesario sacar los que son step-flats, step-trills y trills-trills.
```
---
# Session information {.unnumbered .unlisted}
```{r session info, echo=F}
sessionInfo()
```