Anuran call

Introduction

The object of study of this paper is the analysis of the dataset Anuran Call(MFCCs). As we can learn from the documentation:

“This dataset was used in several classifications tasks related to the challenge of anuran species recognition through their calls. It is a multilabel dataset with three columns of labels. This dataset was created segmenting 60 audio records belonging to 4 different families, 8 genus, and 10 species. Each audio corresponds to one specimen (an individual frog), the record ID is also included as an extra column.”

In the dataset page is asserted that after using the spectral entropy and a binary cluster method to detect audio frames belonging to each syllable, 7195 syllables are found. From each of them 22 Mel-frequency cepstral coefficients (MFCCs) were calculated. These coefficients were normalized between -1 and 1. MFCCs are coefficients that collectively make up a mel-frequency cepstrum (MFC). Due to each syllable has different length, every row was normalized acording to \(\frac{MFCCs_i}{max|MFCCs_i|}\). Looking at the final data (the only ones available), it seems to us that the declared normalization operation has not been done on all the variables. In fact we noticed that some variables are distributed over a smaller interval. However, not having the raw dataset available and not being able to obtain precise information on the meaning of the individual attributes, we decided not to further modify the normalization of the other attributes.

The dataset is therefore composed of 22 attributes that are all numerics. Moreover we have 3 possible level of classification, in fact our anurans are divided in families, genus and species. Obviously if a anuran belong to a species it will belong to a precise genus and family. It makes no sense to use one of this 3 variable as target and others as regressors. In fact the object of this analysis is find a way to understand which species/genus/family is present only using only its call. Obviously it would be easier divide anurans by family respect to do it by species (we have only 4 different families but 10 species). In any case we will talk later about the type of analysis we want to performe.

This dataset seems without complexity, but the difficulty could be to have an interpretation of the attributes. It might be thought that it is not important the interpretation of any singular attributes, they are only some codification of sound, but maybe understand their significant can help to remove some of them, to understand if there is an important one and in which attributes you can see more differences between families.

So before starting looking at data we will have a quick look to attributes meaning. Cepstral features contain information about the rate changes in the different spectrum bands. The influence of the vocal cords and the vocal tract in a signal can be separated since the low-frequency excitation and the formant filtering of the vocal tract are located in different regions in the cepstral domain. The first-order coefficient represents the distribution spectral energy between low and high frequencies ( normally low frequency regions represent sonorant sound while hig frequency regions represent fricative sound). The lower order coefficients contain most of the information about the overall spectral shape of the source-filter transfer function while higher order coefficients represent increasing levels of spectral details.

It would therefore seem possible to use only the first attributes, but for example 12 to 20 cepstral coefficients are typically optimal for speech analysis. This makes us understand how it is not possible to make an immediate choice of which attributes are most important.

Used packages

#Generic library
library(plyr) #"revalue" function belong to this package
library(knitr) #"kable" function belong to this package
library(dplyr, warn.conflicts = FALSE)

#Graphical library
library(ggplot2) #main component in building graphs
library(magrittr) #required by ggpubr - if omitted it is loaded automatically
library(ggpubr, warn.conflicts = FALSE) #provides some functions for creating and customizing ggplot2
library(ggcorrplot) #used in correlation plot
library(ggdendro) #used to build dendograms

#Machine Learning library
library(lattice) #required by caret - if omitted it is loaded automatically
library(caret) #used in classification and Regression Training - function like "train" belong to this package
library(grid) #required by DMwr - if omitted it is loaded automatically
suppressMessages(library(DMwR)) #includes functions accompanying the book "Data Mining with R - function like "knn" belong to this package - to avoid warning messages on output the "suppressMessages"" function is used
library(rpart) #used to build decision trees
library(rpart.plot) #used to plot decision trees
suppressMessages(library(randomForest)) #used to perform random forest algorithm
library(nnet) #provides Neural Networks and Multinomial Log-Linear models - "multinom" function belong to this package
library(e1071) #utility package - "svm" function belong to this package
suppressMessages(library(UBL)) #"SmoteClassif" function belong to this package

Exploration

The dataset, composed from 7195 rows and 26 columns, has not missing values. We removed the last attribute, whichw was an ID of the frog.

First, we take a look at some data points to get a feeling what the values of the various columns look like. As told before all the attributes takes real value, normalized between -1 and +1, except for the last three attributes that are categorical. So in this case there’s not any work to do in preprocessing datas, because someone did it before to publish the dataset.

##   MFCCs_.1   MFCCs_.2    MFCCs_.3  MFCCs_.4  MFCCs_.5  MFCCs_.6    MFCCs_.7
## 1        1 0.15293630 -0.10558590 0.2007219 0.3172011 0.2607639 0.100944641
## 2        1 0.17153426 -0.09897474 0.2684252 0.3386719 0.2683531 0.060835087
## 3        1 0.15231709 -0.08297267 0.2871280 0.2760141 0.1898668 0.008713957
## 4        1 0.22439245  0.11898466 0.3294317 0.3720880 0.3610046 0.015501040
## 5        1 0.08781691 -0.06834489 0.3069667 0.3309229 0.2491439 0.006883713
## 6        1 0.09970374 -0.03340782 0.3498951 0.3445353 0.2475688 0.022406957
##     MFCCs_.8    MFCCs_.9  MFCCs_10  MFCCs_11    MFCCs_12   MFCCs_13    MFCCs_14
## 1 -0.1500626 -0.17112763 0.1246764 0.1886541 -0.07562172 -0.1564359  0.08224512
## 2 -0.2224746 -0.20769267 0.1708829 0.2709583 -0.09500394 -0.2543415  0.02278623
## 3 -0.2422342 -0.21915332 0.2325383 0.2660645 -0.07282719 -0.2373836  0.05079074
## 4 -0.1943475 -0.09818067 0.2703754 0.2672789 -0.16225825 -0.3170842 -0.01156743
## 5 -0.2654234 -0.17269981 0.2664343 0.3326951 -0.10074854 -0.2985239  0.03743889
## 6 -0.2137672 -0.12791598 0.2773526 0.3098613 -0.13452792 -0.2951227  0.01248602
##    MFCCs_15    MFCCs_16    MFCCs_17    MFCCs_18     MFCCs_19    MFCCs_20
## 1 0.1357520 -0.02401665 -0.10835111 -0.07762252 -0.009567802  0.05768398
## 2 0.1633201  0.01202228 -0.09097401 -0.05650952 -0.035303357  0.02013996
## 3 0.2073384  0.08353570 -0.05069143 -0.02359023 -0.066721549 -0.02508323
## 4 0.1004128 -0.05022373 -0.13600940 -0.17703701 -0.130498133 -0.05476640
## 5 0.2191528  0.06283723 -0.04888462 -0.05307351 -0.088550403 -0.03134557
## 6 0.1806410  0.05524178 -0.08048746 -0.13008922 -0.171477611 -0.07156940
##      MFCCs_21   MFCCs_22          Family     Genus        Species
## 1  0.11868014 0.01403845 Leptodactylidae Adenomera AdenomeraAndre
## 2  0.08226299 0.02905574 Leptodactylidae Adenomera AdenomeraAndre
## 3  0.09910840 0.07716238 Leptodactylidae Adenomera AdenomeraAndre
## 4 -0.01869145 0.02395431 Leptodactylidae Adenomera AdenomeraAndre
## 5  0.10860983 0.07924433 Leptodactylidae Adenomera AdenomeraAndre
## 6  0.07764295 0.06490259 Leptodactylidae Adenomera AdenomeraAndre

It can be useful to have a statistical overview of the data, to understand how each attribute is distributed.

##     MFCCs_.1          MFCCs_.2          MFCCs_.3          MFCCs_.4      
##  Min.   :-0.2512   Min.   :-0.6730   Min.   :-0.4360   Min.   :-0.4727  
##  1st Qu.: 1.0000   1st Qu.: 0.1659   1st Qu.: 0.1384   1st Qu.: 0.3367  
##  Median : 1.0000   Median : 0.3022   Median : 0.2746   Median : 0.4815  
##  Mean   : 0.9899   Mean   : 0.3236   Mean   : 0.3112   Mean   : 0.4460  
##  3rd Qu.: 1.0000   3rd Qu.: 0.4666   3rd Qu.: 0.4307   3rd Qu.: 0.5599  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##                                                                         
##     MFCCs_.5           MFCCs_.6           MFCCs_.7            MFCCs_.8         
##  Min.   :-0.63601   Min.   :-0.41042   Min.   :-0.538982   Min.   :-0.5765062  
##  1st Qu.: 0.05172   1st Qu.: 0.01258   1st Qu.:-0.125737   1st Qu.:-0.0631089  
##  Median : 0.16136   Median : 0.07208   Median :-0.052630   Median : 0.0132649  
##  Mean   : 0.12705   Mean   : 0.09794   Mean   :-0.001397   Mean   :-0.0003701  
##  3rd Qu.: 0.22259   3rd Qu.: 0.17596   3rd Qu.: 0.085580   3rd Qu.: 0.0751075  
##  Max.   : 0.75225   Max.   : 0.96424   Max.   : 1.000000   Max.   : 0.5517624  
##                                                                                
##     MFCCs_.9            MFCCs_10            MFCCs_11           MFCCs_12       
##  Min.   :-0.587313   Min.   :-0.952266   Min.   :-0.90199   Min.   :-0.79944  
##  1st Qu.: 0.004648   1st Qu.:-0.001132   1st Qu.:-0.26986   1st Qu.:-0.03393  
##  Median : 0.189317   Median : 0.063478   Median :-0.15332   Median : 0.05105  
##  Mean   : 0.128213   Mean   : 0.055998   Mean   :-0.11568   Mean   : 0.04337  
##  3rd Qu.: 0.265395   3rd Qu.: 0.117725   3rd Qu.: 0.02669   3rd Qu.: 0.13243  
##  Max.   : 0.738033   Max.   : 0.522768   Max.   : 0.52303   Max.   : 0.69089  
##                                                                               
##     MFCCs_13            MFCCs_14           MFCCs_15           MFCCs_16       
##  Min.   :-0.644116   Min.   :-0.59038   Min.   :-0.71716   Min.   :-0.49868  
##  1st Qu.:-0.002859   1st Qu.:-0.13298   1st Qu.:-0.25593   1st Qu.:-0.01955  
##  Median : 0.196921   Median :-0.05071   Median :-0.14326   Median : 0.04108  
##  Mean   : 0.150945   Mean   :-0.03924   Mean   :-0.10175   Mean   : 0.04206  
##  3rd Qu.: 0.324589   3rd Qu.: 0.03916   3rd Qu.: 0.01735   3rd Qu.: 0.10705  
##  Max.   : 0.945710   Max.   : 0.57575   Max.   : 0.66892   Max.   : 0.67070  
##                                                                              
##     MFCCs_17            MFCCs_18            MFCCs_19        
##  Min.   :-0.421480   Min.   :-0.759322   Min.   :-0.680745  
##  1st Qu.:-0.001764   1st Qu.:-0.042122   1st Qu.:-0.106079  
##  Median : 0.112769   Median : 0.011820   Median :-0.052626  
##  Mean   : 0.088680   Mean   : 0.007755   Mean   :-0.049474  
##  3rd Qu.: 0.201932   3rd Qu.: 0.061889   3rd Qu.: 0.006321  
##  Max.   : 0.681157   Max.   : 0.614064   Max.   : 0.574209  
##                                                             
##     MFCCs_20            MFCCs_21           MFCCs_22         
##  Min.   :-0.361649   Min.   :-0.43081   Min.   :-0.3793043  
##  1st Qu.:-0.120971   1st Qu.:-0.01762   1st Qu.: 0.0005327  
##  Median :-0.055180   Median : 0.03127   Median : 0.1053726  
##  Mean   :-0.053244   Mean   : 0.03731   Mean   : 0.0875675  
##  3rd Qu.: 0.001342   3rd Qu.: 0.08962   3rd Qu.: 0.1948188  
##  Max.   : 0.467831   Max.   : 0.38980   Max.   : 0.4322068  
##                                                             
##              Family               Genus                        Species    
##  Bufonidae      :  68   Adenomera    :4150   AdenomeraHylaedactylus:3478  
##  Dendrobatidae  : 542   Hypsiboas    :1593   HypsiboasCordobae     :1121  
##  Hylidae        :2165   Ameerega     : 542   AdenomeraAndre        : 672  
##  Leptodactylidae:4420   Dendropsophus: 310   Ameeregatrivittata    : 542  
##                         Leptodactylus: 270   HypsiboasCinerascens  : 472  
##                         Scinax       : 148   HylaMinuta            : 310  
##                         (Other)      : 182   (Other)               : 600

We can immediately notice that the frog are not balance between the 4 families (and also between genus and species). In fact as it is possible to see in the plot below about 50% of datas belong to AdenomeraHylaedactylus species and more than 60% cames from Leptodactylidae family, while the family Bufonidae includes less than 1% of analized frogs.

It could also be interesting have a look to the distribution of the different attributes. To have a complete view we plot them in pairs one upside down from the other.

As usual our hope is that attributes are more or less distributed normally: obviously this is not the case. We have to remember this feature of our dataset when we choose method we want to apply to analize them. In particular most of the values for the attribute MFCCs_.1 are equal or very close to 1. We decided to remove it, because it does not give us a lot of information and because its meaning is related only with a summarize measure.

Before starting analyze our datas, we would like to discover if there is a huge diference in some attributes for different families. A way to do this is making some boxplot of this attributes and see if there are substancial different between families. It is difficult that we can select only 2 or 3 attributes to make our analysis but it could help to have some ideas of data behaviour. In principle, has we try to explain before, we do not know which coefficients can be more significant, so we starting looking some boxplot for the firsts (attribute one is excluded, indeed it show only how the spectrasl energy is distributed between high and low frequencies).

A very tight box like for BUfonidae family means that about \(\%50\) of frog from this family have more or less the same value. However there are also an high number of outliers. In any case also the concentrated region is not useful to classify an anuran in Bufonidae family because there are a lot of overlaps with the other classes. Moreover in general we can notice that also other classes have a lot of outliers and there are not clear patterns in the distribution of the families trough the attributes. So obviously we need more advanced methods to classify our datas.

Before moving on these methods could be useful have an idea about how attributes are correlated each other, so we will plot a correlation matrice.

We can notice a strange pattern in the correlation matrix: most of variables i are negatively correlated with the variable \(i-2\) and \(i+2\) and, as a consequence, positively correlated with \(i-4\) and \(i+4\). Some of the negative correlation is quite high, like \(0.8\) or even \(0.9\) so it must be taken into consideration.

Reduction of dimensionality

In order to analize our dataset we can proceed in different ways: the first is to perform classification methods on the entire dataset, while the other is to perform the classification on a smaller set of predictors.

New predictors can be selected according to two categories of methods: subset selection and dimension reduction. The first approach involves identifying a subset of the \(p\) variables that we believe related with the response. The second approach involves projecting the \(p\) variables into a subspace with reduct dimension. The variables projected are used as predictors.

There exist both supervised and unsupervised techniques to reduce the dimensionality of our problem, we will consider two unsupervised methods, one between subselection methods and one between the dimension reduction methods.

Principal components analysis

Principal components analysis (PCA) is a data reduction technique that transform correlated variables into uncorrelated ones called principal components. We know that when faced with a large set of correlated variables, principal components allow us to summarize this set with a smaller number of representative variables that collectively explain most of the variability in the original set. Given the non-negligible correlations between the variables PCA may perform a good job on our dataset.

## Importance of components:
##                           PC1    PC2    PC3     PC4    PC5     PC6     PC7
## Standard deviation     2.7644 1.8048 1.5746 1.30174 1.2020 0.96354 0.90391
## Proportion of Variance 0.3639 0.1551 0.1181 0.08069 0.0688 0.04421 0.03891
## Cumulative Proportion  0.3639 0.5190 0.6371 0.71776 0.7866 0.83077 0.86968
##                            PC8     PC9   PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.86319 0.63264 0.5705 0.53322 0.48519 0.45679 0.38008
## Proportion of Variance 0.03548 0.01906 0.0155 0.01354 0.01121 0.00994 0.00688
## Cumulative Proportion  0.90516 0.92422 0.9397 0.95326 0.96447 0.97440 0.98128
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.36717 0.32550 0.24633 0.20558 0.15962 0.12433 0.09191
## Proportion of Variance 0.00642 0.00505 0.00289 0.00201 0.00121 0.00074 0.00040
## Cumulative Proportion  0.98770 0.99275 0.99564 0.99765 0.99886 0.99960 1.00000

Normally in PCA there are two main criteria to choose how many principal component use to performe future analysis. The first is the minimum explained variance criterium: we choose the first \(p\) variables until we reach a predefined explained variance threshold, using the cumulative explained variance plot. Imagine we would like to keep in our dataset at least \(90\%\) of the original variance. In this case, looking the summary we can see that if we use the first \(8\) component we explaine \(90.5\%\) of the variance, so we can decide to use only these variables in our analysis. The second way to choose the number of principal component is the elbow rule. PCA is built in such a way the proportional variance explained for a component decrease, initially sharply, then more modestly. The elbow rule assert that you have to choose a number \(p\) of principal component such that the percentage of explained variance from variable \(p\) to variable \(p+1\) is not relevant, and this is the first time this happen. In this case, as we can see from the second plot we could choose both the fourth or the sixth component.

At the end we decide tu choose to use the first six components. Below is possible to see the plot for the cumulative and proportion of variance explained. The orizontal line in the first plot is the treshold of \(0.9\) that we decided, the vertical line in the second plot correspond to the sixth component.

Now it is possible to see how the Families are distributed in the first two principal component and in two of the variables (we decide to use MFCCs_.3 and MFCCs_13 looking at the PCA’s autoplot). Obviously datas are better separated when PCA variables are used.

Features clustering

We decide to use also a subset selection techniques. We would like to select a subset of variables to explain all the dataset. Often to do this supervised method like best subset selection or stepwise selection are used. On the contrary we would like to use an approach based only on our predictors. Our aim is to identify and remove features according to their pairwise correlation with others. Probably if two variables have an high correlation we can use only one of them instead of both. To do this we can make a clustering of the features using agglomerative hierarchical clustering. At starting this method create one cluster for each feature. Then, by computing the distance between all clusters it select the two cluster with the lowest distance to be merged (this correspond to merge attributes with the highest correlation in absolute value). This process is repeated until we end up with one cluster using average linkage criterior. This criterior works in this way: if \(A\) and \(B\) are two clusters the distance between them is calculated using the formula \[\frac{1}{|A||B|}\sum_{a\in A} \sum_{b\in B} d(a,b)\] where \[d(a,b) = 1 - |\rho_{a,b}|\].

Now we have to decide a treshold in which cut our hierarchical clustering. This treshold is strictly related with the pairwise correlation between attributes. Intuitively lower the treshold, greater would be the average correllation between the remaining variable. In this specific case we decide to set the treshold to 0.3. Greater values implie to combine \(5\) or more variable in \(1\). The couple/triple to merge that we obtain are:

MFCCs_12 and MFCCs_14
MFCCs_13, MFCCs_15 and MFCCs_17
MFCCs_.7, MFCCs._9 and MFCCs_11
MFCCs_.3 and MFCCs_.5

To decide which attribute to save in each couple/triple we use their variance: if a variable has greater variance, it will better explain the data. So we keep MFCCs_12, MFCCs_13, MFCCs_11, MFCCs_.3 .

This method has several limits: firstly in general it not garauntees to eliminate all the highest correlation. As we can see from the correlation matrix below also if the average correlation between variables has diminished, there are still some highly correlated variables, like MFCCS_11 and MFCCs_13. Moreover in theory this method is less performant than PCA in extract information using a small number of variables. Normally a subset selection method is preferred when in PCA is really hard to interpret variables and we need a clear interpretation. This is not our case, so in our case PCA would be preffered but for some method we would like to use both strategy to compare them and see the differences.

Comparison of different datesets and consideration

Now we have 3 different datasets in which we can test our classification techniques. The first one, anuran_MFCCs is the original dataset in which the first variable was removed. It contains 21 attributes and some of them are strongly correlated. The second one, anuran_cl is composed from a subset of the original variables. It contains 14 attributes and mostly of them has not a strong correlation. The third one, anuran_pca is the set composed of the first 6 principal component. Obviously all its variables are uncorralated. (In this calculation we are not considering the 3 attributes Family, Genus and Species)

Before starting with the classification some consideration have to be made. For istance the Naive Bayes Classifier should not be used in anuran_MFCCs and anuran_cl because variables are correlated, while it can ben used in anuran_pca since principal components are uncorrelated. Moreover as we notice before most of the original attributes have bimodal densities, far away from a normal distribution, so with notice anuran_MFCCs and anuran_cl it is not possible to use linear (or quadratic) discriminant analysis. anuran_pca have different distribution with respect to the original attributes, but has we can see from the graph below the first component is strongly bimodal so it is always better not to use LDA and QDA.

Model choices

As we pointed out before, using this dataset we could theoretically perform \(3\) different levels of analysis. All three lead us to a multinomial and very unbalanced dataset. Obviously, it is more difficult achieve good results using Species as classification attributes.

To have reliable results we decide to simplify our job. In fact in our analysis we decide to follow two different path. The first is using a binary family classification. In this case our goal would be decide if an anuran belong to Leptodactylidae family or to others families. In this way analysis became simpler and the dataset became balanced. The second type of analysis we would like to performe is using as target all the four families. In this case we have to pay attention because the dataset will be very unbalanced. We will see later some possible way to deal with this second dataset.

Below we can take a look to the summary of the \(3\) dataset used in thei binary version. The first is the complete dataset, the second is the dataset resulting from the features clustering procedure while the third is the dataset with the principal component (often abbreviated as PCA dataset).

##              Family        MFCCs_.2          MFCCs_.3          MFCCs_.4      
##  Others         :2775   Min.   :-0.6730   Min.   :-0.4360   Min.   :-0.4727  
##  Leptodactylidae:4420   1st Qu.: 0.1659   1st Qu.: 0.1384   1st Qu.: 0.3367  
##                         Median : 0.3022   Median : 0.2746   Median : 0.4815  
##                         Mean   : 0.3236   Mean   : 0.3112   Mean   : 0.4460  
##                         3rd Qu.: 0.4666   3rd Qu.: 0.4307   3rd Qu.: 0.5599  
##                         Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##     MFCCs_.5           MFCCs_.6           MFCCs_.7            MFCCs_.8         
##  Min.   :-0.63601   Min.   :-0.41042   Min.   :-0.538982   Min.   :-0.5765062  
##  1st Qu.: 0.05172   1st Qu.: 0.01258   1st Qu.:-0.125737   1st Qu.:-0.0631089  
##  Median : 0.16136   Median : 0.07208   Median :-0.052630   Median : 0.0132649  
##  Mean   : 0.12705   Mean   : 0.09794   Mean   :-0.001397   Mean   :-0.0003701  
##  3rd Qu.: 0.22259   3rd Qu.: 0.17596   3rd Qu.: 0.085580   3rd Qu.: 0.0751075  
##  Max.   : 0.75225   Max.   : 0.96424   Max.   : 1.000000   Max.   : 0.5517624  
##     MFCCs_.9            MFCCs_10            MFCCs_11           MFCCs_12       
##  Min.   :-0.587313   Min.   :-0.952266   Min.   :-0.90199   Min.   :-0.79944  
##  1st Qu.: 0.004648   1st Qu.:-0.001132   1st Qu.:-0.26986   1st Qu.:-0.03393  
##  Median : 0.189317   Median : 0.063478   Median :-0.15332   Median : 0.05105  
##  Mean   : 0.128213   Mean   : 0.055998   Mean   :-0.11568   Mean   : 0.04337  
##  3rd Qu.: 0.265395   3rd Qu.: 0.117725   3rd Qu.: 0.02669   3rd Qu.: 0.13243  
##  Max.   : 0.738033   Max.   : 0.522768   Max.   : 0.52303   Max.   : 0.69089  
##     MFCCs_13            MFCCs_14           MFCCs_15           MFCCs_16       
##  Min.   :-0.644116   Min.   :-0.59038   Min.   :-0.71716   Min.   :-0.49868  
##  1st Qu.:-0.002859   1st Qu.:-0.13298   1st Qu.:-0.25593   1st Qu.:-0.01955  
##  Median : 0.196921   Median :-0.05071   Median :-0.14326   Median : 0.04108  
##  Mean   : 0.150945   Mean   :-0.03924   Mean   :-0.10175   Mean   : 0.04206  
##  3rd Qu.: 0.324589   3rd Qu.: 0.03916   3rd Qu.: 0.01735   3rd Qu.: 0.10705  
##  Max.   : 0.945710   Max.   : 0.57575   Max.   : 0.66892   Max.   : 0.67070  
##     MFCCs_17            MFCCs_18            MFCCs_19        
##  Min.   :-0.421480   Min.   :-0.759322   Min.   :-0.680745  
##  1st Qu.:-0.001764   1st Qu.:-0.042122   1st Qu.:-0.106079  
##  Median : 0.112769   Median : 0.011820   Median :-0.052626  
##  Mean   : 0.088680   Mean   : 0.007755   Mean   :-0.049474  
##  3rd Qu.: 0.201932   3rd Qu.: 0.061889   3rd Qu.: 0.006321  
##  Max.   : 0.681157   Max.   : 0.614064   Max.   : 0.574209  
##     MFCCs_20            MFCCs_21           MFCCs_22         
##  Min.   :-0.361649   Min.   :-0.43081   Min.   :-0.3793043  
##  1st Qu.:-0.120971   1st Qu.:-0.01762   1st Qu.: 0.0005327  
##  Median :-0.055180   Median : 0.03127   Median : 0.1053726  
##  Mean   :-0.053244   Mean   : 0.03731   Mean   : 0.0875675  
##  3rd Qu.: 0.001342   3rd Qu.: 0.08962   3rd Qu.: 0.1948188  
##  Max.   : 0.467831   Max.   : 0.38980   Max.   : 0.4322068

##              Family        MFCCs_.2          MFCCs_.3          MFCCs_.4      
##  Others         :2775   Min.   :-0.6730   Min.   :-0.4360   Min.   :-0.4727  
##  Leptodactylidae:4420   1st Qu.: 0.1659   1st Qu.: 0.1384   1st Qu.: 0.3367  
##                         Median : 0.3022   Median : 0.2746   Median : 0.4815  
##                         Mean   : 0.3236   Mean   : 0.3112   Mean   : 0.4460  
##                         3rd Qu.: 0.4666   3rd Qu.: 0.4307   3rd Qu.: 0.5599  
##                         Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##     MFCCs_.6           MFCCs_.8             MFCCs_10            MFCCs_11       
##  Min.   :-0.41042   Min.   :-0.5765062   Min.   :-0.952266   Min.   :-0.90199  
##  1st Qu.: 0.01258   1st Qu.:-0.0631089   1st Qu.:-0.001132   1st Qu.:-0.26986  
##  Median : 0.07208   Median : 0.0132649   Median : 0.063478   Median :-0.15332  
##  Mean   : 0.09794   Mean   :-0.0003701   Mean   : 0.055998   Mean   :-0.11568  
##  3rd Qu.: 0.17596   3rd Qu.: 0.0751075   3rd Qu.: 0.117725   3rd Qu.: 0.02669  
##  Max.   : 0.96424   Max.   : 0.5517624   Max.   : 0.522768   Max.   : 0.52303  
##     MFCCs_12           MFCCs_13            MFCCs_16           MFCCs_18        
##  Min.   :-0.79944   Min.   :-0.644116   Min.   :-0.49868   Min.   :-0.759322  
##  1st Qu.:-0.03393   1st Qu.:-0.002859   1st Qu.:-0.01955   1st Qu.:-0.042122  
##  Median : 0.05105   Median : 0.196921   Median : 0.04108   Median : 0.011820  
##  Mean   : 0.04337   Mean   : 0.150945   Mean   : 0.04206   Mean   : 0.007755  
##  3rd Qu.: 0.13243   3rd Qu.: 0.324589   3rd Qu.: 0.10705   3rd Qu.: 0.061889  
##  Max.   : 0.69089   Max.   : 0.945710   Max.   : 0.67070   Max.   : 0.614064  
##     MFCCs_19            MFCCs_20            MFCCs_21       
##  Min.   :-0.680745   Min.   :-0.361649   Min.   :-0.43081  
##  1st Qu.:-0.106079   1st Qu.:-0.120971   1st Qu.:-0.01762  
##  Median :-0.052626   Median :-0.055180   Median : 0.03127  
##  Mean   :-0.049474   Mean   :-0.053244   Mean   : 0.03731  
##  3rd Qu.: 0.006321   3rd Qu.: 0.001342   3rd Qu.: 0.08962  
##  Max.   : 0.574209   Max.   : 0.467831   Max.   : 0.38980  
##     MFCCs_22         
##  Min.   :-0.3793043  
##  1st Qu.: 0.0005327  
##  Median : 0.1053726  
##  Mean   : 0.0875675  
##  3rd Qu.: 0.1948188  
##  Max.   : 0.4322068

##              Family          PC1               PC2               PC3         
##  Others         :2775   Min.   :-7.8008   Min.   :-9.2926   Min.   :-5.9990  
##  Leptodactylidae:4420   1st Qu.:-2.6456   1st Qu.:-0.7959   1st Qu.:-0.8600  
##                         Median : 0.1537   Median : 0.2125   Median :-0.2168  
##                         Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##                         3rd Qu.: 2.5459   3rd Qu.: 1.0528   3rd Qu.: 0.6380  
##                         Max.   : 7.0325   Max.   : 9.0015   Max.   : 8.3187  
##       PC4               PC5                PC6          
##  Min.   :-6.8051   Min.   :-4.61914   Min.   :-4.37908  
##  1st Qu.:-0.5745   1st Qu.:-0.70560   1st Qu.:-0.53654  
##  Median : 0.1476   Median :-0.02184   Median :-0.01848  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.7040   3rd Qu.: 0.53067   3rd Qu.: 0.49216  
##  Max.   : 6.2008   Max.   : 6.40734   Max.   : 4.68080

Performance evaluation

The performance of machine learning algorithms is typically evaluated using predictive accuracy. Accuracy tells us the percentage of test cases that have been correctly classified, i.e. the number of test cases among all for which we could correctly identify the Family. However, this is not always the most appropriate method. Particulary when the data is imbalanced and/or the costs of different errors vary considerably, we need also other measure to evaluate our results.

We can define the confusion matrix that for a binary classification problem is a table with 2 rows and 2 columns. Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class.

CONFUSION.MATRIX	Predicted.Others	Predicted.Leptodactylidae
Others	True negative	False positive
Leptodactylidae	False negative	True positive

We can define the precision value, that is the number of true positive predictions divided by the total number of positive class values predicted. Precision can be thought of as a measure of a classifiers exactness, in fact a low precision can also indicate a large number of false positives.

\[precision = \frac{TP}{TP+FP}\]

The recall value instead is the number of true positive predictions divided by the number of positive class values. Recall can be thought of as a measure of a classifiers completeness, in fact a low recall indicates many false negatives.

\[recall = \frac{TP}{TP+FN}\]

Another value that can be useful, specially in unbalanced dataset is specificity. It tells us what fraction of all negative samples are correctly predicted as negative by the classifier.

\[specificity = \frac{TN}{TN+FP}\]

Lastly the f1 score expresses the balance between these two previous values of precision and recall. \[F_{1} score = 2\frac{precision \times recall}{precision + recall}\]

When we face with multi class classification confusion matrix becomes slightly more complex. Unlike binary classification, infact there are no positive or negative classes here. What we have to do here is to find TP, TN, FP and FN for each individual class. Formulas to find precision, recall and specificity are the same. In general we will use accuracy or f1 score, that is a good summary measure.

Sampling technique

Our goal is to analyze the performance of the different classifiers using \(k\)-fold cross-validation. This procedure involves randomly dividing the set of observations into \(k\) groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining \(k-1\) folds. The model is then evaluated on the observations in the held-out fold. This procedure is repeated \(k\) times and at each iteration a different group of observations is treated as a validation set. The \(k\)-fold CV estimate is computed by averaging the model score of each iteration. For a good trade-off between runtime and accuracy of the score we choose to use a \(k\)-fold cross-validation with \(k=5\), so the classifiers are trained on \(80\%\) of the train data in each iteration.

When subpopulations within an overall population vary a lot, it can be better to sample each population independently. This technique is called stratified sampling and the objective is to improve the precision of the sample by reducing sampling error. A possibility is to use the proportionate allocation strategy: it considers a sampling fraction in each of the strata that is proportional to that of the total population. For instance in our binary clasification \(62\%\) of observations come from Leptodactylidae family, so, according with this technique, approximately \(62\%\) of sample data should belong to Leptodactylidae family.

When dataset is very unbalanced (like in our multiclass analysis) another thing that is possible to do is try to balanced the class in samples. In fact, normally the algorithm receives significantly more example from one class, prompting it to be biased towards that particular class. It does not learn what makes other class different and fails to understand the undelying patterns that allow us to distinguish among the classes. The algorithm only learns that a given class is more common, making it natural for there to be a greater tendency towards it, so it’s led to overfitting the majority class. A technique to overcome this problem involves oversampling, i.e. increasing the number of minority class instances in order to match the size of the majority class. It will consist of resampling the smaller class at random until it consists of as many samples as the majority class. This process can balance the class distribution but does not provide any additional information to the model. A more elaborate technique is SMOTE(Synthetic Minority Oversampling TEcnique). It consists fo introducing synthetic examples belonging to the smaller class. It selects of a random point along the line segment between two specific features (of the smaller class). This apporach effectively forces the decision region of the minority class to become more general.

In our case, due to the fact that the dataset is very simple, we do not need any of this last techniques to improve our results. We will show at the end of the classification part an example of the SMOTE technique use, trying to classify correctly the families of our anurans.

Classification

In this section we will start classifing our data. We will focus both on the binary dateset, the simplified one, in which only two target classes are available, Leptodactylidae and Others and in the completed one, in which all \(4\) families are used. In this first case obviously we are interesting in finding anurans belongin to Leptodactylidae family. As mentioned before we will try to use classification method on the original dataset, on the dataset resulting from PCA and on the dataset resulting from features clustering.

K nearest neighbors

The first classifier we want to apply to our datasets is the K-Nearest Neighbors classifier. It computes the distance between each pair of observation in the training set, it identifies \(k\) nearest neighbors and it uses class labels of nearest neighbors to determine the class label of the unknown record. There is only one parameter that is the number \(k\) that corespond to the number of nearest neighbors considered for the prediction.

We will use firstly the very powerful train function, that allow us to set in an esay way a lot of options. For example it is possible automatically perform a cross validation approach and tune the value of k. Unfortunately if we want to find the f1score value we need a customize approach, so we build another algorithm based on KNN function, to show also some result related to f1 score.

As shown in the table below all three binomial model have very high accuracy, but the time spent with PCA_bin dataset is much smaller, so it is preferable.

Binary.dataset	Accuracy	Best.k	Elapsed.time
Base	0.9911049	5	6.342187 secs
Features clustering	0.9902710	5	4.857320 secs
PCA	0.9845726	5	2.972195 secs

Also with our homemade function we find similar results. In particular also f1 score values are close to \(1\).

Binary.dataset	Accuracy	F1_score	k	Elapsed.time
Base	0.9915219	0.9931097	5	0.9646461 secs
Features clustering	0.9906880	0.9924371	5	0.6799769 secs
PCA	0.9851286	0.9879137	5	0.3258700 secs

Now we try the first approach also on dataset with all the \(4\) families.

Famiy.dataset	Accuracy	Best.k	Elapsed.time
Base	0.9894366	5	6.274721 secs
Features clustering	0.9879080	5	4.953400 secs
PCA	0.9827650	5	3.017058 secs

What we can notice is that as before best result are given from the use of the entire dataset, but this is also the slower approach. On the contrary using PCA dataset gives us slightly worse results justified, however, by the shorter execution time.

Obviously if we compare accuracy from the binary dataset it is (slightly) higher than the accuracy of the \(4\) families dataset.

We do not try to use particular sampling technique because accuracy and f1_score are already close to \(1\) and there is no point in building more complex model.

Below we can see how accuracy change if we change the \(k\) parameter.

Decision Tree

Tree based methods involve segmenting the predictor space into a number of simple regions. In order to make a prediction for a given observation we use the most commonly occurring class of training observations in the region to which it belongs. In interpreting the result of a classification tree, we are often interested not only in the class prediction corresponding to a particular terminal node region, but also in the class proportions among the training observations that fall into that region.

As a criterion for making the binary split th classification error rate is used. It is simply the fraction of the training observations in that region that do not belong to the most common class. Anyway the two most common measure used to build a classification tree are the Gini index

\[G = \sum_{k=1}^{K} p_{mk}(1-p_{mk})\]

and the cross entropy

\[D = -\sum_{k=1}^{K} p_{mk}\log(p_{mk})\]

Bot measures will take on a value near zero if the \(p_{mk}\)’s are all near zero or near one, where \(p_{mk}\) is the proportion of training observation in the \(m\)th region that are from the \(k\)th class. We will use the Gini index. Let’s see an example of decision tree based on our dataset

##                  predict_unseen
##                   Bufonidae Dendrobatidae Hylidae Leptodactylidae
##   Bufonidae               8             0       1               2
##   Dendrobatidae           0            78      12               4
##   Hylidae                 1            10     416              32
##   Leptodactylidae         0             1      28             846

Binary.dataset	Accuracy
Base	0.9367616

With this classifier to perform our analysis we use algorithms based on rpart function, both for binary and four classes dataset. In this case we can notice tha PCA is both faster and more accurate. This is due to the fact that decisione tree algorithm is looking for simple areas in which divide data and it is easier to do if the features are semplified like the principal component.

Binary.dataset	F1_score	Elapsed.time
Base	0.9542883	1.0641739 secs
Features clustering	0.9518497	0.7542579 secs
PCA	0.9620110	0.3160570 secs

As we can see below also in the case in which we consider all \(4\) families we have the same behaviour. This time we decide to use accuracy for semplicity.

Family.dataset	Accuracy	Elapsed.time
Base	0.9161918	1.4675071 secs
Features clustering	0.9095205	0.9549370 secs
PCA	0.9309243	0.3622961 secs

Random forest

The Random Forest Classifier is a set of decision trees from randomly selected various sub-samples of training set which are created by the use of bootstrapping. In the inference stage it aggregates the votes from different decision trees to decide the final class of the test object, this improves the predictive accuracy and controls overfitting. We can then summarize the Random Forest algorithm in these following steps: select random samples from a given dataset, construct a decision tree for each sample and get a prediction result from each decision tree, perform a vote for each predicted result and finally select the prediction result with the most votes as the final prediction. The result of the Random Forest algorithm is more reliable than the Decision Tree algorithm because each of the trees neutralizes the error of other trees.

As done before with knn we could use different algorithms. The first possibility, the one we will use, is based on the specific function randomForest from the namesake package. The other uses the general function train in which is easier tune the parameters. Unfortunately, this second method was too slow and we decide to discard it.

Random forest allow us to obtain better result if compared with decision tree but algorithms take longer to finish their job. We decide to tune the number of trees to use and the number of variables randomly sampled as candidates at each split, but to avoid spending to much time we build a small \(3\times 3\) grid in which the \(3\) values for the number trees are \(100\), \(500\) and \(1000\) while the \(3\) values for number of variables at each split are linked with the square root of the number of features.

The best hyperparameters selected are shown in the table below with their accuracy

Although the best performances were obtained with a number of trees equal to 1000, we have noticed that the differences in performance with a value of the number of trees equal to 100 were negligible. We have therefore chosen to use this value as it allows the algorithm to be faster. We noticed that in our case even variations of mtyr do not lead to considerable differences in the f1 score.

Binary.dataset	F1_score	Elapsed.time	Ntree	Mtry
Base	0.9902671	5.210014 secs	100	3
Features clustering	0.9876599	4.031174 secs	100	3
PCA	0.9850577	2.261171 secs	100	1

As we can see, contrary to what happened with decision trees, the best results have been obtained with the complete dataset. In fact if we compare a lot of trees we can better explore all the features of the dataset and so in a dataset with more attributes is possible to achieve better results. Below we can see the accuracy values when we use all the \(4\) families. The trend is the same as in the binary dataset.

Family.dataset	Accuracy	Elapsed.time
Base	0.9842946	5.293714 secs
Features clustering	0.9835997	4.315891 secs
PCA	0.9780403	2.796565 secs

Logistic regression

Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more independent variables. So logistic regression is a binary classifier by nature. However, it can also be used for multi-class classifications with \(K>2\) classes.

In the binary case it takes linear combination of features and applies non-linear function (sigmoid) to it. We can define \(x_{i}\) as the n-dimensional feature vector of a given sample and \(\beta_{0},\beta=(\beta_{1},..,\beta_{n})\) as the model parameters. Then the logistic regression model is defined as:

\[\mathcal{P}(y=1|x_{i})=\frac{\exp(\beta_{0}+x^{T}_{i}\beta)}{1+\exp(\beta_{0}+x^{T}_{i}\beta)}\] where \(y\) is the response vector of the binary problem.

A first possible approach is to convert the binary classifier to a multi-class classifier with 1 vs rest and 1 vs 1 methods.

It is also possible generalize it into multinominal logistic regression, or softmax regression, for multi-class problems. In this case, the dependent variable \(y\) is a categorical variable that takes any one of \(K\) distinct values representing \(K>2\) different classes. In this case to model the probability of the event \(y=i\) instead of the logistic function, a softmax function is used:

\[\mathcal{P}(y=i\vert{\bf x})=\frac{\exp(w_{i}^T x)}{\sum_{k=1}^{K} \exp(w_{k}^T x)}\]

Binary.dataset	Accuracy	F1_score	Elapsed.time
Base	0.9915219	0.9622599	1.3017950 secs
Features clustering	0.9906880	0.9543400	0.5539131 secs
PCA	0.9452397	0.9554097	0.5427570 secs

With logistic regression accuracy is clearly worse in the \(4\) class case with respect to the binary case. The issue with softmax function is that it blows small differences out of proportion which makes our classifier biased towards a particular class.

Family.dataset	Accuracy	Elapsed.time
Base	0.9391244	3.045820 secs
Features clustering	0.9266157	2.347250 secs
PCA	0.9081306	2.120302 secs

Support Vector Machine

Also the support vector machine is a natural approach for classification in the two class setting. It is a discriminative classifier formally defined by a separating hyperplane. If the training data is linearly separable, we actual select two parallel hyperplanes that separate the two classes of data, so that the distance between them is as large as possible. The region bounded by these two hyperplanes is called the “margin”, and the maximum-margin hyperplane is the hyperplane that lies halfway between them. In two dimentional space this hyperplane is a line dividing a plane in two parts where each class lies on one side. We can then classify a test observation based on which side of the maximum-margin hyperpalnes it lies.

This leads to the following optimization problem called hard - margin problem

\[\min_{w,b} \frac{1}{2} ||w^2||\]

\[\text{s.t.} \hspace{2mm} y_{i}(<x_{i}, w> +b) \geq 1, \hspace{2mm} \forall i\]

where w is the normal vector to the hyperplane, \(x_i\) is a point, \(y_i\) is the i-th target and \(b\) derives from the parameter \(\frac{b}{||w||}\) which determines the offset of the hyperplane from the origin along the normal vector.

To extend SVM to cases in which the data are not linearly separable, we can introduce the soft - margin problem

\[\min_{w,b} \frac{1}{2} ||w^2|| + c\sum_{i} \xi_i\]

\[\text{s.t.} \hspace{2mm} y_{i}(<x_{i}, w> +b) \geq 1- \xi_{i}, \hspace{2mm} \xi_{i} \geq 0, \hspace{2mm} \forall i\]

where \(\xi_i\) is slack variable, whose value is the distance of \(x_i\) from the corresponding class’ margin if \(x_i\) is on the wrong side of the margin, otherwise zero. \(c\) determines the trade-off between increasing the margin size and ensuring that the \(x_i\) lies on the correct side of the margin, in fact it corresponds to the number and severity of the violations to the margin that we’ll tolerate. When \(c\) is small, classification mistakes are given less importance and we focus on maximizing the margin, whereas when \(c\) is large, the focus is more on avoiding misclassification at the expense of keeping the margin small.

As presented is clear that a support vector classifier will perform poorly with a non linear class boundaries. In order to adress non linearity is possible enlarging thefeature space using kernels. Substantially the algorithm is similar, except that every inner product is replaced by a non linear kernel function.

One of the most popular choiche is the radial kernel, which takes the form

\[K(x_{i}, x_{i'}) = \exp (-\gamma \sum_{j=1}^{p} (x_{ij} - x_{i'j})^{2})\]

The parameter gamma defines how far the influence of a single training example reaches, with low values meaning “far” and high values meaning “close”. It’s a sort of sensitivity of the model and if it’s too large, the radius of the area of influence of the support vectors only includes the support vector itself otherwise, if gamma is too small, the model is too constrained.

Everything we have said so far is limited to the case of binary classification. The two most popular \(K\)- class algorithm for SVM are one-versus-one and one-versus-all.

If there are \(K>2\) classes a one-versus-one approach constructs \({K}\choose{2}\) SVMs, each of which compares a pair of classes. The final classification is performed by assigning the test observation to the class to which it was most frequently assigned in the \({K}\choose{2}\) pairwise classifications. The one-versus-all approach fit \(K\) SVMs and using coefficient \(\beta_{ik}\) it build an index of confidence that the test observation belong to the \(k\)th class tather than to any of the other.

For the binary case, after tuning parameters we found that the best value for gamma is about \(0.1\) while the best value for the cost is \(100\). Moreover we notice that gamma is a very sensible parameter, while changing the value of the cost does not impact a lot. We use a radial kernel, but good results can be obtained also with a linear kernel.

Binary.dataset	F1_score	Elapsed.time
Base	0.9941003	5.040407 secs
Features clustering	0.9933235	3.813344 secs
PCA	0.9864161	3.486036 secs

Here we can see a graph about how the accuracy on PCA dataset changes when gamme changes.

At the end we show also results for the \(4\) families dataset. The function we use performs the multi-class classification using the one-versus-one approach. As we can see svm outperform logistic regression: this normally happens when the classes are well separated.

Family.dataset	Accuracy	Elapsed.time
Base	0.9917999	6.768103 secs
Features clustering	0.9899931	4.717085 secs
PCA	0.9799861	3.417723 secs

SMOTE approach example

To test how effective SMOTE approach could be on our dataset we decide to test it on svm classifier in the case of \(4\) families. We can notice an improvement in our result, but due to the fact that the accuracy was already about \(99\%\), the SMOTE approach can be avoid. Maybe it coudl be more useful in analyzed the species of each anuran, but this is beyond our scope.

SMOTE.approach	Accuracy	Elapsed.time
Base	0.9958310	5.947000 secs
Features clustering	0.9933296	4.281599 secs
PCA	0.9874931	2.538894 secs

Comparison between classifiers

Having completed the study of the various classifiers, our goal now is to make a comparison between them and understand which is the best performing model. As general trend we notice that it was better to conduct the analysis on the whole dataset. In fact also if it takes more time, this not implies prohibhitive wait, due to the fact that the number of observation is limited. In order to make a comparison between them, we decide to use in addition to accuracy also precision, recall and specificity (also f1 score is used). However is quite a mess to have this indicators for the multi class dataset. Since in our case the results on two classes are very similar to those on four, we decided to compute these values only on the binary classifiers, without further complicating our lives.

Classifier	Accuracy	Precision	Recall	Specificity	F1.score
KNN	0.9915219	0.9909890	0.9952489	0.9923803	0.9931143
Decision tree	0.9423211	0.9513185	0.9549774	0.9278463	0.9531444
Random forest	0.9874913	0.9911525	0.9884615	0.9817008	0.9898052
Logistic regression	0.9535789	0.9615906	0.9628959	0.9407728	0.9622428
SVM	0.9926338	0.9974937	0.9904977	0.9850321	0.9939834

It is clear from the table that the two best classifiers are KNN and SVM. In this case we think KNN is to be preferred because SVM has some problem in predict the “negative” class, so it could suffer with more unbalanced datas, like when we try to classify alla the \(4\) families.