The object of study of this paper is the analysis of the dataset Anuran Call(MFCCs). As we can learn from the documentation:
“This dataset was used in several classifications tasks related to the challenge of anuran species recognition through their calls. It is a multilabel dataset with three columns of labels. This dataset was created segmenting 60 audio records belonging to 4 different families, 8 genus, and 10 species. Each audio corresponds to one specimen (an individual frog), the record ID is also included as an extra column.”
In the dataset page is asserted that after using the spectral entropy and a binary cluster method to detect audio frames belonging to each syllable, 7195 syllables are found. From each of them 22 Mel-frequency cepstral coefficients (MFCCs) were calculated. These coefficients were normalized between -1 and 1. MFCCs are coefficients that collectively make up a mel-frequency cepstrum (MFC). Due to each syllable has different length, every row was normalized acording to \(\frac{MFCCs_i}{max|MFCCs_i|}\). Looking at the final data (the only ones available), it seems to us that the declared normalization operation has not been done on all the variables. In fact we noticed that some variables are distributed over a smaller interval. However, not having the raw dataset available and not being able to obtain precise information on the meaning of the individual attributes, we decided not to further modify the normalization of the other attributes.
The dataset is therefore composed of 22 attributes that are all numerics. Moreover we have 3 possible level of classification, in fact our anurans are divided in families, genus and species. Obviously if a anuran belong to a species it will belong to a precise genus and family. It makes no sense to use one of this 3 variable as target and others as regressors. In fact the object of this analysis is find a way to understand which species/genus/family is present only using only its call. Obviously it would be easier divide anurans by family respect to do it by species (we have only 4 different families but 10 species). In any case we will talk later about the type of analysis we want to performe.
This dataset seems without complexity, but the difficulty could be to have an interpretation of the attributes. It might be thought that it is not important the interpretation of any singular attributes, they are only some codification of sound, but maybe understand their significant can help to remove some of them, to understand if there is an important one and in which attributes you can see more differences between families.
So before starting looking at data we will have a quick look to attributes meaning. Cepstral features contain information about the rate changes in the different spectrum bands. The influence of the vocal cords and the vocal tract in a signal can be separated since the low-frequency excitation and the formant filtering of the vocal tract are located in different regions in the cepstral domain. The first-order coefficient represents the distribution spectral energy between low and high frequencies ( normally low frequency regions represent sonorant sound while hig frequency regions represent fricative sound). The lower order coefficients contain most of the information about the overall spectral shape of the source-filter transfer function while higher order coefficients represent increasing levels of spectral details.
It would therefore seem possible to use only the first attributes, but for example 12 to 20 cepstral coefficients are typically optimal for speech analysis. This makes us understand how it is not possible to make an immediate choice of which attributes are most important.
#Generic library
library(plyr) #"revalue" function belong to this package
library(knitr) #"kable" function belong to this package
library(dplyr, warn.conflicts = FALSE)
#Graphical library
library(ggplot2) #main component in building graphs
library(magrittr) #required by ggpubr - if omitted it is loaded automatically
library(ggpubr, warn.conflicts = FALSE) #provides some functions for creating and customizing ggplot2
library(ggcorrplot) #used in correlation plot
library(ggdendro) #used to build dendograms
#Machine Learning library
library(lattice) #required by caret - if omitted it is loaded automatically
library(caret) #used in classification and Regression Training - function like "train" belong to this package
library(grid) #required by DMwr - if omitted it is loaded automatically
suppressMessages(library(DMwR)) #includes functions accompanying the book "Data Mining with R - function like "knn" belong to this package - to avoid warning messages on output the "suppressMessages"" function is used
library(rpart) #used to build decision trees
library(rpart.plot) #used to plot decision trees
suppressMessages(library(randomForest)) #used to perform random forest algorithm
library(nnet) #provides Neural Networks and Multinomial Log-Linear models - "multinom" function belong to this package
library(e1071) #utility package - "svm" function belong to this package
suppressMessages(library(UBL)) #"SmoteClassif" function belong to this package
The dataset, composed from 7195 rows and 26 columns, has not missing values. We removed the last attribute, whichw was an ID of the frog.
First, we take a look at some data points to get a feeling what the values of the various columns look like. As told before all the attributes takes real value, normalized between -1 and +1, except for the last three attributes that are categorical. So in this case there’s not any work to do in preprocessing datas, because someone did it before to publish the dataset.
## MFCCs_.1 MFCCs_.2 MFCCs_.3 MFCCs_.4 MFCCs_.5 MFCCs_.6 MFCCs_.7
## 1 1 0.15293630 -0.10558590 0.2007219 0.3172011 0.2607639 0.100944641
## 2 1 0.17153426 -0.09897474 0.2684252 0.3386719 0.2683531 0.060835087
## 3 1 0.15231709 -0.08297267 0.2871280 0.2760141 0.1898668 0.008713957
## 4 1 0.22439245 0.11898466 0.3294317 0.3720880 0.3610046 0.015501040
## 5 1 0.08781691 -0.06834489 0.3069667 0.3309229 0.2491439 0.006883713
## 6 1 0.09970374 -0.03340782 0.3498951 0.3445353 0.2475688 0.022406957
## MFCCs_.8 MFCCs_.9 MFCCs_10 MFCCs_11 MFCCs_12 MFCCs_13 MFCCs_14
## 1 -0.1500626 -0.17112763 0.1246764 0.1886541 -0.07562172 -0.1564359 0.08224512
## 2 -0.2224746 -0.20769267 0.1708829 0.2709583 -0.09500394 -0.2543415 0.02278623
## 3 -0.2422342 -0.21915332 0.2325383 0.2660645 -0.07282719 -0.2373836 0.05079074
## 4 -0.1943475 -0.09818067 0.2703754 0.2672789 -0.16225825 -0.3170842 -0.01156743
## 5 -0.2654234 -0.17269981 0.2664343 0.3326951 -0.10074854 -0.2985239 0.03743889
## 6 -0.2137672 -0.12791598 0.2773526 0.3098613 -0.13452792 -0.2951227 0.01248602
## MFCCs_15 MFCCs_16 MFCCs_17 MFCCs_18 MFCCs_19 MFCCs_20
## 1 0.1357520 -0.02401665 -0.10835111 -0.07762252 -0.009567802 0.05768398
## 2 0.1633201 0.01202228 -0.09097401 -0.05650952 -0.035303357 0.02013996
## 3 0.2073384 0.08353570 -0.05069143 -0.02359023 -0.066721549 -0.02508323
## 4 0.1004128 -0.05022373 -0.13600940 -0.17703701 -0.130498133 -0.05476640
## 5 0.2191528 0.06283723 -0.04888462 -0.05307351 -0.088550403 -0.03134557
## 6 0.1806410 0.05524178 -0.08048746 -0.13008922 -0.171477611 -0.07156940
## MFCCs_21 MFCCs_22 Family Genus Species
## 1 0.11868014 0.01403845 Leptodactylidae Adenomera AdenomeraAndre
## 2 0.08226299 0.02905574 Leptodactylidae Adenomera AdenomeraAndre
## 3 0.09910840 0.07716238 Leptodactylidae Adenomera AdenomeraAndre
## 4 -0.01869145 0.02395431 Leptodactylidae Adenomera AdenomeraAndre
## 5 0.10860983 0.07924433 Leptodactylidae Adenomera AdenomeraAndre
## 6 0.07764295 0.06490259 Leptodactylidae Adenomera AdenomeraAndre
It can be useful to have a statistical overview of the data, to understand how each attribute is distributed.
## MFCCs_.1 MFCCs_.2 MFCCs_.3 MFCCs_.4
## Min. :-0.2512 Min. :-0.6730 Min. :-0.4360 Min. :-0.4727
## 1st Qu.: 1.0000 1st Qu.: 0.1659 1st Qu.: 0.1384 1st Qu.: 0.3367
## Median : 1.0000 Median : 0.3022 Median : 0.2746 Median : 0.4815
## Mean : 0.9899 Mean : 0.3236 Mean : 0.3112 Mean : 0.4460
## 3rd Qu.: 1.0000 3rd Qu.: 0.4666 3rd Qu.: 0.4307 3rd Qu.: 0.5599
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
##
## MFCCs_.5 MFCCs_.6 MFCCs_.7 MFCCs_.8
## Min. :-0.63601 Min. :-0.41042 Min. :-0.538982 Min. :-0.5765062
## 1st Qu.: 0.05172 1st Qu.: 0.01258 1st Qu.:-0.125737 1st Qu.:-0.0631089
## Median : 0.16136 Median : 0.07208 Median :-0.052630 Median : 0.0132649
## Mean : 0.12705 Mean : 0.09794 Mean :-0.001397 Mean :-0.0003701
## 3rd Qu.: 0.22259 3rd Qu.: 0.17596 3rd Qu.: 0.085580 3rd Qu.: 0.0751075
## Max. : 0.75225 Max. : 0.96424 Max. : 1.000000 Max. : 0.5517624
##
## MFCCs_.9 MFCCs_10 MFCCs_11 MFCCs_12
## Min. :-0.587313 Min. :-0.952266 Min. :-0.90199 Min. :-0.79944
## 1st Qu.: 0.004648 1st Qu.:-0.001132 1st Qu.:-0.26986 1st Qu.:-0.03393
## Median : 0.189317 Median : 0.063478 Median :-0.15332 Median : 0.05105
## Mean : 0.128213 Mean : 0.055998 Mean :-0.11568 Mean : 0.04337
## 3rd Qu.: 0.265395 3rd Qu.: 0.117725 3rd Qu.: 0.02669 3rd Qu.: 0.13243
## Max. : 0.738033 Max. : 0.522768 Max. : 0.52303 Max. : 0.69089
##
## MFCCs_13 MFCCs_14 MFCCs_15 MFCCs_16
## Min. :-0.644116 Min. :-0.59038 Min. :-0.71716 Min. :-0.49868
## 1st Qu.:-0.002859 1st Qu.:-0.13298 1st Qu.:-0.25593 1st Qu.:-0.01955
## Median : 0.196921 Median :-0.05071 Median :-0.14326 Median : 0.04108
## Mean : 0.150945 Mean :-0.03924 Mean :-0.10175 Mean : 0.04206
## 3rd Qu.: 0.324589 3rd Qu.: 0.03916 3rd Qu.: 0.01735 3rd Qu.: 0.10705
## Max. : 0.945710 Max. : 0.57575 Max. : 0.66892 Max. : 0.67070
##
## MFCCs_17 MFCCs_18 MFCCs_19
## Min. :-0.421480 Min. :-0.759322 Min. :-0.680745
## 1st Qu.:-0.001764 1st Qu.:-0.042122 1st Qu.:-0.106079
## Median : 0.112769 Median : 0.011820 Median :-0.052626
## Mean : 0.088680 Mean : 0.007755 Mean :-0.049474
## 3rd Qu.: 0.201932 3rd Qu.: 0.061889 3rd Qu.: 0.006321
## Max. : 0.681157 Max. : 0.614064 Max. : 0.574209
##
## MFCCs_20 MFCCs_21 MFCCs_22
## Min. :-0.361649 Min. :-0.43081 Min. :-0.3793043
## 1st Qu.:-0.120971 1st Qu.:-0.01762 1st Qu.: 0.0005327
## Median :-0.055180 Median : 0.03127 Median : 0.1053726
## Mean :-0.053244 Mean : 0.03731 Mean : 0.0875675
## 3rd Qu.: 0.001342 3rd Qu.: 0.08962 3rd Qu.: 0.1948188
## Max. : 0.467831 Max. : 0.38980 Max. : 0.4322068
##
## Family Genus Species
## Bufonidae : 68 Adenomera :4150 AdenomeraHylaedactylus:3478
## Dendrobatidae : 542 Hypsiboas :1593 HypsiboasCordobae :1121
## Hylidae :2165 Ameerega : 542 AdenomeraAndre : 672
## Leptodactylidae:4420 Dendropsophus: 310 Ameeregatrivittata : 542
## Leptodactylus: 270 HypsiboasCinerascens : 472
## Scinax : 148 HylaMinuta : 310
## (Other) : 182 (Other) : 600
We can immediately notice that the frog are not balance between the 4 families (and also between genus and species). In fact as it is possible to see in the plot below about 50% of datas belong to AdenomeraHylaedactylus species and more than 60% cames from Leptodactylidae family, while the family Bufonidae includes less than 1% of analized frogs.
It could also be interesting have a look to the distribution of the different attributes. To have a complete view we plot them in pairs one upside down from the other.
As usual our hope is that attributes are more or less distributed normally: obviously this is not the case. We have to remember this feature of our dataset when we choose method we want to apply to analize them. In particular most of the values for the attribute MFCCs_.1 are equal or very close to 1. We decided to remove it, because it does not give us a lot of information and because its meaning is related only with a summarize measure.
Before starting analyze our datas, we would like to discover if there is a huge diference in some attributes for different families. A way to do this is making some boxplot of this attributes and see if there are substancial different between families. It is difficult that we can select only 2 or 3 attributes to make our analysis but it could help to have some ideas of data behaviour. In principle, has we try to explain before, we do not know which coefficients can be more significant, so we starting looking some boxplot for the firsts (attribute one is excluded, indeed it show only how the spectrasl energy is distributed between high and low frequencies).
A very tight box like for BUfonidae family means that about \(\%50\) of frog from this family have more or less the same value. However there are also an high number of outliers. In any case also the concentrated region is not useful to classify an anuran in Bufonidae family because there are a lot of overlaps with the other classes. Moreover in general we can notice that also other classes have a lot of outliers and there are not clear patterns in the distribution of the families trough the attributes. So obviously we need more advanced methods to classify our datas.
Before moving on these methods could be useful have an idea about how attributes are correlated each other, so we will plot a correlation matrice.
We can notice a strange pattern in the correlation matrix: most of variables i are negatively correlated with the variable \(i-2\) and \(i+2\) and, as a consequence, positively correlated with \(i-4\) and \(i+4\). Some of the negative correlation is quite high, like \(0.8\) or even \(0.9\) so it must be taken into consideration.
In order to analize our dataset we can proceed in different ways: the first is to perform classification methods on the entire dataset, while the other is to perform the classification on a smaller set of predictors.
New predictors can be selected according to two categories of methods: subset selection and dimension reduction. The first approach involves identifying a subset of the \(p\) variables that we believe related with the response. The second approach involves projecting the \(p\) variables into a subspace with reduct dimension. The variables projected are used as predictors.
There exist both supervised and unsupervised techniques to reduce the dimensionality of our problem, we will consider two unsupervised methods, one between subselection methods and one between the dimension reduction methods.
Principal components analysis (PCA) is a data reduction technique that transform correlated variables into uncorrelated ones called principal components. We know that when faced with a large set of correlated variables, principal components allow us to summarize this set with a smaller number of representative variables that collectively explain most of the variability in the original set. Given the non-negligible correlations between the variables PCA may perform a good job on our dataset.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.7644 1.8048 1.5746 1.30174 1.2020 0.96354 0.90391
## Proportion of Variance 0.3639 0.1551 0.1181 0.08069 0.0688 0.04421 0.03891
## Cumulative Proportion 0.3639 0.5190 0.6371 0.71776 0.7866 0.83077 0.86968
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.86319 0.63264 0.5705 0.53322 0.48519 0.45679 0.38008
## Proportion of Variance 0.03548 0.01906 0.0155 0.01354 0.01121 0.00994 0.00688
## Cumulative Proportion 0.90516 0.92422 0.9397 0.95326 0.96447 0.97440 0.98128
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.36717 0.32550 0.24633 0.20558 0.15962 0.12433 0.09191
## Proportion of Variance 0.00642 0.00505 0.00289 0.00201 0.00121 0.00074 0.00040
## Cumulative Proportion 0.98770 0.99275 0.99564 0.99765 0.99886 0.99960 1.00000
Normally in PCA there are two main criteria to choose how many principal component use to performe future analysis. The first is the minimum explained variance criterium: we choose the first \(p\) variables until we reach a predefined explained variance threshold, using the cumulative explained variance plot. Imagine we would like to keep in our dataset at least \(90\%\) of the original variance. In this case, looking the summary we can see that if we use the first \(8\) component we explaine \(90.5\%\) of the variance, so we can decide to use only these variables in our analysis. The second way to choose the number of principal component is the elbow rule. PCA is built in such a way the proportional variance explained for a component decrease, initially sharply, then more modestly. The elbow rule assert that you have to choose a number \(p\) of principal component such that the percentage of explained variance from variable \(p\) to variable \(p+1\) is not relevant, and this is the first time this happen. In this case, as we can see from the second plot we could choose both the fourth or the sixth component.
At the end we decide tu choose to use the first six components. Below is possible to see the plot for the cumulative and proportion of variance explained. The orizontal line in the first plot is the treshold of \(0.9\) that we decided, the vertical line in the second plot correspond to the sixth component.
Now it is possible to see how the Families are distributed in the first two principal component and in two of the variables (we decide to use MFCCs_.3 and MFCCs_13 looking at the PCA’s autoplot). Obviously datas are better separated when PCA variables are used.
We decide to use also a subset selection techniques. We would like to select a subset of variables to explain all the dataset. Often to do this supervised method like best subset selection or stepwise selection are used. On the contrary we would like to use an approach based only on our predictors. Our aim is to identify and remove features according to their pairwise correlation with others. Probably if two variables have an high correlation we can use only one of them instead of both. To do this we can make a clustering of the features using agglomerative hierarchical clustering. At starting this method create one cluster for each feature. Then, by computing the distance between all clusters it select the two cluster with the lowest distance to be merged (this correspond to merge attributes with the highest correlation in absolute value). This process is repeated until we end up with one cluster using average linkage criterior. This criterior works in this way: if \(A\) and \(B\) are two clusters the distance between them is calculated using the formula \[\frac{1}{|A||B|}\sum_{a\in A} \sum_{b\in B} d(a,b)\] where \[d(a,b) = 1 - |\rho_{a,b}|\].
Now we have to decide a treshold in which cut our hierarchical clustering. This treshold is strictly related with the pairwise correlation between attributes. Intuitively lower the treshold, greater would be the average correllation between the remaining variable. In this specific case we decide to set the treshold to 0.3. Greater values implie to combine \(5\) or more variable in \(1\). The couple/triple to merge that we obtain are:
To decide which attribute to save in each couple/triple we use their variance: if a variable has greater variance, it will better explain the data. So we keep MFCCs_12, MFCCs_13, MFCCs_11, MFCCs_.3 .
This method has several limits: firstly in general it not garauntees to eliminate all the highest correlation. As we can see from the correlation matrix below also if the average correlation between variables has diminished, there are still some highly correlated variables, like MFCCS_11 and MFCCs_13. Moreover in theory this method is less performant than PCA in extract information using a small number of variables. Normally a subset selection method is preferred when in PCA is really hard to interpret variables and we need a clear interpretation. This is not our case, so in our case PCA would be preffered but for some method we would like to use both strategy to compare them and see the differences.
Now we have 3 different datasets in which we can test our classification techniques. The first one, anuran_MFCCs is the original dataset in which the first variable was removed. It contains 21 attributes and some of them are strongly correlated. The second one, anuran_cl is composed from a subset of the original variables. It contains 14 attributes and mostly of them has not a strong correlation. The third one, anuran_pca is the set composed of the first 6 principal component. Obviously all its variables are uncorralated. (In this calculation we are not considering the 3 attributes Family, Genus and Species)
Before starting with the classification some consideration have to be made. For istance the Naive Bayes Classifier should not be used in anuran_MFCCs and anuran_cl because variables are correlated, while it can ben used in anuran_pca since principal components are uncorrelated. Moreover as we notice before most of the original attributes have bimodal densities, far away from a normal distribution, so with notice anuran_MFCCs and anuran_cl it is not possible to use linear (or quadratic) discriminant analysis. anuran_pca have different distribution with respect to the original attributes, but has we can see from the graph below the first component is strongly bimodal so it is always better not to use LDA and QDA.
As we pointed out before, using this dataset we could theoretically perform \(3\) different levels of analysis. All three lead us to a multinomial and very unbalanced dataset. Obviously, it is more difficult achieve good results using Species as classification attributes.
To have reliable results we decide to simplify our job. In fact in our analysis we decide to follow two different path. The first is using a binary family classification. In this case our goal would be decide if an anuran belong to Leptodactylidae family or to others families. In this way analysis became simpler and the dataset became balanced. The second type of analysis we would like to performe is using as target all the four families. In this case we have to pay attention because the dataset will be very unbalanced. We will see later some possible way to deal with this second dataset.
Below we can take a look to the summary of the \(3\) dataset used in thei binary version. The first is the complete dataset, the second is the dataset resulting from the features clustering procedure while the third is the dataset with the principal component (often abbreviated as PCA dataset).
## Family MFCCs_.2 MFCCs_.3 MFCCs_.4
## Others :2775 Min. :-0.6730 Min. :-0.4360 Min. :-0.4727
## Leptodactylidae:4420 1st Qu.: 0.1659 1st Qu.: 0.1384 1st Qu.: 0.3367
## Median : 0.3022 Median : 0.2746 Median : 0.4815
## Mean : 0.3236 Mean : 0.3112 Mean : 0.4460
## 3rd Qu.: 0.4666 3rd Qu.: 0.4307 3rd Qu.: 0.5599
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## MFCCs_.5 MFCCs_.6 MFCCs_.7 MFCCs_.8
## Min. :-0.63601 Min. :-0.41042 Min. :-0.538982 Min. :-0.5765062
## 1st Qu.: 0.05172 1st Qu.: 0.01258 1st Qu.:-0.125737 1st Qu.:-0.0631089
## Median : 0.16136 Median : 0.07208 Median :-0.052630 Median : 0.0132649
## Mean : 0.12705 Mean : 0.09794 Mean :-0.001397 Mean :-0.0003701
## 3rd Qu.: 0.22259 3rd Qu.: 0.17596 3rd Qu.: 0.085580 3rd Qu.: 0.0751075
## Max. : 0.75225 Max. : 0.96424 Max. : 1.000000 Max. : 0.5517624
## MFCCs_.9 MFCCs_10 MFCCs_11 MFCCs_12
## Min. :-0.587313 Min. :-0.952266 Min. :-0.90199 Min. :-0.79944
## 1st Qu.: 0.004648 1st Qu.:-0.001132 1st Qu.:-0.26986 1st Qu.:-0.03393
## Median : 0.189317 Median : 0.063478 Median :-0.15332 Median : 0.05105
## Mean : 0.128213 Mean : 0.055998 Mean :-0.11568 Mean : 0.04337
## 3rd Qu.: 0.265395 3rd Qu.: 0.117725 3rd Qu.: 0.02669 3rd Qu.: 0.13243
## Max. : 0.738033 Max. : 0.522768 Max. : 0.52303 Max. : 0.69089
## MFCCs_13 MFCCs_14 MFCCs_15 MFCCs_16
## Min. :-0.644116 Min. :-0.59038 Min. :-0.71716 Min. :-0.49868
## 1st Qu.:-0.002859 1st Qu.:-0.13298 1st Qu.:-0.25593 1st Qu.:-0.01955
## Median : 0.196921 Median :-0.05071 Median :-0.14326 Median : 0.04108
## Mean : 0.150945 Mean :-0.03924 Mean :-0.10175 Mean : 0.04206
## 3rd Qu.: 0.324589 3rd Qu.: 0.03916 3rd Qu.: 0.01735 3rd Qu.: 0.10705
## Max. : 0.945710 Max. : 0.57575 Max. : 0.66892 Max. : 0.67070
## MFCCs_17 MFCCs_18 MFCCs_19
## Min. :-0.421480 Min. :-0.759322 Min. :-0.680745
## 1st Qu.:-0.001764 1st Qu.:-0.042122 1st Qu.:-0.106079
## Median : 0.112769 Median : 0.011820 Median :-0.052626
## Mean : 0.088680 Mean : 0.007755 Mean :-0.049474
## 3rd Qu.: 0.201932 3rd Qu.: 0.061889 3rd Qu.: 0.006321
## Max. : 0.681157 Max. : 0.614064 Max. : 0.574209
## MFCCs_20 MFCCs_21 MFCCs_22
## Min. :-0.361649 Min. :-0.43081 Min. :-0.3793043
## 1st Qu.:-0.120971 1st Qu.:-0.01762 1st Qu.: 0.0005327
## Median :-0.055180 Median : 0.03127 Median : 0.1053726
## Mean :-0.053244 Mean : 0.03731 Mean : 0.0875675
## 3rd Qu.: 0.001342 3rd Qu.: 0.08962 3rd Qu.: 0.1948188
## Max. : 0.467831 Max. : 0.38980 Max. : 0.4322068
## Family MFCCs_.2 MFCCs_.3 MFCCs_.4
## Others :2775 Min. :-0.6730 Min. :-0.4360 Min. :-0.4727
## Leptodactylidae:4420 1st Qu.: 0.1659 1st Qu.: 0.1384 1st Qu.: 0.3367
## Median : 0.3022 Median : 0.2746 Median : 0.4815
## Mean : 0.3236 Mean : 0.3112 Mean : 0.4460
## 3rd Qu.: 0.4666 3rd Qu.: 0.4307 3rd Qu.: 0.5599
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## MFCCs_.6 MFCCs_.8 MFCCs_10 MFCCs_11
## Min. :-0.41042 Min. :-0.5765062 Min. :-0.952266 Min. :-0.90199
## 1st Qu.: 0.01258 1st Qu.:-0.0631089 1st Qu.:-0.001132 1st Qu.:-0.26986
## Median : 0.07208 Median : 0.0132649 Median : 0.063478 Median :-0.15332
## Mean : 0.09794 Mean :-0.0003701 Mean : 0.055998 Mean :-0.11568
## 3rd Qu.: 0.17596 3rd Qu.: 0.0751075 3rd Qu.: 0.117725 3rd Qu.: 0.02669
## Max. : 0.96424 Max. : 0.5517624 Max. : 0.522768 Max. : 0.52303
## MFCCs_12 MFCCs_13 MFCCs_16 MFCCs_18
## Min. :-0.79944 Min. :-0.644116 Min. :-0.49868 Min. :-0.759322
## 1st Qu.:-0.03393 1st Qu.:-0.002859 1st Qu.:-0.01955 1st Qu.:-0.042122
## Median : 0.05105 Median : 0.196921 Median : 0.04108 Median : 0.011820
## Mean : 0.04337 Mean : 0.150945 Mean : 0.04206 Mean : 0.007755
## 3rd Qu.: 0.13243 3rd Qu.: 0.324589 3rd Qu.: 0.10705 3rd Qu.: 0.061889
## Max. : 0.69089 Max. : 0.945710 Max. : 0.67070 Max. : 0.614064
## MFCCs_19 MFCCs_20 MFCCs_21
## Min. :-0.680745 Min. :-0.361649 Min. :-0.43081
## 1st Qu.:-0.106079 1st Qu.:-0.120971 1st Qu.:-0.01762
## Median :-0.052626 Median :-0.055180 Median : 0.03127
## Mean :-0.049474 Mean :-0.053244 Mean : 0.03731
## 3rd Qu.: 0.006321 3rd Qu.: 0.001342 3rd Qu.: 0.08962
## Max. : 0.574209 Max. : 0.467831 Max. : 0.38980
## MFCCs_22
## Min. :-0.3793043
## 1st Qu.: 0.0005327
## Median : 0.1053726
## Mean : 0.0875675
## 3rd Qu.: 0.1948188
## Max. : 0.4322068
## Family PC1 PC2 PC3
## Others :2775 Min. :-7.8008 Min. :-9.2926 Min. :-5.9990
## Leptodactylidae:4420 1st Qu.:-2.6456 1st Qu.:-0.7959 1st Qu.:-0.8600
## Median : 0.1537 Median : 0.2125 Median :-0.2168
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 2.5459 3rd Qu.: 1.0528 3rd Qu.: 0.6380
## Max. : 7.0325 Max. : 9.0015 Max. : 8.3187
## PC4 PC5 PC6
## Min. :-6.8051 Min. :-4.61914 Min. :-4.37908
## 1st Qu.:-0.5745 1st Qu.:-0.70560 1st Qu.:-0.53654
## Median : 0.1476 Median :-0.02184 Median :-0.01848
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.7040 3rd Qu.: 0.53067 3rd Qu.: 0.49216
## Max. : 6.2008 Max. : 6.40734 Max. : 4.68080
The performance of machine learning algorithms is typically evaluated using predictive accuracy. Accuracy tells us the percentage of test cases that have been correctly classified, i.e. the number of test cases among all for which we could correctly identify the Family. However, this is not always the most appropriate method. Particulary when the data is imbalanced and/or the costs of different errors vary considerably, we need also other measure to evaluate our results.
We can define the confusion matrix that for a binary classification problem is a table with 2 rows and 2 columns. Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class.
| CONFUSION.MATRIX | Predicted.Others | Predicted.Leptodactylidae |
|---|---|---|
| Others | True negative | False positive |
| Leptodactylidae | False negative | True positive |
We can define the precision value, that is the number of true positive predictions divided by the total number of positive class values predicted. Precision can be thought of as a measure of a classifiers exactness, in fact a low precision can also indicate a large number of false positives.
\[precision = \frac{TP}{TP+FP}\]
The recall value instead is the number of true positive predictions divided by the number of positive class values. Recall can be thought of as a measure of a classifiers completeness, in fact a low recall indicates many false negatives.
\[recall = \frac{TP}{TP+FN}\]
Another value that can be useful, specially in unbalanced dataset is specificity. It tells us what fraction of all negative samples are correctly predicted as negative by the classifier.
\[specificity = \frac{TN}{TN+FP}\]
Lastly the f1 score expresses the balance between these two previous values of precision and recall. \[F_{1} score = 2\frac{precision \times recall}{precision + recall}\]
When we face with multi class classification confusion matrix becomes slightly more complex. Unlike binary classification, infact there are no positive or negative classes here. What we have to do here is to find TP, TN, FP and FN for each individual class. Formulas to find precision, recall and specificity are the same. In general we will use accuracy or f1 score, that is a good summary measure.
Our goal is to analyze the performance of the different classifiers using \(k\)-fold cross-validation. This procedure involves randomly dividing the set of observations into \(k\) groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining \(k-1\) folds. The model is then evaluated on the observations in the held-out fold. This procedure is repeated \(k\) times and at each iteration a different group of observations is treated as a validation set. The \(k\)-fold CV estimate is computed by averaging the model score of each iteration. For a good trade-off between runtime and accuracy of the score we choose to use a \(k\)-fold cross-validation with \(k=5\), so the classifiers are trained on \(80\%\) of the train data in each iteration.
When subpopulations within an overall population vary a lot, it can be better to sample each population independently. This technique is called stratified sampling and the objective is to improve the precision of the sample by reducing sampling error. A possibility is to use the proportionate allocation strategy: it considers a sampling fraction in each of the strata that is proportional to that of the total population. For instance in our binary clasification \(62\%\) of observations come from Leptodactylidae family, so, according with this technique, approximately \(62\%\) of sample data should belong to Leptodactylidae family.
When dataset is very unbalanced (like in our multiclass analysis) another thing that is possible to do is try to balanced the class in samples. In fact, normally the algorithm receives significantly more example from one class, prompting it to be biased towards that particular class. It does not learn what makes other class different and fails to understand the undelying patterns that allow us to distinguish among the classes. The algorithm only learns that a given class is more common, making it natural for there to be a greater tendency towards it, so it’s led to overfitting the majority class. A technique to overcome this problem involves oversampling, i.e. increasing the number of minority class instances in order to match the size of the majority class. It will consist of resampling the smaller class at random until it consists of as many samples as the majority class. This process can balance the class distribution but does not provide any additional information to the model. A more elaborate technique is SMOTE(Synthetic Minority Oversampling TEcnique). It consists fo introducing synthetic examples belonging to the smaller class. It selects of a random point along the line segment between two specific features (of the smaller class). This apporach effectively forces the decision region of the minority class to become more general.
In our case, due to the fact that the dataset is very simple, we do not need any of this last techniques to improve our results. We will show at the end of the classification part an example of the SMOTE technique use, trying to classify correctly the families of our anurans.
In this section we will start classifing our data. We will focus both on the binary dateset, the simplified one, in which only two target classes are available, Leptodactylidae and Others and in the completed one, in which all \(4\) families are used. In this first case obviously we are interesting in finding anurans belongin to Leptodactylidae family. As mentioned before we will try to use classification method on the original dataset, on the dataset resulting from PCA and on the dataset resulting from features clustering.
The first classifier we want to apply to our datasets is the K-Nearest Neighbors classifier. It computes the distance between each pair of observation in the training set, it identifies \(k\) nearest neighbors and it uses class labels of nearest neighbors to determine the class label of the unknown record. There is only one parameter that is the number \(k\) that corespond to the number of nearest neighbors considered for the prediction.
We will use firstly the very powerful train function, that allow us to set in an esay way a lot of options. For example it is possible automatically perform a cross validation approach and tune the value of k. Unfortunately if we want to find the f1score value we need a customize approach, so we build another algorithm based on KNN function, to show also some result related to f1 score.
As shown in the table below all three binomial model have very high accuracy, but the time spent with PCA_bin dataset is much smaller, so it is preferable.
| Binary.dataset | Accuracy | Best.k | Elapsed.time |
|---|---|---|---|
| Base | 0.9911049 | 5 | 6.342187 secs |
| Features clustering | 0.9902710 | 5 | 4.857320 secs |
| PCA | 0.9845726 | 5 | 2.972195 secs |
Also with our homemade function we find similar results. In particular also f1 score values are close to \(1\).
| Binary.dataset | Accuracy | F1_score | k | Elapsed.time |
|---|---|---|---|---|
| Base | 0.9915219 | 0.9931097 | 5 | 0.9646461 secs |
| Features clustering | 0.9906880 | 0.9924371 | 5 | 0.6799769 secs |
| PCA | 0.9851286 | 0.9879137 | 5 | 0.3258700 secs |
Now we try the first approach also on dataset with all the \(4\) families.
| Famiy.dataset | Accuracy | Best.k | Elapsed.time |
|---|---|---|---|
| Base | 0.9894366 | 5 | 6.274721 secs |
| Features clustering | 0.9879080 | 5 | 4.953400 secs |
| PCA | 0.9827650 | 5 | 3.017058 secs |
What we can notice is that as before best result are given from the use of the entire dataset, but this is also the slower approach. On the contrary using PCA dataset gives us slightly worse results justified, however, by the shorter execution time.
Obviously if we compare accuracy from the binary dataset it is (slightly) higher than the accuracy of the \(4\) families dataset.
We do not try to use particular sampling technique because accuracy and f1_score are already close to \(1\) and there is no point in building more complex model.
Below we can see how accuracy change if we change the \(k\) parameter.
Tree based methods involve segmenting the predictor space into a number of simple regions. In order to make a prediction for a given observation we use the most commonly occurring class of training observations in the region to which it belongs. In interpreting the result of a classification tree, we are often interested not only in the class prediction corresponding to a particular terminal node region, but also in the class proportions among the training observations that fall into that region.
As a criterion for making the binary split th classification error rate is used. It is simply the fraction of the training observations in that region that do not belong to the most common class. Anyway the two most common measure used to build a classification tree are the Gini index
\[G = \sum_{k=1}^{K} p_{mk}(1-p_{mk})\]
and the cross entropy
\[D = -\sum_{k=1}^{K} p_{mk}\log(p_{mk})\]
Bot measures will take on a value near zero if the \(p_{mk}\)’s are all near zero or near one, where \(p_{mk}\) is the proportion of training observation in the \(m\)th region that are from the \(k\)th class. We will use the Gini index. Let’s see an example of decision tree based on our dataset
## predict_unseen
## Bufonidae Dendrobatidae Hylidae Leptodactylidae
## Bufonidae 8 0 1 2
## Dendrobatidae 0 78 12 4
## Hylidae 1 10 416 32
## Leptodactylidae 0 1 28 846
| Binary.dataset | Accuracy |
|---|---|
| Base | 0.9367616 |
With this classifier to perform our analysis we use algorithms based on rpart function, both for binary and four classes dataset. In this case we can notice tha PCA is both faster and more accurate. This is due to the fact that decisione tree algorithm is looking for simple areas in which divide data and it is easier to do if the features are semplified like the principal component.
| Binary.dataset | F1_score | Elapsed.time |
|---|---|---|
| Base | 0.9542883 | 1.0641739 secs |
| Features clustering | 0.9518497 | 0.7542579 secs |
| PCA | 0.9620110 | 0.3160570 secs |
As we can see below also in the case in which we consider all \(4\) families we have the same behaviour. This time we decide to use accuracy for semplicity.
| Family.dataset | Accuracy | Elapsed.time |
|---|---|---|
| Base | 0.9161918 | 1.4675071 secs |
| Features clustering | 0.9095205 | 0.9549370 secs |
| PCA | 0.9309243 | 0.3622961 secs |
The Random Forest Classifier is a set of decision trees from randomly selected various sub-samples of training set which are created by the use of bootstrapping. In the inference stage it aggregates the votes from different decision trees to decide the final class of the test object, this improves the predictive accuracy and controls overfitting. We can then summarize the Random Forest algorithm in these following steps: select random samples from a given dataset, construct a decision tree for each sample and get a prediction result from each decision tree, perform a vote for each predicted result and finally select the prediction result with the most votes as the final prediction. The result of the Random Forest algorithm is more reliable than the Decision Tree algorithm because each of the trees neutralizes the error of other trees.
As done before with knn we could use different algorithms. The first possibility, the one we will use, is based on the specific function randomForest from the namesake package. The other uses the general function train in which is easier tune the parameters. Unfortunately, this second method was too slow and we decide to discard it.
Random forest allow us to obtain better result if compared with decision tree but algorithms take longer to finish their job. We decide to tune the number of trees to use and the number of variables randomly sampled as candidates at each split, but to avoid spending to much time we build a small \(3\times 3\) grid in which the \(3\) values for the number trees are \(100\), \(500\) and \(1000\) while the \(3\) values for number of variables at each split are linked with the square root of the number of features.
The best hyperparameters selected are shown in the table below with their accuracy
Although the best performances were obtained with a number of trees equal to 1000, we have noticed that the differences in performance with a value of the number of trees equal to 100 were negligible. We have therefore chosen to use this value as it allows the algorithm to be faster. We noticed that in our case even variations of mtyr do not lead to considerable differences in the f1 score.
| Binary.dataset | F1_score | Elapsed.time | Ntree | Mtry |
|---|---|---|---|---|
| Base | 0.9902671 | 5.210014 secs | 100 | 3 |
| Features clustering | 0.9876599 | 4.031174 secs | 100 | 3 |
| PCA | 0.9850577 | 2.261171 secs | 100 | 1 |
As we can see, contrary to what happened with decision trees, the best results have been obtained with the complete dataset. In fact if we compare a lot of trees we can better explore all the features of the dataset and so in a dataset with more attributes is possible to achieve better results. Below we can see the accuracy values when we use all the \(4\) families. The trend is the same as in the binary dataset.
| Family.dataset | Accuracy | Elapsed.time |
|---|---|---|
| Base | 0.9842946 | 5.293714 secs |
| Features clustering | 0.9835997 | 4.315891 secs |
| PCA | 0.9780403 | 2.796565 secs |
Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more independent variables. So logistic regression is a binary classifier by nature. However, it can also be used for multi-class classifications with \(K>2\) classes.
In the binary case it takes linear combination of features and applies non-linear function (sigmoid) to it. We can define \(x_{i}\) as the n-dimensional feature vector of a given sample and \(\beta_{0},\beta=(\beta_{1},..,\beta_{n})\) as the model parameters. Then the logistic regression model is defined as:
\[\mathcal{P}(y=1|x_{i})=\frac{\exp(\beta_{0}+x^{T}_{i}\beta)}{1+\exp(\beta_{0}+x^{T}_{i}\beta)}\] where \(y\) is the response vector of the binary problem.
A first possible approach is to convert the binary classifier to a multi-class classifier with 1 vs rest and 1 vs 1 methods.
It is also possible generalize it into multinominal logistic regression, or softmax regression, for multi-class problems. In this case, the dependent variable \(y\) is a categorical variable that takes any one of \(K\) distinct values representing \(K>2\) different classes. In this case to model the probability of the event \(y=i\) instead of the logistic function, a softmax function is used:
\[\mathcal{P}(y=i\vert{\bf x})=\frac{\exp(w_{i}^T x)}{\sum_{k=1}^{K} \exp(w_{k}^T x)}\]
| Binary.dataset | Accuracy | F1_score | Elapsed.time |
|---|---|---|---|
| Base | 0.9915219 | 0.9622599 | 1.3017950 secs |
| Features clustering | 0.9906880 | 0.9543400 | 0.5539131 secs |
| PCA | 0.9452397 | 0.9554097 | 0.5427570 secs |
With logistic regression accuracy is clearly worse in the \(4\) class case with respect to the binary case. The issue with softmax function is that it blows small differences out of proportion which makes our classifier biased towards a particular class.
| Family.dataset | Accuracy | Elapsed.time |
|---|---|---|
| Base | 0.9391244 | 3.045820 secs |
| Features clustering | 0.9266157 | 2.347250 secs |
| PCA | 0.9081306 | 2.120302 secs |
Also the support vector machine is a natural approach for classification in the two class setting. It is a discriminative classifier formally defined by a separating hyperplane. If the training data is linearly separable, we actual select two parallel hyperplanes that separate the two classes of data, so that the distance between them is as large as possible. The region bounded by these two hyperplanes is called the “margin”, and the maximum-margin hyperplane is the hyperplane that lies halfway between them. In two dimentional space this hyperplane is a line dividing a plane in two parts where each class lies on one side. We can then classify a test observation based on which side of the maximum-margin hyperpalnes it lies.
This leads to the following optimization problem called hard - margin problem
\[\min_{w,b} \frac{1}{2} ||w^2||\]
\[\text{s.t.} \hspace{2mm} y_{i}(<x_{i}, w> +b) \geq 1, \hspace{2mm} \forall i\]
where w is the normal vector to the hyperplane, \(x_i\) is a point, \(y_i\) is the i-th target and \(b\) derives from the parameter \(\frac{b}{||w||}\) which determines the offset of the hyperplane from the origin along the normal vector.
To extend SVM to cases in which the data are not linearly separable, we can introduce the soft - margin problem
\[\min_{w,b} \frac{1}{2} ||w^2|| + c\sum_{i} \xi_i\]
\[\text{s.t.} \hspace{2mm} y_{i}(<x_{i}, w> +b) \geq 1- \xi_{i}, \hspace{2mm} \xi_{i} \geq 0, \hspace{2mm} \forall i\]
where \(\xi_i\) is slack variable, whose value is the distance of \(x_i\) from the corresponding class’ margin if \(x_i\) is on the wrong side of the margin, otherwise zero. \(c\) determines the trade-off between increasing the margin size and ensuring that the \(x_i\) lies on the correct side of the margin, in fact it corresponds to the number and severity of the violations to the margin that we’ll tolerate. When \(c\) is small, classification mistakes are given less importance and we focus on maximizing the margin, whereas when \(c\) is large, the focus is more on avoiding misclassification at the expense of keeping the margin small.
As presented is clear that a support vector classifier will perform poorly with a non linear class boundaries. In order to adress non linearity is possible enlarging thefeature space using kernels. Substantially the algorithm is similar, except that every inner product is replaced by a non linear kernel function.
One of the most popular choiche is the radial kernel, which takes the form
\[K(x_{i}, x_{i'}) = \exp (-\gamma \sum_{j=1}^{p} (x_{ij} - x_{i'j})^{2})\]
The parameter gamma defines how far the influence of a single training example reaches, with low values meaning “far” and high values meaning “close”. It’s a sort of sensitivity of the model and if it’s too large, the radius of the area of influence of the support vectors only includes the support vector itself otherwise, if gamma is too small, the model is too constrained.
Everything we have said so far is limited to the case of binary classification. The two most popular \(K\)- class algorithm for SVM are one-versus-one and one-versus-all.
If there are \(K>2\) classes a one-versus-one approach constructs \({K}\choose{2}\) SVMs, each of which compares a pair of classes. The final classification is performed by assigning the test observation to the class to which it was most frequently assigned in the \({K}\choose{2}\) pairwise classifications. The one-versus-all approach fit \(K\) SVMs and using coefficient \(\beta_{ik}\) it build an index of confidence that the test observation belong to the \(k\)th class tather than to any of the other.
For the binary case, after tuning parameters we found that the best value for gamma is about \(0.1\) while the best value for the cost is \(100\). Moreover we notice that gamma is a very sensible parameter, while changing the value of the cost does not impact a lot. We use a radial kernel, but good results can be obtained also with a linear kernel.
| Binary.dataset | F1_score | Elapsed.time |
|---|---|---|
| Base | 0.9941003 | 5.040407 secs |
| Features clustering | 0.9933235 | 3.813344 secs |
| PCA | 0.9864161 | 3.486036 secs |
Here we can see a graph about how the accuracy on PCA dataset changes when gamme changes.
At the end we show also results for the \(4\) families dataset. The function we use performs the multi-class classification using the one-versus-one approach. As we can see svm outperform logistic regression: this normally happens when the classes are well separated.
| Family.dataset | Accuracy | Elapsed.time |
|---|---|---|
| Base | 0.9917999 | 6.768103 secs |
| Features clustering | 0.9899931 | 4.717085 secs |
| PCA | 0.9799861 | 3.417723 secs |
To test how effective SMOTE approach could be on our dataset we decide to test it on svm classifier in the case of \(4\) families. We can notice an improvement in our result, but due to the fact that the accuracy was already about \(99\%\), the SMOTE approach can be avoid. Maybe it coudl be more useful in analyzed the species of each anuran, but this is beyond our scope.
| SMOTE.approach | Accuracy | Elapsed.time |
|---|---|---|
| Base | 0.9958310 | 5.947000 secs |
| Features clustering | 0.9933296 | 4.281599 secs |
| PCA | 0.9874931 | 2.538894 secs |
Having completed the study of the various classifiers, our goal now is to make a comparison between them and understand which is the best performing model. As general trend we notice that it was better to conduct the analysis on the whole dataset. In fact also if it takes more time, this not implies prohibhitive wait, due to the fact that the number of observation is limited. In order to make a comparison between them, we decide to use in addition to accuracy also precision, recall and specificity (also f1 score is used). However is quite a mess to have this indicators for the multi class dataset. Since in our case the results on two classes are very similar to those on four, we decided to compute these values only on the binary classifiers, without further complicating our lives.
| Classifier | Accuracy | Precision | Recall | Specificity | F1.score |
|---|---|---|---|---|---|
| KNN | 0.9915219 | 0.9909890 | 0.9952489 | 0.9923803 | 0.9931143 |
| Decision tree | 0.9423211 | 0.9513185 | 0.9549774 | 0.9278463 | 0.9531444 |
| Random forest | 0.9874913 | 0.9911525 | 0.9884615 | 0.9817008 | 0.9898052 |
| Logistic regression | 0.9535789 | 0.9615906 | 0.9628959 | 0.9407728 | 0.9622428 |
| SVM | 0.9926338 | 0.9974937 | 0.9904977 | 0.9850321 | 0.9939834 |
It is clear from the table that the two best classifiers are KNN and SVM. In this case we think KNN is to be preferred because SVM has some problem in predict the “negative” class, so it could suffer with more unbalanced datas, like when we try to classify alla the \(4\) families.