Gender Recognition by Voice

Reynaldi Gevin

04/06/2022

1. Objective

Our objective in this project is to identify a voice as male or female, based upon acoustic properties of the voice and speech.

2. About dataset

The dataset consists of 3,168 recorded voice samples, collected from male and female speakers.

2.1. Import data

voice <- read.csv("voice.csv")
rmarkdown::paged_table(voice)

2.2. Meta data

meanfreq: mean frequency (in kHz)
sd: standard deviation of frequency
median: median frequency (in kHz)
Q25: first quantile (in kHz)
Q75: third quantile (in kHz)
IQR: interquantile range (in kHz)
skew: skewness (see note in specprop description)
kurt: kurtosis (see note in specprop description)
sp.ent: spectral entropy
sfm: spectral flatness
mode: mode frequency
centroid: frequency centroid (see specprop)
peakf: peak frequency (frequency with highest energy)
meanfun: average of fundamental frequency measured across acoustic signal
minfun: minimum fundamental frequency measured across acoustic signal
maxfun: maximum fundamental frequency measured across acoustic signal
meandom: average of dominant frequency measured across acoustic signal
mindom: minimum of dominant frequency measured across acoustic signal
maxdom: maximum of dominant frequency measured across acoustic signal
dfrange: range of dominant frequency measured across acoustic signal
modindx: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range
label: male or female

3. Data wrangling

3.1. Data type check

str(voice)

## 'data.frame':    3168 obs. of  21 variables:
##  $ meanfreq: num  0.0598 0.066 0.0773 0.1512 0.1351 ...
##  $ sd      : num  0.0642 0.0673 0.0838 0.0721 0.0791 ...
##  $ median  : num  0.032 0.0402 0.0367 0.158 0.1247 ...
##  $ Q25     : num  0.0151 0.0194 0.0087 0.0966 0.0787 ...
##  $ Q75     : num  0.0902 0.0927 0.1319 0.208 0.206 ...
##  $ IQR     : num  0.0751 0.0733 0.1232 0.1114 0.1273 ...
##  $ skew    : num  12.86 22.42 30.76 1.23 1.1 ...
##  $ kurt    : num  274.4 634.61 1024.93 4.18 4.33 ...
##  $ sp.ent  : num  0.893 0.892 0.846 0.963 0.972 ...
##  $ sfm     : num  0.492 0.514 0.479 0.727 0.784 ...
##  $ mode    : num  0 0 0 0.0839 0.1043 ...
##  $ centroid: num  0.0598 0.066 0.0773 0.1512 0.1351 ...
##  $ meanfun : num  0.0843 0.1079 0.0987 0.089 0.1064 ...
##  $ minfun  : num  0.0157 0.0158 0.0157 0.0178 0.0169 ...
##  $ maxfun  : num  0.276 0.25 0.271 0.25 0.267 ...
##  $ meandom : num  0.00781 0.00901 0.00799 0.2015 0.71281 ...
##  $ mindom  : num  0.00781 0.00781 0.00781 0.00781 0.00781 ...
##  $ maxdom  : num  0.00781 0.05469 0.01562 0.5625 5.48438 ...
##  $ dfrange : num  0 0.04688 0.00781 0.55469 5.47656 ...
##  $ modindx : num  0 0.0526 0.0465 0.2471 0.2083 ...
##  $ label   : chr  "male" "male" "male" "male" ...

All variables already have the appropriate data type except the target variable “label”

Data type of the target variable will be changed in the data pre-processing stage

3.2. Missing value check

colSums(is.na(voice))

## meanfreq       sd   median      Q25      Q75      IQR     skew     kurt 
##        0        0        0        0        0        0        0        0 
##   sp.ent      sfm     mode centroid  meanfun   minfun   maxfun  meandom 
##        0        0        0        0        0        0        0        0 
##   mindom   maxdom  dfrange  modindx    label 
##        0        0        0        0        0

No missing value in our data

4. Exploratory data analysis

4.1. Correlation check

# Change target variabel, male becom 1 and female become 0
voice$label <- ifelse(voice$label == "male",1,0)

# plot feature correlation
ggcorr(data = voice,label = T, label_size = 3, hjust =.9, layout.exp = 2)

Based on the correlation table above, it can be concluded that there are several predictors that have a high correlation to the target variable. Have a high correlation when correlation value >= abs(0.5)

4.2. Plot density graph

Predictor that has high correlation to target variable : meanfun, sp.ent, IQR, Q25, sd

Change data type

voice <- voice %>%
  mutate(label = as.factor(label))

a <- ggplot(data=voice, aes(x=meanfun, group=label, fill=label)) +
  geom_density(adjust=1.5, alpha=.2) +
  scale_fill_manual(values=c("red", "blue"))+
  theme_get()
b <- ggplot(data=voice, aes(x=sp.ent, group=label, fill=label)) +
  geom_density(adjust=1.5, alpha=.2) +
  scale_fill_manual(values=c("red", "blue"))+
  theme_get()
c <- ggplot(data=voice, aes(x=IQR, group=label, fill=label)) +
  geom_density(adjust=1.5, alpha=.2) +
  scale_fill_manual(values=c("red", "blue"))+
  theme_get()
d <- ggplot(data=voice, aes(x=Q25, group=label, fill=label)) +
  geom_density(adjust=1.5, alpha=.2) +
  scale_fill_manual(values=c("red", "blue"))+
  theme_get()
e <- ggplot(data=voice, aes(x=sd, group=label, fill=label)) +
  geom_density(adjust=1.5, alpha=.2) +
  scale_fill_manual(values=c("red", "blue"))+
  theme_get()

ggarrange(a,b,c,d,e, ncol = 2, nrow = 3)

The varibles that identify male and female charcteristics distinctly are IQR, Meanfun, SD, sp.ent, Q25. Less overlapping implies the alorithm can easily and more accurately say if its male or female by looking at the values.

However, we should not use predictors that indicate perfect separation. Perfect Separation is a condition where there is 1 predictor variable that can completely separate the target class

It is not recommended to use a model with perfect separation, because the model is highly biased in one variable and does not consider other variables. This can make the model inaccurate (poorly) in predicting new data.

The variables above are not categorized into perfect separation because they still have overlapping areas

4.3. Proportion target

prop.table(table(voice$label))

## 
##   0   1 
## 0.5 0.5

Target variable in our dataset are balanced. So we dont need to worry to balancing target variable.

5. Logistic Regression

Logistic Regression is an algorithm for the adjusted case classification of the regression curve, y=f(x), where y is a categorical variable.

The purpose of logistic regression is to predict probability using a linear regression model (which can be used for classification).

5.1. Pre-processing Data

# Remove multicolinearity
voice_clean <- voice %>% select(-c(dfrange,IQR,centroid))

Indication of multicollinearity can be seen in the previous correlation feature plot. which is where there are 3 predictors which between the predictors have a correlation of 1. So we need to remove them

5.2. Cross validation

RNGkind(sample.kind = "Rounding") 
set.seed(15)

# index sampling
index <- sample(x = nrow(voice_clean), size = nrow(voice_clean)*0.8)

# splitting, by doing subseting
voice_train <- voice_clean[index,]
voice_test <- voice_clean[-index,]

Data train has a proportion of 80% of the total data and data test has 20% proportion from total data

5.3. Build model

model_voice <- glm(formula = label~.,data = voice_train,family = "binomial")
summary(model_voice)

## 
## Call:
## glm(formula = label ~ ., family = "binomial", data = voice_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2231  -0.0439  -0.0005   0.1150   4.2255  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.005e+01  1.007e+01  -1.991   0.0465 *  
## meanfreq     1.996e+01  5.003e+01   0.399   0.6899    
## sd           4.216e+01  3.881e+01   1.086   0.2773    
## median      -1.305e+01  1.411e+01  -0.925   0.3550    
## Q25         -4.989e+01  1.256e+01  -3.971 7.15e-05 ***
## Q75          4.262e+01  2.224e+01   1.916   0.0553 .  
## skew         1.262e-01  1.799e-01   0.701   0.4831    
## kurt        -6.695e-03  4.775e-03  -1.402   0.1609    
## sp.ent       4.557e+01  1.107e+01   4.115 3.87e-05 ***
## sfm         -1.237e+01  2.770e+00  -4.466 7.98e-06 ***
## mode         2.334e+00  2.401e+00   0.972   0.3309    
## meanfun     -1.616e+02  9.154e+00 -17.650  < 2e-16 ***
## minfun       4.030e+01  9.854e+00   4.090 4.32e-05 ***
## maxfun      -5.660e+00  7.906e+00  -0.716   0.4741    
## meandom      5.687e-02  4.674e-01   0.122   0.9032    
## mindom       4.323e-02  2.336e+00   0.019   0.9852    
## maxdom       1.890e-02  7.205e-02   0.262   0.7931    
## modindx     -2.786e+00  1.792e+00  -1.554   0.1201    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3512.68  on 2533  degrees of freedom
## Residual deviance:  469.89  on 2516  degrees of freedom
## AIC: 505.89
## 
## Number of Fisher Scoring iterations: 8

Model Interpretation

Variables that increase odds: meanfreq, sd, Q75, skew, sp.ent, mode, minfun, meandom, mindom, maxdom
Variables that decrease odds: mdeian, Q25, kurt, sfm, meanfun, maxfun, modindx
Variable Significance: Q25, so.ent, sfm, meanfun, minfun

5.4. Predict

voice_test$pred <- predict(object = model_voice,newdata = voice_test,type = "response")

voice_test$pred_label <- ifelse(voice_test$pred > 0.5,1,0)
voice_test$pred_label <- as.factor(voice_test$pred_label)

The threshold that we used is 0.5. This value can be changed according to business objectives

If it is greater than 0.5 it will be categorized as 1 male and less than 0.5 will be categorized as 0 female

5.5. Model evaluation

confusionMatrix(data = voice_test$pred_label,
                reference = voice_test$label,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 293   5
##          1  13 323
##                                           
##                Accuracy : 0.9716          
##                  95% CI : (0.9555, 0.9831)
##     No Information Rate : 0.5174          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9431          
##                                           
##  Mcnemar's Test P-Value : 0.09896         
##                                           
##             Sensitivity : 0.9848          
##             Specificity : 0.9575          
##          Pos Pred Value : 0.9613          
##          Neg Pred Value : 0.9832          
##              Prevalence : 0.5174          
##          Detection Rate : 0.5095          
##    Detection Prevalence : 0.5300          
##       Balanced Accuracy : 0.9711          
##                                           
##        'Positive' Class : 1               
##

Because in this case we do not focus the prediction results to one of the classes then the positive parameter in the confusion matrix is not needed.

Based on the resulting confusion matrix value, we only focus on the accuracy value because we only want the prediction results to be able to objectively predict whether the gender is male or female.

The resulting accuracy performance is also very good, where the model created has a 97% accuracy to predict the gender of a male or female based on their voice

5.6. Model Tuning

Feature selection is performed using stepwise (removing predictors one by one until the lowest AIC value is obtained)

model_step <- step(object = model_voice,direction = "backward", trace = 0)
summary(model_step)

## 
## Call:
## glm(formula = label ~ Q25 + Q75 + kurt + sp.ent + sfm + meanfun + 
##     minfun + modindx, family = "binomial", data = voice_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0470  -0.0467  -0.0006   0.1132   4.2500  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.196e+01  7.177e+00  -1.667  0.09560 .  
## Q25         -5.448e+01  5.223e+00 -10.431  < 2e-16 ***
## Q75          5.295e+01  5.991e+00   8.839  < 2e-16 ***
## kurt        -3.686e-03  1.149e-03  -3.210  0.00133 ** 
## sp.ent       3.726e+01  8.666e+00   4.300 1.71e-05 ***
## sfm         -9.897e+00  1.963e+00  -5.043 4.59e-07 ***
## meanfun     -1.633e+02  9.077e+00 -17.989  < 2e-16 ***
## minfun       3.831e+01  8.725e+00   4.391 1.13e-05 ***
## modindx     -2.768e+00  1.409e+00  -1.964  0.04953 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3512.68  on 2533  degrees of freedom
## Residual deviance:  474.41  on 2525  degrees of freedom
## AIC: 492.41
## 
## Number of Fisher Scoring iterations: 8

5.7. Predict model tuning

voice_test$pred2 <- predict(object = model_step,newdata = voice_test,type = "response")
voice_test$pred_label2 <- ifelse(voice_test$pred2 > 0.5,1,0)
voice_test$pred_label2 <- as.factor(voice_test$pred_label2)

5.8. Model Tunnng Evaluation

confusionMatrix(data = voice_test$pred_label2,
                reference = voice_test$label)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 296   5
##          1  10 323
##                                           
##                Accuracy : 0.9763          
##                  95% CI : (0.9613, 0.9867)
##     No Information Rate : 0.5174          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9526          
##                                           
##  Mcnemar's Test P-Value : 0.3017          
##                                           
##             Sensitivity : 0.9673          
##             Specificity : 0.9848          
##          Pos Pred Value : 0.9834          
##          Neg Pred Value : 0.9700          
##              Prevalence : 0.4826          
##          Detection Rate : 0.4669          
##    Detection Prevalence : 0.4748          
##       Balanced Accuracy : 0.9760          
##                                           
##        'Positive' Class : 0               
##

The resul is tuning model has a smaller AIC value. In addition, the prediction results using the tuning model produce a slightly greater accuracy value, increasing from 97.16% to 97.63%

5.9 Prediction Result

ggplot(voice_test, aes(x=pred2)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data") +
  theme_minimal()

> It can be interpreted if the prediction results tend to be balanced, not skewed towards one of the target classes

6. K-Nearest Neighbour

K-NN is K-nearest neighbor. This method will classify the new data by comparing the characteristics of the new data (test data) with existing data (train data). The closeness of these characteristics is measured by Euclidean Distance to obtain distance. Then k nearest neighbors will be selected from the new data, then the class will be determined using majority voting.

6.1. Cross validation

RNGkind(sample.kind = "Rounding")
set.seed(123)

# index sampling
index_knn <- sample(x = nrow(voice),
                size = 0.8*nrow(voice))

# splitting
voice_train <- voice[index_knn,]
voice_test <- voice[-index_knn,]

Data train has a proportion of 80% of the total data and data test has 20% proportion from total data

6.2. Data pre-processing

# prediktor

voice_train_x <- voice_train %>% select(-label)
voice_test_x <- voice_test %>% select(-label)

# target
voice_train_y <- voice_train[,"label"]
voice_test_y <- voice_test[,"label"]

6.3. Scaling

Check the distribution of data

summary(voice)

##     meanfreq             sd              median             Q25           
##  Min.   :0.03936   Min.   :0.01836   Min.   :0.01097   Min.   :0.0002288  
##  1st Qu.:0.16366   1st Qu.:0.04195   1st Qu.:0.16959   1st Qu.:0.1110865  
##  Median :0.18484   Median :0.05916   Median :0.19003   Median :0.1402864  
##  Mean   :0.18091   Mean   :0.05713   Mean   :0.18562   Mean   :0.1404556  
##  3rd Qu.:0.19915   3rd Qu.:0.06702   3rd Qu.:0.21062   3rd Qu.:0.1759388  
##  Max.   :0.25112   Max.   :0.11527   Max.   :0.26122   Max.   :0.2473469  
##       Q75               IQR               skew              kurt         
##  Min.   :0.04295   Min.   :0.01456   Min.   : 0.1417   Min.   :   2.068  
##  1st Qu.:0.20875   1st Qu.:0.04256   1st Qu.: 1.6496   1st Qu.:   5.670  
##  Median :0.22568   Median :0.09428   Median : 2.1971   Median :   8.319  
##  Mean   :0.22476   Mean   :0.08431   Mean   : 3.1402   Mean   :  36.569  
##  3rd Qu.:0.24366   3rd Qu.:0.11418   3rd Qu.: 2.9317   3rd Qu.:  13.649  
##  Max.   :0.27347   Max.   :0.25223   Max.   :34.7255   Max.   :1309.613  
##      sp.ent            sfm               mode           centroid      
##  Min.   :0.7387   Min.   :0.03688   Min.   :0.0000   Min.   :0.03936  
##  1st Qu.:0.8618   1st Qu.:0.25804   1st Qu.:0.1180   1st Qu.:0.16366  
##  Median :0.9018   Median :0.39634   Median :0.1866   Median :0.18484  
##  Mean   :0.8951   Mean   :0.40822   Mean   :0.1653   Mean   :0.18091  
##  3rd Qu.:0.9287   3rd Qu.:0.53368   3rd Qu.:0.2211   3rd Qu.:0.19915  
##  Max.   :0.9820   Max.   :0.84294   Max.   :0.2800   Max.   :0.25112  
##     meanfun            minfun             maxfun          meandom        
##  Min.   :0.05557   Min.   :0.009775   Min.   :0.1031   Min.   :0.007812  
##  1st Qu.:0.11700   1st Qu.:0.018223   1st Qu.:0.2540   1st Qu.:0.419828  
##  Median :0.14052   Median :0.046110   Median :0.2712   Median :0.765795  
##  Mean   :0.14281   Mean   :0.036802   Mean   :0.2588   Mean   :0.829211  
##  3rd Qu.:0.16958   3rd Qu.:0.047904   3rd Qu.:0.2775   3rd Qu.:1.177166  
##  Max.   :0.23764   Max.   :0.204082   Max.   :0.2791   Max.   :2.957682  
##      mindom             maxdom             dfrange          modindx       
##  Min.   :0.004883   Min.   : 0.007812   Min.   : 0.000   Min.   :0.00000  
##  1st Qu.:0.007812   1st Qu.: 2.070312   1st Qu.: 2.045   1st Qu.:0.09977  
##  Median :0.023438   Median : 4.992188   Median : 4.945   Median :0.13936  
##  Mean   :0.052647   Mean   : 5.047277   Mean   : 4.995   Mean   :0.17375  
##  3rd Qu.:0.070312   3rd Qu.: 7.007812   3rd Qu.: 6.992   3rd Qu.:0.20918  
##  Max.   :0.458984   Max.   :21.867188   Max.   :21.844   Max.   :0.93237  
##  label   
##  0:1584  
##  1:1584  
##          
##          
##          
##

The range of each data must be the same because the KNN model classifies based on distance. If there is a high value alone compared to others, then that variable will greatly affect the classification results and ignore other variables.

The range of each variable is different so feature rescaling needs to be done

#scaling

voice_train_x_scale <- scale(voice_train_x)
voice_test_x_scale <- scale(voice_test_x,
                            center = attr(voice_train_x_scale,"scaled:center"),
                            scale = attr(voice_train_x_scale,"scaled:scale"))

Predictor data will be scaled using z-score standardization. The test data also scaled using the parameters from the train data.

6.4. Pick optimum K

Some conditions in choosing the optimum K

don’t be too big: class selection is based only on the dominant class and ignores small patterns that turn out to be important.
don’t be too small: prone to classify new data to outlier class.
k optimum is the root of our data sum: sqrt(nrow(data))
to avoid a draw when majority voting:
- k must be odd if the number of target classes is even
- k must be even if the number of target classes is odd
- k cannot be a multiple of the number of the target class
If the majority voting result is a draw, then the class will be chosen randomly.

sqrt(nrow(voice_train))

## [1] 50.33885

Cecause the result is even, then the optimum K used is 49

6.5. Predict

voice_pred <- knn(train = voice_train_x_scale,
                  test = voice_test_x_scale,
                  cl = voice_train_y,
                  k = 49)

Conceptually, KNN does not make a model, but directly classifies the target variable based on the distance of ecluidience and then the class is selected based on majority voting

6.6. Model evaluation

confusionMatrix(data = voice_pred,
                reference = voice_test_y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 289   8
##          1  25 312
##                                           
##                Accuracy : 0.9479          
##                  95% CI : (0.9277, 0.9639)
##     No Information Rate : 0.5047          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8958          
##                                           
##  Mcnemar's Test P-Value : 0.005349        
##                                           
##             Sensitivity : 0.9204          
##             Specificity : 0.9750          
##          Pos Pred Value : 0.9731          
##          Neg Pred Value : 0.9258          
##              Prevalence : 0.4953          
##          Detection Rate : 0.4558          
##    Detection Prevalence : 0.4685          
##       Balanced Accuracy : 0.9477          
##                                           
##        'Positive' Class : 0               
##

The result shows that our K-NN has accuracy of 94.79% on test dataset, meaning that 94.79 % of our data is correctly classified to male and female. The value of sensitivity and specificity is 92.04 % and 97.50 %. This indicate that most of male outcomes are correctly classified but only a small number of female outcomes are correctly classified. The precision or positive predicted value is 97.31 %, meaning that 97.31 % of our male prediction is correct.

7. Conclusion

Both models have very good accuracy performance, Logistic Regression Model 97.63% and K-NN 94.79%. So it can be concluded that the two models do not have too much difference in the ability to predict gender.

Not only the accuracy and AIC values have increased, when tuning using the stepwise backward method, the specificity and post pred values also increase about 1-2% from the previous one. This is because the model is getting better at predicting the female class which was previously predicted to be male

While the sensitivity value decreased by about 2%, this decrease was not because more women were predicted to turn out to be men, but because of the trade-off of increasing accuracy and precision values.