1. Objective
Our objective in this project is to identify a voice as male or female, based upon acoustic properties of the voice and speech.
2. About dataset
The dataset consists of 3,168 recorded voice samples, collected from male and female speakers.
2.1. Import data
voice <- read.csv("voice.csv")
rmarkdown::paged_table(voice)
2.2. Meta data
- meanfreq: mean frequency (in kHz)
- sd: standard deviation of frequency
- median: median frequency (in kHz)
- Q25: first quantile (in kHz)
- Q75: third quantile (in kHz)
- IQR: interquantile range (in kHz)
- skew: skewness (see note in specprop description)
- kurt: kurtosis (see note in specprop description)
- sp.ent: spectral entropy
- sfm: spectral flatness
- mode: mode frequency
- centroid: frequency centroid (see specprop)
- peakf: peak frequency (frequency with highest energy)
- meanfun: average of fundamental frequency measured across acoustic signal
- minfun: minimum fundamental frequency measured across acoustic signal
- maxfun: maximum fundamental frequency measured across acoustic signal
- meandom: average of dominant frequency measured across acoustic signal
- mindom: minimum of dominant frequency measured across acoustic signal
- maxdom: maximum of dominant frequency measured across acoustic signal
- dfrange: range of dominant frequency measured across acoustic signal
- modindx: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range
- label: male or female
3. Data wrangling
3.1. Data type check
str(voice)
## 'data.frame': 3168 obs. of 21 variables:
## $ meanfreq: num 0.0598 0.066 0.0773 0.1512 0.1351 ...
## $ sd : num 0.0642 0.0673 0.0838 0.0721 0.0791 ...
## $ median : num 0.032 0.0402 0.0367 0.158 0.1247 ...
## $ Q25 : num 0.0151 0.0194 0.0087 0.0966 0.0787 ...
## $ Q75 : num 0.0902 0.0927 0.1319 0.208 0.206 ...
## $ IQR : num 0.0751 0.0733 0.1232 0.1114 0.1273 ...
## $ skew : num 12.86 22.42 30.76 1.23 1.1 ...
## $ kurt : num 274.4 634.61 1024.93 4.18 4.33 ...
## $ sp.ent : num 0.893 0.892 0.846 0.963 0.972 ...
## $ sfm : num 0.492 0.514 0.479 0.727 0.784 ...
## $ mode : num 0 0 0 0.0839 0.1043 ...
## $ centroid: num 0.0598 0.066 0.0773 0.1512 0.1351 ...
## $ meanfun : num 0.0843 0.1079 0.0987 0.089 0.1064 ...
## $ minfun : num 0.0157 0.0158 0.0157 0.0178 0.0169 ...
## $ maxfun : num 0.276 0.25 0.271 0.25 0.267 ...
## $ meandom : num 0.00781 0.00901 0.00799 0.2015 0.71281 ...
## $ mindom : num 0.00781 0.00781 0.00781 0.00781 0.00781 ...
## $ maxdom : num 0.00781 0.05469 0.01562 0.5625 5.48438 ...
## $ dfrange : num 0 0.04688 0.00781 0.55469 5.47656 ...
## $ modindx : num 0 0.0526 0.0465 0.2471 0.2083 ...
## $ label : chr "male" "male" "male" "male" ...
All variables already have the appropriate data type except the target variable “label”
Data type of the target variable will be changed in the data pre-processing stage
3.2. Missing value check
colSums(is.na(voice))
## meanfreq sd median Q25 Q75 IQR skew kurt
## 0 0 0 0 0 0 0 0
## sp.ent sfm mode centroid meanfun minfun maxfun meandom
## 0 0 0 0 0 0 0 0
## mindom maxdom dfrange modindx label
## 0 0 0 0 0
No missing value in our data
4. Exploratory data analysis
4.1. Correlation check
# Change target variabel, male becom 1 and female become 0
voice$label <- ifelse(voice$label == "male",1,0)
# plot feature correlation
ggcorr(data = voice,label = T, label_size = 3, hjust =.9, layout.exp = 2)
Based on the correlation table above, it can be concluded that there are several predictors that have a high correlation to the target variable. Have a high correlation when correlation value >= abs(0.5)
4.2. Plot density graph
Predictor that has high correlation to target variable : meanfun, sp.ent, IQR, Q25, sd
Change data type
voice <- voice %>%
mutate(label = as.factor(label))
a <- ggplot(data=voice, aes(x=meanfun, group=label, fill=label)) +
geom_density(adjust=1.5, alpha=.2) +
scale_fill_manual(values=c("red", "blue"))+
theme_get()
b <- ggplot(data=voice, aes(x=sp.ent, group=label, fill=label)) +
geom_density(adjust=1.5, alpha=.2) +
scale_fill_manual(values=c("red", "blue"))+
theme_get()
c <- ggplot(data=voice, aes(x=IQR, group=label, fill=label)) +
geom_density(adjust=1.5, alpha=.2) +
scale_fill_manual(values=c("red", "blue"))+
theme_get()
d <- ggplot(data=voice, aes(x=Q25, group=label, fill=label)) +
geom_density(adjust=1.5, alpha=.2) +
scale_fill_manual(values=c("red", "blue"))+
theme_get()
e <- ggplot(data=voice, aes(x=sd, group=label, fill=label)) +
geom_density(adjust=1.5, alpha=.2) +
scale_fill_manual(values=c("red", "blue"))+
theme_get()
ggarrange(a,b,c,d,e, ncol = 2, nrow = 3)
The varibles that identify male and female charcteristics distinctly are
IQR, Meanfun, SD, sp.ent, Q25. Less overlapping implies the alorithm can
easily and more accurately say if its male or female by looking at the
values.
However, we should not use predictors that indicate perfect separation. Perfect Separation is a condition where there is 1 predictor variable that can completely separate the target class
It is not recommended to use a model with perfect separation, because the model is highly biased in one variable and does not consider other variables. This can make the model inaccurate (poorly) in predicting new data.
The variables above are not categorized into perfect separation because they still have overlapping areas
4.3. Proportion target
prop.table(table(voice$label))
##
## 0 1
## 0.5 0.5
Target variable in our dataset are balanced. So we dont need to worry to balancing target variable.
5. Logistic Regression
Logistic Regression is an algorithm for the adjusted case
classification of the regression curve, y=f(x), where y is
a categorical variable.
The purpose of logistic regression is to predict probability using a linear regression model (which can be used for classification).
5.1. Pre-processing Data
# Remove multicolinearity
voice_clean <- voice %>% select(-c(dfrange,IQR,centroid))
Indication of multicollinearity can be seen in the previous correlation feature plot. which is where there are 3 predictors which between the predictors have a correlation of 1. So we need to remove them
5.2. Cross validation
RNGkind(sample.kind = "Rounding")
set.seed(15)
# index sampling
index <- sample(x = nrow(voice_clean), size = nrow(voice_clean)*0.8)
# splitting, by doing subseting
voice_train <- voice_clean[index,]
voice_test <- voice_clean[-index,]
Data train has a proportion of 80% of the total data and data test has 20% proportion from total data
5.3. Build model
model_voice <- glm(formula = label~.,data = voice_train,family = "binomial")
summary(model_voice)
##
## Call:
## glm(formula = label ~ ., family = "binomial", data = voice_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2231 -0.0439 -0.0005 0.1150 4.2255
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.005e+01 1.007e+01 -1.991 0.0465 *
## meanfreq 1.996e+01 5.003e+01 0.399 0.6899
## sd 4.216e+01 3.881e+01 1.086 0.2773
## median -1.305e+01 1.411e+01 -0.925 0.3550
## Q25 -4.989e+01 1.256e+01 -3.971 7.15e-05 ***
## Q75 4.262e+01 2.224e+01 1.916 0.0553 .
## skew 1.262e-01 1.799e-01 0.701 0.4831
## kurt -6.695e-03 4.775e-03 -1.402 0.1609
## sp.ent 4.557e+01 1.107e+01 4.115 3.87e-05 ***
## sfm -1.237e+01 2.770e+00 -4.466 7.98e-06 ***
## mode 2.334e+00 2.401e+00 0.972 0.3309
## meanfun -1.616e+02 9.154e+00 -17.650 < 2e-16 ***
## minfun 4.030e+01 9.854e+00 4.090 4.32e-05 ***
## maxfun -5.660e+00 7.906e+00 -0.716 0.4741
## meandom 5.687e-02 4.674e-01 0.122 0.9032
## mindom 4.323e-02 2.336e+00 0.019 0.9852
## maxdom 1.890e-02 7.205e-02 0.262 0.7931
## modindx -2.786e+00 1.792e+00 -1.554 0.1201
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3512.68 on 2533 degrees of freedom
## Residual deviance: 469.89 on 2516 degrees of freedom
## AIC: 505.89
##
## Number of Fisher Scoring iterations: 8
Model Interpretation
Variables that increase odds: meanfreq, sd, Q75, skew, sp.ent, mode, minfun, meandom, mindom, maxdom
Variables that decrease odds: mdeian, Q25, kurt, sfm, meanfun, maxfun, modindx
Variable Significance: Q25, so.ent, sfm, meanfun, minfun
5.4. Predict
voice_test$pred <- predict(object = model_voice,newdata = voice_test,type = "response")
voice_test$pred_label <- ifelse(voice_test$pred > 0.5,1,0)
voice_test$pred_label <- as.factor(voice_test$pred_label)
The threshold that we used is 0.5. This value can be changed according to business objectives
If it is greater than 0.5 it will be categorized as 1 male and less than 0.5 will be categorized as 0 female
5.5. Model evaluation
confusionMatrix(data = voice_test$pred_label,
reference = voice_test$label,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 293 5
## 1 13 323
##
## Accuracy : 0.9716
## 95% CI : (0.9555, 0.9831)
## No Information Rate : 0.5174
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9431
##
## Mcnemar's Test P-Value : 0.09896
##
## Sensitivity : 0.9848
## Specificity : 0.9575
## Pos Pred Value : 0.9613
## Neg Pred Value : 0.9832
## Prevalence : 0.5174
## Detection Rate : 0.5095
## Detection Prevalence : 0.5300
## Balanced Accuracy : 0.9711
##
## 'Positive' Class : 1
##
Because in this case we do not focus the prediction results to one of the classes then the positive parameter in the confusion matrix is not needed.
Based on the resulting confusion matrix value, we only focus on the accuracy value because we only want the prediction results to be able to objectively predict whether the gender is male or female.
The resulting accuracy performance is also very good, where the model created has a 97% accuracy to predict the gender of a male or female based on their voice
5.6. Model Tuning
Feature selection is performed using stepwise (removing predictors one by one until the lowest AIC value is obtained)
model_step <- step(object = model_voice,direction = "backward", trace = 0)
summary(model_step)
##
## Call:
## glm(formula = label ~ Q25 + Q75 + kurt + sp.ent + sfm + meanfun +
## minfun + modindx, family = "binomial", data = voice_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0470 -0.0467 -0.0006 0.1132 4.2500
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.196e+01 7.177e+00 -1.667 0.09560 .
## Q25 -5.448e+01 5.223e+00 -10.431 < 2e-16 ***
## Q75 5.295e+01 5.991e+00 8.839 < 2e-16 ***
## kurt -3.686e-03 1.149e-03 -3.210 0.00133 **
## sp.ent 3.726e+01 8.666e+00 4.300 1.71e-05 ***
## sfm -9.897e+00 1.963e+00 -5.043 4.59e-07 ***
## meanfun -1.633e+02 9.077e+00 -17.989 < 2e-16 ***
## minfun 3.831e+01 8.725e+00 4.391 1.13e-05 ***
## modindx -2.768e+00 1.409e+00 -1.964 0.04953 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3512.68 on 2533 degrees of freedom
## Residual deviance: 474.41 on 2525 degrees of freedom
## AIC: 492.41
##
## Number of Fisher Scoring iterations: 8
5.7. Predict model tuning
voice_test$pred2 <- predict(object = model_step,newdata = voice_test,type = "response")
voice_test$pred_label2 <- ifelse(voice_test$pred2 > 0.5,1,0)
voice_test$pred_label2 <- as.factor(voice_test$pred_label2)
5.8. Model Tunnng Evaluation
confusionMatrix(data = voice_test$pred_label2,
reference = voice_test$label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 296 5
## 1 10 323
##
## Accuracy : 0.9763
## 95% CI : (0.9613, 0.9867)
## No Information Rate : 0.5174
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9526
##
## Mcnemar's Test P-Value : 0.3017
##
## Sensitivity : 0.9673
## Specificity : 0.9848
## Pos Pred Value : 0.9834
## Neg Pred Value : 0.9700
## Prevalence : 0.4826
## Detection Rate : 0.4669
## Detection Prevalence : 0.4748
## Balanced Accuracy : 0.9760
##
## 'Positive' Class : 0
##
The resul is tuning model has a smaller AIC value. In addition, the prediction results using the tuning model produce a slightly greater accuracy value, increasing from 97.16% to 97.63%
5.9 Prediction Result
ggplot(voice_test, aes(x=pred2)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data") +
theme_minimal()
> It can be interpreted if the prediction results tend to be
balanced, not skewed towards one of the target classes
6. K-Nearest Neighbour
K-NN is K-nearest neighbor. This method will classify the new data by comparing the characteristics of the new data (test data) with existing data (train data). The closeness of these characteristics is measured by Euclidean Distance to obtain distance. Then k nearest neighbors will be selected from the new data, then the class will be determined using majority voting.
6.1. Cross validation
RNGkind(sample.kind = "Rounding")
set.seed(123)
# index sampling
index_knn <- sample(x = nrow(voice),
size = 0.8*nrow(voice))
# splitting
voice_train <- voice[index_knn,]
voice_test <- voice[-index_knn,]
Data train has a proportion of 80% of the total data and data test has 20% proportion from total data
6.2. Data pre-processing
# prediktor
voice_train_x <- voice_train %>% select(-label)
voice_test_x <- voice_test %>% select(-label)
# target
voice_train_y <- voice_train[,"label"]
voice_test_y <- voice_test[,"label"]
6.3. Scaling
Check the distribution of data
summary(voice)
## meanfreq sd median Q25
## Min. :0.03936 Min. :0.01836 Min. :0.01097 Min. :0.0002288
## 1st Qu.:0.16366 1st Qu.:0.04195 1st Qu.:0.16959 1st Qu.:0.1110865
## Median :0.18484 Median :0.05916 Median :0.19003 Median :0.1402864
## Mean :0.18091 Mean :0.05713 Mean :0.18562 Mean :0.1404556
## 3rd Qu.:0.19915 3rd Qu.:0.06702 3rd Qu.:0.21062 3rd Qu.:0.1759388
## Max. :0.25112 Max. :0.11527 Max. :0.26122 Max. :0.2473469
## Q75 IQR skew kurt
## Min. :0.04295 Min. :0.01456 Min. : 0.1417 Min. : 2.068
## 1st Qu.:0.20875 1st Qu.:0.04256 1st Qu.: 1.6496 1st Qu.: 5.670
## Median :0.22568 Median :0.09428 Median : 2.1971 Median : 8.319
## Mean :0.22476 Mean :0.08431 Mean : 3.1402 Mean : 36.569
## 3rd Qu.:0.24366 3rd Qu.:0.11418 3rd Qu.: 2.9317 3rd Qu.: 13.649
## Max. :0.27347 Max. :0.25223 Max. :34.7255 Max. :1309.613
## sp.ent sfm mode centroid
## Min. :0.7387 Min. :0.03688 Min. :0.0000 Min. :0.03936
## 1st Qu.:0.8618 1st Qu.:0.25804 1st Qu.:0.1180 1st Qu.:0.16366
## Median :0.9018 Median :0.39634 Median :0.1866 Median :0.18484
## Mean :0.8951 Mean :0.40822 Mean :0.1653 Mean :0.18091
## 3rd Qu.:0.9287 3rd Qu.:0.53368 3rd Qu.:0.2211 3rd Qu.:0.19915
## Max. :0.9820 Max. :0.84294 Max. :0.2800 Max. :0.25112
## meanfun minfun maxfun meandom
## Min. :0.05557 Min. :0.009775 Min. :0.1031 Min. :0.007812
## 1st Qu.:0.11700 1st Qu.:0.018223 1st Qu.:0.2540 1st Qu.:0.419828
## Median :0.14052 Median :0.046110 Median :0.2712 Median :0.765795
## Mean :0.14281 Mean :0.036802 Mean :0.2588 Mean :0.829211
## 3rd Qu.:0.16958 3rd Qu.:0.047904 3rd Qu.:0.2775 3rd Qu.:1.177166
## Max. :0.23764 Max. :0.204082 Max. :0.2791 Max. :2.957682
## mindom maxdom dfrange modindx
## Min. :0.004883 Min. : 0.007812 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.007812 1st Qu.: 2.070312 1st Qu.: 2.045 1st Qu.:0.09977
## Median :0.023438 Median : 4.992188 Median : 4.945 Median :0.13936
## Mean :0.052647 Mean : 5.047277 Mean : 4.995 Mean :0.17375
## 3rd Qu.:0.070312 3rd Qu.: 7.007812 3rd Qu.: 6.992 3rd Qu.:0.20918
## Max. :0.458984 Max. :21.867188 Max. :21.844 Max. :0.93237
## label
## 0:1584
## 1:1584
##
##
##
##
The range of each data must be the same because the KNN model classifies based on distance. If there is a high value alone compared to others, then that variable will greatly affect the classification results and ignore other variables.
The range of each variable is different so feature rescaling needs to be done
#scaling
voice_train_x_scale <- scale(voice_train_x)
voice_test_x_scale <- scale(voice_test_x,
center = attr(voice_train_x_scale,"scaled:center"),
scale = attr(voice_train_x_scale,"scaled:scale"))
Predictor data will be scaled using z-score standardization. The test data also scaled using the parameters from the train data.
6.4. Pick optimum K
Some conditions in choosing the optimum K
don’t be too big: class selection is based only on the dominant class and ignores small patterns that turn out to be important.
don’t be too small: prone to classify new data to outlier class.
k optimum is the root of our data sum:
sqrt(nrow(data))to avoid a draw when majority voting:
- k must be odd if the number of target classes is even
- k must be even if the number of target classes is odd
- k cannot be a multiple of the number of the target class
If the majority voting result is a draw, then the class will be chosen randomly.
sqrt(nrow(voice_train))
## [1] 50.33885
Cecause the result is even, then the optimum K used is 49
6.5. Predict
voice_pred <- knn(train = voice_train_x_scale,
test = voice_test_x_scale,
cl = voice_train_y,
k = 49)
Conceptually, KNN does not make a model, but directly classifies the target variable based on the distance of ecluidience and then the class is selected based on majority voting
6.6. Model evaluation
confusionMatrix(data = voice_pred,
reference = voice_test_y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 289 8
## 1 25 312
##
## Accuracy : 0.9479
## 95% CI : (0.9277, 0.9639)
## No Information Rate : 0.5047
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8958
##
## Mcnemar's Test P-Value : 0.005349
##
## Sensitivity : 0.9204
## Specificity : 0.9750
## Pos Pred Value : 0.9731
## Neg Pred Value : 0.9258
## Prevalence : 0.4953
## Detection Rate : 0.4558
## Detection Prevalence : 0.4685
## Balanced Accuracy : 0.9477
##
## 'Positive' Class : 0
##
The result shows that our K-NN has accuracy of 94.79% on test dataset, meaning that 94.79 % of our data is correctly classified to male and female. The value of sensitivity and specificity is 92.04 % and 97.50 %. This indicate that most of male outcomes are correctly classified but only a small number of female outcomes are correctly classified. The precision or positive predicted value is 97.31 %, meaning that 97.31 % of our male prediction is correct.
7. Conclusion
Both models have very good accuracy performance, Logistic Regression Model 97.63% and K-NN 94.79%. So it can be concluded that the two models do not have too much difference in the ability to predict gender.
Not only the accuracy and AIC values have increased, when tuning using the stepwise backward method, the specificity and post pred values also increase about 1-2% from the previous one. This is because the model is getting better at predicting the female class which was previously predicted to be male
While the sensitivity value decreased by about 2%, this decrease was not because more women were predicted to turn out to be men, but because of the trade-off of increasing accuracy and precision values.