In this Project, various ML algorithms are used on the voice.csv dataset which consists of various features of voice: mean frequency etc to train models that can recognize the gender of a person based on the features of their voice.

A small dataset is used for training and testing the model. The records used have almost even distribution of genders all across. The system is targeting the gender recognition in order to support the existing systems by reducing their search spaces intern reducing the delay in responses.

Loading necessary libraries

import pandas as pd
import numpy as np
from sklearn import preprocessing, neighbors
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import seaborn as sns
import matplotlib.pyplot as plt
Loading Voice (voice.csv) Dataset into Pandas Dataframe
df = pd.read_csv('voice.csv')
df.head(5)
meanfreq sd median Q25 Q75 IQR skew kurt sp.ent sfm centroid meanfun minfun maxfun meandom mindom maxdom dfrange modindx label
0 0.059781 0.064241 0.032027 0.015071 0.090193 0.075122 12.863462 274.402906 0.893369 0.491918 0.059781 0.084279 0.015702 0.275862 0.007812 0.007812 0.007812 0.000000 0.000000 male
1 0.066009 0.067310 0.040229 0.019414 0.092666 0.073252 22.423285 634.613855 0.892193 0.513724 0.066009 0.107937 0.015826 0.250000 0.009014 0.007812 0.054688 0.046875 0.052632 male
2 0.077316 0.083829 0.036718 0.008701 0.131908 0.123207 30.757155 1024.927705 0.846389 0.478905 0.077316 0.098706 0.015656 0.271186 0.007990 0.007812 0.015625 0.007812 0.046512 male
3 0.151228 0.072111 0.158011 0.096582 0.207955 0.111374 1.232831 4.177296 0.963322 0.727232 0.151228 0.088965 0.017798 0.250000 0.201497 0.007812 0.562500 0.554688 0.247119 male
4 0.135120 0.079146 0.124656 0.078720 0.206045 0.127325 1.101174 4.333713 0.971955 0.783568 0.135120 0.106398 0.016931 0.266667 0.712812 0.007812 5.484375 5.476562 0.208274 male

5 rows × 21 columns

Visualizing the correlation among the features.

corrmat = df.corr()
sns.set(rc={'figure.figsize':(10,8)})
sns.heatmap(corrmat, linewidths=0.25, vmax=1.0, cmap='YlGnBu', linecolor='black')

<AxesSubplot:>

png

Data Preprocessing

col_names = list(df.columns.values)
print(col_names)
type(df.columns.values)

[‘meanfreq’, ‘sd’, ‘median’, ‘Q25’, ‘Q75’, ‘IQR’, ‘skew’, ‘kurt’, ‘sp.ent’, ‘sfm’, ‘mode’, ‘centroid’, ‘meanfun’, ‘minfun’, ‘maxfun’, ‘meandom’, ‘mindom’, ‘maxdom’, ‘dfrange’, ‘modindx’, ‘label’]

numpy.ndarray

df = df.rename(columns = {'label': 'gender'})
df.columns.values

array([‘meanfreq’, ‘sd’, ‘median’, ‘Q25’, ‘Q75’, ‘IQR’, ‘skew’, ‘kurt’, ‘sp.ent’, ‘sfm’, ‘mode’, ‘centroid’, ‘meanfun’, ‘minfun’, ‘maxfun’, ‘meandom’, ‘mindom’, ‘maxdom’, ‘dfrange’, ‘modindx’, ‘gender’], dtype=object)

Creating Machine Learning Models to classify

Logistic Regression

X = np.array(df.drop(['gender'],1))
y = np.array(df['gender'])

Dividing the data randomly into training and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Creating and training the model

model = LogisticRegression()
model.fit(X_train, y_train)

print('Accuracy1: ', model.score(X_train, y_train))
print('Accuracy2: ', model.score(X_test, y_test))

Accuracy1: 0.9033149171270718 Accuracy2: 0.8943217665615142

KNN Classifier

X = np.array(df.drop(['gender'],1))
y = np.array(df['gender'])

Dividing the data randomly into training and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Creating and training the model

KNNmodel = neighbors.KNeighborsClassifier()
KNNmodel.fit(X_train, y_train)

accuracy = KNNmodel.score(X_test, y_test)
print('Accuracy: ', accuracy)

Accuracy: 0.722397476340694

The above results are without tuning the data, now we will drop some column which does not help in classification.
We will use columns (meanfreq, sd, meanfun, gender) for less error

df1 = df[['meanfreq', 'sd', 'meanfun', 'gender']]
X = np.array(df1.drop(['gender'],1))
y = np.array(df1['gender'])

Splitting data into training an testing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Training and testing KNN model

KNNmodel2 = neighbors.KNeighborsClassifier()
KNNmodel2.fit(X_train, y_train)

accuracy = KNNmodel2.score(X_test, y_test)
print('Accuracy: ', accuracy)

Accuracy: 0.9574132492113565

Improving models through feature selection from sklearn

from sklearn.feature_selection import SelectKBest, f_classif

Selecting K-Best features for classification
1. param data_frame: A pandas dataFrame with the training data
2. param target: target variable name in DataFrame
3. param k: desired number of features from the data
4. returns feature_scores: scores for each feature in the data as pandas DataFrame

def select_kbest_clf(data_frame, target, k=5):
  feat_selector = SelectKBest(f_classif, k=k)
  _ = feat_selector.fit(data_frame.drop(target, axis=1), data_frame[target])
  feat_scores = pd.DataFrame()
  feat_scores['F Score'] = feat_selector.scores_
  feat_scores['P Value'] = feat_selector.pvalues_
  feat_scores['Attribute'] = data_frame.drop(target, axis=1).columns
  return feat_scores

k = select_kbest_clf(df, 'gender', k=5).sort_values(['F Score'], ascending = False)
k
F Score P Value Attribute
12 7228.790362 0.000000e+00 meanfun
5 1965.750000 0.000000e+00 IQR
3 1121.569224 9.140832e-211 Q25
8 1003.308717 1.614016e-191 sp.ent
1 945.461376 6.654756e-182 sd
9 463.923194 3.877715e-96 sfm
0 406.752820 3.368951e-85 meanfreq
11 406.752820 3.368951e-85 centroid
2 277.588158 8.259210e-60 median
17 126.024161 1.050986e-28 maxdom
16 125.110999 1.636130e-28 mindom
18 121.457858 9.626061e-28 dfrange
15 119.959108 1.992966e-27 meandom
10 96.257909 2.097044e-22 mode
14 90.228036 4.044625e-21 maxfun
13 60.282137 1.101400e-14 minfun
7 24.255365 8.869557e-07 kurt
4 14.236082 1.642021e-04 Q75
6 4.252980 3.926293e-02 skew
19 3.006445 8.303136e-02 modindx

Plotting Bar Gragh to find the crucial features for recognition

k1 = sns.barplot(x = k['F Score'], y = k['Attribute'])
k1.set_title('Feature Importance')

Text(0.5, 1.0, ‘Feature Importance’)

png

Selecting top 7 features to train the ML models

df3 = df[['meanfun', 'IQR', 'Q25', 'sp.ent', 'sd', 'sfm', 'meanfreq', 'gender']]
X = np.array(df3.drop(['gender'],1))
y = np.array(df3['gender'])

K-Nearest Neighbours

Dividing dataset into training and test data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Training and testing the model

KNNmodel3 = neighbors.KNeighborsClassifier()
KNNmodel3.fit(X_train, y_train)

accuracy = KNNmodel3.score(X_test, y_test)
print('Accuracy: ', accuracy)

Accuracy: 0.9858044164037855