Gender Recognition using Voice

In this Project, various ML algorithms are used on the voice.csv dataset which consists of various features of voice: mean frequency etc to train models that can recognize the gender of a person based on the features of their voice.

A small dataset is used for training and testing the model. The records used have almost even distribution of genders all across. The system is targeting the gender recognition in order to support the existing systems by reducing their search spaces intern reducing the delay in responses.

Loading necessary libraries

import pandas as pd
import numpy as np
from sklearn import preprocessing, neighbors
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import seaborn as sns
import matplotlib.pyplot as plt

Loading Voice (voice.csv) Dataset into Pandas Dataframe

df = pd.read_csv('voice.csv')
df.head(5)

	meanfreq	sd	median	Q25	Q75	IQR	skew	kurt	sp.ent	sfm	…	centroid	meanfun	minfun	maxfun	meandom	mindom	maxdom	dfrange	modindx	label
0	0.059781	0.064241	0.032027	0.015071	0.090193	0.075122	12.863462	274.402906	0.893369	0.491918	…	0.059781	0.084279	0.015702	0.275862	0.007812	0.007812	0.007812	0.000000	0.000000	male
1	0.066009	0.067310	0.040229	0.019414	0.092666	0.073252	22.423285	634.613855	0.892193	0.513724	…	0.066009	0.107937	0.015826	0.250000	0.009014	0.007812	0.054688	0.046875	0.052632	male
2	0.077316	0.083829	0.036718	0.008701	0.131908	0.123207	30.757155	1024.927705	0.846389	0.478905	…	0.077316	0.098706	0.015656	0.271186	0.007990	0.007812	0.015625	0.007812	0.046512	male
3	0.151228	0.072111	0.158011	0.096582	0.207955	0.111374	1.232831	4.177296	0.963322	0.727232	…	0.151228	0.088965	0.017798	0.250000	0.201497	0.007812	0.562500	0.554688	0.247119	male
4	0.135120	0.079146	0.124656	0.078720	0.206045	0.127325	1.101174	4.333713	0.971955	0.783568	…	0.135120	0.106398	0.016931	0.266667	0.712812	0.007812	5.484375	5.476562	0.208274	male

5 rows × 21 columns

Visualizing the correlation among the features.

corrmat = df.corr()
sns.set(rc={'figure.figsize':(10,8)})
sns.heatmap(corrmat, linewidths=0.25, vmax=1.0, cmap='YlGnBu', linecolor='black')

<AxesSubplot:>

png

Data Preprocessing

col_names = list(df.columns.values)
print(col_names)
type(df.columns.values)

[‘meanfreq’, ‘sd’, ‘median’, ‘Q25’, ‘Q75’, ‘IQR’, ‘skew’, ‘kurt’, ‘sp.ent’, ‘sfm’, ‘mode’, ‘centroid’, ‘meanfun’, ‘minfun’, ‘maxfun’, ‘meandom’, ‘mindom’, ‘maxdom’, ‘dfrange’, ‘modindx’, ‘label’]

numpy.ndarray

df = df.rename(columns = {'label': 'gender'})
df.columns.values

array([‘meanfreq’, ‘sd’, ‘median’, ‘Q25’, ‘Q75’, ‘IQR’, ‘skew’, ‘kurt’, ‘sp.ent’, ‘sfm’, ‘mode’, ‘centroid’, ‘meanfun’, ‘minfun’, ‘maxfun’, ‘meandom’, ‘mindom’, ‘maxdom’, ‘dfrange’, ‘modindx’, ‘gender’], dtype=object)

Creating Machine Learning Models to classify

Logistic Regression

X = np.array(df.drop(['gender'],1))
y = np.array(df['gender'])

Dividing the data randomly into training and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Creating and training the model

model = LogisticRegression()
model.fit(X_train, y_train)

print('Accuracy1: ', model.score(X_train, y_train))
print('Accuracy2: ', model.score(X_test, y_test))

Accuracy1: 0.9033149171270718 Accuracy2: 0.8943217665615142

KNN Classifier

X = np.array(df.drop(['gender'],1))
y = np.array(df['gender'])

Dividing the data randomly into training and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Creating and training the model

KNNmodel = neighbors.KNeighborsClassifier()
KNNmodel.fit(X_train, y_train)

accuracy = KNNmodel.score(X_test, y_test)
print('Accuracy: ', accuracy)

Accuracy: 0.722397476340694

The above results are without tuning the data, now we will drop some column which does not help in classification.
We will use columns (meanfreq, sd, meanfun, gender) for less error

df1 = df[['meanfreq', 'sd', 'meanfun', 'gender']]
X = np.array(df1.drop(['gender'],1))
y = np.array(df1['gender'])

Splitting data into training an testing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Training and testing KNN model

KNNmodel2 = neighbors.KNeighborsClassifier()
KNNmodel2.fit(X_train, y_train)

accuracy = KNNmodel2.score(X_test, y_test)
print('Accuracy: ', accuracy)

Accuracy: 0.9574132492113565

Improving models through feature selection from sklearn

from sklearn.feature_selection import SelectKBest, f_classif

Selecting K-Best features for classification
1. param data_frame: A pandas dataFrame with the training data
2. param target: target variable name in DataFrame
3. param k: desired number of features from the data
4. returns feature_scores: scores for each feature in the data as pandas DataFrame

def select_kbest_clf(data_frame, target, k=5):
  feat_selector = SelectKBest(f_classif, k=k)
  _ = feat_selector.fit(data_frame.drop(target, axis=1), data_frame[target])
  feat_scores = pd.DataFrame()
  feat_scores['F Score'] = feat_selector.scores_
  feat_scores['P Value'] = feat_selector.pvalues_
  feat_scores['Attribute'] = data_frame.drop(target, axis=1).columns
  return feat_scores

k = select_kbest_clf(df, 'gender', k=5).sort_values(['F Score'], ascending = False)
k

	F Score	P Value	Attribute
12	7228.790362	0.000000e+00	meanfun
5	1965.750000	0.000000e+00	IQR
3	1121.569224	9.140832e-211	Q25
8	1003.308717	1.614016e-191	sp.ent
1	945.461376	6.654756e-182	sd
9	463.923194	3.877715e-96	sfm
0	406.752820	3.368951e-85	meanfreq
11	406.752820	3.368951e-85	centroid
2	277.588158	8.259210e-60	median
17	126.024161	1.050986e-28	maxdom
16	125.110999	1.636130e-28	mindom
18	121.457858	9.626061e-28	dfrange
15	119.959108	1.992966e-27	meandom
10	96.257909	2.097044e-22	mode
14	90.228036	4.044625e-21	maxfun
13	60.282137	1.101400e-14	minfun
7	24.255365	8.869557e-07	kurt
4	14.236082	1.642021e-04	Q75
6	4.252980	3.926293e-02	skew
19	3.006445	8.303136e-02	modindx

Plotting Bar Gragh to find the crucial features for recognition

k1 = sns.barplot(x = k['F Score'], y = k['Attribute'])
k1.set_title('Feature Importance')

Text(0.5, 1.0, ‘Feature Importance’)

png

Selecting top 7 features to train the ML models

df3 = df[['meanfun', 'IQR', 'Q25', 'sp.ent', 'sd', 'sfm', 'meanfreq', 'gender']]
X = np.array(df3.drop(['gender'],1))
y = np.array(df3['gender'])

K-Nearest Neighbours

Dividing dataset into training and test data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Training and testing the model

KNNmodel3 = neighbors.KNeighborsClassifier()
KNNmodel3.fit(X_train, y_train)

accuracy = KNNmodel3.score(X_test, y_test)
print('Accuracy: ', accuracy)

Accuracy: 0.9858044164037855