In this Project, various ML algorithms are used on the voice.csv dataset which consists of various features of voice: mean frequency etc to train models that can recognize the gender of a person based on the features of their voice.
A small dataset is used for training and testing the model. The records used have almost even distribution of genders all across. The system is targeting the gender recognition in order to support the existing systems by reducing their search spaces intern reducing the delay in responses.
import pandas as pd
import numpy as np
from sklearn import preprocessing, neighbors
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('voice.csv')
df.head(5)
| meanfreq | sd | median | Q25 | Q75 | IQR | skew | kurt | sp.ent | sfm | … | centroid | meanfun | minfun | maxfun | meandom | mindom | maxdom | dfrange | modindx | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.059781 | 0.064241 | 0.032027 | 0.015071 | 0.090193 | 0.075122 | 12.863462 | 274.402906 | 0.893369 | 0.491918 | … | 0.059781 | 0.084279 | 0.015702 | 0.275862 | 0.007812 | 0.007812 | 0.007812 | 0.000000 | 0.000000 | male |
| 1 | 0.066009 | 0.067310 | 0.040229 | 0.019414 | 0.092666 | 0.073252 | 22.423285 | 634.613855 | 0.892193 | 0.513724 | … | 0.066009 | 0.107937 | 0.015826 | 0.250000 | 0.009014 | 0.007812 | 0.054688 | 0.046875 | 0.052632 | male |
| 2 | 0.077316 | 0.083829 | 0.036718 | 0.008701 | 0.131908 | 0.123207 | 30.757155 | 1024.927705 | 0.846389 | 0.478905 | … | 0.077316 | 0.098706 | 0.015656 | 0.271186 | 0.007990 | 0.007812 | 0.015625 | 0.007812 | 0.046512 | male |
| 3 | 0.151228 | 0.072111 | 0.158011 | 0.096582 | 0.207955 | 0.111374 | 1.232831 | 4.177296 | 0.963322 | 0.727232 | … | 0.151228 | 0.088965 | 0.017798 | 0.250000 | 0.201497 | 0.007812 | 0.562500 | 0.554688 | 0.247119 | male |
| 4 | 0.135120 | 0.079146 | 0.124656 | 0.078720 | 0.206045 | 0.127325 | 1.101174 | 4.333713 | 0.971955 | 0.783568 | … | 0.135120 | 0.106398 | 0.016931 | 0.266667 | 0.712812 | 0.007812 | 5.484375 | 5.476562 | 0.208274 | male |
5 rows × 21 columns
corrmat = df.corr()
sns.set(rc={'figure.figsize':(10,8)})
sns.heatmap(corrmat, linewidths=0.25, vmax=1.0, cmap='YlGnBu', linecolor='black')
<AxesSubplot:>
png
col_names = list(df.columns.values)
print(col_names)
type(df.columns.values)
[‘meanfreq’, ‘sd’, ‘median’, ‘Q25’, ‘Q75’, ‘IQR’, ‘skew’, ‘kurt’, ‘sp.ent’, ‘sfm’, ‘mode’, ‘centroid’, ‘meanfun’, ‘minfun’, ‘maxfun’, ‘meandom’, ‘mindom’, ‘maxdom’, ‘dfrange’, ‘modindx’, ‘label’]
numpy.ndarray
df = df.rename(columns = {'label': 'gender'})
df.columns.values
array([‘meanfreq’, ‘sd’, ‘median’, ‘Q25’, ‘Q75’, ‘IQR’, ‘skew’, ‘kurt’, ‘sp.ent’, ‘sfm’, ‘mode’, ‘centroid’, ‘meanfun’, ‘minfun’, ‘maxfun’, ‘meandom’, ‘mindom’, ‘maxdom’, ‘dfrange’, ‘modindx’, ‘gender’], dtype=object)
X = np.array(df.drop(['gender'],1))
y = np.array(df['gender'])
Dividing the data randomly into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
Creating and training the model
model = LogisticRegression()
model.fit(X_train, y_train)
print('Accuracy1: ', model.score(X_train, y_train))
print('Accuracy2: ', model.score(X_test, y_test))
Accuracy1: 0.9033149171270718 Accuracy2: 0.8943217665615142
X = np.array(df.drop(['gender'],1))
y = np.array(df['gender'])
Dividing the data randomly into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
Creating and training the model
KNNmodel = neighbors.KNeighborsClassifier()
KNNmodel.fit(X_train, y_train)
accuracy = KNNmodel.score(X_test, y_test)
print('Accuracy: ', accuracy)
Accuracy: 0.722397476340694
The above results are without tuning the data, now we will drop some column which does not help in classification.
We will use columns (meanfreq, sd, meanfun, gender) for less error
df1 = df[['meanfreq', 'sd', 'meanfun', 'gender']]
X = np.array(df1.drop(['gender'],1))
y = np.array(df1['gender'])
Splitting data into training an testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Training and testing KNN model
KNNmodel2 = neighbors.KNeighborsClassifier()
KNNmodel2.fit(X_train, y_train)
accuracy = KNNmodel2.score(X_test, y_test)
print('Accuracy: ', accuracy)
Accuracy: 0.9574132492113565
from sklearn.feature_selection import SelectKBest, f_classif
Selecting K-Best features for classification
1. param data_frame: A pandas dataFrame with the training data
2. param target: target variable name in DataFrame
3. param k: desired number of features from the data
4. returns feature_scores: scores for each feature in the data as pandas DataFrame
def select_kbest_clf(data_frame, target, k=5):
feat_selector = SelectKBest(f_classif, k=k)
_ = feat_selector.fit(data_frame.drop(target, axis=1), data_frame[target])
feat_scores = pd.DataFrame()
feat_scores['F Score'] = feat_selector.scores_
feat_scores['P Value'] = feat_selector.pvalues_
feat_scores['Attribute'] = data_frame.drop(target, axis=1).columns
return feat_scores
k = select_kbest_clf(df, 'gender', k=5).sort_values(['F Score'], ascending = False)
k
| F Score | P Value | Attribute | |
|---|---|---|---|
| 12 | 7228.790362 | 0.000000e+00 | meanfun |
| 5 | 1965.750000 | 0.000000e+00 | IQR |
| 3 | 1121.569224 | 9.140832e-211 | Q25 |
| 8 | 1003.308717 | 1.614016e-191 | sp.ent |
| 1 | 945.461376 | 6.654756e-182 | sd |
| 9 | 463.923194 | 3.877715e-96 | sfm |
| 0 | 406.752820 | 3.368951e-85 | meanfreq |
| 11 | 406.752820 | 3.368951e-85 | centroid |
| 2 | 277.588158 | 8.259210e-60 | median |
| 17 | 126.024161 | 1.050986e-28 | maxdom |
| 16 | 125.110999 | 1.636130e-28 | mindom |
| 18 | 121.457858 | 9.626061e-28 | dfrange |
| 15 | 119.959108 | 1.992966e-27 | meandom |
| 10 | 96.257909 | 2.097044e-22 | mode |
| 14 | 90.228036 | 4.044625e-21 | maxfun |
| 13 | 60.282137 | 1.101400e-14 | minfun |
| 7 | 24.255365 | 8.869557e-07 | kurt |
| 4 | 14.236082 | 1.642021e-04 | Q75 |
| 6 | 4.252980 | 3.926293e-02 | skew |
| 19 | 3.006445 | 8.303136e-02 | modindx |
Plotting Bar Gragh to find the crucial features for recognition
k1 = sns.barplot(x = k['F Score'], y = k['Attribute'])
k1.set_title('Feature Importance')
Text(0.5, 1.0, ‘Feature Importance’)
png
Selecting top 7 features to train the ML models
df3 = df[['meanfun', 'IQR', 'Q25', 'sp.ent', 'sd', 'sfm', 'meanfreq', 'gender']]
X = np.array(df3.drop(['gender'],1))
y = np.array(df3['gender'])
Dividing dataset into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
Training and testing the model
KNNmodel3 = neighbors.KNeighborsClassifier()
KNNmodel3.fit(X_train, y_train)
accuracy = KNNmodel3.score(X_test, y_test)
print('Accuracy: ', accuracy)
Accuracy: 0.9858044164037855