K Nearest Neighbors

Objective

You’ve been given a classified data set from a company. They’ve hidden the feature column names but have given you the data and the target classes.

Try to use KNN to create a model that directly predicts a class for a new data point based off of the features.

The Libraries

> import pandas as pd
+ import seaborn as sns
+ import matplotlib.pyplot as plt
+ import numpy as np

The Data

> df = pd.read_csv("Classified Data",index_col=0)

df.head()

	WTT	PTI	EQW	SBI	LQE	QWG	FDJ	PJF	HQE	NXJ	TARGET CLASS
0	0.9139173	1.1620727	0.5679459	0.7554639	0.7808616	0.3526077	0.7596969	0.6437976	0.8794221	1.231409	1
1	0.6356319	1.0037216	0.5353424	0.8256448	0.9241089	0.6484502	0.6753343	1.0135460	0.6215522	1.492702	0
2	0.7213598	1.2014926	0.9219897	0.8555950	1.5266294	0.7207809	1.6263507	1.1544831	0.9578770	1.285597	0
3	1.2342044	1.3867263	0.6530463	0.8256244	1.1425035	0.8751279	1.4097081	1.3800025	1.5226920	1.153093	1
4	1.2794908	0.9497496	0.6272800	0.6689761	1.2325373	0.7037274	1.1155955	0.6466907	1.4638118	1.419167	1
5	0.8339278	1.5233023	1.1047427	1.0211390	1.1073771	1.0109295	1.2795375	1.2806766	0.5103503	1.528044	0

Exploratory Data Analysis

> sns.pairplot(df,hue='TARGET CLASS',palette='coolwarm');

> plt.show()

Standardize the Variables

Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale.

Standardize features by removing the mean and scaling to unit variance.

The standard score of a sample x is calculated as:

\[z = \frac{(x - \mu)}{\sigma}\]

where \(\mu\) is the mean of the training samples and \(\sigma\) is the standard deviation of the training samples.

> from sklearn.preprocessing import StandardScaler

> scaler = StandardScaler()
+ scaler.fit(df.drop('TARGET CLASS',axis=1))

StandardScaler(copy=True, with_mean=True, with_std=True)

> scaled_features = scaler.transform(
+   df.drop('TARGET CLASS',axis=1))

> df_feat = pd.DataFrame(scaled_features,
+                 columns=df.columns[:-1])

df_feat.head()

WTT	PTI	EQW	SBI	LQE	QWG	FDJ	PJF	HQE	NXJ
-0.1235419	0.1859075	-0.9134307	0.3196291	-1.0336368	-2.3083747	-0.7989513	-1.4823681	-0.9497194	-0.6433142
-1.0848360	-0.4303484	-1.0253133	0.6253883	-0.4448471	-1.1527060	-1.1297975	-0.2022403	-1.8280509	0.6367586
-0.7887022	0.3393182	0.3015114	0.7558728	2.0316930	-0.8701562	2.5998184	0.2857065	-0.6824938	-0.3778499
0.9828405	1.0601933	-0.6213988	0.6252994	0.4528203	-0.2672204	1.7502076	1.0664905	1.2413246	-1.0269871
1.1392755	-0.6403919	-0.7098186	-0.0571746	0.8228862	-0.9367731	0.5967817	-1.4723516	1.0407722	0.2765098
-0.3998534	1.5917070	0.9286490	1.4771025	0.3084402	0.2632702	1.2397158	0.7226082	-2.2068162	0.8098998

Train Test Split

> from sklearn.model_selection import train_test_split

> X_train, X_test, y_train, y_test = train_test_split(
+     scaled_features,df['TARGET CLASS'],
+         test_size=0.30, random_state=101)

Using KNN

Start with K=1

> from sklearn.neighbors import KNeighborsClassifier

> knn = KNeighborsClassifier(n_neighbors=1)

> knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

> pred = knn.predict(X_test)

Predictions and Evaluations

> from sklearn.metrics import classification_report,confusion_matrix

> print(confusion_matrix(y_test,pred))

[[151   8]
 [ 15 126]]

> print(classification_report(y_test,pred))

              precision    recall  f1-score   support

           0       0.91      0.95      0.93       159
           1       0.94      0.89      0.92       141

    accuracy                           0.92       300
   macro avg       0.92      0.92      0.92       300
weighted avg       0.92      0.92      0.92       300

Choosing a K Value

> error_rate = []
+ 
+ for i in range(1,40):
+     
+     knn = KNeighborsClassifier(n_neighbors=i)
+     knn.fit(X_train,y_train)
+     pred_i = knn.predict(X_test)
+     error_rate.append(np.mean(pred_i != y_test))

> plt.figure(figsize=(10,6))
+ plt.plot(range(1,40),error_rate,color='blue', 
+ linestyle='dashed', marker='o',
+          markerfacecolor='red', markersize=10)
+ plt.title('Error Rate vs. K Value')
+ plt.xlabel('K')
+ plt.ylabel('Error Rate')
+ plt.show()

Here we can see that that after around K>12 the error rate doesn’t go much lower. We can retrain the model with that and check the classification report.

K=1

> # FIRST A QUICK COMPARISON THE ORIGINAL K=1
+ knn = KNeighborsClassifier(n_neighbors=1)
+ 
+ knn.fit(X_train,y_train);
+ pred = knn.predict(X_test)
+ 
+ 
+ print(confusion_matrix(y_test,pred))

[[151   8]
 [ 15 126]]

> print(classification_report(y_test,pred))

              precision    recall  f1-score   support

           0       0.91      0.95      0.93       159
           1       0.94      0.89      0.92       141

    accuracy                           0.92       300
   macro avg       0.92      0.92      0.92       300
weighted avg       0.92      0.92      0.92       300

K=23

> # NOW WITH K=23
+ knn = KNeighborsClassifier(n_neighbors=12)
+ 
+ knn.fit(X_train,y_train);
+ pred = knn.predict(X_test)
+ 
+ print(confusion_matrix(y_test,pred))

[[155   4]
 [ 10 131]]

> print(classification_report(y_test,pred))

              precision    recall  f1-score   support

           0       0.94      0.97      0.96       159
           1       0.97      0.93      0.95       141

    accuracy                           0.95       300
   macro avg       0.95      0.95      0.95       300
weighted avg       0.95      0.95      0.95       300

Python for K Nearest Neighbors

Python code in R Markdown

Paul Jozefek

2020-09-23