K Nearest Neighbors


Objective

You’ve been given a classified data set from a company. They’ve hidden the feature column names but have given you the data and the target classes.

Try to use KNN to create a model that directly predicts a class for a new data point based off of the features.

The Libraries

> import pandas as pd
+ import seaborn as sns
+ import matplotlib.pyplot as plt
+ import numpy as np

The Data

> df = pd.read_csv("Classified Data",index_col=0)
df.head()
WTT PTI EQW SBI LQE QWG FDJ PJF HQE NXJ TARGET CLASS
0 0.9139173 1.1620727 0.5679459 0.7554639 0.7808616 0.3526077 0.7596969 0.6437976 0.8794221 1.231409 1
1 0.6356319 1.0037216 0.5353424 0.8256448 0.9241089 0.6484502 0.6753343 1.0135460 0.6215522 1.492702 0
2 0.7213598 1.2014926 0.9219897 0.8555950 1.5266294 0.7207809 1.6263507 1.1544831 0.9578770 1.285597 0
3 1.2342044 1.3867263 0.6530463 0.8256244 1.1425035 0.8751279 1.4097081 1.3800025 1.5226920 1.153093 1
4 1.2794908 0.9497496 0.6272800 0.6689761 1.2325373 0.7037274 1.1155955 0.6466907 1.4638118 1.419167 1
5 0.8339278 1.5233023 1.1047427 1.0211390 1.1073771 1.0109295 1.2795375 1.2806766 0.5103503 1.528044 0

Exploratory Data Analysis

> sns.pairplot(df,hue='TARGET CLASS',palette='coolwarm');
> plt.show()

Standardize the Variables

Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale.

Standardize features by removing the mean and scaling to unit variance.

The standard score of a sample x is calculated as:

\[z = \frac{(x - \mu)}{\sigma}\]

where \(\mu\) is the mean of the training samples and \(\sigma\) is the standard deviation of the training samples.

> from sklearn.preprocessing import StandardScaler
> scaler = StandardScaler()
+ scaler.fit(df.drop('TARGET CLASS',axis=1))
StandardScaler(copy=True, with_mean=True, with_std=True)
> scaled_features = scaler.transform(
+   df.drop('TARGET CLASS',axis=1))
> df_feat = pd.DataFrame(scaled_features,
+                 columns=df.columns[:-1])
df_feat.head()
WTT PTI EQW SBI LQE QWG FDJ PJF HQE NXJ
-0.1235419 0.1859075 -0.9134307 0.3196291 -1.0336368 -2.3083747 -0.7989513 -1.4823681 -0.9497194 -0.6433142
-1.0848360 -0.4303484 -1.0253133 0.6253883 -0.4448471 -1.1527060 -1.1297975 -0.2022403 -1.8280509 0.6367586
-0.7887022 0.3393182 0.3015114 0.7558728 2.0316930 -0.8701562 2.5998184 0.2857065 -0.6824938 -0.3778499
0.9828405 1.0601933 -0.6213988 0.6252994 0.4528203 -0.2672204 1.7502076 1.0664905 1.2413246 -1.0269871
1.1392755 -0.6403919 -0.7098186 -0.0571746 0.8228862 -0.9367731 0.5967817 -1.4723516 1.0407722 0.2765098
-0.3998534 1.5917070 0.9286490 1.4771025 0.3084402 0.2632702 1.2397158 0.7226082 -2.2068162 0.8098998

Train Test Split

> from sklearn.model_selection import train_test_split
> X_train, X_test, y_train, y_test = train_test_split(
+     scaled_features,df['TARGET CLASS'],
+         test_size=0.30, random_state=101)

Using KNN

Start with K=1

> from sklearn.neighbors import KNeighborsClassifier
> knn = KNeighborsClassifier(n_neighbors=1)
> knn.fit(X_train,y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')
> pred = knn.predict(X_test)

Predictions and Evaluations

> from sklearn.metrics import classification_report,confusion_matrix
> print(confusion_matrix(y_test,pred))
[[151   8]
 [ 15 126]]
> print(classification_report(y_test,pred))
              precision    recall  f1-score   support

           0       0.91      0.95      0.93       159
           1       0.94      0.89      0.92       141

    accuracy                           0.92       300
   macro avg       0.92      0.92      0.92       300
weighted avg       0.92      0.92      0.92       300

Choosing a K Value

> error_rate = []
+ 
+ for i in range(1,40):
+     
+     knn = KNeighborsClassifier(n_neighbors=i)
+     knn.fit(X_train,y_train)
+     pred_i = knn.predict(X_test)
+     error_rate.append(np.mean(pred_i != y_test))
> plt.figure(figsize=(10,6))
+ plt.plot(range(1,40),error_rate,color='blue', 
+ linestyle='dashed', marker='o',
+          markerfacecolor='red', markersize=10)
+ plt.title('Error Rate vs. K Value')
+ plt.xlabel('K')
+ plt.ylabel('Error Rate')
+ plt.show()

Here we can see that that after around K>12 the error rate doesn’t go much lower. We can retrain the model with that and check the classification report.

K=1

> # FIRST A QUICK COMPARISON THE ORIGINAL K=1
+ knn = KNeighborsClassifier(n_neighbors=1)
+ 
+ knn.fit(X_train,y_train);
+ pred = knn.predict(X_test)
+ 
+ 
+ print(confusion_matrix(y_test,pred))
[[151   8]
 [ 15 126]]
> print(classification_report(y_test,pred))
              precision    recall  f1-score   support

           0       0.91      0.95      0.93       159
           1       0.94      0.89      0.92       141

    accuracy                           0.92       300
   macro avg       0.92      0.92      0.92       300
weighted avg       0.92      0.92      0.92       300

K=23

> # NOW WITH K=23
+ knn = KNeighborsClassifier(n_neighbors=12)
+ 
+ knn.fit(X_train,y_train);
+ pred = knn.predict(X_test)
+ 
+ print(confusion_matrix(y_test,pred))
[[155   4]
 [ 10 131]]
> print(classification_report(y_test,pred))
              precision    recall  f1-score   support

           0       0.94      0.97      0.96       159
           1       0.97      0.93      0.95       141

    accuracy                           0.95       300
   macro avg       0.95      0.95      0.95       300
weighted avg       0.95      0.95      0.95       300