You’ve been given a classified data set from a company. They’ve hidden the feature column names but have given you the data and the target classes.
Try to use KNN to create a model that directly predicts a class for a new data point based off of the features.
| WTT | PTI | EQW | SBI | LQE | QWG | FDJ | PJF | HQE | NXJ | TARGET CLASS | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.9139173 | 1.1620727 | 0.5679459 | 0.7554639 | 0.7808616 | 0.3526077 | 0.7596969 | 0.6437976 | 0.8794221 | 1.231409 | 1 |
| 1 | 0.6356319 | 1.0037216 | 0.5353424 | 0.8256448 | 0.9241089 | 0.6484502 | 0.6753343 | 1.0135460 | 0.6215522 | 1.492702 | 0 |
| 2 | 0.7213598 | 1.2014926 | 0.9219897 | 0.8555950 | 1.5266294 | 0.7207809 | 1.6263507 | 1.1544831 | 0.9578770 | 1.285597 | 0 |
| 3 | 1.2342044 | 1.3867263 | 0.6530463 | 0.8256244 | 1.1425035 | 0.8751279 | 1.4097081 | 1.3800025 | 1.5226920 | 1.153093 | 1 |
| 4 | 1.2794908 | 0.9497496 | 0.6272800 | 0.6689761 | 1.2325373 | 0.7037274 | 1.1155955 | 0.6466907 | 1.4638118 | 1.419167 | 1 |
| 5 | 0.8339278 | 1.5233023 | 1.1047427 | 1.0211390 | 1.1073771 | 1.0109295 | 1.2795375 | 1.2806766 | 0.5103503 | 1.528044 | 0 |
Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale.
Standardize features by removing the mean and scaling to unit variance.
The standard score of a sample x is calculated as:
\[z = \frac{(x - \mu)}{\sigma}\]
where \(\mu\) is the mean of the training samples and \(\sigma\) is the standard deviation of the training samples.
StandardScaler(copy=True, with_mean=True, with_std=True)
df_feat.head()
| WTT | PTI | EQW | SBI | LQE | QWG | FDJ | PJF | HQE | NXJ |
|---|---|---|---|---|---|---|---|---|---|
| -0.1235419 | 0.1859075 | -0.9134307 | 0.3196291 | -1.0336368 | -2.3083747 | -0.7989513 | -1.4823681 | -0.9497194 | -0.6433142 |
| -1.0848360 | -0.4303484 | -1.0253133 | 0.6253883 | -0.4448471 | -1.1527060 | -1.1297975 | -0.2022403 | -1.8280509 | 0.6367586 |
| -0.7887022 | 0.3393182 | 0.3015114 | 0.7558728 | 2.0316930 | -0.8701562 | 2.5998184 | 0.2857065 | -0.6824938 | -0.3778499 |
| 0.9828405 | 1.0601933 | -0.6213988 | 0.6252994 | 0.4528203 | -0.2672204 | 1.7502076 | 1.0664905 | 1.2413246 | -1.0269871 |
| 1.1392755 | -0.6403919 | -0.7098186 | -0.0571746 | 0.8228862 | -0.9367731 | 0.5967817 | -1.4723516 | 1.0407722 | 0.2765098 |
| -0.3998534 | 1.5917070 | 0.9286490 | 1.4771025 | 0.3084402 | 0.2632702 | 1.2397158 | 0.7226082 | -2.2068162 | 0.8098998 |
Start with K=1
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=1, p=2,
weights='uniform')
[[151 8]
[ 15 126]]
precision recall f1-score support
0 0.91 0.95 0.93 159
1 0.94 0.89 0.92 141
accuracy 0.92 300
macro avg 0.92 0.92 0.92 300
weighted avg 0.92 0.92 0.92 300
> error_rate = []
+
+ for i in range(1,40):
+
+ knn = KNeighborsClassifier(n_neighbors=i)
+ knn.fit(X_train,y_train)
+ pred_i = knn.predict(X_test)
+ error_rate.append(np.mean(pred_i != y_test))> plt.figure(figsize=(10,6))
+ plt.plot(range(1,40),error_rate,color='blue',
+ linestyle='dashed', marker='o',
+ markerfacecolor='red', markersize=10)
+ plt.title('Error Rate vs. K Value')
+ plt.xlabel('K')
+ plt.ylabel('Error Rate')
+ plt.show()Here we can see that that after around K>12 the error rate doesn’t go much lower. We can retrain the model with that and check the classification report.
> # FIRST A QUICK COMPARISON THE ORIGINAL K=1
+ knn = KNeighborsClassifier(n_neighbors=1)
+
+ knn.fit(X_train,y_train);
+ pred = knn.predict(X_test)
+
+
+ print(confusion_matrix(y_test,pred))[[151 8]
[ 15 126]]
precision recall f1-score support
0 0.91 0.95 0.93 159
1 0.94 0.89 0.92 141
accuracy 0.92 300
macro avg 0.92 0.92 0.92 300
weighted avg 0.92 0.92 0.92 300
> # NOW WITH K=23
+ knn = KNeighborsClassifier(n_neighbors=12)
+
+ knn.fit(X_train,y_train);
+ pred = knn.predict(X_test)
+
+ print(confusion_matrix(y_test,pred))[[155 4]
[ 10 131]]
precision recall f1-score support
0 0.94 0.97 0.96 159
1 0.97 0.93 0.95 141
accuracy 0.95 300
macro avg 0.95 0.95 0.95 300
weighted avg 0.95 0.95 0.95 300