k-Nearest Neighbor (k-NN)

Introduction

The k-Nearest Neighbor (k-NN) classifier is a supervised learning algorithm that can be used for both classification and regression. In the medical domain, it can be particularly useful for tasks such as predicting whether a patient has a certain disease based on their medical history and test results.

Example Dataset

Consider the following dataset of patients with their Age, Blood Pressure, and Cholesterol levels. The goal is to classify whether a patient has a disease (Yes or No).

Age Blood Pressure Cholesterol Disease (Label)
45 High High Yes
30 Normal Normal No
50 High Normal Yes
35 Low Normal No
60 High High Yes
25 Low Normal No

Now, we want to predict the disease status of a new patient with the following features: - Age: 40 - Blood Pressure: Normal - Cholesterol: High

Goal

We want to predict whether a new patient, based on their age, blood pressure, and cholesterol level, is likely to have the disease.

Let’s say we have a new patient with the following features:

  • Age: 40
  • Blood Pressure: Normal
  • Cholesterol: High

Steps of the k-NN Algorithm in This Case

1. Choose k:

Let’s assume k = 3, which means we will look at the 3 closest neighbors to make our prediction.

2. Compute the Distance:

To classify the new patient, we need to compute the distance between the new patient and every other patient in the dataset. One commonly used metric is Euclidean distance. If features like blood pressure and cholesterol are categorical, we convert them into numerical values (e.g., Normal = 0, High = 1).

Here’s how we might encode the dataset numerically:

Age Blood Pressure (Encoded) Cholesterol (Encoded) Disease (Label)
45 1 1 Yes
30 0 0 No
50 1 0 Yes
35 0 0 No
60 1 1 Yes
25 0 0 No

The new patient would be encoded as:

  • Age: 40
  • Blood Pressure: Normal (encoded as 0)
  • Cholesterol: High (encoded as 1)

The Euclidean distance formula for two points p and q in n-dimensional space is:

\[ d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2} \]

For instance, to calculate the distance between the new patient and the first patient in the dataset:

\[ d = \sqrt{(40 - 45)^2 + (0 - 1)^2 + (1 - 1)^2} = \sqrt{(-5)^2 + (-1)^2 + 0^2} = \sqrt{25 + 1} = \sqrt{26} \approx 5.10 \]

Repeat this for all patients in the dataset.

3. Identify the Neighbors:

After calculating the distances, select the 3 closest patients (smallest distances) to the new patient. Let’s say the 3 nearest neighbors are:

  • Patient 1: Disease = Yes
  • Patient 3: Disease = Yes
  • Patient 4: Disease = No

4. Vote for the Label:

The new patient’s label is determined by a majority vote from the labels of the 3 nearest neighbors. In this case:

  • 2 out of 3 neighbors have “Yes” for the disease label.
  • 1 out of 3 neighbors has “No” for the disease label.

Therefore, the new patient is classified as “Yes” (i.e., the patient likely has the disease).

5. Make Predictions:

Based on the votes, the model predicts that the new patient is likely to have the disease.

Key Considerations

  • Choosing k: The value of k affects the prediction. A small k (e.g., 1) might lead to a noisy model, while a larger k may smooth over finer details.
  • Feature Scaling: Features such as age and blood pressure should be normalized or standardized to avoid bias in the distance calculation since features with larger ranges could dominate the distance computation.
  • Handling Categorical Features: For non-numerical features like “Blood Pressure” and “Cholesterol,” we need to convert them into numerical representations.

Steps: - Load the dataset from the CSV file. - Preprocess the data (convert categorical variables to numerical values if necessary). - Split the dataset into features (X) and labels (y). - Use scikit-learn to apply the k-NN algorithm. - Make predictions.

Import necessary libraries

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

Step 1: Load the dataset

# Assume the CSV file is named 'medical_data.csv'
# The dataset should have columns like Age, BloodPressure, Cholesterol, and Disease
df = pd.read_csv('medical_data.csv')

Step 2: Preprocess the data (Encode categorical features if needed)

# Assuming BloodPressure and Cholesterol are categorical, we'll encode them
df['BloodPressure'] = df['BloodPressure'].map({'Low': 0, 'Normal': 1, 'High': 2})
df['Cholesterol'] = df['Cholesterol'].map({'Normal': 0, 'High': 1})

Step 3: Split the dataset into features (X) and labels (y)


# Step 3: Split the dataset into features (X) and labels (y)
X = df[['Age', 'BloodPressure', 'Cholesterol']]  # Features
y = df['Disease'].map({'No': 0, 'Yes': 1})  # Labels (0 for No, 1 for Yes)

Step 4: Split the data into training and testing sets


# Step 4: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 5: Initialize the k-NN classifier with k=3


# Step 5: Initialize the k-NN classifier with k=3
knn = KNeighborsClassifier(n_neighbors=3)

Step 6: Train the model


# Step 6: Train the model
knn.fit(X_train, y_train)

Step 7: Make predictions on the test set


# Step 7: Make predictions on the test set
y_pred = knn.predict(X_test)

Step 8: Evaluate the model’s accuracy


# Step 8: Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

Step 9: Predict for a new patient


# Step 9: Predict for a new patient
new_patient = [[40, 1, 1]]  # Age=40, Blood Pressure=Normal (1), Cholesterol=High (1)
prediction = knn.predict(new_patient)
print("New patient likely has disease" if prediction == 1 else "New patient likely does not have disease")