k-Nearest Neighbor (k-NN)
Introduction
The k-Nearest Neighbor (k-NN) classifier is a supervised learning algorithm that can be used for both classification and regression. In the medical domain, it can be particularly useful for tasks such as predicting whether a patient has a certain disease based on their medical history and test results.
Example Dataset
Consider the following dataset of patients with their
Age, Blood Pressure, and
Cholesterol levels. The goal is to classify whether a
patient has a disease (Yes or No).
| Age | Blood Pressure | Cholesterol | Disease (Label) |
|---|---|---|---|
| 45 | High | High | Yes |
| 30 | Normal | Normal | No |
| 50 | High | Normal | Yes |
| 35 | Low | Normal | No |
| 60 | High | High | Yes |
| 25 | Low | Normal | No |
Now, we want to predict the disease status of a new patient with the following features: - Age: 40 - Blood Pressure: Normal - Cholesterol: High
Goal
We want to predict whether a new patient, based on their age, blood pressure, and cholesterol level, is likely to have the disease.
Let’s say we have a new patient with the following features:
- Age: 40
- Blood Pressure: Normal
- Cholesterol: High
Steps of the k-NN Algorithm in This Case
1. Choose k:
Let’s assume k = 3, which means we will look at the 3
closest neighbors to make our prediction.
2. Compute the Distance:
To classify the new patient, we need to compute the distance between the new patient and every other patient in the dataset. One commonly used metric is Euclidean distance. If features like blood pressure and cholesterol are categorical, we convert them into numerical values (e.g., Normal = 0, High = 1).
Here’s how we might encode the dataset numerically:
| Age | Blood Pressure (Encoded) | Cholesterol (Encoded) | Disease (Label) |
|---|---|---|---|
| 45 | 1 | 1 | Yes |
| 30 | 0 | 0 | No |
| 50 | 1 | 0 | Yes |
| 35 | 0 | 0 | No |
| 60 | 1 | 1 | Yes |
| 25 | 0 | 0 | No |
The new patient would be encoded as:
- Age: 40
- Blood Pressure: Normal (encoded as 0)
- Cholesterol: High (encoded as 1)
The Euclidean distance formula for two points p and
q in n-dimensional space is:
\[ d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2} \]
For instance, to calculate the distance between the new patient and the first patient in the dataset:
\[ d = \sqrt{(40 - 45)^2 + (0 - 1)^2 + (1 - 1)^2} = \sqrt{(-5)^2 + (-1)^2 + 0^2} = \sqrt{25 + 1} = \sqrt{26} \approx 5.10 \]
Repeat this for all patients in the dataset.
3. Identify the Neighbors:
After calculating the distances, select the 3 closest patients (smallest distances) to the new patient. Let’s say the 3 nearest neighbors are:
- Patient 1: Disease = Yes
- Patient 3: Disease = Yes
- Patient 4: Disease = No
4. Vote for the Label:
The new patient’s label is determined by a majority vote from the labels of the 3 nearest neighbors. In this case:
- 2 out of 3 neighbors have “Yes” for the disease label.
- 1 out of 3 neighbors has “No” for the disease label.
Therefore, the new patient is classified as “Yes” (i.e., the patient likely has the disease).
5. Make Predictions:
Based on the votes, the model predicts that the new patient is likely to have the disease.
Key Considerations
- Choosing
k: The value ofkaffects the prediction. A smallk(e.g., 1) might lead to a noisy model, while a largerkmay smooth over finer details. - Feature Scaling: Features such as age and blood pressure should be normalized or standardized to avoid bias in the distance calculation since features with larger ranges could dominate the distance computation.
- Handling Categorical Features: For non-numerical features like “Blood Pressure” and “Cholesterol,” we need to convert them into numerical representations.
Steps: - Load the dataset from the CSV file. - Preprocess the data (convert categorical variables to numerical values if necessary). - Split the dataset into features (X) and labels (y). - Use scikit-learn to apply the k-NN algorithm. - Make predictions.
Import necessary libraries
Step 1: Load the dataset
Download CSV file
!()[https://docs.google.com/spreadsheets/d/1hYdBeAhgNIyXnp1QTlrEcwihyOhfmCk3jbytBf3GoyE/edit?usp=sharing]