Executive Summary

This project demonstrates the application of K-Means Clustering, an unsupervised learning algorithm, to classify universities into two groups: Private and Public. Although K-Means typically operates without labels, we leverage the true labels in the dataset for educational purposes, providing a unique opportunity to evaluate clustering performance using a classification report and confusion matrix. The analysis begins with exploratory data visualization and summary statistics to understand the dataset’s structure and relationships. We then implement the K-Means algorithm with two clusters and assess its results against the actual classifications. The findings highlight the limitations of K-Means, including its sensitivity to feature scaling and assumptions of cluster shape and size. While the clustering algorithm does not perform well in this specific context, the project underscores the importance of preprocessing, feature selection, and domain expertise in unsupervised learning. This exercise provides valuable insights into the practical application and challenges of clustering techniques in real-world datasets.

1 Background

In this project, we use the K-Means Clustering algorithm to classify universities into two groups: Private and Public. K-Means is an unsupervised learning algorithm, meaning it clusters data without relying on pre-existing labels. However, for this specific analysis, we have access to the true labels of the dataset, which allows us to evaluate the clustering performance.

It is crucial to note that such evaluation is atypical in real-world clustering tasks. In practice, K-Means is used when labels are not available, making it a tool for discovering hidden patterns or natural groupings within data. Here, the inclusion of labels is strictly for educational purposes to gauge the algorithm’s accuracy using metrics like a classification report and a confusion matrix. This approach, while insightful for understanding K-Means, does not represent its standard application.

2 Key Takeaways:

K-Means can be a powerful tool for discovering patterns in unlabeled data but is highly dependent on feature selection and scaling.
The evaluation metrics (classification report and confusion matrix) in this exercise provide educational value but are not applicable in real-world unsupervised learning tasks.
Domain expertise and preprocessing (e.g., scaling, dimensionality reduction) play a crucial role in improving clustering outcomes.

3 Data

We begin by loading a dataset of universities, which includes various metrics about each institution, such as whether they are Private or Public.

Code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

college = pd.read_csv("College_Data")

3.0.1 Data Overview

Features: The dataset contains a mix of numerical and categorical variables that describe university characteristics.
Target Variable: The Private column indicates whether a university is private (Yes) or public (No). This column will not be used for clustering as K-Means does not rely on labels.

4 Data Visualization

Visualizing the dataset helps in identifying patterns or potential clusters:

Code

sns.pairplot(college, corner=True, hue="Private", palette="mako")
plt.title("Pair Plot of University Data")
plt.show()

This pair plot reveals potential groupings among universities based on features, providing a visual intuition for clustering.

5 Data Summary

We generate summary statistics to better understand the dataset:

5.0.1 General Information

Code

college.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   777 non-null    object 
 1   Private      777 non-null    object 
 2   Apps         777 non-null    int64  
 3   Accept       777 non-null    int64  
 4   Enroll       777 non-null    int64  
 5   Top10perc    777 non-null    int64  
 6   Top25perc    777 non-null    int64  
 7   F.Undergrad  777 non-null    int64  
 8   P.Undergrad  777 non-null    int64  
 9   Outstate     777 non-null    int64  
 10  Room.Board   777 non-null    int64  
 11  Books        777 non-null    int64  
 12  Personal     777 non-null    int64  
 13  PhD          777 non-null    int64  
 14  Terminal     777 non-null    int64  
 15  S.F.Ratio    777 non-null    float64
 16  perc.alumni  777 non-null    int64  
 17  Expend       777 non-null    int64  
 18  Grad.Rate    777 non-null    int64  
dtypes: float64(1), int64(16), object(2)
memory usage: 115.5+ KB

5.0.2 Categorical Variables

Code

college.describe(include="object")

	Unnamed: 0	Private
count	777	777
unique	777	2
top	Abilene Christian University	Yes
freq	1	565

5.0.3 Numerical Variables

Code

college.describe()

	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
count	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.00000
mean	3001.638353	2018.804376	779.972973	27.558559	55.796654	3699.907336	855.298584	10440.669241	4357.526384	549.380952	1340.642214	72.660232	79.702703	14.089704	22.743887	9660.171171	65.46332
std	3870.201484	2451.113971	929.176190	17.640364	19.804778	4850.420531	1522.431887	4023.016484	1096.696416	165.105360	677.071454	16.328155	14.722359	3.958349	12.391801	5221.768440	17.17771
min	81.000000	72.000000	35.000000	1.000000	9.000000	139.000000	1.000000	2340.000000	1780.000000	96.000000	250.000000	8.000000	24.000000	2.500000	0.000000	3186.000000	10.00000
25%	776.000000	604.000000	242.000000	15.000000	41.000000	992.000000	95.000000	7320.000000	3597.000000	470.000000	850.000000	62.000000	71.000000	11.500000	13.000000	6751.000000	53.00000
50%	1558.000000	1110.000000	434.000000	23.000000	54.000000	1707.000000	353.000000	9990.000000	4200.000000	500.000000	1200.000000	75.000000	82.000000	13.600000	21.000000	8377.000000	65.00000
75%	3624.000000	2424.000000	902.000000	35.000000	69.000000	4005.000000	967.000000	12925.000000	5050.000000	600.000000	1700.000000	85.000000	92.000000	16.500000	31.000000	10830.000000	78.00000
max	48094.000000	26330.000000	6392.000000	96.000000	100.000000	31643.000000	21836.000000	21700.000000	8124.000000	2340.000000	6800.000000	103.000000	100.000000	39.800000	64.000000	56233.000000	118.00000

5.0.4 Correlation Analysis

To examine relationships between numerical variables, we create a correlation heatmap:

Code

sns.heatmap(college.corr(numeric_only=True), annot=False, cmap="Blues", fmt=".1f")
plt.title("Correlation Heatmap")
plt.show()

These exploratory steps provide insights into the dataset’s structure and relationships between variables.

6 Data Scaling

In K_means clustering, it is critical to scale the variables, otherwise, some values will have a higher weight in distance calculations.

Code

scaler = StandardScaler()
scaler.fit(college.drop(columns=["Private", "Unnamed: 0"]))
scaled_data = scaler.fit_transform(college.drop(columns=["Private", "Unnamed: 0"]))

kmeans_data = pd.DataFrame(scaled_data, columns=college.drop(columns=["Private", "Unnamed: 0"]).columns)

7 K-Means Clustering

7.0.1 Model Implementation

We initialize and fit the K-Means algorithm with two clusters, corresponding to the binary grouping (Private and Public universities):

Code

kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(kmeans_data)

KMeans(n_clusters=2, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

7.0.2 Cluster Labels

After fitting the model, we extract the predicted cluster labels:

Code

pred = kmeans.labels_

8 Model Evaluation

8.0.1 Understanding the Limitations

Typically, clustering algorithms like K-Means are used in unsupervised settings where labels are not available. The goal is to identify natural groupings within data. In this project, we perform an artificial evaluation by comparing the predicted clusters to the actual labels (Private), which is not standard practice for unsupervised learning.

8.0.2 Creating True Labels

To evaluate the clustering, we convert the Private column into binary values:

Code

truth = [1 if x == "Yes" else 0 for x in college["Private"]]

8.0.3 Classification Report and Confusion Matrix

We generate evaluation metrics to assess the clustering performance:

Code

print(classification_report(truth, pred))

              precision    recall  f1-score   support

           0       0.20      0.61      0.30       212
           1       0.34      0.07      0.12       565

    accuracy                           0.22       777
   macro avg       0.27      0.34      0.21       777
weighted avg       0.30      0.22      0.17       777

We also visualize the confusion matrix:

Code

sns.heatmap(confusion_matrix(truth, pred), annot=True, cmap="Blues", fmt=".0f")
plt.title("Confusion Matrix")
plt.show()

9 Insights and Observations

The classification report and confusion matrix reveal that the K-Means algorithm does not perform well in distinguishing between private and public universities. This is not surprising, as clustering algorithms rely on domain expertise and feature selection to produce meaningful groupings.

10 Challenges in K-Means Clustering:

Feature Scaling: Without scaling, certain features might dominate the clustering process.
Cluster Assumptions: K-Means assumes clusters are spherical and equally sized, which may not hold true for this dataset.
Domain Expertise: Identifying relevant features often requires significant domain knowledge, which impacts clustering quality.

11 Conclusion

In this project, we applied K-Means Clustering to categorize universities into private and public groups. While the algorithm is inherently unsupervised, we leveraged the true labels in this dataset to evaluate its performance, demonstrating the limitations of K-Means in this context (Kodinariya, Makwana, et al. 2013; Fisher 1936; James et al. 2013; Muddana and Vinayakam 2024).

12 Future Work:

Explore other clustering techniques such as DBSCAN or Hierarchical Clustering, which may better handle non-spherical clusters.
Investigate the impact of feature scaling or dimensionality reduction using PCA on clustering performance.
Apply K-Means to a dataset without labels to showcase its true use case.

References

Fisher, Ronald A. 1936. “The Use of Multiple Measurements in Taxonomic Problems.” Annals of Eugenics 7 (2): 179–88.

James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.

Kodinariya, Trupti M, Prashant R Makwana, et al. 2013. “Review on Determining Number of Cluster in k-Means Clustering.” International Journal 1 (6): 90–95.

Muddana, A Lakshmi, and Sandhya Vinayakam. 2024. Python for Data Science. Springer.

--- title: "**Clustering Universities: A K-Means Clustering Approach in Python**" subtitle: "*Independent Data Analysis Project*" author: - name: John Karuitha, PhD orcid: 0000-0002-8204-7034 email: jkaruitha@karu.ac.ke affiliations: - name: Karatina University, Department of Business and Economics city: Karatina state: Kenya postal-code: 10101 url: https://www.rpubs.com/Karuitha - name: University of the Witwatersrand, School of Construction Economics & Management city: Johannesburg state: South Africa postal-code: 2000 url: https://www.linkedin.com/in/Karuitha date: today date-modified: last-modified date-format: long abstract-title: "Executive Summary" abstract: | This project demonstrates the application of **K-Means Clustering**, an unsupervised learning algorithm, to classify universities into two groups: **Private** and **Public**. Although K-Means typically operates without labels, we leverage the true labels in the dataset for educational purposes, providing a unique opportunity to evaluate clustering performance using a **classification report** and **confusion matrix**. The analysis begins with exploratory data visualization and summary statistics to understand the dataset's structure and relationships. We then implement the K-Means algorithm with two clusters and assess its results against the actual classifications. The findings highlight the limitations of K-Means, including its sensitivity to feature scaling and assumptions of cluster shape and size. While the clustering algorithm does not perform well in this specific context, the project underscores the importance of preprocessing, feature selection, and domain expertise in unsupervised learning. This exercise provides valuable insights into the practical application and challenges of clustering techniques in real-world datasets. keywords: - Data Analysis - Python - Pandas - Seaborn - Numpy - Machine Learning - K-Means Clustering - Scikit-learn - Classification bibliography: bibliography.bib format: html: toc: true toc-depth: 3 toc-title: "Contents" fontsize: 1.2em number-sections: true number-depth: 3 code-fold: true code-tools: true link-external-icon: true theme: lux css: styles.css html-math-method: katex fig-align: center smooth-scroll: true toc-location: left title-block-banner: "Untitled.jpg" title-block-banner-color: black header-includes: | <link rel="icon" type="image/png" href="favicon.png"> execute: echo: true warning: false message: false cache: true --- # **Background** In this project, we use the **K-Means Clustering algorithm** to classify universities into two groups: **Private** and **Public**. K-Means is an **unsupervised learning algorithm**, meaning it clusters data without relying on pre-existing labels. However, for this specific analysis, we have access to the true labels of the dataset, which allows us to evaluate the clustering performance. It is crucial to note that such evaluation is atypical in real-world clustering tasks. In practice, K-Means is used when labels are not available, making it a tool for discovering hidden patterns or natural groupings within data. Here, the inclusion of labels is strictly for educational purposes to gauge the algorithm’s accuracy using metrics like a **classification report** and a **confusion matrix**. This approach, while insightful for understanding K-Means, does not represent its standard application. --- # **Key Takeaways**: - K-Means can be a powerful tool for discovering patterns in unlabeled data but is highly dependent on feature selection and scaling. - The evaluation metrics (classification report and confusion matrix) in this exercise provide educational value but are not applicable in real-world unsupervised learning tasks. - Domain expertise and preprocessing (e.g., scaling, dimensionality reduction) play a crucial role in improving clustering outcomes. --- # **Data** We begin by loading a dataset of universities, which includes various metrics about each institution, such as whether they are **Private** or **Public**. ```{python} import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.metrics import classification_report, confusion_matrix college = pd.read_csv("College_Data") ``` ### **Data Overview** 1. **Features**: The dataset contains a mix of numerical and categorical variables that describe university characteristics. 2. **Target Variable**: The `Private` column indicates whether a university is private (`Yes`) or public (`No`). This column will not be used for clustering as K-Means does not rely on labels. --- # **Data Visualization** Visualizing the dataset helps in identifying patterns or potential clusters: ```{python} sns.pairplot(college, corner=True, hue="Private", palette="mako") plt.title("Pair Plot of University Data") plt.show() ``` This pair plot reveals potential groupings among universities based on features, providing a visual intuition for clustering. --- # **Data Summary** We generate summary statistics to better understand the dataset: ### **General Information** ```{python} college.info() ``` ### **Categorical Variables** ```{python} college.describe(include="object") ``` ### **Numerical Variables** ```{python} college.describe() ``` ### **Correlation Analysis** To examine relationships between numerical variables, we create a correlation heatmap: ```{python} sns.heatmap(college.corr(numeric_only=True), annot=False, cmap="Blues", fmt=".1f") plt.title("Correlation Heatmap") plt.show() ``` These exploratory steps provide insights into the dataset's structure and relationships between variables. # Data Scaling In K_means clustering, it is critical to scale the variables, otherwise, some values will have a higher weight in distance calculations. ```{python} scaler = StandardScaler() scaler.fit(college.drop(columns=["Private", "Unnamed: 0"])) scaled_data = scaler.fit_transform(college.drop(columns=["Private", "Unnamed: 0"])) kmeans_data = pd.DataFrame(scaled_data, columns=college.drop(columns=["Private", "Unnamed: 0"]).columns) ``` --- # **K-Means Clustering** ### **Model Implementation** We initialize and fit the **K-Means algorithm** with two clusters, corresponding to the binary grouping (Private and Public universities): ```{python} kmeans = KMeans(n_clusters=2, random_state=42) kmeans.fit(kmeans_data) ``` ### **Cluster Labels** After fitting the model, we extract the predicted cluster labels: ```{python} pred = kmeans.labels_ ``` --- # **Model Evaluation** ### **Understanding the Limitations** Typically, clustering algorithms like K-Means are used in **unsupervised settings** where labels are not available. The goal is to identify natural groupings within data. In this project, we perform an **artificial evaluation** by comparing the predicted clusters to the actual labels (`Private`), which is not standard practice for unsupervised learning. ### **Creating True Labels** To evaluate the clustering, we convert the `Private` column into binary values: ```{python} truth = [1 if x == "Yes" else 0 for x in college["Private"]] ``` ### **Classification Report and Confusion Matrix** We generate evaluation metrics to assess the clustering performance: ```{python} print(classification_report(truth, pred)) ``` We also visualize the **confusion matrix**: ```{python} sns.heatmap(confusion_matrix(truth, pred), annot=True, cmap="Blues", fmt=".0f") plt.title("Confusion Matrix") plt.show() ``` --- # **Insights and Observations** The classification report and confusion matrix reveal that the K-Means algorithm does not perform well in distinguishing between private and public universities. This is not surprising, as clustering algorithms rely on domain expertise and feature selection to produce meaningful groupings. # **Challenges in K-Means Clustering**: 1. **Feature Scaling**: Without scaling, certain features might dominate the clustering process. 2. **Cluster Assumptions**: K-Means assumes clusters are spherical and equally sized, which may not hold true for this dataset. 3. **Domain Expertise**: Identifying relevant features often requires significant domain knowledge, which impacts clustering quality. --- # **Conclusion** In this project, we applied **K-Means Clustering** to categorize universities into private and public groups. While the algorithm is inherently unsupervised, we leveraged the true labels in this dataset to evaluate its performance, demonstrating the limitations of K-Means in this context [@kodinariya2013review; @fisher1936use; @james2013introduction; @muddana2024python]. # **Future Work**: - Explore other clustering techniques such as **DBSCAN** or **Hierarchical Clustering**, which may better handle non-spherical clusters. - Investigate the impact of feature scaling or dimensionality reduction using **PCA** on clustering performance. - Apply K-Means to a dataset without labels to showcase its true use case. # **References** {-}