K Means Clustering


K Means Clustering is an unsupervised learning algorithm that groups data based on their similarity.

First, we specify the number of clusters, or data groups. Next, the algorithm randomly assigns each observation to a cluster and finds the centroid. Then the algorithm iterates through two steps:

  1. Reassign data points to the cluster whose centroid is closest.
  2. Calculate new centroid of each cluster.

These two steps are repeated until the within-cluster variation cannot be reduced further. The within-cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids.

> import seaborn as sns
+ import matplotlib.pyplot as plt
+ import pandas as pd
+ import numpy as np

Artificial Data

> from sklearn.datasets import make_blobs
> # Create Artificial Data
+ data = make_blobs(n_samples=200, n_features=2, 
+                   centers=4, cluster_std=1.8,
+                   random_state=101)
> data[0].shape
(200, 2)
> #cluster number
+ data[1]
array([3, 2, 0, 2, 2, 1, 2, 0, 2, 0, 3, 0, 2, 2, 3, 0, 2, 0, 1, 3, 1, 0,
       0, 1, 3, 1, 1, 0, 2, 2, 3, 1, 2, 0, 0, 3, 1, 1, 1, 2, 1, 3, 3, 3,
       0, 3, 3, 0, 1, 2, 0, 3, 2, 0, 1, 3, 0, 0, 3, 2, 1, 2, 1, 3, 2, 0,
       1, 2, 2, 1, 2, 0, 1, 3, 1, 2, 2, 0, 3, 0, 0, 1, 2, 1, 0, 0, 0, 3,
       2, 1, 1, 1, 1, 3, 0, 1, 2, 3, 1, 2, 0, 1, 0, 0, 2, 0, 1, 2, 1, 1,
       0, 3, 3, 2, 1, 2, 3, 3, 2, 3, 0, 3, 0, 3, 0, 2, 3, 0, 1, 3, 3, 3,
       0, 1, 1, 3, 2, 3, 2, 0, 1, 2, 1, 3, 3, 2, 0, 1, 3, 3, 3, 3, 0, 2,
       0, 3, 2, 2, 2, 0, 2, 0, 0, 3, 1, 3, 0, 2, 3, 0, 2, 0, 3, 3, 0, 3,
       2, 2, 1, 2, 3, 1, 1, 3, 1, 1, 1, 1, 1, 0, 1, 2, 2, 3, 1, 0, 2, 2,
       1, 0])

Visualize the Data

> plt.figure(figsize=(8,6))
+ sns.set_style('darkgrid')
+ plt.scatter(data[0][:,0],data[0][:,1],
+             c=data[1],cmap='rainbow');
+ plt.show()

Creating the Clusters

> from sklearn.cluster import KMeans
> kmeans = KMeans(n_clusters=4)
+ kmeans.fit(data[0])
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)
> kmeans.cluster_centers_
array([[-0.0123077 ,  2.13407664],
       [-9.46941837, -6.56081545],
       [ 3.71749226,  7.01388735],
       [-4.13591321,  7.95389851]])
> kmeans.labels_
array([3, 2, 0, 2, 2, 1, 2, 0, 2, 0, 3, 0, 2, 2, 3, 0, 2, 0, 1, 3, 1, 0,
       0, 1, 3, 1, 1, 0, 2, 2, 3, 1, 2, 0, 0, 3, 1, 1, 1, 0, 1, 3, 3, 3,
       0, 2, 3, 0, 1, 0, 0, 3, 2, 0, 1, 3, 0, 0, 3, 2, 1, 2, 1, 3, 2, 0,
       1, 2, 2, 1, 2, 0, 1, 0, 1, 2, 2, 0, 3, 0, 0, 1, 2, 1, 0, 0, 0, 3,
       0, 1, 1, 1, 1, 0, 0, 1, 2, 3, 1, 2, 0, 1, 0, 0, 2, 0, 1, 2, 1, 1,
       2, 3, 3, 2, 1, 2, 3, 3, 2, 3, 0, 3, 0, 3, 0, 2, 3, 0, 1, 3, 3, 3,
       0, 1, 1, 3, 2, 3, 2, 0, 1, 2, 1, 3, 3, 2, 0, 1, 3, 3, 3, 3, 0, 2,
       0, 3, 2, 2, 2, 0, 2, 0, 0, 3, 1, 3, 0, 2, 3, 0, 2, 0, 3, 2, 0, 3,
       2, 2, 1, 2, 3, 1, 1, 3, 1, 1, 1, 1, 1, 0, 1, 2, 2, 3, 1, 0, 2, 2,
       1, 0])

Although the color selections differ between plots, we can see how the points are grouped. There are some overlapping groups in the original plot, but they are fully separated by K means.

> fig, (ax1, ax2) = plt.subplots(1, 2, 
+             sharey=True,figsize=(10,6));
+ ax1.set_title('K Means')
+ ax1.scatter(data[0][:,0],data[0][:,1],
+             c=kmeans.labels_,cmap='rainbow');
+ ax2.set_title("Original")
+ ax2.scatter(data[0][:,0],data[0][:,1],
+             c=data[1],cmap='rainbow');
+ plt.show()

Example


We’ll attempt to use KMeans to cluster Universities into to two groups, Private and Public.

Although we actually have the labels for this data set, we will NOT use them for the KMeans clustering algorithm, since it is an unsupervised learning algorithm.

When using the Kmeans algorithm under normal circumstances it is because you don’t have labels. In this case we will use the labels to get an idea of how well the algorithm performed.

The Data

We will use a data frame with 777 observations on the following 18 variables.

  • Private - A factor with levels No and Yes indicating private or public university
  • Apps - Number of applications received
  • Accept - Number of applications accepted
  • Enroll - Number of new students enrolled
  • Top10perc - Pct. new students from top 10% of H.S. class
  • Top25perc - Pct. new students from top 25% of H.S. class
  • F.Undergrad - Number of fulltime undergraduates
  • P.Undergrad - Number of parttime undergraduates
  • Outstate - Out-of-state tuition
  • Room.Board - Room and board costs
  • Books - Estimated book costs
  • Personal - Estimated personal spending
  • PhD - Pct. of faculty with Ph.D.’s
  • Terminal - Pct. of faculty with terminal degree
  • S.F.Ratio - Student/faculty ratio
  • perc.alumni - Pct. alumni who donate
  • Expend Instructional - expenditure per student
  • Grad.Rate - Graduation rate
> df = pd.read_csv('College_Data',index_col=0)
df.head()
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate
Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440
Adelphi University Yes 2186 1924 512 16 29 2683 1227 12280
Adrian College Yes 1428 1097 336 22 50 1036 99 11250
Agnes Scott College Yes 417 349 137 60 89 510 63 12960
Alaska Pacific University Yes 193 146 55 16 44 249 869 7560
Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
Abilene Christian University 3300 450 2200 70 78 18.1 12 7041 60
Adelphi University 6450 750 1500 29 30 12.2 16 10527 56
Adrian College 3750 400 1165 53 66 12.9 30 8735 54
Agnes Scott College 5450 450 875 92 97 7.7 37 19016 59
Alaska Pacific University 4120 800 1500 76 72 11.9 2 10922 15
> df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 777 entries, Abilene Christian University to York College of Pennsylvania
Data columns (total 18 columns):
Private        777 non-null object
Apps           777 non-null int64
Accept         777 non-null int64
Enroll         777 non-null int64
Top10perc      777 non-null int64
Top25perc      777 non-null int64
F.Undergrad    777 non-null int64
P.Undergrad    777 non-null int64
Outstate       777 non-null int64
Room.Board     777 non-null int64
Books          777 non-null int64
Personal       777 non-null int64
PhD            777 non-null int64
Terminal       777 non-null int64
S.F.Ratio      777 non-null float64
perc.alumni    777 non-null int64
Expend         777 non-null int64
Grad.Rate      777 non-null int64
dtypes: float64(1), int64(16), object(1)
memory usage: 115.3+ KB
> abc = df.describe()
df.describe()
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board
count 777.000 777.000 777.0000 777.00000 777.00000 777.000 777.0000 777.000 777.000
mean 3001.638 2018.804 779.9730 27.55856 55.79665 3699.907 855.2986 10440.669 4357.526
std 3870.201 2451.114 929.1762 17.64036 19.80478 4850.421 1522.4319 4023.016 1096.696
min 81.000 72.000 35.0000 1.00000 9.00000 139.000 1.0000 2340.000 1780.000
25% 776.000 604.000 242.0000 15.00000 41.00000 992.000 95.0000 7320.000 3597.000
50% 1558.000 1110.000 434.0000 23.00000 54.00000 1707.000 353.0000 9990.000 4200.000
75% 3624.000 2424.000 902.0000 35.00000 69.00000 4005.000 967.0000 12925.000 5050.000
max 48094.000 26330.000 6392.0000 96.00000 100.00000 31643.000 21836.0000 21700.000 8124.000
Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
count 777.0000 777.0000 777.00000 777.00000 777.000000 777.00000 777.000 777.00000
mean 549.3810 1340.6422 72.66023 79.70270 14.089704 22.74389 9660.171 65.46332
std 165.1054 677.0715 16.32815 14.72236 3.958349 12.39180 5221.768 17.17771
min 96.0000 250.0000 8.00000 24.00000 2.500000 0.00000 3186.000 10.00000
25% 470.0000 850.0000 62.00000 71.00000 11.500000 13.00000 6751.000 53.00000
50% 500.0000 1200.0000 75.00000 82.00000 13.600000 21.00000 8377.000 65.00000
75% 600.0000 1700.0000 85.00000 92.00000 16.500000 31.00000 10830.000 78.00000
max 2340.0000 6800.0000 103.00000 100.00000 39.800000 64.00000 56233.000 118.00000

Exploratory Analysis

Create a scatterplot of Grad.Rate versus Room.Board where the points are colored by the Private column.

> sns.set_style('whitegrid')
+ ylteal = ["#DDE61B", "#1BE6DE"]
+ sns.set_palette(ylteal)
+ sns.lmplot('Room.Board','Grad.Rate',data=df, hue='Private',
+            height=6,aspect=1,fit_reg=False);
+ plt.show()

Create a scatterplot of F.Undergrad versus Outstate where the points are colored by the Private column.

> sns.set_style('whitegrid')
+ ylteal = ["#DDE61B", "#1BE6DE"]
+ sns.set_palette(ylteal)
+ sns.lmplot('Outstate','F.Undergrad',data=df, hue='Private',
+           height=6,aspect=1,fit_reg=False);
+ plt.show()

Create a stacked histogram showing Out of State Tuition based on the Private column.

> sns.set_style('darkgrid')
+ g = sns.FacetGrid(df,hue="Private",palette='coolwarm',
+                   height=6,aspect=2);
+ g = g.map(plt.hist,'Outstate',bins=20,alpha=0.7);
+ plt.legend();
+ plt.show()

Create a similar histogram for the Grad.Rate column.

> g = sns.FacetGrid(df,hue="Private",palette='coolwarm',
+                   height=6,aspect=2);
+ g = g.map(plt.hist,'Grad.Rate',bins=20,alpha=0.7);
+ plt.legend();
+ plt.show()

Notice how there seems to be a private school with a graduation rate of higher than 100%.

> df[df['Grad.Rate'] > 100].loc[:,['Private','Grad.Rate']]
                  Private  Grad.Rate
Cazenovia College     Yes        118

Set that school’s graduation rate to 100 so it makes sense.

> df['Grad.Rate']['Cazenovia College'] = 100
> df['Grad.Rate'].max()
100
> g = sns.FacetGrid(df,hue="Private",palette='coolwarm',
+                   height=6,aspect=2);
+ g = g.map(plt.hist,'Grad.Rate',bins=20,alpha=0.7);
+ plt.legend();
+ plt.show()

Cluster Creation

> kmeans = KMeans(n_clusters=2)

Fit the model to all the data except for the Private label.

> kmeans.fit(df.drop('Private',axis=1))
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

What are the cluster center vectors?

> kmeans.cluster_centers_
array([[1.03631389e+04, 6.55089815e+03, 2.56972222e+03, 4.14907407e+01,
        7.02037037e+01, 1.30619352e+04, 2.46486111e+03, 1.07191759e+04,
        4.64347222e+03, 5.95212963e+02, 1.71420370e+03, 8.63981481e+01,
        9.13333333e+01, 1.40277778e+01, 2.00740741e+01, 1.41705000e+04,
        6.75925926e+01],
       [1.81323468e+03, 1.28716592e+03, 4.91044843e+02, 2.53094170e+01,
        5.34708520e+01, 2.18854858e+03, 5.95458894e+02, 1.03957085e+04,
        4.31136472e+03, 5.41982063e+02, 1.28033632e+03, 7.04424514e+01,
        7.78251121e+01, 1.40997010e+01, 2.31748879e+01, 8.93204634e+03,
        6.50926756e+01]])

Evaluation

There is no perfect way to evaluate clustering if you don’t have the labels. However, in this instance we do have the labels, so we’ll take advantage of this to evaluate our clusters.

Create a new column for df called Cluster, which is a 1 for a Private school, and a 0 for a Public school.

> def converter(cluster):
+     if cluster=='Yes':
+         return 1
+     else:
+         return 0
> df['Cluster'] = df['Private'].apply(converter)
df.head()
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board
Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440 3300
Adelphi University Yes 2186 1924 512 16 29 2683 1227 12280 6450
Adrian College Yes 1428 1097 336 22 50 1036 99 11250 3750
Agnes Scott College Yes 417 349 137 60 89 510 63 12960 5450
Alaska Pacific University Yes 193 146 55 16 44 249 869 7560 4120
Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate Cluster
Abilene Christian University 450 2200 70 78 18.1 12 7041 60 1
Adelphi University 750 1500 29 30 12.2 16 10527 56 1
Adrian College 400 1165 53 66 12.9 30 8735 54 1
Agnes Scott College 450 875 92 97 7.7 37 19016 59 1
Alaska Pacific University 800 1500 76 72 11.9 2 10922 15 1

Create a confusion matrix and classification report to see how well the Kmeans clustering worked without being given any labels.

> from sklearn.metrics import confusion_matrix,classification_report
> confu = confusion_matrix(df['Cluster'],kmeans.labels_)
> pd.DataFrame(confu,index=['Actual 0','Actual 1'],
+             columns=['Predicted 0','Predicted 1'])
          Predicted 0  Predicted 1
Actual 0           74          138
Actual 1           34          531
> print(classification_report(df['Cluster'],kmeans.labels_))
              precision    recall  f1-score   support

           0       0.69      0.35      0.46       212
           1       0.79      0.94      0.86       565

    accuracy                           0.78       777
   macro avg       0.74      0.64      0.66       777
weighted avg       0.76      0.78      0.75       777