K Means Clustering

K Means Clustering is an unsupervised learning algorithm that groups data based on their similarity.

First, we specify the number of clusters, or data groups. Next, the algorithm randomly assigns each observation to a cluster and finds the centroid. Then the algorithm iterates through two steps:

Reassign data points to the cluster whose centroid is closest.
Calculate new centroid of each cluster.

These two steps are repeated until the within-cluster variation cannot be reduced further. The within-cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids.

> import seaborn as sns
+ import matplotlib.pyplot as plt
+ import pandas as pd
+ import numpy as np

Artificial Data

> from sklearn.datasets import make_blobs

> # Create Artificial Data
+ data = make_blobs(n_samples=200, n_features=2, 
+                   centers=4, cluster_std=1.8,
+                   random_state=101)

> data[0].shape

(200, 2)

> #cluster number
+ data[1]

array([3, 2, 0, 2, 2, 1, 2, 0, 2, 0, 3, 0, 2, 2, 3, 0, 2, 0, 1, 3, 1, 0,
       0, 1, 3, 1, 1, 0, 2, 2, 3, 1, 2, 0, 0, 3, 1, 1, 1, 2, 1, 3, 3, 3,
       0, 3, 3, 0, 1, 2, 0, 3, 2, 0, 1, 3, 0, 0, 3, 2, 1, 2, 1, 3, 2, 0,
       1, 2, 2, 1, 2, 0, 1, 3, 1, 2, 2, 0, 3, 0, 0, 1, 2, 1, 0, 0, 0, 3,
       2, 1, 1, 1, 1, 3, 0, 1, 2, 3, 1, 2, 0, 1, 0, 0, 2, 0, 1, 2, 1, 1,
       0, 3, 3, 2, 1, 2, 3, 3, 2, 3, 0, 3, 0, 3, 0, 2, 3, 0, 1, 3, 3, 3,
       0, 1, 1, 3, 2, 3, 2, 0, 1, 2, 1, 3, 3, 2, 0, 1, 3, 3, 3, 3, 0, 2,
       0, 3, 2, 2, 2, 0, 2, 0, 0, 3, 1, 3, 0, 2, 3, 0, 2, 0, 3, 3, 0, 3,
       2, 2, 1, 2, 3, 1, 1, 3, 1, 1, 1, 1, 1, 0, 1, 2, 2, 3, 1, 0, 2, 2,
       1, 0])

Visualize the Data

> plt.figure(figsize=(8,6))
+ sns.set_style('darkgrid')
+ plt.scatter(data[0][:,0],data[0][:,1],
+             c=data[1],cmap='rainbow');
+ plt.show()

Creating the Clusters

> from sklearn.cluster import KMeans

> kmeans = KMeans(n_clusters=4)
+ kmeans.fit(data[0])

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

> kmeans.cluster_centers_

array([[-0.0123077 ,  2.13407664],
       [-9.46941837, -6.56081545],
       [ 3.71749226,  7.01388735],
       [-4.13591321,  7.95389851]])

> kmeans.labels_

array([3, 2, 0, 2, 2, 1, 2, 0, 2, 0, 3, 0, 2, 2, 3, 0, 2, 0, 1, 3, 1, 0,
       0, 1, 3, 1, 1, 0, 2, 2, 3, 1, 2, 0, 0, 3, 1, 1, 1, 0, 1, 3, 3, 3,
       0, 2, 3, 0, 1, 0, 0, 3, 2, 0, 1, 3, 0, 0, 3, 2, 1, 2, 1, 3, 2, 0,
       1, 2, 2, 1, 2, 0, 1, 0, 1, 2, 2, 0, 3, 0, 0, 1, 2, 1, 0, 0, 0, 3,
       0, 1, 1, 1, 1, 0, 0, 1, 2, 3, 1, 2, 0, 1, 0, 0, 2, 0, 1, 2, 1, 1,
       2, 3, 3, 2, 1, 2, 3, 3, 2, 3, 0, 3, 0, 3, 0, 2, 3, 0, 1, 3, 3, 3,
       0, 1, 1, 3, 2, 3, 2, 0, 1, 2, 1, 3, 3, 2, 0, 1, 3, 3, 3, 3, 0, 2,
       0, 3, 2, 2, 2, 0, 2, 0, 0, 3, 1, 3, 0, 2, 3, 0, 2, 0, 3, 2, 0, 3,
       2, 2, 1, 2, 3, 1, 1, 3, 1, 1, 1, 1, 1, 0, 1, 2, 2, 3, 1, 0, 2, 2,
       1, 0])

Although the color selections differ between plots, we can see how the points are grouped. There are some overlapping groups in the original plot, but they are fully separated by K means.

> fig, (ax1, ax2) = plt.subplots(1, 2, 
+             sharey=True,figsize=(10,6));
+ ax1.set_title('K Means')
+ ax1.scatter(data[0][:,0],data[0][:,1],
+             c=kmeans.labels_,cmap='rainbow');
+ ax2.set_title("Original")
+ ax2.scatter(data[0][:,0],data[0][:,1],
+             c=data[1],cmap='rainbow');
+ plt.show()

Example

We’ll attempt to use KMeans to cluster Universities into to two groups, Private and Public.

Although we actually have the labels for this data set, we will NOT use them for the KMeans clustering algorithm, since it is an unsupervised learning algorithm.

When using the Kmeans algorithm under normal circumstances it is because you don’t have labels. In this case we will use the labels to get an idea of how well the algorithm performed.

The Data

We will use a data frame with 777 observations on the following 18 variables.

Private - A factor with levels No and Yes indicating private or public university
Apps - Number of applications received
Accept - Number of applications accepted
Enroll - Number of new students enrolled
Top10perc - Pct. new students from top 10% of H.S. class
Top25perc - Pct. new students from top 25% of H.S. class
F.Undergrad - Number of fulltime undergraduates
P.Undergrad - Number of parttime undergraduates
Outstate - Out-of-state tuition
Room.Board - Room and board costs
Books - Estimated book costs
Personal - Estimated personal spending
PhD - Pct. of faculty with Ph.D.’s
Terminal - Pct. of faculty with terminal degree
S.F.Ratio - Student/faculty ratio
perc.alumni - Pct. alumni who donate
Expend Instructional - expenditure per student
Grad.Rate - Graduation rate

> df = pd.read_csv('College_Data',index_col=0)

df.head()

	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate
Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440
Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280
Adrian College	Yes	1428	1097	336	22	50	1036	99	11250
Agnes Scott College	Yes	417	349	137	60	89	510	63	12960
Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560

	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
Abilene Christian University	3300	450	2200	70	78	18.1	12	7041	60
Adelphi University	6450	750	1500	29	30	12.2	16	10527	56
Adrian College	3750	400	1165	53	66	12.9	30	8735	54
Agnes Scott College	5450	450	875	92	97	7.7	37	19016	59
Alaska Pacific University	4120	800	1500	76	72	11.9	2	10922	15

> df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 777 entries, Abilene Christian University to York College of Pennsylvania
Data columns (total 18 columns):
Private        777 non-null object
Apps           777 non-null int64
Accept         777 non-null int64
Enroll         777 non-null int64
Top10perc      777 non-null int64
Top25perc      777 non-null int64
F.Undergrad    777 non-null int64
P.Undergrad    777 non-null int64
Outstate       777 non-null int64
Room.Board     777 non-null int64
Books          777 non-null int64
Personal       777 non-null int64
PhD            777 non-null int64
Terminal       777 non-null int64
S.F.Ratio      777 non-null float64
perc.alumni    777 non-null int64
Expend         777 non-null int64
Grad.Rate      777 non-null int64
dtypes: float64(1), int64(16), object(1)
memory usage: 115.3+ KB

> abc = df.describe()

df.describe()

	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board
count	777.000	777.000	777.0000	777.00000	777.00000	777.000	777.0000	777.000	777.000
mean	3001.638	2018.804	779.9730	27.55856	55.79665	3699.907	855.2986	10440.669	4357.526
std	3870.201	2451.114	929.1762	17.64036	19.80478	4850.421	1522.4319	4023.016	1096.696
min	81.000	72.000	35.0000	1.00000	9.00000	139.000	1.0000	2340.000	1780.000
25%	776.000	604.000	242.0000	15.00000	41.00000	992.000	95.0000	7320.000	3597.000
50%	1558.000	1110.000	434.0000	23.00000	54.00000	1707.000	353.0000	9990.000	4200.000
75%	3624.000	2424.000	902.0000	35.00000	69.00000	4005.000	967.0000	12925.000	5050.000
max	48094.000	26330.000	6392.0000	96.00000	100.00000	31643.000	21836.0000	21700.000	8124.000

	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
count	777.0000	777.0000	777.00000	777.00000	777.000000	777.00000	777.000	777.00000
mean	549.3810	1340.6422	72.66023	79.70270	14.089704	22.74389	9660.171	65.46332
std	165.1054	677.0715	16.32815	14.72236	3.958349	12.39180	5221.768	17.17771
min	96.0000	250.0000	8.00000	24.00000	2.500000	0.00000	3186.000	10.00000
25%	470.0000	850.0000	62.00000	71.00000	11.500000	13.00000	6751.000	53.00000
50%	500.0000	1200.0000	75.00000	82.00000	13.600000	21.00000	8377.000	65.00000
75%	600.0000	1700.0000	85.00000	92.00000	16.500000	31.00000	10830.000	78.00000
max	2340.0000	6800.0000	103.00000	100.00000	39.800000	64.00000	56233.000	118.00000

Exploratory Analysis

Create a scatterplot of Grad.Rate versus Room.Board where the points are colored by the Private column.

> sns.set_style('whitegrid')
+ ylteal = ["#DDE61B", "#1BE6DE"]
+ sns.set_palette(ylteal)
+ sns.lmplot('Room.Board','Grad.Rate',data=df, hue='Private',
+            height=6,aspect=1,fit_reg=False);
+ plt.show()

Create a scatterplot of F.Undergrad versus Outstate where the points are colored by the Private column.

> sns.set_style('whitegrid')
+ ylteal = ["#DDE61B", "#1BE6DE"]
+ sns.set_palette(ylteal)
+ sns.lmplot('Outstate','F.Undergrad',data=df, hue='Private',
+           height=6,aspect=1,fit_reg=False);
+ plt.show()

Create a stacked histogram showing Out of State Tuition based on the Private column.

> sns.set_style('darkgrid')
+ g = sns.FacetGrid(df,hue="Private",palette='coolwarm',
+                   height=6,aspect=2);
+ g = g.map(plt.hist,'Outstate',bins=20,alpha=0.7);
+ plt.legend();
+ plt.show()

Create a similar histogram for the Grad.Rate column.

> g = sns.FacetGrid(df,hue="Private",palette='coolwarm',
+                   height=6,aspect=2);
+ g = g.map(plt.hist,'Grad.Rate',bins=20,alpha=0.7);
+ plt.legend();
+ plt.show()

Notice how there seems to be a private school with a graduation rate of higher than 100%.

> df[df['Grad.Rate'] > 100].loc[:,['Private','Grad.Rate']]

                  Private  Grad.Rate
Cazenovia College     Yes        118

Set that school’s graduation rate to 100 so it makes sense.

> df['Grad.Rate']['Cazenovia College'] = 100

> df['Grad.Rate'].max()

> g = sns.FacetGrid(df,hue="Private",palette='coolwarm',
+                   height=6,aspect=2);
+ g = g.map(plt.hist,'Grad.Rate',bins=20,alpha=0.7);
+ plt.legend();
+ plt.show()

Cluster Creation

> kmeans = KMeans(n_clusters=2)

Fit the model to all the data except for the Private label.

> kmeans.fit(df.drop('Private',axis=1))

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

What are the cluster center vectors?

> kmeans.cluster_centers_

array([[1.03631389e+04, 6.55089815e+03, 2.56972222e+03, 4.14907407e+01,
        7.02037037e+01, 1.30619352e+04, 2.46486111e+03, 1.07191759e+04,
        4.64347222e+03, 5.95212963e+02, 1.71420370e+03, 8.63981481e+01,
        9.13333333e+01, 1.40277778e+01, 2.00740741e+01, 1.41705000e+04,
        6.75925926e+01],
       [1.81323468e+03, 1.28716592e+03, 4.91044843e+02, 2.53094170e+01,
        5.34708520e+01, 2.18854858e+03, 5.95458894e+02, 1.03957085e+04,
        4.31136472e+03, 5.41982063e+02, 1.28033632e+03, 7.04424514e+01,
        7.78251121e+01, 1.40997010e+01, 2.31748879e+01, 8.93204634e+03,
        6.50926756e+01]])

Evaluation

There is no perfect way to evaluate clustering if you don’t have the labels. However, in this instance we do have the labels, so we’ll take advantage of this to evaluate our clusters.

Create a new column for df called Cluster, which is a 1 for a Private school, and a 0 for a Public school.

> def converter(cluster):
+     if cluster=='Yes':
+         return 1
+     else:
+         return 0

> df['Cluster'] = df['Private'].apply(converter)

df.head()

	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board
Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	3300
Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	6450
Adrian College	Yes	1428	1097	336	22	50	1036	99	11250	3750
Agnes Scott College	Yes	417	349	137	60	89	510	63	12960	5450
Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560	4120

	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate	Cluster
Abilene Christian University	450	2200	70	78	18.1	12	7041	60	1
Adelphi University	750	1500	29	30	12.2	16	10527	56	1
Adrian College	400	1165	53	66	12.9	30	8735	54	1
Agnes Scott College	450	875	92	97	7.7	37	19016	59	1
Alaska Pacific University	800	1500	76	72	11.9	2	10922	15	1

Create a confusion matrix and classification report to see how well the Kmeans clustering worked without being given any labels.

> from sklearn.metrics import confusion_matrix,classification_report

> confu = confusion_matrix(df['Cluster'],kmeans.labels_)

> pd.DataFrame(confu,index=['Actual 0','Actual 1'],
+             columns=['Predicted 0','Predicted 1'])

          Predicted 0  Predicted 1
Actual 0           74          138
Actual 1           34          531

> print(classification_report(df['Cluster'],kmeans.labels_))

              precision    recall  f1-score   support

           0       0.69      0.35      0.46       212
           1       0.79      0.94      0.86       565

    accuracy                           0.78       777
   macro avg       0.74      0.64      0.66       777
weighted avg       0.76      0.78      0.75       777

Python for K Means Clustering

Python code in R Markdown

Paul Jozefek

2020-09-26