K Means Clustering is an unsupervised learning algorithm that groups data based on their similarity.
First, we specify the number of clusters, or data groups. Next, the algorithm randomly assigns each observation to a cluster and finds the centroid. Then the algorithm iterates through two steps:
These two steps are repeated until the within-cluster variation cannot be reduced further. The within-cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids.
> import seaborn as sns
+ import matplotlib.pyplot as plt
+ import pandas as pd
+ import numpy as np
> # Create Artificial Data
+ data = make_blobs(n_samples=200, n_features=2,
+ centers=4, cluster_std=1.8,
+ random_state=101)
(200, 2)
array([3, 2, 0, 2, 2, 1, 2, 0, 2, 0, 3, 0, 2, 2, 3, 0, 2, 0, 1, 3, 1, 0,
0, 1, 3, 1, 1, 0, 2, 2, 3, 1, 2, 0, 0, 3, 1, 1, 1, 2, 1, 3, 3, 3,
0, 3, 3, 0, 1, 2, 0, 3, 2, 0, 1, 3, 0, 0, 3, 2, 1, 2, 1, 3, 2, 0,
1, 2, 2, 1, 2, 0, 1, 3, 1, 2, 2, 0, 3, 0, 0, 1, 2, 1, 0, 0, 0, 3,
2, 1, 1, 1, 1, 3, 0, 1, 2, 3, 1, 2, 0, 1, 0, 0, 2, 0, 1, 2, 1, 1,
0, 3, 3, 2, 1, 2, 3, 3, 2, 3, 0, 3, 0, 3, 0, 2, 3, 0, 1, 3, 3, 3,
0, 1, 1, 3, 2, 3, 2, 0, 1, 2, 1, 3, 3, 2, 0, 1, 3, 3, 3, 3, 0, 2,
0, 3, 2, 2, 2, 0, 2, 0, 0, 3, 1, 3, 0, 2, 3, 0, 2, 0, 3, 3, 0, 3,
2, 2, 1, 2, 3, 1, 1, 3, 1, 1, 1, 1, 1, 0, 1, 2, 2, 3, 1, 0, 2, 2,
1, 0])
> plt.figure(figsize=(8,6))
+ sns.set_style('darkgrid')
+ plt.scatter(data[0][:,0],data[0][:,1],
+ c=data[1],cmap='rainbow');
+ plt.show()
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
array([[-0.0123077 , 2.13407664],
[-9.46941837, -6.56081545],
[ 3.71749226, 7.01388735],
[-4.13591321, 7.95389851]])
array([3, 2, 0, 2, 2, 1, 2, 0, 2, 0, 3, 0, 2, 2, 3, 0, 2, 0, 1, 3, 1, 0,
0, 1, 3, 1, 1, 0, 2, 2, 3, 1, 2, 0, 0, 3, 1, 1, 1, 0, 1, 3, 3, 3,
0, 2, 3, 0, 1, 0, 0, 3, 2, 0, 1, 3, 0, 0, 3, 2, 1, 2, 1, 3, 2, 0,
1, 2, 2, 1, 2, 0, 1, 0, 1, 2, 2, 0, 3, 0, 0, 1, 2, 1, 0, 0, 0, 3,
0, 1, 1, 1, 1, 0, 0, 1, 2, 3, 1, 2, 0, 1, 0, 0, 2, 0, 1, 2, 1, 1,
2, 3, 3, 2, 1, 2, 3, 3, 2, 3, 0, 3, 0, 3, 0, 2, 3, 0, 1, 3, 3, 3,
0, 1, 1, 3, 2, 3, 2, 0, 1, 2, 1, 3, 3, 2, 0, 1, 3, 3, 3, 3, 0, 2,
0, 3, 2, 2, 2, 0, 2, 0, 0, 3, 1, 3, 0, 2, 3, 0, 2, 0, 3, 2, 0, 3,
2, 2, 1, 2, 3, 1, 1, 3, 1, 1, 1, 1, 1, 0, 1, 2, 2, 3, 1, 0, 2, 2,
1, 0])
Although the color selections differ between plots, we can see how the points are grouped. There are some overlapping groups in the original plot, but they are fully separated by K means.
> fig, (ax1, ax2) = plt.subplots(1, 2,
+ sharey=True,figsize=(10,6));
+ ax1.set_title('K Means')
+ ax1.scatter(data[0][:,0],data[0][:,1],
+ c=kmeans.labels_,cmap='rainbow');
+ ax2.set_title("Original")
+ ax2.scatter(data[0][:,0],data[0][:,1],
+ c=data[1],cmap='rainbow');
+ plt.show()
We’ll attempt to use KMeans to cluster Universities into to two groups, Private and Public.
Although we actually have the labels for this data set, we will NOT use them for the KMeans clustering algorithm, since it is an unsupervised learning algorithm.
When using the Kmeans algorithm under normal circumstances it is because you don’t have labels. In this case we will use the labels to get an idea of how well the algorithm performed.
We will use a data frame with 777 observations on the following 18 variables.
Private
- A factor with levels No and Yes indicating private or public universityApps
- Number of applications receivedAccept
- Number of applications acceptedEnroll
- Number of new students enrolledTop10perc
- Pct. new students from top 10% of H.S. classTop25perc
- Pct. new students from top 25% of H.S. classF.Undergrad
- Number of fulltime undergraduatesP.Undergrad
- Number of parttime undergraduatesOutstate
- Out-of-state tuitionRoom.Board
- Room and board costsBooks
- Estimated book costsPersonal
- Estimated personal spendingPhD
- Pct. of faculty with Ph.D.’sTerminal
- Pct. of faculty with terminal degreeS.F.Ratio
- Student/faculty ratioperc.alumni
- Pct. alumni who donateExpend Instructional
- expenditure per studentGrad.Rate
- Graduation ratePrivate | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | |
---|---|---|---|---|---|---|---|---|---|
Abilene Christian University | Yes | 1660 | 1232 | 721 | 23 | 52 | 2885 | 537 | 7440 |
Adelphi University | Yes | 2186 | 1924 | 512 | 16 | 29 | 2683 | 1227 | 12280 |
Adrian College | Yes | 1428 | 1097 | 336 | 22 | 50 | 1036 | 99 | 11250 |
Agnes Scott College | Yes | 417 | 349 | 137 | 60 | 89 | 510 | 63 | 12960 |
Alaska Pacific University | Yes | 193 | 146 | 55 | 16 | 44 | 249 | 869 | 7560 |
Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | |
---|---|---|---|---|---|---|---|---|---|
Abilene Christian University | 3300 | 450 | 2200 | 70 | 78 | 18.1 | 12 | 7041 | 60 |
Adelphi University | 6450 | 750 | 1500 | 29 | 30 | 12.2 | 16 | 10527 | 56 |
Adrian College | 3750 | 400 | 1165 | 53 | 66 | 12.9 | 30 | 8735 | 54 |
Agnes Scott College | 5450 | 450 | 875 | 92 | 97 | 7.7 | 37 | 19016 | 59 |
Alaska Pacific University | 4120 | 800 | 1500 | 76 | 72 | 11.9 | 2 | 10922 | 15 |
<class 'pandas.core.frame.DataFrame'>
Index: 777 entries, Abilene Christian University to York College of Pennsylvania
Data columns (total 18 columns):
Private 777 non-null object
Apps 777 non-null int64
Accept 777 non-null int64
Enroll 777 non-null int64
Top10perc 777 non-null int64
Top25perc 777 non-null int64
F.Undergrad 777 non-null int64
P.Undergrad 777 non-null int64
Outstate 777 non-null int64
Room.Board 777 non-null int64
Books 777 non-null int64
Personal 777 non-null int64
PhD 777 non-null int64
Terminal 777 non-null int64
S.F.Ratio 777 non-null float64
perc.alumni 777 non-null int64
Expend 777 non-null int64
Grad.Rate 777 non-null int64
dtypes: float64(1), int64(16), object(1)
memory usage: 115.3+ KB
df.describe()
Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | |
---|---|---|---|---|---|---|---|---|---|
count | 777.000 | 777.000 | 777.0000 | 777.00000 | 777.00000 | 777.000 | 777.0000 | 777.000 | 777.000 |
mean | 3001.638 | 2018.804 | 779.9730 | 27.55856 | 55.79665 | 3699.907 | 855.2986 | 10440.669 | 4357.526 |
std | 3870.201 | 2451.114 | 929.1762 | 17.64036 | 19.80478 | 4850.421 | 1522.4319 | 4023.016 | 1096.696 |
min | 81.000 | 72.000 | 35.0000 | 1.00000 | 9.00000 | 139.000 | 1.0000 | 2340.000 | 1780.000 |
25% | 776.000 | 604.000 | 242.0000 | 15.00000 | 41.00000 | 992.000 | 95.0000 | 7320.000 | 3597.000 |
50% | 1558.000 | 1110.000 | 434.0000 | 23.00000 | 54.00000 | 1707.000 | 353.0000 | 9990.000 | 4200.000 |
75% | 3624.000 | 2424.000 | 902.0000 | 35.00000 | 69.00000 | 4005.000 | 967.0000 | 12925.000 | 5050.000 |
max | 48094.000 | 26330.000 | 6392.0000 | 96.00000 | 100.00000 | 31643.000 | 21836.0000 | 21700.000 | 8124.000 |
Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | |
---|---|---|---|---|---|---|---|---|
count | 777.0000 | 777.0000 | 777.00000 | 777.00000 | 777.000000 | 777.00000 | 777.000 | 777.00000 |
mean | 549.3810 | 1340.6422 | 72.66023 | 79.70270 | 14.089704 | 22.74389 | 9660.171 | 65.46332 |
std | 165.1054 | 677.0715 | 16.32815 | 14.72236 | 3.958349 | 12.39180 | 5221.768 | 17.17771 |
min | 96.0000 | 250.0000 | 8.00000 | 24.00000 | 2.500000 | 0.00000 | 3186.000 | 10.00000 |
25% | 470.0000 | 850.0000 | 62.00000 | 71.00000 | 11.500000 | 13.00000 | 6751.000 | 53.00000 |
50% | 500.0000 | 1200.0000 | 75.00000 | 82.00000 | 13.600000 | 21.00000 | 8377.000 | 65.00000 |
75% | 600.0000 | 1700.0000 | 85.00000 | 92.00000 | 16.500000 | 31.00000 | 10830.000 | 78.00000 |
max | 2340.0000 | 6800.0000 | 103.00000 | 100.00000 | 39.800000 | 64.00000 | 56233.000 | 118.00000 |
Create a scatterplot of Grad.Rate
versus Room.Board
where the points are colored by the Private
column.
> sns.set_style('whitegrid')
+ ylteal = ["#DDE61B", "#1BE6DE"]
+ sns.set_palette(ylteal)
+ sns.lmplot('Room.Board','Grad.Rate',data=df, hue='Private',
+ height=6,aspect=1,fit_reg=False);
+ plt.show()
Create a scatterplot of F.Undergrad
versus Outstate
where the points are colored by the Private column
.
> sns.set_style('whitegrid')
+ ylteal = ["#DDE61B", "#1BE6DE"]
+ sns.set_palette(ylteal)
+ sns.lmplot('Outstate','F.Undergrad',data=df, hue='Private',
+ height=6,aspect=1,fit_reg=False);
+ plt.show()
Create a stacked histogram showing Out of State Tuition based on the Private
column.
> sns.set_style('darkgrid')
+ g = sns.FacetGrid(df,hue="Private",palette='coolwarm',
+ height=6,aspect=2);
+ g = g.map(plt.hist,'Outstate',bins=20,alpha=0.7);
+ plt.legend();
+ plt.show()
Create a similar histogram for the Grad.Rate
column.
> g = sns.FacetGrid(df,hue="Private",palette='coolwarm',
+ height=6,aspect=2);
+ g = g.map(plt.hist,'Grad.Rate',bins=20,alpha=0.7);
+ plt.legend();
+ plt.show()
Notice how there seems to be a private school with a graduation rate of higher than 100%.
Private Grad.Rate
Cazenovia College Yes 118
Set that school’s graduation rate to 100 so it makes sense.
100
> g = sns.FacetGrid(df,hue="Private",palette='coolwarm',
+ height=6,aspect=2);
+ g = g.map(plt.hist,'Grad.Rate',bins=20,alpha=0.7);
+ plt.legend();
+ plt.show()
Fit the model to all the data except for the Private
label.
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
What are the cluster center vectors?
array([[1.03631389e+04, 6.55089815e+03, 2.56972222e+03, 4.14907407e+01,
7.02037037e+01, 1.30619352e+04, 2.46486111e+03, 1.07191759e+04,
4.64347222e+03, 5.95212963e+02, 1.71420370e+03, 8.63981481e+01,
9.13333333e+01, 1.40277778e+01, 2.00740741e+01, 1.41705000e+04,
6.75925926e+01],
[1.81323468e+03, 1.28716592e+03, 4.91044843e+02, 2.53094170e+01,
5.34708520e+01, 2.18854858e+03, 5.95458894e+02, 1.03957085e+04,
4.31136472e+03, 5.41982063e+02, 1.28033632e+03, 7.04424514e+01,
7.78251121e+01, 1.40997010e+01, 2.31748879e+01, 8.93204634e+03,
6.50926756e+01]])
There is no perfect way to evaluate clustering if you don’t have the labels. However, in this instance we do have the labels, so we’ll take advantage of this to evaluate our clusters.
Create a new column for df called Cluster
, which is a 1 for a Private school, and a 0 for a Public school.
Private | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | |
---|---|---|---|---|---|---|---|---|---|---|
Abilene Christian University | Yes | 1660 | 1232 | 721 | 23 | 52 | 2885 | 537 | 7440 | 3300 |
Adelphi University | Yes | 2186 | 1924 | 512 | 16 | 29 | 2683 | 1227 | 12280 | 6450 |
Adrian College | Yes | 1428 | 1097 | 336 | 22 | 50 | 1036 | 99 | 11250 | 3750 |
Agnes Scott College | Yes | 417 | 349 | 137 | 60 | 89 | 510 | 63 | 12960 | 5450 |
Alaska Pacific University | Yes | 193 | 146 | 55 | 16 | 44 | 249 | 869 | 7560 | 4120 |
Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | Cluster | |
---|---|---|---|---|---|---|---|---|---|
Abilene Christian University | 450 | 2200 | 70 | 78 | 18.1 | 12 | 7041 | 60 | 1 |
Adelphi University | 750 | 1500 | 29 | 30 | 12.2 | 16 | 10527 | 56 | 1 |
Adrian College | 400 | 1165 | 53 | 66 | 12.9 | 30 | 8735 | 54 | 1 |
Agnes Scott College | 450 | 875 | 92 | 97 | 7.7 | 37 | 19016 | 59 | 1 |
Alaska Pacific University | 800 | 1500 | 76 | 72 | 11.9 | 2 | 10922 | 15 | 1 |
Create a confusion matrix and classification report to see how well the Kmeans clustering worked without being given any labels.
Predicted 0 Predicted 1
Actual 0 74 138
Actual 1 34 531
precision recall f1-score support
0 0.69 0.35 0.46 212
1 0.79 0.94 0.86 565
accuracy 0.78 777
macro avg 0.74 0.64 0.66 777
weighted avg 0.76 0.78 0.75 777