Unsupervised Machine Learning Tutorial, HE930

1 Introduction

Link to this document: https://rpubs.com/anshulkumar/unsupervised1

1.1 Sources and resources

The code and ideas used in this tutorial were taken almost entirely from the following resources:

K-means Cluster Analysis. UC Business Analytics R Programming Guide. https://uc-r.github.io/kmeans_clustering. – This resource was used to guide the import and prepare, calculating distance, and k-means sections of this tutorial.

The following optional resources further help describe or relate to this tutorial:¹

K-means clustering, from StatQuest with Josh Starmer. Video: https://youtu.be/4b5d3muPQmA.
Principal Component Analysis (PCA), Step-by-Step, from StatQuest with Josh Starmer. Video: https://youtu.be/FgakZw6K1QQ.
PCA main ideas in only 5 minutes!!!, from StatQuest with Josh Starmer. Video: https://youtu.be/HMOI_lkzW08.
Clustering tutorial in Google Colab, by Anshul Kumar. Short link: http://tinyurl.com/AnshulCluster1. Full link: https://colab.research.google.com/drive/1fDT9vQ1yMbDr7AczPB1xaaDCMl4ck7bd?usp=sharing#scrollTo=j98RD388gxUZ.

1.2 Learning goals

Apply one unsupervised ML technique—k-means clustering—to given data.
Present and interpret results of k-means clustering analysis.

1.3 Overview

In this tutorial, you will encounter the following methods that will help us perform clustering, an unsupervised machine learning method that helps us explore our data:

Euclidean distance calculations
K-means clustering
Principal component analysis (PCA)

We will use the mtcars dataset to see some examples. Then, you will conduct an unsupervised analysis of your own using either your own data or the student-por.csv data that we have looked at before.

1.4 Tasks and tips for this week

This week, please read this entire document and then complete the assignment at the bottom of this document. Please also follow along by copying the code from this document into your own RMarkdown file and running all of it in RStudio² on your own.

Also be sure to complete any additional tasks described in this week’s portion of the course’s D2L site.

2 Import and prepare data

Import:

For the examples in this tutorial, I will just use the mtcars dataset that is already built into R. However, we will still “import” this and give it a new name—df—so that it is easier to type.

df <- mtcars

Run the following command in your own R console to see information about the data

?mtcars

And we can see all of the variable names, like we have done before:

names(df)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

Number of observations:

nrow(df)

## [1] 32

Remove missing observations:

df <- na.omit(df)

Inspect some rows in the dataset:

head(df)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Convert non-numeric variables to dummy (binary) variables, if necessary:³

if (!require(fastDummies)) install.packages('fastDummies')
library(fastDummies)

df <- dummy_cols(df, remove_first_dummy = TRUE,  remove_selected_columns = TRUE)

There are no non-numeric variables in the mtcars data set, so the code above is not necessary to run in that case. For other data sets, the code above is useful.

Scale all variables in dataset and inspect again:

df <- scale(df)
head(df)

##                          mpg        cyl        disp         hp       drat
## Mazda RX4          0.1508848 -0.1049878 -0.57061982 -0.5350928  0.5675137
## Mazda RX4 Wag      0.1508848 -0.1049878 -0.57061982 -0.5350928  0.5675137
## Datsun 710         0.4495434 -1.2248578 -0.99018209 -0.7830405  0.4739996
## Hornet 4 Drive     0.2172534 -0.1049878  0.22009369 -0.5350928 -0.9661175
## Hornet Sportabout -0.2307345  1.0148821  1.04308123  0.4129422 -0.8351978
## Valiant           -0.3302874 -0.1049878 -0.04616698 -0.6080186 -1.5646078
##                             wt       qsec         vs         am       gear
## Mazda RX4         -0.610399567 -0.7771651 -0.8680278  1.1899014  0.4235542
## Mazda RX4 Wag     -0.349785269 -0.4637808 -0.8680278  1.1899014  0.4235542
## Datsun 710        -0.917004624  0.4260068  1.1160357  1.1899014  0.4235542
## Hornet 4 Drive    -0.002299538  0.8904872  1.1160357 -0.8141431 -0.9318192
## Hornet Sportabout  0.227654255 -0.4637808 -0.8680278 -0.8141431 -0.9318192
## Valiant            0.248094592  1.3269868  1.1160357 -0.8141431 -0.9318192
##                         carb
## Mazda RX4          0.7352031
## Mazda RX4 Wag      0.7352031
## Datsun 710        -1.1221521
## Hornet 4 Drive    -1.1221521
## Hornet Sportabout -0.5030337
## Valiant           -1.1221521

3 Calculating distance

3.1 Distances between physical objects – distance formula

You may recall from your math education that you can use the distance formula— which is based on the Pythagorean Theorem—which allows you to calculate the distance between two plotted points on a graph, or two latitude and longitude locations.

You can review the distance formula here, if you wish:

The Distance Formula. Purplemath. https://www.purplemath.com/modules/distform.htm.

Now, imagine that instead of just x and y for each point you have plotted, you have three coordinates. Then you would add a term to the formula and you could still calculate the distance between the two points. Imagine that you have latitude, longitude, and height-from-ground measurements for two hot air balloons and you wanted to know the distance between them. Then, you could use the following formula to calculate how far apart they are:⁴

\[distance = \sqrt{(latitude_1-latitude_2)^2 + (longitude_1-longitude_2)^2 + (height_1-height_2)^2}\]

You can review the three-dimensional distance formula here, if you wish:

The Distance Formula in 3 Dimensions. Varsity Tutors. https://www.varsitytutors.com/hotmath/hotmath_help/topics/distance-formula-in-3d.

If we wanted, we could have a whole bunch of hot air balloons hovering in the air, all with their own three location coordinates. We could make a database of all of the hot air balloons. The database would be in a spreadsheet with each row representing a hot air balloon. There would be three columns, one for each type of location coordinate.

Here’s an example:

	Latitude (degrees north)	Longitude (degrees west)	Height-above-ground (meters)
Balloon 1	42.3601	71.0589	100
Balloon 2	42.3876	71.0995	200
Balloon 3	42.3720	71.1075	205
Balloon 4	42.2058	71.0348	158
…	…	…	…

Within the database above, we could calculate the distances between each pair of balloons.⁵ And we could say that maybe all of the balloons that are close to each other have passengers that have similar preferences as each other. That’s why they have all chosen to navigate to the same location. Based on these revealed preferences, maybe we could even identify a few groups and put each balloon into one of these groups. That’s what clustering is all about.

3.2 “Distances” between observations

In the hot air balloons, it is (hopefully) very intuitive that we could calculate distances between items. But now imagine that you have a dataset like the mtcars dataset, which we prepared and inspected earlier in this tutorial. Each row is a car and each column is an attribute that a car might have. We can pretend that these attributes for the cars are just like location measurements for the hot air balloons. Even though these car attributes are not physical locations that we can visualize, we can still use the distance formula to calculate a “distance” between any two objects. And maybe those distances will tell us something noteworthy about the relationship (or similarity or difference) between those two objects.

The “distance” we get between two observations in our data (between two cars, in this case) is called the euclidean distance when calculated using the distance formula above. There are other ways to calculate distance between two observations, but we will not discuss them in this much detail.⁶

3.3 Calculating distance in R

Please load the following packages in preparation for running unsupervised learning techniques in R on your computer:

if (!require(factoextra)) install.packages('factoextra') 
library(factoextra)

if (!require(cluster)) install.packages('cluster') 
library(cluster)

Also, make sure that your data is scaled or standardized, as demonstrated in the data preparation section of this tutorial.

Now, we’ll have R calculate the distances for us between every pair of cars in our dataset:

distance <- get_dist(df)

Above, all of the distances have been stored as the object distance. We can refer to distance any time we want as we continue.

On your own computer, you can inspect the resulting matrix with the following code:

View(as.matrix(distance))

3.4 Visualizing distances between observations

Now we’ll visualize these distances that we just calculated. In the command below, change distance to the name of your own stored distances object and leave everything else as it is:

fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

Interpretation:

Each box corresponds to a pair of cars in our dataset.
The color of that box tells us whether those two cars are “near” or “far” in distance to each other. Blue means close and red means far.
The closest two cars can be is very close to 0.
The farthest apart two cars can be is around 8.

The Maserati Bora and the Datsun 710 are two cars that are very “far” from each other. They are 7.3 distance units apart (which you can see if you run the View(as.matrix(distance)) command). Does that mean that they are pretty different from each other?

The AMC Javelin and Hornet Sportabout are two cars that are very “close” to each other. They are 0.84 distance units apart (you can also see this in the matrix). Perhaps that means that they are very similar to each other.

4 K-means clustering

Now that we have reviewed some of the basic underlying concepts, we are ready to start clustering!

4.1 K-means algorithm

Please watch this short video that explains how k-means clustering works:

Starmer, J. StatQuest: K-means clustering. https://www.youtube.com/watch?v=4b5d3muPQmA.

The process described in this video is what the computer will be doing with our data when it makes clusters for us.

4.2 Two clusters

Now we will finally assign observations in our data to individual clusters. Also, make sure that your data is scaled or standardized, as demonstrated in the data preparation section of this tutorial. With the code below, we are instructing the computer to create two clusters and then assign each observation to one of those clusters:

k2 <- kmeans(df, centers = 2)

Above, we created a new object called k2 which holds the clustering results. We created this kmeans clustering results object by using the function kmeans(...) in which we told it to use df as the data and we told it to make two centers, or clusters.

Let’s inspect the results by running k2 as a command:

k2

## K-means clustering with 2 clusters of sizes 18, 14
## 
## Cluster means:
##          mpg        cyl       disp         hp       drat         wt       qsec
## 1  0.6440403 -0.7893528 -0.7679844 -0.7093044  0.5342642 -0.6215850  0.4685997
## 2 -0.8280518  1.0148821  0.9874085  0.9119628 -0.6869112  0.7991807 -0.6024854
##           vs         am       gear       carb
## 1  0.6751327  0.4105508  0.4235542 -0.3310564
## 2 -0.8680278 -0.5278510 -0.5445697  0.4256439
## 
## Clustering vector:
##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##                   1                   1                   1                   1 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##                   2                   1                   2                   1 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##                   1                   1                   1                   2 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##                   2                   2                   2                   2 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##                   2                   1                   1                   1 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##                   1                   2                   2                   2 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##                   2                   1                   1                   1 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##                   2                   1                   2                   1 
## 
## Within cluster sum of squares by cluster:
## [1] 114.14693  64.93441
##  (between_SS / total_SS =  47.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The output above gives us a number of items:

Cluster means: A list of the mean value of each variable in the dataset, for each of the two clusters. For example, the mean value of mpg in cluster #2 is -0.83.
Clustering vector: A list of which cluster each observation in the dataset belongs to. For example, the car Toyota Corolla belongs to cluster #1.
Sum-of-squares metrics: We will not discuss these in detail this week. But you can read more about this if you wish in the K-means Cluster Analysis⁷ reading.

You can run the following command to view the clusters assigned to each observation:

View(k2$cluster)

The command above will open a Viewer window to show you just a single column of data, with the assigned clusters. The created k2 object has a property called cluster, which is where the cluster assignments are stored. That’s what we’re calling for above.

We can add these cluster assignments back to the original dataset if we wish, as follows:

k2.clusters.temp <- as.data.frame(k2$cluster)
df <- cbind(df,k2.clusters.temp)

In the code above, we’re making a new dataset called k2.clusters.temp, which only has a single column (variable) in it. This dataset temporarily holds the cluster assignments for us. Then, in the second line, we join together the original dataset, df, with the new dataset that only has the cluster assignments, k2.clusters.temp. The cbind function joins together datasets by including all of the columns from both provided datasets.

After that, we can run the View command on the entire dataset again (you can do this on your own computer):

View(df)

Again, a Viewer window will appear and show you your original df dataset with the cluster assignments included as a column on the right side!

4.3 Visualizing two clusters with PCA

Now that we have split our data into two clusters, we can also visualize these clusters using the following command:

fviz_cluster(k2, data = df)

Above, we are using the fviz_cluster command and we are giving it the following arguments:

k2 – This is the clustering object that we created before that we want to now visualize.
df – This is the dataset, which we need to remind the computer about again.

Here is a description of what this visualization represents:

This provides a nice illustration of the clusters. If there are more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.⁸

It would be impossible to create a plot of all of the variables in df because we do not have enough axes in our plot for all of the variables. To get around this problem, the fviz_cluster function we used above reduces all of the independent variables into principal components, which are composite variables of the independent variables. In the plot we see above, the first of these principal components, called Dim1 captures 61.8% of the variation in all of the variables in df. The second of the principal components is Dim2 and captures 22.1% of the variation in all of the variables in df. It is not necessary for you to fully understand PCA this week.

4.4 Visualizing clusters with standard scatterplots

If we do want to select just two variables and make a scatterplot (instead of using PCA), we can still do that. I will summarize but not fully explain this procedure, focusing only on the items that you need to modify when you do this with different data. There are many possible ways to do this, two of which are shown below. You can use whichever one is best for your situation and data.

4.4.1 Scatterplot version 1

Here is how you can adapt the code below for your own use:

df – change all instances of df to the name of the dataset that you are using.
k2 – change this to the name of the stored k-means object that you created before.
mpg, hp – change these two variables to the names of the variables that you want to include in your own scatterplot.

if (!require(tidyverse)) install.packages('tidyverse') 
library(tidyverse)

if (!require(ggplot2)) install.packages('ggplot2') 
library(ggplot2)
df %>%
  as_tibble() %>%
  mutate(cluster = k2$cluster,
         car = row.names(df)) %>%
  ggplot(aes(mpg, hp, color = factor(cluster), label = car)) +
  geom_text()

Above, we see an ordinary scatterplot showing hp plotted against mpg, with the following additions:

Points have been replaced by the names of each observation (names of cars, in this situation).
Each observation is colored according to its cluster membership.

The scatterplot shows us that overall, the clusters help us differentiate between cars pretty well, in this case at least!

4.4.2 Scatterplot version 2

This⁹ is a simpler scatterplot which does not include a label for each observation, which may be desirable if you do not have a need for labels or if you have a large number of observations (in which case labels would cause clutter).

First, we create a categorical variable identifying each observation’s cluster:

df$cluster <- as.factor(df$`k2$cluster`)

Next, we make the scatterplot.

Here is how you can adapt the code below for your own use:

Change df to the name of your data set.
Change mpg to your desired X variable.
Change hp to your desired Y variable.
Change cluster to your variable that identifies each observation’s cluster (this might remain as being called cluster if you use the code used above; you do not necessarily need to change this).

if (!require(ggplot2)) install.packages('ggplot2') 
library(ggplot2)

ggplot(df,aes(mpg,hp,color=cluster))+geom_point()

Above, you can see that we have a standard scatterplot with plotted points. Each point’s color is determined according to which cluster it belongs to (from the k-means clustering that we already conducted).

A slight modification to the scatterplot above is here:¹⁰

ggplot(df,aes(jitter(mpg),jitter(hp),color=cluster))+geom_point()

The results are not shown, but you can run it on your own computer and see the result. The change we made is that we took both the X variable (mpg) and the Y (hp) variable and we changed them to jitter(mpg) and jitter(hp). This tells the computer to slightly modify where the dots (points) are plotted, such that you can more easily see ones that have the same coordinates.

4.5 Visualize clustering results with box plots

A box plot might also be useful to help look at how our clusters might have split up our data.¹¹

Here is an example:

boxplot(mpg ~ cluster, data = df)

Above, we visualize the distribution of the variable mpg separately for the observations (cars, in this case) in each cluster.

We can repeat this with a different variable by changing mpg in the code to carb:

boxplot(carb ~ cluster, data = df)

In the two plots above, we see that observations (cars, in this case) have fairly different mpg measurements when comparing clusters 1 and 2 (the only two clusters). But the carb measurements are not as different when comparing the two clusters. Box plots are a convenient way to quickly look across groups (clusters, in this case) within your data.

If you made more than two clusters, note that more than two boxes should appear!

4.6 “Visualize” cluster results with tables

Typically, scatter and box plots will be most useful. However, if you are using dummy (binary) variables, a simple table can often be the simplest way to inspect results and look for patterns.

Below, we make a two-way table with the dummy variable am and the cluster membership (saved as the variable cluster) of each observation:¹²

with(df,table(am,cluster))

##                     cluster
## am                    1  2
##   -0.814143075490065  7 12
##   1.18990141802394   11  2

Above, we see at least a hint of some noteworthy patterns. Out of the 13 cars that have am coded as 1¹³—meaning that they have manual transmission—11 of these cars belong to cluster 1 and only 2 of these cars belong to cluster 2. Cars with automatic transmission are more evenly spread out between the two clusters that were created.

If we had made three clusters instead of two, three columns would have appeared in the table above instead of two.

4.7 Three clusters – k-means

Below is all the code that you already saw above, but changed so that we make three clusters instead of two. The outputs are not shown but you can run it all in R on your own computer to see the results.

df <- mtcars # you don't need to run this again if you already have
df <- scale(df) # you don't need to run this again if you already have

k3 <- kmeans(df, centers = 3)

k3

View(k3$cluster)

df$k3.clusters <- as.numeric(k3$cluster)

View(df)

fviz_cluster(k3, data = df)
             

if (!require(tidyverse)) install.packages('tidyverse') 
library(tidyverse)

if (!require(ggplot2)) install.packages('ggplot2') 
library(ggplot2)

df %>%
  as_tibble() %>%
  mutate(cluster = k3$cluster,
         car = row.names(df)) %>%
  ggplot(aes(mpg, hp, color = factor(cluster), label = car)) +
  geom_text()

4.8 How many clusters are best? – elbow method

There are multiple ways to determine how many clusters are the best number to use when doing k-means clustering. You are required to learn one of these methods, called the “elbow method”.

You can read about and find the code to execute other methods in the K-means Cluster Analysis¹⁴ reading, but this is optional (not required).

One of the most important metrics of error in k-means clustering is the within-clusters sum of squares (WSS). This is a measure of how dispersed—or far apart—all of the data points within each cluster are.¹⁵ We don’t want them to be too far apart from each other. If the data points within a single cluster are too far apart from each other, maybe they should just be divided into two clusters. Looking at the WSS for many different numbers of clusters helps us decide on an optimal number of clusters to use.

The fviz_nbclust command below plots the WSS for many different numbers of clusters. Here is what we are giving to the fviz_nbclust function:

df – The dataset that we are using for clustering analysis.
kmeans – The method of clustering that we are using.
method = "wss" the error metric we want to use. In this case, we have specified wss, meaning the same WSS that is described above.

Once receiving these instructions, the fviz_nbclust function does the same k-means process that we did above for 2 and then 3 clusters. And then it does many more numbers of clusters. It records the WSS each time it does this. Then it plots the WSS against the number of clusters, as you can see below.

fviz_nbclust(df, kmeans, method = "wss")

You can see that when we get down to about four clusters, we are really not reducing the WSS much at all with each additional cluster we add. Four clusters might be optimal. You could also argue that 4 isn’t much better than 3, and 3 isn’t much better than 2! Two clusters might possibly be appropriate for this dataset.

This is called the “elbow method” of determining the optimal number of clusters because the curve is shaped like a human’s upper arm, elbow, and then lower arm. The optimal number of clusters lies in the elbow region of the curve, which in this case is in the range between 2 clusters and 4 clusters.

5 Assignment

In this week’s assignment, you are required to answer all of the questions below, which relate to unsupervised machine learning and k-means clustering. Please show R code and R output for all questions that require you to do anything in R. It is recommended that you submit this assignment as an HTML, Word, or PDF file that is “knitted” through RStudio.

If you have data of your own, I strongly recommend that you use it this week instead of the data I have provided for this assignment. Unsupervised machine learning is a great way to learn new things about your own data. You can just do all of the tasks described below with your own data. Feel free to take my help as you do this.

If you do want to use data provided by me, you should again use the student-por.csv dataset that we have used before. This is in D2L, if you don’t already have it downloaded. You can also get it from here:

UCI Machine Learning Repository. Center for Machine Learning and Intelligent Systems. https://archive.ics.uci.edu/ml/datasets/Student+Performance. Download the file student.zip. Then open the file student-por.csv to see the data; this is the file you should load into R.

Here’s how you can load the data into R and make sure all variables are numeric:¹⁶

df_original <- read.csv("student-por.csv")

if (!require(fastDummies)) install.packages('fastDummies')
library(fastDummies)
df <- dummy_cols(df_original, remove_first_dummy = TRUE,  remove_selected_columns = TRUE)

5.1 Prepare your data

In this first part of the assignment, you will prepare the data for further analysis.

Task 1: Conduct all data preparation steps that you think are necessary for you to do before you do distance calculations and k-means clustering analysis. Be sure to make all variables numeric that are not.

5.2 Practice calculating distances

In this part of the assignment, you will calculate a distance matrix for your data and identify observations that are “close” and “far” from each other.

Task 2: Calculate the distance between the following two points, called A and B:

point	w coordinate	x coordinate	y coordinate	z coordinate
A	8	7	6	5
B	4	3	2	1

Show all of your calculation work and the final answer. Remember that a coordinate is like a variable. A point is like an observation in your data set.

Task 3: Calculate a distance matrix for the data you are using for this assignment.

Task 4: Look at the distance matrix you created in the Viewer. What are the highest and lowest distances that you can see between observations?

Task 5: Choose two observations (students, in the case of the student-por.csv data) in the data that are very “close” to each other in distance. Do they have similar final grades (choose a different outcome to look at if you’re using your own data)?

Task 6: Choose two observations in the data that are very “far” from each other in distance. Do they have different final grades (choose a different outcome to look at if you’re using your own data)?

Task 7: Make a visualization showing the distances between the observations. This visualization may not be useful if your data has a large number of observations. But still, see what happens when you try to make one!

5.3 k-means clustering

Now it’s time to make some clusters with our data! You will use k-means clustering, as demonstrated earlier in this tutorial. However, we will change the order in which we do things a little bit, as you’ll see below.

Task 8: For the data you’re using for this assignment, start off by identifying the number of clusters that are ideal. Make a plot and use the “elbow method.”

Task 9: Create clusters in the data, using k-means clustering and the number of clusters that you found to be optimal.

Task 10: Add the clusters as a variable into the original—unstandardized/unscaled—dataset.

Task 11: Inspect the dataset (now with the clusters added), using the View(df) command, replacing df with the name of your dataset. Pick a single cluster to focus on. Look at some of the observations that fall into that single cluster. In what ways are these observations similar or different? You may also want to look back at the unstandardized/unscaled version of the dataset to do this. The scatter and box plots that you are asked to make later in the assignment might help you answer this question.¹⁷

Task 12: Make a visualization that uses PCA to plot your clusters.

It is good to have a sense for which variables are helping you create your clusters and which ones are not helping as much. The following tasks are meant to help you practice this process.

When you make the scatterplots in the following tasks, I recommend that you try making them with and without the jitter() option described above to see which version helps you see your data the best.¹⁸

Task 13: Choose two variables in the dataset that you feel are important in differentiating among observations. Make a scatterplot that plots these two variables for all observations and shows the separate clusters.¹⁹ How much do the clusters overlap, when looking at just these two variables? Ideally they should not overlap much. You can try multiple combinations of variables until you get two that don’t have much overlap in the clusters. These variables are fairly important in differentiating between observations.

Task 14: Choose two variables in the dataset that you feel are not that important in differentiating among observations. Make a scatterplot that plots these two variables for all observations and shows the separate clusters.²⁰ How much do the clusters overlap, when looking at just these two variables? Ideally there should be some overlap. You can try multiple combinations of variables until you get two that have some overlap in the clusters. These variables are less important in differentiating between observations.

Task 15: Try to make a box plot that shows one of your data set’s variables that are different when comparing across clusters. If you are using a dummy (binary) variable, make a table instead of a box plot.²¹

Task 16: Try to make a box plot that shows one of your data set’s variables that are similar when comparing across clusters. If you are using a dummy (binary) variable, make a table instead of a box plot.²²

5.4 Conclusions

Finally, you will practice drawing some conclusions this week from the analysis you have just completed in the previous tasks.

Task 17: What conclusions can you draw from the unsupervised analysis you just conducted?

Task 18: What did you learn about the data that was new?

Task 19: What did you already suspect about the data that was supported/confirmed?

5.5 Logistics and follow-up

Task 20: Write any feedback you have about this week’s content/assignment or anything else at all (optional).

Task 21: Please write any questions you have this week (optional).

Task 22: Submit your final version of this assignment to the appropriate assignment drop box in D2L.

This list was added on June 24 2021.↩︎
It is fine to use either the cloud version or a version downloaded to your own computer↩︎
This was added on June 21 2021.↩︎
This is example is just for illustrative purposes. In this example, we are pretending that the world is flat or that the balloons are in the same town. This wouldn’t work if the balloons were farther apart, because of the earth’s curvature.↩︎
We would have to convert all of the three metrics to the same units—like meters—first.↩︎
This article shows a few other ways to calculate distances: K-means Cluster Analysis. UC Business Analytics R Programming Guide. https://uc-r.github.io/kmeans_clustering.↩︎
K-means Cluster Analysis. UC Business Analytics R Programming Guide. https://uc-r.github.io/kmeans_clustering.↩︎
K-means Cluster Analysis. UC Business Analytics R Programming Guide. https://uc-r.github.io/kmeans_clustering.↩︎
This second version was added on June 21 2021.↩︎
This modification option was added on June 22 2021.↩︎
This section was added on June 22 2021.↩︎
This entire section was added to this tutorial on June 24 2021.↩︎
The 1 was changed to 1.19 after scaling the data.↩︎
K-means Cluster Analysis. UC Business Analytics R Programming Guide. https://uc-r.github.io/kmeans_clustering.↩︎
More specifically, WSS is calculated by taking the center coordinate of all points in a cluster and then seeing how far apart all points are from that center coordinate. It is not required for you to understand WSS in this level of detail.↩︎
This code was modified on June 21 2021. The old code was: df <- read.csv("student-por.csv") and then df <- sapply(df, as.numeric). The new code uses the dummy_cols function from the fastDummies package to convert categorical variables into one or more dummy (binary) variables.↩︎
This suggestion was added on June 25 2021.↩︎
This suggestion was added on June 22 2021.↩︎
You can use any one of the scatterplot options demonstrated earlier. This note was added on June 21 2021.↩︎
You can use any one of the scatterplot options demonstrated earlier. This note was added on June 21 2021.↩︎
This task was added on June 22 2021 and is optional (not required) for students in 2021. It is required for students in 2022 and beyond. Clarification about making a table for dummy variables was added on June 24 2021.↩︎
This task was added on June 22 2021 and is optional (not required) for students in 2021. It is required for students in 2022 and beyond. Clarification about making a table for dummy variables was added on June 24 2021.↩︎