This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
Use control+Enter to run the code chunks on PC. Use command+Enter to run the code chunks on MAC.
In this section, we install and load the necessary packages.
In this section, we import the necessary data for this lab.
We use the usarrests.csv data set.
This data contains crime statistics per 100,000 residents in 50 states of USA. For each of the 50 states in the United States, the data set contains the number of arrests per 100,000 residents for each of three crimes: Assault, Murder, and Rape. We also record UrbanPop (the percent of the population in each state living in urban areas). Note that the variables are measured in different units; Murder, Rape, and Assault are reported as the number of occurrences per 100,000 people, and UrbanPop is the percentage of the state’s population that lives in an urban area.
The objective is to cluster different US states using crime statistics.
First, familiarize yourself with the data.
Explore the dataset using 5 functions: dim(), str(), colnames(), head() and tail.
# Explore the dataset using 5 functions: dim(), str(), colnames(), head() and tail
dim(usarrests)
## [1] 50 5
str(usarrests)
## 'data.frame': 50 obs. of 5 variables:
## $ State : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
## $ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
## $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
## $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
colnames(usarrests)
## [1] "State" "Murder" "Assault" "UrbanPop" "Rape"
head(usarrests)
## State Murder Assault UrbanPop Rape
## 1 Alabama 13.2 236 58 21.2
## 2 Alaska 10.0 263 48 44.5
## 3 Arizona 8.1 294 80 31.0
## 4 Arkansas 8.8 190 50 19.5
## 5 California 9.0 276 91 40.6
## 6 Colorado 7.9 204 78 38.7
tail(usarrests)
## State Murder Assault UrbanPop Rape
## 45 Vermont 2.2 48 32 11.2
## 46 Virginia 8.5 156 63 20.7
## 47 Washington 4.0 145 73 26.2
## 48 West Virginia 5.7 81 39 9.3
## 49 Wisconsin 2.6 53 66 10.8
## 50 Wyoming 6.8 161 60 15.6
Do the following tasks and answer the questions below.
Use K-Means clustering to cluster the states.
From our previous experience and according to the experts’ view, we are expecting 3 kinds of US state clusters: low, medium and high crime rate. Perform K-means clustering with K = 3.
set.seed(1234) # used when we want to reproduce results.
# Because we are looking for three different US state clusters in the data
# In the kmeans(), we will set number of clusters as 3 and the X's are "Murder", "Assault", "UrbanPop" and "Rape"
km_usarrests <- kmeans(usarrests[, c("Murder", "Assault", "UrbanPop", "Rape")], 3, nstart = 20)
km_usarrests
## K-means clustering with 3 clusters of sizes 14, 20, 16
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 8.214286 173.2857 70.64286 22.84286
## 2 4.270000 87.5500 59.75000 14.39000
## 3 11.812500 272.5625 68.31250 28.37500
##
## Clustering vector:
## [1] 3 3 3 1 3 1 2 3 3 1 2 2 3 2 2 2 2 3 2 3 1 3 2 3 1 2 2 3 2 1 3 3 3 2 2 1 1 2
## [39] 1 3 2 1 1 2 2 1 1 2 2 1
##
## Within cluster sum of squares by cluster:
## [1] 9136.643 19263.760 19563.863
## (between_SS / total_SS = 86.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Question 1: How do you interpret the results? Interpret: (1) Cluster size, (2) Profile the Clusters and (3) Goodness of fit (Within cluster variation).
Please add your response here
# print out the assigned clusters to each state
km_usarrests$cluster
## [1] 3 3 3 1 3 1 2 3 3 1 2 2 3 2 2 2 2 3 2 3 1 3 2 3 1 2 2 3 2 1 3 3 3 2 2 1 1 2
## [39] 1 3 2 1 1 2 2 1 1 2 2 1
# summarize the frequency of each cluster
table(km_usarrests$cluster)
##
## 1 2 3
## 14 20 16
Interpret the outputs here
# Print the final centroids (center of the cluster)
km_usarrests$centers
## Murder Assault UrbanPop Rape
## 1 8.214286 173.2857 70.64286 22.84286
## 2 4.270000 87.5500 59.75000 14.39000
## 3 11.812500 272.5625 68.31250 28.37500
Interpret the outputs here
The centroid for one Cluster correspond to 8.214286 murders, 173.2857 assaults, 70.64286 is the population, and 22.84286 rape cases. The centroid for the second Cluster correspond to 4.270000 murders, 87.5500 assaults, 59.75000 is the population, and 14.39000 rape cases. The centroid for the last Cluster correspond to 11.812500 murders, 272.5625 assaults, 68.31250 is the population, and 28.37500 rape cases.
# Print the final variation in the cluster
km_usarrests$withinss
## [1] 9136.643 19263.760 19563.863
# Print the total cluster variation
km_usarrests$tot.withinss
## [1] 47964.27
# Print the between_SS / total_SS
km_usarrests$betweenss/km_usarrests$totss
## [1] 0.8651961
Interpret the outputs here
The total within-cluster sum of squares is 47964.27.
The BSS/TSS ratio is 86.51%, indicating a good fit.
Using hierarchical clustering with complete linkage and Euclidean distance to cluster the states.
Cut the dendrogram at a height that results in three distinct clusters.
# The dist() function is used to compute the inter-observation Euclidean distance matrix
# compute distance matrix for X values
d <- dist(as.matrix(usarrests[,c("Murder", "Assault", "UrbanPop", "Rape")]))
# apply hierarchical clustering
hc <- hclust(d, method ="complete")
# We can now plot the dendrograms obtained using the usual plot() function.
plot(hc, labels = usarrests$State, xlab = "States", ylab = "distance")
# To determine the cluster labels for each observation associated with a given cut of the dendrogram, we can use the cutree() function
# Cut the dendrogram at a height that results in three distinct clusters.
ct = cutree (hc , 3)
# Print which states go into each cluster:
for( k in 1:3 ){
print(k)
print( usarrests$State[ ct == k ] )
}
## [1] 1
## [1] "Alabama" "Alaska" "Arizona" "California"
## [5] "Delaware" "Florida" "Illinois" "Louisiana"
## [9] "Maryland" "Michigan" "Mississippi" "Nevada"
## [13] "New Mexico" "New York" "North Carolina" "South Carolina"
## [1] 2
## [1] "Arkansas" "Colorado" "Georgia" "Massachusetts"
## [5] "Missouri" "New Jersey" "Oklahoma" "Oregon"
## [9] "Rhode Island" "Tennessee" "Texas" "Virginia"
## [13] "Washington" "Wyoming"
## [1] 3
## [1] "Connecticut" "Hawaii" "Idaho" "Indiana"
## [5] "Iowa" "Kansas" "Kentucky" "Maine"
## [9] "Minnesota" "Montana" "Nebraska" "New Hampshire"
## [13] "North Dakota" "Ohio" "Pennsylvania" "South Dakota"
## [17] "Utah" "Vermont" "West Virginia" "Wisconsin"
Question 2: How do you interpret the dendrogram? Which states belong to which clusters with three distinct clusters? List the states for each cluster.
Please add your response here
Cluster 1: Alabama, Alaska, Arizona, California, Delaware, Florida, Illinois, Louisiana, Maryland, Michigan, Mississippi, Nevada, New Mexico, New York, North Carolina, and South Carolina.
Cluster 2: Arkansas, Colorado, Georgia, Massachusetts, Missouri, New Jersey, Oklahoma, Oregon, Rhode Island, Tennessee, Texas, Virginia, Washington, and Wyoming.
Cluster 3: Connecticut, Hawaii, Idaho, Indiana, Iowa, Kansas, Kentucky, Maine, Minnesota, Montana, Nebraska, New Hampshire, North Dakota, Ohio, Pennsylvania, South Dakota, Utah, Vermont, West Virginia, and Wisconsin.
Note that in the US Arrests dataset the variables are measured in different units and this might make the clustering flawed. A good way to handle this problem is to standardize the data so that all standardize variables are given a mean of zero and a standard deviation of one. Then all variables will be on a comparable scale. To scale the variables before performing hierarchical clustering of the observations, we use the scale() function.
Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
# Insides the dist() function use scale(usarrests[,2:5],center=FALSE) instead of as.matrix(usarrests[,2:5]) and then use hclust() to apply Hierarchical Clustering
d <- dist(scale(usarrests[,2:5],center=FALSE))
hc <- hclust(d, method ="complete")
# plot the dendrogram
plot(hc, labels = usarrests$State, xlab = "States", ylab = "distance")
# Cut the dendrogram at a height that results in three distinct clusters.
cutree (hc , 3)
## [1] 1 2 2 3 2 2 3 3 2 1 3 3 2 3 3 3 3 1 3 2 3 2 3 1 2 3 3 2 3 3 2 2 1 3 3 3 3 3
## [39] 3 1 3 2 2 3 3 3 3 3 3 3
# Print which states go into each cluster in this case:
for( k in 1:3 ){
print(k)
print( usarrests$State[ ct == k ] )
}
## [1] 1
## [1] "Alabama" "Alaska" "Arizona" "California"
## [5] "Delaware" "Florida" "Illinois" "Louisiana"
## [9] "Maryland" "Michigan" "Mississippi" "Nevada"
## [13] "New Mexico" "New York" "North Carolina" "South Carolina"
## [1] 2
## [1] "Arkansas" "Colorado" "Georgia" "Massachusetts"
## [5] "Missouri" "New Jersey" "Oklahoma" "Oregon"
## [9] "Rhode Island" "Tennessee" "Texas" "Virginia"
## [13] "Washington" "Wyoming"
## [1] 3
## [1] "Connecticut" "Hawaii" "Idaho" "Indiana"
## [5] "Iowa" "Kansas" "Kentucky" "Maine"
## [9] "Minnesota" "Montana" "Nebraska" "New Hampshire"
## [13] "North Dakota" "Ohio" "Pennsylvania" "South Dakota"
## [17] "Utah" "Vermont" "West Virginia" "Wisconsin"
Question 3: Which states belong to which clusters with three distinct clusters? List the states for each cluster.
Please add your response here
Cluster 1: Alabama, Alaska, Arizona, California, Delaware, Florida, Illinois, Louisiana, Maryland, Michigan, Mississippi, Nevada, New Mexico, New York, North Carolina, and South Carolina.
Cluster 2: Arkansas, Colorado, Georgia, Massachusetts, Missouri, New Jersey, Oklahoma, Oregon, Rhode Island, Tennessee, Texas, Virginia, Washington, and Wyoming.
Cluster 3: Connecticut, Hawaii, Idaho, Indiana, Iowa, Kansas, Kentucky, Maine, Minnesota, Montana, Nebraska, New Hampshire, North Dakota, Ohio, Pennsylvania, South Dakota, Utah, Vermont, West Virginia, and Wisconsin.
Question 4: What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.
Please add your response here I did not see any difference with scaling the variables.