R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Use control+Enter to run the code chunks on PC. Use command+Enter to run the code chunks on MAC.

Load Packages

In this section, we install and load the necessary packages.

Import Data

In this section, we import the necessary data for this lab.

US Arrests Case

We use the usarrests.csv data set.

This data contains crime statistics per 100,000 residents in 50 states of USA. For each of the 50 states in the United States, the data set contains the number of arrests per 100,000 residents for each of three crimes: Assault, Murder, and Rape. We also record UrbanPop (the percent of the population in each state living in urban areas). Note that the variables are measured in different units; Murder, Rape, and Assault are reported as the number of occurrences per 100,000 people, and UrbanPop is the percentage of the state’s population that lives in an urban area.

The objective is to cluster different US states using crime statistics.

First, familiarize yourself with the data.

Data exploration

Explore the dataset using 5 functions: dim(), str(), colnames(), head() and tail.

# Explore the dataset using 5 functions: dim(), str(), colnames(), head() and tail

dim(usarrests)

## [1] 50  5

str(usarrests)

## 'data.frame':    50 obs. of  5 variables:
##  $ State   : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
##  $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
##  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
##  $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

colnames(usarrests)

## [1] "State"    "Murder"   "Assault"  "UrbanPop" "Rape"

head(usarrests)

##        State Murder Assault UrbanPop Rape
## 1    Alabama   13.2     236       58 21.2
## 2     Alaska   10.0     263       48 44.5
## 3    Arizona    8.1     294       80 31.0
## 4   Arkansas    8.8     190       50 19.5
## 5 California    9.0     276       91 40.6
## 6   Colorado    7.9     204       78 38.7

tail(usarrests)

##            State Murder Assault UrbanPop Rape
## 45       Vermont    2.2      48       32 11.2
## 46      Virginia    8.5     156       63 20.7
## 47    Washington    4.0     145       73 26.2
## 48 West Virginia    5.7      81       39  9.3
## 49     Wisconsin    2.6      53       66 10.8
## 50       Wyoming    6.8     161       60 15.6

Do the following tasks and answer the questions below.

Task 1: K-Means Clustering

Use K-Means clustering to cluster the states.

From our previous experience and according to the experts’ view, we are expecting 3 kinds of US state clusters: low, medium and high crime rate. Perform K-means clustering with K = 3.

set.seed(1234) # used when we want to reproduce results.

# Because we are looking for three different US state clusters in the data
# In the kmeans(), we will set number of clusters as 3 and the X's are  "Murder", "Assault", "UrbanPop" and "Rape"

km_usarrests <- kmeans(usarrests[, c("Murder", "Assault", "UrbanPop", "Rape")], 3, nstart = 20)
km_usarrests

## K-means clustering with 3 clusters of sizes 14, 20, 16
## 
## Cluster means:
##      Murder  Assault UrbanPop     Rape
## 1  8.214286 173.2857 70.64286 22.84286
## 2  4.270000  87.5500 59.75000 14.39000
## 3 11.812500 272.5625 68.31250 28.37500
## 
## Clustering vector:
##  [1] 3 3 3 1 3 1 2 3 3 1 2 2 3 2 2 2 2 3 2 3 1 3 2 3 1 2 2 3 2 1 3 3 3 2 2 1 1 2
## [39] 1 3 2 1 1 2 2 1 1 2 2 1
## 
## Within cluster sum of squares by cluster:
## [1]  9136.643 19263.760 19563.863
##  (between_SS / total_SS =  86.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Question 1: How do you interpret the results? Interpret: (1) Cluster size, (2) Profile the Clusters and (3) Goodness of fit (Within cluster variation).

Please add your response here

Let’s start with cluster size.

# print out the assigned clusters to each state
km_usarrests$cluster

##  [1] 3 3 3 1 3 1 2 3 3 1 2 2 3 2 2 2 2 3 2 3 1 3 2 3 1 2 2 3 2 1 3 3 3 2 2 1 1 2
## [39] 1 3 2 1 1 2 2 1 1 2 2 1

# summarize the frequency of each cluster
table(km_usarrests$cluster)

## 
##  1  2  3 
## 14 20 16

Interpret the outputs here

Now we will talk about profiling clusters:

# Print the final centroids (center of the cluster)
km_usarrests$centers

##      Murder  Assault UrbanPop     Rape
## 1  8.214286 173.2857 70.64286 22.84286
## 2  4.270000  87.5500 59.75000 14.39000
## 3 11.812500 272.5625 68.31250 28.37500

Interpret the outputs here

The centroid for one Cluster correspond to 8.214286 murders, 173.2857 assaults, 70.64286 is the population, and 22.84286 rape cases. The centroid for the second Cluster correspond to 4.270000 murders, 87.5500 assaults, 59.75000 is the population, and 14.39000 rape cases. The centroid for the last Cluster correspond to 11.812500 murders, 272.5625 assaults, 68.31250 is the population, and 28.37500 rape cases.

Last, we will talk about within cluster variation. Within cluster variation should be as small as possible.

# Print the final variation in the cluster
km_usarrests$withinss

## [1]  9136.643 19263.760 19563.863

# Print the total cluster variation
km_usarrests$tot.withinss

## [1] 47964.27

# Print the between_SS / total_SS
km_usarrests$betweenss/km_usarrests$totss

## [1] 0.8651961

Interpret the outputs here

The total within-cluster sum of squares is 47964.27.

The BSS/TSS ratio is 86.51%, indicating a good fit.

Task 2: Hierarchical Clustering

Using hierarchical clustering with complete linkage and Euclidean distance to cluster the states.

Cut the dendrogram at a height that results in three distinct clusters.

# The dist() function is used to compute the inter-observation Euclidean distance matrix
# compute distance matrix for X values
d <- dist(as.matrix(usarrests[,c("Murder", "Assault", "UrbanPop", "Rape")]))

# apply hierarchical clustering
hc <- hclust(d, method ="complete")

# We can now plot the dendrograms obtained using the usual plot() function.
plot(hc, labels = usarrests$State, xlab = "States", ylab = "distance")

# To determine the cluster labels for each observation associated with a given cut of the dendrogram, we can use the cutree() function
# Cut the dendrogram at a height that results in three distinct clusters. 
ct = cutree (hc , 3)

# Print which states go into each cluster:
for( k in 1:3 ){
  print(k)
  print( usarrests$State[ ct == k ] )
}

## [1] 1
##  [1] "Alabama"        "Alaska"         "Arizona"        "California"    
##  [5] "Delaware"       "Florida"        "Illinois"       "Louisiana"     
##  [9] "Maryland"       "Michigan"       "Mississippi"    "Nevada"        
## [13] "New Mexico"     "New York"       "North Carolina" "South Carolina"
## [1] 2
##  [1] "Arkansas"      "Colorado"      "Georgia"       "Massachusetts"
##  [5] "Missouri"      "New Jersey"    "Oklahoma"      "Oregon"       
##  [9] "Rhode Island"  "Tennessee"     "Texas"         "Virginia"     
## [13] "Washington"    "Wyoming"      
## [1] 3
##  [1] "Connecticut"   "Hawaii"        "Idaho"         "Indiana"      
##  [5] "Iowa"          "Kansas"        "Kentucky"      "Maine"        
##  [9] "Minnesota"     "Montana"       "Nebraska"      "New Hampshire"
## [13] "North Dakota"  "Ohio"          "Pennsylvania"  "South Dakota" 
## [17] "Utah"          "Vermont"       "West Virginia" "Wisconsin"

Question 2: How do you interpret the dendrogram? Which states belong to which clusters with three distinct clusters? List the states for each cluster.

Please add your response here

Cluster 1: Alabama, Alaska, Arizona, California, Delaware, Florida, Illinois, Louisiana, Maryland, Michigan, Mississippi, Nevada, New Mexico, New York, North Carolina, and South Carolina.

Cluster 2: Arkansas, Colorado, Georgia, Massachusetts, Missouri, New Jersey, Oklahoma, Oregon, Rhode Island, Tennessee, Texas, Virginia, Washington, and Wyoming.

Cluster 3: Connecticut, Hawaii, Idaho, Indiana, Iowa, Kansas, Kentucky, Maine, Minnesota, Montana, Nebraska, New Hampshire, North Dakota, Ohio, Pennsylvania, South Dakota, Utah, Vermont, West Virginia, and Wisconsin.

Task 3: Hierarchical Clustering (Extension)

Note that in the US Arrests dataset the variables are measured in different units and this might make the clustering flawed. A good way to handle this problem is to standardize the data so that all standardize variables are given a mean of zero and a standard deviation of one. Then all variables will be on a comparable scale. To scale the variables before performing hierarchical clustering of the observations, we use the scale() function.

Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.

# Insides the dist() function use scale(usarrests[,2:5],center=FALSE) instead of as.matrix(usarrests[,2:5]) and then use hclust() to apply Hierarchical Clustering
d <- dist(scale(usarrests[,2:5],center=FALSE))
hc <- hclust(d, method ="complete") 

# plot the dendrogram
plot(hc, labels = usarrests$State, xlab = "States", ylab = "distance")

# Cut the dendrogram at a height that results in three distinct clusters. 
cutree (hc , 3)

##  [1] 1 2 2 3 2 2 3 3 2 1 3 3 2 3 3 3 3 1 3 2 3 2 3 1 2 3 3 2 3 3 2 2 1 3 3 3 3 3
## [39] 3 1 3 2 2 3 3 3 3 3 3 3

# Print which states go into each cluster in this case:
for( k in 1:3 ){
  print(k)
  print( usarrests$State[ ct == k ] )
}

## [1] 1
##  [1] "Alabama"        "Alaska"         "Arizona"        "California"    
##  [5] "Delaware"       "Florida"        "Illinois"       "Louisiana"     
##  [9] "Maryland"       "Michigan"       "Mississippi"    "Nevada"        
## [13] "New Mexico"     "New York"       "North Carolina" "South Carolina"
## [1] 2
##  [1] "Arkansas"      "Colorado"      "Georgia"       "Massachusetts"
##  [5] "Missouri"      "New Jersey"    "Oklahoma"      "Oregon"       
##  [9] "Rhode Island"  "Tennessee"     "Texas"         "Virginia"     
## [13] "Washington"    "Wyoming"      
## [1] 3
##  [1] "Connecticut"   "Hawaii"        "Idaho"         "Indiana"      
##  [5] "Iowa"          "Kansas"        "Kentucky"      "Maine"        
##  [9] "Minnesota"     "Montana"       "Nebraska"      "New Hampshire"
## [13] "North Dakota"  "Ohio"          "Pennsylvania"  "South Dakota" 
## [17] "Utah"          "Vermont"       "West Virginia" "Wisconsin"

Question 3: Which states belong to which clusters with three distinct clusters? List the states for each cluster.

Please add your response here

Cluster 1: Alabama, Alaska, Arizona, California, Delaware, Florida, Illinois, Louisiana, Maryland, Michigan, Mississippi, Nevada, New Mexico, New York, North Carolina, and South Carolina.

Cluster 2: Arkansas, Colorado, Georgia, Massachusetts, Missouri, New Jersey, Oklahoma, Oregon, Rhode Island, Tennessee, Texas, Virginia, Washington, and Wyoming.

Question 4: What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

Please add your response here I did not see any difference with scaling the variables.

USArrests_RScript_Answers

Carolina Rendon

3/26/24

R Markdown

Load Packages

Import Data

US Arrests Case

Data exploration

Task 1: K-Means Clustering

Task 2: Hierarchical Clustering

Task 3: Hierarchical Clustering (Extension)