STAT 388 HW 8

We will perform hierarchical clustering. First, we will vary the distance. The distance measures the distance between different observations.

Euclidean Distance

#VARIOUS DISTANCES
clusters_euc <- hclust(dist(MLBdata, method= "euclidean"))

## Warning in dist(MLBdata, method = "euclidean"): NAs introduced by coercion

plot(clusters_euc)

Maximum Distance

clusters_max<- hclust(dist(MLBdata, method="maximum"))

## Warning in dist(MLBdata, method = "maximum"): NAs introduced by coercion

plot(clusters_max)

Manhattan Distance

clusters_man<-hclust(dist(MLBdata, method="manhattan"))

## Warning in dist(MLBdata, method = "manhattan"): NAs introduced by coercion

plot(clusters_man)

From the three different distance metrics, we see the following observations:

-All three create 2 clear clusters, although it can be argued that there are 3 clusters for the maximum distance metric and the manhattan distance metric

-The first cluster on the left is about the same for both the euclidean distance and the manhattan distance

-At first I figured it was grouped by the best and worst teams. After looking at the standings, it is not clear what the teams are grouped by. I do not have much knowledge on the teams or baseball statistics, so it is difficult to say what they may clustered by.

-The maximum distance provides a much different type of clustering than the other two. -COL, DET, and LAA are grouped together in all three clusters

Now, we will perform different hierarchical clusterings using various linkage methods. Different linkages measure the distance between clusters.

Euclidean Distance with different linkages (Complete, average, mcquitty, centroid)

## Warning in dist(MLBdata, method = "euclidean"): NAs introduced by coercion

## Warning in dist(MLBdata, method = "euclidean"): NAs introduced by coercion

## Warning in dist(MLBdata, method = "euclidean"): NAs introduced by coercion

## Warning in dist(MLBdata, method = "euclidean"): NAs introduced by coercion

From the different linkages between the euclidean distance, we notice: -The first three linkages create 2 sepeate clusters, and the centroid creates a very strange dendrogram with technically 3 clusters but it seems like just one big one -COL and DET are next to each other in all 4 -CIN and SDP are next to each other in all 4 (in centroid, it is in a different cluster but just barely) -Complete and Mcquitty seem the most similar clusters -Honestly though, I have no idea what variables are creating these clusters

Maximum Distance with different linkages (Complete, average, mcquitty, centroid)

## Warning in dist(MLBdata, method = "maximum"): NAs introduced by coercion

## Warning in dist(MLBdata, method = "maximum"): NAs introduced by coercion

## Warning in dist(MLBdata, method = "maximum"): NAs introduced by coercion

## Warning in dist(MLBdata, method = "maximum"): NAs introduced by coercion

From the different linkages between the maxixmum distance, we notice: -BAL and MIA both branch off to their own groups very early in the first three linkages -Again, the cluster dendogram created from the centroid linkage is very different than the rest as it does not produce distinct clusters -The average and mcquitty linkages produce similar looking dendrograms

Manhattan Distance with different linkages (Complete, average, mcquitty, centroid)

## Warning in dist(MLBdata, method = "manhattan"): NAs introduced by coercion

## Warning in dist(MLBdata, method = "manhattan"): NAs introduced by coercion

## Warning in dist(MLBdata, method = "manhattan"): NAs introduced by coercion

## Warning in dist(MLBdata, method = "manhattan"): NAs introduced by coercion

From the different linkages between the manhattan distance, we notice: -The dendrograms produced by the complete and mcquitty distances are extremely similar -The dendrogram produced by the centroid linkage is yet again very different than the other 3 -SDP seems to branch off very quickly in the average linkage and the centroid linkage -Again, not sure how to interpret the clusterings; most have two clusters

Overall, it seems that complete and mcquitty linkages produce similar dendrograms when the metric is the same. It seems that centroid produces a very unique dendrogram.

Something to use to tell which clustering is the best is a cophentic correlation. I will compute cophenetic correlations between all the clusterings. From this, I see that the best clusters would be with the euclidean or manhattan distance with the complete or mcquity distance. These clusters are most prevelant and similar, suggesting they might be the best.