Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
#install.packages("pvclust")
library(Rling)
library(cluster)
library(pvclust)
## Warning: package 'pvclust' was built under R version 3.6.1
The data is from a publication that I worked on in graduate school - focusing on the differences in semantic (meaning) and associative (context) memory. You can view the article if you are interested here - this dataset is a different one but based on the same ideas. Each of the measures provided is a type of distance measure - figuring out how related word-pairs are by examining their features or some other relation between them. They fall into three theoretical categories:
The main goal is to examine if the cluters match what is expected based on theory - and we will cover more of these models and how they work in the next several weeks.
The original dataset includes word pairs as the rows and distance measures as the columns. We want to cluster on the distance measures, so you will want to:
t()
, as the clustering variables should be rows in the dataframe.data_set = read.csv("C:/Users/HPi/OneDrive - HP Inc/Pradosh/Personal/Harrisburg/Course 540/385pairs.csv")
rownames(data_set) = paste(data_set[ , 1], data_set[ , 2])
data_set <- data_set[,c(-1,-2)]
cluster_dataset <- t(data_set)
While the data set includes popular distance measures, we still need to figure out how these distance measures are related to each other. Create distance measures in Euclidean distance.
In looking at the distances - what seems immediately obvious about one of the variables? CLearly, the variable “jcn” consistently seems to havelarger Euclidean distance than all other variables.
cluster_dataset.dist = dist(cluster_dataset, method = "euclidean")
cluster_dataset.dist
## fsg bsg was_comp cos lsa419
## bsg 3.304094e+00
## was_comp 4.686207e+01 4.692190e+01
## cos 6.659115e+00 6.870998e+00 4.261738e+01
## lsa419 5.997115e+00 6.216735e+00 4.319664e+01 4.909861e+00
## lsa300 6.842315e+00 7.038163e+00 4.237142e+01 4.948454e+00 1.184146e+00
## bgl_item 5.645493e+00 5.916190e+00 4.377613e+01 5.203739e+00 3.439947e+00
## bgl_comp 7.295988e+00 7.576532e+00 4.225131e+01 5.623813e+00 4.395417e+00
## t1700 2.900379e+00 3.282844e+00 4.896078e+01 7.853448e+00 6.973351e+00
## t900 2.928459e+00 3.327696e+00 4.897038e+01 7.864880e+00 6.965698e+00
## lch 4.022548e+01 4.041156e+01 2.491672e+01 3.517572e+01 3.604332e+01
## jcn 8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08
## lesk 1.887924e+01 1.900341e+01 3.709973e+01 1.546005e+01 1.647046e+01
## lsa300 bgl_item bgl_comp t1700 t900
## bsg
## was_comp
## cos
## lsa419
## lsa300
## bgl_item 3.678593e+00
## bgl_comp 4.140901e+00 2.796369e+00
## t1700 7.908813e+00 6.516250e+00 8.390504e+00
## t900 7.898376e+00 6.505255e+00 8.375305e+00 3.797271e-01
## lch 3.512777e+01 3.630402e+01 3.454637e+01 4.177355e+01 4.175591e+01
## jcn 8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08
## lesk 1.603306e+01 1.678751e+01 1.588326e+01 2.023125e+01 2.021113e+01
## lch jcn
## bsg
## was_comp
## cos
## lsa419
## lsa300
## bgl_item
## bgl_comp
## t1700
## t900
## lch
## jcn 8.120926e+08
## lesk 2.739274e+01 8.120926e+08
cluster_dataset.hc = hclust(cluster_dataset.dist, method = "ward.D2")
plot(cluster_dataset.hc, hang = -1)
Clearly there’s one variable that is pretty radically different.
cluster_dataset2 <- cluster_dataset[-12,]
cluster_dataset2.dist = dist(cluster_dataset2, method = "euclidean")
cluster_dataset2.dist
## fsg bsg was_comp cos lsa419 lsa300
## bsg 3.3040943
## was_comp 46.8620668 46.9219009
## cos 6.6591147 6.8709981 42.6173776
## lsa419 5.9971155 6.2167348 43.1966405 4.9098609
## lsa300 6.8423155 7.0381634 42.3714175 4.9484542 1.1841457
## bgl_item 5.6454931 5.9161902 43.7761316 5.2037392 3.4399470 3.6785934
## bgl_comp 7.2959877 7.5765317 42.2513052 5.6238128 4.3954168 4.1409005
## t1700 2.9003790 3.2828443 48.9607793 7.8534482 6.9733509 7.9088129
## t900 2.9284589 3.3276961 48.9703791 7.8648798 6.9656985 7.8983757
## lch 40.2254782 40.4115600 24.9167194 35.1757249 36.0433219 35.1277715
## lesk 18.8792447 19.0034103 37.0997273 15.4600546 16.4704562 16.0330591
## bgl_item bgl_comp t1700 t900 lch
## bsg
## was_comp
## cos
## lsa419
## lsa300
## bgl_item
## bgl_comp 2.7963691
## t1700 6.5162504 8.3905044
## t900 6.5052547 8.3753055 0.3797271
## lch 36.3040197 34.5463723 41.7735521 41.7559057
## lesk 16.7875062 15.8832554 20.2312470 20.2111268 27.3927395
cluster_dataset2.hc = hclust(cluster_dataset2.dist, method = "ward.D2")
plot(cluster_dataset2.hc, hang = -1)
sapply
calculate the average silhouette distances for 2 to n-1 clusters on only the second cluster analysis.sapply(2:11, function(x) {
summary(
silhouette(
cutree(cluster_dataset2.hc, x),
cluster_dataset2.dist
)
)$avg.width
}
)
## [1] 0.7271035 0.6436228 0.5137831 0.3828193 0.3263801 0.3657955 0.2993538
## [8] 0.3077878 0.2561061 0.1449507
{plot(cluster_dataset2.hc, hang = -1)
rect.hclust(cluster_dataset2.hc, k = 2)}
Make a snake plot of the results by plotting a random subset of 25 word pairs. In the notes we used the behavioral profile data, in this example you can use the original dataset without the bad variable. - Use something like random_data = dataframe[ , sample(1:ncol(dataframe), 25)]. - Then calculate the snake plot on that smaller dataset.
What word pairs appear to be most heavily tied to each cluster? Are there any interesting differences you see given the top and bottom most distinguishing pairs? - Note: you can run this a few times to see what you think over a wide variety of plots. Please detail you answer including the pairs, since the knitted version will be a different random run.
Since all the difference values are negative, almost all pairs are tied to second cluster.The ones that are closer to Cluster 2 than cluster one are “dishwasher stove”, “dolphin tuna”, “radish tomato”. The ones at the opposite end of the spectrum are “bureau dresser”, “octopus squid”, “knife sword”. Most of the CLuster 2 words seem synonymous.
set.seed(12345)
random_data = cluster_dataset2[ , sample(1:ncol(cluster_dataset2), 25)]
# save the clusters
cluster_chunks = cutree(cluster_dataset2.hc, k =2)
firstcluster = random_data[names(cluster_chunks[cluster_chunks == 1]), ]
secondcluster = random_data[names(cluster_chunks[cluster_chunks == 2]), ]
# create the differences
cluster_differences = colMeans(firstcluster) - colMeans(secondcluster)
# create the plot
plot(sort(cluster_differences),
1:length(cluster_differences),
type = "n",
xlab = "First Cluster <--> Second Cluster",
yaxt = "n", ylab = "")
text(sort(cluster_differences),
1:length(cluster_differences),
names(sort(cluster_differences)))
pvclust
to validate your solution on the dataframe without the bad variable.set.seed(12345)
cluster_dataset2.pvc = pvclust(t(cluster_dataset2),
method.hclust = "ward.D2",
method.dist = "euclidean")
## Bootstrap (r = 0.5)... Done.
## Bootstrap (r = 0.6)... Done.
## Bootstrap (r = 0.7)... Done.
## Bootstrap (r = 0.8)... Done.
## Bootstrap (r = 0.9)... Done.
## Bootstrap (r = 1.0)... Done.
## Bootstrap (r = 1.1)... Done.
## Bootstrap (r = 1.2)... Done.
## Bootstrap (r = 1.3)... Done.
## Bootstrap (r = 1.4)... Done.
plot(cluster_dataset2.pvc, hang = -1)
Most of the clusters seem to have a very high AU/BP value and are relative close to each other. This indicates that the clusters work well with each other.