Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
#install.packages("pvclust")
library(Rling)
library(pvclust)
## Warning: package 'pvclust' was built under R version 3.5.3
library(cluster)
The data is from a publication that I worked on in graduate school - focusing on the differences in semantic (meaning) and associative (context) memory. You can view the article if you are interested here - this dataset is a different one but based on the same ideas. Each of the measures provided is a type of distance measure - figuring out how related word-pairs are by examining their features or some other relation between them. They fall into three theoretical categories:
The main goal is to examine if the cluters match what is expected based on theory - and we will cover more of these models and how they work in the next several weeks.
The original dataset includes word pairs as the rows and distance measures as the columns. We want to cluster on the distance measures, so you will want to:
t()
, as the clustering variables should be rows in the dataframe.dataSetInfo = read.csv("385pairs.csv")
rownames(dataSetInfo) = paste(dataSetInfo[ , 1], dataSetInfo[ , 2])
dataSetInfo <- dataSetInfo[,c(-1,-2)]
datCluster <- t(dataSetInfo)
While the data set includes popular distance measures, we still need to figure out how these distance measures are related to each other. Create distance measures in Euclidean distance.
In looking at the distances - what seems immediately obvious about one of the variables? CLearly, the variable “jcn” consistently seems to havelarger Euclidean distance than all other variables.
datCluster.dist = dist(datCluster, method = "euclidean")
datCluster.dist
## fsg bsg was_comp cos lsa419
## bsg 3.304094e+00
## was_comp 4.686207e+01 4.692190e+01
## cos 6.659115e+00 6.870998e+00 4.261738e+01
## lsa419 5.997115e+00 6.216735e+00 4.319664e+01 4.909861e+00
## lsa300 6.842315e+00 7.038163e+00 4.237142e+01 4.948454e+00 1.184146e+00
## bgl_item 5.645493e+00 5.916190e+00 4.377613e+01 5.203739e+00 3.439947e+00
## bgl_comp 7.295988e+00 7.576532e+00 4.225131e+01 5.623813e+00 4.395417e+00
## t1700 2.900379e+00 3.282844e+00 4.896078e+01 7.853448e+00 6.973351e+00
## t900 2.928459e+00 3.327696e+00 4.897038e+01 7.864880e+00 6.965698e+00
## lch 4.022548e+01 4.041156e+01 2.491672e+01 3.517572e+01 3.604332e+01
## jcn 8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08
## lesk 1.887924e+01 1.900341e+01 3.709973e+01 1.546005e+01 1.647046e+01
## lsa300 bgl_item bgl_comp t1700 t900
## bsg
## was_comp
## cos
## lsa419
## lsa300
## bgl_item 3.678593e+00
## bgl_comp 4.140901e+00 2.796369e+00
## t1700 7.908813e+00 6.516250e+00 8.390504e+00
## t900 7.898376e+00 6.505255e+00 8.375305e+00 3.797271e-01
## lch 3.512777e+01 3.630402e+01 3.454637e+01 4.177355e+01 4.175591e+01
## jcn 8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08
## lesk 1.603306e+01 1.678751e+01 1.588326e+01 2.023125e+01 2.021113e+01
## lch jcn
## bsg
## was_comp
## cos
## lsa419
## lsa300
## bgl_item
## bgl_comp
## t1700
## t900
## lch
## jcn 8.120926e+08
## lesk 2.739274e+01 8.120926e+08
datCluster.hc = hclust(datCluster.dist, method = "ward.D2")
plot(datCluster.hc, hang = -1)
Clearly there’s one variable that is pretty radically different.
datCluster2 <- datCluster[-12,]
datCluster2.dist = dist(datCluster2, method = "euclidean")
datCluster2.dist
## fsg bsg was_comp cos lsa419 lsa300
## bsg 3.3040943
## was_comp 46.8620668 46.9219009
## cos 6.6591147 6.8709981 42.6173776
## lsa419 5.9971155 6.2167348 43.1966405 4.9098609
## lsa300 6.8423155 7.0381634 42.3714175 4.9484542 1.1841457
## bgl_item 5.6454931 5.9161902 43.7761316 5.2037392 3.4399470 3.6785934
## bgl_comp 7.2959877 7.5765317 42.2513052 5.6238128 4.3954168 4.1409005
## t1700 2.9003790 3.2828443 48.9607793 7.8534482 6.9733509 7.9088129
## t900 2.9284589 3.3276961 48.9703791 7.8648798 6.9656985 7.8983757
## lch 40.2254782 40.4115600 24.9167194 35.1757249 36.0433219 35.1277715
## lesk 18.8792447 19.0034103 37.0997273 15.4600546 16.4704562 16.0330591
## bgl_item bgl_comp t1700 t900 lch
## bsg
## was_comp
## cos
## lsa419
## lsa300
## bgl_item
## bgl_comp 2.7963691
## t1700 6.5162504 8.3905044
## t900 6.5052547 8.3753055 0.3797271
## lch 36.3040197 34.5463723 41.7735521 41.7559057
## lesk 16.7875062 15.8832554 20.2312470 20.2111268 27.3927395
datCluster2.hc = hclust(datCluster2.dist, method = "ward.D2")
plot(datCluster2.hc, hang = -1)
sapply
calculate the average silhouette distances for 2 to n-1 clusters on only the second cluster analysis.sapply(2:11, function(x) { summary( silhouette( cutree(datCluster2.hc, x),
datCluster2.dist
)
)$avg.width
}
)
## [1] 0.7271035 0.6436228 0.5137831 0.3828193 0.3263801 0.3657955 0.2993538
## [8] 0.3077878 0.2561061 0.1449507
Ther is no obvious disticntion that can be considered for thematic/text measures vs semantic. They very much look different from the theoritical listings.
{plot(datCluster2.hc, hang = -1)
rect.hclust(datCluster2.hc, k = 2)}
Make a snake plot of the results by plotting a random subset of 25 word pairs. In the notes we used the behavioral profile data, in this example you can use the original dataset without the bad variable. - Use something like random_data = dataframe[ , sample(1:ncol(dataframe), 25)]. - Then calculate the snake plot on that smaller dataset.
What word pairs appear to be most heavily tied to each cluster? Are there any interesting differences you see given the top and bottom most distinguishing pairs? - Note: you can run this a few times to see what you think over a wide variety of plots. Please detail you answer including the pairs, since the knitted version will be a different random run.
set.seed(12345)
random_data = datCluster2[ , sample(1:ncol(datCluster2), 25)]
# save the clusters
chunkClust = cutree(datCluster2.hc, k =2)
Clustfirst = random_data[names(chunkClust[chunkClust == 1]), ]
Clustsecond = random_data[names(chunkClust[chunkClust == 2]), ]
# create the differences
Clustdiff = colMeans(Clustfirst) - colMeans(Clustsecond)
# create the plot
plot(sort(Clustdiff),
1:length(Clustdiff),
type = "n",
xlab = "First Cluster <--> Second Cluster",
yaxt = "n", ylab = "")
text(sort(Clustdiff),
1:length(Clustdiff),
names(sort(Clustdiff)))
The pairs “dolphin tuna”, “radish tomato” , “dishwasher stove” are close to 2nd Cluster. “octopus squid”, “knife sword” and “bureau dresser” seem to be on the opposite end. Most of the pairs are tied to 2nd cluster and seem to be synonymous.
pvclust
to validate your solution on the dataframe without the bad variable.Cluster seems to work well with eachg other. They have high AU/BP value.
# Adding the seed
set.seed(12345)
datCluster2.pvc = pvclust(t(datCluster2), method.hclust = "ward.D2", method.dist = "euclidean")
## Bootstrap (r = 0.5)... Done.
## Bootstrap (r = 0.6)... Done.
## Bootstrap (r = 0.7)... Done.
## Bootstrap (r = 0.8)... Done.
## Bootstrap (r = 0.9)... Done.
## Bootstrap (r = 1.0)... Done.
## Bootstrap (r = 1.1)... Done.
## Bootstrap (r = 1.2)... Done.
## Bootstrap (r = 1.3)... Done.
## Bootstrap (r = 1.4)... Done.
plot(datCluster2.pvc, hang = -1)