Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

#install.packages("pvclust")
library(Rling)
library(cluster)
library(pvclust)
## Warning: package 'pvclust' was built under R version 3.6.1

The Data

The data is from a publication that I worked on in graduate school - focusing on the differences in semantic (meaning) and associative (context) memory. You can view the article if you are interested here - this dataset is a different one but based on the same ideas. Each of the measures provided is a type of distance measure - figuring out how related word-pairs are by examining their features or some other relation between them. They fall into three theoretical categories:

The main goal is to examine if the cluters match what is expected based on theory - and we will cover more of these models and how they work in the next several weeks.

The original dataset includes word pairs as the rows and distance measures as the columns. We want to cluster on the distance measures, so you will want to:

data_set = read.csv("C:/Users/HPi/OneDrive - HP Inc/Pradosh/Personal/Harrisburg/Course 540/385pairs.csv")
rownames(data_set) = paste(data_set[ , 1], data_set[ , 2])
data_set <- data_set[,c(-1,-2)]
cluster_dataset <- t(data_set)

Create Distances

While the data set includes popular distance measures, we still need to figure out how these distance measures are related to each other. Create distance measures in Euclidean distance.

In looking at the distances - what seems immediately obvious about one of the variables? CLearly, the variable “jcn” consistently seems to havelarger Euclidean distance than all other variables.

cluster_dataset.dist = dist(cluster_dataset, method = "euclidean")
cluster_dataset.dist
##                   fsg          bsg     was_comp          cos       lsa419
## bsg      3.304094e+00                                                    
## was_comp 4.686207e+01 4.692190e+01                                       
## cos      6.659115e+00 6.870998e+00 4.261738e+01                          
## lsa419   5.997115e+00 6.216735e+00 4.319664e+01 4.909861e+00             
## lsa300   6.842315e+00 7.038163e+00 4.237142e+01 4.948454e+00 1.184146e+00
## bgl_item 5.645493e+00 5.916190e+00 4.377613e+01 5.203739e+00 3.439947e+00
## bgl_comp 7.295988e+00 7.576532e+00 4.225131e+01 5.623813e+00 4.395417e+00
## t1700    2.900379e+00 3.282844e+00 4.896078e+01 7.853448e+00 6.973351e+00
## t900     2.928459e+00 3.327696e+00 4.897038e+01 7.864880e+00 6.965698e+00
## lch      4.022548e+01 4.041156e+01 2.491672e+01 3.517572e+01 3.604332e+01
## jcn      8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08
## lesk     1.887924e+01 1.900341e+01 3.709973e+01 1.546005e+01 1.647046e+01
##                lsa300     bgl_item     bgl_comp        t1700         t900
## bsg                                                                      
## was_comp                                                                 
## cos                                                                      
## lsa419                                                                   
## lsa300                                                                   
## bgl_item 3.678593e+00                                                    
## bgl_comp 4.140901e+00 2.796369e+00                                       
## t1700    7.908813e+00 6.516250e+00 8.390504e+00                          
## t900     7.898376e+00 6.505255e+00 8.375305e+00 3.797271e-01             
## lch      3.512777e+01 3.630402e+01 3.454637e+01 4.177355e+01 4.175591e+01
## jcn      8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08
## lesk     1.603306e+01 1.678751e+01 1.588326e+01 2.023125e+01 2.021113e+01
##                   lch          jcn
## bsg                               
## was_comp                          
## cos                               
## lsa419                            
## lsa300                            
## bgl_item                          
## bgl_comp                          
## t1700                             
## t900                              
## lch                               
## jcn      8.120926e+08             
## lesk     2.739274e+01 8.120926e+08

Create Cluster

cluster_dataset.hc = hclust(cluster_dataset.dist, method = "ward.D2")
plot(cluster_dataset.hc, hang = -1)

Try Again

Clearly there’s one variable that is pretty radically different.

cluster_dataset2 <- cluster_dataset[-12,]

cluster_dataset2.dist = dist(cluster_dataset2, method = "euclidean")
cluster_dataset2.dist
##                 fsg        bsg   was_comp        cos     lsa419     lsa300
## bsg       3.3040943                                                       
## was_comp 46.8620668 46.9219009                                            
## cos       6.6591147  6.8709981 42.6173776                                 
## lsa419    5.9971155  6.2167348 43.1966405  4.9098609                      
## lsa300    6.8423155  7.0381634 42.3714175  4.9484542  1.1841457           
## bgl_item  5.6454931  5.9161902 43.7761316  5.2037392  3.4399470  3.6785934
## bgl_comp  7.2959877  7.5765317 42.2513052  5.6238128  4.3954168  4.1409005
## t1700     2.9003790  3.2828443 48.9607793  7.8534482  6.9733509  7.9088129
## t900      2.9284589  3.3276961 48.9703791  7.8648798  6.9656985  7.8983757
## lch      40.2254782 40.4115600 24.9167194 35.1757249 36.0433219 35.1277715
## lesk     18.8792447 19.0034103 37.0997273 15.4600546 16.4704562 16.0330591
##            bgl_item   bgl_comp      t1700       t900        lch
## bsg                                                            
## was_comp                                                       
## cos                                                            
## lsa419                                                         
## lsa300                                                         
## bgl_item                                                       
## bgl_comp  2.7963691                                            
## t1700     6.5162504  8.3905044                                 
## t900      6.5052547  8.3753055  0.3797271                      
## lch      36.3040197 34.5463723 41.7735521 41.7559057           
## lesk     16.7875062 15.8832554 20.2312470 20.2111268 27.3927395
cluster_dataset2.hc = hclust(cluster_dataset2.dist, method = "ward.D2")
plot(cluster_dataset2.hc, hang = -1)

Silhouette

sapply(2:11, function(x) {
  summary(
    silhouette(
      cutree(cluster_dataset2.hc, x),
      cluster_dataset2.dist
    )
  )$avg.width
}
)
##  [1] 0.7271035 0.6436228 0.5137831 0.3828193 0.3263801 0.3657955 0.2993538
##  [8] 0.3077878 0.2561061 0.1449507

Examine those results

{plot(cluster_dataset2.hc, hang = -1)
  rect.hclust(cluster_dataset2.hc, k = 2)}

Snake Plots

Make a snake plot of the results by plotting a random subset of 25 word pairs. In the notes we used the behavioral profile data, in this example you can use the original dataset without the bad variable. - Use something like random_data = dataframe[ , sample(1:ncol(dataframe), 25)]. - Then calculate the snake plot on that smaller dataset.

What word pairs appear to be most heavily tied to each cluster? Are there any interesting differences you see given the top and bottom most distinguishing pairs? - Note: you can run this a few times to see what you think over a wide variety of plots. Please detail you answer including the pairs, since the knitted version will be a different random run.

Since all the difference values are negative, almost all pairs are tied to second cluster.The ones that are closer to Cluster 2 than cluster one are “dishwasher stove”, “dolphin tuna”, “radish tomato”. The ones at the opposite end of the spectrum are “bureau dresser”, “octopus squid”, “knife sword”. Most of the CLuster 2 words seem synonymous.

set.seed(12345)
random_data = cluster_dataset2[ , sample(1:ncol(cluster_dataset2), 25)]

# save the clusters
cluster_chunks = cutree(cluster_dataset2.hc, k =2)
firstcluster = random_data[names(cluster_chunks[cluster_chunks == 1]), ]
secondcluster = random_data[names(cluster_chunks[cluster_chunks == 2]), ]

# create the differences
cluster_differences = colMeans(firstcluster) - colMeans(secondcluster)

# create the plot
plot(sort(cluster_differences),
     1:length(cluster_differences),
     type = "n",
     xlab = "First Cluster <--> Second Cluster",
     yaxt = "n", ylab = "")
text(sort(cluster_differences),
     1:length(cluster_differences),
     names(sort(cluster_differences)))

Bootstrapping

set.seed(12345)
cluster_dataset2.pvc = pvclust(t(cluster_dataset2),
                            method.hclust = "ward.D2",
                            method.dist = "euclidean")
## Bootstrap (r = 0.5)... Done.
## Bootstrap (r = 0.6)... Done.
## Bootstrap (r = 0.7)... Done.
## Bootstrap (r = 0.8)... Done.
## Bootstrap (r = 0.9)... Done.
## Bootstrap (r = 1.0)... Done.
## Bootstrap (r = 1.1)... Done.
## Bootstrap (r = 1.2)... Done.
## Bootstrap (r = 1.3)... Done.
## Bootstrap (r = 1.4)... Done.
plot(cluster_dataset2.pvc, hang = -1)

Most of the clusters seem to have a very high AU/BP value and are relative close to each other. This indicates that the clusters work well with each other.