Assignment 7: Profiles and Clustering

Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

#install.packages("pvclust")
library(Rling)
library(cluster)
library(pvclust)

## Warning: package 'pvclust' was built under R version 3.6.1

The Data

The data is from a publication that I worked on in graduate school - focusing on the differences in semantic (meaning) and associative (context) memory. You can view the article if you are interested here - this dataset is a different one but based on the same ideas. Each of the measures provided is a type of distance measure - figuring out how related word-pairs are by examining their features or some other relation between them. They fall into three theoretical categories:

Association meaures: fsg, bsg, was_comp
Semantic measures: cos, jcn, lesk, lch
Thematic/Text measures: lsa419, lsa300, bgl_item, bgl_comp, t1700, t900

The main goal is to examine if the cluters match what is expected based on theory - and we will cover more of these models and how they work in the next several weeks.

The original dataset includes word pairs as the rows and distance measures as the columns. We want to cluster on the distance measures, so you will want to:

Load the data.
Use rownames(dataframe_name) = paste(dataframe_name[ , 1], dataframe_name[ , 2]) to set the rownames as the word-pairs from the data.
Delete column 1 and 2 from the data.
Flip the data using t(), as the clustering variables should be rows in the dataframe.

data_set = read.csv("C:/Users/HPi/OneDrive - HP Inc/Pradosh/Personal/Harrisburg/Course 540/385pairs.csv")
rownames(data_set) = paste(data_set[ , 1], data_set[ , 2])
data_set <- data_set[,c(-1,-2)]
cluster_dataset <- t(data_set)

Create Distances

While the data set includes popular distance measures, we still need to figure out how these distance measures are related to each other. Create distance measures in Euclidean distance.

In looking at the distances - what seems immediately obvious about one of the variables? CLearly, the variable “jcn” consistently seems to havelarger Euclidean distance than all other variables.

cluster_dataset.dist = dist(cluster_dataset, method = "euclidean")
cluster_dataset.dist

##                   fsg          bsg     was_comp          cos       lsa419
## bsg      3.304094e+00                                                    
## was_comp 4.686207e+01 4.692190e+01                                       
## cos      6.659115e+00 6.870998e+00 4.261738e+01                          
## lsa419   5.997115e+00 6.216735e+00 4.319664e+01 4.909861e+00             
## lsa300   6.842315e+00 7.038163e+00 4.237142e+01 4.948454e+00 1.184146e+00
## bgl_item 5.645493e+00 5.916190e+00 4.377613e+01 5.203739e+00 3.439947e+00
## bgl_comp 7.295988e+00 7.576532e+00 4.225131e+01 5.623813e+00 4.395417e+00
## t1700    2.900379e+00 3.282844e+00 4.896078e+01 7.853448e+00 6.973351e+00
## t900     2.928459e+00 3.327696e+00 4.897038e+01 7.864880e+00 6.965698e+00
## lch      4.022548e+01 4.041156e+01 2.491672e+01 3.517572e+01 3.604332e+01
## jcn      8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08
## lesk     1.887924e+01 1.900341e+01 3.709973e+01 1.546005e+01 1.647046e+01
##                lsa300     bgl_item     bgl_comp        t1700         t900
## bsg                                                                      
## was_comp                                                                 
## cos                                                                      
## lsa419                                                                   
## lsa300                                                                   
## bgl_item 3.678593e+00                                                    
## bgl_comp 4.140901e+00 2.796369e+00                                       
## t1700    7.908813e+00 6.516250e+00 8.390504e+00                          
## t900     7.898376e+00 6.505255e+00 8.375305e+00 3.797271e-01             
## lch      3.512777e+01 3.630402e+01 3.454637e+01 4.177355e+01 4.175591e+01
## jcn      8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08
## lesk     1.603306e+01 1.678751e+01 1.588326e+01 2.023125e+01 2.021113e+01
##                   lch          jcn
## bsg                               
## was_comp                          
## cos                               
## lsa419                            
## lsa300                            
## bgl_item                          
## bgl_comp                          
## t1700                             
## t900                              
## lch                               
## jcn      8.120926e+08             
## lesk     2.739274e+01 8.120926e+08

Create Cluster

Use hierarchical clustering to examine the relatedness of these measures.
Create a dendogram plot of the results.

cluster_dataset.hc = hclust(cluster_dataset.dist, method = "ward.D2")
plot(cluster_dataset.hc, hang = -1)

Try Again

Clearly there’s one variable that is pretty radically different.

Remove that variable from the original dataset. - Removed jcn variable.
Rerun the distance and cluster measures below.
Create a new plot of the cluster analysis (the branches may be hard to see but they are clearly separating out more).

cluster_dataset2 <- cluster_dataset[-12,]

cluster_dataset2.dist = dist(cluster_dataset2, method = "euclidean")
cluster_dataset2.dist

##                 fsg        bsg   was_comp        cos     lsa419     lsa300
## bsg       3.3040943                                                       
## was_comp 46.8620668 46.9219009                                            
## cos       6.6591147  6.8709981 42.6173776                                 
## lsa419    5.9971155  6.2167348 43.1966405  4.9098609                      
## lsa300    6.8423155  7.0381634 42.3714175  4.9484542  1.1841457           
## bgl_item  5.6454931  5.9161902 43.7761316  5.2037392  3.4399470  3.6785934
## bgl_comp  7.2959877  7.5765317 42.2513052  5.6238128  4.3954168  4.1409005
## t1700     2.9003790  3.2828443 48.9607793  7.8534482  6.9733509  7.9088129
## t900      2.9284589  3.3276961 48.9703791  7.8648798  6.9656985  7.8983757
## lch      40.2254782 40.4115600 24.9167194 35.1757249 36.0433219 35.1277715
## lesk     18.8792447 19.0034103 37.0997273 15.4600546 16.4704562 16.0330591
##            bgl_item   bgl_comp      t1700       t900        lch
## bsg                                                            
## was_comp                                                       
## cos                                                            
## lsa419                                                         
## lsa300                                                         
## bgl_item                                                       
## bgl_comp  2.7963691                                            
## t1700     6.5162504  8.3905044                                 
## t900      6.5052547  8.3753055  0.3797271                      
## lch      36.3040197 34.5463723 41.7735521 41.7559057           
## lesk     16.7875062 15.8832554 20.2312470 20.2111268 27.3927395

cluster_dataset2.hc = hclust(cluster_dataset2.dist, method = "ward.D2")
plot(cluster_dataset2.hc, hang = -1)

Silhouette

Using sapply calculate the average silhouette distances for 2 to n-1 clusters on only the second cluster analysis.

sapply(2:11, function(x) {
  summary(
    silhouette(
      cutree(cluster_dataset2.hc, x),
      cluster_dataset2.dist
    )
  )$avg.width
}
)

##  [1] 0.7271035 0.6436228 0.5137831 0.3828193 0.3263801 0.3657955 0.2993538
##  [8] 0.3077878 0.2561061 0.1449507

Examine those results

Replot the dendogram with cluster markers based on the highest silhouette value.
Interpret the results - do these match the theoretical listings we expected? The clusters seen in the plot vary from theoritical listings. THe thematic/text measures form one cluster which includes the association and sematic measures with the exception of was_comp and lch. There doesn’t seem to be an obvious distinction between the semantic and thematic/text measures.

{plot(cluster_dataset2.hc, hang = -1)
  rect.hclust(cluster_dataset2.hc, k = 2)}

Snake Plots

Make a snake plot of the results by plotting a random subset of 25 word pairs. In the notes we used the behavioral profile data, in this example you can use the original dataset without the bad variable. - Use something like random_data = dataframe[ , sample(1:ncol(dataframe), 25)]. - Then calculate the snake plot on that smaller dataset.

What word pairs appear to be most heavily tied to each cluster? Are there any interesting differences you see given the top and bottom most distinguishing pairs? - Note: you can run this a few times to see what you think over a wide variety of plots. Please detail you answer including the pairs, since the knitted version will be a different random run.

Since all the difference values are negative, almost all pairs are tied to second cluster.The ones that are closer to Cluster 2 than cluster one are “dishwasher stove”, “dolphin tuna”, “radish tomato”. The ones at the opposite end of the spectrum are “bureau dresser”, “octopus squid”, “knife sword”. Most of the CLuster 2 words seem synonymous.

set.seed(12345)
random_data = cluster_dataset2[ , sample(1:ncol(cluster_dataset2), 25)]

# save the clusters
cluster_chunks = cutree(cluster_dataset2.hc, k =2)
firstcluster = random_data[names(cluster_chunks[cluster_chunks == 1]), ]
secondcluster = random_data[names(cluster_chunks[cluster_chunks == 2]), ]

# create the differences
cluster_differences = colMeans(firstcluster) - colMeans(secondcluster)

# create the plot
plot(sort(cluster_differences),
     1:length(cluster_differences),
     type = "n",
     xlab = "First Cluster <--> Second Cluster",
     yaxt = "n", ylab = "")
text(sort(cluster_differences),
     1:length(cluster_differences),
     names(sort(cluster_differences)))

Bootstrapping

Use pvclust to validate your solution on the dataframe without the bad variable.
Plot the pv cluster.
How well do our clusters appear to work? Most of the clusters seem to have a very high AU/BP value and are relative close to each other. This indicates that the clusters work well with each other.

set.seed(12345)
cluster_dataset2.pvc = pvclust(t(cluster_dataset2),
                            method.hclust = "ward.D2",
                            method.dist = "euclidean")

## Bootstrap (r = 0.5)... Done.
## Bootstrap (r = 0.6)... Done.
## Bootstrap (r = 0.7)... Done.
## Bootstrap (r = 0.8)... Done.
## Bootstrap (r = 0.9)... Done.
## Bootstrap (r = 1.0)... Done.
## Bootstrap (r = 1.1)... Done.
## Bootstrap (r = 1.2)... Done.
## Bootstrap (r = 1.3)... Done.
## Bootstrap (r = 1.4)... Done.

plot(cluster_dataset2.pvc, hang = -1)

Most of the clusters seem to have a very high AU/BP value and are relative close to each other. This indicates that the clusters work well with each other.