ANLY540 - Analysis of Human Language - Assignment 7: Profiles and Clustering

Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

library(Rling)
library(cluster)
library(pvclust)

The Data

The data is from a publication that I worked on in graduate school - focusing on the differences in semantic (meaning) and associative (context) memory. You can view the article if you are interested here - this dataset is a different one but based on the same ideas. Each of the measures provided is a type of distance measure - figuring out how related word-pairs are by examining their features or some other relation between them. They fall into three theoretical categories:

Association meaures: fsg, bsg, was_comp
Semantic measures: cos, jcn, lesk, lch
Thematic/Text measures: lsa419, lsa300, bgl_item, bgl_comp, t1700, t900

The main goal is to examine if the cluters match what is expected based on theory - and we will cover more of these models and how they work in the next several weeks.

The original dataset includes word pairs as the rows and distance measures as the columns. We want to cluster on the distance measures, so you will want to:

Load the data.
Use rownames(dataframe_name) = paste(dataframe_name[ , 1], dataframe_name[ , 2]) to set the rownames as the word-pairs from the data.
Delete column 1 and 2 from the data.
Flip the data using t(), as the clustering variables should be rows in the dataframe.

data <- read.csv("385pairs.csv")
rownames(data) = paste(data[ , 1], data[ , 2])
data <- data[,c(-1,-2)]
cluster_data <- t(data)

Create Distances

While the data set includes popular distance measures, we still need to figure out how these distance measures are related to each other. Create distance measures in Euclidean distance.

In looking at the distances - what seems immediately obvious about one of the variables?

cluster_data.dist = dist(cluster_data, method = "euclidean")
cluster_data.dist

##                   fsg          bsg     was_comp          cos       lsa419
## bsg      3.304094e+00                                                    
## was_comp 4.686207e+01 4.692190e+01                                       
## cos      6.659115e+00 6.870998e+00 4.261738e+01                          
## lsa419   5.997115e+00 6.216735e+00 4.319664e+01 4.909861e+00             
## lsa300   6.842315e+00 7.038163e+00 4.237142e+01 4.948454e+00 1.184146e+00
## bgl_item 5.645493e+00 5.916190e+00 4.377613e+01 5.203739e+00 3.439947e+00
## bgl_comp 7.295988e+00 7.576532e+00 4.225131e+01 5.623813e+00 4.395417e+00
## t1700    2.900379e+00 3.282844e+00 4.896078e+01 7.853448e+00 6.973351e+00
## t900     2.928459e+00 3.327696e+00 4.897038e+01 7.864880e+00 6.965698e+00
## lch      4.022548e+01 4.041156e+01 2.491672e+01 3.517572e+01 3.604332e+01
## jcn      8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08
## lesk     1.887924e+01 1.900341e+01 3.709973e+01 1.546005e+01 1.647046e+01
##                lsa300     bgl_item     bgl_comp        t1700         t900
## bsg                                                                      
## was_comp                                                                 
## cos                                                                      
## lsa419                                                                   
## lsa300                                                                   
## bgl_item 3.678593e+00                                                    
## bgl_comp 4.140901e+00 2.796369e+00                                       
## t1700    7.908813e+00 6.516250e+00 8.390504e+00                          
## t900     7.898376e+00 6.505255e+00 8.375305e+00 3.797271e-01             
## lch      3.512777e+01 3.630402e+01 3.454637e+01 4.177355e+01 4.175591e+01
## jcn      8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08 8.120926e+08
## lesk     1.603306e+01 1.678751e+01 1.588326e+01 2.023125e+01 2.021113e+01
##                   lch          jcn
## bsg                               
## was_comp                          
## cos                               
## lsa419                            
## lsa300                            
## bgl_item                          
## bgl_comp                          
## t1700                             
## t900                              
## lch                               
## jcn      8.120926e+08             
## lesk     2.739274e+01 8.120926e+08

In looking at the distances, the variable ‘jcn’ seems to have a very large Euclidean distance with respect to every other variable.

Create Cluster

Use hierarchical clustering to examine the relatedness of these measures.
Create a dendogram plot of the results.

cluster_data.hc = hclust(cluster_data.dist, method = "ward.D2")
plot(cluster_data.hc, hang = -1)

Try Again

Clearly there’s one variable that is pretty radically different.

Remove that variable from the original dataset.
Rerun the distance and cluster measures below.
Create a new plot of the cluster analysis (the branches may be hard to see but they are clearly separating out more).

‘jcn’ is removed and the analysis is re-run.

cluster_data2 <- cluster_data[-12,]

cluster_data2.dist = dist(cluster_data2, method = "euclidean")
cluster_data2.dist

##                 fsg        bsg   was_comp        cos     lsa419     lsa300
## bsg       3.3040943                                                       
## was_comp 46.8620668 46.9219009                                            
## cos       6.6591147  6.8709981 42.6173776                                 
## lsa419    5.9971155  6.2167348 43.1966405  4.9098609                      
## lsa300    6.8423155  7.0381634 42.3714175  4.9484542  1.1841457           
## bgl_item  5.6454931  5.9161902 43.7761316  5.2037392  3.4399470  3.6785934
## bgl_comp  7.2959877  7.5765317 42.2513052  5.6238128  4.3954168  4.1409005
## t1700     2.9003790  3.2828443 48.9607793  7.8534482  6.9733509  7.9088129
## t900      2.9284589  3.3276961 48.9703791  7.8648798  6.9656985  7.8983757
## lch      40.2254782 40.4115600 24.9167194 35.1757249 36.0433219 35.1277715
## lesk     18.8792447 19.0034103 37.0997273 15.4600546 16.4704562 16.0330591
##            bgl_item   bgl_comp      t1700       t900        lch
## bsg                                                            
## was_comp                                                       
## cos                                                            
## lsa419                                                         
## lsa300                                                         
## bgl_item                                                       
## bgl_comp  2.7963691                                            
## t1700     6.5162504  8.3905044                                 
## t900      6.5052547  8.3753055  0.3797271                      
## lch      36.3040197 34.5463723 41.7735521 41.7559057           
## lesk     16.7875062 15.8832554 20.2312470 20.2111268 27.3927395

cluster_data2.hc = hclust(cluster_data2.dist, method = "ward.D2")
plot(cluster_data2.hc, hang = -1)

Silhouette

Using sapply calculate the average silhouette distances for 2 to n-1 clusters on only the second cluster analysis.

sapply(2:11, function(x) {
  summary(
    silhouette(
      cutree(cluster_data2.hc, x),
      cluster_data2.dist
    )
  )$avg.width
}
)

##  [1] 0.7271035 0.6436228 0.5137831 0.3828193 0.3263801 0.3657955 0.2993538
##  [8] 0.3077878 0.2561061 0.1449507

Examine those results

Replot the dendogram with cluster markers based on the highest silhouette value.
Interpret the results - do these match the theoretical listings we expected?

{plot(cluster_data2.hc, hang = -1)
  rect.hclust(cluster_data2.hc, k = 2)}

The clusters don’t match our theoretical expectations. In fact, there is no clear separation between association, semantic and thematic/text measures. All the thematic/text measures tend to form part of a single cluster, but that cluster also includes all the association and semantic measures except was_comp and lch, which form a separate cluster. The semantic measure, jcn was removed as this was very different from all the other measures.

Snake Plots

Make a snake plot of the results by plotting a random subset of 25 word pairs. In the notes we used the behavioral profile data, in this example you can use the original dataset without the bad variable. - Use something like random_data = dataframe[ , sample(1:ncol(dataframe), 25)]. - Then calculate the snake plot on that smaller dataset.

What word pairs appear to be most heavily tied to each cluster? Are there any interesting differences you see given the top and bottom most distinguishing pairs? - Note: you can run this a few times to see what you think over a wide variety of plots. Please detail you answer including the pairs, since the knitted version will be a different random run.

set.seed(12345)
random_data = cluster_data2[ , sample(1:ncol(cluster_data2), 25)]

# save the clusters
clustercut = cutree(cluster_data2.hc, k =2)
cluster1 = random_data[names(clustercut[clustercut == 1]), ]
cluster2 = random_data[names(clustercut[clustercut == 2]), ]

# create the differences
differences = colMeans(cluster1) - colMeans(cluster2)

# create the plot
plot(sort(differences),
     1:length(differences),
     type = "n",
     xlab = "Cluster 2 <--> Cluster 1",
     yaxt = "n", ylab = "")
text(sort(differences),
     1:length(differences),
     names(sort(differences)))

Almost all the word pairs from the sample seem to be tied to Cluster 2 as all the difference values are less than 0, with “cathedral church” and “hut shack” being the most tied to Cluster 2 and “biscuit chicken” and “submarine subway” being the leqast tied to Cluster 2. Interestingly, the pair of words most tied to Cluster 2 seem to be synonyms.

Bootstrapping

Use pvclust to validate your solution on the dataframe without the bad variable.
Plot the pv cluster.
How well do our clusters appear to work?

set.seed(12345)
cluster_data2.pvc = pvclust(t(cluster_data2),
                            method.hclust = "ward.D2",
                            method.dist = "euclidean")

## Bootstrap (r = 0.5)... Done.
## Bootstrap (r = 0.6)... Done.
## Bootstrap (r = 0.7)... Done.
## Bootstrap (r = 0.8)... Done.
## Bootstrap (r = 0.9)... Done.
## Bootstrap (r = 1.0)... Done.
## Bootstrap (r = 1.1)... Done.
## Bootstrap (r = 1.2)... Done.
## Bootstrap (r = 1.3)... Done.
## Bootstrap (r = 1.4)... Done.

plot(cluster_data2.pvc, hang = -1)

The clusters seem to work pretty well. The only relatively unstable cluster is Cluster 4 with low AU and BP values.