This document guides you through exercise 5. After this exercise you should be able to calculate cosine similarities and illustrate the result in a heat map. Make sure to be able to answer the following questions:
What is cosine similarity?
Why do we need TF-IDF weights for the calculus of cosine similarity measures?
Which R command calculates cosine similarities?
How can we output a matrix of cosine similarities?
Let’s run our code from Exercise 1 to create a dfm of political manifestos:
rm(list = ls())
library(quanteda)
library(readtext)
dat <- readtext("C:/Users/felix/Dropbox/Teaching/sps_text_sose2020/material/manifesto_pdfs/*.pdf",
docvarsfrom = "filenames",
encoding = "UTF-8")
corp <- corpus(dat)
summary(corp)
dfm <- dfm(corp,
tolower = TRUE,
stem = TRUE,
remove_punct = TRUE,
remove_numbers= TRUE ,
remove = stopwords("German"),
ngrams = 1)
Now we calculate tf-idf scores as in the previous exercise:
# convert to tf-idf
dfm_tfidf <- dfm_tfidf(dfm)
The command textstat_dist() is quanteda’s super-powerful command to calculate any kind of similarity measure based on a dfm. Let’s read the documentation first:
?textstat_dist
As you can see, there are many options for different similarity measures to choose from. However, the convention in text data analysis is to use cosine similarity. Let’s do that:
tstat.cos <- textstat_simil(dfm_tfidf, method = "cosine", margin = "documents")
as.matrix(tstat.cos)
Finally, we could visualize our results using a heatmap and the levelplot() command from the lattice package. However, note that the graph is far from perfect. What are its shortcomings?
# convert to matrix format
matrix <- as.matrix(tstat.cos)
# Optional: visualisation with heatmap
library(lattice)
diag(matrix) <- NA
?levelplot
levelplot(matrix,col.regions=heat.colors(20), xlab="Party Manifesto", ylab="Party Manifesto", main="Cosine Similarity",
scales=list(x=list(rot=90)))
Copyright (c) Felix Hagemeister, 2020