Text as Data - Exercise 7

This document guides you through exercise 7. Please try to follow the instructions on your own PC and feel free to ask questions if something is unclear. After this exercise you should be able to run topic models on text data. In particular, make sure to be able to answer the following questions:

What is a topic model?
How do I read in .csv data?
How can I subset data?
How can I visualize most frequent hasthags in a network plot?
What is the difference between k-means clustering and LDA? How are these methods implemented?

When people tweet about the new corona virus and COVID-19, what topics do they speak about? In this exercise we’ll use a corpus of UK and US tweets from April 2020 and run topic models.

Let’s first clear the environment and load the data (source: https://www.kaggle.com/smid80/coronavirus-covid19-tweets-early-april/version/3):

rm(list = ls())
library(quanteda)
# install packages("readtext")
library(readtext)
# install packages("lubridate")
library(lubridate)
# install packages("topicmodels")
library(topicmodels)

# Let's use a corpus of tweets about the Corona virus:
?read.table
data <- read.delim("C:/Users/felix/Dropbox/Teaching/sps_text_sose2020/material/corona_tweets/2020-03-29_Coronavirus_Tweets.csv",
                   header=T, sep=',', dec = '.', stringsAsFactors=F  , fill = TRUE )

summary(data)

As said in the beginning, we’d like to focus on tweets in the UK or US:

# only use English language tweets in US and GB
data <- subset(data, lang=="en")
data <- subset(data, country_code=="GB" | country_code=="US")

Of course, we need to convert the tweets to Quanteda corpus and dfm:

# convert to Quanteda corpus
corp <- corpus(data)
summary(corp)

# To dfm
dfm <- dfm(corp, 
           tolower = TRUE,               
           stem = FALSE,               
           remove_punct = TRUE,
           remove_numbers= TRUE ,
           remove_symbols= TRUE ,
           remove = stopwords("en"),
           ngrams = 1)

dfm

We could visualize the hasthags that are mentioned most frequently in a network plot:

## VISUALIZATION: NETWORKPLOT
#Top Hashtags
tag_dfm <- dfm_select(dfm, pattern = ("#*"))
toptag <- names(topfeatures(tag_dfm, 50))
head(toptag)
toptag[1:20]
tag_fcm <- fcm(tag_dfm)
head(tag_fcm)

topgat_fcm <- fcm_select(tag_fcm, pattern = toptag)
textplot_network(topgat_fcm, min_freq = 0.1, edge_alpha = 0.8, edge_size = 5)

However, our question was with topics. Let’s run two different topic model algorithms: LDA and k-means.

What is the difference between these methods?

## TOPIC MODEL ANALYSIS

## (1) LDA

# Remove some special characters
?dfm_remove
dfm <- dfm_remove(dfm, c("*ð*", "ðÿ","*covid*", "*corona*","â", "#", "@", "amp" ,"#ø", "âš", "ž", "ÿ", "t", "s", "ï"))

# keep only the top 30% of the most frequent features (min_termfreq = 0.70) 
# that appear in less than 5% of all documents (max_docfreq = 0.05) 
# using dfm_trim() to focus on common but distinguishing features.
dfm <-   dfm_trim(dfm, min_termfreq = 0.7, termfreq_type = "quantile", 
           max_docfreq = 0.05, docfreq_type = "prop")

# LDA Topicmodels
?convert
dtm <- convert(dfm, to = "topicmodels")
?LDA
lda <- LDA(dtm, k = 2)
terms(lda, 20)


## (2) K-means

# convert to tf-idf
dfm_tfidf <- dfm_tfidf(dfm)

# run k-means
k <- 2  # number of clusters
km_out <- stats::kmeans(dfm_tfidf, centers = k)

colnames(km_out$centers) <- featnames(dfm_tfidf)

for (i in 1:k) { # loop for each cluster
  cat("CLUSTER", i, "\n")
  cat("Top 10 words:\n") # 10 most important terms at the centroid
  print(head(sort(km_out$centers[i, ], decreasing = TRUE), n = 10))
  cat("\n")
  cat("Corona Tweets classified: \n") # extract essays classified
  head(docnames(dfm_tfidf)[km_out$cluster == i])
  cat("\n")
}

Text as Data - Exercise 7

Congratulations, this is the end of the seventh and final exercise.