Netflix has recently launched a new series called Mindhunter. Let’s listen in how Twitter users talk about the show. In this tutorial, we will use a pre-collected dataset containing tweets from #mindhunter and #davidfincher (David Fincher is the EP of the show).
You can download the dataset at: https://www.dropbox.com/s/eudqxwi5scdhiq9/mindhunter_tweets.csv?dl=0
Please notice that the data are stored in a CSV file.
To begin, let’s load the libraries. Make sure they are installed!
library(tm)
library(topicmodels)
library(LDAvis)
library(servr)
library(dplyr)
library(stringi)
Now, let’s import the CSV file. (mindhunter_tweets.csv is the filename)
tweets <- read.csv("mindhunter_tweets.csv")
tweets <- tweets[0:3000,]
Let’s see what columns are in the dataset. Q: Which is the column for Twitter user screen name? Q: Which is the column for tweet content?
colnames(tweets)
## [1] "id" "query"
## [3] "tweet_id" "inserted_date"
## [5] "truncated" "language"
## [7] "possibly_sensitive" "coordinates"
## [9] "retweeted_status" "created_at_text"
## [11] "created_at" "content"
## [13] "favorite_count" "from_user_screen_name"
## [15] "from_user_id" "from_user_followers_count"
## [17] "from_user_friends_count" "from_user_listed_count"
## [19] "from_user_statuses_count" "from_user_description"
## [21] "from_user_location" "from_user_created_at"
## [23] "retweet_count" "entities_urls"
## [25] "entities_urls_count" "entities_hashtags"
## [27] "entities_hashtags_count" "entities_mentions"
## [29] "entities_mentions_count" "in_reply_to_screen_name"
## [31] "in_reply_to_status_id" "source"
## [33] "entities_expanded_urls" "json_output"
## [35] "entities_media_count" "media_expanded_url"
## [37] "media_url" "media_type"
## [39] "video_link" "photo_link"
## [41] "twitpic" "subjectivity"
## [43] "polarity"
Try the aggregate() function. See what it returns.
tweets_byusers <- aggregate(x = tweets$content, by = list(tweets$from_user_screen_name), paste, collapse=". ")
colnames(tweets_byusers) <- c("from_user_screen_name", "content")
Convert the text into a corpus
tweets_byusers_corpus <- iconv(tweets_byusers$content)
corpus <- Corpus(VectorSource(tweets_byusers_corpus))
Now, performe text cleaning. There are many noises in text. For example, we will filter stop words and some other noisy words. Also, we will convert all words to the lower case, deletes punctuation and numbers and removes white spaces between words.
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus,removeWords,stopwords("english"))
corpus <- tm_map(corpus,removeWords,c("mindhunter", "david", "fincher","netflix"))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus,stripWhitespace)
Construct a document-term matrix (DTM) (https://en.wikipedia.org/wiki/Document-term_matrix).
dtm <- DocumentTermMatrix(corpus)
the line above will convert corpus to DTM. But we need to run the next four lines to remove empty document from DTM to prevent potential errors.
rowTotals<-apply(dtm,1,sum) #running this line takes time
empty.rows<-dtm[rowTotals==0,]$dimnames[1][[1]]
corpus<-corpus[-as.numeric(empty.rows)]
dtm <- DocumentTermMatrix(corpus)
A DTM is a matrix, with documents in the rows and terms in the columns. In this example, a document is a user’s tweets and a term is a word that appears in the user’s tweets. Let’s see what the DTM look like (only showing 5 rows and columns)
inspect(dtm[1:5, 1:5])
## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 5/20
## Sparsity : 80%
## Maximal term length: 8
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs cant directed far tomorrow wait
## 1 1 0 0 1 1
## 2 0 1 1 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
Let’s explore the frequency of different terms. The following lines of code give you 25 most frequent terms.
dtm.mx <- as.matrix(dtm)
frequency <- colSums(dtm.mx)
frequency <- sort(frequency, decreasing=TRUE)
frequency[1:25]
## episode just review watching good first
## 299 284 214 207 205 198
## killer watched show tvafterdark ive groff
## 187 183 180 177 174 174
## series finchers jonathan amp tvtime serial
## 173 172 171 158 156 150
## new far watch like now get
## 145 136 129 119 112 109
## https
## 107
To produce a topic model, we will supply R with initial parameters. No need to change anything here
burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE
Tuning up a topic model is an art. It is like tuning up a telescope. The algorithm is agnostic to how many topics your text data entail. You need to specify the number of topics to be identified. Let’s start by asking the algorithm to give us 5 topics (k <- 5).
k <- 5 #find 5 topics
Give the topic model a try! Pay attention to what we specify in LDA(). dtm is the document-term matrix we created from previous steps. k is the number of topics to be identified. The rest are the initial parameters we previously refined.
ldaOut <-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
Let’s examine what’s in the topic model. Running the next two lines will generate a csv file. The file lists which topic each document (that is, each user’s tweet) belongs to.
ldaOut.topics <- as.matrix(topics(ldaOut))
write.csv(ldaOut.topics,file=paste("topic_model",k,"DocsToTopics.csv"))
The following lines will give you keywords associated with each topic. The current output file gives you 6 keywords for each topic. Do you know how to show 15 keywords per topic?
ldaOut.terms <- as.matrix(terms(ldaOut,6))
write.csv(ldaOut.terms,file=paste("topic_model",k,"TopicsToTerms.csv"))
ldaOut.terms[1:6,]
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "show" "first" "tvafterdark" "episode" "review"
## [2,] "groff" "finchers" "get" "just" "good"
## [3,] "jonathan" "far" "crazy" "watching" "killer"
## [4,] "new" "watch" "now" "watched" "series"
## [5,] "holtmccallany" "https" "dont" "ive" "amp"
## [6,] "netflixs" "reviews" "know" "tvtime" "serial"
Let’s visualize the result. We will use the R library called LDAvis for visualization. However, LDAvis does not directly take the output from the topic modeling done through topicmodels (which is the library we used for topic modeling). So we need the following lines to convert the output to be LDAvis-readable.
topicmodels2LDAvis <- function(x, ...){
post <- topicmodels::posterior(x)
if (ncol(post[["topics"]]) < 3) stop("The model must contain > 2 topics")
mat <- x@wordassignments
LDAvis::createJSON(
phi = post[["terms"]],
theta = post[["topics"]],
vocab = colnames(post[["terms"]]),
doc.length = slam::row_sums(mat, na.rm = TRUE),
term.frequency = slam::col_sums(mat, na.rm = TRUE)
)
}
serVis(topicmodels2LDAvis(ldaOut))
This tutorial is developed for COMM497DB Fall 2017, taught at UMass-Amherst.
If you find this tutorial helpful and would like to use it in your projects, please acknowledge the source:
Xu, Weiai W. (2017). How to Detect Sentiments from Donald Trump’s Tweets?. Amherst, MA: http://curiositybits.com