Netflix has recently launched a new series called Mindhunter. Let’s listen in how Twitter users talk about the show. In this tutorial, we will use a pre-collected dataset containing tweets from #mindhunter and #davidfincher (David Fincher is the EP of the show).

You can download the dataset at: https://www.dropbox.com/s/eudqxwi5scdhiq9/mindhunter_tweets.csv?dl=0

Please notice that the data are stored in a CSV file.

To begin, let’s load the libraries. Make sure they are installed!

library(tm)
library(topicmodels)
library(LDAvis)
library(servr)
library(dplyr)
library(stringi) 

Now, let’s import the CSV file. (mindhunter_tweets.csv is the filename)

tweets <- read.csv("mindhunter_tweets.csv")
tweets <- tweets[0:3000,]

Let’s see what columns are in the dataset. Q: Which is the column for Twitter user screen name? Q: Which is the column for tweet content?

colnames(tweets)
##  [1] "id"                        "query"                    
##  [3] "tweet_id"                  "inserted_date"            
##  [5] "truncated"                 "language"                 
##  [7] "possibly_sensitive"        "coordinates"              
##  [9] "retweeted_status"          "created_at_text"          
## [11] "created_at"                "content"                  
## [13] "favorite_count"            "from_user_screen_name"    
## [15] "from_user_id"              "from_user_followers_count"
## [17] "from_user_friends_count"   "from_user_listed_count"   
## [19] "from_user_statuses_count"  "from_user_description"    
## [21] "from_user_location"        "from_user_created_at"     
## [23] "retweet_count"             "entities_urls"            
## [25] "entities_urls_count"       "entities_hashtags"        
## [27] "entities_hashtags_count"   "entities_mentions"        
## [29] "entities_mentions_count"   "in_reply_to_screen_name"  
## [31] "in_reply_to_status_id"     "source"                   
## [33] "entities_expanded_urls"    "json_output"              
## [35] "entities_media_count"      "media_expanded_url"       
## [37] "media_url"                 "media_type"               
## [39] "video_link"                "photo_link"               
## [41] "twitpic"                   "subjectivity"             
## [43] "polarity"

Try the aggregate() function. See what it returns.

tweets_byusers <- aggregate(x = tweets$content, by = list(tweets$from_user_screen_name), paste, collapse=". ")

colnames(tweets_byusers) <- c("from_user_screen_name", "content")

Convert the text into a corpus

tweets_byusers_corpus <- iconv(tweets_byusers$content)
corpus <- Corpus(VectorSource(tweets_byusers_corpus))

Now, performe text cleaning. There are many noises in text. For example, we will filter stop words and some other noisy words. Also, we will convert all words to the lower case, deletes punctuation and numbers and removes white spaces between words.

corpus <- tm_map(corpus, content_transformer(tolower)) 
corpus <- tm_map(corpus, removePunctuation) 
corpus <- tm_map(corpus,removeWords,stopwords("english")) 
corpus <- tm_map(corpus,removeWords,c("mindhunter", "david", "fincher","netflix")) 
corpus <- tm_map(corpus, removeNumbers) 
corpus <- tm_map(corpus,stripWhitespace) 

Construct a document-term matrix (DTM) (https://en.wikipedia.org/wiki/Document-term_matrix).

dtm <- DocumentTermMatrix(corpus)  

the line above will convert corpus to DTM. But we need to run the next four lines to remove empty document from DTM to prevent potential errors.

rowTotals<-apply(dtm,1,sum) #running this line takes time
empty.rows<-dtm[rowTotals==0,]$dimnames[1][[1]] 
corpus<-corpus[-as.numeric(empty.rows)]
dtm <- DocumentTermMatrix(corpus)  

A DTM is a matrix, with documents in the rows and terms in the columns. In this example, a document is a user’s tweets and a term is a word that appears in the user’s tweets. Let’s see what the DTM look like (only showing 5 rows and columns)

inspect(dtm[1:5, 1:5])  
## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 5/20
## Sparsity           : 80%
## Maximal term length: 8
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs cant directed far tomorrow wait
##    1    1        0   0        1    1
##    2    0        1   1        0    0
##    3    0        0   0        0    0
##    4    0        0   0        0    0
##    5    0        0   0        0    0

Let’s explore the frequency of different terms. The following lines of code give you 25 most frequent terms.

dtm.mx <- as.matrix(dtm)
frequency <- colSums(dtm.mx)
frequency <- sort(frequency, decreasing=TRUE)
frequency[1:25] 
##     episode        just      review    watching        good       first 
##         299         284         214         207         205         198 
##      killer     watched        show tvafterdark         ive       groff 
##         187         183         180         177         174         174 
##      series    finchers    jonathan         amp      tvtime      serial 
##         173         172         171         158         156         150 
##         new         far       watch        like         now         get 
##         145         136         129         119         112         109 
##       https 
##         107

To produce a topic model, we will supply R with initial parameters. No need to change anything here

burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE

Tuning up a topic model is an art. It is like tuning up a telescope. The algorithm is agnostic to how many topics your text data entail. You need to specify the number of topics to be identified. Let’s start by asking the algorithm to give us 5 topics (k <- 5).

k <- 5 #find 5 topics

Give the topic model a try! Pay attention to what we specify in LDA(). dtm is the document-term matrix we created from previous steps. k is the number of topics to be identified. The rest are the initial parameters we previously refined.

ldaOut <-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))

Let’s examine what’s in the topic model. Running the next two lines will generate a csv file. The file lists which topic each document (that is, each user’s tweet) belongs to.

ldaOut.topics <- as.matrix(topics(ldaOut))
write.csv(ldaOut.topics,file=paste("topic_model",k,"DocsToTopics.csv"))

The following lines will give you keywords associated with each topic. The current output file gives you 6 keywords for each topic. Do you know how to show 15 keywords per topic?

ldaOut.terms <- as.matrix(terms(ldaOut,6))
write.csv(ldaOut.terms,file=paste("topic_model",k,"TopicsToTerms.csv"))
ldaOut.terms[1:6,]
##      Topic 1         Topic 2    Topic 3       Topic 4    Topic 5 
## [1,] "show"          "first"    "tvafterdark" "episode"  "review"
## [2,] "groff"         "finchers" "get"         "just"     "good"  
## [3,] "jonathan"      "far"      "crazy"       "watching" "killer"
## [4,] "new"           "watch"    "now"         "watched"  "series"
## [5,] "holtmccallany" "https"    "dont"        "ive"      "amp"   
## [6,] "netflixs"      "reviews"  "know"        "tvtime"   "serial"

Let’s visualize the result. We will use the R library called LDAvis for visualization. However, LDAvis does not directly take the output from the topic modeling done through topicmodels (which is the library we used for topic modeling). So we need the following lines to convert the output to be LDAvis-readable.

topicmodels2LDAvis <- function(x, ...){
  post <- topicmodels::posterior(x)
  if (ncol(post[["topics"]]) < 3) stop("The model must contain > 2 topics")
  mat <- x@wordassignments
  LDAvis::createJSON(
    phi = post[["terms"]], 
    theta = post[["topics"]],
    vocab = colnames(post[["terms"]]),
    doc.length = slam::row_sums(mat, na.rm = TRUE),
    term.frequency = slam::col_sums(mat, na.rm = TRUE)
  )
}

serVis(topicmodels2LDAvis(ldaOut))

This tutorial is developed for COMM497DB Fall 2017, taught at UMass-Amherst.

If you find this tutorial helpful and would like to use it in your projects, please acknowledge the source:

Xu, Weiai W. (2017). How to Detect Sentiments from Donald Trump’s Tweets?. Amherst, MA: http://curiositybits.com