Intro to Text Mining

Text Mining on Avengers End Game Tweets

Author

Burak Demirtas

Published

September 24, 2023

1 Loading the Needed Libraries

Hide / show the code
if (!require("pacman")) install.packages("pacman")
p_load(pacman, # package loader
       ggplot2, # for better graphs
       dplyr, # for data manipulation
       tm, #for text mining
       wordcloud #for wordcloud viz
       )

2 Text Data Preparation for Analysis

2.1 Loading the Data and Isolating the Text

First we need to load the data and turn our text content to a vector to process it later. For this study, I will use the dataset in kaggle for a Marvel Movie called “Avengers - End Game”. This dataset has large number of tweets scraped from twitter.

Hide / show the code
# Import text data from CSV
tweets <- read.csv("avengers_end_game_tweets.csv")

# View the structure of tweets
str(tweets)
'data.frame':   15000 obs. of  17 variables:
 $ X            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ text         : chr  "RT @mrvelstan: literally nobody:\nme:\n\n#AvengersEndgame https://t.co/LR9kFwfD5c" "RT @agntecarter: i’m emotional, sorry!!\n\n2014 x 2019\n#blackwidow\n#captainamerica https://t.co/xcwkCMw18w" "saving these bingo cards for tomorrow \n©\n #AvengersEndgame https://t.co/d6For0jwRb" "RT @HelloBoon: Man these #AvengersEndgame ads are everywhere https://t.co/Q0lNf5eJsX" ...
 $ favorited    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ favoriteCount: int  0 0 0 0 0 0 0 0 0 0 ...
 $ replyToSN    : chr  NA NA NA NA ...
 $ created      : chr  "2019-04-23 10:43:30" "2019-04-23 10:43:30" "2019-04-23 10:43:30" "2019-04-23 10:43:29" ...
 $ truncated    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ replyToSID   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ id           : num  1.12e+18 1.12e+18 1.12e+18 1.12e+18 1.12e+18 ...
 $ replyToUID   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ statusSource : chr  "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" ...
 $ screenName   : chr  "DavidAc96" "NRmalaa" "jijitsuu" "SahapunB" ...
 $ retweetCount : int  637 302 0 23781 13067 3122 269 5687 349 23781 ...
 $ isRetweet    : logi  TRUE TRUE FALSE TRUE TRUE TRUE ...
 $ retweeted    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ longitude    : num  NA NA NA NA NA NA NA NA NA NA ...
 $ latitude     : num  NA NA NA NA NA NA NA NA NA NA ...

To work easier, I will isolate the text part of the data from the dataset:

Hide / show the code
# Isolate text from tweets
end_game_tweets <- tweets$text

2.2 Creating a Source Vector and Corpus

To be able to work with our text, we need to turn to a source file first. Source files can be:

  • Vector Source
  • Dataframe Source

Let’s first create a vector source using tm package:

Hide / show the code
# Make a vector source
endgame_source = VectorSource(end_game_tweets)

Now that we’ve converted our vector to a Source object, we pass it to another tm function, VCorpus(), to create our volatile1corpus 2.

The VCorpus object is structured as a nested list or a list of lists. Each index within the VCorpus contains a PlainTextDocument object, which is essentially a list that holds the actual text data (content) along with associated metadata (meta).

Let’s first create a volatile corpus from our source file:

Hide / show the code
# Make a volatile corpus from source
endgame_corpus <- VCorpus(endgame_source)

# Print out the corpus
endgame_corpus
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 15000

We have 15000 documents in our corpus. If we want to see a specific element in our corpus, let’s say, 50th tweet, we can go there by using the addressing which we always use in lists! Because, our VCorpus object acts like a list.

Hide / show the code
# Print the 50th tweet in corpus
endgame_corpus[[50]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 60

60 in the content represents the number of charachters. IF we would want to see the tweet content itself, we can use content function.

Hide / show the code
# Print the contents of the 15th tweet in coffee_corpus
content(endgame_corpus[[50]])
[1] "Steering clear of social media until I see #AvengersEndgame."

In this structure, the tweet itself called as PlainTextDocument.

2.3 Creating a Data Frame Source Instead of Vector

Using DataframeSource() , we can also create a source from data frames instead of a vector.

But for that, there are certain requirements about the data frame to be used:

  • Column 1 must be called doc_id and contain a unique string for each row
  • Column 2 must be called text with “UTF-8” encoding
  • Any other columns, 3+, are considered metadata and will be retained as such.

In our data, first column is called X, so we need to change it:

Hide / show the code
# Making a copy of original source
tweets_df <- tweets

# Changing column name
colnames(tweets_df)[1] <- "doc_id"
# re-coding the text column as UTF-8
tweets_df$text <- iconv(tweets_df$text, to = "UTF-8", sub = "byte")

# Check the result
colnames(tweets_df)
 [1] "doc_id"        "text"          "favorited"     "favoriteCount"
 [5] "replyToSN"     "created"       "truncated"     "replyToSID"   
 [9] "id"            "replyToUID"    "statusSource"  "screenName"   
[13] "retweetCount"  "isRetweet"     "retweeted"     "longitude"    
[17] "latitude"     

Since we our data is now ready, we can use it as a data source:

Hide / show the code
# Create a DataframeSource from the example text
df_source <- DataframeSource(tweets_df)

# Convert df_source to a volatile corpus
df_corpus <- VCorpus(df_source)

# Examine df_corpus
df_corpus
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 15
Content:  documents: 15000

As you see, we have the exact same results with a dataframe as well. If we need to work with also metadata, this could be a better solution.

2.4 The Most Used Cleaning Functions

The below fucntions are commonly used in text analysis:

  • tolower(): Makes all characters lowercase (This is a base-R function)
  • removePunctuation(): Removes all punctuation marks
  • removeNumbers(): Removes numbers
  • stripWhitespace(): Removes extra white space
Hide / show the code
# Create the text
text <- "<b>Imagination<b/> I'm enough of the artist to draw freely upon my imagination. Imagination is more    important than knowledge. Knowledge is limited. Imagination encircles the world.   - (1929 / Dr. Albert Einstein) - Sentences worth 1000$s."

Now we can try these functions:

Hide / show the code
# Make lowercase
tolower(text)
[1] "<b>imagination<b/> i'm enough of the artist to draw freely upon my imagination. imagination is more    important than knowledge. knowledge is limited. imagination encircles the world.   - (1929 / dr. albert einstein) - sentences worth 1000$s."
Hide / show the code
# Remove punctuation
removePunctuation(text)
[1] "bImaginationb Im enough of the artist to draw freely upon my imagination Imagination is more    important than knowledge Knowledge is limited Imagination encircles the world    1929  Dr Albert Einstein  Sentences worth 1000s"
Hide / show the code
# Remove numbers
removeNumbers(text)
[1] "<b>Imagination<b/> I'm enough of the artist to draw freely upon my imagination. Imagination is more    important than knowledge. Knowledge is limited. Imagination encircles the world.   - ( / Dr. Albert Einstein) - Sentences worth $s."
Hide / show the code
# Remove whitespace
stripWhitespace(text)
[1] "<b>Imagination<b/> I'm enough of the artist to draw freely upon my imagination. Imagination is more important than knowledge. Knowledge is limited. Imagination encircles the world. - (1929 / Dr. Albert Einstein) - Sentences worth 1000$s."

2.5 Stop Words

There are words that are frequent but usually provide little information. These are called stop words, and it’s usually the best move to just remove them. Some common English stop words include “I”, “the”, “to”, “a” etc. In the tm package, there are 174 common English stop words.

We can also enlarge our stop words list using the c() function.

Once we have a list of stop words, we can use the removeWords() function to remove them from our text.

Hide / show the code
# List of English stop words in tm package
stopwords("en")
  [1] "i"          "me"         "my"         "myself"     "we"        
  [6] "our"        "ours"       "ourselves"  "you"        "your"      
 [11] "yours"      "yourself"   "yourselves" "he"         "him"       
 [16] "his"        "himself"    "she"        "her"        "hers"      
 [21] "herself"    "it"         "its"        "itself"     "they"      
 [26] "them"       "their"      "theirs"     "themselves" "what"      
 [31] "which"      "who"        "whom"       "this"       "that"      
 [36] "these"      "those"      "am"         "is"         "are"       
 [41] "was"        "were"       "be"         "been"       "being"     
 [46] "have"       "has"        "had"        "having"     "do"        
 [51] "does"       "did"        "doing"      "would"      "should"    
 [56] "could"      "ought"      "i'm"        "you're"     "he's"      
 [61] "she's"      "it's"       "we're"      "they're"    "i've"      
 [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
 [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
 [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
 [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
 [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
 [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
 [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
[101] "who's"      "what's"     "here's"     "there's"    "when's"    
[106] "where's"    "why's"      "how's"      "a"          "an"        
[111] "the"        "and"        "but"        "if"         "or"        
[116] "because"    "as"         "until"      "while"      "of"        
[121] "at"         "by"         "for"        "with"       "about"     
[126] "against"    "between"    "into"       "through"    "during"    
[131] "before"     "after"      "above"      "below"      "to"        
[136] "from"       "up"         "down"       "in"         "out"       
[141] "on"         "off"        "over"       "under"      "again"     
[146] "further"    "then"       "once"       "here"       "there"     
[151] "when"       "where"      "why"        "how"        "all"       
[156] "any"        "both"       "each"       "few"        "more"      
[161] "most"       "other"      "some"       "such"       "no"        
[166] "nor"        "not"        "only"       "own"        "same"      
[171] "so"         "than"       "too"        "very"      

Let’s remove those words from our initial text:

Hide / show the code
# Print text without standard stop words
removeWords(text, stopwords("en"))
[1] "<b>Imagination<b/> I'm enough   artist  draw freely upon  imagination. Imagination      important  knowledge. Knowledge  limited. Imagination encircles  world.   - (1929 / Dr. Albert Einstein) - Sentences worth 1000$s."
Hide / show the code
# Adding new words to stop words list
added_stops <- c("I'm", "limited" , stopwords("en"))

# Remove stop words from text
removeWords(text, added_stops)
[1] "<b>Imagination<b/>  enough   artist  draw freely upon  imagination. Imagination      important  knowledge. Knowledge  . Imagination encircles  world.   - (1929 / Dr. Albert Einstein) - Sentences worth 1000$s."

2.6 Preprocessing on Corpus

In the tm package, tm_map() function applies the preprocessing functions to all items in the corpus.

tm_map() needs 2 arguments: - a corpus - a cleaning function.

If we need to use a function outside of the tm,like base R and qdap functions, we need to use them inside the content_transformer() function.

Usually in text analysis, we may need a custom function which will apply our cleaning steps over and over. Instead of repeating the lines, we can create our own function like clearthecorpus() as well.

Hide / show the code
# Defining our function
clearthecorpus <- function(corpus) {
  
  # Transform to lower case - base R function, so we need content_transformer()
  corpus <- tm_map(corpus, content_transformer(tolower))
  
  # Remove punctuations
  corpus <- tm_map(corpus, removePunctuation)
 
   # Add more stopwords
  corpus <- tm_map(corpus, removeWords,
                   words = c(stopwords("en"), "may"))
  # Clear extra whitespace
  corpus <- tm_map(corpus,stripWhitespace)
  
  # Remove numbers
  corpus <- tm_map(corpus,removeNumbers)
  
  return(corpus)
}

Now we can use our cleaning function on our corpus:

Hide / show the code
# Applying the custom function
cleaned_corpus <- clearthecorpus(endgame_corpus)

# Print out a cleaned tweet
content(cleaned_corpus[[50]])
[1] "steering clear social media see avengersendgame"
Hide / show the code
# Print out the same tweet in the original form
tweets$text[50]
[1] "Steering clear of social media until I see #AvengersEndgame."

3 Creating a Document-Term Matrix

One of the most common structures that text mining packages work with is the document-term matrix (or DTM). This is a matrix where:

  • each row represents one document (such as a book or article),
  • each column represents one term, and
  • each value (typically) contains the number of appearances of that term in that document.3

The tm package uses a “simple triplet matrix” class. However, it is often easier to manipulate and examine the object by re-classifying the DTM with as.matrix()

Hide / show the code
# Create the document-term matrix from the corpus
endgame_dtm <- DocumentTermMatrix(cleaned_corpus)

# Print out coffee_dtm data
endgame_dtm
<<DocumentTermMatrix (documents: 15000, terms: 7391)>>
Non-/sparse entries: 132652/110732348
Sparsity           : 100%
Maximal term length: 34
Weighting          : term frequency (tf)

DTM also gives us some statistics about our text data. For example, we have 15000 tweets but actually, we only have 7391 terms.

Hide / show the code
# Convert to a matrix
endgame_matrix <- as.matrix(endgame_dtm)

# Print the dimensions of coffee_m
dim(endgame_dtm)
[1] 15000  7391
Hide / show the code
# Review a portion of the matrix to get some Starbucks
endgame_matrix[1:10, c("avengers", "avengersendgame")]
    Terms
Docs avengers avengersendgame
  1         0               1
  2         0               0
  3         0               1
  4         0               1
  5         0               1
  6         0               1
  7         0               1
  8         1               1
  9         0               1
  10        0               1

When we look at the first 10 tweets, only 1 of them contains the term “avengers” but almost all of them has “avengersendgame”.

4 Creating a Term-Document Matrix

Term-Document Matrix(TDM) is just the transposed version of DTM. Why do we need TDM? Because usually, instead of each text on rows, it’s easier for calculations to have the terms on rows. We can easily turn our TDM to a matrix with as.matrix() and we can make our analysis faster.

Hide / show the code
# Create a term-document matrix from the corpus
endgame_tdm <- TermDocumentMatrix(cleaned_corpus)

# Print coffee_tdm data
endgame_tdm
<<TermDocumentMatrix (terms: 7391, documents: 15000)>>
Non-/sparse entries: 132652/110732348
Sparsity           : 100%
Maximal term length: 34
Weighting          : term frequency (tf)

As you can see, the statistics are the same between the TDM and DTM!

Hide / show the code
# Convert to a matrix
endgame_matrix_tdm <- as.matrix(endgame_tdm)

# Print the dimensions of the matrix
dim(endgame_matrix_tdm)
[1]  7391 15000

5 Frequent Terms with tm Package

Using rowSums() , now we can get the frequency of each term in the corpus. Also we can sort them with sort() - decreasing = TRUE, and find top frequent terms and put them all to a graph.

Hide / show the code
# Calculate the row sums
term_frequency <- rowSums(endgame_matrix_tdm)

# Sort term_frequency in decreasing order
term_frequency <- sort(term_frequency,
                       decreasing = TRUE)

# View the top 10 most common words
term_frequency[1:10]
 avengersendgame           marvel         avengers              man 
           13235             3126             2720             2149 
        premiere              ads       everywhere        helloboon 
            1591             1457             1456             1456 
httpstcoqlnfejsx   captainamerica 
            1456             1013 
Hide / show the code
# Plot a barchart of the 10 most common words
barplot(term_frequency[1:10], col = "grey", las = 2)

6 Word Clouds

A better way to visualize the most common words is word clouds. We can use wordcloud library as below to get a visualization better than a barplot.

Hide / show the code
# Vector of terms
terms_vec = names(term_frequency)

# Create a word cloud for the values in word_freqs
wordcloud(terms_vec,term_frequency,max.words = 50, colors = 'black')
Warning in wordcloud(terms_vec, term_frequency, max.words = 50, colors =
"black"): avengersendgame could not be fit on page. It will not be plotted.

7 Improved Word Cloud

After the first word cloud, we can see that:

  • marvel, avengers, avengersendgame, marvelstudios, endgame, rt are repeated but they are not providing insights.

  • Also there are lots of http links in the tweets which needs to be cleaned.

Hide / show the code
# Custom function to filter out words containing "http"
filter_out_http <- function(x) {
  x <- unlist(strsplit(x, " "))  # Tokenize the text into words
  x <- x[!grepl("http", x)]  # Remove words containing "http"
  x <- paste(x, collapse = " ")  # Reconstruct the text
  return(x)
}

# Apply the custom function to your corpus
cleaned_corpus2 <- tm_map(cleaned_corpus, content_transformer(filter_out_http))

# Review a "cleaned" tweet
content(cleaned_corpus[[4]])
[1] "rt helloboon man avengersendgame ads everywhere httpstcoqlnfejsx"
Hide / show the code
content(cleaned_corpus2[[4]])
[1] "rt helloboon man avengersendgame ads everywhere"

Since we can remove the words with http now, we can also enlarge our stop words list:

Hide / show the code
# Add to stopwords
stops <- c(stopwords(kind = 'en'), 
           'marvel', 'avengers', 'avengersendgame', 
           'marvelstudios', 'endgame', 'rt' , 
           "’re" ,'will','six', 'movie', 'just')

# Apply to a corpus
cleaned_corpus3 <- tm_map(cleaned_corpus2, removeWords, stops)

# Review a "cleaned" tweet again
content(cleaned_corpus3[[4]])
[1] " helloboon man  ads everywhere"

Now that we removed additional stopwords let’s take a look at the improved word cloud!

Hide / show the code
# Create a term-document matrix from the corpus
endgame_tdm3 <- TermDocumentMatrix(cleaned_corpus3)

# Convert to a matrix
endgame_matrix_tdm3 <- as.matrix(endgame_tdm3)

# Calculate the row sums
term_frequency3 <- rowSums(endgame_matrix_tdm3)

# Sort term_frequency in decreasing order
term_frequency3 <- sort(term_frequency3,
                       decreasing = TRUE)

# Get a terms vector
terms_vec <- names(term_frequency3)

# Create a wordcloud for the values in word_freqs
wordcloud(terms_vec, term_frequency3, 
          max.words = 50, colors = "blue")

Footnotes

  1. Volatile corpus is a temporary corpus which is only stored on RAM and deleted after use.↩︎

  2. A text corpus is a large and unstructured set of texts (nowadays usually electronically stored and processed) used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. For more information: https://guides.library.uq.edu.au/research-techniques/text-mining-analysis/language-corpora↩︎

  3. https://www.tidytextmining.com/dtm.html?q=document%20term#tidying-documenttermmatrix-objects↩︎