if (!require("pacman")) install.packages("pacman")p_load(pacman, # package loader ggplot2, # for better graphs dplyr, # for data manipulation tm, #for text mining wordcloud #for wordcloud viz )
2 Text Data Preparation for Analysis
2.1 Loading the Data and Isolating the Text
First we need to load the data and turn our text content to a vector to process it later. For this study, I will use the dataset in kaggle for a Marvel Movie called “Avengers - End Game”. This dataset has large number of tweets scraped from twitter.
Hide / show the code
# Import text data from CSVtweets <-read.csv("avengers_end_game_tweets.csv")# View the structure of tweetsstr(tweets)
To work easier, I will isolate the text part of the data from the dataset:
Hide / show the code
# Isolate text from tweetsend_game_tweets <- tweets$text
2.2 Creating a Source Vector and Corpus
To be able to work with our text, we need to turn to a source file first. Source files can be:
Vector Source
Dataframe Source
Let’s first create a vector source using tm package:
Hide / show the code
# Make a vector sourceendgame_source =VectorSource(end_game_tweets)
Now that we’ve converted our vector to a Source object, we pass it to another tm function, VCorpus(), to create our volatile1corpus2.
The VCorpus object is structured as a nested list or a list of lists. Each index within the VCorpus contains a PlainTextDocument object, which is essentially a list that holds the actual text data (content) along with associated metadata (meta).
Let’s first create a volatile corpus from our source file:
Hide / show the code
# Make a volatile corpus from sourceendgame_corpus <-VCorpus(endgame_source)# Print out the corpusendgame_corpus
We have 15000 documents in our corpus. If we want to see a specific element in our corpus, let’s say, 50th tweet, we can go there by using the addressing which we always use in lists! Because, our VCorpus object acts like a list.
Hide / show the code
# Print the 50th tweet in corpusendgame_corpus[[50]]
60 in the content represents the number of charachters. IF we would want to see the tweet content itself, we can use content function.
Hide / show the code
# Print the contents of the 15th tweet in coffee_corpuscontent(endgame_corpus[[50]])
[1] "Steering clear of social media until I see #AvengersEndgame."
In this structure, the tweet itself called as PlainTextDocument.
2.3 Creating a Data Frame Source Instead of Vector
Using DataframeSource() , we can also create a source from data frames instead of a vector.
But for that, there are certain requirements about the data frame to be used:
Column 1 must be called doc_id and contain a unique string for each row
Column 2 must be called text with “UTF-8” encoding
Any other columns, 3+, are considered metadata and will be retained as such.
In our data, first column is called X, so we need to change it:
Hide / show the code
# Making a copy of original sourcetweets_df <- tweets# Changing column namecolnames(tweets_df)[1] <-"doc_id"# re-coding the text column as UTF-8tweets_df$text <-iconv(tweets_df$text, to ="UTF-8", sub ="byte")# Check the resultcolnames(tweets_df)
Since we our data is now ready, we can use it as a data source:
Hide / show the code
# Create a DataframeSource from the example textdf_source <-DataframeSource(tweets_df)# Convert df_source to a volatile corpusdf_corpus <-VCorpus(df_source)# Examine df_corpusdf_corpus
As you see, we have the exact same results with a dataframe as well. If we need to work with also metadata, this could be a better solution.
2.4 The Most Used Cleaning Functions
The below fucntions are commonly used in text analysis:
tolower(): Makes all characters lowercase (This is a base-R function)
removePunctuation(): Removes all punctuation marks
removeNumbers(): Removes numbers
stripWhitespace(): Removes extra white space
Hide / show the code
# Create the texttext <-"<b>Imagination<b/> I'm enough of the artist to draw freely upon my imagination. Imagination is more important than knowledge. Knowledge is limited. Imagination encircles the world. - (1929 / Dr. Albert Einstein) - Sentences worth 1000$s."
Now we can try these functions:
Hide / show the code
# Make lowercasetolower(text)
[1] "<b>imagination<b/> i'm enough of the artist to draw freely upon my imagination. imagination is more important than knowledge. knowledge is limited. imagination encircles the world. - (1929 / dr. albert einstein) - sentences worth 1000$s."
Hide / show the code
# Remove punctuationremovePunctuation(text)
[1] "bImaginationb Im enough of the artist to draw freely upon my imagination Imagination is more important than knowledge Knowledge is limited Imagination encircles the world 1929 Dr Albert Einstein Sentences worth 1000s"
Hide / show the code
# Remove numbersremoveNumbers(text)
[1] "<b>Imagination<b/> I'm enough of the artist to draw freely upon my imagination. Imagination is more important than knowledge. Knowledge is limited. Imagination encircles the world. - ( / Dr. Albert Einstein) - Sentences worth $s."
Hide / show the code
# Remove whitespacestripWhitespace(text)
[1] "<b>Imagination<b/> I'm enough of the artist to draw freely upon my imagination. Imagination is more important than knowledge. Knowledge is limited. Imagination encircles the world. - (1929 / Dr. Albert Einstein) - Sentences worth 1000$s."
2.5 Stop Words
There are words that are frequent but usually provide little information. These are called stop words, and it’s usually the best move to just remove them. Some common English stop words include “I”, “the”, “to”, “a” etc. In the tm package, there are 174 common English stop words.
We can also enlarge our stop words list using the c() function.
Once we have a list of stop words, we can use the removeWords() function to remove them from our text.
Hide / show the code
# List of English stop words in tm packagestopwords("en")
# Print text without standard stop wordsremoveWords(text, stopwords("en"))
[1] "<b>Imagination<b/> I'm enough artist draw freely upon imagination. Imagination important knowledge. Knowledge limited. Imagination encircles world. - (1929 / Dr. Albert Einstein) - Sentences worth 1000$s."
Hide / show the code
# Adding new words to stop words listadded_stops <-c("I'm", "limited" , stopwords("en"))# Remove stop words from textremoveWords(text, added_stops)
[1] "<b>Imagination<b/> enough artist draw freely upon imagination. Imagination important knowledge. Knowledge . Imagination encircles world. - (1929 / Dr. Albert Einstein) - Sentences worth 1000$s."
2.6 Preprocessing on Corpus
In the tm package, tm_map() function applies the preprocessing functions to all items in the corpus.
tm_map() needs 2 arguments: - a corpus - a cleaning function.
If we need to use a function outside of the tm,like base R and qdap functions, we need to use them inside the content_transformer() function.
Usually in text analysis, we may need a custom function which will apply our cleaning steps over and over. Instead of repeating the lines, we can create our own function like clearthecorpus() as well.
Hide / show the code
# Defining our functionclearthecorpus <-function(corpus) {# Transform to lower case - base R function, so we need content_transformer() corpus <-tm_map(corpus, content_transformer(tolower))# Remove punctuations corpus <-tm_map(corpus, removePunctuation)# Add more stopwords corpus <-tm_map(corpus, removeWords,words =c(stopwords("en"), "may"))# Clear extra whitespace corpus <-tm_map(corpus,stripWhitespace)# Remove numbers corpus <-tm_map(corpus,removeNumbers)return(corpus)}
Now we can use our cleaning function on our corpus:
Hide / show the code
# Applying the custom functioncleaned_corpus <-clearthecorpus(endgame_corpus)# Print out a cleaned tweetcontent(cleaned_corpus[[50]])
[1] "steering clear social media see avengersendgame"
Hide / show the code
# Print out the same tweet in the original formtweets$text[50]
[1] "Steering clear of social media until I see #AvengersEndgame."
3 Creating a Document-Term Matrix
One of the most common structures that text mining packages work with is the document-term matrix (or DTM). This is a matrix where:
each row represents one document (such as a book or article),
each column represents one term, and
each value (typically) contains the number of appearances of that term in that document.3
The tm package uses a “simple triplet matrix” class. However, it is often easier to manipulate and examine the object by re-classifying the DTM with as.matrix()
Hide / show the code
# Create the document-term matrix from the corpusendgame_dtm <-DocumentTermMatrix(cleaned_corpus)# Print out coffee_dtm dataendgame_dtm
<<DocumentTermMatrix (documents: 15000, terms: 7391)>>
Non-/sparse entries: 132652/110732348
Sparsity : 100%
Maximal term length: 34
Weighting : term frequency (tf)
DTM also gives us some statistics about our text data. For example, we have 15000 tweets but actually, we only have 7391 terms.
Hide / show the code
# Convert to a matrixendgame_matrix <-as.matrix(endgame_dtm)# Print the dimensions of coffee_mdim(endgame_dtm)
[1] 15000 7391
Hide / show the code
# Review a portion of the matrix to get some Starbucksendgame_matrix[1:10, c("avengers", "avengersendgame")]
When we look at the first 10 tweets, only 1 of them contains the term “avengers” but almost all of them has “avengersendgame”.
4Creating a Term-Document Matrix
Term-Document Matrix(TDM) is just the transposed version of DTM. Why do we need TDM? Because usually, instead of each text on rows, it’s easier for calculations to have the terms on rows. We can easily turn our TDM to a matrix with as.matrix() and we can make our analysis faster.
Hide / show the code
# Create a term-document matrix from the corpusendgame_tdm <-TermDocumentMatrix(cleaned_corpus)# Print coffee_tdm dataendgame_tdm
<<TermDocumentMatrix (terms: 7391, documents: 15000)>>
Non-/sparse entries: 132652/110732348
Sparsity : 100%
Maximal term length: 34
Weighting : term frequency (tf)
As you can see, the statistics are the same between the TDM and DTM!
Hide / show the code
# Convert to a matrixendgame_matrix_tdm <-as.matrix(endgame_tdm)# Print the dimensions of the matrixdim(endgame_matrix_tdm)
[1] 7391 15000
5Frequent Terms with tm Package
Using rowSums() , now we can get the frequency of each term in the corpus. Also we can sort them with sort() - decreasing = TRUE, and find top frequent terms and put them all to a graph.
Hide / show the code
# Calculate the row sumsterm_frequency <-rowSums(endgame_matrix_tdm)# Sort term_frequency in decreasing orderterm_frequency <-sort(term_frequency,decreasing =TRUE)# View the top 10 most common wordsterm_frequency[1:10]
# Plot a barchart of the 10 most common wordsbarplot(term_frequency[1:10], col ="grey", las =2)
6 Word Clouds
A better way to visualize the most common words is word clouds. We can use wordcloud library as below to get a visualization better than a barplot.
Hide / show the code
# Vector of termsterms_vec =names(term_frequency)# Create a word cloud for the values in word_freqswordcloud(terms_vec,term_frequency,max.words =50, colors ='black')
Warning in wordcloud(terms_vec, term_frequency, max.words = 50, colors =
"black"): avengersendgame could not be fit on page. It will not be plotted.
7Improved Word Cloud
After the first word cloud, we can see that:
marvel, avengers, avengersendgame, marvelstudios, endgame, rt are repeated but they are not providing insights.
Also there are lots of http links in the tweets which needs to be cleaned.
Hide / show the code
# Custom function to filter out words containing "http"filter_out_http <-function(x) { x <-unlist(strsplit(x, " ")) # Tokenize the text into words x <- x[!grepl("http", x)] # Remove words containing "http" x <-paste(x, collapse =" ") # Reconstruct the textreturn(x)}# Apply the custom function to your corpuscleaned_corpus2 <-tm_map(cleaned_corpus, content_transformer(filter_out_http))# Review a "cleaned" tweetcontent(cleaned_corpus[[4]])
[1] "rt helloboon man avengersendgame ads everywhere httpstcoqlnfejsx"
Hide / show the code
content(cleaned_corpus2[[4]])
[1] "rt helloboon man avengersendgame ads everywhere"
Since we can remove the words with http now, we can also enlarge our stop words list:
Hide / show the code
# Add to stopwordsstops <-c(stopwords(kind ='en'), 'marvel', 'avengers', 'avengersendgame', 'marvelstudios', 'endgame', 'rt' , "’re" ,'will','six', 'movie', 'just')# Apply to a corpuscleaned_corpus3 <-tm_map(cleaned_corpus2, removeWords, stops)# Review a "cleaned" tweet againcontent(cleaned_corpus3[[4]])
[1] " helloboon man ads everywhere"
Now that we removed additional stopwords let’s take a look at the improved word cloud!
Hide / show the code
# Create a term-document matrix from the corpusendgame_tdm3 <-TermDocumentMatrix(cleaned_corpus3)# Convert to a matrixendgame_matrix_tdm3 <-as.matrix(endgame_tdm3)# Calculate the row sumsterm_frequency3 <-rowSums(endgame_matrix_tdm3)# Sort term_frequency in decreasing orderterm_frequency3 <-sort(term_frequency3,decreasing =TRUE)# Get a terms vectorterms_vec <-names(term_frequency3)# Create a wordcloud for the values in word_freqswordcloud(terms_vec, term_frequency3, max.words =50, colors ="blue")
Footnotes
Volatile corpus is a temporary corpus which is only stored on RAM and deleted after use.↩︎
A text corpus is a large and unstructured set of texts (nowadays usually electronically stored and processed) used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. For more information: https://guides.library.uq.edu.au/research-techniques/text-mining-analysis/language-corpora↩︎