A Prelude to Text Mining

Introduction

Text mining involves extracting valuable insights from textual data using technological tools. These texts can range from diverse sources like blogs, social media posts, websites, books, emails, and articles. Through various statistical techniques or predictive modeling, text mining uncovers hidden patterns and trends, thereby generating new information of high quality. In this context, we will conduct preliminary analysis and visual exploration of three specific text datasets (corpora) sourced from Blogs, News, and Twitter (X), which have been downloaded onto my local storage disk. For similar corpora, see [2]. Our exploration will be enhanced by utilizing various R packages tailored for text analytics and visualization.

To begin, we’ll install the necessary packages.

## install.packages("tm")  # text mining
## install.packages("SnowballC") # text stemming
## install.packages("wordcloud") # word-cloud generator 
## install.packages("RColorBrewer") # color palettes
## Load all
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
library(lattice) #barchart
library(dplyr)
library(tidyr)

Obtaining and exploring the corpora

Before proceeding with the upload of the corpora, it may be beneficial to examine their size and other preliminary details. The function “file.info()” can be particularly useful for this purpose. For instance, here’s the preliminary information for the blogs corpus:

file.info("en_US.blogs.txt")

##                      size isdir mode               mtime               ctime
## en_US.blogs.txt 210160014 FALSE  666 2024-06-15 21:23:47 2014-07-22 10:13:06
##                               atime exe
## en_US.blogs.txt 2024-06-21 23:21:32  no

The quantity in the “size” column above is the size of the fie in bytes. Let’s summarize the sizes of the three corpora in megabytes.

blogs<-file.info("en_US.blogs.txt")$size*10^-6   
news<-file.info("en_US.news.txt")$size*10^-6  
twitter<- file.info("en_US.twitter.txt")$size*10^-6

df<-round(data.frame(blogs, news, twitter), 0)
head(df) ## show summary of file sizes in MegaBytes

##   blogs news twitter
## 1   210  206     167

The sizes of the three corpora mentioned above are large, but it looks like our system is capable of handling them. We will proceed to upload these files using the ‘readLines’ function (see, for example, [1]). Checking “?readLines” at the R prompt shows that we have the option to skip null objects.

blogs<-readLines("en_US.blogs.txt", skipNul = TRUE)
news <-readLines("en_US.news.txt", skipNul = TRUE)
twitter<-readLines("en_US.twitter.txt", skipNul = TRUE)

We can view these files in different ways, like head(blogs), View(as.data.frame(blogs)), str(as.data.frame(blogs)), etc

head(blogs,3)

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."

Displayed above are the first 3 lines of the blogs corpus in a column. Similarly, we can examine the other two corpora to gain an understanding of their internal structure.

Additionally, it would be useful to determine the number of text lines in each corpus. Let’s summarize this information in a table.

## how many text lines each data set has?
Blogs<-length(blogs); News<- length(news); Twitter<- length(twitter);
df<-data.frame(Blogs, News, Twitter) 
head(df)

##    Blogs  News Twitter
## 1 899288 77259 2360148

Further, we can use the function “nchar()” to get a vector with the number of characters in each line. Here are examples of how to use it

nchar(blogs)[3] # number of characters in the third line of the blogs corpus

## [1] 692

max(nchar(blogs))  # the length of the longest line in the blogs corpus

## [1] 40833

min(nchar(blogs))  # the length of the shortest line in the blogs data set

## [1] 1

Noise removal, normalization

For the purpose of cleaning and further explotary analysis, we pass the corpora through a vector source function.

blogs <- Corpus(VectorSource(blogs))
news <- Corpus(VectorSource(news))
twitter <- Corpus(VectorSource(twitter))

We start cleaning by replacing “/”, “@”, “|”, etc., with space:

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
blogs <- tm_map(blogs, toSpace, "/")
news <- tm_map(news, toSpace, "/")
twitter <- tm_map(twitter, toSpace, "/")

blogs <- tm_map(blogs, toSpace, "@")
news <- tm_map(news, toSpace, "@")
twitter <- tm_map(twitter, toSpace, "@")

blogs <- tm_map(blogs, toSpace, "\\|")
news <- tm_map(news, toSpace, "\\|") 
twitter <- tm_map(twitter, toSpace, "\\|")

blogs <- tm_map(blogs, toSpace, "http[[:alnum:]]*")# Remove URLs
news <- tm_map(news, toSpace, "http[[:alnum:]]*")# Remove URLs
twitter <- tm_map(twitter, toSpace, "http[[:alnum:]]*")# Remove URLs

blogs <- tm_map(blogs, toSpace, "http\\S+\\s*") # Remove URLs
news <- tm_map(news, toSpace, "http\\S+\\s*") # Remove URLs
twitter <- tm_map(twitter, toSpace, "http\\S+\\s*") # Remove URLs

blogs <- tm_map(blogs, toSpace, "[[:cntrl:]]+") # Remove Controls 
news <- tm_map(news, toSpace, "[[:cntrl:]]+") # Remove Controls 
twitter <- tm_map(twitter, toSpace, "[[:cntrl:]]+") # Remove Controls 

blogs <- tm_map(blogs, toSpace, "#\\S+") # remove hashtags
news <- tm_map(news, toSpace, "#\\S+") # remove hashtags
twitter <- tm_map(twitter, toSpace, "#\\S+") # remove hashtags

blogs <- tm_map(blogs, toSpace, "@\\S+") # Remove twitter handles
news <- tm_map(news, toSpace, "@\\S+") # Remove twitter handles
twitter <- tm_map(twitter, toSpace, "@\\S+") # Remove twitter handles

blogs <- tm_map(blogs, toSpace, "’")
news <- tm_map(news, toSpace, "’")
twitter <- tm_map(twitter, toSpace, "’")

We need to do more cleaning like changing all words to lower case, stemming words of same root to their root word, removing stop words (such as “a”, “the”, etc), etc., using predefined functions in the “tm” package.

# Convert the text to lower case
blogs <- tm_map(blogs, content_transformer(tolower))
news <- tm_map(news, content_transformer(tolower))
twitter <- tm_map(twitter, content_transformer(tolower))

# Remove numbers
blogs <- tm_map(blogs, removeNumbers)
news <- tm_map(news, removeNumbers)
twitter <- tm_map(twitter, removeNumbers)

# Remove english common stopwords
blogs <- tm_map(blogs, removeWords, stopwords("english"))
news <- tm_map(news, removeWords, stopwords("english"))
twitter <- tm_map(twitter, removeWords, stopwords("english"))

# Remove your own stop words
# specify your stopwords as a character vector
blogs <- tm_map(blogs, removeWords, c("'", "s")) 
news <- tm_map(news, removeWords, c("'", "s")) 
twitter <- tm_map(twitter, removeWords, c("'", "s")) 

# Remove punctuations
blogs <- tm_map(blogs, removePunctuation)
news <- tm_map(news, removePunctuation)
twitter <- tm_map(twitter, removePunctuation)

# Eliminate extra white spaces
blogs <- tm_map(blogs, stripWhitespace)
news <- tm_map(news, stripWhitespace)
twitter <- tm_map(twitter, stripWhitespace)

# Text stemming
blogs <- tm_map(blogs, stemDocument)
news <- tm_map(news, stemDocument)
twitter <- tm_map(twitter, stemDocument)

Build Term Document Matrix and Frequency Table of Words for the blogs corpus

A more general way to a frequency distribution of Terms (a word or Bag of Words) is what is called a Term Document Matrix (TDM). TDM represents corpus in matrix form with columns that correspond to vectors of documents, rows correspond to the terms in the documents, and cells correspond to the frequency of the terms.Its transpose is called Document Term matrix (DTM), See,e.g., [3].

## Build a term document matrix and inspect
tdm <- TermDocumentMatrix(blogs) # this also also automatically drops terms of length less than 3 unless we specify not to.
inspect(tdm)

## <<TermDocumentMatrix (terms: 327400, documents: 899288)>>
## Non-/sparse entries: 16625643/294410265557
## Sparsity           : 100%
## Maximal term length: 373
## Weighting          : term frequency (tf)
## Sample             :
##       Docs
## Terms  144333 296687 311665 476221 483415 493020 506059 517366 603795 694864
##   can       5      1      3     18      7      1      0     31      8      2
##   day       7      1      1      6     10      0      7     10      3      0
##   get       2      1      2      1      1      6      1      7      1      2
##   just      0      3      0      3      1      2      0      2      3      3
##   know      0      0      0     23      0      1      0      4      2      1
##   like      1      0      5      5      4      3      0      1      1      0
##   make      3      0      4      7      2      2      2     14      7     10
##   one       6      3      3     13      6      4      5     15      3      8
##   time      2      5      1     11     10      6      2      6      4      3
##   will     22      2     11     21     19      2      0     41      6      9

The size of “tdm” makes it hard to handle it as a matrix. We can however still reduce the dimension by removing less frequent terms especially terms of zero or mostly zero frequency across columns. Since the sparsity of “tdm” (i.e., proportion of cells with zeroes/cells with non-zero values) is too large (100%), let’s remove some of these low frequency terms and read the data as matrix for convenience

tdm <- removeSparseTerms(tdm, 0.99)
tdm<-as.matrix(tdm)

Next we create a data frame with word counts sorted by decreasing frequency.

v <- sort(rowSums(tdm),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 15)

##        word   freq
## one     one 133489
## will   will 115666
## like   like 109667
## can     can 108629
## time   time 105736
## just   just  99667
## get     get  94374
## make   make  80528
## day     day  70597
## know   know  68951
## year   year  66809
## use     use  64245
## love   love  64150
## thing thing  61978
## work   work  61688

We are now ready to visualize the corpora.

Visualize the blogs corpus

A barchart for the blogs corpus

## Generate bargraph of d
d$word <- factor(d$word, levels=unique(as.character(d$word))) # Controls order of bars
barchart(freq~word, data = d[1:10,],main="Top Ten Most frequent words of the blogs corpus",
         xlab = "Most Frequent Words", ylab = "Frequency of Words", col ="blue")

#Generate word cloud
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Similarly, we can adopt the previous codes to visualize the remaining two corpora

Processing and visualizing the news corpus

For the news corpus, we have:

## Build a term document matrix and frequency distribution the news corpus
tdm_news <- TermDocumentMatrix(news) 
tdm_news <- removeSparseTerms(tdm_news, 0.99)
tdm_news<-as.matrix(tdm_news)
v_news <- sort(rowSums(tdm_news),decreasing=TRUE)
d_news <- data.frame(word = names(v_news),freq=v_news)
head(d_news, 15)

##        word  freq
## said   said 19169
## will   will  8698
## year   year  8464
## one     one  6673
## new     new  5332
## state state  5223
## time   time  5168
## say     say  4869
## get     get  4669
## like   like  4590
## can     can  4582
## also   also  4515
## two     two  4459
## first first  4154
## just   just  4132

A bar graph for the news corpus:

## Generate barchart of d_news
d_news$word <- factor(d_news$word, levels=unique(as.character(d_news$word))) # Controls order of bars
barchart(freq~word, data = d_news[1:10,],main="Top Ten Most frequent words of the news corpus",
         xlab = "Most Frequent Words", ylab = "Frequency of Words", col ="red")

A word cloud for the news corpus

#Generate word cloud
set.seed(1234)
wordcloud(words = d_news$word, freq = d_news$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Processing and visualizing the twitter corpus

For the twitter corpus, we have:

## Build a term document matrix and frequency distribution the news corpus
tdm_twitter <- TermDocumentMatrix(twitter) 
tdm_twitter <- removeSparseTerms(tdm_twitter, 0.99)
tdm_twitter <- as.matrix(tdm_twitter)
v_twitter <- sort(rowSums(tdm_twitter),decreasing=TRUE)
d_twitter <- data.frame(word = names(v_twitter),freq=v_twitter)
head(d_twitter, 15)

##          word   freq
## just     just 149833
## get       get 146044
## thank   thank 130642
## like     like 129760
## love     love 123269
## day       day 109315
## good     good 101664
## will     will  95871
## can       can  90069
## one       one  86650
## time     time  85750
## know     know  85678
## now       now  82295
## follow follow  77934
## great   great  76589

A bar graph for the twitter corpus:

## Generate barchart of d_twitter
d_twitter$word <- factor(d_twitter$word, levels=unique(as.character(d_twitter$word))) # Controls order of bars
barchart(freq~word, data = d_twitter[1:10,],main="Top Ten Most frequent words of the twitter corpus",
         xlab = "Most Frequent Words", ylab = "Frequency of Words", col ="purple")

A word cloud for the twitter corpus

#Generate word cloud
set.seed(1234)
wordcloud(words = d_twitter$word, freq = d_twitter$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Conclusion: In this note, we undertook an exploration of text mining methodologies applied to three distinct corpora sourced from Blogs, News, and Twitter. Text mining, a pivotal aspect of modern data science, involves harnessing computational tools to extract insights from unstructured textual data across diverse sources such as social media, websites, and articles. The initial steps involve data acquisition, where we assessed the size and structure of each corpus, ensuring their system’s capability to handle large datasets. Subsequently, the text undergoes meticulous preprocessing steps including noise removal, normalization, and transformation to prepare it for deeper analysis. Techniques like term frequency analysis, represented here through Term Document Matrices (TDM), enable the identification of key terms and patterns within each corpus. Visualization techniques such as bar graphs and word clouds are employed to intuitively present findings, facilitating the identification of prevalent themes and trends across different text sources. This work not only demonstrates proficiency in using R for text analytics but also underscores the utility of text mining in generating valuable insights from vast textual datasets across multiple domains.

Future Directions: Our roadmap involves developing a data product—a ShinyApp—that hosts predictive text models. These models will be designed with adaptability across various platforms in mind, ensuring compatibility of the environment they will be used: desktop computers or mobile devices. We prioritize the optimization of model accuracy and efficiency, evaluating performance through metrics such as perplexity and accuracy in predicting sequences of words. For deeper exploration into this area and other directions, interested readers can refer to references [4 - 6].

References

[1] Phil Spector, Reading Data into R available at Berkeley Stat Website.

[2] A site for English Corpora - English-Corpora.org.

[3] Heena Girdher, TDM (Term Document Matrix) and DTM (Document Term Matrix), Analytics Viday, July 30, 2021.

[4] Julia Silge & David Robinson, Text Mining with R: A Tidy Approach 1st Edition, O’Reilly Media

[5] Ted Kwartler, Text Mining in Practice with R, Wiley, 2017

[6] Mong Shen Ng, People Analytics & Text Mining with R, Independently published, 2019.