Text mining involves extracting valuable insights from textual data using technological tools. These texts can range from diverse sources like blogs, social media posts, websites, books, emails, and articles. Through various statistical techniques or predictive modeling, text mining uncovers hidden patterns and trends, thereby generating new information of high quality. In this context, we will conduct preliminary analysis and visual exploration of three specific text datasets (corpora) sourced from Blogs, News, and Twitter (X), which have been downloaded onto my local storage disk. For similar corpora, see [2]. Our exploration will be enhanced by utilizing various R packages tailored for text analytics and visualization.
To begin, we’ll install the necessary packages.
## install.packages("tm") # text mining
## install.packages("SnowballC") # text stemming
## install.packages("wordcloud") # word-cloud generator
## install.packages("RColorBrewer") # color palettes
## Load all
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
library(lattice) #barchart
library(dplyr)
library(tidyr)
Before proceeding with the upload of the corpora, it may be beneficial to examine their size and other preliminary details. The function “file.info()” can be particularly useful for this purpose. For instance, here’s the preliminary information for the blogs corpus:
file.info("en_US.blogs.txt")
## size isdir mode mtime ctime
## en_US.blogs.txt 210160014 FALSE 666 2024-06-15 21:23:47 2014-07-22 10:13:06
## atime exe
## en_US.blogs.txt 2024-06-21 23:21:32 no
The quantity in the “size” column above is the size of the fie in bytes. Let’s summarize the sizes of the three corpora in megabytes.
blogs<-file.info("en_US.blogs.txt")$size*10^-6
news<-file.info("en_US.news.txt")$size*10^-6
twitter<- file.info("en_US.twitter.txt")$size*10^-6
df<-round(data.frame(blogs, news, twitter), 0)
head(df) ## show summary of file sizes in MegaBytes
## blogs news twitter
## 1 210 206 167
The sizes of the three corpora mentioned above are large, but it looks like our system is capable of handling them. We will proceed to upload these files using the ‘readLines’ function (see, for example, [1]). Checking “?readLines” at the R prompt shows that we have the option to skip null objects.
blogs<-readLines("en_US.blogs.txt", skipNul = TRUE)
news <-readLines("en_US.news.txt", skipNul = TRUE)
twitter<-readLines("en_US.twitter.txt", skipNul = TRUE)
We can view these files in different ways, like head(blogs), View(as.data.frame(blogs)), str(as.data.frame(blogs)), etc
head(blogs,3)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
Displayed above are the first 3 lines of the blogs corpus in a column. Similarly, we can examine the other two corpora to gain an understanding of their internal structure.
Additionally, it would be useful to determine the number of text lines in each corpus. Let’s summarize this information in a table.
## how many text lines each data set has?
Blogs<-length(blogs); News<- length(news); Twitter<- length(twitter);
df<-data.frame(Blogs, News, Twitter)
head(df)
## Blogs News Twitter
## 1 899288 77259 2360148
Further, we can use the function “nchar()” to get a vector with the number of characters in each line. Here are examples of how to use it
nchar(blogs)[3] # number of characters in the third line of the blogs corpus
## [1] 692
max(nchar(blogs)) # the length of the longest line in the blogs corpus
## [1] 40833
min(nchar(blogs)) # the length of the shortest line in the blogs data set
## [1] 1
For the purpose of cleaning and further explotary analysis, we pass the corpora through a vector source function.
blogs <- Corpus(VectorSource(blogs))
news <- Corpus(VectorSource(news))
twitter <- Corpus(VectorSource(twitter))
We start cleaning by replacing “/”, “@”, “|”, etc., with space:
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
blogs <- tm_map(blogs, toSpace, "/")
news <- tm_map(news, toSpace, "/")
twitter <- tm_map(twitter, toSpace, "/")
blogs <- tm_map(blogs, toSpace, "@")
news <- tm_map(news, toSpace, "@")
twitter <- tm_map(twitter, toSpace, "@")
blogs <- tm_map(blogs, toSpace, "\\|")
news <- tm_map(news, toSpace, "\\|")
twitter <- tm_map(twitter, toSpace, "\\|")
blogs <- tm_map(blogs, toSpace, "http[[:alnum:]]*")# Remove URLs
news <- tm_map(news, toSpace, "http[[:alnum:]]*")# Remove URLs
twitter <- tm_map(twitter, toSpace, "http[[:alnum:]]*")# Remove URLs
blogs <- tm_map(blogs, toSpace, "http\\S+\\s*") # Remove URLs
news <- tm_map(news, toSpace, "http\\S+\\s*") # Remove URLs
twitter <- tm_map(twitter, toSpace, "http\\S+\\s*") # Remove URLs
blogs <- tm_map(blogs, toSpace, "[[:cntrl:]]+") # Remove Controls
news <- tm_map(news, toSpace, "[[:cntrl:]]+") # Remove Controls
twitter <- tm_map(twitter, toSpace, "[[:cntrl:]]+") # Remove Controls
blogs <- tm_map(blogs, toSpace, "#\\S+") # remove hashtags
news <- tm_map(news, toSpace, "#\\S+") # remove hashtags
twitter <- tm_map(twitter, toSpace, "#\\S+") # remove hashtags
blogs <- tm_map(blogs, toSpace, "@\\S+") # Remove twitter handles
news <- tm_map(news, toSpace, "@\\S+") # Remove twitter handles
twitter <- tm_map(twitter, toSpace, "@\\S+") # Remove twitter handles
blogs <- tm_map(blogs, toSpace, "’")
news <- tm_map(news, toSpace, "’")
twitter <- tm_map(twitter, toSpace, "’")
We need to do more cleaning like changing all words to lower case, stemming words of same root to their root word, removing stop words (such as “a”, “the”, etc), etc., using predefined functions in the “tm” package.
# Convert the text to lower case
blogs <- tm_map(blogs, content_transformer(tolower))
news <- tm_map(news, content_transformer(tolower))
twitter <- tm_map(twitter, content_transformer(tolower))
# Remove numbers
blogs <- tm_map(blogs, removeNumbers)
news <- tm_map(news, removeNumbers)
twitter <- tm_map(twitter, removeNumbers)
# Remove english common stopwords
blogs <- tm_map(blogs, removeWords, stopwords("english"))
news <- tm_map(news, removeWords, stopwords("english"))
twitter <- tm_map(twitter, removeWords, stopwords("english"))
# Remove your own stop words
# specify your stopwords as a character vector
blogs <- tm_map(blogs, removeWords, c("'", "s"))
news <- tm_map(news, removeWords, c("'", "s"))
twitter <- tm_map(twitter, removeWords, c("'", "s"))
# Remove punctuations
blogs <- tm_map(blogs, removePunctuation)
news <- tm_map(news, removePunctuation)
twitter <- tm_map(twitter, removePunctuation)
# Eliminate extra white spaces
blogs <- tm_map(blogs, stripWhitespace)
news <- tm_map(news, stripWhitespace)
twitter <- tm_map(twitter, stripWhitespace)
# Text stemming
blogs <- tm_map(blogs, stemDocument)
news <- tm_map(news, stemDocument)
twitter <- tm_map(twitter, stemDocument)
A more general way to a frequency distribution of Terms (a word or Bag of Words) is what is called a Term Document Matrix (TDM). TDM represents corpus in matrix form with columns that correspond to vectors of documents, rows correspond to the terms in the documents, and cells correspond to the frequency of the terms.Its transpose is called Document Term matrix (DTM), See,e.g., [3].
## Build a term document matrix and inspect
tdm <- TermDocumentMatrix(blogs) # this also also automatically drops terms of length less than 3 unless we specify not to.
inspect(tdm)
## <<TermDocumentMatrix (terms: 327400, documents: 899288)>>
## Non-/sparse entries: 16625643/294410265557
## Sparsity : 100%
## Maximal term length: 373
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms 144333 296687 311665 476221 483415 493020 506059 517366 603795 694864
## can 5 1 3 18 7 1 0 31 8 2
## day 7 1 1 6 10 0 7 10 3 0
## get 2 1 2 1 1 6 1 7 1 2
## just 0 3 0 3 1 2 0 2 3 3
## know 0 0 0 23 0 1 0 4 2 1
## like 1 0 5 5 4 3 0 1 1 0
## make 3 0 4 7 2 2 2 14 7 10
## one 6 3 3 13 6 4 5 15 3 8
## time 2 5 1 11 10 6 2 6 4 3
## will 22 2 11 21 19 2 0 41 6 9
The size of “tdm” makes it hard to handle it as a matrix. We can however still reduce the dimension by removing less frequent terms especially terms of zero or mostly zero frequency across columns. Since the sparsity of “tdm” (i.e., proportion of cells with zeroes/cells with non-zero values) is too large (100%), let’s remove some of these low frequency terms and read the data as matrix for convenience
tdm <- removeSparseTerms(tdm, 0.99)
tdm<-as.matrix(tdm)
Next we create a data frame with word counts sorted by decreasing frequency.
v <- sort(rowSums(tdm),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 15)
## word freq
## one one 133489
## will will 115666
## like like 109667
## can can 108629
## time time 105736
## just just 99667
## get get 94374
## make make 80528
## day day 70597
## know know 68951
## year year 66809
## use use 64245
## love love 64150
## thing thing 61978
## work work 61688
We are now ready to visualize the corpora.
A barchart for the blogs corpus
## Generate bargraph of d
d$word <- factor(d$word, levels=unique(as.character(d$word))) # Controls order of bars
barchart(freq~word, data = d[1:10,],main="Top Ten Most frequent words of the blogs corpus",
xlab = "Most Frequent Words", ylab = "Frequency of Words", col ="blue")
#Generate word cloud
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Similarly, we can adopt the previous codes to visualize the remaining two corpora
For the news corpus, we have:
## Build a term document matrix and frequency distribution the news corpus
tdm_news <- TermDocumentMatrix(news)
tdm_news <- removeSparseTerms(tdm_news, 0.99)
tdm_news<-as.matrix(tdm_news)
v_news <- sort(rowSums(tdm_news),decreasing=TRUE)
d_news <- data.frame(word = names(v_news),freq=v_news)
head(d_news, 15)
## word freq
## said said 19169
## will will 8698
## year year 8464
## one one 6673
## new new 5332
## state state 5223
## time time 5168
## say say 4869
## get get 4669
## like like 4590
## can can 4582
## also also 4515
## two two 4459
## first first 4154
## just just 4132
A bar graph for the news corpus:
## Generate barchart of d_news
d_news$word <- factor(d_news$word, levels=unique(as.character(d_news$word))) # Controls order of bars
barchart(freq~word, data = d_news[1:10,],main="Top Ten Most frequent words of the news corpus",
xlab = "Most Frequent Words", ylab = "Frequency of Words", col ="red")
A word cloud for the news corpus
#Generate word cloud
set.seed(1234)
wordcloud(words = d_news$word, freq = d_news$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
For the twitter corpus, we have:
## Build a term document matrix and frequency distribution the news corpus
tdm_twitter <- TermDocumentMatrix(twitter)
tdm_twitter <- removeSparseTerms(tdm_twitter, 0.99)
tdm_twitter <- as.matrix(tdm_twitter)
v_twitter <- sort(rowSums(tdm_twitter),decreasing=TRUE)
d_twitter <- data.frame(word = names(v_twitter),freq=v_twitter)
head(d_twitter, 15)
## word freq
## just just 149833
## get get 146044
## thank thank 130642
## like like 129760
## love love 123269
## day day 109315
## good good 101664
## will will 95871
## can can 90069
## one one 86650
## time time 85750
## know know 85678
## now now 82295
## follow follow 77934
## great great 76589
A bar graph for the twitter corpus:
## Generate barchart of d_twitter
d_twitter$word <- factor(d_twitter$word, levels=unique(as.character(d_twitter$word))) # Controls order of bars
barchart(freq~word, data = d_twitter[1:10,],main="Top Ten Most frequent words of the twitter corpus",
xlab = "Most Frequent Words", ylab = "Frequency of Words", col ="purple")
A word cloud for the twitter corpus
#Generate word cloud
set.seed(1234)
wordcloud(words = d_twitter$word, freq = d_twitter$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Conclusion: In this note, we undertook an exploration of text mining methodologies applied to three distinct corpora sourced from Blogs, News, and Twitter. Text mining, a pivotal aspect of modern data science, involves harnessing computational tools to extract insights from unstructured textual data across diverse sources such as social media, websites, and articles. The initial steps involve data acquisition, where we assessed the size and structure of each corpus, ensuring their system’s capability to handle large datasets. Subsequently, the text undergoes meticulous preprocessing steps including noise removal, normalization, and transformation to prepare it for deeper analysis. Techniques like term frequency analysis, represented here through Term Document Matrices (TDM), enable the identification of key terms and patterns within each corpus. Visualization techniques such as bar graphs and word clouds are employed to intuitively present findings, facilitating the identification of prevalent themes and trends across different text sources. This work not only demonstrates proficiency in using R for text analytics but also underscores the utility of text mining in generating valuable insights from vast textual datasets across multiple domains.
Future Directions: Our roadmap involves developing a data product—a ShinyApp—that hosts predictive text models. These models will be designed with adaptability across various platforms in mind, ensuring compatibility of the environment they will be used: desktop computers or mobile devices. We prioritize the optimization of model accuracy and efficiency, evaluating performance through metrics such as perplexity and accuracy in predicting sequences of words. For deeper exploration into this area and other directions, interested readers can refer to references [4 - 6].
[1] Phil Spector, Reading Data into R available at Berkeley Stat Website.
[2] A site for English Corpora - English-Corpora.org.
[3] Heena Girdher, TDM (Term Document Matrix) and DTM (Document Term Matrix), Analytics Viday, July 30, 2021.
[4] Julia Silge & David Robinson, Text Mining with R: A Tidy Approach 1st Edition, O’Reilly Media
[5] Ted Kwartler, Text Mining in Practice with R, Wiley, 2017
[6] Mong Shen Ng, People Analytics & Text Mining with R, Independently published, 2019.