Objectives:The objectives of this analysis were: 1) to prepare and explore a large corpus of english language texts obtained from news sites, blogs, and twitter; and 2) To delinate next steps for constructing a prediction model to predict the next word in a sentence.
Methods Three large datasets in zip format were downloaded and unzipped. The datasets were three natural language text files from the news sites, blogs, and twitter. The files were read in a binary mode and some descriptive summaries were run. Considering the size of the files, in order to process faster, a random subset of each file containing 1% of the lines was prepared. All non ASCII characters were removed from the three subset files. The subsets from all three sources were combined to form the dataset for analysis. The dataset was then converted into a corpus followed by cleaning of the corpus. The cleaning steps involved
1. removing punctuation
2. removing numbers
3. converting all words to lower case
4. removing common stopwords in the english language
5. removing whitespaces
6. stemming the document and
7. removing profane words.
Following cleaning of the corpus it was converted into a tidy text document and tokenized into either single words or bigrams and trigrams. The results were summarized using plots or tables.
Results The three datasets from the blogs, news and twitter sources were 200.4, 196.3, and 159.4 MBs respectively. There were approximately 900K lines in the blogs file, ~1 million lines in the news file and ~2.4 million lines in the twitter file.
Based on the created sample corpus the most common word was “will” occuring more than 3000 times in the sample. The top 5 words were “will,”said“,”get“,”like“and”just“. The most common bigrams (two words occuring together) were”right now“,”last year“,”look like“,”dont know“, and”feel like" whereas the most common trigrams were “cant wait see”, “let us know”, “happi mother day”, “new york citi”, and “happi new year”. Further Steps Using the approach of developing frequencies of a single word, bigrams, trigrams and quadragrams prepare a predictive model that would predict the fourth, third, and second word in a sentence. The frequencies with which a particular word appears next to a given word or set of words will be used to predict the next word. There might be some words that are used infrequently and some sort of prediction for them needs to be developed. Also in the current analysis “stopwords” were removed, however for the model it is essential that the model also predicts the next word if it is a stopword like “a” or “the”.
The details of the text mining analysis along with the code and results are depicted below.
The filepaths where the blogs, news, and twitter files are located and also a filepath where common english language profanity words are located.
As some of the files contain special characters, in order to avoid problems during the reading of the files, the files were read in in a binary mode. The code is given below.
conn <- file(paste(fp,"/","en_US.blogs.txt", sep=""), open = "rb")
blogs <- readLines(conn, encoding = "UTF-8")
close(conn)
conn <- file(paste(fp,"/","en_US.news.txt", sep=""), open = "rb")
news <- readLines(conn, encoding = "UTF-8")
close(conn)
conn <- file(paste(fp,"/","en_US.twitter.txt", sep=""), open = "rb")
twit <- readLines(conn, encoding = "UTF-8")
close(conn)
bad_words<-readLines(paste(fp2,"/","bad-words.txt", sep=""), encoding="UTF-8", skipNul = TRUE)
bad_words<-bad_words[2:1384]# remove the first line
Once the datasets were read in in the correct format, the datasets were summarized for the file size, number of lines and number of words. The summary is presented in Table 1. below
blogs_size<-round(file.info(paste(fp,"/","en_US.blogs.txt", sep=""))$size/1024^2,1)
news_size<-round(file.info(paste(fp,"/","en_US.news.txt", sep=""))$size/1024^2,1)
twit_size<-round(file.info(paste(fp,"/","en_US.twitter.txt", sep=""))$size/1024^2,1)
blogs_words<-sum(stri_count_words(blogs))
news_words<-sum(stri_count_words(news))
twit_words<-sum(stri_count_words(twit))
data<- data.frame(name= c("blogs", "news", "twitter"),
file_size_MB= c(blogs_size,news_size,twit_size),
num_lines= c(length(blogs),length(news),length(twit)),
num_words = c(blogs_words,news_words,twit_words))
kable(data, caption= "Table 1- Summary of the natural language datasets", row.names = F)
| name | file_size_MB | num_lines | num_words |
|---|---|---|---|
| blogs | 200.4 | 899288 | 37546246 |
| news | 196.3 | 1010242 | 34762395 |
| 159.4 | 2360148 | 30093369 |
As the datasets are large, in order to reduce the processing time for the exploratory analysis, only a subset of the data from the three sources was used in the analysis. The datasets were randomly sampled to get about 1% of the lines in each dataset. Following the sampling, all the non-ASCII characters were removed from the sampled datasets. Once the non-ASCII characters were removed the datasets were all joined together and converted to a corpus which was used in the subsequent analysis.
set.seed(12345)
sample_blogs<-sample(blogs,length(blogs)*0.01)## sample 1% of lines from blogs
sample_news<-sample(news,length(news)*0.01)## sample 1% of lines from news
sample_twit<-sample(twit,length(twit)*0.01)## sample 1% of lines from twitter
###remove nonAScii characters
sample_blogs<-iconv(sample_blogs, "UTF-8", "ASCII", sub="")
sample_news<-iconv(sample_news, "UTF-8", "ASCII", sub="")
sample_twit<-iconv(sample_twit, "UTF-8", "ASCII", sub="")
sample_data<-c(sample_blogs,sample_news,sample_twit)
docs<-VCorpus(VectorSource(sample_data))###create a corpus of the sample files
Once the corpus was generated it was cleaned to remove punctuations, remove numbers, convert all the characters to lower case, remove the common stopwords of english language like “the”, “a” etc, remove any extra whitespaces from the document, stem the documents (i.e remove common word endings like -s, -es etc), remove profanity (The badwords dataset which was downloaded from https://www.cs.cmu.edu/~biglou/resources/bad-words.txt), and treat the document as a plain text document.
docs<-tm_map(docs, removePunctuation)
docs<-tm_map(docs, removeNumbers)
docs<-tm_map(docs, content_transformer(stri_trans_tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
docs<-tm_map(docs, stripWhitespace)
docs<-tm_map(docs, stemDocument)## remove common word stems like s, es, ly
docs<-tm_map(docs, removeWords,bad_words)## remove words with profanity
docs <- tm_map(docs, PlainTextDocument) ## Tells R to treat the documents as plain text
Once a clean corpus was obtained, it was converted to a tidy document format using the tidy function in package tidy. A tidy dataset has one word (or one bigram, one trigram etc) per row of the dataset.
Using the tidy document, the corpus was summarized for frequencies of the words used in the corpus. In addition to single words other constructs such as a bi-gram (two words that follow each other) frequencies and tri-gram frequencies were also generated.
token_docs<-tidy_docs %>% unnest_tokens(words,text)
freq_tidy<-token_docs %>% count(words, sort=TRUE)
token2_docs<-tidy_docs %>% unnest_tokens(bigram,text,token="ngrams",n=2)
freq_tidy2<-token2_docs %>% count(bigram, sort=TRUE)
token3_docs<-tidy_docs %>% unnest_tokens(trigram,text,token="ngrams",n=3)
freq_tidy3<-token3_docs %>% count(trigram, sort=TRUE)
The 20 most frequently used words in the sample of the corpus are presented in Figure.1 whereas a word cloud of the 100 most frequent words is presented in Figure.2
Most Frequent Words
Word Cloud Top 100
Similar to the frequencies of single words the 20-most commonly occuring bi-grams and trigrams along with their clouds are presented in Figure.3 to Figure.6
Most Frequent Words
Bigram Cloud Top 100
Most Frequent Words
Trigram Cloud Top 100
The n-gram model with either 4, 3, 2 or one word will be used to predict what the next word is likely going to be. The The frequencies with which a particular word appears next to a given word or set of words will be used to predict the next word. There might be some words that are used infrequently and some sort of prediction for them needs to be developed. Also in the current analysis “stopwords” were removed, however for the model it is essential that the model also predicts the next word if it is a stopword like “a” or “the”. Following the model generation the model would be tested for accuracy as well as processing speed, there needs to be a balance between accuracy and efficiency (speed) particularly if the model is to be implemented on less memory platforms like mobile devices. The final step would be building a shiny app for predicting the next word.