In this report, we aim at analyzing a large dataset of English words including blogs data, news data and tweets. We start by by summarizing the three files corresponding to each of the catageries and by sampling these for forming a sample dataset that we use later for data exploration. We load the sample dataset, pre-process it and then use it for grouping the data into unigrams, bigrams and trigrams, respectively. We then perform explarotary data analysis for tracking the frequency of different unigrams, bigrams and trigrams. We display our findings in tables and graphs and reach conclusions that are likely to help us build a good model for text prediction.
We place the three text files corresponding to English blogs, news and tweets under a folder in the “C:/” drive. For each of the three documents, we get the size of the document, the number of lines and the number of words. This information is then displayed in a summary table.
size1 <- file.size("C:/texts/en_US.blogs.txt")
size2 <- file.size("C:/texts/en_US.news.txt")
size3 <- file.size("C:/texts/en_US.twitter.txt")
blogs <- readLines("C:/texts/en_US.blogs.txt")
tweets <- readLines("C:/texts/en_US.twitter.txt")
news <- readLines("C:/texts/en_US.news.txt")
len1 <- length(blogs)
len2 <- length(news)
len3 <- length(tweets)
v1 <- sapply(gregexpr("\\W+", blogs), length)
v2 <- sapply(gregexpr("\\W+", news), length)
v3 <- sapply(gregexpr("\\W+", tweets), length)
word_count1 <- sum(v1) + len1
word_count2 <- sum(v2) + len2
word_count3 <- sum(v3) + len3
library(gridExtra)
row1 <- c(size1,size2,size3)
row2 <- c(len1,len2,len3)
row3 <- c(word_count1,word_count2,word_count3)
m <- rbind(row1,row2,row3)
rownames(m) <- c('Size in Bytes','# of Lines','# of words')
colnames(m) <- c("Blogs Document","News Document","Twitter Document")
grid.table(m)
As the data is very large to be processed, we sample 10% of the content of each of the blogs, news and tweets files. We then aggregate sampled contents and write aggregated data into a new text file that we place in a “sample” folder under the “C:/” drive
sampleBlogs <- blogs[sample(1:length(blogs),floor(len1*0.1))]
sampleNews <- news[sample(1:length(news),floor(len2*0.1))]
sampleTwitter <- tweets[sample(1:length(tweets),floor(len3*0.1))]
sampleData <- c(sampleTwitter,sampleNews,sampleBlogs)
writeLines(sampleData, "C:/sample/Sample_en_data.txt")
We start by loading sampled data into R.
cname <- file.path("C:", "sample")
cname
## [1] "C:/sample"
dir(cname)
## [1] "Sample_en_data.txt"
library(tm)
## Loading required package: NLP
library(RWeka)
docs <- Corpus(DirSource(cname)) ## loading docs into R
summary(docs)
## Length Class Mode
## Sample_en_data.txt 2 PlainTextDocument list
We then pre-process this data by performing a set of operations including: Removing punctuations, numbers, white spaces and special characters.
docs <- tm_map(docs, removePunctuation)
for(j in seq(docs))
{
docs[[j]] <- gsub("/", " ", docs[[j]])
docs[[j]] <- gsub("@", " ", docs[[j]])
docs[[j]] <- gsub("\\|", " ", docs[[j]])
docs[[j]] <- gsub("#", " ", docs[[j]])
}
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeWords, stopwords("english"))
library(SnowballC)
## Warning: package 'SnowballC' was built under R version 3.2.3
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, PlainTextDocument)
Once the sampled data is pre-processed, we group into tokens of n-grams. A 1-gram token or unigram is a simple word. A n-gram token is a list of n consecutive words. We compute unigrams, bigrams, trigrams and quadgrams.
unigramTokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 1, max = 1))
}
unigrams <- DocumentTermMatrix(docs, control = list(tokenize = unigramTokenizer))
BigramTokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
bigrams <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer))
TrigramTokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 3, max = 3))
}
trigrams <- DocumentTermMatrix(docs, control = list(tokenize = TrigramTokenizer))
quadgramTokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 4, max = 4))
}
quadgrams <- DocumentTermMatrix(docs, control = list(tokenize = quadgramTokenizer))
We use the tokenized structures of data that we created for assessing the frequency of different unigrams, bigrams, trigrams and quadgrams structures in the sampled data.
unigrams_matrix <- as.matrix(unigrams)
unigrams_freq <- sort(colSums(unigrams_matrix),decreasing = TRUE)
unigrams_freq_df <- data.frame(word = names(unigrams_freq), frequency = unigrams_freq)
bigrams_matrix <- as.matrix(bigrams)
bigrams_freq <- sort(colSums(bigrams_matrix),decreasing = TRUE)
bigrams_freq_df <- data.frame(word = names(bigrams_freq), frequency = bigrams_freq)
trigrams_matrix <- as.matrix(trigrams)
trigrams_freq <- sort(colSums(trigrams_matrix),decreasing = TRUE)
trigrams_freq_df <- data.frame(word = names(trigrams_freq), frequency = trigrams_freq)
quadgrams_matrix <- as.matrix(quadgrams)
quadgrams_freq <- sort(colSums(quadgrams_matrix),decreasing = TRUE)
quadgrams_freq_df <- data.frame(word = names(quadgrams_freq), frequency = quadgrams_freq)
We then visualize the frequency of top unigrams, bigrams and trigrams in three histograms that we respectively show below:
library(ggplot2) # visualizing frequency
## Warning: package 'ggplot2' was built under R version 3.2.5
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
p1 <- ggplot(subset(unigrams_freq_df, frequency>7000), aes(word, frequency))
p1 <- p1 + geom_bar(stat="identity")
p1 <- p1 + ggtitle("Words with Frequency > 7000")
p1 <- p1 + theme(axis.text.x=element_text(angle=45, hjust=1))
p1
p2 <- ggplot(subset(bigrams_freq_df, frequency>500), aes(word, frequency))
p2 <- p2 + geom_bar(stat="identity")
p2 <- p2 + ggtitle("Bigrams with Frequency > 500")
p2 <- p2 + theme(axis.text.x=element_text(angle=45, hjust=1))
p2
p3 <- ggplot(subset(trigrams_freq_df, frequency>50), aes(word, frequency))
p3 <- p3 + geom_bar(stat="identity")
p3 <- p3 + ggtitle("Trigrams with Frequency > 50")
p3 <- p3 + theme(axis.text.x=element_text(angle=45, hjust=1))
p3
p4 <- ggplot(subset(quadgrams_freq_df, frequency>20), aes(word, frequency))
p4 <- p4 + geom_bar(stat="identity")
p4 <- p4 + ggtitle("Quadgrams with Frequency > 20")
p4 <- p4 + theme(axis.text.x=element_text(angle=45, hjust=1))
p4
Next steps include building a prediction model and a shiny app/UI for text prediction. In this context, a good prediction model would consist of using the past few words for predicting the probability of the next word. A shiny UI would consist of a simple text box where the user enters a word or group of words and the app outputs the next word as predicted by the prediction model.