The 3 files provided with language data from blogs, news and tweets are used to build the English language corpus.
The following steps are executed to prepare the data into a corpus.
Basic profile of each file in terms of number of lines and words are collected.
The top words from each data source are identified. Words that occur in at least 2.5% of the documents are considered for analysis by removing sparse terms.
Next tri-grams from each data source are created and top 20 tri-gram from each source listed. For purpose of presentation, only tri-grams have been considered but with same approach n-grams of different number of words can be analysed.
The analysis is run in a loop across all three data sources and comparisons presented below.
Analysis of top 20 words and tri-grams from each source as shown below reveals intersting pattern of how different words or phrases are used more commonly in different media like blogs, news or Twitter.
# Creating dataframes to store the results
file_summary <- data.frame(FileName=character(), Lines=integer(), Words=integer())
file_top_words <- data.frame(FileName=character(), Word=character(), Count=integer())
file_top_3grams <- data.frame(FileName=character(), TriGram=character(), Count=integer())
# Setting file names
file_names <- c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt")
# Initializing profanity list
con <- file("profanity.txt", open = "r")
profanity <- readLines(con)
close(con)
# Looping over the three files
for (i in 1:length(file_names) ) {
file_name <- file_names[i]
con <- file(file_name, open = "r")
all_lines <- readLines(con)
close(con)
# Counting #lines and #words in the file
lines <- length(all_lines)
words <- sum(sapply(all_lines,function(x)length(unlist(gregexpr(" ",x)))+1))
dl <- list (FileName = file_name, Lines = lines, Words = words)
file_summary = rbind(file_summary, dl, stringsAsFactors=FALSE)
# Taking 1/20 random sample of lines
randomSample <- sample(x = 1:lines, size = lines / 20, replace = FALSE)
all_lines <- all_lines[randomSample]
# Data cleaning steps
Corpus <- Corpus(VectorSource(all_lines))
Corpus <- tm_map (Corpus, toupper)
Corpus <- tm_map (Corpus, removePunctuation)
Corpus <- tm_map (Corpus, removeWords, stopwords("english"))
Corpus <- tm_map (Corpus, removeWords, profanity)
Corpus <- tm_map (Corpus, stripWhitespace)
# Term document matrix
dtm <- DocumentTermMatrix(Corpus)
notSparse <- removeSparseTerms(dtm,0.975)
finalWords <- as.data.frame(as.matrix(notSparse))
# Collecting top 20 words
top_words <- colSums(finalWords)
t(top_words)
top_words <- as.data.frame(top_words)
top_words <- cbind(Word = row.names(top_words), top_words)
names(top_words) <- c("Word","Count")
top_words <- top_words[order(-top_words$Count),]
rownames(top_words) <- 1:nrow(top_words)
top_words <- cbind(FileName = file_name, top_words)
file_top_words = rbind(file_top_words,top_words[1:20,])
# Collecting top 20 3-grams
token_delim <- " \\t\\r\\n.!?,;\"()"
tritoken <- NGramTokenizer(all_lines, Weka_control(min=3,max=3, delimiters = token_delim))
tri <- as.data.frame(table(tritoken))
names(tri) <- c("TriGram","Count")
tri <- tri[order(-tri$Count),]
top_tri_words <- cbind(FileName = file_name, tri[1:20,])
file_top_3grams = rbind(file_top_3grams,top_tri_words)
}
print(file_summary)
## FileName Lines Words
## 1 en_US.blogs.txt 899288 37345400
## 2 en_US.news.txt 77259 2644241
## 3 en_US.twitter.txt 2360148 30374792
ggplot(file_top_words,aes(x=reorder(Word,Count),y=Count, fill = FileName)) +
geom_bar(stat='identity') +
coord_flip() + labs(y='Word',x='Count', title = "Top 20 words in each file") +
facet_wrap(~ FileName, scales = "free") +
theme(legend.position="none")
ggplot(file_top_3grams,aes(x=reorder(TriGram,Count),y=Count, fill = FileName)) +
geom_bar(stat='identity') +
coord_flip() + labs(y='TriGram',x='Count', title = "Top 20 3-grams in each file") +
facet_wrap(~ FileName, scales = "free") +
theme(legend.position="none")
The following approach will be followed in next phases to complete the project.
Create 1 to 5 grams from corpus for each data source and store results in terms of frequency.
Use Markov Chain to store the data for easy retrieval, such that for any given sequence of words from 1,2,….,n (for n < 5) the top 3 choices for the n + 1 th word will be returned from the Markov Chain. The choices will be returned in decreasing order of frequency as obsreved in the corpus.
Create a Shiny app, which will run the R code and suggest the next word based on words and phrases entered by the user.