Reading and Preparing the English Language Database

The 3 files provided with language data from blogs, news and tweets are used to build the English language corpus.

The following steps are executed to prepare the data into a corpus.

  1. Each file is read and a 5% or 1/20th random sample is made into a Corpus from each file.
  2. The sentences are converted to all upper case to make the case consistent.
  3. Punctuations, stopwords and whitespaces are removed.
  4. Profane words are removed based on a fixed profane word list.

Exploratory Analysis of the Language Database

Basic profile of each file in terms of number of lines and words are collected.

The top words from each data source are identified. Words that occur in at least 2.5% of the documents are considered for analysis by removing sparse terms.

Next tri-grams from each data source are created and top 20 tri-gram from each source listed. For purpose of presentation, only tri-grams have been considered but with same approach n-grams of different number of words can be analysed.

The analysis is run in a loop across all three data sources and comparisons presented below.

Analysis of top 20 words and tri-grams from each source as shown below reveals intersting pattern of how different words or phrases are used more commonly in different media like blogs, news or Twitter.

# Creating dataframes to store the results
file_summary <- data.frame(FileName=character(), Lines=integer(), Words=integer())
file_top_words <- data.frame(FileName=character(), Word=character(), Count=integer())
file_top_3grams <- data.frame(FileName=character(), TriGram=character(), Count=integer())

# Setting file names
file_names <- c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt")

# Initializing profanity list
con <- file("profanity.txt", open = "r")
profanity <- readLines(con)
close(con)

# Looping over the three files
for (i in 1:length(file_names) ) {
  
  file_name <- file_names[i]
  
  con <- file(file_name, open = "r")
  all_lines <- readLines(con)
  close(con)
  
  # Counting #lines and #words in the file 
  lines <- length(all_lines)
  words <- sum(sapply(all_lines,function(x)length(unlist(gregexpr(" ",x)))+1)) 
  dl <- list (FileName = file_name, Lines = lines, Words = words)
  file_summary = rbind(file_summary, dl, stringsAsFactors=FALSE)

  # Taking 1/20 random sample of lines
  randomSample <- sample(x = 1:lines, size = lines / 20, replace = FALSE)
  all_lines <- all_lines[randomSample]
  
  # Data cleaning steps
  Corpus <- Corpus(VectorSource(all_lines))
  Corpus <- tm_map (Corpus, toupper)
  Corpus <- tm_map (Corpus, removePunctuation)
  Corpus <- tm_map (Corpus, removeWords, stopwords("english"))
  Corpus <- tm_map (Corpus, removeWords, profanity)  
  Corpus <- tm_map (Corpus, stripWhitespace)

  # Term document matrix
  dtm <- DocumentTermMatrix(Corpus)
  notSparse <- removeSparseTerms(dtm,0.975)
  finalWords <- as.data.frame(as.matrix(notSparse))
  
  # Collecting top 20 words
  top_words <- colSums(finalWords)
  t(top_words)
  top_words <- as.data.frame(top_words)
  top_words <- cbind(Word = row.names(top_words), top_words)
  names(top_words) <- c("Word","Count")
  top_words <- top_words[order(-top_words$Count),]
  rownames(top_words) <- 1:nrow(top_words)
  top_words <- cbind(FileName = file_name, top_words)
  file_top_words = rbind(file_top_words,top_words[1:20,])

  # Collecting top 20 3-grams
  token_delim <- " \\t\\r\\n.!?,;\"()"
  tritoken <- NGramTokenizer(all_lines, Weka_control(min=3,max=3, delimiters = token_delim))
  tri <- as.data.frame(table(tritoken)) 
  names(tri) <- c("TriGram","Count")
  tri <- tri[order(-tri$Count),]  
  top_tri_words <- cbind(FileName = file_name, tri[1:20,])
  
  file_top_3grams = rbind(file_top_3grams,top_tri_words)
}

print(file_summary)
##            FileName   Lines    Words
## 1   en_US.blogs.txt  899288 37345400
## 2    en_US.news.txt   77259  2644241
## 3 en_US.twitter.txt 2360148 30374792

Including Plots

ggplot(file_top_words,aes(x=reorder(Word,Count),y=Count, fill = FileName)) +
  geom_bar(stat='identity') +
  coord_flip() + labs(y='Word',x='Count', title = "Top 20 words in each file") + 
  facet_wrap(~ FileName, scales = "free") +
  theme(legend.position="none")

ggplot(file_top_3grams,aes(x=reorder(TriGram,Count),y=Count, fill = FileName)) +
  geom_bar(stat='identity') +
  coord_flip() + labs(y='TriGram',x='Count', title = "Top 20 3-grams in each file") + 
  facet_wrap(~ FileName, scales = "free") +
  theme(legend.position="none")

Approach for Swift Key Application

The following approach will be followed in next phases to complete the project.

  1. Create 1 to 5 grams from corpus for each data source and store results in terms of frequency.

  2. Use Markov Chain to store the data for easy retrieval, such that for any given sequence of words from 1,2,….,n (for n < 5) the top 3 choices for the n + 1 th word will be returned from the Markov Chain. The choices will be returned in decreasing order of frequency as obsreved in the corpus.

  3. Create a Shiny app, which will run the R code and suggest the next word based on words and phrases entered by the user.