Synopsis

The Capstone project for Data Science course track is to create a Shiny App for text prediction in mobile phones, using the given corpus of data from blogs, news and twitter texts.This milestone report presents the exploratory data analysis on the corpus and describes the future plan.

Given Corpus Data

The project data was downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip, and unzipped to get corpus in English, German and other languages. In this project only the English data - inside the directory en_US - is used. This directory has three text files respectively for the blogs, news and twitter texts, as shown below.

list.files()
## [1] "Capstone_Milestone.html" "Capstone_Milestone.Rmd" 
## [3] "en_US.blogs.txt"         "en_US.news.txt"         
## [5] "en_US.twitter.txt"

Exploratory Data Analysis

First load the required R libraries.

library(NLP)
library(tm)
library(RWeka)
library(stringi)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate

Load the blog, news and twitter texts and estimate the number of words, lines and file size.

file_blogs <- file("en_US.blogs.txt", "rb")
blogs_txt <- readLines(file_blogs, encoding="UTF-8", skipNul = TRUE)
close(file_blogs)
blogs_words <- sum(stri_count_words(blogs_txt))/10^6 # in millions
blogs_lines <- length(blogs_txt)/10^6 # in millions
blogs_size <- file.info("en_US.blogs.txt")$size/1024^2 # in MB

file_news <- file("en_US.news.txt", "rb")
news_txt <- readLines(file_news, encoding="UTF-8", skipNul = TRUE)
close(file_news)
news_words <- sum(stri_count_words(news_txt))/10^6 # in millions
news_lines <- length(news_txt)/10^6 # in millions
news_size <- file.info("en_US.news.txt")$size/1024^2 # in MB

file_twit <- file("en_US.twitter.txt", "rb")
twit_txt <- readLines(file_twit, encoding="UTF-8", skipNul = TRUE)
close(file_twit)
twit_words <- sum(stri_count_words(twit_txt))/10^6 # in millions
twit_lines <- length(twit_txt)/10^6 # in millions
twit_size <- file.info("en_US.twitter.txt")$size/1024^2 # in MB

Summarize the results in a table:

summary_table <- data.frame(File = c("Blogs","News","Twitter"),
Million_words = c(blogs_words, news_words, twit_words),
Million_lines = c(blogs_lines, news_lines, twit_lines),
MB = c(blogs_size, news_size, twit_size))
summary_table
##      File Million_words Million_lines       MB
## 1   Blogs      37.54625      0.899288 200.4242
## 2    News      34.76239      1.010242 196.2775
## 3 Twitter      30.09341      2.360148 159.3641
rm(blogs_txt, news_txt, twit_txt)

Data Pre-processing

Since the corpus data is very large, for this preliminary study a smaller sample is used. The full corpus will be used in the final implementation.

blogs_samp <- readLines("en_US.blogs.txt", 10000)
news_samp <- readLines("en_US.news.txt", 10000)
twit_samp <- readLines("en_US.twitter.txt", 10000)
txt_samp <- sample(paste(blogs_samp, news_samp, twit_samp), size = 10000, replace = TRUE)
txt_corpus <- Corpus(VectorSource(txt_samp))
rm(blogs_samp, news_samp, twit_samp,txt_samp)

This sample corpus is transformed into lower case, stripped of white spaces, punctuation marks and numbers for further analysis below.

txt_corpus <- tm_map(txt_corpus, content_transformer(tolower))
txt_corpus <- tm_map(txt_corpus, stripWhitespace)
txt_corpus <- tm_map(txt_corpus, removePunctuation)
txt_corpus <- tm_map(txt_corpus, removeNumbers)

Constructing the n-gram models

This section demonstrates the method to get the tri, bi and uni-gram models for the sample corpus. These models will be used for text prediction as detailed in the next section.

triGramTok <- function(x) NGramTokenizer (x, Weka_control(min = 3, max = 3))
triGramMat <- TermDocumentMatrix(txt_corpus, control = list(tokenize = triGramTok))
freqTerms <- findFreqTerms(triGramMat, lowfreq = 100)
termFrequency <- rowSums(as.matrix(triGramMat[freqTerms,]))
termFrequency <- data.frame(trigram=names(termFrequency), frequency=termFrequency)
g <- ggplot(termFrequency, aes(x=reorder(trigram, frequency), y=frequency)) +
    geom_bar(stat = "identity") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Trigram") + ylab("Frequency") +
    labs(title = "Top Trigrams by Frequency")
print(g)

biGramTok <- function(x) NGramTokenizer (x, Weka_control(min = 2, max = 2))
biGramMat <- TermDocumentMatrix(txt_corpus, control = list(tokenize = biGramTok))
freqTerms <- findFreqTerms(biGramMat, lowfreq = 800)
termFrequency <- rowSums(as.matrix(biGramMat[freqTerms,]))
termFrequency <- data.frame(bigram=names(termFrequency), frequency=termFrequency)
g <- ggplot(termFrequency, aes(x=reorder(bigram, frequency), y=frequency)) +
    geom_bar(stat = "identity") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Bigram") + ylab("Frequency") +
    labs(title = "Top Bigrams by Frequency")
print(g)

uniGramTok <- function(x) NGramTokenizer (x, Weka_control(min = 1, max = 1))
uniGramMat <- TermDocumentMatrix(txt_corpus, control = list(tokenize = uniGramTok))
freqTerms <- findFreqTerms(uniGramMat, lowfreq = 3000)
termFrequency <- rowSums(as.matrix(uniGramMat[freqTerms,]))
termFrequency <- data.frame(unigram=names(termFrequency), frequency=termFrequency)
g <- ggplot(termFrequency, aes(x=reorder(unigram, frequency), y=frequency)) +
    geom_bar(stat = "identity") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Unigram") + ylab("Frequency") +
    labs(title = "Top Unigrams by Frequency")
print(g)

Plan for Text Prediction App

The current plan is to use the higher n-gram models (tri-gram or even quad-gram) to “advance-guess” the pattern of the text being entered, and use bi-gram and uni-gram for the “immediate guess”. The finer details of the algorithm are yet to be finalized. Eventually this algorithm will be packaged as a Shiny app with a simple text entering interface.