Introduction

The goal of the capstone project is to create a model that can predict texts using data provided by Swiftkey.

The milestone project briefly describes the features of the training data, contains exploratory analysis of the data provdied and summarizes the data and prepare them for prediction model building.

Preparing the RStudio

First we set the working directory and load the data into RStudio

setwd("~/R/Capstone/final/en_US")
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul =
## TRUE): incomplete final line found on 'final/en_US/en_US.news.txt'
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Then we briefly summarize the quality and features of the dataset.

library(stringi)

blogs.size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2

blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)

data.frame(source = c("blogs", "news", "twitter"),
           file.size.MB = c(blogs.size, news.size, twitter.size),
           num.lines = c(length(blogs), length(news), length(twitter)),
           num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
           mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
##    source file.size.MB num.lines num.words mean.num.words
## 1   blogs     200.4242    899288  37546239       41.75107
## 2    news     196.2775     77259   2674536       34.61779
## 3 twitter     159.3641   2360148  30093413       12.75065

Data Cleaning and Exploratory Analysis

There are URLs, special symbos, numbers, excess sapce, punctuations in the raw dataset that is provided, and we need to clean these data out so that we can properly analyze the data. The dataset provided is too excessive in size, so we sample 1% of the dataset in random and clean the gathered sample.

library(tm)
## Loading required package: NLP
library(NLP)

set.seed(679)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
                 sample(news, length(news) * 0.01),
                 sample(twitter, length(twitter) * 0.01))

corpus <- VCorpus(VectorSource(data.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

After cleaning the data, we can conduct some exploratory analysis. W want to find the most frequently occuring words in the data, the following codes perform exactly this function:

Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre7')
library(RWeka)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
options(mc.cores=1)

getFreq <- function(tdm) {
  freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
  return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
  ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
    labs(x = label, y = "Frequency") +
    theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
    geom_bar(stat = "identity", fill = I("grey50"))
}

freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999))
freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))

Now to plot these graphs out:

1st: The 30 most common Unigrams:

makePlot(freq1, "30 Most Common Unigrams")

2nd plot: the 3o most common Bigrams

makePlot(freq2, "30 Most Common Bigrams")

3rd plot: the 30 most common Trigrams

makePlot(freq3, "30 Most Common Trigrams")

Next Steps

This milestone report concludes the exploratory analysis. Next step in the Capstone project is to finish creating the predicting model and create a shiny app based on it. Specifically, We will use the Ngram dataframe to calculate the probability of the word following a previous word, and the input string will be tokenized. Then a shiny app created based on this predictive model can be created to make a autoprint/autocorrect function.