The goal of the capstone project is to create a model that can predict texts using data provided by Swiftkey.
The milestone project briefly describes the features of the training data, contains exploratory analysis of the data provdied and summarizes the data and prepare them for prediction model building.
First we set the working directory and load the data into RStudio
setwd("~/R/Capstone/final/en_US")
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul =
## TRUE): incomplete final line found on 'final/en_US/en_US.news.txt'
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
Then we briefly summarize the quality and features of the dataset.
library(stringi)
blogs.size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)
data.frame(source = c("blogs", "news", "twitter"),
file.size.MB = c(blogs.size, news.size, twitter.size),
num.lines = c(length(blogs), length(news), length(twitter)),
num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
## source file.size.MB num.lines num.words mean.num.words
## 1 blogs 200.4242 899288 37546239 41.75107
## 2 news 196.2775 77259 2674536 34.61779
## 3 twitter 159.3641 2360148 30093413 12.75065
There are URLs, special symbos, numbers, excess sapce, punctuations in the raw dataset that is provided, and we need to clean these data out so that we can properly analyze the data. The dataset provided is too excessive in size, so we sample 1% of the dataset in random and clean the gathered sample.
library(tm)
## Loading required package: NLP
library(NLP)
set.seed(679)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
corpus <- VCorpus(VectorSource(data.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
After cleaning the data, we can conduct some exploratory analysis. W want to find the most frequently occuring words in the data, the following codes perform exactly this function:
Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre7')
library(RWeka)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
options(mc.cores=1)
getFreq <- function(tdm) {
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = I("grey50"))
}
freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999))
freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))
Now to plot these graphs out:
1st: The 30 most common Unigrams:
makePlot(freq1, "30 Most Common Unigrams")
2nd plot: the 3o most common Bigrams
makePlot(freq2, "30 Most Common Bigrams")
3rd plot: the 30 most common Trigrams
makePlot(freq3, "30 Most Common Trigrams")
This milestone report concludes the exploratory analysis. Next step in the Capstone project is to finish creating the predicting model and create a shiny app based on it. Specifically, We will use the Ngram dataframe to calculate the probability of the word following a previous word, and the input string will be tokenized. Then a shiny app created based on this predictive model can be created to make a autoprint/autocorrect function.