This document should be concise and explain only the major features of the data you have identified and briefly summarize my plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.
downloaded the zip file containing the text files from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The data sets consist of text from 3 different sources as News, Blogs and Twitter feeds in different languages. I only use the data in English in this report.
# Packages may useful
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringi)
library(tm)
## Loading required package: NLP
library(RWeka)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
# Download and unzip the data to local disk
if (!file.exists("Coursera-SwiftKey.zip")) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
}
# Import to R
setwd("./final/en_US")
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
# Data summary
blogs.size <-paste( round(file.info("en_US.blogs.txt")$size / 1024^2, 1) , "MB")
news.size <- paste( round(file.info("en_US.news.txt")$size / 1024^2, 1) , "MB")
twitter.size <- paste( round(file.info("en_US.twitter.txt")$size / 1024^2, 1) , "MB")
data.frame(source = c("blogs", "news", "twitter"),
file_size = c(blogs.size,news.size,twitter.size),
number_of_words = c( sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))))
## source file_size number_of_words
## 1 blogs 200.4 MB 37546239
## 2 news 196.3 MB 34762395
## 3 twitter 159.4 MB 30093413
Since the dataset is huge, I sample 1% of the dataset. Then I remove the URLs, special characters, punctuations, numbers, excess whitespace and stopwords and transfer the text to lower case.
# Sample 1% of data
set.seed(999)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
# Create corpus and clean the data
corpus <- VCorpus(VectorSource(data.sample))
tran_f <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, tran_f, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, tran_f, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
# Sort
exploratory_function <- function(NgramX, titleName ){
Ngram<-function(x) NGramTokenizer(x,Weka_control(min=NgramX,max=NgramX))
Ngramtab<-removeSparseTerms(TermDocumentMatrix(corpus,control=list(tokenize=Ngram)), 0.9999)
Ngramcorpus<-findFreqTerms(Ngramtab)
Ngramcorpusnum<-rowSums(as.matrix(Ngramtab[Ngramcorpus,]))
Ngramcorpustab<-data.frame(Word=names(Ngramcorpusnum),frequency=Ngramcorpusnum)
Ngramcorpussort<-Ngramcorpustab[order(-Ngramcorpustab$frequency),]
# Plot
graph0 <- ggplot(Ngramcorpussort[1:10,],aes(x=reorder(Word,-frequency),y=frequency))+
geom_bar(stat="identity",fill = I("grey50"))+
labs(title=titleName,x="The 10 Most Words",y="Frequency")+
theme(axis.text.x=element_text(angle=60))
return(graph0)
}
plot(exploratory_function(1, "Unigram"))
exploratory_function(2, "Bigram")
#### Trigram
exploratory_function(3, "Trigram")
After this exploratory analysis, I will design the text prediction algorithm and deploy as a Shiny app. The prediction algorithm will use the N-gram model to find the experimental distribution of predicting words. We choose the result from the possible words with highest probability.