Milestone Report

Introduction

This document should be concise and explain only the major features of the data you have identified and briefly summarize my plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.

Importing Data

downloaded the zip file containing the text files from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The data sets consist of text from 3 different sources as News, Blogs and Twitter feeds in different languages. I only use the data in English in this report.

# Packages may useful
library(plyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringi)
library(tm)

## Loading required package: NLP

library(RWeka)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

# Download and unzip the data to local disk
if (!file.exists("Coursera-SwiftKey.zip")) {
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
  unzip("Coursera-SwiftKey.zip")
}
# Import to R
setwd("./final/en_US")
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

# Data summary
blogs.size <-paste( round(file.info("en_US.blogs.txt")$size / 1024^2, 1) , "MB")
news.size <- paste( round(file.info("en_US.news.txt")$size / 1024^2, 1) , "MB")
twitter.size <- paste( round(file.info("en_US.twitter.txt")$size / 1024^2, 1) , "MB")

data.frame(source = c("blogs", "news", "twitter"),
           file_size = c(blogs.size,news.size,twitter.size),
           number_of_words = c( sum(stri_count_words(blogs)),
                                sum(stri_count_words(news)),
                                sum(stri_count_words(twitter))))

##    source file_size number_of_words
## 1   blogs  200.4 MB        37546239
## 2    news  196.3 MB        34762395
## 3 twitter  159.4 MB        30093413

Cleaning Data

Since the dataset is huge, I sample 1% of the dataset. Then I remove the URLs, special characters, punctuations, numbers, excess whitespace and stopwords and transfer the text to lower case.

# Sample 1% of data
set.seed(999)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
                 sample(news, length(news) * 0.01),
                 sample(twitter, length(twitter) * 0.01))

# Create corpus and clean the data
corpus <- VCorpus(VectorSource(data.sample))
tran_f <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, tran_f, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, tran_f, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

Exploratory Data Analysis

Unigram

# Sort
exploratory_function <- function(NgramX, titleName ){
Ngram<-function(x) NGramTokenizer(x,Weka_control(min=NgramX,max=NgramX))
Ngramtab<-removeSparseTerms(TermDocumentMatrix(corpus,control=list(tokenize=Ngram)), 0.9999)
Ngramcorpus<-findFreqTerms(Ngramtab)
Ngramcorpusnum<-rowSums(as.matrix(Ngramtab[Ngramcorpus,]))
Ngramcorpustab<-data.frame(Word=names(Ngramcorpusnum),frequency=Ngramcorpusnum)
Ngramcorpussort<-Ngramcorpustab[order(-Ngramcorpustab$frequency),]
# Plot
graph0 <- ggplot(Ngramcorpussort[1:10,],aes(x=reorder(Word,-frequency),y=frequency))+
geom_bar(stat="identity",fill = I("grey50"))+
labs(title=titleName,x="The 10 Most Words",y="Frequency")+
theme(axis.text.x=element_text(angle=60))

return(graph0)

}

plot(exploratory_function(1, "Unigram"))

Bigram

exploratory_function(2, "Bigram")

#### Trigram

exploratory_function(3, "Trigram")

Next Step Plan

After this exploratory analysis, I will design the text prediction algorithm and deploy as a Shiny app. The prediction algorithm will use the N-gram model to find the experimental distribution of predicting words. We choose the result from the possible words with highest probability.