Captone Milestone Report

Synopisis

This milestone report is based on all of work has been done so far for the Swiftkey capstone project. This report will demonstrate how the dataset obtained and processed, explored, and what the next steps are in the project to contruct a predict algorithm for text user types into the Shiny App. We will make use of tables and plots to illustrate important summaries of the data set from our explotary data analysis. Finally, we will outline our plan for the final product for this project.

Data Processing

We downloaded our from data from the link provided in the Data Science Capstone course page: Capstone Dataset. After we finished downloading, we will start to load the data into R and performing data cleansing. Since our app will predict words in Engligh, to get the project started,we will load the English language datasets in the en_US folder.

#load in the english blog and twitter data
us_blogs <-  readLines("en_US.blogs.txt",encoding="UTF-8")
us_twitter <- readLines("en_US.twitter.txt",encoding="UTF-8")

# load in the English news data
con <- file("en_US.news.txt", open="rb")
us_news <- readLines(con, encoding="UTF-8")
close(con)
rm(con)

Word Count and other Basic Statistics

For en_US data sets

Corpus	Number of Entries
Twitter	2360148
Blogs	899288
News	1010242

Corpus	Number of Words
Twitter	30417180
Blogs	37592702
News	34626812

Data Cleaning and Sampling

require(tm)
require(openNLP)
require(RWeka)

profane <- read.csv("google_twunter_lol.csv",sep=":",header=FALSE)
profanity <- as.character(profane[,1])

Sample <- function(data, p) { 
  return(data[as.logical(rbinom(length(data),1,p))])
}

sampleData <- c(Sample(us_blogs, 0.1),Sample(us_twitter, 0.1),Sample(us_news, 0.1))
sampleData <- gsub("[^a-zA-Z ]", "", sampleData)
#sampleData <- gsub(paste(profanity, collapse='|'), " ", sampleData)

sample_corpus <- Corpus(VectorSource(list(sampleData)))
sample_corpus <- tm_map(sample_corpus, stripWhitespace)
sample_corpus <- tm_map(sample_corpus, content_transformer(tolower))
sample_corpus <- tm_map(sample_corpus, removeWords, profanity)
sample_corpus <- as.character(sample_corpus)
#tdm <- TermDocumentMatrix(sample_corpus)

writeCorpus(sample_corpus, filenames="sampleCorpus.txt")
#SampleCorpus <- readLines("my.corpus.txt")

Tokenization and ngram analysis

We will use a Tokenizer provided by Capstone Community TA Tokenizer to help us achieve the results of Tokenization for our sample corpus.

Unigram EDA

setwd("C:/Users/Charlie/Desktop/Data Science/Capstone/Coursera-SwiftKey/final/en_US")
library(stringi)
library(wordcloud)
source("ngram_tokenizer.R")
unigram_tokenizer <- ngram_tokenizer(1)
wordlist <- unigram_tokenizer(sample_corpus)
df1 <- data.frame(V1 = as.vector(names(table(unlist(wordlist)))), V2 = as.numeric(table(unlist(wordlist))))
names(df1) <- c("word","freq")
df1 <- df1[with(df1, order(-df1$freq)),]
head(df1)
#m <- as.matrix(tdm)
#v <- sort(rowSums(m),decreasing=TRUE)
#d <- data.frame(word = names(v),freq=v)
wordcloud(df1$word,df1$freq, scale=c(4,.8),min.freq=2,max.words=100, random.order=T, rot.per=.15, colors=brewer.pal(8, "Dark2"))
#unigram_tokenizer <- ngram_tokenizer(1)
#wordlist <- unigram_tokenizer(sampleCorpus)

word	freq
the	475652
to	274039
and	239173
a	239171
of	200312
in	165371
i	164050
for	109696
is	107258
that	103757
you	93450
it	90740
on	83129
with	71690
was	62538
my	59974
at	57425
be	54622
this	53900
have	52958

alt text

Bigram EDA

library(ggplot2)
bigram_tokenizer <- ngram_tokenizer(2)
wordlist <- unigram_tokenizer(sampleData)
df2 <- data.frame(V1 = as.vector(names(table(unlist(wordlist)))), V2 = as.numeric(table(unlist(wordlist))))
names(df2) <- c("word","freq")
df2 <- df2[with(df2, order(-df2$freq)),]
head(df2)
ggplot(head(df2,15), aes(x=sort(word, -freq), y=freq)) +
  geom_bar(stat="Identity", fill="violetred") +
  geom_text(aes(label=freq), vjust = -0.5) +
  ggtitle("Bigrams frequency") +
  ylab("Frequency") +
  xlab("Phrase")

word	freq
of the	42346
in the	38786
to the	20785
for the	19124
on the	18558
to be	15893

alt text

Trigram EDA

library(ggplot2)
trigram_tokenizer <- ngram_tokenizer(3)
wordlist <- trigram_tokenizer(sampleData)
df3 <- data.frame(V1 = as.vector(names(table(unlist(wordlist)))), V2 = as.numeric(table(unlist(wordlist))))
names(df3) <- c("word","freq")
df3 <- df3[with(df3, order(-df3$freq)),]
head(df3)
ggplot(head(df3,15), aes(x=reorder(word, -freq), y=freq)) +
  geom_bar(stat="Identity", fill="lightblue2") +
  geom_text(aes(label=freq), vjust = -0.5) +
  ggtitle("Trigram frequency") +
  ylab("Frequency") +
  xlab("Phrase")

alt text

word	freq
one of the	2890
a lot of	2777
to be a	1757
going to be	1699
Thanks for the	1589
the end of	1459

Next Steps

Based on our exploratory data analyis done above, we will use the sample corpus we have generated to do our prediction, and then apply it to full corpus. Since we have ngram frequencies from our Tokenization, we will continue on that path to build our ngram model to predict next word, based on Markov Model in conditional probabilities of sentences. We will use perplexity to measure the effects of different smoothing. After an algorithm has been successfully accepted and scable, an application for users to enter data and predict next word will be built through a Shiny app. A simpler example could be WordCloud

Captone Milestone Report

Charlie Zuo

Saturday, March 28, 2015

Synopisis

Data Processing

Word Count and other Basic Statistics

Data Cleaning and Sampling

Tokenization and ngram analysis

Unigram EDA

Bigram EDA

Trigram EDA

Next Steps