Synopisis

This milestone report is based on all of work has been done so far for the Swiftkey capstone project. This report will demonstrate how the dataset obtained and processed, explored, and what the next steps are in the project to contruct a predict algorithm for text user types into the Shiny App. We will make use of tables and plots to illustrate important summaries of the data set from our explotary data analysis. Finally, we will outline our plan for the final product for this project.

Data Processing

We downloaded our from data from the link provided in the Data Science Capstone course page: Capstone Dataset. After we finished downloading, we will start to load the data into R and performing data cleansing. Since our app will predict words in Engligh, to get the project started,we will load the English language datasets in the en_US folder.

#load in the english blog and twitter data
us_blogs <-  readLines("en_US.blogs.txt",encoding="UTF-8")
us_twitter <- readLines("en_US.twitter.txt",encoding="UTF-8")

# load in the English news data
con <- file("en_US.news.txt", open="rb")
us_news <- readLines(con, encoding="UTF-8")
close(con)
rm(con)

Word Count and other Basic Statistics

For en_US data sets

Corpus Number of Entries
Twitter 2360148
Blogs 899288
News 1010242
Corpus Number of Words
Twitter 30417180
Blogs 37592702
News 34626812

Data Cleaning and Sampling

require(tm)
require(openNLP)
require(RWeka)

profane <- read.csv("google_twunter_lol.csv",sep=":",header=FALSE)
profanity <- as.character(profane[,1])

Sample <- function(data, p) { 
  return(data[as.logical(rbinom(length(data),1,p))])
}

sampleData <- c(Sample(us_blogs, 0.1),Sample(us_twitter, 0.1),Sample(us_news, 0.1))
sampleData <- gsub("[^a-zA-Z ]", "", sampleData)
#sampleData <- gsub(paste(profanity, collapse='|'), " ", sampleData)

sample_corpus <- Corpus(VectorSource(list(sampleData)))
sample_corpus <- tm_map(sample_corpus, stripWhitespace)
sample_corpus <- tm_map(sample_corpus, content_transformer(tolower))
sample_corpus <- tm_map(sample_corpus, removeWords, profanity)
sample_corpus <- as.character(sample_corpus)
#tdm <- TermDocumentMatrix(sample_corpus)

writeCorpus(sample_corpus, filenames="sampleCorpus.txt")
#SampleCorpus <- readLines("my.corpus.txt")

Tokenization and ngram analysis

We will use a Tokenizer provided by Capstone Community TA Tokenizer to help us achieve the results of Tokenization for our sample corpus.

Unigram EDA

setwd("C:/Users/Charlie/Desktop/Data Science/Capstone/Coursera-SwiftKey/final/en_US")
library(stringi)
library(wordcloud)
source("ngram_tokenizer.R")
unigram_tokenizer <- ngram_tokenizer(1)
wordlist <- unigram_tokenizer(sample_corpus)
df1 <- data.frame(V1 = as.vector(names(table(unlist(wordlist)))), V2 = as.numeric(table(unlist(wordlist))))
names(df1) <- c("word","freq")
df1 <- df1[with(df1, order(-df1$freq)),]
head(df1)
#m <- as.matrix(tdm)
#v <- sort(rowSums(m),decreasing=TRUE)
#d <- data.frame(word = names(v),freq=v)
wordcloud(df1$word,df1$freq, scale=c(4,.8),min.freq=2,max.words=100, random.order=T, rot.per=.15, colors=brewer.pal(8, "Dark2"))
#unigram_tokenizer <- ngram_tokenizer(1)
#wordlist <- unigram_tokenizer(sampleCorpus)
word freq
the 475652
to 274039
and 239173
a 239171
of 200312
in 165371
i 164050
for 109696
is 107258
that 103757
you 93450
it 90740
on 83129
with 71690
was 62538
my 59974
at 57425
be 54622
this 53900
have 52958

alt text

Bigram EDA

library(ggplot2)
bigram_tokenizer <- ngram_tokenizer(2)
wordlist <- unigram_tokenizer(sampleData)
df2 <- data.frame(V1 = as.vector(names(table(unlist(wordlist)))), V2 = as.numeric(table(unlist(wordlist))))
names(df2) <- c("word","freq")
df2 <- df2[with(df2, order(-df2$freq)),]
head(df2)
ggplot(head(df2,15), aes(x=sort(word, -freq), y=freq)) +
  geom_bar(stat="Identity", fill="violetred") +
  geom_text(aes(label=freq), vjust = -0.5) +
  ggtitle("Bigrams frequency") +
  ylab("Frequency") +
  xlab("Phrase")
word freq
of the 42346
in the 38786
to the 20785
for the 19124
on the 18558
to be 15893

alt text

Trigram EDA

library(ggplot2)
trigram_tokenizer <- ngram_tokenizer(3)
wordlist <- trigram_tokenizer(sampleData)
df3 <- data.frame(V1 = as.vector(names(table(unlist(wordlist)))), V2 = as.numeric(table(unlist(wordlist))))
names(df3) <- c("word","freq")
df3 <- df3[with(df3, order(-df3$freq)),]
head(df3)
ggplot(head(df3,15), aes(x=reorder(word, -freq), y=freq)) +
  geom_bar(stat="Identity", fill="lightblue2") +
  geom_text(aes(label=freq), vjust = -0.5) +
  ggtitle("Trigram frequency") +
  ylab("Frequency") +
  xlab("Phrase")

alt text

word freq
one of the 2890
a lot of 2777
to be a 1757
going to be 1699
Thanks for the 1589
the end of 1459

Next Steps

Based on our exploratory data analyis done above, we will use the sample corpus we have generated to do our prediction, and then apply it to full corpus. Since we have ngram frequencies from our Tokenization, we will continue on that path to build our ngram model to predict next word, based on Markov Model in conditional probabilities of sentences. We will use perplexity to measure the effects of different smoothing. After an algorithm has been successfully accepted and scable, an application for users to enter data and predict next word will be built through a Shiny app. A simpler example could be WordCloud