This milestone report is based on all of work has been done so far for the Swiftkey capstone project. This report will demonstrate how the dataset obtained and processed, explored, and what the next steps are in the project to contruct a predict algorithm for text user types into the Shiny App. We will make use of tables and plots to illustrate important summaries of the data set from our explotary data analysis. Finally, we will outline our plan for the final product for this project.
We downloaded our from data from the link provided in the Data Science Capstone course page: Capstone Dataset. After we finished downloading, we will start to load the data into R and performing data cleansing. Since our app will predict words in Engligh, to get the project started,we will load the English language datasets in the en_US folder.
#load in the english blog and twitter data
us_blogs <- readLines("en_US.blogs.txt",encoding="UTF-8")
us_twitter <- readLines("en_US.twitter.txt",encoding="UTF-8")
# load in the English news data
con <- file("en_US.news.txt", open="rb")
us_news <- readLines(con, encoding="UTF-8")
close(con)
rm(con)
For en_US data sets
| Corpus | Number of Entries |
|---|---|
| 2360148 | |
| Blogs | 899288 |
| News | 1010242 |
| Corpus | Number of Words |
|---|---|
| 30417180 | |
| Blogs | 37592702 |
| News | 34626812 |
require(tm)
require(openNLP)
require(RWeka)
profane <- read.csv("google_twunter_lol.csv",sep=":",header=FALSE)
profanity <- as.character(profane[,1])
Sample <- function(data, p) {
return(data[as.logical(rbinom(length(data),1,p))])
}
sampleData <- c(Sample(us_blogs, 0.1),Sample(us_twitter, 0.1),Sample(us_news, 0.1))
sampleData <- gsub("[^a-zA-Z ]", "", sampleData)
#sampleData <- gsub(paste(profanity, collapse='|'), " ", sampleData)
sample_corpus <- Corpus(VectorSource(list(sampleData)))
sample_corpus <- tm_map(sample_corpus, stripWhitespace)
sample_corpus <- tm_map(sample_corpus, content_transformer(tolower))
sample_corpus <- tm_map(sample_corpus, removeWords, profanity)
sample_corpus <- as.character(sample_corpus)
#tdm <- TermDocumentMatrix(sample_corpus)
writeCorpus(sample_corpus, filenames="sampleCorpus.txt")
#SampleCorpus <- readLines("my.corpus.txt")
We will use a Tokenizer provided by Capstone Community TA Tokenizer to help us achieve the results of Tokenization for our sample corpus.
setwd("C:/Users/Charlie/Desktop/Data Science/Capstone/Coursera-SwiftKey/final/en_US")
library(stringi)
library(wordcloud)
source("ngram_tokenizer.R")
unigram_tokenizer <- ngram_tokenizer(1)
wordlist <- unigram_tokenizer(sample_corpus)
df1 <- data.frame(V1 = as.vector(names(table(unlist(wordlist)))), V2 = as.numeric(table(unlist(wordlist))))
names(df1) <- c("word","freq")
df1 <- df1[with(df1, order(-df1$freq)),]
head(df1)
#m <- as.matrix(tdm)
#v <- sort(rowSums(m),decreasing=TRUE)
#d <- data.frame(word = names(v),freq=v)
wordcloud(df1$word,df1$freq, scale=c(4,.8),min.freq=2,max.words=100, random.order=T, rot.per=.15, colors=brewer.pal(8, "Dark2"))
#unigram_tokenizer <- ngram_tokenizer(1)
#wordlist <- unigram_tokenizer(sampleCorpus)
| word | freq |
|---|---|
| the | 475652 |
| to | 274039 |
| and | 239173 |
| a | 239171 |
| of | 200312 |
| in | 165371 |
| i | 164050 |
| for | 109696 |
| is | 107258 |
| that | 103757 |
| you | 93450 |
| it | 90740 |
| on | 83129 |
| with | 71690 |
| was | 62538 |
| my | 59974 |
| at | 57425 |
| be | 54622 |
| this | 53900 |
| have | 52958 |
library(ggplot2)
bigram_tokenizer <- ngram_tokenizer(2)
wordlist <- unigram_tokenizer(sampleData)
df2 <- data.frame(V1 = as.vector(names(table(unlist(wordlist)))), V2 = as.numeric(table(unlist(wordlist))))
names(df2) <- c("word","freq")
df2 <- df2[with(df2, order(-df2$freq)),]
head(df2)
ggplot(head(df2,15), aes(x=sort(word, -freq), y=freq)) +
geom_bar(stat="Identity", fill="violetred") +
geom_text(aes(label=freq), vjust = -0.5) +
ggtitle("Bigrams frequency") +
ylab("Frequency") +
xlab("Phrase")
| word | freq |
|---|---|
| of the | 42346 |
| in the | 38786 |
| to the | 20785 |
| for the | 19124 |
| on the | 18558 |
| to be | 15893 |
library(ggplot2)
trigram_tokenizer <- ngram_tokenizer(3)
wordlist <- trigram_tokenizer(sampleData)
df3 <- data.frame(V1 = as.vector(names(table(unlist(wordlist)))), V2 = as.numeric(table(unlist(wordlist))))
names(df3) <- c("word","freq")
df3 <- df3[with(df3, order(-df3$freq)),]
head(df3)
ggplot(head(df3,15), aes(x=reorder(word, -freq), y=freq)) +
geom_bar(stat="Identity", fill="lightblue2") +
geom_text(aes(label=freq), vjust = -0.5) +
ggtitle("Trigram frequency") +
ylab("Frequency") +
xlab("Phrase")
| word | freq |
|---|---|
| one of the | 2890 |
| a lot of | 2777 |
| to be a | 1757 |
| going to be | 1699 |
| Thanks for the | 1589 |
| the end of | 1459 |
Based on our exploratory data analyis done above, we will use the sample corpus we have generated to do our prediction, and then apply it to full corpus. Since we have ngram frequencies from our Tokenization, we will continue on that path to build our ngram model to predict next word, based on Markov Model in conditional probabilities of sentences. We will use perplexity to measure the effects of different smoothing. After an algorithm has been successfully accepted and scable, an application for users to enter data and predict next word will be built through a Shiny app. A simpler example could be WordCloud