The ultimate goal of this Data Science Capstone session is to create a prediction algrithm that performs on moble devices when people type on keyboard. The training data to get started can downloaded in following link https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The dataset used in this capstone are the text files in en_US folder.
Before realizing the ultimate goal, building prediction model for next words, a milestone report is created for two purposes: (1) displaying the exploratory analysis on words survey, and (2) proposal for prediction algorithm and Shiny app.
In this report, the following tasks will be accomplished:
library(NLP)
library(tm)
library(SnowballC)
library(wordcloud)
## Loading required package: RColorBrewer
library(ggplot2)
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
### download ziped file to local drive and unzip file
if (!file.exists("./capstone_dataset.zip")){
url <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url,"./capstone_dataset.zip", mode="wb")
unzip("capstone_dataset.zip")
unlink(url)
}
# read twitter text file
twittercon <- file("./final/en_US/en_US.twitter.txt")
twitterlines <- readLines(twittercon)
nline_twitter <- length(twitterlines)
# read news text file
newscon <- file("./final/en_US/en_US.news.txt")
newslines <- readLines(newscon)
nline_news <- length(newslines)
# read blogs text file
blogscon <- file("./final/en_US/en_US.blogs.txt")
blogslines <- readLines(blogscon)
nline_blogs <- length(blogslines)
close(twittercon)
close(newscon)
close(blogscon)
# set seed adn data sampling
set.seed(123)
samplingratio <- 0.002
# the twitter sampling data for data training
twittersampleindex <-sample(1:nline_twitter, floor(samplingratio*nline_twitter), replace = FALSE)
twitter_sample <- twitterlines[twittersampleindex]
# the news sampling data for data training
newsampleindex <-sample(1:nline_news, floor(samplingratio*nline_news), replace = FALSE)
news_sample <- newslines[newsampleindex]
# the blogs sampling data for data training
blogsampleindex <-sample(1:nline_blogs, floor(samplingratio*nline_blogs), replace = FALSE)
blogs_sample <- blogslines[blogsampleindex]
The basic table for en_US dataset are following.
en_US_dataset
## Number of lines Number of words
## Twitter 2360148 32793399
## News 1010242 36721087
## Blog 899288 39120549
To save computational time, I build a function for word tokenization in text file. The input of this function is set as sampled text data, which are got above.
# Build a function for word tokenization
Tokenization_function <- function(textlines){
text_sample <- gsub("[!;:,.?~#$%&()0-9]", " ", textlines) # remove special symbols
# replace short terms with original words
text_sample <- gsub("['][m]", " am", text_sample) # replace abrreviation of "I'm" to "I am"
text_sample <- gsub("['][re]", " are", text_sample) # replace abrreviation similar to "you're" to "you are"
text_sample <- gsub("['][v][e]", " have", text_sample) # replace abrreviation of "*'ll" to "will"
text_sample <- gsub("['][d] [l][i][k][e]", " would like", text_sample) # replace abrreviation of "*'d like" to "* would like"
text_sample <- gsub("[w][o][n]['][t]", " would not", text_sample) # replace abrreviation of "won't" to "* will not"
text_sample <- gsub("[n]['][t]", " not", text_sample) # replace abrreviation of "*n't" to "not"
text_sample <- gsub("['][l][l]", " will", text_sample) # replace abrreviation of "*'ll" to "will"
text_sample <- removeNumbers( text_sample) # remove numbers
text_sample <- removePunctuation( text_sample) # remove punctuation
text_sample <- tolower(text_sample)
text_sample <- removeWords(text_sample, stopwords("en")) # remove stop words
# creat corpus for computation of TermDocumentMatrix
myCorpus <- Corpus(VectorSource(text_sample))
myTextMatrix <- TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))
myToken <- sort(rowSums(as.matrix(myTextMatrix)), decreasing = TRUE)
Token <- {}
Token$Name <- names(myToken)
Token$Count <-myToken
Token <- data.frame(Token)
return(Token)
}
For Twitter text file, the tokens are retrieved below.
Twitter_Token <- Tokenization_function(twitter_sample)
The wordcloud of the tokens of twitter text file is displayed below.
wordcloud(Twitter_Token$Name, Twitter_Token$Count, max.words=100)
The barplot of top 10 popular tokens of twitter text file is displayed below.
# number of bars in barplot
nbars=10
ggplot(Twitter_Token[1:nbars, ], aes(x = factor(Name), y = Count)) + geom_bar(stat = "identity")+xlab("Token Names") + ylab("Count") + ggtitle("Counts of The Most Popular Unigram from Twitter text file")+theme(axis.text.x = element_text(colour = 'black', angle=30, size = 16))
For News text file, the tokens are retrieved below.
News_Token <- Tokenization_function(news_sample)
The wordcloud of the tokens of news text file is displayed below.
wordcloud(News_Token$Name, News_Token$Count, max.words=100)
The barplot of top 10 popular tokens of news text file is displayed below.
# number of bars in barplot
nbars=10
ggplot(News_Token[1:nbars, ], aes(x = factor(Name), y = Count)) + geom_bar(stat = "identity")+xlab("Token Names") + ylab("Count") + ggtitle("Counts of The Most Popular Unigram from News")+theme(axis.text.x = element_text(colour = 'black', angle=30, size = 16))
For Blog text file, the tokens are retrieved below.
Blogs_Token <- Tokenization_function(blogs_sample)
The wordcloud of the tokens of blog text file is displayed below.
wordcloud(Blogs_Token$Name, Blogs_Token$Count, max.words=100)
The barplot of top 10 popular tokens of news text file is displayed below.
# number of bars in barplot
nbars=10
ggplot(Blogs_Token[1:nbars, ], aes(x = factor(Name), y = Count)) + geom_bar(stat = "identity")+xlab("Token Names") + ylab("Count") + ggtitle("Counts of The Most Popular Unigram from Blogs")+theme(axis.text.x = element_text(colour = 'black', angle=30, size = 16))
The ratio between words nad lines in twitter, news and blog text files are following: 13.895, 36.349 and 43.502. Although the number of lines in Twitter text file is around twice as those in News and Blog text files, the number of lines of these three files are close to each other. The reason for this observation is probably becasue that twitter texts are meant to express personal thoughts and feelings quickly, which makes text sentences very short. However, the blogs can have styles of articles and personal diaries, which makes sentences are more elaborated than twitter sentences. Moreover, the news text files should be professional, which would have both sound and concise articles.
Four out of ten most popular tokens in twitter texts includes “good”, “like”, “love” and “thanks”, which demonstrates the signature of twitter texts are closed to personal feelings. As for tokens in both news and blog texts, the words used are neutral. Besides, token counts in both news and blog text are similar to each other, except the occurence of word “said” in news texts is exceptionally high, which is imaginable.
Form two observations mentioned above, we can know different types of text files have their own signatures depending on the attributes of text files.
Everyone has his own writing style and habbit. It would be a good idea that next-word prediction algrithm can performe better if people can own their customerized training database. Based on first several sentenses typed, I can classify people’s writing style, and assign the suitable training database such as twitter text files or blog text files.
For given training database, I will test various algrithms, which use bi-gram, tri-gram and four-gram methods on next word prediction, to see which one has highest accuracy on test dataset. After choosing best prediction algrithm, I will include the sentenses typed by users into training database, hoping the step can make this typing Shiny app truly customerized for everyone.