author: angelayuan
date: Sunday, Jul 26, 2015
Nowadays, people are spending a lot of time on their mobile devices in a whole range of activities. But typing on mobile devices can be a serious pain. The aim of the current project is to build a predictive text product in which the product should predict the next word based on user’s input.
The aim of this report is to display that I’ve gotten used to working with the data and that I am on track to create my prediction algorithm. In this report, you will mainly see (1) exploratory analysis of the data sets; (2) basic report of summary statistics about the data sets; and (3) interesting findings that I amaased so far.
library(tm) # for text mining
library(SnowballC) # for stemming
library(RWeka) # for tokenization
library(wordcloud) # for wordcloud plotting
library(ggplot2) # for plotting
blogs <- readLines("./en_US/en_US.blogs.txt")
twitter <- readLines("./en_US/en_US.twitter.txt")
news <- readLines("./en_US/en_US.news.txt")
blogLines <- length(blogs)
twitterLines <- length(twitter)
newsLines <- length(news)
c(blogLines,twitterLines,newsLines)
## [1] 899288 2360148 77259
We can see that the original data sets are fairly large, containing 899288 lines in blog data, 2360148 lines in twitter data, and 77259 lines in news data. Therefore I randomly choose a sample set from each file and save sample sets for further use.
set.seed(123456)
index <- rbinom(n = length(blogs), size = 1, prob = 0.02)
blogs <- blogs[index==1]
write.csv(blogs, file = "./en_US/sample/blogs.sample.csv", row.names = FALSE, col.names = FALSE)
set.seed(123456)
index <- rbinom(n = length(twitter), size = 1, prob = 0.02)
twitter <- twitter[index==1]
write.csv(twitter, file = "./en_US/sample/twitter.sample.csv", row.names = FALSE, col.names = FALSE)
set.seed(123456)
index <- rbinom(n = length(news), size = 1, prob = 0.02)
news <- news[index==1]
write.csv(news, file = "./en_US/sample/news.sample.csv", row.names = FALSE, col.names = FALSE)
Data preprocessing of text data mainly contains the following steps: transforming to lower case, removing punctuation, removing numbers, removing special characters, removing profanity words, removing stop words, stemming, and removing white space.
# combine three sample datasets into a corpus
corpus <- Corpus(DirSource("./en_US/sample/"), readerControl = list(language="en_US"))
corpus <- tm_map(corpus, PlainTextDocument)
# transform to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# removing punctuation
corpus<- tm_map(corpus,content_transformer(removePunctuation))
# removing numbers
corpus<- tm_map(corpus,content_transformer(removeNumbers))
# removing special signs: [[:alnum:]] means [0-9A-Za-z]
removespecial <- function(x) gsub("[^[0-9A-Za-z]]"," ",x)
corpus<- tm_map(corpus, content_transformer(removespecial))
# removing English stop words
corpus<- tm_map(corpus,removeWords,stopwords("english"))
# removing profanity words
profanitylist <- readLines("./en_US/profanity.txt")
corpus<- tm_map(corpus,removeWords,profanitylist)
# stem
corpus <- tm_map(corpus, stemDocument, language = "english")
# removing white space
corpus<- tm_map(corpus, stripWhitespace)
First, perform 1-gram tokenization, organizing a data frame which contains the information of each unique word and its frequency.
# unigram
unigram <- NGramTokenizer(corpus, control =Weka_control(min = 1, max = 1))
unigram <- data.frame(table(unigram))
unigram <-unigram[order(unigram$Freq, decreasing = TRUE),]
colnames(unigram) <- c("Word","Freq")
saveRDS(unigram, file = "./en_US/sample/unigram.Rda")
Second, perform 2-gram tokenization, organizing a data frame which contains the information of each unique word pair and its frequency.
bigram <- NGramTokenizer(corpus, control =Weka_control(min = 2, max = 2))
bigram <- data.frame(table(bigram))
bigram <-bigram[order(bigram$Freq, decreasing = TRUE),]
colnames(bigram) <- c("Word","Freq")
saveRDS(bigram, file = "./en_US/sample/bigram.Rda")
Second, perform 3-gram tokenization, organizing a data frame which contains the information of each unique word triple and its frequency.
trigram <- NGramTokenizer(corpus, control =Weka_control(min = 3, max = 3))
trigram <- data.frame(table(trigram))
trigram <-trigram[order(trigram$Freq, decreasing = TRUE),]
colnames(trigram) <- c("Word","Freq")
saveRDS(trigram, file = "./en_US/sample/trigram.Rda")
First, plot a word cloud to help form a general impression of word frequency distribution.
wordcld <- wordcloud(corpus, scale=c(4,0.5), max.words=200, random.order=FALSE,
rot.per=0.4, use.r.layout=FALSE, colors=brewer.pal(8,'Dark2'))
Second, let’s see the information of unique word and the top 20 word with highest frequency.
length(unigram$Word)
## [1] 54641
par(mar = c(5,4,2,2), las = 2)
barplot( height = unigram$Freq[1:20], names.arg = unigram$Word[1:20], horiz = FALSE, col = heat.colors(20),
main = "Top 20 Unigram Word" , ylab = "Frequency")
Third, it is very useful to know how many unique words are needed in a frequency sorted dictionary to cover a certain percentage of all word instances in the language. Therefore I calculate and plot the number of unique word vs. their coverage of total words.
unigram$ratio <- sapply(1:length(unigram$Word), function(x) sum(unigram$Freq[1:x]))
unigram$ratio <- unigram$ratio*100/sum(unigram$Freq)
unigram$number <- 1:length(unigram$Word)
g <- ggplot(unigram, aes(x = number, y = ratio))
g + geom_line() + labs(title = "Number of Unique Word vs. Their Coverage of Total Words") + labs(x = "Number of Unique Word", y = "Coverage of Total Words (%)")
From above figure, we can see that only around 2000 unique words can cover 50% of total words, and 20000 unique words can cover nearly 90% of total words. This provide important information for further development of modeling and prediction.
Now, let’s see the information of unique word pair and the top 20 word pairs with the highest frequency.
length(bigram$Word)
## [1] 568056
par(mar = c(8,4,2,2), las = 2)
barplot( height = bigram$Freq[1:20], names.arg = bigram$Word[1:20], horiz = FALSE, col = heat.colors(20),
main = "Top 20 Bigram Word" , ylab = "Frequency")
Here we can see the information of unique word triple and the top 20 word triples with the highest frequency.
length(trigram$Word)
## [1] 752166
par(mar = c(10,4,2,2), las = 2)
barplot( height = trigram$Freq[1:20], names.arg = trigram$Word[1:20], horiz = FALSE, col = heat.colors(20),
main = "Top 20 Trigram Word" , ylab = "Frequency")