This is a milestone report for Data Science Capstone holding by John Hopikins University and Coursera. The goal of this project is to build a model that can predict the next word given an input phrases. This report is mainly about examine the three text datasets provided by SwiftKey and explore some basic data analysis on them. N-Gram model is used to preprocess some sample of the whole datasets, and some of the exploratory analysis are performed based on 1-gram, 2-gram, and 3-gram model. Finally, the future plan of how to build a predictive model is also provided.
The datasets for this project is provided by HC Corpora. The datasets could be downloaded in this website: SwiftKey Dataset
The datasets using in this project are en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt.
setwd("C:/Users/Borye/Documents/")
## load en_US.blogs.txt
con_blogs <- file("./R/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", "r")
data_blogs <- readLines(con_blogs)
close(con_blogs)
## load en_US.news.txt
con_news <- file("./R/Coursera-SwiftKey/final/en_US/en_US.news.txt", "r")
data_news <- readLines(con_news)
close(con_news)
## load en_US.twitter.txt
con_twitter <- file("./R/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", "r")
data_twitter <- readLines(con_twitter)
close(con_twitter)
The table below shows the basic information like size and number of lines of the three datasets.
| File Name | Size | Number of Lines |
|---|---|---|
en_US.blogs |
248.5 Mb |
899288 |
en_US.news |
19.2 Mb |
77259 |
en_US.twitter |
301.4 Mb |
2360148 |
As we could see from the above table, the size of all three datasets are vast. So we could choose to sample the data before conducting further analysis. In this paper, random 5000 samples lines each has been chosen to represent the whole datasets.
sample_blogs <- sample(1:length(data_blogs), 5000, replace = FALSE)
data_blogs_sample <- data_blogs[sample_blogs]
sample_news <- sample(1:length(data_news), 5000, replace = FALSE)
data_news_sample <- data_news[sample_news]
sample_twitter <- sample(1:length(data_twitter), 5000, replace = FALSE)
data_twitter_sample <- data_twitter[sample_twitter]
Now, we could have a peek about these datasets.
head(data_blogs_sample, 3)
## [1] "This is the result."
## [2] "Reading the symptoms of Swine Flu in that email, I now know I certainly had the ish. And for a couple days, it kicked my ass. Luckily the wife and baby did not get sick, and I healed up in time to where I could consume many litres of beer at Amsterdam and Oktoberfest with no affect other than nearly ruining my marriage."
## [3] "8:00"
head(data_news_sample, 3)
## [1] "This definition is a problem for many people. They belong to a religion but they don't really believe or practice everything that the religion teaches. There may be some beliefs or practices that they don't follow. For, example, a 1998 study of Presbyterians found that only two-thirds of its members attended Sunday worship services \"every week\" or \"nearly every week\" Similarly, almost a third of its members don't agree that \"Jesus will return to earth some day.\" Remember, this survey is among members of the church. We have earlier discussed the fact that evangelical Christians beat their wives and divorce their spouses about as often as their non-Christian neighbors do."
## [2] "He speaks of Watts and Compton as one, as the two districts butt against one another in South Central Los Angeles, with Compton the more southern neighborhood, dwarfing Watts in size."
## [3] "Groza Award winner Randy Bullock of Texas A&M was the All-American kicker."
head(data_twitter_sample, 3)
## [1] "40-60% off on all Sports Apparel @ www.allsportsshopping.com. Nike, UGG Boots, New Balance, Team Sports Equipment, Golf, Cycles, Skis"
## [2] "you are very welcome!"
## [3] "No smurf left behind"
Next step should be create corpus and doing some further processing with it. It is better for split every line into different sentences first. If without this process, the last word of one line could mix with the first word of another line, and may lead to a useless n-gram phrase in the model.
In this work, qdap package is used to conduct sentence detect and split.
library(qdap)
data_sample <- c(data_blogs_sample, data_news_sample)
data_sample <- sent_detect(data_sample)
We have split the blogs and news data into single sentences, but left twitter data. Because twitter data contains lots of spoken language and nonstandard usage. For example, people usually forget to add a period at the end of a tweet. And the function sent_detect is detect whether it is a complete sentence based on the period. So for twitter data, every lines are processed one by one.
tw <- c(NULL)
for(i in 1:length(data_twitter_sample)){
tw_eve <- sent_detect(data_twitter_sample[i])
tw <- c(tw, tw_eve)
}
data_sample <- c(data_sample, tw)
In the datasets, there are many of NON-ASCII characters. For example, the emoji characters. In this report, all these NON-ASCII characters are removed.
for(i in 1:length(data_sample)){
row <- data_sample[i]
row_1 <- iconv(row, "latin1", "ASCII", sub = "")
data_sample[i] <- row_1
}
Now it is time to begin the corpus creation process. Corpus is the foundation of this project, which all the prdictive model created in the futrue, are based on the corpus. In the process of corpus creating, feature selection is one of a important process to extract the main feature of the datasets, and eliminate the useless part.
For example, we could convert all letters to lower case. And we could remove all the punctuaiton of it. Normally, in the preprocess of text mining, stemming and stopword removing are very useful tools in cleaning the text data. However, the goal of this project is predictive analysis, which the word to be predict is very likely a stopword like the, of, are, etc. Also, we don’t want to output a word, under predictive algorithm, which is a stemmed word.
The data preprocessing in this report is for a simple N-Gram model. So stemming and stopword removing will not be processed in this report. tm package is used to conduct this process.
library(tm)
corpus_sample <- VCorpus(VectorSource(data_sample))
corpus_sample <- tm_map(corpus_sample, content_transformer(tolower))
corpus_sample <- tm_map(corpus_sample, stripWhitespace)
corpus_sample <- tm_map(corpus_sample, removePunctuation)
It is proper for us to filtering all the profanity words in the datasets. After all, we don’t want to end up predicted a F-word after input phrases.
setwd("C:/Users/Borye/Documents/")
con_profanity <- file("./R/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words-master/en", "r")
profanity <- readLines(con_profanity)
close(con_profanity)
corpus_sample <- tm_map(corpus_sample, removeWords, profanity)
Now the tokenization of the corpuses are performed for further data analysis. In this report 1-gram, 2-gram, and 3-gram model has been showed. We use RWeka package here to do the trick.
library(RWeka)
## Convert corpus to dataframe
dataframe_sample <- data.frame(text = unlist(sapply(corpus_sample, '[', 'content')), stringsAsFactors = F)
## Process NGram model
token_one <- NGramTokenizer(dataframe_sample, Weka_control(min = 1, max = 1))
token_two <- NGramTokenizer(dataframe_sample, Weka_control(min = 2, max = 2))
token_three <- NGramTokenizer(dataframe_sample, Weka_control(min = 3, max = 3))
word_one <- data.frame(table(token_one))
word_two <- data.frame(table(token_two))
word_three <- data.frame(table(token_three))
word_one_order <- word_one[order(word_one$Freq, decreasing = TRUE), ]
word_two_order <- word_two[order(word_two$Freq, decreasing = TRUE), ]
word_three_order <- word_three[order(word_three$Freq, decreasing = TRUE), ]
names(word_one_order) <- c("Token", "Freq")
names(word_two_order) <- c("Token", "Freq")
names(word_three_order) <- c("Token", "Freq")
6 lines of 2-gram model is presented here as an example.
head(word_two_order)
## Token Freq
## 145186 of the 2085
## 105642 in the 1906
## 215988 to the 948
## 147718 on the 881
## 80908 for the 809
## 213858 to be 672
We can observed that this dataframe cantains two columns, the first one is the token, and the second one is the orderly appeared frequency.
This section we will use the tokens that we created to perform some exporatory analysis. First, the plot and wordcloud of top 11 frequent words in 1-gram, 2-gram, and 3-gram are presented.
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(grid)
## Remove the order of "Token" in every N-Gram model, and order it as "Freq""
word_one_order$Token <- factor(word_one_order$Token, levels = unique(as.character(word_one_order$Token)))
word_two_order$Token <- factor(word_two_order$Token, levels = unique(as.character(word_two_order$Token)))
word_three_order$Token <- factor(word_three_order$Token, levels = unique(as.character(word_three_order$Token)))
1-gram
ggplot(word_one_order[1:11, ], aes(Token, Freq, fill = Token)) + geom_bar(stat="Identity", width = .8) + geom_text(aes(label = word_one_order[1:11, ]$Freq), vjust = -0.2, size = 3) + scale_fill_manual(values = brewer.pal(11, "BrBG")) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9)) + theme(axis.text.y = element_text(size = 7)) + theme(plot.margin = unit(c(0.3, 0, 0, 0), "cm")) + labs(title = "Word Frequency for 1-gram Model") + labs(x = "1-gram tokens") + labs(y = "Frequency")
wordcloud(word_one_order[1:11, ]$Token, word_one_order[1:11, ]$Freq, scale = c(5,1), rot.per = .3, colors = brewer.pal(11, "BrBG"), ordered.colors = TRUE, random.order = TRUE)
2-gram
ggplot(word_two_order[1:11, ], aes(Token, Freq, fill = Token)) + geom_bar(stat="Identity", width = .8) + geom_text(aes(label = word_two_order[1:11, ]$Freq), vjust = -0.2, size = 3) + scale_fill_manual(values = brewer.pal(11, "PiYG")) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9)) + theme(axis.text.y = element_text(size = 7)) + theme(plot.margin = unit(c(0.3, 0, 0, 0), "cm")) + labs(title = "Word Frequency for 2-gram Model") + labs(x = "2-gram tokens") + labs(y = "Frequency")
wordcloud(word_two_order[1:11, ]$Token, word_two_order[1:11, ]$Freq, scale = c(4,1), rot.per = .3, colors = brewer.pal(11, "PiYG"), ordered.colors = TRUE, random.order = TRUE)
3-gram
ggplot(word_three_order[1:11, ], aes(Token, Freq, fill = Token)) + geom_bar(stat="Identity", width = .8) + geom_text(aes(label = word_three_order[1:11, ]$Freq), vjust = -0.2, size = 3) + scale_fill_manual(values = brewer.pal(11, "RdYlGn")) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9)) + theme(axis.text.y = element_text(size = 7)) + theme(plot.margin = unit(c(0.3, 0, 0, 0), "cm")) + labs(title = "Word Frequency for 3-gram Model") + labs(x = "3-gram tokens") + labs(y = "Frequency")
wordcloud(word_three_order[1:11, ]$Token, word_three_order[1:11, ]$Freq, scale = c(3,1), rot.per = .3, colors = brewer.pal(11, "RdYlGn"), ordered.colors = TRUE, random.order = TRUE)
We could consider a interesting question: How many unique words do we need in a frequency sorted dictionary to cover 50% or 90% of all word instances in the English language.
To solve this question, we could assume that the corpus we generated can represent the corpus of whole english language. To examine this assumption, a histogram of the frequency in 1-gram tokens is presented.
library(poweRlaw)
hist(word_one_order$Freq, breaks = 10000, xlim = c(0, 2000), ylim = c(0, 200), main = "Histgram of Frequency in 1-gram tokens", xlab = "Frequency of tokens in 1-gram model", ylab = "Frequency")
par(new = TRUE)
## Generage power law curve
m = displ$new()
m$setXmin(1)
x = 1:80
m$setPars(2)
plot(x, dist_pdf(m, x), type="l", col = "red", xaxt = 'n', yaxt = 'n', ann = FALSE, lwd = 2)
As we can see from the figure above, every bar of the histogram actually represent a 1-gram token or a word. The red line is a Power Law curve. It is clearly that the frequency of every word appeared in our corpus is approximately follow the Power Law. According to the Universality of Power Law, we could assume that our corpus can approximately represent the whole english language corpus, and test the frequency pattern with it.
## 50%
num_count <- 0
count <- 0
s <- sum(word_one_order$Freq)
for(i in 1:dim(word_one_order)[1]){
num_count = num_count + word_one_order$Freq[i]
count = count + 1
if(num_count >= s/2){
break
}
}
print(count); print(count / dim(word_one_order)[1])
## [1] 147
## [1] 0.00404435
print(dim(word_one_order)[1])
## [1] 36347
## 90%
num_count <- 0
count <- 0
s <- sum(word_one_order$Freq)
for(i in 1:dim(word_one_order)[1]){
num_count = num_count + word_one_order$Freq[i]
count = count + 1
if(num_count >= s * 0.9){
break
}
}
print(count); print(count / dim(word_one_order)[1])
## [1] 7551
## [1] 0.2077475
As we can see from the calculation, to represent 50% of all the 1-gram tokens, we only need 147 out of 36347 tokens or 0.404% of all tokens. On the other hand, to represent 90% of all the 1-gram tokens, we only need 7551 out of 36347 tokens or 20.7% of all tokens
According to this assumption, we could choose several top frequent tokens (for example, 50% or 90% of the whole part) in the corpus to represent the vast amount of the original corpus. And this will save a lot of time and memory when we apply the prdictive model to web app.
The goal of this project is to build a predictive model which can predict next word given a random phrases. So to acheive this goal, the future plan is as follows: