This is the last project of Coursera Data Science courses. In this Exploratory Analysis report we will focus on building an application based on predictive text model. The user will provide a word or a phrase and the application will try to predict the next one. This report will analysis the most frequent unigrams, bigrams and trigrams.The model will be trained based on three sources( blog, twitter, and news).
first load all the libary needed
library(tm)
library(dplyr)
library(ggplot2)
library(pryr)
library(stringr)
library(RWeka)
load the data using readLines()
us_twitter = readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8")
us_blog = readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8")
us_news = readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8")
summarize the length of different source
length(us_twitter)
## [1] 2360148
length(us_blog)
## [1] 899288
length(us_news)
## [1] 1010242
Since the original files are too big. Each of them will be more than 200 Mb. It will be a desaster to convert it to a Corpus file. I tried to convert the smallest blog file into Vcorpus and my labtop went down because of short for memories. So I take 1% content of each source and combined them together to do the analysis.
Sampling content, combine contents from different source and then convert text vector into a Corpus
set.seed(12345)
s_blog = base::sample(us_blog, length(us_blog)*0.01)
s_news = base::sample(us_news, length(us_news)*0.01)
s_twitter = base::sample(us_twitter, length(us_twitter)*0.01)
data = c (s_blog, s_news, s_twitter)
#object_size(data)
newCorpus = Corpus (VectorSource(data))
process of the text in Corpus including convert all the word into lower case, remove all the punctuations, remove all the numbers, remove all the stopwords, steming the document etc. Then the DTM file is created
newCorpus = tm_map(newCorpus, tolower)
newCorpus = tm_map(newCorpus, removePunctuation)
newCorpus = tm_map(newCorpus, removeNumbers)
#newCorpus = tm_map(newCorpus, removeWords, stopwords("english"))
newCorpus = tm_map(newCorpus, stemDocument)
newCorpus = tm_map(newCorpus, stripWhitespace)
newCorpus = tm_map(newCorpus, PlainTextDocument)
newDTM = DocumentTermMatrix(newCorpus)
The frequency of words are stored in the freq variable. The most frequent words are arranged in order.
freq = colSums(as.matrix(newDTM))
ord = order(-freq)
freq = freq[ord]
head(freq,20)
## said will just one like can get time new good
## 3122 3091 2974 2687 2686 2503 2314 1950 1924 1806
## now day dont know love people back see also first
## 1804 1743 1714 1624 1564 1531 1425 1365 1331 1317
head(table(freq),20) ##shows frequency of frequencies
## freq
## 1 2 3 4 5 6 7 8 9 10 11 12
## 32065 7663 3737 2322 1550 1202 929 795 619 548 442 363
## 13 14 15 16 17 18 19 20
## 346 331 274 250 225 187 194 185
tail(table(freq),20)
## freq
## 1317 1331 1365 1425 1531 1564 1624 1714 1743 1804 1806 1924 1950 2314 2503
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2686 2687 2974 3091 3122
## 1 1 1 1 1
We can see 32065 terms only appear once There are a lot of others that appear very infrequently.
Now let’s take a visualization of the frequencies of terms. ##Visualization
wf = data.frame(word = names(freq), freq = freq)
p = ggplot(wf[1:20,], aes(x = reorder(word,freq), y = freq)) +
geom_bar(width = 0.5, stat = "identity", fill = "darkblue") +
coord_flip() + xlab("word") + ggtitle("top 20 frequent words")
p
make a word clouds
library(wordcloud)
set.seed(132)
wordcloud(names(freq), freq, max.words = 100, scale = c(5, 0.1), colors = brewer.pal(6, "Dark2"))
Calculate how many unique words need in a frequency sorted dictorniary to cover 50% of all word instances in the language. ##Analyzing
total_freq = sum(freq)
coverage = function (data,percentage) {
coverage = 0
count = 0
while (coverage <total_freq*percentage) {
count = count + 1
coverage = sum(freq[1:count])
}
return(count)
}
fifty_cover = coverage(freq,0.5)
nighty_cover = coverage(freq, 0.9)
fifty_cover
## [1] 1017
nighty_cover
## [1] 16187
It turns out that to cover 50% of all word, we need 1017 unique words. To cover 90%, we need 16187 words
After analyzed unigram terms frequency, it’s more important to study 2-grams and 3-grams in the dataset since the prediction model is based on multi-grams.
I used RWeka to split string into n grams. Then create three matrix containing uni-gram, bi-gram and tri-gram
bi_tokenizer = function (x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
tri_tokenizer = function (x) unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
bi_matrix = DocumentTermMatrix(newCorpus, control = list(tokenize = bi_tokenizer))
tri_matrix = DocumentTermMatrix(newCorpus, control = list(tokenize = tri_tokenizer))
Find the 2-grams and 3-grams terms with frequencies more than 50.
bi_corpus = findFreqTerms(bi_matrix, lowfreq = 100)
tri_corpus = findFreqTerms(tri_matrix, lowfreq = 100)
head(bi_corpus)
## [1] "a big" "a bit" "a couple" "a day" "a few" "a good"
head(tri_corpus)
## [1] "a lot of" "as well as" "be able to" "end of the" "going to be"
## [6] "i dont know"
Calculate the frequencies of terms with more than 100 times
freq2 = colSums(as.matrix(bi_matrix[,bi_corpus]))
freq3 = colSums(as.matrix(tri_matrix[,tri_corpus]))
Now we can summarize what is the most frequent 2-grams and 3-grams terms.
ord2 = order(-freq2)
freq2 = freq2[ord2]
ord3 = order(-freq3)
freq3 = freq3[ord3]
head(freq2)
## of the in the to the on the for the to be
## 4201 4154 2178 1958 1917 1564
head(freq3)
## one of the a lot of thanks for the to be a going to be
## 323 302 224 169 162
## as well as
## 151
Let’s visualize it through barplot
df2 = data.frame(word = names(freq2), freq = freq2 )
q2 = ggplot(df2[1:20,], aes(x = reorder( word, freq), y = freq)) +
geom_bar(width = 0.5, stat = "identity", fill = "darkblue") +
coord_flip() + xlab("word") + ggtitle("top 20 frequent 2-grams terms")
q2
df3 = data.frame(word = names(freq3), freq = freq3 )
q3 = ggplot(df3[1:20,], aes(x = reorder( word, freq), y = freq)) +
geom_bar(width = 0.5, stat = "identity", fill = "darkblue") +
coord_flip() + xlab("word") + ggtitle("top 20 frequent 3-grams terms")
q3
Draw a wordcloud for 2-grams and 3-grams terms
library(wordcloud)
set.seed(132)
wordcloud(names(freq2), freq, max.words = 50, scale = c(5, 0.1), colors = brewer.pal(6, "Dark2"))
wordcloud(names(freq3), freq, max.words = 50, scale = c(5, 0.1), colors = brewer.pal(6, "Dark2"))
This milestone report is to analyze relationship between frequencies and uni-gram, 2-grams and 3-grams terms. The highest 20 terms of each category are listed through bar plot. And the wordcloud map was drawn. The next steps will be building a predictive algorithm through the results I got. This algorithm will then be used in a Shiny app and will suggest the most likely next word after a word is typed.