Objective

This is the last project of Coursera Data Science courses. In this Exploratory Analysis report we will focus on building an application based on predictive text model. The user will provide a word or a phrase and the application will try to predict the next one. This report will analysis the most frequent unigrams, bigrams and trigrams.The model will be trained based on three sources( blog, twitter, and news).

Load Data

first load all the libary needed

library(tm)
library(dplyr)
library(ggplot2)
library(pryr)
library(stringr)
library(RWeka)

load the data using readLines()

us_twitter = readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8")
us_blog = readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8")
us_news = readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8")

summarize the length of different source

length(us_twitter)
## [1] 2360148
length(us_blog)
## [1] 899288
length(us_news)
## [1] 1010242

Processing data

Since the original files are too big. Each of them will be more than 200 Mb. It will be a desaster to convert it to a Corpus file. I tried to convert the smallest blog file into Vcorpus and my labtop went down because of short for memories. So I take 1% content of each source and combined them together to do the analysis.

Sampling content, combine contents from different source and then convert text vector into a Corpus

set.seed(12345)
s_blog = base::sample(us_blog,  length(us_blog)*0.01)
s_news = base::sample(us_news, length(us_news)*0.01)
s_twitter = base::sample(us_twitter, length(us_twitter)*0.01)
data = c (s_blog, s_news, s_twitter)
#object_size(data)
newCorpus = Corpus (VectorSource(data))

process of the text in Corpus including convert all the word into lower case, remove all the punctuations, remove all the numbers, remove all the stopwords, steming the document etc. Then the DTM file is created

newCorpus = tm_map(newCorpus, tolower)
newCorpus = tm_map(newCorpus, removePunctuation)
newCorpus = tm_map(newCorpus, removeNumbers)
#newCorpus = tm_map(newCorpus, removeWords, stopwords("english"))
newCorpus = tm_map(newCorpus, stemDocument)
newCorpus = tm_map(newCorpus, stripWhitespace)
newCorpus = tm_map(newCorpus, PlainTextDocument)
newDTM = DocumentTermMatrix(newCorpus)

The frequency of words are stored in the freq variable. The most frequent words are arranged in order.

freq = colSums(as.matrix(newDTM))
ord = order(-freq)
freq = freq[ord]
head(freq,20)
##   said   will   just    one   like    can    get   time    new   good 
##   3122   3091   2974   2687   2686   2503   2314   1950   1924   1806 
##    now    day   dont   know   love people   back    see   also  first 
##   1804   1743   1714   1624   1564   1531   1425   1365   1331   1317
head(table(freq),20) ##shows frequency of frequencies
## freq
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 32065  7663  3737  2322  1550  1202   929   795   619   548   442   363 
##    13    14    15    16    17    18    19    20 
##   346   331   274   250   225   187   194   185
tail(table(freq),20)
## freq
## 1317 1331 1365 1425 1531 1564 1624 1714 1743 1804 1806 1924 1950 2314 2503 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 2686 2687 2974 3091 3122 
##    1    1    1    1    1

We can see 32065 terms only appear once There are a lot of others that appear very infrequently.

Now let’s take a visualization of the frequencies of terms. ##Visualization

wf = data.frame(word = names(freq), freq = freq)
p = ggplot(wf[1:20,], aes(x = reorder(word,freq), y = freq)) +
        geom_bar(width = 0.5, stat = "identity", fill = "darkblue") +
        coord_flip() + xlab("word") + ggtitle("top 20 frequent words")
p

make a word clouds

library(wordcloud)
set.seed(132)
wordcloud(names(freq), freq, max.words = 100, scale = c(5, 0.1), colors = brewer.pal(6, "Dark2"))

Calculate how many unique words need in a frequency sorted dictorniary to cover 50% of all word instances in the language. ##Analyzing

total_freq = sum(freq)
coverage = function (data,percentage) {
     coverage = 0
     count = 0
     while (coverage <total_freq*percentage) {
             count = count + 1
             coverage = sum(freq[1:count])
     }
     return(count)
}
fifty_cover = coverage(freq,0.5)
nighty_cover = coverage(freq, 0.9)
fifty_cover
## [1] 1017
nighty_cover
## [1] 16187

It turns out that to cover 50% of all word, we need 1017 unique words. To cover 90%, we need 16187 words

Multi-grams analysis

After analyzed unigram terms frequency, it’s more important to study 2-grams and 3-grams in the dataset since the prediction model is based on multi-grams.

I used RWeka to split string into n grams. Then create three matrix containing uni-gram, bi-gram and tri-gram

bi_tokenizer = function (x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
tri_tokenizer = function (x) unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
bi_matrix = DocumentTermMatrix(newCorpus, control = list(tokenize = bi_tokenizer))
tri_matrix = DocumentTermMatrix(newCorpus, control = list(tokenize = tri_tokenizer))

Analysis of 2-grams and 3-grams frequencies

Find the 2-grams and 3-grams terms with frequencies more than 50.

bi_corpus = findFreqTerms(bi_matrix, lowfreq = 100)
tri_corpus = findFreqTerms(tri_matrix, lowfreq = 100)
head(bi_corpus)
## [1] "a big"    "a bit"    "a couple" "a day"    "a few"    "a good"
head(tri_corpus)
## [1] "a lot of"    "as well as"  "be able to"  "end of the"  "going to be"
## [6] "i dont know"

Calculate the frequencies of terms with more than 100 times

freq2 = colSums(as.matrix(bi_matrix[,bi_corpus]))
freq3 = colSums(as.matrix(tri_matrix[,tri_corpus]))

Now we can summarize what is the most frequent 2-grams and 3-grams terms.

ord2 = order(-freq2)
freq2 = freq2[ord2]
ord3 = order(-freq3)
freq3 = freq3[ord3]
head(freq2)
##  of the  in the  to the  on the for the   to be 
##    4201    4154    2178    1958    1917    1564
head(freq3)
##     one of the       a lot of thanks for the        to be a    going to be 
##            323            302            224            169            162 
##     as well as 
##            151

Let’s visualize it through barplot

df2 = data.frame(word = names(freq2), freq = freq2 )
q2 = ggplot(df2[1:20,], aes(x = reorder( word, freq), y = freq)) + 
        geom_bar(width = 0.5, stat = "identity", fill = "darkblue") +
        coord_flip() + xlab("word") + ggtitle("top 20 frequent 2-grams terms")
q2

df3 = data.frame(word = names(freq3), freq = freq3 )
q3 = ggplot(df3[1:20,], aes(x = reorder( word, freq), y = freq)) + 
        geom_bar(width = 0.5, stat = "identity", fill = "darkblue") +
        coord_flip() + xlab("word") + ggtitle("top 20 frequent 3-grams terms")
q3

Draw a wordcloud for 2-grams and 3-grams terms

library(wordcloud)
set.seed(132)
wordcloud(names(freq2), freq, max.words = 50, scale = c(5, 0.1), colors = brewer.pal(6, "Dark2"))

wordcloud(names(freq3), freq, max.words = 50, scale = c(5, 0.1), colors = brewer.pal(6, "Dark2"))

Conclusion

This milestone report is to analyze relationship between frequencies and uni-gram, 2-grams and 3-grams terms. The highest 20 terms of each category are listed through bar plot. And the wordcloud map was drawn. The next steps will be building a predictive algorithm through the results I got. This algorithm will then be used in a Shiny app and will suggest the most likely next word after a word is typed.