This report presents Exploratory Data Analysis of words, tokens, and phrases in the text. Three text files were used as data source. One file is a collection of tweets. Another file is a collection of texts from blogs. The last file is a collection of texts from news stories. The files were provided by the text prediction company SwiftKey and can be found here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
For some steps, I used the package tm and for other steps, I used the package tidytext. To limit the volume of data for exploratory analysis, I sampled the data. Then I created a single corpus from the sampled data from the three sources. I sampled 1% of the data from each source so that my computer memory could handle the data. It should be noted that before sampling the data was cleaned. The main cleaning tasks were to remove punctuations, stop words, and profanity.
When I compare the most frequent word bar graphs of unigrams, bigrams, and tri grams, I see that although there seems to be no apparent relation between most frequent unigrams and most frequent bigrams, there appears to be subtle relationship between most frequent bigrams and most frequent trigrams. I could have investigated this issue further if I had more memory available on the computer.
The exploratory analysis showed that approximately 1,100 unique words are needed in a frequency sorted dictionary to cover 50% of all word instances in the US English language. Likewise, 10,000 words are required to cover 90% of all word instances. If we are interested in bigrams to predict which word pairs occur together, we find that we need 20,000 bigrams and 75,000 bigrams to cover 50% and 90%, respectively, of all bigrams that occur in a frequency sorted dictionary of bigrams. Extending this analysis further, we find that 27,000 trigrams and 85,000 trigrams are needed to cover 50% and 90% trigrams occuring in the US English language.
blogs_size <- file.size(blogs_file) / (2^20)
news_size <- file.size(news_file) / (2^20)
twitter_size <- file.size(twitter_file) / (2^20)
blogs_size
## [1] 200.4242
news_size
## [1] 196.2775
twitter_size
## [1] 159.3641
blogs <- readLines(file(blogs_file,"rb"), encoding="UTF-8")
news <- readLines(file(news_file,"rb"), encoding="UTF-8")
twitter <- readLines(file(twitter_file,"rb"), encoding="UTF-8")
## Warning in readLines(file(twitter_file, "rb"), encoding = "UTF-8"): line 167155
## appears to contain an embedded nul
## Warning in readLines(file(twitter_file, "rb"), encoding = "UTF-8"): line 268547
## appears to contain an embedded nul
## Warning in readLines(file(twitter_file, "rb"), encoding = "UTF-8"): line 1274086
## appears to contain an embedded nul
## Warning in readLines(file(twitter_file, "rb"), encoding = "UTF-8"): line 1759032
## appears to contain an embedded nul
no_of_lines_blogs <- length(blogs)
no_of_lines_news <- length(news)
no_of_lines_twitter <- length(twitter)
no_of_lines_blogs
## [1] 899288
no_of_lines_news
## [1] 1010242
no_of_lines_twitter
## [1] 2360148
#Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
require("stringi")
blogs_words_cnt <- stri_stats_latex(blogs)[4]
news_words_cnt <- stri_stats_latex(news)[4]
twitter_words_cnt <- stri_stats_latex(twitter)[4]
blogs_words_cnt
## Words
## 37570839
news_words_cnt
## Words
## 34494539
twitter_words_cnt
## Words
## 30451128
blogs_char_cnt <- stri_stats_general(blogs)[3]
news_char_cnt <- stri_stats_general(news)[3]
twitter_char_cnt <- stri_stats_general(twitter)[3]
blogs_char_cnt
## Chars
## 206824382
news_char_cnt
## Chars
## 203223154
twitter_char_cnt
## Chars
## 162096031
blogs_lgth <- max(nchar(blogs))
news_lgth <- max(nchar(news))
twitter_lgth <- max(nchar(twitter))
blogs_lgth
## [1] 40833
news_lgth
## [1] 11384
twitter_lgth
## [1] 140
To make processing the data more efficient, a random sample of 1% of the lines is taken, written to new files, and read into the Corpus.
blogs<-blogs[rbinom(length(blogs)*.01, length(blogs), .5)]
write.csv(blogs, file="./final/en_US/sample/blogs.csv", row.names=FALSE)
news<-news[rbinom(length(news)*.01, length(news), .5)]
write.csv(news, file="./final/en_US/sample/news.csv", row.names=FALSE)
twitter<-twitter[rbinom(length(twitter)*.01, length(twitter), .5)]
write.csv(twitter, file = "./final/en_US/sample/twitter.csv", row.names=FALSE)
filenames <- file.path("./final/en_US/sample")
docs <- VCorpus(VectorSource(paste(blogs,news,twitter)))
inspect(docs[1])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 602
require("stopwords")
## Loading required package: stopwords
## Warning: package 'stopwords' was built under R version 4.0.5
##
## Attaching package: 'stopwords'
## The following object is masked from 'package:tm':
##
## stopwords
# Remove punctuation
docs <- tm_map(docs, removePunctuation)
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Transform all alphabets to lowercase
docs <- tm_map(docs, content_transformer(tolower))
# Remove stop words
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove whitespaces
docs <- tm_map(docs, stripWhitespace)
# Remove profanity
if (!file.exists("./data/swearWords.txt"))
download.file(
url = "http://www.bannedwordlist.com/lists/swearWords.txt",
destfile = "./final/en_us/swearWords.txt",
method = "curl")
profanity <- readLines(file("./final/en_us/swearWords.txt"), skipNul = T)
## Warning in readLines(file("./final/en_us/swearWords.txt"), skipNul = T):
## incomplete final line found on './final/en_us/swearWords.txt'
close(file("./final/en_us/swearWords.txt"))
docs <- tm_map(docs, removeWords, profanity)
docs_df <- data.frame(text=unlist(sapply(docs, `[`, "content")),
stringsAsFactors=F)
write.csv(docs_df,'./final/en_us/cleandata.csv', row.names=FALSE)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2672997 142.8 7524294 401.9 9405367 502.4
## Vcells 10279421 78.5 71581585 546.2 101548417 774.8
memory.limit(size=64000)
## [1] 64000
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
## word freq
## said said 7091
## one one 5777
## just just 5143
## get get 4695
## like like 4694
## can can 4374
## new new 4002
## time time 3978
## now now 3636
## day day 3148
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Accent"), scale=c(3.5,0.25))
The above wordcloud shows that the most common word in the clean dataset is ‘said’, followed by ‘one’, and subsequently by ‘get’, ‘like’,’ just’, etc. This is a pictorial representation of the sorted clean data.
g1 <- ggplot(data=d[1:10,], aes(x = word, y = freq, fill=word))
g2 <- g1 + geom_bar(stat="identity") + coord_flip() + ggtitle("Frequent Words")
g3 <- g2 + geom_text(data = d[1:10,], aes(x = word, y = freq, label = freq), hjust=-1, position = "identity")
g4 <- g3 + theme(legend.position="none")
g4
summary(d)
## word freq
## Length:23570 Min. : 1.00
## Class :character 1st Qu.: 6.00
## Mode :character Median : 15.00
## Mean : 46.76
## 3rd Qu.: 32.00
## Max. :7091.00
tot_ug <- sum(d$freq)
tot_ug_50 <- as.data.frame(d[1:1500,])
tot_ug_50 <- tot_ug_50 %>%
mutate(cumulative = cumsum(freq)) %>%
mutate(index = seq.int(1, 1500))
tot_ug_90 <- as.data.frame(d[1:20000,])
tot_ug_90 <- tot_ug_90 %>%
mutate(cumulative = cumsum(freq)) %>%
mutate(index = seq.int(1, 20000))
g1 <- ggplot(data=tot_ug_50, aes(x=tot_ug_50$index, y=tot_ug_50$cumulative))
g2 <- g1 + labs(x="Number of unique words", y="Instances in text", title="50% coverage")
g3 <- g2 + geom_line(color = "green")
g4 <- g3 + geom_hline(yintercept=0.5*tot_ug, col="blue")
g5 <- ggplot(data=tot_ug_90, aes(x=tot_ug_90$index, y=tot_ug_90$cumulative))
g6 <- g5 + labs(x="Number of unique words", y="Instances in text", title="90% coverage")
g7 <- g6 + geom_line(color = "green")
g8 <- g7 + geom_hline(yintercept=0.9*tot_ug, col="red")
grid.arrange(g4, g8, ncol = 2)
## Warning: Use of `tot_ug_50$index` is discouraged. Use `index` instead.
## Warning: Use of `tot_ug_50$cumulative` is discouraged. Use `cumulative` instead.
## Warning: Use of `tot_ug_90$index` is discouraged. Use `index` instead.
## Warning: Use of `tot_ug_90$cumulative` is discouraged. Use `cumulative` instead.
docs_bigrams <- docs_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
docs_bigrams_count <- docs_bigrams %>%
count(bigram, sort = TRUE)
head(docs_bigrams_count, 10)
## bigram n
## 1 new york 453
## 2 last year 443
## 3 years ago 348
## 4 right now 342
## 5 last night 324
## 6 mister rogers 315
## 7 little boy 279
## 8 can get 242
## 9 make sure 242
## 10 high school 239
wordcloud(words=docs_bigrams_count$bigram, freq=docs_bigrams_count$n, min.freq = 100,
max.words=100,random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Accent"), scale=c(3.5,0.03))
g1 <- ggplot(data=docs_bigrams_count[1:10,], aes(x = bigram, y = n, fill=bigram))
g2 <- g1 + geom_bar(stat="identity") + coord_flip() + ggtitle("Frequent Bigrams")
g3 <- g2 + geom_text(data = docs_bigrams_count[1:10,], aes(x = bigram, y = n, label = n), hjust=-1, position = "identity")
g4 <- g3 + theme(legend.position="none")
g4
summary(docs_bigrams_count)
## bigram n
## Length:142707 Min. : 1.000
## Class :character 1st Qu.: 1.000
## Mode :character Median : 4.000
## Mean : 7.779
## 3rd Qu.: 12.000
## Max. :453.000
tot_bg <- sum(docs_bigrams_count$n)
tot_bg_50 <- as.data.frame(docs_bigrams_count[1:30000,])
tot_bg_50 <- tot_bg_50 %>%
mutate(cumulativebg = cumsum(n)) %>%
mutate(index = seq.int(1, 30000))
tot_bg_90 <- as.data.frame(docs_bigrams_count[1:100000,])
tot_bg_90 <- tot_bg_90 %>%
mutate(cumulativebg = cumsum(n)) %>%
mutate(index = seq.int(1, 100000))
g1 <- ggplot(data=tot_bg_50, aes(x=index, y=cumulativebg))
g2 <- g1 + labs(x="Number of unique bigrams", y="Instances in text", title="50% coverage")
g3 <- g2 + geom_line(color = "green")
g4 <- g3 + geom_hline(yintercept=0.5*tot_bg, col="blue")
g5 <- ggplot(data=tot_bg_90, aes(x=index, y=cumulativebg))
g6 <- g5 + labs(x="Number of unique bigrams", y="Instances in text", title="90% coverage")
g7 <- g6 + geom_line(color = "green")
g8 <- g7 + geom_hline(yintercept=0.9*tot_bg, col="red")
grid.arrange(g4, g8, ncol = 2)
docs_trigrams <- docs_df %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3)
docs_trigrams_count <- docs_trigrams %>%
count(trigram, sort = TRUE)
head(docs_trigrams_count, 10)
## trigram n
## 1 boy big sword 126
## 2 little boy big 126
## 3 new york city 117
## 4 gaston south carolina 110
## 5 south carolina attractions 110
## 6 love toast mom 92
## 7 id love tell 87
## 8 advertising people good 70
## 9 pu bef th 66
## 10 happy little boy 60
wordcloud(words=docs_trigrams_count$trigram, freq=docs_trigrams_count$n, min.freq = 45,
max.words=100,random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Accent"), scale=c(2.5,0.005))
g1 <- ggplot(data=docs_trigrams_count[1:10,], aes(x = trigram, y = n, fill=trigram))
g2 <- g1 + geom_bar(stat="identity") + coord_flip() + ggtitle("Frequent Trigrams")
g3 <- g2 + geom_text(data = docs_trigrams_count[1:10,], aes(x = trigram, y = n, label = n), hjust=-1, position = "identity")
g4 <- g3 + theme(legend.position="none")
g4
summary(docs_trigrams_count)
## trigram n
## Length:189229 Min. : 1.000
## Class :character 1st Qu.: 1.000
## Mode :character Median : 2.000
## Mean : 5.742
## 3rd Qu.: 9.000
## Max. :126.000
tot_tg <- sum(docs_trigrams_count$n)
tot_tg_50 <- as.data.frame(docs_trigrams_count[1:35000,])
tot_tg_50 <- tot_tg_50 %>%
mutate(cumulativetg = cumsum(n)) %>%
mutate(index = seq.int(1, 35000))
tot_tg_90 <- as.data.frame(docs_trigrams_count[1:130000,])
tot_tg_90 <- tot_tg_90 %>%
mutate(cumulativetg = cumsum(n)) %>%
mutate(index = seq.int(1, 130000))
g1 <- ggplot(data=tot_tg_50, aes(x=index, y=cumulativetg))
g2 <- g1 + labs(x="Number of unique trigrams", y="Instances in text", title="50% coverage")
g3 <- g2 + geom_line(color = "green")
g4 <- g3 + geom_hline(yintercept=0.5*tot_tg, col="blue")
g5 <- ggplot(data=tot_tg_90, aes(x=index, y=cumulativetg))
g6 <- g5 + labs(x="Number of unique trigrams", y="Instances in text", title="90% coverage")
g7 <- g6 + geom_line(color = "green")
g8 <- g7 + geom_hline(yintercept=0.9*tot_tg, col="red")
grid.arrange(g4, g8, ncol = 2)
I will start by removing inefficiencies from the code by storing data of various objects in an sql database, rather than in RAM memory of R. In addition, I will put the heavy load of analysing data on the sql server. This will help the application to run faster. I will build a model based on Markov chain. This makes sense logically since, in a Markov chain, each choice of word depends only on the previous word. Also the length of a chain can, theoretically, have no limit. In other words, to predict the next word, we are not constrained by whether it is a 1, or 2, or 3, or 4, or 5-grams.