Introduction

This report presents Exploratory Data Analysis of words, tokens, and phrases in the text. Three text files were used as data source. One file is a collection of tweets. Another file is a collection of texts from blogs. The last file is a collection of texts from news stories. The files were provided by the text prediction company SwiftKey and can be found here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

For some steps, I used the package tm and for other steps, I used the package tidytext. To limit the volume of data for exploratory analysis, I sampled the data. Then I created a single corpus from the sampled data from the three sources. I sampled 1% of the data from each source so that my computer memory could handle the data. It should be noted that before sampling the data was cleaned. The main cleaning tasks were to remove punctuations, stop words, and profanity.

When I compare the most frequent word bar graphs of unigrams, bigrams, and tri grams, I see that although there seems to be no apparent relation between most frequent unigrams and most frequent bigrams, there appears to be subtle relationship between most frequent bigrams and most frequent trigrams. I could have investigated this issue further if I had more memory available on the computer.

The exploratory analysis showed that approximately 1,100 unique words are needed in a frequency sorted dictionary to cover 50% of all word instances in the US English language. Likewise, 10,000 words are required to cover 90% of all word instances. If we are interested in bigrams to predict which word pairs occur together, we find that we need 20,000 bigrams and 75,000 bigrams to cover 50% and 90%, respectively, of all bigrams that occur in a frequency sorted dictionary of bigrams. Extending this analysis further, we find that 27,000 trigrams and 85,000 trigrams are needed to cover 50% and 90% trigrams occuring in the US English language.

Load Libraries

Download data

Explore data

File size in MB

blogs_size   <- file.size(blogs_file) / (2^20)
news_size    <- file.size(news_file) / (2^20)
twitter_size <- file.size(twitter_file) / (2^20)
blogs_size
## [1] 200.4242
news_size
## [1] 196.2775
twitter_size
## [1] 159.3641

Read the data files

blogs   <- readLines(file(blogs_file,"rb"), encoding="UTF-8")
news    <- readLines(file(news_file,"rb"), encoding="UTF-8")
twitter <- readLines(file(twitter_file,"rb"), encoding="UTF-8") 
## Warning in readLines(file(twitter_file, "rb"), encoding = "UTF-8"): line 167155
## appears to contain an embedded nul
## Warning in readLines(file(twitter_file, "rb"), encoding = "UTF-8"): line 268547
## appears to contain an embedded nul
## Warning in readLines(file(twitter_file, "rb"), encoding = "UTF-8"): line 1274086
## appears to contain an embedded nul
## Warning in readLines(file(twitter_file, "rb"), encoding = "UTF-8"): line 1759032
## appears to contain an embedded nul

Number of lines per file

no_of_lines_blogs <- length(blogs)
no_of_lines_news <- length(news)
no_of_lines_twitter <- length(twitter)
no_of_lines_blogs
## [1] 899288
no_of_lines_news
## [1] 1010242
no_of_lines_twitter
## [1] 2360148

Number of words per file

#Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

require("stringi")
blogs_words_cnt <- stri_stats_latex(blogs)[4]
news_words_cnt <- stri_stats_latex(news)[4]
twitter_words_cnt <- stri_stats_latex(twitter)[4]
blogs_words_cnt
##    Words 
## 37570839
news_words_cnt
##    Words 
## 34494539
twitter_words_cnt
##    Words 
## 30451128

Number of characters per file

blogs_char_cnt <- stri_stats_general(blogs)[3]
news_char_cnt <- stri_stats_general(news)[3]
twitter_char_cnt <- stri_stats_general(twitter)[3]
blogs_char_cnt
##     Chars 
## 206824382
news_char_cnt
##     Chars 
## 203223154
twitter_char_cnt
##     Chars 
## 162096031

Longest line

blogs_lgth <- max(nchar(blogs))
news_lgth <- max(nchar(news))
twitter_lgth <- max(nchar(twitter))
blogs_lgth
## [1] 40833
news_lgth
## [1] 11384
twitter_lgth
## [1] 140

Sampling data

To make processing the data more efficient, a random sample of 1% of the lines is taken, written to new files, and read into the Corpus.

blogs<-blogs[rbinom(length(blogs)*.01, length(blogs), .5)]
write.csv(blogs, file="./final/en_US/sample/blogs.csv", row.names=FALSE)

news<-news[rbinom(length(news)*.01, length(news), .5)]
write.csv(news, file="./final/en_US/sample/news.csv", row.names=FALSE)

twitter<-twitter[rbinom(length(twitter)*.01, length(twitter), .5)]
write.csv(twitter, file = "./final/en_US/sample/twitter.csv", row.names=FALSE)

filenames <- file.path("./final/en_US/sample")   
docs <- VCorpus(VectorSource(paste(blogs,news,twitter))) 

inspect(docs[1])
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 602

Cleaning data

require("stopwords")
## Loading required package: stopwords
## Warning: package 'stopwords' was built under R version 4.0.5
## 
## Attaching package: 'stopwords'
## The following object is masked from 'package:tm':
## 
##     stopwords
# Remove punctuation
docs <- tm_map(docs, removePunctuation)

# Remove numbers
docs <- tm_map(docs, removeNumbers)

# Transform all alphabets to lowercase
docs <- tm_map(docs, content_transformer(tolower))

# Remove stop words
docs <- tm_map(docs, removeWords, stopwords("english"))

# Remove whitespaces
docs <- tm_map(docs, stripWhitespace)

# Remove profanity
if (!file.exists("./data/swearWords.txt"))
  download.file(
    url = "http://www.bannedwordlist.com/lists/swearWords.txt", 
    destfile = "./final/en_us/swearWords.txt", 
    method = "curl")

profanity <- readLines(file("./final/en_us/swearWords.txt"), skipNul = T)
## Warning in readLines(file("./final/en_us/swearWords.txt"), skipNul = T):
## incomplete final line found on './final/en_us/swearWords.txt'
close(file("./final/en_us/swearWords.txt"))
docs <- tm_map(docs, removeWords, profanity)

docs_df <- data.frame(text=unlist(sapply(docs, `[`, "content")), 
                      stringsAsFactors=F)
write.csv(docs_df,'./final/en_us/cleandata.csv', row.names=FALSE)

Exploratory Data Analysis

Unigram Analysis

Build a term document matrix
gc()
##            used  (Mb) gc trigger  (Mb)  max used  (Mb)
## Ncells  2672997 142.8    7524294 401.9   9405367 502.4
## Vcells 10279421  78.5   71581585 546.2 101548417 774.8
memory.limit(size=64000)
## [1] 64000
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
##      word freq
## said said 7091
## one   one 5777
## just just 5143
## get   get 4695
## like like 4694
## can   can 4374
## new   new 4002
## time time 3978
## now   now 3636
## day   day 3148
Generate the word cloud
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=100, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Accent"), scale=c(3.5,0.25))

The above wordcloud shows that the most common word in the clean dataset is ‘said’, followed by ‘one’, and subsequently by ‘get’, ‘like’,’ just’, etc. This is a pictorial representation of the sorted clean data.

Generate the Bar Plot
g1 <- ggplot(data=d[1:10,], aes(x = word, y = freq, fill=word))
g2 <- g1 + geom_bar(stat="identity") + coord_flip() + ggtitle("Frequent Words")
g3 <- g2 + geom_text(data = d[1:10,], aes(x = word, y = freq, label = freq), hjust=-1, position = "identity")
g4 <- g3 + theme(legend.position="none")
g4

Examine unigram coverage
summary(d)
##      word                freq        
##  Length:23570       Min.   :   1.00  
##  Class :character   1st Qu.:   6.00  
##  Mode  :character   Median :  15.00  
##                     Mean   :  46.76  
##                     3rd Qu.:  32.00  
##                     Max.   :7091.00
tot_ug <- sum(d$freq)

tot_ug_50 <- as.data.frame(d[1:1500,])
tot_ug_50 <- tot_ug_50 %>%
  mutate(cumulative = cumsum(freq)) %>% 
  mutate(index = seq.int(1, 1500))

tot_ug_90 <- as.data.frame(d[1:20000,])
tot_ug_90 <- tot_ug_90 %>%
  mutate(cumulative = cumsum(freq)) %>% 
  mutate(index = seq.int(1, 20000))

g1 <- ggplot(data=tot_ug_50, aes(x=tot_ug_50$index, y=tot_ug_50$cumulative)) 
g2 <- g1 + labs(x="Number of unique words", y="Instances in text", title="50% coverage")
g3 <- g2 + geom_line(color = "green")
g4 <- g3 + geom_hline(yintercept=0.5*tot_ug, col="blue")

g5 <- ggplot(data=tot_ug_90, aes(x=tot_ug_90$index, y=tot_ug_90$cumulative)) 
g6 <- g5 + labs(x="Number of unique words", y="Instances in text", title="90% coverage")
g7 <- g6 + geom_line(color = "green")
g8 <- g7 + geom_hline(yintercept=0.9*tot_ug, col="red")

grid.arrange(g4, g8, ncol = 2)
## Warning: Use of `tot_ug_50$index` is discouraged. Use `index` instead.
## Warning: Use of `tot_ug_50$cumulative` is discouraged. Use `cumulative` instead.
## Warning: Use of `tot_ug_90$index` is discouraged. Use `index` instead.
## Warning: Use of `tot_ug_90$cumulative` is discouraged. Use `cumulative` instead.

Bigram Analysis

Build a bigram
docs_bigrams <- docs_df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)
docs_bigrams_count <- docs_bigrams %>%
  count(bigram, sort = TRUE)
head(docs_bigrams_count, 10)
##           bigram   n
## 1       new york 453
## 2      last year 443
## 3      years ago 348
## 4      right now 342
## 5     last night 324
## 6  mister rogers 315
## 7     little boy 279
## 8        can get 242
## 9      make sure 242
## 10   high school 239
Generate the Bigram word cloud
wordcloud(words=docs_bigrams_count$bigram, freq=docs_bigrams_count$n, min.freq = 100,
          max.words=100,random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Accent"), scale=c(3.5,0.03))

Generate the Bigram Bar Plot
g1 <- ggplot(data=docs_bigrams_count[1:10,], aes(x = bigram, y = n, fill=bigram))
g2 <- g1 + geom_bar(stat="identity") + coord_flip() + ggtitle("Frequent Bigrams")
g3 <- g2 + geom_text(data = docs_bigrams_count[1:10,], aes(x = bigram, y = n, label = n), hjust=-1, position = "identity")
g4 <- g3 + theme(legend.position="none")
g4

Examine bigram coverage
summary(docs_bigrams_count)
##     bigram                n          
##  Length:142707      Min.   :  1.000  
##  Class :character   1st Qu.:  1.000  
##  Mode  :character   Median :  4.000  
##                     Mean   :  7.779  
##                     3rd Qu.: 12.000  
##                     Max.   :453.000
tot_bg <- sum(docs_bigrams_count$n)

tot_bg_50 <- as.data.frame(docs_bigrams_count[1:30000,])
tot_bg_50 <- tot_bg_50 %>%
  mutate(cumulativebg = cumsum(n)) %>% 
  mutate(index = seq.int(1, 30000))

tot_bg_90 <- as.data.frame(docs_bigrams_count[1:100000,])
tot_bg_90 <- tot_bg_90 %>%
  mutate(cumulativebg = cumsum(n)) %>% 
  mutate(index = seq.int(1, 100000))

g1 <- ggplot(data=tot_bg_50, aes(x=index, y=cumulativebg)) 
g2 <- g1 + labs(x="Number of unique bigrams", y="Instances in text", title="50% coverage")
g3 <- g2 + geom_line(color = "green")
g4 <- g3 + geom_hline(yintercept=0.5*tot_bg, col="blue")

g5 <- ggplot(data=tot_bg_90, aes(x=index, y=cumulativebg)) 
g6 <- g5 + labs(x="Number of unique bigrams", y="Instances in text", title="90% coverage")
g7 <- g6 + geom_line(color = "green")
g8 <- g7 + geom_hline(yintercept=0.9*tot_bg, col="red")

grid.arrange(g4, g8, ncol = 2)

Trigram Analysis

Build a trigram
docs_trigrams <- docs_df %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3)
docs_trigrams_count <- docs_trigrams %>%
  count(trigram, sort = TRUE)
head(docs_trigrams_count, 10)
##                       trigram   n
## 1               boy big sword 126
## 2              little boy big 126
## 3               new york city 117
## 4       gaston south carolina 110
## 5  south carolina attractions 110
## 6              love toast mom  92
## 7                id love tell  87
## 8     advertising people good  70
## 9                   pu bef th  66
## 10           happy little boy  60
Generate the Trigram word cloud
wordcloud(words=docs_trigrams_count$trigram, freq=docs_trigrams_count$n, min.freq = 45,
          max.words=100,random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Accent"), scale=c(2.5,0.005))

Generate the Trigram Bar Plot
g1 <- ggplot(data=docs_trigrams_count[1:10,], aes(x = trigram, y = n, fill=trigram))
g2 <- g1 + geom_bar(stat="identity") + coord_flip() + ggtitle("Frequent Trigrams")
g3 <- g2 + geom_text(data = docs_trigrams_count[1:10,], aes(x = trigram, y = n, label = n), hjust=-1, position = "identity")
g4 <- g3 + theme(legend.position="none")
g4

Examine Trigram coverage
summary(docs_trigrams_count)
##    trigram                n          
##  Length:189229      Min.   :  1.000  
##  Class :character   1st Qu.:  1.000  
##  Mode  :character   Median :  2.000  
##                     Mean   :  5.742  
##                     3rd Qu.:  9.000  
##                     Max.   :126.000
tot_tg <- sum(docs_trigrams_count$n)

tot_tg_50 <- as.data.frame(docs_trigrams_count[1:35000,])
tot_tg_50 <- tot_tg_50 %>%
  mutate(cumulativetg = cumsum(n)) %>% 
  mutate(index = seq.int(1, 35000))

tot_tg_90 <- as.data.frame(docs_trigrams_count[1:130000,])
tot_tg_90 <- tot_tg_90 %>%
  mutate(cumulativetg = cumsum(n)) %>% 
  mutate(index = seq.int(1, 130000))

g1 <- ggplot(data=tot_tg_50, aes(x=index, y=cumulativetg)) 
g2 <- g1 + labs(x="Number of unique trigrams", y="Instances in text", title="50% coverage")
g3 <- g2 + geom_line(color = "green")
g4 <- g3 + geom_hline(yintercept=0.5*tot_tg, col="blue")

g5 <- ggplot(data=tot_tg_90, aes(x=index, y=cumulativetg)) 
g6 <- g5 + labs(x="Number of unique trigrams", y="Instances in text", title="90% coverage")
g7 <- g6 + geom_line(color = "green")
g8 <- g7 + geom_hline(yintercept=0.9*tot_tg, col="red")

grid.arrange(g4, g8, ncol = 2)

Summary

I conducted an exploratory analysis of the supplied text data. The goal was to develop an intuition and understanding of the data that would be used to develop a model to predict words in some context. Some of the issues analysed and researched were:
  • What are the distributions of word frequencies?
  • What are the frequencies of 2-grams and 3-grams in the dataset?
  • How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
  • Ideas for modelling

    I will start by removing inefficiencies from the code by storing data of various objects in an sql database, rather than in RAM memory of R. In addition, I will put the heavy load of analysing data on the sql server. This will help the application to run faster. I will build a model based on Markov chain. This makes sense logically since, in a Markov chain, each choice of word depends only on the previous word. Also the length of a chain can, theoretically, have no limit. In other words, to predict the next word, we are not constrained by whether it is a 1, or 2, or 3, or 4, or 5-grams.