Week 2 Milestone Report

Introduction

This report presents Exploratory Data Analysis of words, tokens, and phrases in the text. Three text files were used as data source. One file is a collection of tweets. Another file is a collection of texts from blogs. The last file is a collection of texts from news stories. The files were provided by the text prediction company SwiftKey and can be found here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

For some steps, I used the package tm and for other steps, I used the package tidytext. To limit the volume of data for exploratory analysis, I sampled the data. Then I created a single corpus from the sampled data from the three sources. I sampled 1% of the data from each source so that my computer memory could handle the data. It should be noted that before sampling the data was cleaned. The main cleaning tasks were to remove punctuations, stop words, and profanity.

When I compare the most frequent word bar graphs of unigrams, bigrams, and tri grams, I see that although there seems to be no apparent relation between most frequent unigrams and most frequent bigrams, there appears to be subtle relationship between most frequent bigrams and most frequent trigrams. I could have investigated this issue further if I had more memory available on the computer.

The exploratory analysis showed that approximately 1,100 unique words are needed in a frequency sorted dictionary to cover 50% of all word instances in the US English language. Likewise, 10,000 words are required to cover 90% of all word instances. If we are interested in bigrams to predict which word pairs occur together, we find that we need 20,000 bigrams and 75,000 bigrams to cover 50% and 90%, respectively, of all bigrams that occur in a frequency sorted dictionary of bigrams. Extending this analysis further, we find that 27,000 trigrams and 85,000 trigrams are needed to cover 50% and 90% trigrams occuring in the US English language.

Load Libraries

Download data

Explore data

File size in MB

blogs_size   <- file.size(blogs_file) / (2^20)
news_size    <- file.size(news_file) / (2^20)
twitter_size <- file.size(twitter_file) / (2^20)
blogs_size

## [1] 200.4242

news_size

## [1] 196.2775

twitter_size

## [1] 159.3641

Read the data files

blogs   <- readLines(file(blogs_file,"rb"), encoding="UTF-8")
news    <- readLines(file(news_file,"rb"), encoding="UTF-8")
twitter <- readLines(file(twitter_file,"rb"), encoding="UTF-8")

## Warning in readLines(file(twitter_file, "rb"), encoding = "UTF-8"): line 167155
## appears to contain an embedded nul

## Warning in readLines(file(twitter_file, "rb"), encoding = "UTF-8"): line 268547
## appears to contain an embedded nul

## Warning in readLines(file(twitter_file, "rb"), encoding = "UTF-8"): line 1274086
## appears to contain an embedded nul

## Warning in readLines(file(twitter_file, "rb"), encoding = "UTF-8"): line 1759032
## appears to contain an embedded nul

Number of lines per file

no_of_lines_blogs <- length(blogs)
no_of_lines_news <- length(news)
no_of_lines_twitter <- length(twitter)
no_of_lines_blogs

## [1] 899288

no_of_lines_news

## [1] 1010242

no_of_lines_twitter

## [1] 2360148

Number of words per file

#Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

require("stringi")
blogs_words_cnt <- stri_stats_latex(blogs)[4]
news_words_cnt <- stri_stats_latex(news)[4]
twitter_words_cnt <- stri_stats_latex(twitter)[4]
blogs_words_cnt

##    Words 
## 37570839

news_words_cnt

##    Words 
## 34494539

twitter_words_cnt

##    Words 
## 30451128

Number of characters per file

blogs_char_cnt <- stri_stats_general(blogs)[3]
news_char_cnt <- stri_stats_general(news)[3]
twitter_char_cnt <- stri_stats_general(twitter)[3]
blogs_char_cnt

##     Chars 
## 206824382

news_char_cnt

##     Chars 
## 203223154

twitter_char_cnt

##     Chars 
## 162096031

Longest line

blogs_lgth <- max(nchar(blogs))
news_lgth <- max(nchar(news))
twitter_lgth <- max(nchar(twitter))
blogs_lgth

## [1] 40833

news_lgth

## [1] 11384

twitter_lgth

## [1] 140

Sampling data

To make processing the data more efficient, a random sample of 1% of the lines is taken, written to new files, and read into the Corpus.

blogs<-blogs[rbinom(length(blogs)*.01, length(blogs), .5)]
write.csv(blogs, file="./final/en_US/sample/blogs.csv", row.names=FALSE)

news<-news[rbinom(length(news)*.01, length(news), .5)]
write.csv(news, file="./final/en_US/sample/news.csv", row.names=FALSE)

twitter<-twitter[rbinom(length(twitter)*.01, length(twitter), .5)]
write.csv(twitter, file = "./final/en_US/sample/twitter.csv", row.names=FALSE)

filenames <- file.path("./final/en_US/sample")   
docs <- VCorpus(VectorSource(paste(blogs,news,twitter))) 

inspect(docs[1])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 602

Cleaning data

require("stopwords")

## Loading required package: stopwords

## Warning: package 'stopwords' was built under R version 4.0.5

## 
## Attaching package: 'stopwords'

## The following object is masked from 'package:tm':
## 
##     stopwords

# Remove punctuation
docs <- tm_map(docs, removePunctuation)

# Remove numbers
docs <- tm_map(docs, removeNumbers)

# Transform all alphabets to lowercase
docs <- tm_map(docs, content_transformer(tolower))

# Remove stop words
docs <- tm_map(docs, removeWords, stopwords("english"))

# Remove whitespaces
docs <- tm_map(docs, stripWhitespace)

# Remove profanity
if (!file.exists("./data/swearWords.txt"))
  download.file(
    url = "http://www.bannedwordlist.com/lists/swearWords.txt", 
    destfile = "./final/en_us/swearWords.txt", 
    method = "curl")

profanity <- readLines(file("./final/en_us/swearWords.txt"), skipNul = T)

## Warning in readLines(file("./final/en_us/swearWords.txt"), skipNul = T):
## incomplete final line found on './final/en_us/swearWords.txt'

close(file("./final/en_us/swearWords.txt"))
docs <- tm_map(docs, removeWords, profanity)

docs_df <- data.frame(text=unlist(sapply(docs, `[`, "content")), 
                      stringsAsFactors=F)
write.csv(docs_df,'./final/en_us/cleandata.csv', row.names=FALSE)

Exploratory Data Analysis

Unigram Analysis

Build a term document matrix

gc()

##            used  (Mb) gc trigger  (Mb)  max used  (Mb)
## Ncells  2672997 142.8    7524294 401.9   9405367 502.4
## Vcells 10279421  78.5   71581585 546.2 101548417 774.8

memory.limit(size=64000)

## [1] 64000

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

##      word freq
## said said 7091
## one   one 5777
## just just 5143
## get   get 4695
## like like 4694
## can   can 4374
## new   new 4002
## time time 3978
## now   now 3636
## day   day 3148

Generate the word cloud

wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=100, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Accent"), scale=c(3.5,0.25))

The above wordcloud shows that the most common word in the clean dataset is ‘said’, followed by ‘one’, and subsequently by ‘get’, ‘like’,’ just’, etc. This is a pictorial representation of the sorted clean data.

Generate the Bar Plot

g1 <- ggplot(data=d[1:10,], aes(x = word, y = freq, fill=word))
g2 <- g1 + geom_bar(stat="identity") + coord_flip() + ggtitle("Frequent Words")
g3 <- g2 + geom_text(data = d[1:10,], aes(x = word, y = freq, label = freq), hjust=-1, position = "identity")
g4 <- g3 + theme(legend.position="none")
g4

Examine unigram coverage

summary(d)

##      word                freq        
##  Length:23570       Min.   :   1.00  
##  Class :character   1st Qu.:   6.00  
##  Mode  :character   Median :  15.00  
##                     Mean   :  46.76  
##                     3rd Qu.:  32.00  
##                     Max.   :7091.00

tot_ug <- sum(d$freq)

tot_ug_50 <- as.data.frame(d[1:1500,])
tot_ug_50 <- tot_ug_50 %>%
  mutate(cumulative = cumsum(freq)) %>% 
  mutate(index = seq.int(1, 1500))

tot_ug_90 <- as.data.frame(d[1:20000,])
tot_ug_90 <- tot_ug_90 %>%
  mutate(cumulative = cumsum(freq)) %>% 
  mutate(index = seq.int(1, 20000))

g1 <- ggplot(data=tot_ug_50, aes(x=tot_ug_50$index, y=tot_ug_50$cumulative)) 
g2 <- g1 + labs(x="Number of unique words", y="Instances in text", title="50% coverage")
g3 <- g2 + geom_line(color = "green")
g4 <- g3 + geom_hline(yintercept=0.5*tot_ug, col="blue")

g5 <- ggplot(data=tot_ug_90, aes(x=tot_ug_90$index, y=tot_ug_90$cumulative)) 
g6 <- g5 + labs(x="Number of unique words", y="Instances in text", title="90% coverage")
g7 <- g6 + geom_line(color = "green")
g8 <- g7 + geom_hline(yintercept=0.9*tot_ug, col="red")

grid.arrange(g4, g8, ncol = 2)

## Warning: Use of `tot_ug_50$index` is discouraged. Use `index` instead.

## Warning: Use of `tot_ug_50$cumulative` is discouraged. Use `cumulative` instead.

## Warning: Use of `tot_ug_90$index` is discouraged. Use `index` instead.

## Warning: Use of `tot_ug_90$cumulative` is discouraged. Use `cumulative` instead.

Bigram Analysis

Build a bigram

docs_bigrams <- docs_df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)
docs_bigrams_count <- docs_bigrams %>%
  count(bigram, sort = TRUE)
head(docs_bigrams_count, 10)

##           bigram   n
## 1       new york 453
## 2      last year 443
## 3      years ago 348
## 4      right now 342
## 5     last night 324
## 6  mister rogers 315
## 7     little boy 279
## 8        can get 242
## 9      make sure 242
## 10   high school 239

Generate the Bigram word cloud

wordcloud(words=docs_bigrams_count$bigram, freq=docs_bigrams_count$n, min.freq = 100,
          max.words=100,random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Accent"), scale=c(3.5,0.03))

Generate the Bigram Bar Plot

g1 <- ggplot(data=docs_bigrams_count[1:10,], aes(x = bigram, y = n, fill=bigram))
g2 <- g1 + geom_bar(stat="identity") + coord_flip() + ggtitle("Frequent Bigrams")
g3 <- g2 + geom_text(data = docs_bigrams_count[1:10,], aes(x = bigram, y = n, label = n), hjust=-1, position = "identity")
g4 <- g3 + theme(legend.position="none")
g4

Examine bigram coverage

summary(docs_bigrams_count)

##     bigram                n          
##  Length:142707      Min.   :  1.000  
##  Class :character   1st Qu.:  1.000  
##  Mode  :character   Median :  4.000  
##                     Mean   :  7.779  
##                     3rd Qu.: 12.000  
##                     Max.   :453.000

tot_bg <- sum(docs_bigrams_count$n)

tot_bg_50 <- as.data.frame(docs_bigrams_count[1:30000,])
tot_bg_50 <- tot_bg_50 %>%
  mutate(cumulativebg = cumsum(n)) %>% 
  mutate(index = seq.int(1, 30000))

tot_bg_90 <- as.data.frame(docs_bigrams_count[1:100000,])
tot_bg_90 <- tot_bg_90 %>%
  mutate(cumulativebg = cumsum(n)) %>% 
  mutate(index = seq.int(1, 100000))

g1 <- ggplot(data=tot_bg_50, aes(x=index, y=cumulativebg)) 
g2 <- g1 + labs(x="Number of unique bigrams", y="Instances in text", title="50% coverage")
g3 <- g2 + geom_line(color = "green")
g4 <- g3 + geom_hline(yintercept=0.5*tot_bg, col="blue")

g5 <- ggplot(data=tot_bg_90, aes(x=index, y=cumulativebg)) 
g6 <- g5 + labs(x="Number of unique bigrams", y="Instances in text", title="90% coverage")
g7 <- g6 + geom_line(color = "green")
g8 <- g7 + geom_hline(yintercept=0.9*tot_bg, col="red")

grid.arrange(g4, g8, ncol = 2)

Trigram Analysis

Build a trigram

docs_trigrams <- docs_df %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3)
docs_trigrams_count <- docs_trigrams %>%
  count(trigram, sort = TRUE)
head(docs_trigrams_count, 10)

##                       trigram   n
## 1               boy big sword 126
## 2              little boy big 126
## 3               new york city 117
## 4       gaston south carolina 110
## 5  south carolina attractions 110
## 6              love toast mom  92
## 7                id love tell  87
## 8     advertising people good  70
## 9                   pu bef th  66
## 10           happy little boy  60

Generate the Trigram word cloud

wordcloud(words=docs_trigrams_count$trigram, freq=docs_trigrams_count$n, min.freq = 45,
          max.words=100,random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Accent"), scale=c(2.5,0.005))

Generate the Trigram Bar Plot

g1 <- ggplot(data=docs_trigrams_count[1:10,], aes(x = trigram, y = n, fill=trigram))
g2 <- g1 + geom_bar(stat="identity") + coord_flip() + ggtitle("Frequent Trigrams")
g3 <- g2 + geom_text(data = docs_trigrams_count[1:10,], aes(x = trigram, y = n, label = n), hjust=-1, position = "identity")
g4 <- g3 + theme(legend.position="none")
g4

Examine Trigram coverage

summary(docs_trigrams_count)

##    trigram                n          
##  Length:189229      Min.   :  1.000  
##  Class :character   1st Qu.:  1.000  
##  Mode  :character   Median :  2.000  
##                     Mean   :  5.742  
##                     3rd Qu.:  9.000  
##                     Max.   :126.000

tot_tg <- sum(docs_trigrams_count$n)

tot_tg_50 <- as.data.frame(docs_trigrams_count[1:35000,])
tot_tg_50 <- tot_tg_50 %>%
  mutate(cumulativetg = cumsum(n)) %>% 
  mutate(index = seq.int(1, 35000))

tot_tg_90 <- as.data.frame(docs_trigrams_count[1:130000,])
tot_tg_90 <- tot_tg_90 %>%
  mutate(cumulativetg = cumsum(n)) %>% 
  mutate(index = seq.int(1, 130000))

g1 <- ggplot(data=tot_tg_50, aes(x=index, y=cumulativetg)) 
g2 <- g1 + labs(x="Number of unique trigrams", y="Instances in text", title="50% coverage")
g3 <- g2 + geom_line(color = "green")
g4 <- g3 + geom_hline(yintercept=0.5*tot_tg, col="blue")

g5 <- ggplot(data=tot_tg_90, aes(x=index, y=cumulativetg)) 
g6 <- g5 + labs(x="Number of unique trigrams", y="Instances in text", title="90% coverage")
g7 <- g6 + geom_line(color = "green")
g8 <- g7 + geom_hline(yintercept=0.9*tot_tg, col="red")

grid.arrange(g4, g8, ncol = 2)

Summary

I conducted an exploratory analysis of the supplied text data. The goal was to develop an intuition and understanding of the data that would be used to develop a model to predict words in some context. Some of the issues analysed and researched were:

What are the distributions of word frequencies?

What are the frequencies of 2-grams and 3-grams in the dataset?

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

Ideas for modelling

I will start by removing inefficiencies from the code by storing data of various objects in an sql database, rather than in RAM memory of R. In addition, I will put the heavy load of analysing data on the sql server. This will help the application to run faster. I will build a model based on Markov chain. This makes sense logically since, in a Markov chain, each choice of word depends only on the previous word. Also the length of a chain can, theoretically, have no limit. In other words, to predict the next word, we are not constrained by whether it is a 1, or 2, or 3, or 4, or 5-grams.