Summary

This report covers the Exploratory analysis of the data for the capstone project: a predictive text app using Shiny. The primary finding is that not all of the data will be needed to provide coverage. There is an exponential relationship between the amount of frequency coverage and the number of unique words.

Read Data into R

library("readr")
library("ggplot2")
library("dplyr")
library("quanteda")
library("ngram")

I will begin by reading the txt files into R, using readr’s read_lines(). This was the easiest function, with no errors, after trying several others.

blog <- read_lines("final/en_US/en_US.blogs.txt")
news <- read_lines("final/en_US/en_US.news.txt")
twitter <- read_lines("final/en_US/en_US.twitter.txt")

Below we can see the basic stats of these files.

stats <- data.frame(source=c("Blog","News","Twitter"),
           file_size = c(file.size("final/en_US/en_US.blogs.txt")/1024^2,
                        file.size("final/en_US/en_US.news.txt")/1024^2,
                        file.size("final/en_US/en_US.twitter.txt")/1024^2),
           word_count = sapply(list(blog, news, twitter), wordcount),
           line_count = c(length(blog), length(news), length(twitter)),
           char_count = c(sum(nchar(blog)), sum(nchar(news)), sum(nchar(twitter))))
names(stats) <- c("Source", "File Size (MB)", "Word Count", "Line Count", "Character Count")
stats
##    Source File Size (MB) Word Count Line Count Character Count
## 1    Blog       200.4242   37334131     899288       206824505
## 2    News       196.2775   34372530    1010242       203223159
## 3 Twitter       159.3641   30373543    2360148       162096031

I will then sample 10% of these data as a sufficiently large dataset to explore.

set.seed(123)
sample_blog <- blog[sample(length(blog), length(blog)*0.1)]
sample_news <- news[sample(length(news), length(news)*0.1)]
sample_twitter <- twitter[sample(length(twitter), length(twitter)*0.1)]

Using the quanteda package, I create a corpus for each dataset, and include a “source” value to indicate where that text came from. Then it is combined into a single corpus.

blog_corpus <- corpus(sample_blog)
docvars(blog_corpus, "Source") <- "blog"

news_corpus <- corpus(sample_news)
docvars(news_corpus, "Source") <- "news"

twitter_corpus <- corpus(sample_twitter)
docvars(twitter_corpus, "Source") <- "twitter"

combined_corpus <- blog_corpus + news_corpus + twitter_corpus

Finally, the datasets are all removed, except for the combined corpus, to help clear some RAM space.

rm(stats, blog, news, twitter, 
   sample_blog, sample_news, sample_twitter, 
   blog_corpus, news_corpus, twitter_corpus)

Clean Datasets

To clean the data, I create a token set. I have removed numbers, punctuation, and stopwords. I took the stem of each word (running & runs are both changed into “run” and analyzed together). and everything is changed to lowercase.

token <- tokens(combined_corpus, remove_numbers = TRUE, remove_punct = TRUE)
token <- tokens_select(token, stopwords("english"), selection = "remove")
token <- tokens_wordstem(token)
token <- tokens_tolower(token)

Finally, I created two Document Feature Matrices (a matrix of each unique word as a column, and each text entry as a row). The first is the basic DFM. The second is a DFM grouped by source (twitter/news/blog).

dfm <- dfm(token)
dfm_source <- dfm(token, groups = "Source")

Explore Data Structure

I will start with high level understanding of the data. The following code shows the overall structure of the text.

dfm
## Document-feature matrix of: 426,966 documents, 171,866 features (>99.99% sparse).
dfm_source
## Document-feature matrix of: 3 documents, 171,866 features (53.6% sparse).
dfm_source[,1:10]
## Document-feature matrix of: 3 documents, 10 features (13.3% sparse).
## 3 x 10 sparse Matrix of class "dfm"
##          features
## docs        bruschetta howev miss mark instead manag two-bit crostini huge
##   blog    1          4  1523  996  616    1035  1011       3        1  644
##   news    0          4   827  970  857     709  1735       1        4  390
##   twitter 0          2   115 3040  347     477   457       0        0  539

Below, we can see the top features of these data. The most common words by themselves, as well as grouped by source. A word cloud is also shown below (the more common a word, the larger it will appear.)

topfeatures(dfm)
##   one  said  just  like   get    go  time   can   day  year 
## 30791 30650 30496 30041 30034 26520 26125 24659 21947 21624
dfm_sort(dfm_source)[,1:10]
## Document-feature matrix of: 3 documents, 10 features (0.0% sparse).
## 3 x 10 sparse Matrix of class "dfm"
##          features
## docs        one  said  just  like   get    go  time  can   day  year
##   blog    13388  3656 10039 10972  9430  8137 10751 9789  7022  6660
##   news     8723 25202  5423  6000  6029  5522  6729 5942  4227 10725
##   twitter  8680  1792 15034 13069 14575 12861  8645 8928 10698  4239
textplot_wordcloud(dfm, color = rev(RColorBrewer::brewer.pal(5, "RdYlBu")))

Exploratory Analysis

There were several questions posed to consider with respect to this data. I will go over each of them below.

QUESTION 1: Some words are more frequent than others - what are the distributions of word frequencies?

A bar graph of the top frequencies is shown below.

data.frame("names" = names(topfeatures(dfm, n=40)), "count" = unname(topfeatures(dfm, n=40))) %>%
      ggplot(aes(x=reorder(names, -count), y=count)) + geom_col() +
      labs(x = "", y = "Freqeuncy", title = "Frequency of the Most Common Words") +
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

QUESTION 2: What are the frequencies of 2-grams and 3-grams in the dataset?

N-grams are basically word phrases, so a 2-gram is two words back-to-back. The function below takes a value for n and plots a bargraph showing the frequencies of the top 40 “n”-grams.

ngram <- function(n=1){
      ngram_tok <- tokens_ngrams(token, n)
      ngram_dfm <- dfm(ngram_tok)
      ngram <- data.frame(names = colnames(ngram_dfm), frequency = colSums(ngram_dfm))
            ngram$names <- as.character(ngram$names)
            rownames(ngram) <- c()
            ngram <- arrange(ngram, desc(frequency))[1:40,]
      ggplot(ngram, aes(x=reorder(names, -frequency), y=frequency)) + geom_col() +
            labs(x = "", y = "Frequency", title = paste0("Frequency of ",n,"-gram Sets")) +
            theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
}

Then we can plot this data below. Note that I already have a “1-gram” plot above, showing the frequency of the most common single words.

ngram(2)

ngram(3)

QUESTION 3: How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

Below is a function to calculate the number of unique words required

uniqueWords <- function(coverage) {
      data <- data.frame(names = colnames(dfm), frequency = colSums(dfm))
            data$names <- as.character(data$names)
            rownames(data) <- c()
      x <- 0
      for(i in 1:nrow(data)) {
            x <- x + data$frequency[i]
            if (x >= sum(data$frequency)*coverage){ return(i) }
      }
      return(i)
}

fifty_percent <- uniqueWords(0.5)
fifty_percentage <- round(fifty_percent/ncol(dfm)*100,1)
ninety_percent <- uniqueWords(0.9)
ninety_percentage <- round(ninety_percent/ncol(dfm)*100,1)

You need 1451 words (~0.8%) to cover 50%.
You need 17450 words (~10.2%) to cover 90%.

QUESTION 4: How do you evaluate how many of the words come from foreign languages?

It should not matter. If enough American-English speakers type “Buenos dias”, then it effectively becomes a part of american english speech, and should be used in predicting text. Just because words may be from another language should not preclude them from the analysis.

Also, modern languages all borrow words from foreign languages, and it is difficult if not impossible to sort through all of that.

However, we could use the UTF symbols - if a word contains a certain type of symbol that is not included in en_US, the word is likely not part of the American English language.

QUESTION 5: Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

This is an extension of question 3, in which we look at the frequency-based coverage. We can extend that discussion by looking at a graph of that data:

coverage <- c(seq(0, 0.9, 0.1),0.95, 0.99, 1)
unique_words <- c(1:length(coverage))
for (i in 1:length(coverage)){
      unique_words[i] <- uniqueWords(coverage[i])
}
ggplot(data.frame(coverage = coverage, unique_words = unique_words), aes(x=coverage, y=unique_words)) +
      geom_line() +
      labs(x = "Coverage Percentage", y = "Number of Unique Words Required", title = "Number of Unique Words vs. Coverage")

There is clearly an exponential increase in the number of unique words required. I will need to determine the proper amount of words to use to keep the app small, while still covering enough of the language, whether that’s 90%, 95% or some other value.

Key Understandings From This Analysis

The main take away this analysis has shown is that I need to be selective on the amount of data I utilize in my final app. The data can get to several gigs in size, and that won’t be acceptable to use. A smaller subset of data should be sufficient for a relatively large amount of coverage.

Modeling: Next Steps

The next step will involve creating a predictive model, and then creating a shiny app to deploy the model. This will include the results from this exploratory analysis (i.e. utilizing 90% of the word freqeuncy for a much smaller dataset).