Milestone Report EDA for NLP SwiftKey Data

Executive Summary

This Milestone Report will cover some EDA (Exploratory Data Analysis) on the SwitfKey text data from three different sources (blogs, twitter, and news feeds). The purpose of this EDA is to become familiar enough with the text data so that NLP (Natural Language Processing) models can be created and used in a Shiny App that will predict the next word typed by a user.

This plot below shows a prototype NLP model that was created using a Markov Chain (from the markovchain package) that was derived from text data sampled and extracted from the blogs text data in the SwiftKey data sets. In the predictive ngram Shiny App that I will eventually develop, I will use a model similar to this, except much larger and hopefully more efficient. This is just to show the efforts of many hours of research, and demonstrate a first step in the feasibility of my plan. Also note that due to the massive computational efforts involved, I read all the data, tidied the data, filtered the data, and created most of the plots and saved them to RSD files ahead of time and then they were read by my R Markdown file.
After examining the plot below, please proceed chronologically through my report for EDA on the words, bigrams, trigrams, and fourgrams. Then I close with some final remarks in the CONCLUSION section.

Notice that for 100 Trigrams (combinations of 3 words), we only have 106 unique words. When this type of reduction is seen, then it will result in a more more “connected” looking graph. This means that the NLP model will have more choices to make, regarding what will be the predicted next word. But if the model is done correctly, then each transition path will have a precomputed probability.

Exploratory Data Analysis (EDA) of SwiftKey data

I begin the EDA of the SwiftKey data by plotting the density of the words per line for each of the three data sets. The horizontal axis shows the log of word count in the lines and the vertical axis is the percentage of overall data that has the word count in a line. Notice the shapes of the density plots differ significantly. The bulk of the blog data has a much wider range of word count, while the twitter data builds up and ends abruptly, and the news data is the most consistent with a somewhat tighter Gaussian curve. Ultimately, these data sets comprise millions of lines of text and therefore by default, I sampled only 10% of each data set for most of my work, and as you will see for the fourgram section, I only sampled 3% from each text.

Tokenizing Text by Word

Using 100% of the data from each of the sources shown above, I cleaned the data by removing punctuation and all stop words. Here we can see the most common words from each source and the unique word count required to achieve the required coverage from each sampled source text. To answer the question, “How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?” … we can see from the lower plot that for all three text sets it averages out to around 1600 and 20,000 respectively, and the number grows non-linearly as the required COVERAGE increases. Note that I sampled 100% of the text, but filtered out stop words, so my results may differ from the results of other Data Scientists.

Tokenizing Text by BIGRAM

In this section, I show the results of tokenizing each text into bigrams (combinations of two words). Each source text is sampled at 10%, which allows us to examine approximately 89 thousand, 236 thousand, and 101 thousand lines for blogs, twitter, and news sources, respectively.
Same as before, punctuation was removed, as well as all stop words. Here we can see the most common bigrams from each source and the unique bigram count required to achieve a desired coverage from each sampled source text. Notice the required COVERAGE is becoming more linear.

Tokenizing Text by TRIGRAM

In this section, I show the results of tokenizing each text into bigrams (combinations of three words). Each source text is sampled at 10%; the same portions as the previous bigram data. Note that this takes much longer than tokenizing by single words or bigrams. However, I chose to stay with a 10% sample size instead carrying out iterations of smaller sample sizes, in order to keep this script simple.
The main difference between these trigrams and the previous bigrams, I used a form of Feature Engineering and allowed a maximum of 2 stop words per trigram for both blogs and twitter. For the news text data, no stop words were allowed at all. This is because of the context of sentences in each data set. News is more formal and I wanted to retain the quality of recording more place names and formal references, whereas both blog and twitter are less formal. For this section I have provided some code snippets that show my process of filtering, sampling, and cleaning the data. I relied upon the tidytext package because it allowed me to unnest the tokens by trigram and do operations on the dataframe at a very granular level.

# ============================================================================
# Filtering, Subsetting, and Cleaning the trigrams ...
# =============================================================================
trigrams_tidy_words_blogs   <- blogs_lines_df %>% 
  dplyr::filter( stringi::stri_count_words(TXT_LINE) >= 3 ) %>%
  dplyr::sample_frac(.1) %>%  tidytext::unnest_tokens(output=trigram, input=TXT_LINE, token = "ngrams", n=3 )

trigrams_tidy_words_blogs   <- trigrams_tidy_words_blogs %>%
  tidyr::separate(col = trigram, into = c("word1","word2","word3"), sep = " ")

# create a column called stop_word_count and then count all the stop words in each bigram
trigrams_tidy_words_blogs$stop_word_count <- 0
word1_stop_word <- ifelse( trigrams_tidy_words_blogs$word1 %in% tidytext::stop_words$word,1,0)
word2_stop_word <- ifelse( trigrams_tidy_words_blogs$word2 %in% tidytext::stop_words$word,1,0)
word3_stop_word <- ifelse( trigrams_tidy_words_blogs$word3 %in% tidytext::stop_words$word,1,0)
trigrams_tidy_words_blogs$stop_word_count <- word1_stop_word + word2_stop_word +word3_stop_word

# filter out rows with more than 2 stopwords so that our trigrams are higher quality
trigrams_tidy_words_blogs <- trigrams_tidy_words_blogs %>%
  dplyr::filter(stop_word_count <= 2)

trigrams_counts_blogs   <- trigrams_tidy_words_blogs   %>% dplyr::count( word1, word2, word3, sort=TRUE)

trigrams_counts_blogs     <- trigrams_counts_blogs   %>% dplyr::mutate(src="blogs") %>%
  dplyr::mutate(freq = n / nrow(trigrams_tidy_words_blogs) )

Tokenizing Text by Fourgram

Lastly, I completed the EDA by tokenizing each text into fourgrams (combinations of four words). Each source text is sampled at only 3% because processing time will be significantly increased.
The fourgrams were filtered in this manner; blogs, twitter, and news text sources can each have a maximum of only 2 stop words; and no two adjacent words can be equal. This was done to capture the highest quality of phrases.

CONCLUSION

I will conclude this report by pointing out that NLP is computationally intensive and different types of text require slightly different filtering.
My strategy will be to preprocess as much data as possible, not treating all data sets equal. After preprocessing is done, I will combine the most common ngrams from different sources into a larger unified set and then subset that down to less than a million different paths. The result can then be used to create a markov chain of possible paths and probability associated with each transition from word to next word(s). The flaw in my data visualization is that I could not show the probability between word(state) transitions, even though it is in my data I couldn’t figure out how to label the path.

I close the conlcusion by using a larger subset of the trigram blog data, this time with 200 trigrams. Processing 200 trigrams took more than twice as long as just processing 100 trigrams. The plot gets much busier than before, but it is fun to look at.

This is the code snippet that created the plot above. As you can see, I have a dataframe called trigrams_counts_blogs that I subsetted from. The subsetted dataframe was then refactored into a transition matrix, and then used to create a markovchain object. The markovchain object was then refactored into an Igraph object and passed into ggraph and finally plotted.

# create test set of bigrams
trigram_test_set_L <- head(trigrams_counts_blogs,200) %>% dplyr::select(c(word1,word2,word3,n))

blogs_transition_matrix_L <- transition_matrix_second_order(trigram_test_set_L, word1, word2, word3, n)

unique_states_L <- get_unique_states_from_matrix(blogs_transition_matrix_L)

# create the markovchain object
mcWordStateTransitions_L <- new("markovchain", 
                       states = unique_states_L,
                       transitionMatrix = blogs_transition_matrix_L,
                       name = "wordstates_L")

mcIgraph_L <- as(mcWordStateTransitions_L, "igraph")

# PLOT THE Igraph USING GGRAPH
mcGraphLayout_L = ggraph::create_layout(mcIgraph_L,  layout = "fr")
circular_graph_plot_L <- ggraph::ggraph(mcGraphLayout_L, layout = 'fr') +
  geom_edge_bend(alpha = 0.5,  strength=0.3,
                 arrow   = arrow(length = unit(4, 'mm')),
                 end_cap = circle(2, 'mm')) +
  ggraph::geom_node_point( shape=21,  size = 6,
                           color="black", fill="cyan",
                           alpha  = 0.5) +
  ggraph::geom_node_label(aes(label = paste0( name )),
                          color="black", fill="white",
                          alpha=0.8,     repel=TRUE) +
  labs(title = paste0("Markov Chain Word Transitions Used for Predicing Next Word"),
       subtitle = paste0("Example: based off en_US.blogs.txt data set, top ", 
                         nrow(trigram_test_set_L), " Trigrams. ",
                         "This shows all possible paths in a subset of data."),
       caption = paste0("Unique Words(states): ", length(unique_states_L) ) ) +
  theme_networkMap

circular_graph_plot_L