Introduction

This report shows progress report on my model using Natural Language Processing to predict subsequent words in a phrase. The model relies heavily on the tidytext library, which is a convenient and efficient R library focused on text mining and has the capability of creating tokens and n-grams.

The steps below show the building of the model, including data preparation, exploratory analysis, and plans for developing a prediction algorithm and Shiny app.

Data Processing

First, I load the files containing texts from English-language blogs, news, and Twitter. I combine the texts into one corpus and use 20% of the overall material as a training set.

# Load libraries
library(readr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.4.2
library(ggplot2)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.2
library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:tidyr':
## 
##     %>%, crossing
## The following objects are masked from 'package:dplyr':
## 
##     %>%, as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(ggraph)

# Load stop words
data(stop_words)

# Get file names
enfileblogs <- "en_US.blogs.txt"
enfilenews <- "en_US.news.txt"
enfiletwitter <- "en_US.twitter.txt"

# Read files
enblogs <- read_lines(enfileblogs)
ennews <- read_lines(enfilenews)
entwitter <- read_lines(enfiletwitter)

# Take samples
samplesize = 0.2

set.seed(2243)
blogsample <- sample(enblogs, samplesize*length(enblogs))
newssample <- sample(ennews, samplesize*length(ennews))
twittersample <- sample(entwitter, samplesize*length(entwitter))

# Convert samples to data frames
blog_df <- data_frame(line = 1:length(blogsample), text=blogsample)
news_df <- data_frame(line = 1:length(newssample), text=newssample)
twitter_df <- data_frame(line = 1:length(twittersample), text=twittersample)

Line Counts

The training set has the following number of lines in each of the texts of the corpus: Blog sample: 179857 News sample: 202048 Twitter sample: 472029

Tokenization and Frequency Counting

Next, I combine the data frames from the three sources into one and convert the training set corpus to tokens and count the most frequent tokens. The process strips the corpus of all punctuation and makes all words lowercase. Because the corpus contains stop words which appear frequently throughout the document and the English language, I filter the stop words out. I visualize the most frequent tokens (with both the stop words included and excluded).

# Convert data frames of samples to tokens
blogtoken <- blog_df %>% unnest_tokens(word, text)
newstoken <- news_df %>% unnest_tokens(word, text)
twittertoken <- twitter_df %>% unnest_tokens(word, text)

sample_df <- rbind(blog_df, news_df, twitter_df)
sampletoken <- sample_df %>% unnest_tokens(word, text)

# Count most common tokens / words
commontokens <- sampletoken %>% count(word, sort = TRUE)
commontokens_nostopwords <- commontokens %>% anti_join(stop_words)
## Joining, by = "word"
# Visualize most common tokens (with stop words)
commontokens %>% 
    filter(n > 200000) %>% 
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(word, n)) +
    geom_col() +
    ggtitle("Most Common Tokens (Stop Words Included)") +
    xlab('Number') + 
    coord_flip()

# Visualize most common tokens (without stop words)
commontokens_nostopwords %>% 
    filter(n > 15000) %>% 
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(word, n)) +
    geom_col() +
    ggtitle("Most Common Tokens (Without Stop Words)") +
    xlab('Number') + 
    coord_flip()

Analyzing term frequency

In addition to analyzing the number of words that appear in the corpus, perhaps a more useful measure is term frequency (tf), which is the ratio a particular token appears in the text as a whole. My training set is only 20% of the original corpus, and as a result my actual word counts are rather small compared to the whole corpus. If I divide each word count by the number of words in the text, I get a ratio that I am able to compare with different sample sizes.

# Add the tf, idf, and tf-idf terms to the token count 
totalwords <- commontokens_nostopwords %>%
    summarize(total=sum(n))
totalwords <- as.integer(totalwords)
totalwords <- rep(totalwords, dim(commontokens_nostopwords)[1])

commontokens_nostopwords <- cbind(commontokens_nostopwords, totalwords)
commontokens_nostopwords <- commontokens_nostopwords %>% mutate(tf=n/totalwords)
head(commontokens_nostopwords, 25)
##      word     n totalwords          tf
## 1    time 45293    8689727 0.005212247
## 2     day 35321    8689727 0.004064685
## 3    love 32605    8689727 0.003752132
## 4  people 32210    8689727 0.003706676
## 5       2 21399    8689727 0.002462563
## 6       3 20720    8689727 0.002384425
## 7       1 18725    8689727 0.002154843
## 8    life 18320    8689727 0.002108237
## 9      rt 17866    8689727 0.002055991
## 10   home 16777    8689727 0.001930671
## 11   week 15542    8689727 0.001788549
## 12  night 15500    8689727 0.001783715
## 13   game 14994    8689727 0.001725486
## 14 school 14952    8689727 0.001720652
## 15    lol 14450    8689727 0.001662883
## 16  happy 13600    8689727 0.001565066
## 17  world 13151    8689727 0.001513396
## 18      4 13030    8689727 0.001499472
## 19   feel 11725    8689727 0.001349294
## 20   city 11711    8689727 0.001347683
## 21 follow 11572    8689727 0.001331687
## 22     10 11553    8689727 0.001329501
## 23   hope 11464    8689727 0.001319259
## 24      5 11111    8689727 0.001278636
## 25    lot 10778    8689727 0.001240315

Generating bigrams

Now that I have created a list of tokens in the combined training set, I move on to create a data frame of bigrams (i.e., a set of two words that appear consecutively). Bigrams will be useful in the prediction algorithm as I can use the term frequency of each bigram in the training set to determine the list of possible second words and their probabilities given the first word. Because stop words (such as “the”, “an”, “that”, “of”) are much more frequent than other words, they increase the probability of appearing in the list of bigrams, and will similarly be excluded from this analysis.

# Generate bigrams
bigrams <- sample_df %>%
    unnest_tokens(bigram, text, token="ngrams", n = 2)

# Get the counts with stop words included
bigrams <- bigrams %>% count(bigram, sort=TRUE)

# Seperate the bigram into two words
bigrams_separated <- bigrams %>% 
    separate(bigram, c("word1", "word2"), sep=" ")

# Remove stop words from the bigrams
bigrams_filtered <- bigrams_separated %>%
    filter(!word1 %in% stop_words$word) %>%
    filter(!word2 %in% stop_words$word)

# Find the total number of bigrams
totalbigrams <- bigrams_filtered %>%
    summarize(total=sum(n))
totalbigrams <- as.integer(totalbigrams)
totalbigrams <- rep(totalbigrams, dim(bigrams_filtered)[1])

# Find the term frequency
bigrams <- cbind(bigrams_filtered, totalbigrams)
bigrams <- bigrams %>% mutate(tf=n/totalbigrams)
head(bigrams)
##    word1     word2    n totalbigrams           tf
## 1     st     louis 1930      3426965 0.0005631805
## 2  happy  birthday 1802      3426965 0.0005258297
## 3      1         2 1701      3426965 0.0004963576
## 4    los   angeles 1385      3426965 0.0004041477
## 5 social     media 1256      3426965 0.0003665051
## 6    san francisco 1245      3426965 0.0003632952

Visualizing the bigrams

It might be interesting to visualize the relationships between adjacent words that form bigrams. Rather than focusing on the most common bigrams, my goal here is to understand how many different words follow a common word (excluding stop words) and how many of those words are in turn followed by other words.

Converting the bigrams into graphs (with the words as nodes and edges as the connection between adjacent words) helps visually understand this relationship. These networks will have the advantage of seeing first words that are followed by many possible second words (which decreases the probability of each predicted word) and first words which a low list of possible second words.

# Generate an igraph object
# Filter out by the most common bigrams

bigram_graph <- bigrams %>%
    filter (n > 350) %>%
    graph_from_data_frame()

# Generate the network graph using ggraph
set.seed(3226)
arrow <- grid::arrow(type="closed", length=unit(.15, "inches"))
ggraph(bigram_graph, layout="fr") +
    geom_edge_link(aes(edge_alpha=n), show.legend = FALSE, arrow=arrow) +
    geom_node_point(color="blue", size=2) +
    geom_node_text(aes(label=name), vjust=1, hjust=1) +
    theme_void()

Generating trigrams

It might be useful to use a trigram model (i.e., three words that appear next to each other) in addition to bigrams. The accuracy of a word that follows two adjacent words rather than just one might be higher, as long as the context of the test data set is similar to the training set. In the prediction model, I will try to use trigrams to predit the third word, and only if a particular trigram does not appear in the training set, the model will build probabilities based on bigrams.

trigrams <- sample_df %>%
    unnest_tokens(trigram, text, token="ngrams", n=3) %>%
    separate(trigram, c("word1", "word2", "word3"), sep=" ") %>%
    filter(!word1 %in% stop_words$word,
           !word2 %in% stop_words$word,
           !word3 %in% stop_words$word) %>%
    count(word1, word2, word3, sort=TRUE)

totaltrigrams <- trigrams %>%
    summarize(total=sum(n))
totaltrigrams <- as.integer(totaltrigrams)
totaltrigrams <- rep(totaltrigrams, dim(trigrams)[1])

# Find the term frequency
trigrams <- cbind(trigrams, totaltrigrams)
trigrams <- trigrams %>% mutate(tf=n/totaltrigrams)
head(trigrams)
##       word1    word2  word3   n totaltrigrams           tf
## 1     happy  mothers    day 386       1438321 0.0002683685
## 2     happy mother's    day 348       1438321 0.0002419488
## 3 president   barack  obama 297       1438321 0.0002064908
## 4     cinco       de   mayo 251       1438321 0.0001745090
## 5     world      war     ii 233       1438321 0.0001619944
## 6        st    louis county 216       1438321 0.0001501751

Findings

  • The training set should not be too large (otherwise, processing speed is too slow). I had originally developed the training set with 60% of the data and have reduced it to 20%. While the absolute count of tokens in the sample has significantly decreased, the relative frequency (and rank) are similar. Nevertheless, the down sides of having a smaller sample are bias of the more common term (i.e., more common words in the sample having higher frequency than in the corpus) and a decrease in variance (i.e., the sample not containing or having a lower frequency for less common words than appears in the corpus).
  • The count and frequency of stop words signficantly outweighs other words in the text, which may obfuscate the probability that any two words appear closely in the corpus. For now, the model will take out stop words. I will review whether how to determine the list of stop words and whether it is appropriate to remove them from the model.
  • The bigram list shows the count and frequency of bigrams in the model and between words and the visualization using a graph shows the strength of that relationship. Some relationships are more frequent (for example, “happy birthday”, “st. louis”, “1 2”), for which the model will be more accurate.
  • The trigram list was extremely slow to generate. I will have to evaluate whether to use other methods to generate the trigrams.

Future plans

Combining the corpus, leaving out the stop words, creating tokens, bigrams, and tigrams, along with their count and term frequency are the first step to creating a succesful prediction model.

The term frequency of a trigram and bigram will determine the prediction for a word (with the use of a back-off model). However, I will have to account for phrases that do not appear in the training set and instances where the term frequency is low across many possible words.