This report shows progress report on my model using Natural Language Processing to predict subsequent words in a phrase. The model relies heavily on the tidytext library, which is a convenient and efficient R library focused on text mining and has the capability of creating tokens and n-grams.
The steps below show the building of the model, including data preparation, exploratory analysis, and plans for developing a prediction algorithm and Shiny app.
First, I load the files containing texts from English-language blogs, news, and Twitter. I combine the texts into one corpus and use 20% of the overall material as a training set.
# Load libraries
library(readr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.4.2
library(ggplot2)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.2
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:tidyr':
##
## %>%, crossing
## The following objects are masked from 'package:dplyr':
##
## %>%, as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggraph)
# Load stop words
data(stop_words)
# Get file names
enfileblogs <- "en_US.blogs.txt"
enfilenews <- "en_US.news.txt"
enfiletwitter <- "en_US.twitter.txt"
# Read files
enblogs <- read_lines(enfileblogs)
ennews <- read_lines(enfilenews)
entwitter <- read_lines(enfiletwitter)
# Take samples
samplesize = 0.2
set.seed(2243)
blogsample <- sample(enblogs, samplesize*length(enblogs))
newssample <- sample(ennews, samplesize*length(ennews))
twittersample <- sample(entwitter, samplesize*length(entwitter))
# Convert samples to data frames
blog_df <- data_frame(line = 1:length(blogsample), text=blogsample)
news_df <- data_frame(line = 1:length(newssample), text=newssample)
twitter_df <- data_frame(line = 1:length(twittersample), text=twittersample)
The training set has the following number of lines in each of the texts of the corpus: Blog sample: 179857 News sample: 202048 Twitter sample: 472029
Next, I combine the data frames from the three sources into one and convert the training set corpus to tokens and count the most frequent tokens. The process strips the corpus of all punctuation and makes all words lowercase. Because the corpus contains stop words which appear frequently throughout the document and the English language, I filter the stop words out. I visualize the most frequent tokens (with both the stop words included and excluded).
# Convert data frames of samples to tokens
blogtoken <- blog_df %>% unnest_tokens(word, text)
newstoken <- news_df %>% unnest_tokens(word, text)
twittertoken <- twitter_df %>% unnest_tokens(word, text)
sample_df <- rbind(blog_df, news_df, twitter_df)
sampletoken <- sample_df %>% unnest_tokens(word, text)
# Count most common tokens / words
commontokens <- sampletoken %>% count(word, sort = TRUE)
commontokens_nostopwords <- commontokens %>% anti_join(stop_words)
## Joining, by = "word"
# Visualize most common tokens (with stop words)
commontokens %>%
filter(n > 200000) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
ggtitle("Most Common Tokens (Stop Words Included)") +
xlab('Number') +
coord_flip()
# Visualize most common tokens (without stop words)
commontokens_nostopwords %>%
filter(n > 15000) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
ggtitle("Most Common Tokens (Without Stop Words)") +
xlab('Number') +
coord_flip()
In addition to analyzing the number of words that appear in the corpus, perhaps a more useful measure is term frequency (tf), which is the ratio a particular token appears in the text as a whole. My training set is only 20% of the original corpus, and as a result my actual word counts are rather small compared to the whole corpus. If I divide each word count by the number of words in the text, I get a ratio that I am able to compare with different sample sizes.
# Add the tf, idf, and tf-idf terms to the token count
totalwords <- commontokens_nostopwords %>%
summarize(total=sum(n))
totalwords <- as.integer(totalwords)
totalwords <- rep(totalwords, dim(commontokens_nostopwords)[1])
commontokens_nostopwords <- cbind(commontokens_nostopwords, totalwords)
commontokens_nostopwords <- commontokens_nostopwords %>% mutate(tf=n/totalwords)
head(commontokens_nostopwords, 25)
## word n totalwords tf
## 1 time 45293 8689727 0.005212247
## 2 day 35321 8689727 0.004064685
## 3 love 32605 8689727 0.003752132
## 4 people 32210 8689727 0.003706676
## 5 2 21399 8689727 0.002462563
## 6 3 20720 8689727 0.002384425
## 7 1 18725 8689727 0.002154843
## 8 life 18320 8689727 0.002108237
## 9 rt 17866 8689727 0.002055991
## 10 home 16777 8689727 0.001930671
## 11 week 15542 8689727 0.001788549
## 12 night 15500 8689727 0.001783715
## 13 game 14994 8689727 0.001725486
## 14 school 14952 8689727 0.001720652
## 15 lol 14450 8689727 0.001662883
## 16 happy 13600 8689727 0.001565066
## 17 world 13151 8689727 0.001513396
## 18 4 13030 8689727 0.001499472
## 19 feel 11725 8689727 0.001349294
## 20 city 11711 8689727 0.001347683
## 21 follow 11572 8689727 0.001331687
## 22 10 11553 8689727 0.001329501
## 23 hope 11464 8689727 0.001319259
## 24 5 11111 8689727 0.001278636
## 25 lot 10778 8689727 0.001240315
Now that I have created a list of tokens in the combined training set, I move on to create a data frame of bigrams (i.e., a set of two words that appear consecutively). Bigrams will be useful in the prediction algorithm as I can use the term frequency of each bigram in the training set to determine the list of possible second words and their probabilities given the first word. Because stop words (such as “the”, “an”, “that”, “of”) are much more frequent than other words, they increase the probability of appearing in the list of bigrams, and will similarly be excluded from this analysis.
# Generate bigrams
bigrams <- sample_df %>%
unnest_tokens(bigram, text, token="ngrams", n = 2)
# Get the counts with stop words included
bigrams <- bigrams %>% count(bigram, sort=TRUE)
# Seperate the bigram into two words
bigrams_separated <- bigrams %>%
separate(bigram, c("word1", "word2"), sep=" ")
# Remove stop words from the bigrams
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# Find the total number of bigrams
totalbigrams <- bigrams_filtered %>%
summarize(total=sum(n))
totalbigrams <- as.integer(totalbigrams)
totalbigrams <- rep(totalbigrams, dim(bigrams_filtered)[1])
# Find the term frequency
bigrams <- cbind(bigrams_filtered, totalbigrams)
bigrams <- bigrams %>% mutate(tf=n/totalbigrams)
head(bigrams)
## word1 word2 n totalbigrams tf
## 1 st louis 1930 3426965 0.0005631805
## 2 happy birthday 1802 3426965 0.0005258297
## 3 1 2 1701 3426965 0.0004963576
## 4 los angeles 1385 3426965 0.0004041477
## 5 social media 1256 3426965 0.0003665051
## 6 san francisco 1245 3426965 0.0003632952
It might be interesting to visualize the relationships between adjacent words that form bigrams. Rather than focusing on the most common bigrams, my goal here is to understand how many different words follow a common word (excluding stop words) and how many of those words are in turn followed by other words.
Converting the bigrams into graphs (with the words as nodes and edges as the connection between adjacent words) helps visually understand this relationship. These networks will have the advantage of seeing first words that are followed by many possible second words (which decreases the probability of each predicted word) and first words which a low list of possible second words.
# Generate an igraph object
# Filter out by the most common bigrams
bigram_graph <- bigrams %>%
filter (n > 350) %>%
graph_from_data_frame()
# Generate the network graph using ggraph
set.seed(3226)
arrow <- grid::arrow(type="closed", length=unit(.15, "inches"))
ggraph(bigram_graph, layout="fr") +
geom_edge_link(aes(edge_alpha=n), show.legend = FALSE, arrow=arrow) +
geom_node_point(color="blue", size=2) +
geom_node_text(aes(label=name), vjust=1, hjust=1) +
theme_void()
It might be useful to use a trigram model (i.e., three words that appear next to each other) in addition to bigrams. The accuracy of a word that follows two adjacent words rather than just one might be higher, as long as the context of the test data set is similar to the training set. In the prediction model, I will try to use trigrams to predit the third word, and only if a particular trigram does not appear in the training set, the model will build probabilities based on bigrams.
trigrams <- sample_df %>%
unnest_tokens(trigram, text, token="ngrams", n=3) %>%
separate(trigram, c("word1", "word2", "word3"), sep=" ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word3 %in% stop_words$word) %>%
count(word1, word2, word3, sort=TRUE)
totaltrigrams <- trigrams %>%
summarize(total=sum(n))
totaltrigrams <- as.integer(totaltrigrams)
totaltrigrams <- rep(totaltrigrams, dim(trigrams)[1])
# Find the term frequency
trigrams <- cbind(trigrams, totaltrigrams)
trigrams <- trigrams %>% mutate(tf=n/totaltrigrams)
head(trigrams)
## word1 word2 word3 n totaltrigrams tf
## 1 happy mothers day 386 1438321 0.0002683685
## 2 happy mother's day 348 1438321 0.0002419488
## 3 president barack obama 297 1438321 0.0002064908
## 4 cinco de mayo 251 1438321 0.0001745090
## 5 world war ii 233 1438321 0.0001619944
## 6 st louis county 216 1438321 0.0001501751
Combining the corpus, leaving out the stop words, creating tokens, bigrams, and tigrams, along with their count and term frequency are the first step to creating a succesful prediction model.
The term frequency of a trigram and bigram will determine the prediction for a word (with the use of a back-off model). However, I will have to account for phrases that do not appear in the training set and instances where the term frequency is low across many possible words.