This project will create an application for predicting the next word given English language input. The input Corpus data set has input files in several languages. In our case, we select the English language data files.
# load NLP libraries
library(RWeka)
library(stringi)
library(tm)
# load data and graph libraries
library(data.table)
library(rlang) #ggplot2 needs rlang
library(ggplot2)
blogs_file = "en_US.blogs.txt"
news_file = "en_US.news.txt"
twitter_file = "en_US.twitter.txt"
blogs <- readLines(blogs_file, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(news_file, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(twitter_file, encoding = "UTF-8", skipNul = TRUE)
#size
size_blogs = file.info(blogs_file)$size / 10^6
size_news = file.info(news_file)$size / 10^6
size_twitter = file.info(twitter_file)$size / 10^6
# lines
length_blogs = length(blogs) / 10^6
length_news = length(news) / 10^6
length_twitter = length(twitter) / 10^6
# number of words
words_blogs = sum(stri_count_words(blogs)) / 10^6
words_news = sum(stri_count_words(news)) / 10^6
words_twitter = sum(stri_count_words(twitter)) / 10^6
b <- c(size_blogs, length_blogs, words_blogs)
n <- c(size_news, length_news, words_news)
t <- c(size_twitter, length_twitter, words_twitter)
df <- rbind(b, n, t)
colnames(df) <- c("Size (MB)", "Lines (millions)", "Words (millions)")
rownames(df) <- c("Blogs", "News", "Twitter")
df
## Size (MB) Lines (millions) Words (millions)
## Blogs 210.1600 0.899288 37.54625
## News 205.8119 1.010242 34.76240
## Twitter 167.1053 2.360148 30.09341
As can be seen, the files for the Blogs, News, and Twitter are quite large at approximately 210, 206, and 167 MB. They contain 0.9, 1, and 2.4 million lines and have 38, 35, and 30 millions words respectively.
## select ASCII
blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")
We removed the non-ASCII characters from the data set.
During Exploratory Data Analysis, we select a sample, clean it and examine the data using the n-gram approach for prediction.
Give then large sample size of the input files, we select a very small sample of the file namely 2000 lines from each file and create a combined data set.
set.seed(456)
combo_data <- c(sample(blogs, 2000),
sample(news, 2000),
sample(twitter, 2000))
Given an input with a few words, we have to predict the next word. In this case, N-gram approach is quite useful. N-gram approach involves creating a list of 1, 2,.. N adjacent words. Given the user input, we try to find the most probable follow-up words. A simple but effective algorithm for prediction is the Katz Backoff algorithm. If the user enters three or more words, the final three words are used to find the best matches in the Quadgrams table. If no three word match is found, then a match is attempted in the last two words in the trigram table and so on, until the last one input word is used to find a match in the bigram table.
We next build a corpus from the sample combined data set above. We convert to lowercase, remove punctuation, numbers and extra white space,
### build a corpus
corpus <- VCorpus(VectorSource(combo_data))
# Convert to lowercase
corpus <- tm_map(corpus, tolower)
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Remove extra white spaces
corpus <- tm_map(corpus, stripWhitespace)
# convert to plain text
corpus <- tm_map(corpus, PlainTextDocument)
We next convert the text into tokens and set up n-grams.
df_corpus <- data.frame(text = unlist(sapply(corpus, `[`, "content")), stringsAsFactors = F)
unigrams <- NGramTokenizer(df_corpus, Weka_control(min=1, max=1))
bigrams <- NGramTokenizer(df_corpus, Weka_control(min=2, max=2))
trigrams <- NGramTokenizer(df_corpus, Weka_control(min=3, max=3))
quadgrams <- NGramTokenizer(df_corpus, Weka_control(min=4, max=4))
df_unigrams <- data.frame(table(unigrams))
df_bigrams <- data.frame(table(bigrams))
df_trigrams <- data.frame(table(trigrams))
df_quadgrams<- data.frame(table(quadgrams))
unigrams_top10 <- head(df_unigrams[order(df_unigrams$Freq, decreasing = T),],10)
bigrams_top10 <- head(df_bigrams[order(df_bigrams$Freq, decreasing = T),],10)
trigrams_top10 <- head(df_trigrams[order(df_trigrams$Freq, decreasing = T),],10)
quadgrams_top10 <- head(df_quadgrams[order(df_quadgrams$Freq, decreasing = T),],10)
Next we examine the top 10 n-grams in four plots.
barfill <- "gold1"
barlines <- "goldenrod2"
ggplot(unigrams_top10, aes(x=unigrams, y=Freq)) +
geom_bar(stat = "Identity",colour = barlines, fill = barfill) +
geom_text(aes(label=Freq), vjust=0) +
theme(axis.text.x = element_text(angle = 35)) +
labs(x = "Unigrams", y = "Frequency") +
ggtitle("Frequeny Histograms of Unigrams") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(bigrams_top10, aes(x=bigrams, y=Freq)) +
geom_bar(stat = "Identity",colour = barlines, fill = barfill) +
geom_text(aes(label=Freq), vjust=0) +
theme(axis.text.x = element_text(angle = 35)) +
labs(x = "bigrams", y = "Frequency") +
ggtitle("Frequeny Histograms of Bigrams") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(trigrams_top10, aes(x=trigrams, y=Freq)) +
geom_bar(stat = "Identity",colour = barlines, fill = barfill) +
geom_text(aes(label=Freq), vjust=0) +
theme(axis.text.x = element_text(angle = 35)) +
labs(x = "Trigrams", y = "Frequency") +
ggtitle("Frequeny Histograms of Trigrams") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(quadgrams_top10, aes(x=quadgrams, y=Freq)) +
geom_bar(stat = "Identity",colour = barlines, fill = barfill) +
geom_text(aes(label=Freq), vjust=0) +
theme(axis.text.x = element_text(angle = 35)) +
labs(x = "quadgrams", y = "Frequency") +
ggtitle("Frequeny Histograms of Quadgrams") +
theme(plot.title = element_text(hjust = 0.5))
The above plots show the most frequent n-grams in the selected sample. Since n-grams has been created, Katz backoff algorithm can be implemented using n-gram.
In the next phase of the project, prediction algorithm needs to be implemented. Sample size has to be increased to increase prediction accuracy, while keeping in mind the memory footprint of the final Shiny Application. Trade offs may have to be made between prediction accuracy and memory availability. Once the prediction algorithm is the implemented, then the Shiny App has to be created. The Shiny App will use the prediction algorithm.