Introduction

This project will create an application for predicting the next word given English language input. The input Corpus data set has input files in several languages. In our case, we select the English language data files.

# load NLP libraries
library(RWeka)
library(stringi)
library(tm)
# load data and graph libraries
library(data.table)
library(rlang)  #ggplot2 needs rlang
library(ggplot2) 

Input Files

blogs_file = "en_US.blogs.txt"
news_file  = "en_US.news.txt"
twitter_file = "en_US.twitter.txt"

Loading Data

blogs <- readLines(blogs_file, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(news_file, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(twitter_file, encoding = "UTF-8", skipNul = TRUE)

Size of the files

#size
size_blogs = file.info(blogs_file)$size / 10^6
size_news = file.info(news_file)$size / 10^6
size_twitter = file.info(twitter_file)$size / 10^6
# lines
length_blogs = length(blogs) / 10^6
length_news = length(news) / 10^6
length_twitter = length(twitter) / 10^6

# number of words
words_blogs = sum(stri_count_words(blogs)) / 10^6
words_news = sum(stri_count_words(news)) / 10^6
words_twitter = sum(stri_count_words(twitter)) / 10^6

b <- c(size_blogs, length_blogs, words_blogs)
n <- c(size_news, length_news, words_news)
t <- c(size_twitter, length_twitter, words_twitter)
df <- rbind(b, n, t)
colnames(df) <- c("Size (MB)", "Lines (millions)", "Words (millions)")
rownames(df) <- c("Blogs", "News", "Twitter")
df
##         Size (MB) Lines (millions) Words (millions)
## Blogs    210.1600         0.899288         37.54625
## News     205.8119         1.010242         34.76240
## Twitter  167.1053         2.360148         30.09341

As can be seen, the files for the Blogs, News, and Twitter are quite large at approximately 210, 206, and 167 MB. They contain 0.9, 1, and 2.4 million lines and have 38, 35, and 30 millions words respectively.

Clean Data

##  select ASCII
blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")

We removed the non-ASCII characters from the data set.

Exploratory Data Analysis

During Exploratory Data Analysis, we select a sample, clean it and examine the data using the n-gram approach for prediction.

Sampling Data

Give then large sample size of the input files, we select a very small sample of the file namely 2000 lines from each file and create a combined data set.

set.seed(456)
combo_data <- c(sample(blogs, 2000),
                 sample(news, 2000),
                 sample(twitter, 2000))

Prediction Approach

Given an input with a few words, we have to predict the next word. In this case, N-gram approach is quite useful. N-gram approach involves creating a list of 1, 2,.. N adjacent words. Given the user input, we try to find the most probable follow-up words. A simple but effective algorithm for prediction is the Katz Backoff algorithm. If the user enters three or more words, the final three words are used to find the best matches in the Quadgrams table. If no three word match is found, then a match is attempted in the last two words in the trigram table and so on, until the last one input word is used to find a match in the bigram table.

Cleaning Data

We next build a corpus from the sample combined data set above. We convert to lowercase, remove punctuation, numbers and extra white space,

### build a corpus
corpus <- VCorpus(VectorSource(combo_data))
# Convert to lowercase
corpus <- tm_map(corpus, tolower)
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Remove extra white spaces
corpus <- tm_map(corpus, stripWhitespace)
# convert to plain text
corpus <- tm_map(corpus, PlainTextDocument)

N-gram creation

We next convert the text into tokens and set up n-grams.

df_corpus <- data.frame(text = unlist(sapply(corpus, `[`, "content")), stringsAsFactors = F)

unigrams <- NGramTokenizer(df_corpus, Weka_control(min=1, max=1))
bigrams <- NGramTokenizer(df_corpus, Weka_control(min=2, max=2))
trigrams <- NGramTokenizer(df_corpus, Weka_control(min=3, max=3))
quadgrams <- NGramTokenizer(df_corpus, Weka_control(min=4, max=4))

df_unigrams <- data.frame(table(unigrams))
df_bigrams <- data.frame(table(bigrams))
df_trigrams <- data.frame(table(trigrams))
df_quadgrams<- data.frame(table(quadgrams))

unigrams_top10 <- head(df_unigrams[order(df_unigrams$Freq, decreasing = T),],10)
bigrams_top10 <- head(df_bigrams[order(df_bigrams$Freq, decreasing = T),],10)
trigrams_top10 <- head(df_trigrams[order(df_trigrams$Freq, decreasing = T),],10)
quadgrams_top10 <- head(df_quadgrams[order(df_quadgrams$Freq, decreasing = T),],10)

Next we examine the top 10 n-grams in four plots.

N-gram Plots

barfill <- "gold1"
barlines <- "goldenrod2"

ggplot(unigrams_top10, aes(x=unigrams, y=Freq)) + 
    geom_bar(stat = "Identity",colour = barlines, fill = barfill) +
    geom_text(aes(label=Freq), vjust=0) +
    theme(axis.text.x = element_text(angle = 35)) +
    labs(x = "Unigrams", y = "Frequency") +
    ggtitle("Frequeny Histograms of Unigrams") +
    theme(plot.title = element_text(hjust = 0.5)) 

ggplot(bigrams_top10, aes(x=bigrams, y=Freq)) + 
        geom_bar(stat = "Identity",colour = barlines, fill = barfill) +
        geom_text(aes(label=Freq), vjust=0) +
        theme(axis.text.x = element_text(angle = 35)) +
        labs(x = "bigrams", y = "Frequency") +
        ggtitle("Frequeny Histograms of Bigrams") +
        theme(plot.title = element_text(hjust = 0.5)) 

ggplot(trigrams_top10, aes(x=trigrams, y=Freq)) + 
        geom_bar(stat = "Identity",colour = barlines, fill = barfill) +
        geom_text(aes(label=Freq), vjust=0) +
        theme(axis.text.x = element_text(angle = 35)) +
        labs(x = "Trigrams", y = "Frequency") +
        ggtitle("Frequeny Histograms of Trigrams") +
        theme(plot.title = element_text(hjust = 0.5)) 

ggplot(quadgrams_top10, aes(x=quadgrams, y=Freq)) + 
        geom_bar(stat = "Identity",colour = barlines, fill = barfill) +
        geom_text(aes(label=Freq), vjust=0) +
        theme(axis.text.x = element_text(angle = 35)) +
        labs(x = "quadgrams", y = "Frequency") +
        ggtitle("Frequeny Histograms of Quadgrams") +
        theme(plot.title = element_text(hjust = 0.5)) 

The above plots show the most frequent n-grams in the selected sample. Since n-grams has been created, Katz backoff algorithm can be implemented using n-gram.

Next steps

In the next phase of the project, prediction algorithm needs to be implemented. Sample size has to be increased to increase prediction accuracy, while keeping in mind the memory footprint of the final Shiny Application. Trade offs may have to be made between prediction accuracy and memory availability. Once the prediction algorithm is the implemented, then the Shiny App has to be created. The Shiny App will use the prediction algorithm.