Introduction

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of “understanding” the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Figure 1: Natural Language Processing

In this project, we will create a Shiny Web App that predicts the following word from a set of words as input. In order to accomplish this, there are a series of steps we must take:

  1. Getting the source for our training dataset
  2. Loading the training data set into our workspace
  3. Preprocess the training dataset
  4. Process the data set into tokens and ngrams
  5. Build a predictor model
  6. Build a Shiny Web App

Getting and Loading the Data

The data set was provided from the coursera Data Science Capstone. The file consists of a zipfile containing a directory called “final” and four subdirectories, one for a different language (German, English, Finnish and Russian). For this project we will only be using the English dataset.

The packages tm was used to load the text files into our workspace as a corpus. A corpus is a collection of text, which is unstructured data. For our project we will need to convert the corpus from this unstructured format into a structured format, in order to analyze the text and build our predictor model.

The text for our corpus is taken from Plain Text Document (.txt) files that contain text from 3 different sources:

  1. Blogs
  2. News
  3. Twitter

Since the text are from web sources, they will contain special characters and links that we will need to remove (“@”, “http://…”, “www…”). This will be done during the preprocessing steps.

Preprocessing

Preprocessing is an important part of Natural Language Processing. Full texts are preprocessed to improve computational performance and accuracy of text analysis techniques. Below is a diagram of the most common preprocessing techniques.

Preprocessing

The Preprocessing for our project consisted of lowercasing, removing symbols and numbers, profanity filtering and removing stopwords. Then the corpus was tokenized and a Document Term Matrix was made. The preprocessing was scripted, and when sourced it creates RDS files for the objects that will be needed during Exploratory Data Analysis.

Processing

The preprocessed corpus was loaded and splitted into three corpus objects for more efficient processing, one for each source (blogs, news, twitter). The corpora was then tokenized and ngrams were made (unigrams, bigrams, trigrams).

Processing

Then the unigrams, bigrams and trigrams tokens were joined into Data Feature Matrices (DFM), which will be used for the exploratory data analysis.

Processing

Exploratory Data Analysis

Once the data was loaded, preprocessed and processed into structured data, an exploratory data analysis was conducted.

It is important to get information about the text documents that will be part of our corpus. Below is a table with the main characteristics of every text document from the corpus.
Documents Number of lines Number of characters File size (Mb)
en_US.blogs.txt 899288 209260725 200
en_US.news.txt 1010242 15761023 196
en_US.twitter.txt 2360148 164744972 159

The code for preprocessing the corpus was saved as a separte script file, which when sourced creates an RDS file for the preprocessed corpus, which is used for processing in this report.

Since the total file size of the corpus would be too big for processing, the corpus will be splitted into three, one for each document.

corpus <- readRDS("Assets/RDS/preprocessedCorpus-tm.rds")
corpus <- corpus(corpus)
blogsCorpus <- corpus(corpus[1])
newsCorpus <- corpus(corpus[2])
twitterCorpus <- corpus(corpus[3])
saveRDS(blogsCorpus, "Assets/RDS/blogsCorpus.rds")
saveRDS(newsCorpus, "Assets/RDS/newsCorpus.rds")
saveRDS(twitterCorpus, "Assets/RDS/twitterCorpus.rds")
rm(corpus, newsCorpus, twitterCorpus)

Then each corpus was processed, tokenized and unigrams, bigrams and trigrams were extracted from the corpus.

The top ten unigrams, bigrams and trigrams of the corpus are shown in the graphics below.

unigrams_dfm <- readRDS("Assets/RDS/dfm/unigrams_dfm.rds")
topUnigrams <- topfeatures(unigrams_dfm)
topUnigrams <- data.frame(topUnigrams)
Unigram <- rownames(topUnigrams)
topUnigrams <- transmute(topUnigrams, Unigram = Unigram, Count = topUnigrams)

cols <- c("#d00000", "#ffba08", "#229631", "#8fe388", "#1b998b", 
          "#3185fc", "#5d2e8c", "#196bde", "#ff7b9c", "#ff9b85")

g <- ggplot(data = topUnigrams, aes(x = reorder(factor(Unigram), -Count), 
                                    y = Count))
g + geom_col(aes(fill = Unigram)) + 
  scale_fill_manual(values = cols) +
  labs(title = "Top 10 Unigrams", x = "Unigram tokens") + 
  theme_minimal()

rm(Unigram, topUnigrams, unigrams_dfm)
bigrams_dfm <- readRDS("Assets/RDS/dfm/bigrams_dfm.rds")
bigrams_dfm <- dfm_select(bigrams_dfm, pattern = c("cant wait", "dont know",
                                                   "im going", "dont"), 
                          selection = "remove")
topBigrams <- topfeatures(bigrams_dfm)
topBigrams <- data.frame(topBigrams)
Bigram <- rownames(topBigrams)
topBigrams <- transmute(topBigrams, Bigram = 
                          Bigram, Count = topBigrams)

cols <- c("#d00000", "#ffba08", "#229631", "#8fe388", "#1b998b", 
          "#3185fc", "#5d2e8c", "#196bde", "#ff7b9c", "#ff9b85")

h <- ggplot(data = topBigrams, aes(x = reorder(factor(Bigram), -Count), 
                                    y = Count))
h + geom_col(aes(fill = Bigram)) + 
  scale_fill_manual(values = cols) +
  labs(title = "Top 10 Bigrams", x = "Bigram tokens") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

rm(Bigram, topBigrams, bigrams_dfm)
trigrams_dfm <- readRDS("Assets/RDS/dfm/trigrams_dfm.rds")
trigrams_dfm <- dfm_select(trigrams_dfm, pattern = c("cant wait see",
                                                   "dont even know",
                                                   "feel like im", 
                                                   "im pretty sure", 
                                                   "im", "dont", "cant"), 
                          selection = "remove")
topTrigrams <- topfeatures(trigrams_dfm)
topTrigrams <- data.frame(topTrigrams)
Trigram <- rownames(topTrigrams)
topTrigrams <- transmute(topTrigrams, Trigram = 
                          Trigram, Count = topTrigrams)

cols <- c("#d00000", "#ffba08", "#229631", "#8fe388", "#1b998b", 
          "#3185fc", "#5d2e8c", "#196bde", "#ff7b9c", "#ff9b85")

i <- ggplot(data = topTrigrams, aes(x = reorder(factor(Trigram), -Count), 
                                    y = Count))
i + geom_col(aes(fill = Trigram)) + 
  scale_fill_manual(values = cols) +
  labs(title = "Top 10 Trigrams", x = "Trigram tokens") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

rm(Trigram, topTrigrams, trigrams_dfm)

Next Steps

Now that we have structured data with features that will be needed for modeling, the next step is to begin the modeling process. Markov chains for the predictor model. The data for the model will probably need further processing in order to decrease memory use and better efficiency and predictions. This could be Achieved by dimension reduction and eliminating features with low variance.

The Shiny App will consist of a text input panel, and a text output panel. Under the hood the Shiny App will run the markov model to find the best prediction for the next word, as illustrated in the image below.

Predictor Model

References

1.Feinerer, I., Hornik, K., & Meyer, D. (2008). Text Mining Infrastructure in R. Journal of Statistical Software, 25(5), 1–54. https://doi.org/10.18637/jss.v025.i05
2. Kasper Welbers, Wouter Van Atteveldt & Kenneth Benoit (2017) Text Analysis in R, Communication Methods and Measures, 11:4, 245-265, DOI: 10.1080/19312458.2017.1387238.
3. Ashish Kumar , Avinash Paul (2016). Mastering Text Mining with R. Packt Publishing Company.