Data Science Capstone: Milestone Report

Introduction

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of “understanding” the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

In this project, we will create a Shiny Web App that predicts the following word from a set of words as input. In order to accomplish this, there are a series of steps we must take:

Getting the source for our training dataset
Loading the training data set into our workspace
Preprocess the training dataset
Process the data set into tokens and ngrams
Build a predictor model
Build a Shiny Web App

Getting and Loading the Data

The data set was provided from the coursera Data Science Capstone. The file consists of a zipfile containing a directory called “final” and four subdirectories, one for a different language (German, English, Finnish and Russian). For this project we will only be using the English dataset.

The packages tm was used to load the text files into our workspace as a corpus. A corpus is a collection of text, which is unstructured data. For our project we will need to convert the corpus from this unstructured format into a structured format, in order to analyze the text and build our predictor model.

The text for our corpus is taken from Plain Text Document (.txt) files that contain text from 3 different sources:

Blogs
News
Twitter

Since the text are from web sources, they will contain special characters and links that we will need to remove (“@”, “http://…”, “www…”). This will be done during the preprocessing steps.

Preprocessing

Preprocessing is an important part of Natural Language Processing. Full texts are preprocessed to improve computational performance and accuracy of text analysis techniques. Below is a diagram of the most common preprocessing techniques.

The Preprocessing for our project consisted of lowercasing, removing symbols and numbers, profanity filtering and removing stopwords. Then the corpus was tokenized and a Document Term Matrix was made. The preprocessing was scripted, and when sourced it creates RDS files for the objects that will be needed during Exploratory Data Analysis.

Processing

The preprocessed corpus was loaded and splitted into three corpus objects for more efficient processing, one for each source (blogs, news, twitter). The corpora was then tokenized and ngrams were made (unigrams, bigrams, trigrams).

Then the unigrams, bigrams and trigrams tokens were joined into Data Feature Matrices (DFM), which will be used for the exploratory data analysis.

Exploratory Data Analysis

Once the data was loaded, preprocessed and processed into structured data, an exploratory data analysis was conducted.

It is important to get information about the text documents that will be part of our corpus. Below is a table with the main characteristics of every text document from the corpus.

Documents	Number of lines	Number of characters	File size (Mb)
en_US.blogs.txt	899288	209260725	200
en_US.news.txt	1010242	15761023	196
en_US.twitter.txt	2360148	164744972	159

The code for preprocessing the corpus was saved as a separte script file, which when sourced creates an RDS file for the preprocessed corpus, which is used for processing in this report.

Since the total file size of the corpus would be too big for processing, the corpus will be splitted into three, one for each document.

corpus <- readRDS("Assets/RDS/preprocessedCorpus-tm.rds")
corpus <- corpus(corpus)
blogsCorpus <- corpus(corpus[1])
newsCorpus <- corpus(corpus[2])
twitterCorpus <- corpus(corpus[3])
saveRDS(blogsCorpus, "Assets/RDS/blogsCorpus.rds")
saveRDS(newsCorpus, "Assets/RDS/newsCorpus.rds")
saveRDS(twitterCorpus, "Assets/RDS/twitterCorpus.rds")
rm(corpus, newsCorpus, twitterCorpus)

Then each corpus was processed, tokenized and unigrams, bigrams and trigrams were extracted from the corpus.

The top ten unigrams, bigrams and trigrams of the corpus are shown in the graphics below.

unigrams_dfm <- readRDS("Assets/RDS/dfm/unigrams_dfm.rds")
topUnigrams <- topfeatures(unigrams_dfm)
topUnigrams <- data.frame(topUnigrams)
Unigram <- rownames(topUnigrams)
topUnigrams <- transmute(topUnigrams, Unigram = Unigram, Count = topUnigrams)

cols <- c("#d00000", "#ffba08", "#229631", "#8fe388", "#1b998b", 
          "#3185fc", "#5d2e8c", "#196bde", "#ff7b9c", "#ff9b85")

g <- ggplot(data = topUnigrams, aes(x = reorder(factor(Unigram), -Count), 
                                    y = Count))
g + geom_col(aes(fill = Unigram)) + 
  scale_fill_manual(values = cols) +
  labs(title = "Top 10 Unigrams", x = "Unigram tokens") + 
  theme_minimal()

rm(Unigram, topUnigrams, unigrams_dfm)

bigrams_dfm <- readRDS("Assets/RDS/dfm/bigrams_dfm.rds")
bigrams_dfm <- dfm_select(bigrams_dfm, pattern = c("cant wait", "dont know",
                                                   "im going", "dont"), 
                          selection = "remove")
topBigrams <- topfeatures(bigrams_dfm)
topBigrams <- data.frame(topBigrams)
Bigram <- rownames(topBigrams)
topBigrams <- transmute(topBigrams, Bigram = 
                          Bigram, Count = topBigrams)

cols <- c("#d00000", "#ffba08", "#229631", "#8fe388", "#1b998b", 
          "#3185fc", "#5d2e8c", "#196bde", "#ff7b9c", "#ff9b85")

h <- ggplot(data = topBigrams, aes(x = reorder(factor(Bigram), -Count), 
                                    y = Count))
h + geom_col(aes(fill = Bigram)) + 
  scale_fill_manual(values = cols) +
  labs(title = "Top 10 Bigrams", x = "Bigram tokens") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

rm(Bigram, topBigrams, bigrams_dfm)

trigrams_dfm <- readRDS("Assets/RDS/dfm/trigrams_dfm.rds")
trigrams_dfm <- dfm_select(trigrams_dfm, pattern = c("cant wait see",
                                                   "dont even know",
                                                   "feel like im", 
                                                   "im pretty sure", 
                                                   "im", "dont", "cant"), 
                          selection = "remove")
topTrigrams <- topfeatures(trigrams_dfm)
topTrigrams <- data.frame(topTrigrams)
Trigram <- rownames(topTrigrams)
topTrigrams <- transmute(topTrigrams, Trigram = 
                          Trigram, Count = topTrigrams)

cols <- c("#d00000", "#ffba08", "#229631", "#8fe388", "#1b998b", 
          "#3185fc", "#5d2e8c", "#196bde", "#ff7b9c", "#ff9b85")

i <- ggplot(data = topTrigrams, aes(x = reorder(factor(Trigram), -Count), 
                                    y = Count))
i + geom_col(aes(fill = Trigram)) + 
  scale_fill_manual(values = cols) +
  labs(title = "Top 10 Trigrams", x = "Trigram tokens") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

rm(Trigram, topTrigrams, trigrams_dfm)

Next Steps

Now that we have structured data with features that will be needed for modeling, the next step is to begin the modeling process. Markov chains for the predictor model. The data for the model will probably need further processing in order to decrease memory use and better efficiency and predictions. This could be Achieved by dimension reduction and eliminating features with low variance.

The Shiny App will consist of a text input panel, and a text output panel. Under the hood the Shiny App will run the markov model to find the best prediction for the next word, as illustrated in the image below.

References

1.Feinerer, I., Hornik, K., & Meyer, D. (2008). Text Mining Infrastructure in R. Journal of Statistical Software, 25(5), 1–54. https://doi.org/10.18637/jss.v025.i05
2. Kasper Welbers, Wouter Van Atteveldt & Kenneth Benoit (2017) Text Analysis in R, Communication Methods and Measures, 11:4, 245-265, DOI: 10.1080/19312458.2017.1387238.
3. Ashish Kumar , Avinash Paul (2016). Mastering Text Mining with R. Packt Publishing Company.