wk2-milestone-report

Instructions

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm.

Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm.

This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.

You should make use of tables and plots to illustrate important summaries of the data set.

Motivation

The motivation for this project is to:

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

Review Criteria

Does the link lead to an HTML page describing the exploratory analysis of the training data set?
Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
Has the data scientist made basic plots, such as histograms to illustrate features of the data?
Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

Introduction

This report is the first step in the capstone project for the Data Science Specialization from John Hopkins University. The project will use the corpus of data from blogs, news, and twitter twits and try to predict the next word after a few letters or words are entered at the prompt. This is a great interface to integrate with mobile applications limited input options

Finally we will use this report to briefly analzye the data and create a plan to describe the predictive model used in the shiy app.

Load Libraries used in the project

# A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
# https://www.rdocumentation.org/packages/ggplot2/versions/3.3.0
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.3

# A framework for text mining applications within R.
# https://www.rdocumentation.org/packages/tm/versions/0.7-7
library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

# Basic classes and methods for Natural Language Processing.
# https://www.rdocumentation.org/packages/NLP/versions/0.2-0
library(NLP)

# An interface to the Apache OpenNLP tools (version 1.5.3). The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text written in Java
# https://www.rdocumentation.org/packages/openNLP/versions/0.2-7
library(openNLP)

## Warning: package 'openNLP' was built under R version 3.6.3

# Provides color schemes for maps (and other graphics) designed by Cynthia Brewer as described at http://colorbrewer2.org
# https://www.rdocumentation.org/packages/RColorBrewer/versions/1.1-2
# Amazing lib for color schemes: https://colorbrewer2.org/#type=sequential&scheme=Purples&n=3
library(RColorBrewer)

# Fast, correct, consistent, portable and convenient character string/text processing in every locale and any native encoding
# https://www.rdocumentation.org/packages/stringi/versions/1.4.6
library(stringi)

## Warning: package 'stringi' was built under R version 3.6.2

# Low-level interface to Java VM very much like .C/.Call and friends. Allows creation of objects, calling methods and accessing fields.
# https://www.rdocumentation.org/packages/rJava/versions/0.9-12
library(rJava)

## Warning: package 'rJava' was built under R version 3.6.3

# An R interface to Weka (Version 3.9.3). Weka is a collection of machine learning algorithms for data mining tasks written in Java, containing tools for data pre-processing, classification, regression, clustering, association rules, and visualization
# https://www.rdocumentation.org/packages/RWeka/versions/0.4-42
library(RWeka)

## Warning: package 'RWeka' was built under R version 3.6.3

library(RWekajars) ## required by RWeka

# An R interface to the C 'libstemmer' library that implements Porter's word stemming algorithm for collapsing words to a common root to aid comparison of vocabulary
# https://www.rdocumentation.org/packages/SnowballC/versions/0.7.0
library(SnowballC)

## Warning: package 'SnowballC' was built under R version 3.6.3

# Automates many of the tasks associated with quantitative discourse analysis of transcripts containing discourse including frequency counts of sentence types, words, sentences, turns of talk, syllables and other assorted analysis tasks.
# https://www.rdocumentation.org/packages/qdap/versions/2.3.6
library(qdap)

## Warning: package 'qdap' was built under R version 3.6.3

## Loading required package: qdapDictionaries

## Loading required package: qdapRegex

## 
## Attaching package: 'qdapRegex'

## The following object is masked from 'package:ggplot2':
## 
##     %+%

## Loading required package: qdapTools

## Registered S3 methods overwritten by 'qdap':
##   method               from
##   t.DocumentTermMatrix tm  
##   t.TermDocumentMatrix tm

## 
## Attaching package: 'qdap'

## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix

## The following object is masked from 'package:NLP':
## 
##     ngrams

## The following object is masked from 'package:base':
## 
##     Filter

Data Loads

The files used for this analysis will only include the English language files from the total Corpora. The files will include English US blogs (en_US.blogs.txt), English US News (en_US.news.txt), and English US twits from Twitter (en_US.twitter.txt)

The data will be loaded from the local project folder. Files will be included in the github repo.

content_blogs <- file("en_US.blogs.txt", 
                      open="rb") 

data_blogs <- readLines(content_blogs, 
                        encoding = "UTF-8", 
                        skipNul  = TRUE)

content_news <- file("en_US.news.txt", 
                     open = "rb")

data_news <- readLines(content_news, 
                       encoding = "UTF-8", 
                       skipNul  = TRUE)

content_twitter <- file("en_US.twitter.txt", 
                   open = "rb")

data_twitter <- readLines(content_twitter, 
                          encoding = "UTF-8", 
                          skipNul=TRUE)

Prepare Data for Analysis

We’ll calculate the size of each file in the corpus, the number of lines per file, the number of words per file, and the length of the longest line recorded

Blogs

blogs_file_size   <- file.info("en_US.blogs.txt")$size / 1024 / 1024
blogs_lines       <- length(data_blogs)
blogs_words       <- sum(stri_count_words(data_blogs))
blogs_line_length <- max(nchar(data_blogs))
paste("Blog File size (MB) = ",blogs_file_size)

## [1] "Blog File size (MB) =  200.424207687378"

paste("Blog Lines = ", blogs_lines)

## [1] "Blog Lines =  899288"

paste("Blog Words = ", blogs_words)

## [1] "Blog Words =  37546239"

paste("Blog Longest line length = ", blogs_line_length)

## [1] "Blog Longest line length =  40833"

News

news_file_size    <- file.info("en_US.news.txt")$size  / 1024 / 1024
news_lines        <- length(data_news)
news_words        <- sum(stri_count_words(data_news))
news_line_length  <- max(nchar(data_news))
paste("News File size (MB) = ", news_file_size)

## [1] "News File size (MB) =  196.277512550354"

paste("News Lines = ", news_lines)

## [1] "News Lines =  1010242"

paste("News Words = ", news_words)

## [1] "News Words =  34762395"

paste("News Longest line length = ", news_line_length)

## [1] "News Longest line length =  11384"

Twits

twits_file_size   <- file.info("en_US.twitter.txt")$size / 1024 / 1024
twits_lines       <- length(data_twitter)
twits_words       <- sum(stri_count_words(data_twitter))
twits_line_length <- max(nchar(data_twitter))
paste("Twits File size (MB) = ", twits_file_size)

## [1] "Twits File size (MB) =  159.364068984985"

paste("Twits Lines = ", twits_lines)

## [1] "Twits Lines =  2360148"

paste("Twits Words = ", twits_words)

## [1] "Twits Words =  30093413"

paste("Twits Longest line length = ", twits_line_length)

## [1] "Twits Longest line length =  140"

Summary

data_summary <- data.frame(
                            file_names      = c("Blogs","News","Twits"),
                            file_size       = c(blogs_file_size, news_file_size, twits_file_size),
                            line_counts     = c(blogs_lines, news_lines, twits_lines),
                            word_counts     = c(blogs_words, news_words, twits_words),
                            max_line_length = c(blogs_line_length, news_line_length, twits_line_length)
)

data_summary

##   file_names file_size line_counts word_counts max_line_length
## 1      Blogs  200.4242      899288    37546239           40833
## 2       News  196.2775     1010242    34762395           11384
## 3      Twits  159.3641     2360148    30093413             140

Summary Analysis

Data Size >>> Blogs data is slightly larger than News or Twits.
Lines >>> Twits have the highest count of lines but the smallest size of the file. Likely because each twit is only 140 characters
Words >>> Blogs have the larger amount of words (surprising) and surpass the number of words in the news. One would expect the news data to have more words because usually are in-depth articles, whereas blogs are usually quick expressions of the blogger’s mind
Line Length >>> Blogs also have the longest line at 40,833 characters

Creating a Sample for the sake of Performance

In mobile apps the amount of data that can be downloaded and stored on the device is limited, therefore we need to create a sample of the original data that can be used to create inferences for the algorithm

set.seed(7765287)
size_per_sample = 1000
blogs_sample    <- sample(data_blogs, 
                        size = size_per_sample, 
                        replace = TRUE)

news_sample     <- sample(data_news, 
                        size = size_per_sample, 
                        replace = TRUE)

twits_sample    <- sample(data_twitter, 
                        size = size_per_sample, 
                        replace = TRUE)

samples <- c(blogs_sample, 
             news_sample, 
             twits_sample)

paste("Length of samples = ", length(samples))

## [1] "Length of samples =  3000"

writeLines(samples, "samples.txt")

Data Cleaning

We will use a list of profanity or “bad words” identified in the English language and by browsers (Google, Yahoo, IE, etc).

More information can be found here: https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/

We will use the corpus in a series of steps to clean further and further until the corpus is fully clean

# Read the sample created before and create an initial corpus
samples_file    <- file("samples.txt")
samples_lines   <- readLines(samples_file)

# Create the corpus based on the sample data
# NOTE: in the most recent version of tm we need to use VCorpus and not Corpus
samples_corpus  <- VCorpus(
                    VectorSource(
                      samples_lines))

# More info here:
# https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
# Step 1 - convert the corpus to UTF-8
samples_corpus <- tm_map(samples_corpus, 
                          content_transformer(
                          function(x) iconv(x, 
                                             to="UTF-8", 
                                              sub="byte")))

# Step 2 - convert the corpus to lowercase
samples_corpus <- tm_map(samples_corpus, 
                          content_transformer(tolower))

# Step 3 - remove punctuation, including - , ; : " ', etc
samples_corpus <- tm_map(samples_corpus, 
                          content_transformer(removePunctuation),
                          preserve_intra_word_dashes=TRUE)

# Step 4 - filter "bad words" 
custom_bad_words = readLines("custom-bad-words-nlp.txt")
samples_corpus <- tm_map(samples_corpus,
                         removeWords, 
                         custom_bad_words)

# Step 5- remove numeric numbers 
samples_corpus <- tm_map(samples_corpus, 
                          content_transformer(removeNumbers))

# Step 6 - remove URLs  
urls_to_remove <- function(x) gsub("http[[:alnum:]]*", "", x)
samples_corpus <- tm_map(samples_corpus, 
                          content_transformer(urls_to_remove))

# Step 7 - remove stop words in English. 
# Note: this might remove words that are needed to understand the content of the sentence
# More info here: https://en.wikipedia.org/wiki/Stop_words
samples_corpus <- tm_map(samples_corpus, 
                         removeWords, 
                         stopwords("english"))

# Step 8 - Remove white spaces
samples_corpus <- tm_map(samples_corpus, 
                          stripWhitespace)
    
# Step 9 - the final step converts the corpus to plain text
samples_corpus <- tm_map(samples_corpus, 
                          PlainTextDocument) 

# Create a file with the content of samples. It can be used for debugging
write.csv(samples_corpus,
          "samples_corpus.csv",
          row.names=F)

# Create a R Data object to easily load the Corpus
saveRDS(samples_corpus, file = "clean_corpus.RData")

# Display a few lines of the corpus to make sure it displays correctly
for (i in 1:25){
print(samples_corpus[[i]]$content)
}

## [1] "aureolin"
## [1] "grace receive day"
## [1] "luckily someone tough work – pam downtown case contact barnes noble determined figure sand try get books onto ipads need – couple weeks away"
## [1] "yet christopher nolan needed loosen collar inception find humor playfulness can come odd group -developed personalities entering peoples dreams though tom hardys eames comes closest fun job film seems feel ashamed embrace caper weighted crazy wife story device dicaprio already tackled shutter island even end revolutionary road insight relationship merely supposed supply depth make movie serious capital s say final shot movie leaves question whether cobb still dream state reaction matter one way "
## [1] " breakfast room glass box extends beyond main volume capture views indian creek morning sun main vertical element core residence ’ diameter void houses partially suspended stair spirals ’ width ground level ’ width roof garden third level"
## [1] " even wonders nature right empty temperature-controlled abode caught spiders creepy-crawlies moved stuff now theyve returned reinforcementsand extended families awesome"
## [1] "whats moral story go ballet man dont get tricky worth risk wives smarter us"
## [1] "psi worried might sleep horrific brain injury though wear helmet always called nurse hotline apparently everything ok just ice neck take easy command boyfriend give gentle massages sounds good "
## [1] "even though surrounded sweet nothing lovers valentine’s never really moved remember celebrating mother best friend high school years february suddenly got way going home will buy mother artificial rose us will spend day ordinarily best friend’s case ’s tradition give card heart shape stuffs together remembered days gone ’s valentine"
## [1] " metal drunkards metal bros"
## [1] "april "
## [1] "now know need give reasons must support anna hazare beneath anna’s dignity beg make case support fighting abusive corrupt regime still prefer ostriches country nah country vast term even ’m ostrich fact may even ostrich city delhi let’s just keep ostriches city ’s quick recap"
## [1] "behold will make thee new sharp threshing instrument teeth thou shalt thresh mountains beat small shalt make hills chaff thou shalt fan wind shall carry away whirlwind shall scatter thou shalt rejoice lord shalt glory holy one "
## [1] "full-room shots nearly entire house except kids rooms bathrooms"
## [1] " nature consider equality something grasped"
## [1] " lit face many times number"
## [1] " important note one reasons democrats leftists avowed marxists easily steal away rights country “ people” grown largely ignorant rights begin many know rights know abstractly example many citizens know religious freedom yet aren’t familiar way first amendment worded don’t understand amendment doesn’t just recognize -given right religious freedom actually bars government interfering religious exercises words first amendment says “congress shall make law respecting establishment religion prohibiting free exercise thereof” ties hands government hands people yet ignorance ’ve allowed things reversed"
## [1] "janeane garofalo"
## [1] "one instant accident"
## [1] "oddly type judgement sometimes swings way meryl streep portraying toothy bouffant iron lady international screens renewed discussion britains now-frail ex-prime minister feel must mention margaret thatcher neighbour"
## [1] " remember elementary school use walk school knee-deep snow walk home snow even deeper uphill ways school knew teachers principals people basic values parents basically demanded respect effort adhering specific values values biblical one ever complained biblical values laws based promote hard work along compassion"
## [1] "secondly found wanting book end sign good story got last paragraph went back re-read last couple pages just remind ending im spoilers left feeling hope thats never bad thing"
## [1] " economy covered business media newspapers magazines cable channels tends get short-shrift mainstream media well may shortage reporters actually understand economic trends beyond obvious"
## [1] "endre beeing high sugar birthday party"
## [1] " sense loneliness"

Read the Corpus into a Data frame for processing

file_clean_corpus <- readRDS("clean_corpus.RData")

# load the clean corpus into a data frame
clean_corpus <- data.frame(text = 
                             unlist(
                               sapply(
                                 file_clean_corpus,
                                 `[`,
                                 "content")),
                           stringsAsFactors = FALSE)

#clean_corpus

Create tokens

We’re going to read all the words from the corpus and create tokens called n-grams with one word, two words, and three words (unigrams, bigrams, and trigrams respectively)

Creating Unigrams (1 word)

corpus_unigrams <- NGramTokenizer(clean_corpus, 
                                  Weka_control(min = 1, 
                                               max = 1,
                                               delimiters = " \\r\\n\\t.,;:\"()?!"))

corpus_unigrams <- data.frame(table(corpus_unigrams))

corpus_unigrams <- corpus_unigrams[
            order(
              corpus_unigrams$Freq,
              decreasing = TRUE),]

names(corpus_unigrams) <- c("Word", "Frequency")

corpus_unigrams$Word <- as.character(corpus_unigrams$Word)

write.csv(corpus_unigrams[
          corpus_unigrams$Frequency > 1,
          ],
          "corpus_unigrams.csv",
          row.names=F)

corpus_unigrams <- read.csv("corpus_unigrams.csv",
                    stringsAsFactors = F)

saveRDS(corpus_unigrams, file = "corpus_unigrams.RData")

Unigram graphical analysis

corpus_unigrams <- readRDS("corpus_unigrams.RData")

ggplt <- ggplot(data = corpus_unigrams[1:5,], 
             aes(x = Word, 
                 y = Frequency))

ggplt2 <- ggplt + 
        geom_bar(stat = "identity") + 
        coord_flip() + 
        ggtitle("Unigram Frequency") +
        guides(color = "none") +
        coord_flip()

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

ggplt3 <- ggplt2 + 
      geom_text(data = corpus_unigrams[1:5,], 
                aes(x = Word, 
                    y = Frequency, 
                    label = Frequency), 
                hjust = -1, 
                position = "identity") +
      scale_fill_grey() +
      theme_classic()

ggplt3

head(corpus_unigrams)

##   Word Frequency
## 1 said       300
## 2  one       276
## 3 will       255
## 4 just       218
## 5 like       212
## 6  can       187

Creating Bigrams (2 words)

corpus_bigrams <- NGramTokenizer(clean_corpus, 
                                 Weka_control(min = 2, 
                                              max = 2,
                                              delimiters = " \\r\\n\\t.,;:\"()?!"))

corpus_bigrams <- data.frame(
                            table(
                              corpus_bigrams))

corpus_bigrams <- corpus_bigrams[
                                  order(
                                    corpus_bigrams$Freq, 
                                    decreasing = TRUE),]

names(corpus_bigrams) <- c("Words","Frequency")

corpus_bigrams$Words <- as.character(corpus_bigrams$Words)

head(corpus_bigrams)

##            Words Frequency
## 21544  last year        21
## 26691   new york        19
## 45539  years ago        15
## 11261  dont know        12
## 21514 last night        12
## 26623 new jersey        12

Bigram graphical analysis

bi_ggplt <- ggplot(data = corpus_bigrams[1:5,], 
             aes(x = Words, 
                 y = Frequency))

bi_ggplt2 <- bi_ggplt + 
        geom_bar(stat = "identity") + 
        coord_flip() + 
        ggtitle("Bigram Frequency") +
        guides(color = "none") +
        coord_flip()

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

bi_ggplt3 <- bi_ggplt2 + 
      geom_text(data = corpus_bigrams[1:5,], 
                aes(x = Words, 
                    y = Frequency, 
                    label = Frequency), 
                hjust = -1, 
                position = "identity") +
      scale_fill_grey() +
      theme_classic()

bi_ggplt3

split_words <- strsplit(corpus_bigrams$Words,split=" ")
corpus_bigrams <- transform(corpus_bigrams, 
                            word_one = sapply(split_words,"[[",1),   
                            word_two = sapply(split_words,"[[",2))

corpus_bigrams <- data.frame(word_one = corpus_bigrams$word_one,
                             word_two = corpus_bigrams$word_two,
                             frequency  = corpus_bigrams$Frequency,
                             stringsAsFactors = FALSE)

write.csv(corpus_bigrams[
          corpus_bigrams$frequency > 1,],
          "corpus_bigrams.csv",
          row.names = F)

corpus_bigrams <- read.csv("corpus_bigrams.csv",
                           stringsAsFactors = F)

saveRDS(corpus_bigrams,"corpus_bigrams.RData")
head(corpus_bigrams)

##   word_one word_two frequency
## 1     last     year        21
## 2      new     york        19
## 3    years      ago        15
## 4     dont     know        12
## 5     last    night        12
## 6      new   jersey        12

Creating Trigrams (3 words)

corpus_trigram <- NGramTokenizer(clean_corpus, 
                                 Weka_control(min = 3, 
                                              max = 3,
                                              delimiters = " \\r\\n\\t.,;:\"()?!"))
corpus_trigram <- data.frame(
                              table(
                                corpus_trigram))

corpus_trigram <- corpus_trigram[
                                  order(
                                    corpus_trigram$Freq,
                                    decreasing = TRUE),]

names(corpus_trigram) <- c("Words","Frequency")

corpus_trigram$Words <- as.character(corpus_trigram$Words)

head(corpus_trigram)

##                                Words Frequency
## 1324  aesthetic correction circuitry         5
## 15301             fort jackson south         5
## 19788                 im pretty sure         5
## 20905         jackson south carolina         5
## 19443            human rights abuses         4
## 23008                    let us know         4

Trigram graphical analysis

tri_ggplt <- ggplot(data = corpus_trigram[1:5,], 
             aes(x = Words, 
                 y = Frequency))

tri_ggplt2 <- tri_ggplt + 
        geom_bar(stat = "identity") + 
        coord_flip() + 
        ggtitle("Bigram Frequency") +
        guides(color = "none") +
        coord_flip()

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

tri_ggplt3 <- tri_ggplt2 + 
      geom_text(data = corpus_trigram[1:5,], 
                aes(x = Words, 
                    y = Frequency, 
                    label = Frequency), 
                hjust = -1, 
                position = "identity") +
      scale_fill_grey() +
      theme_classic()

tri_ggplt3

split_word_3 <- strsplit(corpus_trigram$Words,
                         split = " ")

# corpus sample with all trigrams split
head(split_word_3)

## [[1]]
## [1] "aesthetic"  "correction" "circuitry" 
## 
## [[2]]
## [1] "fort"    "jackson" "south"  
## 
## [[3]]
## [1] "im"     "pretty" "sure"  
## 
## [[4]]
## [1] "jackson"  "south"    "carolina"
## 
## [[5]]
## [1] "human"  "rights" "abuses"
## 
## [[6]]
## [1] "let"  "us"   "know"

corpus_trigram <- transform(corpus_trigram,
                             one = sapply(split_word_3,"[[",1),
                             two = sapply(split_word_3,"[[",2),
                             three = sapply(split_word_3,"[[",3))

# corpus after transformed with split words
head(corpus_trigram)

##                                Words Frequency       one        two     three
## 1324  aesthetic correction circuitry         5 aesthetic correction circuitry
## 15301             fort jackson south         5      fort    jackson     south
## 19788                 im pretty sure         5        im     pretty      sure
## 20905         jackson south carolina         5   jackson      south  carolina
## 19443            human rights abuses         4     human     rights    abuses
## 23008                    let us know         4       let         us      know

corpus_trigram <- data.frame(word_one   = corpus_trigram$one,
                             word_two   = corpus_trigram$two, 
                             word_three = corpus_trigram$three, 
                             frequency  = corpus_trigram$Frequency,
                             stringsAsFactors = FALSE)

# corpus after added to a new trigram corpus dataframe
head(corpus_trigram)

##    word_one   word_two word_three frequency
## 1 aesthetic correction  circuitry         5
## 2      fort    jackson      south         5
## 3        im     pretty       sure         5
## 4   jackson      south   carolina         5
## 5     human     rights     abuses         4
## 6       let         us       know         4

write.csv(corpus_trigram[
                          corpus_trigram$frequency > 1,],
                          "corpus_trigram.csv",
          row.names = F)

corpus_trigram <- read.csv("corpus_trigram.csv",
                           stringsAsFactors = F)

saveRDS(corpus_trigram,"corpus_trigram.RData")

# final corpus after saved an R object
head(corpus_trigram)

##    word_one   word_two word_three frequency
## 1 aesthetic correction  circuitry         5
## 2      fort    jackson      south         5
## 3        im     pretty       sure         5
## 4   jackson      south   carolina         5
## 5     human     rights     abuses         4
## 6       let         us       know         4

Conclusion and next steps

The processing, cleaning, and research on n-grams is a very time consuming task sicne this is green-field for myself
The amount of tests, debugging, and re-run of each file is very time consuming, even with a sample of 1000 lines
Removing words during cleaning is a process that needs a lot of tweaking and there’s a whole world of techniques out there to fine-tune NLP algorithms
The final algorithm might be more or less accurate depending on this first step to collect, cleaning, and process the data for analysis

Next Steps

Develop a NLP algorithm
Develop a Shiny app
Create a pitch slide deck to present the app and results