assignment

datasciencegit30

11/30/2020

##SHINY APP Introduction The objective of the Captone project is to build an application to predict the next word(s) from a phrase entered by the user.Drawing from a corpora of English text from US blogs, news and twitter, a Shiny application is built. Some R libraries are used to clean and process the text, after which the ngram model is built.

Libraries used:

tm stringi qdap quanteda

application

Cleaning and Processing Text The data source are plain text documents with a total file size of about 600MB: the blog file being about 264MB, news file being about 202MB, and the twitter file being 317MB respectively.

Since the files are huge, they are sampled with a binomial distribution so that the computer can handle the file size and processing.

shiny

These are the steps used to clean and process the data:

Convert to Latin encoding to remove lines with non-English words Combine the samples into one file Remove URLs, numeric characters, punctuation, Stop words (e.g. a, the, an, etc)

shiny

Shiny Application After the data is cleaned, they are broken up to unigram, bigrams, trigrams and quadgrams. A shiny application is then built. First the application reads the data from csv to load the unigram, bigrams, trigrams and quadgrams. The user can enter text into the textbox, and the application will first go through quadgrams to find a match, and if a match is not found, proceed to the trigram and so on.

create

R version 3.3.2 (2016-10-31) R version (short form): 3.3.2 File version: 1.0.0 Author Profile: [prashi] GitHub: Source Code Additional session information: