Task1: Input and cleaning

The first task is to bring in the files, choosing to limit the sample files to 1% of the ingested one for analysis, and randomly churning through via the binom function. I opted to keep the cleaning simple, reverting to using functions off the box, rather than customized. I also opted to keep stopwords as is central to predicting what users ought to see when suggested with options.

FALSE Loading required package: NLP

FALSE 
FALSE Attaching package: 'ggplot2'

FALSE The following object is masked from 'package:NLP':
FALSE 
FALSE     annotate

FALSE 
FALSE Attaching package: 'dplyr'

FALSE The following objects are masked from 'package:stats':
FALSE 
FALSE     filter, lag

FALSE The following objects are masked from 'package:base':
FALSE 
FALSE     intersect, setdiff, setequal, union

Task2: Exploratory Data Analysis

When observing data via analysis of charts and simple word count, we see that the “News” feed tend to have fewer unique words than of the “Twitter” and “Blogs” one. We also see via the ngram analysis, not surprisingly, that the “News” feed tend to have better constructed sentences and 3- or 4-grams combination. I personally struggled with the tokenization process, as i tried to keep apostrophes to ultimately keep the likes of don’t, can’t, won’t, etc… which tend to be words we use in casual language. I ultimatelty chose a tokenization that removed punctuation for simplicity.

FALSE [1] "Number of unique words in twitter are:7885"

FALSE [1] "Number of unique words in blogs are:11836"

FALSE [1] "Number of unique words in news are:4091"

Task3: Modeling

This modeling task is performed only on one of the data set, blogs that is, so as to not duplicate studies.
We effectively constructed the simplest data model by feeding the n-gram results into a data frame, sorted by highest occurence and splitting out the most frequent last term when prompted either 1, 2 or 3 words as input.

Using “object.size()”, we can see the exponential nature of the storage. For example, for the blogs feed, using 1% of the feed, we get a tokenized storage need of anywhere between 9 and 12 MB for 2- to 4-ngrams.

Clearly storage will become an issue as we load more data, and so we will need to consider keeping only a certain percentile of the data for example, or using Markov chains.

With respect to unseen n-gram, we propose to use the function findAssocs, which finds correlation of words to an input.

best next word following one (1) word input

Here using “last” as an example in the blogs sample

FALSE [1] "size of dataframe for bigram is:9132080"

FALSE [1] "Net memory increase for bigram is:387364"

FALSE # A tibble: 1 x 1
FALSE   term2
FALSE   <chr>
FALSE 1 night

FALSE # A tibble: 54 x 3
FALSE    term1 term2    occurence
FALSE    <chr> <chr>        <dbl>
FALSE  1 last  night           66
FALSE  2 last  year            60
FALSE  3 last  post            20
FALSE  4 last  day             16
FALSE  5 last  week            15
FALSE  6 last  few             13
FALSE  7 last  and             11
FALSE  8 last  saturday        11
FALSE  9 last  several         10
FALSE 10 last  in               8
FALSE # ... with 44 more rows

best next word following two (2) word input

Here, we are using “last night” as an example in the blogs sample

FALSE [1] "size of dataframe for trigram is:11062480"

FALSE [1] "Net memory increase for trigram is:354976"

FALSE # A tibble: 1 x 1
FALSE   term3
FALSE   <chr>
FALSE 1 i

FALSE # A tibble: 12 x 4
FALSE    term1 term2 term3   occurence
FALSE    <chr> <chr> <chr>       <dbl>
FALSE  1 last  night i              20
FALSE  2 last  night plus           13
FALSE  3 last  night what            6
FALSE  4 last  night just            5
FALSE  5 last  night because         4
FALSE  6 last  night there           4
FALSE  7 last  night at              3
FALSE  8 last  night was             3
FALSE  9 last  night don             2
FALSE 10 last  night it              2
FALSE 11 last  night we              2
FALSE 12 last  night dh              1

best next word following three (3) word input

Here, we are using “i love to” as an example in the blogs sample

FALSE [1] "size of dataframe for quadgram is:11761848"

FALSE [1] "Net memory increase for quadgram is:348831"

FALSE # A tibble: 1 x 1
FALSE   term4
FALSE   <chr>
FALSE 1 do

FALSE # A tibble: 2 x 5
FALSE   term1 term2 term3 term4 occurence
FALSE   <chr> <chr> <chr> <chr>     <dbl>
FALSE 1 i     love  to    do            4
FALSE 2 i     love  to    shop          1

best next word where no n-gram exists

Using the word “last”, we try to find the highest correlated terms in the blogs text.

FALSE $last
FALSE numeric(0)

Next steps

Clearly, balance between speed and memory will need to be considered and played with during the predicting implementation. Few low hanging fruits come to mind, such as implementing “remove sparse terms” which I have not used on the sampling, or consider limiting a certain percentile to be stored in memory, as a tradeoff to number of possible predicted words.

With regards to markov chains, we plan to implementation a trie-like data structure to predict the next character and word

Implementation will tell how to trade off speed and memory. We plan to start with predicting the next word with the last three, as accuracy is likely to be best (i.e. using a quadgram), but if a word cannot be found, fall back to a trigram, and next a bigram.

I suspect we will have to revisit the decision to have suppressed the apostrophes from the contracted words, as the frequency of a contracted word will turn up more often than a properly structured one (i.e. isn’t will be more frequently found than is not). In the absence of the apostrophe, the predicted word will suggest an improper word, such as “isn” for example. To be re-visited.

Finally, we are playing to also play with machine learning, to the extent that the technology / speed allows, whereby we would train the current data set, and predict based on inputs.

Capstone

Mikael Herve

10/13/2020