The first task is to bring in the files, choosing to limit the sample files to 1% of the ingested one for analysis, and randomly churning through via the binom function. I opted to keep the cleaning simple, reverting to using functions off the box, rather than customized. I also opted to keep stopwords as is central to predicting what users ought to see when suggested with options.
FALSE Loading required package: NLP
FALSE
FALSE Attaching package: 'ggplot2'
FALSE The following object is masked from 'package:NLP':
FALSE
FALSE annotate
FALSE
FALSE Attaching package: 'dplyr'
FALSE The following objects are masked from 'package:stats':
FALSE
FALSE filter, lag
FALSE The following objects are masked from 'package:base':
FALSE
FALSE intersect, setdiff, setequal, union
When observing data via analysis of charts and simple word count, we see that the “News” feed tend to have fewer unique words than of the “Twitter” and “Blogs” one. We also see via the ngram analysis, not surprisingly, that the “News” feed tend to have better constructed sentences and 3- or 4-grams combination. I personally struggled with the tokenization process, as i tried to keep apostrophes to ultimately keep the likes of don’t, can’t, won’t, etc… which tend to be words we use in casual language. I ultimatelty chose a tokenization that removed punctuation for simplicity.
FALSE [1] "Number of unique words in twitter are:7885"
FALSE [1] "Number of unique words in blogs are:11836"
FALSE [1] "Number of unique words in news are:4091"
This modeling task is performed only on one of the data set, blogs that is, so as to not duplicate studies.
We effectively constructed the simplest data model by feeding the n-gram results into a data frame, sorted by highest occurence and splitting out the most frequent last term when prompted either 1, 2 or 3 words as input.
Using “object.size()”, we can see the exponential nature of the storage. For example, for the blogs feed, using 1% of the feed, we get a tokenized storage need of anywhere between 9 and 12 MB for 2- to 4-ngrams.
Clearly storage will become an issue as we load more data, and so we will need to consider keeping only a certain percentile of the data for example, or using Markov chains.
With respect to unseen n-gram, we propose to use the function findAssocs, which finds correlation of words to an input.
Here using “last” as an example in the blogs sample
FALSE [1] "size of dataframe for bigram is:9132080"
FALSE [1] "Net memory increase for bigram is:387364"
FALSE # A tibble: 1 x 1
FALSE term2
FALSE <chr>
FALSE 1 night
FALSE # A tibble: 54 x 3
FALSE term1 term2 occurence
FALSE <chr> <chr> <dbl>
FALSE 1 last night 66
FALSE 2 last year 60
FALSE 3 last post 20
FALSE 4 last day 16
FALSE 5 last week 15
FALSE 6 last few 13
FALSE 7 last and 11
FALSE 8 last saturday 11
FALSE 9 last several 10
FALSE 10 last in 8
FALSE # ... with 44 more rows
Here, we are using “last night” as an example in the blogs sample
FALSE [1] "size of dataframe for trigram is:11062480"
FALSE [1] "Net memory increase for trigram is:354976"
FALSE # A tibble: 1 x 1
FALSE term3
FALSE <chr>
FALSE 1 i
FALSE # A tibble: 12 x 4
FALSE term1 term2 term3 occurence
FALSE <chr> <chr> <chr> <dbl>
FALSE 1 last night i 20
FALSE 2 last night plus 13
FALSE 3 last night what 6
FALSE 4 last night just 5
FALSE 5 last night because 4
FALSE 6 last night there 4
FALSE 7 last night at 3
FALSE 8 last night was 3
FALSE 9 last night don 2
FALSE 10 last night it 2
FALSE 11 last night we 2
FALSE 12 last night dh 1
Here, we are using “i love to” as an example in the blogs sample
FALSE [1] "size of dataframe for quadgram is:11761848"
FALSE [1] "Net memory increase for quadgram is:348831"
FALSE # A tibble: 1 x 1
FALSE term4
FALSE <chr>
FALSE 1 do
FALSE # A tibble: 2 x 5
FALSE term1 term2 term3 term4 occurence
FALSE <chr> <chr> <chr> <chr> <dbl>
FALSE 1 i love to do 4
FALSE 2 i love to shop 1
Using the word “last”, we try to find the highest correlated terms in the blogs text.
FALSE $last
FALSE numeric(0)
Clearly, balance between speed and memory will need to be considered and played with during the predicting implementation. Few low hanging fruits come to mind, such as implementing “remove sparse terms” which I have not used on the sampling, or consider limiting a certain percentile to be stored in memory, as a tradeoff to number of possible predicted words.
With regards to markov chains, we plan to implementation a trie-like data structure to predict the next character and word
Implementation will tell how to trade off speed and memory. We plan to start with predicting the next word with the last three, as accuracy is likely to be best (i.e. using a quadgram), but if a word cannot be found, fall back to a trigram, and next a bigram.
I suspect we will have to revisit the decision to have suppressed the apostrophes from the contracted words, as the frequency of a contracted word will turn up more often than a properly structured one (i.e. isn’t will be more frequently found than is not). In the absence of the apostrophe, the predicted word will suggest an improper word, such as “isn” for example. To be re-visited.
Finally, we are playing to also play with machine learning, to the extent that the technology / speed allows, whereby we would train the current data set, and predict based on inputs.