This is a Milestone report for the Capstone Project of the Coursera Data Science Specialization.
Three sample compilations of texts were provided, each in English and around 200MB in size. The sources were Blogs, News, and Twitter. The objectives were: first to indicate some familiarity with the texts, giving statistics, plots, and the like; second to give ideas about and goals for the App and the underlying algorithm.
Here are some brief numerical facts about each text file, obtained by downloading and reading the files, tokenizing them into words, sentences, and lines, forming selected ngrams, counting and forming ratios:
| Textfile | en_US.blogs | en_US.news | en_US.twitter |
|---|---|---|---|
| Bytes | 210,160,014 | 205,811,889 | 167,105,338 |
| Words | 36,764,537 | 32,897,504 | 28,995,825 |
| Distinct Words | 245,252 | 203,003 | 285,206 |
| Distinct 3grams | 19,586,263 | 18,700,988 | 16,301,478 |
| Sentences | 2,083,684 | 1,830,494 | 2,360,148 |
| Lines=Texts | 899,288 | 1,010,242 | 2,811,218 |
| Words/Distinct Words | 149.9 | 162.1 | 101.7 |
| Words/Sentence | 17.6 | 18.0 | 12.3 |
One probably shouldn’t try to read too much into these numbers without more context, although it is not too surprising that Tweets have significantly fewer words per sentence.
The following plots show the most common words, and their frequencies, from each file. Note that I have left in the so-called “Stop Words”, as the input text and the prediction might include them.
Frequency Distribution of top 20 words in Blogs
Frequency Distribution of top 20 words in News
Frequency Distribution of top 20 words in Twitter
These goals are subject to future refinement.
The App should be an interactive window that allows the input of some text and gives several selections of predictions for the next word. One should be able to select one of these predictions, or type a different next word, and be able to iterate the process. Perhaps the style of text would be indicated in advance, to reflect say the three styles of text in the samples.
A first pass at an algorithm has been completed. It uses the Backoff methodology applied to matching input text to ngrams from the sample texts. It first prepossesses the text into words, lowering the case and discarding punctuation such as commas and periods. Then it pre-computes ngrams from the texts of various lengths, and considers all but the last word to be input text with the last word being a prediction. For each input text, the predictions are tabulated by decreasing frequency, and the first several are proposed. Implementing Backoff requires that: If the real input text does not show up in the sample text, the first word is discarded and the shortened input text is compared with the next shorter ngrams to produce a prediction, continuing until a prediction can be made. (The process terminates since an empty input would be followed by predictions of the most common words.)
Several possible improvements come to mind. This is space-intensive (for the three files together, well over a gigabyte). One might replace words by integer codes.One might drop the low occurring input strings. And of course an entirely different algorithm may suggest itself. One must wait for further clarification of the course goals before too much pursuit here.
As there was a month to kill between course 9 and the capstone, Python was studied and these computations were done there. The primary package was NLTK (Natural Language Tool Kit). Locally, Python snippets can be evaluated in an RStudio Markdown .Rmd document. It remains to be tested whether they can be on the Shiny server.