A zip archive file was downloaded from the Coursera project assignment web page. The files below were extracted from this and used in this report. These files contained English language text from three different sources:
The text files were read into an “R” environment and the following summary information was produced for each. As can been seen below, the maximum character size for Twitter is the (old) limit of 140 characters.
Number of lines per source file:| source_text | line_count |
|---|---|
| blogs | 899288 |
| news | 1010242 |
| 2360148 |
| source_text | longest_line |
|---|---|
| blogs | 40833 |
| news | 11384 |
| 140 |
The text lines were split into individual words (also known as tokens) and from that, the number of words per source was determined:
| source_text | word_count |
|---|---|
| blogs | 37546246 |
| news | 34762395 |
| 30093369 |
We can then see what the more frequently used words were per source text (top 10 words are shown).
| source_text | word | word_count |
|---|---|---|
| news | the | 1974366 |
| blogs | the | 1860156 |
| blogs | and | 1094401 |
| blogs | to | 1069440 |
| the | 937405 | |
| news | to | 906145 |
| blogs | a | 900362 |
| news | and | 889511 |
| news | a | 878035 |
| blogs | of | 876799 |
Over all three texts, the most frequently used words are:
| word | word_count |
|---|---|
| the | 4771927 |
| to | 2764230 |
| and | 2422450 |
| a | 2389755 |
| of | 2010936 |
| in | 1657973 |
| i | 1657335 |
| for | 1103087 |
| is | 1075727 |
| that | 1042522 |
It can be seen from above that stop words (words such as “the”, “a” - see [https://en.wikipedia.org/wiki/Stop_words]) are very popular. These were removed as well as digits (0,1,2,3…10), again to show the most frequent non-stop, non-numeric words. Top 10 words are shown.
| word | word_count |
|---|---|
| time | 224774 |
| day | 175983 |
| love | 161651 |
| people | 159280 |
| life | 91716 |
| rt | 89702 |
| home | 83247 |
| week | 78095 |
| night | 77360 |
| game | 74838 |
Interesting that “time” and “day” are so frequently used.
The plot below shows the 10 most frequently used words per source text - excluding stop words and digits (0,1,2…etc).
The propsed next step is to split the source texts into bigram tokens contained in a single data set. Bigrams [https://en.wikipedia.org/wiki/Bigram] are two adjacent words.
Bigrams were created from the entire texts - the 10 most frequent are:
| bigram | bigram_count |
|---|---|
| of the | 431130 |
| in the | 408595 |
| to the | 213669 |
| for the | 201206 |
| on the | 197419 |
| to be | 162723 |
| at the | 142545 |
| and the | 125852 |
| in a | 119416 |
| with the | 106231 |
Given a word that the user has typed in, the prediction algorithm would then use the frequency counts in the bigram data set to suggest the next possible words. It could limit the prediction to four suggested words.
For example if the user types in “the”, the algortithm would look for the most frequent bigram whose first word is “the” and return the second word of bigram.
Perhaps this coule be made more interesting by using the following rules for creating the list of predicted words: