Word Prediction

Synopsis

The goal of the project is to create a Shiny App which uses a prediction algorithm to predict the next word, based on the previous inputs. For the prediction of the next words, we will develop an n-gram algorithm. To train the algorithm, we have sample texts from various sources (Twitter, blogs, news). These sample texts are provided by SwiftKey. The sample texts are available in different languages, but in this project we focus on English. In this Milestone we load, analyze and normalize the provided Data.

File Analysis

Basic Overview

The following table shows a simple overview of the contents of the three files:

                    News    Blogs  Twitter
file_size_in_mb      196      200      159
line_count       2360148   899288  1010242
word_count      34762395 37546246 30093410

It can be seen that we have a lot of data, this is good to learn an algorithm. However, we must also keep in mind that the algorithm should have a good Performance and consume as little memory as possible.

Check Words

Stopword

We now know that we have a lot of data and we have to think about which parts we can omit without affecting the prediction. The goal is to predict words, so we can remove pure numbers without hesitation. We also do not want to predict offensive or vulgar words. A file from Google (https://code.google.com/archive/p/badwordslist/downloads) contains a list of words we can omit.

Stopwords can also be filtered out because they are so common that they do not improve the predictive algorithm. We use the stopwords contained in R package “tm”, here an extract:

 [1] "i"          "me"         "my"         "myself"     "we"        
 [6] "our"        "ours"       "ourselves"  "you"        "your"      
[11] "yours"      "yourself"   "yourselves" "he"         "him"       
[16] "his"        "himself"    "she"        "her"        "hers"

However, we must remember to make these words available again later. For a first attempt, however, we leave them away.

Words with Numbers

Next, we’ll review words that also contain numbers. These could be times (e.g. 9am) or position indications (e.g. “The 3th one”). We can leave these out, but we have to check especially on Twitter whether “abstracted words” occur which we have to consider (e.g. 2gether -> together)

                         News Blogs Twitter
word_with_numbers_count 23348 38324   99873

As expected, there are a lot more words on Twitter that contain numbers. Here is a wordcloud for Twiter:

As expected, there are word creations that can not just be thrown away, e.g. “2day” exists 991 times, so we should correct these words (e.g., “2day” -> “today”).

Here a small list from the Top 100 Words mith Numbers from Twitter:

      Var1 Freq corrected
16    2day  991     today
21   2nite  819   tonight
45 2morrow  378  tomorrow
61  2night  284   tonight
82 2gether  101  together
83   4ever   99   forever
97   2moro   78  tomorrow

The Top 100 Words with numbers of “News” and “Blogs” does not contain such word creations.

Obviously there are also problems with typos. We could try to make corrections, but for now we ignore typos.

Word Frequency

As a last step, we check the Word Frequency.

     0%     10%     20%     30%     40%     50%     60%     70%     80% 
      1       1       1       1       1       1       2       3       5 
    90%    100% 
     20 4771927

This simple quantile report shows that 80% of the words occurs less than 5 times. The Linechart that indicates the word frequencies, shows the strong gradient in the word frequencies. If we omit the First Top 1000, we can see the slope slightly better:

Words that occurs less than 5 times will hardly improve our prediuction model but influence the performance, so I will drop this words.

Normalization

Based on this analysis, I will make the following normalizations for the next step - the creation on n-grams and a prediction model:

To lower: Put all Words to lower case to simplify the search
Removing numbers: We do not need to predict numbers. These can occur in different formats, eg. “1234”, “12.34”, “12,122.45” etc.
Correct Word Creations: Replace words like “2day” with “today”
Remove combinations of letters and numbers, eg. “1st”; “8pm”; “Mlanet12”
Remove Stopwords: These are words which are so common in a language that they hardly provide a improvement for the prediction model .
Remove Badwords: We can also exclude offensive or vulgar words
Remove special characters that do not belong to a word, e.g. “+”, “*“,”#"
Remove words that occur less than 5 times
Remove unnecessary spaces

Next Steps

The goal of this Milestone was the analysis of the data and to find out which normalizations we can apply.

Based on this analysis, the following steps are necessary to successfully complete this project:

Creation of a prediction algorithm for the next word based on the previous input. For this I will create 2-grams and 3-grams and a prediction algorithm based on these n-grams.
Create a model to offer words that are not covered by the prediction model (unknown n-grams)
Test two other normalizing method’s:
- Stemming: To normalize words to their word stem could reduce the number of words again. The simplest method that contains the packet “tm” cuts known endings.
- Typing errors and word creations: typing errors and word creations are a difficult topic. For this I have to compare each word with a dictionary and make corrections if necessary. However, since we have so much data, I assume that the proportion of misspelled words is not important.
Creation of a Shiny App to see the prediction algorithm in action. Here I imagine a classic text input mask in which the user can type any text. Below are three fields where he can see the three most likely words that might come next.