The goal of the project is to create a Shiny App which uses a prediction algorithm to predict the next word, based on the previous inputs. For the prediction of the next words, we will develop an n-gram algorithm. To train the algorithm, we have sample texts from various sources (Twitter, blogs, news). These sample texts are provided by SwiftKey. The sample texts are available in different languages, but in this project we focus on English. In this Milestone we load, analyze and normalize the provided Data.
The following table shows a simple overview of the contents of the three files:
News Blogs Twitter
file_size_in_mb 196 200 159
line_count 2360148 899288 1010242
word_count 34762395 37546246 30093410
It can be seen that we have a lot of data, this is good to learn an algorithm. However, we must also keep in mind that the algorithm should have a good Performance and consume as little memory as possible.
We now know that we have a lot of data and we have to think about which parts we can omit without affecting the prediction. The goal is to predict words, so we can remove pure numbers without hesitation. We also do not want to predict offensive or vulgar words. A file from Google (https://code.google.com/archive/p/badwordslist/downloads) contains a list of words we can omit.
Stopwords can also be filtered out because they are so common that they do not improve the predictive algorithm. We use the stopwords contained in R package “tm”, here an extract:
[1] "i" "me" "my" "myself" "we"
[6] "our" "ours" "ourselves" "you" "your"
[11] "yours" "yourself" "yourselves" "he" "him"
[16] "his" "himself" "she" "her" "hers"
However, we must remember to make these words available again later. For a first attempt, however, we leave them away.
Next, we’ll review words that also contain numbers. These could be times (e.g. 9am) or position indications (e.g. “The 3th one”). We can leave these out, but we have to check especially on Twitter whether “abstracted words” occur which we have to consider (e.g. 2gether -> together)
News Blogs Twitter
word_with_numbers_count 23348 38324 99873
As expected, there are a lot more words on Twitter that contain numbers. Here is a wordcloud for Twiter:
As expected, there are word creations that can not just be thrown away, e.g. “2day” exists 991 times, so we should correct these words (e.g., “2day” -> “today”).
Here a small list from the Top 100 Words mith Numbers from Twitter:
Var1 Freq corrected
16 2day 991 today
21 2nite 819 tonight
45 2morrow 378 tomorrow
61 2night 284 tonight
82 2gether 101 together
83 4ever 99 forever
97 2moro 78 tomorrow
The Top 100 Words with numbers of “News” and “Blogs” does not contain such word creations.
Obviously there are also problems with typos. We could try to make corrections, but for now we ignore typos.
As a last step, we check the Word Frequency.
0% 10% 20% 30% 40% 50% 60% 70% 80%
1 1 1 1 1 1 2 3 5
90% 100%
20 4771927
This simple quantile report shows that 80% of the words occurs less than 5 times. The Linechart that indicates the word frequencies, shows the strong gradient in the word frequencies. If we omit the First Top 1000, we can see the slope slightly better:
Words that occurs less than 5 times will hardly improve our prediuction model but influence the performance, so I will drop this words.
Based on this analysis, I will make the following normalizations for the next step - the creation on n-grams and a prediction model:
The goal of this Milestone was the analysis of the data and to find out which normalizations we can apply.
Based on this analysis, the following steps are necessary to successfully complete this project: