The goal of this project is to build an app, which can predict the next word in English language, given a single, two or three words of a sentence. To achive this, several texts from news, blogs and twitter are analyzed. The most frequent used words as well as two or three often successive used combination of words are parsed. It is checked how much of a given text can be covered by those combination of pairs or tripel of successive words. In this milestone report the further steps for creating a prediction model is sketched.
The given data, which can be used to train a prediction model can be downloaded from the site https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. It consists of news articles, blogs and twitter tweets.
The following table shows the size of the given data.
## sources number_lines number_words
## 1 news 1010242 34372530
## 2 blogs 899288 37334131
## 3 twitter 2360148 30373543
Since we do not want to count profane words, we will remove all articles and tweets, which contain those. The list of profane words are downloaded from this site: https://github.com/RobertJGabriel/Google-profanity-words.
It takes too much time for calculating, so we will reduce the sample data to 10% of the whole data set. The three data sets is summarized to one for further analysis.
The data size is now as follows:
## Numer of Lines Number of the Words
## 1 358383 7284402
For further exploring of the data, the will separate the lines in unigrams, bigrams and trigrams. The list of all unigrams, bigrams and trigrams is sorted after frequency and the ten most frequent ones are summarized in this report.
The ten most frequend used words are these below:
## # A tibble: 160,125 x 2
## word n
## <chr> <int>
## 1 the 327193
## 2 to 197224
## 3 a 167498
## 4 and 161749
## 5 of 133898
## 6 i 125304
## 7 in 114067
## 8 for 80672
## 9 you 76176
## 10 is 75971
## # … with 160,115 more rows
Similar to the most frequent words, we will separate the data into bigrams and trigrams and list up the ten most frequent ones below:
To get a better understanding about the text, frequency of the words, we will analyse how many words of the sorted unigrams are necessary to cover a certain percentage of the whole text.
We can see, that up to about 75% of the whole text can be covered by few frequently used words. Above 75%, the number of used words increases nearly exponentially.
The code used here is available here: https://github.com/725sora/Language_Prediction1/blob/master/language_prediction1.Rmd