Language Prediction Milestone Report

Executive Summary

The goal of this project is to build an app, which can predict the next word in English language, given a single, two or three words of a sentence. To achive this, several texts from news, blogs and twitter are analyzed. The most frequent used words as well as two or three often successive used combination of words are parsed. It is checked how much of a given text can be covered by those combination of pairs or tripel of successive words. In this milestone report the further steps for creating a prediction model is sketched.

Reading, Cleaning the Data and Sampling

The given data, which can be used to train a prediction model can be downloaded from the site https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. It consists of news articles, blogs and twitter tweets.

The following table shows the size of the given data.

##   sources number_lines number_words
## 1    news      1010242     34372530
## 2   blogs       899288     37334131
## 3 twitter      2360148     30373543

Since we do not want to count profane words, we will remove all articles and tweets, which contain those. The list of profane words are downloaded from this site: https://github.com/RobertJGabriel/Google-profanity-words.

It takes too much time for calculating, so we will reduce the sample data to 10% of the whole data set. The three data sets is summarized to one for further analysis.

The data size is now as follows:

##   Numer of Lines Number of the Words
## 1         358383             7284402

Exploratory Data Analysis

Unigrams, Bigrams and Trigrams

For further exploring of the data, the will separate the lines in unigrams, bigrams and trigrams. The list of all unigrams, bigrams and trigrams is sorted after frequency and the ten most frequent ones are summarized in this report.

The ten most frequend used words are these below:

## # A tibble: 160,125 x 2
##    word       n
##    <chr>  <int>
##  1 the   327193
##  2 to    197224
##  3 a     167498
##  4 and   161749
##  5 of    133898
##  6 i     125304
##  7 in    114067
##  8 for    80672
##  9 you    76176
## 10 is     75971
## # … with 160,115 more rows

Similar to the most frequent words, we will separate the data into bigrams and trigrams and list up the ten most frequent ones below:

Coverage of the whole Text by Unigrams

To get a better understanding about the text, frequency of the words, we will analyse how many words of the sorted unigrams are necessary to cover a certain percentage of the whole text.

We can see, that up to about 75% of the whole text can be covered by few frequently used words. Above 75%, the number of used words increases nearly exponentially.

Next Steps

A possible design of a prediction model for the next word looks as follows:
- Given one or two preceding words, the predicting model function looks up these words in the list of 2- or 3-grams and picks up the 2- or 3-gram, which starts with this or these words.
- This function returns the word with the highest frequency in the sorted list of 2-grams and 3-grams, which starts with the given words.
- E.g. in case of "in" the 2-gram model delivers "the" and in case of "thanks for" the 3-gram-model delivers "of".
A basic implementation for this model is already implemented.
A more advanced solution is planned, which gives
- a probability for the prediction
- more possible predictions with probabilities for each prediction.
Furthermore it is planned to implement a shiny app, in which the user can type in some words and gets the predictions.

The code used here is available here: https://github.com/725sora/Language_Prediction1/blob/master/language_prediction1.Rmd