Introduction

In this exploratory data analysis two main questions are dealt with:

This analysis has been written in a style that is accessable for non data scientist. For this reason, most of the concrete R code is not shown in this report. However, all results and graphs have been created in a reproduceable manner with R files that are available at request.

Basic Information about the available data

This analysis is based on three data files, containing texts from US blogs, news sites and Twitter. The files contain roughly between 160 and 210 million characters, representing about 30 to 38 million words. The average word length, for all three files, is between 5 and 6 characters. As could be expected, the average line length for Twitter entries is very short (about 13 words on average), while news and blog entries tend to be longer (average line length 34 and 42 words respectively).

file.data <- get.basic.file.information(c(blog.file, news.file, twitter.file), data.dir)
knitr::kable(file.data)
Filename Lines Words Characters Longest_Line Avg_Words_per_Line Avg_Characters_per_Word
en_US.blogs.txt 899288 37334114 208623081 40833 41.51519 5.588001
en_US.news.txt 1010242 34365936 205243643 11384 34.01753 5.972299
en_US.twitter.txt 2360148 30359804 166816544 173 12.86352 5.494651

Selecting representative training and test data

The available data files are very extensive. Too use all data would require more computing ressources than are available for this analysis. Therefore, a representative excerpt of alltogether about 10.000 lines is taken from all three files and stored in a single training data set (“train.txt”). All three sources will be represented with about the same number of words. This is achieved by selecting randomly lines from the blog, news, and Twitter files in the proportion 3:4:10. In the same way, a testing data set (“train.txt”) containing about 5.000 lines is created for the verification of the later developed prediction model.

create.data.excerpt(c(blog.file, news.file, twitter.file), data.dir, 
                    c(3, 4, 10), c(10000, 5000), c("en_US.train.txt", "en_US.test.txt"), file.data)

Decompose the training file

The created training file with approximately 10.000 lines from all three data sources is now decomposed.

# Decompose blog data
decompose.text.file(train.file, data.dir, monogram, bigram, trigram)

The training data contains 229109 words in total, but only 23319 unique words. More than half of the unique words are very rare (they occure only once), more than three thirds are quite rare (they occure three times or less).

summary(values(monogram))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     1.000     1.000     1.000     9.825     3.000 10900.000

As the following graph shows, few words appear more often than 15 times.

hist(sort(values(monogram)), main = "Frequency of Words", xlab = "Words", xlim = c(0,20), breaks = 10000)

On the other hand, there are words that appear more than a houndred times (244 words) or even a few thousand times (26 words). The following graph shows the 20 most common words.

par(las=2)
barplot(head(sort(values(monogram), decreasing = TRUE), n = 25), 
        main = "Most Frequent Words")

The training data contains a total of 229109 bigrams and 229109 trigrams. There are 117725 unique bi- and 174715 unique trigrams. More than three quarters of the unique bi- and trigrams are very rare, i.e. they occur only once.

summary(values(bigram))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.000    1.000    1.946    1.000 2140.000
summary(values(trigram))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.000    1.000    1.311    1.000 2140.000

On the other hand, there are bi- and trigrams that appear more than 10 times (1840 bigrams and 671 trigrams) The following graphs shows the 20 most common bi- and trigrams.

par(las=2)
barplot(head(sort(values(bigram), decreasing = TRUE), n = 25), 
        main = "Most Frequent Bigrams")

barplot(head(sort(values(trigram), decreasing = TRUE), n = 25), 
        main = "Most Frequent Trigrams")

The trigrams that start a sentence are the most frequent ones. Most phrases start with “I”, “The”, or “And”. The following graph lists the trigrams that come from the middle or end of a phrase.

par(las=2)
barplot(head(sort(values(trigram.new), decreasing = TRUE), n = 25), 
        main = "Most Frequent Trigrams (from within a Phrase)")

These trigrams contain mainly the more common words, thus suggesting that the predictive model might work even if it disregards very rare words.

Summary

The explorativ data analysis shows that the needed word sequences can be extracted from the given data files. The data is very extensive, so only a representative sample of the data has been used for analysis. Nevertheless, the results are promising. Generally speaking, texts seem to contain many rarely used words and a limited number of extremely frequently words. The trend is the same for bi- and trigrams, though less pronounced. To use bi- and/or trigrams for predicting words shows promise. The challange will be to develop a model that can cope with the given memory and time limits.

The basic idea for the word prediction model is to harvest the power of the bi- and trigrams to predict the next word. For this purpose, given a sequence of zero to two words, the most likely of the fitting bi- and trigrams will be used. This idea can be generalized for the beginning of phrases. The Shiny app, that will be developed, will allow the user to enter a sequence of words. After the user presses a button, a short list of likely next words will be shown.

Besides memory and computing time restrictions, several other factors will be considered in the development of the prediction model: Dealing with inappropriate words, bad orthography, and special character sequences (e.g. abbrevations, words with numbers, smileys). Also, the consequences of using a different language then English have to be considered.