Introduction

In this exploratory data analysis two main questions are dealt with:

Can the given raw data be loaded and preprocessed as needed? Basic information about the raw data will be extracted and, subsequently, a representative subset will created and used for further analysis. This subset will be dissected into the elements of further study: phrases, words, and short sequences of words.
Does the preprocessed data show properties that can be used to later build a word prediction model? To answer this question the frequency distribution of words as well as of two- and three-word sequences (bi- and trigrams) will be analysed.

This analysis has been written in a style that is accessable for non data scientist. For this reason, most of the concrete R code is not shown in this report. However, all results and graphs have been created in a reproduceable manner with R files that are available at request.

Basic Information about the available data

This analysis is based on three data files, containing texts from US blogs, news sites and Twitter. The files contain roughly between 160 and 210 million characters, representing about 30 to 38 million words. The average word length, for all three files, is between 5 and 6 characters. As could be expected, the average line length for Twitter entries is very short (about 13 words on average), while news and blog entries tend to be longer (average line length 34 and 42 words respectively).

file.data <- get.basic.file.information(c(blog.file, news.file, twitter.file), data.dir)
knitr::kable(file.data)

Filename	Lines	Words	Characters	Longest_Line	Avg_Words_per_Line	Avg_Characters_per_Word
en_US.blogs.txt	899288	37334114	208623081	40833	41.51519	5.588001
en_US.news.txt	1010242	34365936	205243643	11384	34.01753	5.972299
en_US.twitter.txt	2360148	30359804	166816544	173	12.86352	5.494651

Selecting representative training and test data

The available data files are very extensive. Too use all data would require more computing ressources than are available for this analysis. Therefore, a representative excerpt of alltogether about 10.000 lines is taken from all three files and stored in a single training data set (“train.txt”). All three sources will be represented with about the same number of words. This is achieved by selecting randomly lines from the blog, news, and Twitter files in the proportion 3:4:10. In the same way, a testing data set (“train.txt”) containing about 5.000 lines is created for the verification of the later developed prediction model.

create.data.excerpt(c(blog.file, news.file, twitter.file), data.dir, 
                    c(3, 4, 10), c(10000, 5000), c("en_US.train.txt", "en_US.test.txt"), file.data)

Decompose the training file

The created training file with approximately 10.000 lines from all three data sources is now decomposed.

Every line is read and decomposed into phrases. Phrases are delimited mainly by punction marks as “.”, “?”, “!”, “;”, “,”, “-” etc. This is done because there is probably no valuable information ranging from the end of one phrase to the beginning of another.
Every phrase is then transformed into lower case and decomposed into words, using regular expressions. A word is everything composed solely out of letters or the apostroph (so constructs as “we’ll” or “don’t” can be treated as words).
All words on a list of defined inappropriate words are filtered out. At the time given, no decision has been made as to the extend of this list.
The remaining words of the length two or greater (as well as the special cases “a” and “i”) are then added to three hashes, called “monogram”, “bigram”, and “trigram”. “monogram” counts the occurances of singe words, “bigram” the occurances of two-word sequences, “trigram” the occurances of three-word-sequences.
To be able to use the bigrams and trigrams also for the beginning of a phrase, phrases are prefixed with the special character “^” (once for bigrams, twice for trigrams).

# Decompose blog data
decompose.text.file(train.file, data.dir, monogram, bigram, trigram)

The training data contains 229109 words in total, but only 23319 unique words. More than half of the unique words are very rare (they occure only once), more than three thirds are quite rare (they occure three times or less).

summary(values(monogram))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     1.000     1.000     1.000     9.825     3.000 10900.000

As the following graph shows, few words appear more often than 15 times.

hist(sort(values(monogram)), main = "Frequency of Words", xlab = "Words", xlim = c(0,20), breaks = 10000)

On the other hand, there are words that appear more than a houndred times (244 words) or even a few thousand times (26 words). The following graph shows the 20 most common words.

par(las=2)
barplot(head(sort(values(monogram), decreasing = TRUE), n = 25), 
        main = "Most Frequent Words")

The training data contains a total of 229109 bigrams and 229109 trigrams. There are 117725 unique bi- and 174715 unique trigrams. More than three quarters of the unique bi- and trigrams are very rare, i.e. they occur only once.

summary(values(bigram))

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.000    1.000    1.946    1.000 2140.000

summary(values(trigram))

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.000    1.000    1.311    1.000 2140.000

On the other hand, there are bi- and trigrams that appear more than 10 times (1840 bigrams and 671 trigrams) The following graphs shows the 20 most common bi- and trigrams.

par(las=2)
barplot(head(sort(values(bigram), decreasing = TRUE), n = 25), 
        main = "Most Frequent Bigrams")

barplot(head(sort(values(trigram), decreasing = TRUE), n = 25), 
        main = "Most Frequent Trigrams")

The trigrams that start a sentence are the most frequent ones. Most phrases start with “I”, “The”, or “And”. The following graph lists the trigrams that come from the middle or end of a phrase.

par(las=2)
barplot(head(sort(values(trigram.new), decreasing = TRUE), n = 25), 
        main = "Most Frequent Trigrams (from within a Phrase)")

These trigrams contain mainly the more common words, thus suggesting that the predictive model might work even if it disregards very rare words.

Summary

The explorativ data analysis shows that the needed word sequences can be extracted from the given data files. The data is very extensive, so only a representative sample of the data has been used for analysis. Nevertheless, the results are promising. Generally speaking, texts seem to contain many rarely used words and a limited number of extremely frequently words. The trend is the same for bi- and trigrams, though less pronounced. To use bi- and/or trigrams for predicting words shows promise. The challange will be to develop a model that can cope with the given memory and time limits.

The basic idea for the word prediction model is to harvest the power of the bi- and trigrams to predict the next word. For this purpose, given a sequence of zero to two words, the most likely of the fitting bi- and trigrams will be used. This idea can be generalized for the beginning of phrases. The Shiny app, that will be developed, will allow the user to enter a sequence of words. After the user presses a button, a short list of likely next words will be shown.

Besides memory and computing time restrictions, several other factors will be considered in the development of the prediction model: Dealing with inappropriate words, bad orthography, and special character sequences (e.g. abbrevations, words with numbers, smileys). Also, the consequences of using a different language then English have to be considered.

Data Science Capstone - Exploratory Analysis of the Data Set

Uwe Neuhaus

29.03.2015

Introduction

Basic Information about the available data

Selecting representative training and test data

Decompose the training file

Summary