Executive Summary

This report contains some basic exploratory data analysis for the Capstone Project of the Data Science Specialization by Johns Hopkins University on Coursera.

Our final goal is to create a predictive model, that will predict the next words a user is going to type, displaying those words with the highed probability and allowing the user to choose the right word without fully typing it. If our prediction algorithm works well, this can actually save the user time when typing and make him/her write texts faster.

This report will show some of the pitfalls the given data provides and how I plan to deal with it when creating the predictive algorithm and the webapp.

Exploratory Data Analysis

The data set consists of four languages (English, German, Finish and Russian) and for each language there are three files:

twitter feeds (*.twitter.txt)
news articles (*.news.txt) and
blog entries (*.blogs.txt)

For our Capstone Project we will only work with the english language, but our analysis could be applied to the other languages aswell (with some adjustments).

First look at the data

For the data to load correctly, I need to open it in binary read mode “rb” and ignore characters. This is unexpected, but maybe a few of the datasets are corrupt.

E.g:

con <- file("final/en_US/en_US.blogs.txt", open = "rb")
all_lines <- readLines(con = con, skipNul = TRUE)
close(con = con)

Basic statistic about the data

##                file   lines total_characters
## 1   en_US.blogs.txt  899288        206824505
## 2    en_US.news.txt 1010242        203223159
## 3 en_US.twitter.txt 2360148        162096241

Not suprisingly twitter feeds are much shorter than both new articles and blog entries.

Usage of characters

As we are dealing with the english language, my assumption is, that we are dealing only with the characters a-z in both upper and lowercase aswell as some punctuations.

To my suprise there were characters, that are more common that the least common letter. Some of them can be explained easily (’ or -) as they may be part of a word. Others are part of punctuation (, . ? !). Even some numbers 0, 1, 2, 3 occur more often than the least used letter q. Some can be explained by “words” like 1st, 2nd or 3rd. Other than that I couldn’t find any meaningful and often occurances of those numbers.

As I need to clean the data later on, for now I will consider letters a-z, ’ and - as possibly “letters” for a word, so we will be able to catch words like “I’ll” or “We’re”

Coverage of text by amount of different words

With relativly few words in our dictionary we can cover a large number of sentences in our test data. However, the closer we get to a full coverage, the size of your dictionary will increase dramatically. For performance reasons it makes sense to compromise speed and low memory usage vs. perfect coverage:

Another aspect of human written language is that there will be errors. For example:

dict[which(tolower(names(dict)) == "you")]

## 
##   You   you   YOU   YOu   YoU   yOU   yoU 
## 22329 17099   739     7     4     3     1

there are a lot of different ways to write “you”. All of them probably mean the same word “you”. So for our algorithm predicting the next word, we will only consider lowercase variant of all words. If necessary this can later be fixed by applying the correct spelling to the dictionary.

Further todos to create a predictive modell

Build at least ngrams with 2 and 3 words. Possibly I can also build ngrams with 4 or more words, but I expect a similiar effect as with the word coverage. The larger the amount of words for a ngram, the more rarely it will occur. So when further building the model I will check which is the “right” ngrams for prediction
Consider the tradeoff for performance and accuracy. As we want to help the user to type faster, our prediction algorithm must be very fast. Typical users can type with 50-60 characters per minute, experienced users can type with double or triple of that. Therefore having “the perfect” prediction after a couple of seconds is not useful, in that time the user would have completed writting out that word.
For now I will not use profanity filtering. It could be easily implemented by either removing those words from the dictionary or by removing harmful ngrams. This should be the choice of the user
Consider what letters the user has already typed when predicting a word. If the user hasn’t typed any letter yet we fully rely on our ngrams. However, if the user has already typed some letters, we might refine our search in the ngram with this partial word or if we don’t find any ngrams we might rely on looking up matching words in the dictionary.
Finally a prototype application should be build with Shiny to demonstrate our algorithm

Milestone Report Capstone Project

Heiko Lange