Executive Summary

This report contains some basic exploratory data analysis for the Capstone Project of the Data Science Specialization by Johns Hopkins University on Coursera.

Our final goal is to create a predictive model, that will predict the next words a user is going to type, displaying those words with the highed probability and allowing the user to choose the right word without fully typing it. If our prediction algorithm works well, this can actually save the user time when typing and make him/her write texts faster.

This report will show some of the pitfalls the given data provides and how I plan to deal with it when creating the predictive algorithm and the webapp.

Exploratory Data Analysis

The data set consists of four languages (English, German, Finish and Russian) and for each language there are three files:

For our Capstone Project we will only work with the english language, but our analysis could be applied to the other languages aswell (with some adjustments).

First look at the data

For the data to load correctly, I need to open it in binary read mode “rb” and ignore characters. This is unexpected, but maybe a few of the datasets are corrupt.

E.g:

con <- file("final/en_US/en_US.blogs.txt", open = "rb")
all_lines <- readLines(con = con, skipNul = TRUE)
close(con = con)

Basic statistic about the data

##                file   lines total_characters
## 1   en_US.blogs.txt  899288        206824505
## 2    en_US.news.txt 1010242        203223159
## 3 en_US.twitter.txt 2360148        162096241

Not suprisingly twitter feeds are much shorter than both new articles and blog entries.

Usage of characters

As we are dealing with the english language, my assumption is, that we are dealing only with the characters a-z in both upper and lowercase aswell as some punctuations.

To my suprise there were characters, that are more common that the least common letter. Some of them can be explained easily (’ or -) as they may be part of a word. Others are part of punctuation (, . ? !). Even some numbers 0, 1, 2, 3 occur more often than the least used letter q. Some can be explained by “words” like 1st, 2nd or 3rd. Other than that I couldn’t find any meaningful and often occurances of those numbers.

As I need to clean the data later on, for now I will consider letters a-z, ’ and - as possibly “letters” for a word, so we will be able to catch words like “I’ll” or “We’re”

Coverage of text by amount of different words

With relativly few words in our dictionary we can cover a large number of sentences in our test data. However, the closer we get to a full coverage, the size of your dictionary will increase dramatically. For performance reasons it makes sense to compromise speed and low memory usage vs. perfect coverage:

Another aspect of human written language is that there will be errors. For example:

dict[which(tolower(names(dict)) == "you")]
## 
##   You   you   YOU   YOu   YoU   yOU   yoU 
## 22329 17099   739     7     4     3     1

there are a lot of different ways to write “you”. All of them probably mean the same word “you”. So for our algorithm predicting the next word, we will only consider lowercase variant of all words. If necessary this can later be fixed by applying the correct spelling to the dictionary.

Further todos to create a predictive modell