Exploratory Data Analysis

The purpose of this document is to demonstrate that I have downloaded and successfully loaded the data, report summary statistics about the datasets, report any interesting findings, and briefly mention my plans for creating a prediction algorithm.

blogs <- readLines("/Users/Creed/Desktop/en_US.blogs.txt") # line count
news <- readLines("/Users/Creed/Desktop/en_US.news.txt") # line count
twitter <- readLines("/Users/Creed/Desktop/en_US.twitter.txt") # line count
## Warning in readLines("/Users/Creed/Desktop/en_US.twitter.txt"): line
## 167155 appears to contain an embedded nul
## Warning in readLines("/Users/Creed/Desktop/en_US.twitter.txt"): line
## 268547 appears to contain an embedded nul
## Warning in readLines("/Users/Creed/Desktop/en_US.twitter.txt"): line
## 1274086 appears to contain an embedded nul
## Warning in readLines("/Users/Creed/Desktop/en_US.twitter.txt"): line
## 1759032 appears to contain an embedded nul
df.features <- data.frame(
    filename=c("blogs","news","twitter"),
    size.in.MB=c(round(object.size(blogs)/(1024*1024),1),round(object.size(news)/(1024*1024),1),round(object.size(twitter)/(1024*1024),1)),
    TotalLines=c(length(blogs),length(news),length(twitter)),
    TotalCharacters=c(sum(nchar(blogs)),sum(nchar(news)),sum(nchar(twitter))),
    CharactresPerLine=c(round(mean(nchar(blogs)),2),round(mean(nchar(news)),2),round(mean(nchar(twitter)),2)),
    TotalWords=c(sum(stringi::stri_count_words(blogs)),sum(stringi::stri_count_words(news)),sum(stringi::stri_count_words(twitter)))                  
) # create data frame
df.features$wordsPerLine <- round(df.features$TotalWords/df.features$TotalLines,2) # round numbers
library(gridExtra) # require package
## Loading required package: grid
grid.table(df.features, gp=gpar(fontsize=8)) # create grid

par(mfrow = c(1,3)) # combine histograms
barplot(df.features$size.in.MB,names=df.features$filename, xlab="file size in MB") # plot histogram
barplot(df.features$TotalLines,names=df.features$filename, xlab="line counts") # plot histogram
barplot(df.features$wordsPerLine,names=df.features$filename, xlab="word counts") # plot histogram

Prediction Algorithm

Pass the word to be predicted on (last word typed) and the one immediately before it (antecedent).

Filter out any twitter, blog, or news that doesn’t contain the word being predicted.

Consolidate the remaining texts into a single corpus.

Try to get a prediction for the next word using 3grams where the 1st word is the antecedent and the 2nd word is the last word typed.

If no such 3gram exists, use 2grams where the 1st word is the last word typed to do the prediction.