Introduction

The goal of this project is just to display that I have gotten used to working with the data and that I am on track to create my prediction algorithm.

The motivation for this project is to: - Demonstrate that I have downloaded the data and have successfully loaded it in. - Create a basic report of summary statistics about the data sets. - Report any interesting findings that I amassed so far. - Get feedback on your plans for creating a prediction algorithm and Shiny app.

First, we do the basic initialization:

### Setting the working directory
setwd("~/_R_Projects/Capstone")

### Load libraries
library(stringi)
library(scales)
library(ggplot2)
library(tm)
library(RWeka)

We check the filesizes of the source files:

### Filesizes
path<-"Coursera-SwiftKey/final/en_US/"
files<- paste(path,list.files(path),sep="")
filesizes<-cbind(files,paste(round(file.info(files)$size/1024^2,digits=2),"MB",sep=" "))
filesizes
##      files                                                        
## [1,] "Coursera-SwiftKey/final/en_US/en_US.blogs.txt"   "200.42 MB"
## [2,] "Coursera-SwiftKey/final/en_US/en_US.news.txt"    "196.28 MB"
## [3,] "Coursera-SwiftKey/final/en_US/en_US.twitter.txt" "159.36 MB"

Although the dataset is quite large, it fits a standard computer’s memory. Still, this is not “Big data”.

We load the data in:

### Importing the data: We will be working with the english corpus
blogs <- readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt")
news <- readLines("Coursera-SwiftKey/final/en_US/en_US.news.txt")
twitter <- readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt")

First look

The hypothesis is that due to different source, the character of the three datasets will be different as well. The first place to look for these differences is the dataset itself. Below, I present three sample lines from each dataset:

### Sample 4 lines of blogs:
blogs[1:3]
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
### Sample 4 lines of news:
news[1:3]
## [1] "He wasn't home alone, apparently."                                                                                                                                                
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                        
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
### Sample 4 lines of tweets:
twitter[1:3]
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."

This allows us to immediately see the different tone and language of these three sources. Whereas the blog posts are quite various in length and language used, the news posts are generally more formal and the twitter posts are shorter and written in an informal language.

Basic statistics

As to the summary statistics for each of the three files:

### Number of lines, sentences, words and characters in each of these three files
stats<-as.data.frame(rbind(stri_stats_general(blogs)[c(1,3)],stri_stats_general(news)[c(1,3)],stri_stats_general(twitter)[c(1,3)]))
stats$Words<-c(sum(stri_count_words(blogs)),sum(stri_count_words(news)),sum(stri_count_words(twitter)))
row.names(stats) <- c("blogs","news","twitter")
stats
##           Lines     Chars    Words
## blogs    899288 208361438 38256701
## news      77259  15683765  2697319
## twitter 2360148 162384825 30249390

We see that even with the smallest filesize, the twitter data contain the most posts.

Histogram data

It is also interesting to look at the distributions of number of characters and words in each line of the three loaded files. First, we check the character counts:

### Count of chars
b1<- data.frame("name"="blogs", "value"=nchar(blogs))
n1<-data.frame("name"="news", "value"=nchar(news))
t1<-data.frame("name"="twitter","value"=nchar(twitter))
hist<-rbind(b1,n1,t1)
hist<-hist[hist$value<1000,]
ggplot(hist,aes(x=hist$value, fill=hist$name))+geom_histogram(binwidth=10)+facet_grid(name ~ ., scales="free_y")

The distribution of the blog data is closest to exponential from the three. There is a lot of very short blog posts, but from the tail we also see that they might be really long (the tail is the longest of the three). On the other hand, the distribution of tweets ends sharply at aroound 140 characters since this is Twitters length restriction on the posts. Thhey simply cannot be longer. We can also see a spike at the right side of the tweet distribution which is caused by the limit and the fact that even though people might have a lot to say, they try to squeeze to the limit.

For the word count distribution across the lines of the documents, the picture looks really similar:

### Count of words
b2<- data.frame("name"="blogs", "value"=stri_count_words(blogs))
n2<-data.frame("name"="news", "value"=stri_count_words(news))
t2<-data.frame("name"="twitter","value"=stri_count_words(twitter))
hist<-rbind(b2,n2,t2)
hist<-hist[hist$value<100,]
ggplot(hist,aes(x=hist$value, fill=hist$name))+geom_histogram(binwidth=2)+facet_grid(name ~ ., scales="free_y")

Next steps

he goal is to create a prediction algorithm for the next best word and a data product product to highlight the prediction algorithmas interface that can be accessed by others. As to the prediction algorithm, the data will have to be cleaned first so that there are no diferrences in letter capitalization and strange and invalid characters are remmoved. When the algorithm is ready, I will create a shiny application that would suggest the next best word based on a phrase already given. For performance purposes, I count on the fact that some of the calculations (e.g. the tokenization and building of the n-grams) will have to be performed in advance and the relevant data will have to stored in a materialized form. However, as the dataset for training will not change too significantly in the coming months, these precalculated solutions (e.g. the word frequencies) might be not only a relativeky fast solution but maybe also a necessity.

When the n-gram matrices are ready, I will start working on the prediction algorithm.So far, I am thinking Bayesian Networks. I wanted to have the n-grams ready already in this report, but I am facing technical difficulties with crashing R (even though I am building the matrices from 10% sampled data).