Overview

This document will run through the data prep, initial exploratory work, and preliminary prediction plan for the capstone project.

The first step completed was to download the complete data set, and unzip it. The read me file for the data set is located here: http://www.corpora.heliohost.org/aboutcorpus.html

Below is a summary of the data included in the document. The tweet file is substantially smaller than the other two, oweing mostly to the shorter length of each tweet.

library(stringi)
library(knitr)
library(tm)
## Loading required package: NLP
library(RWeka)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(wordcloud)
## Loading required package: RColorBrewer
# CAN BE SKIPPED IF ALREADY HAVE THE DOWNLOADED DATA FROM THE LINK BELOW IN THE WORKING DIRECTORY
#dataURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
#download.file(dataURL,"dataDownload.zip")
#unzip("dataDownload.zip")

# Reading in data
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
tweets <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

# size info
tweets.size = file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
blogs.size = file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size = file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2

# Summary Info
summary_gen <- sapply(list(blogs,news,tweets),stri_stats_general)
summary_words <- sapply(list(blogs,news,tweets),stri_count_words)

# Summary Consolidation
summary_output <- data.frame(
        File = c("blogs","news","tweets"), 
        Size_mb = c(blogs.size,news.size,tweets.size),
        t(rbind(summary_gen[c(1,3,4),],
                Words = sapply(summary_words,sum),
                WordAvg = sapply(summary_words,mean)))
        )

kable(summary_output,digits=2)
File Size_mb Lines Chars CharsNWhite Words WordAvg
blogs 200.42 899288 206824382 170389539 37546246 41.75
news 196.28 77259 15639408 13072698 2674536 34.62
tweets 159.36 2360148 162096241 134082806 30093410 12.75

Cleaning and Sampling

After performing some initial review of the datasets, and noticing the size of the files, the next step was to sample from the three data sets before continuing any further exploration. The sample portion was 5% of each of the three sources. In addition to the sampling of the data sets, we have also taken the steps to clean the data. The items we have attempted to account for in the cleaning include: * Eliminating emojis * Converting everything to lower case * Removing stop words * Removing punctuation * Removing numbers * Eliminating excess white spaces

# Data Sample
set.seed(21)

blogs.sample <- sample(blogs,length(blogs)*.01)
news.sample <- sample(news,length(news)*.01)
tweets.sample <- sample(tweets,length(tweets)*.01)

file.sample <- c(blogs.sample,
                 news.sample,
                 tweets.sample)

# Edit added to accomodate emojis or invalid characters that were causing errors
file.sample.final <- sapply(file.sample,function(row) iconv(row, "latin1", "ASCII", sub=""))

# removing dat sets to save space
rm(blogs,news,tweets)

# for resuming from this point if have the file already
# writeLines(file.sample.final,"fileSample.txt")
# con <- file("fileSample.txt","r")
# file.sample <- readLines(con)

# Setting up the corpus
usCorpus <- VCorpus(VectorSource(file.sample.final))

# Various items to clean the data
usCorpus <- tm_map(usCorpus, content_transformer(tolower))
usCorpus <- tm_map(usCorpus, removeWords, stopwords("en"))
usCorpus <- tm_map(usCorpus, removePunctuation)
usCorpus <- tm_map(usCorpus, removeNumbers)
usCorpus <- tm_map(usCorpus, stripWhitespace)

# Ensure plain text for analysis
usCorpus <- tm_map(usCorpus, PlainTextDocument)

Tokenization and Exploration

To break the cleaned corpus into digestable chunks of information, we will explore different sets of n-grams (consecutive word chains of length n).

# tokenization functions
bigramToken <- function(x) NGramTokenizer(x,Weka_control(min=2,max=2))
trigramToken <- function(x) NGramTokenizer(x,Weka_control(min=3,max=3))

# function to calculate frequencies
frequency <- function(tdm){
    freq_sort <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
    freq_output <- data.frame(word=names(freq_sort), count=freq_sort)
    return(freq_output)
}

# setting the sparsenes variable to limit size of outputs
sparseness <- .9999

# Preparing final outputs
unigram <- frequency(removeSparseTerms(TermDocumentMatrix(usCorpus),sparseness))
bigram <- frequency(removeSparseTerms(TermDocumentMatrix(usCorpus,control=list(tokenize=bigramToken)),sparseness))
trigram <- frequency(removeSparseTerms(
  TermDocumentMatrix(usCorpus,control=list(tokenize=trigramToken)),sparseness)
  )

Below are charts with the top 25 words from each of following n-grams: 1) Unigram 2) Bigram 3) Trigram

The word combinations shown below are the ones that would recieve the highest probablities of prediction in our eventual model, absent of other information.

Plan for predictions

The plan for the shiny app will be to leverage the data found in the n-grams to adjust the likely expected next word when someone inputs a particular text string. The plan will be to expand and improve n-grams further. There will also be a lot of work put into developing the logic that leverages the n-gram output, which will likely include revisiting the cleaning of the n-grams further.