Capstone - Milestone Report

Overview

This document will run through the data prep, initial exploratory work, and preliminary prediction plan for the capstone project.

The first step completed was to download the complete data set, and unzip it. The read me file for the data set is located here: http://www.corpora.heliohost.org/aboutcorpus.html

Below is a summary of the data included in the document. The tweet file is substantially smaller than the other two, oweing mostly to the shorter length of each tweet.

library(stringi)
library(knitr)
library(tm)

## Loading required package: NLP

library(RWeka)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(wordcloud)

## Loading required package: RColorBrewer

# CAN BE SKIPPED IF ALREADY HAVE THE DOWNLOADED DATA FROM THE LINK BELOW IN THE WORKING DIRECTORY
#dataURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
#download.file(dataURL,"dataDownload.zip")
#unzip("dataDownload.zip")

# Reading in data
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
tweets <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

# size info
tweets.size = file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
blogs.size = file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size = file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2

# Summary Info
summary_gen <- sapply(list(blogs,news,tweets),stri_stats_general)
summary_words <- sapply(list(blogs,news,tweets),stri_count_words)

# Summary Consolidation
summary_output <- data.frame(
        File = c("blogs","news","tweets"), 
        Size_mb = c(blogs.size,news.size,tweets.size),
        t(rbind(summary_gen[c(1,3,4),],
                Words = sapply(summary_words,sum),
                WordAvg = sapply(summary_words,mean)))
        )

kable(summary_output,digits=2)

File	Size_mb	Lines	Chars	CharsNWhite	Words	WordAvg
blogs	200.42	899288	206824382	170389539	37546246	41.75
news	196.28	77259	15639408	13072698	2674536	34.62
tweets	159.36	2360148	162096241	134082806	30093410	12.75

Cleaning and Sampling

After performing some initial review of the datasets, and noticing the size of the files, the next step was to sample from the three data sets before continuing any further exploration. The sample portion was 5% of each of the three sources. In addition to the sampling of the data sets, we have also taken the steps to clean the data. The items we have attempted to account for in the cleaning include: * Eliminating emojis * Converting everything to lower case * Removing stop words * Removing punctuation * Removing numbers * Eliminating excess white spaces

# Data Sample
set.seed(21)

blogs.sample <- sample(blogs,length(blogs)*.01)
news.sample <- sample(news,length(news)*.01)
tweets.sample <- sample(tweets,length(tweets)*.01)

file.sample <- c(blogs.sample,
                 news.sample,
                 tweets.sample)

# Edit added to accomodate emojis or invalid characters that were causing errors
file.sample.final <- sapply(file.sample,function(row) iconv(row, "latin1", "ASCII", sub=""))

# removing dat sets to save space
rm(blogs,news,tweets)

# for resuming from this point if have the file already
# writeLines(file.sample.final,"fileSample.txt")
# con <- file("fileSample.txt","r")
# file.sample <- readLines(con)

# Setting up the corpus
usCorpus <- VCorpus(VectorSource(file.sample.final))

# Various items to clean the data
usCorpus <- tm_map(usCorpus, content_transformer(tolower))
usCorpus <- tm_map(usCorpus, removeWords, stopwords("en"))
usCorpus <- tm_map(usCorpus, removePunctuation)
usCorpus <- tm_map(usCorpus, removeNumbers)
usCorpus <- tm_map(usCorpus, stripWhitespace)

# Ensure plain text for analysis
usCorpus <- tm_map(usCorpus, PlainTextDocument)

Tokenization and Exploration

To break the cleaned corpus into digestable chunks of information, we will explore different sets of n-grams (consecutive word chains of length n).

# tokenization functions
bigramToken <- function(x) NGramTokenizer(x,Weka_control(min=2,max=2))
trigramToken <- function(x) NGramTokenizer(x,Weka_control(min=3,max=3))

# function to calculate frequencies
frequency <- function(tdm){
    freq_sort <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
    freq_output <- data.frame(word=names(freq_sort), count=freq_sort)
    return(freq_output)
}

# setting the sparsenes variable to limit size of outputs
sparseness <- .9999

# Preparing final outputs
unigram <- frequency(removeSparseTerms(TermDocumentMatrix(usCorpus),sparseness))
bigram <- frequency(removeSparseTerms(TermDocumentMatrix(usCorpus,control=list(tokenize=bigramToken)),sparseness))
trigram <- frequency(removeSparseTerms(
  TermDocumentMatrix(usCorpus,control=list(tokenize=trigramToken)),sparseness)
  )

Below are charts with the top 25 words from each of following n-grams: 1) Unigram 2) Bigram 3) Trigram

The word combinations shown below are the ones that would recieve the highest probablities of prediction in our eventual model, absent of other information.

Plan for predictions

The plan for the shiny app will be to leverage the data found in the n-grams to adjust the likely expected next word when someone inputs a particular text string. The plan will be to expand and improve n-grams further. There will also be a lot of work put into developing the logic that leverages the n-gram output, which will likely include revisiting the cleaning of the n-grams further.

Capstone - Milestone Report

mbinz2

February 19, 2017

Overview

Cleaning and Sampling

Tokenization and Exploration

Plan for predictions