JHU Data Science Capstone Milestone Report

The goal of this project is to explain my exploratory analysis and my goals for the eventual app and algorithm in the capstone project. This document explains only the major features of the data you have identified and briefly summarizes my plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.

The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in. 2. Create a basic report of summary statistics about the data sets. 3. Report any interesting findings that you amassed so far. 4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

options(mc.cores=4)

Checking file sizes (MB)

file.info("final/en_US/en_US.blogs.txt")$size / (1024*1024)

## [1] 200.4242

file.info("final/en_US/en_US.news.txt")$size / (1024*1024)

## [1] 196.2775

file.info("final/en_US/en_US.twitter.txt")$size / (1024*1024)

## [1] 159.3641

Reading in data samples for each text file

# read in data from three text files
blogs <- readLines('./en_US/en_US.blogs.txt')
news <- readLines('./en_US/en_US.news.txt')
twitter <- readLines('./en_US/en_US.twitter.txt')

Counting lines of text in each file

summary(blogs)
# Length is 899288
summary(news)
# Length is 1010242
summary(twitter)
# Length is 2360148

Sampling

# ensure reproducibility
set.seed(111)

# sampling to reduce file size
sBlogs <- blogs[sample(1:length(blogs),5000)]
sNews <- news[sample(1:length(news),5000)]
sTwitter <- twitter[sample(1:length(twitter),5000)]

# combine data samples
sData <- c(sTwitter,sNews,sBlogs)

# save the combined data sample
writeLines(sData, "./sample/sData.txt")

# remove redundant variables
rm(twitter,news,blogs,sTwitter,sNews,sBlogs)

Reading in the reduced sample dataset

sData <- readLines("./final/sample/sData.txt", encoding="UTF-8")

Cleaning data

Using tm library (code in this order): * Convert to lowercase * Remove punctuation * Remove numbers * Remove whitespace * Remove English stop words

library(tm)

## Loading required package: NLP

## Warning: package 'NLP' was built under R version 3.1.3

corpus <- VCorpus(VectorSource(sData))

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))

Visualizing the sample dataset

library(wordcloud)

## Loading required package: RColorBrewer

wordcloud(corpus, scale=c(3,0.5), min.freq=5, max.words=100, random.order=TRUE,
          rot.per=0.5, colors=brewer.pal(8, "Set1"), use.r.layout=FALSE)

Tokenization

Create unigram, bigram and trigram word models to explore frequency of word occurences (using RWeka package).

library(RWeka)

## Warning: package 'RWeka' was built under R version 3.1.3

corpus_df <- data.frame(text = unlist(sapply(corpus, '[', 'content')), stringsAsFactors = F)

uniGramToken <- data.frame(table(NGramTokenizer(corpus_df, Weka_control(min = 1, max = 1))))
biGramToken <- data.frame(table(NGramTokenizer(corpus_df, Weka_control(min = 2, max = 2))))
triGramToken <- data.frame(table(NGramTokenizer(corpus_df, Weka_control(min = 3, max = 3))))

#order by decreasing frequency
unigram <- uniGramToken[order(uniGramToken$Freq, decreasing = TRUE),]
bigram  <- biGramToken[order(biGramToken$Freq, decreasing = TRUE),]
trigram <- triGramToken[order(triGramToken$Freq, decreasing = TRUE),]

Exploratory Data Analysis

Graphing frequencies of top n-grams.

## Warning: package 'ggplot2' was built under R version 3.1.3

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

Plans for the final project

Utilizing ‘tm’ package for natural language processing, I will create bigram and trigram datasets that will be used for predicting the next word. In the app, the user will input 2-3 word strings, and the next word will be suggested based on the prediction algorithm from the n-gram dataset.