The goal of this project is to explain my exploratory analysis and my goals for the eventual app and algorithm in the capstone project. This document explains only the major features of the data you have identified and briefly summarizes my plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.
The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in. 2. Create a basic report of summary statistics about the data sets. 3. Report any interesting findings that you amassed so far. 4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
options(mc.cores=4)
file.info("final/en_US/en_US.blogs.txt")$size / (1024*1024)
## [1] 200.4242
file.info("final/en_US/en_US.news.txt")$size / (1024*1024)
## [1] 196.2775
file.info("final/en_US/en_US.twitter.txt")$size / (1024*1024)
## [1] 159.3641
# read in data from three text files
blogs <- readLines('./en_US/en_US.blogs.txt')
news <- readLines('./en_US/en_US.news.txt')
twitter <- readLines('./en_US/en_US.twitter.txt')
summary(blogs)
# Length is 899288
summary(news)
# Length is 1010242
summary(twitter)
# Length is 2360148
# ensure reproducibility
set.seed(111)
# sampling to reduce file size
sBlogs <- blogs[sample(1:length(blogs),5000)]
sNews <- news[sample(1:length(news),5000)]
sTwitter <- twitter[sample(1:length(twitter),5000)]
# combine data samples
sData <- c(sTwitter,sNews,sBlogs)
# save the combined data sample
writeLines(sData, "./sample/sData.txt")
# remove redundant variables
rm(twitter,news,blogs,sTwitter,sNews,sBlogs)
sData <- readLines("./final/sample/sData.txt", encoding="UTF-8")
Using tm library (code in this order): * Convert to lowercase * Remove punctuation * Remove numbers * Remove whitespace * Remove English stop words
library(tm)
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.1.3
corpus <- VCorpus(VectorSource(sData))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
library(wordcloud)
## Loading required package: RColorBrewer
wordcloud(corpus, scale=c(3,0.5), min.freq=5, max.words=100, random.order=TRUE,
rot.per=0.5, colors=brewer.pal(8, "Set1"), use.r.layout=FALSE)
Create unigram, bigram and trigram word models to explore frequency of word occurences (using RWeka package).
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.1.3
corpus_df <- data.frame(text = unlist(sapply(corpus, '[', 'content')), stringsAsFactors = F)
uniGramToken <- data.frame(table(NGramTokenizer(corpus_df, Weka_control(min = 1, max = 1))))
biGramToken <- data.frame(table(NGramTokenizer(corpus_df, Weka_control(min = 2, max = 2))))
triGramToken <- data.frame(table(NGramTokenizer(corpus_df, Weka_control(min = 3, max = 3))))
#order by decreasing frequency
unigram <- uniGramToken[order(uniGramToken$Freq, decreasing = TRUE),]
bigram <- biGramToken[order(biGramToken$Freq, decreasing = TRUE),]
trigram <- triGramToken[order(triGramToken$Freq, decreasing = TRUE),]
Graphing frequencies of top n-grams.
## Warning: package 'ggplot2' was built under R version 3.1.3
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
Utilizing ‘tm’ package for natural language processing, I will create bigram and trigram datasets that will be used for predicting the next word. In the app, the user will input 2-3 word strings, and the next word will be suggested based on the prediction algorithm from the n-gram dataset.