Data Science Milestone Project

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Set working directory

working_directory <- paste(wd_Drive, wd_Path, sep = '')
setwd(working_directory)

Configure Environment

library(RWeka)
## Warning: package 'RWeka' was built under R version 3.4.2
library(ggplot2)
library(stringi)
library(tm)
## Warning: package 'tm' was built under R version 3.4.2
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(SnowballC)

Load the data into targeted objects

# Configure directory to read from 
dir <- paste0(getwd(),"/final/en_US/")

# Pull the twitter data
con <- file(paste0(dir,"en_US.twitter.txt"), open = "rb")
twitter_data <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

# Pull the news data
con <- file(paste0(dir,"en_US.news.txt"), open = "rb")
news_data <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

# Pull the blog data
con <- file(paste0(dir,"en_US.blogs.txt"), open = "rb")
blogs_data <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

Get some information on the data

megaBytes <- 1024 ^ 2

# Get file sizes
twitter_data.size <- file.info("final/en_US/en_US.twitter.txt")$size / megaBytes
blogs_data.size   <- file.info("final/en_US/en_US.blogs.txt")$size / megaBytes
news_data.size    <- file.info("final/en_US/en_US.news.txt")$size / megaBytes

# Get words in files
twitter_data.words <- stri_count_words(twitter_data)
blogs_data.words <- stri_count_words(blogs_data)
news_data.words <- stri_count_words(news_data)

# Summarize the information about the data sets into one area
dataInfo_DF <- data.frame(Area = c("twitter", "blogs", "news"),
                          file.size.MB = c(twitter_data.size, blogs_data.size, news_data.size),
                          num.lines = c(length(twitter_data), length(blogs_data), length(news_data)),
                          num.words = c(sum(twitter_data.words), sum(blogs_data.words), sum(news_data.words)),
                          mean.num.words = c(mean(twitter_data.words), mean(blogs_data.words), mean(news_data.words)))

dataInfo_DF
##      Area file.size.MB num.lines num.words mean.num.words
## 1 twitter     159.3641   2360148  30093410       12.75065
## 2   blogs     200.4242    899288  37546246       41.75108
## 3    news     196.2775   1010243  34762395       34.40993

Create a sample set of data

set.seed(41)
data.percentage.sample <- c(sample(twitter_data, length(twitter_data) * 0.01), 
                            sample(blogs_data, length(blogs_data) * 0.01),
                            sample(news_data, length(news_data) * 0.01)
                           )

Cleaning the sample data

library(tm)

# Create corpus and clean the data
corpus <- VCorpus(VectorSource(data.percentage.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, toSpace,"[^[:graph:]]")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

Explorator Analysis Functions

options(mc.cores=1)

getFreq <- function(tdm) {
  freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
  return(data.frame(word = names(freq), freq = freq))
}

 bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
 trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
 
 makePlot <- function(data, label) {
   ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
     labs(x = label, y = "Frequency") +
     theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
     geom_bar(stat = "identity", fill = I("grey50"))
 }

Get frequencies of most common uni-grams in data sample

 freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
 makePlot(freq1, "30 Most Common Unigrams")

Get frequencies of most common bi-grams in data sample

 freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999))
 makePlot(freq2, "30 Most Common Bigrams")

Get frequencies of most common tri-grams in data sample

 freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))
 makePlot(freq3, "30 Most Common Trigrams")

Next Steps that could be used for Prediction Algorithm And Shiny App

This concludes our exploratory analysis. The next steps are to finalize our predictive algorithm, and integrate the algorithm as a Shiny app.

Our predictive algorithm will be using n-gram model with frequency lookup similar to our exploratory analysis above. The possible effective solution would be to use the trigram model to predict the next word. If no matching trigram can be found. The next step is to use the bigram model, and then to the unigram model if needed as a last possible option.

The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase. Then the app will use our algorithm to suggest the most likely next word after a short delay. Our plan is also to allow the user to configure how many words our app should suggest.