The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
working_directory <- paste(wd_Drive, wd_Path, sep = '')
setwd(working_directory)
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.4.2
library(ggplot2)
library(stringi)
library(tm)
## Warning: package 'tm' was built under R version 3.4.2
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(SnowballC)
# Configure directory to read from
dir <- paste0(getwd(),"/final/en_US/")
# Pull the twitter data
con <- file(paste0(dir,"en_US.twitter.txt"), open = "rb")
twitter_data <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
# Pull the news data
con <- file(paste0(dir,"en_US.news.txt"), open = "rb")
news_data <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
# Pull the blog data
con <- file(paste0(dir,"en_US.blogs.txt"), open = "rb")
blogs_data <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
megaBytes <- 1024 ^ 2
# Get file sizes
twitter_data.size <- file.info("final/en_US/en_US.twitter.txt")$size / megaBytes
blogs_data.size <- file.info("final/en_US/en_US.blogs.txt")$size / megaBytes
news_data.size <- file.info("final/en_US/en_US.news.txt")$size / megaBytes
# Get words in files
twitter_data.words <- stri_count_words(twitter_data)
blogs_data.words <- stri_count_words(blogs_data)
news_data.words <- stri_count_words(news_data)
# Summarize the information about the data sets into one area
dataInfo_DF <- data.frame(Area = c("twitter", "blogs", "news"),
file.size.MB = c(twitter_data.size, blogs_data.size, news_data.size),
num.lines = c(length(twitter_data), length(blogs_data), length(news_data)),
num.words = c(sum(twitter_data.words), sum(blogs_data.words), sum(news_data.words)),
mean.num.words = c(mean(twitter_data.words), mean(blogs_data.words), mean(news_data.words)))
dataInfo_DF
## Area file.size.MB num.lines num.words mean.num.words
## 1 twitter 159.3641 2360148 30093410 12.75065
## 2 blogs 200.4242 899288 37546246 41.75108
## 3 news 196.2775 1010243 34762395 34.40993
set.seed(41)
data.percentage.sample <- c(sample(twitter_data, length(twitter_data) * 0.01),
sample(blogs_data, length(blogs_data) * 0.01),
sample(news_data, length(news_data) * 0.01)
)
library(tm)
# Create corpus and clean the data
corpus <- VCorpus(VectorSource(data.percentage.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, toSpace,"[^[:graph:]]")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
options(mc.cores=1)
getFreq <- function(tdm) {
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = I("grey50"))
}
freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
makePlot(freq1, "30 Most Common Unigrams")
freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999))
makePlot(freq2, "30 Most Common Bigrams")
freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))
makePlot(freq3, "30 Most Common Trigrams")
This concludes our exploratory analysis. The next steps are to finalize our predictive algorithm, and integrate the algorithm as a Shiny app.
Our predictive algorithm will be using n-gram model with frequency lookup similar to our exploratory analysis above. The possible effective solution would be to use the trigram model to predict the next word. If no matching trigram can be found. The next step is to use the bigram model, and then to the unigram model if needed as a last possible option.
The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase. Then the app will use our algorithm to suggest the most likely next word after a short delay. Our plan is also to allow the user to configure how many words our app should suggest.