The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm.
Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.
The motivation for this project is to:
The data provided for NLP (Natural Language Processing) consists of 3 “corpora” of data:
## Load libraries and suppress messages for ease of reading report
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
suppressMessages(library(LaF))
suppressMessages(library(quanteda))
suppressMessages(library(RColorBrewer))
suppressMessages(library(RWeka))
suppressMessages(library(SnowballC))
suppressMessages(library(tau))
suppressMessages(library(tm))
suppressMessages(library(wordcloud))
# Download and extract data
source_file <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
destination_file <- "Coursera-SwiftKey.zip"
download.file(source_file, destination_file)
unzip(destination_file)
# Unzip file
unzip(destination_file, list = FALSE )
# Load the data en_US data
dataBlogs <- readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
dataNews <- readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
dataTwitter <- readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
#Convert to ASCII
dataNews <- iconv(dataNews, 'UTF-8', 'ASCII', "byte")
dataBlogs <- iconv(dataBlogs, 'UTF-8', 'ASCII', "byte")
dataTwitter <- iconv(dataTwitter, 'UTF-8', 'ASCII', "byte")
Since these files are huge (based on time taken to read), a quick summary will help determine a sampling approach.
# Assess size of all 3 files - blogs, news and Twitter
dataBlogs.filesizeMB <- file.size("./final/en_US/en_US.blogs.txt")
dataNews.filesizeMB <- file.size("./final/en_US/en_US.news.txt")
dataTwitter.filesizeMB <- file.size("./final/en_US/en_US.twitter.txt")
info <- data.frame(rbind(dataframe.blogs, dataframe.news, dataframe.twitter))
names(info) <- c("File Size (MB)", "Word Count", "Longest Line")
row.names(info) <- c("Blogs", "News", "Twitter")
# Showcase table
info
## File Size (MB) Word Count Longest Line
## Blogs 210160014 899288 40844
## News 205811889 77259 5760
## Twitter 167105338 2360148 589
Since working with such huge data sets is memory intensive, using basic random sampling, I will try to reduce the text to mine through. This sample would also be used for the final predictive analysis.
The sampling has been arbitrarily chosen as 5 % of the actual file parameters. However, based on the prediction results, this could later be increased or decreased. The exploratory analysis is however based on the initial arbitrary value.
# Assess maximum number of characters in a line of the files
summary(nchar(dataBlogs))[6]
## Max.
## 40840
summary(nchar(dataNews))[6]
## Max.
## 5760
summary(nchar(dataTwitter))[6]
## Max.
## 589
# Run sampling at 5% of the actual file parameters because of sizes of files
dataBlogs_sample_size <- round(.05 * length(dataBlogs), 0)
dataNews_sample_size <- round(.05 * length(dataNews), 0)
dataTwitter_sample_size <- round(.05 * length(dataTwitter), 0)
# Compute with approximately 5% of the population for each file
dataBlogs_sample <- sample_lines("./final/en_US/en_US.blogs.txt", n = dataBlogs_sample_size, nlines = NULL)
dataNews_sample <- sample_lines("./final/en_US/en_US.news.txt", n = dataNews_sample_size , nlines = NULL)
dataTwitter_sample <- sample_lines("./final/en_US/en_US.twitter.txt", n = dataTwitter_sample_size, nlines = NULL)
# Determine word frequency for each of the 3 files
dataBlogs_word_freq <- dfm(dataBlogs_sample, verbose = FALSE)
dataNews_word_freq <- dfm(dataNews_sample, verbose = FALSE)
dataTwitter_word_freq <- dfm(dataTwitter_sample, verbose = FALSE)
docfreq(dataBlogs_word_freq)[1:11]
## todd breathed deeply , as if
## 11 14 66 27136 7354 4331
## restraining himself from clobbering me
## 5 304 5911 1 4989
docfreq(dataNews_word_freq)[1:11]
## such students now must pay $ 15 for
## 79 46 149 32 47 256 39 1084
## each of their
## 66 1776 303
docfreq(dataTwitter_word_freq)[1:11]
## thanks for the rt / mention !
## 4392 17618 36410 4150 4043 216 36847
## i am answering questions
## 28458 1982 24 192
The function below will be used to clean the data, including stemming. Stop words are not removed on purpose. Stop words provided much needed context and sentence fluidity in natural language and hence they will be retained.
## Loading required package: slam
# Set CleanR function
CleanR <- function(corpus){
tm_map(corpus, removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(stripWhitespace) %>%
tm_map(stemDocument)
}
# Create a few NGram functions via RWeka
unigram_token <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram_token <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram_token <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
# Create UniGram functions via RWeka
options(stringsAsFactors = FALSE)
options(mc.cores = 1)
unigram <- TermDocumentMatrix(corpus, control=list(tokenize=unigram_token))
unigram.good <- rollup(unigram, 2, na.rm=TRUE, FUN = sum)
# Sort with decreasing frequency
unigram.tf <- findFreqTerms(unigram.good, lowfreq = 3)
unigram.tf <- sort(rowSums(as.matrix(unigram.good[unigram.tf, ])), decreasing = TRUE)
unigram.tf <- data.frame(unigram.good=names(unigram.tf), frequency=unigram.tf)
names(unigram.tf) <- c("word", "frequency")
head(unigram.tf, 10)
## word frequency
## the the 57668
## and and 31435
## you you 16836
## for for 15543
## that that 14065
## with with 9598
## this this 8601
## was was 8312
## have have 7843
## are are 7264
# BiGram work
bi.gram.dataBlogs <- textcnt(dataBlogs_sample, n = 2, method = "string")
bi.gram.dataBlogs <- bi.gram.dataBlogs[order(bi.gram.dataBlogs, decreasing = TRUE)]
bi.gram.dataBlogs[1:3] # top three, 2-Word combinations
## of the in the to the
## 9213 7612 4262
blogs_corpus <- VCorpus(DataframeSource(data.frame(dataBlogs_sample)))
news_corpus <- VCorpus(DataframeSource(data.frame(dataNews_sample)))
twitter_corpus <- VCorpus(DataframeSource(data.frame(dataTwitter_sample)))
rm(dataBlogs_sample); rm(dataNews_sample); rm(dataTwitter_sample)
blogs_corpus <- CleanR(blogs_corpus)
news_corpus <- CleanR(news_corpus)
twitter_corpus <- CleanR(twitter_corpus)
pal <- brewer.pal(8,"Accent")
wordcloud(blogs_corpus, max.words = 90, random.order = FALSE, colors = pal)
wordcloud(news_corpus, max.words = 90, random.order = FALSE, colors = pal)
wordcloud(twitter_corpus, max.words = 90, random.order = FALSE, colors = pal)
The below section briefly explains the approach for prediction and creating a shiny app. At the time of writing this report, profanity filter has not be decided.
The next steps in the project are:
The samples corpus would be used to create bi and tri gram frequencies. The data frames would then be used to predict the next word from the n-gram frequency table.
The top two words per the frequency table would be returned. Only the last word will be used to predict even though the input may be more than one word.
The app would take user input as characters strings and use the last input word and return top two words that could be next.