The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to:
First step of the project is to load all the libraries necessary to complete all the tasks outlined in the introduction.
library(stringi)
library(NLP)
library(openNLP)
library(tm)
library(rJava)
library(RWeka)
library(RWekajars)
library(SnowballC)
library(qdap)
library(ggplot2)
The data used in this project can be obtained from this link.
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
destFile <- "Coursera-SwiftKey.zip"
output <- "C://Users//bebxtaxasta//Desktop//Coursera//Project"
if (!file.exists(destFile)) {
download.file(url, destFile)
unzip(destFile,exdir=output)
}
In this part I will create a very basic overview of the data file statistics.
The 3 files used in this projects are as follows:
The following is the summary of the file size in megabytes.
file.info(".//Coursera//Project//final//en_US//en_US.blogs.txt")$size / 1024^2 #size of Blogs file
## [1] 200.4242
file.info(".//Coursera//Project//final//en_US//en_US.news.txt")$size / 1024^2 #size of News file
## [1] 196.2775
file.info(".//Coursera//Project//final//en_US//en_US.twitter.txt")$size / 1024^2 #size of Twitter file
## [1] 159.3641
Here is an analysis of the number of rows in each file.
blogs <- readLines(".//Coursera//Project//final//en_US//en_US.blogs.txt", encoding = "UTF-8", skipNul=TRUE)
length(blogs) #Number of rows in Blogs file
## [1] 899288
news <- readLines(".//Coursera//Project//final//en_US//en_US.news.txt", encoding = "UTF-8", skipNul=TRUE)
length(news) #Number of rows in News file
## [1] 77259
twitter <- readLines(".//Coursera//Project//final//en_US//en_US.twitter.txt", encoding = "UTF-8", skipNul=TRUE)
length(twitter) #Number of rows in Twitter file
## [1] 2360148
Summary of the number of words in each file used in the project.
sum(stri_count_words(blogs)) # Number of words in Blogs file
## [1] 37546246
sum(stri_count_words(news)) # Number of words in News file
## [1] 2674536
sum(stri_count_words(twitter)) # Number of words in Twitter file
## [1] 30093410
Due to the sheer size of the data files, we will use a sample 1000 lines from each file. Total sample size will be 3000.
set.seed(1000)
sTwitter <- sample(twitter, size = 1000, replace = TRUE)
sBlogs <- sample(blogs, size = 1000, replace = TRUE)
sNews <- sample(news, size = 1000, replace = TRUE)
sampleTotal <- c(sTwitter, sBlogs, sNews)
length(sampleTotal)
## [1] 3000
writeLines(sampleTotal, "./sample.txt")
In this step, the following steps were completed:
build_corpus <- function (x = sampleData) {
c <- VCorpus(VectorSource(x))
c <- tm_map(c, tolower)
c <- tm_map(c, removePunctuation)
c <- tm_map(c, removeNumbers)
c <- tm_map(c, stripWhitespace)
c <- tm_map(c, removeWords, stopwords("english"))
c <- tm_map(c, stemDocument)
c <- tm_map(c, PlainTextDocument)
}
corpus <- build_corpus(sampleTotal)
rm(sampleTotal)
In this section I will explore the data, get its frequencies and create plots of Uni-grams, Bi-grams and Tri-grams.
options(mc.cores=1)
getFreq <- function(tdm) {
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = I("red"))
}
freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999))
freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))
makePlot(freq1, "TOP 30 Most Common Uni-grams")
makePlot(freq2, "TOP 30 Most Common Bi-grams")
makePlot(freq3, "TOP 30 Most Common Tri-grams")
This conclused the preliminary analysis of the data. Due to the large size of the files, a lot of memory resources are needed by the computer in order to analyse them.
Here are the next steps for the final project of the course: