library(stringi)
library(tm)
library(slam)
library(ggplot2)
library(wordcloud)
library(RWeka)
library(reshape2)
library(R.utils)
The goal of this report is to display that we’ve gotten used to working with the data and that we are on track to create our prediction algorithm. A report needs to be submitted on R Pubs that explains our exploratory analysis and our goals for the eventual app and algorithm. This is a concise document and explain only the major features of the data we have identified and briefly summarize our plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. Tables and plots are being used to illustrate important summaries of the data set. The motivation for this project is to:
All the data has been downloaded from the link provided in Week 1 of the Data Science Capstone on Coursera. There are four sets of files which contain samples of blogs, news and tweets in different languages - German, English, Finnish & Russian. This whole project is being done on the English dataset.
blogs <- readLines("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul=TRUE)
news <- readLines("./Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul=TRUE)
twitter <- readLines("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul=TRUE)
# Check the length/line count of the files
summary(blogs)
## Length Class Mode
## 899288 character character
summary(news)
## Length Class Mode
## 77259 character character
summary(twitter)
## Length Class Mode
## 2360148 character character
# Check the word count of the files
blogs_words <- sum(stri_count_words(blogs))
news_words <- sum(stri_count_words(news))
twitter_words <- sum(stri_count_words(twitter))
print(blogs_words)
## [1] 37546246
print(news_words)
## [1] 2674536
print(twitter_words)
## [1] 30093410
After taking into consideration the above traits of the data, each dataset is being sampled for faster processing
sampleTwitter <- sample(twitter, 75000)
sampleNews <- sample(news, 75000)
sampleBlogs <- sample(blogs, 75000)
textSample <- c(sampleTwitter,sampleNews,sampleBlogs)
## Save sample
writeLines(textSample, "./Sample Data/BigTextSample.txt")
We clean the sample data with the ‘tm’ package. We remove the punctuation, numbers, profanity, urls, and stem words.
SampleCleanse <- VCorpus(DirSource("./Sample Data/", encoding = "UTF-8"), readerControl = list(reader = readPlain))
inspect(SampleCleanse)
ProfanityFilter <- read.table("./Profanitywords.txt", header=FALSE)
## Converting to lower case
SampleCleanse <- tm_map(SampleCleanse, content_transformer(tolower))
## Not removing the stop words right now as they may improve the predictive model. This will be tested when the algorithm is being built.
## Removing punctuation, nos., profanity, urls and stem words:
SampleCleanse <- tm_map(SampleCleanse, content_transformer(removePunctuation))
SampleCleanse <- tm_map(SampleCleanse, content_transformer(removeNumbers))
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
SampleCleanse <- tm_map(SampleCleanse, content_transformer(removeURL))
SampleCleanse <- tm_map(SampleCleanse, stripWhitespace)
SampleCleanse <- tm_map(SampleCleanse, removeWords, ProfanityFilter)
SampleCleanse <- tm_map(SampleCleanse, stemDocument)
SampleCleanse <- tm_map(SampleCleanse, stripWhitespace)
## Save the cleaned and final sample
saveRDS(SampleCleanse, file = "./FinalSample.RData")
##Load the final sample
FinalData <- readRDS(file = "./FinalSample.RData")
sampleTDM <- TermDocumentMatrix(FinalData)
saveRDS(sampleTDM, file = "./sampleTDM.RData")
An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n-1)order (based on the Markov model).
n-gram models are widely used in statistical natural language processing.
Two benefits of n-gram models (and algorithms that use them) are simplicity and scalability - with larger n, a model can store more context with a well-understood space-time tradeoff, enabling small experiments to scale up efficiently.
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
TrigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
smpdata <- readRDS("./FinalSample.RData")
# n-gram generation
unigram <- removeSparseTerms(TermDocumentMatrix(smpdata), 0.9999)
bigram <- removeSparseTerms(TermDocumentMatrix(smpdata, control = list(tokenize = BigramTokenizer)), 0.9999)
trigram <- removeSparseTerms(TermDocumentMatrix(smpdata, control = list(tokenize = TrigramTokenizer)), 0.9999)
To create plots for the n-grams, it would be helpful to extract the frequency from each of them.
freq_ngm <- function(x){
freq <- sort(rowSums(as.matrix(x)), decreasing=TRUE)
freq_ngm <- data.frame(word=names(freq), freq=freq)
return(freq_ngm)
}
unifreq <- freq_ngm(unigram)
bifreq <- freq_ngm(bigram)
trifreq <- freq_ngm(trigram)
plotfrequency <- function(data, title) {
ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
labs(x = "Words/Phrases", y = "Frequency") +
ggtitle(title) +
theme(axis.text.x = element_text(angle = 90, size = 12, hjust = 1)) +
geom_bar(stat = "identity")
}
plotfrequency(unifreq, "Top 30 Unigrams")
plotfrequency(bifreq, "Top 30 Bigrams")
plotfrequency(trifreq, "Top 30 Trigrams")
Loading and processing the data takes a lot of time because of the size. The data had to be sampled for faster processing time.
Although removing the stop words is advised, they have not been removed here because they may improve the prediction model later on as they are an essential part of any language.
The next step is to create a predictive algorithm and make it as efficient and accurate as possible.