The goal of this project is to work with the data and build prediction algorithm. This explains exploratory analysis and goals for the eventual app and algorithm. This document explain only the major features of the data i have identified and briefly summarize my plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.
The motivation for this project is to:
library(RWeka)
library(RColorBrewer)
library(NLP);library(slam)
library(knitr); library(doParallel)
library(stringi); library(tm); library(ggplot2); library(wordcloud)
setwd("E:/shubby coursera/final/en_US")
blogs <- suppressWarnings(readLines(con <- file("./en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE))
close(con)
twitter <- suppressWarnings(readLines(con <- file("./en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE))
close(con)
news <- suppressWarnings(readLines(con <- file("./en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE))
close(con)
FileName <- c("blogs","news","twitter")
FileSizeMB <- c(200,196,159)
chars <- sapply(list(blogs,news,twitter),length)
lines <- c(max(nchar(blogs)),max(nchar(news)),max(nchar(twitter)))
data.frame(FileName,FileSizeMB,chars,lines)
FileName FileSizeMB chars lines
1 blogs 200 899288 40833
2 news 196 77259 5760
3 twitter 159 2360148 140
blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")
set.seed(519)
sample_data <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
corpus <- VCorpus(VectorSource(sample_data))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
plotNGram <- function(n) {
options(mc.cores=1)
# builds n-gram tokenizer and term document matrix
tk <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
tdm <- TermDocumentMatrix(corpus, control=list(tokenize=tk))
# find 25 most frequent n-grams in the matrix
ngram <- as.matrix(rollup(tdm, 2, na.rm=TRUE, FUN=sum))
ngram <- data.frame(word=rownames(ngram), freq=ngram[,1])
ngram <- ngram[order(-ngram$freq), ][1:25, ]
ngram$word <- factor(ngram$word, as.character(ngram$word))
# plots
ggplot(ngram, aes(x=word, y=freq)) + ggtitle("Frequency of Words") +
geom_bar(stat="Identity", fill="#ED9626", color="#855415") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + xlab("Word(s)") +
ylab("Frequency")
}
plotNGram(1)
plotNGram(2)
plotNGram(3)
plotNGram(4)
Common n-grams can be used to identify tokens
n-grams with similar frequency are likely to be part of a longer n-gram (eg. 3-gram ‘boy big sword’ and ‘little boy big’ are identified as common, and so is 4-gram ‘little boy big sword’)
Differences among the 3 data sources include:
-Noise is more prominent in informal sources (eg. blogs and twitter) as data is heavily influenced by personal style, which differs significantly from person to person