The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
Load required libraries for this report.
library(LaF)
library(tm)
library(lexicon)
library(RWeka)
In order to create some basic summaries, we first need to get the list of files.
PATH <- "final/en_US"
fpaths <- list.files(PATH, full.names=TRUE)
We are going to use the Bash command wc. This will print, for each file, the number of lines, number of words, number of characters, file size in bytes, and the maximum line length, in this order.
header <- paste("", "Lines", "", "Words", "Chars", "", "Bytes",
"", "Max", "", "File", sep="\t")
output <- system2("wc", c("-lwmcL", fpaths), stdout=TRUE)
cat(c(header, output), sep="\n")
## Lines Words Chars Bytes Max File
## 899288 37334117 207723793 209260726 40833 final/en_US/en_US.blogs.txt
## 1010242 34365936 204233401 204801647 11384 final/en_US/en_US.news.txt
## 2360148 30373559 164456396 164745190 173 final/en_US/en_US.twitter.txt
## 4269678 102073612 576413590 578807563 40833 total
And now, the file sizes in a more human readable format, using the Bash command du.
cat(system2("du", c("-h", fpaths), stdout=TRUE), sep="\n")
## 200M final/en_US/en_US.blogs.txt
## 196M final/en_US/en_US.news.txt
## 158M final/en_US/en_US.twitter.txt
Now we explore the data. Since the files are considerably large, we will randomly read only 1% of the lines from each file.
LINES_PERCENT <- 0.01
SEED <- 123456
set.seed(SEED)
randomLines <- function(fpath, percent) {
nlines <- determine_nlines(fpath)
lines <- sample_lines(fpath, nlines*percent)
return(lines)
}
lines <- lapply(fpaths, randomLines, LINES_PERCENT)
corp <- VCorpus(VectorSource(lines))
With the read data, we will perform some text data cleaning, such as removing special characters, symbols, extra whitespace, and some profanity words. We opted to keep stopwords since they are sensible predictions.
# List of functions to be applied (must be passed in reverse order)
funs <- rev(list(content_transformer(
function (x) iconv(x, from="UTF-8", to="ASCII//TRANSLIT", sub="")
),
content_transformer(tolower),
content_transformer(
function(x) gsub("[^[:lower:][:space:]]", "", x)
),
stripWhitespace,
function(x) removeWords(x, profanity_racist)))
corp <- tm_map(corp, FUN=tm_reduce, tmFuns=funs)
Finally, we will create some Term Document Matrices (TDMs) for n-grams, with n ranging from 1 to 3.
NGRAMS_MAX <- 3
ngramTDM <- function(n, corp) {
tokenizer <- function(x) NGramTokenizer(x, Weka_control(min=n, max=n))
tdm <- TermDocumentMatrix(corp, control=list(tokenize=tokenizer))
return(tdm)
}
tdm <- lapply(1:NGRAMS_MAX, ngramTDM, corp)
With the TDMs, we can now make some plots with the most frequent n-grams. In particular, the dashed grey lines indicate both 50% and 90% frequency percentiles, which are printed next.
createPlots <- function(tdm) {
par(mfrow=c(1,2))
terms <- findMostFreqTerms(tdm, tdm$nrow, rep(1, tdm$ncol))$`1`
ngram <- sapply(strsplit(names(terms[1]), " "), length)
barplot(terms[1:15], las=2,
col="lightblue",
main=sprintf("Most frequent %d-grams", ngram),
ylab="Frequency"
)
terms <- 100*cumsum(terms/sum(terms))
th50 <- sum(terms <= 50)
th90 <- sum(terms <= 90)
plot(terms, type="l", col="red", lwd=2,
main=sprintf("Cumulative frequency of %d-grams", ngram),
ylab="Frequency (%)",
xlab="Number of n-grams")
abline(v=c(th50, th90),
h=c(terms[th50], terms[th90]),
col="lightgray", lwd=2, lty=2)
return(list("50%"=th50, "90%"=th90))
}
lapply(tdm, createPlots)
## [[1]]
## [[1]]$`50%`
## [1] 319
##
## [[1]]$`90%`
## [1] 9960
##
##
## [[2]]
## [[2]]$`50%`
## [1] 33152
##
## [[2]]$`90%`
## [1] 345654
##
##
## [[3]]
## [[3]]$`50%`
## [1] 311931
##
## [[3]]$`90%`
## [1] 677155
The idea for the prediction algorithm is to find the most frequent matching n-gram from the highest order to the lowest, with a back-off for the most frequent word.