Project

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Libraries

Load required libraries for this report.

library(LaF)
library(tm)
library(lexicon)
library(RWeka)

Summary

In order to create some basic summaries, we first need to get the list of files.

PATH <- "final/en_US"
fpaths <- list.files(PATH, full.names=TRUE)

We are going to use the Bash command wc. This will print, for each file, the number of lines, number of words, number of characters, file size in bytes, and the maximum line length, in this order.

header <- paste("", "Lines", "", "Words", "Chars", "", "Bytes",
                "", "Max", "", "File", sep="\t")
output <- system2("wc", c("-lwmcL", fpaths), stdout=TRUE)
cat(c(header, output), sep="\n")
##  Lines       Words   Chars       Bytes       Max     File
##    899288  37334117 207723793 209260726     40833 final/en_US/en_US.blogs.txt
##   1010242  34365936 204233401 204801647     11384 final/en_US/en_US.news.txt
##   2360148  30373559 164456396 164745190       173 final/en_US/en_US.twitter.txt
##   4269678 102073612 576413590 578807563     40833 total

And now, the file sizes in a more human readable format, using the Bash command du.

cat(system2("du", c("-h", fpaths), stdout=TRUE), sep="\n")
## 200M final/en_US/en_US.blogs.txt
## 196M final/en_US/en_US.news.txt
## 158M final/en_US/en_US.twitter.txt

Data

Now we explore the data. Since the files are considerably large, we will randomly read only 1% of the lines from each file.

LINES_PERCENT <- 0.01
SEED <- 123456
set.seed(SEED)

randomLines <- function(fpath, percent) {
  nlines <- determine_nlines(fpath)
  lines <- sample_lines(fpath, nlines*percent)
  return(lines)
}

lines <- lapply(fpaths, randomLines, LINES_PERCENT)
corp <- VCorpus(VectorSource(lines))

With the read data, we will perform some text data cleaning, such as removing special characters, symbols, extra whitespace, and some profanity words. We opted to keep stopwords since they are sensible predictions.

# List of functions to be applied (must be passed in reverse order)
funs <- rev(list(content_transformer(
                  function (x) iconv(x, from="UTF-8", to="ASCII//TRANSLIT", sub="")
                ),
                content_transformer(tolower),
                content_transformer(
                  function(x) gsub("[^[:lower:][:space:]]", "", x)
                ),
                stripWhitespace,
                function(x) removeWords(x, profanity_racist)))

corp <- tm_map(corp, FUN=tm_reduce, tmFuns=funs)

Finally, we will create some Term Document Matrices (TDMs) for n-grams, with n ranging from 1 to 3.

NGRAMS_MAX <- 3

ngramTDM <- function(n, corp) {
  tokenizer <- function(x) NGramTokenizer(x, Weka_control(min=n, max=n))
  tdm <- TermDocumentMatrix(corp, control=list(tokenize=tokenizer))
  return(tdm)
}

tdm <- lapply(1:NGRAMS_MAX, ngramTDM, corp)

Plot

With the TDMs, we can now make some plots with the most frequent n-grams. In particular, the dashed grey lines indicate both 50% and 90% frequency percentiles, which are printed next.

createPlots <- function(tdm) {
  par(mfrow=c(1,2)) 
  
  terms <- findMostFreqTerms(tdm, tdm$nrow, rep(1, tdm$ncol))$`1`
  ngram <- sapply(strsplit(names(terms[1]), " "), length)

  barplot(terms[1:15], las=2,
          col="lightblue",
          main=sprintf("Most frequent %d-grams", ngram),
          ylab="Frequency"
  )

  terms <- 100*cumsum(terms/sum(terms))
  th50 <- sum(terms <= 50)
  th90 <- sum(terms <= 90)

  plot(terms, type="l", col="red", lwd=2,
       main=sprintf("Cumulative frequency of %d-grams", ngram),
       ylab="Frequency (%)",
       xlab="Number of n-grams")
  abline(v=c(th50, th90),
         h=c(terms[th50], terms[th90]),
         col="lightgray", lwd=2, lty=2)
  
  return(list("50%"=th50, "90%"=th90))
}

lapply(tdm, createPlots)

## [[1]]
## [[1]]$`50%`
## [1] 319
## 
## [[1]]$`90%`
## [1] 9960
## 
## 
## [[2]]
## [[2]]$`50%`
## [1] 33152
## 
## [[2]]$`90%`
## [1] 345654
## 
## 
## [[3]]
## [[3]]$`50%`
## [1] 311931
## 
## [[3]]$`90%`
## [1] 677155

Prediction

The idea for the prediction algorithm is to find the most frequent matching n-gram from the highest order to the lowest, with a back-off for the most frequent word.