Capstone - Milestone Report

Motivations

I. Demonstrate that I have downloaded the data and have successfully loaded it in.
II. Create a basic report of summary statistics about the data sets.
III. Report any interesting findings that I have ammased so far.
IV. Get feedback on my plans for creating a predictions algorithm & Shiny app.

I.I Package Loading

I am not sure if I will need all of these by the end of this.

suppressPackageStartupMessages({
  library(tm)
library(rJava)
library(ngram)
library(RWeka)
library(knitr)
library(tidytext)
library(tidyverse)
library(wordcloud)
library(stringi)
library(stringr)
library(ggplot2)
library(dplyr)
})

I.II Loading the Data

After I had downloaded the files, I was able to begin the process of reading in the information.

blogs_file   <- "./en_US/en_US.blogs.txt"
news_file    <- "./en_US/en_US.news.txt"
twitter_file <- "./en_US/en_US.twitter.txt"  
blogs_size   <- file.size(blogs_file) / (2^20)
news_size    <- file.size(news_file) / (2^20)
twitter_size <- file.size(twitter_file) / (2^20)
blogs   <- readLines(blogs_file, skipNul = TRUE)
news    <- readLines(news_file,  skipNul = TRUE)
twitter <- readLines(twitter_file, skipNul = TRUE)

II.I Summary of the Data

This summary shows a general idea of how the contents of this data are arranged.

blogs_lines   <- length(blogs)
news_lines    <- length(news)
twitter_lines <- length(twitter)
total_lines   <- blogs_lines + news_lines + twitter_lines

blogs_nchar   <- nchar(blogs)
news_nchar    <- nchar(news)
twitter_nchar <- nchar(twitter)
boxplot(blogs_nchar, news_nchar, twitter_nchar, log = "y",
        names = c("Blogs", "News", "Twitter"),
        ylab = "log(Number of Characters)",
        col = c("darkseagreen","wheat4","royalblue2"))

blogs_nchar_sum   <- sum(blogs_nchar)
news_nchar_sum    <- sum(news_nchar)
twitter_nchar_sum <- sum(twitter_nchar)

blogs_words <- wordcount(blogs, sep = " ")
news_words  <- wordcount(news,  sep = " ")
twitter_words <- wordcount(twitter, sep = " ")

summary1 <- data.frame(file_names = c("blogs", "news", "twitter"),
                           file_size  = c(blogs_size, news_size, twitter_size),
                           file_lines = c(blogs_lines, news_lines, twitter_lines),
                           number_of_characters =  c(blogs_nchar_sum, news_nchar_sum, twitter_nchar_sum),
                           number_of_words = c(blogs_words, news_words, twitter_words))
summary1 <- summary1 %>% mutate(percent_of_characters = round(number_of_characters/sum(number_of_characters), 2))
summary1 <- summary1 %>% mutate(percent_of_lines = round(file_lines/sum(file_lines), 2))
summary1 <- summary1 %>% mutate(percent_of_words = round(number_of_words/sum(number_of_words), 2))
kable(summary1)

file_names	file_size	file_lines	number_of_characters	number_of_words	percent_of_characters	percent_of_lines	percent_of_words
blogs	200.4242	899288	206824505	37334131	0.36	0.21	0.37
news	196.2775	1010242	203223159	34372530	0.36	0.24	0.34
twitter	159.3641	2360148	162096241	30373583	0.28	0.55	0.30

III.I Creating a Data Sample

I chose to use 9% of the sample. This will be a good number to change if the run time ends up taking too long.

blogs   <- data.frame(text = blogs)
news    <- data.frame(text = news)
twitter <- data.frame(text = twitter)

set.seed(1110)
sample_pct <- 0.09

blogs_sample <- blogs %>%
  sample_n(., nrow(blogs)*sample_pct)
news_sample <- news %>%
  sample_n(., nrow(news)*sample_pct)
twitter_sample <- twitter %>%
  sample_n(., nrow(twitter)*sample_pct)
full_sample <- c(blogs_sample,news_sample,twitter_sample)

III.II Creating Functions for Cleaning

The first function is used to signal all of the cleaning steps that will be taken. All the letters will become lowercase. All numbers, punctuation, stopwords, profanity, and extra white spaces will be removed. The profanity was found from the following website http://www.bannedwordlist.com.

data("stop_words")
swear_words <- read_delim("./en_US/swearWords.csv", delim = "\n", col_names = FALSE)

## Parsed with column specification:
## cols(
##   X1 = col_character()
## )

pp_corpus <- function(corpus){
    corpus <- tm_map(corpus, content_transformer(function(x, pattern) gsub(pattern, " ", x)), "/|@|\\|")
    corpus <- tm_map(corpus, content_transformer(tolower))
    corpus <- tm_map(corpus, removeNumbers)
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, removeWords, "stop_words")
    corpus <- tm_map(corpus, removeWords, "swear_words")
    corpus <- tm_map(corpus, stripWhitespace)
    return(corpus)
}

The second function is used to find the frequency of the words and tabulate them in order of appearance.

ffreq <- function(tdm){
    # Helper function to tabulate frequency
    freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
    ffreq <- data.frame(word=names(freq), freq=freq)
    return(ffreq)
}

These functions will be used to tokenize the sets into the bigram, trigram, and quadgram varieties.

bigram <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
quadgram <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))

III.III Cleaning

Using the pre-processing corpus made earlier.

full_sample <- VCorpus(VectorSource(full_sample))
full_sample <- pp_corpus(full_sample)

Creating Term Document Matrices.

words <- TermDocumentMatrix(full_sample)
bigrams <- TermDocumentMatrix(full_sample, control=list(tokenize=bigram))
trigrams <- TermDocumentMatrix(full_sample, control=list(tokenize=trigram))
quadgrams <- TermDocumentMatrix(full_sample, control=list(tokenize=quadgram))

Removing Infrequent Terms. (This might also be a good place to look for in terms of sacrificing runtime for accuracy)

words <- removeSparseTerms(words, 0.99)
bigrams <- removeSparseTerms(bigrams, 0.999)
trigrams <- removeSparseTerms(trigrams, 0.999)
quadgrams <- removeSparseTerms(quadgrams, 0.999)

Finding the Frequencies for each of the sets.

freq1 <- ffreq(words)
freq2 <- ffreq(bigrams)
freq3 <- ffreq(trigrams)
freq4 <- ffreq(quadgrams)
freq1_top25 <- top_n(freq1,25)

## Selecting by freq

freq2_top25 <- top_n(freq2,25)

## Selecting by freq

freq3_top25 <- top_n(freq3,25)

## Selecting by freq

freq4_top25 <- top_n(freq4,25)

## Selecting by freq

III.IV Exploratory Analysis

I will now plot the top 25 of each of the N-gram sets to get a general idea of what to expect from a precition model that utilizes them.

ggplot(freq1_top25, aes(x=reorder(word,freq), y=freq, fill=freq)) +
    geom_bar(stat="identity") +
    coord_flip() +
    theme_minimal() +
    scale_fill_gradient(low="paleturquoise4", high="burlywood2") +
    theme(axis.title.y = element_blank()) +
    labs(y="Frequency", title="Top 25 Most Freuqent Unigrams")

ggplot(freq2_top25, aes(x=reorder(word,freq), y=freq, fill=freq)) +
    geom_bar(stat="identity") +
    coord_flip() +
    theme_minimal() +
    scale_fill_gradient(low="paleturquoise4", high="burlywood2") +
    theme(axis.title.y = element_blank()) +
    labs(y="Frequency", title="Top 25 Most Freuqent Bigrams")

ggplot(freq3_top25, aes(x=reorder(word,freq), y=freq, fill=freq)) +
    geom_bar(stat="identity") +
    coord_flip() +
    theme_minimal() +
    scale_fill_gradient(low="paleturquoise4", high="burlywood2") +
    theme(axis.title.y = element_blank()) +
    labs(y="Frequency", title="Top 25 Most Freuqent Trigrams")

ggplot(freq4_top25, aes(x=reorder(word,freq), y=freq, fill=freq)) +
    geom_bar(stat="identity") +
    coord_flip() +
    theme_minimal() +
    scale_fill_gradient(low="paleturquoise4", high="burlywood2") +
    theme(axis.title.y = element_blank()) +
    labs(y="Frequency", title="Top 25 Most Freuqent Quadgrams")

IV. Prediction Strategies and Plans

This analysis had given me access to an ordered ranking of the most likely unigrams, bigrams, trigrams, and quadgrams for the sample. Using these data frames, I will create a model that first looks at the quadgram frequency table to identify which of the phrases the word is most likely to be associated. If no phrase can be found, the trigram frequency table would then be used in the same manner. If the trigram table has no luck either, the bigram table will be utilized; although, I do not expect this bigram table to be extremely useful in practice, it seems logical in theory.

My app will utilize a simple user interface that simply has instructions and an empty textbox for entry. The app will then use its model to find the most likely options. The top three options will be displayed in ranked order. The app might display these as continuations of the previously input text. For example, the input text is “want”. The top three results will all start with “want” followed by the model results - “want to”, “want the”, “want for”.
Any suggestions the graders have on this idea would be much appreciated, as it still feels like a very rough concept.