Introduction

This report is intended to demonstrate familiarity with working with the supplied data, and the beginning of progress toward developing a word prediction algorithm. The report presents my exploratory analysis of the data, and explains my concept for an eventual word prediction app and associated algorithm. The focus of this exploratory data analysis is on major features of the data, and presents graphical depictions of various summary quantities.

Reading and Initial Processing of Data Files.

twittxt <- read_lines("en_US.twitter.txt")
newstxt <- read_lines("en_US.news.txt")
blogtxt <- read_lines("en_US.blogs.txt")
twittxt <- gsub("[^\x01-\x7F]", "", twittxt) # Get rid of emoticon codes
newstxt <- gsub("[^\x01-\x7F]", "", newstxt) # Get rid of emoticon codes
blogtxt <- gsub("[^\x01-\x7F]", "", blogtxt) # Get rid of emoticon codes

Counts of Lines and Words in Each Corpus.

linecounts <- c(length(twittxt), length(newstxt), length(blogtxt))
twittxt_df <- tibble(line = 1:linecounts[1], text = twittxt)
newstxt_df <- tibble(line = 1:linecounts[2], text = newstxt)
blogtxt_df <- tibble(line = 1:linecounts[3], text = blogtxt)
twittxt_df$text <- tolower(twittxt_df$text)
newstxt_df$text <- tolower(newstxt_df$text)
blogtxt_df$text <- tolower(blogtxt_df$text)
tidytwit <- twittxt_df %>% unnest_tokens(word, text)
tidynews <- newstxt_df %>% unnest_tokens(word, text)
tidyblog <- blogtxt_df %>% unnest_tokens(word, text)
twitWordCount <- tidytwit %>% count(word, sort = TRUE)
newsWordCount <- tidynews %>% count(word, sort = TRUE)
blogWordCount <- tidyblog %>% count(word, sort = TRUE)
wordcounts <- c(sum(twitWordCount$n), sum(newsWordCount$n), sum(blogWordCount$n))

The following plots display the line and word counts for each corpus.

Top Words in Each Corpus

This analysis employs corpuses from which “stop words” have been removed. However for the final word prediction algorithm I am likely to leave the “stop words” in the corpus, given their general importance to phrase and sentence structure.

The following plots display the 25 most frequent words in each corpus.

Top N-grams in the Combined Corpus

Wordclouds and Histograms of Most Frequent Unigrams, Bigrams, and Trigrams

The three data sets were merged into one composite corpus, and a random 5% sample of this corpus was taken for analysis. Unigrams, bigrams and trigrams were calculated from this sample. Wordcloud figures of the top 100 and histograms of the top 25 of each were then generated.

alltext <- c(twittxt, newstxt, blogtxt)
sampfrac <- 0.05
alltextrand <- rbinom(alltext, 1, sampfrac)
alltext <- alltext[alltextrand == 1]
alltext_df <- tibble(line = 1:length(alltext), text = alltext)
alltext_df$text <- tolower(alltext_df$text)
alltext.corpus <- corpus(alltext_df)
profanity_vec <- read_lines("bad_words_list.txt")
n1grams.tokens <- tokens(alltext.corpus, what = "word", remove_numbers = TRUE, remove_punct = TRUE,
                         remove_symbols = TRUE, remove_separators = TRUE,
                         remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE,
                         ngrams = 1, skip = 0, concatenator = "_",
                         verbose = quanteda_options("verbose"), include_docvars = FALSE)
n1grams.tokens <- tokens_remove(n1grams.tokens, profanity_vec, padding = FALSE)
n1grams.dfm <- dfm(n1grams.tokens)

n2grams.tokens <- tokens(alltext.corpus, what = "word", remove_numbers = TRUE, remove_punct = TRUE,
                         remove_symbols = TRUE, remove_separators = TRUE,
                         remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE,
                         ngrams = 2, skip = 0, concatenator = "_",
                         verbose = quanteda_options("verbose"), include_docvars = FALSE)
n2grams.tokens <- tokens_remove(n2grams.tokens, profanity_vec, padding = FALSE)
n2grams.dfm <- dfm(n2grams.tokens)

n3grams.tokens <- tokens(alltext.corpus, what = "word", remove_numbers = TRUE, remove_punct = TRUE,
                         remove_symbols = TRUE, remove_separators = TRUE,
                         remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE,
                         ngrams = 3, skip = 0, concatenator = "_",
                         verbose = quanteda_options("verbose"), include_docvars = FALSE)
n3grams.tokens <- tokens_remove(n3grams.tokens, profanity_vec, padding = FALSE)
n3grams.dfm <- dfm(n3grams.tokens)

The following wordclouds display the 100 most frequent unigrams, bigrams and trigrams in the corpus.

The following plots display the 25 most frequent unigrams, bigrams and trigrams in the corpus.

Word Prediction Algorithm Concept

To build a word prediction algorithm, I plan to combine the twitter, blog, and news files into a single corpus, from which I will extract a random sample to develop N-grams. My approach to word prediction will employ four-grams or five-grams with “stupid backoff” to predict the next word given a four or five-word prefix supplied by the user. I plan to implement the algorthm as a Shiny app which will include a window for user input of the four or five-word prefix.

Data Analysis for Capstone Project - Milestone Report

Jim Carleton

June 26, 2019