This report is intended to demonstrate familiarity with working with the supplied data, and the beginning of progress toward developing a word prediction algorithm. The report presents my exploratory analysis of the data, and explains my concept for an eventual word prediction app and associated algorithm. The focus of this exploratory data analysis is on major features of the data, and presents graphical depictions of various summary quantities.
twittxt <- read_lines("en_US.twitter.txt")
newstxt <- read_lines("en_US.news.txt")
blogtxt <- read_lines("en_US.blogs.txt")
twittxt <- gsub("[^\x01-\x7F]", "", twittxt) # Get rid of emoticon codes
newstxt <- gsub("[^\x01-\x7F]", "", newstxt) # Get rid of emoticon codes
blogtxt <- gsub("[^\x01-\x7F]", "", blogtxt) # Get rid of emoticon codes
linecounts <- c(length(twittxt), length(newstxt), length(blogtxt))
twittxt_df <- tibble(line = 1:linecounts[1], text = twittxt)
newstxt_df <- tibble(line = 1:linecounts[2], text = newstxt)
blogtxt_df <- tibble(line = 1:linecounts[3], text = blogtxt)
twittxt_df$text <- tolower(twittxt_df$text)
newstxt_df$text <- tolower(newstxt_df$text)
blogtxt_df$text <- tolower(blogtxt_df$text)
tidytwit <- twittxt_df %>% unnest_tokens(word, text)
tidynews <- newstxt_df %>% unnest_tokens(word, text)
tidyblog <- blogtxt_df %>% unnest_tokens(word, text)
twitWordCount <- tidytwit %>% count(word, sort = TRUE)
newsWordCount <- tidynews %>% count(word, sort = TRUE)
blogWordCount <- tidyblog %>% count(word, sort = TRUE)
wordcounts <- c(sum(twitWordCount$n), sum(newsWordCount$n), sum(blogWordCount$n))
This analysis employs corpuses from which “stop words” have been removed. However for the final word prediction algorithm I am likely to leave the “stop words” in the corpus, given their general importance to phrase and sentence structure.
The three data sets were merged into one composite corpus, and a random 5% sample of this corpus was taken for analysis. Unigrams, bigrams and trigrams were calculated from this sample. Wordcloud figures of the top 100 and histograms of the top 25 of each were then generated.
alltext <- c(twittxt, newstxt, blogtxt)
sampfrac <- 0.05
alltextrand <- rbinom(alltext, 1, sampfrac)
alltext <- alltext[alltextrand == 1]
alltext_df <- tibble(line = 1:length(alltext), text = alltext)
alltext_df$text <- tolower(alltext_df$text)
alltext.corpus <- corpus(alltext_df)
profanity_vec <- read_lines("bad_words_list.txt")
n1grams.tokens <- tokens(alltext.corpus, what = "word", remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE,
remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE,
ngrams = 1, skip = 0, concatenator = "_",
verbose = quanteda_options("verbose"), include_docvars = FALSE)
n1grams.tokens <- tokens_remove(n1grams.tokens, profanity_vec, padding = FALSE)
n1grams.dfm <- dfm(n1grams.tokens)
n2grams.tokens <- tokens(alltext.corpus, what = "word", remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE,
remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE,
ngrams = 2, skip = 0, concatenator = "_",
verbose = quanteda_options("verbose"), include_docvars = FALSE)
n2grams.tokens <- tokens_remove(n2grams.tokens, profanity_vec, padding = FALSE)
n2grams.dfm <- dfm(n2grams.tokens)
n3grams.tokens <- tokens(alltext.corpus, what = "word", remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE,
remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE,
ngrams = 3, skip = 0, concatenator = "_",
verbose = quanteda_options("verbose"), include_docvars = FALSE)
n3grams.tokens <- tokens_remove(n3grams.tokens, profanity_vec, padding = FALSE)
n3grams.dfm <- dfm(n3grams.tokens)
To build a word prediction algorithm, I plan to combine the twitter, blog, and news files into a single corpus, from which I will extract a random sample to develop N-grams. My approach to word prediction will employ four-grams or five-grams with “stupid backoff” to predict the next word given a four or five-word prefix supplied by the user. I plan to implement the algorthm as a Shiny app which will include a window for user input of the four or five-word prefix.