I. Demonstrate that I have downloaded the data and have successfully loaded it in.
II. Create a basic report of summary statistics about the data sets.
III. Report any interesting findings that I have ammased so far.
IV. Get feedback on my plans for creating a predictions algorithm & Shiny app.
I am not sure if I will need all of these by the end of this.
suppressPackageStartupMessages({
library(tm)
library(rJava)
library(ngram)
library(RWeka)
library(knitr)
library(tidytext)
library(tidyverse)
library(wordcloud)
library(stringi)
library(stringr)
library(ggplot2)
library(dplyr)
})
After I had downloaded the files, I was able to begin the process of reading in the information.
blogs_file <- "./en_US/en_US.blogs.txt"
news_file <- "./en_US/en_US.news.txt"
twitter_file <- "./en_US/en_US.twitter.txt"
blogs_size <- file.size(blogs_file) / (2^20)
news_size <- file.size(news_file) / (2^20)
twitter_size <- file.size(twitter_file) / (2^20)
blogs <- readLines(blogs_file, skipNul = TRUE)
news <- readLines(news_file, skipNul = TRUE)
twitter <- readLines(twitter_file, skipNul = TRUE)
This summary shows a general idea of how the contents of this data are arranged.
blogs_lines <- length(blogs)
news_lines <- length(news)
twitter_lines <- length(twitter)
total_lines <- blogs_lines + news_lines + twitter_lines
blogs_nchar <- nchar(blogs)
news_nchar <- nchar(news)
twitter_nchar <- nchar(twitter)
boxplot(blogs_nchar, news_nchar, twitter_nchar, log = "y",
names = c("Blogs", "News", "Twitter"),
ylab = "log(Number of Characters)",
col = c("darkseagreen","wheat4","royalblue2"))
blogs_nchar_sum <- sum(blogs_nchar)
news_nchar_sum <- sum(news_nchar)
twitter_nchar_sum <- sum(twitter_nchar)
blogs_words <- wordcount(blogs, sep = " ")
news_words <- wordcount(news, sep = " ")
twitter_words <- wordcount(twitter, sep = " ")
summary1 <- data.frame(file_names = c("blogs", "news", "twitter"),
file_size = c(blogs_size, news_size, twitter_size),
file_lines = c(blogs_lines, news_lines, twitter_lines),
number_of_characters = c(blogs_nchar_sum, news_nchar_sum, twitter_nchar_sum),
number_of_words = c(blogs_words, news_words, twitter_words))
summary1 <- summary1 %>% mutate(percent_of_characters = round(number_of_characters/sum(number_of_characters), 2))
summary1 <- summary1 %>% mutate(percent_of_lines = round(file_lines/sum(file_lines), 2))
summary1 <- summary1 %>% mutate(percent_of_words = round(number_of_words/sum(number_of_words), 2))
kable(summary1)
| file_names | file_size | file_lines | number_of_characters | number_of_words | percent_of_characters | percent_of_lines | percent_of_words |
|---|---|---|---|---|---|---|---|
| blogs | 200.4242 | 899288 | 206824505 | 37334131 | 0.36 | 0.21 | 0.37 |
| news | 196.2775 | 1010242 | 203223159 | 34372530 | 0.36 | 0.24 | 0.34 |
| 159.3641 | 2360148 | 162096241 | 30373583 | 0.28 | 0.55 | 0.30 |
I chose to use 9% of the sample. This will be a good number to change if the run time ends up taking too long.
blogs <- data.frame(text = blogs)
news <- data.frame(text = news)
twitter <- data.frame(text = twitter)
set.seed(1110)
sample_pct <- 0.09
blogs_sample <- blogs %>%
sample_n(., nrow(blogs)*sample_pct)
news_sample <- news %>%
sample_n(., nrow(news)*sample_pct)
twitter_sample <- twitter %>%
sample_n(., nrow(twitter)*sample_pct)
full_sample <- c(blogs_sample,news_sample,twitter_sample)
The first function is used to signal all of the cleaning steps that will be taken. All the letters will become lowercase. All numbers, punctuation, stopwords, profanity, and extra white spaces will be removed. The profanity was found from the following website http://www.bannedwordlist.com.
data("stop_words")
swear_words <- read_delim("./en_US/swearWords.csv", delim = "\n", col_names = FALSE)
## Parsed with column specification:
## cols(
## X1 = col_character()
## )
pp_corpus <- function(corpus){
corpus <- tm_map(corpus, content_transformer(function(x, pattern) gsub(pattern, " ", x)), "/|@|\\|")
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, "stop_words")
corpus <- tm_map(corpus, removeWords, "swear_words")
corpus <- tm_map(corpus, stripWhitespace)
return(corpus)
}
The second function is used to find the frequency of the words and tabulate them in order of appearance.
ffreq <- function(tdm){
# Helper function to tabulate frequency
freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
ffreq <- data.frame(word=names(freq), freq=freq)
return(ffreq)
}
These functions will be used to tokenize the sets into the bigram, trigram, and quadgram varieties.
bigram <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
quadgram <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))
Using the pre-processing corpus made earlier.
full_sample <- VCorpus(VectorSource(full_sample))
full_sample <- pp_corpus(full_sample)
Creating Term Document Matrices.
words <- TermDocumentMatrix(full_sample)
bigrams <- TermDocumentMatrix(full_sample, control=list(tokenize=bigram))
trigrams <- TermDocumentMatrix(full_sample, control=list(tokenize=trigram))
quadgrams <- TermDocumentMatrix(full_sample, control=list(tokenize=quadgram))
Removing Infrequent Terms. (This might also be a good place to look for in terms of sacrificing runtime for accuracy)
words <- removeSparseTerms(words, 0.99)
bigrams <- removeSparseTerms(bigrams, 0.999)
trigrams <- removeSparseTerms(trigrams, 0.999)
quadgrams <- removeSparseTerms(quadgrams, 0.999)
Finding the Frequencies for each of the sets.
freq1 <- ffreq(words)
freq2 <- ffreq(bigrams)
freq3 <- ffreq(trigrams)
freq4 <- ffreq(quadgrams)
freq1_top25 <- top_n(freq1,25)
## Selecting by freq
freq2_top25 <- top_n(freq2,25)
## Selecting by freq
freq3_top25 <- top_n(freq3,25)
## Selecting by freq
freq4_top25 <- top_n(freq4,25)
## Selecting by freq
I will now plot the top 25 of each of the N-gram sets to get a general idea of what to expect from a precition model that utilizes them.
ggplot(freq1_top25, aes(x=reorder(word,freq), y=freq, fill=freq)) +
geom_bar(stat="identity") +
coord_flip() +
theme_minimal() +
scale_fill_gradient(low="paleturquoise4", high="burlywood2") +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Top 25 Most Freuqent Unigrams")
ggplot(freq2_top25, aes(x=reorder(word,freq), y=freq, fill=freq)) +
geom_bar(stat="identity") +
coord_flip() +
theme_minimal() +
scale_fill_gradient(low="paleturquoise4", high="burlywood2") +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Top 25 Most Freuqent Bigrams")
ggplot(freq3_top25, aes(x=reorder(word,freq), y=freq, fill=freq)) +
geom_bar(stat="identity") +
coord_flip() +
theme_minimal() +
scale_fill_gradient(low="paleturquoise4", high="burlywood2") +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Top 25 Most Freuqent Trigrams")
ggplot(freq4_top25, aes(x=reorder(word,freq), y=freq, fill=freq)) +
geom_bar(stat="identity") +
coord_flip() +
theme_minimal() +
scale_fill_gradient(low="paleturquoise4", high="burlywood2") +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Top 25 Most Freuqent Quadgrams")
This analysis had given me access to an ordered ranking of the most likely unigrams, bigrams, trigrams, and quadgrams for the sample. Using these data frames, I will create a model that first looks at the quadgram frequency table to identify which of the phrases the word is most likely to be associated. If no phrase can be found, the trigram frequency table would then be used in the same manner. If the trigram table has no luck either, the bigram table will be utilized; although, I do not expect this bigram table to be extremely useful in practice, it seems logical in theory.
My app will utilize a simple user interface that simply has instructions and an empty textbox for entry. The app will then use its model to find the most likely options. The top three options will be displayed in ranked order. The app might display these as continuations of the previously input text. For example, the input text is “want”. The top three results will all start with “want” followed by the model results - “want to”, “want the”, “want for”.
Any suggestions the graders have on this idea would be much appreciated, as it still feels like a very rough concept.