library(dplyr)
library(stringi)
library(tm)
library(ggplot2)
library(wesanderson)

I had a terrible time attempting to deal with RWeka on my Mac. Currently, I still have yet to find a solution, so I had to come up with a different way. If anyone knows any quick fix, let me know, because I spent all day scourging Google and StackOverFlows.

Anyway, this is the Week 2 Check In. In the following, I read and clean the data, and look at some of the most prominent words.

blogs <- readLines("~/Downloads/final 3/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("~/Downloads/final 3/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("~/Downloads/final 3/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

First Pass:

wordperline=sapply(list(blogs,news,twitter),function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(wordperline)=c('WPL_Min','WPL_Mean','WPL_Max')
stats=data.frame(
  Dataset=c("blogs","news","twitter"),      
  t(rbind(
  sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],
  Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',],
  wordperline)
))
head(stats)
##   Dataset   Lines     Chars    Words WPL_Min WPL_Mean WPL_Max
## 1   blogs  899288 206824382 37570839       0    41.75    6726
## 2    news 1010242 203223154 34494539       1    34.41    1796
## 3 twitter 2360148 162096241 30451170       1    12.75      47

From this, we can see that the maximum amount of words per observation were found in blogs (6726) while twitter is obviously the lowest at 47.

Now, we want to reduce these to something to start analyzing and visualizing better, so I’m going to get the words of English only and then sample out a small percentage of it, using my own seed of 131.

blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")

set.seed(131)
sample_data <- c(sample(blogs, length(blogs) * 0.01),
                 sample(news, length(news) * 0.01),
                 sample(twitter, length(twitter) * 0.01))

Clean Up Words

We need to be able to clean the words down and make it a bit easier to analyze. That means removing extra spaces, lowering everything, taking out punctation. We also need to remove

profane <- read.csv(file ='~/Desktop/profanity.txt', stringsAsFactors=FALSE)
profane <- gsub(",", "", tolower(profane[,1]))

corpus <- VCorpus(VectorSource(sample_data))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, profane)
corpus <- tm_map(corpus, PlainTextDocument)

Making N-Grams

Now, like I said, I had trouble using RWeka’s functions, so I had to make my own. Hopefully this is okay, though I’m sure Weka’s is both faster and stronger. Either way, these functions create n-grams of my data.

UnigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = TRUE)

BigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = TRUE)

TrigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = TRUE)


uni_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = UnigramTokenizer))
bi_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
tri_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer))

We can reduce the datasets down to at least seeing it 20 times.

uni_corpus <- findFreqTerms(uni_matrix,lowfreq = 20)
bi_corpus <- findFreqTerms(bi_matrix,lowfreq=20)
tri_corpus <- findFreqTerms(tri_matrix,lowfreq=20)

uni_corpus_freq <- rowSums(as.matrix(uni_matrix[uni_corpus,]))
uni_corpus_freq <- data.frame(word=names(uni_corpus_freq), frequency=uni_corpus_freq)
uni_corpus_freq <- uni_corpus_freq[with(uni_corpus_freq, order(-frequency)), ]
bi_corpus_freq <- rowSums(as.matrix(bi_matrix[bi_corpus,]))
bi_corpus_freq <- data.frame(word=names(bi_corpus_freq), frequency=bi_corpus_freq)
bi_corpus_freq <- bi_corpus_freq[with(bi_corpus_freq, order(-frequency)), ]
tri_corpus_freq <- rowSums(as.matrix(tri_matrix[tri_corpus,]))
tri_corpus_freq <- data.frame(word=names(tri_corpus_freq), frequency=tri_corpus_freq)
tri_corpus_freq <- tri_corpus_freq[with(tri_corpus_freq, order(-frequency)), ]

Even this is too many, so for displaying purposes we’ll shrink down to the top 30 most seen n-grams.

trigram_ggplot <-  tri_corpus_freq[c(1:30),]
bigram_ggplot <- bi_corpus_freq[c(1:30),]
unigram_ggplot <- uni_corpus_freq[c(1:30),]

Frequency Histograms

And some visualizations of our data.

wp2 <- wesanderson::wes_palette("Royal1", 30, "continuous")

trigram_plot <-
  ggplot(trigram_ggplot, aes(reorder(word, frequency), frequency)) +
  geom_bar(stat = "identity", color="black", fill=wp2) +
  xlab("Trigrams") + ylab("Count") + ggtitle("Top Trigrams by Frequency") +
  theme(plot.title = element_text(lineheight=.7, face="bold")) +
  coord_flip()
trigram_plot

bigram_plot <-
  ggplot(bigram_ggplot, aes(reorder(word, frequency), frequency)) +
  geom_bar(stat = "identity", color="black", fill=wp2) +
  xlab("Bigrams") + ylab("Count") + ggtitle("Top Bigrams by Frequency") +
  theme(plot.title = element_text(lineheight=.7, face="bold")) +
  coord_flip()
bigram_plot

unigram_plot <-
  ggplot(unigram_ggplot, aes(reorder(word, frequency), frequency)) +
  geom_bar(stat = "identity", color="black", fill=wp2) +
  xlab("Unigrams") + ylab("Count") + ggtitle("Top Unigrams by Frequency") +
  theme(plot.title = element_text(lineheight=.7, face="bold")) +
  coord_flip() 
  
unigram_plot

Plans for creating a prediction algorithm and Shiny app

After this, my first plan is to figure out what the hell is going on wih my Java stuff and RWeka. But, after that, obviously would be finalizing the predictive algorithm and deploying it under a Shiny Application.

I would want my application to find associations between various n-grams and their order, build a model using these orders and relations, and have the app make a prediction based on user input.