This milestone report hopes to accomplish the following:
Demonstrate downloading the data and loading it. Prove a basic report of summary statistics about the data sets. Reports interesting findings amassed so far. Outlines plans for creating a prediction algorithm and Shiny app.
library(knitr)
library(tm)
## Loading required package: NLP
library(NLP)
library(SnowballC)
## Error: there is no package called 'SnowballC'
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.1.2
From downloading the data, it is apparent that this data set is large:
Blog Text Data: 210.160014 MB News Text Data: 205.811889 MB Twitter Text Data: 167.105338 MB Hoping to allow my application to run quickly and still have my model built on a robust set of training data, I used the following method to create representative sample sets of the large text files and combine them into one Corpus.
blogs<-readLines("final/en_US/en_US.blogs.txt")
news<-readLines("final/en_US/en_US.news.txt")
twitter<-readLines("final/en_US/en_US.twitter.txt", warn = FALSE)
dir.create("final/en_US/Sample/", showWarnings = FALSE)
set.seed(2000)
blogs<-blogs[rbinom(length(blogs)*.005, length(blogs), .5)]
write.csv(blogs, file = "final/en_US/Sample/blogs.csv", row.names = FALSE)
news<-news[rbinom(length(news)*.005, length(news), .5)]
write.csv(news, file = "final/en_US/Sample/news.csv", row.names = FALSE)
twitter<-twitter[rbinom(length(twitter)*.005, length(twitter), .5)]
write.csv(twitter, file = "final/en_US/Sample/twitter.csv", row.names = FALSE)
dat<-Corpus(DirSource("final/en_US/Sample"), readerControl = list(reader=readPlain, language="en_US"))
Now we have a much more manageable dataset.
Blog Text Data in sample corpus: 2 lines of text News Text Data in sample corpus: 2 lines of text Twitter Text Data in sample corpus: 2 lines of text Word counts for the documents in the Corpus follow:
dtm<-TermDocumentMatrix(dat)
dtm<-as.matrix(dtm)
colSums(dtm)
## blogs.csv news.csv twitter.csv
## 148252 142698 120183
#Removed extra whitespace
dat <- tm_map(dat, stripWhitespace)
#Transformed all characters to lowercase
dat <- tm_map(dat, content_transformer(tolower))
#Remove Numbers
dat <- tm_map(dat, removeNumbers)
#Remove Punctuation
dat<- tm_map(dat, removePunctuation)
To remove profane words from the data, I found a list of 360 profane words on github and forked the repo. Here is the link to the list on GitHub: List of profane words
swears<-VectorSource(readLines("profanity"))
## Warning: cannot open file 'profanity': No such file or directory
## Error: cannot open the connection
dat <- tm_map(dat, removeWords, swears)
## Warning: all scheduled cores encountered errors in user code
To kick off the exploratory data analysis, I performed some ngram analysis using the NGramTokenizer function in the RWeka package. In my next steps, I’d like to find a new method for tokenization as this package takes a lot of system resources. Nonetheless, the results are intersting:
Top Bigrams:
Bigram<- NGramTokenizer(dat, Weka_control(min = 2, max = 2))
freq.Bigram <- data.frame(table(Bigram))
sort.Bigram <- freq.Bigram[order(freq.Bigram$Freq,decreasing = TRUE),]
Bigram_top10<-head(sort.Bigram,10)
barplot(Bigram_top10$Freq, names.arg = Bigram_top10$Bigram, border=NA, las=2, main="Top 10 Most Frequent BiGrams", cex.main=2)
Trigram<- NGramTokenizer(dat, Weka_control(min = 3, max = 3))
freq.Trigram <- data.frame(table(Trigram))
sort.Trigram <- freq.Trigram[order(freq.Trigram$Freq,decreasing = TRUE),]
Trigram_top10<-head(sort.Trigram,10)
barplot(Trigram_top10$Freq, names.arg = Trigram_top10$Trigram, border=NA, las=2, main="Top 10 Most Frequent TriGrams", cex.main=2)
Besides the nGrams, I’ve noted the following observations with the data I will need to factor in to my models:
Foreign language text exists in the dataset and it will need to be dealt with
The twitter data is (as is to be expected) made up of purely short phrases, I expect this data will yield different results than the blog and news data which are much more context rich.
I suspect it will be difficult for my model to deal with very common terms such as those contained in the top bi and trigrams. They will have many different possibilities to follow them. Longer, more complex words may be easier to predict based on the data, but I will worry about overfitting my models in those cases.
I still have work to do in cleaning the data in order to make an efficient app. I need to
Identify a more efficient method for tokenization Find a way to deal with foreign text in the data Build several very small training/test sets so that I can efficiently run models and test them on new data. As far as modeling, I will explore several avenues for predicting the next word when my app is given 1-4 words. I hope to have users enter 1-4 words on the left side of my GUI and on the right have 3 top suggested words automatically pop up, perhaps with the words showing larger if the model is more confident in them.