First, I load the raw data into character vectors:
dir <- file.path("~", "Coursera", "Capstone", "final", "en_US")
news_data <- readLines(file.path(dir, "en_US.news.txt"))
blogs_data <- readLines(file.path(dir, "en_US.blogs.txt"))
twitter_data <- readLines(file.path(dir, "en_US.twitter.txt"))
Next, we look at how many lines and words are in each set of documents:
news_line_count <- length(news_data)
blogs_line_count <- length(blogs_data)
twitter_line_count <- length(twitter_data)
news_word_count <- sum(sapply(gregexpr("\\W+", news_data), length) + 1)
blogs_word_count <- sum(sapply(gregexpr("\\W+", blogs_data), length) + 1)
twitter_word_count <- sum(sapply(gregexpr("\\W+", twitter_data), length) + 1)
word_counts <- c(twitter_word_count, news_word_count, blogs_word_count)
line_counts <- c(twitter_line_count, news_line_count, blogs_line_count)
m <- matrix(c(word_counts, line_counts), byrow=TRUE, ncol=3, nrow=2)
rownames(m) <- c("Word count", "Line count")
colnames(m) <- c("Twitter", "News", "Blogs")
m
## Twitter News Blogs
## Word count 32793399 36721087 39120549
## Line count 2360148 1010242 899288
In order to efficiently analyze the term frequencies, I take a random sample:
sample_size <- 0.05
set.seed(1234)
news_subset <- sample(news_data, length(news_data) * sample_size)
blogs_subset <- sample(blogs_data, length(blogs_data) * sample_size)
twitter_subset <- sample(twitter_data, length(twitter_data) * sample_size)
combined_doc <- sample(c(news_subset, blogs_subset, twitter_subset))
tokenized_doc <- tokenize(toLower(combined_doc),
removePunct = TRUE,
removeNumbers = TRUE,
removeTwitter = TRUE,
ngrams = 1)
my_dfm <- dfm(tokenized_doc)
##
## ... indexing documents: 213,483 documents
## ... indexing features: 142,584 feature types
## ... created a 213483 x 142584 sparse dfm
## ... complete.
## Elapsed time: 3.193 seconds.
top_features <- topfeatures(my_dfm, 10)
top_features
## the to and a of in i for is that
## 238371 137627 120771 118170 100250 82450 82010 54459 53668 51858
barplot(top_features, horiz=TRUE)
We see many of these are stop words. It would be interesting to see which words are most common after stop words are removed:
tokenized_doc <- removeFeatures(tokenized_doc, stopwords("english"))
topfeatures(dfm(tokenized_doc), 10)
##
## ... indexing documents: 213,483 documents
## ... indexing features: 142,410 feature types
## ... created a 213483 x 142410 sparse dfm
## ... complete.
## Elapsed time: 2.55 seconds.
## will said just one like can get time new good
## 15853 15405 15012 14643 13484 12274 11360 10568 9771 9001
topfeatures
## function (x, ...)
## {
## UseMethod("topfeatures")
## }
## <environment: namespace:quanteda>
barplot(top_features, horiz = TRUE)
tokenized_doc <- tokenize(toLower(combined_doc),
removePunct = TRUE,
removeNumbers = TRUE,
removeTwitter = TRUE,
ngrams = 2)
top_features <- topfeatures(dfm(tokenized_doc))
##
## ... indexing documents: 213,483 documents
## ... indexing features: 1,576,113 feature types
## ... created a 213483 x 1576113 sparse dfm
## ... complete.
## Elapsed time: 6.097 seconds.
top_features
## of_the in_the to_the for_the on_the to_be at_the and_the
## 21816 20681 10553 9978 9957 8103 7120 6427
## in_a with_the
## 6008 5376
barplot(top_features, horiz=TRUE)
tokenized_doc <- tokenize(toLower(combined_doc),
removePunct = TRUE,
removeNumbers = TRUE,
removeTwitter = TRUE,
ngrams = 3)
top_features <- topfeatures(dfm(tokenized_doc))
##
## ... indexing documents: 213,483 documents
## ... indexing features: 3,358,881 feature types
## ... created a 213483 x 3358881 sparse dfm
## ... complete.
## Elapsed time: 7.982 seconds.
top_features
## one_of_the a_lot_of thanks_for_the to_be_a going_to_be
## 1772 1410 1273 905 866
## the_end_of out_of_the i_want_to some_of_the it_was_a
## 783 724 724 721 704
barplot(top_features, horiz = TRUE)
We are interested in the probable next word in a sequence of words. We will build a prediction model that takes an arbitrary sequence of tokens (n-grams), and returns the likely next word.
The modeling steps are, in general, as follows:
The Shiny app will provide an interactive interface for testing the model developed in the above steps. It will take a string as the predictor variable and will return a single response – the completed n-gram. For input of 2 words, the most likely 3-word sequence will be returned.