Instructions
The goal here is to build your first simple model for the relationship between words. This is the first step in building a predictive text mining application. You will explore simple models and discover more complicated modeling techniques.
Tasks to accomplish
Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.
After download from Coursera: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
blogs.size <- file.info("/Users/chinettis/Desktop/BackToMEF/DataScience/CapstoneProject/final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("/Users/chinettis/Desktop/BackToMEF/DataScience/CapstoneProject/final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("/Users/chinettis/Desktop/BackToMEF/DataScience/CapstoneProject/final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
Loading the file into the environment:
twitter=readLines("/Users/chinettis/Desktop/BackToMEF/DataScience/CapstoneProject/final/en_US/en_US.twitter.txt")
blogs=readLines("/Users/chinettis/Desktop/BackToMEF/DataScience/CapstoneProject/final/en_US/en_US.blogs.txt")
news=readLines("/Users/chinettis/Desktop/BackToMEF/DataScience/CapstoneProject/final/en_US/en_US.news.txt")
library(stringi)
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)
How many lines is Twitter?
length(twitter)
## [1] 2360148
Which file does contain more characters?
max_twitter=max(nchar(twitter))
max_blogs=max(nchar(blogs))
max_news=max(nchar(news))
max_blogs
## [1] 40833
max_news
## [1] 11384
max_twitter
## [1] 140
If we divide the number of lines where the word “love” (all lowercase) occurs by the number of lines the word “hate” (all lowercase) occurs, about what do we get?
love_count=sum(grepl("love", twitter))
hate_count=sum(grepl("hate", twitter))
love_count/hate_count
## [1] 4.108592
The one tweet in the en_US twitter data set that matches the word “biostats” says what?
biostats=grep("biostats", twitter)
twitter[biostats]
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"
How many tweets have the exact characters “A computer once beat me at chess, but it was no match for me at kickboxing” ?
sum(grepl("A computer once beat me at chess, but it was no match for me at kickboxing", twitter))
## [1] 3
data.frame(source = c("blogs", "news", "twitter"),
file.size.MB = c(blogs.size, news.size, twitter.size),
num.lines = c(length(blogs), length(news), length(twitter)),
num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
## source file.size.MB num.lines num.words mean.num.words
## 1 blogs 200.4242 899288 37546246 41.75108
## 2 news 196.2775 1010242 34762395 34.40997
## 3 twitter 159.3641 2360148 30093369 12.75063
n-gram model (think Markov Chains).
First, we proceed to remove all weird characters:
cleanedtwitter=iconv(twitter, 'UTF-8', 'ASCII', "byte")
Then, we create a sample of 10000 characters:
require(tm)
## Loading required package: tm
## Loading required package: NLP
set.seed(679)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
Now, we proceed to: - convert all characters in lower case - remove all punctuations - remove all numbers - remove whitespaces - force everything back to the plaintext document
corpus <- VCorpus(VectorSource(data.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
We are now ready to perform exploratory analysis on the data. It would be interesting and helpful to find the most frequently occurring words in the data. Here we list the most common unigrams, bigrams, and trigrams.
library(RWeka)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
options(mc.cores=1)
getFreq <- function(tdm) {
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = I("blue"))
}
# Get frequencies of most common n-grams in data sample
freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999))
freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))
Here is a histogram of the 30 most common unigrams in the data sample.
makePlot(freq1, "30 Most Common Unigrams")
Here is a histogram of the 30 most common bigrams in the data sample.
makePlot(freq2, "30 Most Common Bigrams")
Here is a histogram of the 30 most common trigrams in the data sample.
makePlot(freq3, "30 Most Common Trigrams")
This concludes our exploratory analysis. The next steps of this capstone project would be to finalize our predictive algorithm, and deploy our algorithm as a Shiny app.
Our predictive algorithm will be using n-gram model with frequency lookup similar to our exploratory analysis above. One possible strategy would be to use the trigram model to predict the next word. If no matching trigram can be found, then the algorithm would back off to the bigram model, and then to the unigram model if needed.
The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase. Then the app will use our algorithm to suggest the most likely next word after a short delay. Our plan is also to allow the user to configure how many words our app should suggest.