Data Science Capstone: Milestone Report

Instructions

The goal here is to build your first simple model for the relationship between words. This is the first step in building a predictive text mining application. You will explore simple models and discover more complicated modeling techniques.

Tasks to accomplish

Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

Data:

After download from Coursera: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Basic statistics and operations:

blogs.size <- file.info("/Users/chinettis/Desktop/BackToMEF/DataScience/CapstoneProject/final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("/Users/chinettis/Desktop/BackToMEF/DataScience/CapstoneProject/final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("/Users/chinettis/Desktop/BackToMEF/DataScience/CapstoneProject/final/en_US/en_US.twitter.txt")$size / 1024 ^ 2

Loading the file into the environment:

twitter=readLines("/Users/chinettis/Desktop/BackToMEF/DataScience/CapstoneProject/final/en_US/en_US.twitter.txt")
blogs=readLines("/Users/chinettis/Desktop/BackToMEF/DataScience/CapstoneProject/final/en_US/en_US.blogs.txt")
news=readLines("/Users/chinettis/Desktop/BackToMEF/DataScience/CapstoneProject/final/en_US/en_US.news.txt")

library(stringi)
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)

How many lines is Twitter?

length(twitter)

## [1] 2360148

Which file does contain more characters?

max_twitter=max(nchar(twitter))
max_blogs=max(nchar(blogs))
max_news=max(nchar(news))
max_blogs

## [1] 40833

max_news

## [1] 11384

max_twitter

## [1] 140

If we divide the number of lines where the word “love” (all lowercase) occurs by the number of lines the word “hate” (all lowercase) occurs, about what do we get?

love_count=sum(grepl("love", twitter))
hate_count=sum(grepl("hate", twitter))
love_count/hate_count

## [1] 4.108592

The one tweet in the en_US twitter data set that matches the word “biostats” says what?

biostats=grep("biostats", twitter)
twitter[biostats]

## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"

How many tweets have the exact characters “A computer once beat me at chess, but it was no match for me at kickboxing” ?

sum(grepl("A computer once beat me at chess, but it was no match for me at kickboxing", twitter))

## [1] 3

data.frame(source = c("blogs", "news", "twitter"),
           file.size.MB = c(blogs.size, news.size, twitter.size),
           num.lines = c(length(blogs), length(news), length(twitter)),
           num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
           mean.num.words = c(mean(blogs.words), mean(news.words),         mean(twitter.words)))

##    source file.size.MB num.lines num.words mean.num.words
## 1   blogs     200.4242    899288  37546246       41.75108
## 2    news     196.2775   1010242  34762395       34.40997
## 3 twitter     159.3641   2360148  30093369       12.75063

Data Cleaning

n-gram model (think Markov Chains).

First, we proceed to remove all weird characters:

cleanedtwitter=iconv(twitter, 'UTF-8', 'ASCII', "byte")

Then, we create a sample of 10000 characters:

require(tm)

## Loading required package: tm

## Loading required package: NLP

set.seed(679)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
                 sample(news, length(news) * 0.01),
                 sample(twitter, length(twitter) * 0.01))

Now, we proceed to: - convert all characters in lower case - remove all punctuations - remove all numbers - remove whitespaces - force everything back to the plaintext document

corpus <- VCorpus(VectorSource(data.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

Exploratory Analysis

We are now ready to perform exploratory analysis on the data. It would be interesting and helpful to find the most frequently occurring words in the data. Here we list the most common unigrams, bigrams, and trigrams.

library(RWeka)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

options(mc.cores=1)

getFreq <- function(tdm) {
  freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
  return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
  ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
         labs(x = label, y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("blue"))
}

# Get frequencies of most common n-grams in data sample
freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999))
freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))

Here is a histogram of the 30 most common unigrams in the data sample.

makePlot(freq1, "30 Most Common Unigrams")

Here is a histogram of the 30 most common bigrams in the data sample.

makePlot(freq2, "30 Most Common Bigrams")

Here is a histogram of the 30 most common trigrams in the data sample.

makePlot(freq3, "30 Most Common Trigrams")

Next Steps For Prediction Algorithm And Shiny App

This concludes our exploratory analysis. The next steps of this capstone project would be to finalize our predictive algorithm, and deploy our algorithm as a Shiny app.

Our predictive algorithm will be using n-gram model with frequency lookup similar to our exploratory analysis above. One possible strategy would be to use the trigram model to predict the next word. If no matching trigram can be found, then the algorithm would back off to the bigram model, and then to the unigram model if needed.

The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase. Then the app will use our algorithm to suggest the most likely next word after a short delay. Our plan is also to allow the user to configure how many words our app should suggest.