Instructions

The goal of this project is to build simple model for the relationship between words.

We will explore simple models and discover more complicated modeling techniques eventually.

To begin this project, we were given corpora in four differnet languages (DE,US,FI,RU). Each set of corpora contained samples for blog, news and tweets. We will focus on the English language corpora.

Tasks to accomplish

Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

Initial data summary

  1. Download from Coursera: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
#load the libraries 
library(tm);
## Loading required package: NLP
library(stringi);
## Warning: package 'stringi' was built under R version 3.3.3
library(RWeka);
## Warning: package 'RWeka' was built under R version 3.3.3
library(ggplot2);
## Warning: package 'ggplot2' was built under R version 3.3.3
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate

Length of the Blog file

con = file("en_US.blogs.txt", open = "r")
BlogLines = readLines(con)
BlogLength = length((BlogLines))
close(con)
BlogLength
## [1] 899288

Length of the News file

con = file("en_US.news.txt", open = "r")
newsLines = readLines(con)
## Warning in readLines(con): incomplete final line found on 'en_US.news.txt'
newsLength = length((newsLines))
close(con)
newsLength
## [1] 77259

Length of the News file

con = file("en_US.twitter.txt", open = "r")
twitterLines = readLines(con)
## Warning in readLines(con): line 167155 appears to contain an embedded nul
## Warning in readLines(con): line 268547 appears to contain an embedded nul
## Warning in readLines(con): line 1274086 appears to contain an embedded nul
## Warning in readLines(con): line 1759032 appears to contain an embedded nul
tweetsLength = length((twitterLines))
close(con)
tweetsLength
## [1] 2360148
  1. Create a basic report of summary statistics about the data sets.

Data Cleaning:

Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.

#Remove all weird characters
cleanedTwitter<- iconv(twitterLines, 'UTF-8', 'ASCII', "byte")

#Sample 10000 
twitterSample<-sample(cleanedTwitter, 1000)
doc.vec <- VectorSource(twitterSample)                      
doc.corpus <- Corpus(doc.vec)

#Convert to lower case
doc.corpus<- tm_map(doc.corpus, tolower)

#Remove all punctuatins
doc.corpus<- tm_map(doc.corpus, removePunctuation)

#Remove all numbers 
doc.corpus<- tm_map(doc.corpus, removeNumbers)

##Remove whitespace
doc.corpus <- tm_map(doc.corpus, stripWhitespace)

Profanity filtering - removing profanity and other words you do not want to predict.

##Stop words
doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
##Force everything back to plaintext document
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)

Exploratory Data Anaysis

# n-gram Modeling

# bigram
bigram <- function(x) 
             NGramTokenizer(x, Weka_control(min = 2, max = 2))

# trigram
trigram <- function(x) 
             NGramTokenizer(x, Weka_control(min = 3, max = 3))

get_word_Freq <- function(tdm) {
    freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
    return(data.frame(word = names(freq), freq = freq))
}

# n-grams analysis of sample data 
corpus_unigram <- get_word_Freq(removeSparseTerms(TermDocumentMatrix(doc.corpus), 0.9999))
corpus_bigram <- get_word_Freq(removeSparseTerms(TermDocumentMatrix(doc.corpus, control = list(tokenize = bigram)), 0.9999))
corpus_trigram <- get_word_Freq(removeSparseTerms(TermDocumentMatrix(doc.corpus, control = list(tokenize = trigram)), 0.9999))

Plots - Histogram

Twitter word count plot:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

  1. Report any interesting findings that you amassed so far.

  2. Get feedback on your plans for creating a prediction algorithm and Shiny app.