Text Prediction-Milestone Report

Summary

The main goal of this project is to create and app on shiny that takes a word or sentence or a phrase as an input and predicts the next word in that sequence.This report gives summary of the train dataset which can be downloaded from
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The data is from a corpus called HC Corpora (www.corpora.heliohost.org).The files have been language filtered but may still contain some foreign text.For computing the word probability, we will be counting the words in the training Corpus.

Data Acquisition and Cleaning

Loading the data and summarizing

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Loading the required library Packages

suppressWarnings(library(NLP))
suppressWarnings(library(openNLP))
suppressWarnings(library(tm))
suppressWarnings(library(ggplot2))

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

suppressWarnings(library(wordcloud))

## Loading required package: RColorBrewer

suppressWarnings(library(RWeka))
suppressWarnings(library(qdap))

## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## 
## Attaching package: 'qdapRegex'
## 
## The following object is masked from 'package:ggplot2':
## 
##     %+%
## 
## Loading required package: qdapTools
## WARNING: Rtools is required to build R packages, but is not currently installed.
## 
## Please download and install Rtools 3.2 from http://cran.r-project.org/bin/windows/Rtools/ and then run find_rtools().
## 
## Attaching package: 'qdap'
## 
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix
## 
## The following object is masked from 'package:NLP':
## 
##     ngrams
## 
## The following object is masked from 'package:base':
## 
##     Filter

Reading the 3 text files and summarizing file details

blogs<-readLines("final/en_US/en_US.blogs.txt")
news<-readLines("final/en_US/en_US.news.txt")

## Warning in readLines("final/en_US/en_US.news.txt"): incomplete final line
## found on 'final/en_US/en_US.news.txt'

twitter<-readLines("final/en_US/en_US.twitter.txt")

## Warning in readLines("final/en_US/en_US.twitter.txt"): line 167155 appears
## to contain an embedded nul

## Warning in readLines("final/en_US/en_US.twitter.txt"): line 268547 appears
## to contain an embedded nul

## Warning in readLines("final/en_US/en_US.twitter.txt"): line 1274086 appears
## to contain an embedded nul

## Warning in readLines("final/en_US/en_US.twitter.txt"): line 1759032 appears
## to contain an embedded nul

###Summary
size<-file.info(c("final/en_US/en_US.blogs.txt","final/en_US/en_US.news.txt","final/en_US/en_US.twitter.txt"))$size
lines<-c(length(blogs),length(news),length(twitter))
summary<-rbind(size,lines)
rownames(summary)<-c("File Size(in bytes)","No.of Lines")
colnames(summary)<-c("blogs","news","twitter")
summary

##                         blogs      news   twitter
## File Size(in bytes) 210160014 205811889 167105338
## No.of Lines            899288     77259   2360148

We consider a subset of all the datasets where, we choose only 3000 lines from each dataset for making the code run faster for further analysis.

subBlogs <- readLines(file("final/en_us/en_US.blogs.txt","r"), 3000)
subNews <- readLines(file("final/en_us/en_US.news.txt","r"), 3000)
subTwitter <- readLines(file("final/en_us/en_US.twitter.txt","r"), 3000)
subdata<- paste(subBlogs,subNews,subTwitter)

sent_detect function returns a character vector of sentences split on endmark.

subdata <- sent_detect(subdata, language = "en", model = NULL)

Tokenization and Profanity Filtering

We build a Corpus and the source is specified to be character vectors. We do some text tranformations whereby we tranform the text to lower case, remove punctuations,numbers, English stopwords,URLs, extra whitespaces etc.

myCorpus <- Corpus(VectorSource(subdata))
# convert to lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
# remove URLs
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))
# remove punctuations
myCorpus <- tm_map(myCorpus, removePunctuation)
# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)
# remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)

#Creating a Document Term matrix of Corpus
dtm<-DocumentTermMatrix(myCorpus)
freq <- colSums(as.matrix(dtm))
wf <- data.frame(word =names(freq), freq=freq)

We use the RWeka library to find unigram , bigram,trigram sets from the dataset and plot the frequencies of these sets .

### Creating Tokens
Unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
Bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
Trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

### Creating  Document Term Matrix
Uni.dtm <-DocumentTermMatrix(myCorpus, control = list(tokenize = Unigram))
Bi.dtm<- DocumentTermMatrix(myCorpus, control = list(tokenize = Bigram)) 
Tri.dtm <- DocumentTermMatrix(myCorpus, control = list(tokenize = Trigram))

## Creating Data Frame with words and it corresponding frequncies
Unifreq <- colSums(as.matrix(Uni.dtm))
Bifreq<-colSums(as.matrix(Bi.dtm))
Trifreq<-colSums(as.matrix(Tri.dtm))
Uniwf <- data.frame(word =names(Unifreq), freq=Unifreq)
Biwf <- data.frame(word =names(Bifreq), freq=Bifreq)
Triwf <- data.frame(word =names(Trifreq), freq=Trifreq)

Plots of the most frequent unigram, bigram amd trigram words occuring in the sample dataset.

Further Steps

Modelling and prediction will include use of
1) Markov Chain , where it is assumed that the probability of a word depends only on the previous word. We can genaralize the bigram to the trigram and to the N-gram.

Back-Off model,which would handle any unseen events,if a particular N-gram is not observed in the training Corpus, it would tell us how to assign probabailities to these unseen events.