The main goal of this project is to create and app on shiny that takes a word or sentence or a phrase as an input and predicts the next word in that sequence.This report gives summary of the train dataset which can be downloaded from
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The data is from a corpus called HC Corpora (www.corpora.heliohost.org).The files have been language filtered but may still contain some foreign text.For computing the word probability, we will be counting the words in the training Corpus.
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
Loading the required library Packages
suppressWarnings(library(NLP))
suppressWarnings(library(openNLP))
suppressWarnings(library(tm))
suppressWarnings(library(ggplot2))
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
suppressWarnings(library(wordcloud))
## Loading required package: RColorBrewer
suppressWarnings(library(RWeka))
suppressWarnings(library(qdap))
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
##
## Attaching package: 'qdapRegex'
##
## The following object is masked from 'package:ggplot2':
##
## %+%
##
## Loading required package: qdapTools
## WARNING: Rtools is required to build R packages, but is not currently installed.
##
## Please download and install Rtools 3.2 from http://cran.r-project.org/bin/windows/Rtools/ and then run find_rtools().
##
## Attaching package: 'qdap'
##
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, as.TermDocumentMatrix
##
## The following object is masked from 'package:NLP':
##
## ngrams
##
## The following object is masked from 'package:base':
##
## Filter
Reading the 3 text files and summarizing file details
blogs<-readLines("final/en_US/en_US.blogs.txt")
news<-readLines("final/en_US/en_US.news.txt")
## Warning in readLines("final/en_US/en_US.news.txt"): incomplete final line
## found on 'final/en_US/en_US.news.txt'
twitter<-readLines("final/en_US/en_US.twitter.txt")
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 167155 appears
## to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 268547 appears
## to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 1274086 appears
## to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 1759032 appears
## to contain an embedded nul
###Summary
size<-file.info(c("final/en_US/en_US.blogs.txt","final/en_US/en_US.news.txt","final/en_US/en_US.twitter.txt"))$size
lines<-c(length(blogs),length(news),length(twitter))
summary<-rbind(size,lines)
rownames(summary)<-c("File Size(in bytes)","No.of Lines")
colnames(summary)<-c("blogs","news","twitter")
summary
## blogs news twitter
## File Size(in bytes) 210160014 205811889 167105338
## No.of Lines 899288 77259 2360148
We consider a subset of all the datasets where, we choose only 3000 lines from each dataset for making the code run faster for further analysis.
subBlogs <- readLines(file("final/en_us/en_US.blogs.txt","r"), 3000)
subNews <- readLines(file("final/en_us/en_US.news.txt","r"), 3000)
subTwitter <- readLines(file("final/en_us/en_US.twitter.txt","r"), 3000)
subdata<- paste(subBlogs,subNews,subTwitter)
sent_detect function returns a character vector of sentences split on endmark.
subdata <- sent_detect(subdata, language = "en", model = NULL)
We build a Corpus and the source is specified to be character vectors. We do some text tranformations whereby we tranform the text to lower case, remove punctuations,numbers, English stopwords,URLs, extra whitespaces etc.
myCorpus <- Corpus(VectorSource(subdata))
# convert to lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
# remove URLs
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))
# remove punctuations
myCorpus <- tm_map(myCorpus, removePunctuation)
# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)
# remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)
#Creating a Document Term matrix of Corpus
dtm<-DocumentTermMatrix(myCorpus)
freq <- colSums(as.matrix(dtm))
wf <- data.frame(word =names(freq), freq=freq)
We use the RWeka library to find unigram , bigram,trigram sets from the dataset and plot the frequencies of these sets .
### Creating Tokens
Unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
Bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
Trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
### Creating Document Term Matrix
Uni.dtm <-DocumentTermMatrix(myCorpus, control = list(tokenize = Unigram))
Bi.dtm<- DocumentTermMatrix(myCorpus, control = list(tokenize = Bigram))
Tri.dtm <- DocumentTermMatrix(myCorpus, control = list(tokenize = Trigram))
## Creating Data Frame with words and it corresponding frequncies
Unifreq <- colSums(as.matrix(Uni.dtm))
Bifreq<-colSums(as.matrix(Bi.dtm))
Trifreq<-colSums(as.matrix(Tri.dtm))
Uniwf <- data.frame(word =names(Unifreq), freq=Unifreq)
Biwf <- data.frame(word =names(Bifreq), freq=Bifreq)
Triwf <- data.frame(word =names(Trifreq), freq=Trifreq)
Plots of the most frequent unigram, bigram amd trigram words occuring in the sample dataset.
Modelling and prediction will include use of
1) Markov Chain , where it is assumed that the probability of a word depends only on the previous word. We can genaralize the bigram to the trigram and to the N-gram.