The goal of this analysis is to of this project is just to display that I’ve gotten used to working with the data and that I’m on track to create your prediction algorithm.
In this application the user will provide a word or a phrase and the application will try to predict the next word.
The training dataset is alreadey provided to me on Week 1, Task 0. I’ve Loaded the data from the provided location CapStone Data
Now as we have downloaded the data let’s Load the data into the environment.
library(RWeka)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringi)
library(tm)
## Loading required package: NLP
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
Now Let’s read the file.
blogs <- readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt",encoding = "UTF-8", skipNul = TRUE)
news <- readLines("Coursera-SwiftKey/final/en_US/en_US.news.txt",encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt",encoding = "UTF-8", skipNul = TRUE)
To get better analytical model from the dataset let’s drill down further into the data.
wordsPerLine = sapply(list(blogs, twitter, news), function(x)
summary(stri_count_words(x))[c('Min.', 'Max.', 'Mean')])
rownames(wordsPerLine)=c('WPL_Min','WPL_Mean','WPL_Max')
stats=data.frame(
Dataset=c("blogs","news","twitter"),
t(rbind(
sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],
Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',],
wordsPerLine)
))
head(stats)
## Dataset Lines Chars Words WPL_Min WPL_Mean WPL_Max
## 1 blogs 899288 206824382 37570839 0 6726 41.75108
## 2 news 77259 15639408 2651432 1 47 12.75065
## 3 twitter 2360148 162096241 30451170 1 1123 34.61779
blogs <- iconv(blogs, "latin1", "UTF-8", sub="")
news <- iconv(news, "latin1", "UTF-8", sub="")
twitter <- iconv(twitter, "latin1", "UTF-8", sub="")
set.seed(1000)
sample_data <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
A corpus is a large and structured set of texts. They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
corpus <- VCorpus(VectorSource(sample_data))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "/|@|\\|")
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
corpusDf <-data.frame(text=unlist(sapply(corpus,`[`, "content")), stringsAsFactors=F)
findNGrams <- function(corp, grams) {
ngram <- NGramTokenizer(corp, Weka_control(min = grams, max = grams,
delimiters = " \\r\\n\\t.,;:\"()?!"))
ngram2 <- data.frame(table(ngram))
#pick only top 25
ngram3 <- ngram2[order(ngram2$Freq,decreasing = TRUE),][1:100,]
colnames(ngram3) <- c("String","Count")
ngram3
}
twoGram <- findNGrams(corpusDf, 2)
threeGram <- findNGrams(corpusDf, 3)
par(mfrow=c(1,3))
barplot(twoGram[1:20,2], cex.names=0.5, names.arg=twoGram[1:20,1], col="red", main="2-Grams", las=2)
barplot(threeGram[1:20,2], cex.names=0.5, names.arg=threeGram[1:20,1], col="green", main="3-Grams", las=2)
In the following report we’ve demonstrated that the files were read and stored for further analysis. The next steps will be to build a predictive algorithm that uses an n-gram model with a frequency lookup similar to the analysis above and using that knowledge build a Shiny app and will suggest the most likely next word after a phrase is typed.