Data Science Capstone: Exploratory Data Analysis Milestone Report, Week 2

0.1 Introduction

The goal of this analysis is to of this project is just to display that I’ve gotten used to working with the data and that I’m on track to create your prediction algorithm.

In this application the user will provide a word or a phrase and the application will try to predict the next word.

0.1.1 Data Processing

The training dataset is alreadey provided to me on Week 1, Task 0. I’ve Loaded the data from the provided location CapStone Data

0.1.1.1 Loading Dataset

Now as we have downloaded the data let’s Load the data into the environment.

library(RWeka)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringi)
library(tm)

## Loading required package: NLP

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

Now Let’s read the file.

blogs <- readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt",encoding = "UTF-8", skipNul = TRUE)
news <- readLines("Coursera-SwiftKey/final/en_US/en_US.news.txt",encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt",encoding = "UTF-8", skipNul = TRUE)

0.1.2 Analysis of the dataset

To get better analytical model from the dataset let’s drill down further into the data.

wordsPerLine = sapply(list(blogs, twitter, news), function(x)
  summary(stri_count_words(x))[c('Min.', 'Max.', 'Mean')])
rownames(wordsPerLine)=c('WPL_Min','WPL_Mean','WPL_Max')
stats=data.frame(
  Dataset=c("blogs","news","twitter"),      
  t(rbind(
  sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],
  Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',],
  wordsPerLine)
))
head(stats)

##   Dataset   Lines     Chars    Words WPL_Min WPL_Mean  WPL_Max
## 1   blogs  899288 206824382 37570839       0     6726 41.75108
## 2    news   77259  15639408  2651432       1       47 12.75065
## 3 twitter 2360148 162096241 30451170       1     1123 34.61779

0.1.3 Cleaning and Sampling from the dataset

blogs <- iconv(blogs, "latin1", "UTF-8", sub="")
news <- iconv(news, "latin1", "UTF-8", sub="")
twitter <- iconv(twitter, "latin1", "UTF-8", sub="")

set.seed(1000)
sample_data <- c(sample(blogs, length(blogs) * 0.01),
                 sample(news, length(news) * 0.01),
                 sample(twitter, length(twitter) * 0.01))

0.1.4 Building Corpus

A corpus is a large and structured set of texts. They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

corpus <- VCorpus(VectorSource(sample_data))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "/|@|\\|")
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
corpusDf <-data.frame(text=unlist(sapply(corpus,`[`, "content")), stringsAsFactors=F)

0.1.5 Tokenize and Calculate Frequencies of N-Grams

findNGrams <- function(corp, grams) {
  ngram <- NGramTokenizer(corp, Weka_control(min = grams, max = grams,
                      delimiters = " \\r\\n\\t.,;:\"()?!"))
  ngram2 <- data.frame(table(ngram))
  #pick only top 25
  ngram3 <- ngram2[order(ngram2$Freq,decreasing = TRUE),][1:100,]
  colnames(ngram3) <- c("String","Count")
  ngram3
}

twoGram <- findNGrams(corpusDf, 2)
threeGram <- findNGrams(corpusDf, 3)
par(mfrow=c(1,3))
barplot(twoGram[1:20,2], cex.names=0.5, names.arg=twoGram[1:20,1], col="red", main="2-Grams", las=2)

barplot(threeGram[1:20,2], cex.names=0.5, names.arg=threeGram[1:20,1], col="green", main="3-Grams", las=2)

0.2 Conclusion

In the following report we’ve demonstrated that the files were read and stored for further analysis. The next steps will be to build a predictive algorithm that uses an n-gram model with a frequency lookup similar to the analysis above and using that knowledge build a Shiny app and will suggest the most likely next word after a phrase is typed.