The objective of the Milestone Report is to analyse text data with natural language processing. Analysis of large corpus of text document , its structure is to be done. This report explains the exploratory analysis and goal for the eventual app and predictive algorithm.It shows the major features of the data and its summary statistics.

Data is downloaded at the following link: “https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The Coursera- Swiftkey data folder contains four different language of text data i.e.-German(DE), English(US), Finnish(FI) and Russian(RU). English(US) data will be used for analysis.

Enviornment Set-up:

First We have to set up the output language as English and the working folder has to be set.

Loading of Packages:

if (!require("NLP")){
  install.packages("NLP",dependencies = TRUE)
}
## Loading required package: NLP
if (!require("tm")){
  install.packages("tm",dependencies = TRUE)
}
## Loading required package: tm
## Warning: package 'tm' was built under R version 3.4.4
if (!require("dplyr")){
  install.packages("dplyr",dependencies = TRUE)
}
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
if (!require("stringi")){
  install.packages("stringi",dependencies = TRUE)
}
## Loading required package: stringi
## Warning: package 'stringi' was built under R version 3.4.4
if (!require("RWeka")){
  install.packages("RWeka",dependencies = TRUE)
}
## Loading required package: RWeka
## Warning: package 'RWeka' was built under R version 3.4.4
if (!require("wordcloud")){
  install.packages("wordcloud",dependencies = TRUE)
}
## Loading required package: wordcloud
## Warning: package 'wordcloud' was built under R version 3.4.4
## Loading required package: RColorBrewer
if (!require("knitr")){
  install.packages("knitr",dependencies = TRUE)
}
## Loading required package: knitr
if (!require("SnowballC")){
  install.packages("SnowballC",dependencies = TRUE)
}
## Loading required package: SnowballC
if (!require("RColorBrewer")){
  install.packages("RColorBrewer",dependencies = TRUE)
}

Loading the dataset:

The dataset is comparatively very big to download, so we downloaded the en_US folder separately and saved it in our local drive.As the data files are very big ,so it is expensive computationally.

USblogs <- readLines("./en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
USnews <- readLines("./en_US.news.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
UStwitter <- readLines("./en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)

Summary statistics of the Dataset:

Kable function is used to generate simple table.The R package we need is knitR.

kable(data.frame(
  FileName <- c("USblogs","USnews","UStwitter"),
  FileSize <- sapply(list(USblogs, USnews, UStwitter), function(x){format(object.size(x),"MB")}),
  t(rbind(sapply(list(USblogs,USnews,UStwitter),stri_stats_general),
  WordCount<- sapply(list(USblogs,USnews,UStwitter),stri_stats_latex)[4,]))
  
  )
)
FileName….c..USblogs….USnews….UStwitter.. FileSize….sapply.list.USblogs..USnews..UStwitter…function.x… Lines LinesNEmpty Chars CharsNWhite V5
USblogs 248.5 Mb 899288 899288 206824382 170389539 37570839
USnews 19.2 Mb 77259 77259 15639408 13072698 2651432
UStwitter 301.4 Mb 2360148 2360148 162096241 134082806 30451170
#Data Processing:
#Corpus creation (using tm library) ans sampling o f Dataset:
From summary statistics,We realize that the datase t is very big in size, So we will proceed with a sample of 0.1 of eac h
file.Then We will create a corpus.
USblogs_samp <- sample(USblogs,length(USblogs)*.1)
USnews_samp <- sample(USnews, length(USnews)*.1)
UStwitter_samp <- sample(UStwitter,length(UStwitter)*.1)
sample <- list(USblogs_samp,USnews_samp,UStwitter_samp)
sample_corp <- Corpus(VectorSource(sample))

Transformation of Data:

The sample dataset has been transformed in the following way:- Eliminating Extra whitespace,To lowercase,remove punctuation,remove stopwords,remove numbers,etc.

sample_corp <- tm_map(sample_corp, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(sample_corp, content_transformer(tolower)):
## transformation drops documents
sample_corp <- tm_map(sample_corp,stripWhitespace)
## Warning in tm_map.SimpleCorpus(sample_corp, stripWhitespace):
## transformation drops documents
sample_corp <- tm_map(sample_corp,removePunctuation)
## Warning in tm_map.SimpleCorpus(sample_corp, removePunctuation):
## transformation drops documents
sample_corp <- tm_map(sample_corp,PlainTextDocument)
## Warning in tm_map.SimpleCorpus(sample_corp, PlainTextDocument):
## transformation drops documents
sample_corp <- tm_map(sample_corp,removeWords,stopwords("English"))
sample_corp <- tm_map(sample_corp,removeNumbers)
sample_corp <- iconv(sample_corp, to = "utf-8", sub = "")
sample_corp <- Corpus(VectorSource(sample_corp))

Exploratory Analysis:

To understand the distribution of words and its relationship in the corpora, we tokenize the sample into N-Gram models.The Term Document Matrix(TDM) is the format we use in the N- Gram model.The TDM are stored the Frequencies of the N-Gram model. Here we use the following N-Gram models: 1.UniGram : It is a type of model where we will get continous sequence of individual word. 2.BiGram : It is a type of model where we will get continous sequence of pair of words. 3.TriGram: It is a type of model where we will get continous sequence of three words.

Visualisation of UniGram Model:

UniGram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max =1))
tdm1 <- TermDocumentMatrix(sample_corp,control = list(tokenize = UniGram))
wordMatrix1 <- as.data.frame((as.matrix(tdm1 )) )
v1 <- sort(rowSums(wordMatrix1),decreasing = TRUE)
d1 <- data.frame(word = names(v1), freq = v1)
head(v1,20)
##   just   like   will    one    can    get   time   love   good    now 
##  25088  22464  21710  21522  19162  18814  16566  15155  14975  14525 
##    day   know    new   dont    see people   back  great  think   make 
##  14298  14226  13076  11875  11799  11366  11071  10664  10330   9931

Word cloud for UniGram Model

#Barplot for word frequencies :Unigram model

Conclusion:

There are a number of issues to be addressed in the Predictive/Shiny application of the Coursera-Swiftkey project. 1.While building the model, the datset need to be divided into two parts in a ratio of 60:40.One is training datset and other is Test dataset. 2.Tokenization of N-gram model needed to be adressed more intensly by BiGram and TriGram also. 3.Creating the actual predictive models will involve the extensive use of Tokenization & N-Grams. 4.After getting the actual predictive model and the shiny application out of that will really be an interesting part. 5.The shiny App will definitely helpful to the Non-Data Manager to predict the next word .