The objective of the Milestone Report is to analyse text data with natural language processing. Analysis of large corpus of text document , its structure is to be done. This report explains the exploratory analysis and goal for the eventual app and predictive algorithm.It shows the major features of the data and its summary statistics.

Data is downloaded at the following link: “https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip”

The Coursera- Swiftkey data folder contains four different language of text data i.e.-German(DE), English(US), Finnish(FI) and Russian(RU). English(US) data will be used for analysis.

Enviornment Set-up:

First We have to set up the output language as English and the working folder has to be set.

Loading of Packages:

if (!require("NLP")){
  install.packages("NLP",dependencies = TRUE)
}

## Loading required package: NLP

if (!require("tm")){
  install.packages("tm",dependencies = TRUE)
}

## Loading required package: tm

## Warning: package 'tm' was built under R version 3.4.4

if (!require("dplyr")){
  install.packages("dplyr",dependencies = TRUE)
}

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

if (!require("stringi")){
  install.packages("stringi",dependencies = TRUE)
}

## Loading required package: stringi

## Warning: package 'stringi' was built under R version 3.4.4

if (!require("RWeka")){
  install.packages("RWeka",dependencies = TRUE)
}

## Loading required package: RWeka

## Warning: package 'RWeka' was built under R version 3.4.4

if (!require("wordcloud")){
  install.packages("wordcloud",dependencies = TRUE)
}

## Loading required package: wordcloud

## Warning: package 'wordcloud' was built under R version 3.4.4

## Loading required package: RColorBrewer

if (!require("knitr")){
  install.packages("knitr",dependencies = TRUE)
}

## Loading required package: knitr

if (!require("SnowballC")){
  install.packages("SnowballC",dependencies = TRUE)
}

## Loading required package: SnowballC

if (!require("RColorBrewer")){
  install.packages("RColorBrewer",dependencies = TRUE)
}

Loading the dataset:

The dataset is comparatively very big to download, so we downloaded the en_US folder separately and saved it in our local drive.As the data files are very big ,so it is expensive computationally.

USblogs <- readLines("./en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
USnews <- readLines("./en_US.news.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
UStwitter <- readLines("./en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)

Summary statistics of the Dataset:

Kable function is used to generate simple table.The R package we need is knitR.

kable(data.frame(
  FileName <- c("USblogs","USnews","UStwitter"),
  FileSize <- sapply(list(USblogs, USnews, UStwitter), function(x){format(object.size(x),"MB")}),
  t(rbind(sapply(list(USblogs,USnews,UStwitter),stri_stats_general),
  WordCount<- sapply(list(USblogs,USnews,UStwitter),stri_stats_latex)[4,]))
  
  )
)

FileName….c..USblogs….USnews….UStwitter..	FileSize….sapply.list.USblogs..USnews..UStwitter…function.x…	Lines	LinesNEmpty	Chars	CharsNWhite	V5
USblogs	248.5 Mb	899288	899288	206824382	170389539	37570839
USnews	19.2 Mb	77259	77259	15639408	13072698	2651432
UStwitter	301.4 Mb	2360148	2360148	162096241	134082806	30451170
#Data Processing:
#Corpus creation (using tm library) ans sampling o	f Dataset:
From summary statistics,We realize that the datase	t is very big in size, So we will proceed with a sample of 0.1 of eac	h
file.Then We will create a corpus.

USblogs_samp <- sample(USblogs,length(USblogs)*.1)
USnews_samp <- sample(USnews, length(USnews)*.1)
UStwitter_samp <- sample(UStwitter,length(UStwitter)*.1)
sample <- list(USblogs_samp,USnews_samp,UStwitter_samp)
sample_corp <- Corpus(VectorSource(sample))

Transformation of Data:

The sample dataset has been transformed in the following way:- Eliminating Extra whitespace,To lowercase,remove punctuation,remove stopwords,remove numbers,etc.

sample_corp <- tm_map(sample_corp, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(sample_corp, content_transformer(tolower)):
## transformation drops documents

sample_corp <- tm_map(sample_corp,stripWhitespace)

## Warning in tm_map.SimpleCorpus(sample_corp, stripWhitespace):
## transformation drops documents

sample_corp <- tm_map(sample_corp,removePunctuation)

## Warning in tm_map.SimpleCorpus(sample_corp, removePunctuation):
## transformation drops documents

sample_corp <- tm_map(sample_corp,PlainTextDocument)

## Warning in tm_map.SimpleCorpus(sample_corp, PlainTextDocument):
## transformation drops documents

sample_corp <- tm_map(sample_corp,removeWords,stopwords("English"))
sample_corp <- tm_map(sample_corp,removeNumbers)
sample_corp <- iconv(sample_corp, to = "utf-8", sub = "")
sample_corp <- Corpus(VectorSource(sample_corp))

Exploratory Analysis:

To understand the distribution of words and its relationship in the corpora, we tokenize the sample into N-Gram models.The Term Document Matrix(TDM) is the format we use in the N- Gram model.The TDM are stored the Frequencies of the N-Gram model. Here we use the following N-Gram models: 1.UniGram : It is a type of model where we will get continous sequence of individual word. 2.BiGram : It is a type of model where we will get continous sequence of pair of words. 3.TriGram: It is a type of model where we will get continous sequence of three words.

Visualisation of UniGram Model:

UniGram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max =1))
tdm1 <- TermDocumentMatrix(sample_corp,control = list(tokenize = UniGram))
wordMatrix1 <- as.data.frame((as.matrix(tdm1 )) )
v1 <- sort(rowSums(wordMatrix1),decreasing = TRUE)
d1 <- data.frame(word = names(v1), freq = v1)
head(v1,20)

##   just   like   will    one    can    get   time   love   good    now 
##  25088  22464  21710  21522  19162  18814  16566  15155  14975  14525 
##    day   know    new   dont    see people   back  great  think   make 
##  14298  14226  13076  11875  11799  11366  11071  10664  10330   9931

Word cloud for UniGram Model

#Barplot for word frequencies :Unigram model

Conclusion:

There are a number of issues to be addressed in the Predictive/Shiny application of the Coursera-Swiftkey project. 1.While building the model, the datset need to be divided into two parts in a ratio of 60:40.One is training datset and other is Test dataset. 2.Tokenization of N-gram model needed to be adressed more intensly by BiGram and TriGram also. 3.Creating the actual predictive models will involve the extensive use of Tokenization & N-Grams. 4.After getting the actual predictive model and the shiny application out of that will really be an interesting part. 5.The shiny App will definitely helpful to the Non-Data Manager to predict the next word .

Milestone Report : Coursera- Swiftkey

Lopamudra Satpathy

June 23, 2018