The objective of the Milestone Report is to analyse text data with natural language processing. Analysis of large corpus of text document , its structure is to be done. This report explains the exploratory analysis and goal for the eventual app and predictive algorithm.It shows the major features of the data and its summary statistics.
Data is downloaded at the following link: “https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip”
The Coursera- Swiftkey data folder contains four different language of text data i.e.-German(DE), English(US), Finnish(FI) and Russian(RU). English(US) data will be used for analysis.
First We have to set up the output language as English and the working folder has to be set.
if (!require("NLP")){
install.packages("NLP",dependencies = TRUE)
}
## Loading required package: NLP
if (!require("tm")){
install.packages("tm",dependencies = TRUE)
}
## Loading required package: tm
## Warning: package 'tm' was built under R version 3.4.4
if (!require("dplyr")){
install.packages("dplyr",dependencies = TRUE)
}
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
if (!require("stringi")){
install.packages("stringi",dependencies = TRUE)
}
## Loading required package: stringi
## Warning: package 'stringi' was built under R version 3.4.4
if (!require("RWeka")){
install.packages("RWeka",dependencies = TRUE)
}
## Loading required package: RWeka
## Warning: package 'RWeka' was built under R version 3.4.4
if (!require("wordcloud")){
install.packages("wordcloud",dependencies = TRUE)
}
## Loading required package: wordcloud
## Warning: package 'wordcloud' was built under R version 3.4.4
## Loading required package: RColorBrewer
if (!require("knitr")){
install.packages("knitr",dependencies = TRUE)
}
## Loading required package: knitr
if (!require("SnowballC")){
install.packages("SnowballC",dependencies = TRUE)
}
## Loading required package: SnowballC
if (!require("RColorBrewer")){
install.packages("RColorBrewer",dependencies = TRUE)
}
The dataset is comparatively very big to download, so we downloaded the en_US folder separately and saved it in our local drive.As the data files are very big ,so it is expensive computationally.
USblogs <- readLines("./en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
USnews <- readLines("./en_US.news.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
UStwitter <- readLines("./en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
Kable function is used to generate simple table.The R package we need is knitR.
kable(data.frame(
FileName <- c("USblogs","USnews","UStwitter"),
FileSize <- sapply(list(USblogs, USnews, UStwitter), function(x){format(object.size(x),"MB")}),
t(rbind(sapply(list(USblogs,USnews,UStwitter),stri_stats_general),
WordCount<- sapply(list(USblogs,USnews,UStwitter),stri_stats_latex)[4,]))
)
)
| FileName….c..USblogs….USnews….UStwitter.. | FileSize….sapply.list.USblogs..USnews..UStwitter…function.x… | Lines | LinesNEmpty | Chars | CharsNWhite | V5 |
|---|---|---|---|---|---|---|
| USblogs | 248.5 Mb | 899288 | 899288 | 206824382 | 170389539 | 37570839 |
| USnews | 19.2 Mb | 77259 | 77259 | 15639408 | 13072698 | 2651432 |
| UStwitter | 301.4 Mb | 2360148 | 2360148 | 162096241 | 134082806 | 30451170 |
| #Data Processing: | ||||||
| #Corpus creation (using tm library) ans sampling o | f Dataset: | |||||
| From summary statistics,We realize that the datase | t is very big in size, So we will proceed with a sample of 0.1 of eac | h | ||||
| file.Then We will create a corpus. |
USblogs_samp <- sample(USblogs,length(USblogs)*.1)
USnews_samp <- sample(USnews, length(USnews)*.1)
UStwitter_samp <- sample(UStwitter,length(UStwitter)*.1)
sample <- list(USblogs_samp,USnews_samp,UStwitter_samp)
sample_corp <- Corpus(VectorSource(sample))
The sample dataset has been transformed in the following way:- Eliminating Extra whitespace,To lowercase,remove punctuation,remove stopwords,remove numbers,etc.
sample_corp <- tm_map(sample_corp, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(sample_corp, content_transformer(tolower)):
## transformation drops documents
sample_corp <- tm_map(sample_corp,stripWhitespace)
## Warning in tm_map.SimpleCorpus(sample_corp, stripWhitespace):
## transformation drops documents
sample_corp <- tm_map(sample_corp,removePunctuation)
## Warning in tm_map.SimpleCorpus(sample_corp, removePunctuation):
## transformation drops documents
sample_corp <- tm_map(sample_corp,PlainTextDocument)
## Warning in tm_map.SimpleCorpus(sample_corp, PlainTextDocument):
## transformation drops documents
sample_corp <- tm_map(sample_corp,removeWords,stopwords("English"))
sample_corp <- tm_map(sample_corp,removeNumbers)
sample_corp <- iconv(sample_corp, to = "utf-8", sub = "")
sample_corp <- Corpus(VectorSource(sample_corp))
To understand the distribution of words and its relationship in the corpora, we tokenize the sample into N-Gram models.The Term Document Matrix(TDM) is the format we use in the N- Gram model.The TDM are stored the Frequencies of the N-Gram model. Here we use the following N-Gram models: 1.UniGram : It is a type of model where we will get continous sequence of individual word. 2.BiGram : It is a type of model where we will get continous sequence of pair of words. 3.TriGram: It is a type of model where we will get continous sequence of three words.
UniGram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max =1))
tdm1 <- TermDocumentMatrix(sample_corp,control = list(tokenize = UniGram))
wordMatrix1 <- as.data.frame((as.matrix(tdm1 )) )
v1 <- sort(rowSums(wordMatrix1),decreasing = TRUE)
d1 <- data.frame(word = names(v1), freq = v1)
head(v1,20)
## just like will one can get time love good now
## 25088 22464 21710 21522 19162 18814 16566 15155 14975 14525
## day know new dont see people back great think make
## 14298 14226 13076 11875 11799 11366 11071 10664 10330 9931
#Barplot for word frequencies :Unigram model
Conclusion:
There are a number of issues to be addressed in the Predictive/Shiny application of the Coursera-Swiftkey project. 1.While building the model, the datset need to be divided into two parts in a ratio of 60:40.One is training datset and other is Test dataset. 2.Tokenization of N-gram model needed to be adressed more intensly by BiGram and TriGram also. 3.Creating the actual predictive models will involve the extensive use of Tokenization & N-Grams. 4.After getting the actual predictive model and the shiny application out of that will really be an interesting part. 5.The shiny App will definitely helpful to the Non-Data Manager to predict the next word .