As part of the final Capstone Project we will practically follow the step by step approach as given below: 1. Understanding the problem - Business case 2. Data acquisition and cleaning - reading the data 3. Exploratory analysis - pre processing - NLP Corpus 4. Statistical modeling - n-gram models 5. Predictive modeling - 6. Creative exploration - evaluating performance 7. Creating a data product - Shiny App 8. Creating a short slide deck pitching your product
This is an intermediate report that describes the data acquisition, cleaning and exploratory analysis. We will use the libraries as given below.
#List of Packages and Libraries
library(stringi)
## Warning: package 'stringi' was built under R version 3.6.2
library(tm)
## Warning: package 'tm' was built under R version 3.6.3
## Loading required package: NLP
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.6.3
## Loading required package: RColorBrewer
library(RColorBrewer)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(htmlTable)
## Warning: package 'htmlTable' was built under R version 3.6.3
library(xtable)
## Warning: package 'xtable' was built under R version 3.6.1
library(knitr)
## Warning: package 'knitr' was built under R version 3.6.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tm)
library(NLP)
library(ngram)
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.6.3
library(slam)
## Warning: package 'slam' was built under R version 3.6.2
library(data.table)
## Warning: package 'data.table' was built under R version 3.6.3
##
## Attaching package: 'data.table'
## The following object is masked from 'package:slam':
##
## rollup
## The following objects are masked from 'package:dplyr':
##
## between, first, last
#Download the data from the link under Course Data Set
url="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists("Coursera-SwiftKey.zip")){
download.file(url, destfile="./Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
}
list.files("./final")
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
As shown above the data is unzipped in directory -“final”. The data consists of 4 languages de_DE(German), en_US(English), fi_FI(Finnish), ru_RU(Russian). Here we will use the English language data en_US. This data set has text from three different sources: 1. Blogs - en_US.blogs.txt 2. News - en_US.news.txt 3. Twitter - en_US.twitter.txt
We will use readLines to load the data in the R workspace and finally conver to dataframe for further analysis Reading Lines of en_US.blog.txt
en_blogs<-file("./final/en_US/en_US.blogs.txt", "r")
Blogs_Lines<-readLines(en_blogs, encoding = "UTF-8", skipNul = TRUE) # Using UTF-8 coding
close(en_blogs)
stri_stats_general(Blogs_Lines)
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
summary( nchar(Blogs_Lines))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 47 156 230 329 40833
Reading Lines of en_US.news.txt
Reading as binary file to avoid data loss
en_news<- file("./final/en_US/en_US.news.txt", "rb")# Reading as binary file
News_Lines<- readLines(en_news, encoding="UTF-8", skipNul = TRUE)
close(en_news)
stri_stats_general(News_Lines)
## Lines LinesNEmpty Chars CharsNWhite
## 1010242 1010242 203223154 169860866
summary( nchar(News_Lines))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 110.0 185.0 201.2 268.0 11384.0
Reading Lines of en_US.twitter.txt
en_twitter<- file("./final/en_US/en_US.twitter.txt")
Twitter_Lines<-readLines(en_twitter, encoding="UTF-8", skipNul = TRUE)# skip Nulls
close(en_twitter)
stri_stats_general(Twitter_Lines)
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096241 134082806
summary( nchar(Twitter_Lines))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 37.00 64.00 68.68 100.00 140.00
# Use stri_Stats_general and stri_stats_latex to get Lines, LinesNempty, character etc.;
Blogs <- c(stri_stats_general(Blogs_Lines)[1], stri_stats_latex(Blogs_Lines)[4],
file.info("./final/en_US/en_US.blogs.txt")$size/(2^20))# Data Size in MB
News <- c(stri_stats_general(News_Lines)[1], stri_stats_latex(News_Lines)[4], file.info("./final/en_US/en_US.news.txt")$size/(2^20)) # Data size in MB
Twitter<- c(stri_stats_general(Twitter_Lines)[1], stri_stats_latex(Twitter_Lines)[4],
file.info("./final/en_US/en_US.twitter.txt")$size/(2^20)) # Data Size in MB
Total <- Blogs+News+Twitter
table <- as.data.frame(rbind(Blogs, News, Twitter, Total))# create a Table
rm(Blogs, News, Twitter, Total) # remove original data sets
colnames(table)[3] <- "Size in Mb"
table
## Lines Words Size in Mb
## Blogs 899288 37570839 200.4242
## News 1010242 34494539 196.2775
## Twitter 2360148 30451170 159.3641
## Total 4269678 102516548 556.0658
Total 4.3 million lines of text conatining 102.5 million words. This data size is unmanageable. Lets ceate a subdata to work on for further processing - develop a model,test the algorithms and deploy. Subsample can be 1% but my desktop can only handle .1%. All Samples are .1%
# Take .1% of the data as Sample size
set.seed(0657)
Sample_001_Data <- c(sample(Blogs_Lines, length(Blogs_Lines) * 0.001),
sample(News_Lines, length(News_Lines) * 0.001),
sample(Twitter_Lines, length(Twitter_Lines) * 0.001))
# Create corpus and clean the data
corpus_001 <- VCorpus(VectorSource(Sample_001_Data))# Creating VCorpus
print(corpus_001)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 4269
We have created a Virtual Corpus - VCorpus loaded into memory total documents 4269 - each line here represents one document. Virtual corpus is not permanent. Here the VCorpus is a nested list.
1. Cleaning the data 1.1 Remove the unwanted punctuations, stopwords, unwanted numbers, spaces, convert to lowercase etc;.
stopwords("en") # List of stop words in english
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
If required we can add our own stopwords not reflected in the standard list. For example:
Stopwords_plus <- c(stopwords('en'), "can", "will", "want", "just", "like", "shall","but")
1.2 Removing Whitespaces, numbers,punctuations -periods, commas, hyphens, ? etc.
Space_r <- content_transformer(function(x, pattern) gsub(pattern, " ", x))# Function for removing unwanted
corpus_001 <- tm_map(corpus_001, Space_r, "(f|ht)tp(s?)://(.*)[.][a-z]+") # Remove spaces
corpus_001 <- tm_map(corpus_001, Space_r, "@[^\\s]+")
corpus_001 <- tm_map(corpus_001, tolower) # Convert to lower case
corpus_001 <- tm_map(corpus_001, removeWords, Stopwords_plus) # remove all stopwords
corpus_001 <- tm_map(corpus_001, removePunctuation)# remove Punctuations
corpus_001 <- tm_map(corpus_001, removeNumbers) # Remove numberw
corpus_001 <- tm_map(corpus_001, stripWhitespace)# Remove white spaces
corpus_001 <- tm_map(corpus_001, PlainTextDocument)# create a plain text doc
print(corpus_001)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 4269
#Data Exploratory Analysis: Visuals Plot histograms of the most common words Plot histogram for pair of words bigram and triplet words trigram construct tdm
f1 <- function(TDM) {
freq <- sort(rowSums(as.matrix(TDM)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
Gram2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))#bigram
Gram3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)) #trigram
# Histogram function
Histogram_Plot <- function(data, label, color) {
ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency of the Words", title= "Plot of Most Common Words v/s Frequency ") +
theme(axis.text.x = element_text(angle = 45, size = 10, hjust = 1)) +
geom_bar(stat = "identity", fill = I(color))
}
# Create the TDM from the corpus:
corpus_TDM <- TermDocumentMatrix(corpus_001)
# Print out corpus_dtm data
corpus_TDM
## <<TermDocumentMatrix (terms: 14912, documents: 4269)>>
## Non-/sparse entries: 50706/63608622
## Sparsity : 100%
## Maximal term length: 38
## Weighting : term frequency (tf)
# Get frequencies of most common n-grams in data sample
freq1 <- f1(removeSparseTerms(corpus_TDM, 0.9999))
Histogram_Plot(freq1, "Thirty Most Common Words","blue")
one, said, get, time and day are the most common unigram words.News, Twitter and even bloggers use these word more frequently. Much, got, need are less frequently used words.
# Creating DTM from the corpus:
corpus_DTM_Gram2 <- TermDocumentMatrix(corpus_001, control = list(tokenize = Gram2))#bigram
corpus_DTM_Gram2 # Print out corpus_DTM_Gram2 data
## <<TermDocumentMatrix (terms: 49354, documents: 4269)>>
## Non-/sparse entries: 51354/210640872
## Sparsity : 100%
## Maximal term length: 52
## Weighting : term frequency (tf)
freq2 <- f1(removeSparseTerms(corpus_DTM_Gram2, 0.9999))
Histogram_Plot(freq2, "30 Most Common Bigrams","red")
New york, years ago, right now, last week, last year are the most frequent bigram pair. Name of places and timings are mostly predominant in bigrams.
# DTM from the corpus for trigram:
corpus_DTM_Gram3 <- TermDocumentMatrix(corpus_001, control = list(tokenize = Gram3))
corpus_DTM_Gram3 # Print out corpus_DTM_Gram3 data
## <<TermDocumentMatrix (terms: 47487, documents: 4269)>>
## Non-/sparse entries: 47543/202674460
## Sparsity : 100%
## Maximal term length: 60
## Weighting : term frequency (tf)
freq3 <- f1(removeSparseTerms(corpus_DTM_Gram3, 0.9999))
Histogram_Plot(freq3, "30 Most Common Trigrams", "green")
Market Live Update, Happy New Year, live update nz, new york city are predominant trigrams. In the News and Twitter this is the trend.
Developing a prediction algorithm evaluating the model Optimize the use of computer resources Build a Simple Model for Relationship between words bothe previous words and unseen words Creating a Shiny app to emulate swiftkey app
The steps in this report are iterative and there is lot of improvement possible with more and more iterations and the knowledgebase.
1.https://www.jstatsoft.org/article/view/v025i05 (text Mining Infrastructure in R) 2.https://www.coursera.org/learn/data-science-project/supplement/FrBtO/project-overview 3.https://cran.r-project.org/web/views/NaturalLanguageProcessing.html 4. Slides from Stanford Natural Language Processing course 5. Text Mining with r by Julia Silge and Dvid Robinson