Introduction

As part of the final Capstone Project we will practically follow the step by step approach as given below: 1. Understanding the problem - Business case 2. Data acquisition and cleaning - reading the data 3. Exploratory analysis - pre processing - NLP Corpus 4. Statistical modeling - n-gram models 5. Predictive modeling - 6. Creative exploration - evaluating performance 7. Creating a data product - Shiny App 8. Creating a short slide deck pitching your product

This is an intermediate report that describes the data acquisition, cleaning and exploratory analysis. We will use the libraries as given below.

#List of Packages and Libraries

library(stringi)
## Warning: package 'stringi' was built under R version 3.6.2
library(tm)
## Warning: package 'tm' was built under R version 3.6.3
## Loading required package: NLP
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.6.3
## Loading required package: RColorBrewer
library(RColorBrewer)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(htmlTable)
## Warning: package 'htmlTable' was built under R version 3.6.3
library(xtable)
## Warning: package 'xtable' was built under R version 3.6.1
library(knitr)
## Warning: package 'knitr' was built under R version 3.6.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tm)
library(NLP)
library(ngram)
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.6.3
library(slam)
## Warning: package 'slam' was built under R version 3.6.2
library(data.table)
## Warning: package 'data.table' was built under R version 3.6.3
## 
## Attaching package: 'data.table'
## The following object is masked from 'package:slam':
## 
##     rollup
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

Downloading the Data

#Download the data from the link under Course Data Set
url="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists("Coursera-SwiftKey.zip")){
   download.file(url, destfile="./Coursera-SwiftKey.zip")
   unzip("Coursera-SwiftKey.zip")
 }
list.files("./final")
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"

As shown above the data is unzipped in directory -“final”. The data consists of 4 languages de_DE(German), en_US(English), fi_FI(Finnish), ru_RU(Russian). Here we will use the English language data en_US. This data set has text from three different sources: 1. Blogs - en_US.blogs.txt 2. News - en_US.news.txt 3. Twitter - en_US.twitter.txt

Basic statistical datasummary of these 3 files

We will use readLines to load the data in the R workspace and finally conver to dataframe for further analysis Reading Lines of en_US.blog.txt

en_blogs<-file("./final/en_US/en_US.blogs.txt", "r")
Blogs_Lines<-readLines(en_blogs, encoding = "UTF-8", skipNul = TRUE) # Using UTF-8 coding
close(en_blogs)
stri_stats_general(Blogs_Lines)
##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539
summary( nchar(Blogs_Lines))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40833

Reading Lines of en_US.news.txt
Reading as binary file to avoid data loss

en_news<- file("./final/en_US/en_US.news.txt", "rb")# Reading as binary file
News_Lines<- readLines(en_news, encoding="UTF-8", skipNul = TRUE)
close(en_news)
stri_stats_general(News_Lines)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010242   203223154   169860866
summary( nchar(News_Lines))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   110.0   185.0   201.2   268.0 11384.0

Reading Lines of en_US.twitter.txt

en_twitter<- file("./final/en_US/en_US.twitter.txt")
Twitter_Lines<-readLines(en_twitter, encoding="UTF-8", skipNul = TRUE)# skip Nulls 
close(en_twitter)
stri_stats_general(Twitter_Lines)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096241   134082806
summary( nchar(Twitter_Lines))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   37.00   64.00   68.68  100.00  140.00

Data Size(MB), No of lines and Words/ characters and Costruct a data frame

# Use stri_Stats_general and stri_stats_latex to get Lines, LinesNempty,  character etc.;

Blogs <- c(stri_stats_general(Blogs_Lines)[1],  stri_stats_latex(Blogs_Lines)[4], 

file.info("./final/en_US/en_US.blogs.txt")$size/(2^20))# Data Size in MB

News <- c(stri_stats_general(News_Lines)[1], stri_stats_latex(News_Lines)[4], file.info("./final/en_US/en_US.news.txt")$size/(2^20)) # Data size in MB

Twitter<- c(stri_stats_general(Twitter_Lines)[1], stri_stats_latex(Twitter_Lines)[4], 
            
file.info("./final/en_US/en_US.twitter.txt")$size/(2^20)) # Data Size in MB

Total <- Blogs+News+Twitter

table <- as.data.frame(rbind(Blogs, News, Twitter, Total))# create a Table 
rm(Blogs, News, Twitter, Total) # remove original data sets
colnames(table)[3] <- "Size in Mb"
table
##           Lines     Words Size in Mb
## Blogs    899288  37570839   200.4242
## News    1010242  34494539   196.2775
## Twitter 2360148  30451170   159.3641
## Total   4269678 102516548   556.0658

Total 4.3 million lines of text conatining 102.5 million words. This data size is unmanageable. Lets ceate a subdata to work on for further processing - develop a model,test the algorithms and deploy. Subsample can be 1% but my desktop can only handle .1%. All Samples are .1%

# Take .1% of the data as Sample size
set.seed(0657)
Sample_001_Data <- c(sample(Blogs_Lines, length(Blogs_Lines) * 0.001),
                 sample(News_Lines, length(News_Lines) * 0.001),
                 sample(Twitter_Lines, length(Twitter_Lines) * 0.001))

# Create corpus and clean the data
corpus_001 <- VCorpus(VectorSource(Sample_001_Data))# Creating VCorpus
print(corpus_001)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 4269

We have created a Virtual Corpus - VCorpus loaded into memory total documents 4269 - each line here represents one document. Virtual corpus is not permanent. Here the VCorpus is a nested list.

Exploratory Data Analysis

1. Cleaning the data 1.1 Remove the unwanted punctuations, stopwords, unwanted numbers, spaces, convert to lowercase etc;.

stopwords("en") # List of stop words in english
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"

If required we can add our own stopwords not reflected in the standard list. For example:

Stopwords_plus <- c(stopwords('en'), "can", "will", "want", "just", "like", "shall","but") 

1.2 Removing Whitespaces, numbers,punctuations -periods, commas, hyphens, ? etc.

Space_r <- content_transformer(function(x, pattern) gsub(pattern, " ", x))# Function for removing unwanted
corpus_001 <- tm_map(corpus_001, Space_r, "(f|ht)tp(s?)://(.*)[.][a-z]+") # Remove spaces 
corpus_001 <- tm_map(corpus_001, Space_r, "@[^\\s]+")
corpus_001 <- tm_map(corpus_001, tolower) # Convert to lower case
corpus_001 <- tm_map(corpus_001, removeWords, Stopwords_plus) # remove all stopwords
corpus_001 <- tm_map(corpus_001, removePunctuation)# remove Punctuations
corpus_001 <- tm_map(corpus_001, removeNumbers) # Remove numberw
corpus_001 <- tm_map(corpus_001, stripWhitespace)# Remove white spaces
corpus_001 <- tm_map(corpus_001, PlainTextDocument)# create a plain text doc
print(corpus_001)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 4269

#Data Exploratory Analysis: Visuals Plot histograms of the most common words Plot histogram for pair of words bigram and triplet words trigram construct tdm

f1 <- function(TDM) {
  freq <- sort(rowSums(as.matrix(TDM)), decreasing = TRUE)
  return(data.frame(word = names(freq), freq = freq))
}
Gram2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))#bigram
Gram3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)) #trigram

Visual Plots - Histograms with 30 most frequent words (unigram, bigram, and trigram) in the corpus.

# Histogram function
Histogram_Plot <- function(data, label, color) {
  ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
    labs(x = label, y = "Frequency of the Words", title= "Plot of Most Common Words v/s Frequency ") +
    theme(axis.text.x = element_text(angle = 45, size = 10, hjust = 1)) +
    geom_bar(stat = "identity", fill = I(color))
}

# Create the TDM from the corpus: 
corpus_TDM <- TermDocumentMatrix(corpus_001)
# Print out corpus_dtm data
corpus_TDM
## <<TermDocumentMatrix (terms: 14912, documents: 4269)>>
## Non-/sparse entries: 50706/63608622
## Sparsity           : 100%
## Maximal term length: 38
## Weighting          : term frequency (tf)
# Get frequencies of most common n-grams in data sample
freq1 <- f1(removeSparseTerms(corpus_TDM, 0.9999))
Histogram_Plot(freq1, "Thirty Most Common Words","blue")

one, said, get, time and day are the most common unigram words.News, Twitter and even bloggers use these word more frequently. Much, got, need are less frequently used words.

# Creating DTM from the corpus: 
corpus_DTM_Gram2 <- TermDocumentMatrix(corpus_001, control = list(tokenize = Gram2))#bigram
corpus_DTM_Gram2 # Print out corpus_DTM_Gram2 data
## <<TermDocumentMatrix (terms: 49354, documents: 4269)>>
## Non-/sparse entries: 51354/210640872
## Sparsity           : 100%
## Maximal term length: 52
## Weighting          : term frequency (tf)
freq2 <- f1(removeSparseTerms(corpus_DTM_Gram2, 0.9999))
Histogram_Plot(freq2, "30 Most Common Bigrams","red")

New york, years ago, right now, last week, last year are the most frequent bigram pair. Name of places and timings are mostly predominant in bigrams.

# DTM from the corpus for trigram: 
corpus_DTM_Gram3 <- TermDocumentMatrix(corpus_001, control = list(tokenize = Gram3))
corpus_DTM_Gram3 # Print out corpus_DTM_Gram3 data
## <<TermDocumentMatrix (terms: 47487, documents: 4269)>>
## Non-/sparse entries: 47543/202674460
## Sparsity           : 100%
## Maximal term length: 60
## Weighting          : term frequency (tf)
freq3 <- f1(removeSparseTerms(corpus_DTM_Gram3, 0.9999))
Histogram_Plot(freq3, "30 Most Common Trigrams", "green")

Market Live Update, Happy New Year, live update nz, new york city are predominant trigrams. In the News and Twitter this is the trend.

Future Plans

Developing a prediction algorithm evaluating the model Optimize the use of computer resources Build a Simple Model for Relationship between words bothe previous words and unseen words Creating a Shiny app to emulate swiftkey app

The steps in this report are iterative and there is lot of improvement possible with more and more iterations and the knowledgebase.

References

1.https://www.jstatsoft.org/article/view/v025i05 (text Mining Infrastructure in R) 2.https://www.coursera.org/learn/data-science-project/supplement/FrBtO/project-overview 3.https://cran.r-project.org/web/views/NaturalLanguageProcessing.html 4. Slides from Stanford Natural Language Processing course 5. Text Mining with r by Julia Silge and Dvid Robinson