Executive Summary

This is a Week2 Milestone Report for the Coursera Capstone project by Johns Hopkins University / Sfiftkey.

The final goal of the Capstone Project is to create a Shiny App product which uses prediction algorithm that I have built, that takes as input a word or a phrase (multiple words) and outputs a prediction of the next word. In this capstone we will be applying data science in the area of natural language processing (NLP) and will be building predictive model based on the concept of n-gram sequence of words. We assume that the word that we are trying to predict depends on the word(s) that precede(s) it.

The goal of this Milestone Report is to familiarize myself with the Capstone Data Set, do some basic Text Mining in R, and learn to apply the availabvle tools in R.

Specifically, I’ll go through the following steps in this Milestone report:

  1. Data loading, basic Text Mining, and creating a statistics summary report.
  2. Sampling the files and building Corpus using subsets of 3 files.
  3. Text cleaning: tokenization, removing Stopwords, Stemming and Profanity filtering.
  4. Building n-gram models.
  5. Plotting the most frequent n-grams using bar plots and wordclouds.
  6. Conclusion: Summarizing the findings.
  7. Plans for creating a prediction algorithm and Shiny app.

Capstone Data Set.

In this report we’ll look at three corpora of US English text

  1. internet blogs posts
  2. internet news articles
  3. twitter messages

We’ll not use German, Finnish or Russian texts, as the results will be less readable and understandable for non English speakers.

1. Data loading, basic Text Mining, and creating a statistics summary report.

library(knitr)

inPath <- file.path("C:/", "Users","Andrey", "Desktop", "Coursera-Swiftkey", "final", "en_US")
outPath <- file.path("C:/", "Users","Andrey", "Desktop","Coursera-Swiftkey", "final", "en_US", "txt")

# inspect the data
list.files("./Coursera-SwiftKey/final")
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
list.files(inPath)
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"
## [4] "txt"
blogPath <- paste(inPath, "/en_US.blogs.txt", sep = "")
twitterPath <- paste(inPath, "/en_US.twitter.txt", sep = "")
newsPath <- paste(inPath, "/en_US.news.txt", sep = "")

blogs <- readLines(blogPath, encoding="UTF-8")
twitter <- readLines(twitterPath, encoding="UTF-8")
## Warning in readLines(twitterPath, encoding = "UTF-8"): сторка 167155
## похоже, содержит встроенный nul
## Warning in readLines(twitterPath, encoding = "UTF-8"): сторка 268547
## похоже, содержит встроенный nul
## Warning in readLines(twitterPath, encoding = "UTF-8"): сторка 1274086
## похоже, содержит встроенный nul
## Warning in readLines(twitterPath, encoding = "UTF-8"): сторка 1759032
## похоже, содержит встроенный nul
news <- readLines(newsPath, encoding="UTF-8")
## Warning in readLines(newsPath, encoding = "UTF-8"): неполная последняя
## строка найдена в 'C://Users/Andrey/Desktop/Coursera-Swiftkey/final/en_US/
## en_US.news.txt'

I use below code for basic text mining and to accomplish basic summaries of the three files, Word counts, line counts and then build the basic data tables

# library for character string analysis
library(stringi)

wordsBlogs <- stri_count_words(blogs)
wordsNews <- stri_count_words(news)
wordsTwitter <- stri_count_words(twitter)
sizeBlogs <- file.info(blogPath)$size/1024^2
sizeNews <- file.info(newsPath)$size/1024^2
sizeTwitter <- file.info(twitterPath)$size/1024^2
lenBlog <- format(length(blogs), big.mark = ',', small.mark = '.', n.small = 2)
lenNews <- format(length(news), big.mark = ',', small.mark = '.', n.small = 2)
lenTitter <- format(length(twitter), big.mark = ',', small.mark = '.', n.small = 2)
sumWordBlog <- format(sum(wordsBlogs), big.mark = ',', small.mark = '.', n.small = 2)
sumWordNews <- format(sum(wordsNews), big.mark = ',', small.mark = '.', n.small = 2)
sumWordTwit <- format(sum(wordsTwitter), big.mark = ',', small.mark = '.', n.small = 2)
summaryTable <- data.frame(filename = c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt"),
      file_size_MB = c(sizeBlogs, sizeNews, sizeTwitter),
      no_of_lines = c(lenBlog,lenNews,lenTitter),
      no_of_words = c(sumWordBlog,sumWordNews,sumWordTwit),
      mean_no_of_words = c(mean(wordsBlogs),mean(wordsNews),mean(wordsTwitter)))

summaryTable
##            filename file_size_MB no_of_lines no_of_words mean_no_of_words
## 1   en_US.blogs.txt     200.4242     899,288  37,546,246         41.75108
## 2    en_US.news.txt     196.2775      77,259   2,674,536         34.61779
## 3 en_US.twitter.txt     159.3641   2,360,148  30,093,369         12.75063

As we can see from the summary table, the file sizes and word count seems to be huge. Processing of full data sets will require significant PC resourses and time. To expedite analysis and modeling, we’ll use samples of the files.

2. Sampling the files and building Corpus using subsets of 3 files.

For further work, we’ll create subset for each file by randomly taking 2% of each file

# First we code a function whch will create random Sample of files

samplingText <- function(inFile, outFile, n){
  con1 <- file(inFile, "rb") 
  content <- readLines(con1, encoding = "UTF-8", warn = TRUE, skipNul = TRUE)
  numOfLines = length(content)
  sampleSize = ceiling(numOfLines * 0.02)
  
  set.seed(n)
  randomSamp = sample(seq(1, numOfLines), sampleSize, replace = FALSE)
  sampleText = content[randomSamp]
  
  con2 = file(outFile, "w")
  writeLines(sampleText, con2)
  close(con1)
  close(con2)
}

# we generate random Samples of 3 files

samplingText(twitterPath, paste(outPath, "/twitter.txt", sep=""), 123)

samplingText(blogPath, paste(outPath, "/blogs.txt", sep=""), 123)

samplingText(newsPath, paste(outPath, "/news.txt", sep=""), 123)

Now we are creating the Corpus using subsets of 3 files

# Basic text operations will be done using the tm package :

library(NLP) 
library(tm)

sampleCorpus <- Corpus(DirSource(outPath))

3. Text cleaning: Tokenization, removing Stopwords, Stemming and Profanity filtering

Below we are identifying appropriate tokens such as words, punctuation, and numbers. We’ll be writing a function that takes a file as input and returns a tokenized version of it

# tOKINEZATION and cleaning of Corpus 
sampleCorpus <- tm_map(sampleCorpus, removeNumbers)
sampleCorpus <- tm_map(sampleCorpus, removePunctuation)
sampleCorpus <- tm_map(sampleCorpus, stripWhitespace)
sampleCorpus <- tm_map(sampleCorpus, tolower)

# Profanity filtering
profanity <- read.csv ("C:/Users/Andrey/Desktop/profanity1.xls", header=FALSE)
sampleCorpus <- tm_map(sampleCorpus, removeWords, profanity)

#Remove stopwords 
sampleCorpus <- tm_map(sampleCorpus, removeWords, stopwords("english")) 

For reference, the standard set of english stopwords is provided by the “tm” package for R.

Stopwords are very frequent words found uin most texts, however such words may not add extra meaning / can be omitted without losing sense of the sentence

sample of the Stopwords is provided below:

“a, about, above, across, after, again, against, all, almost, alone, along, already, also, although, always, am, among, an, and, another, any, anybody, anyone, anything, anywhere, are, area, areas, aren’t, around, as, ask, asked, asking, asks, at, away, ….. yes, yet, you, you’d, you’ll, young, younger, youngest, your, you’re, yours, yourself, yourselves, you’ve, z”

Stemming is removing common word endings (e.g., “ing”, “es”, “s”). In many cases, words need to be stemmed to retrieve their radicals. For instance, “example” and “examples” are both stemmed to “exampl”.

library(SnowballC)   
sampleCorpus <- tm_map(sampleCorpus, stemDocument)  

# This tells R to treat our preprocessed documents as text documents
sampleCorpus <- tm_map(sampleCorpus, PlainTextDocument) 

4. Building n-gram models.

I will use the RWeka package to create uni-, bi- and tri-grams sets:

library(RWeka)

nGramFunction <- function(corpusData, n){
  tdm <- TermDocumentMatrix(corpusData, control = 
                           list(tokenize = function(x)  
                           NGramTokenizer(x, RWeka::Weka_control(min = n, max = n)), 
                                     wordLengths = c(1, Inf)))
  frequency <- sort(rowSums(as.matrix(tdm))
                    , decreasing = TRUE)
  
  df <- data.frame(word = names(frequency)
                        , freq = frequency, percentage_covered=
                          frequency/sum(frequency), Rank = rank(-frequency))
}

uniGram <- nGramFunction(sampleCorpus, 1)
head (uniGram)
##      word freq percentage_covered Rank
## will will 6138        0.005392702    1
## said said 6085        0.005346137    2
## just just 6080        0.005341744    3
## one   one 5446        0.004784727    4
## like like 5181        0.004551904    5
## can   can 4822        0.004236495    6
biGram <- nGramFunction(sampleCorpus, 2)
head(biGram)
##                    word freq percentage_covered Rank
## right now     right now  510       0.0004843286    1
## cant wait     cant wait  392       0.0003722683    2
## new york       new york  391       0.0003713186    3
## last year     last year  381       0.0003618220    4
## last night   last night  301       0.0002858489    5
## high school high school  283       0.0002687549    6
triGram <- nGramFunction(sampleCorpus, 3)
head(triGram)
##                                word freq percentage_covered Rank
## cant wait see         cant wait see   64       6.599408e-05  1.5
## happy mothers day happy mothers day   64       6.599408e-05  1.5
## let us know             let us know   55       5.671366e-05  3.0
## new york city         new york city   44       4.537093e-05  4.0
## cinco de mayo         cinco de mayo   36       3.712167e-05  5.0
## happy new year       happy new year   33       3.402820e-05  6.0

5. Plotting the most frequent n-grams using bar plots and wordclouds.

Visualization of most frequent uni-grams:

I will use the wordcloud package to create word cloud visualization:

library(wordcloud)
## Loading required package: RColorBrewer
df1 <- uniGram
df1$word <- as.character(df1$word)

wordcloud(df1$word[1:30], df1$freq [1:30], scale=c(4, .1), colors=brewer.pal(8, "Dark2"))

# loading package for plotting
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
ggplot(df1[1:20,], aes(x = reorder(word, freq), y = freq) ) + geom_bar(stat = "identity", fill = "blue") + coord_flip() + xlab("Uni-gram") + ggtitle("1-gram frequency") 

Visualization of most frequent bi-gram word combinations:

df2 <- biGram
df2$word <- as.character(df2$word)

wordcloud(df2$word[1:30], df2$freq [1:30], scale=c(3, .1), colors=brewer.pal(8, "Dark2"))

ggplot(df2[1:20,], aes(x = reorder(word, freq), y = freq) ) + geom_bar(stat = "identity", fill = "red") + coord_flip() + xlab("Bi-gram") + ggtitle("2-gram frequency") 

Visualization of most frequent tri-gram word combinations:

df3 <- triGram
df3$word <- as.character(df3$word)

wordcloud(df3$word[1:30], df3$freq [1:30], scale=c(3, .1), colors=brewer.pal(8, "Dark2"))

ggplot(df3[1:20,], aes(x = reorder(word, freq), y = freq) ) + geom_bar(stat = "identity", fill = "orange") + coord_flip() + xlab("Tri-gram") + ggtitle("3-gram frequency") 

6. Conclusion: Summarizing the findings.

Due to large size of original text files, we used random sample of the files with the size of 2% of original size to reduce the size.

The initial data mining helped us to find most frequent uni-grams, bi-grams and tri-grams word combinations.

By looking at the most frequent 2-grams and 3-grams we conclude that they make sense and capture well the word combinations used in spoken language.

Very few 3-grams probably do not make sense for the prediction model (e.g. “follow follow back”)

7. Plans for creating a prediction algorithm and Shiny app.

Given significant size of the original files, it could be a challenge to use full original data for training a prediction model. Random samples most likely should be used. Also, the model should be optimized for low memory utilization.

Although I removed stopwords while doing initial text mining, I’ll use them later to train the model, as they may improve the model performance.

The next step will be creating actual model used for prediction of word based on previous entry (entries). The actual prediction model will be deployed using shiny application, to make an easy interface with the model users.