This is a Week2 Milestone Report for the Coursera Capstone project by Johns Hopkins University / Sfiftkey.
The final goal of the Capstone Project is to create a Shiny App product which uses prediction algorithm that I have built, that takes as input a word or a phrase (multiple words) and outputs a prediction of the next word. In this capstone we will be applying data science in the area of natural language processing (NLP) and will be building predictive model based on the concept of n-gram sequence of words. We assume that the word that we are trying to predict depends on the word(s) that precede(s) it.
The goal of this Milestone Report is to familiarize myself with the Capstone Data Set, do some basic Text Mining in R, and learn to apply the availabvle tools in R.
Specifically, I’ll go through the following steps in this Milestone report:
In this report we’ll look at three corpora of US English text
We’ll not use German, Finnish or Russian texts, as the results will be less readable and understandable for non English speakers.
library(knitr)
inPath <- file.path("C:/", "Users","Andrey", "Desktop", "Coursera-Swiftkey", "final", "en_US")
outPath <- file.path("C:/", "Users","Andrey", "Desktop","Coursera-Swiftkey", "final", "en_US", "txt")
# inspect the data
list.files("./Coursera-SwiftKey/final")
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
list.files(inPath)
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
## [4] "txt"
blogPath <- paste(inPath, "/en_US.blogs.txt", sep = "")
twitterPath <- paste(inPath, "/en_US.twitter.txt", sep = "")
newsPath <- paste(inPath, "/en_US.news.txt", sep = "")
blogs <- readLines(blogPath, encoding="UTF-8")
twitter <- readLines(twitterPath, encoding="UTF-8")
## Warning in readLines(twitterPath, encoding = "UTF-8"): сторка 167155
## похоже, содержит встроенный nul
## Warning in readLines(twitterPath, encoding = "UTF-8"): сторка 268547
## похоже, содержит встроенный nul
## Warning in readLines(twitterPath, encoding = "UTF-8"): сторка 1274086
## похоже, содержит встроенный nul
## Warning in readLines(twitterPath, encoding = "UTF-8"): сторка 1759032
## похоже, содержит встроенный nul
news <- readLines(newsPath, encoding="UTF-8")
## Warning in readLines(newsPath, encoding = "UTF-8"): неполная последняя
## строка найдена в 'C://Users/Andrey/Desktop/Coursera-Swiftkey/final/en_US/
## en_US.news.txt'
I use below code for basic text mining and to accomplish basic summaries of the three files, Word counts, line counts and then build the basic data tables
# library for character string analysis
library(stringi)
wordsBlogs <- stri_count_words(blogs)
wordsNews <- stri_count_words(news)
wordsTwitter <- stri_count_words(twitter)
sizeBlogs <- file.info(blogPath)$size/1024^2
sizeNews <- file.info(newsPath)$size/1024^2
sizeTwitter <- file.info(twitterPath)$size/1024^2
lenBlog <- format(length(blogs), big.mark = ',', small.mark = '.', n.small = 2)
lenNews <- format(length(news), big.mark = ',', small.mark = '.', n.small = 2)
lenTitter <- format(length(twitter), big.mark = ',', small.mark = '.', n.small = 2)
sumWordBlog <- format(sum(wordsBlogs), big.mark = ',', small.mark = '.', n.small = 2)
sumWordNews <- format(sum(wordsNews), big.mark = ',', small.mark = '.', n.small = 2)
sumWordTwit <- format(sum(wordsTwitter), big.mark = ',', small.mark = '.', n.small = 2)
summaryTable <- data.frame(filename = c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt"),
file_size_MB = c(sizeBlogs, sizeNews, sizeTwitter),
no_of_lines = c(lenBlog,lenNews,lenTitter),
no_of_words = c(sumWordBlog,sumWordNews,sumWordTwit),
mean_no_of_words = c(mean(wordsBlogs),mean(wordsNews),mean(wordsTwitter)))
summaryTable
## filename file_size_MB no_of_lines no_of_words mean_no_of_words
## 1 en_US.blogs.txt 200.4242 899,288 37,546,246 41.75108
## 2 en_US.news.txt 196.2775 77,259 2,674,536 34.61779
## 3 en_US.twitter.txt 159.3641 2,360,148 30,093,369 12.75063
As we can see from the summary table, the file sizes and word count seems to be huge. Processing of full data sets will require significant PC resourses and time. To expedite analysis and modeling, we’ll use samples of the files.
For further work, we’ll create subset for each file by randomly taking 2% of each file
# First we code a function whch will create random Sample of files
samplingText <- function(inFile, outFile, n){
con1 <- file(inFile, "rb")
content <- readLines(con1, encoding = "UTF-8", warn = TRUE, skipNul = TRUE)
numOfLines = length(content)
sampleSize = ceiling(numOfLines * 0.02)
set.seed(n)
randomSamp = sample(seq(1, numOfLines), sampleSize, replace = FALSE)
sampleText = content[randomSamp]
con2 = file(outFile, "w")
writeLines(sampleText, con2)
close(con1)
close(con2)
}
# we generate random Samples of 3 files
samplingText(twitterPath, paste(outPath, "/twitter.txt", sep=""), 123)
samplingText(blogPath, paste(outPath, "/blogs.txt", sep=""), 123)
samplingText(newsPath, paste(outPath, "/news.txt", sep=""), 123)
Now we are creating the Corpus using subsets of 3 files
# Basic text operations will be done using the tm package :
library(NLP)
library(tm)
sampleCorpus <- Corpus(DirSource(outPath))
Below we are identifying appropriate tokens such as words, punctuation, and numbers. We’ll be writing a function that takes a file as input and returns a tokenized version of it
# tOKINEZATION and cleaning of Corpus
sampleCorpus <- tm_map(sampleCorpus, removeNumbers)
sampleCorpus <- tm_map(sampleCorpus, removePunctuation)
sampleCorpus <- tm_map(sampleCorpus, stripWhitespace)
sampleCorpus <- tm_map(sampleCorpus, tolower)
# Profanity filtering
profanity <- read.csv ("C:/Users/Andrey/Desktop/profanity1.xls", header=FALSE)
sampleCorpus <- tm_map(sampleCorpus, removeWords, profanity)
#Remove stopwords
sampleCorpus <- tm_map(sampleCorpus, removeWords, stopwords("english"))
For reference, the standard set of english stopwords is provided by the “tm” package for R.
Stopwords are very frequent words found uin most texts, however such words may not add extra meaning / can be omitted without losing sense of the sentence
sample of the Stopwords is provided below:
“a, about, above, across, after, again, against, all, almost, alone, along, already, also, although, always, am, among, an, and, another, any, anybody, anyone, anything, anywhere, are, area, areas, aren’t, around, as, ask, asked, asking, asks, at, away, ….. yes, yet, you, you’d, you’ll, young, younger, youngest, your, you’re, yours, yourself, yourselves, you’ve, z”
Stemming is removing common word endings (e.g., “ing”, “es”, “s”). In many cases, words need to be stemmed to retrieve their radicals. For instance, “example” and “examples” are both stemmed to “exampl”.
library(SnowballC)
sampleCorpus <- tm_map(sampleCorpus, stemDocument)
# This tells R to treat our preprocessed documents as text documents
sampleCorpus <- tm_map(sampleCorpus, PlainTextDocument)
I will use the RWeka package to create uni-, bi- and tri-grams sets:
library(RWeka)
nGramFunction <- function(corpusData, n){
tdm <- TermDocumentMatrix(corpusData, control =
list(tokenize = function(x)
NGramTokenizer(x, RWeka::Weka_control(min = n, max = n)),
wordLengths = c(1, Inf)))
frequency <- sort(rowSums(as.matrix(tdm))
, decreasing = TRUE)
df <- data.frame(word = names(frequency)
, freq = frequency, percentage_covered=
frequency/sum(frequency), Rank = rank(-frequency))
}
uniGram <- nGramFunction(sampleCorpus, 1)
head (uniGram)
## word freq percentage_covered Rank
## will will 6138 0.005392702 1
## said said 6085 0.005346137 2
## just just 6080 0.005341744 3
## one one 5446 0.004784727 4
## like like 5181 0.004551904 5
## can can 4822 0.004236495 6
biGram <- nGramFunction(sampleCorpus, 2)
head(biGram)
## word freq percentage_covered Rank
## right now right now 510 0.0004843286 1
## cant wait cant wait 392 0.0003722683 2
## new york new york 391 0.0003713186 3
## last year last year 381 0.0003618220 4
## last night last night 301 0.0002858489 5
## high school high school 283 0.0002687549 6
triGram <- nGramFunction(sampleCorpus, 3)
head(triGram)
## word freq percentage_covered Rank
## cant wait see cant wait see 64 6.599408e-05 1.5
## happy mothers day happy mothers day 64 6.599408e-05 1.5
## let us know let us know 55 5.671366e-05 3.0
## new york city new york city 44 4.537093e-05 4.0
## cinco de mayo cinco de mayo 36 3.712167e-05 5.0
## happy new year happy new year 33 3.402820e-05 6.0
Visualization of most frequent uni-grams:
I will use the wordcloud package to create word cloud visualization:
library(wordcloud)
## Loading required package: RColorBrewer
df1 <- uniGram
df1$word <- as.character(df1$word)
wordcloud(df1$word[1:30], df1$freq [1:30], scale=c(4, .1), colors=brewer.pal(8, "Dark2"))
# loading package for plotting
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
ggplot(df1[1:20,], aes(x = reorder(word, freq), y = freq) ) + geom_bar(stat = "identity", fill = "blue") + coord_flip() + xlab("Uni-gram") + ggtitle("1-gram frequency")
Visualization of most frequent bi-gram word combinations:
df2 <- biGram
df2$word <- as.character(df2$word)
wordcloud(df2$word[1:30], df2$freq [1:30], scale=c(3, .1), colors=brewer.pal(8, "Dark2"))
ggplot(df2[1:20,], aes(x = reorder(word, freq), y = freq) ) + geom_bar(stat = "identity", fill = "red") + coord_flip() + xlab("Bi-gram") + ggtitle("2-gram frequency")
Visualization of most frequent tri-gram word combinations:
df3 <- triGram
df3$word <- as.character(df3$word)
wordcloud(df3$word[1:30], df3$freq [1:30], scale=c(3, .1), colors=brewer.pal(8, "Dark2"))
ggplot(df3[1:20,], aes(x = reorder(word, freq), y = freq) ) + geom_bar(stat = "identity", fill = "orange") + coord_flip() + xlab("Tri-gram") + ggtitle("3-gram frequency")
Due to large size of original text files, we used random sample of the files with the size of 2% of original size to reduce the size.
The initial data mining helped us to find most frequent uni-grams, bi-grams and tri-grams word combinations.
By looking at the most frequent 2-grams and 3-grams we conclude that they make sense and capture well the word combinations used in spoken language.
Very few 3-grams probably do not make sense for the prediction model (e.g. “follow follow back”)
Given significant size of the original files, it could be a challenge to use full original data for training a prediction model. Random samples most likely should be used. Also, the model should be optimized for low memory utilization.
Although I removed stopwords while doing initial text mining, I’ll use them later to train the model, as they may improve the model performance.
The next step will be creating actual model used for prediction of word based on previous entry (entries). The actual prediction model will be deployed using shiny application, to make an easy interface with the model users.