The goal of this mielstone report is to explore the major featuires of the text data give to us for the Coursera “https://www.coursera.org/” Data Science Capstone through Johns Hopkins University " https://www.jhu.edu/“. The project is sponsored by SwiftKey. The end-goal is to create a text-prediction applicatoin with the R’s Shiny package that predicts words using a natural language processing model.
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.
1.) Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
2.) Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
1.) Some words are more frequent than others - what are the distributions of word frequencies?
2.) What are the frequencies of 2-grams and 3-grams in the dataset?
3.) How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
4.) How do you evaluate how many of the words come from foreign languages?
5.) Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
The following packages were already installed and this code was removed from the final product: install.packages(“tm”) install.packages(“quanteda”) install.packages(“dplyr”) install.packages(“ggplot2”) install.packages(“stringr”) install.packages(“pander”) install.packages(“readr”) install.packages(“tables”) install.packages(“wordcloud”)
library(tm)
library(quanteda)
library(dplyr)
library(ggplot2)
library(stringr)
library(pander)
library(readr)
library(tables)
library(wordcloud)
The dataset used for this project is included in the following link “https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip”. In this next step the file will be downloaded into the current working directory and the files will be read into R with the readLines function.
Sys.time()
## [1] "2018-06-22 13:58:27 EDT"
fileUrl <-
"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists(basename(fileUrl))) {
download.file(fileUrl, basename(fileUrl))
unzip(basename(fileUrl))
}
blogs <- readLines(con = "./final/en_US/en_US.blogs.txt", encoding ="UTF-8", skipNul = T)
news <- readLines(con = "./final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = T)
tweets <- readLines(con = "./final/en_US/en_US.twitter.txt", encoding ="UTF-8", skipNul = T)
The dataset has been downloaded and the data split into 3 datasets above. The next step in the research will create a training dataset, a development dataset, and a test dataset for each of the 3 files created.
For this exercise, we will use 60% of the dataset for training, 20% for development and 20% for test. Based on recommendations googled, I will explore the dataset with 20% of the training set. This implies 12% (20% of 60%) of the overall dataset will be used for exploration.
## [1] "2018-06-22 13:59:07 EDT"
Now the data has been written to local files for later.
In this section I ended up doing a number of gyrations to try and document the files in the directory, count the lines, count the words, and compute the average words per line. In spite of all of my efforts the the full “Twitter” file does not count the words completely. The files where I split Twitter all count accurately, however the full file shows less words than lines. I did nothing different for any of the 15 files, yet Twitter is giving me trouble. In order to accurately account for all words in the full Twitter file I took the average from the split files and applied it to the total lines in the full file.
For counting lines I found a few lines that seemed to work. For counting words I used the Stringr package with “\w+” parameter. This too seemed to work (except for the full Twitter file). In the end I built “outLine”, a data frame that holds each of the files and it’s key metrics. I would like to find a more efficient way of doing this and will be experimenting with “for” or “while” loops and a function to simplify the code. As the current code is cumbersome, and ugly, I surpressed it.
## [1] "2018-06-22 14:00:39 EDT"
## File.Name File.Size Mb Lines Words AVG.Words
## 1 en_US.blogs.txt 200.4 Mb 899288 38309620 42.6
## 2 en_US.news.txt 196.3 Mb 1010242 35624454 35.3
## 3 en_US.twitter.txt 159.4 Mb 2360148 31153954 13.2
## 4 train_blogs.txt 119.9 Mb 539572 23006709 42.6
## 5 train_news.txt 117.2 Mb 606145 21375609 35.3
## 6 train_tweets.txt 94.3 Mb 1416088 18604511 13.1
## 7 dev_blogs.txt 39.8 Mb 179858 7652569 42.5
## 8 test_blogs.txt 39.8 Mb 179858 7650342 42.5
## 9 test_news.txt 39.1 Mb 202049 7128322 35.3
## 10 dev_news.txt 39 Mb 202048 7120523 35.2
## 11 dev_tweets.txt 31.4 Mb 472030 6198871 13.1
## 12 test_tweets.txt 31.4 Mb 472030 6200162 13.1
## 13 small_blogs.txt 24 Mb 107914 4613989 42.8
## 14 small_news.txt 23.4 Mb 121229 4272627 35.2
## 15 small_tweets.txt 18.9 Mb 283217 3726016 13.2
In this next step I made some transformations to the Corpus in order to:
1.) Eliminate foreign letter / odd symbols and transform them to ASCII.
2.) I used the following list of profanity, removing them, in order to avoid obscene words. “https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/”, downloading and unzipping the text file to my Capstone directory.
3.) I removed twitter hashtag “#” words, the unicode“<>” tags, and urls as they are not words.
4.) I removed any punctuation that was not an end of sentence or apostorphe was removed in order to preserve the integrity of the context.
5.) I added a tag to account for all end of sentence puncutation “. ?!”.
6.) Provided additional clean up where prior steps may have added some confusion.
7.) Accounted for the numbers in the text as they are not words.
8.) Clean up any remaining whitespace.
Sys.time()
## [1] "2018-06-22 14:02:56 EDT"
myDir <- "./final/en_US/SMALL FILES/"
mySource <- "blogs"
myFile <- paste("small_", mySource, ".txt", sep = '')
fullFile <- paste(myDir, myFile, sep = '')
# file.info(fullFile)
corp <- VCorpus(DirSource(myDir))
myChars <- function(x, n = seq(x)) {
# x: a Corpus
# n: The elements of x for which characters will be returned
require(dplyr)
t <- character()
for(i in n) {
t <- c(t, x[[i]][[1]])
}
t %>%
str_split("") %>%
sapply(function(x) x[-1]) %>%
unlist %>%
unique %>%
sort(dec = F)
}
chars <- myChars(corp)
#print(chars, quote = F)
I commented out the print function as there are over 1000 entries to be printed of characters that are to be eliminated. There are an excessive number of foreign characters and odd symbols that could impact the prediction, so these will be converted or deleted.
Sys.time()
## [1] "2018-06-22 14:06:27 EDT"
dat <- sapply(corp, function(row) iconv(row, "latin1", "ASCII", sub = ""))
corp <- VCorpus(VectorSource(dat))
rm(dat)
chars <- myChars(corp)
print(chars, quote = F)
## [1] \177 \037 \003 \031 _ - , ; : ! ? . '
## [15] " ( ) [ ] { } @ * / \\ & # %
## [29] ` ^ + < = > | ~ $ 0 1 2 3 4
## [43] 5 6 7 8 9 a A b B c C d D e
## [57] E f F g G h H i I j J k K l
## [71] L m M n N o O p P q Q r R s
## [85] S t T u U v V w W x X y Y z
## [99] Z
The number of characters can now be transformed as outlined above.
Sys.time()
## [1] "2018-06-22 14:08:40 EDT"
swap <- content_transformer(function(x, from, to) gsub(from, to, x))
corp <- tm_map(corp, content_transformer(tolower))
# Remove any profanity
profanityWords <- readLines(con="full-list-of-bad-words-text-file_2018_03_26.txt", skipNul = T)
corp <- tm_map(corp, removeWords, profanityWords)
# Use a space to replace all foreign unicode character codes
corp <- tm_map(corp, swap, "<.*>", " ")
# Eliminate the twitter hashtags
corp <- tm_map(corp, swap, "#.*", " ")
# Remove all notation associated with website names
corp <- tm_map(corp, swap, "www\\..*", " ")
corp <- tm_map(corp, swap, ".*\\.com", " ")
# Use a space to replace all punctuation except EOS punctuation and apostrophe
corp <- tm_map(corp, swap, "[^[:alnum:][:space:]\'\\.\\?!]", " ")
# Any numbers with decimal places are deleted
corp <- tm_map(corp, swap, "[0-9]+\\.[0-9]+", "")
# Use only one instance of EOS punctuation
corp <- tm_map(corp, swap, "([\\.\\?!]){2,}", ". ")
# Use the <EOS> tag to replace other end of sentence punctuation
corp <- tm_map(corp, swap, "\\. |\\.$", " <EOS> ")
corp <- tm_map(corp, swap, "\\? |\\?$", " <EOS> ")
corp <- tm_map(corp, swap, "! |!$", " <EOS> ")
# Identify and replace any typos with EOS punctuation
corp <- tm_map(corp, swap, "[[:alnum:]]+\\?[[:alnum:]]+", " <EOS> ")
corp <- tm_map(corp, swap, "[[:alnum:]]+![[:alnum:]]+", " <EOS> ")
# Delete any extra punctuation
corp <- tm_map(corp, swap, "!", " ")
corp <- tm_map(corp, swap, "\\?", " ")
# Update various occurences of u.s to US
corp <- tm_map(corp, swap, "u\\.s", "US")
corp <- tm_map(corp, swap, "\\.", "")
# Delete other unnecessary punctuation characters
corp <- tm_map(corp, swap, " 's", " ")
corp <- tm_map(corp, swap, " ' ", " ")
corp <- tm_map(corp, swap, "\\\\", " ")
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, stripWhitespace)
smallDir <- paste(myDir, "data/corp", sep = "")
dir.create(smallDir)
if(!dir.exists(smallDir)) {dir.create(smallDir)}
writeCorpus(corp, smallDir, filenames = c("cleanSmallBlogs1", "cleanSmallNews1", "cleanSmallTweets1"))
At this point I have to figure out a few things:
1.) How can I do a better job of optimizing this repetitive code with a more suitable approach
2.) I decided to report Sys.time() with each chunk in order to understand where the code is slow. I ended up setting cache=FALSE in order to avoid the memory thrashing that caused this to run forever.
The next step in this process is to evaluate the most frequent sets of one, two, and three word pairings. First I will examine a sample of text from one of the documents.
Sys.time()
## [1] "2018-06-22 14:18:02 EDT"
rm(corp)
# Reload the corpus from the new file to ensure changes are set
myDir <- "./final/en_US/SMALL FILES/"
smallDir <- paste(myDir, "data/corp", sep = "")
smallDir
## [1] "./final/en_US/SMALL FILES/data/corp"
myChars <- function(x, n = seq(x)) {
# x: a Corpus
# n: The elements of x for which characters will be returned
require(dplyr)
t <- character()
for(i in n) {
t <- c(t, x[[i]][[1]])
}
t %>%
str_split("") %>%
sapply(function(x) x[-1]) %>%
unlist %>%
unique %>%
sort(dec = F)
}
corp <- VCorpus(DirSource(smallDir))
print(myChars(corp), quote = F)
## [1] ' < > a b c d e E f g h i j k l m n o O p q r s S t u U v w x y z
print(strwrap(corp[[2]]$content[c(4,6)]), quote=F)
## [1] and she rarely fails to connect with an audience <EOS>
## [2] their relationship quickly deteriorated and winsor said a fight in
## [3] november brought gilbert police to the home <EOS>
I am trying to understand why I had to recreate a fucntion (myChars) and two constants (myDir & smallDir)??
This now demonstrates that the text has fewer characters than the original document and only apostorphes for punctuation. This allows for more easily matching of like words.
My first attempt to count terms using the DocumentTermMatrix function again demonstrated that I am in dire need of an upgrade to my 10 year old MacBook Pro. After searching a bit I found other examples of where the quanteda package provided the functionality I needed and worked within the constraints of my hardware.
I also found it useful to splt up the chunks of 1-, 2-, and 3- word counts. By the time I get to tri_frequency, the code runs for over an hour - as shown with Sys.time().
Sys.time()
## [1] "2018-06-22 14:22:37 EDT"
corp <- quanteda::corpus(corp)
# This function will use a token output and provides a sorted N-gram frequency table
freq_df <- function(x){
fr <- sort(colSums(as.matrix(x)),decreasing = TRUE)
df <- data.frame(n_gram = names(fr), freq=fr, row.names = NULL)
return(df)
}
# Create Dataframes
uni <- dfm(tokens(corp, removeSymbols=TRUE), tolower=FALSE)
uni_freq <- freq_df(uni)
rm(uni)
uni_freq <- uni_freq[-grep("<", uni_freq$n_gram),]
uni_freq <- uni_freq[-grep("EOS", uni_freq$n_gram),]
uni_freq <- uni_freq[-grep(">", uni_freq$n_gram),]
Complete with the uni_frequency chunk I move onto pairs of words.
Sys.time()
## [1] "2018-06-22 14:24:03 EDT"
biToks <- tokens_ngrams(tokens(corp, removeSymbols=TRUE), n=2L)
bi <- dfm(biToks, tolower=FALSE)
rm(biToks)
bi_freq <- freq_df(bi)
rm(bi)
bi_freq <- bi_freq[-grep("<", bi_freq$n_gram),]
bi_freq <- bi_freq[-grep("EOS", bi_freq$n_gram),]
bi_freq <- bi_freq[-grep(">", bi_freq$n_gram),]
Again, these chunks are split so I can better understand progress of the code. Onto the triplets of words.
Sys.time()
## [1] "2018-06-22 14:38:20 EDT"
triToks <- tokens_ngrams(tokens(corp, removeSymbols=TRUE), n=3L)
tri <- dfm(triToks, tolower=FALSE)
rm(triToks)
tri_freq <- freq_df(tri)
rm(tri)
tri_freq <- tri_freq[-grep("<", tri_freq$n_gram),]
tri_freq <- tri_freq[-grep("EOS", tri_freq$n_gram),]
tri_freq <- tri_freq[-grep(">", tri_freq$n_gram),]
Based on the completed analysis above, we can now graph our most common occurrences of 1-, 2-, and 3- words sets through the corpus.
Sys.time()
## [1] "2018-06-22 15:19:13 EDT"
top40 <- function(df, title) {
df <- df[1:40,]
df$n_gram <- factor(df$n_gram, levels = df$n_gram[order(-df$freq)])
ggplot(df, aes(x = n_gram, y = freq)) +
geom_bar(stat = "identity", fill = "deepskyblue2", colour = "ivory3") +
labs(title = title, x="N-Gram", y="Count") +
theme(axis.text.x = element_text(angle=60, size=12, hjust = 1), axis.title = element_text(size=14, face="bold"), plot.title = element_text(size=16, face="bold"))
}
top40(uni_freq, "40 Most Common Unigrams")
top40(bi_freq, "40 Most Common Bigrams")
top40(tri_freq, "40 Most Common Trigrams")
The quanteda package seems to provide the functionality I need and was able to run on my antique machine. I will continue to research other options to see if there is something I could the tm package and make it work more efficiently.
I need to do the following: - Take a look at my pre-processing steps and determine if there is someway to optimize this code as well as review any feedback on points I may have missed.
Recreate the dataframes of 1-, 2- ,3-, and if possible 4-grams using the larger training dataset including word-relation frequencies.
Look into options that will allow me to load part of the data at time in order to work within the constraints of my current hardware.
Test a variety of prediction algorithms, applying them to the development dataset to determine their accuracy, efficiency and speed.
Determine how either Katz backoff or Kneser-Ney smoothing to deal with the unknown words.
Accomplish the final goal by creating a Shiny app, with a basic UI that provides word predictions as quickly and accurately as possible.