The Coursera capstone project will use Natural Language Processing (NLP) to create a model for predicting the next word or words as someone is typing. This report will focus on the cleaning and exploring of three types of social media text files which include blogs, news and twitter feeds. This is the early stages of creating the predictive model, therefore the analysis will prepare the text files into corpuses, removing offending words and then combining the words into a n-gram of 3 words. Methods of analysis include word frequencies by media type.
The quanteda library was selected for processing the text files and analyzing the text into words frequencies and n-grams.
Three text files have been provided by Coursera from the Capstone project and have been split into sample files using OS tools to randomly select records for efficiently processing and training the data. For reproducing results the sample files will be packaged with the code to ensure the data remains the same during analysis. See the appendix on how the files were generated.
Sample files were randomly generated selecting 5% of the original files.
Below is a summary of the original and sample media files. See the appendix on how the size and counts were determined.
| Media Type | File Size (MB) | # of Lines | Sample Size (MB) | Sample # of Lines |
|---|---|---|---|---|
| Blogs | 210.16 | 899,288 | 10.58 | 44,964 |
| News | 205.81 | 77,259 | 10.29 | 77,259 |
| 167.11 | 2,360,148 | 6.39 | 50,512 |
The readtext library was selected for importing the three different media text files for its ease of importing capabilities with different file formats as well as reading in the document level metadata associated with the texts.
media <- readtext(mediaSamples, cache = FALSE)
Each media text file will be stored as word tokens which allow for easier transforming of data, frequency counts, creating text corpuses or document matrixes which will be used for word analysis and natural language processing.
Below, the media files will first be converted to lower case word tokens and then transformed to remove inappropriate words, punction, numbers, separators, symbols, and URLs. Note, lower cased words will improve word frequency counts. Additional cleansing will remove stop words and any words less than three characters long. The raw token dataset will keep all words except for inappropriate words and then can be used to compare analysis between the clean dataset and the raw dataset. I anticipate the raw dataset will be needed in order to predict the next possible word or words as someone is typing.
rawTokens <- tokens_tolower(quanteda::tokens(media$text))
rawTokens <- tokens_select(rawTokens, cursewords.list, selection = "remove", case_insensitive = TRUE)
mediaTokens <- quanteda::tokens(rawTokens,
remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE,
remove_symbols = TRUE, remove_url = TRUE, remove_twitter = TRUE)
mediaTokens <- tokens_select(mediaTokens, stopwords("english"), selection = "remove", case_insensitive = TRUE)
mediaTokens <- tokens_select(mediaTokens, selection = "keep", min_nchar = 3 ,case_insensitive = TRUE)
Below is a summary of the word tokens in both datasets.
Word Counts:
| File | Raw Word Count | Media Word Count (after cleaning) |
|---|---|---|
| Blogs | 2,230,205 | 949,719 |
| News | 2,021,414 | 965,683 |
| 1,406,283 | 609,702 |
Based on Wikipedia, a term-document matrix (TDM) is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix (DTM), rows correspond to documents in the collection and columns correspond to terms. The media word set of tokens will be stored in a Document Feature Matrix from the quanteda library which provides easier use for transforming data and processing words.
Stemming removes suffixes to reduce inflectional forms and derivationally related forms of a word to a common base form. For example, toy, toys, toy’s form the root word toy.
blogs <- mediaTokens$blogs
blogs.dfm <- quanteda::dfm(blogs, stem=TRUE)
news <- mediaTokens$news
news.dfm <- quanteda::dfm(news, stem=TRUE)
twitter <- mediaTokens$twitter
twitter.dfm <- quanteda::dfm(twitter, stem=TRUE)
The plot below shows the top 20 words found in each of the media datasets.
blogs.top <- quanteda::topfeatures(blogs.dfm, n=20, scheme = "count")
blogs.df <- enframe(blogs.top)
names(blogs.df) <- c("words","frequency")
blogs.df$words <- reorder(blogs.df$words, blogs.df$frequency)
g.blogs.top <- ggplot(blogs.df, aes(x = words, y = frequency)) +
geom_bar(stat = "identity") + coord_flip() +
labs(title = "Blog Words")
news.top <- quanteda::topfeatures(news.dfm, n=20, scheme = "count")
news.df <- enframe(news.top)
names(news.df) <- c("words","frequency")
news.df$words <- reorder(news.df$words, news.df$frequency)
g.news.top <- ggplot(news.df, aes(x = words, y = frequency)) +
geom_bar(stat = "identity") + coord_flip() +
labs(title = "News Words")
twitter.top <- quanteda::topfeatures(twitter.dfm, n=20, scheme = "count")
twitter.df <- enframe(twitter.top)
names(twitter.df) <- c("words","frequency")
twitter.df$words <- reorder(twitter.df$words, twitter.df$frequency)
g.twitter.top <- ggplot(twitter.df, aes(x = words, y = frequency)) +
geom_bar(stat = "identity") + coord_flip() +
labs(title = "Twitter Words")
grid.arrange(g.blogs.top, g.news.top, g.twitter.top, nrow=1)
For a visualization effect, the word cloud is used to plot the most frequent words (features) found for the combined media datasets which are plotted with their sizes proportional to their numeric values
combined.dfm <- quanteda::dfm(mediaTokens, stem = TRUE)
quanteda::textplot_wordcloud(combined.dfm, min.freq = 60, random.order=FALSE, rot.per=.10, colors=RColorBrewer::brewer.pal(8, 'Dark2'), comparison = FALSE, max_words=200)
textstat_frequency(combined.dfm, n=20)
## feature frequency rank docfreq group
## 1 said 14925 1 3 all
## 2 one 14160 2 3 all
## 3 just 13615 3 3 all
## 4 get 13287 4 3 all
## 5 like 13237 5 3 all
## 6 time 11804 6 3 all
## 7 can 11197 7 3 all
## 8 year 10000 8 3 all
## 9 day 9836 9 3 all
## 10 make 9380 10 3 all
## 11 new 8837 11 3 all
## 12 love 8583 12 3 all
## 13 know 8293 13 3 all
## 14 work 8111 14 3 all
## 15 good 8072 15 3 all
## 16 now 7982 16 3 all
## 17 say 7445 17 3 all
## 18 want 7326 18 3 all
## 19 peopl 7314 19 3 all
## 20 see 7056 20 3 all
The outcome of this project, is to create a shiny application that will predict the next word or set of words a user may type while creating a message or search term. Today, google is displaying a set of words or completed thoughts while typing a message in an email. How will the application perform this feat and make the right suggests at least 80% of the time.
N-grams are a set of co-occuring words within a given window and when computing the n-grams you typically move one word forward. N-grams are used to develop models that can be used in a variety of tasks such as spelling correction, word breaking, word predictions and text summarization. We will explore different n-gram sizes such as a unigram (N=1), bigram (N=2), trigram (N=3) and n-grams greater than 3.
Example of n-grams. Sentence: Life is trying things to see if they work
| bigrams | trigrams | n-grams (N=4) |
|---|---|---|
| Life is | Life is trying | Life is trying things |
| is trying | is trying things | is trying things to |
| trying things | trying things to | trying things to see |
| things to | things to see | things to see if |
| to see | to see if | to see if they |
| see if | see if they | see if they work |
| if they | if they work | |
| they work |
This is where the fun happens, to be able to use different n-grams to predict the next word or set of words. At this moment I am not sure if a bigram, trigram or greater is the best way to predict the next set of words. In this phase, analysis will have to be done to use all potential n-grams sizes to determine which is the best or if multple n-grams will have the best outcome.
Relative frequency, which is estimating the N-gram probablility by dividing the observed frequency of a particular sequence by the observed frequency, will be used to help predict the next word based on the different N-grams. But, a word in one media document many not be as frequent in another document, such words used when tweeting may not be frequent or used in blogs. Therefore, the relative frequency will have to be based on the media type N-grams.
This is just a start at how to predict the next word. Further reading on other peoples research will be used to help formulate the correct predictive model.
A shiny application will be created which will allow a user to enter text while predicting the next word. Below is a list of key features anticipated for the final product:
Because the text files were large and I was unsure of the processing power for anyone potentially pulling the code with the data which could overwelm their machine, I felt that it was better to create sample files at the operating system level and provide them with the project. I may also create training files as well but that will be determined at a later date.
Here is how the files where created using Windows 10 and Unix/linux like commands to randomly generate sample files containing 5% of the data:
sampleSize=$(awk ’END{print int((NR==0)?0:(NR*0.05))}’ en_US.twitter.txt) shuf -n $sampleSize en_US.twitter.txt > somefile_
Below is the code used to generate the file sizes and number of lines. The file size was converted to MB for human readability.
blogSize <- round(file.info(blogFile)["size"][,1] / 1000 / 1000, 2)
twitterSize <- round(file.info(twitterFile)["size"][,1] / 1000 / 1000, 2)
newsSize <- round(file.info(newsFile)["size"][,1] / 1000 / 1000, 2)
con <- file(blogFile, "r")
blogNbrOfLines <- NROW(readLines(con))
con <- file(twitterFile, "r")
twitterNbrOfLines <- NROW(readLines(con))
con <- file(newsFile, "r")
newsNbrOfLines <- NROW(readLines(con))
# Samples
blogSampleFile <- "../final/en_US/blogs.sample.txt"
twitterSampleFile <- "../final/en_US/twitter.sample.txt"
newsSampleFile <- "../final/en_US/news.sample.txt"
blogSampleSize <- round(file.info(blogSampleFile)["size"][,1] / 1000 / 1000, 2)
twitterSampleSize <- round(file.info(twitterSampleFile)["size"][,1] / 1000 /1000, 2)
newsSampleSize <- round(file.info(newsSampleFile)["size"][,1] / 1000 / 1000, 2)
con <- file(blogSampleFile, "r")
blogSampleNbrOfLines <- NROW(readLines(con))
con <- file(newsSampleFile, "r")
twitterSampleNbrOfLines <- NROW(readLines(con))
con <- file(newsFile, "r")
newsSampleNbrOfLines <- NROW(readLines(con))
con <- ""
quanteda’s library will be used to create N-grams. Below is an example of N-gramming.
mediaNgram3 <- tokens(mediaTokens, ngrams=3)
head(mediaNgram3$blogs, 10)
## [1] "use_make_hard-boiled" "make_hard-boiled_eggs"
## [3] "hard-boiled_eggs_think" "eggs_think_pretty"
## [5] "think_pretty_genius" "pretty_genius_hehe"
## [7] "genius_hehe_donâ" "hehe_donâ_kick"
## [9] "donâ_kick_can" "kick_can_blond"