The Data Science Specialization, offered by Johns Hopkins University through Coursera, consists of nine courses, and culminates in a tenth course known as the Capstone Project. The ultimate goal of the Capstone project is to produce a Shiny Web application which provides a reasonable prediction of candidate words based on user input, similar to that done by products from SwiftKey, a leading provider of intuitive and personalized keyboard software. In so doing, the knowledge and techniques acquired via the previous nine courses will be utilized and demonstrated in the process.
This Milestone report provides comprehensive information regarding the work done midway through the Capstone Project towards achieving the stated goal. It will describe the data used in the project, the data cleansing techniques applied, the summary statistics calculated on the cleansed data, and the suggested approaches for predictive algorithms, among other things.
The datasets used in this project are from a corpus called HC Corpora and are available here; information, what little there is, about the corpus can be found at this site.
It takes approximately 30 minutes to download the zip file over a standard Wi-Fi connection. Once downloaded, the zipped file was extracted into a directory local to the Capstone project files. The zip file contains a top-level directory named final. Within final are four sub-directories housing datasets for each of these languages:
The project will examine only the English datasets, so the other three were deleted. The following table summarizes some statistics about the English datasets:
| Filename | #Lines | #Words | #Characters | #Longest Line |
|---|---|---|---|---|
| en_US.blogs.txt | 899288 | 37334117 | 208623081 | 40833 |
| en_US.news.txt | 1010242 | 34365936 | 205243643 | 11384 |
| en_US.twitter.txt | 2360148 | 30373559 | 166816544 | 173 |
The environment and packages used to process the data after download and extraction are described next.
The datasets were housed and processed on a Dell Latitude E6530 laptop with 15.5 GiB of memory and an Intel® Core™ i7-3520M CPU @ 2.90GHz × 4 processor. The OS in use is a 64-bit version of Ubuntu 16.04.
This project makes extensive use of the text mining framework provided by the tm1 package for data import, cleansing, matrix manipulation, and correlative information about the data. In addition, the wordcloud2 package is used to visualize high-frequency words using color-weighting. Finally, the RWeka3 package is an R interface to Weka, a collection of machine learning algorithms for data mining tasks written in Java, containing tools for data pre-processing, classification, regression, clustering, association rules, and visualization.
Several techniques were considered for loading the datasets. Eventually, the methods provided by the tm package proved reliable and fast enough to load the entire 3 datasets into a corpus:
corpora <- Corpus(DirSource(dataFiles))
Timings were taken, showing roughly 2 minutes to load all 3 files into memory:
| user | system | elapsed |
|---|---|---|
| 123.924 | 0.804 | 124.618 |
No corruption of the data was encountered, nor were any errors issued during the data import process. The following table shows a reduction in the number of characters from what was originally reported. The percentage of potential data lossage is not significant enough to warrant further investigation, as the population is sufficiently large enough for analysis.
| Corpus ID | #Characters |
|---|---|
| en_US.blogs.txt | 206824505 |
| en_US.news.txt | 203223159 |
| en_US.twitter.txt | 162096031 |
The data is largely plain-text, with various forms of punctuation and digits interspersed. Although an effort was made to filter the text according to the source language, a few aberrant Unicode and other non-alphabetic characters still remain, as the following sections show.
A typical line, using the en_US.news.txt dataset, looks like this:
"The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
An example of an emoticon (or emoji) which appears in the en_US.twitter.txt dataset is shown below:
"I'm doing it!👦"
An example of Unicode text which can be found in the en_US.news.txt dataset is shown here:
"\u0093I was just trying to hit it hard someplace,\u0094 said Rizzo, who hit the pitch to the opposite field in left-center. \u0093I\u0092m just up there trying to make good contact.\u0094"
Before any meaningful data mining work can be done, the corpora needs to be cleansed a bit further. There are several standards steps performed in most data mining efforts; most of them are listed in the next sub-section.
In order to efficiently remove certain undesired patterns in the data, a content_transformer (i.e., a function which modifies the content of an R object) is assigned to an R variable. This content_transformer is defined with a simple function to replace undesired patterns with a blank and will be passed to the tm_map method of the tm package.
#
# Convert the specified pattern to blank, given a vector
#
toBlank <- content_transformer(function(x, pattern) {return (gsub(pattern, "", x))})
The standard steps for cleaning the corpora are self-explanatory and listed in the code segment below:
# Remove punctuation and digits
corpora <- tm_map(corpora, toBlank, "[[:punct:][:digit:]]")
# Transform to lower case (need to wrap in content_transformer)
corpora <- tm_map(corpora, content_transformer(tolower))
# Remove stopwords
corpora <- tm_map(corpora, removeWords, stopwords("english"))
# Remove whitespace
corpora <- tm_map(corpora, stripWhitespace)
# But there is still leading whitespace (and maybe trailing?)
corpora <- tm_map(corpora, toBlank, "^[ \t]+|[ \t]+$")
# Remove all blank lines
corpora <- tm_map(corpora, toBlank, "^[[:blank:]*]$")
# Now all data is in lowercase but there are some non-Latin characters to remove
corpora <- tm_map(corpora, toBlank, "[^a-z ]")
Finally, once this level of cleansing is done, words of an offensive nature (not listed here) are removed using a similar technique with tm_map and the toBlank function. The list of offensive words is by no means comprehensive but does seek to exclude the more common (and most offensive) words potentially contained in the corpora.
Note: Stemming was not performed on this corpora due to processing efficiency issues.
Now that the population has been sanitized, the EDA and statistical analysis can begin.
First action is to take a reasonably sized sample from the population, keeping in mind that setting a seed in the take.sample function will enforce reproducible results:
# Sample from the population
sample.size <- 5000
total.sample <- c(take.sample(corpora[[1]]$content, sample.size),
take.sample(corpora[[2]]$content, sample.size),
take.sample(corpora[[3]]$content, sample.size))
Next, construct a VCorpus (A volatile corpus, or VCorpus, is fully kept in memory and thus all changes only affect the corresponding R object.) from the sample so that a DocumentTermMatrix can be constructed from it.
# Construct a VCorpus from the sampling
corpora.sample <- VCorpus(VectorSource(total.sample))
A DocumentTermMatrix is, as its name hints at, a matrix in which all the rows are documents and all columns are terms contained in those documents. It takes about 16 seconds to construct a DocumentTermMatrix from the sample:
# Construct a DTM from the sample
dtm <- DocumentTermMatrix(corpora.sample)
This DocumentTermMatrix has the following characteristics:
Number of documents: 15000
Number of terms: 34895
Non-/sparse entries: 219530/523205470
Sparsity : 100%
As with most text-mining efforts, the matrices tend to be sparse. Efforts to reduce the sparsity and compact the matrix produced little of value:
removeSparseTerms(dtm, 0.99994)
<<DocumentTermMatrix (documents: 15000, terms: 35179)>>
Non-/sparse entries: 219859/527465141
Sparsity : 100%
The tm package provides a method for determining the frequency for words within a bounded range as well as the correlation of a word against all the documents in the corpus. As an example for the first, determine which words occur at least 500 times but no more than 1000 times:
findFreqTerms(dtm, lowfreq = 500, highfreq = 1000)
[1] "also" "back" "day" "dont" "first" "get" "going" "good"
[9] "know" "last" "love" "make" "much" "new" "now" "people"
[17] "see" "time" "two" "well" "year" "years"
As an example for the second, one word from the previous list will be selected and used as input to the findAssocs method. This method also accepts a correlation parameter which ranges in value from 0 to 1. With a value of Z = 1, the term ALWAYS occurs with some other word in the corpus; that probability decreases as the value moves closer to zero. It is NOT an indicator of nearness as the DTM is just a “bag of words”.
term <- "two"
Z <- 0.1
findAssocs(dtm, term, Z)
$two
agonizing commotionuntil culled fuzzy lilburn
0.12 0.12 0.12 0.12 0.12
mullinax prettier roosters suwanee techie
0.12 0.12 0.12 0.12 0.12
wyandotte bannister innings predictable
0.12 0.11 0.11 0.10
Since the columns of the matrix are the frequency counts of each word across all documents, the sums of those frequency counts can be obtained and ordered from highest to lowest:
# Calculate the frequency of each word
freq <- colSums(as.matrix(dtm))
# Create sort order (descending)
freq.order <- order(freq, decreasing = TRUE)
# List most/least frequent terms
# Or least/most unique terms
freq[head(freq.order)]
said will one just like can
1548 1355 1235 1096 1062 1033
freq[tail(freq.order)]
zuerlein zunino zurich zusi zweerman zydrunas
1 1 1 1 1 1
The classic “5-number summary” shows that the data is heavily skewed to the right:
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 1.000 6.749 3.000 1548.000
Standard Deviation is the best measure of spread of an approximately normal distribution. This is not the case when there are extreme values in a distribution or when the distribution is skewed, as in the situation here. So, neither standard deviation nor variance are calculated for this data.
Finally, a plot of the distribution of words that occur more than 400 in the sample is shown below:; it shows a definite skewing to the right:
A word cloud is a visual representation of text data, typically used to visualize free form text, as well as other text-based data sources like website metadata. Here is a wordcloud of the words in the sample having a minimum frequency of 150 occurrences.
Wordcloud of the DTM
Further analysis of the sample data will be done using n-grams. An n-gram is a contiguous sequence of n items from a given sequence of text or speech. An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n, e.g., “four-gram”, “five-gram”, and so on. An example of the code (using the RWeka package method NGramTokenizer) that will be used to perform this analysis follows; it is unlikely that an n-gram higher than three will be used:
unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=1))
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
Using this method, the following results were obtained:
the to and a of in i that is for
22002 11782 11123 10511 9489 7552 6781 4669 4501 4485
of_the in_the to_the on_the for_the to_be and_the at_the in_a
2021 1900 955 855 829 684 617 615 570
with_the
523
one_of_the a_lot_of as_well_as out_of_the i_want_to going_to_be the_end_of some_of_the
159 134 83 75 72 70 70 68
it_was_a be_able_to
66 58
The UI for testing the prediction algorithm will be coded using the Shiny API and hosted on the shinyapps.io website. No wireframes have been produced yet, but the initial thoughts on design is to have a very simple UI with an input text field, an output field for the predicted next word, and some usage information.