As part of the Data Science Capstone Course from Coursera in Week 2, this is a milestone report for a peer-graded. As follows, the purpose of this document is to,
1.Demonstrate that you’ve downloaded the data and have successfully loaded it in. 2.Create a basic report of summary statistics about the data sets. 3.Report any interesting findings that you have amassed so far. 4.Get feedback on your plans for creating a prediction algorithm and Shiny app.
This report would also act as a basis for the development of the next task report, so it should be as simple and concise as possible. As per the goal mentioned above, the contents of this report will be structured into 5 sections.
# set the working directory
setwd("/Users/priyadamodharan/Desktop")
# download the dataset
destfile = "./Coursera-SwiftKey.zip"
if(!file.exists(destfile)){
url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
file <- basename(url)
download.file(url, file, method="curl")
# uncompressed the dataset
unzip(file)
}
As we have now set up, the dataset should be stored in the right directory. It includes a dataset of 4 different languages: German, English, Finnish, and Russian. There are 3 different sources of each language dataset: Blogs, News, and Twitters. Although we will use all the various corporate languages, let's just explore the English dataset right now.
# Read the blogs and Twitter data from English dataset into R
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
# load necessary library
library(knitr)
library(stringi)
# Get file sizes
blogs.size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
# Get words in files
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)
# Summary of the data sets
data_summary <- data.frame(data_source = c("blogs", "news", "twitter"),
file_size_MB = c(blogs.size, news.size, twitter.size),
line_counts = c(length(blogs), length(news), length(twitter)),
words_counts = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
num_of_words_per_line_mean = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
kable(data_summary, caption = "English Dataset Summary")
| data_source | file_size_MB | line_counts | words_counts | num_of_words_per_line_mean |
|---|---|---|---|---|
| blogs | 200.4242 | 899288 | 37546239 | 41.75107 |
| news | 196.2775 | 1010242 | 34762395 | 34.40997 |
| 159.3641 | 2360148 | 30093413 | 12.75065 |
We have done the basic exploration of the English dataset. From the table above, we can observe that the dataset statistics from the 3 different sources are more or less of equal size. The dataset from Twitter has the least file size (~159 MB) but double the number of lines compared to the other two sources. Also, the ones from Twitter have the least number of words and the mean (average number of words per line). These are expected because twitter data is composed of a short text (with a defined text limit), and we also expect that there can be more ‘garbage words’ that we need to clean from the Twitter data than blogs and news data. Before we do some data exploratory analysis, let’s clean the data in the next section.
In order to get a more accurate model, these are important steps. Since the dataset is very wide (up to 200 MB and 2 million lines from each source), let's take only 1% of the dataset as a sample to speed up the data analysis and to validate the data cleanup. Next, let's load the sample data into a Corpus, the primary data system used by them, which is a list of records.
# load necessary library
library(tm)
## Loading required package: NLP
# Sample the data
set.seed(679)
data_sample <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
# Create corpus and clean the data
corpus <- VCorpus(VectorSource(data_sample))
print(corpus)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 42695
The result is a structure of type VCorpus (‘virtual corpus’ that is, loaded into memory) with 42,695 documents (each line of text in the source is loaded as a document in the corpus). The corpus object is a nested list or list of lists. At each index of the VCorpus object, there is a PlainTextDocument object, which is essentially a list that contains the actual text data (content), as well as some corresponding metadata (meta) which can help to visualize a VCorpus object and to conceptualize the whole thing. Now, we can start with the data cleaning process to this corpus.
The tm package provides several function to carry out these tasks, which are applied to the document collection as transformations via the tm_map() function. We will do the several data cleaning tasks through tm_map as follows:
Converting the corpus documents to lower case Removing stopwords (extremely common words such as “and”, “or”, “not”, “in”, “is” etc). We’ll first look at the standard english stop words.
# List standard English stop words
stopwords("en")
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
We can see that we stll can include more modal verbs which is not in the standard list. Let’s added few more stopwords.
updatedStopwords <- c(stopwords('en'), "can", "will", "want", "just", "like")
Removing punctuation marks (periods, commas, hyphens etc). Removing numbers. Removing extra whitespace.
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, updatedStopwords)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
print(corpus)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 42695
We can improve further by removing unwanted list of words such as profanity/foul language, ‘garbage words’ or informal/slang languages (commonly used in twitter with less formal meaning). However, this should be done with extra care since it might alter the context or meaning of original sentences. Hence, we won’t do this as for now in the data cleaning steps.
In this stage, after the data clean-up, we will plot histograms of the most common terms remaining in the data. To generate various n-grams from the corpus, we can use the Rweka library: n-grams functions, then construct a term-document matrix for the different n-gram tokens. Then, to map the histogram of the matrix frequency vs n-gram terms, we can use the ggplot2 library.
When you wish to see each document depicted as a row, the document-term matrix is used. If you compare authors within rows, or the data is organized chronologically and you want to retain the time series, this may be helpful. We have also used sparse word exclusion with a 0.9999 sparsity threshold. Sparsity refers to the threshold of relative document frequency for a term, above which the term will be removed, then with our threshold, it will remove only terms that are more sparse than 0.9999.
#load necessary library
library(RWeka)
options(mc.cores=1)
getFreq <- function(tdm) {
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
The plot below shows histograms of the ten most frequent words (unigram, bigram, and trigram) in the corpus. The plots are using distinct RGB color to make it easy to distinguish between n-grams statistics. Let’s start with the unigram first.
#load necessary library
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
# create plot historam function
makeHistogramPlot <- function(data, label, color) {
ggplot(data[1:10,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = I(color))
}
# Create the dtm from the corpus:
corpus_dtm <- TermDocumentMatrix(corpus)
# Print out corpus_dtm data
corpus_dtm
## <<TermDocumentMatrix (terms: 57748, documents: 42695)>>
## Non-/sparse entries: 495632/2465055228
## Sparsity : 100%
## Maximal term length: 85
## Weighting : term frequency (tf)
## Get frequencies of most common n-grams in data sample
freq1 <- getFreq(removeSparseTerms(corpus_dtm, 0.9999))
makeHistogramPlot(freq1, "10 Most Common Unigrams","Black")
We can observe that some of the most common unigram words are ‘said’, ‘now’, and ‘time’. This is interesting and somewhat valid because the data source is a collection of news, blogs, and tweets which mostly reporting on what other people said about certain events and the timing when it happens. Let’s now look at the bigram (pair of words that appear together).
# Create the dtm from the corpus:
corpus_dtm_bigram <- TermDocumentMatrix(corpus, control = list(tokenize = bigram))
# Print out corpus_dtm_bigram data
corpus_dtm_bigram
## <<TermDocumentMatrix (terms: 429121, documents: 42695)>>
## Non-/sparse entries: 500118/18320820977
## Sparsity : 100%
## Maximal term length: 101
## Weighting : term frequency (tf)
freq2 <- getFreq(removeSparseTerms(corpus_dtm_bigram, 0.9999))
makeHistogramPlot(freq2, "10 Most Common Bigrams","Blue")
For the bigram, we also still observe that the most common words is also about timing, such as ‘last year’, ‘right now’ and also about places , such as ‘new york’, ‘high school’. Also if we notice that there are several bigram start with ‘last’ in differant ranking. Hence, when we design the text prediction, we can set a range e.g. 10-20 most common words and rank them which will result the suggestion options listed in this order ‘last year’, ‘last night’, ‘last week’. Let’s continue with the trigram.
# Create the dtm from the corpus:
corpus_dtm_trigram <- TermDocumentMatrix(corpus, control = list(tokenize = trigram))
# Print out corpus_dtm_bigram data
corpus_dtm_trigram
## <<TermDocumentMatrix (terms: 458174, documents: 42695)>>
## Non-/sparse entries: 461682/19561277248
## Sparsity : 100%
## Maximal term length: 108
## Weighting : term frequency (tf)
freq3 <- getFreq(removeSparseTerms(corpus_dtm_trigram, 0.9999))
makeHistogramPlot(freq3, "10 Most Common Trigrams", "Yellow")
For the trigram, we can see some congratulatory message that is related to occasional events such as ‘’mother’s day’, ‘new year’ which we expect will be a lot in conversational text like tweets. Also we can see expression of time, place and certain subject (e.g. president barrack obama) which usually appears a lot in news or tweets when the subject is an popular topics.
Alternatively, n-gram the statistics also can use term_stats function from the corpus library by specifying number of n-grams to be calculated (below are the results without using sparse and document matrix for trigram).
library("corpus")
term_stats(corpus, ngrams = 3, types = TRUE)
## Warning in term_stats(corpus, ngrams = 3, types = TRUE): renaming entries with
## duplicate names
## term type1 type2 type3 count support
## 1 “ ' s “ ' s 64 63
## 2 happy mothers day happy mothers day 42 41
## 3 ' m sure ' m sure 25 24
## 4 let us know let us know 23 23
## 5 “ ' m “ ' m 23 22
## 6 new york city new york city 20 20
## 7 think ' s think ' s 21 18
## 8 ' m going ' m going 18 18
## 9 ' ve got ' ve got 18 18
## 10 happy new year happy new year 18 18
## 11 president barack obama president barack obama 18 18
## 12 said “ ' said “ ' 18 18
## 13 ” said “ ” said “ 18 18
## 14 – ' s – ' s 16 16
## 15 ' s going ' s going 15 15
## 16 two years ago two years ago 15 15
## 17 ' s ' ' s ' 15 14
## 18 ' s hard ' s hard 14 14
## 19 ' s one ' s one 14 14
## 20 ' ve never ' ve never 14 14
## ⋮ (462686 rows total)
We have done examining the dataset and get some interesting findings from the exploratory analysis. Now we are ready to train and create our first predictive model. Machine Learning is an iterative process where we preprocess the training data, then train and evaluate the model and repeat the steps iteratively to get a better performance model based on our evaluation metrics.
The shiny app we’ll be going to build based on our trained predictive model, later on, will have similar functionality as the SwiftKey app. It will have a text field for the user to input the text and it will pop up 3-5 options of suggestions what the next word might be which user can choose and it will be appended automatically to what they already type (save a lot of time typing!).
Before we end this report, It is important to note that each of the steps (data cleaning, preprocessing, model training, and evaluation) are important and each step needs to be re-evaluated continuously to get a working and accurate ML model for our predictive text app. We are looking forward to the next report on the predictive model and shiny app we’ll going to build!