This is a milestone report for peer graded assigment as part of Data Science Captone Course from Coursera in Week 2. The objective of this document is as follows,
1- Demonstrate that you’ve downloaded the data and have successfully loaded it in. 2- Create a basic report of summary statistics about the data sets. 3- Report any interesting findings that you amassed so far. 4- Get feedback on your plans for creating a prediction algorithm and Shiny app. This report also will be served as a base for creating the next assignment report, hence it should be as clear and concise as possible. The content of this report will be structured to 5 sections as per the objective mentioned above.
The training dataset to get started that will be the basis for most of the capstone. The dataset must be downloaded from the link below and not from external websites to start. https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
Let’s set our working directory, and then specify the url from where the data to be downloaded, download the data and uncompressed it.
# set the working directory
setwd("/Users/Farshad.DESKTOP-MQL7796/Desktop/Hosein Abbasian/Coursera/JohnsHopkins/Capstone")
# download the dataset
destfile = "./Coursera-SwiftKey.zip"
if(!file.exists(destfile)){
url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
file <- basename(url)
download.file(url, file, method="curl")
# uncompressed the dataset
unzip(file)
}
The dataset should be placed in the correct directory as we set by now. It contains dataset with 4 different languages: German, English, Finnish and Russian. Each language dataset contains dataset from 3 different sources: Blogs, News and Twitters. While we may use all the different languages corpora, for now let’s explore just the English dataset.
# Read the blogs and Twitter data from English dataset into R
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul =
## TRUE): incomplete final line found on 'final/en_US/en_US.news.txt'
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
# load necessary library
library(knitr)
library(stringi)
# Get file sizes
blogs.size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
# Get words in files
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)
# Summary of the data sets
data_summary <- data.frame(data_source = c("blogs", "news", "twitter"),
file_size_MB = c(blogs.size, news.size, twitter.size),
line_counts = c(length(blogs), length(news), length(twitter)),
words_counts = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
num_of_words_per_line_mean = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
kable(data_summary, caption = "English Dataset Summary")
| data_source | file_size_MB | line_counts | words_counts | num_of_words_per_line_mean |
|---|---|---|---|---|
| blogs | 200.4242 | 899288 | 37546239 | 41.75107 |
| news | 196.2775 | 77259 | 2674536 | 34.61779 |
| 159.3641 | 2360148 | 30093413 | 12.75065 |
We have done the basic exploration of the English dataset. From the table above, we can observe that the dataset statistics from the 3 different sources are more or less of equal size. The dataset from twitter have the least file size (~159 MB) but double the number of lines compared to the other two sources. Also, the ones from twitter have the least number of words and the mean (average of number of word per line). These are expected because twitter data is composed of short text (with defined text limit), and we also expect that there can be more of ‘garbage words’ that we need to clean from the twitter data than blogs and news data.
Before we do some data exploratory analysis, let’s clean the data in the next section.
This is important steps to get the more accurate model. Because the dataset are quite big (up to 200MB and 2 million lines from each source), to speed up the data exploration and to test the data cleaning, let’s take just 1% of the dataset as a sample. First let’s load the sample data into a Corpus (a collection of documents) which is the main data structure used by tm.
# load necessary library
library(stringi)
# library for character string analysis
library(ggplot2)
library(RWeka)
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
# Sample the data
set.seed(679)
data_sample <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
# Create corpus and clean the data
corpus <- VCorpus(VectorSource(data_sample))
print(corpus)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 33365
The result is a structure of type VCorpus (‘virtual corpus’ that is, loaded into memory) with 42,695 documents (each line of text in the source is loaded as a document in the corpus). The VCorpus object is a nested list, or list of lists. At each index of the VCorpus object, there is a PlainTextDocument object, which is essentially a list that contains the actual text data (content), as well as some corresponding metadata (meta) which can help to visualize a VCorpus object and to conceptualize the whole thing. Now, we can start with the data cleaning process to this corpus.
The tm package provides several function to carry out these tasks, which are applied to the document collection as transformations via the tm_map() function. We will do the several data cleaning tasks through tm_map as follows:
# List standard English stop words
stopwords("en")
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
We can see that we stll can include more modal verbs which is not in the standard list. Let’s added few more stopwords.
updatedStopwords <- c(stopwords('en'), "can", "will", "want", "just", "like")
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, updatedStopwords)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
print(corpus)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 33365
We can improve further by removing unwanted list of words such as profanity/foul language, ‘garbage words’ or informal/slang languages (commonly used in twitter with less formal meaning). However, this should be done with extra care since it might alter the context or meaning of original sentences. Hence, we won’t do this as for now in the data cleaning steps.
In this steps, we will plot histograms of the most common words remaining in the data after the data cleaning. We will use Rweka library:n-grams functions to create different n-grams from the corpus then construct a term-document matricies for the various n-gram tokens. Then we will use ggplot2 library to plot the histogram of the matricies frequency vs n-gram words.
The document-term matrix is used when you want to have each document represented as a row. This can be useful if you are comparing authors within rows, or the data is arranged chronologically and you want to preserve the time series. Also we used remove sparse term with sparsity treshold of 0.9999. Sparsity refers to the threshold of relative document frequency for a term, above which the term will be removed, then with our treshold, it will remove only terms that are more sparse than 0.9999.
#load necessary library
library(RWeka)
library(rJava)
options(mc.cores=1)
getFreq <- function(tdm) {
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
The plot below shows histograms of the ten most frequent words (unigram, bigram, and trigram) in the corpus. The plots are using distinct RGB color to make it easy to distinguish between n-grams statistics. Let’s start with the unigram first.
#load necessary library
library(ggplot2)
# create plot historam function
makeHistogramPlot <- function(data, label, color) {
ggplot(data[1:10,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = I(color))
}
# Create the dtm from the corpus:
corpus_dtm <- TermDocumentMatrix(corpus)
# Print out corpus_dtm data
corpus_dtm
## <<TermDocumentMatrix (terms: 45574, documents: 33365)>>
## Non-/sparse entries: 332921/1520243589
## Sparsity : 100%
## Maximal term length: 85
## Weighting : term frequency (tf)
# Get frequencies of most common n-grams in data sample
freq1 <- getFreq(removeSparseTerms(corpus_dtm, 0.9999))
makeHistogramPlot(freq1, "10 Most Common Unigrams","red")
We can observe that the some of the most common unigram words are ‘said’, ‘now’ and ‘time’. This is interesting and somewhat valid because the datasource is collection of news, blogs and tweets which mostly reporting on what other people said about certain events and the timing, when it is happened. Let’s now look at the bigram (pair of words that appear together).
# Create the dtm from the corpus:
corpus_dtm_bigram <- TermDocumentMatrix(corpus, control = list(tokenize = bigram))
# Print out corpus_dtm_bigram data
corpus_dtm_bigram
## <<TermDocumentMatrix (terms: 290239, documents: 33365)>>
## Non-/sparse entries: 334146/9683490089
## Sparsity : 100%
## Maximal term length: 101
## Weighting : term frequency (tf)
freq2 <- getFreq(removeSparseTerms(corpus_dtm_bigram, 0.9999))
makeHistogramPlot(freq2, "10 Most Common Bigrams","green")
For the bigram, we also still observe that the most common words is also about timing, such as ‘last year’, ‘right now’ and also about places , such as ‘new york’, ‘high school’. Also if we notice that there are several bigram start with ‘last’ in differant ranking. Hence, when we design the text prediction, we can set a range e.g. 10-20 most common words and rank them which will result the suggestion options listed in this order ‘last year’, ‘last night’, ‘last week’. Let’s continue with the trigram.
# Create the dtm from the corpus:
corpus_dtm_trigram <- TermDocumentMatrix(corpus, control = list(tokenize = trigram))
# Print out corpus_dtm_bigram data
corpus_dtm_trigram
## <<TermDocumentMatrix (terms: 301796, documents: 33365)>>
## Non-/sparse entries: 304007/10069119533
## Sparsity : 100%
## Maximal term length: 108
## Weighting : term frequency (tf)
freq3 <- getFreq(removeSparseTerms(corpus_dtm_trigram, 0.9999))
makeHistogramPlot(freq3, "10 Most Common Trigrams", "blue")
For the trigram, we can see some congratulatory message that is related to occasional events such as ‘’mother’s day’, ‘new year’ which we expect will be a lot in conversational text like tweets. Also we can see expression of time, place and certain subject (e.g. president barrack obama) which usually appears a lot in news or tweets when the subject is an popular topics.
Alternatively, n-gram the statistics also can use term_stats function from the corpus library by specifying number of n-grams to be calculated (below are the results without using sparse and document matrix for trigram).
library("corpus")
term_stats(corpus, ngrams = 3, types = TRUE)
## Warning in term_stats(corpus, ngrams = 3, types = TRUE): renaming entries with
## duplicate names
## term type1 type2 type3 count support
## 1 <U+3E64> <U+613C> <U+3E30> <U+3E64> <U+613C> <U+3E30> 40 40
## 2 <U+613C> <U+3E30> <U+623C> <U+613C> <U+3E30> <U+623C> 40 40
## 3 <U+653C> <U+3E64> <U+613C> <U+653C> <U+3E64> <U+613C> 40 40
## 4 <U+3E30> <U+623C> <U+3E64> <U+3E30> <U+623C> <U+3E64> 35 35
## 5 “ ' s “ ' s 33 33
## 6 happy mothers day happy mothers day 31 30
## 7 let us know let us know 22 22
## 8 ' m sure ' m sure 21 20
## 9 happy new year happy new year 19 19
## 10 ' m going ' m going 16 16
## 11 – ' s – ' s 16 16
## 12 think ' s think ' s 18 15
## 13 “ ' m “ ' m 14 14
## 14 ' ve never ' ve never 13 13
## 15 ” ' s ” ' s 13 13
## 16 ' s hard ' s hard 12 12
## 17 new york city new york city 12 12
## 18 ' s still ' s still 11 11
## 19 ' s ' ' s ' 11 10
## 20 ' s good ' s good 11 10
## <U+22EE>(305835 rows total)
We have done examining the dataset and get some intereting findings from the exploratory analysis. Now we are ready to train and create our first predictive model. Machine Learning is an iterative process where we preprocess the training data, then train and evaluate the model and repeat the steps again iteratively to get better performace model based on our evaluation metrics.
The shiny app we’ll going to built based on our trained predictive model later on will have similar functionality as the SwiftKey app. It will have text fielf for user to input thir text and it will pop up 3-5 options of sugestions what the next word might be which user can choose and it will be appended automatically to what they already type (save alot of time typing!).
Before we end this report, It is important to note that each of the steps (data cleaning, preprocessing, model training and evaluation) are important and each steps need to be re-evaluated continuosly to get really working and accurate ML model for our predictive text app. We are looking forward on the next report on the predictive model and shiny app we’ll going to build!