This is the Milestone Report for the Coursera Data Science Capstone project. The goal of the capstone project is to create a predictive text model using a large text corpus of documents as training data. Natural language processing techniques will be used to perform the analysis and build the predictive model.
This Milestone Report describes the major features of the data with the exploratory data analysis and summarizes. To get started with the Milestone Report the Coursera Swiftkey Dataset has downloaded. Finally, the plans for creating the predictive model(s) and a Shiny App as data product has explained.
# Loading Libraries
library(tm)
## Loading required package: NLP
library(stringi)
library(RWeka)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(SnowballC)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
DataUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
DataFile <- "Newdata/Coursera-SwiftKey.zip"
if (!file.exists('Newdata')) {
dir.create('Newdata')
}
if (!file.exists("Newdata/final/en_US")) {
tempFile <- tempfile()
download.file(DataUrl, tempFile)
unzip(tempFile, exdir = "Newdata")
unlink(tempFile)
}
The Coursera Swiftkey Dataset contains following three sources of text data:
The provided text data are provided in four different languages. This project will only focus on the English corpora.
Here the number of lines, number of characters, and number of words for each of the 3 datasets (Blog, News and Twitter) has determined. Besides, the code calculates the number of words per line.
#Loading Files and show summaries
blogsCon <- file(paste0("Newdata/final/en_US/en_US.blogs.txt"), "r")
blogs <- readLines(blogsCon, encoding="UTF-8", skipNul = TRUE)
close(blogsCon)
newsCon <- file(paste0("Newdata/final/en_US/en_US.news.txt"), "r")
news <- readLines(newsCon, encoding="UTF-8", skipNul = TRUE)
## Warning in readLines(newsCon, encoding = "UTF-8", skipNul = TRUE): incomplete
## final line found on 'Newdata/final/en_US/en_US.news.txt'
close(newsCon)
twitterCon <- file(paste0("Newdata/final/en_US/en_US.twitter.txt"), "r")
twitter <- readLines(twitterCon, encoding="UTF-8", skipNul = TRUE)
close(twitterCon)
# Create stats of files
WPL <- sapply(list(blogs,news,twitter),function(x)
summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(WPL) <- c('WPL_Min','WPL_Mean','WPL_Max')
rawstats <- data.frame(
File = c("blogs","news","twitter"),
t(rbind(sapply(list(blogs,news,twitter),stri_stats_general),
TotalWords = sapply(list(blogs,news,twitter),stri_stats_latex)[4,],
WPL))
)
# Show stats in table
kable(rawstats) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| File | Lines | LinesNEmpty | Chars | CharsNWhite | TotalWords | WPL_Min | WPL_Mean | WPL_Max |
|---|---|---|---|---|---|---|---|---|
| blogs | 899288 | 899288 | 206824382 | 170389539 | 37570839 | 0 | 41.75109 | 6726 |
| news | 77259 | 77259 | 15639408 | 13072698 | 2651432 | 1 | 34.61779 | 1123 |
| 2360148 | 2360148 | 162096241 | 134082806 | 30451170 | 1 | 12.75065 | 47 |
The data files are very big, therefor, here only 1% of every file has considered as sample. Remove all non-English characters and then compile a sample data set.
# Sampling the data
set.seed(123)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
saveRDS(data.sample, 'sample.rds')
# Remove the object that are not used in analysis.
rm(blogs, blogsCon, data.sample, news, newsCon, rawstats, twitter,
twitterCon, WPL)
After loading the sample RDS file (stored data sample), the code creates Corpus and starts to analyse the data with the tm Text mining library.
Then the code cleans the sample and removes all numbers, convert text to lowercase, punctuation and stop words, for English language.
Later it performs stemming, which is a form that affixes can be attached. When the stemming is done, code removes the white spaces.
# Load the RDS file
data <- readRDS("sample.rds")
# Create a Corpus
docs <- VCorpus(VectorSource(data))
# Remove undesired data and stop words in English (a, as, at, so, etc.)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
In Natural Language Processing (NLP), n-gram is a contiguous sequence of n items from a given sequence of text or speech.
The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.
We next need to tokenize the clean Corpus (i.e., break the text up into words and short phrases) and construct a set of N-grams. We will start with the following three N-Grams:
Unigram - A n-gram matrix containing individual words
Bigram - A n-gram matrix containing two-word patterns
Trigram - A n-gram matrix containing three-word patterns
The RWeka package has used to develop the N-gram Tokenizersin order to create the unigram, bigram and trigram.
Then, the code calculates the frequencies of the N-Grams and see what these look like.
# Create Tokenization functions
unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
# Create plain text format
docs <- tm_map(docs, PlainTextDocument)
In this section code can find the most frequenzies of occurring words based on on unigram, bigram and trigrams.
# Create TermDocumentMatrix with Tokenizations and Remove Sparse Terms
freq1 <- removeSparseTerms(TermDocumentMatrix(docs, control = list(tokenize = unigram)), 0.9999)
freq2 <- removeSparseTerms(TermDocumentMatrix(docs, control = list(tokenize = bigram)), 0.9999)
freq3 <- removeSparseTerms(TermDocumentMatrix(docs, control = list(tokenize = trigram)), 0.9999)
# Create frequencies
FreqUni <- sort(rowSums(as.matrix(freq1)), decreasing=TRUE)
FreqBi <- sort(rowSums(as.matrix(freq2)), decreasing=TRUE)
FreqTri <- sort(rowSums(as.matrix(freq3)), decreasing=TRUE)
# Create Data Frames
dfUni <- data.frame(term=names(FreqUni), freq=FreqUni)
dfBi <- data.frame(term=names(FreqBi), freq=FreqBi)
dfTri <- data.frame(term=names(FreqTri), freq=FreqTri)
# Show head 10 of unigrams
kable(head(dfUni,10))%>%
kable_styling(bootstrap_options = c("striped", "hover"))
| term | freq | |
|---|---|---|
| get | get | 2540 |
| just | just | 2500 |
| like | like | 2396 |
| will | will | 2289 |
| one | one | 2277 |
| time | time | 1966 |
| love | love | 1942 |
| can | can | 1869 |
| day | day | 1759 |
| good | good | 1567 |
# Plot head 20 of unigrams
head(dfUni,20) %>%
ggplot(aes(reorder(term,-freq), freq, fill=freq)) +
geom_bar(stat = "identity") +
ggtitle("Unigrams") +
xlab("Unigrams Wrods") + ylab("Frequency") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90, hjust = 1))
# Show head 10 of bigrams
kable(head(dfBi,10))%>%
kable_styling(bootstrap_options = c("striped", "hover"))
| term | freq | |
|---|---|---|
| right now | right now | 247 |
| look like | look like | 192 |
| cant wait | cant wait | 181 |
| last night | last night | 159 |
| look forward | look forward | 155 |
| feel like | feel like | 139 |
| dont know | dont know | 132 |
| im go | im go | 111 |
| thank follow | thank follow | 109 |
| new york | new york | 102 |
# Plot head 20 of bigrams
head(dfBi,20) %>%
ggplot(aes(reorder(term,-freq), freq,fill=freq)) +
geom_bar(stat = "identity") +
ggtitle("Bigrams") +
xlab("Bigrams Words") + ylab("Frequency") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90, hjust = 1))
# Show head 10 of trigrams
kable(head(dfTri,10))%>%
kable_styling(bootstrap_options = c("striped", "hover"))
| term | freq | |
|---|---|---|
| cant wait see | cant wait see | 42 |
| happi mother day | happi mother day | 26 |
| let us know | let us know | 23 |
| happi new year | happi new year | 18 |
| cinco de mayo | cinco de mayo | 13 |
| dont even know | dont even know | 13 |
| im pretti sure | im pretti sure | 13 |
| look forward see | look forward see | 13 |
| new york citi | new york citi | 13 |
| new york time | new york time | 13 |
# Plot head 20 of trigrams
head(dfTri,20) %>%
ggplot(aes(reorder(term,-freq), freq,fill=freq)) +
geom_bar(stat = "identity") +
ggtitle("Trigrams") +
xlab("Trigrams Words") + ylab("Frequency") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90, hjust = 1))
The developing plan of this capstone project would be to create predictive models(s) based on the N-gram Tokenization, and deploy it as a data product. The next steps are: