Data Science Capstone_Milestone Report

Summary of project

This is the Milestone Report for the Coursera Data Science Capstone project. The goal of the capstone project is to create a predictive text model using a large text corpus of documents as training data. Natural language processing techniques will be used to perform the analysis and build the predictive model.

This Milestone Report describes the major features of the data with the exploratory data analysis and summarizes. To get started with the Milestone Report the Coursera Swiftkey Dataset has downloaded. Finally, the plans for creating the predictive model(s) and a Shiny App as data product has explained.

# Loading Libraries
library(tm)

## Loading required package: NLP

library(stringi)
library(RWeka)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(SnowballC)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

Download data and load files

DataUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
DataFile <- "Newdata/Coursera-SwiftKey.zip"
if (!file.exists('Newdata')) {
    dir.create('Newdata')
}
if (!file.exists("Newdata/final/en_US")) {
    tempFile <- tempfile()
    download.file(DataUrl, tempFile)
    unzip(tempFile, exdir = "Newdata")
    unlink(tempFile)
}

Summary of data

The Coursera Swiftkey Dataset contains following three sources of text data:

Blogs
News
Twitter

The provided text data are provided in four different languages. This project will only focus on the English corpora.

Here the number of lines, number of characters, and number of words for each of the 3 datasets (Blog, News and Twitter) has determined. Besides, the code calculates the number of words per line.

#Loading Files and show summaries
blogsCon <- file(paste0("Newdata/final/en_US/en_US.blogs.txt"), "r")
blogs <- readLines(blogsCon, encoding="UTF-8", skipNul = TRUE)
close(blogsCon)

newsCon <- file(paste0("Newdata/final/en_US/en_US.news.txt"), "r")
news <- readLines(newsCon, encoding="UTF-8", skipNul = TRUE)

## Warning in readLines(newsCon, encoding = "UTF-8", skipNul = TRUE): incomplete
## final line found on 'Newdata/final/en_US/en_US.news.txt'

close(newsCon)

twitterCon <- file(paste0("Newdata/final/en_US/en_US.twitter.txt"), "r")
twitter <- readLines(twitterCon, encoding="UTF-8", skipNul = TRUE)
close(twitterCon)

# Create stats of files
WPL <- sapply(list(blogs,news,twitter),function(x)
  summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(WPL) <- c('WPL_Min','WPL_Mean','WPL_Max')
rawstats <- data.frame(
  File = c("blogs","news","twitter"), 
  t(rbind(sapply(list(blogs,news,twitter),stri_stats_general),
          TotalWords = sapply(list(blogs,news,twitter),stri_stats_latex)[4,],
          WPL))
)
# Show stats in table
kable(rawstats) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

File	Lines	LinesNEmpty	Chars	CharsNWhite	TotalWords	WPL_Min	WPL_Mean	WPL_Max
blogs	899288	899288	206824382	170389539	37570839	0	41.75109	6726
news	77259	77259	15639408	13072698	2651432	1	34.61779	1123
twitter	2360148	2360148	162096241	134082806	30451170	1	12.75065	47

Data sampling

The data files are very big, therefor, here only 1% of every file has considered as sample. Remove all non-English characters and then compile a sample data set.

# Sampling the data
set.seed(123)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
                 sample(news, length(news) * 0.01),
                 sample(twitter, length(twitter) * 0.01))
saveRDS(data.sample, 'sample.rds')

# Remove the object that are not used in analysis.
rm(blogs, blogsCon, data.sample, news, newsCon, rawstats, twitter, 
   twitterCon, WPL)

Data processing

After loading the sample RDS file (stored data sample), the code creates Corpus and starts to analyse the data with the tm Text mining library.

Then the code cleans the sample and removes all numbers, convert text to lowercase, punctuation and stop words, for English language.

Later it performs stemming, which is a form that affixes can be attached. When the stemming is done, code removes the white spaces.

# Load the RDS file
data <- readRDS("sample.rds")

# Create a Corpus
docs <- VCorpus(VectorSource(data))

# Remove undesired data and stop words in English (a, as, at, so, etc.)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))


docs <- tm_map(docs, stemDocument)


docs <- tm_map(docs, stripWhitespace)

Build N-Grams

In Natural Language Processing (NLP), n-gram is a contiguous sequence of n items from a given sequence of text or speech.

The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.

We next need to tokenize the clean Corpus (i.e., break the text up into words and short phrases) and construct a set of N-grams. We will start with the following three N-Grams:

Unigram - A n-gram matrix containing individual words

Bigram - A n-gram matrix containing two-word patterns

Trigram - A n-gram matrix containing three-word patterns

The RWeka package has used to develop the N-gram Tokenizersin order to create the unigram, bigram and trigram.

Then, the code calculates the frequencies of the N-Grams and see what these look like.

# Create Tokenization functions
unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

# Create plain text format
docs <- tm_map(docs, PlainTextDocument)

Exploratory Analysis

In this section code can find the most frequenzies of occurring words based on on unigram, bigram and trigrams.

# Create TermDocumentMatrix with Tokenizations and Remove Sparse Terms
freq1 <- removeSparseTerms(TermDocumentMatrix(docs, control = list(tokenize = unigram)), 0.9999)
freq2 <- removeSparseTerms(TermDocumentMatrix(docs, control = list(tokenize = bigram)), 0.9999)
freq3 <- removeSparseTerms(TermDocumentMatrix(docs, control = list(tokenize = trigram)), 0.9999)

# Create frequencies 
FreqUni <- sort(rowSums(as.matrix(freq1)), decreasing=TRUE)
FreqBi <- sort(rowSums(as.matrix(freq2)), decreasing=TRUE)
FreqTri <- sort(rowSums(as.matrix(freq3)), decreasing=TRUE)

# Create Data Frames
dfUni <- data.frame(term=names(FreqUni), freq=FreqUni)   
dfBi <- data.frame(term=names(FreqBi), freq=FreqBi)   
dfTri <- data.frame(term=names(FreqTri), freq=FreqTri)

# Show head 10 of unigrams
kable(head(dfUni,10))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))

	term	freq
get	get	2540
just	just	2500
like	like	2396
will	will	2289
one	one	2277
time	time	1966
love	love	1942
can	can	1869
day	day	1759
good	good	1567

# Plot head 20 of unigrams
head(dfUni,20) %>% 
  ggplot(aes(reorder(term,-freq), freq, fill=freq)) +
  geom_bar(stat = "identity") +
  ggtitle("Unigrams") +
  xlab("Unigrams Wrods") + ylab("Frequency") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 90, hjust = 1))

# Show head 10 of bigrams
kable(head(dfBi,10))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))

	term	freq
right now	right now	247
look like	look like	192
cant wait	cant wait	181
last night	last night	159
look forward	look forward	155
feel like	feel like	139
dont know	dont know	132
im go	im go	111
thank follow	thank follow	109
new york	new york	102

# Plot head 20 of bigrams
head(dfBi,20) %>% 
  ggplot(aes(reorder(term,-freq), freq,fill=freq)) +
  geom_bar(stat = "identity") +
  ggtitle("Bigrams") +
  xlab("Bigrams Words") + ylab("Frequency") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 90, hjust = 1))

# Show head 10 of trigrams
kable(head(dfTri,10))%>%
  kable_styling(bootstrap_options = c("striped", "hover"))

	term	freq
cant wait see	cant wait see	42
happi mother day	happi mother day	26
let us know	let us know	23
happi new year	happi new year	18
cinco de mayo	cinco de mayo	13
dont even know	dont even know	13
im pretti sure	im pretti sure	13
look forward see	look forward see	13
new york citi	new york citi	13
new york time	new york time	13

# Plot head 20 of trigrams
head(dfTri,20) %>% 
  ggplot(aes(reorder(term,-freq), freq,fill=freq)) +
  geom_bar(stat = "identity") +
  ggtitle("Trigrams") +
  xlab("Trigrams Words") + ylab("Frequency") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 90, hjust = 1))

Development Plan

The developing plan of this capstone project would be to create predictive models(s) based on the N-gram Tokenization, and deploy it as a data product. The next steps are:

Establish the predictive model(s) by using N-gram Tokenizations.
Optimize the code for faster processing.
Develop data product, a Shiny App, to make a next word prediction based on user inputs.
Create a Slide Deck for pitching my algorithm and Shiny App.