Milestone Report

1. Overview

This report provides an exploratory analysis and explains the goals for the eventual app and algorithm for the Capstone Project in the Coursera Data Specialization Course. R programming will be the major tool used in the project.

Objectives:
- Downland and load the Swift Key data set - Create a basic report of summary statistics about the data sets
- Report findings
- Get feedback on the plans for the prediction algorithm and Shiny app

2. Data preparation

As the first step in this investigation, data preparation is needed. The following code is used to load the corresponding libraries.

library(stringi)
library(knitr)
library(tm)
library(NLP)
library(magrittr)
library(SnowballC)
library(rJava)
library(RWeka)
library(ggplot2)

The next step is to download the Coursera-SwiftKey.zip document and extract the three English-files that will be used in further analysis. The data is provided by Swift Key via Coursera. The extracted files are as followed:
1. en_US.blogs.txt : Consists of text data from blog posts
2. en_US.news.txt : Consists of text data from online news articles
3. en_US.twitter.txt : Consists of text data from online tweets

## Set Coursera-SwiftKey.zip URL
dataSetURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

## Verify if the dataset exists
if(!file.exists("Coursera-SwiftKey.zip")) 
  { 
  download.file(dataSetURL, destfile = "Coursera-SwiftKey.zip", method = "curl")
  }

## Verify of the directory exists 
if(!dir.exists("final"))
  { 
  unzip("Coursera-SwiftKey.zip")
}

## Extract files 
blogsAux <- file("~/Desktop/Data Science/Data Science Capstone/final/en_US/en_US.blogs.txt", "rb")
blogsDS <- readLines(blogsAux, encoding = "UTF-8", skipNul = TRUE)
close(blogsAux)
newsAux <- file("~/Desktop/Data Science/Data Science Capstone/final/en_US/en_US.news.txt", "rb")
newsDS <- readLines(newsAux, encoding = "UTF-8", skipNul = TRUE)
close(newsAux)
tweetsAux <- file("~/Desktop/Data Science/Data Science Capstone/final/en_US/en_US.twitter.txt", "rb")
tweetsDS <- readLines(tweetsAux, encoding = "UTF-8", skipNul = TRUE)
close(tweetsAux)

To get a better insight of the information within the files, a basic report of summary statistics about the data sets is created.

## Create matrix structure
dataSummary <- matrix(0, nrow = 3, ncol = 3, dimnames = list(c("Blogs", "News", "Twitter"),c("File size (MBs)", "Lines", "Words")))

## Fill "File size (MB)"
dataSummary[1, 1] <- round(file.info("~/Desktop/Data Science/Data Science Capstone/final/en_US/en_US.blogs.txt")$size / 1024^2, 2)
dataSummary[2, 1] <- round(file.info("~/Desktop/Data Science/Data Science Capstone/final/en_US/en_US.news.txt")$size / 1024^2, 2)
dataSummary[3, 1] <- round(file.info("~/Desktop/Data Science/Data Science Capstone/final/en_US/en_US.twitter.txt")$size / 1024^2, 2)

## Fill "Lines"
dataSummary[1, 2] <- length(blogsDS)
dataSummary[2, 2] <- length(newsDS)
dataSummary[3, 2] <- length(tweetsDS)

## Fill "Words"
dataSummary[1, 3] <- sum(stri_count_words(blogsDS))
dataSummary[2, 3] <- sum(stri_count_words(newsDS))
dataSummary[3, 3] <- sum(stri_count_words(tweetsDS))

kable(dataSummary)

	File size (MBs)	Lines	Words
Blogs	200.42	899288	37546239
News	196.28	1010242	34762395
Twitter	159.36	2360148	30093413

Due to the files’ size, a 1% sample is taken to do a proof of concept on the exploratory analysis. The sampled files will be combined together and used to generate the corpus.

## Set seed for reproducibility
set.seed(1234)

## Generate samples
blogsSample <- sample(blogsDS, length(blogsDS) * 0.01)
newsSample <- sample(newsDS, length(newsDS) * 0.01)
tweetsSample <- sample(tweetsDS, length(tweetsDS) * 0.01)

## Combine samples
combinedSamples <- c(blogsSample, newsSample, tweetsSample)

## Generate corpus
combinedSamples <- iconv(combinedSamples, 'UTF-8', 'ASCII')
corpus <- VCorpus(VectorSource(as.data.frame(combinedSamples, stringsAsFactors = FALSE)))

Now, before the exploratory analysis, data cleansing must be performed. For this new dataset, all characters will be transformed into lowercase, and the most fundamental steps in natural language processing will be performed: remove numbers, punctuations, stop words and white spaces, and carry out word stemming. The used of pipes (%>%) will be used within the development of the R code to save running time.

## Data cleansing
corpus <- corpus %>%
  tm_map(tolower) %>% ## Transform to lowercase
  tm_map(removeNumbers) %>% ## Remove numbers
  tm_map(removePunctuation) %>% ## Remove punctuation
  tm_map(removeWords, stopwords(kind = "en")) %>% ## Remove stopwords
  tm_map(stripWhitespace) %>% ## Remove white spaces
  tm_map(stemDocument) %>% ## Stem words
  tm_map(PlainTextDocument) ## Transger to plain text

3. Exploratory analysis

In order to find the most used term, a unigram model is built.

unigramTokenizer <- function(x) NGramTokenizer(corpus, Weka_control(min = 1, max = 1))
unigramMatrix <- DocumentTermMatrix(corpus, control = list(tokenize = unigramTokenizer))
unigramMatrix <- as.matrix(unigramMatrix)
unigramMatrix <- colSums(unigramMatrix)
unigramMatrix <- sort(unigramMatrix, decreasing = TRUE)
unigramDataFrame <- data.frame(word = names(unigramMatrix), frequency = unigramMatrix)

unigramPlot <- ggplot(unigramDataFrame[1:10, ], aes(x = reorder(word, -frequency), y = frequency)) +
  geom_bar(stat = "Identity") +
  theme_classic() +
  labs(title = "Figure 1: Top 10 Unigrams", x = "Unigrams", y = "Frequency")
unigramPlot

As seen in Figure 1: Top 10 Unigrams, the most used word is ‘get’ with a frequency of 2499 repetitions, fllowed by ‘will’ with 2497 matches and ‘just’ with 2427 duplicates.

In order to determine the most common two words used together, a bigram model is made.

bigramTokenizer <- function(x) NGramTokenizer(corpus, Weka_control(min = 2, max = 2))
bigramMatrix <- DocumentTermMatrix(corpus, control = list(tokenize = bigramTokenizer))
bigramMatrix <- as.matrix(bigramMatrix)
bigramMatrix <- colSums(bigramMatrix)
bigramMatrix <- sort(bigramMatrix, decreasing = TRUE)
bigramDataFrame <- data.frame(word = names(bigramMatrix), frequency = bigramMatrix)

bigramPlot <- ggplot(bigramDataFrame[1:10, ], aes(x = reorder(word, -frequency), y = frequency)) +
  geom_bar(stat = "Identity") +
  theme_classic() +
  labs(title = "Figure 2: Top 10 Bigrams", x = "Bigrams", y = "Frequency")
bigramPlot

As seen in Figure 2: Top 10 Bigrams, the most used words are ‘na na’ with a frequency of 1008 repetitions, fllowed by ‘right now’ with 221 matches and ‘last year’ with 191 duplicates.

Finally, the frecueny of trigrams are explored in the following model.

trigramTokenizer <- function(x) NGramTokenizer(corpus, Weka_control(min = 3, max = 3))
trigramMatrix <- DocumentTermMatrix(corpus, control = list(tokenize = trigramTokenizer))
trigramMatrix <- as.matrix(trigramMatrix)
trigramMatrix <- colSums(trigramMatrix)
trigramMatrix <- sort(trigramMatrix, decreasing = TRUE)
trigramDataFrame <- data.frame(word = names(trigramMatrix), frequency = trigramMatrix)

trigramPlot <- ggplot(trigramDataFrame[1:10, ], aes(x = reorder(word, -frequency), y = frequency)) +
  geom_bar(stat = "Identity") +
  theme_classic() +
  labs(title = "Figure 3: Top 10 Trigrams", x = "Trigrams", y = "Frequency")
trigramPlot

As seen in Figure 3: Top 10 Trigrams, the most used words are ‘na na na’ with a frequency of 275 repetitions, fllowed by ‘can wait see’ with 48 matches and ‘happi mmother day’ with 32 duplicates.

4. Findings

During the performance of this project, some findings were made.
1. Due to the file size, data subsetting must be performed.
2. Other NLP packages and techniques must be studied to enhance processing performance.
3. Even after doing the NLP fundamental cleansing tasks, junk text was found, such as “na na” and “na na na”.
4. Prediction algorithms can be made.

5. Prediction algorithm and Shiny app

For the prediction algorithm, a Markov Chain approach will be used. In other words, given a entry, the algorithm will propose the next entity with the highest probability of appearance given what is known from the actual state. The model performance is data sensitive, therefore the data should be properly cleaned to boost the model efficiency.

The Shiny app will consist of an interface, where the user may be able to input text within a box. This will trigger the algorithm for the prediction model in the background, and provide a list of words that might complete the user’s text.