Milestone Report – Data Science Capstone Project

Overview
Introduction
Assignment
Natural Language Processing Project
Environment Setup
Loading required Data
Basic Summary of the Data
Data Preparation
- Sampling
- Cleaning Corpus
N-gram Dictionary
- Tokenizing
- Unigrams
- Bigrams
- Trigrams
- Quadgrams
Conclusion and next steps
References:

Overview

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm.

Introduction

This is a milestone report for the fullfilment of the Capstone Project for the Coursera Data Science Specialization offered by Johns Hopkins University. The report explores several aspects of the corpus data and depicts visual representations of the analysis performed on it.

The data is made up of many media ‘posts’ or articles deriving from 3 different sources: - Twitter - Blogs - News.

Assignment

Questions to consider, suggested by Coursera:

What do the data look like?
Where do the data come from?
Can you think of any other data sources that might help you in this project?
What are the common steps in natural language processing?
What are some common issues in the analysis of text data?
What is the relationship between NLP and the concepts you have learned in the Specialization?

Natural Language Processing Project

The objective of this project is to create a predictive text model that reduces the number of required keystrokes and effectively predicts the next word typed based on word frequency and context. Natural language processing techniques will be used to perform the analysis and build the predictive model.

This milestone report describes the major features of the training data with our exploratory data analysis and summarizes our plans for creating the predictive model.

Environment Setup

First, we will load in the relevant packages that will be necessary for the exploratory data analysis.

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, fig.width=10, fig.height=5)
options(width=120)

library(knitr)     
library(stringi)   
library(NLP)       
library(tm)        
library(rJava)     
library(RWeka)     
library(ggplot2)

Loading required Data

The dataset below was downloaded from the Coursera Capstone project page located here. The file was quite large at over 500 Megabytes. The data provided came in four languages:

English (US)
Russian
Finnish
German

During this project, we’ll be focusing only on the Txt files in English. Download, unzip and load the training data.

# Download data files if neccessary
if (!file.exists("./final/en_US")) {
    tempDownloadFile <- tempfile()
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", tempDownloadFile)
    unzip(tempDownloadFile, exdir = "./")
    unlink(tempDownloadFile)
    rm(tempDownloadFile)
}

# Load blogs 
con <- file("./final/en_US/en_US.blogs.txt", open = "r")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

# Load news 
con <- file("./final/en_US/en_US.news.txt", open = "r")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

# Load twitter 
con <- file("./final/en_US/en_US.twitter.txt", open = "r")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
rm(con)

Basic Summary of the Data

To get a sense of the data set, have a look at some basic features of the data, such as the file size (in MB), the number of lines in each corpora, and the maximum characters per line for each. Below, is a summary of the three datasets.

# File sizes (MiB)
file.sizes <- round(file.info(c("./final/en_US/en_US.blogs.txt", "./final/en_US/en_US.news.txt", "./final/en_US/en_US.twitter.txt"))$size / 1048576)

# Number of lines
number.of.lines <- sapply(list(blogs, news, twitter), length)

# Number of characters
number.of.chars <- sapply(list(nchar(blogs), nchar(news), nchar(twitter)), sum)

# The longest line
longest.line <- sapply(list(nchar(blogs), nchar(news), nchar(twitter)), max)

# Number of words
number.of.words <- sapply(list(blogs, news, twitter), stri_stats_latex)[4,]

data.summary<- data.frame(
    c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
    paste(file.sizes, " MiB"),
    number.of.lines,
    number.of.chars,
    number.of.words,
    longest.line
)
names(data.summary) <- c("Data file", "File size", "Number of lines", "Number of characters", "Number of words", "The longest line")

data.summary

Summary of the Raw Data
Data file	File size	Number of lines	Number of characters	Number of words	The longest line
en_US.blogs.txt	200 MiB	899288	206824505	37570839	40833
en_US.news.txt	196 MiB	77259	15639408	2651432	5760
en_US.twitter.txt	159 MiB	2360148	162096241	30451170	140

Data Preparation

Sampling

As the combined corpus is fairly large, a sample is taken of each dataset.

# Sample size
sammple.size <- 4000

# Set RNG seed
set.seed(333)

# Sample files
sample.blogs <- sample(blogs, sammple.size, replace = FALSE)
sample.news <- sample(news, sammple.size, replace = FALSE)
sample.twitter <- sample(twitter, sammple.size, replace = FALSE)

# All sample
sample.data <- c(sample.blogs, sample.news, sample.twitter)

# Save concatenated sample file
sample.file <- "./final/en_US/sample.txt"
if (!file.exists(sample.file)) { writeLines(sample.data, sample.file) }

# File size (MiB)
file.size <- round(file.info(sample.file)$size / 1048576)
rm(sample.file)

# Number of lines
number.of.lines <- length(sample.data)

# Number of characters
number.of.chars <- sum(nchar(sample.data))

# Number of words
number.of.words <- sum(stri_count_words(sample.data))

# The longest line
longest.line <- max(nchar(sample.data))

data.summary<- data.frame(
    c("sample.txt"),
    paste(file.size, " MiB"),
    number.of.lines,
    number.of.chars,
    number.of.words,
    longest.line
)
names(data.summary) <- c("Data file", "File size", "Number of lines", "Number of characters", "Number of words", "The longest line")

data.summary

Summary of the Sample Data
Data file	File size	Number of lines	Number of characters	Number of words	The longest line
sample.txt	2 MiB	12000	2021224	359393	2689

# Clean up memory
rm(blogs, news, twitter, sample.blogs, sample.news, sample.twitter)
#gc()

Cleaning Corpus

Create a cleaned Corpus file. Data cleansing is also applied by removing numbers, punctuation, whitespaces, transform to lower case and plain text and finally bad words removed.

# Read sample data
text.corpus <- Corpus(VectorSource(sample.data))

# Translate characters from upper to lower case
text.corpus <- tm_map(text.corpus, tolower)

# Remove punctuation marks from the text
text.corpus <- tm_map(text.corpus, removePunctuation)

# Remove numbers from the text
text.corpus <- tm_map(text.corpus, removeNumbers)

# Replacement of all matches to empty string
replacement <- function(x, pattern) { gsub(pattern, "", x) }

# Remove URL addresses (ftp, ftps, http, https)
text.corpus <- tm_map(text.corpus, replacement, "(f|ht)tp(s?)://(.*)[.][a-z]+")

# Strip extra whitespace from the text
text.corpus <- tm_map(text.corpus, stripWhitespace)

# Download profane word list
profane.file <- "./final/en_US/profane.txt"
if (!file.exists(profane.file)) {
  download.file("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt", profane.file)
  rm(profane.file)
}

# Remove profane words from the text 
profane.words <- readLines(profane.file)
text.corpus <- tm_map(text.corpus, removeWords, profane.words)

# Create plain text
text.corpus <- tm_map(text.corpus, PlainTextDocument)

N-gram Dictionary

The corpus samples will be tokenized to build a basic n-gram model. An N-gram model estimates the probability of a word occuring in a phrase based on the previous words in the phrase. N-grams calculate this probability by looking at the number of times the last word occurs in a phrase followed by the number of times the phrase minus the last word that occurs.

While the word analysis performed in this document is helpful for initial exploration, the data analyst will need to construct a dictionary of bigrams, trigrams, and four-grams, collectively called n-grams. Bigrams are two word phrases, trigrams are three word phrases, and four-grams are four word phrases.

We will use the RWeka package to construct functions that tokenize the sample data and construct matrices of uniqrams, bigrams, trigrams and 4-grams.

Tokenizing

# Create tokenizers
uni.gram <- NGramTokenizer(text.corpus, Weka_control(min = 1, max = 1))
bi.grams <- NGramTokenizer(text.corpus, Weka_control(min = 2, max = 2))
tri.grams <- NGramTokenizer(text.corpus, Weka_control(min = 3, max = 3))
quad.grams <- NGramTokenizer(text.corpus, Weka_control(min = 4, max = 4))

Unigrams

# Convert to data frame
uni.gram <- data.frame(table(uni.gram))
names(uni.gram) <- c("Words", "Frequency")

# Sort by frequency
uni.gram <- uni.gram[order(uni.gram$Frequency, decreasing = TRUE),]

# Select top 10 words
uni.gram.top10 <- uni.gram[1:10,]

# Convert character to factor
uni.gram.top10$Words <- factor(uni.gram.top10$Words, levels = uni.gram.top10$Words[order(-uni.gram.top10$Frequency)])

# Plot top 10 1-grams
g <- ggplot(uni.gram.top10, aes(x = Words, y = Frequency))
g <- g + geom_bar(stat = "identity", fill = "#00AFBB")
g <- g + geom_text(aes(label = Frequency), vjust = -0.30, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 16, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 1.0, angle = 90),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("10 Most Common 1-grams")
print(g)

Bigrams

# Convert to data frame
bi.grams <- data.frame(table(bi.grams))
names(bi.grams) <- c("Words", "Frequency")

# Sort by frequency
bi.grams <- bi.grams[order(bi.grams$Frequency, decreasing = TRUE),]

# Select top 10 words
bi.grams.top10 <- bi.grams[1:10,]

# Convert character to factor
bi.grams.top10$Words <- factor(bi.grams.top10$Words, levels = bi.grams.top10$Words[order(-bi.grams.top10$Frequency)])

# Plot top 10 2-grams
g <- ggplot(bi.grams.top10, aes(x = Words, y = Frequency))
g <- g + geom_bar(stat = "identity", fill = "#52854C")
g <- g + geom_text(aes(label = Frequency), vjust = -0.30, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 16, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 1.0, angle = 90),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("10 Most Common 2-grams")
print(g)

Trigrams

# Convert to data frame
tri.grams <- data.frame(table(tri.grams))
names(tri.grams) <- c("Words", "Frequency")

# Sort by frequency
tri.grams <- tri.grams[order(tri.grams$Frequency, decreasing = TRUE),]

# Select top 10 words
tri.grams.top10 <- tri.grams[1:10,]

# Convert character to factor
tri.grams.top10$Words <- factor(tri.grams.top10$Words, levels = tri.grams.top10$Words[order(-tri.grams.top10$Frequency)])

# Plot top 10 3-grams
g <- ggplot(tri.grams.top10, aes(x = Words, y = Frequency))
g <- g + geom_bar(stat = "identity", fill = "#293352")
g <- g + geom_text(aes(label = Frequency), vjust = -0.30, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 16, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 1.0, angle = 90),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("10 Most Common 3-grams")
print(g)

Quadgrams

# Convert to data frame
quad.grams <- data.frame(table(quad.grams))
names(quad.grams) <- c("Words", "Frequency")

# Sort by frequency
quad.grams <- quad.grams[order(quad.grams$Frequency, decreasing = TRUE),]

# Select top 10 words
quad.grams.top10 <- quad.grams[1:10,]

# Convert character to factor
quad.grams.top10$Words <- factor(quad.grams.top10$Words, levels = quad.grams.top10$Words[order(-quad.grams.top10$Frequency)])

# Plot top 10 4-grams
g <- ggplot(quad.grams.top10, aes(x = Words, y = Frequency))
g <- g + geom_bar(stat = "identity", fill = "#CC79A7")
g <- g + geom_text(aes(label = Frequency), vjust = -0.30, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 16, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 1.0, angle = 90),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("10 Most Common 4-grams")
print(g)

Conclusion and next steps

The goal is to create a predictive model which predicts the most probable words to follow an input from the user. Having constructed the basic n-grams to build the prediction algorithm. The plan is find a good balance between sample size and prediction accuracy. This model will be evaluated and deployed to Shiny. Next step is the predictive algorithm and to build a UI of the Shiny app.