The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm.
This is a milestone report for the fullfilment of the Capstone Project for the Coursera Data Science Specialization offered by Johns Hopkins University. The report explores several aspects of the corpus data and depicts visual representations of the analysis performed on it.
The data is made up of many media ‘posts’ or articles deriving from 3 different sources: - Twitter - Blogs - News.
Questions to consider, suggested by Coursera:
The objective of this project is to create a predictive text model that reduces the number of required keystrokes and effectively predicts the next word typed based on word frequency and context. Natural language processing techniques will be used to perform the analysis and build the predictive model.
This milestone report describes the major features of the training data with our exploratory data analysis and summarizes our plans for creating the predictive model.
First, we will load in the relevant packages that will be necessary for the exploratory data analysis.
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, fig.width=10, fig.height=5)
options(width=120)
library(knitr)
library(stringi)
library(NLP)
library(tm)
library(rJava)
library(RWeka)
library(ggplot2)
The dataset below was downloaded from the Coursera Capstone project page located here. The file was quite large at over 500 Megabytes. The data provided came in four languages:
During this project, we’ll be focusing only on the Txt files in English. Download, unzip and load the training data.
# Download data files if neccessary
if (!file.exists("./final/en_US")) {
tempDownloadFile <- tempfile()
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", tempDownloadFile)
unzip(tempDownloadFile, exdir = "./")
unlink(tempDownloadFile)
rm(tempDownloadFile)
}
# Load blogs
con <- file("./final/en_US/en_US.blogs.txt", open = "r")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
# Load news
con <- file("./final/en_US/en_US.news.txt", open = "r")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
# Load twitter
con <- file("./final/en_US/en_US.twitter.txt", open = "r")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
rm(con)
To get a sense of the data set, have a look at some basic features of the data, such as the file size (in MB), the number of lines in each corpora, and the maximum characters per line for each. Below, is a summary of the three datasets.
# File sizes (MiB)
file.sizes <- round(file.info(c("./final/en_US/en_US.blogs.txt", "./final/en_US/en_US.news.txt", "./final/en_US/en_US.twitter.txt"))$size / 1048576)
# Number of lines
number.of.lines <- sapply(list(blogs, news, twitter), length)
# Number of characters
number.of.chars <- sapply(list(nchar(blogs), nchar(news), nchar(twitter)), sum)
# The longest line
longest.line <- sapply(list(nchar(blogs), nchar(news), nchar(twitter)), max)
# Number of words
number.of.words <- sapply(list(blogs, news, twitter), stri_stats_latex)[4,]
data.summary<- data.frame(
c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
paste(file.sizes, " MiB"),
number.of.lines,
number.of.chars,
number.of.words,
longest.line
)
names(data.summary) <- c("Data file", "File size", "Number of lines", "Number of characters", "Number of words", "The longest line")
data.summary
| Data file | File size | Number of lines | Number of characters | Number of words | The longest line |
|---|---|---|---|---|---|
| en_US.blogs.txt | 200 MiB | 899288 | 206824505 | 37570839 | 40833 |
| en_US.news.txt | 196 MiB | 77259 | 15639408 | 2651432 | 5760 |
| en_US.twitter.txt | 159 MiB | 2360148 | 162096241 | 30451170 | 140 |
As the combined corpus is fairly large, a sample is taken of each dataset.
# Sample size
sammple.size <- 4000
# Set RNG seed
set.seed(333)
# Sample files
sample.blogs <- sample(blogs, sammple.size, replace = FALSE)
sample.news <- sample(news, sammple.size, replace = FALSE)
sample.twitter <- sample(twitter, sammple.size, replace = FALSE)
# All sample
sample.data <- c(sample.blogs, sample.news, sample.twitter)
# Save concatenated sample file
sample.file <- "./final/en_US/sample.txt"
if (!file.exists(sample.file)) { writeLines(sample.data, sample.file) }
# File size (MiB)
file.size <- round(file.info(sample.file)$size / 1048576)
rm(sample.file)
# Number of lines
number.of.lines <- length(sample.data)
# Number of characters
number.of.chars <- sum(nchar(sample.data))
# Number of words
number.of.words <- sum(stri_count_words(sample.data))
# The longest line
longest.line <- max(nchar(sample.data))
data.summary<- data.frame(
c("sample.txt"),
paste(file.size, " MiB"),
number.of.lines,
number.of.chars,
number.of.words,
longest.line
)
names(data.summary) <- c("Data file", "File size", "Number of lines", "Number of characters", "Number of words", "The longest line")
data.summary
| Data file | File size | Number of lines | Number of characters | Number of words | The longest line |
|---|---|---|---|---|---|
| sample.txt | 2 MiB | 12000 | 2021224 | 359393 | 2689 |
# Clean up memory
rm(blogs, news, twitter, sample.blogs, sample.news, sample.twitter)
#gc()
Create a cleaned Corpus file. Data cleansing is also applied by removing numbers, punctuation, whitespaces, transform to lower case and plain text and finally bad words removed.
# Read sample data
text.corpus <- Corpus(VectorSource(sample.data))
# Translate characters from upper to lower case
text.corpus <- tm_map(text.corpus, tolower)
# Remove punctuation marks from the text
text.corpus <- tm_map(text.corpus, removePunctuation)
# Remove numbers from the text
text.corpus <- tm_map(text.corpus, removeNumbers)
# Replacement of all matches to empty string
replacement <- function(x, pattern) { gsub(pattern, "", x) }
# Remove URL addresses (ftp, ftps, http, https)
text.corpus <- tm_map(text.corpus, replacement, "(f|ht)tp(s?)://(.*)[.][a-z]+")
# Strip extra whitespace from the text
text.corpus <- tm_map(text.corpus, stripWhitespace)
# Download profane word list
profane.file <- "./final/en_US/profane.txt"
if (!file.exists(profane.file)) {
download.file("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt", profane.file)
rm(profane.file)
}
# Remove profane words from the text
profane.words <- readLines(profane.file)
text.corpus <- tm_map(text.corpus, removeWords, profane.words)
# Create plain text
text.corpus <- tm_map(text.corpus, PlainTextDocument)
The corpus samples will be tokenized to build a basic n-gram model. An N-gram model estimates the probability of a word occuring in a phrase based on the previous words in the phrase. N-grams calculate this probability by looking at the number of times the last word occurs in a phrase followed by the number of times the phrase minus the last word that occurs.
While the word analysis performed in this document is helpful for initial exploration, the data analyst will need to construct a dictionary of bigrams, trigrams, and four-grams, collectively called n-grams. Bigrams are two word phrases, trigrams are three word phrases, and four-grams are four word phrases.
We will use the RWeka package to construct functions that tokenize the sample data and construct matrices of uniqrams, bigrams, trigrams and 4-grams.
# Create tokenizers
uni.gram <- NGramTokenizer(text.corpus, Weka_control(min = 1, max = 1))
bi.grams <- NGramTokenizer(text.corpus, Weka_control(min = 2, max = 2))
tri.grams <- NGramTokenizer(text.corpus, Weka_control(min = 3, max = 3))
quad.grams <- NGramTokenizer(text.corpus, Weka_control(min = 4, max = 4))
# Convert to data frame
uni.gram <- data.frame(table(uni.gram))
names(uni.gram) <- c("Words", "Frequency")
# Sort by frequency
uni.gram <- uni.gram[order(uni.gram$Frequency, decreasing = TRUE),]
# Select top 10 words
uni.gram.top10 <- uni.gram[1:10,]
# Convert character to factor
uni.gram.top10$Words <- factor(uni.gram.top10$Words, levels = uni.gram.top10$Words[order(-uni.gram.top10$Frequency)])
# Plot top 10 1-grams
g <- ggplot(uni.gram.top10, aes(x = Words, y = Frequency))
g <- g + geom_bar(stat = "identity", fill = "#00AFBB")
g <- g + geom_text(aes(label = Frequency), vjust = -0.30, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 16, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 1.0, angle = 90),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("10 Most Common 1-grams")
print(g)
# Convert to data frame
bi.grams <- data.frame(table(bi.grams))
names(bi.grams) <- c("Words", "Frequency")
# Sort by frequency
bi.grams <- bi.grams[order(bi.grams$Frequency, decreasing = TRUE),]
# Select top 10 words
bi.grams.top10 <- bi.grams[1:10,]
# Convert character to factor
bi.grams.top10$Words <- factor(bi.grams.top10$Words, levels = bi.grams.top10$Words[order(-bi.grams.top10$Frequency)])
# Plot top 10 2-grams
g <- ggplot(bi.grams.top10, aes(x = Words, y = Frequency))
g <- g + geom_bar(stat = "identity", fill = "#52854C")
g <- g + geom_text(aes(label = Frequency), vjust = -0.30, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 16, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 1.0, angle = 90),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("10 Most Common 2-grams")
print(g)
# Convert to data frame
tri.grams <- data.frame(table(tri.grams))
names(tri.grams) <- c("Words", "Frequency")
# Sort by frequency
tri.grams <- tri.grams[order(tri.grams$Frequency, decreasing = TRUE),]
# Select top 10 words
tri.grams.top10 <- tri.grams[1:10,]
# Convert character to factor
tri.grams.top10$Words <- factor(tri.grams.top10$Words, levels = tri.grams.top10$Words[order(-tri.grams.top10$Frequency)])
# Plot top 10 3-grams
g <- ggplot(tri.grams.top10, aes(x = Words, y = Frequency))
g <- g + geom_bar(stat = "identity", fill = "#293352")
g <- g + geom_text(aes(label = Frequency), vjust = -0.30, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 16, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 1.0, angle = 90),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("10 Most Common 3-grams")
print(g)
# Convert to data frame
quad.grams <- data.frame(table(quad.grams))
names(quad.grams) <- c("Words", "Frequency")
# Sort by frequency
quad.grams <- quad.grams[order(quad.grams$Frequency, decreasing = TRUE),]
# Select top 10 words
quad.grams.top10 <- quad.grams[1:10,]
# Convert character to factor
quad.grams.top10$Words <- factor(quad.grams.top10$Words, levels = quad.grams.top10$Words[order(-quad.grams.top10$Frequency)])
# Plot top 10 4-grams
g <- ggplot(quad.grams.top10, aes(x = Words, y = Frequency))
g <- g + geom_bar(stat = "identity", fill = "#CC79A7")
g <- g + geom_text(aes(label = Frequency), vjust = -0.30, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 16, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 1.0, angle = 90),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("10 Most Common 4-grams")
print(g)
The goal is to create a predictive model which predicts the most probable words to follow an input from the user. Having constructed the basic n-grams to build the prediction algorithm. The plan is find a good balance between sample size and prediction accuracy. This model will be evaluated and deployed to Shiny. Next step is the predictive algorithm and to build a UI of the Shiny app.