This report is an interim status product for the tenth and final class (Capstone Project) under the Coursera’s John Hopkins University Data Science Specialization. This Capstone Project calls for obtaining a corpora (collections of text writings, speeches, etc.) called HC Corpora, which has been compiled from online sources and made available through Coursera. The English portion of this corpora will be used for initial processing and it contains three distinct text datasets: Blog records, News feed records, and Twitter feed records.
The overall objective of this project is to analyze the corpora provided and develop a predictive word model that suggests the next word in a sentence given one, two, or three starting words. This predictive model is to be incorporated into an online Shiny application that provides a user interface for its testing and evaluation.
Several weeks of work are planned to compile, enhance, and test the model and the final Shiny application. When completed the Shiny application will be hosted on Shiny.com.
The corpora for this project was downloaded from:
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Although the downloaded data contains files in English and three other languages, primarily the English data will be used for processing in this project. The three raw unzipped English datasets are fairly large (blogs - 205mb, news - 201mb, twitter - 163mb), and reading them shows the number of records each contains: blogs - 899,288 records, news - 77,259 records, twitter - 2,360,148 records (see Appendix II - “read_files” and “explore_raw_data” for code).
Each of the three datasets are comprised of just one column of type “Character”. The data contained within each record of each set is a string of characters, usually making up phrases, or sentences. The blog data records vary in length from 1 to 40,835 characters, news data records vary in length from 1 to 5,760 characters, and twitter data records vary in length from 2 to 213 characters. In addition, a simple word parsing shows that the blog dataset contains 38,154,238 words, news dataset contains 2,693,898 words, and the twitter dataset contains 30,218,166 words.
Creating a “next word” prediction program requires the processing of the three datasets into simple “words” and analyzing the frequency of word groups (N-Grams of sizes 1, 2, 3, and 4 words). To do that initial analysis, each of the three datasets containing raw text data must be “cleaned” to address numerous issues including, extraneous special characters, non-English language, and other problems. Because these three datasets are too large to process this cleaning in one pass, each dataset has been subset (using a random 1% of each dataset) to facilitate data exploration and to expedite code development and testing (see Appendix II - “clean_data” for code). Follow-on project efforts will employ code to parse through each of the three datasets full record sets.
Once the datasets have been “cleaned”, the next step in processing requires the data to be transformed (tokenized) into tables of N-Grams (word groups), resulting in one table each for the N-Gram sizes of interest (1-Gram, 2-Gram, 3-Gram, and 4-Gram), for each of the three cleaned datasets (see Appendix II - “tokenize_data” for code). Each of these tables contain one record for each time a word group occurs in the raw text.
Combining the three tokenized data type N-Gram files produces one table for each N-Gram that can then be summarized to a frequency table. These tables contain a character field for the N-Gram (word group), it’s count (frequency of occurrence), and its overall percentage within that N-Gram (see Appendix II - “frequency_data” for code).
Further work on this project will process the full and complete files for each of the three raw datasets. This will result in analysis of the full corpora, and those results will be used for modeling, testing and production of the completed prediction system. (Note: a subsample (25%) of each N-Gram data will be set aside to be used as a sample set for testing and validation.)
To get a feel for what the N-Gram data looks like, a plot of the top 20 most frequent N-Grams gives both a view of the text values (the N-Gram) and its relative percent within that N-Gram sample population. Appendix I - N-Gram Plots contains Percentage bar charts for N-Grams of 1, 2, 3, and 4 word groups (see Appendix II - “plot_frequecy_data” for code).
This subset of 1% of the data also shows that the number of unique 1-Grams to be = 41,799, the number of unique 2-Grams to be = 314,188, the number of unique 3-Grams to be = 539,522, and the number of unique 4-Grams to be = 593,109. Although this is only 1% of the corpora, given the nature of language, it is unlikely that these numbers will increase in proportion to the increase in corpora training data usage (i.e., 1% -> 75%).
The preliminary analysis of the corpora for use in a predictive word application identifies several activities that need to be accomplished to complete this project. These include:
Enhance Data Cleaning - Looking at the contents of the actual N-Gram text for each level N-Gram, shows language usage issues that should be addressed (e.g., Word Concatenations, Profanity, etc.)
Full Corpora Usage - The full corpora should be incorporated into the data for prediction modeling and validation.
Model Enhancement - Initial simplistic models of “most frequent” need to be developed into more sophisticated models (e.g., Katz Back-off, etc.)
User Interface Development - This project requires a user interface to be used for testing and evaluation of the predictive word application using Shiny. The interfaces and processing for this effort will proceed in the next few weeks.
Model Accuracy - The model development effort set aside 25% of the raw data for validation/testing. At each major stage of the predictive word model development, that validation dataset will be used to evaluate the model accuracy.
Enhance Performance - An additional requirement of this project application is that in run within 1G memory, and have reasonable response time. These issues will be evaluated and reviewed during development.
# Blog Data
con <- file("..\\Capstone_Project\\Data\\final\\en_US\\en_US.blogs.txt", "r")
blogData <- readLines(con, skipNul = TRUE)
close(con)
# News Data
con <- file("..\\Capstone_Project\\Data\\final\\en_US\\en_US.news.txt", "r")
newsData <- readLines(con, skipNul = TRUE)
close(con)
# Twitter Data
con <- file("..\\Capstone_Project\\Data\\final\\en_US\\en_US.twitter.txt", "r")
twitterData <- readLines(con, skipNul = TRUE)
close(con)
rm(con)
library(stringi)
# Blog Data
blogClass <- class(blogData)
blogLength <- length(blogData)
blogMinChar <- min(nchar(blogData))
blogMaxChar <- max(nchar(blogData))
blogNumWords <- sum(stri_count_words(blogData))
# News Data
newsClass <- class(newsData)
newsLength <- length(newsData)
newsMinChar <- min(nchar(newsData))
newsMaxChar <- max(nchar(newsData))
newsNumWords <- sum(stri_count_words(newsData))
# Twitter Data
twitterClass <- class(twitterData)
twitterLength <- length(twitterData)
twitterMinChar <- min(nchar(twitterData))
twitterMaxChar <- max(nchar(twitterData))
twitterNumWords <- sum(stri_count_words(twitterData))
# Subset to 1% of Data
set.seed(1234)
blogSampleSize <- round((length(blogData)/100), 0)
blogDataSample <- sample(blogData, blogSampleSize, replace=FALSE)
newsSampleSize <- round((length(newsData)/100), 0)
newsDataSample <- sample(newsData, newsSampleSize, replace=FALSE)
twitterSampleSize <- round((length(twitterData)/100), 0)
twitterDataSample <- sample(twitterData, twitterSampleSize, replace=FALSE)
# Clean blogData
blogClean <- stringi::stri_trans_general(blogDataSample, "latin-ascii")
blogClean <- iconv(blogClean, from="UTF-8", to="ASCII", sub = "")
blogClean <- tolower(blogClean)
blogClean <- stringr::str_replace_all(blogClean,"[^a-zA-Z\\s]", " ")
blogClean <- stringr::str_replace_all(blogClean, "[\\s]+", " ")
# Clean newsData
newsClean <- stringi::stri_trans_general(newsDataSample, "latin-ascii")
newsClean <- iconv(newsClean, from="UTF-8", to="ASCII", sub = "")
newsClean <- tolower(newsClean)
newsClean <- stringr::str_replace_all(newsClean,"[^a-zA-Z\\s]", " ")
newsClean <- stringr::str_replace_all(newsClean, "[\\s]+", " ")
# Clean twitterData
twitterClean <- stringi::stri_trans_general(twitterDataSample, "latin-ascii")
twitterClean <- iconv(twitterClean, from="UTF-8", to="ASCII", sub = "")
twitterClean <- tolower(twitterClean)
twitterClean <- stringr::str_replace_all(twitterClean,"[^a-zA-Z\\s]", " ")
twitterClean <- stringr::str_replace_all(twitterClean, "[\\s]+", " ")
library(RWeka)
# Blog
blogNGram1 <- NGramTokenizer(blogClean, Weka_control(min = 1, max = 1))
blogNGram2 <- NGramTokenizer(blogClean, Weka_control(min = 2, max = 2))
blogNGram3 <- NGramTokenizer(blogClean, Weka_control(min = 3, max = 3))
blogNGram4 <- NGramTokenizer(blogClean, Weka_control(min = 4, max = 4))
# News
newsNGram1 <- NGramTokenizer(newsClean, Weka_control(min = 1, max = 1))
newsNGram2 <- NGramTokenizer(newsClean, Weka_control(min = 2, max = 2))
newsNGram3 <- NGramTokenizer(newsClean, Weka_control(min = 3, max = 3))
newsNGram4 <- NGramTokenizer(newsClean, Weka_control(min = 4, max = 4))
# Twitter
twitterNGram1 <- NGramTokenizer(twitterClean, Weka_control(min = 1, max = 1))
twitterNGram2 <- NGramTokenizer(twitterClean, Weka_control(min = 2, max = 2))
twitterNGram3 <- NGramTokenizer(twitterClean, Weka_control(min = 3, max = 3))
twitterNGram4 <- NGramTokenizer(twitterClean, Weka_control(min = 4, max = 4))
library(reshape2)
# Combine Tokenized Datasets and Generate Freq Data
#NGram1
nGram1Combined <- c(blogNGram1, newsNGram1, twitterNGram1)
nGram1Freq <- table(nGram1Combined)
nGram1Freq <- melt(nGram1Freq)
names(nGram1Freq) <- c("nGram1", "count")
nGram1Freq <- nGram1Freq[order(-nGram1Freq$count),]
rownames(nGram1Freq) <- 1:nrow(nGram1Freq)
nGram1Freq$pct <- nGram1Freq$count / sum(nGram1Freq$count)
nGram1Num <- nrow(nGram1Freq)
#NGram2
nGram2Combined <- c(blogNGram2, newsNGram2, twitterNGram2)
nGram2Freq <- table(nGram2Combined)
nGram2Freq <- melt(nGram2Freq)
names(nGram2Freq) <- c("nGram2", "count")
nGram2Freq <- nGram2Freq[order(-nGram2Freq$count),]
rownames(nGram2Freq) <- 1:nrow(nGram2Freq)
nGram2Freq$pct <- nGram2Freq$count / sum(nGram2Freq$count)
nGram2Num <- nrow(nGram2Freq)
#NGram3
nGram3Combined <- c(blogNGram3, newsNGram3, twitterNGram3)
nGram3Freq <- table(nGram3Combined)
nGram3Freq <- melt(nGram3Freq)
names(nGram3Freq) <- c("nGram3", "count")
nGram3Freq <- nGram3Freq[order(-nGram3Freq$count),]
rownames(nGram3Freq) <- 1:nrow(nGram3Freq)
nGram3Freq$pct <- nGram3Freq$count / sum(nGram3Freq$count)
nGram3Num <- nrow(nGram3Freq)
#NGram4
nGram4Combined <- c(blogNGram4, newsNGram4, twitterNGram4)
nGram4Freq <- table(nGram4Combined)
nGram4Freq <- melt(nGram4Freq)
names(nGram4Freq) <- c("nGram4", "count")
nGram4Freq <- nGram4Freq[order(-nGram4Freq$count),]
rownames(nGram4Freq) <- 1:nrow(nGram4Freq)
nGram4Freq$pct <- nGram4Freq$count / sum(nGram4Freq$count)
nGram4Num <- nrow(nGram4Freq)
library(ggplot2)
# Plot Pct NGram1
nGram1FreqSub <- nGram1Freq[1:20,]
ggplot(
data=nGram1FreqSub,
aes(reorder(nGram1, -count), pct)) +
labs(x = "NGram1", y = "Frequency Percent") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_bar(stat = "identity", fill = I("blue"))
# Plot Pct NGram2
nGram2FreqSub <- nGram2Freq[1:20,]
ggplot(
data=nGram2FreqSub,
aes(reorder(nGram2, -count), pct)) +
labs(x = "NGram2", y = "Frequency Percent") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_bar(stat = "identity", fill = I("blue"))
# Plot Pct NGram3
nGram3FreqSub <- nGram3Freq[1:20,]
ggplot(
data=nGram3FreqSub,
aes(reorder(nGram3, -count), pct)) +
labs(x = "NGram3", y = "Frequency Percent") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_bar(stat = "identity", fill = I("blue"))
# Plot Pct NGram4
nGram4FreqSub <- nGram4Freq[1:20,]
ggplot(
data=nGram4FreqSub,
aes(reorder(nGram4, -count), pct)) +
labs(x = "NGram4", y = "Frequency Percent") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_bar(stat = "identity", fill = I("blue"))