This milestone report is the project of week two within the Data Science Capstone Project Course on the Data Science Specialization by Johns Hopkins University on Coursera.
The overal goal of the capstone is to develop a prediction algorithm for the most likely next word in a sequence of words. The purpose of this milestone report is to show some exploratory data analyses to investigate some features of the data, which will lead to the eventual prediction app and algorithm.
To know more about how the data looks like, we can see the size in Megabytes, the number of lines, number of characters, and number of words for each of the 3 datasets (Blog, News and Twitter). And last, the min, max and average number of words per line.
# Libraries
library(knitr); library(dplyr); library(doParallel); library(tm); library(SnowballC)
library(stringi); library(tm); library(ggplot2); library(wordcloud); library(kableExtra)
setwd("~/Coursera/8_Data_Science_Specialization/10 Data Science Capstone/final/en_US")
path1 <- "./en_US.blogs.txt"
path2 <- "./en_US.news.txt"
path3 <- "./en_US.twitter.txt"
# Read blogs data in binary mode
conn <- file(path1, open="rb")
blogs <- readLines(conn, encoding="UTF-8"); close(conn)
# Read news data in binary mode
conn <- file(path2, open="rb")
news <- readLines(conn, encoding="UTF-8"); close(conn)
# Read twitter data in binary mode
conn <- file(path3, open="rb")
twitter <- readLines(conn, encoding="UTF-8"); close(conn)
# Remove temporary variable
rm(conn)
# Summary info
WPL <- sapply(list(blogs,news,twitter),function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(WPL) <- c('WPL_Min','WPL_Mean','WPL_Max')
stats <- data.frame(
FileName=c("en_US.blogs","en_US.news","en_US.twitter"),
"File Size" = sapply(list(blogs, news, twitter), function(x){format(object.size(x),"MB")}),
t(rbind(
sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],
Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',],
WPL)
))
| FileName | File.Size | Lines | Chars | Words | WPL_Min | WPL_Mean | WPL_Max |
|---|---|---|---|---|---|---|---|
| en_US.blogs | 255.4 Mb | 899288 | 206824382 | 37570839 | 0 | 41.75107 | 6726 |
| en_US.news | 257.3 Mb | 1010242 | 203223154 | 34494539 | 1 | 34.40997 | 1796 |
| en_US.twitter | 319 Mb | 2360148 | 162096031 | 30451128 | 1 | 12.75063 | 47 |
The following clean up steps are performed to a 1% sample of the dataset:
# Build Corpus
build_corpus <- function (x = sampleData) {
sample_c <- VCorpus(VectorSource(x)) # Create corpus dataset
sample_c <- tm_map(sample_c, tolower) # all lowercase
sample_c <- tm_map(sample_c, removePunctuation) # Eleminate punctuation
sample_c <- tm_map(sample_c, removeNumbers) # Eliminate numbers
sample_c <- tm_map(sample_c, stripWhitespace) # Strip Whitespace
sample_c <- tm_map(sample_c, removeWords, stopwords("english")) # Eliminate English stop words
sample_c <- tm_map(sample_c, stemDocument) # Stem the document
sample_c <- tm_map(sample_c, PlainTextDocument) # Create plain text format
}
corpusData <- build_corpus(sampleData)
# Tokenize and Calculate Frequencies of N-Grams
library(RWeka)
getTermTable <- function(corpusData, ngrams = 1, lowfreq = 50) {
#create term-document matrix tokenized on n-grams
tokenizer <- function(x) { NGramTokenizer(x, Weka_control(min = ngrams, max = ngrams)) }
tdm <- TermDocumentMatrix(corpusData, control = list(tokenize = tokenizer))
#find the top term grams with a minimum of occurrence in the corpus
top_terms <- findFreqTerms(tdm,lowfreq)
top_terms_freq <- rowSums(as.matrix(tdm[top_terms,]))
top_terms_freq <- data.frame(word = names(top_terms_freq), frequency = top_terms_freq)
top_terms_freq <- arrange(top_terms_freq, desc(frequency))
}
tt.Data <- list(3)
for (i in 1:3) {
tt.Data[[i]] <- getTermTable(corpusData, ngrams = i, lowfreq = 10)
}
We need to convert the datset into a format that is most useful for Natural Language Prpcessing (NLP). The format of choice are N-grams. The N-gram representation of a text lists all N-tuples of words that appear. The simplest case is the unigram which is based on individual words. The bigram is based on pairs of to words, and the trigram is based on three words.
The wordcloud package is used to demonstrate what the corpus looks like in term of word frequency mapping. Here it is shown the wordcloud for unigrams, bigrams and trigrams from left to right respectively.
After the exploratory analysis, we can continue with the goal of this Capstone which is building the predictive model(s) and eventually the data product.