This document serves as a Milestone Report for the Capstone course in Coursera’s Data Science Specialization. The ultimate goal of the Capstone project is to produce a prediction algorithm and Shiny app that will serve as a predictive text product. This exploratory analysis takes stock of the development process by showing that the data has been successfully loaded, summary statistics are being evaluated, interesting trends are being uncovered, and ideas are percolating for the next steps.
The data sets for this project are available at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The Twitter, news, and blog data in U.S. English are successfully read into R, as are our necessary R packages.
library(readtext)
## Warning: package 'readtext' was built under R version 3.4.4
library(stringi)
## Warning: package 'stringi' was built under R version 3.4.4
library(tm)
## Warning: package 'tm' was built under R version 3.4.4
## Loading required package: NLP
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
if (!file.exists("Coursera-SwiftKey.zip")) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")}
twitter_text<- readLines("~/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
news_text<- readLines("~/Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
blogs_text<- readLines("~/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
By taking a cursory look at the data, we can get a better feel for that with which we are working, and get a clearer perspective on our ultimate goals. As requested, let’s perform some word counts, line counts, etc.
twitter_size <- file.info("~/Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size /(1024 ^ 2)
news_size <- file.info("~/Coursera-SwiftKey/final/en_US/en_US.news.txt")$size /(1024 ^ 2)
blogs_size <- file.info("~/Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size /(1024 ^ 2)
twitter_words <- stri_count_words(twitter_text)
news_words <- stri_count_words(news_text)
blogs_words <- stri_count_words(blogs_text)
data.frame(source = c("Twitter", "news", "blogs"),
file_size_MB = c(twitter_size, news_size, blogs_size),
num_lines = c(length(twitter_text), length(news_text), length(blogs_text)),
num_words = c(sum(twitter_words), sum(news_words), sum(blogs_words)),
mean_num_words = c(mean(twitter_words), mean(news_words), mean(blogs_words)))
## source file_size_MB num_lines num_words mean_num_words
## 1 Twitter 159.3641 2360148 30093410 12.75065
## 2 news 196.2775 1010242 34762395 34.40997
## 3 blogs 200.4242 899288 37546246 41.75108
For the purposes of this exploratory analysis, we will randomly choose 1% of the data from each of the three sets. We will then make the necessary transformations to the data to eliminate non-words and standardize lower case text.
set.seed(1234)
sample_text <- c(sample(twitter_text, length(twitter_text) * 0.01),
sample(news_text, length(news_text) * 0.01),
sample(blogs_text, length(blogs_text) * 0.01))
clean_sample_text <- VCorpus(VectorSource(sample_text))
rm(twitter_text, twitter_size, twitter_words, news_text, news_size, news_words, blogs_text, blogs_size, blogs_words)
now_a_space <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
clean_sample_text <- tm_map(clean_sample_text, now_a_space, "(f|ht)tp(s?)://(.*)[.][a-z]+")
clean_sample_text <- tm_map(clean_sample_text, now_a_space, "@[^\\s]+")
clean_sample_text <- tm_map(clean_sample_text, removeNumbers)
clean_sample_text <- tm_map(clean_sample_text, removeWords, stopwords("english"))
clean_sample_text <- tm_map(clean_sample_text, removePunctuation)
clean_sample_text <- tm_map(clean_sample_text, stripWhitespace)
clean_sample_text <- tm_map(clean_sample_text, PlainTextDocument)
# We almost forgot to remove the profanity. Let's do that now.
profanity_file <- file("http://www.bannedwordlist.com/lists/swearWords.txt", open = "rb")
profanity <- readLines(profanity_file, encoding = "UTF-8", warn=TRUE, skipNul=TRUE)
## Warning in readLines(profanity_file, encoding = "UTF-8", warn =
## TRUE, skipNul = TRUE): incomplete final line found on 'http://
## www.bannedwordlist.com/lists/swearWords.txt'
close(profanity_file)
rm(profanity_file)
clean_sample_text <- tm_map(clean_sample_text, removeWords, profanity)
clean_sample_text <- tm_map(clean_sample_text, tolower)
Finally, “it’s time to make the doughnuts.” Let’s get a taste of what the text data holds by analyzing our 1% in terms of n-grams. We will find it productive to look at the most popular individual words, two-word, and three-word combinations. This will undoubtedly help us as we prepare to develop our model.
clean_sample_text <- tm_map(clean_sample_text, PlainTextDocument)
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
TrigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
freq_df <- function(tdm){
freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
freq_df <- data.frame(word=names(freq), freq=freq)
return(freq_df)
}
unigram <- removeSparseTerms(TermDocumentMatrix(clean_sample_text), 0.9999)
unigram_freq <- freq_df(unigram)
bigram <- removeSparseTerms(TermDocumentMatrix(clean_sample_text, control = list(tokenize = BigramTokenizer)), 0.9999)
bigram_freq <- freq_df(bigram)
trigram <- removeSparseTerms(TermDocumentMatrix(clean_sample_text, control = list(tokenize = TrigramTokenizer)), 0.9999)
trigram_freq <- freq_df(trigram)
freq_plot <- function(data, title) {
ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
labs(x = "Words/Phrases", y = "Frequency") +
ggtitle(title) +
theme(axis.text.x = element_text(angle = 90, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = "purple3")
}
freq_plot(unigram_freq, "Top 30 Unigrams")
freq_plot(bigram_freq, "Top 30 Bigrams")
freq_plot(trigram_freq, "Top 30 Trigrams")
There is a great amount of data here, and the corpus is a time hog. We will need to more thoroughly examine our data cleansing. There may also be speed benefits obtained from other packages we have not yet considered or gotten to work properly (RWeka). In this rudimentary study, however, a better feel for the data has been achieved. We’re off to a good start. We may even consider input from four-grams. The big steps of training, testing, and validating still remain for our theoretical model. The Shiny app and presentation will be the culmination of these efforts. Their design should begin as the predictive model becomes viable.