This is a milestone report for the Coursera Data Science Capstone Project. The overall goal of the project is to build a predictive text application. An application that takes a phrase (multiple words) as input and after the user hits submit, predicts the next word. The goal of the current report are: (1) to download and clean the data, (2) to conduct an exploratory data analysis, and (3) to build \(n\)-grams as foundation for a text-predictive model.
Note: Click on CODE to show the code.
The following code downloads and unzips the data.
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists("Coursera-SwiftKey.zip")){download.file(url, destfile = "Coursera-SwiftKey.zip", method = "curl")
unzip("Coursera-SwiftKey.zip")
}We load the following packages that we are going to use for the remainder of this document.
library(tidyverse)
library(tidytext)
library(stringr)
library(stringi)
library(tm)We load the data sets using read_lines from readr package.
twitter <- read_lines("./final/en_US/en_US.twitter.txt")
blogs <- read_lines("./final/en_US/en_US.blogs.txt")
news <- read_lines("./final/en_US/en_US.news.txt")The following code calculates the number of lines, characters, and words in each of the data sets.
words_per_source=sapply(list(blogs,news,twitter),function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(words_per_source)=c('Min','Mean','Max')
stats=data.frame(
Dataset=c("blogs","news","twitter"),
t(rbind(
sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],
Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',],
words_per_source)
))
stats## Dataset Lines Chars Words Min Mean Max
## 1 blogs 899288 206824382 37570839 0 41.75 6726
## 2 news 1010242 203223154 34494539 1 34.41 1796
## 3 twitter 2360148 162096031 30451128 1 12.75 47
The table above shows that the twitter data has the most lines of code and the blog data has the least lines of code, which is the opposite in terms of number of words. This is consistent with our expectations that it is easier to write tweets than to write blogs, but tweets are shorter than blogs.
Since the data sets are large, we only sample around 3% from each of the data. We will combine these data for building our \(n\)-grams.
# Get samples from each document for modeling
set.seed(42)
data_sample <- c(sample(blogs, length(blogs) * 0.03),
sample(news, length(news) * 0.03),
sample(twitter, length(twitter) * 0.03))The following combines the separate data sources into one VCorpus object, then cleans and tidy the resulting data set.
# combine the data into a VCorpus object
corpus <- VCorpus(VectorSource(data_sample), readerControl = list(reader=readPlain, language="en_US"))
# tidy the data using tidytext::tidy
corpus_td <- tidy(corpus)
# remove the URL links, remove the punctuation marks except for apostrophes,
# remove numerals, remove the dollar sign, remove the non-ASCII (hence non-English) words,
# convert the words to lower case
corpus_td <- corpus_td %>%
mutate(
text = str_replace_all(text, "(f|ht)tp(s?)://(.*)[.][a-z]+|[0-9]+|(?!')[[:punct:]]|\\$", "")
) %>%
mutate(
text = tolower(text)
) %>%
mutate(
text = iconv(text, from = "UTF-8", to = "ASCII", sub="")
)
# free the memory
rm(blogs)
rm(news)
rm(twitter)
rm(corpus)
# fix words that are written with repeated letters e.g. pleaseeee
# https://stackoverflow.com/questions/37198364/r-remove-words-with-3-or-more-repeating-letters-using-gsub
rm.repeatLetters <- function(x){
xvec <- unlist(strsplit(x, " "))
rmword <- grepl("(\\w)\\1{2, }", xvec)
return(paste(xvec[!rmword], collapse = " "))
}
corpus_td$text <- sapply(corpus_td$text, rm.repeatLetters)With the help of the tidytext package, we now build our \(n\)-grams.
unigram <- corpus_td %>%
unnest_tokens(unigram, text)
bigram <- corpus_td %>%
unnest_tokens(bigram, text, token = "ngrams", n=2)
trigram <- corpus_td %>%
unnest_tokens(trigram, text, token = "ngrams", n=3)
quadgram <- corpus_td %>%
unnest_tokens(quadgram, text, token = "ngrams", n=4)
pentagram <- corpus_td %>%
unnest_tokens(pentagram, text, token = "ngrams", n=5)
hexagram <- corpus_td %>%
unnest_tokens(hexagram, text, token = "ngrams", n=6)The following function is used for plotting the most frequent \(n\)-grams in our sample data.
plot_freq <- function(data, text, n=10){
#data <- enquo(data)
text <- enquo(text)
data %>%
count(!!text) %>%
top_n(n, n) %>%
mutate(text := fct_reorder(!!text, n)) %>%
ggplot(., aes_(quo(text), quo(n))) +
geom_bar(stat = "identity") +
coord_flip()
}plot_freq(unigram, unigram, n=20)plot_freq(bigram, bigram, n=20)plot_freq(trigram, trigram, n=20)plot_freq(quadgram, quadgram, n=20)plot_freq(pentagram, pentagram, n=20)plot_freq(hexagram, hexagram, n=15)In this report, we have conducted exploratory data analysis and built \(n\)-grams which we will use as foundations of our text-predictive model. The following tasks should now be done: