Introduction

This is a milestone report for the Coursera Data Science Capstone Project. The overall goal of the project is to build a predictive text application. An application that takes a phrase (multiple words) as input and after the user hits submit, predicts the next word. The goal of the current report are: (1) to download and clean the data, (2) to conduct an exploratory data analysis, and (3) to build \(n\)-grams as foundation for a text-predictive model.

Note: Click on CODE to show the code.

Load Data

The following code downloads and unzips the data.

url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists("Coursera-SwiftKey.zip")){download.file(url, destfile = "Coursera-SwiftKey.zip", method = "curl")
  unzip("Coursera-SwiftKey.zip")
  }

We load the following packages that we are going to use for the remainder of this document.

library(tidyverse)
library(tidytext)
library(stringr)
library(stringi)
library(tm)

We load the data sets using read_lines from readr package.

twitter <- read_lines("./final/en_US/en_US.twitter.txt")
blogs <- read_lines("./final/en_US/en_US.blogs.txt")
news <- read_lines("./final/en_US/en_US.news.txt")

Summary statistics

The following code calculates the number of lines, characters, and words in each of the data sets.

words_per_source=sapply(list(blogs,news,twitter),function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(words_per_source)=c('Min','Mean','Max')
stats=data.frame(
  Dataset=c("blogs","news","twitter"),      
  t(rbind(
  sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],
  Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',],
  words_per_source)
))
stats
##   Dataset   Lines     Chars    Words Min  Mean  Max
## 1   blogs  899288 206824382 37570839   0 41.75 6726
## 2    news 1010242 203223154 34494539   1 34.41 1796
## 3 twitter 2360148 162096031 30451128   1 12.75   47

The table above shows that the twitter data has the most lines of code and the blog data has the least lines of code, which is the opposite in terms of number of words. This is consistent with our expectations that it is easier to write tweets than to write blogs, but tweets are shorter than blogs.

Sample and clean sampled data

Since the data sets are large, we only sample around 3% from each of the data. We will combine these data for building our \(n\)-grams.

# Get samples from each document for modeling
set.seed(42)
data_sample <- c(sample(blogs, length(blogs) * 0.03),
                 sample(news, length(news) * 0.03),
                 sample(twitter, length(twitter) * 0.03))

The following combines the separate data sources into one VCorpus object, then cleans and tidy the resulting data set.

# combine the data into a VCorpus object
corpus <- VCorpus(VectorSource(data_sample), readerControl = list(reader=readPlain, language="en_US"))
# tidy the data using tidytext::tidy
corpus_td <- tidy(corpus) 
# remove the URL links, remove the punctuation marks except for apostrophes,
# remove numerals, remove the dollar sign, remove the non-ASCII (hence non-English) words,
# convert the words to lower case
corpus_td <- corpus_td %>%
  mutate(
    text = str_replace_all(text, "(f|ht)tp(s?)://(.*)[.][a-z]+|[0-9]+|(?!')[[:punct:]]|\\$", "")
  ) %>%
  mutate(
    text = tolower(text)
  ) %>%
  mutate(
    text = iconv(text, from = "UTF-8", to = "ASCII", sub="")
  )
# free the memory
rm(blogs)
rm(news)
rm(twitter)
rm(corpus)
# fix words that are written with repeated letters e.g. pleaseeee
# https://stackoverflow.com/questions/37198364/r-remove-words-with-3-or-more-repeating-letters-using-gsub
rm.repeatLetters <- function(x){
  xvec <- unlist(strsplit(x, " "))
  rmword <- grepl("(\\w)\\1{2, }", xvec)
  return(paste(xvec[!rmword], collapse = " "))
}

corpus_td$text <- sapply(corpus_td$text, rm.repeatLetters)

With the help of the tidytext package, we now build our \(n\)-grams.

unigram <- corpus_td %>%
  unnest_tokens(unigram, text)

bigram <- corpus_td %>%
  unnest_tokens(bigram, text, token = "ngrams", n=2)

trigram <- corpus_td %>%
  unnest_tokens(trigram, text, token = "ngrams", n=3)

quadgram <- corpus_td %>%
  unnest_tokens(quadgram, text, token = "ngrams", n=4)

pentagram <- corpus_td %>%
  unnest_tokens(pentagram, text, token = "ngrams", n=5)

hexagram <- corpus_td %>%
  unnest_tokens(hexagram, text, token = "ngrams", n=6)

The following function is used for plotting the most frequent \(n\)-grams in our sample data.

plot_freq <- function(data, text,  n=10){
  #data <- enquo(data)
  text <- enquo(text)
  data %>%
    count(!!text) %>%
    top_n(n, n) %>%
    mutate(text := fct_reorder(!!text, n)) %>%
    ggplot(., aes_(quo(text), quo(n))) + 
    geom_bar(stat = "identity") +
    coord_flip()
}
plot_freq(unigram, unigram,  n=20)

plot_freq(bigram, bigram, n=20)

plot_freq(trigram, trigram, n=20)

plot_freq(quadgram, quadgram, n=20)

plot_freq(pentagram, pentagram, n=20)

plot_freq(hexagram, hexagram, n=15)

Next Steps

In this report, we have conducted exploratory data analysis and built \(n\)-grams which we will use as foundations of our text-predictive model. The following tasks should now be done:

  • Study more about Natural Language Processing (NLP).
  • Attempt cleaning the whole data set by breaking the files down into smaller pieces, cleaning each piece, and merging the results.
  • Review machine learning algorithms and look for a strategy on how to build the model for the text prediction.