Data Science Capstone: Milestone Report

Executive Summary

Based on the text dataset provide by SwiftKey https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip, this milestone report delivers solutions to the following tasks:

Task 0: Understanding the Problem
Task 1: Getting and Cleaning the Data
Task 2: Exploratory Data Analysis
Task 3: Modeling

This milestone report deals exclusively with the English corpus, however, the techniques implemented here will work for the German, Finish, and Russian corpuses.

Task 0: Understanding the Problem

The purpose of this Milestone Report is to demonstrate the ability to mine and analyze text data to discover interesting patterns, extract useful knowledge, and support prediction… to start this process you need to get your hands on the data!

## Datafile from coursera website
file <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

## If file does not exist, download it and unzip it
if (!file.exists("Coursera-SwiftKey.zip")) {
    download.file(file, destfile="Coursera-SwiftKey.zip", method = "curl")
}
unzip("Coursera-SwiftKey.zip")

Task 1: Getting and Cleaning the Data

The next step in the process is to load in the data. The three datasets (blogs, news, tweets) are sampled: 33333 lines from each file. This should enable reasonably representative early exploratory analysis. Prior to analysis or modeling, the data requires cleaning (remove punctuation, numbers, stopwords, profanity, etc.) and tokenization (separate strings into individual words N-grams).

rm(list = ls())

## Read in data
blogs <- readLines("../Two/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE) %>%
    as_tibble() %>%
    mutate(num_char = nchar(value))

news <- readLines("../Two/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE, warn=FALSE) %>%
    as_tibble() %>%
    mutate(num_char = nchar(value))

tweets <- readLines("../Two/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE) %>%
    as_tibble() %>%
    mutate(num_char = nchar(value))

## Sample data
blogs_samp  <- sample_n(blogs, 33333)
news_samp   <- sample_n(news, 33333)
tweets_samp <- sample_n(tweets, 33333)
textmine <- rbind(blogs_samp, news_samp, tweets_samp)

## Datafile from Carnegie Mellon University School of Computer Science website
file <- "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"

## If file does not exist, download
if (!file.exists("../Two/final/en_US/bad-words.txt")) {
    download.file(file, destfile="../Two/final/en_US/bad-words.txt", method = "auto")
}

## Remove bad words
curses <- readLines("../Two/final/en_US/bad-words.txt") 
curses = curses[-1] #empty row
curses <- data.frame(word = curses, lexicon = "PROFANE") # used to be stop words!!!

## Tokenize and filter Ngram data
data("stop_words")
colnames(textmine) <- "text"
unigrams <- textmine %>%
  unnest_tokens(unigram, text, token = "ngrams", n = 1) %>%
  count(unigram, sort = TRUE) %>%
  separate(unigram, c("word1"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word1 %in% curses$word,
         !str_detect(word1, "^\\d+"),
         !str_detect(word1, "[:digit:]"),
         !stri_detect_regex(word1,"^[:punct:]")) %>%
  mutate(total = sum(n))

The above code was implemented for both bi-, tri-, and quad-grams but is omiited/hidden here for readability.

Task 2: Exploratory Data Analysis

The first characterization of the data is to create a table describing the size (memory) and number of lines of each dataset (blogs, news, tweets, and textmine = sampled data). Next bar charts of uni-, bi-, tri-, and quadgrams obtained from the sampled data are displayed.

## Tablulize data
Name <- c("blogs","news","tweets", "textmine")
Size_mb <- c( object_size(blogs), object_size(news), object_size(tweets), object_size(textmine))/1000000
Lines <- c( dim(blogs)[1], dim(news)[1], dim(tweets)[1], dim(textmine)[1])
rawtext_table <- data.frame(Name,Size_mb,Lines)
knitr::kable(rawtext_table)

Name	Size_mb	Lines
blogs	264.1623	899288
news	20.4213	77259
tweets	325.4791	2360148
textmine	23.1800	99999

## Visualize data
unigram <- within(unigrams, rm(total))
unigram %>%
  top_n(20) %>%
  mutate(word = reorder(word1, n)) %>%
  ggplot(aes(word, n)) +
  geom_bar(stat = "identity", fill="red", colour="black") +
  xlab(NULL) +
  coord_flip()+
  ggtitle("Most frequent words in textmine corpus")

## Selecting by n

Again, the above code was implemented for both bi-, tri-, and quad-grams but is omiited/hidden here for readability.

## Selecting by n

## Selecting by n

## Selecting by n

visualizeWordcloud <- function(term, freq, title = "", min.freq = 50, max.words = 200){
    mypal <- brewer.pal(8,"Dark2")
    wordcloud(words = term,
          freq = freq, 
          colors = mypal, 
          scale=c(4,.1),
          rot.per=.15,
          min.freq = min.freq, max.words = max.words,
          random.order = FALSE)
}

#par(mfrow = c(1, 2))
visualizeWordcloud(term = unigram$word1, freq = unigram$n)

#visualizeWordcloud(term = bigram$word3, freq = unigram$n)
#par(mfrow = c(1, 2))
#visualizeWordcloud(term = trigram$word4, freq = unigram$n)
#visualizeWordcloud(term = quadgram$word5, freq = unigram$n)

Finally, a wordcloud is used to provide intuitive insight.

Task 3: Modeling

The code and analysis above is the beginning of building a text prediction App based on N-grams (currently 1, 2, 3, or 4 words). The model will work as follows: The App will check if the inputted text is equivalent to a known N-gram (i.e., previously learned in the textmine corpus), then predict the most appropriate/frequent word.

The following points below will also need to be addressed for implementation.

What to do if the App cannot match an N-gram?
What to do if the App matches an N-gram with multiple predicted appropriate/frequent words?
What is the most efficient storage and execution of N-gram frequency tables?

Finally the above analysis is based upon the removal of stopwords… The App should incorporate these as they are acutally the most common link words used in language.