Executive Summary

Based on the text dataset provide by SwiftKey https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip, this milestone report delivers solutions to the following tasks:

This milestone report deals exclusively with the English corpus, however, the techniques implemented here will work for the German, Finish, and Russian corpuses.

Task 0: Understanding the Problem

The purpose of this Milestone Report is to demonstrate the ability to mine and analyze text data to discover interesting patterns, extract useful knowledge, and support prediction… to start this process you need to get your hands on the data!

## Datafile from coursera website
file <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

## If file does not exist, download it and unzip it
if (!file.exists("Coursera-SwiftKey.zip")) {
    download.file(file, destfile="Coursera-SwiftKey.zip", method = "curl")
}
unzip("Coursera-SwiftKey.zip")

Task 1: Getting and Cleaning the Data

The next step in the process is to load in the data. The three datasets (blogs, news, tweets) are sampled: 33333 lines from each file. This should enable reasonably representative early exploratory analysis. Prior to analysis or modeling, the data requires cleaning (remove punctuation, numbers, stopwords, profanity, etc.) and tokenization (separate strings into individual words N-grams).

rm(list = ls())

## Read in data
blogs <- readLines("../Two/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE) %>%
    as_tibble() %>%
    mutate(num_char = nchar(value))

news <- readLines("../Two/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE, warn=FALSE) %>%
    as_tibble() %>%
    mutate(num_char = nchar(value))

tweets <- readLines("../Two/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE) %>%
    as_tibble() %>%
    mutate(num_char = nchar(value))

## Sample data
blogs_samp  <- sample_n(blogs, 33333)
news_samp   <- sample_n(news, 33333)
tweets_samp <- sample_n(tweets, 33333)
textmine <- rbind(blogs_samp, news_samp, tweets_samp)

## Datafile from Carnegie Mellon University School of Computer Science website
file <- "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"

## If file does not exist, download
if (!file.exists("../Two/final/en_US/bad-words.txt")) {
    download.file(file, destfile="../Two/final/en_US/bad-words.txt", method = "auto")
}

## Remove bad words
curses <- readLines("../Two/final/en_US/bad-words.txt") 
curses = curses[-1] #empty row
curses <- data.frame(word = curses, lexicon = "PROFANE") # used to be stop words!!!

## Tokenize and filter Ngram data
data("stop_words")
colnames(textmine) <- "text"
unigrams <- textmine %>%
  unnest_tokens(unigram, text, token = "ngrams", n = 1) %>%
  count(unigram, sort = TRUE) %>%
  separate(unigram, c("word1"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word1 %in% curses$word,
         !str_detect(word1, "^\\d+"),
         !str_detect(word1, "[:digit:]"),
         !stri_detect_regex(word1,"^[:punct:]")) %>%
  mutate(total = sum(n))

The above code was implemented for both bi-, tri-, and quad-grams but is omiited/hidden here for readability.

Task 2: Exploratory Data Analysis

The first characterization of the data is to create a table describing the size (memory) and number of lines of each dataset (blogs, news, tweets, and textmine = sampled data). Next bar charts of uni-, bi-, tri-, and quadgrams obtained from the sampled data are displayed.

## Tablulize data
Name <- c("blogs","news","tweets", "textmine")
Size_mb <- c( object_size(blogs), object_size(news), object_size(tweets), object_size(textmine))/1000000
Lines <- c( dim(blogs)[1], dim(news)[1], dim(tweets)[1], dim(textmine)[1])
rawtext_table <- data.frame(Name,Size_mb,Lines)
knitr::kable(rawtext_table)
Name Size_mb Lines
blogs 264.1623 899288
news 20.4213 77259
tweets 325.4791 2360148
textmine 23.1800 99999
## Visualize data
unigram <- within(unigrams, rm(total))
unigram %>%
  top_n(20) %>%
  mutate(word = reorder(word1, n)) %>%
  ggplot(aes(word, n)) +
  geom_bar(stat = "identity", fill="red", colour="black") +
  xlab(NULL) +
  coord_flip()+
  ggtitle("Most frequent words in textmine corpus")
## Selecting by n

Again, the above code was implemented for both bi-, tri-, and quad-grams but is omiited/hidden here for readability.

## Selecting by n

## Selecting by n

## Selecting by n

visualizeWordcloud <- function(term, freq, title = "", min.freq = 50, max.words = 200){
    mypal <- brewer.pal(8,"Dark2")
    wordcloud(words = term,
          freq = freq, 
          colors = mypal, 
          scale=c(4,.1),
          rot.per=.15,
          min.freq = min.freq, max.words = max.words,
          random.order = FALSE)
}

#par(mfrow = c(1, 2))
visualizeWordcloud(term = unigram$word1, freq = unigram$n)

#visualizeWordcloud(term = bigram$word3, freq = unigram$n)
#par(mfrow = c(1, 2))
#visualizeWordcloud(term = trigram$word4, freq = unigram$n)
#visualizeWordcloud(term = quadgram$word5, freq = unigram$n)

Finally, a wordcloud is used to provide intuitive insight.

Task 3: Modeling

The code and analysis above is the beginning of building a text prediction App based on N-grams (currently 1, 2, 3, or 4 words). The model will work as follows: The App will check if the inputted text is equivalent to a known N-gram (i.e., previously learned in the textmine corpus), then predict the most appropriate/frequent word.

The following points below will also need to be addressed for implementation.

Finally the above analysis is based upon the removal of stopwords… The App should incorporate these as they are acutally the most common link words used in language.