Loading the data

I’ve loaded only a sample of the data, 1% of the total content for each English source: twitter, blogs, and news. Project data can be found at this link:https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

con_news <- file("final/en_US/en_US.news.txt", "r")
set.seed(1108)
news <- readLines(con_news)
close(con_news)
sub_news <- news[rbinom(
        n = length(news) * 0.01,
        size = length(news),
        prob = 0.5
        )]
writeLines(sub_news, con = "final/en_US/en_US.sub_news.txt")

con_blogs <- file("final/en_US/en_US.blogs.txt", "r")
set.seed(1108)
blogs <- readLines(con_blogs)
close(con_blogs)
sub_blogs <- blogs[rbinom(
        n = length(blogs) * 0.01,
        size = length(blogs),
        prob = 0.5
)]
writeLines(sub_blogs, con="final/en_US/en_US.sub_blogs.txt")

con_twitter <- file("final/en_US/en_US.twitter.txt", "r")
set.seed(1108)
twitter <- readLines(con_twitter)
close(con_twitter)
sub_twitter <- twitter[rbinom(
        n = length(twitter) * 0.01,
        size = length(twitter),
        prob = 0.5
)]
writeLines(sub_twitter, con="final/en_US/en_US.sub_twitter.txt")

# Checking nul values:
twitter[c(167155, 268547, 1274086, 1759032)]

Understanding the data

I’ve summarized the size, length, and word count of the subset of data we’ll be using. For comparison, I’ve included statistics for the original datasets as well.

##      data     size subset_size  length subset_length
## 1    news 196.2775    1.967189 1010200         10102
## 2   blogs 200.4242    2.011213  899200          8992
## 3 twitter 159.3641    1.564456 2360100         23601
##   subset_number_of_words
## 1                 350221
## 2                 377871
## 3                 299397

Cleaning the data

Removing profanity

Profanity data can be found at this link: https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

library(tm)
profanity <- readLines("Profanity/en")

## proof of concept for profanity removal
test_twitter <- removeWords(sub_twitter, profanity)
length(grep(" shit ", sub_twitter, value = TRUE))
## [1] 67
length(grep(" shit ", test_twitter, value = TRUE))
## [1] 0

Data cleaning function

clean_data <- function(data) {
        tcorpus <- VCorpus(VectorSource(data));
        tcorpus <- tm_map(tcorpus, content_transformer(tolower));
        tcorpus <- tm_map(tcorpus, removePunctuation);
        tcorpus <- tm_map(tcorpus, removeNumbers);
        tcorpus <- tm_map(tcorpus, removeWords, profanity);
        tcorpus <- tm_map(tcorpus, stripWhitespace);
        return(tcorpus)
}

clean_news <- clean_data(sub_news)
clean_blogs <- clean_data(sub_blogs)
clean_twitter <- clean_data(sub_twitter)

Let’s demonstrate the difference between the original data and the cleaned version.

head(sub_news, 1)
## [1] "After Heagney's ruling, the police employees appealed, claiming that they were unaware of the suit and thus had no chance to weigh in."
clean_news[[1]]$content
## [1] "after heagneys ruling the police employees appealed claiming that they were unaware of the suit and thus had no chance to weigh in"
head(sub_blogs, 1)
## [1] "If you know champagne is French, you may be farther ahead than you realize. The rest is a simple matter of getting educated. Quickly. So, let's take you back in time to just before Thanksgiving 2011 -- like today, maybe. Sit up straight and pay attention."
clean_blogs[[1]]$content
## [1] "if you know champagne is french you may be farther ahead than you realize the rest is a simple matter of getting educated quickly so lets take you back in time to just before thanksgiving like today maybe sit up straight and pay attention"
head(sub_twitter, 1)
## [1] "Great news! RT : I wasn't expecting this so soon, but I was just named the Associate Dean for Research for UMD's iSchool."
clean_twitter[[1]]$content
## [1] "great news rt i wasnt expecting this so soon but i was just named the associate dean for research for umds ischool"

Exploratory analysis

Find term frequencies for n-grams and create plots

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

##                   50%  90%
## All Words         245 4894
## Stopwords Removed 838 6161

Next steps

  1. Continue to develop algorithm. Test accuracy and speed for different prediction models. Prediction will likely be initially based on stored trigrams, then move to bigram and unigram as needed.

  2. Decide on methodology for training dataset. Options include: a. Combining news, blogs, and twitter datasets. b. Picking the dataset that delivers the most accurate predictions. c. Allowing the user to pick the dataset and language based on their preferences.

  3. Identify best method for suggesting words not covered by the model. Possible option is to suggest most common unigram given a specific dataset.

  4. Deploy prediction model in an interactive shiny app.