I’ve loaded only a sample of the data, 1% of the total content for each English source: twitter, blogs, and news. Project data can be found at this link:https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
con_news <- file("final/en_US/en_US.news.txt", "r")
set.seed(1108)
news <- readLines(con_news)
close(con_news)
sub_news <- news[rbinom(
n = length(news) * 0.01,
size = length(news),
prob = 0.5
)]
writeLines(sub_news, con = "final/en_US/en_US.sub_news.txt")
con_blogs <- file("final/en_US/en_US.blogs.txt", "r")
set.seed(1108)
blogs <- readLines(con_blogs)
close(con_blogs)
sub_blogs <- blogs[rbinom(
n = length(blogs) * 0.01,
size = length(blogs),
prob = 0.5
)]
writeLines(sub_blogs, con="final/en_US/en_US.sub_blogs.txt")
con_twitter <- file("final/en_US/en_US.twitter.txt", "r")
set.seed(1108)
twitter <- readLines(con_twitter)
close(con_twitter)
sub_twitter <- twitter[rbinom(
n = length(twitter) * 0.01,
size = length(twitter),
prob = 0.5
)]
writeLines(sub_twitter, con="final/en_US/en_US.sub_twitter.txt")
# Checking nul values:
twitter[c(167155, 268547, 1274086, 1759032)]
I’ve summarized the size, length, and word count of the subset of data we’ll be using. For comparison, I’ve included statistics for the original datasets as well.
## data size subset_size length subset_length
## 1 news 196.2775 1.967189 1010200 10102
## 2 blogs 200.4242 2.011213 899200 8992
## 3 twitter 159.3641 1.564456 2360100 23601
## subset_number_of_words
## 1 350221
## 2 377871
## 3 299397
Profanity data can be found at this link: https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
library(tm)
profanity <- readLines("Profanity/en")
## proof of concept for profanity removal
test_twitter <- removeWords(sub_twitter, profanity)
length(grep(" shit ", sub_twitter, value = TRUE))
## [1] 67
length(grep(" shit ", test_twitter, value = TRUE))
## [1] 0
clean_data <- function(data) {
tcorpus <- VCorpus(VectorSource(data));
tcorpus <- tm_map(tcorpus, content_transformer(tolower));
tcorpus <- tm_map(tcorpus, removePunctuation);
tcorpus <- tm_map(tcorpus, removeNumbers);
tcorpus <- tm_map(tcorpus, removeWords, profanity);
tcorpus <- tm_map(tcorpus, stripWhitespace);
return(tcorpus)
}
clean_news <- clean_data(sub_news)
clean_blogs <- clean_data(sub_blogs)
clean_twitter <- clean_data(sub_twitter)
Let’s demonstrate the difference between the original data and the cleaned version.
head(sub_news, 1)
## [1] "After Heagney's ruling, the police employees appealed, claiming that they were unaware of the suit and thus had no chance to weigh in."
clean_news[[1]]$content
## [1] "after heagneys ruling the police employees appealed claiming that they were unaware of the suit and thus had no chance to weigh in"
head(sub_blogs, 1)
## [1] "If you know champagne is French, you may be farther ahead than you realize. The rest is a simple matter of getting educated. Quickly. So, let's take you back in time to just before Thanksgiving 2011 -- like today, maybe. Sit up straight and pay attention."
clean_blogs[[1]]$content
## [1] "if you know champagne is french you may be farther ahead than you realize the rest is a simple matter of getting educated quickly so lets take you back in time to just before thanksgiving like today maybe sit up straight and pay attention"
head(sub_twitter, 1)
## [1] "Great news! RT : I wasn't expecting this so soon, but I was just named the Associate Dean for Research for UMD's iSchool."
clean_twitter[[1]]$content
## [1] "great news rt i wasnt expecting this so soon but i was just named the associate dean for research for umds ischool"
## 50% 90%
## All Words 245 4894
## Stopwords Removed 838 6161
Continue to develop algorithm. Test accuracy and speed for different prediction models. Prediction will likely be initially based on stored trigrams, then move to bigram and unigram as needed.
Decide on methodology for training dataset. Options include: a. Combining news, blogs, and twitter datasets. b. Picking the dataset that delivers the most accurate predictions. c. Allowing the user to pick the dataset and language based on their preferences.
Identify best method for suggesting words not covered by the model. Possible option is to suggest most common unigram given a specific dataset.
Deploy prediction model in an interactive shiny app.