The following report will explain the exploratory analysis done and subsequent goal for a predictive text algorithm developed using SwiftKey data.
The data used in this exploratory analysis has already been loaded, and cleaned.
blogs <- "~/Coursera Courses/Data Science Capstone/final/en_US/en_US-blogs.txt"
news <- "~/Coursera Courses/Data Science Capstone/final/en_US/en_US-news.txt"
twitter <- "~/Coursera Courses/Data Science Capstone/final/en_US/en_US-twitter.txt"
blogs_raw <- readLines(blogs)
news_raw <- readLines(news)
twitter_raw <- readLines(twitter)
After the data has been cleaned, it needs to be tokenized, or turn into tokens.
tokenize_fast <- function(text_vector) {
tokens_list <- regmatches(
text_vector,
gregexpr("\\w+|[[:punct:]]", text_vector, perl = TRUE)
)
tokens <- unlist(tokens_list, use.names = FALSE)
return(tokens)
}
blogs_tokens <- tokenize_fast(blogs_raw)
news_tokens <- tokenize_fast(news_raw)
twitter_tokens <- tokenize_fast(twitter_raw)
Once the data has been turned into tokens, the next step is to remove any profanity that may have been tokenized.
profanity_list <- c("fuck", "fucking", "shit", "bitch", "asshole", "bastard", "cunt", "damn", "nigga", "nigger", "bullshit", "motherfucker", "motherfucking", "dumbass", "fag", "faggot", "whore", "slut", "ass", "pussy", "moron", "retard", "jackass", "piss", "goddamn", "hell", "cock", "penis", "vagina","balls", "tits", "cum", "jizz", "twat", "wanker", "blowjob", "porn", "sex", "horny", "kinky", "coke", "weed", "marijuana", "heroin", "bollocks", "horseshit", "batshit", "clit", "fucker")
remove_profanity <- function(tokens, profanity_words = profanity_list) {
# Normalize case for comparison
clean_tokens <- tokens[!tolower(tokens) %in% profanity_words]
return(clean_tokens)
}
blogs_clean <- remove_profanity(blogs_tokens)
news_clean <- remove_profanity(news_tokens)
twitter_clean <- remove_profanity(twitter_tokens)
Now that all the processing has been done, the next step is Exploratory Analysis. Given the qualitative nature of the data, the only analysis done on the data were functions that could be run on character vector/class data.
summary(news_clean)
## Length Class Mode
## 42377795 character character
summary(twitter_clean)
## Length Class Mode
## 38832309 character character
summary(blogs_clean)
## Length Class Mode
## 44106423 character character
length(news_clean)
## [1] 42377795
length(twitter_clean)
## [1] 38832309
length(blogs_clean)
## [1] 44106423
head(twitter_clean, 20)
## [1] "How" "are" "you" "?" "Btw" "thanks" "for"
## [8] "the" "RT" "." "You" "gonna" "be" "in"
## [15] "DC" "anytime" "soon" "?" "Love" "to"
head(news_clean, 20)
## [1] "He" "wasn" "'" "t" "home"
## [6] "alone" "," "apparently" "." "The"
## [11] "St" "." "Louis" "plant" "had"
## [16] "to" "close" "." "It" "would"
head(blogs_clean, 20)
## [1] "In" "the" "years" "thereafter" ","
## [6] "most" "of" "the" "Oil" "fields"
## [11] "and" "platforms" "were" "named" "after"
## [16] "pagan" "gods" "." "We" "love"
After the summary statistics, a bar plot was generated to find the 30 most common words used per each group.
blogs_table <- table(blogs_clean)
top_blogs <- sort(blogs_table, decreasing = TRUE)[1:30]
barplot(top_blogs,
las = 2,
col = "deeppink3",
main = "Bar Plot of Clean Blogs Tokens (Top 30)",
ylab = "Frequency")
news_table <- table(news_clean)
top_news <- sort(news_table, decreasing = TRUE)[1:30]
barplot(top_news,
las = 2,
col = "darkseagreen4",
main = "Bar Plot of Clean News Tokens (Top 30)",
ylab = "Frequency")
twitter_table <- table(twitter_clean)
top_twitter <- sort(twitter_table, decreasing = TRUE)[1:30]
barplot(top_twitter,
las = 2,
col = "lightblue",
main = "Bar Plot of Clean Twitter Tokens (Top 30)",
ylab = "Frequency")
Comparing the results of the barplots, the Twitter token group is the only group with a ?, :, #, and the word “me” in the Top 30 results. The blogs token group is the only one with the word “this: in the Top 30. The words”he”, “his”, “from”, and “said” are singular to the Top 30 results for the news token group.
My plans for the predictive text algorithm and Shiny app are to create a predictive model that generates the most commonly used word(s) that follow the words/phrases that are already present. This will be accomplished by generating lists of the most common words and phrases in the English language and determining which are most likely to be used by people when they are texting. The Shiny app will compile this model together in an easy to install app for any type of smart phone.