Data Science Capstone Milestone Report

Introduction

The following report will explain the exploratory analysis done and subsequent goal for a predictive text algorithm developed using SwiftKey data.

Methods

Data Processing

The data used in this exploratory analysis has already been loaded, and cleaned.

blogs <- "~/Coursera Courses/Data Science Capstone/final/en_US/en_US-blogs.txt"
news <- "~/Coursera Courses/Data Science Capstone/final/en_US/en_US-news.txt"
twitter <- "~/Coursera Courses/Data Science Capstone/final/en_US/en_US-twitter.txt"

blogs_raw <- readLines(blogs)
news_raw <- readLines(news)
twitter_raw <- readLines(twitter)

After the data has been cleaned, it needs to be tokenized, or turn into tokens.

tokenize_fast <- function(text_vector) {
        tokens_list <- regmatches(
                text_vector,
                gregexpr("\\w+|[[:punct:]]", text_vector, perl = TRUE)
        )
        
        tokens <- unlist(tokens_list, use.names = FALSE)
        return(tokens)
}


blogs_tokens   <- tokenize_fast(blogs_raw)
news_tokens    <- tokenize_fast(news_raw)
twitter_tokens <- tokenize_fast(twitter_raw)

Once the data has been turned into tokens, the next step is to remove any profanity that may have been tokenized.

profanity_list <- c("fuck", "fucking", "shit", "bitch", "asshole", "bastard", "cunt", "damn", "nigga", "nigger", "bullshit", "motherfucker", "motherfucking", "dumbass", "fag", "faggot", "whore", "slut", "ass", "pussy", "moron", "retard", "jackass", "piss", "goddamn", "hell", "cock", "penis", "vagina","balls", "tits", "cum", "jizz", "twat", "wanker", "blowjob", "porn", "sex", "horny", "kinky", "coke", "weed", "marijuana", "heroin", "bollocks", "horseshit", "batshit", "clit", "fucker")

remove_profanity <- function(tokens, profanity_words = profanity_list) {
        # Normalize case for comparison
        clean_tokens <- tokens[!tolower(tokens) %in% profanity_words]
        return(clean_tokens)
}

blogs_clean <- remove_profanity(blogs_tokens)
news_clean <- remove_profanity(news_tokens)
twitter_clean <- remove_profanity(twitter_tokens)

Exploratory Data Analysis

Now that all the processing has been done, the next step is Exploratory Analysis. Given the qualitative nature of the data, the only analysis done on the data were functions that could be run on character vector/class data.

summary(news_clean)

##    Length     Class      Mode 
##  42377795 character character

summary(twitter_clean)

##    Length     Class      Mode 
##  38832309 character character

summary(blogs_clean)

##    Length     Class      Mode 
##  44106423 character character

length(news_clean)

## [1] 42377795

length(twitter_clean)

## [1] 38832309

length(blogs_clean)

## [1] 44106423

head(twitter_clean, 20)

##  [1] "How"     "are"     "you"     "?"       "Btw"     "thanks"  "for"    
##  [8] "the"     "RT"      "."       "You"     "gonna"   "be"      "in"     
## [15] "DC"      "anytime" "soon"    "?"       "Love"    "to"

head(news_clean, 20)

##  [1] "He"         "wasn"       "'"          "t"          "home"      
##  [6] "alone"      ","          "apparently" "."          "The"       
## [11] "St"         "."          "Louis"      "plant"      "had"       
## [16] "to"         "close"      "."          "It"         "would"

head(blogs_clean, 20)

##  [1] "In"         "the"        "years"      "thereafter" ","         
##  [6] "most"       "of"         "the"        "Oil"        "fields"    
## [11] "and"        "platforms"  "were"       "named"      "after"     
## [16] "pagan"      "gods"       "."          "We"         "love"

After the summary statistics, a bar plot was generated to find the 30 most common words used per each group.

blogs_table <- table(blogs_clean)
top_blogs <- sort(blogs_table, decreasing = TRUE)[1:30]
barplot(top_blogs,
        las = 2,
        col = "deeppink3",
        main = "Bar Plot of Clean Blogs Tokens (Top 30)",
        ylab = "Frequency")

news_table <- table(news_clean)
top_news <- sort(news_table, decreasing = TRUE)[1:30]
barplot(top_news,
        las = 2,
        col = "darkseagreen4",
        main = "Bar Plot of Clean News Tokens (Top 30)",
        ylab = "Frequency")

twitter_table <- table(twitter_clean)
top_twitter <- sort(twitter_table, decreasing = TRUE)[1:30]
barplot(top_twitter,
        las = 2,
        col = "lightblue",
        main = "Bar Plot of Clean Twitter Tokens (Top 30)",
        ylab = "Frequency")

Comparing the results of the barplots, the Twitter token group is the only group with a ?, :, #, and the word “me” in the Top 30 results. The blogs token group is the only one with the word “this: in the Top 30. The words”he”, “his”, “from”, and “said” are singular to the Top 30 results for the news token group.

Future Considerations

Prediction Algorithm

My plans for the predictive text algorithm and Shiny app are to create a predictive model that generates the most commonly used word(s) that follow the words/phrases that are already present. This will be accomplished by generating lists of the most common words and phrases in the English language and determining which are most likely to be used by people when they are texting. The Shiny app will compile this model together in an easy to install app for any type of smart phone.