Summary

This is the Capstone project for Johns Hopkins University’s Data Science specialization on coursera.org The goal of this capstone project is to build a Shiny application that is capable of predicting the next word based on user text input.

This project was completed in three phases

Downloading and cleaning the text data Prior to downloading the text data the algorithm will check the current working directory and see if the file already exist to avoid re downloading the file again. In this section I process the text to remove numbers, profanity, and white space

Exploratory Analysis

Prediction model and Shinny App Creation

Prepering the workspace

library(tidytext)
library(dplyr)
library(stringr)
library(ggplot2)
library(wordcloud)

Download, unzip, and read the data

# URL for data source
URL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# If data set not downloaded already, fetch it
if (!file.exists("Coursera-SwiftKey.zip")) {
  download.file(URL, destfile = "Coursera-SwiftKey.zip", method="auto")
}
# If data set not extracted already, extract it
if (!file.exists("data/final/en_US/en_US.blogs.txt")) {
  unzip("Coursera-SwiftKey.zip", exdir="/data")
}
blogs <- readLines("data/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul=TRUE)
news <- readLines("data/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul=TRUE)
## Warning in readLines("data/final/en_US/en_US.news.txt", encoding =
## "UTF-8", : incomplete final line found on 'data/final/en_US/en_US.news.txt'
twitter <- readLines("data/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul=TRUE)
rm(URL)

Text Analysis

Preparing the data for Analysis I will be using the tidytext package which uses the tidy data principles to work with text data. The tidy text package puts the text data into a tibble data frame format. In order to conduct analysis using tidytext the text must be in tidy text format which is according to the creators of tidytext, Julia Silge & David Robinson;

“We thus define the tidy text format as being a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. This one-token-per-row structure is in contrast to the ways text is often stored in current analyses, perhaps as strings or in a document-term matrix. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph.”

#Remove numbers from the text to get accurate text analysis
news <- gsub('[0-9]+', '', news)
blogs <- gsub('[0-9]+', '', blogs)
twitter <- gsub('[0-9]+', '', twitter)

# convert the text data into a tabble
news_df <- data_frame(text = news)
blogs_df <- data_frame(text = blogs)
twitter_df <- data_frame(text = twitter)

# now we tokenize the data into a tabble with 1 word per row
news_tidy <- news_df %>% unnest_tokens(word, text)
blogs_tidy <- blogs_df %>% unnest_tokens(word, text)
twitter_tidy <- twitter_df %>% unnest_tokens(word, text)

Words Frequency

In order to get accurate word frequencies we must filter out stop words, which are words that are not useful for the analysis, common words like “the,” “of,” “to,” and so on. To filter these words we use tidyr anti_joint function combined with tidytext stop_words built-in table. Before we do that I will demonstrate the word frequency without removing the stop words.

news_tidy %>% count(word, sort = TRUE)
## # A tibble: 79,459 x 2
##    word       n
##    <chr>  <int>
##  1 the   151717
##  2 to     69757
##  3 and    68605
##  4 a      67426
##  5 of     59315
##  6 in     51895
##  7 for    27166
##  8 that   26384
##  9 is     21973
## 10 on     20815
## # ... with 79,449 more rows

We see in the results above that the highest frequency words are stop words which should be removed to get more meaningful results. ### Removing Stop Words

data("stop_words")
news_tidy <- news_tidy %>% anti_join(stop_words)
## Joining, by = "word"
news_tidy %>% count(word, sort = TRUE)
## # A tibble: 78,758 x 2
##    word        n
##    <chr>   <int>
##  1 time     4474
##  2 people   3673
##  3 city     2902
##  4 school   2702
##  5 percent  2635
##  6 game     2591
##  7 day      2477
##  8 home     2438
##  9 million  2377
## 10 county   2262
## # ... with 78,748 more rows
blogs_tidy <- blogs_tidy %>% anti_join(stop_words)
## Joining, by = "word"
blogs_tidy %>% count(word, sort = TRUE)
## # A tibble: 296,713 x 2
##    word       n
##    <chr>  <int>
##  1 time   90918
##  2 people 59576
##  3 day    52378
##  4 love   45233
##  5 life   41251
##  6 it’s   38657
##  7 world  29305
##  8 i’m    29189
##  9 don’t  28389
## 10 book   28151
## # ... with 296,703 more rows
twitter_tidy <- twitter_tidy %>% anti_join(stop_words)
## Joining, by = "word"
twitter_tidy %>% count(word, sort = TRUE)
## # A tibble: 344,708 x 2
##    word         n
##    <chr>    <int>
##  1 love    106738
##  2 day      92800
##  3 rt       89557
##  4 time     76806
##  5 lol      70137
##  6 people   52043
##  7 happy    49009
##  8 follow   48117
##  9 tonight  44701
## 10 night    41446
## # ... with 344,698 more rows

Visualize Text Data

Because I am using tidytext package I can easily pipe the results into ggplot package for visualization

news_tidy %>%
  count(word, sort = TRUE) %>%
  filter(n > 2000) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = n)) +
  ggtitle("Words Frequency Count For The News Text Set") +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic")) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

blogs_tidy %>%
  count(word, sort = TRUE) %>%
  filter(n > 28000) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = n)) +
  ggtitle("Words Frequency Count For The Blogs Text Set") +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic")) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

twitter_tidy %>%
  count(word, sort = TRUE) %>%
  filter(n > 41000) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = n)) +
  ggtitle("Words Frequency Count For The Twitter Text Set") +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic")) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Word Cloud For The Text Data

Again because I am using tidytext package to prepare my text data I can easily pipe the data to any visualization package

news_tidy %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100, colors=brewer.pal(8, "Spectral")))

blogs_tidy %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100, colors=brewer.pal(8, "Dark2")))

twitter_tidy %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100, colors=brewer.pal(4, "Set1")))