Introduction

In this project I will display my understanding of the data set and provide reproducible steps to download, tidy, summarize, and visualize the text data.

The Data

The data set is located here

  1. Setting up the work space
setwd("~/Data-Science-Capstone-Week2")
library(tidytext)
library(dplyr)
library(stringr)
library(ggplot2)
library(wordcloud)
  1. Downloading, extracting, and reading the text data
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "Coursera-SwiftKey.zip", quiet=TRUE)
unzip("Coursera-SwiftKey.zip")
blogs <- readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul=TRUE)
news <- readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul=TRUE)
twitter <- readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul=TRUE)
  1. Preparing the data for Analysis I will be using the tidytext package which uses the tidy data principles to work with text data. The tidy text package puts the text data into a tibble data frame format. In order to conduct analysis using tidytext the text must be in tidy text format which is according to the creators of tidytext, Julia Silge & David Robinson;

“We thus define the tidy text format as being a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. This one-token-per-row structure is in contrast to the ways text is often stored in current analyses, perhaps as strings or in a document-term matrix. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph.”

#Remove numbers from the text to get accurate text analysis
news <- gsub('[0-9]+', '', news)
blogs <- gsub('[0-9]+', '', blogs)
twitter <- gsub('[0-9]+', '', twitter)

# convert the text data into a tabble
news_df <- data_frame(text = news)
blogs_df <- data_frame(text = blogs)
twitter_df <- data_frame(text = twitter)

# now we tokenize the data into a tabble with 1 word per row
news_tidy <- news_df %>% unnest_tokens(word, text)
blogs_tidy <- blogs_df %>% unnest_tokens(word, text)
twitter_tidy <- twitter_df %>% unnest_tokens(word, text)

Analysis

I will do a basic text analysis on the text sets to include word frequencies and diagrams using the tidytext and ggplot2 packages

Word Frequencies

In order to get accurate word frequencies we must filter out stop words, which are words that are not useful for an analysis, common words like “the,” “of,” “to,” and so on. To filter these words we use tidyr anti_joint function combined with tidytext stop_words built-in table. Before we do that I will demonstrate the word frequency without removing the stop words;

news_tidy %>% count(word, sort = TRUE)
## # A tibble: 79,459 x 2
##     word      n
##    <chr>  <int>
##  1   the 151717
##  2    to  69757
##  3   and  68605
##  4     a  67426
##  5    of  59315
##  6    in  51895
##  7   for  27166
##  8  that  26384
##  9    is  21973
## 10    on  20815
## # ... with 79,449 more rows

We see in the results above that the highest frequency words are stop words that should be removed to get more meaningful results

data("stop_words")
news_tidy <- news_tidy %>% anti_join(stop_words)
news_tidy %>% count(word, sort = TRUE)
## # A tibble: 78,758 x 2
##       word     n
##      <chr> <int>
##  1    time  4474
##  2  people  3673
##  3    city  2902
##  4  school  2702
##  5 percent  2635
##  6    game  2591
##  7     day  2477
##  8    home  2438
##  9 million  2377
## 10  county  2262
## # ... with 78,748 more rows
blogs_tidy <- blogs_tidy %>% anti_join(stop_words)
blogs_tidy %>% count(word, sort = TRUE)
## # A tibble: 296,713 x 2
##      word     n
##     <chr> <int>
##  1   time 90918
##  2 people 59576
##  3    day 52378
##  4   love 45233
##  5   life 41251
##  6   it’s 38657
##  7  world 29305
##  8    i’m 29189
##  9  don’t 28389
## 10   book 28151
## # ... with 296,703 more rows
twitter_tidy <- twitter_tidy %>% anti_join(stop_words)
twitter_tidy %>% count(word, sort = TRUE)
## # A tibble: 344,708 x 2
##       word      n
##      <chr>  <int>
##  1    love 106738
##  2     day  92800
##  3      rt  89557
##  4    time  76806
##  5     lol  70137
##  6  people  52043
##  7   happy  49009
##  8  follow  48117
##  9 tonight  44701
## 10   night  41446
## # ... with 344,698 more rows

Visulazing The Words

Because I am using tidytext package I can easily pipe the results into ggplot package for visualization

news_tidy %>%
  count(word, sort = TRUE) %>%
  filter(n > 2000) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = n)) +
  ggtitle("Word Frequency Count For The News Text Set") +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic")) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

blogs_tidy %>%
  count(word, sort = TRUE) %>%
  filter(n > 28000) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = n)) +
  ggtitle("Word Frequency Count For The Blogs Text Set") +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic")) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

twitter_tidy %>%
  count(word, sort = TRUE) %>%
  filter(n > 41000) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = n)) +
  ggtitle("Word Frequency Count For The Twitter Text Set") +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic")) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Lets do some more plots (I love plots) this time we will be using wordcloud package

news_tidy %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

blogs_tidy %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

twitter_tidy %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

Summary

As we can see in the analysis above that cleaning, and preparing the data is important to create an accurate text analysis and to provide good diagrams and accurate word counts.

The Next Step

I believe my next move is to dive deeper into the text and further clean the text sets. Also in I will be working on relationship between words, N-grams, and correlations to prepare the set for the final word prediction project.