In this project I will display my understanding of the data set and provide reproducible steps to download, tidy, summarize, and visualize the text data.
The data set is located here
setwd("~/Data-Science-Capstone-Week2")
library(tidytext)
library(dplyr)
library(stringr)
library(ggplot2)
library(wordcloud)
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "Coursera-SwiftKey.zip", quiet=TRUE)
unzip("Coursera-SwiftKey.zip")
blogs <- readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul=TRUE)
news <- readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul=TRUE)
twitter <- readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul=TRUE)
“We thus define the tidy text format as being a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. This one-token-per-row structure is in contrast to the ways text is often stored in current analyses, perhaps as strings or in a document-term matrix. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph.”
#Remove numbers from the text to get accurate text analysis
news <- gsub('[0-9]+', '', news)
blogs <- gsub('[0-9]+', '', blogs)
twitter <- gsub('[0-9]+', '', twitter)
# convert the text data into a tabble
news_df <- data_frame(text = news)
blogs_df <- data_frame(text = blogs)
twitter_df <- data_frame(text = twitter)
# now we tokenize the data into a tabble with 1 word per row
news_tidy <- news_df %>% unnest_tokens(word, text)
blogs_tidy <- blogs_df %>% unnest_tokens(word, text)
twitter_tidy <- twitter_df %>% unnest_tokens(word, text)
I will do a basic text analysis on the text sets to include word frequencies and diagrams using the tidytext and ggplot2 packages
In order to get accurate word frequencies we must filter out stop words, which are words that are not useful for an analysis, common words like “the,” “of,” “to,” and so on. To filter these words we use tidyr anti_joint function combined with tidytext stop_words built-in table. Before we do that I will demonstrate the word frequency without removing the stop words;
news_tidy %>% count(word, sort = TRUE)
## # A tibble: 79,459 x 2
## word n
## <chr> <int>
## 1 the 151717
## 2 to 69757
## 3 and 68605
## 4 a 67426
## 5 of 59315
## 6 in 51895
## 7 for 27166
## 8 that 26384
## 9 is 21973
## 10 on 20815
## # ... with 79,449 more rows
We see in the results above that the highest frequency words are stop words that should be removed to get more meaningful results
data("stop_words")
news_tidy <- news_tidy %>% anti_join(stop_words)
news_tidy %>% count(word, sort = TRUE)
## # A tibble: 78,758 x 2
## word n
## <chr> <int>
## 1 time 4474
## 2 people 3673
## 3 city 2902
## 4 school 2702
## 5 percent 2635
## 6 game 2591
## 7 day 2477
## 8 home 2438
## 9 million 2377
## 10 county 2262
## # ... with 78,748 more rows
blogs_tidy <- blogs_tidy %>% anti_join(stop_words)
blogs_tidy %>% count(word, sort = TRUE)
## # A tibble: 296,713 x 2
## word n
## <chr> <int>
## 1 time 90918
## 2 people 59576
## 3 day 52378
## 4 love 45233
## 5 life 41251
## 6 its 38657
## 7 world 29305
## 8 im 29189
## 9 dont 28389
## 10 book 28151
## # ... with 296,703 more rows
twitter_tidy <- twitter_tidy %>% anti_join(stop_words)
twitter_tidy %>% count(word, sort = TRUE)
## # A tibble: 344,708 x 2
## word n
## <chr> <int>
## 1 love 106738
## 2 day 92800
## 3 rt 89557
## 4 time 76806
## 5 lol 70137
## 6 people 52043
## 7 happy 49009
## 8 follow 48117
## 9 tonight 44701
## 10 night 41446
## # ... with 344,698 more rows
Because I am using tidytext package I can easily pipe the results into ggplot package for visualization
news_tidy %>%
count(word, sort = TRUE) %>%
filter(n > 2000) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = n)) +
ggtitle("Word Frequency Count For The News Text Set") +
theme(plot.title = element_text(color="blue", size=14, face="bold.italic")) +
geom_col() +
xlab(NULL) +
coord_flip()
blogs_tidy %>%
count(word, sort = TRUE) %>%
filter(n > 28000) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = n)) +
ggtitle("Word Frequency Count For The Blogs Text Set") +
theme(plot.title = element_text(color="blue", size=14, face="bold.italic")) +
geom_col() +
xlab(NULL) +
coord_flip()
twitter_tidy %>%
count(word, sort = TRUE) %>%
filter(n > 41000) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = n)) +
ggtitle("Word Frequency Count For The Twitter Text Set") +
theme(plot.title = element_text(color="blue", size=14, face="bold.italic")) +
geom_col() +
xlab(NULL) +
coord_flip()
Lets do some more plots (I love plots) this time we will be using wordcloud package
news_tidy %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
blogs_tidy %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
twitter_tidy %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
As we can see in the analysis above that cleaning, and preparing the data is important to create an accurate text analysis and to provide good diagrams and accurate word counts.
I believe my next move is to dive deeper into the text and further clean the text sets. Also in I will be working on relationship between words, N-grams, and correlations to prepare the set for the final word prediction project.