This report demonstrates an exploratory data analysis for the Coursera Capstone Project, part of the Data Science specialization from the Johns Hopkins University. The Capstone Project’s goal is to produce an app which predicts word occurrence. In order to produce the prediction algorithm, students are provided with three documents from three four different languages, of which I will use text in English. These three files consist of data from blog, news and tweet sources.
Since the data was provided by course as a zip file, it was stored in my local folder. Here is the data sets:
# to speed up the process, I use parallel processing
library(parallel)
library(doParallel)
## Loading required package: foreach
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loading required package: iterators
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
# twitter
twitter <- readLines(con <- file("./final/en_US/en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
# file size
size<-file.info("./final/en_US/en_US.twitter.txt")
kb<-size$size/1024
mb<-kb/1024
mb
## [1] 159.3641
################### news
news<-file("./final/en_US/en_US.news.txt","r")
news_lines<-readLines(news)
close(news)
# file size
size<-file.info("./final/en_US/en_US.news.txt")
kb<-size$size/1024
mb<-kb/1024
mb
## [1] 196.2775
##################### blogs
blogs<-file("./final/en_US/en_US.blogs.txt","r")
blogs_lines<-readLines(blogs)
close(blogs)
# file size
size<-file.info("./final/en_US/en_US.blogs.txt")
kb<-size$size/1024
mb<-kb/1024
mb
## [1] 200.4242
As seen above, each data sets are quite big for a text data. Their size range from 150 mb to 200 mb. In this step we check out how character and word distributions vary across data sets. This is a very macro level approach the understand what kind of text we have. To do so, I first calculate the character frequencies and show it with boxplot.
cnum.blogs<-nchar(blogs_lines)
cnum.news<-nchar(news_lines)
cnum.twitter<-nchar(twitter)
boxplot(cnum.blogs, cnum.news, cnum.twitter, log = "y",
names = c("blogs", "news", "twitter"),
ylab = "log(Number of Characters)", xlab = "File Name")
title("Comparing Distributions of Chracters per Line")
As, seen from the box plot, while blogs how a higher mean for character frequency by line, as expected, twitter text has the lowest one. This means that blog writers construct longer texts compared to news and twitter posts. The next step is to calculate word frequency by line.
library(tidyverse)
library(stringi) # for word counts
# twitter
twt_words <- stri_count_words(twitter)
p1 <- qplot(twt_words,
geom = "histogram",
main = "Twitter",
xlab = "Words per Line",
ylab = "Frequency",
binwidth = 5)
# blogs
blogs_words <- stri_count_words(blogs_lines)
p2 <- qplot(blogs_words,
geom = "histogram",
main = "US Blogs",
xlab = "Words per Line",
ylab = "Frequency",
binwidth = 5)
# News
news_words <- stri_count_words(news_lines)
p3 <- qplot(news_words,
geom = "histogram",
main = "US news",
xlab = "Words per Line",
ylab = "Frequency",
binwidth = 5)
# designing plots
library(cowplot)
plot_grid(p1, p2, p3, nrow = 3)
Results show that on average, each text corpora has a relatively low number of words per line. Blogs tend to have more words per line, followed by news and then twitter which has the least words per line. The lower number of words per line for the Twitter data is expected given that a tweet is limited to a certain number of characters.
To understand frequencies of words and word pairs, I build figures and tables to show variation in the frequencies of words and word pairs in the data. However, since the size of the dataset is to big, further exploration will done on a subset of the full data set. Samples of one percent of documents were drawn from the three files for blogs, news and twitter. This data will be then combined into a single corpus and a document feature matrix generated.
# Convert to matrix
twitter <- enframe(twitter)
news_lines <- enframe(news_lines)
blogs_lines <- enframe(blogs_lines)
# sample size
sample_size <- 0.1
blogs_sample <- blogs_lines %>%
sample_n(., nrow(blogs_lines)*sample_size)
news_sample <- news_lines %>%
sample_n(., nrow(news_lines)*sample_size)
twitter_sample <- twitter %>%
sample_n(., nrow(twitter)*sample_size)
# merge data sets
merged_df <- bind_rows(mutate(blogs_sample, source = "blogs"),
mutate(news_sample, source = "news"),
mutate(twitter_sample, source = "twitter"))
merged_df$source <- as.factor(merged_df$source)
Now I initially subset the 1% of the data sets and then combined them into a single corpora. Firstly, most frequent words in the whole corpora is shown in the first graph and then the variation of the word frequencies by source is shared respectively.
################## word frequency ###########
library(tidytext)
library(stringr)
library(tm)
tidy_text <- merged_df %>%
unnest_tokens(word, value)%>%
mutate(word=removePunctuation(word)) %>%
mutate(word=str_squish(word)) %>%
mutate(word=removeNumbers(word)) %>%
filter(str_length(word)>2) %>%
anti_join(stop_words)
# within total
tidy_text %>%
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
top_n(20, n) %>%
ggplot(aes(word, n)) +
geom_col() +
coord_flip() +
labs(y = NULL,
x= " Most frequently used top 20 words")
# by groups
freq <- tidy_text %>%
count(source, word) %>%
group_by(source) %>%
mutate(proportion = n / sum(n)) %>%
spread(source, proportion) %>%
gather(source, proportion, `blogs`:`twitter`) %>%
arrange(desc(proportion), desc(n))
freq %>%
filter(proportion > 0.002) %>%
mutate(word = reorder(word, proportion)) %>%
ggplot(aes(word, proportion)) +
geom_col() +
xlab(NULL) +
coord_flip() +
facet_grid(~source, scales = "free")
For the simplicity, I removed numbers, squashed words, stop words and punctuation from the text by tm and tidytext packages. As seen from the graphs, there is a wide differences between document sources. Most frequent words are more likely come from twitter document.
In the next step, I will check bigram and trigram results of the corpora
Initially I will create bigram. Instead of pulling the most frequent bigram results, I will extract more meaningfull bigram results. for the purpose of this projet, probably the total bigram will be more usefull since we aim to predict the word combination for keybord, here I want to show what kind of meaningful phareses are use in this corpora. To do so, I will initially create the bigram and then divided the word pairs in two groups and remove the stop words from those pairs and finally I will merge them again.
# bigram
bigrams <- merged_df %>%
unnest_tokens(bigram, value, token = "ngrams", n = 2)
# counting ngrams
bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 2,741,671 × 2
## bigram n
## <chr> <int>
## 1 of the 43306
## 2 in the 40901
## 3 to the 21424
## 4 for the 20124
## 5 on the 19806
## 6 to be 16224
## 7 at the 14423
## 8 and the 12577
## 9 in a 12017
## 10 with the 10720
## # … with 2,741,661 more rows
# separting work pairs to remove stopwords
library(tidyr)
bigrams_separated <- bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
drop_na(word1) %>%
drop_na(word2)
# new bigram counts:
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_counts
## # A tibble: 1,138,766 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 st louis 987
## 2 happy birthday 903
## 3 1 2 851
## 4 san francisco 654
## 5 los angeles 653
## 6 social media 607
## 7 san diego 509
## 8 health care 497
## 9 30 p.m 454
## 10 ice cream 447
## # … with 1,138,756 more rows
# reunit word pairs
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
bigrams_united
## # A tibble: 1,656,053 × 3
## name source bigram
## <int> <fct> <chr>
## 1 170077 blogs super stylish
## 2 170077 blogs stylish sis
## 3 170077 blogs charles tyrwhitt
## 4 170077 blogs sample sales
## 5 170077 blogs bargainous offers
## 6 471652 blogs cover looked
## 7 471652 blogs reading standing
## 8 471652 blogs feels weird
## 9 471652 blogs partner reads
## 10 471652 blogs reads rocking
## # … with 1,656,043 more rows
bigrams_united %>%
count(bigram) %>%
top_n(20, n) %>%
mutate(bigram = reorder(bigram, n)) %>%
ggplot(aes(bigram, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
In the final step, I will extract the trigram from the corpora. I will follow the same approach used above. The result will show us more meaningful pairs of thee words used together.
trigram <- merged_df %>%
unnest_tokens(trigram, value, token = "ngrams", n = 3) %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word3 %in% stop_words$word) %>%
drop_na(word1)%>%
drop_na(word2) %>%
drop_na(word3) %>%
count(word1, word2, word3, sort = TRUE)
# reunit trigram pairs
trigrams_united <- trigram %>%
unite(trigram, word1, word2, word3, sep = " ")
# plot
# generate plot
g <- ggplot(trigrams_united[1:20,], aes(reorder(trigram, -n), n))
g <- g + geom_bar(stat = "identity", fill = I("grey50"))
g <- g + geom_text(aes(label = n ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 1.0, angle = 45),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("20 Most Common Trigrams")
print(g)
The final step for the capstone project is to build a predictive algorithm that will also require a Shiny app for the user interface.