Milestone Report for the Capstone Project

This report demonstrates an exploratory data analysis for the Coursera Capstone Project, part of the Data Science specialization from the Johns Hopkins University. The Capstone Project’s goal is to produce an app which predicts word occurrence. In order to produce the prediction algorithm, students are provided with three documents from three four different languages, of which I will use text in English. These three files consist of data from blog, news and tweet sources.

Data

Since the data was provided by course as a zip file, it was stored in my local folder. Here is the data sets:

# to speed up the process, I use parallel processing
library(parallel)
library(doParallel)
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loading required package: iterators
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

# twitter
twitter <- readLines(con <- file("./final/en_US/en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)

# file size
size<-file.info("./final/en_US/en_US.twitter.txt")
kb<-size$size/1024
mb<-kb/1024
mb
## [1] 159.3641
################### news
news<-file("./final/en_US/en_US.news.txt","r")
news_lines<-readLines(news)
close(news)

# file size
size<-file.info("./final/en_US/en_US.news.txt")
kb<-size$size/1024
mb<-kb/1024
mb
## [1] 196.2775
##################### blogs
blogs<-file("./final/en_US/en_US.blogs.txt","r")
blogs_lines<-readLines(blogs)
close(blogs)

# file size
size<-file.info("./final/en_US/en_US.blogs.txt")
kb<-size$size/1024
mb<-kb/1024
mb
## [1] 200.4242

Character and Word Frequency of the texts

As seen above, each data sets are quite big for a text data. Their size range from 150 mb to 200 mb. In this step we check out how character and word distributions vary across data sets. This is a very macro level approach the understand what kind of text we have. To do so, I first calculate the character frequencies and show it with boxplot.

cnum.blogs<-nchar(blogs_lines)
cnum.news<-nchar(news_lines)
cnum.twitter<-nchar(twitter)

boxplot(cnum.blogs, cnum.news, cnum.twitter, log = "y",
        names = c("blogs", "news", "twitter"),
        ylab = "log(Number of Characters)", xlab = "File Name") 
title("Comparing Distributions of Chracters per Line")

As, seen from the box plot, while blogs how a higher mean for character frequency by line, as expected, twitter text has the lowest one. This means that blog writers construct longer texts compared to news and twitter posts. The next step is to calculate word frequency by line.

library(tidyverse)
library(stringi) # for word counts

# twitter
twt_words <- stri_count_words(twitter)
p1 <- qplot(twt_words,
            geom = "histogram",
            main = "Twitter",
            xlab = "Words per Line",
            ylab = "Frequency",
            binwidth = 5)
# blogs
blogs_words <- stri_count_words(blogs_lines)
p2 <- qplot(blogs_words,
            geom = "histogram",
            main = "US Blogs",
            xlab = "Words per Line",
            ylab = "Frequency",
            binwidth = 5)
# News
news_words <- stri_count_words(news_lines)
p3 <- qplot(news_words,
            geom = "histogram",
            main = "US news",
            xlab = "Words per Line",
            ylab = "Frequency",
            binwidth = 5)

# designing plots 
library(cowplot)
plot_grid(p1, p2, p3, nrow = 3)

Results show that on average, each text corpora has a relatively low number of words per line. Blogs tend to have more words per line, followed by news and then twitter which has the least words per line. The lower number of words per line for the Twitter data is expected given that a tweet is limited to a certain number of characters.

The most frequent words in corpora

To understand frequencies of words and word pairs, I build figures and tables to show variation in the frequencies of words and word pairs in the data. However, since the size of the dataset is to big, further exploration will done on a subset of the full data set. Samples of one percent of documents were drawn from the three files for blogs, news and twitter. This data will be then combined into a single corpus and a document feature matrix generated.

# Convert to matrix
twitter <- enframe(twitter)
news_lines <- enframe(news_lines)
blogs_lines <- enframe(blogs_lines)
# sample size
sample_size <- 0.1

blogs_sample <- blogs_lines %>%
        sample_n(., nrow(blogs_lines)*sample_size)
news_sample <- news_lines %>%
        sample_n(., nrow(news_lines)*sample_size)
twitter_sample <- twitter %>%
        sample_n(., nrow(twitter)*sample_size)


# merge data sets

merged_df <- bind_rows(mutate(blogs_sample, source = "blogs"),
                         mutate(news_sample,  source = "news"),
                         mutate(twitter_sample, source = "twitter")) 
merged_df$source <- as.factor(merged_df$source)

Word Frequency

Now I initially subset the 1% of the data sets and then combined them into a single corpora. Firstly, most frequent words in the whole corpora is shown in the first graph and then the variation of the word frequencies by source is shared respectively.

################## word frequency ###########
library(tidytext)
library(stringr)
library(tm)


tidy_text <- merged_df %>%
        unnest_tokens(word, value)%>%
        mutate(word=removePunctuation(word)) %>%
        mutate(word=str_squish(word)) %>%
        mutate(word=removeNumbers(word)) %>%
        filter(str_length(word)>2) %>%
        anti_join(stop_words) 

# within total
tidy_text %>%
        count(word, sort = TRUE) %>%
        mutate(word = reorder(word, n)) %>%
        top_n(20, n) %>% 
        ggplot(aes(word, n)) +
        geom_col() +
        coord_flip() +
        labs(y = NULL,
             x= " Most frequently used top 20 words") 

# by groups
freq <- tidy_text %>%
        count(source, word) %>%
        group_by(source) %>%
        mutate(proportion = n / sum(n)) %>%
        spread(source, proportion) %>%
        gather(source, proportion, `blogs`:`twitter`) %>%
        arrange(desc(proportion), desc(n))


freq %>%
        filter(proportion > 0.002) %>% 
        mutate(word = reorder(word, proportion)) %>% 
        ggplot(aes(word, proportion)) +
        geom_col() + 
        xlab(NULL) + 
        coord_flip() +
        facet_grid(~source, scales = "free")

For the simplicity, I removed numbers, squashed words, stop words and punctuation from the text by tm and tidytext packages. As seen from the graphs, there is a wide differences between document sources. Most frequent words are more likely come from twitter document.

In the next step, I will check bigram and trigram results of the corpora

Ngrams

Bigram

Initially I will create bigram. Instead of pulling the most frequent bigram results, I will extract more meaningfull bigram results. for the purpose of this projet, probably the total bigram will be more usefull since we aim to predict the word combination for keybord, here I want to show what kind of meaningful phareses are use in this corpora. To do so, I will initially create the bigram and then divided the word pairs in two groups and remove the stop words from those pairs and finally I will merge them again.

# bigram
bigrams <- merged_df %>%
        unnest_tokens(bigram, value, token = "ngrams", n = 2)

# counting ngrams
bigrams %>%
        count(bigram, sort = TRUE)
## # A tibble: 2,741,671 × 2
##    bigram       n
##    <chr>    <int>
##  1 of the   43306
##  2 in the   40901
##  3 to the   21424
##  4 for the  20124
##  5 on the   19806
##  6 to be    16224
##  7 at the   14423
##  8 and the  12577
##  9 in a     12017
## 10 with the 10720
## # … with 2,741,661 more rows
# separting work pairs to remove stopwords
library(tidyr)


bigrams_separated <- bigrams %>%
        separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
        filter(!word1 %in% stop_words$word) %>%
        filter(!word2 %in% stop_words$word) %>%
        drop_na(word1) %>%
        drop_na(word2)

# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
        count(word1, word2, sort = TRUE)

bigram_counts
## # A tibble: 1,138,766 × 3
##    word1  word2         n
##    <chr>  <chr>     <int>
##  1 st     louis       987
##  2 happy  birthday    903
##  3 1      2           851
##  4 san    francisco   654
##  5 los    angeles     653
##  6 social media       607
##  7 san    diego       509
##  8 health care        497
##  9 30     p.m         454
## 10 ice    cream       447
## # … with 1,138,756 more rows
# reunit word pairs
bigrams_united <- bigrams_filtered %>%
        unite(bigram, word1, word2, sep = " ")

bigrams_united
## # A tibble: 1,656,053 × 3
##      name source bigram           
##     <int> <fct>  <chr>            
##  1 170077 blogs  super stylish    
##  2 170077 blogs  stylish sis      
##  3 170077 blogs  charles tyrwhitt 
##  4 170077 blogs  sample sales     
##  5 170077 blogs  bargainous offers
##  6 471652 blogs  cover looked     
##  7 471652 blogs  reading standing 
##  8 471652 blogs  feels weird      
##  9 471652 blogs  partner reads    
## 10 471652 blogs  reads rocking    
## # … with 1,656,043 more rows
bigrams_united %>%
        count(bigram) %>% 
        top_n(20, n) %>%
        mutate(bigram = reorder(bigram, n)) %>%
        ggplot(aes(bigram, n)) +
        geom_col() +
        xlab(NULL) +
        coord_flip()

Trigram

In the final step, I will extract the trigram from the corpora. I will follow the same approach used above. The result will show us more meaningful pairs of thee words used together.

trigram <- merged_df %>%
        unnest_tokens(trigram, value, token = "ngrams", n = 3) %>%
        separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
        filter(!word1 %in% stop_words$word,
               !word2 %in% stop_words$word,
               !word3 %in% stop_words$word) %>%
        drop_na(word1)%>%
        drop_na(word2) %>%
        drop_na(word3) %>%
        count(word1, word2, word3, sort = TRUE)

# reunit trigram pairs
trigrams_united <- trigram %>%
        unite(trigram, word1, word2, word3, sep = " ")  


# plot
# generate plot
g <- ggplot(trigrams_united[1:20,], aes(reorder(trigram, -n), n))
g <- g + geom_bar(stat = "identity", fill = I("grey50"))
g <- g + geom_text(aes(label = n ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 1.0, angle = 45),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("20 Most Common Trigrams")
print(g)

Next step to create an algoritm

The final step for the capstone project is to build a predictive algorithm that will also require a Shiny app for the user interface.