Capstone Milestone Report

Goals

The main goal of this report is to acquire and examine the data set provided and begin to understand to scope of building a predicitve text application

Check my work

In order to keep this report concise, I won’t be showing (i.e. echo = FALSE) all of the R code. But, you can find the full R Markdown on GitHub: (https://github.com/kerskine/coursera_data_science_capstone/blob/master/capstone.Rmd)

Using the Tidytext Package

Course mentions using R’s text mining package (tm) but recent but the tidyverse community offers an alternative in tidytext. Tidytext relies on the tidy principle, applied in this analysis, it’s one token per row. This makes it easy to apply other tools (dplyr, tidyr) when exploring and manipulating data.

Getting and Examining the Data

Download

The zip files contain a directory “final” which is 1.4 GB in size. In it are subdirectories of different languages (English, German, Finish, and Russian) each of which have three text files; news, blogs, and twitter. For this report, I’ll be using the English files.

Here is some basic information on the text files I’ll be examining

## # A tibble: 3 x 3
##   file     sizeMB lines_of_text
##   <chr>     <dbl>         <int>
## 1 twitter 167105.       2360148
## 2 blogs   210160.        899288
## 3 news    205812.       1010242

Preparing Data

Next, we’ll need to get each of the text files into a tibble that will allow tidytext to manipulate the data. I’ll do this using a purpose built function texttidy(). It takes the text file as input and a sample size (5%). The reason for only taking a sample is due to the limitations of my current system used for analysis. I’m making an assumption that the data is sufficiently randomized:

twitter <- texttidy("./final/en_US/en_US.twitter.txt", 0.05)
head(twitter, 3)

## # A tibble: 3 x 1
##   text                                                                     
##   <chr>                                                                    
## 1 How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Lov…
## 2 When you meet someone special... you'll know. Your heart will beat more …
## 3 they've decided its more fun if I don't.

dim(twitter)

## [1] 118007      1

So, we can see that ‘twitter’ now a tibble that contains the line of text as it’s only variable. the blogs and news text files are prepared using the same function.

Word Counts

With the text files now ready for analysis, we can look at the number of distinct words in each tibble. I’ll do this with another purpose built function called wordcount(). It uses the tidytext function unnest_tokens() to tease out each word, and then uses the stop_words data set to eliminate words like is, it, the, etc.. (see appendix for source code):

wctwitter <- wordcount(twitter)
wctwitter

## # A tibble: 64,046 x 2
##    word        n
##    <chr>   <int>
##  1 love     5340
##  2 day      4638
##  3 rt       4423
##  4 time     3942
##  5 lol      3427
##  6 3        2739
##  7 people   2607
##  8 follow   2459
##  9 happy    2386
## 10 tonight  2260
## # … with 64,036 more rows

This is interesting; Love is one of the top ten words! You also see a word “rt” which means “retweet”. One of the challenges in predictive text will be to determine if “rt” is something to include in an algorithm. This phrase has been used less on Twitter since a number of UI changes have been introduced.

What about the other two text files? Are there any words common to all three text files? Here’s a graph of the top 20 words in each file:

We can see some numbers in the top 20 words of each text file. Further analysis is needed to see if they’re used as a contraction for spelled out words (e.g, the previous sentence uses “top 20 words” instead of “top twenty words”).

Bigrams and Trigrams: Word Combinations

A predictive text system not only needs to predict the word being typed, but also what the next word will after that. Bigrams, the occurrence of two words together, is an important variable in building a good prediction algorithm.

Let’s look for bigrams in the Twitter text. I’ll use a purpose built function getngrams() that uses unnest_tokens() to find word combinations (see appendix for source code):

getngrams(twitter, 2)

## # A tibble: 47,929 x 2
##    ngram          n
##    <chr>      <int>
##  1 in the      4018
##  2 for the     3795
##  3 of the      2826
##  4 on the      2391
##  5 to be       2329
##  6 thanks for  2219
##  7 to the      2207
##  8 at the      1823
##  9 i love      1794
## 10 going to    1681
## # … with 47,919 more rows

The top bigrams mostly consist of “stop words” which getngrams() doesn’t filter out (unlike wordcount()). At this time, I don’t know whether filtering out stop words are important to the eventual prediction product. More thought will be required.

Let’s look at trigrams:

getngrams(twitter, 3)

## # A tibble: 24,628 x 2
##    ngram                  n
##    <chr>              <int>
##  1 thanks for the      1228
##  2 thank you for        458
##  3 looking forward to   450
##  4 i love you           419
##  5 for the follow       416
##  6 i want to            385
##  7 going to be          379
##  8 can't wait to        374
##  9 a lot of             322
## 10 i need to            306
## # … with 24,618 more rows

Data Statistics

Here’s a table of each of the sampled text file, the number of words, bigrams. and trigrams:

## # A tibble: 3 x 5
##   File     Lines Words Bigrams Trigrams
##   <chr>    <int> <int>   <int>    <int>
## 1 twitter 118007 64046   47929    24628
## 2 blogs    44964 67314   61850    33571
## 3 news     50512 69769   61373    26902

Next Steps

More cleaning of the data is needed as their are other characters in the data set (i.e. Japanese and Chinese).
Sampling of the data set (not shown here) seems to yield similar results, but I’ll need to find a more efficient means of mining the entire contents of the files supplied.

Appendix

texttidy()

texttidy <- function(filename, samplesize = 0.05) {
        # This function will take a text file and create a tibble with each line of text
                count <- countLines(filename)
                ttout <- readLines(filename, n = count * samplesize) %>% # Read in the text
                        as_tibble() # Turn the output into a tibble
                names(ttout) <- c("text") # Name the variable 'text'
                # Done
                ttout
                }

wordcount()

wctwitter <- wordcount(twitter)
wcblogs <- wordcount(blogs)
wcnews <- wordcount(news)

totwords <- rbind(wctwitter %>% top_n(20) %>% mutate(source = "twitter"), 
                wcblogs %>% top_n(20) %>% mutate(source = "blogs"), 
                wcnews %>% top_n(20) %>% mutate(source = "news")
        ) %>%
        group_by(source) %>% 
        mutate(word = reorder(word, n)) 

g <- totwords %>% ggplot(aes(reorder(word,n), n, fill = source))
g + geom_col() + coord_flip() + xlab("Top 20 Words")

getngrams()

getngrams <- function(ttout, combo) {
                # This function takes texttidy() output, plus an integer to find ngrames
                # bigrams: combo = 2. trigrams: combo = 3
                ttout <- ttout %>%
                        unnest_tokens(ngram, text, token = "ngrams", n = combo) %>% 
                        count(ngram, sort = TRUE) %>%
                        # Look for any occurance more than three times
                        filter(n > 3) %>%
                        # drop NA - happens when looking for trigrams
                        drop_na()
                ttout
                }