Understanding the problem

I have beeen given a large corpora of text for use in building a predictive text algorithm. This algoritm will be used in a web application built using Shiny. The application will be similar to the predictive text you see as you type a text message on your phone. To do this, we will use the last few words typed to predict and display the words most likely to be typed next. The person typing will be able to select one of these words instead of typing it.

I began by collecting some resources to aid me in my project.

Text Mining in R This is a technical description of the tm package, with many examples pertaining to common text mining tasks.

CRAN page for tm The ‘Introduction to the tm package’ vignette is especially helpful and highly readable. It describes how to create a Corpus from a Source, transform it using the tm_map function, filter it using the tm_filter function and manage its metadata. There are also sections on common operations used in text analysis that I may add details on later.

A gentle introduction to text mining using R This blog post is a novice’s guide, and even covers installing R, but eventually does address some deeper aspects of using the tm and SnowballC packages.

CRAN Task View: Natural Language Processing This resource contains a brief descriptions and links to many NLP frameworks and resources available in R.

Stanford NLP Course Videos A link to a YouTube playlist for Stanford Professor Dan Jurafsky’s NLP Course, no longer on Coursera.

NTLK Book This is a textbook about text analysis in Python, but the exercises are good for any language.

Text Mining with R, A Tidy Approach Julia Silge and David Robinson’s new book on O’Reilly. In short, the tidyverse meets NLP.

Text Processing in R This is a brief tutorial.

Text Mining, The Tidy Way Julia Silge’s January 14, 2017 presentation at RStudio::Conf advocating the application of tidyverse to text mining tasks.

JHU Data Science Speicalization - Capstone Course - Week 1 This gives a good overview of the capstone project ‘task 0’ - understanding the problem and ‘task 1’ - getting and cleaning the data.

JHU Data Science Speicalization - Capstone Course - Week 2 This provides an overview of the capstone project ‘task 2 - exploratory data analysis’ and ‘task 3 - modeling’.

Data acquisition and cleaning

I downloaded and extract the data files into my project directory. For brevity sake, I am not displaying this code as there is nothing remarkable about it, but it is a part of the source document.

More information on these corpora of text was located using Wayback Machine and can be read here: About the Corpora

After considering the many options for reading, cleaning and tokenizing data for analysis, I decided to proceed using an approach as outlined by Hadley Wickham here.

The first step is to read and sample the data. The entire data set is too large for the resources I have available, so I am sampling 5% of the lines of text for my analysis and discarding the remainder, for the time being. I am showing the full code here so that this step can be reproduced if desired. We’ll read, sample, and calculate basic statistics on the three data sources provided.

# This function takes as input the filename of a text file representing a corpus, a percentage of lines to select for a sample, and a tag for the source, and return a list incluing the size of the corpus, the sample size, and a data_frame of the selected lines 

read_afile <- function(filename, sample_pct, tag) {
        if (sample_pct < 0 | sample_pct > 1) {
                stop("sample_pct must be between 0 and 1")
        }
        fullfile <- read_lines(filename, progress=FALSE)
        sampled <- data_frame(source = tag,
                              text = sample(x = fullfile, size = length(fullfile)*sample_pct))
        list(sample=sampled, sample_length=nrow(sampled), corpus_length=length(fullfile))
}

# Use the function above to samples of the input files. The function returns the data and some statistics in a list. We will parse the list to extract the lines of text and these statistics.

samp_pct = 0.05
blogl <- read_afile("final/en_US/en_US.blogs.txt", sample_pct = samp_pct, tag = "b")
blogs <- blogl[["sample"]]
newsl <- read_afile("final/en_US/en_US.news.txt", sample_pct = samp_pct, tag = "n")
newss <- newsl[["sample"]]
tweetl <- read_afile("final/en_US/en_US.twitter.txt", sample_pct = samp_pct, tag = "t")
tweets <- tweetl[["sample"]]

# Begin to create a structure to contain some key statistics about the corpora.

text_stats <- data_frame(
                source=c("b","n","t"),
                all_lines=c(blogl[["corpus_length"]],newsl[["corpus_length"]],tweetl[["corpus_length"]]),
                sampled_lines=c(blogl[["sample_length"]],newsl[["sample_length"]],tweetl[["sample_length"]]))

Next, I joined these data together into a variable named “sample_text” and remove the temporary variables I used to build it to free up memory. Also, I added line numbers to the data, for backtracking, if needed.

# Simply join all of the inputs, add a linenum so we don't lose track of where things came from, and re-order the columns.

sample_text <- union_all(blogs,newss) %>% 
               union_all(tweets) %>%
               group_by(source) %>%
               mutate(linenum = row_number()) %>%
               ungroup() %>%
               select(source, linenum, text)

# Clean up some duplicate object consuming mamory

rm(list=c("blogl","newsl","tweetl",
          "blogs","newss","tweets"))

My first pass at cleaning the data is at the character level and including these four steps. I dealt with a multiple characters that appear as single quotes in the text, convert underscores to spaces, translate other non-ASCII characters to ASCII equivalents, if they eixts, and then remove any remaining non-ASCII characters.

sample_text$text <- stri_replace_all(sample_text$text,"\u0027",regex="[\u0027\u2018\u2019]")
sample_text$text <- stri_replace_all(sample_text$text,"\u0020",regex="[\u005F]")
sample_text$text <- stri_trans_general(sample_text$text,"latin-ascii")
sample_text$text <- stri_replace_all(sample_text$text, " ", regex = "[^\u0001-\u007F]")

There were seveal categoies of words (stop words, profanity and word containing numerals) that I wasn’t sure about removing from the sample text prior to exploration. I gathered infomration about these and decided to exclude many details for brevity. After some analysis, I decided on using the tidytext “stop word”" list and LDNOOBW’s repo of “profane words”.

There are 1149 stop words and

377 profane words in the list. At this point I tokenized the text and added a coulmn for word number in the line. Ten rows are shown corresponding to the first ten words in the first line of text in the ’b’log source.

# Tokenize the sample text to words

sample_words <- sample_text %>% 
                unnest_tokens(word,text) %>% 
                group_by(source, linenum) %>% 
                mutate(wordnum = row_number()) %>% 
                ungroup() %>%
                arrange(source, linenum, wordnum)
print(sample_words,n=10)
## # A tibble: 5,136,634 × 4
##    source linenum    word wordnum
##     <chr>   <int>   <chr>   <int>
## 1       b       1     how       1
## 2       b       1      to       2
## 3       b       1    walk       3
## 4       b       1      in       4
## 5       b       1     the       5
## 6       b       1   woods       6
## 7       b       1 without       7
## 8       b       1   being       8
## 9       b       1      in       9
## 10      b       1     the      10
## # ... with 5.137e+06 more rows

Exploratory analysis

To deepen my analysis, I calcualted some additional statistics about the total word count, stop word count, profane word count, and numeric word count, by source. A table is shown below.

# Count occurances of word categories that we may exclude for some analysis

total_words <- group_by(sample_words,source) %>% 
               count() %>% 
               rename(total_words=n)
numeric     <- filter(sample_words, grepl("[0-9]",word)) %>% 
               group_by(source) %>% 
               count() %>% 
               rename(numeric=n)
profanity   <- inner_join(sample_words, profane_words, by="word") %>% 
               group_by(source) %>% 
               count() %>% 
               rename(profanity=n)
stop_wds    <- inner_join(sample_words, stop_words, by="word") %>% 
               group_by(source) %>% 
               count() %>% 
               rename(stop_wds=n)

# Add them to our test_stats structure

text_stats <- text_stats %>% 
                 inner_join(total_words, by="source") %>% 
                 inner_join(numeric, by="source") %>% 
                 inner_join(profanity, by="source") %>%  
                 inner_join(stop_wds, by="source")

text_stats
## # A tibble: 3 × 7
##   source all_lines sampled_lines total_words numeric profanity stop_wds
##    <chr>     <int>         <int>       <int>   <int>     <int>    <int>
## 1      b    899288         44964     1895380   23502      1239  3117804
## 2      n   1010242         50512     1735435   46304       554  2450622
## 3      t   2360148        118007     1505819   28317      5284  2249485

I decided for this project, I would only remove numeric and profane words.

clean_words <- sample_words %>%
               anti_join(profane_words, by="word") %>%
               filter(!grepl("[0-9]",word)) 

word_count <- nrow(clean_words)

I was left with 5031434 words in the sample. Here is a chart showing the 30 most frequent words in each source of the sample.

top30 <- clean_words %>% 
        count(word, source) %>% 
        group_by(source) %>% 
        top_n(30,wt=n) %>% 
        arrange(source,desc(n))

facet_labels <- c(b="Blogs",n="News",t="Tweets")
g <- ggplot(data=top30, aes(x=word, y=n/100, fill=source))  
g <- g + geom_bar(stat="identity") 
g <- g + facet_grid(. ~ source, labeller = labeller(source=facet_labels))
g <- g + theme(legend.position="none")
g <- g + xlab(NULL) + ylab("Frequency x 100")
g <- g + coord_flip()
g

Next I looked at word frequencies across the entire sample, regardless of source. Here are the top 30 words.

word_freq <- clean_words %>% 
             count(word, source) %>% 
             spread(source,n) %>%
             mutate(freq=sum(c(b,n,t),na.rm=TRUE),
                    pct = freq/word_count) %>%
             rename(blogs=b, news=n, tweets=t) %>%
             arrange(desc(freq)) %>%
             ungroup() %>%
             mutate(rank = row_number())

print(word_freq,n=30)
## # A tibble: 120,299 × 7
##     word blogs  news tweets   freq         pct  rank
##    <chr> <int> <int>  <int>  <int>       <dbl> <int>
## 1    the 93443 99096  47068 239607 0.047622010     1
## 2     to 53551 44690  39318 137559 0.027339919     2
## 3    and 55730 44316  21589 121635 0.024175017     3
## 4      a 45182 43819  30378 119379 0.023726635     4
## 5     of 44287 39031  18013 101331 0.020139586     5
## 6     in 30217 34102  19096  83415 0.016578773     6
## 7      i 39095  7884  36234  83213 0.016538625     7
## 8    for 18400 17976  19473  55849 0.011100016     8
## 9     is 21694 14268  17926  53888 0.010710267     9
## 10  that 23186 17379  11651  52216 0.010377956    10
## 11   you 15212  4858  27576  47646 0.009469666    11
## 12    it 20417 11102  14560  46079 0.009158224    12
## 13    on 13883 13671  13943  41497 0.008247549    13
## 14  with 14576 12668   8711  35955 0.007146074    14
## 15   was 13967 11464   6081  31512 0.006263026    15
## 16    my 13795  2051  14591  30437 0.006049369    16
## 17    at  8679 10651   9472  28802 0.005724412    17
## 18    be 10386  7741   9252  27379 0.005441590    18
## 19  this 12941  6023   8048  27012 0.005368648    19
## 20  have 11186  7172   8416  26774 0.005321346    20
## 21   are  9864  6878   8076  24818 0.004932590    21
## 22   but 10423  7576   6364  24363 0.004842158    22
## 23    as 11121  9334   3587  24042 0.004778359    23
## 24    he  7474 11210   2831  21515 0.004276117    24
## 25    we  9610  5045   6680  21335 0.004240342    25
## 26   not  8865  5812   6253  20930 0.004159848    26
## 27    so  8418  2860   8206  19484 0.003872455    27
## 28  from  7384  7629   4256  19269 0.003829723    28
## 29    me  7147  1365  10031  18543 0.003685430    29
## 30   all  7337  3579   6003  16919 0.003362660    30
## # ... with 1.203e+05 more rows

At this point, it is the word frequency that interests me. I looked at descriptive statistics for this variable. The data is positively skewed and has a high standard deviation and kurtosis. In layman’s terms there is a big pileup of words with low frequencies, and relatively small number of words account for a large portion of the sample. We’ll do more work on that as we start a statistical analysis.

Below is a histogram of word frequencies for all words in the sample that shows this pile-up.

summary(word_freq$freq)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      1.00      1.00      1.00     41.82      5.00 239600.00
describe(word_freq$freq)
##    vars      n  mean      sd median trimmed mad min    max  range   skew
## X1    1 120299 41.82 1174.49      1    3.39   0   1 239607 239606 118.94
##    kurtosis   se
## X1 18876.18 3.39
g <- ggplot(word_freq, aes(pct)) 
g <- g + geom_histogram(bins=20, fill="blue", na.rm=TRUE) 
g <- g + xlim(NA,0.0009)
g

Interestingly, this “piled-up” distribution of words in a corpus of text is common and is predicted by Zipf’s Law. This law says that the frequency of any word in a corpus of words is inversely proportional to its rank. So, the most frequent word occurs about twice as often as the next most frequent word, three times as often as the third most frequent word, and so forth. A plot of term frequncy vs rank in our sample demonstrates this.

 g <- ggplot(word_freq, aes(x=rank, y=pct)) 
 g <- g + geom_line(size = 1.5, color="darkseagreen")
 g <- g + scale_x_log10() + scale_y_log10()
 g <- g + xlab("log10 Rank") + ylab("log10 Word Frequency")
 g

As I move forward, I will complete a similar analysis for n-grams in the data set to see if they follow a similar pattern. As above, the unnest_tokens function in tidytext has this capability, so the analysis is similar to the above and omitted for brevity.

Statistical modeling

My plan for moving forward is to determine the smallest set of words and n-grams needed to cover a given percentage of the language by calculating a cumulative distribution function for word and n-gram frequency. This will allow me to remove many low-frequency words from the data and focus on specific words and phrases that occur most frequently. Then, I want an efficient way to calculate the probabity a number of possible next words, across the remaining set of common words.

Predictive modeling

There are several models that can be used to effectively model the data. There has been a suggestion that Markhov chains may provide an efficient way to store and use the data, and this is my default plan, however, I want to do some additional reading to see what new techniques may be available

Creative exploration

As a creative direction, I want the product to be intuitive and easy to use, responsive to input from the user, and interesting. I would like to keep some statistics as it gets used to try and improve the model for a specific user.

Creating a data product

I will use Slidify to complete the data product.

Creating a short slide deck pitching your product

I have a few ideas for a product pitch, these remain in development, for now.