Capstone Project EDA Report

Summary

The goal of this project is to create a predictive text algorithm that could be run on a phone. Based on a series of words, the algorithm will try to predict the next word. A database with thousands of text strings from news articles, blogs, and twitter entries will be used to train the algorithm.

This report is an initial review of the training data, with a summary of its important features and ideas for further development.

Data Processing

These code chunks load the required R packages, download the data, and read it into R.

knitr::opts_chunk$set(echo = TRUE, comment=NA)

library(knitr)
library(readr)
library(tm)
library(corpus)
library(dplyr)
library(stringr)
library(ggplot2)
library(scales)

opts_knit$set(root.dir = 'H:\\Courses\\Coursera\\Capstone')

# download files
course.data.url <- 'https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip'
course.file.name <- 'SwiftKey.zip'
download.file(course.data.url, course.file.name)

unzip(course.file.name)

# read data files
twitter <- read_lines('final\\en_US\\en_US.twitter.txt')
news <- read_lines('final\\en_US\\en_US.news.txt')
blogs <- read_lines('final\\en_US\\en_US.blogs.txt')

Basic Data Summary

The number of entries and approximate number of words in each data source is reported. The exact definition of a word can vary; this counts standard strings of alphanumeric characters separated by spaces or punctuation.

# number of blog entries
length(blogs)

[1] 899288

# approximate number of words in all blog entries
sum(sapply(blogs, str_count, pattern = "\\w+", USE.NAMES = FALSE))

[1] 38309620

# number of news entries
length(news)

[1] 1010242

# approximate number of words in all news entries
sum(sapply(news, str_count, pattern = "\\w+", USE.NAMES = FALSE))

[1] 35624454

# number of twitter entries
length(twitter)

[1] 2360148

# approximate number of words in all twitter entries
sum(sapply(twitter, str_count, pattern = "\\w+", USE.NAMES = FALSE))

[1] 31003501

More Data Processing

I’m going to start with a random sample of 10% of the data. That will probably be enough to summarize the major features and take much less time and memory. Later, when I develop the prediction algorithm, I will compare its performance using different fractions of data to determine how much is enough.

# create random sample with 10% of data
set.seed(10)
blog.sample <- sample(blogs, size = floor(length(blogs)) * 0.1)
news.sample <- sample(news, size = floor(length(news)) * 0.1)
twitter.sample <- sample(twitter, size = floor(length(twitter)) * 0.1)
combined.sample <- c(blog.sample, news.sample, twitter.sample)

# create corpus
text.corpus <- as_corpus_frame(combined.sample)

# text.tm <- Corpus(VectorSource(combined.sample),
#               readerControl = list(language = "en_US"))


# define filter
text_filter(text.corpus) <- text_filter(drop_number = TRUE,
                                        drop_punct = TRUE,
                                        drop_symbol = TRUE)

# clean up large objects
rm(blog.sample, news.sample, twitter.sample, 
   blogs, news, twitter, 
   combined.sample)

Single Tokens

A description of the distribution of words over all data entries is reported below.
- The 20 most frequent words are the most common English words; no surprise.
- There are nearly 10 million tokens and 200,000 unique tokens.
- However, only 139 unique tokens represent 50% of the total. 90% coverage requires 7668 or 4% of the tokens.
- About 30,000 tokens occur more than 10 times in this random sample.

# single token stats
text.stats <- term_stats(text.corpus)
text.stats <- text.stats %>% arrange(desc(count))

# top 20 single tokens
ggplot(data=text.stats[1:20,], aes(x=term, y=count)) +
    geom_bar(stat="identity") +
    scale_y_continuous(labels = comma)

# number of tokens
token.count <- sum(text.stats$count)
token.count

[1] 9932894

# number of unique tokens
nrow(text.stats)

[1] 195181

# rank tokens, compute cumulative proportion (ordered most to least frequent)
sum.count = 0
cum.prop <- rep(NA, nrow(text.stats))
for(i in 1:nrow(text.stats)) {
    sum.count <- sum.count + text.stats$count[i]
    cum.prop[i] <- sum.count / token.count
}

# number & proportion of tokens needed for 50% coverage
sum(cum.prop < .5)

[1] 139

sum(cum.prop < .5)/nrow(text.stats)

[1] 0.0007121595

# number & proportion of tokens needed for 90% coverage
sum(cum.prop < .9)

[1] 7668

sum(cum.prop < .9)/nrow(text.stats)

[1] 0.03928661

# number & proportion of tokens needed for 99% coverage
sum(cum.prop < .99)

[1] 95852

sum(cum.prop < .99)/nrow(text.stats)

[1] 0.4910929

# plot frequencies vs rank
plot(log(text.stats$count[1:30000])/log(10), 
     type="l",
     ylab = "Log(10) frequency",
     xlab = "Rank",
     main = "Frequency of top 30000 tokens")
grid()

Two-grams

Two-grams are combinations of single tokens. As expected, the total number of 2-grams is slightly lower relative to the number of single tokens (each entry with n tokens will have n-1 two-grams and n-2 three-grams). I reported 20 two-grams that are most common, somewhat common, and rare. Other notes:

There are about 13 times as many unique 2-grams as single tokens.
38,000 or 1.4% of unique 2-grams are required for 50% coverage. 63% are required for 90% coverage.

# 2-gram stats
text.stats.2gram <- term_stats(text.corpus, ngrams = 2)
text.stats.2gram <- text.stats.2gram %>% arrange(desc(count))

# top 20 2-grams
text.stats.2gram[1:20,]

       term count support
1    of the 43036   35400
2    in the 41296   35586
3    to the 21420   19605
4   for the 19936   18854
5    on the 19461   18061
6     to be 15950   14641
7    at the 14243   13364
8   and the 12476   11633
9      in a 11982   11324
10 with the 10519    9993
11     is a 10088    9511
12   it was  9573    8541
13    for a  9532    9143
14   i have  8788    7925
15 from the  8679    8158
16    i was  8660    7340
17    and i  8327    7693
18    it is  8118    7262
19   with a  8018    7610
20     of a  8007    7607

# sample of less frequent 2-grams
text.stats.2gram[10000:10020,]

               term count support
10000    in reality    90      89
10001     intend to    90      89
10002      it hurts    90      89
10003        job as    90      89
10004     just fine    90      89
10005      life has    90      89
10006      milk and    90      89
10007     museum of    90      89
10008   my facebook    90      89
10009  news release    90      89
10010    of someone    90      89
10011       on sept    90      89
10012      plan and    90      89
10013      right up    90      89
10014     rights to    90      89
10015        say in    90      89
10016 say something    90      89
10017   smells like    90      89
10018   stories and    90      89
10019      the hour    90      89
10020  the northern    90      89

# sample of rare 2-grams
text.stats.2gram[1000000:1000020,]

                      term count support
1000000        clinton ran     1       1
1000001     clinton reeves     1       1
1000002    clinton reunion     1       1
1000003    clinton running     1       1
1000004  clinton secretary     1       1
1000005  clinton separated     1       1
1000006    clinton shaking     1       1
1000007     clinton signed     1       1
1000008  clinton similarly     1       1
1000009       clinton sort     1       1
1000010  clinton steadfast     1       1
1000011    clinton support     1       1
1000012      clinton taken     1       1
1000013      clinton takes     1       1
1000014    clinton thanked     1       1
1000015       clinton then     1       1
1000016      clinton there     1       1
1000017       clinton they     1       1
1000018  clinton townships     1       1
1000019   clinton treasury     1       1
1000020 clinton trumpeting     1       1

# number of 2-grams
gram2.count <- sum(text.stats.2gram$count)
gram2.count

[1] 9506132

# number of unique 2-grams
nrow(text.stats.2gram)

[1] 2607799

# ratio of unique 2-grams to unique tokens
nrow(text.stats.2gram)/nrow(text.stats)

[1] 13.36093

# rank tokens, compute cumulative proportion (ordered most to least frequent)
sum.count = 0
cum.prop.2gram <- rep(NA, nrow(text.stats.2gram))
for(i in 1:nrow(text.stats.2gram)) {
    sum.count <- sum.count + text.stats.2gram$count[i]
    cum.prop.2gram[i] <- sum.count / gram2.count
}

# number & proportion of tokens needed for 50% coverage
sum(cum.prop.2gram < .5)

[1] 38238

sum(cum.prop.2gram < .5)/nrow(text.stats.2gram)

[1] 0.01466294

# number & proportion of tokens needed for 90% coverage
sum(cum.prop.2gram < .9)

[1] 1657185

sum(cum.prop.2gram < .9)/nrow(text.stats.2gram)

[1] 0.6354727

# plot frequencies vs rank
plot(log(text.stats.2gram$count[1:30000])/log(10), 
     type="l",
     ylab = "Log(10) frequency",
     xlab = "Rank",
     main = "Frequency of top 30000 2-grams")
grid()

Three-grams

There were more than 6 million unique 3-grams, 31 times the number of unique tokens. However, only about 2,000 were used more than 100 times.

# 3-gram stats
text.stats.3gram <- term_stats(text.corpus, ngrams = 3)
text.stats.3gram <- text.stats.3gram %>% arrange(desc(count))

# top 20 3-grams
text.stats.3gram[1:20,]

                 term count support
1          one of the  3407    3318
2            a lot of  3012    2839
3      thanks for the  2382    2380
4             to be a  1786    1753
5         going to be  1688    1616
6           i want to  1534    1439
7          the end of  1454    1419
8          out of the  1417    1387
9            it was a  1400    1371
10         as well as  1386    1351
11        some of the  1369    1325
12         be able to  1302    1256
13        part of the  1242    1209
14           i have a  1220    1196
15          i have to  1158    1131
16       i don't know  1125    1094
17 looking forward to  1095    1089
18        the rest of  1080    1066
19     the first time  1042    1018
20        is going to  1040    1016

# sample of less frequent 3-grams
text.stats.3gram[10000:10020,]

                     term count support
10000          he is also    35      33
10001         i spent the    35      33
10002         is a matter    35      33
10003         no time for    35      33
10004   not talking about    35      33
10005         stir in the    35      33
10006      that's part of    35      33
10007       the store and    35      33
10008        to make room    35      33
10009      whether it was    35      33
10010     worry about the    35      33
10011         you wish to    35      33
10012         do not need    35      32
10013     the same person    35      32
10014       the time when    35      32
10015         to the site    35      32
10016          what he is    35      32
10017         you were to    35      32
10018        of the brain    35      31
10019         at the site    35      25
10020 a great opportunity    34      34

# sample of rare 3-grams
text.stats.3gram[1000000:1000020,]

                                               term count support
1000000                                   an i have     1       1
1000001                                   an i hope     1       1
1000002                                     an i in     1       1
1000003                                   an i just     1       1
1000004                                   an i need     1       1
1000005                                      an i o     1       1
1000006                                   an i penn     1       1
1000007                                    an i pod     1       1
1000008                                an i promise     1       1
1000009                                   an i said     1       1
1000010                                  an i still     1       1
1000011                                    an i the     1       1
1000012                                  an i voted     1       1
1000013                                   an i will     1       1
1000014                                  an i won't     1       1
1000015                                 an i'll put     1       1
1000016 an i'll-show-you-dadgum-shark-varmints mode     1       1
1000017                              an i'm missing     1       1
1000018                                   an i'm so     1       1
1000019                                an i'm sorry     1       1
1000020                              an i-beam from     1       1

# number of 3-grams
gram3.count <- sum(text.stats.3gram$count)
gram3.count

[1] 9082158

# number of unique 3-grams
nrow(text.stats.3gram)

[1] 6157033

# ratio of unique 3-grams to unique tokens
nrow(text.stats.3gram)/nrow(text.stats)

[1] 31.54525

# plot frequencies vs rank
plot(log(text.stats.3gram$count[1:30000])/log(10), 
     type="l",
     ylab = "Log(10) frequency",
     xlab = "Rank",
     main = "Frequency of top 30000 3-grams")
grid()

Final notes

There are millions of words in the database, but a small fraction of the entire database is likely enough to capture most relevant features. For the final product, a time-saving solution might be to compute in advance the top 200 or so two-grams and the next word prediction, and store that in a table for fast lookup. Only process the entire sample for the less common two-grams.