Â
The goal of this project is to create a predictive text algorithm that could be run on a phone. Based on a series of words, the algorithm will try to predict the next word. A database with thousands of text strings from news articles, blogs, and twitter entries will be used to train the algorithm.
This report is an initial review of the training data, with a summary of its important features and ideas for further development.
These code chunks load the required R packages, download the data, and read it into R.
knitr::opts_chunk$set(echo = TRUE, comment=NA)
library(knitr)
library(readr)
library(tm)
library(corpus)
library(dplyr)
library(stringr)
library(ggplot2)
library(scales)
opts_knit$set(root.dir = 'H:\\Courses\\Coursera\\Capstone')
# download files
course.data.url <- 'https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip'
course.file.name <- 'SwiftKey.zip'
download.file(course.data.url, course.file.name)
unzip(course.file.name)
# read data files
twitter <- read_lines('final\\en_US\\en_US.twitter.txt')
news <- read_lines('final\\en_US\\en_US.news.txt')
blogs <- read_lines('final\\en_US\\en_US.blogs.txt')
Â
The number of entries and approximate number of words in each data source is reported. The exact definition of a word can vary; this counts standard strings of alphanumeric characters separated by spaces or punctuation.
# number of blog entries
length(blogs)
[1] 899288
# approximate number of words in all blog entries
sum(sapply(blogs, str_count, pattern = "\\w+", USE.NAMES = FALSE))
[1] 38309620
# number of news entries
length(news)
[1] 1010242
# approximate number of words in all news entries
sum(sapply(news, str_count, pattern = "\\w+", USE.NAMES = FALSE))
[1] 35624454
# number of twitter entries
length(twitter)
[1] 2360148
# approximate number of words in all twitter entries
sum(sapply(twitter, str_count, pattern = "\\w+", USE.NAMES = FALSE))
[1] 31003501
Â
I’m going to start with a random sample of 10% of the data. That will probably be enough to summarize the major features and take much less time and memory. Later, when I develop the prediction algorithm, I will compare its performance using different fractions of data to determine how much is enough.
# create random sample with 10% of data
set.seed(10)
blog.sample <- sample(blogs, size = floor(length(blogs)) * 0.1)
news.sample <- sample(news, size = floor(length(news)) * 0.1)
twitter.sample <- sample(twitter, size = floor(length(twitter)) * 0.1)
combined.sample <- c(blog.sample, news.sample, twitter.sample)
# create corpus
text.corpus <- as_corpus_frame(combined.sample)
# text.tm <- Corpus(VectorSource(combined.sample),
# readerControl = list(language = "en_US"))
# define filter
text_filter(text.corpus) <- text_filter(drop_number = TRUE,
drop_punct = TRUE,
drop_symbol = TRUE)
# clean up large objects
rm(blog.sample, news.sample, twitter.sample,
blogs, news, twitter,
combined.sample)
A description of the distribution of words over all data entries is reported below.
- The 20 most frequent words are the most common English words; no surprise.
- There are nearly 10 million tokens and 200,000 unique tokens.
- However, only 139 unique tokens represent 50% of the total. 90% coverage requires 7668 or 4% of the tokens.
- About 30,000 tokens occur more than 10 times in this random sample.
# single token stats
text.stats <- term_stats(text.corpus)
text.stats <- text.stats %>% arrange(desc(count))
# top 20 single tokens
ggplot(data=text.stats[1:20,], aes(x=term, y=count)) +
geom_bar(stat="identity") +
scale_y_continuous(labels = comma)
# number of tokens
token.count <- sum(text.stats$count)
token.count
[1] 9932894
# number of unique tokens
nrow(text.stats)
[1] 195181
# rank tokens, compute cumulative proportion (ordered most to least frequent)
sum.count = 0
cum.prop <- rep(NA, nrow(text.stats))
for(i in 1:nrow(text.stats)) {
sum.count <- sum.count + text.stats$count[i]
cum.prop[i] <- sum.count / token.count
}
# number & proportion of tokens needed for 50% coverage
sum(cum.prop < .5)
[1] 139
sum(cum.prop < .5)/nrow(text.stats)
[1] 0.0007121595
# number & proportion of tokens needed for 90% coverage
sum(cum.prop < .9)
[1] 7668
sum(cum.prop < .9)/nrow(text.stats)
[1] 0.03928661
# number & proportion of tokens needed for 99% coverage
sum(cum.prop < .99)
[1] 95852
sum(cum.prop < .99)/nrow(text.stats)
[1] 0.4910929
# plot frequencies vs rank
plot(log(text.stats$count[1:30000])/log(10),
type="l",
ylab = "Log(10) frequency",
xlab = "Rank",
main = "Frequency of top 30000 tokens")
grid()
Two-grams are combinations of single tokens. As expected, the total number of 2-grams is slightly lower relative to the number of single tokens (each entry with n tokens will have n-1 two-grams and n-2 three-grams). I reported 20 two-grams that are most common, somewhat common, and rare. Other notes:
# 2-gram stats
text.stats.2gram <- term_stats(text.corpus, ngrams = 2)
text.stats.2gram <- text.stats.2gram %>% arrange(desc(count))
# top 20 2-grams
text.stats.2gram[1:20,]
term count support
1 of the 43036 35400
2 in the 41296 35586
3 to the 21420 19605
4 for the 19936 18854
5 on the 19461 18061
6 to be 15950 14641
7 at the 14243 13364
8 and the 12476 11633
9 in a 11982 11324
10 with the 10519 9993
11 is a 10088 9511
12 it was 9573 8541
13 for a 9532 9143
14 i have 8788 7925
15 from the 8679 8158
16 i was 8660 7340
17 and i 8327 7693
18 it is 8118 7262
19 with a 8018 7610
20 of a 8007 7607
# sample of less frequent 2-grams
text.stats.2gram[10000:10020,]
term count support
10000 in reality 90 89
10001 intend to 90 89
10002 it hurts 90 89
10003 job as 90 89
10004 just fine 90 89
10005 life has 90 89
10006 milk and 90 89
10007 museum of 90 89
10008 my facebook 90 89
10009 news release 90 89
10010 of someone 90 89
10011 on sept 90 89
10012 plan and 90 89
10013 right up 90 89
10014 rights to 90 89
10015 say in 90 89
10016 say something 90 89
10017 smells like 90 89
10018 stories and 90 89
10019 the hour 90 89
10020 the northern 90 89
# sample of rare 2-grams
text.stats.2gram[1000000:1000020,]
term count support
1000000 clinton ran 1 1
1000001 clinton reeves 1 1
1000002 clinton reunion 1 1
1000003 clinton running 1 1
1000004 clinton secretary 1 1
1000005 clinton separated 1 1
1000006 clinton shaking 1 1
1000007 clinton signed 1 1
1000008 clinton similarly 1 1
1000009 clinton sort 1 1
1000010 clinton steadfast 1 1
1000011 clinton support 1 1
1000012 clinton taken 1 1
1000013 clinton takes 1 1
1000014 clinton thanked 1 1
1000015 clinton then 1 1
1000016 clinton there 1 1
1000017 clinton they 1 1
1000018 clinton townships 1 1
1000019 clinton treasury 1 1
1000020 clinton trumpeting 1 1
# number of 2-grams
gram2.count <- sum(text.stats.2gram$count)
gram2.count
[1] 9506132
# number of unique 2-grams
nrow(text.stats.2gram)
[1] 2607799
# ratio of unique 2-grams to unique tokens
nrow(text.stats.2gram)/nrow(text.stats)
[1] 13.36093
# rank tokens, compute cumulative proportion (ordered most to least frequent)
sum.count = 0
cum.prop.2gram <- rep(NA, nrow(text.stats.2gram))
for(i in 1:nrow(text.stats.2gram)) {
sum.count <- sum.count + text.stats.2gram$count[i]
cum.prop.2gram[i] <- sum.count / gram2.count
}
# number & proportion of tokens needed for 50% coverage
sum(cum.prop.2gram < .5)
[1] 38238
sum(cum.prop.2gram < .5)/nrow(text.stats.2gram)
[1] 0.01466294
# number & proportion of tokens needed for 90% coverage
sum(cum.prop.2gram < .9)
[1] 1657185
sum(cum.prop.2gram < .9)/nrow(text.stats.2gram)
[1] 0.6354727
# plot frequencies vs rank
plot(log(text.stats.2gram$count[1:30000])/log(10),
type="l",
ylab = "Log(10) frequency",
xlab = "Rank",
main = "Frequency of top 30000 2-grams")
grid()
There were more than 6 million unique 3-grams, 31 times the number of unique tokens. However, only about 2,000 were used more than 100 times.
# 3-gram stats
text.stats.3gram <- term_stats(text.corpus, ngrams = 3)
text.stats.3gram <- text.stats.3gram %>% arrange(desc(count))
# top 20 3-grams
text.stats.3gram[1:20,]
term count support
1 one of the 3407 3318
2 a lot of 3012 2839
3 thanks for the 2382 2380
4 to be a 1786 1753
5 going to be 1688 1616
6 i want to 1534 1439
7 the end of 1454 1419
8 out of the 1417 1387
9 it was a 1400 1371
10 as well as 1386 1351
11 some of the 1369 1325
12 be able to 1302 1256
13 part of the 1242 1209
14 i have a 1220 1196
15 i have to 1158 1131
16 i don't know 1125 1094
17 looking forward to 1095 1089
18 the rest of 1080 1066
19 the first time 1042 1018
20 is going to 1040 1016
# sample of less frequent 3-grams
text.stats.3gram[10000:10020,]
term count support
10000 he is also 35 33
10001 i spent the 35 33
10002 is a matter 35 33
10003 no time for 35 33
10004 not talking about 35 33
10005 stir in the 35 33
10006 that's part of 35 33
10007 the store and 35 33
10008 to make room 35 33
10009 whether it was 35 33
10010 worry about the 35 33
10011 you wish to 35 33
10012 do not need 35 32
10013 the same person 35 32
10014 the time when 35 32
10015 to the site 35 32
10016 what he is 35 32
10017 you were to 35 32
10018 of the brain 35 31
10019 at the site 35 25
10020 a great opportunity 34 34
# sample of rare 3-grams
text.stats.3gram[1000000:1000020,]
term count support
1000000 an i have 1 1
1000001 an i hope 1 1
1000002 an i in 1 1
1000003 an i just 1 1
1000004 an i need 1 1
1000005 an i o 1 1
1000006 an i penn 1 1
1000007 an i pod 1 1
1000008 an i promise 1 1
1000009 an i said 1 1
1000010 an i still 1 1
1000011 an i the 1 1
1000012 an i voted 1 1
1000013 an i will 1 1
1000014 an i won't 1 1
1000015 an i'll put 1 1
1000016 an i'll-show-you-dadgum-shark-varmints mode 1 1
1000017 an i'm missing 1 1
1000018 an i'm so 1 1
1000019 an i'm sorry 1 1
1000020 an i-beam from 1 1
# number of 3-grams
gram3.count <- sum(text.stats.3gram$count)
gram3.count
[1] 9082158
# number of unique 3-grams
nrow(text.stats.3gram)
[1] 6157033
# ratio of unique 3-grams to unique tokens
nrow(text.stats.3gram)/nrow(text.stats)
[1] 31.54525
# plot frequencies vs rank
plot(log(text.stats.3gram$count[1:30000])/log(10),
type="l",
ylab = "Log(10) frequency",
xlab = "Rank",
main = "Frequency of top 30000 3-grams")
grid()
There are millions of words in the database, but a small fraction of the entire database is likely enough to capture most relevant features. For the final product, a time-saving solution might be to compute in advance the top 200 or so two-grams and the next word prediction, and store that in a table for fast lookup. Only process the entire sample for the less common two-grams.