This is a progress update on building a next-word predictive model.
Input Data Exploration and Assesement
The input training data was downloaded from the Coursera Swiftkey zip and extracted.
pwd
ls -lh final/en_US/
echo "Count lines"
wc -l final/en_US/en_US.blogs.txt
wc -l final/en_US/en_US.news.txt
wc -l final/en_US/en_US.twitter.txt
echo "Count words"
wc -w final/en_US/en_US.blogs.txt
wc -w final/en_US/en_US.news.txt
wc -w final/en_US/en_US.twitter.txt
## /Users/grudy/dev/datasciencecoursera/capstone
## total 1184280
## -rw-r--r-- 1 grudy staff 22M Dec 19 20:03 blogs-1k.txt
## -rw-r--r-- 1 grudy staff 200M Jul 22 2014 en_US.blogs.txt
## -rw-r--r-- 1 grudy staff 196M Jul 22 2014 en_US.news.txt
## -rw-r--r-- 1 grudy staff 159M Jul 22 2014 en_US.twitter.txt
## -rw-r--r-- 1 grudy staff 3.5K Dec 26 15:00 filtered_words.txt
## Count lines
## 899288 final/en_US/en_US.blogs.txt
## 1010242 final/en_US/en_US.news.txt
## 2360148 final/en_US/en_US.twitter.txt
## Count words
## 37334690 final/en_US/en_US.blogs.txt
## 34372720 final/en_US/en_US.news.txt
## 30374206 final/en_US/en_US.twitter.txt
The three training files range from 159 to 200MB, 900K to 2.3M lines of text and 30 to 37M words.
Lets look for the longest line for these files to get a sense of their structure.
library(quanteda)
setwd('~/dev/datasciencecoursera/capstone/final/en_US')
for (file in c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"))
{
longestLine = 0
con <- file(file, "r")
while (length(l <- readLines(con, n = 1, warn = FALSE)) > 0) {
longestLine = max(nchar(l), longestLine)
}
print(c("File"=file, 'Longest Line Size'=longestLine))
close(con)
}
## File Longest Line Size
## "en_US.blogs.txt" "40833"
## File Longest Line Size
## "en_US.news.txt" "11384"
## File Longest Line Size
## "en_US.twitter.txt" "140"
Using the quanteda library, we tokenize each file with its built in ability to remove numbers and drop punctuation. We lowercase the words, filter out profanity from a pre-built list of filtered words and use a hash to count their frequencies.
library(quanteda)
library(hash)
setwd('~/dev/datasciencecoursera/capstone/final/en_US')
con <- file("filtered_words.txt", "r")
filtered_lines <- readLines(con, warn=FALSE)
filtered = hash(filtered_lines, 1:length(filtered_lines))
close(con)
filteredWords = 0
wordHash = hash()
for (file in c("en_US.twitter.txt", "en_US.blogs.txt", "en_US.news.txt"))
{
con <- file(file, "r")
while (length(l <- readLines(con, n = 1000, warn = FALSE)) > 0) {
words <- tokenize(l, what="fasterword", removeTwitter = TRUE, removePunct = TRUE, removeNumbers = TRUE)
for(w in words[[1]]) {
w <- toLower(w)
if(!is.null(filtered[[w]])) {
filteredWords = filteredWords + 1
} else {
f <- wordHash[[w]]
if(is.null(f))
wordHash[[w]] = 1
else
wordHash[[w]] = f + 1
}
}
}
print(c("File"=file, '# unique words'=length(wordHash), '# filtered words'=filteredWords))
# Look at two words in particular as a sentinal (and to answer a quiz question)
print(c("#Loves"=wordHash[["love"]], '#Hates'= wordHash[["hate"]], 'Love/Hate'=wordHash[["love"]]/wordHash[["hate"]]))
close(con)
}
## File # unique words # filtered words
## "en_US.twitter.txt" "6301" "123"
## #Loves #Hates Love/Hate
## 108.0 20.0 5.4
## File # unique words # filtered words
## "en_US.blogs.txt" "10930" "158"
## #Loves #Hates Love/Hate
## 157.000000 22.000000 7.136364
## File # unique words # filtered words
## "en_US.news.txt" "15039" "166"
## #Loves #Hates Love/Hate
## 166.000000 22.000000 7.545455
Lets plot the frequency of these words to get a sense of the coverage of the top N words.
library(dplyr)
library(ggplot2)
l <- as.list(wordHash)
df <- data.frame(words=names(l), frequencies=unlist(l)) %>% arrange(desc(frequencies))
qplot(1:length(frequencies), log10(frequencies), data=df, main = "Ranked frequencies of observed words")
Its clear there is a very stark drop-off, and the majority of words are singletons (exist only once) with only 1075 words occurring more than 10 times.
head(df, n=25)
## words frequencies
## 1 the 4765
## 2 to 2765
## 3 a 2429
## 4 and 2333
## 5 of 1976
## 6 in 1679
## 7 i 1624
## 8 for 1055
## 9 is 1052
## 10 that 1046
## 11 you 923
## 12 it 905
## 13 on 790
## 14 with 716
## 15 my 642
## 16 was 607
## 17 be 600
## 18 at 578
## 19 this 571
## 20 as 569
## 21 are 509
## 22 have 498
## 23 but 484
## 24 he 429
## 25 we 401
# How many words have frequency over 1000, 100, 10 and 1
nrow(subset(df, frequencies > 1000))
## [1] 10
nrow(subset(df, frequencies > 100))
## [1] 121
nrow(subset(df, frequencies > 10))
## [1] 1075
nrow(subset(df, frequencies > 1))
## [1] 6286
nrow(df)
## [1] 15039
Strategy for Model Building
To build a predictive model for choosing the next word for a given input phrase, the following model building strategy will be employed:
The following steps are repeated with 3-gram, 4-gram etc as resources allow. A 3-gram training pass will save what third word is predicted for an observed 2-gram and so forth.
The algorithm for applying the model will be:
Use as many words in the suffix of the input as provided and as large as our pre-build model provides, say a three word suffix that predicts the 4th word.
Look up the 3-word input suffix in the index. If it is present, use the computed 4th word as the prediction.
If the 3-word input suffix is not in the index, back off and look for the 2-word input suffix, then a 1-word input suffix.
If even the last input word is not in the model or there is no user input, predict the word “the”, as it is the most common word.
The evaluation of the performance of this model can be done by extracting 4-word subsets from known text and predicting the 4th word based on the 3-word suffix or as large of a suffix we have built a index for.