Predictive text is a staple in many of today’s modern applications. When sending an SMS message via mobile phone or in an IDE software for programming. Predictive text when working correctly can allow a user to compose a message or enter in code faster.
The following project is to analysis a set of three text files in the English language with the goal of being able to construct a predictive model that can predict the next word the user will most likely want to enter.
The general outline of the project will be as follows:
This report will focus on the first three steps in the project outline and will briefly discuss strategy for the last three steps.
Reference: Source Data
It was stated in the project request the data was provided from HC Corpora. Unfortunately the website no longer is active and the link via the website Internet Archive Wayback Machine is not available. The only information available was what was provided from the project request.
Three text files were provided. The text files were constructed by sampling publicly available news websites, blog websites and posts made on Twitter.
The data sources this project will be working with are limited to the English language, although it is possible non-English text will appear in the source files. It is reasonable to expect some common phrases from non-English languages to be interspersed into the provided documents.
Examples
| file | language | word | occurrence |
|---|---|---|---|
| en_US.twitter.txt | French | bonjour | 13 |
| en_US.twitter.txt | Italian | ciao | 43 |
| en_US.blogs.txt | German | guten | 1 |
Source Code: Appendix A: Foreign Words
The provided text files are quite large and the laptop I am using for the project doesn’t have enough physical ram to store all of the files in memory, a necessary requirement with working with the R programming language. Therefore I decided to simplify working with the data by splitting the files into smaller files and store the interim results in an sqlite database
The overall processing took the following steps:
The first step is I split the original files into smaller files. I used a Linux command to split the files and store them in a separate directory. Subsequently I got the list of the file names that were created and stored them in a list for processing.
Source Code: Appendix A: List Files
A key component when analyzing a corpus of text documents is break the document into n-grams.
An n-gram is defined as a collection of word(s), with the size indicating how many words are combined together. An n-gram of size 1 would be individual words from a sentence. While an n-gram of size 2 would be the combination of two consecutive words in a sentence found in the text documents. The following sentence, “Where is the car?” would contain 3 n-grams of size 2. The n-grams would be, “Where is”, “is the” “the car”.
I used the package Quanteda to process the text files. The package has included functions to help with parsing the text files. I used the following Quanteda functions while creating the n-grams.
Remove Stop Words. The package will automatically remove some of the most commons words in the English language.
Additionally, I selected the option to remove numbers, punctuation marks, symbols, separators, twitter symbols, hyphens, and web addresses from the the text documents as well.
Source Code: Appendix A: Create N-Grams
I created a local sqlite database file on my laptop. I used the DBI package to connect to the database and run the appropriate queries.
The queries will take the n-gram data frame created in the previous step and store the results in the database.
First, any information from the temporary table is deleted. Next write the entire contents from the data frame into a temporary table. Using a temporary table is faster than directly inserting into the table. Next, merge the results from the temporary table into the n-gram one table. When merging update the frequency count of the n-gram. Lastly append any n-grams that are in the temporary table but not in the n-gram one table.
Source Code: Appendix A: Write to Database
Connect to the database and store the results in a data frame for processing.
db_location <- "./dbases/nlp_dbase_all_1ngrams.sqlite"
# --------------------------------------------------------------------
# results
# --------------------------------------------------------------------
read_results_db <- function()
{
full_table_query <- "SELECT n.ngram ,n.frequency ,n.frequency_relative
FROM ngram_one_all AS n
;"
# connect database
sqldb <- dbConnect(RSQLite::SQLite(), db_location)
#get the table
sql_result <- dbGetQuery(sqldb, full_table_query)
#close database
dbDisconnect(sqldb)
return(sql_result)
}
# read the results from the database
df_ngram_one <- read_results_db()
Top 20 by ranking
Below is a list of the twenty most possible words in the corpus of documents, exclusive of the aforementioned stop words. The table shows the overall frequency and relative frequency for the word.
It was interesting to note the most common word was “said”.
# rank the results
library(dplyr)
# assign ranking to values
df_ngram_one <- df_ngram_one %>% mutate(ranking = rank(-frequency, 100))
# group the rankings
df_ngram_one <- df_ngram_one %>% mutate(rank_group = ntile(-frequency, 100))
# top 20 by frequency
head(df_ngram_one %>% arrange(desc(frequency)), 20)
## ngram frequency frequency_relative ranking rank_group
## 1 said 302147 0.005582246 1 1
## 2 just 296360 0.005475329 2 1
## 3 one 279613 0.005165924 3 1
## 4 like 261949 0.004839577 4 1
## 5 can 238595 0.004408106 5 1
## 6 get 220730 0.004078045 6 1
## 7 time 207876 0.003840564 7 1
## 8 new 189995 0.003510208 8 1
## 9 good 175344 0.003239527 9 1
## 10 now 174982 0.003232839 10 1
## 11 day 164503 0.003039236 11 1
## 12 know 158355 0.002925651 12 1
## 13 love 157296 0.002906085 13 1
## 14 people 154061 0.002846318 14 1
## 15 back 137639 0.002542917 15 1
## 16 go 136615 0.002523998 16 1
## 17 see 134978 0.002493754 17 1
## 18 first 130574 0.002412389 18 1
## 19 make 126315 0.002333703 19 1
## 20 also 125836 0.002324853 20 1
Cumulative Frequency
In the previous step I separated the words into 100 groups; i.e. one percent per group.
Next I calculated the cumulative frequency by each group.
# top 20 by ranking group, by frequency
ngram_group <- df_ngram_one %>%
group_by(rank_group) %>%
arrange(rank_group) %>%
summarise(count=n()
,freq_rel=sum(frequency_relative)
,freq_c = 0
)
ngram_group$freq_c <- cumsum(ngram_group$freq_rel)
Cumulative Frequency by Group
The plot below shows the cumulative frequency by percentage of unique words. The plot shows that only 5% of the unique words in the corpus make up almost 95% of the corpus document.
In other words there are approximately 44,300 words that represent approximately 95% of the words in the corpus text.
# plot results
library(ggplot2)
qplot(x=rank_group, y=freq_c, data = ngram_group)
One of the issues about working such large corpus documents is that of memory requirements and processing capabilities of my available computer. I have decided to use an sqlite database to overcome the memory requirements. On the negative side it will increase the processing time. However I will be able to cache as much as possible in the database and won’t need to recalculate many interim calculations.
In order to minimize the total number of n-grams I create, I am going to separate out the 44,300 words and use them to make the n-grams for the predictive model. By limiting n-grams of size 2, 3, and 4 where they must include at least one word in the top 44,300 words it will help minimize the final database tables. I realize by doing so it will automatically give an error rate of 5% but in an effort to make the application as fast as possible it seems an acceptable application criteria.
# store the results back in sql table for future use
store_results_db <- function()
{
# connect database
sqldb <- dbConnect(RSQLite::SQLite(), db_location)
#get the table
dbWriteTable(sqldb, "ngrams_top", df_ngram_top, overwrite = TRUE, append = FALSE)
#close database
dbDisconnect(sqldb)
}
#edit the df only send top 5 groups
db_location <- "./dbases/nlp_dbase_2.sqlite"
df_ngram_top <- df_ngram_one %>%
filter(rank_group <= 5) %>%
mutate(frequency_rel_log = log10(frequency_relative))
store_results_db()
I will use the Katz Backoff model to create a predictive model. The model is designed to help to create a probability of what is the next word when given a phrase. It can use n-grams of different sizes in the prediction model.
The web based application will be using the shiny framework. Considerations about the application are the size of the database and the responsiveness when searching for the predicted word. The maximum database size for hosting is 1GB. Which will limit the database to being only able to use n-grams of size 2, 3, and 4. Although table hashing will be used the queries will always be doing full table scans. Full table scan queries are O(n) type operations. The query will run for a relatively long period > 0.1 seconds. The table sizes will be limited to a minimum of 2MM records. With a hashed index the max penalty is estimated to be < 0.3 seconds. The overall latency of the remote computer should allow for masking of the time penalty of the query search and should appear normal to the user.
Loading the English language files. The text files were stored locally, created a small class to be able to process all three files at the same time. The get_file_details method stores some summary statistics about each of the text files.
The text file “en_US.twitter.txt” contains some null characters that will be removed before processing.
#English Files
file_name_en_blogs <- "./files_from_coursera/en_US/en_US.blogs.txt"
file_name_en_news <- "./files_from_coursera/en_US/en_US.news.txt"
file_name_en_twitter <- "./files_from_coursera/en_US/en_US.twitter.txt"
# create a class that will hold the file details
fileDetails <- setClass("fileDetails"
,slots = list(filename = "character"
,filesize = "numeric"
,countlines = "numeric"
,longestlinelength = "numeric"
))
# read the text file, and perform some summary statistics.
get_file_details <- function(file_name)
{
current_file <- new("fileDetails", filename=file_name)
current_file@filesize <- file.size(current_file@filename)
current_file@countlines <- R.utils::countLines(current_file@filename)
#read the file line by line
curr_file_lines <- readLines(current_file@filename)
curr_file_char_per_line <- lapply(curr_file_lines, nchar)
current_file@longestlinelength <- which.max(curr_file_char_per_line)
closeAllConnections()
return(current_file)
}
english_blogs <-get_file_details(file_name_en_blogs)
english_news <-get_file_details(file_name_en_news)
english_twitter <-get_file_details(file_name_en_twitter)
# English_blogs@countlines == 899,288
# english_news@countlines == 1,010,242
# english_twitter@countlines == 2,360,148
count_lines_with_word <- function(file_details, find_word)
{
curr_file_lines <- readLines(file_details@filename)
count_line <- sum(grepl(find_word, curr_file_lines))
closeAllConnections()
return(count_line)
}
count_lines_bonjour <-count_lines_with_word(english_twitter, "bonjour")
count_lines_ciao <-count_lines_with_word(english_twitter, "ciao")
count_lines_guten <-count_lines_with_word(english_blogs, "guten")
# used the linux command split
# ----------------------------------------------------------------------------
# split -l 25000 --additional-suffix=.txt en_US.news.txt ./split/news
# split -l 25000 --additional-suffix=.txt en_US.blogs.txt ./split/blogs
# split -l 25000 --additional-suffix=.txt en_US.twitter.txt ./split/twits
#get the list of text files to load
files_list <- list.files(path = "./files_from_coursera/en_US/split", pattern = "*.txt", full.names = TRUE)
db_location <- "./dbases/nlp_dbase_all_1ngrams.sqlite"
library(quanteda)
library(readtext)
# function creates the data from the provided text files
# use the quanteda and readtext libraries
build_ngrams <- function(filename)
{
#store the text file
text_file <- readtext(filename, docvarsfrom = "filenames")
current_corpus <- corpus(text_file)
rm(text_file)
#create tokens from the just loaded file
file_ngrams <- dfm(current_corpus
, remove = stopwords("english")
, ngrams = 1
, skip = 0
, remove_numbers = TRUE
, remove_punct = TRUE
, remove_symbols = TRUE
, remove_separators = TRUE
, remove_twitter = TRUE
, remove_hyphens = TRUE
, remove_url = TRUE
, verbose = TRUE)
rm(current_corpus)
return(file_ngrams)
}
# open each of the files and create a data frame of ngrams using
# the build_ngrams function
for(file_name in files_list[1:172])
{
message(c("processing: ",file_name))
file_ngrams <- build_ngrams(file_name)
# connect to database and save the results
# -- code listed below
}
write_to_db <- function()
{
# connect database
sqldb <- dbConnect(RSQLite::SQLite(), db_location)
# clear the tempoary table before writing.
query_result <- dbSendStatement(sqldb, truncateqry)
dbHasCompleted(query_result)
dbClearResult(query_result)
# query 1 -- write to temp table for speed
dbWriteTable(sqldb, "ngrams_temp", file_ngrams, overwrite = FALSE, append = TRUE)
# query 2 -- merge results from temp table into n-gram 1 table
query_result <- dbSendStatement(sqldb, mquery)
dbHasCompleted(query_result)
message(c(" ...... merge query updated rows: ", dbGetRowsAffected(query_result)))
dbClearResult(query_result)
#query 3 -- append query
query_result <- dbSendStatement(sqldb, aquery)
dbHasCompleted(query_result)
message(c(" ...... append query updated rows: ", dbGetRowsAffected(query_result)))
dbClearResult(query_result)
#close database
dbDisconnect(sqldb)
}
# open each of the files and create a data frame of ngrams using
# the build_ngrams function
for(file_name in files_list[1:172])
{
# create the n-grams
# -- code listed above
# connect to database and save the results
write_to_db()
# remove the temp variable.
rm(file_ngrams)
}
# queries used when writing to the database
mquery <- "UPDATE ngram_one_all
SET frequency = frequency + (SELECT ngrams_temp.frequency
FROM ngrams_temp
WHERE ngrams_temp.ngram = ngram_one_all.ngram
)
WHERE exists (select *
FROM ngrams_temp
WHERE ngrams_temp.ngram = ngram_one_all.ngram
);"
aquery <- "INSERT INTO ngram_one_all
select ngrams_temp.ngram, ngrams_temp.frequency
from ngrams_temp
LEFT OUTER JOIN ngram_one_all
ON ngrams_temp.ngram = ngram_one_all.ngram
WHERE ngram_one_all.ngram IS NULL
;"
truncateqry <- "DELETE FROM ngrams_temp;"
The following are stopwords (175) removed from the text documents.
stopwords("english")
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very" "will"