Data Analysis of a Text Corpus

Summary

Predictive text is a staple in many of today’s modern applications. When sending an SMS message via mobile phone or in an IDE software for programming. Predictive text when working correctly can allow a user to compose a message or enter in code faster.

The following project is to analysis a set of three text files in the English language with the goal of being able to construct a predictive model that can predict the next word the user will most likely want to enter.

Project Outline

The general outline of the project will be as follows:

Obtain source data
Processing of the text files
Summary statistics
Creating and storing n-grams in local database
Predictive model construction
Building a web based application

This report will focus on the first three steps in the project outline and will briefly discuss strategy for the last three steps.

Obtain Source Data

Reference: Source Data

It was stated in the project request the data was provided from HC Corpora. Unfortunately the website no longer is active and the link via the website Internet Archive Wayback Machine is not available. The only information available was what was provided from the project request.

Three text files were provided. The text files were constructed by sampling publicly available news websites, blog websites and posts made on Twitter.

Foreign Words

The data sources this project will be working with are limited to the English language, although it is possible non-English text will appear in the source files. It is reasonable to expect some common phrases from non-English languages to be interspersed into the provided documents.

Examples

file	language	word	occurrence
en_US.twitter.txt	French	bonjour	13
en_US.twitter.txt	Italian	ciao	43
en_US.blogs.txt	German	guten	1

Source Code: Appendix A: Foreign Words

Processing of the Text Files

Preparing files

The provided text files are quite large and the laptop I am using for the project doesn’t have enough physical ram to store all of the files in memory, a necessary requirement with working with the R programming language. Therefore I decided to simplify working with the data by splitting the files into smaller files and store the interim results in an sqlite database

Processing

The overall processing took the following steps:

Get List of file names to process
- The original 3 files were split into 172 files, approximately 2MB in size
Created the n-grams of size 1.
Inserted the n-grams into the sqlite database
Read the file results from the database
Summarize the results

List Files

The first step is I split the original files into smaller files. I used a Linux command to split the files and store them in a separate directory. Subsequently I got the list of the file names that were created and stored them in a list for processing.

Source Code: Appendix A: List Files

Creating the n-grams of size 1.

A key component when analyzing a corpus of text documents is break the document into n-grams.

An n-gram is defined as a collection of word(s), with the size indicating how many words are combined together. An n-gram of size 1 would be individual words from a sentence. While an n-gram of size 2 would be the combination of two consecutive words in a sentence found in the text documents. The following sentence, “Where is the car?” would contain 3 n-grams of size 2. The n-grams would be, “Where is”, “is the” “the car”.

I used the package Quanteda to process the text files. The package has included functions to help with parsing the text files. I used the following Quanteda functions while creating the n-grams.

Remove Stop Words. The package will automatically remove some of the most commons words in the English language.

Additionally, I selected the option to remove numbers, punctuation marks, symbols, separators, twitter symbols, hyphens, and web addresses from the the text documents as well.

Source Code: Appendix A: Create N-Grams

Write to database

I created a local sqlite database file on my laptop. I used the DBI package to connect to the database and run the appropriate queries.

The queries will take the n-gram data frame created in the previous step and store the results in the database.

First, any information from the temporary table is deleted. Next write the entire contents from the data frame into a temporary table. Using a temporary table is faster than directly inserting into the table. Next, merge the results from the temporary table into the n-gram one table. When merging update the frequency count of the n-gram. Lastly append any n-grams that are in the temporary table but not in the n-gram one table.

Source Code: Appendix A: Write to Database

Summary Statistics

Read the Results

Connect to the database and store the results in a data frame for processing.

db_location <- "./dbases/nlp_dbase_all_1ngrams.sqlite"
# --------------------------------------------------------------------
# results
# --------------------------------------------------------------------

read_results_db <- function()
{
    full_table_query <- "SELECT n.ngram ,n.frequency ,n.frequency_relative
                         FROM ngram_one_all AS n
                        ;"
    # connect database
    sqldb <- dbConnect(RSQLite::SQLite(), db_location)

    #get the table
    sql_result <- dbGetQuery(sqldb, full_table_query)
    
    #close database
    dbDisconnect(sqldb)
    return(sql_result)
}
# read the results from the database
df_ngram_one <- read_results_db()

Top 20 by ranking
Below is a list of the twenty most possible words in the corpus of documents, exclusive of the aforementioned stop words. The table shows the overall frequency and relative frequency for the word.

It was interesting to note the most common word was “said”.

# rank the results
library(dplyr)
# assign ranking to values
df_ngram_one <- df_ngram_one %>% mutate(ranking = rank(-frequency, 100))
# group the rankings
df_ngram_one <- df_ngram_one %>% mutate(rank_group = ntile(-frequency, 100))

# top 20 by frequency
head(df_ngram_one %>% arrange(desc(frequency)), 20)

##     ngram frequency frequency_relative ranking rank_group
## 1    said    302147        0.005582246       1          1
## 2    just    296360        0.005475329       2          1
## 3     one    279613        0.005165924       3          1
## 4    like    261949        0.004839577       4          1
## 5     can    238595        0.004408106       5          1
## 6     get    220730        0.004078045       6          1
## 7    time    207876        0.003840564       7          1
## 8     new    189995        0.003510208       8          1
## 9    good    175344        0.003239527       9          1
## 10    now    174982        0.003232839      10          1
## 11    day    164503        0.003039236      11          1
## 12   know    158355        0.002925651      12          1
## 13   love    157296        0.002906085      13          1
## 14 people    154061        0.002846318      14          1
## 15   back    137639        0.002542917      15          1
## 16     go    136615        0.002523998      16          1
## 17    see    134978        0.002493754      17          1
## 18  first    130574        0.002412389      18          1
## 19   make    126315        0.002333703      19          1
## 20   also    125836        0.002324853      20          1

Cumulative Frequency
In the previous step I separated the words into 100 groups; i.e. one percent per group.

Next I calculated the cumulative frequency by each group.

# top 20 by ranking group, by frequency
ngram_group <- df_ngram_one %>% 
                group_by(rank_group) %>% 
                arrange(rank_group) %>%
                summarise(count=n()
                          ,freq_rel=sum(frequency_relative)
                          ,freq_c = 0
                          )
ngram_group$freq_c <- cumsum(ngram_group$freq_rel)

Cumulative Frequency by Group

The plot below shows the cumulative frequency by percentage of unique words. The plot shows that only 5% of the unique words in the corpus make up almost 95% of the corpus document.

In other words there are approximately 44,300 words that represent approximately 95% of the words in the corpus text.

# plot results
library(ggplot2)
qplot(x=rank_group, y=freq_c, data = ngram_group)

Creating and Storing N-Grams in Local Database

One of the issues about working such large corpus documents is that of memory requirements and processing capabilities of my available computer. I have decided to use an sqlite database to overcome the memory requirements. On the negative side it will increase the processing time. However I will be able to cache as much as possible in the database and won’t need to recalculate many interim calculations.

In order to minimize the total number of n-grams I create, I am going to separate out the 44,300 words and use them to make the n-grams for the predictive model. By limiting n-grams of size 2, 3, and 4 where they must include at least one word in the top 44,300 words it will help minimize the final database tables. I realize by doing so it will automatically give an error rate of 5% but in an effort to make the application as fast as possible it seems an acceptable application criteria.

# store the results back in sql table for future use
store_results_db <- function()
{
    # connect database
    sqldb <- dbConnect(RSQLite::SQLite(), db_location)
    
    #get the table
    dbWriteTable(sqldb, "ngrams_top", df_ngram_top, overwrite = TRUE, append = FALSE)
    
    #close database
    dbDisconnect(sqldb)
}
#edit the df only send top 5 groups
db_location         <- "./dbases/nlp_dbase_2.sqlite"
df_ngram_top <- df_ngram_one %>% 
                filter(rank_group <= 5) %>%
                mutate(frequency_rel_log = log10(frequency_relative))

store_results_db()

Predictive Model Construction

I will use the Katz Backoff model to create a predictive model. The model is designed to help to create a probability of what is the next word when given a phrase. It can use n-grams of different sizes in the prediction model.

Building a Web Based Application

The web based application will be using the shiny framework. Considerations about the application are the size of the database and the responsiveness when searching for the predicted word. The maximum database size for hosting is 1GB. Which will limit the database to being only able to use n-grams of size 2, 3, and 4. Although table hashing will be used the queries will always be doing full table scans. Full table scan queries are O(n) type operations. The query will run for a relatively long period > 0.1 seconds. The table sizes will be limited to a minimum of 2MM records. With a hashed index the max penalty is estimated to be < 0.3 seconds. The overall latency of the remote computer should allow for masking of the time penalty of the query search and should appear normal to the user.

Appendix A

Load Files

Loading the English language files. The text files were stored locally, created a small class to be able to process all three files at the same time. The get_file_details method stores some summary statistics about each of the text files.

The text file “en_US.twitter.txt” contains some null characters that will be removed before processing.

#English Files
file_name_en_blogs <-  "./files_from_coursera/en_US/en_US.blogs.txt"
file_name_en_news <-  "./files_from_coursera/en_US/en_US.news.txt"
file_name_en_twitter <-  "./files_from_coursera/en_US/en_US.twitter.txt"

# create a class that will hold the file details
fileDetails <- setClass("fileDetails"
                         ,slots = list(filename     = "character"
                                      ,filesize     = "numeric"
                                      ,countlines   = "numeric"
                                      ,longestlinelength = "numeric"
                         ))

# read the text file, and perform some summary statistics.
get_file_details <- function(file_name)
{
    current_file <- new("fileDetails", filename=file_name)
    
    current_file@filesize   <- file.size(current_file@filename)
    current_file@countlines <- R.utils::countLines(current_file@filename)
    #read the file line by line
        curr_file_lines         <- readLines(current_file@filename)
        curr_file_char_per_line <- lapply(curr_file_lines, nchar)
        current_file@longestlinelength  <- which.max(curr_file_char_per_line)
    closeAllConnections()
    return(current_file)
}

english_blogs   <-get_file_details(file_name_en_blogs)
english_news    <-get_file_details(file_name_en_news)
english_twitter <-get_file_details(file_name_en_twitter)

# English_blogs@countlines  == 899,288
# english_news@countlines == 1,010,242
# english_twitter@countlines == 2,360,148

Foreign words in text

count_lines_with_word <- function(file_details, find_word)
{
    curr_file_lines         <- readLines(file_details@filename)
    
    count_line <- sum(grepl(find_word, curr_file_lines))
    closeAllConnections()
    return(count_line)
}

count_lines_bonjour <-count_lines_with_word(english_twitter, "bonjour")
count_lines_ciao <-count_lines_with_word(english_twitter, "ciao")
count_lines_guten <-count_lines_with_word(english_blogs, "guten")

List of files

# used the linux command split
# ----------------------------------------------------------------------------
# split -l 25000 --additional-suffix=.txt  en_US.news.txt ./split/news
# split -l 25000 --additional-suffix=.txt  en_US.blogs.txt ./split/blogs
# split -l 25000 --additional-suffix=.txt  en_US.twitter.txt ./split/twits


#get the list of text files to load
files_list  <- list.files(path = "./files_from_coursera/en_US/split", pattern = "*.txt", full.names = TRUE)
db_location <- "./dbases/nlp_dbase_all_1ngrams.sqlite"

Create n-grams

library(quanteda)
library(readtext)

# function creates the data from the provided text files
# use the quanteda and readtext libraries
build_ngrams <- function(filename)
{
    #store the text file
    text_file <- readtext(filename, docvarsfrom = "filenames")
    current_corpus <- corpus(text_file)
    rm(text_file)
    #create tokens from the just loaded file
    file_ngrams <- dfm(current_corpus
                        , remove = stopwords("english")
                        , ngrams = 1
                        , skip   = 0
                        , remove_numbers = TRUE
                        , remove_punct   = TRUE
                        , remove_symbols = TRUE
                        , remove_separators = TRUE
                        , remove_twitter = TRUE
                        , remove_hyphens = TRUE
                        , remove_url = TRUE
                        , verbose = TRUE)
                            
    rm(current_corpus)
    return(file_ngrams)
}

# open each of the files and create a data frame of ngrams using 
# the build_ngrams function
for(file_name in files_list[1:172]) 
{
    message(c("processing: ",file_name))
    file_ngrams <- build_ngrams(file_name)
    # connect to database and save the results
        # -- code listed below
}

Write to Database

write_to_db <- function()
{
    # connect database
    sqldb <- dbConnect(RSQLite::SQLite(), db_location)
    
    # clear the tempoary table before writing.
    query_result <- dbSendStatement(sqldb, truncateqry)
    dbHasCompleted(query_result)
    dbClearResult(query_result)
    
    # query 1 -- write to temp table for speed
    dbWriteTable(sqldb, "ngrams_temp", file_ngrams, overwrite = FALSE, append = TRUE)
    
    # query 2 -- merge results from temp table into n-gram 1 table
    query_result <- dbSendStatement(sqldb, mquery)
    dbHasCompleted(query_result)
    message(c("   ...... merge query updated rows: ", dbGetRowsAffected(query_result)))
    dbClearResult(query_result)
    
    #query 3 -- append query
    query_result <- dbSendStatement(sqldb, aquery)
    dbHasCompleted(query_result)
    message(c("   ...... append query updated rows: ", dbGetRowsAffected(query_result)))
    dbClearResult(query_result)
    
    #close database
    dbDisconnect(sqldb)
    
}

# open each of the files and create a data frame of ngrams using 
# the build_ngrams function
for(file_name in files_list[1:172]) 
{
    # create the n-grams
        # -- code listed above
    # connect to database and save the results
        write_to_db()
    # remove the temp variable.
        rm(file_ngrams)
}

# queries used when writing to the database
mquery <- "UPDATE   ngram_one_all
            SET     frequency = frequency + (SELECT ngrams_temp.frequency
            FROM    ngrams_temp
            WHERE  ngrams_temp.ngram = ngram_one_all.ngram
                )
                WHERE exists (select *
                                FROM ngrams_temp
                                WHERE  ngrams_temp.ngram = ngram_one_all.ngram
                );"

aquery <- "INSERT INTO ngram_one_all
            select  ngrams_temp.ngram, ngrams_temp.frequency
            from    ngrams_temp
            LEFT OUTER JOIN ngram_one_all
            ON ngrams_temp.ngram = ngram_one_all.ngram
            WHERE   ngram_one_all.ngram IS NULL
            ;"

truncateqry <- "DELETE FROM ngrams_temp;"

Appendix B

Stopwords removed from text

The following are stopwords (175) removed from the text documents.

stopwords("english")

##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"       "will"