Synopsis

The puspose of this project is to build a text prediction algorythm, similar to the Swiftkey app on mboile phones. The predicition algorythm is based on frequency analysis of words in text samples. For this milestone, the data is a corpus from HC Corpora (www.corpora.heliohost.org). This report presents the text cleaning process, the exploratory analysis and the frequency of words.

The approach to building the application

From the beginning I considered that my application should be able to process further text files in the future, to enrich its vocabulary and improve its word prediction. So as a basic structure, I decided that word frequencies, which the predicition is based on, should be global tables in the application; each text file should be cleaned and the words counted individually, and the results should be merged into the global word frequency tables. This optimizes the use of computing resources (fully processing only one file at a time) and lets me add a new text file any time.

The data set

My application downloads the text samples and a word list of profanities which later will be used to clean the sample text.

# Loads a text file
# Parameters: file name, number of lines to load: if -1, all lines are read
# Returns: the loaded text
loadTextFile <- function(fileName, numberOfLines= -1L, verbose=FALSE) {
    msg("PROCESSING file",fileName)
    if(verbose){
        msg(" File size:",round(file.info(fileName)[1]*1e-6,1),"MBytes")
    }
    inputText <- suppressWarnings(readLines(fileName, n=numberOfLines, encoding = "UTF-8"))
    if(verbose){
        msg(" Number of lines:",length(inputText))
        msg(" Longest line:",max(nchar(inputText)),"characters")
    }
    return(inputText)
} # end loadTextFile()
File Size (MB) # of lines Longest line (characters)
en_US.blogs.xt 210.2 899,288 40,833
en_US.news.txt 205.8 77,259 5,760
en_US.twitter.txt 167.1 2,360,148 140

Cleaning and processing the data

Several different approaches are available in R for NLP (natural language processing). I experimented with them, to find one which uses the least memory for processing, to be able to process the biggest amount of text sample. I decided to use the stylo package, which allowed me on my PC to use 80,000 lines from each of the 3 sample files, as opposed to 5,000 lines from each when I used term document matrices with the tm and RWeka packages.

The output from the function that cleans the text is a continuous text of lowercase words separated by single spaces, no numbers or punctuations. The content is changed during cleaning the following way:

# Goes through a set of cleaning steps, replacements and 
# removals to make the text processable
# Parameters: text to clean
# Returns: the cleaned text
cleanTextFile <- function(inputText) {
    inputText <- iconv(inputText, "latin1", "ASCII", sub="")
    msg("Replacing a pre-defined set of expressions")
    pb <- txtProgressBar(min=0, max=length(inputText), style=3)
    for(i in 1:length(inputText)) {
        setTxtProgressBar(pb, i) # show the progress
        inputText[i] <- tolower(inputText[i])
        for ( j in 1:nrow(expressionReplacementTable)) {
            inputText[i] <- gsub(expressionReplacementTable[j,1], 
                                 expressionReplacementTable[j,2], 
                                 inputText[i])
        }
        inputText[i] <- removeWords(inputText[i], profanities)
        inputText[i] <- stripWhitespace(inputText[i])
    }
    close(pb)
    return(inputText)
} # end cleanTextFile()

Counting word frequency

The cleaned text is processed and the frequencies of 1,2,3 and 4 words long expressions are added to the global word freqency tables. Using R data tables lets me use fast functions, for example to merge the data. A total of 240,000 text lines are processed (80,000 from each of the 3 sample files).

Due to the time consuming calculations, the frequency tables are saved, so they can be loaded from files instead of re-calculating them on every run.

# After loading and cleaning a text file:
# - does some further cleaning by removing words with 3 or more repeating 
#   identical characters
# - creates Uni- Bi- Tri- and Four-grams
# - adds the NGrams and their frequncy in the input text file to the 
#   global variables: word1Freq, word2Freq, word3Freq, word4Freq
# Parameters: text file name to process
# Returns: the cleaned text
processNGrams <- function(textFileName) {
    inputText <- loadTextFile(textFileName, 80000)
    inputText <- cleanTextFile(inputText)
    inputText <- txt.to.words(inputText, splitting.rule = "[ \n\t]")

    msg("Calculating Unigrams from",textFileName)
    t1Grm<-as.data.table(table(make.ngrams(inputText, ngram.size = 1)))
    names(t1Grm) <- c("Word","Frequency")

    msg("Adding this file's Unigrams to the global word1Freq table")
    tmp <- nrow(word1Freq)
    if (nrow(word1Freq)==0) {
        word1Freq <<- t1Grm
        names(word1Freq)[2] <<- "Freq"
    } else {
        word1Freq <<- merge(word1Freq, t1Grm, by="Word", all=TRUE)
        word1Freq[as.vector(is.na(word1Freq[,2])),2]<<-0
        word1Freq[as.vector(is.na(word1Freq[,3])),3]<<-0
        word1Freq[,2]<<-word1Freq[,2]+word1Freq[,3]
        # remove unnecessary 3rd column (creted by the table merge)
        word1Freq <<- word1Freq[,-3]
    }
    rm("t1Grm")
    msg("word1Freq table grew by",nrow(word1Freq)-tmp,"lines to",nrow(word1Freq))

    # Now generate n-grams
    msg("Calculating Bigrams from ",textFileName)
    t2Grm<-as.data.table(table(make.ngrams(inputText, ngram.size = 2)))
    msg("Adding this file's Bigrams to the global word2Freq table")
    names(t2Grm) <- c("Word","Frequency")
    tmp<-nrow(word2Freq)
    if (nrow(word2Freq) ==0 ) {
        word2Freq <<- t2Grm
        names(word2Freq)[2] <<- "Freq"
    } else {
        word2Freq <<- merge(word2Freq, t2Grm, by="Word", all=TRUE)
        word2Freq[as.vector(is.na(word2Freq[,2])),2]<<-0
        word2Freq[as.vector(is.na(word2Freq[,3])),3]<<-0
        word2Freq[,2]<<-word2Freq[,2]+word2Freq[,3]
        # remove unnecessary 3rd column (creted by the table merge)
        word2Freq <<- word2Freq[,-3]
    }
    rm("t2Grm")
    msg("word2Freq table grew by",nrow(word2Freq)-tmp,"lines to",nrow(word2Freq))
    
    msg("Calculating Trigrams from",textFileName)
    t3Grm<-as.data.table(table(make.ngrams(inputText, ngram.size = 3)))
    msg("Adding this file's Trigrams to the global word3Freq table")
    tmp<-nrow(word3Freq)
    names(t3Grm) <- c("Word","Frequency")
    if (nrow(word3Freq) ==0 ) {
        word3Freq <<- t3Grm
        names(word3Freq)[2] <<- "Freq"
    } else {
        word3Freq <<- merge(word3Freq, t3Grm, by="Word", all=TRUE)
        word3Freq[as.vector(is.na(word3Freq[,2])),2]<<-0
        word3Freq[as.vector(is.na(word3Freq[,3])),3]<<-0
        word3Freq[,2]<<-word3Freq[,2]+word3Freq[,3]
        # remove unnecessary 3rd column (creted by the table merge)
        word3Freq <<- word3Freq[,-3]
    }
    rm("t3Grm")
    msg("word3Freq table grew by",nrow(word3Freq)-tmp,"lines to",nrow(word3Freq))
    
    msg("Calculating Four-grams from",textFileName)
    t4Grm<-as.data.table(table(make.ngrams(inputText, ngram.size = 4)))
    msg("Adding this file's Four-grams to the global word4Freq table")
    tmp<-nrow(word4Freq)
    names(t4Grm) <- c("Word","Frequency")
    if (nrow(word4Freq) ==0 ) {
        word4Freq <<- t4Grm
        names(word4Freq)[2] <<- "Freq"
    } else {
        word4Freq <<- merge(word4Freq, t4Grm, by="Word", all=TRUE)
        word4Freq[as.vector(is.na(word4Freq[,2])),2]<<-0
        word4Freq[as.vector(is.na(word4Freq[,3])),3]<<-0
        word4Freq[,2]<<-word4Freq[,2]+word4Freq[,3]
        # remove unnecessary 3rd column (creted by the table merge)
        word4Freq <<- word4Freq[,-3]
    }
    rm("t4Grm")
    msg("word4Freq table grew by",nrow(word4Freq)-tmp,"lines to",nrow(word4Freq))
    
} # end processNGrams()

The following function calculates the number of words necessary to cover X% of the sample text.

# Sums the frequency of the first n elements in word1Freq
# until the desired percentage of the total frequency of all words has been reached.
# Parameters: the percentage of the processed text to cover
# Returns: how many words needed to cover 'pct' percentage of the processed text
languageCovered <- function(pct=0) {
    n<-0
    wordsToCover<-as.integer(sum(word1Freq[,Freq]) * pct)
    for (i in 1:nrow(word1Freq)) {
        if (n >= wordsToCover)break
        n <-n+word1Freq[i,2]
    }
    msg(i,"words cover",pct*100,"% of the text")
} # end languageCovered()

Word frequency analysis

The main program body is, I believe, self-explanatory.

downloadswiftKeyfiles()
downloadProfanities()

if (!file.exists("word1Freq.RData")) {
    profanities <- suppressWarnings(readLines(profFileName))
    msg("Loading",length(profanities),"profan expressions, to remove such words from the files.")
    processNGrams(newsFileName)
    processNGrams(blogFileName)
    processNGrams(twitFileName)
    rm("profanities")
    save(word1Freq,file="word1Freq.RData")
    save(word2Freq,file="word2Freq.RData")
    save(word3Freq,file="word3Freq.RData")
    save(word4Freq,file="word4Freq.RData")
} else {
    msg("Loading word frequency tables...")
    load("word1Freq.RData")
    load("word2Freq.RData")
    load("word3Freq.RData")
    load("word4Freq.RData")
}
## [1] Loading word frequency tables...
# Prepare tables
setkey(word1Freq, Word)
setkey(word2Freq, Word)
setkey(word3Freq, Word)
setkey(word4Freq, Word)
# Order tables by frequency
setorder(word1Freq, -Freq)
setorder(word2Freq, -Freq)
setorder(word3Freq, -Freq)
setorder(word4Freq, -Freq)
# draw the diagrams
wordFrequencyDiagrams()

The top 20 Uni- Bi– Tri and Four-grams are shown on the diagrams below. Stop words are left in the sample, because for the correct prediction of the next typed word they are needed.

languageCovered(0.5)
## [1] 115 words cover 50 % of the text
languageCovered(0.9)
## [1] 6618 words cover 90 % of the text

As expected, to cover a linearly increasing percentage of the sampled language, an exponentially increasing number of words are required.

After processing 240,000 lines from the 3 text files, the NGram tables have the following properties:

NGram unique phrases NGrams with frequency of 1
1 122,793 58,506 (48%)
2 1,929,108 1,446,300 (75%)
3 4,631,088 4,127,142 (89%)
4 6,041,284 5,814,870 (96%)

Other considerations

Plan for the shiny application

The shiny application will handle input and output. Next word predictions will need to be calculated after each entry of a character, so the application needs to be fast enough to provide 3 prediction choices, virtually immediately. With the addition of every new word, the application should try to give the best predicition based on the contxt, that is, whenever possible use the Four-grams. If a new word is typed for which there is no matching Four-gram, the backoff procedures should step back to Tri-grams, then Bi-grams and so on, until a match is found. For words for which there is not even a Uni-gram, the word itself should be its own estimate, and as soon as a new word is attached, that one should start further predicitions as a Uni-gram.