The puspose of this project is to build a text prediction algorythm, similar to the Swiftkey app on mboile phones. The predicition algorythm is based on frequency analysis of words in text samples. For this milestone, the data is a corpus from HC Corpora (www.corpora.heliohost.org). This report presents the text cleaning process, the exploratory analysis and the frequency of words.
From the beginning I considered that my application should be able to process further text files in the future, to enrich its vocabulary and improve its word prediction. So as a basic structure, I decided that word frequencies, which the predicition is based on, should be global tables in the application; each text file should be cleaned and the words counted individually, and the results should be merged into the global word frequency tables. This optimizes the use of computing resources (fully processing only one file at a time) and lets me add a new text file any time.
My application downloads the text samples and a word list of profanities which later will be used to clean the sample text.
# Loads a text file
# Parameters: file name, number of lines to load: if -1, all lines are read
# Returns: the loaded text
loadTextFile <- function(fileName, numberOfLines= -1L, verbose=FALSE) {
msg("PROCESSING file",fileName)
if(verbose){
msg(" File size:",round(file.info(fileName)[1]*1e-6,1),"MBytes")
}
inputText <- suppressWarnings(readLines(fileName, n=numberOfLines, encoding = "UTF-8"))
if(verbose){
msg(" Number of lines:",length(inputText))
msg(" Longest line:",max(nchar(inputText)),"characters")
}
return(inputText)
} # end loadTextFile()
| File | Size (MB) | # of lines | Longest line (characters) |
|---|---|---|---|
| en_US.blogs.xt | 210.2 | 899,288 | 40,833 |
| en_US.news.txt | 205.8 | 77,259 | 5,760 |
| en_US.twitter.txt | 167.1 | 2,360,148 | 140 |
Several different approaches are available in R for NLP (natural language processing). I experimented with them, to find one which uses the least memory for processing, to be able to process the biggest amount of text sample. I decided to use the stylo package, which allowed me on my PC to use 80,000 lines from each of the 3 sample files, as opposed to 5,000 lines from each when I used term document matrices with the tm and RWeka packages.
The output from the function that cleans the text is a continuous text of lowercase words separated by single spaces, no numbers or punctuations. The content is changed during cleaning the following way:
Expressions like “don’t” are replaced by “do not”. This reduces the number of words and increases the prediction accuracy for when you want to express “do not”.
There are ambiguous cases, for example “it’d” can be “it would” or “it had”. For such situations, “it’d” should be kept, except the “’” character causes problems for most string processing functions. Therefore “’” ir replaced with “_" (“it’d” changes to “it_d”).
Profanties are removed from the text, using the list of profan words from “https://www.cs.cmu.edu/~biglou/resources/bad-words.txt”.
URL, twitter account and retweet refrences, white spaces are removed
Words containing 3 or more consecutive identical characters (example: “arrrgh”) are removed, as these will have little use in normal text prediction. words that start or end with underscore (“_“) are removed.
# Goes through a set of cleaning steps, replacements and
# removals to make the text processable
# Parameters: text to clean
# Returns: the cleaned text
cleanTextFile <- function(inputText) {
inputText <- iconv(inputText, "latin1", "ASCII", sub="")
msg("Replacing a pre-defined set of expressions")
pb <- txtProgressBar(min=0, max=length(inputText), style=3)
for(i in 1:length(inputText)) {
setTxtProgressBar(pb, i) # show the progress
inputText[i] <- tolower(inputText[i])
for ( j in 1:nrow(expressionReplacementTable)) {
inputText[i] <- gsub(expressionReplacementTable[j,1],
expressionReplacementTable[j,2],
inputText[i])
}
inputText[i] <- removeWords(inputText[i], profanities)
inputText[i] <- stripWhitespace(inputText[i])
}
close(pb)
return(inputText)
} # end cleanTextFile()
The cleaned text is processed and the frequencies of 1,2,3 and 4 words long expressions are added to the global word freqency tables. Using R data tables lets me use fast functions, for example to merge the data. A total of 240,000 text lines are processed (80,000 from each of the 3 sample files).
Due to the time consuming calculations, the frequency tables are saved, so they can be loaded from files instead of re-calculating them on every run.
# After loading and cleaning a text file:
# - does some further cleaning by removing words with 3 or more repeating
# identical characters
# - creates Uni- Bi- Tri- and Four-grams
# - adds the NGrams and their frequncy in the input text file to the
# global variables: word1Freq, word2Freq, word3Freq, word4Freq
# Parameters: text file name to process
# Returns: the cleaned text
processNGrams <- function(textFileName) {
inputText <- loadTextFile(textFileName, 80000)
inputText <- cleanTextFile(inputText)
inputText <- txt.to.words(inputText, splitting.rule = "[ \n\t]")
msg("Calculating Unigrams from",textFileName)
t1Grm<-as.data.table(table(make.ngrams(inputText, ngram.size = 1)))
names(t1Grm) <- c("Word","Frequency")
msg("Adding this file's Unigrams to the global word1Freq table")
tmp <- nrow(word1Freq)
if (nrow(word1Freq)==0) {
word1Freq <<- t1Grm
names(word1Freq)[2] <<- "Freq"
} else {
word1Freq <<- merge(word1Freq, t1Grm, by="Word", all=TRUE)
word1Freq[as.vector(is.na(word1Freq[,2])),2]<<-0
word1Freq[as.vector(is.na(word1Freq[,3])),3]<<-0
word1Freq[,2]<<-word1Freq[,2]+word1Freq[,3]
# remove unnecessary 3rd column (creted by the table merge)
word1Freq <<- word1Freq[,-3]
}
rm("t1Grm")
msg("word1Freq table grew by",nrow(word1Freq)-tmp,"lines to",nrow(word1Freq))
# Now generate n-grams
msg("Calculating Bigrams from ",textFileName)
t2Grm<-as.data.table(table(make.ngrams(inputText, ngram.size = 2)))
msg("Adding this file's Bigrams to the global word2Freq table")
names(t2Grm) <- c("Word","Frequency")
tmp<-nrow(word2Freq)
if (nrow(word2Freq) ==0 ) {
word2Freq <<- t2Grm
names(word2Freq)[2] <<- "Freq"
} else {
word2Freq <<- merge(word2Freq, t2Grm, by="Word", all=TRUE)
word2Freq[as.vector(is.na(word2Freq[,2])),2]<<-0
word2Freq[as.vector(is.na(word2Freq[,3])),3]<<-0
word2Freq[,2]<<-word2Freq[,2]+word2Freq[,3]
# remove unnecessary 3rd column (creted by the table merge)
word2Freq <<- word2Freq[,-3]
}
rm("t2Grm")
msg("word2Freq table grew by",nrow(word2Freq)-tmp,"lines to",nrow(word2Freq))
msg("Calculating Trigrams from",textFileName)
t3Grm<-as.data.table(table(make.ngrams(inputText, ngram.size = 3)))
msg("Adding this file's Trigrams to the global word3Freq table")
tmp<-nrow(word3Freq)
names(t3Grm) <- c("Word","Frequency")
if (nrow(word3Freq) ==0 ) {
word3Freq <<- t3Grm
names(word3Freq)[2] <<- "Freq"
} else {
word3Freq <<- merge(word3Freq, t3Grm, by="Word", all=TRUE)
word3Freq[as.vector(is.na(word3Freq[,2])),2]<<-0
word3Freq[as.vector(is.na(word3Freq[,3])),3]<<-0
word3Freq[,2]<<-word3Freq[,2]+word3Freq[,3]
# remove unnecessary 3rd column (creted by the table merge)
word3Freq <<- word3Freq[,-3]
}
rm("t3Grm")
msg("word3Freq table grew by",nrow(word3Freq)-tmp,"lines to",nrow(word3Freq))
msg("Calculating Four-grams from",textFileName)
t4Grm<-as.data.table(table(make.ngrams(inputText, ngram.size = 4)))
msg("Adding this file's Four-grams to the global word4Freq table")
tmp<-nrow(word4Freq)
names(t4Grm) <- c("Word","Frequency")
if (nrow(word4Freq) ==0 ) {
word4Freq <<- t4Grm
names(word4Freq)[2] <<- "Freq"
} else {
word4Freq <<- merge(word4Freq, t4Grm, by="Word", all=TRUE)
word4Freq[as.vector(is.na(word4Freq[,2])),2]<<-0
word4Freq[as.vector(is.na(word4Freq[,3])),3]<<-0
word4Freq[,2]<<-word4Freq[,2]+word4Freq[,3]
# remove unnecessary 3rd column (creted by the table merge)
word4Freq <<- word4Freq[,-3]
}
rm("t4Grm")
msg("word4Freq table grew by",nrow(word4Freq)-tmp,"lines to",nrow(word4Freq))
} # end processNGrams()
The following function calculates the number of words necessary to cover X% of the sample text.
# Sums the frequency of the first n elements in word1Freq
# until the desired percentage of the total frequency of all words has been reached.
# Parameters: the percentage of the processed text to cover
# Returns: how many words needed to cover 'pct' percentage of the processed text
languageCovered <- function(pct=0) {
n<-0
wordsToCover<-as.integer(sum(word1Freq[,Freq]) * pct)
for (i in 1:nrow(word1Freq)) {
if (n >= wordsToCover)break
n <-n+word1Freq[i,2]
}
msg(i,"words cover",pct*100,"% of the text")
} # end languageCovered()
The main program body is, I believe, self-explanatory.
downloadswiftKeyfiles()
downloadProfanities()
if (!file.exists("word1Freq.RData")) {
profanities <- suppressWarnings(readLines(profFileName))
msg("Loading",length(profanities),"profan expressions, to remove such words from the files.")
processNGrams(newsFileName)
processNGrams(blogFileName)
processNGrams(twitFileName)
rm("profanities")
save(word1Freq,file="word1Freq.RData")
save(word2Freq,file="word2Freq.RData")
save(word3Freq,file="word3Freq.RData")
save(word4Freq,file="word4Freq.RData")
} else {
msg("Loading word frequency tables...")
load("word1Freq.RData")
load("word2Freq.RData")
load("word3Freq.RData")
load("word4Freq.RData")
}
## [1] Loading word frequency tables...
# Prepare tables
setkey(word1Freq, Word)
setkey(word2Freq, Word)
setkey(word3Freq, Word)
setkey(word4Freq, Word)
# Order tables by frequency
setorder(word1Freq, -Freq)
setorder(word2Freq, -Freq)
setorder(word3Freq, -Freq)
setorder(word4Freq, -Freq)
# draw the diagrams
wordFrequencyDiagrams()
The top 20 Uni- Bi– Tri and Four-grams are shown on the diagrams below. Stop words are left in the sample, because for the correct prediction of the next typed word they are needed.
What are the distributions of word frequencies? The distribution is an exponentially decreasing curve.
What are the frequencies of 2-grams and 3-grams in the dataset? These are shown on the wortd frequency diagrams.
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
languageCovered(0.5)
## [1] 115 words cover 50 % of the text
languageCovered(0.9)
## [1] 6618 words cover 90 % of the text
As expected, to cover a linearly increasing percentage of the sampled language, an exponentially increasing number of words are required.
After processing 240,000 lines from the 3 text files, the NGram tables have the following properties:
| NGram | unique phrases | NGrams with frequency of 1 |
|---|---|---|
| 1 | 122,793 | 58,506 (48%) |
| 2 | 1,929,108 | 1,446,300 (75%) |
| 3 | 4,631,088 | 4,127,142 (89%) |
| 4 | 6,041,284 | 5,814,870 (96%) |
How do you evaluate how many of the words come from foreign languages? This could be best achieved by removing all the words the basic form of which are not in an English dictionary. The advantage is that mistyped English words would be removed, too.
Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases? Adding words that are not in the corpora would be possible by using a larger sample set, or generating ending for words which only have their basic form on the current data set (for example, add “sitting” to the vocabulary if it currently only contains “sit”). However this would increase the size. Using a smaller number of words would be possible by increasing the processing complexity in the prediction; removing from the dictionary all the different forms of the same basic word (e.g remove “sitting” and leave only “sit”) and then generate the proper ending during the prediction process (e.g. if the user types “addict”, generate the words “addicts”, “addicted”, “addicting”, “addiction”, etc.). This would, however, require high complexity in storing which words cna have what kind of endings and then generate a subset of those that have the highest probablity to be the full word.
How can you efficiently store an n-gram model (think Markov Chains)? The Markov Chains are a more generic solution to what I described above. Chucnks of words pointing to the highest probablity next chucnks could be stored. To be effective, it would require statistical analysis of a really large sample, especially to handle tri- or fourgrams besides single words. It would also help to customize the application’s vocabulary to the user, which would require to analyze what the user types and update the frequencies.
How can you use the knowledge about word frequencies to make your model smaller and more efficient? Tri- and four-grams with 1 occurance give the largest part of their data tables. These could be removed without having major negative impact on the prediction accuracy. Also, words that areused rarely could be removed from the application’s vocabulary, e.g. bi-grams with 2 repetitions of the same character. Knowing exactly what application the user is typing in would be useful, too. In Twitter, a significanlty different set of words is used then in an official document.
How many parameters do you need (i.e. how big is n in your n-gram model)? This may depend on practical considerations for the application. I want to be able to offer the 3 most likely next words for the user. Increasing n will cost me exponentially more memory and computing power, the app will be slow. With too small n my prediction will not be helped by the context. In my model the top 4 uni-grams cover about 8%, 10%, 13%, 15% and 17% of the sample text. It makes practical sense to use n grams where n is between 3 and 6. My decision is to use four-grams at most.
Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data) ? Due to the exponential nature of the frequency diagrams, I would take the logarithm of the frequency to smooth out the probabilities. This would require that words with 0 probabilities have a minimum non-zero probability assigned.
How do you evaluate whether your model is any good? By testing. The application can be further developed to separate test and learning data, and check the accuracy.
How can you use backoff models to estimate the probability of unobserved n-grams? To have a fall back mechnaism in case an n-gram is unobserved, we could construct weighed probabilities probabilities n-1, n-2… grams. To put it simply, if there is an unobserved n-gram fall back to n-1 grams, and continue this until there is a observation or unigrams have been reached.
The shiny application will handle input and output. Next word predictions will need to be calculated after each entry of a character, so the application needs to be fast enough to provide 3 prediction choices, virtually immediately. With the addition of every new word, the application should try to give the best predicition based on the contxt, that is, whenever possible use the Four-grams. If a new word is typed for which there is no matching Four-gram, the backoff procedures should step back to Tri-grams, then Bi-grams and so on, until a match is found. For words for which there is not even a Uni-gram, the word itself should be its own estimate, and as soon as a new word is attached, that one should start further predicitions as a Uni-gram.