The SwiftKey Corpus consists of news articles, blogs, and tweets in English, Russian, Dutch, and Finnish. The purpose of these corpora is to provide natural language data to aid in text prediction. This exploratory analysis will briefly examine these three genres of writing in English, but also explore some of the Dutch and Finnish data as well.
Preliminary exploratory analysis suggests that the SwiftKey corpus follows patterns predicted by Zipf’s Law which are common to natural language data. Overall, about 45% of words occur only once in the data, and nearly 4% of the English words are the word the.
Since Finnish and Russian use non-ASCII characters, we will need the unicode pacakge to handle this.
library(Unicode)
Next, we set our working directories and load all the data. This is demonstrated for the English corpora, but can be preformed on the other languages as well.
setwd("~/Desktop/SwiftKey Corpus/en_US")
en.tweets <- readLines("en_US.twitter.txt")
en.news <- readLines("en_US.news.txt")
en.blogs <- readLines("en_US.blogs.txt")
My prediction algorithm will focus only on words, ignoring case, punctuation, and numbers. Although predicting capitalization, numbers, and punctuation (including emjoi), is interesting, this level of complexity will be reserved for later versions of the prediction algorithm. Therefore, it is necessary to get rid of numbers, extraneous spaces, and normalize the data to lowercase. This process will be demonstrated on the English blog data; however, the same process will be preformed on the twitter and news corpora as well. Getting rid of punctuation will cause contractions like should’ve to become separate single words, as in should and ve.
en.blogs2 <- gsub("\\d","",en.blogs) #get rid of numbers
en.blogs3 <- gsub("\\W", " ", en.blogs2) #get rid of punctuation
en.blogs4 <- gsub("\\s+", " ", en.blogs3) #replace 2+ spaces with one
en.blogs5 <- tolower(en.blogs4) #make everything lowercase
In the interest of memory management, we’ll save this cleaned data set and clear the R console’s memory:
saveRDS(en.blogs5, file="cleanEngBlogs.txt")
rm(list=ls(all=TRUE)) # clear data
clean.en.blogs <- readRDS("cleanEngBlogs.txt")
A unigram is a single word. Again, this process will be demonstrated on the English blog data; however, the same process was preformed on all twelve corpora.
en.blogs6 <-strsplit(clean.en.blogs, "\\s") #split on spaces
en.blogs7 <-unlist(en.blogs6) #unlist the vector
en.blogs.freq <-table(en.blogs7) #makes a frequency list of every word
sorted.en.blogs.freq<-sort(en.blogs.freq, decreasing=T) #sort the frequency list
write.table(sorted.en.blogs.freq, "enBlogFreqs.csv",sep=",") #make a CSV file with the frequency list
Now that the unigram frequency data have been saved as a .csv file, we can return to this at a later point, without loading or manipulating the original data again. We can also clear the data and start with a fresh console.
rm(list=ls(all=TRUE)) # clear data
setwd("~/Desktop/SwiftKey Corpus/en_US")
en.blog.unigram <- read.csv("enBlogFreqs.csv",col.names=c("word","blogFreq"))
en.news.unigram <- read.csv("enNewsFreqs.csv",col.names=c("word","newsFreq"))
en.tweet.unigram <- read.csv("enTweetFreqs.csv",col.names=c("word","tweetFreq"))
It is important to note that I named the words column as a generic “word”, but the frequency column has a unique name for each corpus. This will be important later on when merging the data.
When predicting text, word order matters. We expect a native English speaker to frequently say the red house and almost never say house red the. Moreover, certain combinations of words (such as the White House) might be more common than others (such as the red house). An n-gram is a series of N number of words. The functions ngram2 and ngram3 will help produce these frequency lists.
ngram2 <- function(corpus){
ngrams <- vector()
for(i in 1:length(corpus)){ #for each line
temp <- unlist(strsplit(corpus[i], "\\W"))
cat(i/length(corpus)*100, "%\n") #give progress
for(j in 0:length(temp)){ #start at zero to include line beginning
ngram <- c(temp[j],temp[j+1])
ngrams <- append(ngram, ngrams)
}
}
return(table(ngrams))
}
ngram3 <- function(corpus){
ngrams <- vector()
for(i in 1:length(corpus)){ #for each line
temp <- unlist(strsplit(corpus[i], "\\W"))
cat(i/length(corpus)*100, "%\n") #give progress
for(j in 0:length(temp)){ #start at zero to include line beginning
ngram <- c(temp[j],temp[j+1],temp[j+2])
ngrams <- append(ngram, ngrams)
}
}
return(table(ngrams))
}
ngram4 <- function(corpus){
ngrams <- vector()
for(i in 1:length(corpus)){ #for each line
temp <- unlist(strsplit(corpus[i], "\\W"))
cat(i/length(corpus)*100, "%\n") #give progress
for(j in 0:length(temp)){ #start at zero to include line beginning
ngram <- c(temp[j],temp[j+1],temp[j+2],temp[j+3])
ngrams <- append(ngram, ngrams)
}
}
return(table(ngrams))
}
This process was highly inefficent.
The number of rows tells us how many unique words are in the corpus (word types), while the sum of the freq column tells us how many total words are in the corpus (word tokens).
blogTypes <- nrow(en.blog.unigram); blogTypes
## [1] 256889
blogTokens <- sum(en.blog.unigram$blogFreq); blogTokens
## [1] 37975178
In the English blogs corpus, there are 256,889 unique word types, and a total of 37,975,178 word tokens. However, we cannot expect the average number of tokens for each word to be blogTokens/blogTypes. This is because language data tends to follow a skewed distribution.
According to Zipf’s Law (http://en.wikipedia.org/wiki/Zipf%27s_law), language data follows a logarithmic pattern. In other words, the most frequeny word (or n-gram) can be expected to be about twice as frequent as the second most frequent word (or n-gram), which in turn is twice as frequent as the third most frequent word (or n-gram), and so on. This means there are a small set of very highly frequent words and a large set of extremely infrequent words.
Moreover, the highly frequent words also overwhelmingly tend to be function words (pronouns, articles, prepositions, auxiliary verbs, and conjunctions) rather than content words (nouns, verbs, adverbs, and adjectives). Let’s look at the 10 most frequent words in the blogs corpus:
head(en.blog.unigram,10)
## word blogFreq
## 1 the 1860655
## 2 and 1094849
## 3 to 1069558
## 4 i 906685
## 5 a 903931
## 6 of 876836
## 7 in 598749
## 8 it 485362
## 9 that 484196
## 10 is 432763
en.blog.unigram$blogFreq[1]/blogTokens * 100
## [1] 4.899661
sum(en.blog.unigram$blogFreq[1:10])/blogTokens * 100
## [1] 22.94547
As expected, the 10 most frequent words in the blogs corpus are all function words: articles such as the and a, conjunctions such as and, prepositions such as to, of, in, pronouns such as i, it, and that, and the copula verb is. Moreover, these ten words make up 22.9% of all the tokens! Over a fifth of the blog corpus can be accounted for by just these ten words.
Similar results were found in the other corpora. Among tweets, the most common words are the, i, to, a, you, and, for, it, in, and of, which account for 16.9% of the twitter corpus. In the news corpus, the most common words are the, to, a, and, of, in, s, that, for, and it, which account for 21.8% of the news corpus.
It is important to note the differences between the three corpora. Tweets, which tend to be more social, contain the pronouns I and you more frequently than the other corpora. Words such as of and that, which are common in relative clauses and other syntactically complex phrases, are more common in the news and blogs corpora. The distributions of these words also differ from one corpus to the next. Therefore, it is important to consider context (news, blog, or tweet) when predicting text.
The least frequent words, which only occur once in a corpus, are known as hapaxes (http://en.wikipedia.org/wiki/Hapax_legomenon), and may make up as much as half of all the words in the corpus. For example, in a corpus of 1,000,000 unique words, up to 500,000 may be hapaxes.
hapaxes <- en.blog.unigram$blogFreq==1; table(hapaxes)
## hapaxes
## FALSE TRUE
## 137910 118979
As we can see, 118,979 (46%) of the tokens in the English blogs corpus only occur once. This, again, is expected based on predictions from Zipf’s Law. Similar rates are found in the twitter corpus (56% hapaxes) and news corpus (39% hapaxes).
One popular method of visualizing word frequency data is through a word cloud, which can be done in R using library(wordcloud). This is demonstrated below for all three corpora:
From left to right: blogs, tweets, and news. While visually appealing, these plots do not demonstrate any statistical information. The Zipfian distribution is better demonstrated by plotting the (logged) frequencies of each word against their ranked orders:
Here we can see the logarithmic curve predicted by Zipf’s Law. Again, we can see differences from one corpus to the next. While the is the most common word in all three, social words like thanks and question words like what are more common in the twitter corpus, due to its more interactive nature. Third person pronouns like they are more common in the news and blogs corpora, due to the fact that these may be more reportative.
First, we’ll load our n-grams:
#blog2gram <- read.csv("enBlog2gramFreqs.csv",col.names=c("word","blog2gramFreq"))
#blog2gram <- blog2gram[ order(-blog2gramFreq), ]
#head(blog2gram)
Unigram frequencies can help overall prediction. For example, given no other information, a highly frequent word such as the is much more likely to occur than an infrequent word like botches. However, raw frequencies alone will not be helpful in prediction. Therefore, the relative frequencies of each word will be used as factors in a machine learning predictive algorithm.
The most basic type of prediction would only use the relative frequency of unigrams. To add more complexity, and predictive power, we can also add 2-grams, 3-grams, or other n-grams as factors. To add even more predictive power, we can consider the context. For example, someone writing a tweet may be more likely to type you than someone writing a news article.
Although context is important, having an overall idea of word frequencies is important as well. We can merge the data into one data set using the merge() function from the entropy package:
English <- merge(en.tweet.unigram, merge(en.blog.unigram, en.news.unigram,all=T), all=T)
English[is.na(English)]=0 #makes NAs zero (common with hapaxes)
English$totalFreq <- English$tweetFreq+English$blogFreq+English$newsFreq
English <- English[ order(-English$totalFreq), ]
englishTypes <- nrow(English); englishTypes
## [1] 546645
englishTokens <- sum(English$totalFreq); englishTokens
## [1] 103489549
head(English, 10)
## word tweetFreq blogFreq newsFreq totalFreq
## 265280 the 937889 1860655 1974508 4773052
## 270832 to 788925 1069558 906168 2764651
## 9117 and 438732 1094849 889538 2423119
## 316 a 617132 903931 894434 2415497
## 123620 i 918750 906685 195398 2020833
## 189296 of 359743 876836 774512 2011091
## 127621 in 380762 598749 679106 1658617
## 132831 it 383721 485362 286537 1155620
## 264903 that 271083 484196 371795 1127074
## 93743 for 385479 363928 353915 1103322
There are over half a million unique word types and over 100 million word tokens in this corpus!
Raw numbers do not give an accurate picture of the data. Relative frequencies are more like probabilities, and can be used in a prediction model.
tweetTokens <- sum(English$tweetFreq)
blogTokens <- sum(English$blogFreq)
newsTokens <- sum(English$newsFreq)
English$tweetRelFreq <- English$tweetFreq/tweetTokens
English$blogRelFreq <- English$blogFreq/blogTokens
English$newsRelFreq <- English$newsFreq/newsTokens
English$totalRelFreq <- English$totalFreq/englishTokens
head(English)
## word tweetFreq blogFreq newsFreq totalFreq tweetRelFreq blogRelFreq
## 265280 the 937889 1860655 1974508 4773052 0.03053584 0.04899661
## 270832 to 788925 1069558 906168 2764651 0.02568586 0.02816466
## 9117 and 438732 1094849 889538 2423119 0.01428426 0.02883065
## 316 a 617132 903931 894434 2415497 0.02009262 0.02380321
## 123620 i 918750 906685 195398 2020833 0.02991271 0.02387573
## 189296 of 359743 876836 774512 2011091 0.01171253 0.02308971
## newsRelFreq totalRelFreq
## 265280 0.056738731 0.04612110
## 270832 0.026039308 0.02671430
## 9117 0.025561435 0.02341414
## 316 0.025702124 0.02334049
## 123620 0.005614885 0.01952693
## 189296 0.022256090 0.01943279
Future work will calculate relative frequencies for all the n-gram tables. All the n-gram frequencies will be merged with the unigram frequncies, such that any ngram or unigram contained in a 4-gram will appear as a single row. For example (fake data):
example
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## colname "4gram" "freq" "3gram1" "freq" "3gram2" "freq" "2gram1" "freq"
## data "A B C D" "0.01" "A B C" "0.01" "B C D" "0.01" "A B" "0.01"
## [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
## colname "2gram2" "freq" "2gram3" "freq" "1gram1" "freq" "1gram2" "freq"
## data "B C" "0.01" "C D" "0.01" "A" "0.01" "B" "0.01"
## [,17] [,18] [,19] [,20]
## colname "1gram3" "freq" "1gram4" "freq"
## data "C" "0.01" "D" "0.01"
Hapaxes or extremely infrequent ngrams may be discarded, as they take up nearly half the data set and provide relatively little predictive power at a large memory cost.
The ultimate goal of the prediction algorithm will be to predict “D” given the sequence “A B C” and using the associated frequency probabilities.
Test cases will be generated.
Just for fun, I used the same process as above to make word clouds for Dutch and Finnish.
From left to right: blogs, tweets, and news.
Here we see some of the same trends from English: short function words like die and der “the”, ich “I”, and ist “is” stand out as the most common.
From left to right: blogs, tweets, and news.
Once again, we see small function words like ja “and”, on “there is”, and ei “not” which stand out as the most common in Finnish.
Getting Cyrllic fonts to work with wordcloud() would also be awesome for the Russian data! No luck yet; still exploring font packages!