Overview

This report explores a corpus of tweets, news items, and blogs. The data are provided as part of a capstone project in data science. The eventual goal is to use these texts to develop a predictive language model. Prior to developing any data-based model, it’s important to understand the characteristics of the data.

Reading and Cleaning the Data

Reading in each of the data files is straightforward. Just use scan. The result is that each entry in the file is separated by a new-line character. So tweetsL[1] shows the first tweet in the twitter file, newsL[5] shows the fifth news item, and so on.

tweetsL  <- scan("en_US.twitter.txt", character(0), sep = "\n")
newsL <- scan("en_US.news.txt", character(0), sep = "\n")
blogsL  <- scan("en_US.blogs.txt", character(0), sep = "\n")

There are numerous cleaning operations that are typically performed on English text. Perhaps the most basic is to convert all characters to lower case. Although that can lose some information (e.g., distinguishing Peter the person from peter the verb, as in peter out), for the purposes of exploratory data analysis, it’s an acceptable tradeoff. To accomplish this, the tolower() function works well.

To look at unique words and word frequencies, it’s best to convert the data from its item-organization to a vector of words. That is, tokenize the text. First the text needs to be turned into a list. Then the list into a vector, then the blanks removed. I created a tokenizer function for this purpose. I also included the tolower() operation as part of the tokenizer function.

tokenizer <- function(fileName) {
    ## 1. Read data, separate document by new line
    file.lines  <- scan(fileName, character(0), sep = "\n")
    ## 2. Change text to lowercase
    file.lines.lower <- tolower(file.lines)
    ## 3. Organize the words into a list, split on word boundaries
    file.list <- strsplit(file.lines.lower, "\\W")
    ## 4. Convert the list to a vector
    file.vector <- unlist(file.list)
    ## 5. Determine which items are not blanks
    file.notBlanks <- which(file.vector != "")
    ## 6. Keep only the words.
    fileWords <- file.vector[file.notBlanks] 
}

After the texts were tokenized, I filtered out the profanity. Profanity is difficult to define due to sociopolitical issues, so I kept it basic. I used the words specified in George Carlin’s Seven Dirty Words (see http://en.wikipedia.org/wiki/Seven_dirty_words). Here is an example of the filtering:

goodWordLocations <- !(tweetWords %in% badWords)
tweetsGoodWords <- tweetWords[goodWordLocations]

I also used textcat() to try to detect foreign words, but the results were less that satisfying, as a large percentage of English texts were classified as some other language, such as Esperanto, Irish, or Scottish I’ll need to investigate more in this area.

Exploring the Data

Let’s take a look at some of the basic characteristics of the three bodies of text.

Text type Total items Total words Unique words Profane words
tweets 2360148 31003119 327525 (1%) 31262 (0.1%)
news 1010242 35624448 230205 (0.6%) 12 (negligible)
blogs 99288 38308421 271867 (0.7%) 4481 (0.01%)

Tweets are first in unique words and profanity. One reason for the uniqueness in tweet text is that people often disregard proper spelling in favor of shortened versions of a word. There are many one-letter “words.” For example, “okay” becomes “k” while “are you” becomes “r u”. In addition, sequences of letters are used for purely expressive purposes. For example, one tweet described what it was like to listen to a lecture with a string of letters (such as, “dlkjelkdljklkjldfjd”) to indicate the lecturer was incomprehensible.

How should these meaningful, but not-a-word “utterances” be dealt with? I decided they should stay in the corpus because that’s the language of the tweeting community. I think this characteristic also means that it is not wise to combine tweets with news or blogs to get a bigger corpus. Tweets are almost a world unto themselves.

Next I looked at the relative frequencies of words. In particular, I looked at plots of the first 20 words to see if the frequencies dropped off according to Zipf’s law. The law states that random-texts exhibit a power-law relationship between frequency and rank. In simple terms, that means that word frequency drops sharply rather than being a nice smooth ramp. As you can see in these figures, the word frequencies in these texts have the characteristic sharp drop off.

Alt text Alt text Alt text

The power-law relationship is an interesting one, and caused me to ask these questions:

Text type 50% 90% % represented by top 20
tweets 114 5027 26.8%
news 188 7495 27.7%
blogs 104 6213 30.1%

As you can see, a lot of the same words are used over and over again. So even though no words are used that much, the fact that some are used more than others is quite interesting.

Next Steps: Designing a Model

The next step is to design and build a model. I plan to take a string as input and provide from one to three words as output. That is, predict the next word for a given string, similar to what a keyboard prediction system might do. To keep the problem in scope, the user will need to type the string and then press a button to initiate the prediction. So this will not be a “predict as you type” system. It will be a prototype to demonstrate the prediction model. Even though the prototype won’t be operate in true real-time, it should be near-real time.

In designing the model, I need to take the following into consideration:

  1. The corpora is too large to process effectively on my personal computer. I will either need to use a subset of the data for the final model, or process the data in chunks and then paste the results together.
  2. The corpora do not represent all possible English language words. According to the Global Language Monitor, the number of words in the English language exceeds 1,025,000. The texts I examined use from one-quarter to one-third of possible words. There are techniques that can account for these “known unknowns”" by adjusting the probability of the words in the corpora, but that’s not going to help predict a specific word that isn’t in the corpora.I will need a fallback for when the model is stumped.
  3. Although each corpus has many unique words (hundreds of thousands), 90% of each corpus can be represented by thousands of words. Perhaps there is a way to factor out the words that occur 10% or less to allow for a faster model.
  4. Single word frequency (unigram) isn’t a good predictor, but ngrams are. My model will use a combination of ngrams at different lengths, with matches of longer ngrams being weighted more heavily than shorter ones. Exact matches for long ngrams are likely to represent idiomatic expressions.
  5. Humans use context to determine meaning. My model will not extract semantic meaning or perform sentiment analysis. It will simply look up ngram matches. I recognize such processing could improve the model, especially for complicated sentences (for example, compound conditional clauses). However, given the limited time of the Capstone course, it is necessary to set clear boundaries.

.