This report explores a corpus of tweets, news items, and blogs. The data are provided as part of a capstone project in data science. The eventual goal is to use these texts to develop a predictive language model. Prior to developing any data-based model, it’s important to understand the characteristics of the data.
Reading in each of the data files is straightforward. Just use scan. The result is that each entry in the file is separated by a new-line character. So tweetsL[1] shows the first tweet in the twitter file, newsL[5] shows the fifth news item, and so on.
tweetsL <- scan("en_US.twitter.txt", character(0), sep = "\n")
newsL <- scan("en_US.news.txt", character(0), sep = "\n")
blogsL <- scan("en_US.blogs.txt", character(0), sep = "\n")
There are numerous cleaning operations that are typically performed on English text. Perhaps the most basic is to convert all characters to lower case. Although that can lose some information (e.g., distinguishing Peter the person from peter the verb, as in peter out), for the purposes of exploratory data analysis, it’s an acceptable tradeoff. To accomplish this, the tolower() function works well.
To look at unique words and word frequencies, it’s best to convert the data from its item-organization to a vector of words. That is, tokenize the text. First the text needs to be turned into a list. Then the list into a vector, then the blanks removed. I created a tokenizer function for this purpose. I also included the tolower() operation as part of the tokenizer function.
tokenizer <- function(fileName) {
## 1. Read data, separate document by new line
file.lines <- scan(fileName, character(0), sep = "\n")
## 2. Change text to lowercase
file.lines.lower <- tolower(file.lines)
## 3. Organize the words into a list, split on word boundaries
file.list <- strsplit(file.lines.lower, "\\W")
## 4. Convert the list to a vector
file.vector <- unlist(file.list)
## 5. Determine which items are not blanks
file.notBlanks <- which(file.vector != "")
## 6. Keep only the words.
fileWords <- file.vector[file.notBlanks]
}
After the texts were tokenized, I filtered out the profanity. Profanity is difficult to define due to sociopolitical issues, so I kept it basic. I used the words specified in George Carlin’s Seven Dirty Words (see http://en.wikipedia.org/wiki/Seven_dirty_words). Here is an example of the filtering:
goodWordLocations <- !(tweetWords %in% badWords)
tweetsGoodWords <- tweetWords[goodWordLocations]
I also used textcat() to try to detect foreign words, but the results were less that satisfying, as a large percentage of English texts were classified as some other language, such as Esperanto, Irish, or Scottish I’ll need to investigate more in this area.
Let’s take a look at some of the basic characteristics of the three bodies of text.
| Text type | Total items | Total words | Unique words | Profane words |
|---|---|---|---|---|
| tweets | 2360148 | 31003119 | 327525 (1%) | 31262 (0.1%) |
| news | 1010242 | 35624448 | 230205 (0.6%) | 12 (negligible) |
| blogs | 99288 | 38308421 | 271867 (0.7%) | 4481 (0.01%) |
Tweets are first in unique words and profanity. One reason for the uniqueness in tweet text is that people often disregard proper spelling in favor of shortened versions of a word. There are many one-letter “words.” For example, “okay” becomes “k” while “are you” becomes “r u”. In addition, sequences of letters are used for purely expressive purposes. For example, one tweet described what it was like to listen to a lecture with a string of letters (such as, “dlkjelkdljklkjldfjd”) to indicate the lecturer was incomprehensible.
How should these meaningful, but not-a-word “utterances” be dealt with? I decided they should stay in the corpus because that’s the language of the tweeting community. I think this characteristic also means that it is not wise to combine tweets with news or blogs to get a bigger corpus. Tweets are almost a world unto themselves.
Next I looked at the relative frequencies of words. In particular, I looked at plots of the first 20 words to see if the frequencies dropped off according to Zipf’s law. The law states that random-texts exhibit a power-law relationship between frequency and rank. In simple terms, that means that word frequency drops sharply rather than being a nice smooth ramp. As you can see in these figures, the word frequencies in these texts have the characteristic sharp drop off.
The power-law relationship is an interesting one, and caused me to ask these questions:
| Text type | 50% | 90% | % represented by top 20 |
|---|---|---|---|
| tweets | 114 | 5027 | 26.8% |
| news | 188 | 7495 | 27.7% |
| blogs | 104 | 6213 | 30.1% |
As you can see, a lot of the same words are used over and over again. So even though no words are used that much, the fact that some are used more than others is quite interesting.
The next step is to design and build a model. I plan to take a string as input and provide from one to three words as output. That is, predict the next word for a given string, similar to what a keyboard prediction system might do. To keep the problem in scope, the user will need to type the string and then press a button to initiate the prediction. So this will not be a “predict as you type” system. It will be a prototype to demonstrate the prediction model. Even though the prototype won’t be operate in true real-time, it should be near-real time.
In designing the model, I need to take the following into consideration:
.