I have spent the past few weeks loading in the data we have been given and exploring it. After looking through some of the data manually, I decided that I would have to tokenize it to move further. I did this by running each of the data sets through the following function which I built:

tokenizer <- function(data, number) {
    word <- data[round(runif(number, 0, length(data)+1), 0)]
    word <- gsub("[^A-Za-z0-9!\\?'#\\. -]", "", word) #what we're keeping
    word <- gsub("([\\.\\?!])", "~\\1~", word) #split at . ? !
    #word <- gsub("(')", "~\\1", word) #split before apostrophe
    word <- unlist(strsplit(word, "[ ~]")) #actually make the split
    word <- gsub("([A-Z][A-Z][A-Z][A-Z]+)", "\\L\\1", word, perl=TRUE) #lower 4 running caps
    dcount <- data.frame(table(word))
    dcount <- dcount[grep(".+", dcount$word), ] #remove blank 
    #should we remove periods? dcount <- dcount[-grep("\\.", dcount$word), ]
    dashi <- grep("-", dcount$word) #all dashes
    gdashi <- grep("[A-Za-z0-9]+-[A-Za-z0-9]+", dcount$word) #good dashes
    dmatch <- match(gdashi, dashi) #indices of dashi that are matched by gdashi
    bdashi <- dashi[-dmatch] #bad dashes
    dcount <- dcount[-bdashi, ] #removes bad dashes
    dcount <- dcount[-grep("fuck|shit|damn|crap|bitch", dcount$word, ignore.case=T), ]
    dcount <- dcount[order(dcount$Freq, decreasing=T),]
    dcount
}

Through some exploratory analysis, I decided to remove any non-standard characters (other than #, for twitter purposes), any common curse words, any string of 4 or more capital letters in a row, and any dashes which were not in the middle of compound words.

After doing this, I had data sets with counts for each individual word in each of the original data sets.

setwd("C:/Users/Seth/Documents/Coursera/Capstone")
twitwords <- read.csv("pdata/twitwords.csv", stringsAsFactors=F)
newswords <- read.csv("pdata/newswords.csv", stringsAsFactors=F)
blogwords <- read.csv("pdata/blogwords.csv", stringsAsFactors=F)

Now we take a look at the top 20 words in each set.

head(twitwords[,2:3], 20)
##    allw    Freq
## 1    <> 3219668
## 2   the  629777
## 3    to  576197
## 4     I  465798
## 5     a  431251
## 6   you  360043
## 7   and  302849
## 8   for  279514
## 9    in  268219
## 10   of  263421
## 11   is  254708
## 12   on  197852
## 13   it  194206
## 14   my  187974
## 15 that  158505
## 16   me  140654
## 17   be  133732
## 18   at  129822
## 19 with  124500
## 20 your  118551
head(blogwords[,2:3], 20)
##    allw    Freq
## 1    <> 1865662
## 2   the 1250829
## 3    to  790398
## 4   and  769895
## 5    of  650967
## 6     a  646321
## 7     I  568598
## 8    in  414325
## 9  that  333202
## 10   is  319054
## 11  for  257532
## 12   it  248515
## 13  was  207017
## 14 with  206814
## 15   on  196890
## 16  you  195533
## 17   my  180896
## 18 have  160749
## 19   be  154232
## 20 this  152533
head(newswords[,2:3], 20)
##    allw    Freq
## 1    <> 1664147
## 2   the 1288164
## 3    to  669350
## 4   and  639458
## 5     a  630981
## 6    of  574375
## 7    in  471720
## 8   for  251984
## 9  that  246058
## 10   is  211059
## 11   on  191257
## 12  The  189758
## 13 said  186761
## 14 with  183806
## 15  was  171195
## 16   at  149181
## 17   it  129128
## 18   as  129033
## 19   he  125970
## 20    I  117755

These all look like sensible words to be near the top of the list. It leaves us wondering what the distribution of words looks like. To check this out, we do a bar plot of the top 100 words in each set.

You’ll notice that the first word flies off the chart, but this is because it is not in fact a word, but only the “<>” which we have chosen to identify whitespace or a word break. Therefore we can ignore it and only look at the rest of the words. You can see that after about the first 15 or 20 words, there is a precipitous drop in count, then the slope levels out again around 50 or so. This tells us that the top 20 or so words make up a very large portion of the corpus.

In twitter, for example, there are about twice as many instances of the top twenty words as there are of all the other words combined:

sum(twitwords[1:20,3])
## [1] 8437241
sum(twitwords[21:length(twitwords),3])
## [1] 4701885

N-grams

Our next step will be constructing similar tables with the counts of various n-grams in the data sets. An n-gram is simply a set of several words in the same order. For instance “in the same order” would be a 4-gram, while “the same order” would be a 3-gram.

The following function will be my template for creating these tables:

library(stylo)
## stylo version: 0.5.8.2
ngrammer <- function(data, number) {
    word <- data[round(runif(number, 0, length(data)+1), 0)]
    word <- gsub("[^A-Za-z0-9!\\?'#\\. -]", "", word) #what we're keeping
    word <- gsub("([\\.\\?!])", "~\\1~", word) #split at . ? !
    word <- gsub("(')", "~\\1", word) #split before apostrophe
    word <- gsub("([A-Z][A-Z][A-Z][A-Z]+)", "\\L\\1", word, perl=TRUE) #lower 4 running caps
    allone <- paste(word, collapse = '')
    allw <- txt.to.words(allone, splitting.rule = "[ ~]", preserve.case=T)
    ngrams <- make.ngrams(allw, 3) #change the number to whatever n you want
    ngrams
    dcount <- data.frame(table(ngrams))
    dashi <- grep("-", dcount$ngrams)
    gdashi <- grep("[A-Za-z0-9]+-[A-Za-z0-9]+", dcount$ngrams)
    dmatch <- match(gdashi, dashi) #indices of dashi that are matched by gdashi
    bdashi <- dashi[-dmatch]
    dcount <- dcount[-bdashi, ]
    dcount <- dcount[-grep("[!\\?#\\.] ?[!\\?#\\.]+", dcount$ngrams), ]
    dcount <- dcount[order(dcount$Freq, decreasing=T),]
    dcount
}

I have begun to digest the data with this function, changing the make.ngrams() function to suit whichever table I intend to make. This works well, but I have to break the data sets into chunks of 500,000 or so entries to be able to feed them through the function. This is not a problem, though it does take a bit of time. The twitter, news and blog data sets have 2,360,148, 1,010,242, and 899,288 entries respectively.

Once I have finished this, I will begin testing predictive models. Basically, n-gram predictive models use a reference of common n-grams which they compare to a series of words to attempt to predict the most likely next word. My plan is to explore what levels of accuracy I can achieve using different combinations of n-grams from the different data sets.

Since language used in Twitter or text messaging is very different from the language used in, say, the news, I am considering constructing some kind of filter which would test for certain words or n-grams (hashtags or abbreviations, for instance) that are common in the twitter data set, but not common in the news or blogs data sets. For example, so far I have noticed that “thanks for the” is one of the most common 3-grams in the twitter data. I assume this is because it is common on twitter to say “thanks for the retweet” or “thanks for the follow” when someone interacts with you. Then I will essentially build two different predictors, one based on common n-grams from the twitter set, and the other based on common n-grams from the news and blogs sets.

Once I do this, I will begin taking random samples of text from the internet and running them through these different models to compare speed and accuracy with different numbers of n-grams and different kinds of language filters.

Wish me luck! Seth