I have spent the past few weeks loading in the data we have been given and exploring it. After looking through some of the data manually, I decided that I would have to tokenize it to move further. I did this by running each of the data sets through the following function which I built:
tokenizer <- function(data, number) {
word <- data[round(runif(number, 0, length(data)+1), 0)]
word <- gsub("[^A-Za-z0-9!\\?'#\\. -]", "", word) #what we're keeping
word <- gsub("([\\.\\?!])", "~\\1~", word) #split at . ? !
#word <- gsub("(')", "~\\1", word) #split before apostrophe
word <- unlist(strsplit(word, "[ ~]")) #actually make the split
word <- gsub("([A-Z][A-Z][A-Z][A-Z]+)", "\\L\\1", word, perl=TRUE) #lower 4 running caps
dcount <- data.frame(table(word))
dcount <- dcount[grep(".+", dcount$word), ] #remove blank
#should we remove periods? dcount <- dcount[-grep("\\.", dcount$word), ]
dashi <- grep("-", dcount$word) #all dashes
gdashi <- grep("[A-Za-z0-9]+-[A-Za-z0-9]+", dcount$word) #good dashes
dmatch <- match(gdashi, dashi) #indices of dashi that are matched by gdashi
bdashi <- dashi[-dmatch] #bad dashes
dcount <- dcount[-bdashi, ] #removes bad dashes
dcount <- dcount[-grep("fuck|shit|damn|crap|bitch", dcount$word, ignore.case=T), ]
dcount <- dcount[order(dcount$Freq, decreasing=T),]
dcount
}
Through some exploratory analysis, I decided to remove any non-standard characters (other than #, for twitter purposes), any common curse words, any string of 4 or more capital letters in a row, and any dashes which were not in the middle of compound words.
After doing this, I had data sets with counts for each individual word in each of the original data sets.
setwd("C:/Users/Seth/Documents/Coursera/Capstone")
twitwords <- read.csv("pdata/twitwords.csv", stringsAsFactors=F)
newswords <- read.csv("pdata/newswords.csv", stringsAsFactors=F)
blogwords <- read.csv("pdata/blogwords.csv", stringsAsFactors=F)
Now we take a look at the top 20 words in each set.
head(twitwords[,2:3], 20)
## allw Freq
## 1 <> 3219668
## 2 the 629777
## 3 to 576197
## 4 I 465798
## 5 a 431251
## 6 you 360043
## 7 and 302849
## 8 for 279514
## 9 in 268219
## 10 of 263421
## 11 is 254708
## 12 on 197852
## 13 it 194206
## 14 my 187974
## 15 that 158505
## 16 me 140654
## 17 be 133732
## 18 at 129822
## 19 with 124500
## 20 your 118551
head(blogwords[,2:3], 20)
## allw Freq
## 1 <> 1865662
## 2 the 1250829
## 3 to 790398
## 4 and 769895
## 5 of 650967
## 6 a 646321
## 7 I 568598
## 8 in 414325
## 9 that 333202
## 10 is 319054
## 11 for 257532
## 12 it 248515
## 13 was 207017
## 14 with 206814
## 15 on 196890
## 16 you 195533
## 17 my 180896
## 18 have 160749
## 19 be 154232
## 20 this 152533
head(newswords[,2:3], 20)
## allw Freq
## 1 <> 1664147
## 2 the 1288164
## 3 to 669350
## 4 and 639458
## 5 a 630981
## 6 of 574375
## 7 in 471720
## 8 for 251984
## 9 that 246058
## 10 is 211059
## 11 on 191257
## 12 The 189758
## 13 said 186761
## 14 with 183806
## 15 was 171195
## 16 at 149181
## 17 it 129128
## 18 as 129033
## 19 he 125970
## 20 I 117755
These all look like sensible words to be near the top of the list. It leaves us wondering what the distribution of words looks like. To check this out, we do a bar plot of the top 100 words in each set.
You’ll notice that the first word flies off the chart, but this is because it is not in fact a word, but only the “<>” which we have chosen to identify whitespace or a word break. Therefore we can ignore it and only look at the rest of the words. You can see that after about the first 15 or 20 words, there is a precipitous drop in count, then the slope levels out again around 50 or so. This tells us that the top 20 or so words make up a very large portion of the corpus.
In twitter, for example, there are about twice as many instances of the top twenty words as there are of all the other words combined:
sum(twitwords[1:20,3])
## [1] 8437241
sum(twitwords[21:length(twitwords),3])
## [1] 4701885
Our next step will be constructing similar tables with the counts of various n-grams in the data sets. An n-gram is simply a set of several words in the same order. For instance “in the same order” would be a 4-gram, while “the same order” would be a 3-gram.
The following function will be my template for creating these tables:
library(stylo)
## stylo version: 0.5.8.2
ngrammer <- function(data, number) {
word <- data[round(runif(number, 0, length(data)+1), 0)]
word <- gsub("[^A-Za-z0-9!\\?'#\\. -]", "", word) #what we're keeping
word <- gsub("([\\.\\?!])", "~\\1~", word) #split at . ? !
word <- gsub("(')", "~\\1", word) #split before apostrophe
word <- gsub("([A-Z][A-Z][A-Z][A-Z]+)", "\\L\\1", word, perl=TRUE) #lower 4 running caps
allone <- paste(word, collapse = '')
allw <- txt.to.words(allone, splitting.rule = "[ ~]", preserve.case=T)
ngrams <- make.ngrams(allw, 3) #change the number to whatever n you want
ngrams
dcount <- data.frame(table(ngrams))
dashi <- grep("-", dcount$ngrams)
gdashi <- grep("[A-Za-z0-9]+-[A-Za-z0-9]+", dcount$ngrams)
dmatch <- match(gdashi, dashi) #indices of dashi that are matched by gdashi
bdashi <- dashi[-dmatch]
dcount <- dcount[-bdashi, ]
dcount <- dcount[-grep("[!\\?#\\.] ?[!\\?#\\.]+", dcount$ngrams), ]
dcount <- dcount[order(dcount$Freq, decreasing=T),]
dcount
}
I have begun to digest the data with this function, changing the make.ngrams() function to suit whichever table I intend to make. This works well, but I have to break the data sets into chunks of 500,000 or so entries to be able to feed them through the function. This is not a problem, though it does take a bit of time. The twitter, news and blog data sets have 2,360,148, 1,010,242, and 899,288 entries respectively.
Once I have finished this, I will begin testing predictive models. Basically, n-gram predictive models use a reference of common n-grams which they compare to a series of words to attempt to predict the most likely next word. My plan is to explore what levels of accuracy I can achieve using different combinations of n-grams from the different data sets.
Since language used in Twitter or text messaging is very different from the language used in, say, the news, I am considering constructing some kind of filter which would test for certain words or n-grams (hashtags or abbreviations, for instance) that are common in the twitter data set, but not common in the news or blogs data sets. For example, so far I have noticed that “thanks for the” is one of the most common 3-grams in the twitter data. I assume this is because it is common on twitter to say “thanks for the retweet” or “thanks for the follow” when someone interacts with you. Then I will essentially build two different predictors, one based on common n-grams from the twitter set, and the other based on common n-grams from the news and blogs sets.
Once I do this, I will begin taking random samples of text from the internet and running them through these different models to compare speed and accuracy with different numbers of n-grams and different kinds of language filters.
Wish me luck! Seth