Milestone Report

Introduction

This is the Milestone Report for the Coursera capstone course. We’ll be working with English text data from 3 different sources- Twitter, blogs and, online news. The end goal will be to create a predictive model which will suggest new words to a user. This report will be just the first few steps along that path. We’ll begin by reading a random subset of the data into R and taking some very basic measurements of it. Our random sample should be relatively representative of this data set.

sources <- c("News", "Blogs", "Twitter")
words <- c(sum(stri_count_words(news)), sum(stri_count_words(blogs)), sum(stri_count_words(twit)) )
lines <- c(length(news), length(blogs), length(twit))
wordByLine <- c(mean(stri_count_words(news)), mean(stri_count_words(blogs)), mean(stri_count_words(twit)))
sum_tab <- data.frame(Source = sources,
      TotalLines = lines, TotalWords = words, WordsPerLine = wordByLine)
sum_tab

##    Source TotalLines TotalWords WordsPerLine
## 1    News     101024    3483769     34.48457
## 2   Blogs      89928    3768689     41.90785
## 3 Twitter     236014    3006396     12.73821

Twitter has the largest number of lines but the fewest words per line. This makes sense to anyone familiar with the platform’s character limit per message. Blogs seem to lack brevity, which also doesn’t come as a surprise. Blogs have fewer total lines than the news data but more total words.

Let’s look at how the number of words per Tweet compares to the number of words per news item to see their distributions.

We see that Tweets, limited by their character count, rarely contain more than 40 words. The distribution for news items is much more uneven, with most containing 100 or fewer words while there are outlying items containing over 1500 words.

Cleaning the Data

At this point I’ve decided to combine all three texts together to clean them using regular expressions and then look into finding patterns that occur across all three. There are many functions and packages in R which will perform similar tasks, and I’ll use some later but I’ve also found there’s a great benefit to learning by doing.

combi.sub <- c(blogs, news, twit)

# find strings with 3 or more "-" then replace with " "
hyph <- grep("([A-Za-z]+-[A-Za-z]+){3,}", combi.sub)
combi.sub[hyph] <- gsub("-", " ", combi.sub[hyph])
# separate hyphenates that have at least 4 letters on left and 3 letters on right
combi.sub <- gsub("([A-Za-z]{4,})-([A-Za-z]{3,})", "\\1 \\2", combi.sub)

# replace apostrophe with nothing
combi.sub <- gsub("’|\\'", "", combi.sub)
# remove everything non-alphanumeric except space . and $
combi.sub <- gsub("[^[:alnum:] .$]|\\.\\.+", " ", combi.sub)

combi.sub <- gsub("\\.$", "", combi.sub) #replace ." at end of line with nothing
combi.sub <- gsub("\\. ([A-Z])", " \\1", combi.sub) # replace "." at end of sent, with space and same cap letter

Ngrams

Now that our data is a little cleaner we can look into how often word pairs and triplets occur. We call these bigrams and trigrams respectively. Our purpose here is exploratory but these co-occuring words will play an important role in building any predictive model. We’ll look at the top 10 most frequently occuring bigrams and trigrams in our data.

##   of the   in the   to the  for the   on the    to be   at the  and the 
##    42974    41435    21809    20131    19821    16502    14409    12695 
##     in a with the 
##    11815    10619

##     one of the       a lot of thanks for the        to be a    going to be 
##           3446           3046           2353           1851           1808 
##     the end of     out of the      i want to     as well as       it was a 
##           1465           1461           1438           1420           1392

We see many common phrase fragments and many occurances of the definite article “the.” It’s interesting to note that 25.3% of all bigrams and just 11.3% of all trigrams appear more than once in the data.

Foreign Words

Let’s take a look at how many words occur that are made up completely or partially of symbols foreign to the English language.

combi.token <- unlist(strsplit(combi.sub, " "))
foreign <- grep('combi.token', iconv(combi.token, 'latin1', 'ASCII', sub='combi.token'))
combi.token[head(foreign)] # take a look at a few foreign words

## [1] "agéd"      "bâri"      "querétaro" "métis"     "métis"     "métis"

There are only 2916 total foreign words in our dataset which contains millions of words. The occurance is so low they shouldn’t play a major role in our future model. Additionally, many words are familiar to English readers even with slightly different spellings- cafe vs. café or jalapenos vs jalapeños.

Word Frequency and Coverage

Let’s look at the most commonly occuring words and their coverage over the entire data set. We’ll view this two ways, first with a word cloud and then with a frequency table.

corpus <- VCorpus(VectorSource(combi.sub))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)

rowTotal <- row_sums(tdm)
sortRow <- rowTotal[order(rowTotal, decreasing = TRUE)]
cat("The 15 most frequent terms")

## The 15 most frequent terms

##    the    and    for   that    you   with    was   this   have    but 
## 479178 243347 110433 104468  94185  71880  62968  54460  53073  48787 
##    are    not   from    its    all 
##  48491  41204  38723  35804  34469

Just the top 15 words occur a total of 1521470 times. Let’s look into this further to answer two specific questions. How many unique words are needed to cover half the corpus and how many unique words are needed to cover 90% of the corpus.

We’d need 313 words for 50% coverage and 9375 words for 90% coverage. A few unique words, relative to the total word count, go a very long way.

Conclusion

We’re off to a good start. We have some familiarity with the data for this project. We know foreign words shouldn’t be a problem. We’ve got a good sense of word distribution. And we’ve picked up a lot of good tools to help us on this journey!