This is the Milestone Report for the Coursera capstone course. We’ll be working with English text data from 3 different sources- Twitter, blogs and, online news. The end goal will be to create a predictive model which will suggest new words to a user. This report will be just the first few steps along that path. We’ll begin by reading a random subset of the data into R and taking some very basic measurements of it. Our random sample should be relatively representative of this data set.
sources <- c("News", "Blogs", "Twitter")
words <- c(sum(stri_count_words(news)), sum(stri_count_words(blogs)), sum(stri_count_words(twit)) )
lines <- c(length(news), length(blogs), length(twit))
wordByLine <- c(mean(stri_count_words(news)), mean(stri_count_words(blogs)), mean(stri_count_words(twit)))
sum_tab <- data.frame(Source = sources,
TotalLines = lines, TotalWords = words, WordsPerLine = wordByLine)
sum_tab
## Source TotalLines TotalWords WordsPerLine
## 1 News 101024 3483769 34.48457
## 2 Blogs 89928 3768689 41.90785
## 3 Twitter 236014 3006396 12.73821
Twitter has the largest number of lines but the fewest words per line. This makes sense to anyone familiar with the platform’s character limit per message. Blogs seem to lack brevity, which also doesn’t come as a surprise. Blogs have fewer total lines than the news data but more total words.
Let’s look at how the number of words per Tweet compares to the number of words per news item to see their distributions.
We see that Tweets, limited by their character count, rarely contain more than 40 words. The distribution for news items is much more uneven, with most containing 100 or fewer words while there are outlying items containing over 1500 words.
At this point I’ve decided to combine all three texts together to clean them using regular expressions and then look into finding patterns that occur across all three. There are many functions and packages in R which will perform similar tasks, and I’ll use some later but I’ve also found there’s a great benefit to learning by doing.
combi.sub <- c(blogs, news, twit)
# find strings with 3 or more "-" then replace with " "
hyph <- grep("([A-Za-z]+-[A-Za-z]+){3,}", combi.sub)
combi.sub[hyph] <- gsub("-", " ", combi.sub[hyph])
# separate hyphenates that have at least 4 letters on left and 3 letters on right
combi.sub <- gsub("([A-Za-z]{4,})-([A-Za-z]{3,})", "\\1 \\2", combi.sub)
# replace apostrophe with nothing
combi.sub <- gsub("’|\\'", "", combi.sub)
# remove everything non-alphanumeric except space . and $
combi.sub <- gsub("[^[:alnum:] .$]|\\.\\.+", " ", combi.sub)
combi.sub <- gsub("\\.$", "", combi.sub) #replace ." at end of line with nothing
combi.sub <- gsub("\\. ([A-Z])", " \\1", combi.sub) # replace "." at end of sent, with space and same cap letter
Now that our data is a little cleaner we can look into how often word pairs and triplets occur. We call these bigrams and trigrams respectively. Our purpose here is exploratory but these co-occuring words will play an important role in building any predictive model. We’ll look at the top 10 most frequently occuring bigrams and trigrams in our data.
## of the in the to the for the on the to be at the and the
## 42974 41435 21809 20131 19821 16502 14409 12695
## in a with the
## 11815 10619
## one of the a lot of thanks for the to be a going to be
## 3446 3046 2353 1851 1808
## the end of out of the i want to as well as it was a
## 1465 1461 1438 1420 1392
We see many common phrase fragments and many occurances of the definite article “the.” It’s interesting to note that 25.3% of all bigrams and just 11.3% of all trigrams appear more than once in the data.
Let’s take a look at how many words occur that are made up completely or partially of symbols foreign to the English language.
combi.token <- unlist(strsplit(combi.sub, " "))
foreign <- grep('combi.token', iconv(combi.token, 'latin1', 'ASCII', sub='combi.token'))
combi.token[head(foreign)] # take a look at a few foreign words
## [1] "agéd" "bâri" "querétaro" "métis" "métis" "métis"
There are only 2916 total foreign words in our dataset which contains millions of words. The occurance is so low they shouldn’t play a major role in our future model. Additionally, many words are familiar to English readers even with slightly different spellings- cafe vs. café or jalapenos vs jalapeños.
Let’s look at the most commonly occuring words and their coverage over the entire data set. We’ll view this two ways, first with a word cloud and then with a frequency table.
corpus <- VCorpus(VectorSource(combi.sub))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
rowTotal <- row_sums(tdm)
sortRow <- rowTotal[order(rowTotal, decreasing = TRUE)]
cat("The 15 most frequent terms")
## The 15 most frequent terms
## the and for that you with was this have but
## 479178 243347 110433 104468 94185 71880 62968 54460 53073 48787
## are not from its all
## 48491 41204 38723 35804 34469
Just the top 15 words occur a total of 1521470 times. Let’s look into this further to answer two specific questions. How many unique words are needed to cover half the corpus and how many unique words are needed to cover 90% of the corpus.
We’d need 313 words for 50% coverage and 9375 words for 90% coverage. A few unique words, relative to the total word count, go a very long way.
We’re off to a good start. We have some familiarity with the data for this project. We know foreign words shouldn’t be a problem. We’ve got a good sense of word distribution. And we’ve picked up a lot of good tools to help us on this journey!