Tasks to accomplish

In this RMarkdown document, I want to:

  • describe the exploratory analysis of the training data set. I’ll do this in the textual part preceding the code chunks.
  • provide basic summaries of the three files (e.g. word counts, line counts and basic data tables). See the table of contents for each section of interest!
  • use basic plots such as histograms to illustrate features of the data. Again, see the table of contents for easy links to exploratory plots.
  • and accomplish all of this while maintaining a brief, concise style. I’ll do this with a clean structure and avoid blathering on too much.

Preliminary Steps

First, we’ll grab some packages we want in our working environment.

library(tidyverse, quietly = TRUE)
library(tidytext, quietly = TRUE)
library(stringr, quietly = TRUE)
library(reshape2, quietly = TRUE)

Obtaining the Data

We’ll load the Twitter data first.

Importantly, we’ll read each line into its own row using readLines(), so we’ll know how many lines we have easily.

Let’s keep the line numbers, as it will prove handy for predictive text!

twitter <- data_frame(text = readLines("/Users/paytonk/Downloads/final/en_US/en_US.twitter.txt"))
twitter$tweet <- as.numeric(row.names(twitter))
head(twitter)
## # A tibble: 6 × 2
##                                                                          text
##                                                                         <chr>
## 1 How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love t
## 2 When you meet someone special... you'll know. Your heart will beat more rap
## 3                                    they've decided its more fun if I don't.
## 4 So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 M
## 5             Words from a complete stranger! Made my birthday even better :)
## 6 First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs G
## # ... with 1 more variables: tweet <dbl>

We’ll also load the blog texts.

blogs <- data_frame(text = readLines("/Users/paytonk/Downloads/final/en_US/en_US.blogs.txt"))
blogs$post <- as.numeric(row.names(blogs))
head(blogs)
## # A tibble: 6 × 2
##                                                                          text
##                                                                         <chr>
## 1 In the years thereafter, most of the Oil fields and platforms were named af
## 2                                                      We love you Mr. Brown.
## 3 Chad has been awesome with the kids and holding down the fort while I work 
## 4 so anyways, i am going to share some home decor inspiration that i have bee
## 5 With graduation season right around the corner, Nancy has whipped up a fun 
## 6                      If you have an alternative argument, let's hear it! :)
## # ... with 1 more variables: post <dbl>

Finally, we’ll get the news texts, in the same way:

news <- data_frame(text = readLines("/Users/paytonk/Downloads/final/en_US/en_US.news.txt"))
news$story <- as.numeric(row.names(news))
head(news)
## # A tibble: 6 × 2
##                                                                          text
##                                                                         <chr>
## 1                                           He wasn't home alone, apparently.
## 2 The St. Louis plant had to close. It would die of old age. Workers had been
## 3 WSU's plans quickly became a hot topic on local online sites. Though most p
## 4 The Alaimo Group of Mount Holly was up for a contract last fall to evaluate
## 5 And when it's often difficult to predict a law's impact, legislators should
## 6 There was a certain amount of scoffing going around a few years ago when th
## # ... with 1 more variables: story <dbl>

Determining the scope of the data

How many items, lines, and words do we have for each textual source?

Twitter Text

How many lines do we have?

nrow(twitter)
## [1] 2360148

2.36 million lines of tweet data! That’s a lot of text!

Now let’s analyze it at the word level. First we’ll break each tweet into tokens:

twitterWords <- twitter %>%
  unnest_tokens(word, text)

How many words total do we have in our corpus?

nrow(twitterWords)
## [1] 30093369

Wow, 30 million words! And how many of those words are unique?

uniqueTwitterWords <- twitterWords %>% select(word) %>% unique()
nrow(uniqueTwitterWords)
## [1] 370386

Almost 400k words? Clearly some of those must be variant spellings, or we’re capturing things like emojis that aren’t real words. It’s unlikely the vocabulary is that large! Let’s peek.

head(sort(uniqueTwitterWords$word))
## [1] "__"       "﹏﹏"     "___"      "____"     "﹏﹏﹏﹏" "_____"

Hmmm… there are a lot of “words” here that aren’t really words. Let’s toss out the “words” that don’t at least have one letter, and see if those look better. I don’t want to toss out words based on them being all letters or not, because we could have things like H20 or H1N1 that could be important for our analysis.

uniqueTwitterWords <- uniqueTwitterWords %>% filter(grepl("[:alpha:]", word))
head(sort(uniqueTwitterWords$word))
## [1] "_________________thud" "________________until" "__________all"        
## [4] "__________up"          "______he's"            "______hours"

Well, we’re closer, at least. It’s not perfect, but it’s good enough for now, as we’re just getting a very rough look. Moving forward, we might have to do some data cleaning and figure out what regular expression best identifies a “real” word! How many words do we have now, with our admittedly imperfect system?

nrow(uniqueTwitterWords)
## [1] 279049

Our total unique words (a bit iffy, to be sure) is around 297K.

Blog Posts

We’ll do something very similar for blog posts. Let’s find the line count:

nrow(blogs)
## [1] 899288

899K. Once again, tons of text!

What about words?

blogWords <- blogs %>%
  unnest_tokens(word, text)

How many words total do we have in our corpus?

nrow(blogWords)
## [1] 37546246

Wow, 37.5 million words! And how many of those words are unique, and are likely to be real words, because they contain at least one letter?

uniqueBlogWords <- blogWords %>% 
  select(word) %>% 
  filter(grepl("[:alpha:]", word)) %>%
  unique()
nrow(uniqueBlogWords)
## [1] 237596

238K unique words is still a pretty high number, just like Twitter. These could be names, or variants, or emojis – more analysis is necessary. What’s interesting is that the scope of unique words is pretty similar to Twitter.

News

Just as with Twitter and blogs, we can easily count lines:

nrow(news)
## [1] 1010242

About a million lines.

And similarly, let’s count words, both overall word count and unique words.

What about words?

newsWords <- news %>%
  unnest_tokens(word, text)

How many words total do we have in our corpus?

nrow(newsWords)
## [1] 34762395

34.8 million words! How many of those words are unique, and are likely to be real words, because they contain at least one letter?

uniqueNewsWords <- newsWords %>% 
  select(word) %>% 
  filter(grepl("[:alpha:]", word)) %>%
  unique()
nrow(uniqueNewsWords)
## [1] 200262

Around 200K. Again, in the ballpark of Twitter and blog posts, and arguably a bit more trustworthy (I’d be surprised to see misspellings, emojis, etc. in news articles). Maybe 200K is a realistic unique word count for a sufficiently large corpus?

Going a bit deeper

Word Frequency

Let’s see what the word frequency looks like in Twitter.

twitterWordFreq <- twitterWords %>%
  count(word, sort = TRUE) 

twitterWordFreq <- twitterWordFreq %>% mutate(freq=n/sum(n))

Let’s take a peek…

head(twitterWordFreq)
## # A tibble: 6 × 3
##    word      n       freq
##   <chr>  <int>      <dbl>
## 1   the 937405 0.03114989
## 2    to 788645 0.02620660
## 3     i 723447 0.02404008
## 4     a 611358 0.02031537
## 5   you 548089 0.01821295
## 6   and 438538 0.01457258

And let’s see what the most frequent words are:

head(twitterWordFreq[order(twitterWordFreq$freq, decreasing=TRUE),])
## # A tibble: 6 × 3
##    word      n       freq
##   <chr>  <int>      <dbl>
## 1   the 937405 0.03114989
## 2    to 788645 0.02620660
## 3     i 723447 0.02404008
## 4     a 611358 0.02031537
## 5   you 548089 0.01821295
## 6   and 438538 0.01457258

Those are some low frequencies, even for frequent words. What does our frequency look like overall? I’ll multiply frequency by 100 to get each word’s percentage of the corpus.

summary(twitterWordFreq$freq*100)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000033 0.0000033 0.0000033 0.0002700 0.0000100 3.1150000

Looks like almost all of our words have a frequency at or less than one ten-thousandth of a percent. We have a few words with high frequency, however, with the most frequent word (“the”) coming in at 3.1%. Let’s remove the top English words like “a”, “to”, and “the” to see how that affects frequency.

data("stop_words")

twitterWordFreqStopRemoved <- twitterWordFreq %>%
  anti_join(stop_words)

twitterWordFreqStopRemoved <- twitterWordFreqStopRemoved %>% mutate(freq=n/sum(n)) %>% arrange(desc(n))

head(twitterWordFreqStopRemoved)
## # A tibble: 6 × 3
##    word      n        freq
##   <chr>  <int>       <dbl>
## 1  love 106721 0.008546924
## 2   day  91710 0.007344743
## 3    rt  89537 0.007170715
## 4  time  76794 0.006150172
## 5   lol  70133 0.005616715
## 6     3  54940 0.004399959

It’s interesting to see that close to 1% of non-stop-word usage belongs to the word “love” (0.85%), with terms like day (0.73%) and rt (short for “retweet”, 0.72%) not far behind.

Aside: Emojis and Emoticons

I suspect the “3” is actually part of the “heart” emoticon (<3). Let’s see if that bears out, by looking at the tweets that have a 3 in them.

tweetsWith3 <- twitterWords %>% filter(word == "3") %>% select(tweet)
twitter[head(tweetsWith3$tweet),]
## # A tibble: 6 × 2
##                                                                          text
##                                                                         <chr>
## 1                                                                   I will <3
## 2 RT Congratulations to the for advancing in the #stanleycup Playoffs! They d
## 3                                            Ghost Hunters makes me cry :( <3
## 4                 This Little Girl by Cady Groves..... I <3 it! Look it up :)
## 5 Don't forget.... our 5th annual #Cuts4aCause is TOMORROW from 9-3!!! Don't 
## 6                                                      Going to bed, Night <3
## # ... with 1 more variables: tweet <dbl>

Yep, it looks like the <3 is a frequent occurrence! What other symbols might we be missing? This definitely suggests further work in extracting emoticons (symbols made by combining typed characters) and emojis (unicode symbols that you may use on your phone to text cute pics with), at least in the Twitter corpus.

Back to Word Frequency

Let’s visualize word frequency in tweets, now that we’ve removed the relatively high frequency outliers by using the stopwords data set. Because we have a huge variation in frequency, I’ll log-transform the data.

hist(log(twitterWordFreqStopRemoved$freq))

I suppose it makes sense that there would be a lot of words with very low frequency (proper names, misspellings), and only a few words with high frequency. I wonder if the same is true for our other corpora, and what their high-frequency words (not counting stop words) are?

blogWordFreq <- blogWords %>%
  count(word, sort = TRUE) 
blogWordFreq <- blogWordFreq %>% 
  mutate(freq=n/sum(n))  %>%
  anti_join(stop_words) %>% 
  arrange(desc(n))
head(blogWordFreq)
## # A tibble: 6 × 3
##     word     n        freq
##    <chr> <int>       <dbl>
## 1   time 90918 0.002421494
## 2 people 59574 0.001586683
## 3    day 52372 0.001394866
## 4   love 45230 0.001204648
## 5   life 41251 0.001098672
## 6   it’s 38657 0.001029584

Time, love, and day make another appearance as frequently used words. Will that hold true for news articles? I doubt it!

First, let’s also check out the histogram of word frequency in blog posts.

hist(log(blogWordFreq$freq))

Yep, that squares with what we’ve seen before.

Let’s do the same thing with news posts:

newsWordFreq <- newsWords %>%
  count(word, sort = TRUE) 
newsWordFreq <- newsWordFreq %>% 
  mutate(freq=n/sum(n))  %>%
  anti_join(stop_words) %>% 
  arrange(desc(n))
head(newsWordFreq)
## # A tibble: 6 × 3
##     word     n        freq
##    <chr> <int>       <dbl>
## 1   time 57062 0.001641486
## 2 people 47666 0.001371194
## 3   city 37953 0.001091783
## 4      1 37292 0.001072768
## 5 school 35498 0.001021161
## 6   game 34949 0.001005368
hist(log(newsWordFreq$freq))

Interesting! “Time” appears, but “love” does not. Also, the dropoff in the histogram, while steep, is not as dramatic as in the other two corpora. Perhaps this indicates there are fewer “one-offs” in news and that there’s a broader, more diverse use of words.

Top Word Comparison

Having seen some overlap in just the top six words, I’m curious about the top 30 words. I’m going to take the top 30 words (not counting stop words) from each corpora and see how they overlap and compare.

twitterTop30 <- as.data.frame(head(twitterWordFreqStopRemoved,30)) %>%
  transmute(word=word, twitterFreq = freq)
blogsTop30 <- as.data.frame(head(blogWordFreq,30)) %>%
  transmute(word=word, blogFreq = freq)
newsTop30 <- as.data.frame(head(newsWordFreq,30)) %>%
  transmute(word=word, newsFreq = freq)

Let’s combine these and check out the data.

top30 <- merge(twitterTop30, blogsTop30, by="word", all=TRUE)
top30 <- merge(top30, newsTop30, by="word", all=TRUE)
top30
##         word twitterFreq     blogFreq     newsFreq
## 1          1 0.002090100 0.0008231715 0.0010727684
## 2         10          NA           NA 0.0008539688
## 3          2 0.003645142 0.0007873224 0.0009143214
## 4          3 0.004399959 0.0005864767 0.0007822246
## 5          4 0.001981743 0.0004575158 0.0006411814
## 6          5          NA           NA 0.0006140256
## 7          6          NA           NA 0.0005974847
## 8    awesome 0.002000163           NA           NA
## 9        bit          NA 0.0005230083           NA
## 10      blog          NA 0.0005231948           NA
## 11      book          NA 0.0007496622           NA
## 12    center          NA           NA 0.0005859493
## 13      city          NA           NA 0.0010917832
## 14    county          NA           NA 0.0008679494
## 15       day 0.007344743 0.0013948665 0.0009176871
## 16      days          NA 0.0005369378           NA
## 17     don’t          NA 0.0007561075           NA
## 18    family          NA 0.0005328629           NA
## 19      feel 0.001969250 0.0006512236           NA
## 20    follow 0.003852487           NA           NA
## 21     found          NA 0.0005160836           NA
## 22       fun 0.001856328           NA           NA
## 23      game 0.002451852           NA 0.0010053680
## 24       god          NA 0.0005957453           NA
## 25      haha 0.002178596           NA           NA
## 26     happy 0.003924084           NA           NA
## 27       hey 0.002076005           NA           NA
## 28      home 0.001978300 0.0007442555 0.0008802903
## 29      hope 0.002847640           NA           NA
## 30     house          NA 0.0005053501 0.0005713933
## 31       i’m          NA 0.0007774146           NA
## 32        im 0.002456577           NA           NA
## 33 including          NA           NA 0.0005834178
## 34      it’s          NA 0.0010295836           NA
## 35      life 0.002717659 0.0010986718           NA
## 36       lol 0.005616715           NA           NA
## 37       lot          NA 0.0005739588           NA
## 38      love 0.008546924 0.0012046477           NA
## 39   million          NA           NA 0.0008892943
## 40   morning 0.002121254           NA           NA
## 41     night 0.003297086 0.0005097447           NA
## 42       p.m          NA           NA 0.0006713001
## 43    people 0.004167707 0.0015866833 0.0013711944
## 44   percent          NA           NA 0.0009979174
## 45      play          NA           NA 0.0006078407
## 46    police          NA           NA 0.0007794054
## 47      post          NA 0.0005134734           NA
## 48 president          NA           NA 0.0005713645
## 49    public          NA           NA 0.0006542702
## 50      read          NA 0.0005694311           NA
## 51        rt 0.007170715           NA           NA
## 52    school          NA 0.0004801812 0.0010211609
## 53    season          NA           NA 0.0008196789
## 54        st          NA           NA 0.0006357732
## 55     story          NA 0.0005120885           NA
## 56      team          NA           NA 0.0008328540
## 57      time 0.006150172 0.0024214937 0.0016414864
## 58  tomorrow 0.002284791           NA           NA
## 59   tonight 0.003579792           NA           NA
## 60   twitter 0.002453373           NA           NA
## 61       u.s          NA           NA 0.0006741193
## 62      wait 0.002180358           NA           NA
## 63      week 0.002332603 0.0007009223 0.0006516237
## 64   weekend 0.001804752           NA           NA
## 65     world          NA 0.0007805041           NA

Let’s make long data out of this, tidy-style, and then graph it using ggplot2.

top30Long = melt(top30, id.vars='word')

Can we graph this?

ggplot(data=top30Long, aes(word, value, color=variable)) +
  geom_point() + 
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

What did we learn from this? That Twitter has extremely variable frequency across popular words (the 30 most popular words in each corpus), while blog posts and news tend to be fairly even. This makes me think that the Twitter vocabulary is much less rich and is highly concentrated, while the other two corpora have a more even distribution. This fits with the log-scaled histograms we already saw!

Next Steps

The next things I think it makes sense to do, before moving forward into actually working on word prediction, include:

  • Cleaning up emoticons and emojis, whether that means substitution (<3 = love) or removal
  • Removing one-offs (and maybe two-offs and three-offs), because their frequency is so low as to be pretty darn useless for prediction, I daresay. This will also make the data frames easier to work with in any system, regardless of RAM.
  • Creating three to five word sequences that I can use to model prediction (e.g. if I see the words “I love you so”, I might often find that the last word of that sequence is “much”.)
  • Determining whether prediction makes sense on a per-corpus basis or I can use all the corpora together to come up with a good text predictor that would work on all three.