In this RMarkdown document, I want to:
First, we’ll grab some packages we want in our working environment.
library(tidyverse, quietly = TRUE)
library(tidytext, quietly = TRUE)
library(stringr, quietly = TRUE)
library(reshape2, quietly = TRUE)We’ll load the Twitter data first.
Importantly, we’ll read each line into its own row using readLines(), so we’ll know how many lines we have easily.
Let’s keep the line numbers, as it will prove handy for predictive text!
twitter <- data_frame(text = readLines("/Users/paytonk/Downloads/final/en_US/en_US.twitter.txt"))
twitter$tweet <- as.numeric(row.names(twitter))
head(twitter)## # A tibble: 6 × 2
## text
## <chr>
## 1 How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love t
## 2 When you meet someone special... you'll know. Your heart will beat more rap
## 3 they've decided its more fun if I don't.
## 4 So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 M
## 5 Words from a complete stranger! Made my birthday even better :)
## 6 First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs G
## # ... with 1 more variables: tweet <dbl>
We’ll also load the blog texts.
blogs <- data_frame(text = readLines("/Users/paytonk/Downloads/final/en_US/en_US.blogs.txt"))
blogs$post <- as.numeric(row.names(blogs))
head(blogs)## # A tibble: 6 × 2
## text
## <chr>
## 1 In the years thereafter, most of the Oil fields and platforms were named af
## 2 We love you Mr. Brown.
## 3 Chad has been awesome with the kids and holding down the fort while I work
## 4 so anyways, i am going to share some home decor inspiration that i have bee
## 5 With graduation season right around the corner, Nancy has whipped up a fun
## 6 If you have an alternative argument, let's hear it! :)
## # ... with 1 more variables: post <dbl>
Finally, we’ll get the news texts, in the same way:
news <- data_frame(text = readLines("/Users/paytonk/Downloads/final/en_US/en_US.news.txt"))
news$story <- as.numeric(row.names(news))
head(news)## # A tibble: 6 × 2
## text
## <chr>
## 1 He wasn't home alone, apparently.
## 2 The St. Louis plant had to close. It would die of old age. Workers had been
## 3 WSU's plans quickly became a hot topic on local online sites. Though most p
## 4 The Alaimo Group of Mount Holly was up for a contract last fall to evaluate
## 5 And when it's often difficult to predict a law's impact, legislators should
## 6 There was a certain amount of scoffing going around a few years ago when th
## # ... with 1 more variables: story <dbl>
How many items, lines, and words do we have for each textual source?
How many lines do we have?
nrow(twitter)## [1] 2360148
2.36 million lines of tweet data! That’s a lot of text!
Now let’s analyze it at the word level. First we’ll break each tweet into tokens:
twitterWords <- twitter %>%
unnest_tokens(word, text)How many words total do we have in our corpus?
nrow(twitterWords)## [1] 30093369
Wow, 30 million words! And how many of those words are unique?
uniqueTwitterWords <- twitterWords %>% select(word) %>% unique()
nrow(uniqueTwitterWords)## [1] 370386
Almost 400k words? Clearly some of those must be variant spellings, or we’re capturing things like emojis that aren’t real words. It’s unlikely the vocabulary is that large! Let’s peek.
head(sort(uniqueTwitterWords$word))## [1] "__" "﹏﹏" "___" "____" "﹏﹏﹏﹏" "_____"
Hmmm… there are a lot of “words” here that aren’t really words. Let’s toss out the “words” that don’t at least have one letter, and see if those look better. I don’t want to toss out words based on them being all letters or not, because we could have things like H20 or H1N1 that could be important for our analysis.
uniqueTwitterWords <- uniqueTwitterWords %>% filter(grepl("[:alpha:]", word))
head(sort(uniqueTwitterWords$word))## [1] "_________________thud" "________________until" "__________all"
## [4] "__________up" "______he's" "______hours"
Well, we’re closer, at least. It’s not perfect, but it’s good enough for now, as we’re just getting a very rough look. Moving forward, we might have to do some data cleaning and figure out what regular expression best identifies a “real” word! How many words do we have now, with our admittedly imperfect system?
nrow(uniqueTwitterWords)## [1] 279049
Our total unique words (a bit iffy, to be sure) is around 297K.
We’ll do something very similar for blog posts. Let’s find the line count:
nrow(blogs)## [1] 899288
899K. Once again, tons of text!
What about words?
blogWords <- blogs %>%
unnest_tokens(word, text)How many words total do we have in our corpus?
nrow(blogWords)## [1] 37546246
Wow, 37.5 million words! And how many of those words are unique, and are likely to be real words, because they contain at least one letter?
uniqueBlogWords <- blogWords %>%
select(word) %>%
filter(grepl("[:alpha:]", word)) %>%
unique()
nrow(uniqueBlogWords)## [1] 237596
238K unique words is still a pretty high number, just like Twitter. These could be names, or variants, or emojis – more analysis is necessary. What’s interesting is that the scope of unique words is pretty similar to Twitter.
Just as with Twitter and blogs, we can easily count lines:
nrow(news)## [1] 1010242
About a million lines.
And similarly, let’s count words, both overall word count and unique words.
What about words?
newsWords <- news %>%
unnest_tokens(word, text)How many words total do we have in our corpus?
nrow(newsWords)## [1] 34762395
34.8 million words! How many of those words are unique, and are likely to be real words, because they contain at least one letter?
uniqueNewsWords <- newsWords %>%
select(word) %>%
filter(grepl("[:alpha:]", word)) %>%
unique()
nrow(uniqueNewsWords)## [1] 200262
Around 200K. Again, in the ballpark of Twitter and blog posts, and arguably a bit more trustworthy (I’d be surprised to see misspellings, emojis, etc. in news articles). Maybe 200K is a realistic unique word count for a sufficiently large corpus?
Let’s see what the word frequency looks like in Twitter.
twitterWordFreq <- twitterWords %>%
count(word, sort = TRUE)
twitterWordFreq <- twitterWordFreq %>% mutate(freq=n/sum(n))Let’s take a peek…
head(twitterWordFreq)## # A tibble: 6 × 3
## word n freq
## <chr> <int> <dbl>
## 1 the 937405 0.03114989
## 2 to 788645 0.02620660
## 3 i 723447 0.02404008
## 4 a 611358 0.02031537
## 5 you 548089 0.01821295
## 6 and 438538 0.01457258
And let’s see what the most frequent words are:
head(twitterWordFreq[order(twitterWordFreq$freq, decreasing=TRUE),])## # A tibble: 6 × 3
## word n freq
## <chr> <int> <dbl>
## 1 the 937405 0.03114989
## 2 to 788645 0.02620660
## 3 i 723447 0.02404008
## 4 a 611358 0.02031537
## 5 you 548089 0.01821295
## 6 and 438538 0.01457258
Those are some low frequencies, even for frequent words. What does our frequency look like overall? I’ll multiply frequency by 100 to get each word’s percentage of the corpus.
summary(twitterWordFreq$freq*100)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000033 0.0000033 0.0000033 0.0002700 0.0000100 3.1150000
Looks like almost all of our words have a frequency at or less than one ten-thousandth of a percent. We have a few words with high frequency, however, with the most frequent word (“the”) coming in at 3.1%. Let’s remove the top English words like “a”, “to”, and “the” to see how that affects frequency.
data("stop_words")
twitterWordFreqStopRemoved <- twitterWordFreq %>%
anti_join(stop_words)
twitterWordFreqStopRemoved <- twitterWordFreqStopRemoved %>% mutate(freq=n/sum(n)) %>% arrange(desc(n))
head(twitterWordFreqStopRemoved)## # A tibble: 6 × 3
## word n freq
## <chr> <int> <dbl>
## 1 love 106721 0.008546924
## 2 day 91710 0.007344743
## 3 rt 89537 0.007170715
## 4 time 76794 0.006150172
## 5 lol 70133 0.005616715
## 6 3 54940 0.004399959
It’s interesting to see that close to 1% of non-stop-word usage belongs to the word “love” (0.85%), with terms like day (0.73%) and rt (short for “retweet”, 0.72%) not far behind.
I suspect the “3” is actually part of the “heart” emoticon (<3). Let’s see if that bears out, by looking at the tweets that have a 3 in them.
tweetsWith3 <- twitterWords %>% filter(word == "3") %>% select(tweet)
twitter[head(tweetsWith3$tweet),]## # A tibble: 6 × 2
## text
## <chr>
## 1 I will <3
## 2 RT Congratulations to the for advancing in the #stanleycup Playoffs! They d
## 3 Ghost Hunters makes me cry :( <3
## 4 This Little Girl by Cady Groves..... I <3 it! Look it up :)
## 5 Don't forget.... our 5th annual #Cuts4aCause is TOMORROW from 9-3!!! Don't
## 6 Going to bed, Night <3
## # ... with 1 more variables: tweet <dbl>
Yep, it looks like the <3 is a frequent occurrence! What other symbols might we be missing? This definitely suggests further work in extracting emoticons (symbols made by combining typed characters) and emojis (unicode symbols that you may use on your phone to text cute pics with), at least in the Twitter corpus.
Let’s visualize word frequency in tweets, now that we’ve removed the relatively high frequency outliers by using the stopwords data set. Because we have a huge variation in frequency, I’ll log-transform the data.
hist(log(twitterWordFreqStopRemoved$freq))I suppose it makes sense that there would be a lot of words with very low frequency (proper names, misspellings), and only a few words with high frequency. I wonder if the same is true for our other corpora, and what their high-frequency words (not counting stop words) are?
blogWordFreq <- blogWords %>%
count(word, sort = TRUE)
blogWordFreq <- blogWordFreq %>%
mutate(freq=n/sum(n)) %>%
anti_join(stop_words) %>%
arrange(desc(n))
head(blogWordFreq)## # A tibble: 6 × 3
## word n freq
## <chr> <int> <dbl>
## 1 time 90918 0.002421494
## 2 people 59574 0.001586683
## 3 day 52372 0.001394866
## 4 love 45230 0.001204648
## 5 life 41251 0.001098672
## 6 it’s 38657 0.001029584
Time, love, and day make another appearance as frequently used words. Will that hold true for news articles? I doubt it!
First, let’s also check out the histogram of word frequency in blog posts.
hist(log(blogWordFreq$freq))Yep, that squares with what we’ve seen before.
Let’s do the same thing with news posts:
newsWordFreq <- newsWords %>%
count(word, sort = TRUE)
newsWordFreq <- newsWordFreq %>%
mutate(freq=n/sum(n)) %>%
anti_join(stop_words) %>%
arrange(desc(n))
head(newsWordFreq)## # A tibble: 6 × 3
## word n freq
## <chr> <int> <dbl>
## 1 time 57062 0.001641486
## 2 people 47666 0.001371194
## 3 city 37953 0.001091783
## 4 1 37292 0.001072768
## 5 school 35498 0.001021161
## 6 game 34949 0.001005368
hist(log(newsWordFreq$freq))Interesting! “Time” appears, but “love” does not. Also, the dropoff in the histogram, while steep, is not as dramatic as in the other two corpora. Perhaps this indicates there are fewer “one-offs” in news and that there’s a broader, more diverse use of words.
Having seen some overlap in just the top six words, I’m curious about the top 30 words. I’m going to take the top 30 words (not counting stop words) from each corpora and see how they overlap and compare.
twitterTop30 <- as.data.frame(head(twitterWordFreqStopRemoved,30)) %>%
transmute(word=word, twitterFreq = freq)
blogsTop30 <- as.data.frame(head(blogWordFreq,30)) %>%
transmute(word=word, blogFreq = freq)
newsTop30 <- as.data.frame(head(newsWordFreq,30)) %>%
transmute(word=word, newsFreq = freq)Let’s combine these and check out the data.
top30 <- merge(twitterTop30, blogsTop30, by="word", all=TRUE)
top30 <- merge(top30, newsTop30, by="word", all=TRUE)
top30## word twitterFreq blogFreq newsFreq
## 1 1 0.002090100 0.0008231715 0.0010727684
## 2 10 NA NA 0.0008539688
## 3 2 0.003645142 0.0007873224 0.0009143214
## 4 3 0.004399959 0.0005864767 0.0007822246
## 5 4 0.001981743 0.0004575158 0.0006411814
## 6 5 NA NA 0.0006140256
## 7 6 NA NA 0.0005974847
## 8 awesome 0.002000163 NA NA
## 9 bit NA 0.0005230083 NA
## 10 blog NA 0.0005231948 NA
## 11 book NA 0.0007496622 NA
## 12 center NA NA 0.0005859493
## 13 city NA NA 0.0010917832
## 14 county NA NA 0.0008679494
## 15 day 0.007344743 0.0013948665 0.0009176871
## 16 days NA 0.0005369378 NA
## 17 don’t NA 0.0007561075 NA
## 18 family NA 0.0005328629 NA
## 19 feel 0.001969250 0.0006512236 NA
## 20 follow 0.003852487 NA NA
## 21 found NA 0.0005160836 NA
## 22 fun 0.001856328 NA NA
## 23 game 0.002451852 NA 0.0010053680
## 24 god NA 0.0005957453 NA
## 25 haha 0.002178596 NA NA
## 26 happy 0.003924084 NA NA
## 27 hey 0.002076005 NA NA
## 28 home 0.001978300 0.0007442555 0.0008802903
## 29 hope 0.002847640 NA NA
## 30 house NA 0.0005053501 0.0005713933
## 31 i’m NA 0.0007774146 NA
## 32 im 0.002456577 NA NA
## 33 including NA NA 0.0005834178
## 34 it’s NA 0.0010295836 NA
## 35 life 0.002717659 0.0010986718 NA
## 36 lol 0.005616715 NA NA
## 37 lot NA 0.0005739588 NA
## 38 love 0.008546924 0.0012046477 NA
## 39 million NA NA 0.0008892943
## 40 morning 0.002121254 NA NA
## 41 night 0.003297086 0.0005097447 NA
## 42 p.m NA NA 0.0006713001
## 43 people 0.004167707 0.0015866833 0.0013711944
## 44 percent NA NA 0.0009979174
## 45 play NA NA 0.0006078407
## 46 police NA NA 0.0007794054
## 47 post NA 0.0005134734 NA
## 48 president NA NA 0.0005713645
## 49 public NA NA 0.0006542702
## 50 read NA 0.0005694311 NA
## 51 rt 0.007170715 NA NA
## 52 school NA 0.0004801812 0.0010211609
## 53 season NA NA 0.0008196789
## 54 st NA NA 0.0006357732
## 55 story NA 0.0005120885 NA
## 56 team NA NA 0.0008328540
## 57 time 0.006150172 0.0024214937 0.0016414864
## 58 tomorrow 0.002284791 NA NA
## 59 tonight 0.003579792 NA NA
## 60 twitter 0.002453373 NA NA
## 61 u.s NA NA 0.0006741193
## 62 wait 0.002180358 NA NA
## 63 week 0.002332603 0.0007009223 0.0006516237
## 64 weekend 0.001804752 NA NA
## 65 world NA 0.0007805041 NA
Let’s make long data out of this, tidy-style, and then graph it using ggplot2.
top30Long = melt(top30, id.vars='word')Can we graph this?
ggplot(data=top30Long, aes(word, value, color=variable)) +
geom_point() +
theme(axis.text.x = element_text(angle = 60, hjust = 1))What did we learn from this? That Twitter has extremely variable frequency across popular words (the 30 most popular words in each corpus), while blog posts and news tend to be fairly even. This makes me think that the Twitter vocabulary is much less rich and is highly concentrated, while the other two corpora have a more even distribution. This fits with the log-scaled histograms we already saw!
The next things I think it makes sense to do, before moving forward into actually working on word prediction, include: