We begin with reading text lines of the datasets.
blogs.lines <- readLines("en_US.blogs.txt")
news.lines <- readLines("en_US.news.txt")
twitter.lines <- readLines("en_US.twitter.txt")
We split our data into words.
blogs.words <- unlist(strsplit(blogs.lines, "\\W+"))
news.words <- unlist(strsplit(news.lines, "\\W+"))
twitter.words <- unlist(strsplit(twitter.lines, "\\W+"))
Then we find the number of words in each file.
length(blogs.words)
## [1] 38370723
length(news.words)
## [1] 35783083
length(twitter.words)
## [1] 31149374
We find the number of lines in each file.
length(blogs.lines)
## [1] 899288
length(news.lines)
## [1] 1010242
length(twitter.lines)
## [1] 2360148
Then we create three data frames containing top 10 most frequent words in each file. We start with blogs dataset.
blogs.top <- sort(table(blogs.words), decreasing = TRUE)[1:10]
blogs.top <- as.data.frame(blogs.top)
blogs.top
## blogs.words Freq
## 1 the 1669717
## 2 to 1055460
## 3 and 1036035
## 4 I 889775
## 5 of 868442
## 6 a 864814
## 7 in 555918
## 8 that 459389
## 9 is 426404
## 10 it 382706
As we can see, “the” is the most frequent word with 1669717 occurences, followed by “to” and “and” with 1055460 and 1036035 occurences respectively. “it” has the 10th place with 382706 total occurences.
To illustrate this, we make a barplot of top 10 most frequent words in blogs dataset.
library(ggplot2)
qplot(blogs.words, data = blogs.top, geom = "bar", weight = Freq)
Then we make a data frame for news dataset frequencies.
news.top <- sort(table(news.words), decreasing = TRUE)[1:10]
news.top <- as.data.frame(news.top)
news.top
## news.words Freq
## 1 the 1720339
## 2 to 898055
## 3 and 857242
## 4 a 844320
## 5 of 771103
## 6 in 633109
## 7 s 418469
## 8 that 341487
## 9 for 337611
## 10 is 281764
As we can see, “the” is again the most frequent word with 1720339 occurences, followed by “to” and “and” with 898055 and 857242 occurences respectively. “is” has the 10th place with 281764 total occurences.
To illustrate this, we make a barplot of top 10 most frequent words in news dataset.
qplot(news.words, data = news.top, geom = "bar", weight = Freq)
Finally we do the same with twitter dataset.
twitter.top <- sort(table(twitter.words), decreasing = TRUE)[1:10]
twitter.top <- as.data.frame(twitter.top)
twitter.top
## twitter.words Freq
## 1 the 842294
## 2 I 804209
## 3 to 770738
## 4 a 577916
## 5 you 522520
## 6 and 405729
## 7 for 373099
## 8 in 360565
## 9 of 351926
## 10 is 339361
As we can see, “the” is the most frequent word with 842294 occurences, followed by “I” and “to” with 804209 and 770738 occurences respectively. “is” has the 10th place with 339361 total occurences.
To illustrate this, we make a barplot of top 10 most frequent words in twitter dataset.
qplot(twitter.words, data = twitter.top, geom = "bar", weight = Freq)
I plan to count n-gram frequencies as well and for each n-gram find the most frequent successive words. Then I will create my Shiny app which will predict successive words using that information.