Exploratory analysis of the Data Science Capstone dataset

Loading the data

We begin with reading text lines of the datasets.

blogs.lines <- readLines("en_US.blogs.txt")
news.lines <- readLines("en_US.news.txt")
twitter.lines <- readLines("en_US.twitter.txt")

Summaries of the data

We split our data into words.

blogs.words <- unlist(strsplit(blogs.lines, "\\W+"))
news.words <- unlist(strsplit(news.lines, "\\W+"))
twitter.words <- unlist(strsplit(twitter.lines, "\\W+"))

Then we find the number of words in each file.

length(blogs.words)

## [1] 38370723

length(news.words)

## [1] 35783083

length(twitter.words)

## [1] 31149374

We find the number of lines in each file.

length(blogs.lines)

## [1] 899288

length(news.lines)

## [1] 1010242

length(twitter.lines)

## [1] 2360148

Then we create three data frames containing top 10 most frequent words in each file. We start with blogs dataset.

blogs.top <- sort(table(blogs.words), decreasing = TRUE)[1:10]
blogs.top <- as.data.frame(blogs.top)
blogs.top

##    blogs.words    Freq
## 1          the 1669717
## 2           to 1055460
## 3          and 1036035
## 4            I  889775
## 5           of  868442
## 6            a  864814
## 7           in  555918
## 8         that  459389
## 9           is  426404
## 10          it  382706

As we can see, “the” is the most frequent word with 1669717 occurences, followed by “to” and “and” with 1055460 and 1036035 occurences respectively. “it” has the 10th place with 382706 total occurences.

To illustrate this, we make a barplot of top 10 most frequent words in blogs dataset.

library(ggplot2)
qplot(blogs.words, data = blogs.top, geom = "bar", weight = Freq)

Then we make a data frame for news dataset frequencies.

news.top <- sort(table(news.words), decreasing = TRUE)[1:10]
news.top <- as.data.frame(news.top)
news.top

##    news.words    Freq
## 1         the 1720339
## 2          to  898055
## 3         and  857242
## 4           a  844320
## 5          of  771103
## 6          in  633109
## 7           s  418469
## 8        that  341487
## 9         for  337611
## 10         is  281764

As we can see, “the” is again the most frequent word with 1720339 occurences, followed by “to” and “and” with 898055 and 857242 occurences respectively. “is” has the 10th place with 281764 total occurences.

To illustrate this, we make a barplot of top 10 most frequent words in news dataset.

qplot(news.words, data = news.top, geom = "bar", weight = Freq)

Finally we do the same with twitter dataset.

twitter.top <- sort(table(twitter.words), decreasing = TRUE)[1:10]
twitter.top <- as.data.frame(twitter.top)
twitter.top

##    twitter.words   Freq
## 1            the 842294
## 2              I 804209
## 3             to 770738
## 4              a 577916
## 5            you 522520
## 6            and 405729
## 7            for 373099
## 8             in 360565
## 9             of 351926
## 10            is 339361

As we can see, “the” is the most frequent word with 842294 occurences, followed by “I” and “to” with 804209 and 770738 occurences respectively. “is” has the 10th place with 339361 total occurences.

To illustrate this, we make a barplot of top 10 most frequent words in twitter dataset.

qplot(twitter.words, data = twitter.top, geom = "bar", weight = Freq)

My further plans

I plan to count n-gram frequencies as well and for each n-gram find the most frequent successive words. Then I will create my Shiny app which will predict successive words using that information.

Exploratory analysis of the Data Science Capstone dataset

Dinara Mukhtarova

Loading the data

Summaries of the data

My further plans