Overview

This document contains the analysis for the Milestone Report of the Capstone Project of the Data Science course.

Input Data

First let’s setup the project loading the libraries needed and import all the data used in the analysis. Then we will sample the datasets to have faster results.

library(dplyr)
library(plotly)
Loading required package: ggplot2

Attaching package: ‘plotly’

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:stats’:

    filter

The following object is masked from ‘package:graphics’:

    layout
library(tidytext)
library(tidyr)

Exploratory Analysis

Let’s start with some simple word counts, first removing the stop words from the files. This is the one for the blogs text file.

Word count

# word count for blogs
word.count.blogs <- txt.blogs.smp.df %>%
        unnest_tokens(word, text) %>% 
        # removes the stop words
        filter(!(word %in% stop.words$word)) %>% 
        count(word, sort = TRUE) 
word.count.blogs

This is the word count for the news file.

# word count for news
word.count.news <- txt.news.smp.df %>%
        unnest_tokens(word, text) %>% 
        # removes the stop words
        filter(!(word %in% stop.words$word)) %>% 
        count(word, sort = TRUE) 
word.count.news

Finally, the word count for the Twitter file

# word count for twitter
word.count.twitter <- txt.twitter.smp.df %>%
        unnest_tokens(word, text) %>% 
        filter(!(word %in% stop.words$word)) %>% 
        count(word, sort = TRUE) 
word.count.twitter

Line count

Let’s execute some simple line count on the three files.

# Blogs data frame
print(paste("Blogs rows:", format(nrow(txt.blogs.df), decimal.mark=".", 
                                  big.mark=",", small.mark=".")))
[1] "Blogs rows: 899,288"
# News data frame
print(paste("News rows:", format(nrow(txt.news.df), decimal.mark=".", 
                                 big.mark=",", small.mark=".")))
[1] "News rows: 1,010,242"
# Twitter data frame
print(paste("Twitter rows:", format(nrow(txt.twitter.df), decimal.mark=".", 
                                    big.mark=",", small.mark=".")))
[1] "Twitter rows: 2,360,148"

Bigrams

Let’s now calculate the bigrams, or the two words that comes together in the text. We will start with the blogs txt file.

blogs.bigram <- txt.blogs.smp.df %>%
        unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
        separate(bigram, c("word1", "word2"), sep = " ") %>% 
        filter(!word1 %in% stop.words$word) %>%
        filter(!word2 %in% stop.words$word) %>% 
        count(word1, word2, sort = TRUE)
blogs.bigram

Then the news txt file.

news.bigram <- txt.news.smp.df %>%
        unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
        separate(bigram, c("word1", "word2"), sep = " ") %>% 
        filter(!word1 %in% stop.words$word) %>%
        filter(!word2 %in% stop.words$word) %>% 
        count(word1, word2, sort = TRUE)
news.bigram

And finally the Twitter file.

twitter.bigram <- txt.twitter.smp.df %>%
        unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
        separate(bigram, c("word1", "word2"), sep = " ") %>% 
        filter(!word1 %in% stop.words$word) %>%
        filter(!word2 %in% stop.words$word) %>% 
        count(word1, word2, sort = TRUE)
twitter.bigram

Charts

We would like to provide some charts to explore a bit more the data frames that will be used to make a prediction algorithm

Line count

plot_ly(
        x = c("Blogs", "News", "Twitter"),
        y = c(nrow(txt.blogs.df), nrow(txt.news.df), nrow(txt.twitter.df)),
        name = "Line count",
        type = "bar"
) %>% 
        layout(title = "Line count",
               xaxis = list(title = "data frame"),
               yaxis = list(title = "lines"))

Word count

This chart shows the top 10 word found in the blogs txt file

plot_ly(
        arrange(word.count.blogs, desc(n)) %>% top_n(10),
        x = ~word,
        y = ~n,
        name = "Word count (top 10)",
        type = "bar"
) %>% 
        layout(title = "Twitter word count (top 10)",
               xaxis = list(title = "word"),
               yaxis = list(title = "frequency"))
Selecting by n

This chart shows the top 10 word found in the news txt file

plot_ly(
        arrange(word.count.news, desc(n)) %>% top_n(10),
        x = ~word,
        y = ~n,
        name = "Word count (top 10)",
        type = "bar"
) %>% 
        layout(title = "Twitter word count (top 10)",
               xaxis = list(title = "word"),
               yaxis = list(title = "frequency"))
Selecting by n

This chart shows the top 10 word found in the twitter txt file

plot_ly(
        arrange(word.count.twitter, desc(n)) %>% top_n(10),
        x = ~word,
        y = ~n,
        name = "Word count (top 10)",
        type = "bar"
) %>% 
        layout(title = "Twitter word count (top 10)",
               xaxis = list(title = "word"),
               yaxis = list(title = "frequency"))
Selecting by n
---
title: "Milestone Report"
output:
  html_notebook:
    highlight: tango
    theme: cosmo
    toc: yes
    toc_float: yes
---

## Overview

This document contains the analysis for the Milestone Report of the Capstone Project of the Data Science course.

## Input Data

First let's setup the project loading the libraries needed and import all the data used in the analysis. Then we will sample the datasets to have faster results.

```{r loadData, cache=TRUE}
library(dplyr)
library(plotly)
library(tidytext)
library(tidyr)

## Raw data extraction
## connection to the Twitter file
con <- file("data/final/en_US/en_US.blogs.txt", "r") 

## reads the first n lines
txt_blogs <- readLines(con, skipNul = TRUE)

## Close the connection once done
close(con, type = "r")

## connection to the Twitter file
con <- file("data/final/en_US/en_US.news.txt", "r") 

## reads the first n lines
txt_news <- readLines(con, skipNul = TRUE)

## Close the connection once done
close(con, type = "r")

## connection to the Twitter file
con <- file("data/final/en_US/en_US.twitter.txt", "r") 

## reads the first n lines
txt_twitter <- readLines(con, skipNul = TRUE)

## Close the connection once done
close(con, type = "r")

## Transforms the text file into a data frame
# blogs
txt.blogs.df <- data_frame(line = 1:length(txt_blogs), text = txt_blogs)

# news
txt.news.df <- data_frame(line = 1:length(txt_news), text = txt_news)

# twitter
txt.twitter.df <- data_frame(line = 1:length(txt_twitter), text = txt_twitter)

## Creates a 40% sample of the data frames
# blogs
txt.blogs.smp.df <- txt.blogs.df[sample(nrow(txt.blogs.df), nrow(txt.blogs.df) / 30), ]

# news
txt.news.smp.df <- txt.news.df[sample(nrow(txt.news.df), nrow(txt.news.df) / 30), ]

# twitter
txt.twitter.smp.df <- txt.twitter.df[sample(nrow(txt.twitter.df), nrow(txt.twitter.df) / 30), ]
```


## Exploratory Analysis

Let's start with some simple word counts, first removing the stop words from the files. This is the one for the blogs text file.

### Word count

```{r wordCountBlogs, cache=TRUE}
# word count for blogs
word.count.blogs <- txt.blogs.smp.df %>%
        unnest_tokens(word, text) %>% 
        # removes the stop words
        filter(!(word %in% stop.words$word)) %>% 
        count(word, sort = TRUE) 

word.count.blogs
```

This is the word count for the news file.

```{r wordCountNews, cache=TRUE}
# word count for news
word.count.news <- txt.news.smp.df %>%
        unnest_tokens(word, text) %>% 
        # removes the stop words
        filter(!(word %in% stop.words$word)) %>% 
        count(word, sort = TRUE) 

word.count.news
```

Finally, the word count for the Twitter file

```{r wordCountTwitter, cache=TRUE}
# word count for twitter
word.count.twitter <- txt.twitter.smp.df %>%
        unnest_tokens(word, text) %>% 
        filter(!(word %in% stop.words$word)) %>% 
        count(word, sort = TRUE) 

word.count.twitter
```

## Line count

Let's execute some simple line count on the three files.

```{r}
# Blogs data frame
print(paste("Blogs rows:", format(nrow(txt.blogs.df), decimal.mark=".", 
                                  big.mark=",", small.mark=".")))

# News data frame
print(paste("News rows:", format(nrow(txt.news.df), decimal.mark=".", 
                                 big.mark=",", small.mark=".")))

# Twitter data frame
print(paste("Twitter rows:", format(nrow(txt.twitter.df), decimal.mark=".", 
                                    big.mark=",", small.mark=".")))
```

## Bigrams

Let's now calculate the bigrams, or the two words that comes together in the text. We will start with the blogs txt file.

```{r blogsSampleBigram, cache=TRUE}
blogs.bigram <- txt.blogs.smp.df %>%
        unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
        separate(bigram, c("word1", "word2"), sep = " ") %>% 
        filter(!word1 %in% stop.words$word) %>%
        filter(!word2 %in% stop.words$word) %>% 
        count(word1, word2, sort = TRUE)

blogs.bigram
```

Then the news txt file.

```{r newsSampleBigram, cache=TRUE}
news.bigram <- txt.news.smp.df %>%
        unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
        separate(bigram, c("word1", "word2"), sep = " ") %>% 
        filter(!word1 %in% stop.words$word) %>%
        filter(!word2 %in% stop.words$word) %>% 
        count(word1, word2, sort = TRUE)

news.bigram
```

And finally the Twitter file.

```{r twitterSampleBigram, cache=TRUE}
twitter.bigram <- txt.twitter.smp.df %>%
        unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
        separate(bigram, c("word1", "word2"), sep = " ") %>% 
        filter(!word1 %in% stop.words$word) %>%
        filter(!word2 %in% stop.words$word) %>% 
        count(word1, word2, sort = TRUE)

twitter.bigram
```


## Charts

We would like to provide some charts to explore a bit more the data frames that will be used to make a prediction algorithm

### Line count

```{r}
plot_ly(
        x = c("Blogs", "News", "Twitter"),
        y = c(nrow(txt.blogs.df), nrow(txt.news.df), nrow(txt.twitter.df)),
        name = "Line count",
        type = "bar"
) %>% 
        layout(title = "Line count",
               xaxis = list(title = "data frame"),
               yaxis = list(title = "lines"))

```

### Word count

This chart shows the top 10 word found in the blogs txt file

```{r blogsWordCount}
plot_ly(
        arrange(word.count.blogs, desc(n)) %>% top_n(10),
        x = ~word,
        y = ~n,
        name = "Word count (top 10)",
        type = "bar"
) %>% 
        layout(title = "Twitter word count (top 10)",
               xaxis = list(title = "word"),
               yaxis = list(title = "frequency"))
```

This chart shows the top 10 word found in the news txt file

```{r newsWordCount}
plot_ly(
        arrange(word.count.news, desc(n)) %>% top_n(10),
        x = ~word,
        y = ~n,
        name = "Word count (top 10)",
        type = "bar"
) %>% 
        layout(title = "Twitter word count (top 10)",
               xaxis = list(title = "word"),
               yaxis = list(title = "frequency"))
```

This chart shows the top 10 word found in the twitter txt file

```{r twitterWordCount}
plot_ly(
        arrange(word.count.twitter, desc(n)) %>% top_n(10),
        x = ~word,
        y = ~n,
        name = "Word count (top 10)",
        type = "bar"
) %>% 
        layout(title = "Twitter word count (top 10)",
               xaxis = list(title = "word"),
               yaxis = list(title = "frequency"))
```
