OVERVIEW

This report shows the manage and process of the course data set. The data set contains three files with text of twits, blogs and news. Here, I show the process to clean and performance an exploratory analysis of this data. I present different measures, as the frequencies of unigram, bigrams and trigrams, and the total words to cover a given percentage of all word instance.

LIBRARIES

library(tidytext)
library(tidyverse)
library(ggplot2)
library(hunspell)

DATA

Read the data.

con <- file("data/final/en_US/en_US.twitter.txt", "r")
twitter <- readLines(con)
close(con)

con <- file("data/final/en_US/en_US.blogs.txt", "r")
blogs <- readLines(con)
close(con)

con <- file("data/final/en_US/en_US.news.txt", "r")
news <- readLines(con)
close(con)

Get the total of lines by document.

paste(c("Twitter", "News", "Blogs"),
      c(length(twitter), length(news), length(blogs)), 
      sep = ": ")
## [1] "Twitter: 2360148" "News: 77259"      "Blogs: 899288"

The 3 documents have different lengths/lines, but also there is a difference in the length of the texts. We can see that the blogs and news have, in general, more characters than tweets, consider that twitter has a policy in the character limit of a tweet.

summary(str_length(twitter))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   37.00   64.00   68.68  100.00  140.00
summary(str_length(news))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0   111.0   186.0   202.4   270.0  5760.0
summary(str_length(blogs))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40833

I use a sample of 10 percent of the total data to performance the analysis.

sample_size <- 0.1
blogs <- sample(blogs, length(blogs) * sample_size , replace = FALSE)
news <- sample(news, length(news) * sample_size , replace = FALSE)
twitter <- sample(twitter, length(twitter) * sample_size , replace = FALSE)

First, I create a dataset that contains the text of three files.

df <- tibble(
  id = 1:(length(twitter) + length(blogs) + length(news)),
  text = c(twitter,blogs,news))

Now, I convert all the text to lowercase.

df$text <- tolower(df$text)

Also, I remove any number, because it is not relevant for the analysis.

df <- df %>%
  mutate(text = str_remove_all(text, "[0-9]+"))

UNIGRAM

Divide the text in tokens.

df_tokens <- df %>%
  unnest_tokens(word, text)

Remove stopwords.

df_tokens <- df_tokens %>%
  anti_join(stop_words, by = "word")

Finally, get the frequency of words.

count_words <- df_tokens %>%
  count(word, sort = TRUE)
ggplot(count_words[1:10,], aes(x = reorder(word,-n), y = n)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(title = "Frequency of Words (top 10)",
       x = "Word",
       y = "Frequency (n)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

BIGRAMS

I bigram is a sequence of two elements from a string of tokens. I present the process to get the frequencies of this bigrams in the files.

Divide text in bigrams.

df_bigrams <- df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

Remove stopwords.

df_bigrams  <- df_bigrams %>%
  filter(!str_detect(bigram,paste0("\\b(", paste(stop_words$word, collapse = "|"), ")\\b")))

Finally, get the frequency of bigrams.

count_bigrams <- df_bigrams  %>%
  count(bigram, sort = TRUE)
ggplot(count_bigrams[1:10,], aes(x = reorder(bigram,-n), y = n)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(title = "Frequency of Bigrams (top 10)",
       x = "Bigram",
       y = "Frequency (n)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

TRIGRAMS

A trigram, as a bigram, is a sequence of 3 elements from a string of tokens. I present the process to get the frequencies of this trigrams in the files.

Divide text in trigrams.

df_trigrams <- df %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3)

Remove stopwords.

df_trigrams  <- df_trigrams %>%
  filter(!str_detect(trigram,paste0("\\b(", paste(stop_words$word, collapse = "|"), ")\\b")))

Finally, get the frequency of trigrams.

count_trigrams <- df_trigrams  %>%
  count(trigram, sort = TRUE)
ggplot(count_trigrams[1:10,], aes(x = reorder(trigram,-n), y = n)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(title = "Frequency of Trigrams (top 10)",
       x = "Trigram",
       y = "Frequency (n)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

FOREIGN LANGUAGUE

In this section, I show a method to evaluate if a token comes from English or no. The method uses a dictionary to evaluate each token.This method is very simple but gives a great first approach.

en_words <- dictionary("en_US")

df_tokens <- df_tokens %>%
  mutate(is_english = hunspell_check(word, dict = en_words))

head(df_tokens,5)
## # A tibble: 5 × 3
##      id word       is_english
##   <int> <chr>      <lgl>     
## 1     2 hear       TRUE      
## 2     2 graduate   TRUE      
## 3     2 louisville FALSE     
## 4     2 coffee     TRUE      
## 5     3 feel       TRUE

COVER OF WORDS

In this section, I show how many unique words are needed to cover 50% (and 90%) of all word instances in the files.

cover <- count_words %>%
  mutate( total = sum(n),         
    prop = n / total,           
    cum_prop = cumsum(prop))
n50 <- which(cover$cum_prop >= 0.5)[1]
n90 <- which(cover$cum_prop >= 0.9)[1]

paste("Total of words that covers ", c("50 %: ", "90 %: "), c(n50,n90), sep = "")
## [1] "Total of words that covers 50 %: 1462" 
## [2] "Total of words that covers 90 %: 20130"