This report shows the manage and process of the course data set. The data set contains three files with text of twits, blogs and news. Here, I show the process to clean and performance an exploratory analysis of this data. I present different measures, as the frequencies of unigram, bigrams and trigrams, and the total words to cover a given percentage of all word instance.
library(tidytext)
library(tidyverse)
library(ggplot2)
library(hunspell)
Read the data.
con <- file("data/final/en_US/en_US.twitter.txt", "r")
twitter <- readLines(con)
close(con)
con <- file("data/final/en_US/en_US.blogs.txt", "r")
blogs <- readLines(con)
close(con)
con <- file("data/final/en_US/en_US.news.txt", "r")
news <- readLines(con)
close(con)
Get the total of lines by document.
paste(c("Twitter", "News", "Blogs"),
c(length(twitter), length(news), length(blogs)),
sep = ": ")
## [1] "Twitter: 2360148" "News: 77259" "Blogs: 899288"
The 3 documents have different lengths/lines, but also there is a difference in the length of the texts. We can see that the blogs and news have, in general, more characters than tweets, consider that twitter has a policy in the character limit of a tweet.
summary(str_length(twitter))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 37.00 64.00 68.68 100.00 140.00
summary(str_length(news))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 111.0 186.0 202.4 270.0 5760.0
summary(str_length(blogs))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 47 156 230 329 40833
I use a sample of 10 percent of the total data to performance the analysis.
sample_size <- 0.1
blogs <- sample(blogs, length(blogs) * sample_size , replace = FALSE)
news <- sample(news, length(news) * sample_size , replace = FALSE)
twitter <- sample(twitter, length(twitter) * sample_size , replace = FALSE)
First, I create a dataset that contains the text of three files.
df <- tibble(
id = 1:(length(twitter) + length(blogs) + length(news)),
text = c(twitter,blogs,news))
Now, I convert all the text to lowercase.
df$text <- tolower(df$text)
Also, I remove any number, because it is not relevant for the analysis.
df <- df %>%
mutate(text = str_remove_all(text, "[0-9]+"))
Divide the text in tokens.
df_tokens <- df %>%
unnest_tokens(word, text)
Remove stopwords.
df_tokens <- df_tokens %>%
anti_join(stop_words, by = "word")
Finally, get the frequency of words.
count_words <- df_tokens %>%
count(word, sort = TRUE)
ggplot(count_words[1:10,], aes(x = reorder(word,-n), y = n)) +
geom_bar(stat = "identity", fill = "lightblue") +
labs(title = "Frequency of Words (top 10)",
x = "Word",
y = "Frequency (n)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
I bigram is a sequence of two elements from a string of tokens. I present the process to get the frequencies of this bigrams in the files.
Divide text in bigrams.
df_bigrams <- df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
Remove stopwords.
df_bigrams <- df_bigrams %>%
filter(!str_detect(bigram,paste0("\\b(", paste(stop_words$word, collapse = "|"), ")\\b")))
Finally, get the frequency of bigrams.
count_bigrams <- df_bigrams %>%
count(bigram, sort = TRUE)
ggplot(count_bigrams[1:10,], aes(x = reorder(bigram,-n), y = n)) +
geom_bar(stat = "identity", fill = "lightblue") +
labs(title = "Frequency of Bigrams (top 10)",
x = "Bigram",
y = "Frequency (n)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
A trigram, as a bigram, is a sequence of 3 elements from a string of tokens. I present the process to get the frequencies of this trigrams in the files.
Divide text in trigrams.
df_trigrams <- df %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3)
Remove stopwords.
df_trigrams <- df_trigrams %>%
filter(!str_detect(trigram,paste0("\\b(", paste(stop_words$word, collapse = "|"), ")\\b")))
Finally, get the frequency of trigrams.
count_trigrams <- df_trigrams %>%
count(trigram, sort = TRUE)
ggplot(count_trigrams[1:10,], aes(x = reorder(trigram,-n), y = n)) +
geom_bar(stat = "identity", fill = "lightblue") +
labs(title = "Frequency of Trigrams (top 10)",
x = "Trigram",
y = "Frequency (n)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
In this section, I show a method to evaluate if a token comes from English or no. The method uses a dictionary to evaluate each token.This method is very simple but gives a great first approach.
en_words <- dictionary("en_US")
df_tokens <- df_tokens %>%
mutate(is_english = hunspell_check(word, dict = en_words))
head(df_tokens,5)
## # A tibble: 5 × 3
## id word is_english
## <int> <chr> <lgl>
## 1 2 hear TRUE
## 2 2 graduate TRUE
## 3 2 louisville FALSE
## 4 2 coffee TRUE
## 5 3 feel TRUE
In this section, I show how many unique words are needed to cover 50% (and 90%) of all word instances in the files.
cover <- count_words %>%
mutate( total = sum(n),
prop = n / total,
cum_prop = cumsum(prop))
n50 <- which(cover$cum_prop >= 0.5)[1]
n90 <- which(cover$cum_prop >= 0.9)[1]
paste("Total of words that covers ", c("50 %: ", "90 %: "), c(n50,n90), sep = "")
## [1] "Total of words that covers 50 %: 1462"
## [2] "Total of words that covers 90 %: 20130"