Capstone Project - Milestone Report

Exploratory Data Analysis

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Warning: package 'tidytext' was built under R version 4.3.1

# Counting number of lines on each text file
num_lines_blog <- length(count.fields(blogdata_path, sep = "\n"))
num_lines_twitter <- length(count.fields(twitterdata_path, sep = "\n"))
num_lines_news <- length(count.fields(newsdata_path, sep = "\n"))

# Creating a table using vectors 
datasets = c("blogs", "twitter", "news")   
lines_count <-  c(num_lines_blog, num_lines_twitter, num_lines_twitter)
table_linescount <- as.data.frame(lines_count, datasets)

# Table 1
table_linescount

##         lines_count
## blogs        898821
## twitter     2329858
## news        2329858

# Tokenizing by splitting on whitespace
tokens_blogs <- unlist(strsplit(blogdata_sample, "\\s+"))
tokens_twitter <- unlist(strsplit(twitterdata_sample, "\\s+")) 
tokens_news <- unlist(strsplit(newsdata_sample, "\\s+")) 

# Converting the character vector to a tibble
blogdata_df <- tibble(line = 1:length(tokens_blogs), text = tokens_blogs)
twitterdata_df <- tibble(line = 1:length(tokens_twitter), text = tokens_twitter)
newsdata_df <- tibble(line = 1:length(tokens_news), text = tokens_news)

# Tokenizing each sample data set into individual words
tokenized_blog <- blogdata_df %>%
        unnest_tokens(word, text)

tokenized_twitter <- twitterdata_df %>%
        unnest_tokens(word, text)

tokenized_news <- newsdata_df %>%
        unnest_tokens(word, text)

## # A tibble: 29,210 × 2
##    word      n
##    <chr> <int>
##  1 the   18560
##  2 and   10977
##  3 to    10773
##  4 a      8873
##  5 of     8706
##  6 i      7837
##  7 in     5968
##  8 that   4538
##  9 is     4185
## 10 it     4073
## # ℹ 29,200 more rows

## # A tibble: 30,924 × 2
##    word      n
##    <chr> <int>
##  1 the   19660
##  2 to     8954
##  3 and    8908
##  4 a      8819
##  5 of     7727
##  6 in     6877
##  7 for    3604
##  8 that   3426
##  9 is     2954
## 10 on     2734
## # ℹ 30,914 more rows

## # A tibble: 25,843 × 2
##    word      n
##    <chr> <int>
##  1 the    9262
##  2 to     8086
##  3 i      7190
##  4 a      5902
##  5 you    5516
##  6 and    4450
##  7 for    3880
##  8 in     3738
##  9 of     3677
## 10 is     3482
## # ℹ 25,833 more rows

# The number of words in the blog sample 
nrow(tokenized_blog)

## [1] 375765

# The number of words in the news sample
nrow(tokenized_news)

## [1] 348564

# The number of words in the twitter sample
nrow(tokenized_twitter)

## [1] 302050

## The number of unique words in the blog sample
tokenized_blog %>%
        distinct(word) %>%
        nrow()

## [1] 29210

## The number of unique words in the news sample
tokenized_news %>%
        distinct(word) %>%
        nrow()

## [1] 30924

## The number of unique words in the twitter sample
tokenized_twitter %>%
        distinct(word) %>%
        nrow()

## [1] 25843

# Plotting word frequency distribution
ggplot(word_freq_blog_25, aes(x = reorder(word, n), y = n)) +
        geom_bar(stat = "identity", fill = "orange") +
        coord_flip() + 
        labs(title = "Word Frequency Distribution - Blog Sample",
             x = "Words",
             y = "Frequency") +
        theme_minimal()

ggplot(word_freq_twitter_25, aes(x = reorder(word, n), y = n)) +
        geom_bar(stat = "identity", fill = "lightblue") +
        coord_flip() + 
        labs(title = "Word Frequency Distribution - Twitter Sample",
             x = "Words",
             y = "Frequency") +
        theme_minimal()

ggplot(word_freq_news_25, aes(x = reorder(word, n), y = n)) +
        geom_bar(stat = "identity", fill = "darkgreen") +
        coord_flip() + 
        labs(title = "Word Frequency Distribution - News Sample",
             x = "Words",
             y = "Frequency") +
        theme_minimal()

Findings

Sampling 1000 lines of each text file yielded an average of 338 770 words.

In terms of unique words, the average across the 3 samples was 28 605 distinct words.

When it comes to word frequency distributions, then we need to talk about stop words. Across the 3 data sets the most frequently used words were common words like “the”, “and”, “a”, “of”, etc.) which typically have the highest frequencies but may not provide much analytical value.

On this point, an interesting concept is word coverage.

Determining what percentage of the vocabulary is covered by the most frequent words will help in understanding the diversity of the corpus and the potential coverage of the predictions.

Next steps

The next steps are building n-gram models so that we can create a predictive text Shiny app. By seeing how often word X is followed by word Y, we can build a model of the relationships between words.

With n-gram models, we can leverage observed word patterns (2, 3 consecutive words) and calculate the likelihood of words following each other based on their frequency counts.

The plan is to calculate cumulative frequency distributions and then compute n-gram probabilities in order to create a predictive function. This function should use the calculated probabilities to predict the next word(s) given some input. I also want to calculate how much of the text is covered by words of different frequencies.

Capstone Project - Milestone Report

Ivo Pinheiro

2024-09-28

Exploratory Data Analysis

Findings

Next steps