This is my milestone report about the Capstone Project (Module 2). The final goal of this project is creating a text prediction algorithm and Shiny app.
This report, however, only covers some initial steps:
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'tidytext' was built under R version 4.3.1
# Counting number of lines on each text file
num_lines_blog <- length(count.fields(blogdata_path, sep = "\n"))
num_lines_twitter <- length(count.fields(twitterdata_path, sep = "\n"))
num_lines_news <- length(count.fields(newsdata_path, sep = "\n"))
# Creating a table using vectors
datasets = c("blogs", "twitter", "news")
lines_count <- c(num_lines_blog, num_lines_twitter, num_lines_twitter)
table_linescount <- as.data.frame(lines_count, datasets)
# Table 1
table_linescount
## lines_count
## blogs 898821
## twitter 2329858
## news 2329858
# Tokenizing by splitting on whitespace
tokens_blogs <- unlist(strsplit(blogdata_sample, "\\s+"))
tokens_twitter <- unlist(strsplit(twitterdata_sample, "\\s+"))
tokens_news <- unlist(strsplit(newsdata_sample, "\\s+"))
# Converting the character vector to a tibble
blogdata_df <- tibble(line = 1:length(tokens_blogs), text = tokens_blogs)
twitterdata_df <- tibble(line = 1:length(tokens_twitter), text = tokens_twitter)
newsdata_df <- tibble(line = 1:length(tokens_news), text = tokens_news)
# Tokenizing each sample data set into individual words
tokenized_blog <- blogdata_df %>%
unnest_tokens(word, text)
tokenized_twitter <- twitterdata_df %>%
unnest_tokens(word, text)
tokenized_news <- newsdata_df %>%
unnest_tokens(word, text)
## # A tibble: 29,210 × 2
## word n
## <chr> <int>
## 1 the 18560
## 2 and 10977
## 3 to 10773
## 4 a 8873
## 5 of 8706
## 6 i 7837
## 7 in 5968
## 8 that 4538
## 9 is 4185
## 10 it 4073
## # ℹ 29,200 more rows
## # A tibble: 30,924 × 2
## word n
## <chr> <int>
## 1 the 19660
## 2 to 8954
## 3 and 8908
## 4 a 8819
## 5 of 7727
## 6 in 6877
## 7 for 3604
## 8 that 3426
## 9 is 2954
## 10 on 2734
## # ℹ 30,914 more rows
## # A tibble: 25,843 × 2
## word n
## <chr> <int>
## 1 the 9262
## 2 to 8086
## 3 i 7190
## 4 a 5902
## 5 you 5516
## 6 and 4450
## 7 for 3880
## 8 in 3738
## 9 of 3677
## 10 is 3482
## # ℹ 25,833 more rows
# The number of words in the blog sample
nrow(tokenized_blog)
## [1] 375765
# The number of words in the news sample
nrow(tokenized_news)
## [1] 348564
# The number of words in the twitter sample
nrow(tokenized_twitter)
## [1] 302050
## The number of unique words in the blog sample
tokenized_blog %>%
distinct(word) %>%
nrow()
## [1] 29210
## The number of unique words in the news sample
tokenized_news %>%
distinct(word) %>%
nrow()
## [1] 30924
## The number of unique words in the twitter sample
tokenized_twitter %>%
distinct(word) %>%
nrow()
## [1] 25843
# Plotting word frequency distribution
ggplot(word_freq_blog_25, aes(x = reorder(word, n), y = n)) +
geom_bar(stat = "identity", fill = "orange") +
coord_flip() +
labs(title = "Word Frequency Distribution - Blog Sample",
x = "Words",
y = "Frequency") +
theme_minimal()
ggplot(word_freq_twitter_25, aes(x = reorder(word, n), y = n)) +
geom_bar(stat = "identity", fill = "lightblue") +
coord_flip() +
labs(title = "Word Frequency Distribution - Twitter Sample",
x = "Words",
y = "Frequency") +
theme_minimal()
ggplot(word_freq_news_25, aes(x = reorder(word, n), y = n)) +
geom_bar(stat = "identity", fill = "darkgreen") +
coord_flip() +
labs(title = "Word Frequency Distribution - News Sample",
x = "Words",
y = "Frequency") +
theme_minimal()
Sampling 1000 lines of each text file yielded an average of 338 770 words.
In terms of unique words, the average across the 3 samples was 28 605 distinct words.
When it comes to word frequency distributions, then we need to talk about stop words. Across the 3 data sets the most frequently used words were common words like “the”, “and”, “a”, “of”, etc.) which typically have the highest frequencies but may not provide much analytical value.
On this point, an interesting concept is word coverage.
Determining what percentage of the vocabulary is covered by the most frequent words will help in understanding the diversity of the corpus and the potential coverage of the predictions.
The next steps are building n-gram models so that we can create a predictive text Shiny app. By seeing how often word X is followed by word Y, we can build a model of the relationships between words.
With n-gram models, we can leverage observed word patterns (2, 3 consecutive words) and calculate the likelihood of words following each other based on their frequency counts.
The plan is to calculate cumulative frequency distributions and then compute n-gram probabilities in order to create a predictive function. This function should use the calculated probabilities to predict the next word(s) given some input. I also want to calculate how much of the text is covered by words of different frequencies.