library(tokenizers)
library(tm)
library(dplyr)
blogData <- readLines("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = "UTF-8")
newsData <- readLines("./Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = "UTF-8")
twitterData <- readLines("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding="UTF-8")
# size of blogs text in MB
file.info("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size / (1024 ^ 2)
## [1] 200.4242
# size of news text in MB
file.info("./Coursera-SwiftKey/final/en_US/en_US.news.txt")$size / (1024 ^ 2)
## [1] 196.2775
# size of twitter text in MB
file.info("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size / (1024 ^ 2)
## [1] 159.3641
# Number of lines in Blogs
length(blogData)
## [1] 899288
# Number of lines in News
length(newsData)
## [1] 77259
# Number of lines in Twitter
length(twitterData)
## [1] 2360148
set.seed(2024)
corpus <- c(sample(blogData, 1000), sample(newsData, 1000), sample(twitterData, 1000))
tokens <- unlist(tokenize_words(corpus))
# Counting the total number of tokens
length(tokens)
## [1] 91045
# Counting the number of unique tokens
length(unique(tokens))
## [1] 14249
# Loading a list of profane words
library(sweary)
profanity_list <- get_swearwords("en")
filtered_tokens <- tokens[!tokens %in% profanity_list]
word_frequency <- table(filtered_tokens)
sorted_word_frequency <- sort(word_frequency, decreasing = TRUE)
print(sorted_word_frequency[1:20])
## filtered_tokens
## the to and a of in i that is for it on you with was as
## 4547 2519 2320 2200 1970 1527 1280 978 921 906 780 714 713 627 577 524
## at this my have
## 516 505 485 458
barplot(sorted_word_frequency[1:10], main = "Top 10 Unigram Word Frequencies", xlab = "Word", ylab = "Frequency", col="lightblue")
# Tokenize into bigrams
bigrams <- tokenize_ngrams(corpus, n = 2)
# Compute frequency of each bigram
bigram_frequency <- table(unlist(bigrams))
sorted_bigram_frequency <- sort(bigram_frequency, decreasing = TRUE)
# Plotting a histogram for the top 5 bigrams
barplot(sorted_bigram_frequency[1:5], main = "Top 5 Bigram Frequencies", xlab = "Bigram", ylab = "Frequency", col = "lightblue")
# Tokenize into trigrams
trigrams <- tokenize_ngrams(corpus, n = 3)
# Compute frequency of each trigram
trigram_frequency <- table(unlist(trigrams))
sorted_trigram_frequency <- sort(trigram_frequency, decreasing = TRUE)
# Plotting a histogram for the top 5 trigrams
barplot(sorted_trigram_frequency[1:5], main = "Top 5 Trigram Frequencies", xlab = "Trigram", ylab = "Frequency", col = "lightblue")
The dataset size is very large hence all the analysis has been carried out of a sample subset of 1000 lines of each file. In case of unigram word frequency most of the words in the top 10 are stop words, which are words such as “a” “the” “is” “are”, which offer very little value for analysis, So in applications such as information retrival stop words are removed before analysis. In case of the bigram frequency top 5 frequently occuring bigrams included the and 3 of the top 5 in case of trigram frequencies.