Exploratory Data Analysis and Modeling

Loading the libraries

library(tokenizers) 
library(tm)
library(dplyr)

Reading the textfile

blogData <- readLines("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = "UTF-8")
newsData <- readLines("./Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = "UTF-8")
twitterData <- readLines("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding="UTF-8")

checking for some information of the file such as file size, number of lines

# size of blogs text in MB
file.info("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size / (1024 ^ 2)

## [1] 200.4242

# size of news text in MB
file.info("./Coursera-SwiftKey/final/en_US/en_US.news.txt")$size / (1024 ^ 2)

## [1] 196.2775

# size of twitter text in MB
file.info("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size / (1024 ^ 2)

## [1] 159.3641

# Number of lines in Blogs
length(blogData)

## [1] 899288

# Number of lines in News
length(newsData)

## [1] 77259

# Number of lines in Twitter
length(twitterData)

## [1] 2360148

Since the dataset is vary large we will randomly sample 1000 lines for each file

set.seed(2024)
corpus <- c(sample(blogData, 1000), sample(newsData, 1000), sample(twitterData, 1000))

Now we convert the corpus we collected into tokens

tokens <- unlist(tokenize_words(corpus))

# Counting the total number of tokens
length(tokens)

## [1] 91045

# Counting the number of unique tokens
length(unique(tokens))

## [1] 14249

Performing profanity filtering on the tokens

# Loading a list of profane words
library(sweary)
profanity_list <- get_swearwords("en")

filtered_tokens <- tokens[!tokens %in% profanity_list]

Computing the unigram frequency in the list of tokens

word_frequency <- table(filtered_tokens)
sorted_word_frequency <- sort(word_frequency, decreasing = TRUE)
print(sorted_word_frequency[1:20])

## filtered_tokens
##  the   to  and    a   of   in    i that   is  for   it   on  you with  was   as 
## 4547 2519 2320 2200 1970 1527 1280  978  921  906  780  714  713  627  577  524 
##   at this   my have 
##  516  505  485  458

Plotting a histogram for unigram frequency of words

barplot(sorted_word_frequency[1:10], main = "Top 10 Unigram Word Frequencies", xlab = "Word", ylab = "Frequency", col="lightblue")

Computing the bigram frequency

# Tokenize into bigrams
bigrams <- tokenize_ngrams(corpus, n = 2)

# Compute frequency of each bigram
bigram_frequency <- table(unlist(bigrams))
sorted_bigram_frequency <- sort(bigram_frequency, decreasing = TRUE)

# Plotting a histogram for the top 5 bigrams
barplot(sorted_bigram_frequency[1:5], main = "Top 5 Bigram Frequencies", xlab = "Bigram", ylab = "Frequency", col = "lightblue")

Computing the trigram frequency

# Tokenize into trigrams
trigrams <- tokenize_ngrams(corpus, n = 3)

# Compute frequency of each trigram
trigram_frequency <- table(unlist(trigrams))
sorted_trigram_frequency <- sort(trigram_frequency, decreasing = TRUE)

# Plotting a histogram for the top 5 trigrams
barplot(sorted_trigram_frequency[1:5], main = "Top 5 Trigram Frequencies", xlab = "Trigram", ylab = "Frequency", col = "lightblue")

Summary

The dataset size is very large hence all the analysis has been carried out of a sample subset of 1000 lines of each file. In case of unigram word frequency most of the words in the top 10 are stop words, which are words such as “a” “the” “is” “are”, which offer very little value for analysis, So in applications such as information retrival stop words are removed before analysis. In case of the bigram frequency top 5 frequently occuring bigrams included the and 3 of the top 5 in case of trigram frequencies.