Introduction

This analysis explores and visualizes word frequencies from a text dataset sourced from YouTube using Apache Spark in R. The text often contains noise, such as special characters and stop words, which we address through preprocessing. This includes removing special characters, tokenizing the text into words, and filtering out stop words. The goal of this analysis is to gain insights from the text data by identifying the most frequently occurring words. The report also include codes and explanations for techniques used.

Methodology

First load the text file in r spark using spark_read_text. To remove unwanted characters (like punctuation) from the text the function regexp_replace is used. The next step is to tokenize the text. The ft_tokenizer function is used to split the text in the line column into individual words. In order to remove common stop words like “and,” “the,” etc., which don’t provide meaningful information the function ft_stop_words_remover is use. The explode function is then used to breaks up the list of words into individual rows, allowing us to count each word’s frequency.

# load packages
suppressPackageStartupMessages(library(sparklyr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(wordcloud))
suppressPackageStartupMessages(library(RColorBrewer))

# Connect to Spark
sc <- spark_connect(master = "local", version = '2.4')

# Read the text file
text_df <- spark_read_text(sc, name = "text_file", path = "SHEILA.txt")

all_words <- text_df %>%
  mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " ")) %>%
  filter(line != "")  # Remove rows where the text is empty

# Tokenizing the text: 
tokenized_df <- all_words %>%
  ft_tokenizer(
    input_col = "line", 
    output_col = "word_list")

# remove common stop word
tokenized_df <- tokenized_df %>%
  ft_stop_words_remover(
    input_col = "word_list",
    output_col = "wo_stop_words"
  )

# Exploding the list of words into individual rows:
exploded_df <- tokenized_df %>%
  transmute(word = explode(wo_stop_words))  # Explodes the list of 

Analysis

In order to check the word count the group_by function is used to group the data by each word and count the occurrences using summarise(freq = n()). The collect() function is called to bring the word counts into R as a local dataframe, which allows us to work with the data further.

# ------- Data Analysis ------------
# group word count and sort by frequency
word_counts <- exploded_df %>%
  group_by(word) %>%
  filter(word != "") %>%  # Remove empty strings
  summarise(freq = n()) %>%
  arrange(desc(freq))  # Sorting words by frequency

word_counts_local <- collect(word_counts) # Collecting the results into a local R data frame
print(word_counts_local)
## # A tibble: 287 × 2
##    word         freq
##    <chr>       <dbl>
##  1 data           24
##  2 hadoop         23
##  3 big            15
##  4 manager         8
##  5 hardware        7
##  6 blocks          7
##  7 distributed     7
##  8 resource        7
##  9 hdfs            7
## 10 master          6
## # ℹ 277 more rows

Answers to Question

  1. How many times is the word “Hadoop” counted?

The word Hadoop appears 23 times in the

# Question 1: How many times is the word "Hadoop" counted?
hadoop_count <- word_counts_local %>%
  filter(word == "Hadoop")
print(hadoop_count)
## # A tibble: 0 × 2
## # ℹ 2 variables: word <chr>, freq <dbl>
  1. Which is the most common word used in the file? How many times does it occur?

The word data is the most common word in the file, appearing 24 times.

# most common word
most_common_word <- word_counts_local[1,]
print(most_common_word)
## # A tibble: 1 × 2
##   word   freq
##   <chr> <dbl>
## 1 data     24

3: Which word occurs the fewest times? How many times does the word occur?

The word watching appears only once in the text file, which is the least frequent word.

# the least occuring word
least_common_word <- word_counts_local[nrow(word_counts_local),]
print(least_common_word)
## # A tibble: 1 × 2
##   word      freq
##   <chr>    <dbl>
## 1 watching     1

Additional Analysis

As part of the analysis, additional analysis revealed that, the length of the longest word in the text. The word “transformations” stands out as the longest word, with a length of 15 characters, and it appears only once. In terms of frequency, the top 10 most common words in the text are data (24), hadoop (23), big (15), manager (8), hardware (7), blocks (7), distributed (7), resource (7), hdfs (7), and master (6). These words make up the core themes of the text. Additionally, several words appear only once, indicating their unique presence in the text. These include hello, start, journey, started, made, and manageable. These one-time words contribute distinctive meanings, highlighting specific concepts or points in the text. Words with fewer occurance of one(1) includes hello, start, journey, started, made, manageable.

# Find the longest word
longest_word <- word_counts_local %>%
  mutate(word_length = nchar(word)) %>%
  slice_max(word_length, n = 1)

# View the result
longest_word
# top 10 word
head(word_counts_local, 10)
# word with frequency = 1
words_with_freq_1 <- word_counts_local%>%
  filter(freq == 1)
head(words_with_freq_1)
# Generating a word cloud:
#par(mar = c(1, 1, 1, 1))
wordcloud(words = word_counts_local$word, freq = word_counts_local$freq, min.freq = 1, 
          max.words = 100, random.order = FALSE, colors = brewer.pal(8, "Dark2"))
Word Cloud of Text File

Word Cloud of Text File

Conclusion

In this analysis, we explored and visualized the frequency of words in a text dataset using Apache Spark in R. By preprocessing the text data—removing special characters, stop words, and tokenizing the text—we were able to focus on meaningful words and phrases. The results indicate that the word data appeared 24 times, making it the most frequent term, while Hadoop appeared 23 times. On the other hand, the word watching was the least frequent, appearing only once.

Reference

Apache Spark Documentation. (2024). Apache Spark 2.4.0 documentation. Retrieved from https://spark.apache.org/docs/2.4.0/

edureka! (2017, April 25). Big Data Tutorial for Beginners | What is Big Data | Big Data tutorial | Hadoop training | Edureka [Video]. YouTube. https://www.youtube.com/watch?v=zez2Tv-bcXY

Text mining with Spark & sparklyr – sparklyr. (n.d.). https://spark.posit.co/guides/textmining.html