This analysis explores and visualizes word frequencies from a text dataset sourced from YouTube using Apache Spark in R. The text often contains noise, such as special characters and stop words, which we address through preprocessing. This includes removing special characters, tokenizing the text into words, and filtering out stop words. The goal of this analysis is to gain insights from the text data by identifying the most frequently occurring words. The report also include codes and explanations for techniques used.
First load the text file in r spark using
spark_read_text
. To remove unwanted characters (like
punctuation) from the text the function regexp_replace
is
used. The next step is to tokenize the text. The
ft_tokenizer
function is used to split the text in the
line
column into individual words. In order to remove
common stop words like “and,” “the,” etc., which don’t provide
meaningful information the function ft_stop_words_remover
is use. The explode
function is then used to breaks up the
list of words into individual rows, allowing us to count each word’s
frequency.
# load packages
suppressPackageStartupMessages(library(sparklyr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(wordcloud))
suppressPackageStartupMessages(library(RColorBrewer))
# Connect to Spark
sc <- spark_connect(master = "local", version = '2.4')
# Read the text file
text_df <- spark_read_text(sc, name = "text_file", path = "SHEILA.txt")
all_words <- text_df %>%
mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " ")) %>%
filter(line != "") # Remove rows where the text is empty
# Tokenizing the text:
tokenized_df <- all_words %>%
ft_tokenizer(
input_col = "line",
output_col = "word_list")
# remove common stop word
tokenized_df <- tokenized_df %>%
ft_stop_words_remover(
input_col = "word_list",
output_col = "wo_stop_words"
)
# Exploding the list of words into individual rows:
exploded_df <- tokenized_df %>%
transmute(word = explode(wo_stop_words)) # Explodes the list of
In order to check the word count the group_by function is used to
group the data by each word and count the occurrences using
summarise(freq = n())
. The collect()
function
is called to bring the word counts into R as a local dataframe, which
allows us to work with the data further.
# ------- Data Analysis ------------
# group word count and sort by frequency
word_counts <- exploded_df %>%
group_by(word) %>%
filter(word != "") %>% # Remove empty strings
summarise(freq = n()) %>%
arrange(desc(freq)) # Sorting words by frequency
word_counts_local <- collect(word_counts) # Collecting the results into a local R data frame
print(word_counts_local)
## # A tibble: 287 × 2
## word freq
## <chr> <dbl>
## 1 data 24
## 2 hadoop 23
## 3 big 15
## 4 manager 8
## 5 hardware 7
## 6 blocks 7
## 7 distributed 7
## 8 resource 7
## 9 hdfs 7
## 10 master 6
## # ℹ 277 more rows
The word Hadoop appears 23 times in the
# Question 1: How many times is the word "Hadoop" counted?
hadoop_count <- word_counts_local %>%
filter(word == "Hadoop")
print(hadoop_count)
## # A tibble: 0 × 2
## # ℹ 2 variables: word <chr>, freq <dbl>
The word data is the most common word in the file, appearing 24 times.
# most common word
most_common_word <- word_counts_local[1,]
print(most_common_word)
## # A tibble: 1 × 2
## word freq
## <chr> <dbl>
## 1 data 24
3: Which word occurs the fewest times? How many times does the word occur?
The word watching appears only once in the text file, which is the least frequent word.
# the least occuring word
least_common_word <- word_counts_local[nrow(word_counts_local),]
print(least_common_word)
## # A tibble: 1 × 2
## word freq
## <chr> <dbl>
## 1 watching 1
As part of the analysis, additional analysis revealed that, the length of the longest word in the text. The word “transformations” stands out as the longest word, with a length of 15 characters, and it appears only once. In terms of frequency, the top 10 most common words in the text are data (24), hadoop (23), big (15), manager (8), hardware (7), blocks (7), distributed (7), resource (7), hdfs (7), and master (6). These words make up the core themes of the text. Additionally, several words appear only once, indicating their unique presence in the text. These include hello, start, journey, started, made, and manageable. These one-time words contribute distinctive meanings, highlighting specific concepts or points in the text. Words with fewer occurance of one(1) includes hello, start, journey, started, made, manageable.
# Find the longest word
longest_word <- word_counts_local %>%
mutate(word_length = nchar(word)) %>%
slice_max(word_length, n = 1)
# View the result
longest_word
# top 10 word
head(word_counts_local, 10)
# word with frequency = 1
words_with_freq_1 <- word_counts_local%>%
filter(freq == 1)
head(words_with_freq_1)
# Generating a word cloud:
#par(mar = c(1, 1, 1, 1))
wordcloud(words = word_counts_local$word, freq = word_counts_local$freq, min.freq = 1,
max.words = 100, random.order = FALSE, colors = brewer.pal(8, "Dark2"))
Word Cloud of Text File
In this analysis, we explored and visualized the frequency of words in a text dataset using Apache Spark in R. By preprocessing the text data—removing special characters, stop words, and tokenizing the text—we were able to focus on meaningful words and phrases. The results indicate that the word data appeared 24 times, making it the most frequent term, while Hadoop appeared 23 times. On the other hand, the word watching was the least frequent, appearing only once.
Apache Spark Documentation. (2024). Apache Spark 2.4.0 documentation. Retrieved from https://spark.apache.org/docs/2.4.0/
edureka! (2017, April 25). Big Data Tutorial for Beginners | What is Big Data | Big Data tutorial | Hadoop training | Edureka [Video]. YouTube. https://www.youtube.com/watch?v=zez2Tv-bcXY
Text mining with Spark & sparklyr – sparklyr. (n.d.). https://spark.posit.co/guides/textmining.html