Major Assignment 3

Will Georgia

2024-11-29

In this assignment, you will download, analyze, and visualize Reddit threads based on a keyword of your choice. Specifically, you will be performing the following steps:

1. Describe in one sentence what you aim to examine using user-generated text data and sentiment analysis.
Spirit Airlines is a company renown for low-priced airfares as well as low-quality service, and its product generates a lot of opinions - especially on the Internet.

2. Search Reddit threads using a keyword of your choice. - Specifying a subreddit for your search is optional. - It is okay to combine data obtained by searching the keyword across multiple subreddits. - You can choose any period, but ensure you gather a sufficient amount of data so that you can get meaningful results.

#install packages
library(RedditExtractoR)
library(anytime)
library(magrittr)
library(httr)
library(tidytext)
library(tidyverse)
library(igraph)
library(ggraph)
library(wordcloud2)
library(textdata)
library(sf)
library(tmap)
library(here)
library(ggplot2)
library(knitr)
library(stringr)
library(stringi)
library(sentimentr)
library(syuzhet)
library(lubridate)
library(ggthemes)

# using keyword 'spirit airlines' over the last year.

# reviewing list of subreddits and then using a vector of 5 of those subreddits after initial search returned many irrelevant comments.

# subreddit_spirit_list <- RedditExtractoR::find_subreddits('spirit airlines')
# write.csv(subreddit_spirit_list,"~/GT MSUA/CP 8883/Intro to UA R Projects/subreddit_spirit_list.csv")

Sys.sleep(2)

spirit_subreddits <- c("aviation", "flying", "Flights", "AirRage", "spiritair", "peopleofspirit")

all_spirit_threads <- data.frame()

for (subreddit in spirit_subreddits){
              result <- find_thread_urls(keywords = "spirit airlines", 
                              subreddit = subreddit, 
                              period = 'year')
              
              all_spirit_threads <- rbind(all_spirit_threads, result)
}

all_spirit_threads %>% drop_na()

rownames(all_spirit_threads) <- NULL
colnames(all_spirit_threads)
all_spirit_threads$title <- strtrim(all_spirit_threads$title, 50)
all_spirit_threads$text <- strtrim(all_spirit_threads$text, 50)
kable(head(all_spirit_threads, 5))

3. Clean your text data and then tokenize it.

# loading stop words from tidytext:
data("stop_words")
# requisite regex code to match the URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

spirit_words_clean <- all_spirit_threads %>% 
  # dropping URLs
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  # Word tokenization to begin the cleaning process
  unnest_tokens(word, text, token = "words") %>% 
  # using anti_join to filter out stop words
  anti_join(stop_words, by = "word") %>% 
  # using regex to filter out non-alphabetic strings
  filter(str_detect(word, "[a-z]"))

4. Generate a word cloud that illustrates the frequency of words except your keyword.

# remove several general words to generate a more meaningful analysis.
words_to_remove <- c("spirit", "airlines", "airline", "flight", "fly", "flying")
spirit_words_clean <- spirit_words_clean %>%
  filter(!str_detect(spirit_words_clean$word, paste(words_to_remove, collapse = "|")))
#generate cloud using wordcloud2
spirit_words_clean %>%
  count(word, sort = TRUE) %>% 
  wordcloud2()

5. Conduct a tri-gram analysis. - Extract tri-grams from your text data. - Remove tri-grams containing stop words or non-alphabetic terms. - Present the frequency of tri-grams in a table. - Discuss any noteworthy tri-grams you come across. - If no meaningful tri-grams are found, you may analyze bi-grams as well. However, you still need to show results of the tri-grams.

words_spirit_ngram <- all_spirit_threads %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = paired_words,
                input = text,
                token = "ngrams",
                n = 3)

#separate the words into 3 columns
words_spirit_ngram_trio <- words_spirit_ngram %>%
  separate(paired_words, c("word1", "word2", "word3"), sep = " ")

# filter rows where there are stop words under word 1, word 2, and word 3 columns
words_spirit_ngram_trio_filtered <- words_spirit_ngram_trio %>%
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word & 
           !word3 %in% stop_words$word) %>%
# drop non-alphabet-only strings across the 3 columns
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]") &
           str_detect(word3, "[a-z]"))
# Filter out words that are not encoded in ASCII in all 3 columns
words_spirit_ngram_trio_filtered %<>% 
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) &
           stri_enc_isascii(word3))
# Sort the new tri-gram (n=3) counts:
words_counts <- words_spirit_ngram_trio_filtered %>%
  count(word1, word2, word3) %>%
  arrange(desc(n))

head(words_counts, 20) %>% knitr::kable()

word1	word2	word3	n
round	trip	flight	3
spirit	airlines	flight	3
time	flying	spirit	3
22nd	spirit	unexpectedly	1
active	duty	military	1
actor	frankie	muniz	1
airlines	ceo	sa	1
airlines	explore	restructuri	1
airlines	spirit	frontier	1
airlines	trading	card	1
blocked	jetblue	airways	1
booked	flight	tickets	1
booking	spirit	mco	1
bought	nonrefundable	ticket	1
browsers	incognito	modes	1
brutal	safety	report	1
claiming	active	duty	1
country	spirit	flight	1
crew	hires	class	1
cross	country	spirit	1

Tri-gram Analysis
The tri-gram analysis returns many generic and neutral word strings like “spirit airlines flight” and “round trip flight”. Due to the lack of meaningful results, a bi-gram analysis has been run and reviewed below.

# Perform bi-gram analysis as well:
words_spirit_ngram2 <- all_spirit_threads %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = paired_words,
                input = text,
                token = "ngrams",
                n = 2)

#separate the words into 2 columns
words_spirit_ngram_pair <- words_spirit_ngram2 %>%
  separate(paired_words, c("word1", "word2"), sep = " ")

# filter rows where there are stop words under word 1, and word 2 columns
words_spirit_ngram_pair_filtered <- words_spirit_ngram_pair %>%
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word) %>% 
  # drop non-alphabet-only strings across the 2 columns
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]"))
# Filter out words that are not encoded in ASCII in all 2 columns
words_spirit_ngram_pair_filtered %<>% 
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2))
# Sort the bi-gram:
words_counts2 <- words_spirit_ngram_pair_filtered %>%
  count(word1, word2) %>%
  arrange(desc(n))

head(words_counts2,20) %>% knitr::kable()

word1	word2	n
spirit	airlines	25
flying	spirit	8
spirit	airline	6
round	trip	5
fly	spirit	4
time	flying	4
airlines	flight	3
flight	booked	3
flown	spirit	3
las	vegas	3
spirit	flight	3
trip	flight	3
booking	spirit	2
credit	card	2
flew	spirit	2
flight	attendant	2
hey	guys	2
jetblue	spirit	2
tl	dr	2
22nd	spirit	1

Bi-gram Analysis
The bi-gram analysis still doesn’t offer many revealing strings of words. The most frequent pair is the nondescript “spirit airlines”, and the others are also not very informative. There are a couple pairs that may hint at some complaint posts like “flight attendant” and “credit card”, but frankly, it doesn’t seem the n-gram analyses offer much insight.

6. Perform a sentiment analysis on your text data using a dictionary method that accommodates negations. - You are welcome to apply a deep learning-based model to enrich your analysis, but employing the dictionary method is imperative.

# initial sentiment analysis on all posts, but displaying just the top 10 positively scored ones
sentiment_spirit <- sentiment(all_spirit_threads$text) %>%
  arrange(desc(sentiment)) 
                    
head(sentiment_spirit, 10) %>% 
  knitr::kable()

element_id	sentence_id	word_count	sentiment
254	1	2	0.8838835
67	1	11	0.6934761
245	1	8	0.6540738
56	1	10	0.6261310
116	1	9	0.5000000
207	1	9	0.5000000
246	1	11	0.4974937
106	1	7	0.4913538
247	1	10	0.4743416
287	1	8	0.4596194

7. Display 10 sample texts alongside their sentiment scores and evaluate the credibility of the sentiment analysis outcomes.

# setting seed and setting up code to randomly select 10 posts from the entire population
set.seed(1234)
spirit_senti_filtered <- all_spirit_threads %>%
  filter(nzchar(text) & !grepl("http[s]?://", text)) %>%
  unnest(text)
random_indices <- sample(nrow(spirit_senti_filtered), size = 10)
ten_spirit_samples <- spirit_senti_filtered[random_indices, ]
ten_spirit_samples$sentiment_score <- sapply(ten_spirit_samples$text, function(text) {
  sentiment(text)$sentiment[1]
})
# selecting columns and sorting by sentiment for display
columns <- c("date_utc", "title", "text", "subreddit", "sentiment_score")
ten_spirit_samples_final <- ten_spirit_samples[, columns] %>%
  arrange(desc(sentiment_score))
ten_spirit_samples_final$title <- strtrim(ten_spirit_samples_final$title, 50)
ten_spirit_samples_final$text <- strtrim(ten_spirit_samples_final$text, 50)
head(ten_spirit_samples_final, 10) %>% 
  knitr::kable()

date_utc	title	text	subreddit	sentiment_score
2024-10-04	Scared of this potential bankruptcy =	Love spirit. IDC what people say about the service	spiritair	0.8838835
2024-02-22	Can I bring a backpack AND drawstring bag as perso	I have the honor of traveling with Spirit Airlines	Flights	0.5000000
2024-08-29	Spirit Employees paid to look for and defend negat	I find it funny that there are some people who ver	spiritair	0.2412091
2024-09-21	Pet question [Go comfy]	Im flying Spirit for the first time this week from	spiritair	0.1809068
2024-08-13	Traveling with a CPAP machine	Does Spirit Airlines let you carry on a CPAP machi	spiritair	0.1581139
2024-07-05	International flights with Stops and Baggage… do	I am planning a trip from Toronto, Canada to Tokyo	Flights	0.0790569
2024-07-21	Itinerary Cancelled but flight is still available?	Hello, I booked a reservation 4 days ago, for roun	spiritair	0.0000000
2024-07-23	Spirit Airlines made me pay to rebook a flight whe	So this happened a while back.

The highest score in the sample actually seems fair and accurate. The poster praised Spirit while expressing some concern about the recent bankruptcy filing the airline went through. The algorithm likely homed in on the opening words “Love spirit” in calculating the score.

As for the other nine posts in the sample, only two appear to be accurate. Sentimentr calculated positive scores for six posts that were actually clearly neutral or negative (most just posting questions about the airline). Two posts scored a 0.00, one of which seemed to be a question scored accurately while the other was a profanity-laced complaint which was clearly scored inaccurately. Finally, there was one post with a negative score that truly was negative as the poster was recounting a negative experience.

Sentimentr struggled with these posts. Many of them were long and rambling with poor spelling and grammar, and that may be one of the main reasons the program couldn’t produce accurate assessments. It shows that the program may be a little too simple for complex human thought and emotion!

8. Discuss intriguing insights derived from the sentiment analysis, supporting your observations with at least two plots.

# re-run sentiment analysis code used for 10 samples on all text for plotting
spirit_sentiment_plots <- all_spirit_threads %>%
  filter(nzchar(text) & !grepl("http[s]?://", text)) %>%
  unnest(text) %>%
  drop_na()
spirit_sentiment_plots$sentiment_score <-
  sapply(spirit_sentiment_plots$text, function(text) {
  sentiment(text)$sentiment[1]
})
columns <- c("date_utc", "title", "text", "subreddit", "sentiment_score")
spirit_sentiment_plots <- spirit_sentiment_plots[, columns] %>%
  arrange(desc(sentiment_score))
spirit_sentiment_plots$title <- strtrim(spirit_sentiment_plots$title, 50)
spirit_sentiment_plots$text <- strtrim(spirit_sentiment_plots$text, 50)
# turn date_utc into Day of the Week for plotting
spirit_sentiment_plots$DoW <- wday(spirit_sentiment_plots$date_utc, label = TRUE, abbr = FALSE)
spirit_sentiment_plots <- spirit_sentiment_plots %>% select(date_utc, DoW, title, text,
                                                            subreddit, sentiment_score)
# density plot to show distribution of scores
ggplot(spirit_sentiment_plots, aes(x = sentiment_score)) +
  geom_density(fill = "lightblue", alpha = 0.9) +
  labs(title = "Distribution of Sentiment", x = "Sentiment_score", y = "Density") +
  theme_dark()

#day-of-week bar graph
spirit_sentiment_plots %>% 
  ggplot(aes(x = DoW)) +
  geom_bar(fill = 'orange') +
  theme_classic()

#violin graphn by day-of-week
ggplot(spirit_sentiment_plots, aes(x = DoW, y = sentiment_score)) +
  geom_violin(fill = "lightblue", color = "darkblue") +
  #labs(title = "Violin Plot by Group", x = "Group", y = "Value") +
  theme_dark()

Overall Sentiment Analysis
There is no dispute that Spirit Airlines generates a lot of complaints and negative feelings on the Internet. However, when looking at visualizations of the sentiment analysis performed herein, the Sentimentr algorithm indicates most of the posts are neutral or slightly positive in tone.

As discussed earlier, the algorithm struggled with long, poorly written posts. There were many such examples in the 344-post population which were given positive scores. Those inaccurate scores skewed the density plot and the violin plot generated. Each plot should’ve shown a larger distribution of scores below zero.

On the other hand, both plots show a large distribution of relatively neutral posts. When skimming over the entire population, there are in fact, many, many neutral posts posing questions about flying Spirit (one of which was in the 10-post sample above). The algorithm actually did seem to do well assessing those as neutral, and the plots do a good job representing that neutrality.

Day-of-the-week was incorporated in two of the plots and showed some notable results. There were more posts made on Mondays and Thursdays than other days of the week. Even taking into account the inaccurate scoring noted above, Thursday’s postsseemed to be among the most positive. That may be purely random, but there may be another explanation. Spirit caters to many leisure travelers, and many leisure trips are weekend trips that start Thursday or Friday. The more positive Thursday posts may be reflecting some of that anticipation and excitement. Meanwhile, Mondays posts may reflect customers needing to share their thoughts and opinions at the end of those weekend trips.

The Sentimentr’s results, while not completely accurate, were interesting to analyze and review. The algorithm’s struggle to rate posts reflects the complexity of human thought and expression. It takes very powerful computer programs to even begin to capture true sentiment.