1. Describe in one sentence what you aim to examine using user-generated text data and sentiment analysis: Examine how Reddit users discuss the quality of life and overall sentiment toward women serving in the United States Army from 2023 to 2025.

FOR FURTHER BACKGROUND CONTEXT ABOUT THE MEANING OF THIS PROJECT TO ME BELOW (DISREGARD IF NOT INTERESTED) Personal Context: I chose to focus on conversations about women in the Army because this topic sits close to my own lived experience. I started my journey at West Point and spent years forming my identity around service, leadership, and the idea of building a meaningful career in national security. Even after leaving the military environment in late 2022, the pull to return never really disappeared. Over the past few years, I’ve been trying to sort through my purpose, career interests, and long-term goals, especially as I’ve been encouraged to look into PhD programs. Although I enjoy academic work and have learned so much in this class, nothing feels more aligned with who I am than the idea of serving again as a Military Intelligence officer and bringing together my strengths in geospatial analysis, data analytics, and problem solving. Because this decision carries so much personal weight, examining how people publicly talk about women in the Army gives me a unique way to understand the broader environment I may be stepping back into.

Why Using Time Frame of 2023-2025: The period from 2023 to 2025 is especially meaningful to analyze because it captures the landscape immediately after I left the military in November 2022 and reflects a moment of rapid cultural and political change. During these two years, the military has been adapting to shifting national priorities, recruiting challenges, evolving views on gender integration, and rising global pressures. Public sentiment toward the military has also been influenced by transitions in political leadership, debates around defense strategy, and heightened geopolitical tensions. Looking at Reddit discussions over this specific window allows me to see whether attitudes toward women in uniform have shifted, stabilized, or become more polarized, and how conversations about quality of life, leadership, respect, and overall climate have evolved. This timeframe offers a clear view of how the public discourse has changed during the exact years I’ve been reevaluating my place in the military and imagining what returning as an MI officer could look like.

  1. Search Reddit threads using a keyword of your choice. Keyword: “women army”

Step 1: Load Packages

Step 2: Search Reddit Threads

#Run in Console #wmil_a <- find_thread_urls(“women army”, # sort_by = “relevance”, # period = “all”) # #wmil_b <- find_thread_urls(“women”, # subreddit = “army”, # sort_by = “relevance”, # period = “all”) # #wmil_c <- find_thread_urls(“military”, # subreddit = “MilitaryWomen”, # sort_by = “relevance”, # period = “all”) # #wmil_d <- find_thread_urls(“female soldier”, # sort_by = “relevance”, # period = “all”)

Filter by date range

#clean_dates <- function(df) { # df %>% # mutate(date_utc = as.Date(date_utc)) %>% # filter(date_utc >= as.Date(“2023-01-01”), # date_utc <= as.Date(“2025-03-31”)) #}

#wmil_a <- clean_dates(wmil_a) #wmil_b <- clean_dates(wmil_b) #wmil_c <- clean_dates(wmil_c) #wmil_d <- clean_dates(wmil_d)

#womenmil_total <- bind_rows(wmil_a, wmil_b, wmil_c, wmil_d) %>% distinct()

#write.csv(womenmil_total, “womenmil_total.csv”, row.names = FALSE)

# Retrieve Reddit threads using my keyword and subreddit combinations
wmil_a <- find_thread_urls("women army",
                           sort_by = "relevance",
                           period = "all")

wmil_b <- find_thread_urls("women",
                           subreddit = "army",
                           sort_by = "relevance",
                           period = "all")

wmil_c <- find_thread_urls("military",
                           subreddit = "MilitaryWomen",
                           sort_by = "relevance",
                           period = "all")

wmil_d <- find_thread_urls("female soldier",
                           sort_by = "relevance",
                           period = "all")

# Standalone code for combining results
womenmil_total <- bind_rows(wmil_a, wmil_b, wmil_c, wmil_d) %>%
  distinct()

# Save the Reddit data before knitting at the end
write.csv(womenmil_total, "womenmil_total.csv", row.names = FALSE)

#Purpose of this step 2: This approach allowed me to collect a diverse mix of Reddit posts related to women in the Army. Searching with both broad keywords and subreddit-specific queries helped ensure that I captured general discussions as well as more personal or detailed experiences that tend to appear in niche communities. After filtering to the 2023–2025 window and merging the results, I obtained a dataset that reflects recent online sentiment, cultural commentary, and personal perspectives about serving as a woman in the Army.

Step 3: Clean and Tokenize Text Data

#Purpose of this step 3: To prepare the Reddit posts for text analysis, I first merged the datasets I collected and removed duplicate entries. Then I cleaned the text by stripping out symbols, HTML fragments, and irregular spacing. Since Reddit posts often contain URLs, usernames, or other non-language artifacts, cleaning helps ensure that the word-level and n-gram analysis reflect actual patterns in how people talk about women in the Army rather than noisy formatting. After cleaning, I created a tokenized dataset at the word level that breaks the text into individual units for later steps in the analysis.

womenmil_total <- read.csv("womenmil_total.csv", stringsAsFactors = FALSE)
library(stringr)
library(dplyr)

# Basic text cleaning for Reddit content
womenmil_clean <- womenmil_total %>%
  mutate(
    title = ifelse(is.na(title), "", title),
    text  = ifelse(is.na(text),  "", text),
# Combine title + body 
    full_text = paste(title, text, sep = ". "),
# Remove URLs, odd symbols, and extra spaces (clean)
    full_text = str_replace_all(full_text, "http[^[:space:]]+", ""),
    full_text = str_replace_all(full_text, "[^[:alnum:][:space:]']", " "),
    full_text = str_squish(full_text)
  )
library(tidytext)

#Break the cleaned text into individual tokens 
tokens_words <- womenmil_clean %>%
  unnest_tokens(word, full_text, token = "words")

head(tokens_words)
##     date_utc  timestamp
## 1 2018-12-29 1546103127
## 2 2018-12-29 1546103127
## 3 2018-12-29 1546103127
## 4 2018-12-29 1546103127
## 5 2018-12-29 1546103127
## 6 2018-12-29 1546103127
##                                                                                                                                                                                                            title
## 1 'The Hello Girls' the women who ran the WWI switchboards - Six members of the U.S. Army Signal Corps preparing to ship off for France in 1918, where they and 217 other women served as switchboard operators.
## 2 'The Hello Girls' the women who ran the WWI switchboards - Six members of the U.S. Army Signal Corps preparing to ship off for France in 1918, where they and 217 other women served as switchboard operators.
## 3 'The Hello Girls' the women who ran the WWI switchboards - Six members of the U.S. Army Signal Corps preparing to ship off for France in 1918, where they and 217 other women served as switchboard operators.
## 4 'The Hello Girls' the women who ran the WWI switchboards - Six members of the U.S. Army Signal Corps preparing to ship off for France in 1918, where they and 217 other women served as switchboard operators.
## 5 'The Hello Girls' the women who ran the WWI switchboards - Six members of the U.S. Army Signal Corps preparing to ship off for France in 1918, where they and 217 other women served as switchboard operators.
## 6 'The Hello Girls' the women who ran the WWI switchboards - Six members of the U.S. Army Signal Corps preparing to ship off for France in 1918, where they and 217 other women served as switchboard operators.
##   text        subreddit comments
## 1      ColorizedHistory       41
## 2      ColorizedHistory       41
## 3      ColorizedHistory       41
## 4      ColorizedHistory       41
## 5      ColorizedHistory       41
## 6      ColorizedHistory       41
##                                                                                                    url
## 1 https://www.reddit.com/r/ColorizedHistory/comments/aamtwe/the_hello_girls_the_women_who_ran_the_wwi/
## 2 https://www.reddit.com/r/ColorizedHistory/comments/aamtwe/the_hello_girls_the_women_who_ran_the_wwi/
## 3 https://www.reddit.com/r/ColorizedHistory/comments/aamtwe/the_hello_girls_the_women_who_ran_the_wwi/
## 4 https://www.reddit.com/r/ColorizedHistory/comments/aamtwe/the_hello_girls_the_women_who_ran_the_wwi/
## 5 https://www.reddit.com/r/ColorizedHistory/comments/aamtwe/the_hello_girls_the_women_who_ran_the_wwi/
## 6 https://www.reddit.com/r/ColorizedHistory/comments/aamtwe/the_hello_girls_the_women_who_ran_the_wwi/
##    word
## 1   the
## 2 hello
## 3 girls
## 4   the
## 5 women
## 6   who

Step 4: Generate a word cloud

#Purpose of this step 4: After tokenizing the text, I removed common stop words and filtered out my main keywords so that the most prominent remaining words would reflect the broader themes people discuss when talking about women in the Army. Reddit conversations often include filler terms, URLs, or formatting fragments, so removing these helps highlight terms that actually carry meaning. A word cloud gives a quick visual impression of which topics appear most frequently across the collected posts.

library(tidytext)
library(dplyr)
library(stringr)

# Drop common stop words and filter out my keywords
tokens_clean <- tokens_words %>%
  anti_join(stop_words, by = "word") %>%            
  filter(!str_detect(word, "women|woman|female|army")) %>%   
  filter(str_detect(word, "[a-z]")) 
# Create word cloud below (version 1: PRIOR TO FILTERING OUT KEY WORDS)
library(wordcloud2)

tokens_clean %>%
  count(word, sort = TRUE) %>%
  wordcloud2(size = 0.9, shuffle = FALSE)
#After removing the flood of the noted key words, this second word cloud focuses on more THEMES!!

library(tidytext)
library(dplyr)
library(stringr)

# Create a vector of keywords to exclude (aka chosen keywords stated before)
remove_terms <- c(
  "women", "woman", "female", "people", "day", "don",
  "army", "military", "soldier", "soldiers",
  "service", "serving", "serve",
  "male", "men", "man",
  "gt", "amp"    # Reddit/HTML artifacts
)

tokens_clean <- tokens_words %>%
  anti_join(stop_words, by = "word") %>%               
  filter(!word %in% remove_terms) %>%                  
  filter(str_detect(word, "^[a-z]+$"))                 

tokens_clean %>%
  count(word, sort = TRUE) %>%
  wordcloud2(size = 0.9, shuffle = FALSE)
#Results of this second word cloud shows: The cleaned word cloud reveals how broad and personal the conversations are when people discuss women in the Army without the distraction of the core keywords. The largest terms highlight everyday themes like “time,” “life,” “family,” “duty,” “job,” and “care,” suggesting that people often talk about military service in the context of work–life balance, family responsibilities, and the emotional toll of serving. Words such as “field,” “unit,” “join,” “leave,” and “post” point toward common questions about career decisions, training environments, and the realities of Army life. At the same time, terms like “feel,” “told,” “hard,” “kill,” “hurt,” and “shit” reflect frustration, stress, or negative personal experiences that come up frequently in online discussions. There is also a strong presence of terms related to relationships and identity—“wife,” “mom,” “girl,” “kids”—which reinforces how personal circumstances shape the military experience. Overall, the word cloud paints a picture of discussions centered around emotional realities, family pressures, difficult decisions, and the everyday challenges tied to military culture rather than purely operational or technical topics.

Step 5: Tri-Gram Analysis

#Purpose of this step 5: To understand recurring phrases and themes beyond individual words, I extracted tri-grams from the cleaned text. Tri-grams capture sequences of three consecutive words, which helps uncover patterns in how people describe their experiences, frustrations, or concerns. This is especially useful for a topic like women in the Army, where conversations often involve multi-word concepts related to leadership, safety, harassment, command climate, and day-to-day military life. After generating the tri-grams, I removed stop words and non-alphabetic tokens so that the remaining phrases would reflect meaningful and coherent expressions.

library(tidytext)
library(dplyr)
library(stringr)

# Generate tri-grams from full_text
trigrams <- womenmil_clean %>%
  unnest_tokens(trigram, full_text, token = "ngrams", n = 3)
# Split trigram into individual words
tri_sep <- trigrams %>%
  separate(trigram, into = c("word1", "word2", "word3"), sep = " ")

# Remove stop words in ANY position
tri_clean <- tri_sep %>%
  filter(
    !word1 %in% stop_words$word,
    !word2 %in% stop_words$word,
    !word3 %in% stop_words$word
  ) %>%
# Keep alphabetic words only
  filter(
    str_detect(word1, "^[a-z]+$"),
    str_detect(word2, "^[a-z]+$"),
    str_detect(word3, "^[a-z]+$")
  )

#Count tri-gram frequencies
trigram_counts <- tri_clean %>%
  count(word1, word2, word3, sort = TRUE)

head(trigram_counts, 20) %>%
  knitr::kable()
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning in attr(x, "format"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
word1 word2 word3 n
gt gt gt 65
female israeli soldiers 11
immediately ban women 7
movement takes power 7
female israeli soldier 6
accounts involving victims 5
deadliest female sniper 5
female idf soldiers 5
operation distinguished gentlemen 5
palestinian detainees stripped 5
palestinian women testified 5
watch palestinian detainees 5
breaking rising republican 4
female combat soldiers 4
folks dead germans 4
israeli soldier posted 4
kill women children 4
lyudmila pavlichenko killed 4
occupied west bank 4
red haired woman 4
#Results from the trigram explanation: The tri-gram patterns show that conversations about women in the Army quickly expand beyond day-to-day military life and into broader political and global issues. A lot of the most frequent tri-grams reference the Israeli–Palestinian conflict, like “female Israeli soldiers” and “Palestinian women testified.” This suggests that whenever gender comes up in military discussions, people often compare different countries and conflicts rather than focusing only on the U.S. context.

#At the same time, several tri-grams point toward heavier themes such as violence, combat roles, and victimization. Phrases like “accounts involving victims” and “kill women children” highlight how gender and conflict are often intertwined in these conversations. Even the smaller tri-grams—like “female combat soldiers” or “red haired woman”—show a mix of personal stories and news-driven content. The presence of artifacts like “gt gt gt” also reminds me how messy Reddit text can be as I have no idea what this is referring to, but overall the tri-grams make it clear that discussions about women in the Army tend to be emotionally charged and deeply connected to bigger debates about war, ethics, and gender.

Step 6: Sentiment Analysis on text using dictionary methods.

#Purpose of this step 6: To measure the emotional tone of the Reddit discussions, I used a dictionary-based sentiment model that accounts for negations. This is important because statements in military-related conversations often include phrases like “not supportive” or “didn’t feel safe,” which flip the polarity of sentiment. By combining the post titles and body text, splitting the content into sentences, and applying the negation-aware sentimentr dictionary, I obtained a sentiment score for each post. These scores help me understand whether conversations about women in the Army tend to lean more positive, negative, or neutral across the time period I collected.

library(sentimentr)
library(dplyr)
library(stringr)

# Prepare text for sentiment analysis
sent_data <- womenmil_clean %>%
  mutate(
    title = ifelse(is.na(title), "", title),
    text  = ifelse(is.na(text),  "", text),
    combined_text = paste(title, text, sep = ". ")
  )

#Break text into sentences
sent_data <- sent_data %>%
  mutate(sentences = get_sentences(combined_text))

#Calculate negation-aware sentiment scores
sent_scores <- sentiment_by(sent_data$sentences)

# Add scores back into the dataset
sent_data$sentiment_score <- sent_scores$ave_sentiment
sent_data$word_count      <- sent_scores$word_count

##Step 7: Display 10 sample texts alongside sentiment scores and evaluate credibility of sentiment analysis outcomes

examples <- bind_rows(
  sent_data %>% arrange(sentiment_score) %>% slice_head(n = 5) %>% mutate(type = "Most Negative"),
  sent_data %>% arrange(desc(sentiment_score)) %>% slice_head(n = 5) %>% mutate(type = "Most Positive")
)

examples %>%
  select(type, sentiment_score, combined_text) %>%
  knitr::kable()
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning in attr(x, "format"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
type sentiment_score combined_text
Most Negative -1.0394023 Fighting two enemies: Ukraines female soldiers decry harassment - Women in armed forces express anger at stigma and treatment by male colleagues and say complaints are being ignored.
Most Negative -0.9644947 Things I wish I would have known before MEPS + tips.
Most Negative -0.7882408 Fort Cavazos sergeant charged with attempted murder broke into barracks to rape, assault 5 women, court records show.
Most Negative -0.6959705 Soldier, 29, who grabbed two female colleagues’ bottoms to ‘lighten the mood’ during a parade but said it was ‘just banter’ is found guilty of sexual assault and jailed for two months.
Most Negative -0.6948792 We have to fight two enemies: Ukraines female soldiers decry stigma and harassment.
Most Positive 0.8894585 ‘True American hero’: Major in Minnesota Guard bestowed top military honor.
Most Positive 0.8378771 Female soldier begs drone operator but he doesnt care.
Most Positive 0.6324555 Ukrainians distribute humanitarian aid to civilians in Russia’s Kursk Oblast.
Most Positive 0.5790078 Any good affordable sports bra recommendations for military?. Im looking to buy some sports bras for basic or just military in general and Id like to hear any recommendations if you have any.
Most Positive 0.5670939 17% od Ukrainian army are women. Absolutely bad ass..
#Results from sample texts show: Looking at the sample posts, the model’s sentiment scores generally make sense given the emotional tone of the text. The most negative entries all involve serious topics—harassment, assault, misconduct, mistreatment, or wartime experiences—which naturally produce strongly negative scores. This is consistent with the real-world context of these discussions, especially since conversations about women in the Army often involve safety concerns, discrimination, or institutional failures. On the positive end, the posts tend to focus on recognition, personal achievements, humanitarian work, or simple lifestyle questions, so it makes sense that the model rates them more positively. At the same time, a few posts shift sentiment due to the model reacting strongly to emotionally charged words (like “enemy,” “fight,” or “assault”), even when the overall message is more informational than emotional. This is a typical limitation of dictionary-based methods, but overall the model provides a reasonable and credible picture of how people discuss women in the military.

##Step 8: Intriguing insights derived from sentiment analysis with at least 3 plots Plot 1: Monthly Average Sentiment of Reddit Posts About Women in the Army

library(lubridate)
library(ggplot2)
library(dplyr)

sent_time <- sent_data %>%
  filter(!is.na(sentiment_score),
         !is.na(date_utc)) %>%
  mutate(month = floor_date(as.Date(date_utc), "month")) %>%
  group_by(month) %>%
  summarise(
    avg_sentiment = mean(sentiment_score),
    count_posts = n(),
    .groups = "drop"
  )

ggplot(sent_time, aes(x = month, y = avg_sentiment)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Monthly Average Sentiment of Reddit Posts About Women in the Army",
    x = "Month",
    y = "Average Sentiment Score"
  ) +
  theme_minimal()

#Plot 1 Purpose: This plot shows how the emotional tone of discussions about women in the Army changes over time. Since my dataset spans 2023–2025, looking at the monthly averages helps reveal whether certain events, news cycles, or controversies correspond with noticeable spikes or dips in sentiment.

#Plot 1 Intepretation: Looking at the month-to-month pattern, the overall sentiment toward women in the Army fluctuates much more than I expected. Instead of gradually trending upward or downward, the emotional tone keeps shifting, which suggests that conversations online are being driven by whatever is happening in the news or inside the military community at that moment. The sharp negative dips tend to line up with high-profile stories about misconduct, harassment, violence, or legal cases, while the brief positive spikes seem tied to recognition posts, personal success stories, or general advice discussions. Even in the more recent months, sentiment remains pretty mixed, which reinforces how complex and emotionally charged this topic can be. Rather than settling into a stable narrative, people’s reactions keep reacting to new events, new frustrations, and new experiences being shared online.

Plot 2: Distribution of Sentiment Histogram

library(ggplot2)
# Check list of objects 
ls()
##  [1] "examples"           "installed_packages" "packages"          
##  [4] "remove_terms"       "sent_data"          "sent_scores"       
##  [7] "sent_time"          "tokens_clean"       "tokens_words"      
## [10] "tri_clean"          "tri_sep"            "trigram_counts"    
## [13] "trigrams"           "womenmil_clean"     "womenmil_total"
#Create sentiment bins inside the sent_data
sent_data$sent_bin <- cut(
  sent_data$sentiment_score,
  breaks = seq(-1, 1, by = 0.1),
  include.lowest = TRUE
)

ggplot(sent_data, aes(x = sentiment_score, fill = sent_bin)) +
  geom_histogram(
    binwidth = 0.1,
    color = "white",
    alpha = 0.85
  ) +
  scale_fill_manual(
    values = colorRampPalette(
      c("red", "lightgray", "blue")
    )(length(unique(sent_data$sent_bin))),
    guide = "none"
  ) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
  labs(
    title = "Distribution of Sentiment Scores",
    x = "Sentiment Score",
    y = "Count of Posts"
  ) +
  theme_minimal(base_size = 13)

#Plot 2 Purpose: This distribution makes it much easier to see the balance of the conversation. Most posts cluster close to zero, which means a lot of the discussions land in a neutral or mixed emotional tone rather than being strongly positive or negative. The red shading symbolizes negative sentiment while the blue shows positive.


#Plot 2 Intepretation: The distribution shows that most Reddit posts cluster right around zero, which means that a large portion of the conversation about women in the Army falls into a neutral or mixed emotional tone. The red bins on the left highlight a noticeable spread of negative sentiment, which matches the frequent presence of posts discussing harassment, misconduct, discrimination, or frustration with leadership and policies. On the other side, the blue bins are much smaller, suggesting that strongly positive posts are less common and tend to revolve around recognition, supportive advice, or uplifting personal stories. Overall, the distribution makes it clear that while the topic generates some positive engagement, the heavier and more troubling experiences tend to shape the emotional landscape of the discussions.

Plot 3: Sentiment- Harassment vs Non-Harassment Posts

sent_compare <- sent_data %>%
  mutate(
    harassment_flag = if_else(
      str_detect(tolower(combined_text),
                 "harass|assault|rape|misconduct"),
      "Harassment Mentioned",
      "No Harassment Mentioned"
    )
  )

ggplot(sent_compare, aes(x = harassment_flag, y = sentiment_score)) +
  geom_boxplot(width = 0.4, outlier.alpha = 0.3, fill = "red") +
  geom_jitter(width = 0.15, alpha = 0.25, color = "black") +
  labs(
    title = "Sentiment Scores by Harassment-Related Posts",
    x = "",
    y = "Sentiment Score"
  ) +
  theme_minimal(base_size = 13)

#Plot 3 Purpose: The goal of this graph is to compare how sentiment scores differ between Reddit posts that mention harassment or assault and those that do not. By placing both categories side-by-side, the plot helps reveal whether conversations involving harassment tend to be more negative and how they contrast with the broader discussion about women in the Army.

#Plot 3 Interpretation: Looking at the boxplots, the difference between the two groups is pretty clear. Posts that mention harassment cluster lower on the sentiment scale, showing that the tone of these discussions is consistently more negative. You can also see a few extreme low outliers, which likely reflect posts describing very troubling or traumatic incidents. In contrast, posts that do not mention harassment sit closer to neutral, with a wider spread that includes some mildly positive content. This group has much more variation overall, which makes sense because these posts range from career questions to daily life experiences to more neutral reporting.Even though neither category trends strongly positive, the separation between the two boxes shows that harassment-related posts really do pull the sentiment downward. It reinforces what you’d expect intuitively: when people talk about harassment, the emotional tone shifts, and the language becomes heavier, more frustrated, or more distressed. This plot makes that pattern easy to see at a glance.

BONUS: Plot 4: Sentiment Scores for Posts Asking Whether Joining the Army is “Worth It” for Women

#double check column names
names(sent_scores)
## [1] "element_id"    "word_count"    "sd"            "ave_sentiment"
names(sent_data)
##  [1] "date_utc"        "timestamp"       "title"           "text"           
##  [5] "subreddit"       "comments"        "url"             "full_text"      
##  [9] "combined_text"   "sentences"       "sentiment_score" "word_count"     
## [13] "sent_bin"
# Use the real sentiment dataset
womenmil_sentiment <- sent_data

# Create a flag for posts discussing whether joining the Army is "worth it"
worth_it_terms <- c("worth it", "should i join", "should i enlist",
                    "thinking of joining", "is it worth", "join the army")

womenmil_sentiment$worth_it_flag <- ifelse(
  str_detect(tolower(womenmil_sentiment$combined_text),
             paste(worth_it_terms, collapse = "|")),
  "Decision Posts",
  "Other Posts"
)

# Plot sentiment comparison
library(ggplot2)

ggplot(womenmil_sentiment, aes(x = worth_it_flag, y = sentiment_score)) +
  geom_boxplot(fill = c("lightgreen", "darkred"), alpha = 0.8, outlier.alpha = 0.3) +
  geom_jitter(width = 0.15, alpha = 0.25, size = 1) +
  labs(
    title = "Sentiment Scores for Posts About Whether Joining the Army Is Worth It",
    x = "",
    y = "Sentiment Score"
  ) +
  theme_minimal(base_size = 14)

#Plot 4 Purpose: The goal of this plot is to compare the emotional tone of Reddit posts where people specifically ask whether joining the Army is “worth it” for women against the tone of all other posts about women in the Army. This lets me see whether career-decision questions carry a different sentiment than the broader conversation, and whether the way people talk about enlistment reflects more hesitation, caution, or negativity.

#Plot 4 Interpretation: What stands out right away is how much tighter and slightly more positive the “Decision Posts” group is. These posts hover just above neutral, with only a few dipping slightly negative. That makes sense because the people writing them are usually asking genuine questions, weighing options, and looking for advice, rather than venting or sharing negative experiences. It creates this narrower, calmer band of sentiment.The “Other Posts” category, on the other hand, is a lot more spread out. You can see everything from really negative outliers to scattered positive reactions. That broader range reflects the mix of topics people discuss — harassment, deployments, relationships, injuries, accomplishments, daily routines, and military news. There’s simply more emotional variety in those posts, so the boxplot widens and the points scatter more heavily. Overall, this comparison suggests that when women (or others) talk about whether joining the Army is worth it, the tone leans more neutral and reflective rather than openly negative. The more intense negativity appears in posts tied to specific experiences, events, or issues that are not in career-decision discussions themselves. This helps highlight how the “should I join?” conversation is its own distinct, thoughtful space within the larger online discourse.

Key Takeaways from this assignments’ plots and course: Looking at all four plots together paints a pretty consistent picture of how people talk about women in the Army online. The month-to-month sentiment line shows that conversations stay close to neutral overall, but with noticeable dips during months where harassment, violence, or high-visibility news stories surface. The distribution plot reinforces that pattern by showing that most posts cluster near zero, with only a small share being strongly positive or negative. When the discussion specifically touches on harassment, the tone predictably shifts downward — the harassment boxplot makes that clear, and the limited spread suggests that these posts often share similarly negative experiences or frustrations.

In contrast, posts about whether joining the Army is “worth it” for women fall into a much narrower, slightly positive range. Those conversations feel more cautious and reflective rather than emotionally charged. They stand apart from the heavier themes in the general dataset. Across all plots, you can see the tension between day-to-day curiosity or career exploration and the harsher realities women describe in certain parts of the military environment. Taken together, the sentiment analysis suggests a community trying to be honest, helpful, and direct — but also one that doesn’t shy away from acknowledging systemic issues that shape women’s experiences in the Army.

Thank You!!! Professor, thank you so much for your teaching this semester. Your assignments pushed me to explore tools and methods I never thought I’d actually enjoy, and I learned far more about text analytics than I expected to. I really appreciated how flexible, supportive, and approachable you were throughout the course…especially as hopefully this assignment shows you I’m navigating my own career uncertainties and trying to reconnect with the parts of data analysis that genuinely excite me (and hopefully makes me more competitive when I apply to hopefully be chosen as a Military Intelligence Officer). This final project ended up being personally meaningful, and I’m grateful for the space you created for us to explore topics that matter to us. Thank you again for a great semester. Have a great winter break!