ECI 588 Final Project

Student Study Abroad Blogs from IES Abroad

For my final project, I analyzed publicly available blog posts from the IES Abroad Blog, focusing exclusively on posts published between 2020 and 2025. These posts offer firsthand reflections from students participating in study abroad programs, covering topics such as academics, cultural adaptation, travel, relationships, and personal growth. Each post includes metadata like author name, program location, and publication date. Data was gathered on these blog posts through taking a sample of blog posts from each year (2020-2025).

The target audience includes study abroad advisors, program directors, and international education researchers. These insights could help institutions understand how student experiences—and the way they are articulated—have evolved during a highly dynamic period, offering data-driven input for student support and program planning.

Research Questions

I’m drawn to this dataset because of my interest in international education and advising. The 2020–2025 period is especially compelling due to its overlap with the COVID-19 pandemic and subsequent shifts in global mobility. My research questions include:

What themes and topics were most prominent in student reflections between 2020 and 2025?
How did sentiment vary during and after the height of the pandemic?
What changes or continuities can be observed in how students reflect on their study abroad experiences during this five-year window?

*In this project, I define the ‘height of the pandemic’ as primarily spanning 2020 and 2021, corresponding with the onset of COVID-19 disruptions to global travel and academic mobility.

Prepare

This project analyzes publicly available blog posts from the IES Abroad Blog, focusing on entries published between 2020 and 2025. These blogs capture firsthand accounts from study abroad students, offering rich insight into topics such as academic adjustment, cultural adaptation, travel experiences, and personal growth. Metadata associated with each post—including author name, location, program term, and publication date—provides additional contextual information for analysis.

install.packages("textdata")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

# Load libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(stringr)
library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(textdata)
library(tidytext)
library(textstem)

## Loading required package: koRpus.lang.en
## Loading required package: koRpus
## Loading required package: sylly
## For information on available language packages for 'koRpus', run
## 
##   available.koRpus.lang()
## 
## and see ?install.koRpus.lang()
## 
## 
## Attaching package: 'koRpus'
## 
## The following object is masked from 'package:readr':
## 
##     tokenize

library(tm)

## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## 
## 
## Attaching package: 'tm'
## 
## The following object is masked from 'package:koRpus':
## 
##     readTagged

library(topicmodels)

Key fields including blog title, author, location (separated into country and city), term, tags, text, and URL were selected for further use. Comprehensive text cleaning steps were applied: trimming whitespace, removing HTML tags, correcting character encoding issues, lowering text to uniform case, and eliminating short or empty blog posts. Additional attention was given to location data, correcting misclassified countries where necessary. This careful preparation ensured that the text corpus was well-structured and ready for tokenization, analysis, and modeling. Adjustments were also made throughout the project to adjust for any errors that occurred (for example, NA showing up in graph results, words like “i’m” and “it’s” pulling into data chart final results).

# Step 1: Read CSV with no headers
data_raw <- data_raw <- read_csv("IES_Abroad_Final_Cleaned.csv", col_names = FALSE)

## Rows: 197 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Step 2: Promote first row as headers
new_colnames <- as.character(unlist(data_raw[1,]))
data_raw <- data_raw[-1, ]
colnames(data_raw) <- new_colnames

# Step 3: Clean the column names
data_raw <- janitor::clean_names(data_raw)

# Step 4: Select the correct columns, no need to rename
data_clean <- data_raw %>%
  select(title, author, date, location, country, city, term, tags, full_blog_text, url)

# Step 5: Continue with the text cleaning
data_clean_final <- data_clean %>%
  filter(!is.na(full_blog_text), full_blog_text != "") %>%
  mutate(
    full_blog_text = str_trim(full_blog_text),
    full_blog_text = str_replace_all(full_blog_text, "\n", " "),
    full_blog_text = str_replace_all(full_blog_text, "\\s+", " "),
    full_blog_text = str_replace_all(full_blog_text, "[^[:print:]]", ""),
    full_blog_text = str_replace_all(full_blog_text, "<.*?>", ""),
    full_blog_text = str_to_lower(full_blog_text),
    tags = str_split(tags, ",\\s*")
  ) %>%
  distinct(url, .keep_all = TRUE) %>%
  filter(str_count(full_blog_text, "\\w+") > 50)

data_clean_final <- data_clean_final %>%
  mutate(
    country = case_when(
      city == "Paris" ~ "France",
      TRUE ~ country
    )
  )

# More general clean-up of repeated country names
data_clean_final <- data_clean_final %>%
  mutate(
    country = str_replace_all(country, "\\b(\\w+)\\b\\s+\\1", "\\1")
  )

data_clean_final <- data_clean_final %>%
  mutate(full_blog_text = str_remove_all(full_blog_text, "\\b(m|s|re|ve|ll|d)\\b"))
# Step 5.5: Remove common contractions

data_clean_final <- data_clean_final %>%
  mutate(
    full_blog_text = str_remove_all(full_blog_text, "\\b(i'm|it's|you're|we're|they're|i've|can't|don't|isn't|won't|didn't|hasn't|wasn't|weren't)\\b"),
    full_blog_text = str_remove_all(full_blog_text, "\\b(m|s|re|ve|ll|d)\\b")
  )

# Optional: Look at it
glimpse(data_clean_final)

## Rows: 196
## Columns: 10
## $ title          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ author         <chr> "Patrick Brady", "Patrick Brady", "Patrick Brady", "Pat…
## $ date           <chr> "2025-04-16", "2025-04-16", "2025-04-16", "2025-04-16",…
## $ location       <chr> "Christchurch, New Zealand", "Christchurch, New Zealand…
## $ country        <chr> "New Zealand", "New Zealand", "New Zealand", "New Zeala…
## $ city           <chr> "Christchurch", "Christchurch", "Christchurch", "Christ…
## $ term           <chr> "2025 Spring", "2025 Spring", "2025 Spring", "2025 Spri…
## $ tags           <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ full_blog_text <chr> "after rambling on about christchurch for two blogs, i’…
## $ url            <chr> "https://www.iesabroad.org/blogs/patrick-brady/down-dun…

Wrangle

Following initial preparation, the blog text was tokenized, converting each blog into individual words for analysis. Common stop words (e.g., “the,” “and”) were removed to focus on meaningful content. Lemmatization was applied to reduce words to their base form (e.g., “studying” became “study”), helping group related terms together for more accurate analysis.

The resulting dataset—containing tokenized, cleaned, and lemmatized words—served as the foundation for multiple text mining techniques. By maintaining key metadata (city, country, program term, URL), the wrangling process preserved important context, allowing subsequent analyses to explore geographical and temporal patterns in the student reflections.

# ---- Load Required Libraries ----
library(textstem)
library(ggplot2)

# ---- Select Metadata and Tokenize ----
tokens_clean <- data_clean_final %>%
  select(
    url,
    location,
    term,
    city,
    country,
    full_blog_text
  ) %>%
  unnest_tokens(word, full_blog_text)

# ---- Remove Stopwords ----
data(stop_words)
tokens_clean <- tokens_clean %>%
  anti_join(stop_words, by = "word")

# ---- Lemmatize Words ----
tokens_clean <- tokens_clean %>%
  mutate(word = lemmatize_words(word))

Analyze

Several complementary analyses were conducted to address the project’s research questions:

TF-IDF Words by Country

Term frequency–inverse document frequency (TF-IDF) scores were calculated by country and by city to identify distinctive words associated with specific locations. Visualizations of top TF-IDF terms highlighted how students uniquely described their experiences across different study abroad sites, revealing localized patterns of reflection.

# Word frequency
word_counts <- tokens_clean %>%
  count(word, sort = TRUE)

# ---- Calculate TF-IDF (By Country) ----
tfidf_data_country <- tokens_clean %>%
  count(country, word, sort = TRUE) %>%
  bind_tf_idf(word, country, n) %>%
  arrange(desc(tf_idf))

## Warning: A value for tf_idf is negative:
##  Input should have exactly one row per document-term combination.

# ---- Calculate TF-IDF (By City) ----
tfidf_data_city <- tokens_clean %>%
  count(city, word, sort = TRUE) %>%
  bind_tf_idf(word, city, n) %>%
  arrange(desc(tf_idf))

## Warning: A value for tf_idf is negative:
##  Input should have exactly one row per document-term combination.

# ---- Top TF-IDF Words (Country) ----
tfidf_top_country <- tfidf_data_country %>%
  filter(!is.na(country)) %>%    # <-- REMOVE NA countries
  group_by(country) %>%
  slice_max(tf_idf, n = 5) %>%
  ungroup()

# ---- Top TF-IDF Words (City) ----
tfidf_top_city <- tfidf_data_city %>%
  group_by(city) %>%
  slice_max(tf_idf, n = 5) %>%
  ungroup()

# ---- Plot Top TF-IDF Words by Country ----
ggplot(tfidf_top_country, aes(x = tf_idf, y = reorder_within(word, -tf_idf, country))) +
  geom_col(fill = "steelblue") +
  facet_wrap(~ country, scales = "free", ncol = 2) +
  scale_y_reordered() +
  labs(
    title = "Top 5 TF-IDF Words by Study Abroad Country",
    x = "TF-IDF Score",
    y = "Word"
  ) +
  theme_minimal(base_size = 8) +
  theme(
    strip.text = element_text(size = 14, face = "bold"),
    axis.text.y = element_text(size = 8, angle = 45, hjust = 1),
    axis.title = element_text(size = 14),
    plot.title = element_text(size = 16, face = "bold"),
    strip.background = element_rect(fill = "grey90", color = NA),
    panel.spacing = unit(1, "lines")  # Add space between countries
  ) +
  coord_flip()

AFINN Sentiment Analysis

Using AFINN, Bing, and NRC lexicons, sentiment trends were analyzed across posts. The AFINN lexicon provided a numerical score per blog post, while NRC captured emotional categories (e.g., joy, fear). Average AFINN sentiment scores were plotted across years, revealing notable fluctuations around the COVID-19 pandemic period, suggesting shifts in student emotional tone over time.

# ---- AFINN Sentiment Analysis ----
afinn_sentiment <- tokens_clean %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(url, term) %>%
  summarize(afinn_score = sum(value), .groups = "drop")

# ---- NRC Sentiment Analysis (emotions) ----
nrc_sentiment <- tokens_clean %>%
  inner_join(get_sentiments("nrc"), by = "word") %>%
  count(url, term, sentiment) %>%
  ungroup()

## Warning in inner_join(., get_sentiments("nrc"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 5 of `x` matches multiple rows in `y`.
## ℹ Row 8862 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

# ---- Bing Sentiment Analysis (positive/negative) ----
bing_sentiment <- tokens_clean %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  count(url, term, sentiment) %>%
  ungroup()

# ---- Optional: Aggregate Sentiment Over Terms ----

library(lubridate) # for working with dates

# ---- Extract year from date ----
afinn_sentiment <- afinn_sentiment %>%
  left_join(data_clean_final %>% select(url, date), by = "url") %>%  # join date back if needed
  mutate(year = year(date))   # extract year only

# ---- Average Sentiment per Year ----
afinn_year_trend <- afinn_sentiment %>%
  filter(!is.na(year)) %>%    # Remove missing dates
  group_by(year) %>%
  summarize(avg_sentiment = mean(afinn_score, na.rm = TRUE))

# ---- Plot Sentiment Trend Over Years ----
ggplot(afinn_year_trend, aes(x = year, y = avg_sentiment, group = 1)) +
  geom_line(color = "blue") +
  geom_point(color = "black", size = 2) +
  labs(
    title = "Average AFINN Sentiment Over Years (Study Abroad)",
    x = "Year",
    y = "Average Sentiment Score"
  ) +
  theme_minimal()

Topic Models with LDA

Latent Dirichlet Allocation (LDA) was used to uncover underlying themes across the blogs. A five-topic model was fit to the corpus, with top terms visualized for each topic. Emerging themes included academic adjustment, travel excitement, homesickness and resilience, cultural exploration, and post-pandemic reflections. These topics align closely with the major life domains students often navigate while studying abroad.

# ---- Create Document-Term Matrix (DTM) ----
dtm <- tokens_clean %>%
  count(url, word) %>%
  cast_dtm(url, word, n)

# ---- Fit LDA Model ----
# Choose k = number of topics (e.g., 5 or 6)
library(topicmodels)
lda_model <- LDA(dtm, k = 5, control = list(seed = 1234))

# ---- Tidy LDA output ----
lda_topics <- tidy(lda_model, matrix = "beta")

# ---- Top Terms Per Topic ----
top_terms <- lda_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 10) %>%
  ungroup() %>%
  arrange(topic, -beta)

# ---- Plot Top Words Per Topic ----
ggplot(top_terms, aes(x = beta, y = reorder_within(term, beta, topic))) +
  geom_col(fill = "darkorange") +
  facet_wrap(~ topic, scales = "free", ncol = 2) +  # Spread into 2 columns
  scale_y_reordered() +
  labs(
    title = "Top 10 Most Important Words for Each Topic (LDA Model)",
    x = "Word Importance",
    y = "Word"
  ) +
  theme_minimal(base_size = 8) +
  theme(
    strip.text = element_text(size = 14, face = "bold"),
    axis.text.y = element_text(size = 6),
    axis.title = element_text(size = 14),
    plot.title = element_text(size = 16, face = "bold"),
    strip.background = element_rect(fill = "grey90", color = NA)
  ) +
  coord_flip()

Each of these modeling techniques provided a different lens for understanding how student reflections changed between 2020 and 2025, and how place and time influenced the articulation of their experiences.

Additional Analysis

To explore how major themes evolved during the 2020–2025 period, I examined topic prevalence over time using the results from the Latent Dirichlet Allocation (LDA) model. Each blog post was assigned to its most dominant topic based on the highest topic probability. By aggregating topic assignments across years, a stacked bar chart was created to visualize the proportional distribution of topics annually.

The analysis revealed notable shifts in thematic focus. Topics related to uncertainty, resilience, and academic adjustment were particularly dominant in 2020 and 2021, coinciding with the height of the COVID-19 pandemic and associated disruptions to study abroad programs. In contrast, later years (2023–2025) saw a resurgence of topics centered on travel, cultural exploration, and personal growth. This temporal pattern highlights how student reflections adapted in response to changing global circumstances, illustrating both continuity in core experiences and meaningful change in how students articulated their journeys.

# --- Assign Most Likely Topic to Each Blog ---

# Get document-topic probabilities
doc_topics <- tidy(lda_model, matrix = "gamma")  # gamma = topic probabilities

# For each document, pick the topic with the highest probability
doc_topic_assignment <- doc_topics %>%
  group_by(document) %>%
  slice_max(gamma, n = 1) %>%
  ungroup()

# Merge with blog metadata (date info)
doc_topic_assignment <- doc_topic_assignment %>%
  left_join(data_clean_final %>% select(url, date), by = c("document" = "url")) %>%
  mutate(year = lubridate::year(date))

# --- Plot Topic Prevalence Over Time ---

library(ggplot2)

# Count topics per year
topic_year_distribution <- doc_topic_assignment %>%
  group_by(year, topic) %>%
  summarize(count = n(), .groups = "drop")

# Plot
ggplot(topic_year_distribution, aes(x = year, y = count, fill = factor(topic))) +
  geom_col(position = "fill") +  # Use "fill" to make it a proportion (stacked 100%)
  labs(
    title = "Topic Prevalence Over Time (2020–2025)",
    x = "Year",
    y = "Proportion of Blogs",
    fill = "Topic"
  ) +
  theme_minimal()

To further illustrate the most prominent themes, I generated word clouds based on TF-IDF-weighted terms for different years and topics. Word clouds offer an intuitive visualization of word frequency and importance, allowing quick comparison of language patterns across time and thematic clusters.

By constructing separate word clouds for 2020 through 2025, distinct linguistic patterns emerged. Early pandemic years featured words like “uncertainty,” “online,” and “adapt,” reflecting the challenges of remote learning and travel restrictions. In contrast, later years increasingly featured terms like “explore,” “friends,” and “growth,” suggesting a return to traditional study abroad experiences focused on cultural immersion and personal development. Word clouds for each LDA topic further underscored the thematic richness of the blogs, capturing nuanced differences between emotional adjustment, academic life, and adventurous exploration.

library(wordcloud2)

# --- Merge tokens with topic assignments ---

tokens_with_topics <- tokens_clean %>%
  inner_join(doc_topic_assignment, by = c("url" = "document"))

# Example: Word Cloud for Topic 1

topic1_tokens <- tokens_with_topics %>%
  filter(topic == 1) %>%
  count(word, sort = TRUE)

wordcloud2(topic1_tokens, size = 0.5)

Beyond positive and negative sentiment, I also analyzed emotional trends using the NRC emotion lexicon, which categorizes words into emotions such as joy, fear, trust, and sadness. Aggregating emotion word counts by year allowed visualization of how emotional tones shifted across the study period.

The results showed a clear emotional trajectory: fear and sadness were elevated in 2020 and 2021, corresponding to pandemic-related uncertainties, while expressions of joy and trust grew steadily in subsequent years as international mobility resumed. The presence of emotions such as anticipation and trust throughout the period suggests that despite challenges, students consistently expressed excitement and optimism about their experiences. This multi-dimensional emotional view provides a richer understanding of how students processed their study abroad journeys during a period of profound global change.

# --- Already joined NRC Sentiment Data earlier (nrc_sentiment) ---

# Add year info
nrc_with_year <- nrc_sentiment %>%
  left_join(data_clean_final %>% select(url, date), by = "url") %>%
  mutate(year = lubridate::year(date)) %>%
  filter(!is.na(year))

# --- Aggregate counts per year and emotion ---

nrc_trends <- nrc_with_year %>%
  group_by(year, sentiment) %>%
  summarize(count = sum(n), .groups = "drop")

# Optional: Normalize counts per year if needed

# --- Plot: Emotion Trends Over Time ---

ggplot(nrc_trends, aes(x = year, y = count, color = sentiment)) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  labs(
    title = "Emotion Trends in Study Abroad Blogs (2020–2025)",
    x = "Year",
    y = "Emotion Word Count",
    color = "Emotion"
  ) +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Communicate

Limitations and Ethical Considerations:

While the blog posts are publicly available, care was taken to aggregate findings at the group level rather than spotlighting individual students. Personal identities were protected by focusing on patterns rather than case studies. Some posts may not reflect the full diversity of student experiences, particularly given the self-selecting nature of blog authorship. Additionally, sentiment scores and topic models inherently reduce rich narratives into quantifiable measures; caution must be exercised to not overinterpret these outputs without considering qualitative nuance.

Conclusion

This project demonstrates how text mining techniques can uncover meaningful patterns in student reflections on study abroad experiences. By analyzing blog posts from 2020 to 2025, the study identified shifting sentiments during the pandemic, revealed place-specific themes, and surfaced enduring topics related to personal growth and cultural exploration. These insights can support study abroad advisors, program directors, and researchers in better understanding the evolving needs and emotions of students across a dynamic global period. While the analysis provides valuable trends, future work could integrate deeper qualitative methods to capture the full complexity and richness of student narratives, ensuring that institutional responses are both data-informed and deeply empathetic.