DATA 607 - Project #2

Author

Denise Atherley

Approach

For this project, I reviewed the discussion entries from “Discussion 5A: Untidy Data” and found an entry by Qingquan Li that I found compelling. He included an untidy data set found on Wikipedia that listed different countries and their projected GDP over several years. The data started with estimates from the 1970s all the way to present day and included projections into the year 2030.

Tidying strategy:

For this project I will work specifically with the data from 2020 to 2030. To accomplish this, I took the raw data from the Wikipedia page and created a .csv file that preserved the wide format that it is in. I then loaded the .csv file into my github repository so that I can reference it in RStudio.

A copy of this data can be found in this public repository: Countries GDP (github repository)

Analysis strategy:

I will use a combination of tidyr and dplyr to tidy and transform the data. Once the data is in long format, I will then compare the GDP growth trajectories over time and summarize the average growth rate to see which global economy has the highest growth rate and which has the lowest. In addition, I will also use ggplot2 to visualize the GDP change over time and across the top 5 countries and the bottom 5 countries.

LLM support and enhancement

I suspect that I will be challenged with some of the code needed to properly tidy the data as well as the best code to use to visualize the information, so I will use the assistance of Google Gemini to help expand and iterate on my code.

AI citation -

(Google DeepMind. (2026). Gemini Pro [Large language model].

https://gemini.google.com. Accessed March 8, 2026

Code Deliverable -

I ensure that I am able to use tidyverse syntax in my code by loading the necessary packages.

library (tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load and Preview data

I examine the information by loading the data and then previewing the top 6 rows.

# Import dataset into R from github repository
gdp_data<- read.csv("https://raw.githubusercontent.com/meiqing39/607-Project-2/refs/heads/main/List%20of%20Countries%20by%20Past%20and%20Projected%20GDP.csv")

# Preview data to understand its structure
head(gdp_data)

   Country...territory   X2020   X2021   X2022   X2023   X2024   X2025   X2026
1          Afghanistan  20,136  14,278  14,501  17,248  18,080                
2              Albania  15,271  17,985  19,189  23,633  27,084  29,939  32,411
3              Algeria 164,774 185,850 225,652 248,087 269,140 288,013 284,984
4              Andorra   2,885   3,325   3,376   3,786   4,038   4,408   4,715
5               Angola  66,521  84,375 142,442 112,483 115,214 115,167 109,855
6  Antigua and Barbuda   1,412   1,602   1,867   2,006   2,208   2,340   2,456
    X2027   X2028   X2029   X2030
1                                
2  34,208  36,358  38,657  41,115
3 291,065 297,245 303,879 308,997
4   4,855   5,001   5,157   5,318
5 114,798 122,168 130,394 139,542
6   2,567   2,683   2,803   2,929

Data transformation

I will transform the data from wide format to long format and will rename my variables to follow a more consistent naming convention. I will also normalize the variable structure to ensure that the GDP values and Years are numeric in format and the Country names don’t have any leading or trailing spaces. My data has several records with missing values for past or future GDP projections, so to ensure that this doesn’t affect my analysis later on, I will drop rows containing null values in the ‘GDP’ column.

tidy_gdp <- gdp_data %>%
  
  # 1. Rename variables to follow a consistent naming convention
  rename(Country = `Country...territory`) %>%
  
  # 2. Reshape from wide to tidy (long) format
  # The dataset has years (2020 to 2030) as separate columns (wide format).
  # I will use pivot_longer to melt all columns EXCEPT 'Country' into two columns:
  # 'Year' and 'GDP'.
  pivot_longer(
    cols = -Country,         
    names_to = "Year",       
    values_to = "GDP"        
  ) %>%
  
  # 3. Normalize variable structure
   mutate(
    GDP = as.numeric(str_remove_all(GDP, ",")),
    Year = as.numeric(str_remove_all(Year, "X")),
    Country = str_trim(Country)
  ) %>%
  
  # 4. Address missing or inconsistent values by dropping the rows with null values
  drop_na(GDP)

# View the final, tidy dataset
print(tidy_gdp)

# A tibble: 2,109 × 3
   Country      Year   GDP
   <chr>       <dbl> <dbl>
 1 Afghanistan  2020 20136
 2 Afghanistan  2021 14278
 3 Afghanistan  2022 14501
 4 Afghanistan  2023 17248
 5 Afghanistan  2024 18080
 6 Albania      2020 15271
 7 Albania      2021 17985
 8 Albania      2022 19189
 9 Albania      2023 23633
10 Albania      2024 27084
# ℹ 2,099 more rows

Analysis

To compare the GDP growth trajectories over time and summarize the average growth rate to see which global economy has the highest growth rate and which has the lowest, I will need to use the Compound Annual Growth Rate (CAGR). I will calculate the average annual growth rate from the earliest year to the latest available year for each country to accomplish this.

# Calculate the Compound Annual Growth Rate (CAGR) for each country
gdp_summary <- tidy_gdp %>%
  group_by(Country) %>%
  # Filter to just the start and end years for each country
  filter(Year == min(Year) | Year == max(Year)) %>%
  arrange(Year) %>%
  summarize(
    Start_Year = first(Year),
    End_Year = last(Year),
    Start_GDP = first(GDP),
    End_GDP = last(GDP),
    Years = End_Year - Start_Year,
    # Formula for CAGR: (End Value / Start Value)^(1 / Years) - 1
    CAGR = ((End_GDP / Start_GDP) ^ (1 / Years)) - 1,
    .groups = "drop"
  ) %>%
  # Remove countries with only 1 year of data to avoid dividing by zero
  filter(Years > 0)

I will now extract the top 5 countries with the highest growth and the bottom 5 countries with the lowest growth, combine them into one table for analysis and view them in a formatted summary table.

library(scales) # for formatting percents


Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

# Extract the Top 5 (highest growth) and Bottom 5 (lowest growth) economies
top_5 <- gdp_summary %>% arrange(desc(CAGR)) %>% head(5)
bottom_5 <- gdp_summary %>% arrange(CAGR) %>% head(5)

# Combine into one table for analysis
extreme_gdp <- bind_rows(top_5, bottom_5)

# View the formatted summary table
extreme_gdp %>%
  select(Country, Start_Year, End_Year, Start_GDP, End_GDP, CAGR) %>%
  mutate(CAGR = percent(CAGR, accuracy = 0.1)) %>%
  print()

# A tibble: 10 × 6
   Country               Start_Year End_Year Start_GDP End_GDP CAGR  
   <chr>                      <dbl>    <dbl>     <dbl>   <dbl> <chr> 
 1 Guyana                      2020     2030      5471   40275 22.1% 
 2 Suriname                    2020     2030      2912   11543 14.8% 
 3 Uzbekistan                  2020     2030     66443  240767 13.7% 
 4 São Tomé and Príncipe       2020     2030       476    1618 13.0% 
 5 Kyrgyzstan                  2020     2030      8283   28155 13.0% 
 6 Zimbabwe                    2020     2030     39962    6368 -16.8%
 7 Palestine                   2020     2024     15532   13711 -3.1% 
 8 Nigeria                     2020     2030    598725  443602 -3.0% 
 9 Afghanistan                 2020     2024     20136   18080 -2.7% 
10 Japan                       2020     2030   5054069 5119885 0.1%

Interpretation of Summary Table

The summary table exposes stark contrasts in global economic trajectories. Guyana holds the highest average growth rate by a massive margin (~22% annually). This reflects unprecedented real-world economic expansion largely driven by massive offshore oil discoveries that recently entered production (source: https://www.cfr.org/articles/how-guyanas-oil-boom-will-reshape-energy-security). Conversely, Zimbabwe shows the steepest projected contraction (-16.8% annually) within this dataset due to droughts and floods that affected agricultural output and higher costs of fuel and food imports (source: https://www.afdb.org/en/countries/southern-africa/zimbabwe/zimbabwe-economic-outlook). Advanced economies like Japan sit near the bottom with flat line growth (+0.1%), illustrating long-term economic consistency.

Visualizing the data

I will set up a bar chart that will compare the highest CAGR to the lowest CAGR.

# 1. Bar Chart of Highest and Lowest Growth Rates
ggplot(extreme_gdp, aes(x = reorder(Country, CAGR), y = CAGR, fill = CAGR > 0)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = percent_format()) +
  scale_fill_manual(values = c("TRUE" = "#2E8B57", "FALSE" = "#CD5C5C")) +
  labs(
    title = "Highest and Lowest Global Economies by Average Annual GDP Growth",
    subtitle = "Compound Annual Growth Rate (CAGR) from 2020 to latest projection",
    x = "Country",
    y = "Average Annual Growth Rate (CAGR)",
    fill = "Positive Growth"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Comparing the absolute GDP values of giant economies, like Japan, against tiny ones, like Guyana on the same graph makes the graph unreadable, I will “index” the GDP by setting every country’s 2020 GDP to a baseline of “100”. This will help me fairly compare their trajectories over time.

# 2. Line Chart of Trajectories (Indexed to Start Year)
trajectory_data <- tidy_gdp %>%
  filter(Country %in% extreme_gdp$Country) %>%
  group_by(Country) %>%
  # Index GDP so the starting year for every country equals 100
  mutate(GDP_Index = (GDP / first(GDP)) * 100) %>%
  ungroup()

ggplot(trajectory_data, aes(x = Year, y = GDP_Index, color = Country)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2) +
  labs(
    title = "GDP Growth Trajectories (Indexed: Start Year = 100)",
    subtitle = "Comparing the 5 fastest and 5 slowest growing economies",
    x = "Year",
    y = "GDP Index (Base 100)",
    color = "Country"
  ) +
  theme_minimal()

Interpretation of Visualizations

The bar chart makes it immediately obvious how significantly the top 5 emerging economies outpace the bottom 5. Meanwhile, the indexed trajectory line chart highlights just how aggressively Guyana breaks the mold. While most countries hover gently around the “100” baseline index, Guyana’s trajectory rockets vertically. By 2030, Guyana’s GDP is projected to be nearly 750% of what it was in 2020, visually confirming its status as the world’s most rapidly expanding economy. Conversely, Zimbabwe’s sharp downward slope reveals a severe projected collapse in GDP value over the same decade.