Project and Analysis Detail

Instructions

As you work with more data in R, you’ll begin to observe that most data requires preparation (dealing with NA values, data formatting, data transformation, etc) before it can be properly analyzed using the functions in R. This week’s challenge involves working with a messy data set, formatting it and using some functions from the dplyr library to summarize and group cases. Reference the dplyr “Data transformation cheat sheet”.

library(dplyr)

Download the a CSV file from the World Bank https://datacatalog.worldbank.org/ (select any data set you wish)

setwd("~/Desktop/mydata")
df <- read_csv("WDICSV.csv")

## Rows: 403256 Columns: 69
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): Country Name, Country Code, Indicator Name, Indicator Code
## dbl (65): 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(df)

## # A tibble: 6 × 69
##   `Country Name`  `Country Code` `Indicator Name` `Indicator Code` `1960` `1961`
##   <chr>           <chr>          <chr>            <chr>             <dbl>  <dbl>
## 1 Africa Eastern… AFE            Access to clean… EG.CFT.ACCS.ZS       NA     NA
## 2 Africa Eastern… AFE            Access to clean… EG.CFT.ACCS.RU.…     NA     NA
## 3 Africa Eastern… AFE            Access to clean… EG.CFT.ACCS.UR.…     NA     NA
## 4 Africa Eastern… AFE            Access to elect… EG.ELC.ACCS.ZS       NA     NA
## 5 Africa Eastern… AFE            Access to elect… EG.ELC.ACCS.RU.…     NA     NA
## 6 Africa Eastern… AFE            Access to elect… EG.ELC.ACCS.UR.…     NA     NA
## # ℹ 63 more variables: `1962` <dbl>, `1963` <dbl>, `1964` <dbl>, `1965` <dbl>,
## #   `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
## #   `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
## #   `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
## #   `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
## #   `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
## #   `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, …

Initial Observations

Two things stand out:

The dataset has a lot of NAs. Handle them by eliminating OR filling values with column mean value.
The dataset is in “wide” format. In order to work with it, I should transform it to long format.

#change from wide to long format
#researching the gather function, I found that tidyr now recommends using pivot_longer in favor of gather
df_long <- pivot_longer(df, cols = "1960":"2024", names_to = "Year", values_to = "Value")
head(df_long)

## # A tibble: 6 × 6
##   `Country Name`    `Country Code` `Indicator Name` `Indicator Code` Year  Value
##   <chr>             <chr>          <chr>            <chr>            <chr> <dbl>
## 1 Africa Eastern a… AFE            Access to clean… EG.CFT.ACCS.ZS   1960     NA
## 2 Africa Eastern a… AFE            Access to clean… EG.CFT.ACCS.ZS   1961     NA
## 3 Africa Eastern a… AFE            Access to clean… EG.CFT.ACCS.ZS   1962     NA
## 4 Africa Eastern a… AFE            Access to clean… EG.CFT.ACCS.ZS   1963     NA
## 5 Africa Eastern a… AFE            Access to clean… EG.CFT.ACCS.ZS   1964     NA
## 6 Africa Eastern a… AFE            Access to clean… EG.CFT.ACCS.ZS   1965     NA

Pausing to review more about the dataset…

About the Dataset

The World Development Indicators (WDI) database, published by the World Bank, is a comprehensive collection of global development data, providing key economic, social, and environmental statistics. It includes almost 1,500 indicators covering more than 200 countries and territories, with data spanning several decades.

WDI serves as a vital resource for policymakers, researchers, businesses, and analysts seeking to understand global trends and make data-driven decisions. The database covers a wide range of topics, including economic growth, education, health, poverty, trade, energy, infrastructure, governance, and environmental sustainability.

The indicators are sourced from reputable national and international agencies, ensuring high-quality, consistent, and comparable data. Users can access the database through interactive online tools, API services, and downloadable datasets, facilitating detailed analysis and visualization.

WDI is also used for tracking progress on the Sustainable Development Goals (SDGs) and other global development initiatives. By providing accessible and reliable statistics, it helps to inform policy discussions and strategies globally.

Whether for academic research, policy planning, or economic analysis, the World Development Indicators database is an essential tool for understanding and addressing global development challenges.

Dataset Notes and Research

Growth values in the World Development Indicators (WDI) database appear to be relative to a country’s own historical values, not to other countries (eg. NOT compared to a global index). The growth figures show changes over time within a particular country (e.g., USA compared to itself over different years).

Tehnical Note

I was having trouble getting gather() to work. It did not seem to correctly collapse / transpose the year columns as desired.

Here was the code used (unsuccessfully) df_gather <- gather( df, #dataframe key=“Year”, #Name of new key column value=“Value”, #Name of new column for values 2000:2024) #columns to gather df_gather

Instructions (Cont’d)

*Explore the data and come up with a question to guide your analysis. Create a simple R Markdown document that formats the data. Use dplyr’s summarize, count, and group_by functions to produce a report that describes two key insights that answer your question. Then, use ggplot2’s functions to visualize one of your insights.

Filter for specific indicators: Your data frame contains many different indicators. You will likely want to filter for a specific indicator to analyze it.*

Organize the data

First, I’d like to just look at US data.

library(dplyr)
#practicing using piping instead of nesting
#put Country Code inside backticks for better handling (column names with spaces aren't interpreted well without them)
us_data <- filter(df_long, `Country Code` == "USA" )
us_data

## # A tibble: 98,540 × 6
##    `Country Name` `Country Code` `Indicator Name`   `Indicator Code` Year  Value
##    <chr>          <chr>          <chr>              <chr>            <chr> <dbl>
##  1 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1960     NA
##  2 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1961     NA
##  3 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1962     NA
##  4 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1963     NA
##  5 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1964     NA
##  6 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1965     NA
##  7 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1966     NA
##  8 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1967     NA
##  9 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1968     NA
## 10 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1969     NA
## # ℹ 98,530 more rows

Which Indicator has the largest value?

library(dplyr)
max_value <- max(us_data$Value, na.rm = TRUE)

max_row <- filter(us_data, Value == max_value)
max_row

## # A tibble: 1 × 6
##   `Country Name` `Country Code` `Indicator Name`  `Indicator Code` Year    Value
##   <chr>          <chr>          <chr>             <chr>            <chr>   <dbl>
## 1 United States  USA            Market capitaliz… CM.MKT.LCAP.CD   2024  6.22e13

market_cap_data <- filter(us_data, `Indicator Name` == "Market capitalization of listed domestic companies (current US$)")

# Find the minimum year with available data
min_year <- min(market_cap_data$Year, na.rm = TRUE)

min_year

## [1] "1960"

max_year <- max(market_cap_data$Year, na.rm = TRUE)

max_year

## [1] "2024"

First Insight

The result shows that “Market capitalization of listed domestic companies” is the largest Value in the US.

This indicator is in the dataset starting in 1960, and continues to be measured through the most recent year (2024).

I believe this indicates that US equities have a value, as of 2024, greater (in USD) than any other Indicator. This matches my “real world” understanding of US equities – the stock market is currently at or near an all time high.

Not satisfied with this answer, I’d like to find the next top Indicator for the US.

us_data2 <- us_data %>% filter(`Indicator Name` != "Market capitalization of listed domestic companies (current US$)")
us_data2

## # A tibble: 98,475 × 6
##    `Country Name` `Country Code` `Indicator Name`   `Indicator Code` Year  Value
##    <chr>          <chr>          <chr>              <chr>            <chr> <dbl>
##  1 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1960     NA
##  2 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1961     NA
##  3 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1962     NA
##  4 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1963     NA
##  5 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1964     NA
##  6 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1965     NA
##  7 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1966     NA
##  8 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1967     NA
##  9 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1968     NA
## 10 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1969     NA
## # ℹ 98,465 more rows

max_value2 <- max(us_data2$Value, na.rm = TRUE)
max_value2

## [1] 4.906326e+13

max_row2 <- filter(us_data2, Value == max_value2)
max_row2

## # A tibble: 1 × 6
##   `Country Name` `Country Code` `Indicator Name`  `Indicator Code` Year    Value
##   <chr>          <chr>          <chr>             <chr>            <chr>   <dbl>
## 1 United States  USA            Stocks traded, t… CM.MKT.TRAD.CD   2018  4.91e13

market_cap_data2 <- filter(us_data2, `Indicator Name` == "Stocks traded, total value (current US$)")
market_cap_data2

## # A tibble: 65 × 6
##    `Country Name` `Country Code` `Indicator Name`   `Indicator Code` Year  Value
##    <chr>          <chr>          <chr>              <chr>            <chr> <dbl>
##  1 United States  USA            Stocks traded, to… CM.MKT.TRAD.CD   1960     NA
##  2 United States  USA            Stocks traded, to… CM.MKT.TRAD.CD   1961     NA
##  3 United States  USA            Stocks traded, to… CM.MKT.TRAD.CD   1962     NA
##  4 United States  USA            Stocks traded, to… CM.MKT.TRAD.CD   1963     NA
##  5 United States  USA            Stocks traded, to… CM.MKT.TRAD.CD   1964     NA
##  6 United States  USA            Stocks traded, to… CM.MKT.TRAD.CD   1965     NA
##  7 United States  USA            Stocks traded, to… CM.MKT.TRAD.CD   1966     NA
##  8 United States  USA            Stocks traded, to… CM.MKT.TRAD.CD   1967     NA
##  9 United States  USA            Stocks traded, to… CM.MKT.TRAD.CD   1968     NA
## 10 United States  USA            Stocks traded, to… CM.MKT.TRAD.CD   1969     NA
## # ℹ 55 more rows

# Find the minimum year with available data
min_year2 <- min(market_cap_data2$Year)

min_year2

## [1] "1960"

max_year2 <- max(market_cap_data2$Year)

max_year2

## [1] "2024"

The Indicator with the second-highest Value is “Stocks traded, total value (current US$).”

This indicates the value of stocks that are bought and sold (as opposed to total Market Cap of all US based equities).

What is the third largest indicator?

us_data3 <- us_data %>% filter(`Indicator Name` != "Market capitalization of listed domestic companies (current US$)" & `Indicator Name` != "Stocks traded, total value (current US$)", na.rm=TRUE)
us_data3

## # A tibble: 98,410 × 6
##    `Country Name` `Country Code` `Indicator Name`   `Indicator Code` Year  Value
##    <chr>          <chr>          <chr>              <chr>            <chr> <dbl>
##  1 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1960     NA
##  2 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1961     NA
##  3 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1962     NA
##  4 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1963     NA
##  5 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1964     NA
##  6 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1965     NA
##  7 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1966     NA
##  8 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1967     NA
##  9 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1968     NA
## 10 United States  USA            Access to clean f… EG.CFT.ACCS.ZS   1969     NA
## # ℹ 98,400 more rows

max_value3 <- max(us_data3$Value, na.rm = TRUE)
max_value3

## [1] 3.181105e+13

max_row3 <- filter(us_data3, Value == max_value3)
max_row3

## # A tibble: 1 × 6
##   `Country Name` `Country Code` `Indicator Name`  `Indicator Code` Year    Value
##   <chr>          <chr>          <chr>             <chr>            <chr>   <dbl>
## 1 United States  USA            Central governme… GC.DOD.TOTL.CN   2023  3.18e13

market_cap_data3 <- filter(us_data3, `Indicator Name` == "Central government debt, total (current LCU)")
market_cap_data3

## # A tibble: 65 × 6
##    `Country Name` `Country Code` `Indicator Name`   `Indicator Code` Year  Value
##    <chr>          <chr>          <chr>              <chr>            <chr> <dbl>
##  1 United States  USA            Central governmen… GC.DOD.TOTL.CN   1960     NA
##  2 United States  USA            Central governmen… GC.DOD.TOTL.CN   1961     NA
##  3 United States  USA            Central governmen… GC.DOD.TOTL.CN   1962     NA
##  4 United States  USA            Central governmen… GC.DOD.TOTL.CN   1963     NA
##  5 United States  USA            Central governmen… GC.DOD.TOTL.CN   1964     NA
##  6 United States  USA            Central governmen… GC.DOD.TOTL.CN   1965     NA
##  7 United States  USA            Central governmen… GC.DOD.TOTL.CN   1966     NA
##  8 United States  USA            Central governmen… GC.DOD.TOTL.CN   1967     NA
##  9 United States  USA            Central governmen… GC.DOD.TOTL.CN   1968     NA
## 10 United States  USA            Central governmen… GC.DOD.TOTL.CN   1969     NA
## # ℹ 55 more rows

# Find the minimum year with available data
min_year3 <- min(market_cap_data3$Year, na.rm = TRUE)

min_year3

## [1] "1960"

max_year3 <- max(market_cap_data3$Year, na.rm = TRUE)

max_year3

## [1] "2024"

US Central Government debt is the third largest Indicator by USD. This also matches my real world understanding the US debt continues to grow and is at an all time high.

Due to time constraints, I will cut my analysis here.

Plot the data

I will plot the growth over time of the three Indicators I looked at.

library(dplyr)
library(ggplot2)
library(scales)

plot_data <- us_data %>%
  filter(`Indicator Name` %in% c(
    "Market capitalization of listed domestic companies (current US$)",
    "Stocks traded, total value (current US$)",
    "Central government debt, total (current LCU)"
  ), na.rm=TRUE)

plot_data <- plot_data %>%
  mutate(Year = as.numeric(as.character(Year))) %>%
  filter(!is.na(Year))

plot_data

## # A tibble: 195 × 6
##    `Country Name` `Country Code` `Indicator Name`   `Indicator Code`  Year Value
##    <chr>          <chr>          <chr>              <chr>            <dbl> <dbl>
##  1 United States  USA            Central governmen… GC.DOD.TOTL.CN    1960    NA
##  2 United States  USA            Central governmen… GC.DOD.TOTL.CN    1961    NA
##  3 United States  USA            Central governmen… GC.DOD.TOTL.CN    1962    NA
##  4 United States  USA            Central governmen… GC.DOD.TOTL.CN    1963    NA
##  5 United States  USA            Central governmen… GC.DOD.TOTL.CN    1964    NA
##  6 United States  USA            Central governmen… GC.DOD.TOTL.CN    1965    NA
##  7 United States  USA            Central governmen… GC.DOD.TOTL.CN    1966    NA
##  8 United States  USA            Central governmen… GC.DOD.TOTL.CN    1967    NA
##  9 United States  USA            Central governmen… GC.DOD.TOTL.CN    1968    NA
## 10 United States  USA            Central governmen… GC.DOD.TOTL.CN    1969    NA
## # ℹ 185 more rows

ggplot(plot_data, aes(x = Year, y = Value, color = `Indicator Name`, group = `Indicator Name`)) +
  geom_line(size = 1.2) +
  geom_point() +
  scale_x_continuous(breaks = seq(1960, 2024, by = 10)) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Trends of Key Economic Indicators Over Time (USA)",
    x = "Year",
    y = "Value",
    color = "Indicator"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    legend.position = "bottom"
  )

## Warning: Removed 60 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 60 rows containing missing values or values outside the scale range
## (`geom_point()`).

Legend

Indicators by line color:

Red: Central government debt, total (current LCU)

Blue: Stocks traded, total value (current US$)

Green: Market capitalization of listed domestic companies (current US$)

Evaluating top US Economic Indicators over Time

Drew Wolin

2025-09-20

Summary

Insights