Indicators by line color, full name:
As you work with more data in R, you’ll begin to observe that most data requires preparation (dealing with NA values, data formatting, data transformation, etc) before it can be properly analyzed using the functions in R. This week’s challenge involves working with a messy data set, formatting it and using some functions from the dplyr library to summarize and group cases. Reference the dplyr “Data transformation cheat sheet”.
library(dplyr)
Download the a CSV file from the World Bank https://datacatalog.worldbank.org/ (select any data set you wish)
setwd("~/Desktop/mydata")
df <- read_csv("WDICSV.csv")
## Rows: 403256 Columns: 69
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Country Name, Country Code, Indicator Name, Indicator Code
## dbl (65): 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
## # A tibble: 6 × 69
## `Country Name` `Country Code` `Indicator Name` `Indicator Code` `1960` `1961`
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Africa Eastern… AFE Access to clean… EG.CFT.ACCS.ZS NA NA
## 2 Africa Eastern… AFE Access to clean… EG.CFT.ACCS.RU.… NA NA
## 3 Africa Eastern… AFE Access to clean… EG.CFT.ACCS.UR.… NA NA
## 4 Africa Eastern… AFE Access to elect… EG.ELC.ACCS.ZS NA NA
## 5 Africa Eastern… AFE Access to elect… EG.ELC.ACCS.RU.… NA NA
## 6 Africa Eastern… AFE Access to elect… EG.ELC.ACCS.UR.… NA NA
## # ℹ 63 more variables: `1962` <dbl>, `1963` <dbl>, `1964` <dbl>, `1965` <dbl>,
## # `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
## # `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
## # `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
## # `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
## # `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
## # `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, …
Two things stand out:
The dataset has a lot of NAs. Handle them by eliminating OR filling values with column mean value.
The dataset is in “wide” format. In order to work with it, I should transform it to long format.
#change from wide to long format
#researching the gather function, I found that tidyr now recommends using pivot_longer in favor of gather
df_long <- pivot_longer(df, cols = "1960":"2024", names_to = "Year", values_to = "Value")
head(df_long)
## # A tibble: 6 × 6
## `Country Name` `Country Code` `Indicator Name` `Indicator Code` Year Value
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Africa Eastern a… AFE Access to clean… EG.CFT.ACCS.ZS 1960 NA
## 2 Africa Eastern a… AFE Access to clean… EG.CFT.ACCS.ZS 1961 NA
## 3 Africa Eastern a… AFE Access to clean… EG.CFT.ACCS.ZS 1962 NA
## 4 Africa Eastern a… AFE Access to clean… EG.CFT.ACCS.ZS 1963 NA
## 5 Africa Eastern a… AFE Access to clean… EG.CFT.ACCS.ZS 1964 NA
## 6 Africa Eastern a… AFE Access to clean… EG.CFT.ACCS.ZS 1965 NA
Pausing to review more about the dataset…
The World Development Indicators (WDI) database, published by the World Bank, is a comprehensive collection of global development data, providing key economic, social, and environmental statistics. It includes almost 1,500 indicators covering more than 200 countries and territories, with data spanning several decades.
WDI serves as a vital resource for policymakers, researchers, businesses, and analysts seeking to understand global trends and make data-driven decisions. The database covers a wide range of topics, including economic growth, education, health, poverty, trade, energy, infrastructure, governance, and environmental sustainability.
The indicators are sourced from reputable national and international agencies, ensuring high-quality, consistent, and comparable data. Users can access the database through interactive online tools, API services, and downloadable datasets, facilitating detailed analysis and visualization.
WDI is also used for tracking progress on the Sustainable Development Goals (SDGs) and other global development initiatives. By providing accessible and reliable statistics, it helps to inform policy discussions and strategies globally.
Whether for academic research, policy planning, or economic analysis, the World Development Indicators database is an essential tool for understanding and addressing global development challenges.
Growth values in the World Development Indicators (WDI) database appear to be relative to a country’s own historical values, not to other countries (eg. NOT compared to a global index). The growth figures show changes over time within a particular country (e.g., USA compared to itself over different years).
I was having trouble getting gather() to work. It did not seem to correctly collapse / transpose the year columns as desired.
Here was the code used (unsuccessfully) df_gather <- gather( df,
#dataframe key=“Year”, #Name of new key column value=“Value”, #Name of
new column for values 2000:2024) #columns to
gather df_gather
*Explore the data and come up with a question to guide your analysis. Create a simple R Markdown document that formats the data. Use dplyr’s summarize, count, and group_by functions to produce a report that describes two key insights that answer your question. Then, use ggplot2’s functions to visualize one of your insights.
Filter for specific indicators: Your data frame contains many different indicators. You will likely want to filter for a specific indicator to analyze it.*
First, I’d like to just look at US data.
library(dplyr)
#practicing using piping instead of nesting
#put Country Code inside backticks for better handling (column names with spaces aren't interpreted well without them)
us_data <- filter(df_long, `Country Code` == "USA" )
us_data
## # A tibble: 98,540 × 6
## `Country Name` `Country Code` `Indicator Name` `Indicator Code` Year Value
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 United States USA Access to clean f… EG.CFT.ACCS.ZS 1960 NA
## 2 United States USA Access to clean f… EG.CFT.ACCS.ZS 1961 NA
## 3 United States USA Access to clean f… EG.CFT.ACCS.ZS 1962 NA
## 4 United States USA Access to clean f… EG.CFT.ACCS.ZS 1963 NA
## 5 United States USA Access to clean f… EG.CFT.ACCS.ZS 1964 NA
## 6 United States USA Access to clean f… EG.CFT.ACCS.ZS 1965 NA
## 7 United States USA Access to clean f… EG.CFT.ACCS.ZS 1966 NA
## 8 United States USA Access to clean f… EG.CFT.ACCS.ZS 1967 NA
## 9 United States USA Access to clean f… EG.CFT.ACCS.ZS 1968 NA
## 10 United States USA Access to clean f… EG.CFT.ACCS.ZS 1969 NA
## # ℹ 98,530 more rows
Which Indicator has the largest value?
library(dplyr)
max_value <- max(us_data$Value, na.rm = TRUE)
max_row <- filter(us_data, Value == max_value)
max_row
## # A tibble: 1 × 6
## `Country Name` `Country Code` `Indicator Name` `Indicator Code` Year Value
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 United States USA Market capitaliz… CM.MKT.LCAP.CD 2024 6.22e13
market_cap_data <- filter(us_data, `Indicator Name` == "Market capitalization of listed domestic companies (current US$)")
# Find the minimum year with available data
min_year <- min(market_cap_data$Year, na.rm = TRUE)
min_year
## [1] "1960"
max_year <- max(market_cap_data$Year, na.rm = TRUE)
max_year
## [1] "2024"
The result shows that “Market capitalization of listed domestic companies” is the largest Value in the US.
This indicator is in the dataset starting in 1960, and continues to be measured through the most recent year (2024).
I believe this indicates that US equities have a value, as of 2024, greater (in USD) than any other Indicator. This matches my “real world” understanding of US equities – the stock market is currently at or near an all time high.
Not satisfied with this answer, I’d like to find the next top Indicator for the US.
us_data2 <- us_data %>% filter(`Indicator Name` != "Market capitalization of listed domestic companies (current US$)")
us_data2
## # A tibble: 98,475 × 6
## `Country Name` `Country Code` `Indicator Name` `Indicator Code` Year Value
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 United States USA Access to clean f… EG.CFT.ACCS.ZS 1960 NA
## 2 United States USA Access to clean f… EG.CFT.ACCS.ZS 1961 NA
## 3 United States USA Access to clean f… EG.CFT.ACCS.ZS 1962 NA
## 4 United States USA Access to clean f… EG.CFT.ACCS.ZS 1963 NA
## 5 United States USA Access to clean f… EG.CFT.ACCS.ZS 1964 NA
## 6 United States USA Access to clean f… EG.CFT.ACCS.ZS 1965 NA
## 7 United States USA Access to clean f… EG.CFT.ACCS.ZS 1966 NA
## 8 United States USA Access to clean f… EG.CFT.ACCS.ZS 1967 NA
## 9 United States USA Access to clean f… EG.CFT.ACCS.ZS 1968 NA
## 10 United States USA Access to clean f… EG.CFT.ACCS.ZS 1969 NA
## # ℹ 98,465 more rows
max_value2 <- max(us_data2$Value, na.rm = TRUE)
max_value2
## [1] 4.906326e+13
max_row2 <- filter(us_data2, Value == max_value2)
max_row2
## # A tibble: 1 × 6
## `Country Name` `Country Code` `Indicator Name` `Indicator Code` Year Value
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 United States USA Stocks traded, t… CM.MKT.TRAD.CD 2018 4.91e13
market_cap_data2 <- filter(us_data2, `Indicator Name` == "Stocks traded, total value (current US$)")
market_cap_data2
## # A tibble: 65 × 6
## `Country Name` `Country Code` `Indicator Name` `Indicator Code` Year Value
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 United States USA Stocks traded, to… CM.MKT.TRAD.CD 1960 NA
## 2 United States USA Stocks traded, to… CM.MKT.TRAD.CD 1961 NA
## 3 United States USA Stocks traded, to… CM.MKT.TRAD.CD 1962 NA
## 4 United States USA Stocks traded, to… CM.MKT.TRAD.CD 1963 NA
## 5 United States USA Stocks traded, to… CM.MKT.TRAD.CD 1964 NA
## 6 United States USA Stocks traded, to… CM.MKT.TRAD.CD 1965 NA
## 7 United States USA Stocks traded, to… CM.MKT.TRAD.CD 1966 NA
## 8 United States USA Stocks traded, to… CM.MKT.TRAD.CD 1967 NA
## 9 United States USA Stocks traded, to… CM.MKT.TRAD.CD 1968 NA
## 10 United States USA Stocks traded, to… CM.MKT.TRAD.CD 1969 NA
## # ℹ 55 more rows
# Find the minimum year with available data
min_year2 <- min(market_cap_data2$Year)
min_year2
## [1] "1960"
max_year2 <- max(market_cap_data2$Year)
max_year2
## [1] "2024"
The Indicator with the second-highest Value is “Stocks traded, total value (current US$).”
This indicates the value of stocks that are bought and sold (as opposed to total Market Cap of all US based equities).
What is the third largest indicator?
us_data3 <- us_data %>% filter(`Indicator Name` != "Market capitalization of listed domestic companies (current US$)" & `Indicator Name` != "Stocks traded, total value (current US$)", na.rm=TRUE)
us_data3
## # A tibble: 98,410 × 6
## `Country Name` `Country Code` `Indicator Name` `Indicator Code` Year Value
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 United States USA Access to clean f… EG.CFT.ACCS.ZS 1960 NA
## 2 United States USA Access to clean f… EG.CFT.ACCS.ZS 1961 NA
## 3 United States USA Access to clean f… EG.CFT.ACCS.ZS 1962 NA
## 4 United States USA Access to clean f… EG.CFT.ACCS.ZS 1963 NA
## 5 United States USA Access to clean f… EG.CFT.ACCS.ZS 1964 NA
## 6 United States USA Access to clean f… EG.CFT.ACCS.ZS 1965 NA
## 7 United States USA Access to clean f… EG.CFT.ACCS.ZS 1966 NA
## 8 United States USA Access to clean f… EG.CFT.ACCS.ZS 1967 NA
## 9 United States USA Access to clean f… EG.CFT.ACCS.ZS 1968 NA
## 10 United States USA Access to clean f… EG.CFT.ACCS.ZS 1969 NA
## # ℹ 98,400 more rows
max_value3 <- max(us_data3$Value, na.rm = TRUE)
max_value3
## [1] 3.181105e+13
max_row3 <- filter(us_data3, Value == max_value3)
max_row3
## # A tibble: 1 × 6
## `Country Name` `Country Code` `Indicator Name` `Indicator Code` Year Value
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 United States USA Central governme… GC.DOD.TOTL.CN 2023 3.18e13
market_cap_data3 <- filter(us_data3, `Indicator Name` == "Central government debt, total (current LCU)")
market_cap_data3
## # A tibble: 65 × 6
## `Country Name` `Country Code` `Indicator Name` `Indicator Code` Year Value
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 United States USA Central governmen… GC.DOD.TOTL.CN 1960 NA
## 2 United States USA Central governmen… GC.DOD.TOTL.CN 1961 NA
## 3 United States USA Central governmen… GC.DOD.TOTL.CN 1962 NA
## 4 United States USA Central governmen… GC.DOD.TOTL.CN 1963 NA
## 5 United States USA Central governmen… GC.DOD.TOTL.CN 1964 NA
## 6 United States USA Central governmen… GC.DOD.TOTL.CN 1965 NA
## 7 United States USA Central governmen… GC.DOD.TOTL.CN 1966 NA
## 8 United States USA Central governmen… GC.DOD.TOTL.CN 1967 NA
## 9 United States USA Central governmen… GC.DOD.TOTL.CN 1968 NA
## 10 United States USA Central governmen… GC.DOD.TOTL.CN 1969 NA
## # ℹ 55 more rows
# Find the minimum year with available data
min_year3 <- min(market_cap_data3$Year, na.rm = TRUE)
min_year3
## [1] "1960"
max_year3 <- max(market_cap_data3$Year, na.rm = TRUE)
max_year3
## [1] "2024"
US Central Government debt is the third largest Indicator by USD. This also matches my real world understanding the US debt continues to grow and is at an all time high.
Due to time constraints, I will cut my analysis here.
I will plot the growth over time of the three Indicators I looked at.
library(dplyr)
library(ggplot2)
library(scales)
plot_data <- us_data %>%
filter(`Indicator Name` %in% c(
"Market capitalization of listed domestic companies (current US$)",
"Stocks traded, total value (current US$)",
"Central government debt, total (current LCU)"
), na.rm=TRUE)
plot_data <- plot_data %>%
mutate(Year = as.numeric(as.character(Year))) %>%
filter(!is.na(Year))
plot_data
## # A tibble: 195 × 6
## `Country Name` `Country Code` `Indicator Name` `Indicator Code` Year Value
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 United States USA Central governmen… GC.DOD.TOTL.CN 1960 NA
## 2 United States USA Central governmen… GC.DOD.TOTL.CN 1961 NA
## 3 United States USA Central governmen… GC.DOD.TOTL.CN 1962 NA
## 4 United States USA Central governmen… GC.DOD.TOTL.CN 1963 NA
## 5 United States USA Central governmen… GC.DOD.TOTL.CN 1964 NA
## 6 United States USA Central governmen… GC.DOD.TOTL.CN 1965 NA
## 7 United States USA Central governmen… GC.DOD.TOTL.CN 1966 NA
## 8 United States USA Central governmen… GC.DOD.TOTL.CN 1967 NA
## 9 United States USA Central governmen… GC.DOD.TOTL.CN 1968 NA
## 10 United States USA Central governmen… GC.DOD.TOTL.CN 1969 NA
## # ℹ 185 more rows
ggplot(plot_data, aes(x = Year, y = Value, color = `Indicator Name`, group = `Indicator Name`)) +
geom_line(size = 1.2) +
geom_point() +
scale_x_continuous(breaks = seq(1960, 2024, by = 10)) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Trends of Key Economic Indicators Over Time (USA)",
x = "Year",
y = "Value",
color = "Indicator"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
legend.position = "bottom"
)
## Warning: Removed 60 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 60 rows containing missing values or values outside the scale range
## (`geom_point()`).
Indicators by line color:
Red: Central government debt, total (current LCU)
Blue: Stocks traded, total value (current US$)
Green: Market capitalization of listed domestic companies (current US$)
Since 2008, the Value from Stocks traded (blue) has been basically flat.
Since 2012, Market capitalization of listed domestic companies (green) has skyrocketed.
Combing insights from the blue and green lines, we can conclude that investors are buying and holding positions, compared to the period prior to 2008, which there was far more trading activity.
We also see Central government debt has climbed steadily since the start (red) and is at an all time high.
We can also see notable changes to slope (debt) around 2000, again in 2008, and again around 2019. Each time, the debt climbs more-sharply than prior periods.
Publish your R Markdown document to RPub’s (https://rpubs.com/) and share your link in this discussion form (Title your post the name with the question you used to guide your exploration) Then, take a moment to read the posts of your classmates AND respond to one of your classmates with one thing you found interesting about the insights highlighted by your classmate.