library(tidyverse)
library(ggplot2)
library(DT)

Overview

The objective of this project is to load, tidy, transform, and analyze a (potentially wide) dataset. The approach below starts off with loading World Bank GDP percent change data by country from a CSV. The initial data cleanup includes removing rows with all NAs for the GDP percent change columns, dropping unneeded columns, and renaming columns. Using the pivot_longer function the data is transformed from having columns for each year to having one year column and the associated values in the gdp_perc_change column. Then, data exploration by year is conducted, followed by data exploration by country. In both year and country analyses data is also subsetted to look specifically at the 21st century. The implications of the findings are discussed in the conclusion.

Loading the data

# Load GDP by country csv from Github URL
#df_gdp_raw <- read.csv(url("https://raw.githubusercontent.com/mattlucich/data-transformation/gdp/gdp-growth.csv"))

# Load GDP by country csv from Github repo (if you prefer)
df_gdp_raw <- read.csv("gdp-growth.csv")

# Preview the data
glimpse(df_gdp_raw)
## Rows: 264
## Columns: 64
## $ Country.Name   <chr> "Aruba", "Afghanistan", "Angola", "Albania", "Andorra"…
## $ Country.Code   <chr> "ABW", "AFG", "AGO", "ALB", "AND", "ARB", "ARE", "ARG"…
## $ Indicator.Name <chr> "GDP growth (annual %)", "GDP growth (annual %)", "GDP…
## $ Indicator.Code <chr> "NY.GDP.MKTP.KD.ZG", "NY.GDP.MKTP.KD.ZG", "NY.GDP.MKTP…
## $ X1960          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ X1961          <dbl> NA, NA, NA, NA, NA, NA, NA, 5.4, NA, NA, NA, 2.5, 5.5,…
## $ X1962          <dbl> NA, NA, NA, NA, NA, NA, NA, -0.9, NA, NA, NA, 1.3, 2.6…
## $ X1963          <dbl> NA, NA, NA, NA, NA, NA, NA, -5.3, NA, NA, NA, 6.2, 4.1…
## $ X1964          <dbl> NA, NA, NA, NA, NA, NA, NA, 10.1, NA, NA, NA, 7.0, 6.1…
## $ X1965          <dbl> NA, NA, NA, NA, NA, NA, NA, 10.6, NA, NA, NA, 6.0, 3.5…
## $ X1966          <dbl> NA, NA, NA, NA, NA, NA, NA, -0.7, NA, NA, NA, 2.4, 5.6…
## $ X1967          <dbl> NA, NA, NA, NA, NA, NA, NA, 3.2, NA, NA, NA, 6.3, 3.0,…
## $ X1968          <dbl> NA, NA, NA, NA, NA, NA, NA, 4.8, NA, NA, NA, 5.1, 4.5,…
## $ X1969          <dbl> NA, NA, NA, NA, NA, NA, NA, 9.7, NA, NA, NA, 7.0, 6.3,…
## $ X1970          <dbl> NA, NA, NA, NA, NA, NA, NA, 3.0, NA, NA, NA, 7.2, 6.3,…
## $ X1971          <dbl> NA, NA, NA, NA, 4.6, NA, NA, 5.7, NA, NA, NA, 4.0, 5.1…
## $ X1972          <dbl> NA, NA, NA, NA, 8.1, NA, NA, 1.6, NA, NA, NA, 3.9, 6.2…
## $ X1973          <dbl> NA, NA, NA, NA, 7.8, NA, NA, 2.8, NA, NA, NA, 2.6, 4.9…
## $ X1974          <dbl> NA, NA, NA, NA, 5.6, NA, NA, 5.5, NA, NA, NA, 4.1, 3.9…
## $ X1975          <dbl> NA, NA, NA, NA, 0.5, NA, NA, 0.0, NA, NA, NA, 1.3, -0.…
## $ X1976          <dbl> NA, NA, NA, NA, 3.3, 15.7, 16.5, -2.0, NA, NA, NA, 2.6…
## $ X1977          <dbl> NA, NA, NA, NA, 2.8, 8.2, 21.4, 6.9, NA, NA, NA, 3.6, …
## $ X1978          <dbl> NA, NA, NA, NA, 1.5, -0.7, -1.6, -4.5, NA, NA, 4.4, 0.…
## $ X1979          <dbl> NA, NA, NA, NA, 0.0, 11.7, 20.9, 10.2, NA, NA, 8.1, 4.…
## $ X1980          <dbl> NA, NA, NA, NA, 2.2, 9.1, 23.9, 1.5, NA, NA, 8.2, 3.0,…
## $ X1981          <dbl> NA, NA, -4.4, 5.7, -0.1, 3.0, 4.7, -5.2, NA, NA, 3.8, …
## $ X1982          <dbl> NA, NA, 0.0, 2.9, 1.2, -9.1, -6.7, -0.7, NA, NA, -0.1,…
## $ X1983          <dbl> NA, NA, 4.2, 1.1, 1.8, -6.9, -4.7, 4.3, NA, NA, 5.4, -…
## $ X1984          <dbl> NA, NA, 6.0, -1.3, 1.8, 1.3, 4.0, 1.6, NA, NA, 10.2, 4…
## $ X1985          <dbl> NA, NA, 3.5, 1.8, 2.3, -2.4, -3.6, -5.2, NA, NA, 7.6, …
## $ X1986          <dbl> NA, NA, 2.9, 5.6, 3.3, 4.7, -15.0, 6.2, NA, NA, 11.5, …
## $ X1987          <dbl> 16.1, NA, 4.1, -0.8, 5.5, -0.5, 3.4, 2.7, NA, NA, 6.6,…
## $ X1988          <dbl> 18.6, NA, 6.1, -1.4, 5.1, 5.7, -2.6, -1.1, NA, NA, 5.2…
## $ X1989          <dbl> 12.1, NA, 0.0, 9.8, 4.8, 2.4, 12.3, -7.2, NA, NA, 5.3,…
## $ X1990          <dbl> 4.0, NA, -3.5, -9.6, 3.8, 13.1, 18.3, -2.5, NA, NA, 3.…
## $ X1991          <dbl> 8.0, NA, 1.0, -28.0, 2.5, 1.5, 0.9, 9.1, -11.7, NA, 2.…
## $ X1992          <dbl> 5.9, NA, -5.8, -7.2, 0.9, 4.9, 3.3, 7.9, -41.8, NA, 1.…
## $ X1993          <dbl> 7.3, NA, -24.0, 9.6, -1.0, 3.3, 1.3, 8.2, -8.8, NA, 5.…
## $ X1994          <dbl> 8.2, NA, 1.3, 8.3, 2.4, 3.2, 6.9, 5.8, 5.4, NA, 6.7, 4…
## $ X1995          <dbl> 2.5, NA, 15.0, 13.3, 2.8, 2.7, 6.7, -2.8, 6.9, NA, -4.…
## $ X1996          <dbl> 1.2, NA, 13.5, 9.1, 4.6, 4.6, 5.8, 5.5, 5.9, NA, 6.6, …
## $ X1997          <dbl> 7.0, NA, 7.3, -10.9, 9.1, 4.2, 8.2, 8.1, 3.3, NA, 5.5,…
## $ X1998          <dbl> 2.0, NA, 4.7, 8.8, 3.2, 5.3, 0.3, 3.9, 7.3, NA, 4.7, 4…
## $ X1999          <dbl> 1.2, NA, 2.2, 12.9, 4.1, 1.8, 2.9, -3.4, 3.3, NA, 3.7,…
## $ X2000          <dbl> 7.6, NA, 3.1, 7.0, 3.5, 5.5, 10.9, -0.8, 5.9, NA, 6.2,…
## $ X2001          <dbl> -3.0, NA, 4.2, 8.3, 8.1, 1.6, 1.4, -4.4, 9.6, NA, -4.5…
## $ X2002          <dbl> -3.3, NA, 13.7, 4.5, 4.5, 0.6, 2.4, -10.9, 13.2, NA, 1…
## $ X2003          <dbl> 2.0, 8.8, 3.0, 5.5, 8.7, 5.3, 8.8, 8.8, 14.0, 0.8, 6.1…
## $ X2004          <dbl> 7.9, 1.4, 11.0, 5.5, 8.1, 9.4, 9.6, 9.0, 10.5, 0.5, 5.…
## $ X2005          <dbl> 1.2, 11.2, 15.0, 5.5, 5.4, 5.7, 4.9, 8.9, 13.9, -0.4, …
## $ X2006          <dbl> 1.1, 5.4, 11.5, 5.9, 4.8, 6.5, 9.8, 8.0, 13.2, -4.2, 1…
## $ X2007          <dbl> 1.8, 13.8, 14.0, 6.0, 1.6, 4.5, 3.2, 9.0, 13.7, 2.0, 9…
## $ X2008          <dbl> -0.1, 3.9, 11.2, 7.5, -5.6, 5.8, 3.2, 4.1, 6.9, -2.6, …
## $ X2009          <dbl> -10.5, 21.4, 0.9, 3.4, -5.3, 0.4, -5.2, -5.9, -14.1, -…
## $ X2010          <dbl> -3.7, 14.4, 4.4, 3.7, -2.0, 4.7, 1.6, 10.1, 2.2, 0.4, …
## $ X2011          <dbl> 3.4, 0.4, 3.5, 2.5, 0.0, 3.7, 6.9, 6.0, 4.7, 0.3, -2.0…
## $ X2012          <dbl> -1.4, 12.8, 8.5, 1.4, -5.0, 6.7, 4.5, -1.0, 7.2, -4.4,…
## $ X2013          <dbl> 4.2, 5.6, 5.0, 1.0, -3.5, 3.3, 5.1, 2.4, 3.3, -2.8, -0…
## $ X2014          <dbl> 0.3, 2.7, 4.8, 1.8, 2.5, 2.5, 4.3, -2.5, 3.6, 0.9, 3.8…
## $ X2015          <dbl> 5.7, 1.5, 0.9, 2.2, 1.4, 3.2, 5.1, 2.7, 3.2, 1.4, 3.8,…
## $ X2016          <dbl> 2.1, 2.3, -2.6, 3.3, 3.7, 3.5, 3.1, -2.1, 0.2, -2.8, 5…
## $ X2017          <dbl> 2.0, 2.6, -0.1, 3.8, 0.3, 0.8, 2.4, 2.8, 7.5, -5.9, 3.…
## $ X2018          <dbl> NA, 1.2, -2.0, 4.1, 1.6, 2.1, 1.2, -2.6, 5.2, 2.2, 7.4…
## $ X2019          <dbl> NA, 3.9, -0.6, 2.2, 1.8, 1.5, 1.7, -2.1, 7.6, NA, 4.7,…

Cleaning the data

# Filter out rows with only NAs for GDP values
df_gdp_raw <- df_gdp_raw[rowSums(is.na(select(df_gdp_raw, X1960:X2019))) 
                         != ncol(select(df_gdp_raw, X1960:X2019)),]

# Drop unneeded columns
df_gdp_raw <- subset(df_gdp_raw, select = -c(Indicator.Code, Indicator.Name))

# Rename columns
df_gdp_raw <- df_gdp_raw %>% 
            rename(
              country_name = "Country.Name",
              country_code = "Country.Code"
              )

Transforming the data

# Gather year columns into one column "year" and their values to "gdp_perc_change"
df_gdp_long <- df_gdp_raw %>% 
                        pivot_longer(
                          cols = X1960:X2019, 
                          names_to = "year", 
                          values_to = "gdp_perc_change",
                          values_drop_na = TRUE
                        )

# Remove the extraneous "X" in the year values
df_gdp_long <- df_gdp_long %>% mutate(year = sub("X", "", year))

# Preview the long format
datatable(df_gdp_long, caption = "GDP by Year (long format)")

GDP % change by year analysis

# Group by year and sum of GDP % change
df_year_change <- df_gdp_long %>% group_by(year) %>% summarise(total = sum(as.numeric(gdp_perc_change)))

# Scatterplot: Any trend of sum of GDP % change over the years?
ggplot(df_year_change, aes(x=year, y=total)) + 
  scale_x_discrete(breaks=c(1970, 1980, 1990, 2000, 2010)) +
  geom_point(color='turquoise3') + scale_color_brewer(palette="Accent") + 
  theme_minimal() + ggtitle("Cross Country Sum of GDP % Change by Year") + 
  ylab("Sum of Countries GDP % Changes") + xlab("Year")

# Largest change years
df_largest_growth <- df_year_change %>% arrange(desc(total))
df_largest_growth_top_ten <- slice(df_largest_growth, n=1:10)

datatable(df_largest_growth_top_ten, caption = "10 largest GDP % change years")
# Histogram: largest GDP % change years
ggplot(df_largest_growth_top_ten, aes(x=year, weights=total)) + 
  geom_bar(fill="green4") + theme(axis.text.x = element_text(angle = 90)) +
  geom_text(aes(label=total, y=total), vjust=1.5, color="white", size=2.2) +
  ggtitle("Top 10 GDP % Change Years") + 
  ylab("Sum of Countries GDP % Changes") + xlab("Year")

# Lowest change years
df_lowest_growth <- df_year_change %>% arrange(total)
df_lowest_growth_top_ten <- slice(df_lowest_growth, n=1:10)

datatable(df_lowest_growth_top_ten, caption = "10 lowest GDP % change years across countries")
# Histogram: lowest GDP % change years (red does not indicate negative values)
ggplot(df_lowest_growth_top_ten, aes(x=year, weights=total)) + 
  geom_bar(fill="red") + theme(axis.text.x = element_text(angle = 90)) +
  geom_text(aes(label=total, y=total), vjust=1.5, color="white", size=2.5) +
  ggtitle("Bottom 10 GDP % Change Years") + 
  ylab("Sum of Countries GDP % Changes") + xlab("Year")

# Filter for only the 21st century
df_21_cent_change <- df_year_change %>% filter(year >= 2001)

datatable(df_21_cent_change, caption = "Sum of GDP % change by year for 21st century")
# Histogram: sum of GDP % change by year for 21st century
ggplot(df_21_cent_change, aes(x=year, weights=total)) + 
  geom_bar(fill="turquoise3") + theme(axis.text.x = element_text(angle = 90)) +
  geom_text(aes(label=total, y=total), vjust=1.5, color="white", size=2.2) +
  ggtitle("21st Century GDP % Change") + 
  ylab("Sum of Countries GDP % Changes") + xlab("Year")

GDP % change by country analysis

# Group by country and sum total GDP % change
df_country_change <- df_gdp_long %>% 
                      group_by(country_name) %>% 
                      summarise(total = sum(as.numeric(gdp_perc_change)))

# Largest change countries
df_largest_growth_countries <- df_country_change %>% arrange(desc(total))
df_largest_growth_countries_top_ten <- slice(df_largest_growth_countries, n=1:10)

datatable(df_largest_growth_countries_top_ten, caption = "Largest GDP % change countries")
# Histogram: largest GDP % change countries
ggplot(df_largest_growth_countries_top_ten, aes(x=reorder(country_name, -total), weights=total)) + 
  scale_x_discrete(label = function(country_name) stringr::str_trunc(country_name, 23)) +
  geom_bar(fill="green4") + theme(axis.text.x = element_text(angle = 90)) +
  geom_text(aes(label=total, y=total), vjust=1.5, color="white", size=2.2) +
  ggtitle("Top 10 GDP % Change Countries") + ylab("Sum of GDP % Changes") + xlab("Country")

# Lowest change countries
df_lowest_growth_countries <- df_country_change %>% arrange(total)
df_lowest_growth_countries_top_ten <- slice(df_lowest_growth_countries, n=1:10)

datatable(df_lowest_growth_countries_top_ten, caption = "10 lowest GDP % change countries")
# Histogram: lowest GDP % change countries
ggplot(df_lowest_growth_countries_top_ten, aes(x=reorder(country_name, total), weights=total)) + 
  geom_bar(fill="red") + theme(axis.text.x = element_text(angle = 90)) +
  geom_text(aes(label=total, y=total), vjust=1.5, color="black", size=2.2) +
  ggtitle("Bottom 10 GDP % Change Countries") + ylab("Sum of GDP % Changes") + xlab("Country")

# Group by year and country, sum total GDP % change
df_country_year_change <- df_gdp_long %>% 
                            group_by(country_name, year) %>% 
                            summarise(total = sum(as.numeric(gdp_perc_change)))

# Filter for only the 21st century
df_21_cent_country_year_change <- df_country_year_change %>% filter(year >= 2001)

# Group by country and sum total GDP % change
df_21_cent_country_year_change <- df_21_cent_country_year_change %>% 
                      group_by(country_name) %>% 
                      summarise(total_gdp = sum(as.numeric(total)))

df_21_cent_large_growth <- df_21_cent_country_year_change %>% arrange(desc(total_gdp))
df_21_cent_large_growth_top_ten <- slice(df_21_cent_large_growth, n=1:10)

datatable(df_21_cent_large_growth_top_ten, caption = "10 largest GDP % change countries 21st Century")
# Histogram: sum of GDP % change by country for 21st century
ggplot(df_21_cent_large_growth_top_ten, aes(x=reorder(country_name, -total_gdp), weights=total_gdp)) + 
  scale_x_discrete(label = function(country_name) stringr::str_trunc(country_name, 23)) +
  geom_bar(fill="turquoise3") + theme(axis.text.x = element_text(angle = 90)) +
  geom_text(aes(label=total_gdp, y=total_gdp), vjust=1.5, color="white", size=2.2) +
  ggtitle("Top 10 21st Century GDP % Change Countries") + 
  ylab("Sum of Countries GDP % Changes") + xlab("Country")

Conclusion

The analysis sections are a start to an interesting look at what countries are seeing the largest and lowest GDP growth over time as well as what years saw the largest and lowest GDP changes overall. We cannot infer much by only looking at percent change figures, which unfortunately was the only metric included in this dataset. However, we do have a better idea of what countries to research if we were interested in hypothesizing causes for positive and negative GDP changes. Additionally, we can zero in on what years to research to find factors for maximizing global GDP growth.

References

The World Bank (2020) "GDP growth (annual %)"


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.