GDP

Author

Guibril Ramde

Approach

In this project, I selected the Worldwide GDP dataset from Discussion 5A. The dataset is not initially structured in a tidy format because the GDP values for different years are stored across multiple columns rather than in a single variable column. Therefore, the first step of this analysis will be to clean and restructure the dataset using tidy data principles.

Using tools from the tidyverse package, I will transform the dataset into a tidy format where each variable has its own column, each observation corresponds to a single row, and each type of observational unit forms a table. This process will involve reshaping the data so that the year values become a single column and the GDP values become another column.

After the dataset has been tidied, I will perform an exploratory analysis to examine economic patterns across countries. In particular, I will analyze how GDP trends evolve over time for different countries and investigate possible relationships with factors such as population growth and birth rates. I will also explore how these economic indicators have changed over the past decade and identify notable global or regional patterns.

Finally, I will use visualization tools such as ggplot2 to present the results and highlight key insights from the data.

library(dplyr)

Warning: package 'dplyr' was built under R version 4.5.2


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr)

Warning: package 'tidyr' was built under R version 4.5.2

library(readr)

Warning: package 'readr' was built under R version 4.5.2

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.5.2

Warning: package 'tibble' was built under R version 4.5.2

Warning: package 'purrr' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.1     ✔ purrr     1.2.1
✔ ggplot2   4.0.2     ✔ stringr   1.6.0
✔ lubridate 1.9.4     ✔ tibble    3.3.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

set.seed(123)

load_data <- "https://raw.githubusercontent.com/japhet125/Project2-Data-Science/refs/heads/main/GDP.csv"
get_all_data <- read_csv(load_data)

Rows: 266 Columns: 65
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Country, Country Code
dbl (63): 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

get_all_data

# A tibble: 266 × 65
   Country  `Country Code`   `1960`   `1961`   `1962`   `1963`   `1964`   `1965`
   <chr>    <chr>             <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 Aruba    ABW            NA       NA       NA       NA       NA       NA      
 2 Africa … AFE             2.11e10  2.16e10  2.35e10  2.80e10  2.59e10  2.95e10
 3 Afghani… AFG             5.38e 8  5.49e 8  5.47e 8  7.51e 8  8.00e 8  1.01e 9
 4 Africa … AFW             1.04e10  1.12e10  1.20e10  1.27e10  1.39e10  1.49e10
 5 Angola   AGO            NA       NA       NA       NA       NA       NA      
 6 Albania  ALB            NA       NA       NA       NA       NA       NA      
 7 Andorra  AND            NA       NA       NA       NA       NA       NA      
 8 Arab Wo… ARB            NA       NA       NA       NA       NA       NA      
 9 United … ARE            NA       NA       NA       NA       NA       NA      
10 Argenti… ARG            NA       NA        2.45e10  1.83e10  2.56e10  2.83e10
# ℹ 256 more rows
# ℹ 57 more variables: `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>,
#   `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
#   `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
#   `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
#   `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
#   `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, …

colnames(get_all_data)

 [1] "Country"      "Country Code" "1960"         "1961"         "1962"        
 [6] "1963"         "1964"         "1965"         "1966"         "1967"        
[11] "1968"         "1969"         "1970"         "1971"         "1972"        
[16] "1973"         "1974"         "1975"         "1976"         "1977"        
[21] "1978"         "1979"         "1980"         "1981"         "1982"        
[26] "1983"         "1984"         "1985"         "1986"         "1987"        
[31] "1988"         "1989"         "1990"         "1991"         "1992"        
[36] "1993"         "1994"         "1995"         "1996"         "1997"        
[41] "1998"         "1999"         "2000"         "2001"         "2002"        
[46] "2003"         "2004"         "2005"         "2006"         "2007"        
[51] "2008"         "2009"         "2010"         "2011"         "2012"        
[56] "2013"         "2014"         "2015"         "2016"         "2017"        
[61] "2018"         "2019"         "2020"         "2021"         "2022"

Data Source

The dataset originates from the World Bank and was obtained from Kaggle.
It contains GDP statistics (GDP per USD) for countries
around the world from 1960 to 2022.

The dataset is stored locally as: GDP.csv

Showing the raw structure

glimpse(get_all_data)

Rows: 266
Columns: 65
$ Country        <chr> "Aruba", "Africa Eastern and Southern", "Afghanistan", …
$ `Country Code` <chr> "ABW", "AFE", "AFG", "AFW", "AGO", "ALB", "AND", "ARB",…
$ `1960`         <dbl> NA, 21125015452, 537777811, 10447637853, NA, NA, NA, NA…
$ `1961`         <dbl> NA, 21616228139, 548888896, 11173212080, NA, NA, NA, NA…
$ `1962`         <dbl> NA, 23506279900, 546666678, 11990534018, NA, NA, NA, NA…
$ `1963`         <dbl> NA, 28048360188, 751111191, 12727688165, NA, NA, NA, NA…
$ `1964`         <dbl> NA, 25920665260, 800000044, 13898109284, NA, NA, NA, NA…
$ `1965`         <dbl> NA, 29472103270, 1006666638, 14929792388, NA, NA, NA, N…
$ `1966`         <dbl> NA, 32014368121, 1399999967, 15910837742, NA, NA, NA, N…
$ `1967`         <dbl> NA, 33269509510, 1673333418, 14510579889, NA, NA, NA, N…
$ `1968`         <dbl> NA, 36327785495, 1373333367, 14968235782, NA, NA, NA, 3…
$ `1969`         <dbl> NA, 41638967621, 1408888922, 16979315745, NA, NA, NA, 3…
$ `1970`         <dbl> NA, 44629891649, 1748886596, 23596163865, NA, NA, 78617…
$ `1971`         <dbl> NA, 49173371529, 1831108971, 20936358634, NA, NA, 89406…
$ `1972`         <dbl> NA, 53123459912, 1595555476, 25386169423, NA, NA, 11341…
$ `1973`         <dbl> NA, 69482723444, 1733333264, 31975594565, NA, NA, 15084…
$ `1974`         <dbl> NA, 85380645042, 2155555498, 44416677335, NA, NA, 18655…
$ `1975`         <dbl> NA, 90835426418, 2366666616, 51667190242, NA, NA, 22011…
$ `1976`         <dbl> NA, 90212747243, 2555555567, 62351622300, NA, NA, 22728…
$ `1977`         <dbl> NA, 102240575583, 2953333418, 65595122956, NA, NA, 2539…
$ `1978`         <dbl> NA, 116084638702, 3300000109, 71496496574, NA, NA, 3080…
$ `1979`         <dbl> NA, 134256827127, 3697940410, 88948338390, NA, NA, 4115…
$ `1980`         <dbl> NA, 171217790781, 3641723322, 112439126385, 5930503401,…
$ `1981`         <dbl> NA, 175859256874, 3478787909, 211338060015, 5550483036,…
$ `1982`         <dbl> NA, 168095657215, NA, 187448724920, 5550483036, NA, 375…
$ `1983`         <dbl> NA, 175564912386, NA, 138384182007, 5784341596, NA, 327…
$ `1984`         <dbl> NA, 160646748724, NA, 114516348921, 6131475065, 1857338…
$ `1985`         <dbl> NA, 136759437910, NA, 116776995133, 7554065410, 1897050…
$ `1986`         <dbl> 405586592, 153050335916, NA, 107886511309, 7072536109, …
$ `1987`         <dbl> 487709497, 186658478814, NA, 110728825942, 8084412414, …
$ `1988`         <dbl> 596648045, 204765985926, NA, 109438851254, 8769836769, …
$ `1989`         <dbl> 695530726, 218241607366, NA, 102254998563, 10201780977,…
$ `1990`         <dbl> 764804469, 254062093242, NA, 122387353859, 11229515599,…
$ `1991`         <dbl> 872067039, 276856728336, NA, 118039698016, 12704558517,…
$ `1992`         <dbl> 958659218, 246088124936, NA, 118893094122, 15114352005,…
$ `1993`         <dbl> 1083240223, 242926405780, NA, 99272180411, 11051939102,…
$ `1994`         <dbl> 1245810056, 239610677917, NA, 86636400266, 3390500000, …
$ `1995`         <dbl> 1320670391, 270327154575, NA, 108690885030, 5561222222,…
$ `1996`         <dbl> 1379888268, 269490833465, NA, 126287285163, 7526963964,…
$ `1997`         <dbl> 1531843575, 283446224788, NA, 127602388366, 7648377413,…
$ `1998`         <dbl> 1665363128, 266652333831, NA, 130678128885, 6506229607,…
$ `1999`         <dbl> 1722905028, 263024788890, NA, 138085971820, 6152922943,…
$ `2000`         <dbl> 1873184358, 284759318603, NA, 140945759314, 9129594819,…
$ `2001`         <dbl> 1896648045, 259643121973, NA, 148529518712, 8936079253,…
$ `2002`         <dbl> 1962011173, 266529432166, 3854235264, 177201164643, 152…
$ `2003`         <dbl> 2044134078, 354176768091, 4539496563, 205214466071, 178…
$ `2004`         <dbl> 2.254749e+09, 4.404818e+11, 5.220825e+09, 2.542648e+11,…
$ `2005`         <dbl> 2.359777e+09, 5.139416e+11, 6.226199e+09, 3.108896e+11,…
$ `2006`         <dbl> 2.469832e+09, 5.775869e+11, 6.971383e+09, 3.969210e+11,…
$ `2007`         <dbl> 2.677654e+09, 6.628680e+11, 9.715765e+09, 4.654855e+11,…
$ `2008`         <dbl> 2.843017e+09, 7.105362e+11, 1.024977e+10, 5.677912e+11,…
$ `2009`         <dbl> 2.553631e+09, 7.219012e+11, 1.215484e+10, 5.083627e+11,…
$ `2010`         <dbl> 2.453631e+09, 8.635195e+11, 1.563384e+10, 5.985216e+11,…
$ `2011`         <dbl> 2.637989e+09, 9.678246e+11, 1.819041e+10, 6.820159e+11,…
$ `2012`         <dbl> 2.615084e+09, 9.753548e+11, 2.020357e+10, 7.375895e+11,…
$ `2013`         <dbl> 2.727933e+09, 9.859871e+11, 2.056449e+10, 8.339481e+11,…
$ `2014`         <dbl> 2.791061e+09, 1.006526e+12, 2.055058e+10, 8.943225e+11,…
$ `2015`         <dbl> 2.963128e+09, 9.273485e+11, 1.999814e+10, 7.686447e+11,…
$ `2016`         <dbl> 2.983799e+09, 8.851764e+11, 1.801955e+10, 6.913634e+11,…
$ `2017`         <dbl> 3.092179e+09, 1.021043e+12, 1.889635e+10, 6.848988e+11,…
$ `2018`         <dbl> 3.276188e+09, 1.007196e+12, 1.841886e+10, 7.670257e+11,…
$ `2019`         <dbl> 3.395794e+09, 1.000834e+12, 1.890450e+10, 8.225384e+11,…
$ `2020`         <dbl> 2.610039e+09, 9.275933e+11, 2.014345e+10, 7.864600e+11,…
$ `2021`         <dbl> 3.126019e+09, 1.081998e+12, 1.458314e+10, 8.444597e+11,…
$ `2022`         <dbl> NA, 1.169484e+12, NA, 8.778633e+11, 1.067136e+11, 1.888…

dim(get_all_data)

[1] 266  65

Interpretation of the data:

The dataset contains 266 countries GDP observations from 1960 to 2022. Each year is represented as a separate column, which creates a wide data structure that is not suitable for analysis.

We will use ggplot to visualize and analyze the dataset

3.2 Data Import and Tidying

clean_data <- get_all_data %>%
  pivot_longer(
    cols = `1960`:`2022`,
    names_to = "year",
    values_to = "gdp_Amount"
  )
clean_data

# A tibble: 16,758 × 4
   Country `Country Code` year  gdp_Amount
   <chr>   <chr>          <chr>      <dbl>
 1 Aruba   ABW            1960          NA
 2 Aruba   ABW            1961          NA
 3 Aruba   ABW            1962          NA
 4 Aruba   ABW            1963          NA
 5 Aruba   ABW            1964          NA
 6 Aruba   ABW            1965          NA
 7 Aruba   ABW            1966          NA
 8 Aruba   ABW            1967          NA
 9 Aruba   ABW            1968          NA
10 Aruba   ABW            1969          NA
# ℹ 16,748 more rows

Normalizing variable structure and Rename variable

tidy_clean_data <- clean_data %>%
  rename(
    country = Country,
    country_code = `Country Code`
  ) %>%
  drop_na(gdp_Amount) %>%
  filter(!country_code %in% c(
    "WLD","HIC","LIC","LMC","UMC",
    "EUU","EMU","ARB","NAC","EAS",
    "ECS","LCN","MEA","SAS","SSF"
  ))

Handling missing data, since they do not provide measurable data or any information

tidy_clean_data <- tidy_clean_data %>%
  filter(!is.na(gdp_Amount))

3.3 Analysis

Summary tables of the average gdp per country with the highest gdp form 1960 to 2022

tidy_clean_data %>%
  group_by(country_code) %>%
  summarise(avg_gdp_amount = mean(gdp_Amount, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(avg_gdp_amount))

# A tibble: 247 × 2
   country_code avg_gdp_amount
   <chr>                 <dbl>
 1 OED                 2.28e13
 2 PST                 2.17e13
 3 IBT                 9.62e12
 4 LMY                 9.10e12
 5 MIC                 8.92e12
 6 IBD                 8.86e12
 7 USA                 8.21e12
 8 LTE                 6.11e12
 9 EAP                 3.85e12
10 TEA                 3.84e12
# ℹ 237 more rows

 # head(10)

Visualization

tidy_clean_data %>%
  filter(nchar(country_code) == 3) %>%
  group_by(country) %>%
  summarise(avg_gdp_amount = mean(gdp_Amount, na.rm = TRUE)) %>%
  slice_max(avg_gdp_amount, n =10) %>%
  ggplot(aes(x = reorder(country, avg_gdp_amount), y = avg_gdp_amount)) +
  geom_col(fill = "darkgreen") +
  coord_flip() +
  labs(
    title = "Top 10 Countries and Regions With Highest GDP",
    x = "Country and Regions",
    y = "Average GDP Amount"
  ) +
  theme_minimal()

The chart identifies the regions with the highest average gdp per region. These regions are often located in developing regions such as North America and Europe and Central Asia.

Summary Table with Global gdp Trends in the last 10 years

tidy_clean_data %>%
  filter(year >= 2013) %>%
  group_by(year) %>%
  summarise(avg_dgp_amount = mean(gdp_Amount, na.rm = TRUE), .groups = "drop")%>%
  #arrange(desc(avg_birth_rate)) %>%
  head(20)

# A tibble: 10 × 2
   year  avg_dgp_amount
   <chr>          <dbl>
 1 2013         1.55e12
 2 2014         1.59e12
 3 2015         1.51e12
 4 2016         1.53e12
 5 2017         1.64e12
 6 2018         1.74e12
 7 2019         1.78e12
 8 2020         1.73e12
 9 2021         2.04e12
10 2022         2.31e12

Visualize with x and y labels

tidy_clean_data %>%
  mutate(year = as.numeric(year)) %>%
  filter(year >= 2013) %>%
  group_by(year) %>%
  summarise(avg_gdp_amount = mean(gdp_Amount, na.rm = TRUE)) %>%
  drop_na(avg_gdp_amount) %>%
  ggplot(aes(x = year, y = avg_gdp_amount)) +
  geom_line(color = "red", linewidth = 1) +
  geom_point() +
  labs(
    title = "Average Global GDP In The Last 10 Years",
    x = "Year",
    y = "GDP (per USD)"
  ) +
  theme_minimal()

The last decade provides insight into recent world gdp trends. Many countries have emerged in the last 10 years experiencing economic boom, countries like China, Brazil, United States. it also due to Technologies and industrialization.

GDP countries trends over years (selected randomly)

set.seed(123)

random_countries <- tidy_clean_data %>%
  distinct(country) %>%
  slice_sample(n = 10)

tidy_clean_data %>%
  filter(country %in% random_countries$country) %>%
  ggplot(aes(x = as.numeric(year), y = gdp_Amount, color = country)) +
  geom_line() +
  labs(
    title = "GDP Trends for Randomly Selected Countries",
    x = "Year",
    y = "GDP (USD)"
  ) +
  theme_minimal()

Comparing 5 countries to see their economics boom throughout time

tidy_clean_data %>%
  mutate(year = as.numeric(year)) %>%
  filter(country %in% c("United States", "China", "Russian Federation", "Brazil", "India"), !is.na(gdp_Amount)) %>%
  ggplot(aes(x = year, y = gdp_Amount, color = country)) +
  geom_line(size = 1) +
  geom_point() +
  labs(
    title = "GDP Trends Comparison Between Selected Countries",
    x = "Year",
    y = "GDP Amount"
  ) +
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

This visualization compares GDP trends among five major countries. While most countries show a gradual increase in GDP values over time, the GDP increase differs across regions. Developed countries such as the United States and China show a high increase over time, while India and Brazil and Russian Federation show a little increase over the decades.

Conclusion

The dataset originates from the World Bank and was obtained through Kaggle. It provides GDP statistics for countries worldwide from 1960 to 2022.

Before tidying, the dataset was stored in a wide format where each year was represented as a separate column. After applying tidy data principles, the dataset was transformed into a long format where each observation
represents a country-year pair.

The analysis shows that global GDP have generally increase over time. This trend may be influenced by factors such as economic development,urbanization, increased education, Technologies.

However, some regions, particularly parts of Africa, still exhibit relatively low GDP. Countries such as Niger and Burkina Faso continue to experience lower GDP compared to developed countries.