Code
library(tidyverse)
library(janitor)
library(dplyr)
library(gt)The objective of this Project Two Data Transformations exercise lies in preparing different datasets (three, specifically) for analysis work to be conducted further downstream. Moreover, these three datasets must be drawn from within the confines of the DATA 607 Week 5A Discussion thread, wherein the class’ students were tasked with identifying untidy datasets and suggesting possible analytical endeavors that may be performed once the datasets have been tidied. In choosing three of these provided datasets, a key requirement of the project sis that only distinct wide-format untidy datasets must be selected.
Once these three datasets are selected, imported and cleaned within RStudio, the specific analyses requested within the original discussion posts will then be executed, including features such as summary tables, visualizations, and brief explanatory narratives where applicable.
The potential analyses to be executed for this crude birth rate dataset are:
comparing the trends in birth rates over time across different regions/countries,
visualizing birth rate trends over time, and
examining the declines and increases in birth rates to identify any widely shared population changes (and then attempting to associate such changes with relevant real-world health and economic events).
The prospective analyses to be conducted using this GDP by Country dataset (as set out by Sinem herself) are:
comparing GDP growth between countries (potentially using a time series graph),
calculating the average GDP across the included years, and
determining the top-performing countries in terms of economic output.
For each of the three selected datasets, I will first obtain a raw CSV version of the data that preserves its original wide formatting, and then commit it to my personal GitHub repository. Each dataset will then be imported into RStudio via the read_csv() function, after which a reproducible tidying pipeline will be applied using tidyr and dplyr. The main transformation will involve reshaping the wide year-based columns into a tidy long format using pivot_longer(), followed by general standardization and cleaning such as consistent variable naming and well-document handling of missing or inconsistent values. Finally, I will complete the required Discussion 5A analyses for each dataset using only the tidied version, supported by summary tables, visualizations, and brief narrative interpretations.
One expected challenge involves missing values within the datasets, especially within the earlier examined years. In this, whether rows are retained (with missing values kept as NAs) or filtered out specific analyses will be noted and documented. Additionally, because these selected datasets span a wide range of countries and years, the resulting visualizations may become cluttered if every if plotted at once. As such, there may be cases where only a subset of countries (for instance, the top or bottom performers based on a calculated metric) will be focused on for clarity, while still maintaining a reproducible workflow.
Within the confines of this specific step, the dataset corresponding to the first crude birth rate analyses was downloaded from its discovered location (Kaggle), and pushed to a personal GitHub repository. The file itself contains all information from the original crude birth rates dataset and its wide format has been preserved. The dataset will be imported into RStudio for tidying, and further analyses, in the subsequent step.
Now that the raw wide format CSV file has been committed and made accessible, the next step is to import it into RStudio and apply a reproducible tidying pipeline. Since the year values are currently spread across multiple columns, the principal transformation will involve reshaping the dataset from wide to long format through the use of pivot_longer(). Thereafter, the variable names will be standardized, relevant identifier columns retained, and the value column converted into a numeric format suitable for analysis.
library(tidyverse)
library(janitor)
library(dplyr)
library(gt)birth_rate_url <- "https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/Project%20Two%20Assignment/crude_birth_rate_dataset.csv"
birth_rate_raw <- read_csv(birth_rate_url, show_col_types = FALSE)
glimpse(birth_rate_raw)Rows: 270
Columns: 67
$ `Data Source` <chr> NA, "Last Updated Date", NA, "Country N…
$ `World Development Indicators` <chr> NA, "29/06/2023", NA, "Country Code", "…
$ ...3 <chr> NA, NA, NA, "Indicator Name", "Birth ra…
$ ...4 <chr> NA, NA, NA, "Indicator Code", "SP.DYN.C…
$ ...5 <dbl> NA, NA, NA, 1960.00000, 33.88300, 47.43…
$ ...6 <dbl> NA, NA, NA, 1961.00000, 32.83100, 47.53…
$ ...7 <dbl> NA, NA, NA, 1962.00000, 31.64900, 47.59…
$ ...8 <dbl> NA, NA, NA, 1963.00000, 30.41600, 47.63…
$ ...9 <dbl> NA, NA, NA, 1964.00000, 29.14700, 47.64…
$ ...10 <dbl> NA, NA, NA, 1965.00000, 27.88900, 47.66…
$ ...11 <dbl> NA, NA, NA, 1966.00000, 26.66300, 47.69…
$ ...12 <dbl> NA, NA, NA, 1967.00000, 25.50300, 47.69…
$ ...13 <dbl> NA, NA, NA, 1968.00000, 24.59200, 47.69…
$ ...14 <dbl> NA, NA, NA, 1969.00000, 23.73500, 47.72…
$ ...15 <dbl> NA, NA, NA, 1970.00000, 22.97400, 47.67…
$ ...16 <dbl> NA, NA, NA, 1971.00000, 22.31300, 47.64…
$ ...17 <dbl> NA, NA, NA, 1972.00000, 21.76600, 47.47…
$ ...18 <dbl> NA, NA, NA, 1973.00000, 21.49200, 47.22…
$ ...19 <dbl> NA, NA, NA, 1974.00000, 21.38200, 47.07…
$ ...20 <dbl> NA, NA, NA, 1975.00000, 21.39300, 46.95…
$ ...21 <dbl> NA, NA, NA, 1976.00000, 21.48500, 46.79…
$ ...22 <dbl> NA, NA, NA, 1977.00000, 21.73900, 46.63…
$ ...23 <dbl> NA, NA, NA, 1978.00000, 21.92000, 46.51…
$ ...24 <dbl> NA, NA, NA, 1979.00000, 21.99300, 46.47…
$ ...25 <dbl> NA, NA, NA, 1980.00000, 21.93100, 46.33…
$ ...26 <dbl> NA, NA, NA, 1981.00000, 21.73000, 46.23…
$ ...27 <dbl> NA, NA, NA, 1982.00000, 21.47600, 46.15…
$ ...28 <dbl> NA, NA, NA, 1983.00000, 21.38500, 46.13…
$ ...29 <dbl> NA, NA, NA, 1984.00000, 21.18300, 46.13…
$ ...30 <dbl> NA, NA, NA, 1985.00000, 20.91900, 46.14…
$ ...31 <dbl> NA, NA, NA, 1986.00000, 20.61700, 46.06…
$ ...32 <dbl> NA, NA, NA, 1987.00000, 20.26900, 45.83…
$ ...33 <dbl> NA, NA, NA, 1988.00000, 19.82100, 45.33…
$ ...34 <dbl> NA, NA, NA, 1989.00000, 19.18400, 44.81…
$ ...35 <dbl> NA, NA, NA, 1990.00000, 18.66200, 44.23…
$ ...36 <dbl> NA, NA, NA, 1991.00000, 17.72200, 43.84…
$ ...37 <dbl> NA, NA, NA, 1992.00000, 16.44300, 43.34…
$ ...38 <dbl> NA, NA, NA, 1993.00000, 16.12600, 42.96…
$ ...39 <dbl> NA, NA, NA, 1994.00000, 15.43100, 42.53…
$ ...40 <dbl> NA, NA, NA, 1995.00000, 15.99100, 42.48…
$ ...41 <dbl> NA, NA, NA, 1996.00000, 16.15300, 42.13…
$ ...42 <dbl> NA, NA, NA, 1997.00000, 16.38800, 41.57…
$ ...43 <dbl> NA, NA, NA, 1998.00000, 15.07800, 41.12…
$ ...44 <dbl> NA, NA, NA, 1999.00000, 14.36100, 40.89…
$ ...45 <dbl> NA, NA, NA, 2000.00000, 14.42700, 40.52…
$ ...46 <dbl> NA, NA, NA, 2001.00000, 13.73900, 40.34…
$ ...47 <dbl> NA, NA, NA, 2002.00000, 12.99200, 40.04…
$ ...48 <dbl> NA, NA, NA, 2003.00000, 12.62100, 39.75…
$ ...49 <dbl> NA, NA, NA, 2004.00000, 11.92100, 39.57…
$ ...50 <dbl> NA, NA, NA, 2005.00000, 12.34800, 39.40…
$ ...51 <dbl> NA, NA, NA, 2006.00000, 13.05500, 39.23…
$ ...52 <dbl> NA, NA, NA, 2007.00000, 12.96200, 39.00…
$ ...53 <dbl> NA, NA, NA, 2008.00000, 12.74800, 38.85…
$ ...54 <dbl> NA, NA, NA, 2009.00000, 12.35000, 38.36…
$ ...55 <dbl> NA, NA, NA, 2010.00000, 12.19300, 37.94…
$ ...56 <dbl> NA, NA, NA, 2011.00000, 12.24600, 37.48…
$ ...57 <dbl> NA, NA, NA, 2012.00000, 12.72300, 36.92…
$ ...58 <dbl> NA, NA, NA, 2013.00000, 13.31600, 36.44…
$ ...59 <dbl> NA, NA, NA, 2014.00000, 13.53300, 36.02…
$ ...60 <dbl> NA, NA, NA, 2015.00000, 12.42800, 35.61…
$ ...61 <dbl> NA, NA, NA, 2016.00000, 12.30000, 35.18…
$ ...62 <dbl> NA, NA, NA, 2017.00000, 11.53000, 34.89…
$ ...63 <dbl> NA, NA, NA, 2018.00000, 9.88100, 34.611…
$ ...64 <dbl> NA, NA, NA, 2019.00000, 9.13800, 34.341…
$ ...65 <dbl> NA, NA, NA, 2020.00000, 8.10200, 33.916…
$ ...66 <dbl> NA, NA, NA, 2021.00000, 7.19300, 33.546…
$ ...67 <dbl> NA, NA, NA, 2022, NA, NA, NA, NA, NA, N…
head(birth_rate_raw)# A tibble: 6 × 67
`Data Source` World Development In…¹ ...3 ...4 ...5 ...6 ...7 ...8
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 <NA> <NA> <NA> <NA> NA NA NA NA
2 Last Updated D… 29/06/2023 <NA> <NA> NA NA NA NA
3 <NA> <NA> <NA> <NA> NA NA NA NA
4 Country Name Country Code Indi… Indi… 1960 1961 1962 1963
5 Aruba ABW Birt… SP.D… 33.9 32.8 31.6 30.4
6 Africa Eastern… AFE Birt… SP.D… 47.4 47.5 47.6 47.6
# ℹ abbreviated name: ¹`World Development Indicators`
# ℹ 59 more variables: ...9 <dbl>, ...10 <dbl>, ...11 <dbl>, ...12 <dbl>,
# ...13 <dbl>, ...14 <dbl>, ...15 <dbl>, ...16 <dbl>, ...17 <dbl>,
# ...18 <dbl>, ...19 <dbl>, ...20 <dbl>, ...21 <dbl>, ...22 <dbl>,
# ...23 <dbl>, ...24 <dbl>, ...25 <dbl>, ...26 <dbl>, ...27 <dbl>,
# ...28 <dbl>, ...29 <dbl>, ...30 <dbl>, ...31 <dbl>, ...32 <dbl>,
# ...33 <dbl>, ...34 <dbl>, ...35 <dbl>, ...36 <dbl>, ...37 <dbl>, …
Now that we have the crude birth rates dataset imported, we can now conduct some feature engineering, beginning with the removal of the first three metadata rows of the CSV file.
birth_rate_cleaned <- birth_rate_raw %>%
slice(-1,-2,-3) #remove the first three rowsOur next step is take the new first row (which contains strings such as Country Name, Country Code, etc.) and make these the column/variable names instead.
colnames(birth_rate_cleaned) <- as.character(birth_rate_cleaned[1,])
birth_rate_cleaned <- birth_rate_cleaned %>%
slice(-1) #remove the first row that was used to created the headers
head(birth_rate_cleaned)# A tibble: 6 × 67
`Country Name` `Country Code` `Indicator Name` `Indicator Code` `1960` `1961`
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 Aruba ABW Birth rate, cru… SP.DYN.CBRT.IN 33.9 32.8
2 Africa Eastern… AFE Birth rate, cru… SP.DYN.CBRT.IN 47.4 47.5
3 Afghanistan AFG Birth rate, cru… SP.DYN.CBRT.IN 50.3 50.4
4 Africa Western… AFW Birth rate, cru… SP.DYN.CBRT.IN 47.3 47.4
5 Angola AGO Birth rate, cru… SP.DYN.CBRT.IN 51.0 51.3
6 Albania ALB Birth rate, cru… SP.DYN.CBRT.IN 41.1 40.3
# ℹ 61 more variables: `1962` <dbl>, `1963` <dbl>, `1964` <dbl>, `1965` <dbl>,
# `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
# `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
# `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
# `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
# `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
# `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, …
And now we can clean our column names for best practices.
birth_rate_cleaned <- birth_rate_cleaned %>%
clean_names()
glimpse(birth_rate_cleaned)Rows: 266
Columns: 67
$ country_name <chr> "Aruba", "Africa Eastern and Southern", "Afghanistan", …
$ country_code <chr> "ABW", "AFE", "AFG", "AFW", "AGO", "ALB", "AND", "ARB",…
$ indicator_name <chr> "Birth rate, crude (per 1,000 people)", "Birth rate, cr…
$ indicator_code <chr> "SP.DYN.CBRT.IN", "SP.DYN.CBRT.IN", "SP.DYN.CBRT.IN", "…
$ x1960 <dbl> 33.88300, 47.43855, 50.34000, 47.32548, 51.02600, 41.06…
$ x1961 <dbl> 32.83100, 47.53055, 50.44300, 47.42105, 51.28200, 40.33…
$ x1962 <dbl> 31.64900, 47.59756, 50.57000, 47.52922, 51.31600, 39.21…
$ x1963 <dbl> 30.41600, 47.63614, 50.70300, 47.53103, 51.32300, 38.14…
$ x1964 <dbl> 29.14700, 47.64548, 50.83100, 47.51192, 51.28200, 36.82…
$ x1965 <dbl> 27.88900, 47.66766, 50.87200, 47.46857, 51.28200, 35.36…
$ x1966 <dbl> 26.66300, 47.69789, 50.98600, 47.44364, 51.29500, 34.35…
$ x1967 <dbl> 25.50300, 47.69133, 51.08100, 47.42593, 51.31400, 33.89…
$ x1968 <dbl> 24.59200, 47.69102, 51.14800, 47.42235, 51.34800, 33.51…
$ x1969 <dbl> 23.73500, 47.72112, 51.19500, 47.41269, 51.35300, 33.07…
$ x1970 <dbl> 22.97400, 47.67313, 51.12200, 47.41411, 51.26700, 32.20…
$ x1971 <dbl> 22.31300, 47.64967, 51.16300, 47.52970, 50.69800, 31.74…
$ x1972 <dbl> 21.76600, 47.47074, 51.10900, 47.57899, 50.47400, 31.23…
$ x1973 <dbl> 21.49200, 47.22113, 51.11400, 47.63283, 50.46700, 30.69…
$ x1974 <dbl> 21.38200, 47.07547, 51.13500, 47.81713, 50.47200, 30.29…
$ x1975 <dbl> 21.39300, 46.95020, 51.01800, 47.91150, 50.46900, 29.98…
$ x1976 <dbl> 21.48500, 46.79183, 50.93500, 47.86907, 50.51400, 29.35…
$ x1977 <dbl> 21.73900, 46.63214, 50.92100, 47.96894, 50.52300, 28.58…
$ x1978 <dbl> 21.92000, 46.51202, 50.81600, 48.03727, 50.61600, 27.88…
$ x1979 <dbl> 21.99300, 46.47196, 50.73700, 47.93830, 50.73200, 27.20…
$ x1980 <dbl> 21.93100, 46.33961, 50.48200, 47.77071, 50.89200, 26.76…
$ x1981 <dbl> 21.73000, 46.23755, 50.26400, 47.51406, 51.10900, 26.65…
$ x1982 <dbl> 21.47600, 46.15826, 50.13800, 47.25192, 51.30700, 26.79…
$ x1983 <dbl> 21.38500, 46.13473, 50.13900, 47.11112, 51.61000, 26.72…
$ x1984 <dbl> 21.18300, 46.13520, 50.23500, 46.70656, 51.93500, 26.62…
$ x1985 <dbl> 20.91900, 46.14379, 50.55300, 46.20665, 52.13600, 26.29…
$ x1986 <dbl> 20.61700, 46.06749, 50.72800, 45.72924, 52.19000, 25.93…
$ x1987 <dbl> 20.26900, 45.83058, 50.84500, 45.34627, 52.14600, 25.68…
$ x1988 <dbl> 19.82100, 45.33736, 50.98000, 45.00171, 51.97300, 25.36…
$ x1989 <dbl> 19.18400, 44.81301, 51.16200, 44.92848, 51.69900, 24.86…
$ x1990 <dbl> 18.66200, 44.23072, 51.42300, 44.67619, 51.34400, 24.41…
$ x1991 <dbl> 17.72200, 43.84232, 51.78800, 44.47423, 50.92600, 23.72…
$ x1992 <dbl> 16.44300, 43.34168, 51.94800, 44.30932, 50.37400, 23.14…
$ x1993 <dbl> 16.12600, 42.96601, 52.03800, 44.16810, 49.89300, 22.47…
$ x1994 <dbl> 15.43100, 42.53329, 52.17400, 43.94269, 49.55000, 22.24…
$ x1995 <dbl> 15.99100, 42.48572, 52.07300, 43.73024, 49.18500, 21.80…
$ x1996 <dbl> 16.15300, 42.13563, 51.87300, 43.49103, 48.86000, 20.98…
$ x1997 <dbl> 16.38800, 41.57346, 51.40000, 43.21922, 48.41200, 19.81…
$ x1998 <dbl> 15.07800, 41.12879, 50.88000, 43.02697, 48.00900, 18.80…
$ x1999 <dbl> 14.36100, 40.89482, 50.35100, 43.17424, 47.77300, 17.98…
$ x2000 <dbl> 14.42700, 40.52824, 49.66400, 43.19955, 47.64700, 17.07…
$ x2001 <dbl> 13.73900, 40.34121, 48.97900, 43.07550, 47.57400, 16.12…
$ x2002 <dbl> 12.99200, 40.04732, 48.20100, 42.92712, 47.44800, 15.20…
$ x2003 <dbl> 12.62100, 39.75014, 47.35000, 42.74688, 47.22600, 14.66…
$ x2004 <dbl> 11.92100, 39.57589, 46.33000, 42.50272, 47.09900, 13.97…
$ x2005 <dbl> 12.34800, 39.40739, 45.26300, 42.42154, 46.94400, 13.23…
$ x2006 <dbl> 13.05500, 39.23711, 44.72100, 42.19330, 46.64300, 12.53…
$ x2007 <dbl> 12.96200, 39.00052, 43.85800, 41.94301, 46.29000, 12.07…
$ x2008 <dbl> 12.74800, 38.85169, 41.50600, 41.75479, 45.88900, 11.86…
$ x2009 <dbl> 12.35000, 38.36494, 41.15700, 41.50376, 45.49500, 11.84…
$ x2010 <dbl> 12.19300, 37.94026, 40.60200, 41.21963, 44.97000, 11.92…
$ x2011 <dbl> 12.24600, 37.48399, 39.85500, 40.89424, 44.36400, 12.03…
$ x2012 <dbl> 12.72300, 36.92130, 40.00900, 40.41643, 43.86000, 12.22…
$ x2013 <dbl> 13.31600, 36.44714, 39.60100, 39.85651, 43.28200, 12.27…
$ x2014 <dbl> 13.53300, 36.02832, 39.10500, 39.33535, 42.67600, 12.15…
$ x2015 <dbl> 12.42800, 35.61331, 38.80300, 38.85921, 42.02000, 11.71…
$ x2016 <dbl> 12.30000, 35.18902, 37.93600, 38.39310, 41.37700, 11.28…
$ x2017 <dbl> 11.53000, 34.89254, 37.34200, 37.88166, 40.81000, 10.87…
$ x2018 <dbl> 9.88100, 34.61102, 36.92700, 37.44709, 40.23600, 10.517…
$ x2019 <dbl> 9.13800, 34.34145, 36.46600, 37.02783, 39.72500, 10.343…
$ x2020 <dbl> 8.10200, 33.91675, 36.05100, 36.61573, 39.27100, 10.285…
$ x2021 <dbl> 7.19300, 33.54627, 35.84200, 36.23703, 38.80900, 10.240…
$ x2022 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
head(birth_rate_cleaned)# A tibble: 6 × 67
country_name country_code indicator_name indicator_code x1960 x1961 x1962
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Aruba ABW Birth rate, c… SP.DYN.CBRT.IN 33.9 32.8 31.6
2 Africa Eastern a… AFE Birth rate, c… SP.DYN.CBRT.IN 47.4 47.5 47.6
3 Afghanistan AFG Birth rate, c… SP.DYN.CBRT.IN 50.3 50.4 50.6
4 Africa Western a… AFW Birth rate, c… SP.DYN.CBRT.IN 47.3 47.4 47.5
5 Angola AGO Birth rate, c… SP.DYN.CBRT.IN 51.0 51.3 51.3
6 Albania ALB Birth rate, c… SP.DYN.CBRT.IN 41.1 40.3 39.2
# ℹ 60 more variables: x1963 <dbl>, x1964 <dbl>, x1965 <dbl>, x1966 <dbl>,
# x1967 <dbl>, x1968 <dbl>, x1969 <dbl>, x1970 <dbl>, x1971 <dbl>,
# x1972 <dbl>, x1973 <dbl>, x1974 <dbl>, x1975 <dbl>, x1976 <dbl>,
# x1977 <dbl>, x1978 <dbl>, x1979 <dbl>, x1980 <dbl>, x1981 <dbl>,
# x1982 <dbl>, x1983 <dbl>, x1984 <dbl>, x1985 <dbl>, x1986 <dbl>,
# x1987 <dbl>, x1988 <dbl>, x1989 <dbl>, x1990 <dbl>, x1991 <dbl>,
# x1992 <dbl>, x1993 <dbl>, x1994 <dbl>, x1995 <dbl>, x1996 <dbl>, …
The next step will now comprise executing the pivot_longer() functionality, converting dataset from its wide format into a long one.
birth_rate_tidy <- birth_rate_cleaned %>%
pivot_longer(
cols = -c(country_name, country_code, indicator_name, indicator_code),
names_to = "year",
values_to = "birth_rate"
) %>%
mutate(
year = str_extract(year, "\\d{4}") %>% as.integer(),
birth_rate = as.numeric(birth_rate)
)
glimpse(birth_rate_tidy)Rows: 16,758
Columns: 6
$ country_name <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "…
$ country_code <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW",…
$ indicator_name <chr> "Birth rate, crude (per 1,000 people)", "Birth rate, cr…
$ indicator_code <chr> "SP.DYN.CBRT.IN", "SP.DYN.CBRT.IN", "SP.DYN.CBRT.IN", "…
$ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1…
$ birth_rate <dbl> 33.883, 32.831, 31.649, 30.416, 29.147, 27.889, 26.663,…
head(birth_rate_tidy)# A tibble: 6 × 6
country_name country_code indicator_name indicator_code year birth_rate
<chr> <chr> <chr> <chr> <int> <dbl>
1 Aruba ABW Birth rate, crude (… SP.DYN.CBRT.IN 1960 33.9
2 Aruba ABW Birth rate, crude (… SP.DYN.CBRT.IN 1961 32.8
3 Aruba ABW Birth rate, crude (… SP.DYN.CBRT.IN 1962 31.6
4 Aruba ABW Birth rate, crude (… SP.DYN.CBRT.IN 1963 30.4
5 Aruba ABW Birth rate, crude (… SP.DYN.CBRT.IN 1964 29.1
6 Aruba ABW Birth rate, crude (… SP.DYN.CBRT.IN 1965 27.9
At this stage, the dataset has now been converted into a tidy format, with one row representing one country-year observation. Any missing birth rate values will be retained as NAs (a function of pivot_longer() ) for now, and will subsequently only be filtered out where necessary for a given analysis, so as not to compromise the original informational structure of the dataset.
We will then proceed onto the next step for this first crude birth rates dataset.
The first analysis will involve examining broad birth rate trends over time across all countries.
One prospective starting point is to commence by calculating the average birth rate across all included countries for each year.
birth_rate_summary <- birth_rate_tidy %>%
group_by(year) %>%
summarise(
average_birth_rate = mean(birth_rate, na.rm = TRUE),
.groups = "drop"
)
birth_rate_summary %>%
gt() %>%
tab_header(
title = "Average Birth Rate by Year",
subtitle = "Summary of annual birth rates"
) %>%
cols_label(
year = "Year",
average_birth_rate = "Average Birth Rate"
) %>%
fmt_number(
columns = average_birth_rate,
decimals = 2
)| Average Birth Rate by Year | |
| Summary of annual birth rates | |
| Year | Average Birth Rate |
|---|---|
| 1960 | 38.10 |
| 1961 | 37.72 |
| 1962 | 37.83 |
| 1963 | 37.71 |
| 1964 | 37.18 |
| 1965 | 36.66 |
| 1966 | 36.08 |
| 1967 | 35.73 |
| 1968 | 35.37 |
| 1969 | 34.99 |
| 1970 | 34.55 |
| 1971 | 34.30 |
| 1972 | 33.95 |
| 1973 | 33.47 |
| 1974 | 33.18 |
| 1975 | 32.84 |
| 1976 | 32.49 |
| 1977 | 32.23 |
| 1978 | 32.05 |
| 1979 | 32.00 |
| 1980 | 31.91 |
| 1981 | 31.82 |
| 1982 | 31.68 |
| 1983 | 31.46 |
| 1984 | 31.14 |
| 1985 | 30.90 |
| 1986 | 30.55 |
| 1987 | 30.20 |
| 1988 | 29.86 |
| 1989 | 29.33 |
| 1990 | 28.97 |
| 1991 | 28.49 |
| 1992 | 27.99 |
| 1993 | 27.44 |
| 1994 | 26.86 |
| 1995 | 26.36 |
| 1996 | 25.89 |
| 1997 | 25.47 |
| 1998 | 24.98 |
| 1999 | 24.61 |
| 2000 | 24.27 |
| 2001 | 23.88 |
| 2002 | 23.54 |
| 2003 | 23.28 |
| 2004 | 23.07 |
| 2005 | 22.88 |
| 2006 | 22.73 |
| 2007 | 22.66 |
| 2008 | 22.63 |
| 2009 | 22.47 |
| 2010 | 22.24 |
| 2011 | 22.08 |
| 2012 | 21.86 |
| 2013 | 21.54 |
| 2014 | 21.30 |
| 2015 | 20.89 |
| 2016 | 20.50 |
| 2017 | 20.11 |
| 2018 | 19.66 |
| 2019 | 19.32 |
| 2020 | 18.99 |
| 2021 | 18.77 |
| 2022 | NaN |
The summary table exhibits a broad overview of how crude birth rates have changed over time on average across all of the included countries. We can take the above a step further and attempt to construct a visualization from these insights.
ggplot(data = birth_rate_summary, aes(x = year, y = average_birth_rate)) +
geom_line() +
labs(
title = "Average Crude Birth Over Time",
x = "Year",
y = "Average Crude Birth Rate"
) +
theme(plot.title = element_text(hjust = 0.5))In acknowledging the afore-presented plot of average crude birth over time, we can notice, perhaps somewhat alarmingly, that on a holistic average, worldwide crude birth rates have evidently decreased rather noticeably within the time frame of the utilized data (from 1960 to 2021). This is evidenced by the negative gradient of the plotted line of values.
In addition to the overall trend, it may also be beneficial to compare selected countries in terms of how their birth rates have changed over time. For lucidity, the following analysis shall be limited to a chosen number (5) of countries.
selected_countries <- c("Guyana", "United States", "India", "Brazil", "Nigeria")
birth_rate_selected <- birth_rate_tidy %>%
filter(country_name %in% selected_countries)
ggplot(birth_rate_selected, aes(x = year, y = birth_rate, color = country_name)) +
geom_line() +
labs(
title = "Crude Birth Rate for Selected Countries",
x = "Year",
y = "Crude Birth Rate",
color = "Country"
) +
theme(plot.title = element_text(hjust = 0.5))This visual comparison above allows us to easier interpret and compare the numeric findings from the tidied birth rate dataframe. In this, we can see that of the five examined countries, the United States observed the lowest crude birth rate figures throughout the entirety of the year range. Inversely, Nigeria, throughout all of the years, maintained the highest birth rates. The remaining three countries (Guyana, India, and Brazil) commenced at almost the same level as Nigeria’s had, but in the course of time, these rates fell to almost the same level as that of the USA. On a whole however, all five countries do follow the general declining trend, which was initially observed in the average plot of crude birth rates (across all of the countries within the initial dataset).
From the analyses conducted upon the tidied crude birth rate dataset, a clear downward trend in birth rates over time can be observed. The computed yearly averages indicate that, on a global scale, crude birth rates have steadily declined from the early 1960s through to the most recent years of available data. This pattern is also reflected within the country-level comparison, where each of the five examined countries demonstrates a similar long-term decline, albeit at different magnitudes. Nigeria consistently maintains the highest crude birth rates throughout the observed time frame, while the United States exhibits the lowest overall values. Meanwhile, countries such as Guyana, India, and Brazil begin the period with relatively high birth rates but gradually converge toward lower levels over time. Collectively, these findings suggest that declining fertility trends are not isolated to specific countries, but instead represent a broader demographic shift that has occurred globally over the past several decades.
At this stage, the CSV file pertaining to the second GDP by Country dataset was first downloaded from its discerned location (Kaggle, which was indicated by classmate Sinem Kilicdere), and pushed to a personal GitHub repository. The file preserved the original data structure, wherein each year is represented as a separate column, and will now be imported into RStudio for tidying and subsequent analysis.
Now that the raw wide format GDP dataset has been committed and made accessible, the next step is to import it into RStudio and apply a reproducible tidying pipeline. Since the year values are already spread across several columns, the principal transformation here will again involve reshaping the dataset from a wide to a long format using pivot_longer(). Thereafter, the variable names will be standardized and the GDP figures converted into a numeric format suitable for downstream analysis.
gdp_url <- "https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/Project%20Two%20Assignment/country_gdp_dataset.csv"
gdp_raw <- read_csv(gdp_url, show_col_types = FALSE)
glimpse(gdp_raw)Rows: 266
Columns: 65
$ Country <chr> "Aruba", "Africa Eastern and Southern", "Afghanistan", …
$ `Country Code` <chr> "ABW", "AFE", "AFG", "AFW", "AGO", "ALB", "AND", "ARB",…
$ `1960` <dbl> NA, 21125015452, 537777811, 10447637853, NA, NA, NA, NA…
$ `1961` <dbl> NA, 21616228139, 548888896, 11173212080, NA, NA, NA, NA…
$ `1962` <dbl> NA, 23506279900, 546666678, 11990534018, NA, NA, NA, NA…
$ `1963` <dbl> NA, 28048360188, 751111191, 12727688165, NA, NA, NA, NA…
$ `1964` <dbl> NA, 25920665260, 800000044, 13898109284, NA, NA, NA, NA…
$ `1965` <dbl> NA, 29472103270, 1006666638, 14929792388, NA, NA, NA, N…
$ `1966` <dbl> NA, 32014368121, 1399999967, 15910837742, NA, NA, NA, N…
$ `1967` <dbl> NA, 33269509510, 1673333418, 14510579889, NA, NA, NA, N…
$ `1968` <dbl> NA, 36327785495, 1373333367, 14968235782, NA, NA, NA, 3…
$ `1969` <dbl> NA, 41638967621, 1408888922, 16979315745, NA, NA, NA, 3…
$ `1970` <dbl> NA, 44629891649, 1748886596, 23596163865, NA, NA, 78617…
$ `1971` <dbl> NA, 49173371529, 1831108971, 20936358634, NA, NA, 89406…
$ `1972` <dbl> NA, 53123459912, 1595555476, 25386169423, NA, NA, 11341…
$ `1973` <dbl> NA, 69482723444, 1733333264, 31975594565, NA, NA, 15084…
$ `1974` <dbl> NA, 85380645042, 2155555498, 44416677335, NA, NA, 18655…
$ `1975` <dbl> NA, 90835426418, 2366666616, 51667190242, NA, NA, 22011…
$ `1976` <dbl> NA, 90212747243, 2555555567, 62351622300, NA, NA, 22728…
$ `1977` <dbl> NA, 102241000000, 2953333418, 65595122956, NA, NA, 2539…
$ `1978` <dbl> NA, 116085000000, 3300000109, 71496496574, NA, NA, 3080…
$ `1979` <dbl> NA, 134257000000, 3697940410, 88948338390, NA, NA, 4115…
$ `1980` <dbl> NA, 171218000000, 3641723322, 112439000000, 5930503401,…
$ `1981` <dbl> NA, 175859000000, 3478787909, 211338000000, 5550483036,…
$ `1982` <dbl> NA, 168096000000, NA, 187449000000, 5550483036, NA, 375…
$ `1983` <dbl> NA, 175565000000, NA, 138384000000, 5784341596, NA, 327…
$ `1984` <dbl> NA, 160647000000, NA, 114516000000, 6131475065, 1857338…
$ `1985` <dbl> NA, 136759000000, NA, 116777000000, 7554065410, 1897050…
$ `1986` <dbl> 405586592, 153050000000, NA, 107887000000, 7072536109, …
$ `1987` <dbl> 487709497, 186658000000, NA, 110729000000, 8084412414, …
$ `1988` <dbl> 596648045, 204766000000, NA, 109439000000, 8769836769, …
$ `1989` <dbl> 695530726, 218242000000, NA, 102255000000, 10201780977,…
$ `1990` <dbl> 764804469, 254062000000, NA, 122387000000, 11229515599,…
$ `1991` <dbl> 872067039, 276857000000, NA, 118040000000, 12704558517,…
$ `1992` <dbl> 958659218, 246088000000, NA, 118893000000, 15114352005,…
$ `1993` <dbl> 1083240223, 242926000000, NA, 99272180411, 11051939102,…
$ `1994` <dbl> 1245810056, 239611000000, NA, 86636400266, 3390500000, …
$ `1995` <dbl> 1320670391, 270327000000, NA, 108691000000, 5561222222,…
$ `1996` <dbl> 1379888268, 269491000000, NA, 126287000000, 7526963964,…
$ `1997` <dbl> 1531843575, 283446000000, NA, 127602000000, 7648377413,…
$ `1998` <dbl> 1665363128, 266652000000, NA, 130678000000, 6506229607,…
$ `1999` <dbl> 1722905028, 263025000000, NA, 138086000000, 6152922943,…
$ `2000` <dbl> 1873184358, 284759000000, NA, 140946000000, 9129594819,…
$ `2001` <dbl> 1896648045, 259643000000, NA, 148530000000, 8936079253,…
$ `2002` <dbl> 1962011173, 266529000000, 3854235264, 177201000000, 152…
$ `2003` <dbl> 2044134078, 354177000000, 4539496563, 205214000000, 178…
$ `2004` <dbl> 2.254749e+09, 4.404820e+11, 5.220825e+09, 2.542650e+11,…
$ `2005` <dbl> 2.359777e+09, 5.139420e+11, 6.226199e+09, 3.108900e+11,…
$ `2006` <dbl> 2.469832e+09, 5.775870e+11, 6.971383e+09, 3.969210e+11,…
$ `2007` <dbl> 2.677654e+09, 6.628680e+11, 9.715765e+09, 4.654860e+11,…
$ `2008` <dbl> 2.843017e+09, 7.105360e+11, 1.024977e+10, 5.677910e+11,…
$ `2009` <dbl> 2.553631e+09, 7.219010e+11, 1.215484e+10, 5.083630e+11,…
$ `2010` <dbl> 2.453631e+09, 8.635200e+11, 1.563384e+10, 5.985220e+11,…
$ `2011` <dbl> 2.637989e+09, 9.678250e+11, 1.819041e+10, 6.820160e+11,…
$ `2012` <dbl> 2.615084e+09, 9.753550e+11, 2.020357e+10, 7.375890e+11,…
$ `2013` <dbl> 2.727933e+09, 9.859870e+11, 2.056449e+10, 8.339480e+11,…
$ `2014` <dbl> 2.791061e+09, 1.006530e+12, 2.055058e+10, 8.943220e+11,…
$ `2015` <dbl> 2.963128e+09, 9.273480e+11, 1.999814e+10, 7.686450e+11,…
$ `2016` <dbl> 2.983799e+09, 8.851760e+11, 1.801955e+10, 6.913630e+11,…
$ `2017` <dbl> 3.092179e+09, 1.021040e+12, 1.889635e+10, 6.848990e+11,…
$ `2018` <dbl> 3.276188e+09, 1.007200e+12, 1.841886e+10, 7.670260e+11,…
$ `2019` <dbl> 3.395794e+09, 1.000830e+12, 1.890450e+10, 8.225380e+11,…
$ `2020` <dbl> 2.610039e+09, 9.275930e+11, 2.014345e+10, 7.864600e+11,…
$ `2021` <dbl> 3.126019e+09, 1.082000e+12, 1.458314e+10, 8.444600e+11,…
$ `2022` <dbl> NA, 1.169480e+12, NA, 8.778630e+11, 1.067140e+11, 1.888…
head(gdp_raw)# A tibble: 6 × 65
Country `Country Code` `1960` `1961` `1962` `1963` `1964` `1965`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aruba ABW NA NA NA NA NA NA
2 Africa E… AFE 2.11e10 2.16e10 2.35e10 2.80e10 2.59e10 2.95e10
3 Afghanis… AFG 5.38e 8 5.49e 8 5.47e 8 7.51e 8 8.00e 8 1.01e 9
4 Africa W… AFW 1.04e10 1.12e10 1.20e10 1.27e10 1.39e10 1.49e10
5 Angola AGO NA NA NA NA NA NA
6 Albania ALB NA NA NA NA NA NA
# ℹ 57 more variables: `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>,
# `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
# `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
# `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
# `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
# `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>,
# `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, …
As can be observed, the dataset is already structured with the first row containing the relevant variable names, so no metadata-row removal is required in this case. The next step is to clean the column names and convert the dataset into a long format.
gdp_tidy <- gdp_raw %>%
clean_names() %>%
pivot_longer(
cols = -c(country, country_code),
names_to = "year",
values_to = "gdp"
) %>%
mutate(
year = str_extract(year, "\\d{4}") %>% as.integer(),
gdp = as.numeric(gdp)
)
glimpse(gdp_tidy)Rows: 16,758
Columns: 4
$ country <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Ar…
$ country_code <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "…
$ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 196…
$ gdp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
head(gdp_tidy)# A tibble: 6 × 4
country country_code year gdp
<chr> <chr> <int> <dbl>
1 Aruba ABW 1960 NA
2 Aruba ABW 1961 NA
3 Aruba ABW 1962 NA
4 Aruba ABW 1963 NA
5 Aruba ABW 1964 NA
6 Aruba ABW 1965 NA
At this stage, the GDP dataset has now been transformed into a tidy format, with one row representing one country-year observation. Missing GDP values will be retained as NAs for now and only filtered out where necessary for specific analyses, so as not to prematurely reduce the informational content of the original data.
We may now proceed on to the analytical step for this second dataset.
The first analysis to be conducted upon the tidied GDP dataset will involve examining broad GDP trends over time across all included countries.
gdp_summary <- gdp_tidy %>%
group_by(year) %>%
summarise(
average_gdp = mean(gdp, na.rm = TRUE),
.groups = "drop"
)
gdp_summary %>%
gt() %>%
tab_header(
title = "Average GDP by Year",
subtitle = "Summary of annual GDP values (USD billions)"
) %>%
cols_label(
year = "Year",
average_gdp = "Average GDP (Billion USD)"
) %>%
fmt_number(
columns = average_gdp,
decimals = 2,
scale_by = 1e-9
)| Average GDP by Year | |
| Summary of annual GDP values (USD billions) | |
| Year | Average GDP (Billion USD) |
|---|---|
| 1960 | 71.27 |
| 1961 | 72.75 |
| 1962 | 76.40 |
| 1963 | 82.36 |
| 1964 | 90.48 |
| 1965 | 91.83 |
| 1966 | 102.17 |
| 1967 | 105.98 |
| 1968 | 111.11 |
| 1969 | 122.87 |
| 1970 | 132.99 |
| 1971 | 144.00 |
| 1972 | 165.88 |
| 1973 | 204.76 |
| 1974 | 238.06 |
| 1975 | 262.66 |
| 1976 | 282.67 |
| 1977 | 313.95 |
| 1978 | 370.46 |
| 1979 | 428.40 |
| 1980 | 453.33 |
| 1981 | 463.51 |
| 1982 | 454.57 |
| 1983 | 454.35 |
| 1984 | 466.44 |
| 1985 | 482.14 |
| 1986 | 558.57 |
| 1987 | 619.86 |
| 1988 | 681.47 |
| 1989 | 710.55 |
| 1990 | 755.93 |
| 1991 | 778.78 |
| 1992 | 826.52 |
| 1993 | 831.35 |
| 1994 | 885.75 |
| 1995 | 953.09 |
| 1996 | 982.13 |
| 1997 | 981.06 |
| 1998 | 970.39 |
| 1999 | 993.18 |
| 2000 | 1,009.81 |
| 2001 | 998.93 |
| 2002 | 1,018.48 |
| 2003 | 1,149.29 |
| 2004 | 1,306.61 |
| 2005 | 1,431.44 |
| 2006 | 1,565.92 |
| 2007 | 1,794.23 |
| 2008 | 1,997.00 |
| 2009 | 1,895.74 |
| 2010 | 2,097.25 |
| 2011 | 2,322.63 |
| 2012 | 2,402.47 |
| 2013 | 2,478.71 |
| 2014 | 2,540.06 |
| 2015 | 2,401.52 |
| 2016 | 2,435.92 |
| 2017 | 2,611.09 |
| 2018 | 2,773.99 |
| 2019 | 2,826.37 |
| 2020 | 2,739.71 |
| 2021 | 3,211.35 |
| 2022 | 3,599.79 |
The above table provides a visual representation of how GDP has generally changed over time across all included countries. In this, we can observe substantial increases, from an average of 71.27 Billion USD in 1960 to approximately 3,599.79 Billion USD in 2022.
In addition to the overall trend, it may also be useful to compare GDP growth across a selected number of countries. For clarity, the following comparison will be limited to six countries.
selected_gdp_countries <- c("United States", "Guyana", "China", "India", "Brazil", "Nigeria")
gdp_selected <- gdp_tidy %>%
filter(country %in% selected_gdp_countries)
ggplot(gdp_selected, aes(x = year, y = gdp, color = country)) +
geom_line() +
scale_y_log10() +
labs(
title = "GDP Trends for Selected Countries",
x = "Year",
y = "GDP",
color = "Country"
) +
theme(plot.title = element_text(hjust = 0.5))The visual comparison above allows us to more easily interpret and compare the GDP growth trends of the selected countries over time. In general, all six countries exhibit an overall upward trend in GDP, though at very different magnitudes. China demonstrates the most pronounced acceleration in growth, especially from the early 2000s onward. India and Brazil also show sustained long-term growth, Nigeria displays a generally rising but somewhat more variable pattern, and Guyana, though remaining the smallest economy among the selected countries, still shows noticeable expansion in the later years.
It should also be noted that the y-axis has been transformed to a logarithmic (base-10) scale, meaning that the plotted values no longer represent equal absolute increases, but rather proportional ones; that is, equal vertical distances now correspond to multiplicative changes in GDP, which makes it easier to compare the growth trajectories of both larger and smaller economies within the same visualization.
The final analytical endeavor for this second dataset, and the last of classmate Sinem’s question to be addressed is that of determining the top-performing countries in terms of economic output.
library(countrycode)
valid_codes <- unique(countrycode::codelist$iso3c)
top_gdp_countries <- gdp_tidy %>%
filter(country_code %in% valid_codes) %>%
filter(year == max(year, na.rm = TRUE)) %>%
drop_na(gdp) %>%
arrange(desc(gdp)) %>%
slice_head(n = 10) %>%
mutate(gdp_billions_usd = gdp / 1e9)
top_gdp_countries %>%
select(country, country_code, year, gdp_billions_usd) %>%
gt() %>%
tab_header(
title = "Top-Performing Countries by GDP",
subtitle = "Highest GDP values in the most recent year available"
) %>%
cols_label(
country = "Country",
country_code = "Country Code",
year = "Year",
gdp_billions_usd = "GDP (Billions USD)"
) %>%
fmt_number(
columns = gdp_billions_usd,
decimals = 2
)| Top-Performing Countries by GDP | |||
| Highest GDP values in the most recent year available | |||
| Country | Country Code | Year | GDP (Billions USD) |
|---|---|---|---|
| United States | USA | 2022 | 25,462.70 |
| China | CHN | 2022 | 17,963.20 |
| Japan | JPN | 2022 | 4,231.14 |
| Germany | DEU | 2022 | 4,072.19 |
| India | IND | 2022 | 3,385.09 |
| United Kingdom | GBR | 2022 | 3,070.67 |
| France | FRA | 2022 | 2,782.91 |
| Russian Federation | RUS | 2022 | 2,240.42 |
| Canada | CAN | 2022 | 2,139.84 |
| Italy | ITA | 2022 | 2,010.43 |
The above summary table highlights the countries with the highest levels of economic output in the most recent year available within the dataset. From this ranking, it is evident that the United States and China occupy the top two positions by a substantial margin, reflecting the scale of their respective economies. The remaining countries (such as Japan, Germany, India, the United Kingdom, etc.) also demonstrate significant economic output, though at levels notably below the two leading economies. Overall, the table illustrates how a relatively small number of countries account for a large share of global GDP, emphasizing the concentration of economic activity among the world’s largest economies.
From the analyses conducted on the tidied GDP dataset, a clear long-term expansion in economic output can be observed across the examined period. The average GDP values show a steady increase from the early decades of the dataset to the most recent year available, indicating substantial global economic growth over time. The country-level comparison further illustrates that while many countries experience upward economic trajectories, the magnitude of growth differs considerably between them. Larger economies such as the United States and China dominate the global GDP figures, while other countries exhibit notable but comparatively smaller levels of economic output. Overall, these results suggest that although economic growth has occurred broadly across countries, global economic activity remains concentrated among a relatively small number of the world’s largest economies.
At this stage, the renewable energy consumption dataset was obtained from the World Bank source originally identified in classmate Ciara Bonnett-Jones’ discussion post, and then committed to my personal GitHub repository in its original wide format. The file preserves the dataset’s original structure, wherein the year values are spread across multiple columns, and will now be imported into RStudio for tidying and subsequent analysis.
Now that the raw wide format renewable energy dataset has been committed and made accessible, the next step is to import it into RStudio and apply a reproducible tidying pipeline. As with the crude birth rate dataset, this file contains metadata rows before the true variable names begin, so these rows must first be removed. Thereafter, the dataset can be reshaped from wide to long format using pivot_longer(), with the year values converted into a tidy variable and the renewable energy figures converted into numeric form for analysis.
renewable_url <- "https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/Project%20Two%20Assignment/renewable_energy_consumption_dataset.csv"
renewable_raw <- read_csv(renewable_url, col_names = FALSE, show_col_types = FALSE)
glimpse(renewable_raw)Rows: 271
Columns: 70
$ X1 <chr> "Data Source", NA, "Last Updated Date", NA, "Country Name", "Aruba…
$ X2 <chr> "World Development Indicators", NA, "24/02/2026", NA, "Country Cod…
$ X3 <chr> NA, NA, NA, NA, "Indicator Name", "Renewable energy consumption (%…
$ X4 <chr> NA, NA, NA, NA, "Indicator Code", "EG.FEC.RNEW.ZS", "EG.FEC.RNEW.Z…
$ X5 <dbl> NA, NA, NA, NA, 1960, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X6 <dbl> NA, NA, NA, NA, 1961, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X7 <dbl> NA, NA, NA, NA, 1962, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X8 <dbl> NA, NA, NA, NA, 1963, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X9 <dbl> NA, NA, NA, NA, 1964, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X10 <dbl> NA, NA, NA, NA, 1965, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X11 <dbl> NA, NA, NA, NA, 1966, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X12 <dbl> NA, NA, NA, NA, 1967, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X13 <dbl> NA, NA, NA, NA, 1968, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X14 <dbl> NA, NA, NA, NA, 1969, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X15 <dbl> NA, NA, NA, NA, 1970, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X16 <dbl> NA, NA, NA, NA, 1971, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X17 <dbl> NA, NA, NA, NA, 1972, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X18 <dbl> NA, NA, NA, NA, 1973, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X19 <dbl> NA, NA, NA, NA, 1974, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X20 <dbl> NA, NA, NA, NA, 1975, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X21 <dbl> NA, NA, NA, NA, 1976, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X22 <dbl> NA, NA, NA, NA, 1977, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X23 <dbl> NA, NA, NA, NA, 1978, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X24 <dbl> NA, NA, NA, NA, 1979, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X25 <dbl> NA, NA, NA, NA, 1980, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X26 <dbl> NA, NA, NA, NA, 1981, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X27 <dbl> NA, NA, NA, NA, 1982, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X28 <dbl> NA, NA, NA, NA, 1983, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X29 <dbl> NA, NA, NA, NA, 1984, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X30 <dbl> NA, NA, NA, NA, 1985, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X31 <dbl> NA, NA, NA, NA, 1986, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X32 <dbl> NA, NA, NA, NA, 1987, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X33 <dbl> NA, NA, NA, NA, 1988, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X34 <dbl> NA, NA, NA, NA, 1989, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X35 <dbl> NA, NA, NA, NA, 1990.000000, 0.300000, 60.842404, 23.000000, 85.89…
$ X36 <dbl> NA, NA, NA, NA, 1991.000000, 0.200000, 62.133295, 23.700000, 85.26…
$ X37 <dbl> NA, NA, NA, NA, 1992.000000, 0.200000, 63.940789, 27.400000, 83.81…
$ X38 <dbl> NA, NA, NA, NA, 1993.000000, 0.200000, 64.651965, 28.500000, 85.10…
$ X39 <dbl> NA, NA, NA, NA, 1994.000000, 0.200000, 65.145067, 30.100000, 86.55…
$ X40 <dbl> NA, NA, NA, NA, 1995.000000, 0.200000, 64.721535, 31.800000, 86.14…
$ X41 <dbl> NA, NA, NA, NA, 1996.000000, 0.200000, 63.929272, 33.900000, 85.11…
$ X42 <dbl> NA, NA, NA, NA, 1997.000000, 0.200000, 63.241293, 36.400000, 83.91…
$ X43 <dbl> NA, NA, NA, NA, 1998.000000, 0.200000, 63.962096, 38.000000, 84.63…
$ X44 <dbl> NA, NA, NA, NA, 1999.000000, 0.200000, 65.058401, 43.400000, 84.00…
$ X45 <dbl> NA, NA, NA, NA, 2000.000000, 0.200000, 65.385867, 45.000000, 83.10…
$ X46 <dbl> NA, NA, NA, NA, 2001.000000, 0.200000, 65.817363, 45.600000, 81.93…
$ X47 <dbl> NA, NA, NA, NA, 2002.000000, 0.200000, 65.099281, 37.800000, 81.73…
$ X48 <dbl> NA, NA, NA, NA, 2003.000000, 0.200000, 64.615124, 36.700000, 80.71…
$ X49 <dbl> NA, NA, NA, NA, 2004.000000, 0.200000, 63.678035, 44.200000, 80.79…
$ X50 <dbl> NA, NA, NA, NA, 2005.000000, 0.200000, 63.109098, 33.900000, 80.93…
$ X51 <dbl> NA, NA, NA, NA, 2006.000000, 0.200000, 62.738124, 31.900000, 81.87…
$ X52 <dbl> NA, NA, NA, NA, 2007.000000, 0.200000, 61.505169, 28.800000, 82.33…
$ X53 <dbl> NA, NA, NA, NA, 2008.000000, 0.200000, 61.908027, 21.200000, 81.21…
$ X54 <dbl> NA, NA, NA, NA, 2009.000000, 0.300000, 61.304601, 16.500000, 82.39…
$ X55 <dbl> NA, NA, NA, NA, 2010.000000, 5.500000, 62.876387, 15.200000, 80.68…
$ X56 <dbl> NA, NA, NA, NA, 2011.000000, 5.700000, 62.805279, 12.600000, 79.28…
$ X57 <dbl> NA, NA, NA, NA, 2012.000000, 6.900000, 61.409964, 15.400000, 78.48…
$ X58 <dbl> NA, NA, NA, NA, 2013.000000, 6.900000, 60.801799, 16.900000, 76.70…
$ X59 <dbl> NA, NA, NA, NA, 2014.000000, 6.900000, 61.235882, 19.100000, 75.74…
$ X60 <dbl> NA, NA, NA, NA, 2015.000000, 6.700000, 61.670278, 17.700000, 76.18…
$ X61 <dbl> NA, NA, NA, NA, 2016.000000, 7.000000, 61.822884, 20.200000, 75.79…
$ X62 <dbl> NA, NA, NA, NA, 2017.000000, 6.800000, 61.426950, 19.500000, 76.10…
$ X63 <dbl> NA, NA, NA, NA, 2018.000000, 8.600000, 61.587530, 18.300000, 75.47…
$ X64 <dbl> NA, NA, NA, NA, 2019.000000, 8.300000, 62.690710, 18.900000, 75.07…
$ X65 <dbl> NA, NA, NA, NA, 2020.000000, 9.100000, 65.782380, 18.200000, 75.84…
$ X66 <dbl> NA, NA, NA, NA, 2021.0, 8.8, NA, 20.0, NA, 52.9, 41.9, 18.4, NA, 1…
$ X67 <dbl> NA, NA, NA, NA, 2022.0, 8.8, NA, 20.0, NA, NA, NA, 18.7, NA, NA, N…
$ X68 <dbl> NA, NA, NA, NA, 2023, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X69 <dbl> NA, NA, NA, NA, 2024, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ X70 <dbl> NA, NA, NA, NA, 2025, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
head(renewable_raw)# A tibble: 6 × 70
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Data … Worl… <NA> <NA> NA NA NA NA NA NA NA NA NA
2 <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA
3 Last … 24/0… <NA> <NA> NA NA NA NA NA NA NA NA NA
4 <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA
5 Count… Coun… Indi… Indi… 1960 1961 1962 1963 1964 1965 1966 1967 1968
6 Aruba ABW Rene… EG.F… NA NA NA NA NA NA NA NA NA
# ℹ 57 more variables: X14 <dbl>, X15 <dbl>, X16 <dbl>, X17 <dbl>, X18 <dbl>,
# X19 <dbl>, X20 <dbl>, X21 <dbl>, X22 <dbl>, X23 <dbl>, X24 <dbl>,
# X25 <dbl>, X26 <dbl>, X27 <dbl>, X28 <dbl>, X29 <dbl>, X30 <dbl>,
# X31 <dbl>, X32 <dbl>, X33 <dbl>, X34 <dbl>, X35 <dbl>, X36 <dbl>,
# X37 <dbl>, X38 <dbl>, X39 <dbl>, X40 <dbl>, X41 <dbl>, X42 <dbl>,
# X43 <dbl>, X44 <dbl>, X45 <dbl>, X46 <dbl>, X47 <dbl>, X48 <dbl>,
# X49 <dbl>, X50 <dbl>, X51 <dbl>, X52 <dbl>, X53 <dbl>, X54 <dbl>, …
Upon import, the dataset contains metadata rows above the true headers. These first four rows must therefore be removed, and the new first row must be promoted to the variable names.
renewable_cleaned <- renewable_raw %>%
slice(-1, -2, -3, -4)
colnames(renewable_cleaned) <- as.character(renewable_cleaned[1,])
renewable_cleaned <- renewable_cleaned %>%
slice(-1)
head(renewable_cleaned)# A tibble: 6 × 70
`Country Name` `Country Code` `Indicator Name` `Indicator Code` `1960` `1961`
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 Aruba ABW Renewable energ… EG.FEC.RNEW.ZS NA NA
2 Africa Eastern… AFE Renewable energ… EG.FEC.RNEW.ZS NA NA
3 Afghanistan AFG Renewable energ… EG.FEC.RNEW.ZS NA NA
4 Africa Western… AFW Renewable energ… EG.FEC.RNEW.ZS NA NA
5 Angola AGO Renewable energ… EG.FEC.RNEW.ZS NA NA
6 Albania ALB Renewable energ… EG.FEC.RNEW.ZS NA NA
# ℹ 64 more variables: `1962` <dbl>, `1963` <dbl>, `1964` <dbl>, `1965` <dbl>,
# `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
# `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
# `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
# `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
# `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
# `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, …
The cleaned dataset can now have its column names standardized prior to reshaping.
renewable_cleaned <- renewable_cleaned %>%
clean_names()
glimpse(renewable_cleaned)Rows: 266
Columns: 70
$ country_name <chr> "Aruba", "Africa Eastern and Southern", "Afghanistan", …
$ country_code <chr> "ABW", "AFE", "AFG", "AFW", "AGO", "ALB", "AND", "ARB",…
$ indicator_name <chr> "Renewable energy consumption (% of total final energy …
$ indicator_code <chr> "EG.FEC.RNEW.ZS", "EG.FEC.RNEW.ZS", "EG.FEC.RNEW.ZS", "…
$ x1960 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1961 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1962 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1963 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1964 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1965 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1966 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1967 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1968 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1969 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1970 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1971 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1972 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1973 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1974 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1975 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1976 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1977 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1978 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1979 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1980 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1981 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1982 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1983 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1984 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1985 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1986 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1987 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1988 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1989 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x1990 <dbl> 0.300000, 60.842404, 23.000000, 85.895442, 72.300000, 2…
$ x1991 <dbl> 0.200000, 62.133295, 23.700000, 85.262074, 71.900000, 3…
$ x1992 <dbl> 0.200000, 63.940789, 27.400000, 83.818324, 72.700000, 4…
$ x1993 <dbl> 0.200000, 64.651965, 28.500000, 85.103232, 71.300000, 5…
$ x1994 <dbl> 0.200000, 65.145067, 30.100000, 86.557685, 72.200000, 5…
$ x1995 <dbl> 0.200000, 64.721535, 31.800000, 86.148094, 73.800000, 5…
$ x1996 <dbl> 0.200000, 63.929272, 33.900000, 85.118648, 72.900000, 5…
$ x1997 <dbl> 0.200000, 63.241293, 36.400000, 83.911059, 73.400000, 5…
$ x1998 <dbl> 0.200000, 63.962096, 38.000000, 84.634537, 76.400000, 4…
$ x1999 <dbl> 0.200000, 65.058401, 43.400000, 84.005039, 72.800000, 4…
$ x2000 <dbl> 0.200000, 65.385867, 45.000000, 83.101512, 73.800000, 4…
$ x2001 <dbl> 0.200000, 65.817363, 45.600000, 81.939243, 72.800000, 3…
$ x2002 <dbl> 0.200000, 65.099281, 37.800000, 81.735528, 70.800000, 3…
$ x2003 <dbl> 0.200000, 64.615124, 36.700000, 80.710450, 65.600000, 3…
$ x2004 <dbl> 0.200000, 63.678035, 44.200000, 80.790537, 62.300000, 3…
$ x2005 <dbl> 0.200000, 63.109098, 33.900000, 80.934034, 68.900000, 3…
$ x2006 <dbl> 0.200000, 62.738124, 31.900000, 81.873310, 62.600000, 3…
$ x2007 <dbl> 0.200000, 61.505169, 28.800000, 82.330015, 58.900000, 3…
$ x2008 <dbl> 0.200000, 61.908027, 21.200000, 81.211170, 55.400000, 3…
$ x2009 <dbl> 0.300000, 61.304601, 16.500000, 82.390213, 53.000000, 3…
$ x2010 <dbl> 5.500000, 62.876387, 15.200000, 80.687806, 53.100000, 3…
$ x2011 <dbl> 5.700000, 62.805279, 12.600000, 79.282287, 51.700000, 3…
$ x2012 <dbl> 6.900000, 61.409964, 15.400000, 78.481230, 49.000000, 3…
$ x2013 <dbl> 6.900000, 60.801799, 16.900000, 76.701499, 46.800000, 4…
$ x2014 <dbl> 6.900000, 61.235882, 19.100000, 75.741009, 47.500000, 3…
$ x2015 <dbl> 6.700000, 61.670278, 17.700000, 76.181671, 47.100000, 3…
$ x2016 <dbl> 7.000000, 61.822884, 20.200000, 75.796840, 48.100000, 3…
$ x2017 <dbl> 6.800000, 61.426950, 19.500000, 76.101681, 52.500000, 3…
$ x2018 <dbl> 8.600000, 61.587530, 18.300000, 75.476578, 52.500000, 3…
$ x2019 <dbl> 8.300000, 62.690710, 18.900000, 75.078788, 51.000000, 4…
$ x2020 <dbl> 9.100000, 65.782380, 18.200000, 75.846980, 60.100000, 4…
$ x2021 <dbl> 8.8, NA, 20.0, NA, 52.9, 41.9, 18.4, NA, 1.0, 9.2, 9.1,…
$ x2022 <dbl> 8.8, NA, 20.0, NA, NA, NA, 18.7, NA, NA, NA, NA, 0.4, 0…
$ x2023 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x2024 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x2025 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
head(renewable_cleaned)# A tibble: 6 × 70
country_name country_code indicator_name indicator_code x1960 x1961 x1962
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Aruba ABW Renewable ene… EG.FEC.RNEW.ZS NA NA NA
2 Africa Eastern a… AFE Renewable ene… EG.FEC.RNEW.ZS NA NA NA
3 Afghanistan AFG Renewable ene… EG.FEC.RNEW.ZS NA NA NA
4 Africa Western a… AFW Renewable ene… EG.FEC.RNEW.ZS NA NA NA
5 Angola AGO Renewable ene… EG.FEC.RNEW.ZS NA NA NA
6 Albania ALB Renewable ene… EG.FEC.RNEW.ZS NA NA NA
# ℹ 63 more variables: x1963 <dbl>, x1964 <dbl>, x1965 <dbl>, x1966 <dbl>,
# x1967 <dbl>, x1968 <dbl>, x1969 <dbl>, x1970 <dbl>, x1971 <dbl>,
# x1972 <dbl>, x1973 <dbl>, x1974 <dbl>, x1975 <dbl>, x1976 <dbl>,
# x1977 <dbl>, x1978 <dbl>, x1979 <dbl>, x1980 <dbl>, x1981 <dbl>,
# x1982 <dbl>, x1983 <dbl>, x1984 <dbl>, x1985 <dbl>, x1986 <dbl>,
# x1987 <dbl>, x1988 <dbl>, x1989 <dbl>, x1990 <dbl>, x1991 <dbl>,
# x1992 <dbl>, x1993 <dbl>, x1994 <dbl>, x1995 <dbl>, x1996 <dbl>, …
Now, the dataset may be converted from wide to long format.
renewable_tidy <- renewable_cleaned %>%
pivot_longer(
cols = -c(country_name, country_code, indicator_name, indicator_code),
names_to = "year",
values_to = "renewable_energy_pct"
) %>%
mutate(
year = str_extract(year, "\\d{4}") %>% as.integer(),
renewable_energy_pct = as.numeric(renewable_energy_pct)
)
glimpse(renewable_tidy)Rows: 17,556
Columns: 6
$ country_name <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aru…
$ country_code <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", …
$ indicator_name <chr> "Renewable energy consumption (% of total final e…
$ indicator_code <chr> "EG.FEC.RNEW.ZS", "EG.FEC.RNEW.ZS", "EG.FEC.RNEW.…
$ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1…
$ renewable_energy_pct <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
head(renewable_tidy)# A tibble: 6 × 6
country_name country_code indicator_name indicator_code year
<chr> <chr> <chr> <chr> <int>
1 Aruba ABW Renewable energy consumption (… EG.FEC.RNEW.ZS 1960
2 Aruba ABW Renewable energy consumption (… EG.FEC.RNEW.ZS 1961
3 Aruba ABW Renewable energy consumption (… EG.FEC.RNEW.ZS 1962
4 Aruba ABW Renewable energy consumption (… EG.FEC.RNEW.ZS 1963
5 Aruba ABW Renewable energy consumption (… EG.FEC.RNEW.ZS 1964
6 Aruba ABW Renewable energy consumption (… EG.FEC.RNEW.ZS 1965
# ℹ 1 more variable: renewable_energy_pct <dbl>
At this section within the Dataset Three segment, the dataset has now been converted into a tidy format, with one row representing one country/region per year observation. Missing renewable energy values will be retained as NAs for now and only filtered out where necessary for the specific analysis.
The analysis requested for this dataset centers on comparing the growth rate in renewable energy usage across different geographic regions over a ten-year period. As such, the first step is to isolate the regional groupings from the larger dataset and restrict the dataset to a recent ten-year window.
selected_regions <- c(
"Africa Eastern and Southern",
"East Asia & Pacific",
"Europe & Central Asia",
"Latin America & Caribbean",
"North America",
"South Asia"
)
renewable_regions <- renewable_tidy %>%
filter(country_name %in% selected_regions) %>%
filter(year >= 2013, year <= 2022) %>%
rename(
region = country_name,
region_code = country_code
)
renewable_regions# A tibble: 60 × 6
region region_code indicator_name indicator_code year renewable_energy_pct
<chr> <chr> <chr> <chr> <int> <dbl>
1 Africa … AFE Renewable ene… EG.FEC.RNEW.ZS 2013 60.8
2 Africa … AFE Renewable ene… EG.FEC.RNEW.ZS 2014 61.2
3 Africa … AFE Renewable ene… EG.FEC.RNEW.ZS 2015 61.7
4 Africa … AFE Renewable ene… EG.FEC.RNEW.ZS 2016 61.8
5 Africa … AFE Renewable ene… EG.FEC.RNEW.ZS 2017 61.4
6 Africa … AFE Renewable ene… EG.FEC.RNEW.ZS 2018 61.6
7 Africa … AFE Renewable ene… EG.FEC.RNEW.ZS 2019 62.7
8 Africa … AFE Renewable ene… EG.FEC.RNEW.ZS 2020 65.8
9 Africa … AFE Renewable ene… EG.FEC.RNEW.ZS 2021 NA
10 Africa … AFE Renewable ene… EG.FEC.RNEW.ZS 2022 NA
# ℹ 50 more rows
The following visualization can then be constructed in order to compare the growth trajectories of the six selected geographic regions.
ggplot(renewable_regions, aes(x = year, y = renewable_energy_pct, color = region)) +
geom_line(linewidth = 1) +
geom_point(size = 1.3) +
scale_y_continuous(limits = c(9, 67)) +
labs(
title = "Renewable Energy Usage Across Selected Regions\n(2013–2022)",
x = "Year",
y = "Renewable Energy Consumption (%)",
color = "Region"
) +
theme(
plot.title = element_text(hjust = 0.5, size = 16),
legend.position = "bottom",
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)
)The above visualization illustrates the changes in renewable energy consumption across the selected geographical regions over the 2013 to 2022 period. In general, most regions exhibit a gradual upward trend, suggesting an increasing share of renewable energy within overall energy consumption. Africa Eastern and Southern consistently maintains the highest proportion of renewable energy usage, while Latin America and the Caribbean also appears to show steady growth over the period. Europe and Central Asia, East Asia and the Pacific, and North America remain comparatively lower, though each demonstrates modest increases over time. South Asia, while experiencing a slight decline in the earlier years, appears to stabilize and rise again toward the end of the observed period. Overall, the figure highlights a gradual but observable shift toward greater renewable energy adoption across multiple regions.
From the analysis conducted upon this tidied renewable energy dataset, it can be observed that the selected geographic regions generally experienced either stable or gradually increasing levels of renewable energy consumption over the examined ten-year period. While the magnitude of usage differs notably between regions, the overall pattern suggests a broad movement toward greater reliance on renewable energy sources. Africa Eastern and Southern remained the highest throughout the period, while regions such as Latin America and the Caribbean also showed visible growth. Collectively, these findings suggest that the transition toward renewable energy is occurring across multiple parts of the world, though at uneven rates and from different starting points.
Across the three datasets analyzed in this project, the results illustrate how different types of global indicators (for instance, demographic, economic, and environmental) can be explored through a consistent data tidying and analytical workflow in R. By transforming each dataset into a tidy format and applying simple exploratory analyses and visualizations, it becomes possible to observe meaningful patterns such as changes in birth rates, long-term economic growth across countries, and even gradual increases in renewable energy adoption across regions. Overall, the exercise demonstrates how reproducible data preparation and visualization techniques can help reveal broader global trends from publicly available datasets.