This section walks through each step of the wrangling process. By the end of the section, we will have a single dataframe, where each row represents a single country with its five relevant indicators.
Loading Packages
First, we need to load in the appropriate packages. For wrangling, we only need tidyverse and countrycode. The tidyverse package is a large and diverse implementation containing libraries to help with reading, wrangling, and visualizing data. The countrycode package helps convert between different country identifiers, which is necessary for adequately merging our data in the final wrangling step.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(countrycode)
Warning: package 'countrycode' was built under R version 4.4.1
Tidying Data
The next step is reading in the four data sets and tidying the data to merge the data frames properly. Generally, the wrangling step of any analysis is the most involved, and this analysis is no exception:
CO2pc
The first CSV, entitled “CO2pc.csv,” is from Our World in Data and contains CO2 emissions per capita numbers for countries from as early as the 1700s. The file has 26,600 rows.
co2pc <-read_csv("Datasets/CO2pc.csv")
Rows: 26600 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Entity, Code
dbl (2): Year, Annual CO₂ emissions (per capita)
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
First, there are more years than we need for our analysis; we want to base our models on 2022, so we should filter out the extraneous years. Once we filter out the unwanted years, the Year column becomes unnecessary, so we should remove it, too.
In addition, the current column names are verbose and vague. For one, we should change the Entity column to country; OWiD initially used the term Entity because there were more than just countries in the data set, but we will remove those observations. As specified by the meta data, the Code column uses iso3c formatting, so we should also make that explicit by renaming the column. Lastly, Annual CO₂ emissions (per capita) is an unnecessarily long column name.
Since this data set contains more than just countries, we need an efficient mechanism to remove these non-country observations. Luckily, since OWiD uses iso3c codes, most of these sub-regions and super-regions have NA values for their iso3c values, meaning we can simply call drop_na() to remove all of these unwanted observations.
co2pc_clean <- co2pc |>filter(Year ==2022) |># filter for 2022select(!Year) |># remove year columnrename( # rename columns"country"= Entity,"iso3c"= Code,"co2pc"=3# using index instead of name for convenience ) |>drop_na() # remove non-country level observationshead(co2pc_clean)
# A tibble: 6 × 3
country iso3c co2pc
<chr> <chr> <dbl>
1 Afghanistan AFG 0.295
2 Albania ALB 1.74
3 Algeria DZA 3.93
4 Andorra AND 4.62
5 Angola AGO 0.452
6 Anguilla AIA 8.75
LPI and GDPpc
The second CSV, entitled “LPI_GDPpc.csv,” is from the World Bank and contains countries’ 2022 LPI index ratings and GDP per capita numbers. The file has 537 rows.
lpi_gdppc <-read_csv("Datasets/LPI_GDPpc.csv")
Rows: 537 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Country Name, Country Code, Series Name, Series Code, 2022 [YR2022]
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
When we call head() on lpi_gdppc, we see many issues:
head(lpi_gdppc)
# A tibble: 6 × 5
`Country Name` `Country Code` `Series Name` `Series Code` `2022 [YR2022]`
<chr> <chr> <chr> <chr> <chr>
1 Argentina ARG GDP per capita (c… NY.GDP.PCAP.… 13650.60462945…
2 Argentina ARG Logistics perform… LP.LPI.OVRL.… 2.8
3 Australia AUS GDP per capita (c… NY.GDP.PCAP.… 65077.676668821
4 Australia AUS Logistics perform… LP.LPI.OVRL.… 3.7
5 Brazil BRA GDP per capita (c… NY.GDP.PCAP.… 9065.497333954…
6 Brazil BRA Logistics perform… LP.LPI.OVRL.… 3.2
Perhaps the biggest issue is that each row represents a single country metric. For instance, Argentina has a row for its 2022 LPI and a row for its 2022 GDP per capita. We need to pivot our columns so that each country is represented in a row with both 2022 metrics. However, several adjustments should be made before pivoting to streamline the process.
First, we can remove the unnecessary column Series Name. Based on the head of the data set, we also see that the 2022 [YR2022] column, which contains the LPI and GDP per capita metrics, was interpreted as having character values instead of numeric values; we can resolve this problem by manually converting the column to numeric with as.numeric(). Again, converting to numeric now will streamline the subsequent wrangling steps.
A more subtle issue is displayed in the following code chunk:
head(lpi_gdppc |>group_by(`Country Name`) |># group by country namesummarize(count =n()) |># show totalsarrange(-count)) # sort in descending order
# A tibble: 6 × 2
`Country Name` count
<chr> <int>
1 <NA> 3
2 Afghanistan 2
3 Africa Eastern and Southern 2
4 Africa Western and Central 2
5 Albania 2
6 Algeria 2
Most countries appear only to have one or two rows associated with them, which is what we expect. In these cases, pivot_wider() will work unambiguously and as intended. However, three rows are missing Country Name and Country Code. Missing values are generally problematic, but, in this case, having three rows associated with a single ID column will cause ambiguous pivoting; instead of numbers in our pivoted data frame columns, we will have list-objects. We can remove all rows containing missing values before pivoting to avoid this. This will remove these problematic rows and other rows explicitly missing LPI or GDP per capita values—observations we will not include in our final analysis anyway.
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `2022 [YR2022] = as.numeric(`2022 [YR2022]`)`.
Caused by warning:
! NAs introduced by coercion
head(lpi_gdppc_pre)
# A tibble: 6 × 4
`Country Name` `Country Code` `Series Code` `2022 [YR2022]`
<chr> <chr> <chr> <dbl>
1 Argentina ARG NY.GDP.PCAP.CD 13651.
2 Argentina ARG LP.LPI.OVRL.XQ 2.8
3 Australia AUS NY.GDP.PCAP.CD 65078.
4 Australia AUS LP.LPI.OVRL.XQ 3.7
5 Brazil BRA NY.GDP.PCAP.CD 9065.
6 Brazil BRA LP.LPI.OVRL.XQ 3.2
With these issues resolved, we can now use pivot_wider() to condense each country into a single observation. Additionally, we will want to rename all the column names since they are lengthy or vague. We can also call drop_na(), removing any observations with missing metric values. This may seem like a repetitive step since we called drop_na() earlier, but it is not. To illustrate why, consider the following code chunk:
head(lpi_gdppc_pre |>group_by(`Country Name`) |># group by country namesummarize(count =n()) |># show totalsarrange(count)) # sort in ascending order
In its current construction, lpi_gdppc_pre contains countries with only one row of observations, which means they are missing data on LPI or GDP per capita. While they are missing data, there are no actual NA values in these columns, so a pre-pivot drop_na() will not remove these countries; these missing values are fittingly known as implicit missing values. However, when we run pivot_wider(), the consolidated rows for these countries will contain NA values. Thus, we remove these missing values by calling drop_na() after pivoting.
lpi_gdppc_pivot <- lpi_gdppc_pre |>pivot_wider( # pivot on `Series Code` to widen rowsnames_from =`Series Code`,values_from =`2022 [YR2022]` ) |>rename( # rename for simplicitycountry =1, # referencing using indicescode =2,gdppc =3,lpi =4 ) |>drop_na() # remove countries with implicit NA valueshead(lpi_gdppc_pivot)
# A tibble: 6 × 4
country code gdppc lpi
<chr> <chr> <dbl> <dbl>
1 Argentina ARG 13651. 2.8
2 Australia AUS 65078. 3.7
3 Brazil BRA 9065. 3.2
4 China CHN 12663. 3.7
5 France FRA 40886. 3.9
6 Germany DEU 48718. 4.1
The final step is imperative for merging the data. Recall that the source is World Bank, and they use a custom country identifier that differs from iso3c. Since we want to merge the data sets using iso3c codes, we must convert the current World Bank codes to iso3c. Luckily, the countrycodes can convert World Bank country codes to iso3c. After doing so, we can remove the original code column and relocate the iso3c column for viewing convenience. Again, we can call drop_na(). This time, it will remove observations that do not have an iso3c code but did have a custom World Bank code, which implies that they are super or sub-regions of interest for the World Bank but not countries.
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `iso3c = countrycode(sourcevar = code, origin = "wb",
destination = "iso3c")`.
Caused by warning:
! Some values were not matched unambiguously: AFE, AFW, ARB, CEB, CSS, EAP, EAR, EAS, ECA, ECS, EMU, EUU, FCS, HIC, HPC, IBD, IBT, IDA, IDB, IDX, LAC, LCN, LDC, LIC, LMC, LMY, LTE, MEA, MIC, MNA, NAC, OED, OSS, PRE, PSS, PST, SAS, SSA, SSF, SST, TEA, TEC, TLA, TMN, TSA, TSS, UMC, WLD
head(lpi_gdppc_clean)
# A tibble: 6 × 4
country iso3c gdppc lpi
<chr> <chr> <dbl> <dbl>
1 Argentina ARG 13651. 2.8
2 Australia AUS 65078. 3.7
3 Brazil BRA 9065. 3.2
4 China CHN 12663. 3.7
5 France FRA 40886. 3.9
6 Germany DEU 48718. 4.1
NCI
The third CSV, entitled “NCI.csv”, is from BioDB and contains the 2022 National Conservation Index (NCI) ratings for 180 countries. The file has 180 rows.
nci <-read_csv("Datasets/NCI.csv")
Rows: 180 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country, Change
dbl (2): Rank, Score
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
When we call head() on nci, we notice several things:
head(nci)
# A tibble: 6 × 4
Rank Country Score Change
<dbl> <chr> <dbl> <chr>
1 1 Luxembourg 70.8 N/A
2 2 Estonia 70.5 N/A
3 3 Denmark 69 N/A
4 4 Finland 66.9 N/A
5 5 United Kingdom 66.6 N/A
6 6 Zimbabwe 65.9 N/A
The nci data frame is already in solid shape compared to the previous data sets; we need to make only minor adjustments. First, the Rank and Change columns are unnecessary, so we can remove them. We can also rename the remaining columns to match the other data sets. As always, we will remove missing data rows using drop_na().
# A tibble: 6 × 2
country nci
<chr> <dbl>
1 Luxembourg 70.8
2 Estonia 70.5
3 Denmark 69
4 Finland 66.9
5 United Kingdom 66.6
6 Zimbabwe 65.9
Currently, nci_pre does not contain an iso3c column, which is crucial for merging. We can create a column containing iso3c columns by using countrycode again. In this case, the nci_pre data set does not include a custom country identifier, so we will rely on countrycode’s ability to convert English representations of country names to iso3c. We will then shuffle our columns for viewing convenience and remove rows that contain missing values. In this case, a row with a missing value implies that the entity does not have an iso3c code, meaning it is not a country and should be removed.
# A tibble: 6 × 3
country iso3c nci
<chr> <chr> <dbl>
1 Luxembourg LUX 70.8
2 Estonia EST 70.5
3 Denmark DNK 69
4 Finland FIN 66.9
5 United Kingdom GBR 66.6
6 Zimbabwe ZWE 65.9
Percent Renewable
The fourth CSV, entitled “perRenewable.csv”, is from the International Renewable Energy Agency (IRENA) and contains countries’ total renewable energy production as a percentage of their total production in 2022. The file has 233 rows.
Rows: 233 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Region/country/area, Indicator, Renewable energy share of electrici...
dbl (1): Year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(percent_renewable)
Rows: 233
Columns: 4
$ `Region/country/area` <chr> "W…
$ Indicator <chr> "R…
$ Year <dbl> 20…
$ `Renewable energy share of electricity capacity and generation (%)` <chr> "2…
When we call head() on percent_renewable, we notice several things:
head(percent_renewable)
# A tibble: 6 × 4
`Region/country/area` Indicator Year Renewable energy sha…¹
<chr> <chr> <dbl> <chr>
1 World RE share of el… 2022 29.07
2 Africa RE share of el… 2022 22.78
3 Asia RE share of el… 2022 26.15
4 Central America and the Caribbean RE share of el… 2022 38.33
5 Eurasia RE share of el… 2022 23.48
6 Europe RE share of el… 2022 40.47
# ℹ abbreviated name:
# ¹`Renewable energy share of electricity capacity and generation (%)`
First, we notice that the Indicator and Year columns are unnecessary. The remaining columns have verbose or vague names, so we will rename those. Also, the column that contains the renewable energy metrics should have been interpreted as a column containing numbers, but it was not. Thus, we will use as.numeric() to convert the column to numeric values manually. Finally, we will use drop_na() to remove any observations that have missing values.
percent_renewable_pre <- percent_renewable |>select(c(1, 4)) |># select desired columns using indicesrename( # rename columns to match other data frames"country"=1,"percent_renewable"=2 ) |>mutate( # convert percent_renewable to numericpercent_renewable =as.numeric(percent_renewable) ) |>drop_na() # remove observations with missing values
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `percent_renewable = as.numeric(percent_renewable)`.
Caused by warning:
! NAs introduced by coercion
Currently, percent_renewable_pre does not contain an iso3c column, which will be needed for merging. We can create a column containing iso3c columns by using countrycode again. In this case, the percent_renewable_pre data set does not contain a custom country identifier, so we will rely on countrycode’s ability to convert English representations of country names to iso3c. We will then shuffle our columns for viewing convenience and remove rows that contain missing values. In this case, a row with a missing value implies that the entity does not have an iso3c code, meaning it is not a country and should be removed.
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `iso3c = countrycode(sourcevar = country, origin =
"country.name", destination = "iso3c")`.
Caused by warning:
! Some values were not matched unambiguously: Africa, Asia, Central America and the Caribbean, Eurasia, Europe, Kosovo, Middle East, North America, Oceania, Saint Martin, South America, World
head(percent_renewable_clean)
# A tibble: 6 × 3
country iso3c percent_renewable
<chr> <chr> <dbl>
1 Afghanistan AFG 78.0
2 Albania ALB 100
3 Algeria DZA 0.95
4 American Samoa ASM 3.05
5 Andorra AND 94.1
6 Angola AGO 76.3
Merging Data
With the four data frames tidied, we now need to merge them. To do so, we will chain together several inner_join() calls. Generally, inner_join() is the strictest joining method, as it only keeps observations in both data frames. While theoretically ideal, this joining strategy could be too ambitious for wild data. In our case, however, the data frames contain enough data to make this a feasible option.
merged_data <- co2pc_clean |>inner_join( # IJ 1: co2pc and lpi_gdppc lpi_gdppc_clean,by ="iso3c" ) |>select(!country.y) |># remove redundant country columninner_join( # IJ 2: (co2pc and lpi_gdppc) and nci_clean nci_clean,by ="iso3c" ) |>select(!country) |># remove redundant country columninner_join( # IJ 3: (co2pc and lpi_gdppc and nci) and percent_renewable percent_renewable_clean,by ="iso3c", ) |>select(!country) |># remove redundant country columnrename( # reanme remaining country columncountry ="country.x" )head(merged_data)
Our final data frame merged_data contains 131 countries. To begin with, lpi_gdppc constituted the bottleneck since there were only about 134 countries with published LPI and GDP per capita values. So, across the inner_join() chain, we preserved roughly 98% of the possible countries.
Exploring Distributions
We have merged our data, but we should inspect the distributions of each variable before proceeding with the analysis. Moreover, linear regression models make complex assumptions about the response and explanatory variables. If these assumptions are not met, then the model’s insights are likely invalid.
The specific assumptions are beyond the scope of this course, but variables that possess normal distributions generally (but not necessarily) satisfy most of these conditions. Accordingly, we should check if each of the variables is normally distributed. If they are not, we can alter the distribution by log transforming the variable. Sometimes, this may make the distribution more skewed, so we should still use the original variable. Other times, the transformation may make the variable appear normally distributed, which is desired.
CO2pc
Let us first look at the distribution for co2pc:
merged_data |>ggplot(aes(x = co2pc)) +geom_histogram(bins =12, fill ="olivedrab4") +# 12 binslabs( # axis and title labelstitle ="Distribution of Country CO2 Emissions per Capita, 2022",x ="Count",y ="CO2 Emissions per Capita (ton/person)",caption ="Source: Our World in Data" ) +theme(text =element_text(family ="Luminari")) # cleaner font
Clearly, the distribution is strongly skewed right. We can try to correct this skew by log-transforming the gdppc variable:
merged_data |>ggplot(aes(x =log(co2pc))) +geom_histogram(bins =12, fill ="olivedrab4") +# 12 binslabs( # axis and title labelstitle ="Distribution of Country Log(CO2 Emissions per Capita), 2022",x ="Count",y ="Log(CO2 Emissions per Capita (ton/person))",caption ="Source: Our World in Data" ) +theme(text =element_text(family ="Luminari")) # cleaner font
While this distribution still possesses a slight left skew, it is certainly less skewed than the original co2pc distribution. Thus, we will use log(co2pc) for all of our analyses.
LPI
Next, let us look at the distribution for lpi:
merged_data |>ggplot(aes(x = lpi)) +geom_histogram(bins =12, fill ="plum") +# 12 binslabs( # axis and title labelstitle ="Distribution of Country LPI, 2022",x ="Logistics Performance Index (LPI)",y ="Count",caption ="Source: World Bank" ) +theme(text =element_text(family ="Luminari")) # cleaner font
Clearly, the distribution has a right skew. We can try to correct this skew by log-transforming the lpi variable:
merged_data |>ggplot(aes(x =log(lpi))) +geom_histogram(bins =12, fill ="plum") +# 12 binslabs( # axis and title labelstitle ="Distribution of Country Log(LPI), 2022",x ="Log(Logistics Performance Index (LPI))",y ="Count",caption ="Source: World Bank" ) +theme(text =element_text(family ="Luminari")) # cleaner font
This distribution appears more balanced, but not by much. A more significant issue is that one could reasonably claim that the distribution now appears bimodal, with one center at roughly 0.9 and another at about 1.25; it looks as if there are two separate distributions superimposed on top of each other. While it is hard to make relative judgments, we would argue that this distribution appears less normal than the original one—especially if we consider it bimodal—so we will use untransformed lpi in our analyses.
GDPpc
Next, let us look at the distribution for gdppc:
merged_data |>ggplot(aes(x = gdppc)) +geom_histogram(bins =12, fill ="firebrick4") +# 12 binslabs( # axis and title labelstitle ="Distribution of Country GDP per Capita, 2022",x ="GDP per Capita ($/Person)",y ="Count",caption ="Source: World Bank" ) +theme(text =element_text(family ="Luminari")) # cleaner font
Clearly, the distribution has a sizable right skew. We can try to correct this skew by log-transforming the gdppc variable:
merged_data |>ggplot(aes(x =log(gdppc))) +geom_histogram(bins =12, fill ="firebrick4") +# 12 binslabs( # axis and title labelstitle ="Distribution of Country Log(GDP per Capita), 2022",x ="Log(GDP per Capita ($/Person))",y ="Count",caption ="Source: World Bank" ) +theme(text =element_text(family ="Luminari")) # cleaner font
While this distribution still possesses a slight left skew, it is certainly less skewed than the original gdppc distribution and appears somewhat normal looking. Thus, we will use log(gdppc) for all of our analyses.
NCI
Next, let us look at the distribution for nci:
merged_data |>ggplot(aes(x = nci)) +geom_histogram(bins =12, fill ="goldenrod3") +# 12 binslabs( # axis and title labelstitle ="Distribution of Country NCI, 2022",x ="National Conservation Index (NCI)",y ="Count",caption ="Source: BioDB" ) +theme(text =element_text(family ="Luminari")) # cleaner font
The nci distribution above can be considered normal, especially for wild data. Accordingly, we do not need to adjust and will use un-transformed nci in our analyses.
Percent Renewable
Finally, let us look at the distribution for percent_renewable:
merged_data |>ggplot(aes(x = percent_renewable)) +geom_histogram(bins =12, fill ="slateblue3") +# 12 binslabs( # axis and title labelstitle ="Distribution of Country Renewable Energy, 2022",x ="Renewable Energy (% of Total Energy)",y ="Count",caption ="Source: IRENA" ) +theme(text =element_text(family ="Luminari")) # cleaner font
Clearly, the distribution has a sizable right skew. We can try to correct this skew by log-transforming the percent_renewable variable:
merged_data |>ggplot(aes(x =log(percent_renewable))) +geom_histogram(bins =12, fill ="slateblue3") +# 12 binslabs( # axis and title labelstitle ="Distribution of Country Renewable Energy, 2022",x ="Log(Renewable Energy (% of Total Energy))",y ="Count",caption ="Source: IRENA" ) +theme(text =element_text(family ="Luminari")) # cleaner font
The log-transformed distribution of percent_renewable undoubtedly has a worse skew than the original distribution. Accordingly, we will use the un-transformed percent_renewable in our analyses.