Exploratory/Wrangling

Working with the Data

This section walks through each step of the wrangling process. By the end of the section, we will have a single dataframe, where each row represents a single country with its five relevant indicators.

Loading Packages

First, we need to load in the appropriate packages. For wrangling, we only need tidyverse and countrycode. The tidyverse package is a large and diverse implementation containing libraries to help with reading, wrangling, and visualizing data. The countrycode package helps convert between different country identifiers, which is necessary for adequately merging our data in the final wrangling step.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(countrycode)
Warning: package 'countrycode' was built under R version 4.4.1

Tidying Data

The next step is reading in the four data sets and tidying the data to merge the data frames properly. Generally, the wrangling step of any analysis is the most involved, and this analysis is no exception:

CO2pc

The first CSV, entitled “CO2pc.csv,” is from Our World in Data and contains CO2 emissions per capita numbers for countries from as early as the 1700s. The file has 26,600 rows.

co2pc <- read_csv("Datasets/CO2pc.csv")
Rows: 26600 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Entity, Code
dbl (2): Year, Annual CO₂ emissions (per capita)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(co2pc)
Rows: 26,600
Columns: 4
$ Entity                              <chr> "Afghanistan", "Afghanistan", "Afg…
$ Code                                <chr> "AFG", "AFG", "AFG", "AFG", "AFG",…
$ Year                                <dbl> 1949, 1950, 1951, 1952, 1953, 1954…
$ `Annual CO₂ emissions (per capita)` <dbl> 0.001992146, 0.011265611, 0.012097…

When we call head() on co2pc, we notice several issues:

head(co2pc)
# A tibble: 6 × 4
  Entity      Code   Year `Annual CO₂ emissions (per capita)`
  <chr>       <chr> <dbl>                               <dbl>
1 Afghanistan AFG    1949                             0.00199
2 Afghanistan AFG    1950                             0.0113 
3 Afghanistan AFG    1951                             0.0121 
4 Afghanistan AFG    1952                             0.0119 
5 Afghanistan AFG    1953                             0.0137 
6 Afghanistan AFG    1954                             0.0135 

First, there are more years than we need for our analysis; we want to base our models on 2022, so we should filter out the extraneous years. Once we filter out the unwanted years, the Year column becomes unnecessary, so we should remove it, too.

In addition, the current column names are verbose and vague. For one, we should change the Entity column to country; OWiD initially used the term Entity because there were more than just countries in the data set, but we will remove those observations. As specified by the meta data, the Code column uses iso3c formatting, so we should also make that explicit by renaming the column. Lastly, Annual CO₂ emissions (per capita) is an unnecessarily long column name.

Since this data set contains more than just countries, we need an efficient mechanism to remove these non-country observations. Luckily, since OWiD uses iso3c codes, most of these sub-regions and super-regions have NA values for their iso3c values, meaning we can simply call drop_na() to remove all of these unwanted observations.

co2pc_clean <- co2pc |>
  filter(Year == 2022) |> # filter for 2022
  select(!Year) |> # remove year column
  rename( # rename columns
    "country" = Entity,
    "iso3c" = Code,
    "co2pc" = 3  # using index instead of name for convenience
  ) |>
  drop_na() # remove non-country level observations

head(co2pc_clean)
# A tibble: 6 × 3
  country     iso3c co2pc
  <chr>       <chr> <dbl>
1 Afghanistan AFG   0.295
2 Albania     ALB   1.74 
3 Algeria     DZA   3.93 
4 Andorra     AND   4.62 
5 Angola      AGO   0.452
6 Anguilla    AIA   8.75 

LPI and GDPpc

The second CSV, entitled “LPI_GDPpc.csv,” is from the World Bank and contains countries’ 2022 LPI index ratings and GDP per capita numbers. The file has 537 rows.

lpi_gdppc <- read_csv("Datasets/LPI_GDPpc.csv")
Rows: 537 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Country Name, Country Code, Series Name, Series Code, 2022 [YR2022]

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(lpi_gdppc)
Rows: 537
Columns: 5
$ `Country Name`  <chr> "Argentina", "Argentina", "Australia", "Australia", "B…
$ `Country Code`  <chr> "ARG", "ARG", "AUS", "AUS", "BRA", "BRA", "CHN", "CHN"…
$ `Series Name`   <chr> "GDP per capita (current US$)", "Logistics performance…
$ `Series Code`   <chr> "NY.GDP.PCAP.CD", "LP.LPI.OVRL.XQ", "NY.GDP.PCAP.CD", …
$ `2022 [YR2022]` <chr> "13650.6046294524", "2.8", "65077.676668821", "3.7", "…

When we call head() on lpi_gdppc, we see many issues:

head(lpi_gdppc)
# A tibble: 6 × 5
  `Country Name` `Country Code` `Series Name`      `Series Code` `2022 [YR2022]`
  <chr>          <chr>          <chr>              <chr>         <chr>          
1 Argentina      ARG            GDP per capita (c… NY.GDP.PCAP.… 13650.60462945…
2 Argentina      ARG            Logistics perform… LP.LPI.OVRL.… 2.8            
3 Australia      AUS            GDP per capita (c… NY.GDP.PCAP.… 65077.676668821
4 Australia      AUS            Logistics perform… LP.LPI.OVRL.… 3.7            
5 Brazil         BRA            GDP per capita (c… NY.GDP.PCAP.… 9065.497333954…
6 Brazil         BRA            Logistics perform… LP.LPI.OVRL.… 3.2            

Perhaps the biggest issue is that each row represents a single country metric. For instance, Argentina has a row for its 2022 LPI and a row for its 2022 GDP per capita. We need to pivot our columns so that each country is represented in a row with both 2022 metrics. However, several adjustments should be made before pivoting to streamline the process.

First, we can remove the unnecessary column Series Name. Based on the head of the data set, we also see that the 2022 [YR2022] column, which contains the LPI and GDP per capita metrics, was interpreted as having character values instead of numeric values; we can resolve this problem by manually converting the column to numeric with as.numeric(). Again, converting to numeric now will streamline the subsequent wrangling steps.

A more subtle issue is displayed in the following code chunk:

head(lpi_gdppc |>
  group_by(`Country Name`) |> # group by country name
  summarize(count = n()) |> # show totals
  arrange(-count)) # sort in descending order
# A tibble: 6 × 2
  `Country Name`              count
  <chr>                       <int>
1 <NA>                            3
2 Afghanistan                     2
3 Africa Eastern and Southern     2
4 Africa Western and Central      2
5 Albania                         2
6 Algeria                         2

Most countries appear only to have one or two rows associated with them, which is what we expect. In these cases, pivot_wider() will work unambiguously and as intended. However, three rows are missing Country Name and Country Code. Missing values are generally problematic, but, in this case, having three rows associated with a single ID column will cause ambiguous pivoting; instead of numbers in our pivoted data frame columns, we will have list-objects. We can remove all rows containing missing values before pivoting to avoid this. This will remove these problematic rows and other rows explicitly missing LPI or GDP per capita values—observations we will not include in our final analysis anyway.

lpi_gdppc_pre <- lpi_gdppc |>
  select(!`Series Name`) |> # remove unwanted column
  mutate(
    `2022 [YR2022]` = as.numeric(`2022 [YR2022]`) # convert metric column to numeric
  ) |>
  drop_na() # remove missing rows to prevent ambiguous pivoting
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `2022 [YR2022] = as.numeric(`2022 [YR2022]`)`.
Caused by warning:
! NAs introduced by coercion
head(lpi_gdppc_pre)
# A tibble: 6 × 4
  `Country Name` `Country Code` `Series Code`  `2022 [YR2022]`
  <chr>          <chr>          <chr>                    <dbl>
1 Argentina      ARG            NY.GDP.PCAP.CD         13651. 
2 Argentina      ARG            LP.LPI.OVRL.XQ             2.8
3 Australia      AUS            NY.GDP.PCAP.CD         65078. 
4 Australia      AUS            LP.LPI.OVRL.XQ             3.7
5 Brazil         BRA            NY.GDP.PCAP.CD          9065. 
6 Brazil         BRA            LP.LPI.OVRL.XQ             3.2

With these issues resolved, we can now use pivot_wider() to condense each country into a single observation. Additionally, we will want to rename all the column names since they are lengthy or vague. We can also call drop_na(), removing any observations with missing metric values. This may seem like a repetitive step since we called drop_na() earlier, but it is not. To illustrate why, consider the following code chunk:

head(lpi_gdppc_pre |>
  group_by(`Country Name`) |> # group by country name
  summarize(count = n()) |> # show totals
  arrange(count)) # sort in ascending order
# A tibble: 6 × 2
  `Country Name` count
  <chr>          <int>
1 American Samoa     1
2 Andorra            1
3 Aruba              1
4 Azerbaijan         1
5 Barbados           1
6 Belize             1

In its current construction, lpi_gdppc_pre contains countries with only one row of observations, which means they are missing data on LPI or GDP per capita. While they are missing data, there are no actual NA values in these columns, so a pre-pivot drop_na() will not remove these countries; these missing values are fittingly known as implicit missing values. However, when we run pivot_wider(), the consolidated rows for these countries will contain NA values. Thus, we remove these missing values by calling drop_na() after pivoting.

lpi_gdppc_pivot <- lpi_gdppc_pre |>
  pivot_wider( # pivot on `Series Code` to widen rows
    names_from = `Series Code`,
    values_from = `2022 [YR2022]`
  ) |>
  rename( # rename for simplicity
    country = 1, # referencing using indices
    code = 2,
    gdppc = 3,
    lpi = 4
  ) |>
  drop_na() # remove countries with implicit NA values

head(lpi_gdppc_pivot)
# A tibble: 6 × 4
  country   code   gdppc   lpi
  <chr>     <chr>  <dbl> <dbl>
1 Argentina ARG   13651.   2.8
2 Australia AUS   65078.   3.7
3 Brazil    BRA    9065.   3.2
4 China     CHN   12663.   3.7
5 France    FRA   40886.   3.9
6 Germany   DEU   48718.   4.1

The final step is imperative for merging the data. Recall that the source is World Bank, and they use a custom country identifier that differs from iso3c. Since we want to merge the data sets using iso3c codes, we must convert the current World Bank codes to iso3c. Luckily, the countrycodes can convert World Bank country codes to iso3c. After doing so, we can remove the original code column and relocate the iso3c column for viewing convenience. Again, we can call drop_na(). This time, it will remove observations that do not have an iso3c code but did have a custom World Bank code, which implies that they are super or sub-regions of interest for the World Bank but not countries.

lpi_gdppc_clean <- lpi_gdppc_pivot |>
  mutate( # create iso3c column by converting from wb code to iso3c
    iso3c = countrycode(
        sourcevar = code, 
        origin = "wb",         
        destination = "iso3c")
  ) |>
  select(!code) |> # remove `code` column
  relocate(iso3c, .before = gdppc) |> # shuffle columns
  drop_na() # remove non-countries
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `iso3c = countrycode(sourcevar = code, origin = "wb",
  destination = "iso3c")`.
Caused by warning:
! Some values were not matched unambiguously: AFE, AFW, ARB, CEB, CSS, EAP, EAR, EAS, ECA, ECS, EMU, EUU, FCS, HIC, HPC, IBD, IBT, IDA, IDB, IDX, LAC, LCN, LDC, LIC, LMC, LMY, LTE, MEA, MIC, MNA, NAC, OED, OSS, PRE, PSS, PST, SAS, SSA, SSF, SST, TEA, TEC, TLA, TMN, TSA, TSS, UMC, WLD
head(lpi_gdppc_clean)
# A tibble: 6 × 4
  country   iso3c  gdppc   lpi
  <chr>     <chr>  <dbl> <dbl>
1 Argentina ARG   13651.   2.8
2 Australia AUS   65078.   3.7
3 Brazil    BRA    9065.   3.2
4 China     CHN   12663.   3.7
5 France    FRA   40886.   3.9
6 Germany   DEU   48718.   4.1

NCI

The third CSV, entitled “NCI.csv”, is from BioDB and contains the 2022 National Conservation Index (NCI) ratings for 180 countries. The file has 180 rows.

nci <- read_csv("Datasets/NCI.csv")
Rows: 180 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country, Change
dbl (2): Rank, Score

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(nci)
Rows: 180
Columns: 4
$ Rank    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
$ Country <chr> "Luxembourg", "Estonia", "Denmark", "Finland", "United Kingdom…
$ Score   <dbl> 70.80, 70.50, 69.00, 66.90, 66.60, 65.90, 65.80, 64.70, 64.60,…
$ Change  <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "N/A",…

When we call head() on nci, we notice several things:

head(nci)
# A tibble: 6 × 4
   Rank Country        Score Change
  <dbl> <chr>          <dbl> <chr> 
1     1 Luxembourg      70.8 N/A   
2     2 Estonia         70.5 N/A   
3     3 Denmark         69   N/A   
4     4 Finland         66.9 N/A   
5     5 United Kingdom  66.6 N/A   
6     6 Zimbabwe        65.9 N/A   

The nci data frame is already in solid shape compared to the previous data sets; we need to make only minor adjustments. First, the Rank and Change columns are unnecessary, so we can remove them. We can also rename the remaining columns to match the other data sets. As always, we will remove missing data rows using drop_na().

nci_pre <- nci |>
  select(Country, Score) |> # Remove `Rank` and `Change`
  rename( # rename for continuity
    "country" = Country,
    "nci" = Score
  ) |>
  drop_na() # remove missing rows

head(nci_pre)
# A tibble: 6 × 2
  country          nci
  <chr>          <dbl>
1 Luxembourg      70.8
2 Estonia         70.5
3 Denmark         69  
4 Finland         66.9
5 United Kingdom  66.6
6 Zimbabwe        65.9

Currently, nci_pre does not contain an iso3c column, which is crucial for merging. We can create a column containing iso3c columns by using countrycode again. In this case, the nci_pre data set does not include a custom country identifier, so we will rely on countrycode’s ability to convert English representations of country names to iso3c. We will then shuffle our columns for viewing convenience and remove rows that contain missing values. In this case, a row with a missing value implies that the entity does not have an iso3c code, meaning it is not a country and should be removed.

nci_clean <- nci_pre |>
  mutate( # create iso3c column by converting English names to iso3c
    iso3c = countrycode(
        sourcevar = country, 
        origin = "country.name",         
        destination = "iso3c")
  ) |>
  relocate(iso3c, .before = nci) |> # shuffle columns
  drop_na() # remove non-country observations

head(nci_clean)
# A tibble: 6 × 3
  country        iso3c   nci
  <chr>          <chr> <dbl>
1 Luxembourg     LUX    70.8
2 Estonia        EST    70.5
3 Denmark        DNK    69  
4 Finland        FIN    66.9
5 United Kingdom GBR    66.6
6 Zimbabwe       ZWE    65.9

Percent Renewable

The fourth CSV, entitled “perRenewable.csv”, is from the International Renewable Energy Agency (IRENA) and contains countries’ total renewable energy production as a percentage of their total production in 2022. The file has 233 rows.

percent_renewable <- read_csv("Datasets/perRenewable.csv", skip = 2) # skip header
Rows: 233 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Region/country/area, Indicator, Renewable energy share of electrici...
dbl (1): Year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(percent_renewable)
Rows: 233
Columns: 4
$ `Region/country/area`                                               <chr> "W…
$ Indicator                                                           <chr> "R…
$ Year                                                                <dbl> 20…
$ `Renewable energy share of electricity capacity and generation (%)` <chr> "2…

When we call head() on percent_renewable, we notice several things:

head(percent_renewable)
# A tibble: 6 × 4
  `Region/country/area`             Indicator        Year Renewable energy sha…¹
  <chr>                             <chr>           <dbl> <chr>                 
1 World                             RE share of el…  2022 29.07                 
2 Africa                            RE share of el…  2022 22.78                 
3 Asia                              RE share of el…  2022 26.15                 
4 Central America and the Caribbean RE share of el…  2022 38.33                 
5 Eurasia                           RE share of el…  2022 23.48                 
6 Europe                            RE share of el…  2022 40.47                 
# ℹ abbreviated name:
#   ¹​`Renewable energy share of electricity capacity and generation (%)`

First, we notice that the Indicator and Year columns are unnecessary. The remaining columns have verbose or vague names, so we will rename those. Also, the column that contains the renewable energy metrics should have been interpreted as a column containing numbers, but it was not. Thus, we will use as.numeric() to convert the column to numeric values manually. Finally, we will use drop_na() to remove any observations that have missing values.

percent_renewable_pre <- percent_renewable |>
  select(c(1, 4)) |> # select desired columns using indices
  rename( # rename columns to match other data frames
    "country" = 1,
    "percent_renewable" = 2
  ) |>
  mutate( # convert percent_renewable to numeric
    percent_renewable = as.numeric(percent_renewable)
  ) |>
  drop_na() # remove observations with missing values
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `percent_renewable = as.numeric(percent_renewable)`.
Caused by warning:
! NAs introduced by coercion

Currently, percent_renewable_pre does not contain an iso3c column, which will be needed for merging. We can create a column containing iso3c columns by using countrycode again. In this case, the percent_renewable_pre data set does not contain a custom country identifier, so we will rely on countrycode’s ability to convert English representations of country names to iso3c. We will then shuffle our columns for viewing convenience and remove rows that contain missing values. In this case, a row with a missing value implies that the entity does not have an iso3c code, meaning it is not a country and should be removed.

percent_renewable_clean <- percent_renewable_pre |>
  mutate( # create iso3c column by converting English names to iso3c
    iso3c = countrycode(
        sourcevar = country, 
        origin = "country.name",         
        destination = "iso3c")
  ) |>
  relocate(iso3c, .before = percent_renewable) |> # shuffle columns
  drop_na() # remove non-country observations
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `iso3c = countrycode(sourcevar = country, origin =
  "country.name", destination = "iso3c")`.
Caused by warning:
! Some values were not matched unambiguously: Africa, Asia, Central America and the Caribbean, Eurasia, Europe, Kosovo, Middle East, North America, Oceania, Saint Martin, South America, World
head(percent_renewable_clean)
# A tibble: 6 × 3
  country        iso3c percent_renewable
  <chr>          <chr>             <dbl>
1 Afghanistan    AFG               78.0 
2 Albania        ALB              100   
3 Algeria        DZA                0.95
4 American Samoa ASM                3.05
5 Andorra        AND               94.1 
6 Angola         AGO               76.3 

Merging Data

With the four data frames tidied, we now need to merge them. To do so, we will chain together several inner_join() calls. Generally, inner_join() is the strictest joining method, as it only keeps observations in both data frames. While theoretically ideal, this joining strategy could be too ambitious for wild data. In our case, however, the data frames contain enough data to make this a feasible option.

merged_data <- co2pc_clean |> 
  inner_join( # IJ 1: co2pc and lpi_gdppc
    lpi_gdppc_clean,
    by = "iso3c"
  ) |>
  select(!country.y) |> # remove redundant country column
  inner_join( # IJ 2: (co2pc and lpi_gdppc) and nci_clean
    nci_clean,
    by = "iso3c"
  ) |>
  select(!country) |> # remove redundant country column
  inner_join( # IJ 3: (co2pc and lpi_gdppc and nci) and percent_renewable
    percent_renewable_clean,
    by = "iso3c",
  ) |>
  select(!country) |> # remove redundant country column
  rename( # reanme remaining country column
    country = "country.x"
  )
  
head(merged_data)
# A tibble: 6 × 7
  country             iso3c co2pc  gdppc   lpi   nci percent_renewable
  <chr>               <chr> <dbl>  <dbl> <dbl> <dbl>             <dbl>
1 Afghanistan         AFG   0.295   353.   1.9  48.8             78.0 
2 Albania             ALB   1.74   6810.   2.5  57.4            100   
3 Algeria             DZA   3.93   5023.   2.5  54.7              0.95
4 Angola              AGO   0.452  2933.   2.1  56.6             76.3 
5 Antigua and Barbuda ATG   6.42  19920.   2.9  55.5              6.51
6 Argentina           ARG   4.24  13651.   2.8  54.3             29.2 

Our final data frame merged_data contains 131 countries. To begin with, lpi_gdppc constituted the bottleneck since there were only about 134 countries with published LPI and GDP per capita values. So, across the inner_join() chain, we preserved roughly 98% of the possible countries.

Exploring Distributions

We have merged our data, but we should inspect the distributions of each variable before proceeding with the analysis. Moreover, linear regression models make complex assumptions about the response and explanatory variables. If these assumptions are not met, then the model’s insights are likely invalid.

The specific assumptions are beyond the scope of this course, but variables that possess normal distributions generally (but not necessarily) satisfy most of these conditions. Accordingly, we should check if each of the variables is normally distributed. If they are not, we can alter the distribution by log transforming the variable. Sometimes, this may make the distribution more skewed, so we should still use the original variable. Other times, the transformation may make the variable appear normally distributed, which is desired.

CO2pc

Let us first look at the distribution for co2pc:

merged_data |>
  ggplot(aes(x = co2pc)) +
  geom_histogram(bins = 12, fill = "olivedrab4") + # 12 bins
  labs( # axis and title labels
    title = "Distribution of Country CO2 Emissions per Capita, 2022",
    x = "Count",
    y = "CO2 Emissions per Capita (ton/person)",
    caption = "Source: Our World in Data"
  ) +
  theme(text = element_text(family = "Luminari")) # cleaner font

Clearly, the distribution is strongly skewed right. We can try to correct this skew by log-transforming the gdppc variable:

merged_data |>
  ggplot(aes(x = log(co2pc))) +
  geom_histogram(bins = 12, fill = "olivedrab4") + # 12 bins
  labs( # axis and title labels
    title = "Distribution of Country Log(CO2 Emissions per Capita), 2022",
    x = "Count",
    y = "Log(CO2 Emissions per Capita (ton/person))",
    caption = "Source: Our World in Data"
  ) +
  theme(text = element_text(family = "Luminari")) # cleaner font

While this distribution still possesses a slight left skew, it is certainly less skewed than the original co2pc distribution. Thus, we will use log(co2pc) for all of our analyses.

LPI

Next, let us look at the distribution for lpi:

merged_data |>
  ggplot(aes(x = lpi)) +
  geom_histogram(bins = 12, fill = "plum") + # 12 bins
  labs( # axis and title labels
    title = "Distribution of Country LPI, 2022",
    x = "Logistics Performance Index (LPI)",
    y = "Count",
    caption = "Source: World Bank"
  ) +
  theme(text = element_text(family = "Luminari")) # cleaner font

Clearly, the distribution has a right skew. We can try to correct this skew by log-transforming the lpi variable:

merged_data |>
  ggplot(aes(x = log(lpi))) +
  geom_histogram(bins = 12, fill = "plum") + # 12 bins
  labs( # axis and title labels
    title = "Distribution of Country Log(LPI), 2022",
    x = "Log(Logistics Performance Index (LPI))",
    y = "Count",
    caption = "Source: World Bank"
  ) +
  theme(text = element_text(family = "Luminari")) # cleaner font

This distribution appears more balanced, but not by much. A more significant issue is that one could reasonably claim that the distribution now appears bimodal, with one center at roughly 0.9 and another at about 1.25; it looks as if there are two separate distributions superimposed on top of each other. While it is hard to make relative judgments, we would argue that this distribution appears less normal than the original one—especially if we consider it bimodal—so we will use untransformed lpi in our analyses.

GDPpc

Next, let us look at the distribution for gdppc:

merged_data |>
  ggplot(aes(x = gdppc)) +
  geom_histogram(bins = 12, fill = "firebrick4") + # 12 bins
  labs( # axis and title labels
    title = "Distribution of Country GDP per Capita, 2022",
    x = "GDP per Capita ($/Person)",
    y = "Count",
    caption = "Source: World Bank"
  ) +
  theme(text = element_text(family = "Luminari")) # cleaner font

Clearly, the distribution has a sizable right skew. We can try to correct this skew by log-transforming the gdppc variable:

merged_data |>
  ggplot(aes(x = log(gdppc))) +
  geom_histogram(bins = 12, fill = "firebrick4") + # 12 bins
  labs( # axis and title labels
    title = "Distribution of Country Log(GDP per Capita), 2022",
    x = "Log(GDP per Capita ($/Person))",
    y = "Count",
    caption = "Source: World Bank"
  ) +
  theme(text = element_text(family = "Luminari")) # cleaner font

While this distribution still possesses a slight left skew, it is certainly less skewed than the original gdppc distribution and appears somewhat normal looking. Thus, we will use log(gdppc) for all of our analyses.

NCI

Next, let us look at the distribution for nci:

merged_data |>
  ggplot(aes(x = nci)) +
  geom_histogram(bins = 12, fill = "goldenrod3") + # 12 bins
  labs( # axis and title labels
    title = "Distribution of Country NCI, 2022",
    x = "National Conservation Index (NCI)",
    y = "Count",
    caption = "Source: BioDB"
  ) +
  theme(text = element_text(family = "Luminari")) # cleaner font

The nci distribution above can be considered normal, especially for wild data. Accordingly, we do not need to adjust and will use un-transformed nci in our analyses.

Percent Renewable

Finally, let us look at the distribution for percent_renewable:

merged_data |>
  ggplot(aes(x = percent_renewable)) +
  geom_histogram(bins = 12, fill = "slateblue3") + # 12 bins
  labs( # axis and title labels
    title = "Distribution of Country Renewable Energy, 2022",
    x = "Renewable Energy (% of Total Energy)",
    y = "Count",
    caption = "Source: IRENA"
  ) +
  theme(text = element_text(family = "Luminari")) # cleaner font

Clearly, the distribution has a sizable right skew. We can try to correct this skew by log-transforming the percent_renewable variable:

merged_data |>
  ggplot(aes(x = log(percent_renewable))) +
  geom_histogram(bins = 12, fill = "slateblue3") + # 12 bins
  labs( # axis and title labels
    title = "Distribution of Country Renewable Energy, 2022",
    x = "Log(Renewable Energy (% of Total Energy))",
    y = "Count",
    caption = "Source: IRENA"
  ) +
  theme(text = element_text(family = "Luminari")) # cleaner font

The log-transformed distribution of percent_renewable undoubtedly has a worse skew than the original distribution. Accordingly, we will use the un-transformed percent_renewable in our analyses.