Figure 1.1: Photo by Anna Shvets from Pexels
Climate change is a long-term change in the average weather patterns that have come to define Earth’s local, regional and global climates. These changes have a broad range of observed effects that are synonymous with the term.
Changes observed in Earth’s climate since the early 20th century are primarily driven by human activities, particularly fossil fuel burning, which increases heat-trapping greenhouse gas levels in Earth’s atmosphere, raising Earth’s average surface temperature. These human-produced temperature increases are commonly referred to as global warming. Natural processes can also contribute to climate change, including internal variability (e.g., cyclical ocean patterns like El Niño, La Niña and the Pacific Decadal Oscillation) and external forcings (e.g., volcanic activity, changes in the Sun’s energy output, variations in Earth’s orbit).
Scientists use observations from the ground, air and space, along with theoretical models, to monitor and study past, present and future climate change. Climate data records provide evidence of climate change key indicators, such as global land and ocean temperature increases; rising sea levels; ice loss at Earth’s poles and in mountain glaciers; frequency and severity changes in extreme weather such as hurricanes, heatwaves, wildfires, droughts, floods and precipitation; and cloud and vegetation cover changes, to name but a few. These records are classified into potential footprint (ecological consumption) and biocapacity (ecological production). We will mainly try to analyze these two categories to determine the global ecological impact recorded for each country in 2017, in this report.
Figure 2.1: Photo by Deva Darshan from Pexels
The ecological footprint measures the ecological assets that a given population requires to produce the natural resources it consumes (including plant-based food and fiber products, livestock and fish products, timber and other forest products, space for urban infrastructure) and to absorb its waste, especially carbon emissions.
A nation’s biocapacity represents the productivity of its ecological assets, including cropland, grazing land, forest land, fishing grounds, and built-up land. These areas, especially if left unharvested, can also absorb much of the waste we generate, especially our carbon emissions.
Both the ecological footprint and biocapacity are expressed in global hectares — globally comparable, standardized hectares with world average productivity.
If a population’s ecological footprint exceeds the region’s biocapacity, that region runs an ecological deficit. Its demand for the goods and services that its land and seas can provide — fruits and vegetables, meat, fish, wood, cotton for clothing, and carbon dioxide absorption — exceeds what the region’s ecosystems can renew. A region in ecological deficit meets demand by importing, liquidating its own ecological assets (such as overfishing), and/or emitting carbon dioxide into the atmosphere. If a region’s biocapacity exceeds its ecological footprint, it has an ecological reserve.
There were ~ 12.2 billion hectares of biologically productive land and water on Earth in 2019. Dividing by the number of people alive in that year (7.7 billion) gives 1.6 global hectares per person. This area also needs to accommodate the wild species that compete for the same biological material and spaces as humans.
biological capacity or biocapacityThe capacity of ecosystems to regenerate what people demand from those surfaces. Life, including human life, competes for space. The biocapacity of a particular surface represents its ability to regenerate what people demand. Biocapacity is therefore the ecosystems’ capacity to produce biological materials used by people and to absorb waste material generated by humans, under current management schemes and extraction technologies. Biocapacity can change from year to year due to climate, management, and also what portions are considered useful inputs to the human economy. In the National Footprint and Biocapacity Accounts, the biocapacity of an area is calculated by multiplying the actual physical area by the yield factor and the appropriate equivalence factor. Biocapacity is usually expressed in global hectares.
ecological deficit / reserve OR biocapacity deficit / reserveThe difference between the biocapacity and Ecological Footprint of a region or country. An ecological deficit occurs when the Footprint of a population exceeds the biocapacity of the area available to that population. Conversely, an ecological reserve exists when the biocapacity of a region exceeds its population’s Footprint. If there is a regional or national ecological deficit, it means that the region is importing biocapacity through trade or liquidating regional ecological assets, or emitting wastes into a global commons such as the atmosphere. In contrast to the national scale, the global ecological deficit cannot be compensated for through trade, and is therefore equal to overshoot by definition.
Ecological FootprintA measure of how much area of biologically productive land and water an individual, population or activity requires to produce all the resources it consumes and to absorb the waste it generates, using prevailing technology and resource management practices. The Ecological Footprint is usually measured in global hectares. Because trade is global, an individual or country’s Footprint includes land or sea from all over the world. Without further specification, Ecological Footprint generally refers to the Ecological Footprint of consumption. Ecological Footprint is often referred to in short form as Footprint. “Ecological Footprint” and “Footprint” are proper nouns and thus should always be capitalized.
global hectareGlobal hectares are the accounting unit for the Ecological Footprint and Biocapacity accounts. These productivity weighted biologically productive hectares allow researchers to report both the biocapacity of the earth or a region and the demand on biocapacity (the Ecological Footprint). A global hectare is a biologically productive hectare with world average biological productivity for a given year. Global hectares are needed because different land types have different productivities. A global hectare of, for example, cropland, would occupy a smaller physical area than the much less biologically productive pasture land, as more pasture would be needed to provide the same biocapacity as one hectare of cropland. Because world productivity varies slightly from year to year, the value of a global hectare may change slightly from year to year.
land or area typeThe Earth’s approximately 12.2 billion hectares of biologically productive land and water areas are categorized into five types. The five area types for biocapacity that support the 6 Footprint demand types are:
First and foremost, let’s include the possible libraries that we are going to work with.
I am thinking of at least dplyr, tidyr, and glue for easier time of pre-processing the data. Also ggplot2 and plotly for helpful visualization of data.
library(dplyr)
library(glue)
library(tidyr)
library(ggplot2)
library(plotly)Now, let’s read the dataset. It is made available as a CSV file, so I will use read.csv function to do just that, and using footprint as the object name. As a habit, I will see the first 6 data using head function to have any expectations about the data.
footprint <- read.csv("countries.csv")
head(footprint)## Country Region Population..millions. HDI
## 1 Afghanistan Middle East/Central Asia 29.82 0.46
## 2 Albania Northern/Eastern Europe 3.16 0.73
## 3 Algeria Africa 38.48 0.73
## 4 Angola Africa 20.82 0.52
## 5 Antigua and Barbuda Latin America 0.09 0.78
## 6 Argentina Latin America 41.09 0.83
## GDP.per.Capita Cropland.Footprint Grazing.Footprint Forest.Footprint
## 1 $614.66 0.30 0.20 0.08
## 2 $4,534.37 0.78 0.22 0.25
## 3 $5,430.57 0.60 0.16 0.17
## 4 $4,665.91 0.33 0.15 0.12
## 5 $13,205.10 NA NA NA
## 6 $13,540.00 0.78 0.79 0.29
## Carbon.Footprint Fish.Footprint Total.Ecological.Footprint Cropland
## 1 0.18 0.00 0.79 0.24
## 2 0.87 0.02 2.21 0.55
## 3 1.14 0.01 2.12 0.24
## 4 0.20 0.09 0.93 0.20
## 5 NA NA 5.38 NA
## 6 1.08 0.10 3.14 2.64
## Grazing.Land Forest.Land Fishing.Water Urban.Land Total.Biocapacity
## 1 0.20 0.02 0.00 0.04 0.50
## 2 0.21 0.29 0.07 0.06 1.18
## 3 0.27 0.03 0.01 0.03 0.59
## 4 1.42 0.64 0.26 0.04 2.55
## 5 NA NA NA NA 0.94
## 6 1.86 0.66 1.67 0.10 6.92
## Biocapacity.Deficit.or.Reserve Earths.Required Countries.Required
## 1 -0.30 0.46 1.60
## 2 -1.03 1.27 1.87
## 3 -1.53 1.22 3.61
## 4 1.61 0.54 0.37
## 5 -4.44 3.11 5.70
## 6 3.78 1.82 0.45
## Data.Quality
## 1 6
## 2 6
## 3 5
## 4 6
## 5 2
## 6 6
Looking at the data number 5 of Antigue and Barbuda, we can see some missing values from the footprint and biocapacity details, though the totals is still usable. Keeping this in mind for possible exclusion in case we are going to compare those metrics.
Let’s analyze the data structure of the dataset and make sure that the data types are properly used. I prefer dplyr’s glimpse function compared to base-R’s str but either one will do the trick:
glimpse(footprint)## Rows: 188
## Columns: 21
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "A~
## $ Region <chr> "Middle East/Central Asia", "Northern/E~
## $ Population..millions. <dbl> 29.82, 3.16, 38.48, 20.82, 0.09, 41.09,~
## $ HDI <dbl> 0.46, 0.73, 0.73, 0.52, 0.78, 0.83, 0.7~
## $ GDP.per.Capita <chr> "$614.66", "$4,534.37", "$5,430.57", "$~
## $ Cropland.Footprint <dbl> 0.30, 0.78, 0.60, 0.33, NA, 0.78, 0.74,~
## $ Grazing.Footprint <dbl> 0.20, 0.22, 0.16, 0.15, NA, 0.79, 0.18,~
## $ Forest.Footprint <dbl> 0.08, 0.25, 0.17, 0.12, NA, 0.29, 0.34,~
## $ Carbon.Footprint <dbl> 0.18, 0.87, 1.14, 0.20, NA, 1.08, 0.89,~
## $ Fish.Footprint <dbl> 0.00, 0.02, 0.01, 0.09, NA, 0.10, 0.01,~
## $ Total.Ecological.Footprint <dbl> 0.79, 2.21, 2.12, 0.93, 5.38, 3.14, 2.2~
## $ Cropland <dbl> 0.24, 0.55, 0.24, 0.20, NA, 2.64, 0.44,~
## $ Grazing.Land <dbl> 0.20, 0.21, 0.27, 1.42, NA, 1.86, 0.26,~
## $ Forest.Land <dbl> 0.02, 0.29, 0.03, 0.64, NA, 0.66, 0.10,~
## $ Fishing.Water <dbl> 0.00, 0.07, 0.01, 0.26, NA, 1.67, 0.02,~
## $ Urban.Land <dbl> 0.04, 0.06, 0.03, 0.04, NA, 0.10, 0.07,~
## $ Total.Biocapacity <dbl> 0.50, 1.18, 0.59, 2.55, 0.94, 6.92, 0.8~
## $ Biocapacity.Deficit.or.Reserve <dbl> -0.30, -1.03, -1.53, 1.61, -4.44, 3.78,~
## $ Earths.Required <dbl> 0.46, 1.27, 1.22, 0.54, 3.11, 1.82, 1.2~
## $ Countries.Required <dbl> 1.60, 1.87, 3.61, 0.37, 5.70, 0.45, 2.5~
## $ Data.Quality <chr> "6", "6", "5", "6", "2", "6", "3B", "2"~
summary(footprint)## Country Region Population..millions. HDI
## Length:188 Length:188 Min. : 0.000 Min. :0.3400
## Class :character Class :character 1st Qu.: 2.038 1st Qu.:0.5575
## Mode :character Mode :character Median : 7.970 Median :0.7200
## Mean : 37.342 Mean :0.6864
## 3rd Qu.: 24.870 3rd Qu.:0.8025
## Max. :1408.040 Max. :0.9400
## NA's :16
## GDP.per.Capita Cropland.Footprint Grazing.Footprint Forest.Footprint
## Length:188 Min. :0.0700 Min. :0.0000 Min. :0.0100
## Class :character 1st Qu.:0.3500 1st Qu.:0.0800 1st Qu.:0.1700
## Mode :character Median :0.5200 Median :0.1800 Median :0.2600
## Mean :0.5782 Mean :0.2632 Mean :0.3738
## 3rd Qu.:0.7000 3rd Qu.:0.3200 3rd Qu.:0.4600
## Max. :2.6800 Max. :3.4700 Max. :3.0300
## NA's :15 NA's :15 NA's :15
## Carbon.Footprint Fish.Footprint Total.Ecological.Footprint Cropland
## Min. : 0.000 Min. :0.0000 Min. : 0.420 Min. :0.0000
## 1st Qu.: 0.420 1st Qu.:0.0200 1st Qu.: 1.482 1st Qu.:0.1800
## Median : 1.140 Median :0.0700 Median : 2.740 Median :0.3500
## Mean : 1.805 Mean :0.1225 Mean : 3.318 Mean :0.5319
## 3rd Qu.: 2.600 3rd Qu.:0.1500 3rd Qu.: 4.640 3rd Qu.:0.5900
## Max. :12.650 Max. :0.8200 Max. :15.820 Max. :5.4200
## NA's :15 NA's :15 NA's :15
## Grazing.Land Forest.Land Fishing.Water Urban.Land
## Min. :0.0000 Min. : 0.000 Min. : 0.0000 Min. :0.00000
## 1st Qu.:0.0300 1st Qu.: 0.060 1st Qu.: 0.0300 1st Qu.:0.03000
## Median :0.1200 Median : 0.340 Median : 0.1100 Median :0.05000
## Mean :0.4566 Mean : 2.459 Mean : 0.5951 Mean :0.06711
## 3rd Qu.:0.3400 3rd Qu.: 1.170 3rd Qu.: 0.3700 3rd Qu.:0.09000
## Max. :8.2300 Max. :95.160 Max. :16.0700 Max. :0.27000
## NA's :15 NA's :15 NA's :15 NA's :15
## Total.Biocapacity Biocapacity.Deficit.or.Reserve Earths.Required
## Min. : 0.050 Min. :-14.1400 Min. :0.240
## 1st Qu.: 0.675 1st Qu.: -1.9350 1st Qu.:0.855
## Median : 1.310 Median : -0.7300 Median :1.580
## Mean : 4.020 Mean : 0.7021 Mean :1.916
## 3rd Qu.: 2.815 3rd Qu.: 0.2125 3rd Qu.:2.678
## Max. :111.350 Max. :109.0100 Max. :9.140
##
## Countries.Required Data.Quality
## Min. : 0.0200 Length:188
## 1st Qu.: 0.9425 Class :character
## Median : 1.7050 Mode :character
## Mean : 4.0374
## 3rd Qu.: 2.8475
## Max. :159.4700
##
Few points that I would like to highlight or improve from this data:
Region and Data Quality columns into factors since those columns seem to be categorical or have so many repeating values.
GDP.Per.Capita seem to be more useful if I can use it as a number. Therefore I need to at least get rid of the dollar sign, and possibly the thousand separator for better chance of using that data.
Data.Quality seem to classify the data based on its completeness. I assumed so based on, again, the 5th data of Antigua and Barbuda which has 2 as Data Quality. Perhaps low number of Data Quality equals less complete data? I can use this to filter the data with missing values, but I would need to verify this.
Total.Ecological.Footprint seem to be a simple sum of the 5 columns ending in Footprint just before it. I will need to verify this to avoid some redundancy.
Total.Biocapacity seem to be a simple sum of the 5 columns just before it and have similar namings with the assumed components of Footprint from the previous point.
Column Biocapacity.Deficit.or.Reserve seem to be a simple subtraction of Total.Ecological.Footprint from the value of Total.Biocapacity. Need to make sure of that.
Checking if the Region and Data.Quality columns are having quite low number of unique values to convert them into factors.
unique(footprint$Region)## [1] "Middle East/Central Asia" "Northern/Eastern Europe"
## [3] "Africa" "Latin America"
## [5] "Asia-Pacific" "European Union"
## [7] "North America"
unique(footprint$Data.Quality)## [1] "6" "5" "2" "3B" "3L" "3T" "4"
Both of them have only 7 unique values, compared to the whole set of database that totals in 188 in row count. Therefore, I can safely conclude that we can convert both of the columns into factors.
footprint <- footprint %>%
mutate(
Region = as.factor(Region),
Data.Quality = as.factor(Data.Quality)
)
glimpse(footprint$Region)## Factor w/ 7 levels "Africa","Asia-Pacific",..: 5 7 1 1 4 4 5 4 2 3 ...
glimpse(footprint$Data.Quality)## Factor w/ 7 levels "2","3B","3L",..: 7 7 6 7 1 7 2 1 6 6 ...
To confirm, I have checked them using glimpse function as shown above.
To do this, I think we can use gsub function to replace all dollar sign and comma as thousand separator occurrences inside the GDP.per.Capita column, to be able to convert it into a usable numeric column.
Then, making sure that it is correct by checking the first 6 data and its data type using head and glimpse functions, respectively.
footprint <- footprint %>%
mutate(
GDP.per.Capita = as.numeric(gsub("[\\$,]","",GDP.per.Capita))
)
head(footprint$GDP.per.Capita)## [1] 614.66 4534.37 5430.57 4665.91 13205.10 13540.00
glimpse(footprint$GDP.per.Capita)## num [1:188] 615 4534 5431 4666 13205 ...
Data.Quality = Data Completeness?Is Data.Quality column reflecting data completeness of the whole row?
Since we know that some of components are missing or even literally having “NA” as its values, we need to count missing values within each rows and compare it to the Data Quality values.
Let’s check using a combination of rowSums and is.na functions to calculate the missing values of each row, then arrange it descendingly from both columns of NA.Count and Data.Quality.
footprint %>%
mutate(Total.NA.Count = rowSums(is.na(footprint) | footprint == "")) %>%
arrange(desc(Total.NA.Count), desc(Data.Quality)) %>%
head(10)## Country Region Population..millions. HDI
## 1 British Virgin Islands Latin America 0.03 NA
## 2 Wallis and Futuna Islands Asia-Pacific 0.01 NA
## 3 Aruba Latin America 0.10 NA
## 4 Montserrat Latin America 0.00 NA
## 5 Nauru Asia-Pacific 0.01 NA
## 6 Bermuda North America 0.06 NA
## 7 Norway Northern/Eastern Europe 4.99 0.94
## 8 Cabo Verde Africa 0.49 0.64
## 9 Cambodia Asia-Pacific 14.86 0.55
## 10 Estonia European Union 1.29 0.85
## GDP.per.Capita Cropland.Footprint Grazing.Footprint Forest.Footprint
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 70626.30 NA NA NA
## 7 100172.00 NA NA NA
## 8 3801.45 NA NA NA
## 9 877.64 NA NA NA
## 10 17304.40 NA NA NA
## Carbon.Footprint Fish.Footprint Total.Ecological.Footprint Cropland
## 1 NA NA 2.86 NA
## 2 NA NA 2.07 NA
## 3 NA NA 11.88 NA
## 4 NA NA 7.78 NA
## 5 NA NA 2.94 NA
## 6 NA NA 5.77 NA
## 7 NA NA 4.98 NA
## 8 NA NA 2.52 NA
## 9 NA NA 1.21 NA
## 10 NA NA 6.86 NA
## Grazing.Land Forest.Land Fishing.Water Urban.Land Total.Biocapacity
## 1 NA NA NA NA 2.05
## 2 NA NA NA NA 1.51
## 3 NA NA NA NA 0.57
## 4 NA NA NA NA 1.36
## 5 NA NA NA NA 0.19
## 6 NA NA NA NA 0.13
## 7 NA NA NA NA 8.18
## 8 NA NA NA NA 0.62
## 9 NA NA NA NA 1.09
## 10 NA NA NA NA 10.53
## Biocapacity.Deficit.or.Reserve Earths.Required Countries.Required
## 1 -0.81 1.65 1.40
## 2 -0.56 1.19 1.37
## 3 -11.31 6.86 20.69
## 4 -6.42 4.49 5.71
## 5 -2.76 1.70 15.83
## 6 -5.64 3.33 44.05
## 7 3.19 2.88 0.61
## 8 -1.90 1.46 4.06
## 9 -0.11 0.70 1.11
## 10 3.67 3.96 0.65
## Data.Quality Total.NA.Count
## 1 3T 12
## 2 3T 12
## 3 2 12
## 4 2 12
## 5 2 12
## 6 3T 11
## 7 4 10
## 8 3T 10
## 9 3T 10
## 10 3T 10
Seems like we can’t be sure about using Data.Quality as an adequate column to use to filter for data completeness, since some Data Quality of ‘3’ and ‘4’ has around 10+ missing values, while similar others are quite complete. Other noticeable things are that the 10 missing values are coming from each ‘suspected’ 5 components before each columns of Total.Ecological.Footprint and Total.Biocapacity, while 2 other columns of HDI and GDP.per.Capita. That’s the total of up to 12 missing values.
Rather than being unclear or unsure about the usage of Data.Quality column, I will focus on the completeness of the data, therefore creating 3 more columns to represent the missing values count of the components and the others.
footprint <- footprint %>%
mutate(Total.NA.Count = rowSums(is.na(footprint) | footprint == ""),
Components.NA.Count = rowSums(is.na(footprint[,c(6:10,12:16)]) | footprint[,c(6:10,12:16)] == "")) %>%
mutate(Other.NA.Count = Total.NA.Count - Components.NA.Count)
head(footprint)## Country Region Population..millions. HDI
## 1 Afghanistan Middle East/Central Asia 29.82 0.46
## 2 Albania Northern/Eastern Europe 3.16 0.73
## 3 Algeria Africa 38.48 0.73
## 4 Angola Africa 20.82 0.52
## 5 Antigua and Barbuda Latin America 0.09 0.78
## 6 Argentina Latin America 41.09 0.83
## GDP.per.Capita Cropland.Footprint Grazing.Footprint Forest.Footprint
## 1 614.66 0.30 0.20 0.08
## 2 4534.37 0.78 0.22 0.25
## 3 5430.57 0.60 0.16 0.17
## 4 4665.91 0.33 0.15 0.12
## 5 13205.10 NA NA NA
## 6 13540.00 0.78 0.79 0.29
## Carbon.Footprint Fish.Footprint Total.Ecological.Footprint Cropland
## 1 0.18 0.00 0.79 0.24
## 2 0.87 0.02 2.21 0.55
## 3 1.14 0.01 2.12 0.24
## 4 0.20 0.09 0.93 0.20
## 5 NA NA 5.38 NA
## 6 1.08 0.10 3.14 2.64
## Grazing.Land Forest.Land Fishing.Water Urban.Land Total.Biocapacity
## 1 0.20 0.02 0.00 0.04 0.50
## 2 0.21 0.29 0.07 0.06 1.18
## 3 0.27 0.03 0.01 0.03 0.59
## 4 1.42 0.64 0.26 0.04 2.55
## 5 NA NA NA NA 0.94
## 6 1.86 0.66 1.67 0.10 6.92
## Biocapacity.Deficit.or.Reserve Earths.Required Countries.Required
## 1 -0.30 0.46 1.60
## 2 -1.03 1.27 1.87
## 3 -1.53 1.22 3.61
## 4 1.61 0.54 0.37
## 5 -4.44 3.11 5.70
## 6 3.78 1.82 0.45
## Data.Quality Total.NA.Count Components.NA.Count Other.NA.Count
## 1 6 0 0 0
## 2 6 0 0 0
## 3 5 0 0 0
## 4 6 0 0 0
## 5 2 10 10 0
## 6 6 0 0 0
After that is done, we can move on to verify if the columns are representing the data as following assumptions, since the dataset provider isn’t giving us enough metadata about this:
Total.Footprint is the sum of 5 columns prior to it.
Total.Biocapacity is the sum of 5 columns prior to it.
Earths.Required is the ratio of the country’s footprint against the average of the all countries’ biocapacity (or some other coefficient)
Countries.Required is the ratio of average of the country’s footprint against its own biocapacity to fulfill the need of nature (or some other coefficient).
We would need to filter out the rows/country data with missing component values on them.
footprint %>%
filter(Components.NA.Count == 0) %>%
mutate(Test.Total.Footprint = Cropland.Footprint + Grazing.Footprint + Forest.Footprint + Carbon.Footprint + Fish.Footprint,
Test.Total.Biocapacity = Cropland + Grazing.Land + Forest.Land + Fishing.Water + Urban.Land,
# Test.Earths.Required = round(Total.Ecological.Footprint / median(footprint$Total.Biocapacity),2),
Test.Earths.Required = round(Total.Ecological.Footprint / (sum(Total.Biocapacity * Population..millions.)/sum(Population..millions.)),2),
Test.Countries.Required = round(Total.Ecological.Footprint / Total.Biocapacity,2)) %>%
mutate(Diff.Total.Footprint = abs(Total.Ecological.Footprint-Test.Total.Footprint)/Total.Ecological.Footprint,
Diff.Total.Biocapacity = abs(Total.Biocapacity-Test.Total.Biocapacity)/Total.Biocapacity,
Diff.Earths.Required = abs(Earths.Required-Test.Earths.Required)/Earths.Required,
Diff.Countries.Required = abs(Countries.Required-Test.Countries.Required)/Countries.Required) %>%
summarise(mean_fp = mean(Diff.Total.Footprint),
mean_bc = mean(Diff.Total.Biocapacity),
mean_er = mean(Diff.Earths.Required),
mean_cr = mean(Diff.Countries.Required)
)## mean_fp mean_bc mean_er mean_cr
## 1 0.0287381 0.006181684 0.01716617 0.00335664
Since the mean/average of the difference of the real columns and the assumed formula are close to 0, I think we can safely conclude that the assumed formulas are correct.
hist(footprint$Total.Ecological.Footprint, xlab = "Footprint in Global Hectare (gha)", main = "Spread of Ecological Footprint")summary(footprint$Total.Ecological.Footprint)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.420 1.482 2.740 3.318 4.640 15.820
It seems that the Ecological Footprint does not have a normal distribution. From the histogram, we can see that it is very skewed to the left, having 0-5 gha. This should be good for the world rather than having most countries on the heavier side.
From the summary statistics, we can see that although the central of data - represented by the median and mean - is placed around 2-3 gha, the maximum value is highly distinguishable at 15.82 gha.
Let’s see which countries are consuming the most of the global’s ecological resources.
footprint_plot1 <- footprint %>%
arrange(desc(Total.Ecological.Footprint)) %>%
head(10) %>%
mutate(label = glue("{Country}
Region = {Region}
Footprint = {Total.Ecological.Footprint} gha
Population = {Population..millions.} mil")) %>%
ggplot(mapping = aes(x=Total.Ecological.Footprint, y=reorder(Country, Total.Ecological.Footprint), text=label, fill = Total.Ecological.Footprint)) +
scale_fill_gradient(low = "black", high = "red")+
geom_col() +
labs(x="Footprint in Global Hectare (gha)",
y="Country Name",
title = "Top 10 Footprint Contributor",
fill = "Footprint Value (gha)")
ggplotly(footprint_plot1, tooltip="label")I assumed the top ones it would be most known countries for their technological growth and/or land areas, like USA, China, India, and Japan. Though USA actually made it in the Top 10, apparently other unexpected countries are actually more contributing to leave ecological footprint globally, that is Luxembourg, Aruba, Qatar, and Australia. Singapore, our neighboring country apparently made quite a “contribution” too, putting them in Top 10. Region-wise, it’s also quite spread. I would expect Europe or North America to have more countries in the Top 10, but apparently not.
hist(footprint$Total.Biocapacity, xlab = "Biocapacity in Global Hectare per Person (gha/person)", main = "Spread of Ecological Biocapacity")summary(footprint$Total.Biocapacity)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.050 0.675 1.310 4.020 2.815 111.350
Apparently the biocapacity of the countries are heavily on the lower side, within 0-10 gha. We can see that there are some small numbers of outliers on the histogram, also seeing from the summary statistics that the highest value are so far away in the 100+ gha, while the median is only at 1.31 gha. As we want to live more in this Earth, we should try to make the number go bigger rather than the ecological footprint, to ensure that we are not using too much resources than we are producing.
footprint_plot2 <- footprint %>%
arrange(desc(Total.Biocapacity)) %>%
head(10) %>%
mutate(label = glue("{Country}
Region = {Region}
Biocapacity = {Total.Biocapacity} gha
Population = {Population..millions.} mil")) %>%
ggplot(mapping = aes(x=Total.Biocapacity, y=reorder(Country, Total.Biocapacity), text = label, fill = Total.Biocapacity)) +
geom_col() +
labs(x="Biocapacity in Global Hectare (gha)",
y="Country Name",
title = "Top 10 Biocapacity Producer",
fill = "Biocapacity Value (gha)")
ggplotly(footprint_plot2, tooltip="label")Most countries making it to this list are not very well-known, and are originating from Latin America or Africa area. This is maybe due to less populated areas, therefore less development are being done, sacrificing less natural resources. We can also note Australia and Canada, that were also countries in the Top 10 as ecological footprint contributor, are also in this list, therefore we can expect those countries to be quite balanced.
##Footprint vs Biocapacity We have a data calculating if each country is actually creating a deficit or reserve in the ecological resources, by simply subtracting each country’s footprint to its biocapacity. Let’s see the top 10 countries and worst 10 countries in this regard.
footprint_plot3 <- footprint %>%
arrange(desc(Biocapacity.Deficit.or.Reserve)) %>%
head(10) %>%
mutate(label = glue("{Country}
Region = {Region}
Biocapacity-Footprint = {Biocapacity.Deficit.or.Reserve} gha
Population = {Population..millions.} mil")) %>%
ggplot(mapping = aes(x=Biocapacity.Deficit.or.Reserve, y=reorder(Country, Biocapacity.Deficit.or.Reserve), text = label, fill = Biocapacity.Deficit.or.Reserve)) +
geom_col() +
labs(x="Deficit/Reserve in Global Hectare per Person(gha)",
y="Country Name",
title = "Top 10 Global Ecological Preserver",
fill = "Ecological Reserve (gha)")
ggplotly(footprint_plot3, tooltip="label")Most countries in this list are very similar with the top 10 biocapacity contributor, that we hope that this is due a green initiatives, but this once again could also be due to low number of population causing the not much development happening in the country. It could be helpful to learn the top 5 countries actions that is causing quite a higher number of biocapacity and lower number of footprint.
Few notable countries here is Canada and Bolivia, both have rather a higher number of population in this list, but then again actually help to preserve the Earth.
footprint_plot4 <- footprint %>%
arrange(Biocapacity.Deficit.or.Reserve) %>%
head(10) %>%
mutate(label = glue("{Country}
Region = {Region}
Biocapacity-Footprint = {Biocapacity.Deficit.or.Reserve} gha/person
Population = {Population..millions.} mil")) %>%
ggplot(mapping = aes(x=Biocapacity.Deficit.or.Reserve, y=reorder(Country, Biocapacity.Deficit.or.Reserve), text = label, fill = Biocapacity.Deficit.or.Reserve)) +
geom_col() +
scale_fill_gradient2(high = "black", low = "red")+
labs(x="Deficit/Reserve in Global Hectare per Person(gha/person)",
y="Country Name",
title = "Worst 10 Global Ecological Preserver",
fill = "Ecological Deficit (gha)")
ggplotly(footprint_plot4, tooltip="label")Continuing the insight from before, the list is very similar with the Top 10 Footprint Contributor. The regions are quite spread. These countries need to do more green initiative and hold back developments that sacrificed the environment.
footprint_plot5 <- footprint %>%
ggplot(mapping = aes(y=Region, x=Total.Ecological.Footprint)) +
geom_boxplot(fill = "red") +
labs(title = "Ecological Footprint in Regions",
x = "Footprint (gha)",
y = "Region Name")
footprint_plot5footprint_plot6 <- footprint %>%
ggplot(mapping = aes(y=Region, x=Total.Biocapacity)) +
geom_boxplot(fill = "cyan")
footprint_plot6footprint_region <- footprint %>%
filter(Other.NA.Count==0) %>%
group_by(Region) %>%
summarise(Mean.HDI = mean(HDI),
Sum.Pop.Mil = sum(Population..millions.),
Sum.Footprint = -1*sum(Total.Ecological.Footprint),
Sum.Biocap = sum(Total.Biocapacity),
Sum.Diff = sum(Biocapacity.Deficit.or.Reserve)
) %>%
mutate(label = glue("{Region}
Ecology Deficit/Reserve = {Sum.Diff} gha
Total Footprint = {Sum.Footprint} gha
Total Biocapacity = {Sum.Biocap} gha
Average HDI = {round(Mean.HDI,2)}
Total Population = {Sum.Pop.Mil} Mil
"))
footprint_region %>% select(-label)## # A tibble: 7 x 6
## Region Mean.HDI Sum.Pop.Mil Sum.Footprint Sum.Biocap Sum.Diff
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 0.516 1004. -80.7 114. 33.7
## 2 Asia-Pacific 0.688 3855. -83.5 82.0 -1.56
## 3 European Union 0.865 504. -142. 94.9 -47.5
## 4 Latin America 0.720 604. -100. 256. 156.
## 5 Middle East/Central As~ 0.732 384. -91.7 21.8 -70.0
## 6 North America 0.91 352. -16.4 19.8 3.37
## 7 Northern/Eastern Europe 0.788 238. -45.2 34.6 -10.6
footprint_region2 <- footprint_region %>%
pivot_longer(cols = c("Sum.Footprint", "Sum.Biocap"))
footprint_region2[footprint_region2$name =="Sum.Biocap", ]$name <- "Biocapacity"
footprint_region2[footprint_region2$name =="Sum.Footprint", ]$name <- "Footprint"
footprint_region_plot <- footprint_region2 %>%
ggplot(mapping = aes(x=value, y=reorder(Region,Sum.Diff), fill=name, text=label))+
geom_bar(position = "stack", stat = "identity")+
labs(title = "Regions Sorted by Ecological Contribution",
y="Region Name",
x="Global Hectare (gha)",
fill = "Type of Contribution")
ggplotly(footprint_region_plot, tooltip="label")As we have seen from the Top 10 and Worst 10 in previous explorations, Latin America and Africa are the top 2 regions having positive contributions towards the Earth’s ecological reserve. North America is only slightly on the positive side, while the other 4 are on negative side, causing deficit. European Union and Middle East/Central Asia are the worst 2 regions comparatively.
plot(footprint$Total.Ecological.Footprint, footprint$Total.Biocapacity, main = "Footprint vs Biocapacity in Each Country", xlab = "Footprint (gha)", ylab = "Biocapacity (gha)")
abline(lm(footprint$Total.Biocapacity~footprint$Total.Ecological.Footprint), col="red")cor(footprint$Total.Ecological.Footprint, footprint$Total.Biocapacity)## [1] 0.06658034
We can see that there is almost a flat line of the trend line, and it is even shown in the correlation value of 0.07, which shows a positive correlation, but it’s very weak. Therefore we can safely say that these two metrics do not correlate to each other. Other insight that we might see is that the points are centralized towards the lower values of biocapacity compared to the footprint, seen from the points spread close to X-axis above in the plot.
footprint_pophdi_notna <- footprint %>%
filter(Other.NA.Count == 0)
plot(footprint_pophdi_notna$Total.Ecological.Footprint, footprint_pophdi_notna$Population..millions., main = "Footprint vs Population in Each Country", xlab = "Footprint (gha)", ylab = "Population (million)")
abline(lm(footprint_pophdi_notna$Population..millions. ~ footprint_pophdi_notna$Total.Ecological.Footprint), col="red")cor(footprint_pophdi_notna$Total.Ecological.Footprint, footprint_pophdi_notna$Population..millions.)## [1] -0.05402079
Though it seems to be a negative ones, Population also seem to have a very weak correlation with the Footprint of a country, shown by an almost flat line and a negative correlation value that is close to 0. Looking at the spread, we can see a rather low number of Population, shown by the data points are spread heavily towards X-axis.
plot(footprint_pophdi_notna$Total.Biocapacity, footprint_pophdi_notna$Population..millions., main = "Biocapacity vs Population in Each Country", xlab = "Biocapacity (gha)", ylab = "Population (mil)")
abline(lm(footprint_pophdi_notna$Population..millions. ~ footprint_pophdi_notna$Total.Biocapacity), col="blue")cor(footprint_pophdi_notna$Total.Biocapacity, footprint_pophdi_notna$Population..millions.)## [1] -0.05743381
Same as the Footprint, we can see a similar negative and very weak correlation between Biocapacity and Population.
plot(footprint_pophdi_notna$Total.Ecological.Footprint, footprint_pophdi_notna$HDI, main = "Footprint vs Human Development Index (HDI) in Each Country", xlab = "Footprint (gha)", ylab = "HDI Index (0.0-1.0)")
abline(lm(footprint_pophdi_notna$HDI ~ footprint_pophdi_notna$Total.Ecological.Footprint), col="red")cor(footprint_pophdi_notna$Total.Ecological.Footprint, footprint_pophdi_notna$HDI)## [1] 0.7388287
I might expect this one to have a better chance of a strong positive correlation, since a higher HDI could be interpreted as a better and larger development for the people, therefore, more Ecological Footprint will be produced, and apparently it is so. We can see a positive and quite strong correlation between HDI and Ecological Footprint of a country, resulting in a 0.7388 correlation value. Though a linear model trend line might not be suitable for this correlation, since we can kind of imagine a curving line to represent the data better, but it’s quite a good fit. Therefore we can safely say that an increase of HDI will result in an increase of Footprint, and vice versa.
Meanwhile about the data spread, most countries seems to fall within the lower area of Footprint, around 0-5 gha.
plot(footprint_pophdi_notna$Total.Biocapacity, footprint_pophdi_notna$HDI, main = "Biocapacity vs Human Development Index (HDI) in Each Country", xlab = "Biocapacity (gha)", ylab = "HDI Index (0.0-1.0)")
abline(lm(footprint_pophdi_notna$HDI ~ footprint_pophdi_notna$Total.Biocapacity), col="blue")cor(footprint_pophdi_notna$Total.Biocapacity, footprint_pophdi_notna$HDI)## [1] 0.07693505
From the scatter plot, we can see a positive yet very weak correlation between HDI and Biocapacity. We can also say that the most data are once again spread in the lower area of Biocapacity, since the data points are spread close to the Y-axis.
Hi! My name is Calvin, I am from Jakarta, Indonesia. I am looking forward to be a full-time data analyst and/or data scientist. I have a background in Mathematics and Computer Science from my Bachelor’s Degrees, and I love playing with numbers and data. I am doing this to enhance my Data Science portfolio (constructive criticism is very much welcomed!), also as part of Learn-By-Building assignment at Algoritma Data Science School.
You can reach me at my LinkedIn for more discussion. Thank you!