Initialization:

Lets prepare the tidyverse library and also load up the csv file for the dataset.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df_main <- read.csv("climate_change_dataset.csv")
head(df_main)

Task #1

1.) Let’s first talk about 2 columns or values in my dataset that were unclear to me until I viewed the documentation. The two that I had trouble with was Renewable Energy (%), and Forest Area (%). These two sound very abstract when you think about it. Like what does forest area percentage mean? What is the percent based off of? There are so many variables that go into renewable energy, how can we dissolve all of that into a single variable? Once I read the documentation it made a lot more sense. The documentation states the renewable energy is the “percentage of total energy consumption in a country that comes from renewable energy sources (solar, wind, hyrdo, etc.)”. It further goes to clarify that the metric is very important to establish the progress made towards sustainable energy and reducing emissions. This at the very least establishes a tighter set of attributes that the variable is based off of rather then just being very vague. I think they encoded this data this way because the number of attributes (solar, wind, hydro) are not fixed and we could in theory continue to add more and more items. For example we could have thermal energy in there too. Having the list of attributes trail off with the etc. indicates that there are many different technologies that contribute to the overall percentage without listing out potentially hundreds of sources. We can encapsulate all of that into a percentage.

Similarly, Forest Area (%) can also be investigated in the documentation. When we do we learn that it is the “percentage of the total land area of a country covered by forests”. This makes sense and now we concretely know what the percentage of the area we are talking about is derived from. It is also stated that “Forest cover is a critical indicator of biodiversity and carbon sequestration”. This means if we wanted to some deeper analysis we could pull the results of the carbon emissions variable and this simultaneously and use them for a more comprehensive analysis.

Without reading the documentation it is possible that I could have failed to understand the full scope of what the variables encompass and thus draw inadequate conclusions from the data.

Task #2

2.) My dataset is pretty well documented but there is one thing that I didn’t fully understand. The Extreme Weather Events variable is defined as “the number of extreme weather events recorded in each country, such as hurricanes, floods, wildfires, and droughts”. The part that I don’t fully understand (or would improve if I was collecting the data) is how we classify if something is an actual extreme weather effect? If a hurricane grazed me and all I got was some light rain and some slightly heavy wind, does that count as my area being effected by extreme weather? Or is damage to the area used as a metric to determine if an weather event classifies as “extreme”? Again these might not necessarily be things that confuse me but rather data that is left for interpretation by documentation.

Task #3

3.) In this section we will pick Extreme Weather Effects variable to be the one that we do our visualizations for. We will then try to find if we can highlight some of the issues with this variable that we mentioned in part 2.) via the visualizations.

#DEBUG CODE
names(df_main)
##  [1] "Year"                        "Country"                    
##  [3] "Avg.Temperature...C."        "CO2.Emissions..Tons.Capita."
##  [5] "Sea.Level.Rise..mm."         "Rainfall..mm."              
##  [7] "Population"                  "Renewable.Energy...."       
##  [9] "Extreme.Weather.Events"      "Forest.Area...."

Visualization 1

#produce the bar graph
df_main |>
  group_by(Country) |>
  summarise(avg_events = mean(Extreme.Weather.Events, na.rm = TRUE)) |>
  arrange(desc(avg_events)) |>
  ggplot(aes(x = reorder(Country, avg_events), y = avg_events, fill = Country)) +
  geom_col() +
  labs(
    title = "Avg Extreme Weather Events by Country",
    x = "Country",
    y = "Avg Number of Extreme Weather Events"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Insight: ^^^Above we have a bar graph that plots the average number of extreme weather effects recorded by the dataset against the various countries in the dataset. We can see that there is a visually a difference between say France and Mexico with the number of events but because there is no concrete explanation for what metrics are used to calculate what classifies as a extreme weather event or not, there is no really deep insight that we can really gain from an analysis perspective by comparing the two countries. We can only take a surface level understanding. For example we cannot say “The climate near France is more hostile then near Mexico” because again, what if France has more expensive cost of rebuilding and repairing after weather events and thus the calculated events appear to be higher then in Mexico where maybe Mexico can tank more severe weather events without incurring as much cost. Also geography within the country also matters. If a country is mostly desert or rain forest and we do not define what an extreme weather event is or we are basing our calculations off of cost of damages or something, then a event could hit a remote part of the country where little damage occurs and we would end up not counting that as a extreme event. On the flip side a relatively smaller scale weather event could hit a more condensely packed population area and lead to a significant loss of life and property damage and that would result in a classification of a extreme weather event. Point is, without a concrete basis of what the variable is calculated from, we cannot draw super deep conclusions from it.

Visualization 2

#produce the histogram distributione
df_main |>
  ggplot(aes(x = Country, y = Extreme.Weather.Events, fill = Country)) +
  geom_boxplot(alpha = 0.7) +
  labs(
    title = "Distro of Extreme Weather Events by Country",
    x = "Country",
    y = "Extreme Weather Events"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Insight: ^^^The above visualization shows the distribution and variation of the reported extreme weather events for each country. We can see that relatively speaking, most countries on average tend to hover around a median interval from 5-10 events but the variation is quite high among the countries themselves. This variation is indicative of perhaps there is some bias or other noise that is entering the data based off of the way the data is reported by certain countries or even the quality of the data that is collected from countries since not all countries will be able to provide high quality data. Because of this, it further reinforces our previous conclusion that we must be careful when drawing conclusions from this variable because we are not given an explanation to how these this variable is derived and thus we have no context with which to explain away this variation.

To reduce negative consequences of conclusions drawn from this data, I would amend the documentation such that it gives clear and concrete explanation as to how a number like Extreme Weather Event is calculated and explain possible areas of caution such that the person using my dataset is aware of the risks involved with using/drawing conclusions from that variable. If I was the data collector I would ensure that my variable is as primitive as possible (meaning I would not combine variables together and give it a name that misleadingly injects an inference as this adds bias and abstraction to the variable removing its grandularity) and then fully explain how I collected data for that variable or variables.

Task #4:

4.) We will pick the categorical column of Renewable Energy (%) and Forest Area (%). We will check for physically missing values, indirectly missing values, and any empty groups.

df_main |>
  summarise(
    missing_renewable = sum(is.na(Renewable.Energy....)),
    missing_forest_area = sum(is.na(Forest.Area....))
  )

Insight: ^^^ As we can see the two selected variables have no physically missing values. We figured this out by scanning the dataset for instances of empty slots where there should be data physically present there for the two variables. We got 0s meaning that all slots in the dataset are populated. There is nothing explicitly missing.

df_main |>
  mutate(
    cat_renewable = cut(Renewable.Energy....,
                             breaks = seq(0,100,10),
                             include.lowest = TRUE),
    cat_forest_area = cut(Forest.Area....,
                          breaks = seq(0,100,10),
                          include.lowest = TRUE)
  ) |>
  count(cat_renewable, cat_forest_area) |>
  tidyr::complete(cat_renewable, cat_forest_area)

Insight: ^^^ Above we check for implicitly missing values. We can determine if there are values implicitly missing by grouping them into percentage ranges. When you do this you find that there are intervals that are straight up missing from the dataset. For example, there is no country that has a 0% to 10% forest area. However, from looking at the map we know that there are places in the world that do not have a forest at all. This may indicate an underlying bias or limited scope of the dataset indicating that perhaps it does not fully capture the full range of environmental conditions that would be expected from a variable named Forest Area (%).

df_main |>
  mutate(cat_renewable = cut(Renewable.Energy...., breaks = seq(0,100,10))) |>
  count(cat_renewable)
df_main |>
  mutate(cat_forest_area = cut(Forest.Area...., breaks = seq(0,100,10))) |>
  count(cat_forest_area)

Insights: ^^^ We can see that there are no empty groups for the percent ranges. There are however varying levels of observations indicating that perhaps the dataset has uneven coverage of data points. This is another metric that can be used to assess the health of the dataset and determine if the conclusion you drew for some hypothesis is correct or if it is flawed by some hidden nature of the dataset.

Task #5

5.) For this task we will select rainfall amount as our continuous variable and figure out what a outlier for it would represent.

df_main |>
  summarise(avg_rainfall = mean(Rainfall..mm., na.rm = TRUE))
df_main |>
  group_by(Country) |>
  summarise(avg_rainfall_country = mean(Rainfall..mm., na.rm = TRUE))
#get quartiles
Q1 <- quantile(df_main$Rainfall..mm., 0.25)
Q3 <- quantile(df_main$Rainfall..mm., 0.75)
IQR_value <- Q3 - Q1

#define outlier bounds
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

#extract the outliers
rainfall_outliers <- df_main |>
  filter(Rainfall..mm. < lower_bound | Rainfall..mm. > upper_bound)
#return the output
rainfall_outliers

Insight: ^^^ We check to see if there are any outliers for rainfall. Before we did this last step we also retrieved the average rainfall for the entire dataset and for every country just to check out what the data looks like and to visually confirm the types of values we are looking for. We can see from the output that there is no outlier value that exists in the dataset for our specified parameters. It is important to note however that outliers are more complicated then just setting a range and calling it a day. We can make up our own test that we determine will find outliers and that could yield results that are outliers. It is very dataset dependent and we will also need to interpret what the outlier means because an outlier in one column might not be an outlier in another column.