This is the week 3 data dive R markdown notebook. Last week I accidentally used the standard markdown file not realizing that the notebook version is just a better version of that so I am now using the R Notebook for this assignment.

We begin by first loading the csv into a data frame. We will call this df_main.

#load the tidyverse library (this is uhh very important)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#load the lubridate library 
library(lubridate)
#load our dataset. We call this the main dataframe kind of like int main(){}
df_main <- read.csv("climate_change_dataset.csv")

PART 1: We will first start by breaking the main dataset into 3 smaller data frames each using the group_by() function. The first data frame will be for co2. We will gather the average co2 emission for each country while also counting the rows of data that exist for that country. This allows us to see how many instances of data were recorded for each country allowing us to ascertain the reliability of that emission data.

#create the first of 3 group_by data frames. This one will be co2 emissions relative to country
df_co2 <- df_main |>
  group_by(Country) |>
  summarise(
  #we calculate the average co2 while ignoring any missing values
  avg_co2 = mean(CO2.Emissions..Tons.Capita., na.rm = TRUE),
  #this line will count the number of rows that exist for each country
  record_count = n()
  ) |>
  #this will calculate the probability of selecting a row from this     group if we were to select randomly
  mutate(probability = record_count / sum(record_count))

#we can assign a tag to the group or groups with the lowest probability (aka the smallest sample size). This can allow us to      detect anamolies
df_co2 <- df_co2 |>
  mutate(Key = ifelse(probability == min(probability), "Lowest Probability", "Normal Probability"))


#visualization 
df_co2 |> 
  ggplot(aes(x = reorder(Country, avg_co2), y = avg_co2, fill = Key)) + geom_col() + coord_flip() + scale_fill_manual(values = c("Lowest Probability" = "red", "Normal Probability" = "blue")) +
labs(title = "Average CO2 Emissions by Country", subtitle = "Red bars indicate groups with the lowest probability", x = "Country", y = "Average CO2 Emissions (Tons/Capita)")

Insights: It is obvious that if we only collect one instance of data for anything, its integrity would be in question. In other words, as the sample size of data points collected for some metric increases, the precision for that metric increases which in turn allows us to make more confident conclusions about the data. In the above graph, we have plotted the CO2 emissions for each country. Out of all of these countries, Mexico has the least number of instances of data points, and thus the least precision. This means the probability of selecting Mexico randomly is the least likely. This is important because if we were to do some analysis for this dataset and randomly selected some metric, we are not guaranteed a uniform distribution/representation of that metric since the number of data points per country is not uniform (or close to uniform). Our resulting conclusion may contain some bias that we did not realize we had due to this outlier. Another reason this is important is because we now may want to question if the emission values given by the dataset are actually accurate to the true amount of emissions in Mexico, if the number of data points collected for that country are lower then all of the other countries.

This data frame is one that allows us to map the rainfall levels with the percentage of forest area. This one was a bit difficult to set up correctly so I had to leave out countries and years (I was thinking of how to incorporate them too) but we can still gain insight from this.

#here we will make one for rainfall and forest area
df_rain <- df_main |>
  #we can split rainfall into 3 levels, low, medium, and high
  mutate(rain_lvl = cut_interval(Rainfall..mm., n = 3, labels = c("Low", "Medium", "High"))) |>
  group_by(rain_lvl) |>
  summarise(
    #here we can get the average forest coverage %
    avg_forest = mean(Forest.Area...., na.rm = TRUE),
    #also we can get the total instances of data points collected for     rainfall
    count = n()
  )

#visualization
df_rain |>
  ggplot(aes(x = rain_lvl, y = avg_forest)) + geom_point(size = 4, color = "darkgreen") + geom_segment(aes(x = rain_lvl, xend = rain_lvl, y = 0, yend = avg_forest)) + labs(title = "Forest Area Coverage by Rainfall Level", y = "Avg Forest Area %")

Insight: Here we can see that the rain levels exist in 3 separate buckets. On the Y axis we have the forest area percentage. What we see here makes sense. As the rainfall levels increase, we can see that the average forest area increases. We can see that for the highest rainfall amount, we have the largest forest area. However, notice something interesting. We see that even though the low bucket of rainfall is supposed to be the least amount of rain (and we would think less rain = less forest area) the medium rainfall level is slightly less than the low level for forest area. This is very interesting and goes to show why you should never draw a conclusion from a single diagram or chart. If I presented this to you and didn’t show you the actual dataset, you may not be able to tell what data is missing here and may walk away with the wrong conclusion. As to why this anomaly is occurring (again we expect more rain leads to larger forest areas) there could be a few reasons. First, maybe the data collected is unevenly taken from different parts of the world. The data could have limited scope such that it does not take into account more complex geological data such as rain shadow effect. We also do not have data on the duration of the forest area. There could be situations where regions have an explosion in growth of flora and fauna and then there is a massive decrease due to increasingly amplified weather/climate anomalies or human intervention. Then the data collected is either during this explosion phase or during the withering phase.

For this last data frame we are simply splitting the dataset into buckets where each bucket is a year (basically we are group_bying the year attribute). From that we are then finding the total number of extreme weather effects that took place that year and summing them up and plotting them.

#we will now create the final data frame for extreme weather effects
df_extreme <- df_main |>
  #group data by year
  group_by(Year) |>
  summarise(
    #sum up all the extreme weather effects that happend for each year
    total_extreme_events = sum(Extreme.Weather.Events, na.rm = TRUE),
    #count how many observations we have for specified year
    obs_count = n()
  )


#visualizations
df_extreme |>
  ggplot(aes(x = Year, y = total_extreme_events)) + geom_line(color = "orange", linewidth = 1) + labs(title = "Total Extreme Weather Events per Year")

Insight: Here we can see that the graph plots the the total number of extreme weather events over the course of many years. In this bigger picture view we may not want to look at things from a country to country basis, but rather the entire globe in general. This is important for when we want to zoom out and take in the birds eye view of a dataset. We can see that the year 2000 was one of the largest on records for the most number of extreme weather events globally. We can also see that the 2009 was the lowest year. Althought we could say that the general trend is that the number of extreme weather events is going down year to year since 2000, this alone cannot give us enough data to draw an overarching conclusion for things like the state of global warming or climate change. For one, we do not know what the objective definition of an extreme weather effect is. It is possible that that the reason the number of extreme weather effects is decreasing is our evaluation of what constitutes an extreme weather effect has gotten more refined and nuanced. It is also possible that our infrastructure, prediction, and safety protocols have gotten better as a species, and thus we avoid massive loss to live and infrastructure when a extreme weather event does take place and thus we have made our definition of extreme weather harsher to compensate for this. The point is that our data is still limited. We could impose another dataset and then to a unified analysis of both to get a clearer understanding of what is going on here. If nothing else it will more confidently allow us to make a conclusion.

Hypothesis: If a country has less renewable sources then it will have less instances of data points and thus be rarer than others.

So this example hypothesis basically says that if as the number of renewable resources is less and less for a country, then the data collected for that country is also less compared to others. Data collection has to draw in raw numbers before we can start drawing out information and conclusions from it. If a country is already not investing heavily in renewable sources, perhaps it not a political or scientific priority for it to collect a lot of the data that other countries that are more aware of their ecological footprint would. This means that the number of data points and categories would be reduced for that country.

PART 2: For this section we will combine the CO2 emissions variable with the renewable resources variable. From this we will derive a new variable that does not yet exist: “Green Status”.

#here we create combine co2 and renewable variables to form a new variable that does not yet exist
df_green <- df_main |>
  mutate(
    #bin co2 into high and low lvls using the median as the split         point
    co2_lvl = ifelse(CO2.Emissions..Tons.Capita. > median(CO2.Emissions..Tons.Capita., na.rm = TRUE), "High CO2", "Low CO2"),
    
    #do the same for enewables, bin into high and low
    renew_lvl = ifelse(Renewable.Energy.... > median(Renewable.Energy...., na.rm = TRUE), "High Renew", "Low Renew")
  ) |>
  #combine them into a single "green status" category
  mutate(
    green_status = paste(co2_lvl, "&", renew_lvl)
  )

#build the data frame of all unique combinatione
df_combinations <- df_green |>
  group_by(co2_lvl, renew_lvl) |>
  summarise(
    count = n(),
    avg_forest = mean(Forest.Area...., na.rm = TRUE),
    .groups = "drop"
  )

#find the missing or least common combination
#this can identify if for example there are no countries with low co2 AND low renewables
full_grid <- df_combinations |>
  complete(co2_lvl, renew_lvl, fill = list(count = 0))

#visualization heatmap to show the "Green Status" distribution
ggplot(df_combinations, aes(x = co2_lvl, y = renew_lvl, fill = count)) +
  geom_tile(color = "white", size = 1) +
  geom_text(aes(label = count), color = "white", size = 6) + 
  #adds the numbers inside the boxes
  scale_fill_gradient(low = "gray80", high = "darkgreen") +
  labs(
    title = "Distribution of Countries by Green Status",
    subtitle = "Heatmap of combined CO2 and Renewable Energy levels",
    x = "CO2 Emission Category (high/low)",
    y = "Renewable Energy Category (high/low)"
  ) +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Insights: We used co2 emissions and renewable energy metric to derive a new green status variable. Although there is more complexity then just the amount of reusable energy and co2 emission a country has to fully understand how “green” it is, this new variable does a good job encapsulating this new metric pretty well. After combining the two metrics, we can use a heat map to show the visual representation of our bins. This shows us a comparison between how much renewable resources a country has and its co2 emissions which in turn can allow us a rough estimate on its “green status”. We can see that the least common combination of emissions and renewable is high renewable AND high co2 (square marked 243). This means that the number of observations that include a country having BOTH high renewable AND high co2 is less than any other combination. When we think about it, this makes sense. If a country has a high number of renewable resources, then it will be burning less fossil fuels, coal, natural gas, etc. because it does not need that energy from those sources. This in turn reduces the co2 emissions of that country. Thus we can estimate the “green status” of that country using this combination metric. Obviously, in the real world we would need many more variables like including carbon capture technology programs that the country is doing or any commitments they have made to reduce co2 emissions, however within our dataset itself, this can be pretty good metric.