Assignment 3B Approach

Author

Michael Mayne

Pre-Coding Approach

For the assignment 3B, I intend to use an LLM to see a viable datset regarding time scale. I asked for Google Gemini assistance and it provided me a data set and it recommended the Mauna Loa Observatory data measuring the daily CO2 levels. So I plan to start with a overview of the data, then execute a summary of it data by year average, then weekly average. I do want to see if can test the method on both R & SQL but I will prioritize R as it has given me the most consistent results with CSV files.

Gathering Data & Correction:

The data I am using is part of Mauna Loa Observatory CO2 recording, with the ongoing concern with climate this observatory has consistently recorded data every week on the current CO2 levels in the atmosphere. My original data was not properly prepped and thus to condensed all of the observations. So I had to make initial changes, essentially removing the added blurb at the start of the downloaded data and adjusting the value into weekly instead of daily as the record starts from more than 50 years ago!

My goal to essentially see if which year had the highest overall CO2, and what was the highest recording CO2 levels for every year. This will carried out my using a type of “window function”, creating a column with the incofation for each group and passing that information forward to each instance of the group.

Original Data Source: https://gml.noaa.gov/ccgg/trends/data.html

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
CO2_Measure <- read.csv('https://raw.githubusercontent.com/Mayneman000/DATA607Assignment/refs/heads/main/co2_weekly_mlo.csv')
glimpse(CO2_Measure)
Rows: 2,700
Columns: 9
$ year                <int> 1974, 1974, 1974, 1974, 1974, 1974, 1974, 1974, 19…
$ month               <int> 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9,…
$ day                 <int> 19, 26, 2, 9, 16, 23, 30, 7, 14, 21, 28, 4, 11, 18…
$ decimal             <dbl> 1974.380, 1974.399, 1974.418, 1974.437, 1974.456, …
$ average             <dbl> 333.37, 332.95, 332.35, 332.20, 332.37, 331.73, 33…
$ ndays               <int> 5, 6, 5, 7, 7, 5, 6, 6, 5, 7, 4, 5, 5, 6, 7, 5, 4,…
$ X1.year.ago         <dbl> -999.99, -999.99, -999.99, -999.99, -999.99, -999.…
$ X10.years.ago       <dbl> -999.99, -999.99, -999.99, -999.99, -999.99, -999.…
$ increase.since.1800 <dbl> 50.40, 50.06, 49.60, 49.65, 50.06, 49.72, 50.02, 5…

The 1 year ago count and the 10 years ago column both contain very useful information but are overall not needed at this time. As well as the decimal count which dictate how far in the year the recording is in which is viable but not in a format that is useful for the time right now. So those columns which be removed

CO2measure_Clean <- CO2_Measure%>%
  select(-X1.year.ago, -X10.years.ago, -decimal)

head(CO2measure_Clean)
  year month day average ndays increase.since.1800
1 1974     5  19  333.37     5               50.40
2 1974     5  26  332.95     6               50.06
3 1974     6   2  332.35     5               49.60
4 1974     6   9  332.20     7               49.65
5 1974     6  16  332.37     7               50.06
6 1974     6  23  331.73     5               49.72

Grouping a Window Function of Data

This data was collect by first combining the data by year, then pulling the max of the average CO2 of the year recorded. Then this was done with the max increase from the estimate of 1800.

CO2measure_Clean %>%
  group_by(year) %>% 
  mutate(
    Yearly_Max_CO2 = max(average, na.rm = TRUE),
    Year_Max_Increase = max(increase.since.1800, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  arrange(desc(Yearly_Max_CO2))
# A tibble: 2,700 × 8
    year month   day average ndays increase.since.1800 Yearly_Max_CO2
   <int> <int> <int>   <dbl> <int>               <dbl>          <dbl>
 1  2025     1     5    426.     7                146.           431.
 2  2025     1    12    427.     6                146.           431.
 3  2025     1    19    427.     7                146.           431.
 4  2025     1    26    427.     6                146.           431.
 5  2025     2     2    427.     7                146.           431.
 6  2025     2     9    427.     5                146.           431.
 7  2025     2    16    428.     5                147.           431.
 8  2025     2    23    427.     7                146.           431.
 9  2025     3     2    428.     7                147.           431.
10  2025     3     9    429.     5                147.           431.
# ℹ 2,690 more rows
# ℹ 1 more variable: Year_Max_Increase <dbl>

Conclusions

By showing the data above we can conclude that 2025 has the highest yearly max data present. The data shows that 2025 has the highest overall C02 average since the beginning of recording with 430.86 ppm. Of course this is a concerning trend, because there is very little variance with no significant period in which the years lowered significantly. I would like to plot this in addition to the information provided in order to express how significant the increase in CO2 has been because a table is not nearly enough to show the dramatic correlation.