There are two main objectives for this work:
Generate a graph showing a periodical oscillation superposed with a slower global evolution of the CO2
Caracterize the oscillation and propose a model of the global trend. We will also give an estimation of the CO2 concentration.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
We first need to download the weekly measures from the Scripps CO2 Program Website:
# Downloading the data
csv_url <- "https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/weekly/weekly_in_situ_co2_mlo.csv"
df <- read.csv(csv_url, sep = ",", comment.char = "\"", blank.lines.skip=TRUE, strip.white=TRUE, skip=44, header=FALSE)We are skipping 44 lines because the csv file contains a header of 44 lines of comments.
The columns in the file are not named, so we call them date and CO2:
Let us take a quick look at the data and see the type of each column:
## date CO2
## "integer" "double"
We see that the type of the column date is not Date but integer.
We will thus convert it to a type Date.
## date CO2
## "Date" "numeric"
We will now check that there is no NA values in the data set.
# Checking if there are NAs in the data
na_lines <- apply(df, 1, function(x) any(is.na(x)))
number_of_na_lines <- nrow(df[na_lines,])
number_of_na_lines## [1] 0
There are 0 NA in the dataset.
Let us remove the potential lines containing NA:
Let us discover the dataset and simply plot the evolution of the CO2 concentration from 1958 to present days:
p <- ggplot(data = df, aes(x = date, y = CO2)) +
theme_linedraw() +
geom_point(size = 0.1) +
xlab("Date") + ylab("CO2 Concentation (in ppm)") +
ggtitle("Evolution of the CO2 Concentration from 1958 to present days")
pThe previous graph is zoomed in the data, but let us take a step back and look with the y axis starting at 0:
We can see that there seems to be a periodical oscillation but also the global trend is increasing.
Instead of using pure dates, we will convert each date into the number of days since the first measurement of the data set
We can try to fit a line for the CO2 concentration:
##
## Call:
## lm(formula = CO2 ~ day, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6800 -3.0919 -0.6282 2.6112 12.6771
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.060e+02 1.552e-01 1971.1 <2e-16 ***
## day 4.317e-03 1.184e-05 364.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.285 on 3150 degrees of freedom
## Multiple R-squared: 0.9769, Adjusted R-squared: 0.9769
## F-statistic: 1.33e+05 on 1 and 3150 DF, p-value: < 2.2e-16
We can plot the estimation of this model:
# Compute the predictions
df$prediction_line <- predict(reg_line, df)
# Plot the data and the predictions
ggplot(data = df, aes(x = date, y = CO2)) +
theme_linedraw() +
geom_point(size = 0.1) +
geom_line(aes(x = date, y = prediction_line), colour = "blue") +
xlab("Date") + ylab("CO2 (in ppm)") +
ggtitle("CO2 Variations and its Linear Estimation")We can see that even if the \(R^2\) value in the summary was close to 1, a line does not seem to be a good fit for this data.
Indeed, this model…
… is underestimating from year 1958 to 1970
… is overestimating from yaer 1970 to 2005
… is underestimating from year 2005 to the present days
As we saw previously, a line is not a good fit to model the data.
Let us try with a polynom of degree 2: \[ CO2(day) = a \times day^2 + b \times day + c \]
A polynom of degree 2 seems adapted to this situation because the evolution of the CO2 concentration looks slighlty quadratic.
We can use the function poly to quickly generate of polynom.
##
## Call:
## lm(formula = CO2 ~ poly(day, 2), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3360 -1.7984 0.1008 1.7953 5.7645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.553e+02 4.003e-02 8876.69 <2e-16 ***
## poly(day, 2)1 1.563e+03 2.247e+00 695.33 <2e-16 ***
## poly(day, 2)2 2.048e+02 2.247e+00 91.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.247 on 3149 degrees of freedom
## Multiple R-squared: 0.9936, Adjusted R-squared: 0.9936
## F-statistic: 2.459e+05 on 2 and 3149 DF, p-value: < 2.2e-16
We can now plot the data and the predictions:
df$prediction_quad <- predict(reg_quad, df)
ggplot(data = df, aes(x = date, y = CO2)) +
theme_linedraw() +
geom_point(size = 0.1) +
geom_line(aes(x = date, y = prediction_quad), colour = "blue") +
xlab("Date") + ylab("CO2 (in ppm)") +
ggtitle("CO2 Variations and its Quadratic Estimation")The quadratic model fits way more the data.
As our regression takes the number of days from 1958-03-29, we have to substract this date to have a valid entry.
We will do a prediction of the CO2 concentration evolution from the current date to the January 1st 2025.
# January, 1st 2025
date_2025 <- as.Date("2025-01-01")
day_2025 <- as.integer(date_2025 - date_first_measure)
# Current date
date_today <- Sys.Date()
day_today <- as.integer(date_today - date_first_measure)# Creating points to estimate
data_prediction <- data.frame(day = seq(day_today, day_2025, 7))
data_prediction$date <- as.Date(date_first_measure + data_prediction$day)
# Predicting the CO2 Concentration
data_prediction$CO2 <- predict(reg_quad, data_prediction)
ggplot() +
theme_linedraw() +
geom_point(data = df, aes(x = date, y = CO2), size = 0.1) +
geom_line(data = data_prediction, aes(x = date, y = CO2), color = "red") +
xlab("Date") + ylab("CO2 (in ppm)") +
ggtitle("CO2 Variations to the Present Day and its Estimation until 2025")We can say that if the CO2 concentration in the air continues to follow this quadratic grow, the concentration of CO2 in the air in 2025 will be around 425 ppm.
We will add a column in our data frame to indicate the which day of the year this is January 1st being day 0 and December 31st being 365. This will help us compare each year.
df$year <- as.integer(format(df$date, "%Y"))
df$day_of_year <- as.integer(df$date - as.Date(paste(df$year, "01", "01", sep = "-")))We also want to look at the variations of CO2 in the year during the year. So we will substract the value of the first measure of the year for every measure of this year.
df$CO2_year <- 0
df$percentage_increase_year <- 0
for (y in df$year) {
first_CO2_value_of_the_year <- df[df$year == y,]$CO2[1]
df[df$year == y,]$CO2_year <- df[df$year == y,]$CO2 - first_CO2_value_of_the_year
df[df$year == y,]$percentage_increase_year <- df[df$year == y,]$CO2 * 100 / first_CO2_value_of_the_year - 100
}We can plot the yearly evolution of some years:
ggplot(data = subset(df, year == 1960 | year == 1997 | year == 2019),
aes(x = day_of_year, y = percentage_increase_year, color = factor(year))) +
theme_linedraw() +
geom_line() +
scale_y_continuous(limits = c(-2, 2)) +
xlab("Day of the Year") + ylab("Percentage of CO2 Concentration increase") + labs(color = "Year") +
ggtitle("CO2 Evolution for the Years 1960, 1997 and 2019")We see that the concentration of CO2 is oscillating during the year with a maximal value reached around day 150 (month of May) and a minimal value around day 260 (month of September).
Let us try to fit a sinusoidal curve for the oscillations over one year. We decide to add a polynomial expression to represent the global evolution over the year.
As we saw previously, the global trend of the CO2 concentration over the year is quadratic on the year. We thus add a polynomial term of degree 2 to represent this evolution.
reg_sin <- lm(data = df, percentage_increase_year ~ sin(2 * pi * day_of_year / 366) +
cos(2 * pi * day_of_year / 366) +
poly(day_of_year, 2) +
poly(year, 2))
summary(reg_sin)##
## Call:
## lm(formula = percentage_increase_year ~ sin(2 * pi * day_of_year/366) +
## cos(2 * pi * day_of_year/366) + poly(day_of_year, 2) + poly(year,
## 2), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.85001 -0.12197 0.00322 0.12922 0.70977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.281501 0.003545 79.415 < 2e-16 ***
## sin(2 * pi * day_of_year/366) 0.878982 0.008024 109.543 < 2e-16 ***
## cos(2 * pi * day_of_year/366) -0.836039 0.018281 -45.732 < 2e-16 ***
## poly(day_of_year, 2)1 12.903902 0.318263 40.545 < 2e-16 ***
## poly(day_of_year, 2)2 23.584939 0.726334 32.471 < 2e-16 ***
## poly(year, 2)1 0.374419 0.198977 1.882 0.06 .
## poly(year, 2)2 -1.232686 0.198987 -6.195 6.59e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.199 on 3145 degrees of freedom
## Multiple R-squared: 0.8695, Adjusted R-squared: 0.8693
## F-statistic: 3493 on 6 and 3145 DF, p-value: < 2.2e-16
df$prediction_sin <- predict(reg_sin, df)
ggplot(data = df[df$year == 1997,]) +
theme_linedraw() +
geom_point(aes(x = day_of_year, y = percentage_increase_year)) +
geom_line(aes(x = day_of_year, y = prediction_sin)) +
scale_y_continuous(limits = c(-2, 2)) +
xlab("Days of the Year") + ylab("CO2 Variation (in ppm)") +
ggtitle("CO2 Variations and its Sin Prediction for the Year 1997 (in ppm)")Let us try with a polynomial regression to see if we can get a better model.
We have to chose a degree greater than 2 to get such an oscillation. We decided to use a polynom of degree 5.
Once again we add a polynom of degree 2 to represent the global quadratic evolution of the CO2 concentration.
degree <- 5
reg_poly <- lm(data = df, percentage_increase_year ~ poly(day_of_year, degree) + poly(year, 2))
summary(reg_poly)##
## Call:
## lm(formula = percentage_increase_year ~ poly(day_of_year, degree) +
## poly(year, 2), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.91086 -0.12546 0.00198 0.13186 0.63915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.278688 0.003424 81.400 < 2e-16 ***
## poly(day_of_year, degree)1 -14.258638 0.192243 -74.170 < 2e-16 ***
## poly(day_of_year, degree)2 -8.712652 0.192215 -45.328 < 2e-16 ***
## poly(day_of_year, degree)3 20.708919 0.192219 107.736 < 2e-16 ***
## poly(day_of_year, degree)4 8.849626 0.192218 46.040 < 2e-16 ***
## poly(day_of_year, degree)5 -7.069087 0.192220 -36.776 < 2e-16 ***
## poly(year, 2)1 0.398499 0.192231 2.073 0.0383 *
## poly(year, 2)2 -1.245293 0.192239 -6.478 1.08e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1922 on 3144 degrees of freedom
## Multiple R-squared: 0.8783, Adjusted R-squared: 0.878
## F-statistic: 3241 on 7 and 3144 DF, p-value: < 2.2e-16
df$prediction_poly <- predict(reg_poly, df)
ggplot(data = df[df$year == 2019,]) +
theme_linedraw() +
geom_point(aes(x = day_of_year, y = percentage_increase_year)) +
geom_line(aes(x = day_of_year, y = prediction_poly)) +
scale_y_continuous(limits = c(-2, 2)) +
xlab("Days of the Year") + ylab("CO2 Variation (in ppm)") +
ggtitle("CO2 Variations and its Polynimial Prediction for the Year 1997 (in ppm)")We showed that the global evolution of the CO2 concentration was increasing overtime. We also showed that there was a periodical oscillation during a time span of one year.
You can find the source code here https://github.com/GuilloteauQ/SMPE.