1 Objectives

There are two main objectives for this work:

  1. Generate a graph showing a periodical oscillation superposed with a slower global evolution of the CO2

  2. Caracterize the oscillation and propose a model of the global trend. We will also give an estimation of the CO2 concentration.

2 Importing the necessary Libraries

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

3 Downloading the data

We first need to download the weekly measures from the Scripps CO2 Program Website:

We are skipping 44 lines because the csv file contains a header of 44 lines of comments.

The columns in the file are not named, so we call them date and CO2:

Let us take a quick look at the data and see the type of each column:

##      date       CO2 
## "integer"  "double"

We see that the type of the column date is not Date but integer.

We will thus convert it to a type Date.

##      date       CO2 
##    "Date" "numeric"

We will now check that there is no NA values in the data set.

## [1] 0

There are 0 NA in the dataset.

Let us remove the potential lines containing NA:

4 Discovering the Dataset

Let us discover the dataset and simply plot the evolution of the CO2 concentration from 1958 to present days:

The previous graph is zoomed in the data, but let us take a step back and look with the y axis starting at 0:

We can see that there seems to be a periodical oscillation but also the global trend is increasing.

5 Regression for global trend

5.1 Setting up the Data

Instead of using pure dates, we will convert each date into the number of days since the first measurement of the data set

5.2 Linear Regression

We can try to fit a line for the CO2 concentration:

## 
## Call:
## lm(formula = CO2 ~ day, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6800 -3.0919 -0.6282  2.6112 12.6771 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.060e+02  1.552e-01  1971.1   <2e-16 ***
## day         4.317e-03  1.184e-05   364.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.285 on 3150 degrees of freedom
## Multiple R-squared:  0.9769, Adjusted R-squared:  0.9769 
## F-statistic: 1.33e+05 on 1 and 3150 DF,  p-value: < 2.2e-16

We can plot the estimation of this model:

We can see that even if the \(R^2\) value in the summary was close to 1, a line does not seem to be a good fit for this data.

Indeed, this model…

  • … is underestimating from year 1958 to 1970

  • … is overestimating from yaer 1970 to 2005

  • … is underestimating from year 2005 to the present days

5.3 Quadratic Regression

As we saw previously, a line is not a good fit to model the data.

Let us try with a polynom of degree 2: \[ CO2(day) = a \times day^2 + b \times day + c \]

A polynom of degree 2 seems adapted to this situation because the evolution of the CO2 concentration looks slighlty quadratic.

We can use the function poly to quickly generate of polynom.

## 
## Call:
## lm(formula = CO2 ~ poly(day, 2), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3360 -1.7984  0.1008  1.7953  5.7645 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.553e+02  4.003e-02 8876.69   <2e-16 ***
## poly(day, 2)1 1.563e+03  2.247e+00  695.33   <2e-16 ***
## poly(day, 2)2 2.048e+02  2.247e+00   91.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.247 on 3149 degrees of freedom
## Multiple R-squared:  0.9936, Adjusted R-squared:  0.9936 
## F-statistic: 2.459e+05 on 2 and 3149 DF,  p-value: < 2.2e-16

We can now plot the data and the predictions:

The quadratic model fits way more the data.

6 Yearly Evolution

We will add a column in our data frame to indicate the which day of the year this is January 1st being day 0 and December 31st being 365. This will help us compare each year.

We also want to look at the variations of CO2 in the year during the year. So we will substract the value of the first measure of the year for every measure of this year.

We can plot the yearly evolution of some years:

We see that the concentration of CO2 is oscillating during the year with a maximal value reached around day 150 (month of May) and a minimal value around day 260 (month of September).

6.1 Sinusoidal Regression

Let us try to fit a sinusoidal curve for the oscillations over one year. We decide to add a polynomial expression to represent the global evolution over the year.

As we saw previously, the global trend of the CO2 concentration over the year is quadratic on the year. We thus add a polynomial term of degree 2 to represent this evolution.

## 
## Call:
## lm(formula = percentage_increase_year ~ sin(2 * pi * day_of_year/366) + 
##     cos(2 * pi * day_of_year/366) + poly(day_of_year, 2) + poly(year, 
##     2), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85001 -0.12197  0.00322  0.12922  0.70977 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    0.281501   0.003545  79.415  < 2e-16 ***
## sin(2 * pi * day_of_year/366)  0.878982   0.008024 109.543  < 2e-16 ***
## cos(2 * pi * day_of_year/366) -0.836039   0.018281 -45.732  < 2e-16 ***
## poly(day_of_year, 2)1         12.903902   0.318263  40.545  < 2e-16 ***
## poly(day_of_year, 2)2         23.584939   0.726334  32.471  < 2e-16 ***
## poly(year, 2)1                 0.374419   0.198977   1.882     0.06 .  
## poly(year, 2)2                -1.232686   0.198987  -6.195 6.59e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.199 on 3145 degrees of freedom
## Multiple R-squared:  0.8695, Adjusted R-squared:  0.8693 
## F-statistic:  3493 on 6 and 3145 DF,  p-value: < 2.2e-16

6.2 Polynomial Regression

Let us try with a polynomial regression to see if we can get a better model.

We have to chose a degree greater than 2 to get such an oscillation. We decided to use a polynom of degree 5.

Once again we add a polynom of degree 2 to represent the global quadratic evolution of the CO2 concentration.

## 
## Call:
## lm(formula = percentage_increase_year ~ poly(day_of_year, degree) + 
##     poly(year, 2), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91086 -0.12546  0.00198  0.13186  0.63915 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  0.278688   0.003424  81.400  < 2e-16 ***
## poly(day_of_year, degree)1 -14.258638   0.192243 -74.170  < 2e-16 ***
## poly(day_of_year, degree)2  -8.712652   0.192215 -45.328  < 2e-16 ***
## poly(day_of_year, degree)3  20.708919   0.192219 107.736  < 2e-16 ***
## poly(day_of_year, degree)4   8.849626   0.192218  46.040  < 2e-16 ***
## poly(day_of_year, degree)5  -7.069087   0.192220 -36.776  < 2e-16 ***
## poly(year, 2)1               0.398499   0.192231   2.073   0.0383 *  
## poly(year, 2)2              -1.245293   0.192239  -6.478 1.08e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1922 on 3144 degrees of freedom
## Multiple R-squared:  0.8783, Adjusted R-squared:  0.878 
## F-statistic:  3241 on 7 and 3144 DF,  p-value: < 2.2e-16

7 Conclusion

We showed that the global evolution of the CO2 concentration was increasing overtime. We also showed that there was a periodical oscillation during a time span of one year.

You can find the source code here https://github.com/GuilloteauQ/SMPE.