Analytics Edge: Unit 2 - Climate Change

Climate Change

Background Information on the Dataset

There have been many studies documenting that the average global temperature has been increasing over the last century. The consequences of a continued rise in global temperature will be dire. Rising sea levels and an increased frequency of extreme weather events will affect billions of people.

In this problem, we will attempt to study the relationship between average global temperature and several other factors.

The file climate_change.csv contains climate data from May 1983 to December 2008. The available variables include:

Year: the observation year.
Month: the observation month.
Temp: the difference in degrees Celsius between the average global temperature in that period and a reference value. This data comes from the Climatic Research Unit at the University of East Anglia.
CO2, N2O, CH4, CFC.11, CFC.12: atmospheric concentrations of carbon dioxide (CO2), nitrous oxide (N2O), methane (CH4), trichlorofluoromethane (CCl3F; commonly referred to as CFC-11) and dichlorodifluoromethane (CCl2F2; commonly referred to as CFC-12), respectively. This data comes from the ESRL/NOAA Global Monitoring Division.
- CO2, N2O and CH4 are expressed in ppmv (parts per million by volume – i.e., 397 ppmv of CO2 means that CO2 constitutes 397 millionths of the total volume of the atmosphere)
- CFC.11 and CFC.12 are expressed in ppbv (parts per billion by volume).
Aerosols: the mean stratospheric aerosol optical depth at 550 nm. This variable is linked to volcanoes, as volcanic eruptions result in new particles being added to the atmosphere, which affect how much of the sun’s energy is reflected back into space. This data is from the Godard Institute for Space Studies at NASA.
TSI: the total solar irradiance (TSI) in W/m2 (the rate at which the sun’s energy is deposited per unit area). Due to sunspots and other solar phenomena, the amount of energy that is given off by the sun varies substantially with time. This data is from the SOLARIS-HEPPA project website.
MEI: multivariate El Nino Southern Oscillation index (MEI), a measure of the strength of the El Nino/La Nina-Southern Oscillation (a weather effect in the Pacific Ocean that affects global temperatures). This data comes from the ESRL/NOAA Physical Sciences Division.

Creating Our First Model

We are interested in how changes in these variables affect future temperatures, as well as how well these variables explain temperature changes so far. To do this, first read the dataset climate_change.csv into R.

Then, split the data into a training set, consisting of all the observations up to and including 2006, and a testing set consisting of the remaining years (hint: use subset). A training set refers to the data that will be used to build the model (this is the data we give to the lm() function), and a testing set refers to the data we will use to test our predictive ability.

Next, build a linear regression model to predict the dependent variable Temp, using MEI, CO2, CH4, N2O, CFC.11, CFC.12, TSI, and Aerosols as independent variables (Year and Month should NOT be used in the model). Use the training set to build the model.

# Load the dataset
climate = read.csv("climate_change.csv")
# Subset the dataset into training and testing sets
train = subset(climate, Year <= 2006)
test = subset(climate, Year > 2006)
# Create Linear Regression model
climatelm = lm(Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols, data=train)

What is the R² value?

# Output the summary
summary(climatelm)
## 
## Call:
## lm(formula = Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + 
##     TSI + Aerosols, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25888 -0.05913 -0.00082  0.05649  0.32433 
## 
## Coefficients:
##                 Estimate   Std. Error t value Pr(>|t|)    
## (Intercept) -124.5942604   19.8867999  -6.265 1.43e-09 ***
## MEI            0.0642053    0.0064702   9.923  < 2e-16 ***
## CO2            0.0064574    0.0022846   2.826  0.00505 ** 
## CH4            0.0001240    0.0005158   0.240  0.81015    
## N2O           -0.0165280    0.0085649  -1.930  0.05467 .  
## CFC.11        -0.0066305    0.0016260  -4.078 5.96e-05 ***
## CFC.12         0.0038081    0.0010135   3.757  0.00021 ***
## TSI            0.0931411    0.0147549   6.313 1.10e-09 ***
## Aerosols      -1.5376132    0.2132523  -7.210 5.41e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09171 on 275 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7436 
## F-statistic: 103.6 on 8 and 275 DF,  p-value: < 2.2e-16

The Multiple R-squared value is 0.7509.

Which variables are significant in the model? We will consider a variable signficant only if the p-value is below 0.05.

# Output the summary 
summary(climatelm)

MEI, CO2, CFC.11, CFC.12, TSI, and Aerosols are all significant.

Understanding the Model

Current scientific opinion is that nitrous oxide and CFC-11 are greenhouse gases: gases that are able to trap heat from the sun and contribute to the heating of the Earth. However, the regression coefficients of both the N2O and CFC-11 variables are negative, indicating that increasing atmospheric concentrations of either of these two compounds is associated with lower global temperatures.

Which of the following is the simplest correct explanation for this contradiction?

All of the gas concentration variables reflect human development - N2O and CFC.11 are correlated with other variables in the data set. The linear correlation of N2O and CFC.11 with other variables in the data set is quite large.

Compute the correlations between all the variables in the training set. Which of the following independent variables is N2O highly correlated with (absolute correlation greater than 0.7)?

# Produce correlation on the training set
z = cor(train)
kable(z)

	Year	Month	MEI	CO2	CH4	N2O	CFC.11	CFC.12	TSI	Aerosols	Temp
Year	1.0000000	-0.0279420	-0.0369877	0.9827494	0.9156595	0.9938452	0.5691064	0.8970117	0.1703020	-0.3452467	0.7867971
Month	-0.0279420	1.0000000	0.0008847	-0.1067325	0.0185687	0.0136315	-0.0131112	0.0006751	-0.0346062	0.0148895	-0.0998567
MEI	-0.0369877	0.0008847	1.0000000	-0.0411472	-0.0334193	-0.0508198	0.0690004	0.0082855	-0.1544919	0.3402378	0.1724708
CO2	0.9827494	-0.1067325	-0.0411472	1.0000000	0.8772796	0.9767198	0.5140597	0.8526896	0.1774289	-0.3561548	0.7885292
CH4	0.9156595	0.0185687	-0.0334193	0.8772796	1.0000000	0.8998386	0.7799040	0.9636162	0.2455284	-0.2678092	0.7032550
N2O	0.9938452	0.0136315	-0.0508198	0.9767198	0.8998386	1.0000000	0.5224773	0.8679308	0.1997567	-0.3370546	0.7786389
CFC.11	0.5691064	-0.0131112	0.0690004	0.5140597	0.7799040	0.5224773	1.0000000	0.8689852	0.2720460	-0.0439212	0.4077103
CFC.12	0.8970117	0.0006751	0.0082855	0.8526896	0.9636162	0.8679308	0.8689852	1.0000000	0.2553028	-0.2251312	0.6875575
TSI	0.1703020	-0.0346062	-0.1544919	0.1774289	0.2455284	0.1997567	0.2720460	0.2553028	1.0000000	0.0521165	0.2433827
Aerosols	-0.3452467	0.0148895	0.3402378	-0.3561548	-0.2678092	-0.3370546	-0.0439212	-0.2251312	0.0521165	1.0000000	-0.3849137
Temp	0.7867971	-0.0998567	0.1724708	0.7885292	0.7032550	0.7786389	0.4077103	0.6875575	0.2433827	-0.3849137	1.0000000

CO2, CH4, and CFC.12

Which of the following independent variables is CFC.11 highly correlated with?

# Produce correlation on the training set
cor(train)

CH4 and CFC.12

Simplifying the Model

Given that the correlations are so high, let us focus on the N2O variable and build a model with only MEI, TSI, Aerosols and N2O as independent variables.

# Create linear regression model
LinReg = lm(Temp ~ MEI + N2O + TSI + Aerosols, data=train)

What is the coefficient of N2O in this reduced model?

# Output the summary
summary(LinReg)
## 
## Call:
## lm(formula = Temp ~ MEI + N2O + TSI + Aerosols, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27916 -0.05975 -0.00595  0.05672  0.34195 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept) -116.226858   20.223028  -5.747 2.37e-08 ***
## MEI            0.064186    0.006652   9.649  < 2e-16 ***
## N2O            0.025320    0.001311  19.307  < 2e-16 ***
## TSI            0.079490    0.014875   5.344 1.89e-07 ***
## Aerosols      -1.701737    0.217996  -7.806 1.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09547 on 279 degrees of freedom
## Multiple R-squared:  0.7261, Adjusted R-squared:  0.7222 
## F-statistic: 184.9 on 4 and 279 DF,  p-value: < 2.2e-16

0.02532

What is the R²?

# Output the summary
summary(LinReg)

We have observed that, for this problem, when we remove many variables the sign of N2O flips. The model has not lost a lot of explanatory power (the model R2 is 0.7261 compared to 0.7509 previously) despite removing many variables. this type of behavior is typical when building a model where many of the independent variables are highly correlated with each other. In this particular problem many of the variables (CO2, CH4, N2O, CFC.11 and CFC.12) are highly correlated, since they are all driven by human industrial development.

Automatically Building the Model

We have many variables in this problem, and as we have seen above, dropping some from the model does not decrease model quality. R provides a function, step, that will automate the procedure of trying different combinations of variables to find a good compromise of model simplicity and R2. This trade-off is formalized by the Akaike information criterion (AIC) - it can be informally thought of as the quality of the model with a penalty for the number of variables in the model.

The step function has one argument - the name of the initial model. It returns a simplified model. Use the step function in R to derive a new model, with the full model as the initial model (HINT: If your initial full model was called “climateLM”, you could create a new model with the step function by typing step(climateLM). Be sure to save your new model to a variable name so that you can look at the summary. For more information about the step function, type ?step in your R console.)

# Create step model
StepModel = step(climatelm)
## Start:  AIC=-1348.16
## Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols
## 
##            Df Sum of Sq    RSS     AIC
## - CH4       1   0.00049 2.3135 -1350.1
## <none>                  2.3130 -1348.2
## - N2O       1   0.03132 2.3443 -1346.3
## - CO2       1   0.06719 2.3802 -1342.0
## - CFC.12    1   0.11874 2.4318 -1335.9
## - CFC.11    1   0.13986 2.4529 -1333.5
## - TSI       1   0.33516 2.6482 -1311.7
## - Aerosols  1   0.43727 2.7503 -1301.0
## - MEI       1   0.82823 3.1412 -1263.2
## 
## Step:  AIC=-1350.1
## Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + Aerosols
## 
##            Df Sum of Sq    RSS     AIC
## <none>                  2.3135 -1350.1
## - N2O       1   0.03133 2.3448 -1348.3
## - CO2       1   0.06672 2.3802 -1344.0
## - CFC.12    1   0.13023 2.4437 -1336.5
## - CFC.11    1   0.13938 2.4529 -1335.5
## - TSI       1   0.33500 2.6485 -1313.7
## - Aerosols  1   0.43987 2.7534 -1302.7
## - MEI       1   0.83118 3.1447 -1264.9

What is the R² produced by the step function?

# Output the summary
summary(StepModel)
## 
## Call:
## lm(formula = Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + 
##     Aerosols, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25770 -0.05994 -0.00104  0.05588  0.32203 
## 
## Coefficients:
##                 Estimate   Std. Error t value Pr(>|t|)    
## (Intercept) -124.5151778   19.8501125  -6.273 1.37e-09 ***
## MEI            0.0640678    0.0064339   9.958  < 2e-16 ***
## CO2            0.0064015    0.0022689   2.821 0.005129 ** 
## N2O           -0.0160211    0.0082873  -1.933 0.054234 .  
## CFC.11        -0.0066094    0.0016208  -4.078 5.95e-05 ***
## CFC.12         0.0038676    0.0009812   3.942 0.000103 ***
## TSI            0.0931155    0.0147293   6.322 1.04e-09 ***
## Aerosols      -1.5402058    0.2126158  -7.244 4.36e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09155 on 276 degrees of freedom
## Multiple R-squared:  0.7508, Adjusted R-squared:  0.7445 
## F-statistic: 118.8 on 7 and 276 DF,  p-value: < 2.2e-16

0.7508. It is interesting to note that the step function does not address the collinearity of the variables, except that adding highly correlated variables will not improve the R2 significantly.

Which of the following variable(s) were eliminated from the full model by the step function?

# Output the summary
summary(StepModel)

CH4

Testing on Unseen Data

We have developed an understanding of how well we can fit a linear regression to the training data, but does the model quality hold when applied to unseen data?

Using the model produced from the step function, calculate temperature predictions for the testing data set, using the predict function.

# Make predictions using the stepmodel on the new test
tempPredict = predict(StepModel, newdata = test)

What is the testing set R²

# Calculate the R^2 
SSE = sum((tempPredict - test$Temp)^2)

SST = sum( (mean(train$Temp) - test$Temp)^2)

R2 = 1 - SSE/SST

R2
## [1] 0.6286051

0.6286051.

Analytics Edge: Unit 2 - Climate Change

Sulman Khan

October 26, 2018

Climate Change

Background Information on the Dataset

Creating Our First Model

What is the R² value?

Which variables are significant in the model? We will consider a variable signficant only if the p-value is below 0.05.

Understanding the Model

Which of the following is the simplest correct explanation for this contradiction?

Compute the correlations between all the variables in the training set. Which of the following independent variables is N2O highly correlated with (absolute correlation greater than 0.7)?

Which of the following independent variables is CFC.11 highly correlated with?

Simplifying the Model

What is the coefficient of N2O in this reduced model?

What is the R²?

Automatically Building the Model

What is the R² produced by the step function?

Which of the following variable(s) were eliminated from the full model by the step function?

Testing on Unseen Data

What is the testing set R²

Analytics Edge: Unit 2 - Climate Change

Sulman Khan

October 26, 2018

Climate Change

Background Information on the Dataset

Creating Our First Model

What is the R2 value?

Which variables are significant in the model? We will consider a variable signficant only if the p-value is below 0.05.

Understanding the Model

Which of the following is the simplest correct explanation for this contradiction?

Compute the correlations between all the variables in the training set. Which of the following independent variables is N2O highly correlated with (absolute correlation greater than 0.7)?

Which of the following independent variables is CFC.11 highly correlated with?

Simplifying the Model

What is the coefficient of N2O in this reduced model?

What is the R2?

Automatically Building the Model

What is the R2 produced by the step function?

Which of the following variable(s) were eliminated from the full model by the step function?

Testing on Unseen Data

What is the testing set R2

What is the R² value?

What is the R²?

What is the R² produced by the step function?

What is the testing set R²