edX assignment link: http://bit.ly/2KE2g00

There have been many studies documenting that the average global temperature has been increasing over the last century. The consequences of a continued rise in global temperature will be dire. Rising sea levels and an increased frequency of extreme weather events will affect billions of people.

In this problem, we will attempt to study the relationship between average global temperature and several other factors.

The file climate_change.csv contains climate data from May 1983 to December 2008. The available variables include:

CO2, N2O and CH4 are expressed in ppmv (parts per million by volume – i.e., 397 ppmv of CO2 means that CO2 constitutes 397 millionths of the total volume of the atmosphere) CFC.11 and CFC.12 are expressed in ppbv (parts per billion by volume).


Section 1 - Creating Our First Model

1.1

We are interested in how changes in these variables affect future temperatures, as well as how well these variables explain temperature changes so far. To do this, first read the dataset climate_change.csv into R.

Then, split the data into a training set, consisting of all the observations up to and including 2006, and a testing set consisting of the remaining years (hint: use subset). A training set refers to the data that will be used to build the model (this is the data we give to the lm() function), and a testing set refers to the data we will use to test our predictive ability.

Next, build a linear regression model to predict the dependent variable Temp, using MEI, CO2, CH4, N2O, CFC.11, CFC.12, TSI, and Aerosols as independent variables (Year and Month should NOT be used in the model). Use the training set to build the model.

Enter the model R2 (the “Multiple R-squared” value):

climate_change = read.csv("D:/buiness_analytics/unit2/data/climate_change.csv")
str(climate_change)
'data.frame':   308 obs. of  11 variables:
 $ Year    : int  1983 1983 1983 1983 1983 1983 1983 1983 1984 1984 ...
 $ Month   : int  5 6 7 8 9 10 11 12 1 2 ...
 $ MEI     : num  2.556 2.167 1.741 1.13 0.428 ...
 $ CO2     : num  346 346 344 342 340 ...
 $ CH4     : num  1639 1634 1633 1631 1648 ...
 $ N2O     : num  304 304 304 304 304 ...
 $ CFC.11  : num  191 192 193 194 194 ...
 $ CFC.12  : num  350 352 354 356 357 ...
 $ TSI     : num  1366 1366 1366 1366 1366 ...
 $ Aerosols: num  0.0863 0.0794 0.0731 0.0673 0.0619 0.0569 0.0524 0.0486 0.0451 0.0416 ...
 $ Temp    : num  0.109 0.118 0.137 0.176 0.149 0.093 0.232 0.078 0.089 0.013 ...
train_data = subset(climate_change , Year <= 2006)
test_data = subset(climate_change, Year > 2006)
model1 = lm(Temp~MEI+CO2+CH4+N2O+CFC.11+CFC.12+TSI+Aerosols,data = train_data)
summary(model1)

Call:
lm(formula = Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + 
    TSI + Aerosols, data = train_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.25888 -0.05913 -0.00082  0.05649  0.32433 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.246e+02  1.989e+01  -6.265 1.43e-09 ***
MEI          6.421e-02  6.470e-03   9.923  < 2e-16 ***
CO2          6.457e-03  2.285e-03   2.826  0.00505 ** 
CH4          1.240e-04  5.158e-04   0.240  0.81015    
N2O         -1.653e-02  8.565e-03  -1.930  0.05467 .  
CFC.11      -6.631e-03  1.626e-03  -4.078 5.96e-05 ***
CFC.12       3.808e-03  1.014e-03   3.757  0.00021 ***
TSI          9.314e-02  1.475e-02   6.313 1.10e-09 ***
Aerosols    -1.538e+00  2.133e-01  -7.210 5.41e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09171 on 275 degrees of freedom
Multiple R-squared:  0.7509,    Adjusted R-squared:  0.7436 
F-statistic: 103.6 on 8 and 275 DF,  p-value: < 2.2e-16
test_result = predict(model1, newdata = test_data)
summary(test_result)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3139  0.3437  0.3762  0.3835  0.4238  0.4686 

1.2

Which variables are significant in the model? We will consider a variable signficant only if the p-value is below 0.05. (Select all that apply.)

  • MEI
  • CO2
  • CH4
  • N2O
  • CFC.11
  • CFC.12
  • TSI
  • Aerosols
  • unanswered
#MEI          6.421e-02  6.470e-03   9.923  < 2e-16 *** ->顯著
#CO2          6.457e-03  2.285e-03   2.826  0.00505 ** 
#CH4          1.240e-04  5.158e-04   0.240  0.81015    
#N2O         -1.653e-02  8.565e-03  -1.930  0.05467 .  
#CFC.11      -6.631e-03  1.626e-03  -4.078 5.96e-05 *** ->顯著
#CFC.12       3.808e-03  1.014e-03   3.757  0.00021 *** ->顯著
#TSI          9.314e-02  1.475e-02   6.313 1.10e-09 *** ->顯著
#Aerosols    -1.538e+00  2.133e-01  -7.210 5.41e-12 *** ->顯著

Section 2 - Understanding the Model

Current scientific opinion is that nitrous oxide and CFC-11 are greenhouse gases: gases that are able to trap heat from the sun and contribute to the heating of the Earth. However, the regression coefficients of both the N2O and CFC-11 variables are negative, indicating that increasing atmospheric concentrations of either of these two compounds is associated with lower global temperatures.

2.1

Which of the following is the simplest correct explanation for this contradiction?

  • Climate scientists are wrong that N2O and CFC-11 are greenhouse gases - this regression analysis constitutes part of a disproof.

  • There is not enough data, so the regression coefficients being estimated are not accurate.

  • All of the gas concentration variables reflect human development - N2O and CFC.11 are correlated with other variables in the data set.

str(train_data)
'data.frame':   284 obs. of  11 variables:
 $ Year    : int  1983 1983 1983 1983 1983 1983 1983 1983 1984 1984 ...
 $ Month   : int  5 6 7 8 9 10 11 12 1 2 ...
 $ MEI     : num  2.556 2.167 1.741 1.13 0.428 ...
 $ CO2     : num  346 346 344 342 340 ...
 $ CH4     : num  1639 1634 1633 1631 1648 ...
 $ N2O     : num  304 304 304 304 304 ...
 $ CFC.11  : num  191 192 193 194 194 ...
 $ CFC.12  : num  350 352 354 356 357 ...
 $ TSI     : num  1366 1366 1366 1366 1366 ...
 $ Aerosols: num  0.0863 0.0794 0.0731 0.0673 0.0619 0.0569 0.0524 0.0486 0.0451 0.0416 ...
 $ Temp    : num  0.109 0.118 0.137 0.176 0.149 0.093 0.232 0.078 0.089 0.013 ...
cor(train_data[3:11])
                  MEI         CO2        CH4         N2O      CFC.11       CFC.12
MEI       1.000000000 -0.04114717 -0.0334193 -0.05081978  0.06900044  0.008285544
CO2      -0.041147165  1.00000000  0.8772796  0.97671982  0.51405975  0.852689627
CH4      -0.033419301  0.87727963  1.0000000  0.89983864  0.77990402  0.963616248
N2O      -0.050819775  0.97671982  0.8998386  1.00000000  0.52247732  0.867930776
CFC.11    0.069000439  0.51405975  0.7799040  0.52247732  1.00000000  0.868985183
CFC.12    0.008285544  0.85268963  0.9636162  0.86793078  0.86898518  1.000000000
TSI      -0.154491923  0.17742893  0.2455284  0.19975668  0.27204596  0.255302814
Aerosols  0.340237787 -0.35615480 -0.2678092 -0.33705457 -0.04392120 -0.225131244
Temp      0.172470751  0.78852921  0.7032550  0.77863893  0.40771029  0.687557548
                 TSI    Aerosols       Temp
MEI      -0.15449192  0.34023779  0.1724708
CO2       0.17742893 -0.35615480  0.7885292
CH4       0.24552844 -0.26780919  0.7032550
N2O       0.19975668 -0.33705457  0.7786389
CFC.11    0.27204596 -0.04392120  0.4077103
CFC.12    0.25530281 -0.22513124  0.6875575
TSI       1.00000000  0.05211651  0.2433827
Aerosols  0.05211651  1.00000000 -0.3849137
Temp      0.24338269 -0.38491375  1.0000000
#All of the gas concentration variables reflect human development - N2O and CFC.11 are correlated with other variables in the data set.

2.2

Compute the correlations between all the variables in the training set. Which of the following independent variables is N2O highly correlated with (absolute correlation greater than 0.7)? Select all that apply.

  • MEI
  • CO2
  • CH4
  • CFC.11
  • CFC.12
  • Aerosols
  • TSI
cor(train_data[3:11])
                  MEI         CO2        CH4         N2O      CFC.11       CFC.12
MEI       1.000000000 -0.04114717 -0.0334193 -0.05081978  0.06900044  0.008285544
CO2      -0.041147165  1.00000000  0.8772796  0.97671982  0.51405975  0.852689627
CH4      -0.033419301  0.87727963  1.0000000  0.89983864  0.77990402  0.963616248
N2O      -0.050819775  0.97671982  0.8998386  1.00000000  0.52247732  0.867930776
CFC.11    0.069000439  0.51405975  0.7799040  0.52247732  1.00000000  0.868985183
CFC.12    0.008285544  0.85268963  0.9636162  0.86793078  0.86898518  1.000000000
TSI      -0.154491923  0.17742893  0.2455284  0.19975668  0.27204596  0.255302814
Aerosols  0.340237787 -0.35615480 -0.2678092 -0.33705457 -0.04392120 -0.225131244
Temp      0.172470751  0.78852921  0.7032550  0.77863893  0.40771029  0.687557548
                 TSI    Aerosols       Temp
MEI      -0.15449192  0.34023779  0.1724708
CO2       0.17742893 -0.35615480  0.7885292
CH4       0.24552844 -0.26780919  0.7032550
N2O       0.19975668 -0.33705457  0.7786389
CFC.11    0.27204596 -0.04392120  0.4077103
CFC.12    0.25530281 -0.22513124  0.6875575
TSI       1.00000000  0.05211651  0.2433827
Aerosols  0.05211651  1.00000000 -0.3849137
Temp      0.24338269 -0.38491375  1.0000000
#Anser = CO2,CH4,CFC.12

Which of the following independent variables is CFC.11 highly correlated with? Select all that apply.

  • MEI
  • CO2
  • CH4
  • CFC.11
  • CFC.12
  • Aerosols
  • TSI
cor(train_data[3:11])
                  MEI         CO2        CH4         N2O      CFC.11       CFC.12
MEI       1.000000000 -0.04114717 -0.0334193 -0.05081978  0.06900044  0.008285544
CO2      -0.041147165  1.00000000  0.8772796  0.97671982  0.51405975  0.852689627
CH4      -0.033419301  0.87727963  1.0000000  0.89983864  0.77990402  0.963616248
N2O      -0.050819775  0.97671982  0.8998386  1.00000000  0.52247732  0.867930776
CFC.11    0.069000439  0.51405975  0.7799040  0.52247732  1.00000000  0.868985183
CFC.12    0.008285544  0.85268963  0.9636162  0.86793078  0.86898518  1.000000000
TSI      -0.154491923  0.17742893  0.2455284  0.19975668  0.27204596  0.255302814
Aerosols  0.340237787 -0.35615480 -0.2678092 -0.33705457 -0.04392120 -0.225131244
Temp      0.172470751  0.78852921  0.7032550  0.77863893  0.40771029  0.687557548
                 TSI    Aerosols       Temp
MEI      -0.15449192  0.34023779  0.1724708
CO2       0.17742893 -0.35615480  0.7885292
CH4       0.24552844 -0.26780919  0.7032550
N2O       0.19975668 -0.33705457  0.7786389
CFC.11    0.27204596 -0.04392120  0.4077103
CFC.12    0.25530281 -0.22513124  0.6875575
TSI       1.00000000  0.05211651  0.2433827
Aerosols  0.05211651  1.00000000 -0.3849137
Temp      0.24338269 -0.38491375  1.0000000
#Anser = CH4,CFC.12

Section 3 - Simplifying the Model

Given that the correlations are so high, let us focus on the N2O variable and build a model with only MEI, TSI, Aerosols and N2O as independent variables. Remember to use the training set to build the model.

Enter the coefficient of N2O in this reduced model:

model_n2o = lm(Temp~MEI+TSI+Aerosols+N2O, data = train_data)
summary(model_n2o)

Call:
lm(formula = Temp ~ MEI + TSI + Aerosols + N2O, data = train_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27916 -0.05975 -0.00595  0.05672  0.34195 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.162e+02  2.022e+01  -5.747 2.37e-08 ***
MEI          6.419e-02  6.652e-03   9.649  < 2e-16 ***
TSI          7.949e-02  1.487e-02   5.344 1.89e-07 ***
Aerosols    -1.702e+00  2.180e-01  -7.806 1.19e-13 ***
N2O          2.532e-02  1.311e-03  19.307  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09547 on 279 degrees of freedom
Multiple R-squared:  0.7261,    Adjusted R-squared:  0.7222 
F-statistic: 184.9 on 4 and 279 DF,  p-value: < 2.2e-16
#Anser = 2.532e-02

(How does this compare to the coefficient in the previous model with all of the variables?)

Enter the model R2:

#Anser = 0.7261

Section 4 - Automatically Building the Model

We have many variables in this problem, and as we have seen above, dropping some from the model does not decrease model quality. R provides a function, step, that will automate the procedure of trying different combinations of variables to find a good compromise of model simplicity and R2. This trade-off is formalized by the Akaike information criterion (AIC) - it can be informally thought of as the quality of the model with a penalty for the number of variables in the model.

The step function has one argument - the name of the initial model. It returns a simplified model. Use the step function in R to derive a new model, with the full model as the initial model (HINT: If your initial full model was called “climateLM”, you could create a new model with the step function by typing step(climateLM). Be sure to save your new model to a variable name so that you can look at the summary. For more information about the step function, type ?step in your R console.)

4.1

Enter the R2 value of the model produced by the step function:

step_model = step(model1)
Start:  AIC=-1348.16
Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols

           Df Sum of Sq    RSS     AIC
- CH4       1   0.00049 2.3135 -1350.1
<none>                  2.3130 -1348.2
- N2O       1   0.03132 2.3443 -1346.3
- CO2       1   0.06719 2.3802 -1342.0
- CFC.12    1   0.11874 2.4318 -1335.9
- CFC.11    1   0.13986 2.4529 -1333.5
- TSI       1   0.33516 2.6482 -1311.7
- Aerosols  1   0.43727 2.7503 -1301.0
- MEI       1   0.82823 3.1412 -1263.2

Step:  AIC=-1350.1
Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + Aerosols

           Df Sum of Sq    RSS     AIC
<none>                  2.3135 -1350.1
- N2O       1   0.03133 2.3448 -1348.3
- CO2       1   0.06672 2.3802 -1344.0
- CFC.12    1   0.13023 2.4437 -1336.5
- CFC.11    1   0.13938 2.4529 -1335.5
- TSI       1   0.33500 2.6485 -1313.7
- Aerosols  1   0.43987 2.7534 -1302.7
- MEI       1   0.83118 3.1447 -1264.9
summary(step_model)

Call:
lm(formula = Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + 
    Aerosols, data = train_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.25770 -0.05994 -0.00104  0.05588  0.32203 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.245e+02  1.985e+01  -6.273 1.37e-09 ***
MEI          6.407e-02  6.434e-03   9.958  < 2e-16 ***
CO2          6.402e-03  2.269e-03   2.821 0.005129 ** 
N2O         -1.602e-02  8.287e-03  -1.933 0.054234 .  
CFC.11      -6.609e-03  1.621e-03  -4.078 5.95e-05 ***
CFC.12       3.868e-03  9.812e-04   3.942 0.000103 ***
TSI          9.312e-02  1.473e-02   6.322 1.04e-09 ***
Aerosols    -1.540e+00  2.126e-01  -7.244 4.36e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09155 on 276 degrees of freedom
Multiple R-squared:  0.7508,    Adjusted R-squared:  0.7445 
F-statistic: 118.8 on 7 and 276 DF,  p-value: < 2.2e-16
#Anser = 0.7508

4.2

Which of the following variable(s) were eliminated from the full model by the step function? Select all that apply.

  • MEI
  • CO2
  • CH4
  • N2O
  • CFC.11
  • CFC.12
  • TSI
  • Aerosols
#lm(formula = Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + Aerosols, data = train_data)
#Anser = CH4

It is interesting to note that the step function does not address the collinearity of the variables, except that adding highly correlated variables will not improve the R2 significantly. The consequence of this is that the step function will not necessarily produce a very interpretable model - just a model that has balanced quality and simplicity for a particular weighting of quality and simplicity (AIC).

Section 5 - Testing on Unseen Data

We have developed an understanding of how well we can fit a linear regression to the training data, but does the model quality hold when applied to unseen data?

Using the model produced from the step function, calculate temperature predictions for the testing data set, using the predict function.

5.1

Enter the testing set R2:

test_model = predict(step_model, newdata = test_data)
summary(test_model)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3142  0.3418  0.3771  0.3832  0.4245  0.4678 
SSE = sum(( test_model-test_data$Temp )^2)
SST = sum(( mean(train_data$Temp)-test_data$Temp )^2)
RQuare = 1 - SSE/SST
RQuare
[1] 0.6286051
#0.6286051
