Estimating the IPAT model for CO2 emissions - is urbanisation an explanatory or confounding variable?

In this paper you will estimate the IPAT (Impact = Population x Affluence x Technolgoy) model (McNicoll, 2014) with a sampled dataset from the World Bank’s World Development Indicators (WDI) for 2015 (the latest year for which the renewable energy variable is available). You will decide whether or not to apply logistic transformation to the data and then you have to add urbanisation URBAN as an explanatory factor to the model. Based on your results and diagnostics you have to discuss whether urbanisation could be a relevant explanatory variable or a confounding variable (see for example Wu et al, 2016).

Finally you will use the model you have build to project the CO2 emissions towards 2030.

Equation 1: The IPAT model. \[ log\:EMISSIONS_{i} = a_0 + \beta_1*\; log\:POPULATION_{i} + \beta_2*\; log\:GDPCAP_{i} + \beta_3*\;log\:INDUSTRY_{i} + \beta_4*\;log\:RENEW_i + \epsilon_{i} \]

Description of the data

We downloaded the dataset in R using the WDI package and stored it in the datasets EMIT.Rda / EMIT.sav that are appended to the exam paper. You can also download the data yourself if you work in R and use the code pasted here. Note that you can look up the metadata for the variables in the WDI by copying the variable symbols or tickers (e.g. SP.POP.TOTL etc.) into a Google search window.

library(WDI)

EMIT <- data.frame(WDI(country="all", 
                       indicator=c("EN.ATM.CO2E.KT", 
                                   "SP.POP.TOTL", 
                                   "NY.GDP.PCAP.PP.KD",
                                   "NV.IND.TOTL.ZS", 
                                   "EG.FEC.RNEW.ZS", 
                                   "SP.URB.TOTL.IN.ZS"), 
                       start=2015, end=2015, extra=TRUE))

EMIT <- subset(EMIT, region != "Aggregates")

colnames(EMIT)[4] <- "EMISSIONS"
colnames(EMIT)[5] <- "POPULATION"
colnames(EMIT)[6] <- "GDPCAP"
colnames(EMIT)[7] <- "INDUSTRY"
colnames(EMIT)[8] <- "RENEW"
colnames(EMIT)[9] <- "URBAN"

EMIT$logEMISSIONS <- log(EMIT$EMISSIONS)
EMIT$logPOPULATION <- log(EMIT$POPULATION)
EMIT$logGDPCAP <- log(EMIT$GDPCAP)
EMIT$logINDUSTRY <- log(EMIT$INDUSTRY)
EMIT$logRENEW <- log(EMIT$RENEW+1)
EMIT$logURBAN <- log(EMIT$URBAN)

save(EMIT, file = "EMIT.rda")
library(haven)
write_sav(EMIT, "EMIT.sav")   

The dataset contains the following variables in normal and log transformed values (apart from basic country descriptors):

## First we get an overview of the dataset by generating the basic descriptive statistics..
summary(EMIT)
    iso2c             country               year        EMISSIONS       
 Length:217         Length:217         Min.   :2015   Min.   :      11  
 Class :character   Class :character   1st Qu.:2015   1st Qu.:    1889  
 Mode  :character   Mode  :character   Median :2015   Median :    9149  
                                       Mean   :2015   Mean   :  162301  
                                       3rd Qu.:2015   3rd Qu.:   57317  
                                       Max.   :2015   Max.   :10145005  
                                                      NA's   :13        
   POPULATION            GDPCAP            INDUSTRY          RENEW       
 Min.   :1.110e+04   Min.   :   825.2   Min.   : 2.073   Min.   : 0.000  
 1st Qu.:7.575e+05   1st Qu.:  4670.4   1st Qu.:17.119   1st Qu.: 4.759  
 Median :6.372e+06   Median : 12605.1   Median :23.660   Median :19.281  
 Mean   :3.385e+07   Mean   : 20830.8   Mean   :24.386   Mean   :29.069  
 3rd Qu.:2.324e+07   3rd Qu.: 28767.6   3rd Qu.:30.013   3rd Qu.:46.476  
 Max.   :1.371e+09   Max.   :120294.9   Max.   :61.362   Max.   :95.818  
 NA's   :1           NA's   :24         NA's   :21       NA's   :4       
     URBAN           iso3c              region            capital         
 Min.   : 12.08   Length:217         Length:217         Length:217        
 1st Qu.: 41.16   Class :character   Class :character   Class :character  
 Median : 60.78   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 60.02                                                           
 3rd Qu.: 79.52                                                           
 Max.   :100.00                                                           
 NA's   :3                                                                
  longitude           latitude            income            lending         
 Length:217         Length:217         Length:217         Length:217        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  logEMISSIONS    logPOPULATION      logGDPCAP       logINDUSTRY    
 Min.   : 2.398   Min.   : 9.315   Min.   : 6.716   Min.   :0.7291  
 1st Qu.: 7.542   1st Qu.:13.538   1st Qu.: 8.449   1st Qu.:2.8402  
 Median : 9.121   Median :15.667   Median : 9.442   Median :3.1638  
 Mean   : 9.185   Mean   :15.172   Mean   : 9.376   Mean   :3.0786  
 3rd Qu.:10.956   3rd Qu.:16.962   3rd Qu.:10.267   3rd Qu.:3.4016  
 Max.   :16.132   Max.   :21.039   Max.   :11.698   Max.   :4.1168  
 NA's   :13       NA's   :1        NA's   :24       NA's   :21      
    logRENEW        logURBAN    
 Min.   :0.000   Min.   :2.491  
 1st Qu.:1.751   1st Qu.:3.717  
 Median :3.010   Median :4.107  
 Mean   :2.724   Mean   :3.992  
 3rd Qu.:3.860   3rd Qu.:4.376  
 Max.   :4.573   Max.   :4.605  
 NA's   :4       NA's   :3      

Question 1.

Estimate the IPAT model with the dataset EMIT - with and without logarithmic transformation. Also draw a scatterplot matrix for all the variables in the dataset. Discuss whether logarithmic transformation is appropriate for this particular model?

## Let us try to make the scatterplot matrices first:
pairs(~EMISSIONS+POPULATION+GDPCAP+INDUSTRY+RENEW+URBAN,data=EMIT, 
   main="Simple Scatterplot before Transf")

pairs(~logEMISSIONS+logPOPULATION+logGDPCAP+logINDUSTRY+logRENEW+logURBAN, data=EMIT, main="Simple Scatterplot after Tranformation")

## Only the log transformed variables meet the assumption of linearity. Therefore we need the log transformed model.
## Next we estimate the IPAT model with and without logarithmic transformation
Fit_normal <- lm(EMISSIONS ~ POPULATION + GDPCAP + INDUSTRY + RENEW, data=EMIT)
Fit_log <- lm(logEMISSIONS ~ logPOPULATION + logGDPCAP + logINDUSTRY + logRENEW, data=EMIT)

library(stargazer)
stargazer::stargazer(Fit_normal, Fit_log, title = "Table R1: IPAT model with and without log transformation ", dep.var.labels=c("Emissions in kiloton per year"),  omit.stat=c("f", "ser"), align=TRUE, no.space=TRUE, type="text")

Table R1: IPAT model with and without log transformation
========================================================
                         Dependent variable:            
              ------------------------------------------
              Emissions in kiloton per year logEMISSIONS
                           (1)                  (2)     
--------------------------------------------------------
POPULATION              0.005***                        
                        (0.0003)                        
GDPCAP                   3.218*                         
                         (1.908)                        
INDUSTRY                1,103.429                       
                       (3,370.049)                      
RENEW                  -2,370.018                       
                       (1,452.651)                      
logPOPULATION                                 1.005***  
                                              (0.020)   
logGDPCAP                                     0.918***  
                                              (0.038)   
logINDUSTRY                                   0.343***  
                                              (0.081)   
logRENEW                                     -0.315***  
                                              (0.034)   
Constant               -28,077.580           -14.983*** 
                      (113,751.700)           (0.461)   
--------------------------------------------------------
Observations               184                  184     
R2                        0.678                0.962    
Adjusted R2               0.671                0.961    
========================================================
Note:                        *p<0.1; **p<0.05; ***p<0.01
## The multiple regression analysis also confirms that the model fit significantly improves after the log tranformation - for example R2 increases with nearly 30%. All the variables are significant in Column 2 in Table R1. We can interpret the coefficient estimates as follows: a one percentage increase in x leads to a B increase in y. For example, a 1% increase in population increases CO~2~ emissions with 1.005% etc. 

Question 2.

Now include also the variable URBAN into your model. Discuss the results and investigate whether there could be a problem of multicollinearity or a source of heteroscedasticity by including this variable into the model? For example, conduct the diagnostics for the model with and without the URBAN variable included.

## Next we estimate the IPAT model with and without logarithmic transformation
Fit_log <- lm(logEMISSIONS ~ logPOPULATION + logGDPCAP + logINDUSTRY + logRENEW, data=EMIT)
Fit_log_urban <- lm(logEMISSIONS ~ logPOPULATION + logGDPCAP + logINDUSTRY + logRENEW + logURBAN, data=EMIT)

library(stargazer)
stargazer::stargazer(Fit_log, Fit_log_urban, title = "Table R2: IPAT model with and without URBAN", dep.var.labels=c("Emissions in kiloton per year"),  omit.stat=c("f", "ser"), align=TRUE, no.space=TRUE, type="text")

Table R2: IPAT model with and without URBAN
============================================
                   Dependent variable:      
              ------------------------------
              Emissions in kiloton per year 
                    (1)            (2)      
--------------------------------------------
logPOPULATION    1.005***        1.004***   
                  (0.020)        (0.020)    
logGDPCAP        0.918***        0.871***   
                  (0.038)        (0.048)    
logINDUSTRY      0.343***        0.342***   
                  (0.081)        (0.080)    
logRENEW         -0.315***      -0.311***   
                  (0.034)        (0.034)    
logURBAN                          0.170     
                                 (0.106)    
Constant        -14.983***      -15.226***  
                  (0.461)        (0.476)    
--------------------------------------------
Observations        184            183      
R2                 0.962          0.963     
Adjusted R2        0.961          0.962     
============================================
Note:            *p<0.1; **p<0.05; ***p<0.01
## The URBAN variable is not significant and therefore appears not to be relevant and hence also not a confounder.
library(lmtest)
plot(Fit_normal, 1)

plot(Fit_normal, 2)

## The first plot is the equivalent to testing for heteroscedasticity - we can see that the assumptions are violated especially because of the outliers. The second plot (QQ plot) visualises the normality assumption. Also here we can see the main problem in the untranformed data are the large outliers. They are the same in the two plots: observations 46 (China), 97 (India) and 222 (United States). All very large countries.
library(lmtest)
plot(Fit_log, 1)

plot(Fit_log, 2)

## The model assumptions are better met in the log tranformed model. The outliers are now observations 39 (Congo), 51 (Curacao) and 139 (Niger). All small countries.
## Now let us check if there is any change to the residual plots when including the URBAN variable:
plot(Fit_log_urban, 1)

plot(Fit_log_urban, 2)

## There is no change at all.

Question 3.

Use the column ‘region’ in the dataset to create a categorical variable that you call ECA and that takes a value of 1 when the country is in the region ‘Europe & Central Asia’ and 0 otherwise. Now include the ECA dummy into the IPAT model (again with and without the URBAN variable). Interpret the result for the ECA dummy.

EMIT$ECA <- ifelse(EMIT$region=="Europe & Central Asia", 1, 0)

Fit_log_ECA <- lm(logEMISSIONS ~ logPOPULATION + logGDPCAP + logINDUSTRY + logRENEW + ECA, data=EMIT)
Fit_log_urban_ECA <- lm(logEMISSIONS ~ logPOPULATION + logGDPCAP + logINDUSTRY + logRENEW + logURBAN + ECA, data=EMIT)

library(stargazer)
stargazer::stargazer(Fit_log_ECA, Fit_log_urban_ECA, title = "Table R2: IPAT model with ECA dummy", dep.var.labels=c("Emissions in kiloton per year"),  omit.stat=c("f", "ser"), align=TRUE, no.space=TRUE, type="text")

Table R2: IPAT model with ECA dummy
============================================
                   Dependent variable:      
              ------------------------------
              Emissions in kiloton per year 
                    (1)            (2)      
--------------------------------------------
logPOPULATION    1.003***        1.002***   
                  (0.019)        (0.019)    
logGDPCAP        0.861***        0.812***   
                  (0.042)        (0.052)    
logINDUSTRY      0.343***        0.344***   
                  (0.080)        (0.079)    
logRENEW         -0.341***      -0.335***   
                  (0.034)        (0.034)    
logURBAN                          0.187*    
                                 (0.104)    
ECA              0.269***        0.255***   
                  (0.092)        (0.092)    
Constant        -14.420***      -14.708***  
                  (0.491)        (0.503)    
--------------------------------------------
Observations        184            183      
R2                 0.964          0.965     
Adjusted R2        0.963          0.964     
============================================
Note:            *p<0.1; **p<0.05; ***p<0.01
## According to the results in Table R2 the ECA region has higher emissions levels that the rest of the world. Controlling for the ECA region makes the URBAN variable significant. The explanatory power of the affluence variable (e.g. GDPCAP) is somewhat reduced by including these two variables into the model. The other explanatory factors are constant. There could be a potentially confounding relationship between GDPCAP and URBAN since they are strongly collinear across time. But this result is not born out in a cross section analysis because the countries are at so different levels of development and hence at different stages of their urbanisation process.

## Again checking residual diagnostics for the last equation - because we will use it for prediction, so need to check that it does not violate BLUE assumptions..
## Out of curiousity let us also check if it changes the residual plots to include the URBAN variable -
plot(Fit_log_urban_ECA, 1)

plot(Fit_log_urban_ECA, 2)

## Again there is hardly any change to the residual plots.
## As a backdrop to the new result - e.g. URBAN is significant when we include ECA, let us check the correlation matrix
COR <- EMIT[c(4:9, 23)]
cor(COR, use='complete.obs')
             EMISSIONS   POPULATION      GDPCAP    INDUSTRY        RENEW
EMISSIONS   1.00000000  0.811769761  0.06409496  0.12857604 -0.121391719
POPULATION  0.81176976  1.000000000 -0.06479765  0.12104810 -0.007958916
GDPCAP      0.06409496 -0.064797654  1.00000000  0.10818056 -0.448830703
INDUSTRY    0.12857604  0.121048099  0.10818056  1.00000000 -0.092151094
RENEW      -0.12139172 -0.007958916 -0.44883070 -0.09215109  1.000000000
URBAN       0.06385599 -0.064099942  0.66789862  0.11741882 -0.534818392
ECA        -0.03847362 -0.086567201  0.30535015  0.03076116 -0.177002516
                 URBAN         ECA
EMISSIONS   0.06385599 -0.03847362
POPULATION -0.06409994 -0.08656720
GDPCAP      0.66789862  0.30535015
INDUSTRY    0.11741882  0.03076116
RENEW      -0.53481839 -0.17700252
URBAN       1.00000000  0.19200709
ECA         0.19200709  1.00000000
## A confounder would also be revealed by a very high correlation coefficient. But multicollinearity does not appear to be a problem - for example the Pearson correlation coefficient between URBAN and GDPCAP is the highest among the independent variables, but not too high - e.g. it is 0.67 according to the correlation matrix below. Note also that there is a negative correlation on average between GDPCAP and RENEW!!

Question 4.

Which of the avenues for cutting CO2 emissions levels (e.g. percentage reduction in population, percentage reduction in affluence, percentage reduction of industry or percentage switch towards renewables) will be relatively more or less effective according to your results?

## We do not need standardised coefficient estimates in this case, because the variables are log transformed and therefore on comparable scales (e.g. percentages) that are also much easier to interpret than standardised coefficients. The largest contribution to reduction would come from 1% reduction in population, then GDPCAP, then INDUSTRY, then RENEW and finally URBAN - e.g. in the order that the independent variables are presented in the model.

Question 5.

Use the model to make predictions for CO2 emissions for European and non-European countries (e.g. by filling in Table 1) under the following scenarios (assuming the current population growth rate continues at 1.05% per year and that INDUSTRY and URBAN remains constant until 2030):

Scenario 1 (baseline): GDP per capita increases with 1.5% per year until 2030, the switch to renewables takes pace with around 2% per year until 2030.

Scenario 2 (low growth, green transition): GDP per capita increases with 0.5% per year until 2030, the switch to renewables goes up to 5% per year until 2030.

Scenario 3 (high growth, modest green transition): GDP per capita increases with 3% per year until 2030, the switch to renewables continues slowly at around 0.5% per year until 2030.

Hint - Use the formulae for compound annual growth rate (e.g. (1+g)^15) to project the averages (from your basic descriptive statistics tables). Insert the projections into your preferred prediction equation under Question 3.

Also, discuss what the implicit assumption of these predictions is?

## These calculations could be made manually and/or in a spreadsheet as well. 


## First we can create a new dataset based on the EMIT dataset - but storing the mean values (I did it separately for ECA and non-ECA as it makes more sense for the projections) - and because we need to insert the mean values to make the predictions:
EMIT_ECA_1 <- EMIT[which(EMIT$ECA==1), ]
EMIT_num_ECA_1 <- EMIT_ECA_1[c(4:9)]
EMIT_ECA_0 <- EMIT[which(EMIT$ECA==0), ]
EMIT_num_ECA_0 <- EMIT_ECA_0[c(4:9)]

EMIT_p_ECA_1 <- data.frame(t(colMeans(EMIT_num_ECA_1, na.rm=TRUE)))
EMIT_p_ECA_0 <- data.frame(t(colMeans(EMIT_num_ECA_0, na.rm=TRUE)))


## Scenario 1 calculations for ECA:
EMIT_p_s1 <- EMIT_p_ECA_1

EMIT_p_s1$logPOPULATION <- log(EMIT_p_s1$POPULATION*(1.0105)^15)
EMIT_p_s1$logGDPCAP <- log(EMIT_p_s1$GDPCAP*(1.015)^15)
EMIT_p_s1$logINDUSTRY <- log(EMIT_p_s1$INDUSTRY)
EMIT_p_s1$logRENEW <- log(EMIT_p_s1$RENEW*(1.02)^15)
EMIT_p_s1$logURBAN <- log(EMIT_p_s1$URBAN)
EMIT_p_s1$ECA <- 1
S1_ECA_1 <- predict(Fit_log_urban_ECA, newdata=EMIT_p_s1)

## Scenario 1 calculations for non-ECA:
EMIT_p_s1 <- EMIT_p_ECA_0

EMIT_p_s1$logPOPULATION <- log(EMIT_p_s1$POPULATION*(1.0105)^15)
EMIT_p_s1$logGDPCAP <- log(EMIT_p_s1$GDPCAP*(1.015)^15)
EMIT_p_s1$logINDUSTRY <- log(EMIT_p_s1$INDUSTRY)
EMIT_p_s1$logRENEW <- log(EMIT_p_s1$RENEW*(1.02)^15)
EMIT_p_s1$logURBAN <- log(EMIT_p_s1$URBAN)
EMIT_p_s1$ECA <- 0
S1_ECA_0 <- predict(Fit_log_urban_ECA, newdata=EMIT_p_s1)

exp(S1_ECA_1)
       1 
116702.8 
exp(S1_ECA_0)
       1 
118612.8 
## Scenario 2 calculations for ECA:
EMIT_p_s2 <- EMIT_p_ECA_1

EMIT_p_s2$logPOPULATION <- log(EMIT_p_s2$POPULATION*(1.0105)^15)
EMIT_p_s2$logGDPCAP <- log(EMIT_p_s2$GDPCAP*(1.005)^15)
EMIT_p_s2$logINDUSTRY <- log(EMIT_p_s2$INDUSTRY)
EMIT_p_s2$logRENEW <- log(EMIT_p_s2$RENEW*(1.05)^15)
EMIT_p_s2$logURBAN <- log(EMIT_p_s2$URBAN)

EMIT_p_s2$ECA <- 1
S2_ECA_1 <- predict(Fit_log_urban_ECA, newdata=EMIT_p_s2)


## Scenario 2 calculations for non-ECA:
EMIT_p_s2 <- EMIT_p_ECA_0

EMIT_p_s2$logPOPULATION <- log(EMIT_p_s2$POPULATION*(1.0105)^15)
EMIT_p_s2$logGDPCAP <- log(EMIT_p_s2$GDPCAP*(1.005)^15)
EMIT_p_s2$logINDUSTRY <- log(EMIT_p_s2$INDUSTRY)
EMIT_p_s2$logRENEW <- log(EMIT_p_s2$RENEW*(1.05)^15)
EMIT_p_s2$logURBAN <- log(EMIT_p_s2$URBAN)

EMIT_p_s2$ECA <- 0
S2_ECA_0 <- predict(Fit_log_urban_ECA, newdata=EMIT_p_s2)

exp(S2_ECA_1)
       1 
89414.57 
exp(S2_ECA_0)
       1 
90877.94 
## Scenario 3 calculations for ECA:
EMIT_p_s3 <- EMIT_p_ECA_1

EMIT_p_s3$logPOPULATION <- log(EMIT_p_s3$POPULATION*(1.0105)^15)
EMIT_p_s3$logGDPCAP <- log(EMIT_p_s3$GDPCAP*(1.03)^15)
EMIT_p_s3$logINDUSTRY <- log(EMIT_p_s3$INDUSTRY)
EMIT_p_s3$logRENEW <- log(EMIT_p_s3$RENEW*(1.005)^15)
EMIT_p_s3$logURBAN <- log(EMIT_p_s3$URBAN)

EMIT_p_s3$ECA <- 1
S3_ECA_1 <- predict(Fit_log_urban_ECA, newdata=EMIT_p_s3)

## Scenario 3 calculations for non-ECA:
EMIT_p_s3 <- EMIT_p_ECA_0

EMIT_p_s3$logPOPULATION <- log(EMIT_p_s3$POPULATION*(1.0105)^15)
EMIT_p_s3$logGDPCAP <- log(EMIT_p_s3$GDPCAP*(1.03)^15)
EMIT_p_s3$logINDUSTRY <- log(EMIT_p_s3$INDUSTRY)
EMIT_p_s3$logRENEW <- log(EMIT_p_s3$RENEW*(1.005)^15)
EMIT_p_s3$logURBAN <- log(EMIT_p_s3$URBAN)

EMIT_p_s3$ECA <- 0
S3_ECA_0 <- predict(Fit_log_urban_ECA, newdata=EMIT_p_s3)

exp(S3_ECA_1)
       1 
150315.9 
exp(S3_ECA_0)
     1 
152776 
## Finally does it also make sense to calculate the average for the ECA and non-ECA countries in 2015 - to benchmark the projections and also to help check that the projections made are realistic:
library(dplyr)
Emissions <- EMIT %>% 
    group_by(ECA) %>%
    summarise( n = n(), meanEmissions =mean(EMISSIONS, na.rm=TRUE)) 
Emissions
# A tibble: 2 x 3
    ECA     n meanEmissions
  <dbl> <int>         <dbl>
1     0   159       180981.
2     1    58       107696.
## These last numbers we can use to make relative the projections in the table in terms of seeing what each scenario implies for reaching a certain target - here relative to 2015 (even though often the targets are set relative to for example 1990)

Question 6.

Make a plot, graph or figure that you think best sums up or illustrates your findings in this paper. Write a one paragraph accompanying conclusion to your paper.

References

McNicoll, G. (2014). Population and sustainability. In Handbook of sustainable development. Edward Elgar Publishing.

Wu, Y., Shen, J., Zhang, X., Skitmore, M., & Lu, W. (2016). The impact of urbanization on carbon emissions in developing countries: a Chinese study based on the U-Kaya method. Journal of Cleaner Production, 135, 589-603.