Reminders about submitting your assignment

Please make sure that you first rename and save this file as “LASTNAME-pset3.Rmd” before you do any work. Once you are finished with your problem set and have knitted your file, please remember the following steps to properly upload your assignment to Canvas:

In the Files pane of RStudio, click the box beside all the files you want to download (i.e., the .Rmd and .html files).
Hit the gear icon and select export from the dropdown menu.
Hit download in the dialog box.
Find the zip folder in your downloads (NOT through your browser) and unzip the folder so you can access the individual .html and .Rmd files.
Go to the Canvas submission portal and submit the .Rmd and .html files contained within the unzipped folder.

As a reminder, you need to submit both your .html and .Rmd files for full credit. Any other file extension, such as .htm or .mhtml, cannot be read and graded. Also, please remember that you need to submit both of your individual files, NOT the zip folder. If you submitted the zip folder, please follow the steps above to correctly submit your problem set.

Submit your HTML output and .Rmd file to Canvas by the deadline – 11:55pm on November 12 – and remember to abide by the course’s generative AI policies.

Question 1. What drives greenhouse gas emissions?

One of the most important questions in environmental politics research is about the relationship between greenhouse gas emissions and economic development. In this problem set, we will study the determinants of emissions as a function of economic development, energy usage, and the carbon intensity of energy demand. To do so, we will use data from the Our World in Data project for countries in 2021.

Variable Name	Description
`country`	Country - Geographic location. \
`iso_code`	ISO code - ISO 3166-1 alpha-3 three-letter country codes. \
`greenhouse_gas_emissions`	Emissions from electricity generation - Measured in megatonnes of CO2 equivalents (Mt CO2e). \
`primary_energy_consumption`	Primary energy consumption - Measured in terawatt-hours (TWh). \
`electricity_demand`	Electricity demand - Measured in terawatt-hours (TWh). \
`carbon_intensity_elec`	Carbon intensity of electricity generation - Greenhouse gases emitted per unit of generated electricity, measured in grams of CO2 equivalents per kilowatt-hour (gCO2e / kWh). \
`gdp_cap`	Gross domestic product (GDP) per capita, measured in US dollars. \

Load the dataset (electricity_data.RData) into your environment. What name does R give this dataset? What are the dataset’s dimensions?

load("electricity_data.RData")
dim(energy)

## [1] 165   7

R gives this dataset the name “energy.” The dimmensions of the dataset are 165 rows x 7 columns.

Calculate the mean, median, and standard deviation for greenhouse_gas_emissions, gdp_cap, and carbon_intensity_elec. Substantively interpret the mean value for each of these variables (i.e., explain what the number means, using appropriate units). For each variable, compare the mean and median. What does this comparison tell you about the direction of skewness (if any) for each variable? Do NOT create a distribution plot to answer this question, just rely on the summary statistics.

mean(energy$greenhouse_gas_emissions)

## [1] 83.508

median(energy$greenhouse_gas_emissions)

## [1] 5.73

sd(energy$greenhouse_gas_emissions)

## [1] 430.0907

mean(energy$gdp_cap)

## [1] 18805.94

median(energy$gdp_cap)

## [1] 12639.18

sd(energy$gdp_cap)

## [1] 19750.05

mean(energy$carbon_intensity_elec)

## [1] 416.7446

median(energy$carbon_intensity_elec)

## [1] 447.059

sd(energy$carbon_intensity_elec)

## [1] 245.8638

For greenhouse gas emissions, the mean is 83.508, the median is 5.73 and the standard deviation is 430.0907. For GDP per capita, the mean is 18805.94, the median is 12639.18 and the standard deviation is 19750.05. For carbon intensity of electrical generation, the mean is 4167446, the median is 447.059 and the standard deviation is 245.86389. The mean value for greenhouse gas emissions is 83.508 Mt CO2e. This means the average country produces that much CO2 emissions. The mean value of GDP per capita is 18805.94 US Dollars. This means that on average, countries in the dataset have the value of their total economic output divided by the average economic output per person equal to 18805.94 US Dollars. The mean value for carbon intensity of electrical generation is 416.7446 gCO2e/kWh. This means the average in each country of producing one killowatt-hour of electricity emits almost 417 grams of CO2.

For greenhouse gas emissions, the mean (83.508) is much larger than its median (5.73). This tells us that it has a strong positive (right) skew, which would indicate that there are a few countries which act as outliers that increase the value of the mean and that there are a lot of countries that emit emissions far less than the mean. For GDP per capita, the mean (18805.94) is higher than the median (12639.18). This tells us that there is a positive skew. This difference would indicate that a smaller group of wealthy countries raises the mean above the typical country. It also indicates that most countries have a GDP per capita below the mean. For carbon intensity of electricity, the mean (416.74) is slightly lower than the median (447.06). This tells us that there is a slight negative (left) skew which means that a subset of countries with relatively clean electricity systems/low carbon intensity lowers our mean value. It also means that most countries have higher carbon intensity than others.

We will start by testing the hypothesis that richer countries have higher emissions. For this, regress greenhouse_gas_emissions on gdp_cap. Show the summary() for a bivariate regression model and interpret the results. Be sure to address all of the following:

What are your independent and dependent variables?
Report your estimates for $\beta_0$ and $\beta_1$ to 3 decimal places, and explain what each estimate is communicating using its specific value and appropriate units.
For $\beta_1$, report the estimate’s associated p-value. Also specify the null and alternate hypotheses you are testing using appropriate notation. Finally, evaluate the p-value with reference to the standard significance threshold $\alpha$ and provide a conclusion regarding your hypotheses.
Report the model’s $R^2$ value to 3 decimal places and provide an interpretation – what is this specific value specifying about your model?
Use your regression model results to predict the difference in emissions between two hypothetical countries with GDP per capita of $20,000 and $50,000. You should use R as a calculator for this, and you must show the calculation in your code chunk using your specific $\hat{\beta_0}$ and $\hat{\beta_1}$ values.

model <- lm(greenhouse_gas_emissions ~ gdp_cap, data = energy)
plot(energy$gdp_cap, energy$greenhouse_gas_emissions,
  main = "GDP per Capita vs Greenhouse Gas Emissions",
  xlab = "Gross Domestic Product Per Capita in Dollars",
  ylab = "Greenhouse Gas Emissions in Megatonnes"
  )
abline(model,
       col = "lavender",
       lwd = 3)

summary(model)

## 
## Call:
## lm(formula = greenhouse_gas_emissions ~ gdp_cap, data = energy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -248.3  -73.8  -56.6  -51.3 5028.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.121367  46.305702   1.126    0.262
## gdp_cap      0.001669   0.001701   0.981    0.328
## 
## Residual standard error: 430.1 on 163 degrees of freedom
## Multiple R-squared:  0.005874,   Adjusted R-squared:  -0.0002252 
## F-statistic: 0.9631 on 1 and 163 DF,  p-value: 0.3279

The independent variable is GDP per capita while our dependent variable is greenhouse gas emissions. $\beta_0$ = 52.121 and $\beta_1$ = .002. From summary(model) we can see that our p-value is 0.3279 Our null hypothesis is that there is no correlation between richer countries and higher emissions or in other words:

$H_0$ = $\beta_1$ = 0

Our alternative hypothesis is that there is a correlation between rich countries and higher emissions or in other words:

$H_\alpha$ $\neq$ 0

We know that $\alpha$ = 0.05 so comparing our p-value vs alpha we observe that 0.3779 > 0.05 thus we fail to reject the null hypothesis. We know $\beta_1$ = .002 and can accept our alternative hypothesis as .002 $\neq$ 0. This means our slope is slightly positive and that this estimate is not statistically distinguishable from zero at the 5% level. Thus, there is a slightly positive association between richer countries and higher emissions. Our model’s $R^2$ value is -0.000225 and is extremely small. This means our model is not a good fit for the data.

beta0 <- 52.121367
beta1 <- 0.001669

gdp20k <- 20000
gdp50k <- 50000

pred1 <- beta0 + beta1 * gdp20k
pred2 <- beta0 + beta1 * gdp50k
print(pred1)

## [1] 85.50137

print(pred2)

## [1] 135.5714

pred2 - pred1

## [1] 50.07

From our calculations, we can see that a country with a GDP per capita of $50,000 would emit approximately 50.07 Mt CO2e more of greenhouse gases than a country with GDP per capita of $20,000

Can $\hat{\beta_1}$ from part (c) be interpreted as the causal effect of GDP per capita on GHG emissions? Explain why or why not. Then, propose one specific variable that is NOT in your dataset that could act as a confounder. Specifically, explain how this variable would affect the variables in your model and what direction of bias it might create (i.e., would your $\hat{\beta_1}$ be overestimated or underestimated?). Explain your reasoning.

No, it cannot because of omitted variable bias (OVB). If some other variable affects GDP per capita and greenhouse gas emissions, and that variable is not included in the regression, the OLS model mixes the effect of the GDP per capita with the effect of the omitted variable. One variable that could act as a confounder is population. Total population would cause a country to emit more greenhouse gases due to more energy use needed to keep the country running. Additionally, having a larger population can decrease a country’s GDP per capita if it grows faster than the total GDP. We can use the OVB formula:

$\beta_1$ = $\gamma*\frac{{Cov}(X_1, X_2)}{{Var}(X_1)}$

Where $X_1$ is the GDP per capita and $X_2$ is the population. We can assume $\gamma$ > 0 as total population tend to increase overall greenhouse gas emissions. We can also assume $Cov(X_1, X_2)$ < 0 as population and gdp per capita are negatively correlated. Because of both these things, the bias is negative and the regression omitting population would underestimate the true effect of GDP per capita on total emissions.

Now we will use multivariate regression to begin addressing some of the possible confounders. In other words, we will build a more complex model.

Next let’s evaluate whether the previous finding is robust to the accounting for the energy usage of a country, measured with primary_energy_consumption, and the carbon-intensity of electricity demand, measured with carbon_intensity_elec. The latter is an indication of whether a country’s electricity is generated with more carbon-intense fuels like coal versus non-carbon-intensive fuels like solar, wind, nuclear, and hydro.

Run a multivariate OLS model that regresses emissions on country wealth, energy usage, and carbon-intensity of energy usage. Produce a summary table for your model results and interpret these results. Be sure to address all of the following:

Report all coefficient estimates to 3 decimal places and explain what each estimate is communicating using its specific value and appropriate units.
For all independent variables, report the estimate’s associated p-value. Evaluate the p-value with reference to the standard significance threshold $\alpha$ and provide a conclusion regarding your hypotheses.
Report the model’s $R^2$ value to 3 decimal places and provide an interpretation – what does this specific value tell you about your model? How does this compare to the $R^2$ from your bivariate model in part (c)? What does this comparison tell you?
Compare your results to the bivariate model from part (c). Report $\hat{\beta_1}$ for gdp_cap from both models and calculate the difference (use R to calculate this and show your work). What does this change tell you about the relationship between GDP, energy consumption, carbon intensity, and emissions?

model2 <- lm(greenhouse_gas_emissions~gdp_cap+primary_energy_consumption+carbon_intensity_elec, data = energy)
summary(model2)

## 
## Call:
## lm(formula = greenhouse_gas_emissions ~ gdp_cap + primary_energy_consumption + 
##     carbon_intensity_elec, data = energy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -933.84   -9.74    5.19   20.62  628.87 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -20.773273  17.619639  -1.179  0.24014    
## gdp_cap                     -0.001273   0.000411  -3.097  0.00231 ** 
## primary_energy_consumption   0.102376   0.001989  51.481  < 2e-16 ***
## carbon_intensity_elec        0.062084   0.032792   1.893  0.06012 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 102.9 on 161 degrees of freedom
## Multiple R-squared:  0.9438, Adjusted R-squared:  0.9427 
## F-statistic:   901 on 3 and 161 DF,  p-value: < 2.2e-16

The Primary Energy Consumption variable has a p-value < 2e-16 which indicates strong evidence against the null hypothesis. It has a coeffecient value of 0.1024 which means that every 1 TWh increase, emissions will increase by 0.1024 Mt CO2e. It also shows how countries that use more energy also use emit more greenhouse gases, regardless of income or energy cleanliness.

The GDP per capita variable has a p value of 0.002. Once we control for energy use, the effect of GDP per capita becomes negative. It has a coefficient value of -0.0013 which means that, holding energy use and carbon intensity constant, richer countries will emit slightly less greenhouse gases. A a one USD increase in GDP per capita is associated with a decrease of 0.001 Mt CO2e in greenhouse gas emissions. Because p = 0.020 < $\alpha$ = 0.05, this effect is statistically significant at the 5% level. This could mean wealthier countries use cleaner technologies or are more energy efficient. This tells us that GDP per capita is not mechanically linked to total emissions once we control for actual energy use.

The Carbon Intensity of Electricity variable has a p value of 0.060, which indicates marginal significance. It has a coefficient value of 0.062 which means that a one gCO2e / kWh increase in the carbon intensity of electricity demand is associated with a 0.062 Mt CO2e increase in greenhouse gas emissions. The p-value is 0.060, which is just above the conventional $\alpha$ = 0.05 threshold. Thus we fail to reject the null of no effect at $\alpha$ = 0.05, but the coefficient is marginal (p is around 0.06) so it may be considered suggestive/evidence at a 10% level.

The R^2 value of our model is 0.9427 which explains 94.4% of the variance in greenhouse gas emissions. Compared to the model in part c, this model is a much better fit for our model and that most of the variation in emissions in this dataset is captured by energy usage and the carbon-intensity of electricity, not by GDP per capita alone.

Can the coefficient estimates for primary_energy_consumption and carbon_intensity_elec from part (e) be interpreted as causal effects? Why or why not? Address each variable separately in your answer.

They cannot as there are still too many factors that have not been accounted for in order to assume this relationship is causal. With primary energy consumption, a country could look to increase their consumption if they feel as though their current consumption could be higher. With carbon intensity of electrical generation, countries with high emissions may adopt stricter policies that lower carbon intensity. Additionally, no country was given a random assignment. To interpret coefficients causally, treatment needs to be random and because no country was randomly assigned a random value of carbon intensity level or primary energy consumption, these variables are endogenous.

Then, propose one additional variable that is NOT in your dataset that could still act as a confounder even after controlling for energy consumption and carbon intensity. Explain how this variable would affect the variables in your model and what direction of bias it might create on your estimate for gdp_cap (i.e., would your $\hat{\beta_1}$ estimate be overestimated or underestimated?). Explain your reasoning.

Environmental policy could act as a confounder for this dataset. Countries with stricter environmental regulations tend to have lower greenhouse-gas emissions. Having more environmental policies is also positively correlated with GDP per capita as wealthier countries can afford to maintain stricter environmental regulations. Thus, a higher GDP per capita means more environmental regulations. Using our OVB formula:

$\beta_1$ = $\gamma*\frac{{Cov}(X_1, X_2)}{{Var}(X_1)}$

Where $X_1$ = GDP per capita and $X_2$ = environmental policy. Since environmental policy is positively correlated with GDP per capita and is negatively correlated with with emissions (the more environmental polcies in place the less carbon emissions countries produce), it creates a negative bias or downward bias and will underestimate the data.

Question 2. Substantive Interpretation

For Question 1, write a brief summary (300 words or less) of what you learned from analyzing the predictors of GHG emissions. Your summary should:

Report and interpret at least one coefficient from your multivariate model (part (e)), including its value and statistical significance.
Discuss whether the magnitude of this coefficient is meaningful in a practical sense (i.e., is a one-unit change in this variable associated with a large or small change in emissions? Does this effect size matter for policy purposes?). Explain your reasoning.
Discuss how adding multiple variables to your model changed your understanding of what predicts GHG emissions compared to the bivariate model in part (c).

Write in paragraph form using complete sentences.

Primary Energy Consumption has a coefficient value of 0.1024 which means that every 1 TWh increase, emissions will increase by 0.1024 Mt CO2e. I would argue that overall it is a small change as in 2024, the global emissions was 37.8 gigatons of $CO_2$ so a one-unit change in this variable is a very small change in emissions as 1 gigaton is equal to 1000 megatonnes. As a result, I do not think this will affect policy too much. Prior to adding multiple variables,GDP per capita did have a slight positive correlation with greenhouse gas emissions. However, the overall model was not a good fit for the data as our $R^2$ value was negative. Once we started accounting for other variables like primary energy consumption and carbon intensity electricity, it was made clear that GDP per capita itself is not the primary driver of emissions. Rather, GDP per capita is only part of the reason, how much energy a country uses and how carbon-intensive its electricity system is also play key roles in determining a country’s emissions. By including these variables, the explanation of “Do richer countries emit more $CO_2$?” shifts from just a simple “Yes because they have more money” to “Yes, because of money, energy consumption levels, and the cleanliness of a country’s energy system.”

Question 3

Did you collaborate with anyone on this problem set? If so, list them here. ANSWER: I attended Longjiao’s Office Hours for help on this problem set. For part two, I had to look up how much carbon emissions was emitted globally.

Question 4

Did you use generative AI on any part of this problem set? If so, identify which model you used and how you used it – be specific!

Problem Set 3

Alexei Kim, PS 15, UCSB

Reminders about submitting your assignment

Question 1. What drives greenhouse gas emissions?

Question 2. Substantive Interpretation

Question 3

Question 4