Submit your HTML output and .Rmd file to Canvas by the deadline.
It is best to work will small amounts of code at a time: get some code working, copy it into the rmd as a code chunk, write your text answer (outside the code chunk) if needed, and check that the file will still knit properly. Do not proceed to answer more questions until you get the first bit working. If you knit everytime you try to write some new code, you’ll know where the error is (in the last thing you did!) This will save you huge headaches.
Although the questions break up each task for you into parts, remember that you might need to put a bunch of code together into a single chunk to make it work. For example, if you create a density plot in one part of a question, and want to add the mean value to it as a line in another part, you need these two commands to follow one another in the same chunk of code.
Some tips: Start early, work with friends in the class, use the discussion forum, watch section videos, go to office hours, read the textbook – do all these things and you’ll succeed! Good luck.
For this problem set, we will further explore a relationship we discussed in lecture: income and carbon pollution. To do so, we will use data from “Our World in Data”
1.1 Set your working directory and load the data.
{r} getwd() {r} setwd("/home/jovyan")
{r} load("~/directory/co2_gdppc.RData")
{r} load("~/directory/climate_change_year.RData")
We will deal with two variables: co2percapita and
gdppc. The co2percapita measures the emissions
of carbon dioxide per person from fossil fuels and industry, measured in
tons of carbon. The variable gdppc measures of the GDP per
capita of each country-year in 1000 US Dollars to facilitate
cross-country comparisons.
1.2 Produce a scatter plot with gdppc on the horizontal
access and co2percapita on the vertical axis and use the
abline command to put the linear regression line on the
plot. Given how you have set up your plot, which is your independent
variable and which is your dependent variable? 5 points
```{r} df1 <- co2_gdppc model <- lm(df1\(gdppc ~ df1\)co2percapita, data = df1) plot(df1\(co2percapita, df1\)gdppc, xlab = “gdppc”, ylab = “co2percapita”, main = “Relation”) abline(model, col=“red”)
dependent variable: Co2 per capita independent variable: GDP per capita
Bonus question: explain why the plot above might not be informative to describe the relationship between these two variables. Up to 10 additional points
The plot above might not be informative to describe the relationship between these two variables because it does not demonstrate that changes in GDP per capita cause changes in CO₂ emissions (causation) nor does it account for confounding variables. Additionally, because the data points are densely concentrated mainly in one area, the statistical power of the data may be decreased; for example, variability may be diminished and outliers in the data may be harder to detect.
1.3 Estimate and report both the covariance and the correlation of
gdppc and co2percapita. Write the meaning of
what these results tell you, using the definitions for these two
variables stated above. 5 points.
{r} cov(df1$gdppc, df1$co2percapita)
{r} cor(df1$gdppc, df1$co2percapita) The covariance value
is 77.305; there is a positive relationship between CO2 per capita and
GDP per capita. Per capita, as GDP per capita increases, CO2 emissions
increase.
The correlation value is 0.56; there is an average or moderate positive relationship between Co2 per capita and GDP per capita. This displays that higher GDP per capita is likely indicative of higher emissions per person. However, because this is not a strong relationship, this cannot be concluded definitely.
1.4 The model below describes a linearregression of
co2percapita on gdppc. Explain what each of
the terms below mean using the terms we studied in lecture. 5
points.
\[co2percapita_i\] represents the
dependent variable and denotes CO2 emissions per person per capita
measured in tons of carbon.
\[\beta_{0}\] is the intercept. It
represents the value of Co2 per capita when GDP per capita is zero.
\[\beta_{1}\] is the slope
coefficient. It represents the quantity of the change in Co2 emissions
per capita when GDP per capita changes one unit (either increases or
decreases). \[gdppc_i\] is the
independent variable. It represents annual GDP per capita. \[\epsilon_i\] is the error term. It
represents the difference between actual and predicted Co2 emissions per
capita. Epsilon also denotes factors unrelated to the independent
variable and noise in the data.
You do not have to estimate the model yet so this is not in a code chunk.
1.5 Explain how we will estimate the best values of \(\beta_0\) and \(\beta_1\). In what sense is the line that we choose (by choosing \(\beta_0\) and \(\beta_1\)) the “best-fitting” line? 5 points.
You can estimate the best values of \(\beta_0\) and \(\beta_1\)* by using Ordinary Least Squares to find the difference between actual and predicted CO2 per capita, with the goal of minimizing this difference, or the sum of squared residuals. The line we choose is the best fitting line because it minimizes error and produces the smallest possible sum of squared residuals.
1.6 Now use linear regression to regress co2percapita on
gdppc using the lm() function in
R - make sure you save the model as an object. Show the
result using the summary() command. Interpret the meaning
of the coefficient estimates (both the intercept and the coefficients on
gdppc). 15 points.
```{r} y <- df1\(co2percapita x <- df1\)gdppc model1 <- lm(y ~ x)
summary(model1)
βˆ0: When GDP per capita is 0, Y is 0.758.
βˆ1: Co2 per capita changes by 0.302 per each one-unit change in GDP per capita.
**Bonus question**: Consider the p-values reported on the table in your interpretation, if you want to read ahead and figure out what these mean. Up to 10 additional points.
1.7 Sometimes we need to transform a variable to make it more suitable to analysis by regression. For example, with income-related variables like `gdppc`, we usually need to take their log first before using regression. Create two new variables that are equal to: (1) the log of `gdppc` and (2) the log of `co2percapita`. 5 points
```{r}
logx <- log(x)
logy <- log(y)
p-value: < 2.2e-16. This p-value is extremely low, meaning that When you mention a p-value of < 2.2e-16, it suggests that the relationship between GDP per capita and Co2 per capita is statistically significant. In other words, the null hypothesis can probably be disproved and it can be concluded that chance and other confounding factors do not have a significant impact on the relationship between GDP per capita and CO₂ emissions per capita.
1.8 Now remake a scatter plot like before but with the log of `gdppc` on the horizontal access and the log of `co2percapita` on the vertical axis. Add the regression line using `abline()`. 5 points
```{r}
model2 <- lm(logy ~ logx,
data = df1)
plot(logy, logx,
xlab = "loggdppc",
ylab = "logco2percapita",
main = "Relation")
abline(model2, col="red")
p-value: < 2.2e-16
1.9 Regardless of what you actually got in the above analyses, suppose that we find a positive and statistically significant coefficient in these regressions. Can we causally say that “being wealthier (higher GDP per capita) leads to having more carbon pollution per capita?” based on the data that we analyzed? Why or why not? Follow the instructions in class for how to address such causal questions, including pointing out potential confounders, non-comparability, and proposing the ideal research design. 5 points
We cannot causally say that being wealthier leads to having more carbon pollution per capita because correlation does not prove causation; we cannot say definitively that GDP per capita will cause a change in CO₂ emissions. Additionally, there may be confounding variables that affect GDP per capita and CO₂ emissions.
Begin by loading a new dataset
(climate_change_year.Rdata) into R. This dataset shows the
average sea and surface temperature anomaly for the entire world for
every year since 1851.
{r} load("climate_change_year.RData") 2.1 Make a
scatterplot of temperature over time. Add a trend line. What does it
tell us about how the climate is changing over time? 10 points
{r} df2 <- climate_change_year model3 <- lm(df2$mean_anomaly ~ df2$year, data = df2) plot(df2$year, df2$mean_anomaly, xlab = "year", ylab = "temperature", main = "Relation") 29/10/24, 7:06 RStudio Server abline(model3, col="red")
2.2 Next, subset the data into groups by decade. Start with the seventies, and then create subsets for the eighties, nineties, noughts, teeens, and twenties (up to 2022). What is the mean temperature for each decade? What is the trend over time in the mean? 10 points
{r} subset_1970s <- df2$year[1970:1979]
3.1 How correlation is different from covariance. Provide a formula that can turn \(cov(x,y)\) into \(cor(x,y)\). Explain how correlation relates to covariance. Also explain what the correlation means and the possible values it can take. 10 points
The correlation formula shows that it is calculated by dividing the covariance of the two variables by the product of their standard deviations. This approach addresses the limitation of covariance, which does not have a standardized scale. By using this method, correlation values range between -1 and 1, where -1 indicates a perfect negative relationship and 1 indicates a perfect positive relationship. A correlation of 0 signifies that the two variables are completely unrelated. Therefore, the closer the correlation is to 1, the stronger the relationship between the variables, while values closer to 0 indicate a weaker or no relationship.
3.2 What is a random variable? In your own words, provide a definition. What are its key components? 10 points
A random variable is a variable whose numerical value is determined by chance. For instance, the outcome of drawing a card from a deck is a random variable. Each card has an equal probability of being drawn, leading to different potential outcomes based on chance.
3.3 Explain succinctly what the regression is doing to estimate a relationship between two variables. Add a specific political science example. 10 points
Regression analysis transforms data into a mathematical framework that enables us to predict how changes in one variable may influence others. This can be observed in various political science contexts; for instance, examining how GDP per capita impacts the educational attainment of a population can inform the development of policies aimed at improving educational access in regions with low GDP per capita.