Welcome to the first problem set. There is not a great deal of material here, but since this may be your first time using R and R Markdown, there are many potential pitfalls, so leave yourself plenty of time to complete it. The idea here is not that you can sit down and answer these questions straight away, but that you have a week to figure it out, and this is a key part of the learning process.
If you are looking at the HTML version of the problem set (pset1.html) that may have opened in your web browser, you are seeing the ouput produced by running the “script”” or code called in the file pset1.rmd, also available on the course website. Go ahead and open the file called pset1.rmd. If it does not open automatically within R Studio, you can open R Studio first and then use the File menu to open up pset1.rmd. Once you open pset1.rmd, you can continue reading the text easily in that file.
It will be easiest for you to open the .rmd file posted for each pset, and start writing your solutions in by learning from the code you see in the questions.
The text, output and graphics in this section are provided as an example whenever you create a new R markdown (.rmd) file in R Studio. It’s a good quick introduction so I replicate it here with minor modification. At this point, you may not understand all of the R code being used here, but the goal is to understand how the .rmd file works and how it relates to the .html file that gets outputted.
Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. (Here we have set the code to produce an HTML output, which is what you need to upload for this class).
When you click the Knit button a document will be generated that includes both the content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the plot. In
general, you will not include the echo = FALSE, because we
will want to see your code.
Please submit your problem set via Canvas. Submit both the .rmd file, and the HTML file it creates. This assignment is due by 11:55am next Wednesday (October 15). Late problem sets will receive a penalty for each day of delay. Please list any students you collaborated with.
Please disclose whether you employed ChatGPT to complete this assignment, and how you used it in the space at the end of the problem set.
Here is an example problem, with an example solution.
Question 0. In this question, we’ll provide the answer for you, as an example. You need to be looking at the .rmd file right now for this to make much sense.
Showing your code and the result, execute the code
getwd(). Describe what this command does. You may want to
execute the command directly in the console first (ask your TA if you
dont know how to run a command–this is essential) to see what it does,
but be sure to write it into your .rmd file so that it runs when you
click knit.
Solution: Question 0
getwd()
## [1] "/home/jovyan"
This command, when executed (either in the console or through the .rmd file once you click ``knit’’), tells the user what directory is set as the working directory. This is the directory where output will be saved, or where R will look first when searching for a file, for example a dataset. You need to always set your working directory first so R knows where to pull the data from.
Make sure to try executing your .rmd file now by clicking knit. Then take a look at the HTML that it created and see what you get.
Okay, your turn to answer the remaining questions!
Part 1. Theory. Short answer questions (1 point each; 5 points total)
Start by reading Chapter 1 of the textbook, Real Stats. You can also review your lecture notes. You may NOT use generative AI to answer these questions.
Q1. A researcher observes that countries with democratic systems have weaker climate policies, compared to countries with non-democratic systems. She decides to publish a research article that says having this form of government causes countries to become less climate friendly. Would you like to be a co-author on this paper? Why or why not? (100 words max)
ANSWER: I would not want to be a co-author in this paper because this researcher has only made an observation. There is no experiment conducted to test her theory. While there seems to be a correlation present, she has failed to prove causation. She would need to use a randomized experiment to prove exogeneity. There could be an endogenous reason for this correlation, like lobbying and interest groups in a democratic system can sway policy through campaign financing.
Q2. Explain what this sentence means: “Experiments create exogeneity via randomization.” (75 words max)
ANSWER: Only through randomization in experiments can we truly discover causation. An independent variable is exogenous if changes in it are unrelated to factors in the error term. To be exogenous is to have be unmuddied by associations between X and other factors that affect Y. If the treatment in an experiment is randomized, than it is far more likely to reach an exogenous end.
Q3. What do we refer when we talk about the “internal validity” and “external validity” of experiments? (100 words max)
ANSWER: When we refer to internal validity, we want to ensure that the experiment is well designed and that there is no bias or confounding factors. It is testing if the results can be attributed to a causal relationship. If so, then we can say that the effects we have found are internally valid. In comparison, external validity is determining whether the experiment is applicable to the general population, rather than only the select few that were tested.
Q4. You decide to run an experiment to see whether working in groups helps students learn. You randomly assign half of the class to form study groups to work on their problem sets, and to visit office hours every week. For the other half of the class, you assign them to work individually. At the end of the semester, you give the entire class a test. You find that the students in the first group did much better than those in the second group, who worked individually. (120 words max)
4a. What could you call each group?
ANSWER: The first group is the treatment group. The second is the control group.
4b. What is your independent variable and what is your dependent variable?
ANSWER: The independent variable is study groups and visiting office hours every week. The dependent variable is test scores/ amount learned.
4c. Given this set up, list some factors you are controlling for.
ANSWER: Some factors I am controlling for are the random assingment of students of being in the treatment vs. the control group (being in the study group + office hours or not), being in the same class, and receving the same final test.
4d. Can you say that working in groups caused the students to do better on the test? Why or why not? Explain using the technical terms in the textbook.
ANSWER: You can not say that working in groups caused the students to do better because there is an internal validity failure. The control group is not also visiting office hours every week, therefore, the results could have nothing to do with study groups, and more so to do with attending office hours. This is a confounding variable that causes endogeneity. There are also other possible confounding factors like the amount of sleep each student got before the exam.
4e. Can you say that this finding would also apply in other types of courses (for example, writing-intensive classes)? Why or why not? Explain using the technical terms in the textbook.
ANSWER: Working in groups is not always going to lead to higher test scores, because each subject is different and requires different things from their students (reading heavy, homework heavy, writing heavy, practice problems). In a writing-intensive class, group work could be counter intuitive as writing is a personal task dealing with evaluating each students knowledge of the material or ability to apply their own ideas to a topic. A possible confounder in this study could also be that one student participates more than others, and everyone else copies that one person’s ideas. Therefore, working in groups is not an externally valid strategy for improving students work in any class.
Q5. Imagine you are looking at whether people with higher levels of education have higher incomes. List some of the factors that could lead to endogeneity. (50 words max)
ANSWER: Some possible factors could be that those who receive higher educations are typically from middle or upper class families that have money and/or connections, different work ethics/prefessional goals, health issues, or the current social climate of the country.
Part 2. Data Analysis (1/2 point each; 5 points total)
The improvement in human rights in the second half of the 20th century is one of the most relevant global development trends. What is the relationship between economic development and human rights?
We will use a dataset from “Our World in Data (OWID)” (Saloni Dattani and Fiona Spooner and Hannah Ritchie and Max Roser, 2023) to explore this question.
Download the dataset, humanrights.RData, which you’ll
find on the course website and also on the online RStudio platform. You
may want to put it in your working directory to make it easy to find
(use getwd() to see what your current working directory is;
you can use the Session menu in Rstudio or the setwd()
command to change your working directory.)
Here is a brief description of the variables:
GDP per capitaAs will often be the case when using R, you will need to use the
$ operator to access these variables within the object.
Specifically, once you have loaded humanrights.RData, the result will be
available in the data owid. To get at the variable
country, for example, you would use
owid$country. Remember, the end of each chapter in the
textbook includes R code that can be helpful. We also posted R resources
on Canvas.
Q1. Load the data into R. The data are stored as an Rdata file, so
you can use the load() function to load it.
setwd("/home/jovyan")
load("humanrights.RData")
summary(owid)
## country humanrights gdppc
## Length:160 Min. :0.0240 Min. : 596.4
## Class :character 1st Qu.:0.4545 1st Qu.: 4228.7
## Mode :character Median :0.7885 Median :13025.0
## Mean :0.6737 Mean :19112.7
## 3rd Qu.:0.8965 3rd Qu.:29107.0
## Max. :0.9640 Max. :88366.2
nrow(owid)
## [1] 160
ncol(owid)
## [1] 3
summary(owid$country)
## Length Class Mode
## 160 character character
mean(owid$gdppc)
## [1] 19112.69
min(owid$gdppc)
## [1] 596.393
max(owid$gdppc)
## [1] 88366.22
mean(owid$humanrights)
## [1] 0.6737188
model <- lm(owid$humanrights ~ owid$gdppc,
data = owid)
plot(owid$gdppc, owid$humanrights,
xlab = "GDP Per Capita",
ylab = "Human Rights Index",
main = "Human Rights Index based on GDP per capita")
abline(model, col="red")
summary(model)
##
## Call:
## lm(formula = owid$humanrights ~ owid$gdppc, data = owid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6792 -0.1834 0.1095 0.1908 0.3067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.807e-01 2.875e-02 20.197 < 2e-16 ***
## owid$gdppc 4.868e-06 1.081e-06 4.503 1.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2529 on 158 degrees of freedom
## Multiple R-squared: 0.1137, Adjusted R-squared: 0.1081
## F-statistic: 20.27 on 1 and 158 DF, p-value: 1.298e-05
Q2. Check the dimensions of the data (i.e. the number of rows and columns). How many observations are there? How many variables are there?
ANSWER: There are 160 observations (nrow) and 3 variables (ncol)
Q3. How many countries are covered in this data set?
ANSWER: 160 countries are covered in this data set
Q4. Calculate the average of gdppc, which is average
income per capita for each country, across all points in the sample. Do
you think this average is large or small? What is the minimum and the
maximum gdppc? Make sure to explain what each of these
values are communicating.
ANSWER: The average income per capita for each country is 19112.69. This seems pretty small to me, especially when you take into account the amount of billionares and millionares in the world, not to mention middle/upper class. The minimum income is 596.393, which is insanely low and can only mean extreme poverty and near starvation, and the maximum income is 88366.22, which is a huge disparity.
Q5. Calculate the average of the human rights index across all points in the sample. What does this tell you about the prevalence of human rights protections? Make sure to explain what this value is communicating.
ANSWER: The average of the human rights index is 0.6737188. This tells me that these countries are suffering and that there are human rights violations in dozens and dozens of countries across the world. This tells me we have to do a btter job at wealth distribution and providing livable wages. Providing jobs and affordable housing, while funding more programs for those with the lowest incomes should be our current top priority globally and domestically.
Q6. Produce a simple scatterplot with average income (GDP per capita) on the horizontal axis and the human rights protections (Human Rights Index) on the vertical axis.
Q7. Make the plot again, but this time add a trend line (also known
as a line of best fit or a regression line) using the
abline() command.
Q8. What does this line tell you about development and human rights?
ANSWER: This line shows me that the higher the GDP per capita a country has, the higher its human rights index score is. This suggests that the more people earn, the more money circulates in that country’s economy, resulting in more funds available for investment. This, in turn, enables the government to better support its citizens, which increases their human rights index score.
Q9. What could you call this relationship? Why?
ANSWER: This relationship is a positive correlation, because as one score increases, so does the other, as you can see looking at the red line.
Q10. Did you collaborate with anyone on this problem set? If so, list them here.
ANSWER: No
Q11. Did you use generative AI on any part of this problem set? If so, identify which model you used and how you used it – be specific!
ANSWER: No