Welcome to the first problem set. There is not a great deal of material here, but since this may be your first time using R and R Markdown, there are many potential pitfalls, so leave yourself plenty of time to complete it. The idea here is not that you can sit down and answer these questions straight away, but that you have a week to figure it out, and this is a key part of the learning process.
If you are looking at the HTML version of the problem set (pset1.html) that may have opened in your web browser, you are seeing the ouput produced by running the “script”” or code called in the file pset1.rmd, also available on the course website. Go ahead and open the file called pset1.rmd. If it does not open automatically within R Studio, you can open R Studio first and then use the File menu to open up pset1.rmd. Once you open pset1.rmd, you can continue reading the text easily in that file.
It will be easiest for you to open the .rmd file posted for each pset, and start writing your solutions in by learning from the code you see in the questions.
The text, output and graphics in this section are provided as an example whenever you create a new R markdown (.rmd) file in R Studio. It’s a good quick introduction so I replicate it here with minor modification. At this point, you may not understand all of the R code being used here, but the goal is to understand how the .rmd file works and how it relates to the .html file that gets outputted.
Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. (Here we have set the code to produce an HTML output, which is what you need to upload for this class).
When you click the Knit button a document will be generated that includes both the content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
You can also embed plots, for example:
Note that the echo = FALSE
parameter was added to the
code chunk to prevent printing of the R code that generated the plot. In
general, you will not include the echo = FALSE
, because we
will want to see your code.
Please submit your problem set via Canvas. Submit both the .rmd file, and the HTML file it creates. This assignment is due by 11:55am next Wednesday (October 15). Late problem sets will receive a penalty for each day of delay. Please list any students you collaborated with.
Please disclose whether you employed ChatGPT to complete this assignment, and how you used it in the space at the end of the problem set.
Here is an example problem, with an example solution.
Question 0. In this question, we’ll provide the answer for you, as an example. You need to be looking at the .rmd file right now for this to make much sense.
Showing your code and the result, execute the code
getwd()
. Describe what this command does. You may want to
execute the command directly in the console first (ask your TA if you
dont know how to run a command–this is essential) to see what it does,
but be sure to write it into your .rmd file so that it runs when you
click knit.
Solution: Question 0
setwd("/home/jovyan")
getwd()
## [1] "/home/jovyan"
This command, when executed (either in the console or through the .rmd file once you click ``knit’’), tells the user what directory is set as the working directory. This is the directory where output will be saved, or where R will look first when searching for a file, for example a dataset. You need to always set your working directory first so R knows where to pull the data from.
Make sure to try executing your .rmd file now by clicking knit. Then take a look at the HTML that it created and see what you get.
Okay, your turn to answer the remaining questions!
Part 1. Theory. Short answer questions (1 point each; 5 points total)
Start by reading Chapter 1 of the textbook, Real Stats. You can also review your lecture notes. You may NOT use generative AI to answer these questions.
Q1. A researcher observes that countries with democratic systems have weaker climate policies, compared to countries with non-democratic systems. She decides to publish a research article that says having this form of government causes countries to become less climate friendly. Would you like to be a co-author on this paper? Why or why not? (100 words max)
ANSWER: I would not like to be a co-author on this paper. One of the core principles of political science research, and research as a whole, is to not assume causation from correlation. While the researcher’s data might suggest a correlation for democratic countries to have weaker climate policies, this may be due to or otherwise affected by various other factors besides type of government, and these must be acknowledged in the paper, rather than rushing to conclusions of causation.
Q2. Explain what this sentence means: “Experiments create exogeneity via randomization.” (75 words max)
ANSWER: Exogeneity indicates a lack of associated with factors captured in the error term. So, by creating an experiment, one uses randomization, or random assignment, to create treatment and control groups. This creates exogeneity by limiting factors captured in the error term.
Q3. What do we refer when we talk about the “internal validity” and “external validity” of experiments? (100 words max)
ANSWER: Internal validity is present in an experiment when the experiment is well designed and free from bias or confounders. External validity occurs when an experiment’s finding can be generalized to other populations, situations, or cases. In other words, an experiment with external validity could be applied in a context different to the context in which it was currently run. For example, if researchers ran an experiment on the effect of a new depression pharmaceutical, but excluded test subjects who drink excessively, or take other medication, then the findings of the study will likely not be applicable to a larger population.
Q4. You decide to run an experiment to see whether working in groups helps students learn. You randomly assign half of the class to form study groups to work on their problem sets, and to visit office hours every week. For the other half of the class, you assign them to work individually. At the end of the semester, you give the entire class a test. You find that the students in the first group did much better than those in the second group, who worked individually. (120 words max)
4a. What could you call each group?
ANSWER: The students who work individually would be the control group, and the students who work together would be the treatment group, since the effects of group work is what the experiment is testing for, and group work is the “treatment.”
4b. What is your independent variable and what is your dependent variable?
ANSWER: The independent variable is whether students work together or alone, and the dependent variable is the test scores.
4c. Given this set up, list some factors you are controlling for.
ANSWER: Some controlled factors are the class itself and the professor, which are the same for both groups. The test itself is also the same, and the class assignments are too.
4d. Can you say that working in groups caused the students to do better on the test? Why or why not? Explain using the technical terms in the textbook.
ANSWER: No, while students who worked in groups did score higher, it could not be asserted that this is a causation. A correlation is all that could be asserted here. There are other factors that could affect test scores of students in both groups. Those working in groups could also be individually motivated to work harder, or perhaps those chosen happened to have some prior knowledge of the material before joining the class. There could be confounding factors, or factors that could be related to both the independent and dependent variable that affect test scores. Endogeneity could account for the different test scores; correlation cannot be confused for causation here because there are many factors that could be included in the error term.
4e. Can you say that this finding would also apply in other types of courses (for example, writing-intensive classes)? Why or why not? Explain using the technical terms in the textbook.
ANSWER: It would be difficult to apply this finding to other types of courses, in this way, the external validity of the experiment is not guaranteed. Writing-intensive classes may have very different subject material than the class studied in the questions above, and this may utilize different skills and different parts of the brain, which may elicit a different response to group or individual work.
Q5. Imagine you are looking at whether people with higher levels of education have higher incomes. List some of the factors that could lead to endogeneity. (50 words max)
ANSWER: Networking connections, higher generational wealth, and greater professional experience are factors that can lead to endogeneity. These factors are confounders, because they could be associated with both variables; they’re common in highly educated individuals, but exist beyond education itself, and are correlated to higher incomes.
Part 2. Data Analysis (1/2 point each; 5 points total)
The improvement in human rights in the second half of the 20th century is one of the most relevant global development trends. What is the relationship between economic development and human rights?
We will use a dataset from “Our World in Data (OWID)” (Saloni Dattani and Fiona Spooner and Hannah Ritchie and Max Roser, 2023) to explore this question.
Download the dataset, humanrights.RData
, which you’ll
find on the course website and also on the online RStudio platform. You
may want to put it in your working directory to make it easy to find
(use getwd()
to see what your current working directory is;
you can use the Session menu in Rstudio or the setwd()
command to change your working directory.)
Here is a brief description of the variables:
GDP per capita
As will often be the case when using R, you will need to use the
$
operator to access these variables within the object.
Specifically, once you have loaded humanrights.RData, the result will be
available in the data owid
. To get at the variable
country
, for example, you would use
owid$country
. Remember, the end of each chapter in the
textbook includes R code that can be helpful. We also posted R resources
on Canvas.
Q1. Load the data into R. The data are stored as an Rdata file, so
you can use the load()
function to load it.
load("humanrights.RData")
Q2. Check the dimensions of the data (i.e. the number of rows and columns). How many observations are there? How many variables are there?
nrow(owid)
## [1] 160
ncol(owid)
## [1] 3
ANSWER: There are 160 observations, or rows, and there are 3 columns, or variables.
Q3. How many countries are covered in this data set?
nrow(owid)
## [1] 160
ANSWER: 160 countries are covered in this data set.
Q4. Calculate the average of gdppc
, which is average
income per capita for each country, across all points in the sample. Do
you think this average is large or small? What is the minimum and the
maximum gdppc
? Make sure to explain what each of these
values are communicating.
mean(owid$gdppc)
## [1] 19112.69
min(owid$gdppc)
## [1] 596.393
max(owid$gdppc)
## [1] 88366.22
ANSWER: 19112.69 is the average GDP per capita, which indicates the average earning per person in a given state. The minimum average GDP earned per individual in the given countries is 596.393, whereas the maximum of the same is 88,366.22; this is a very large range. With the minimum and maximum GDPPC taken into account, the mean GDPPC seems low in comparison to the significantly higher maximum GDPPC of approximately 88,366. Since the mean is low, despite a much higher maximum existing, this signifies that a large amount of the data is concentrated on the lower side, and that brings the average GDP down - rather than having an average closer to the middle of the dataset.
Q5. Calculate the average of the human rights index across all points in the sample. What does this tell you about the prevalence of human rights protections? Make sure to explain what this value is communicating.
mean(owid$humanrights)
## [1] 0.6737188
ANSWER: The average human rights index across the countries represented in the sample is 0.6737188, or approximately 0.67. The human rights index is on a scale that ranges from 0 to 1, with 1 being the most advanced in terms of human rights and a score of 0 indicating horrendous human rights policies and conditions. As such, in this dataset, the averaged human rights scores indicate a decent level of human rights, however, there is still much room to improve. An average score of 0.67 indicates that the averaged human rights conditions are better than they are worse, overall, but the high scores of some countries, like Belgium, with an approximate score of 0.9, are “dragged down” by the poorer scores of countries like Afghanistan, with an approximate score of 0.06. This broad difference results in an average score that is not as high as its “highest scorers”, but also certainly not as low as its lowest scorers.
Q6. Produce a simple scatterplot with average income (GDP per capita) on the horizontal axis and the human rights protections (Human Rights Index) on the vertical axis.
plot(owid$gdppc, owid$humanrights,
xlab = "GDP per capita",
ylab = "Human Rights Index",
main = "Income based on Human Rights Protections")
Q7. Make the plot again, but this time add a trend line (also known
as a line of best fit or a regression line) using the
abline()
command.
plot(owid$gdppc, owid$humanrights,
xlab = "GDP per capita",
ylab = "Human Rights Index",
main = "Income based on Human Rights Protections")
model <- lm(owid$humanrights ~ owid$gdppc,
data = owid)
abline(model, col = "red")
Q8. What does this line tell you about development and human rights?
ANSWER: When GDP per capita - and thus, development - increases, human rights index typically increases too. This signfies a positive correlation between GDP and human rights.
Q9. What could you call this relationship? Why?
ANSWER: This would be a positive correlation. As the independent variable increases, so does the dependent variable. It is not a weak correlation, but it is not a particularly strong one either, since the positive sloping line is not extremely steep.
Q10. Did you collaborate with anyone on this problem set? If so, list them here.
ANSWER: I collaborated briefly with Alyssa Buchanan.
Q11. Did you use generative AI on any part of this problem set? If so, identify which model you used and how you used it – be specific!
ANSWER: I did not use any generative AI tools.