Introduction

Welcome to the first problem set. There is not a great deal of material here, but since this may be your first time using R and R Markdown, there are many potential pitfalls, so leave yourself plenty of time to complete it. The idea here is not that you can sit down and answer these questions straight away, but that you have a week to figure it out, and this is a key part of the learning process.

If you are looking at the HTML version of the problem set (pset1.html) that may have opened in your web browser, you are seeing the ouput produced by running the “script”” or code called in the file pset1.rmd, also available on the course website. Go ahead and open the file called pset1.rmd. If it does not open automatically within R Studio, you can open R Studio first and then use the File menu to open up pset1.rmd. Once you open pset1.rmd, you can continue reading the text easily in that file.

It will be easiest for you to open the .rmd file posted for each pset, and start writing your solutions in by learning from the code you see in the questions.

Before you start: The R Markdown Introduction

The text, output and graphics in this section are provided as an example whenever you create a new R markdown (.rmd) file in R Studio. It’s a good quick introduction so I replicate it here with minor modification. At this point, you may not understand all of the R code being used here, but the goal is to understand how the .rmd file works and how it relates to the .html file that gets outputted.

Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. (Here we have set the code to produce an HTML output, which is what you need to upload for this class).

When you click the Knit button a document will be generated that includes both the content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot. In general, you will not include the echo = FALSE, because we will want to see your code.

Submission Instructions

Please submit your problem set via Canvas. Submit both the .rmd file, and the HTML file it creates. This assignment is due by 11:55pm next Wednesday (October 15). Late problem sets will receive a penalty for each day of delay. Please list any students you collaborated with.

Please disclose whether you employed ChatGPT to complete this assignment, and how you used it in the space at the end of the problem set.

Example Problem

Here is an example problem, with an example solution.

Question 0. In this question, we’ll provide the answer for you, as an example. You need to be looking at the .rmd file right now for this to make much sense.

Showing your code and the result, execute the code getwd(). Describe what this command does. You may want to execute the command directly in the console first (ask your TA if you dont know how to run a command–this is essential) to see what it does, but be sure to write it into your .rmd file so that it runs when you click knit.

Solution: Question 0 = getwd() = “C:/Users/kimst/OneDrive/Documents/School/POLS 15/RStudio Directory

This command, when executed (either in the console or through the .rmd file once you click ``knit’’), tells the user what directory is set as the working directory. This is the directory where output will be saved, or where R will look first when searching for a file, for example a dataset. You need to always set your working directory first so R knows where to pull the data from.


Make sure to try executing your .rmd file now by clicking knit. Then take a look at the HTML that it created and see what you get.

Okay, your turn to answer the remaining questions!

Part 1. Theory. Short answer questions (1 point each; 5 points total)

Start by reading Chapter 1 of the textbook, Real Stats. You can also review your lecture notes. You may NOT use generative AI to answer these questions.

Q1. A researcher observes that countries with democratic systems have weaker climate policies, compared to countries with non-democratic systems. She decides to publish a research article that says having this form of government causes countries to become less climate friendly. Would you like to be a co-author on this paper? Why or why not? (100 words max)

ANSWER: I would not want to be a co-author on that paper as her argument is completely based on a causal claim rather than having evidence direct evidence that supports it. The relationship between countries with democratic systems and weak climate policies could be independent of one another and could instead rely on other factors the researcher fails to acknowledge.

Q2. Explain what this sentence means: “Experiments create exogeneity via randomization.” (75 words max)

ANSWER: In an experiment, a researcher will randomly assign which group receives a modifier (the independent variable) and which group does not (the dependent variable). Due to the random nature of the choosing, the treatment is not correlated with any other known factors that would affect the dependent variable which makes it exogenous.

Q3. What do we refer when we talk about the “internal validity” and “external validity” of experiments? (100 words max)

ANSWER: According to the text, “Internal validity refers to whether the inference is biased; external validity refers to whether an inference applies more generally.” (36) In other words, internal validity refers to how accurate the experiment is (did the randomization of the experiment eliminate bias enough that we can trust X caused Y). External validity refers to the repeatablity of the experiment (if the experiment is repeated in other populations, would the results still hold true?)

Q4. You decide to run an experiment to see whether working in groups helps students learn. You randomly assign half of the class to form study groups to work on their problem sets, and to visit office hours every week. For the other half of the class, you assign them to work individually. At the end of the semester, you give the entire class a test. You find that the students in the first group did much better than those in the second group, who worked individually. (120 words max)

4a. What could you call each group?

ANSWER: I would define X (independent variable) as the “Study Group” as we will manipulate it to see whether or not the students work alone or in groups. I would define Y (dependent variable) as the students’ tests results. This way, I manipulate the X variables to determine how much Y changes via studying alone or in groups.

4b. What is your independent variable and what is your dependent variable?

ANSWER: The Independent Variable will be the type of study method that was used (studying in groups or studying alone) and the Dependent Variable will be the outcome of each group’s studying efforts.

4c. Given this set up, list some factors you are controlling for.

ANSWER: We will control for various factors including how much office hours helped the students, the amount of hours the students studied, GPA prior to the start of the experiment, and how much study resources each student had access to (like tutors).

4d. Can you say that working in groups caused the students to do better on the test? Why or why not? Explain using the technical terms in the textbook.

ANSWER: Yes, unlike with the author situation, this experiment sets up a causal claim with evidence to support it. Additionally, the randomness of the experiment created exogenous variation in the independent variable. This randomization will ensure that on average, both groups are equal in all unmeasured characteristics.

4e. Can you say that this finding would also apply in other types of courses (for example, writing-intensive classes)? Why or why not? Explain using the technical terms in the textbook.

ANSWER: No because although our experiment has solid internal validity, it has no claim to any external validity. According to the text “Even with internal validity…an experiment may not be externally valid, because the causal relationship between the treatment and outcome could differ in other contexts” (36). Additionally, the error term could differ across the different subjects (ie. collaboration in writing courses may not be as effective for studying as it might be in math classes).

Q5. Imagine you are looking at whether people with higher levels of education have higher incomes. List some of the factors that could lead to endogeneity. (50 words max)

ANSWER: Some factors could be innate intelligence, family resources, motivations and economic factors.


Part 2. Data Analysis (1/2 point each; 5 points total)

The improvement in human rights in the second half of the 20th century is one of the most relevant global development trends. What is the relationship between economic development and human rights?

We will use a dataset from “Our World in Data (OWID)” (Saloni Dattani and Fiona Spooner and Hannah Ritchie and Max Roser, 2023) to explore this question.

Download the dataset, humanrights.RData, which you’ll find on the course website and also on the online RStudio platform. You may want to put it in your working directory to make it easy to find (use getwd() to see what your current working directory is; you can use the Session menu in Rstudio or the setwd() command to change your working directory.)

Here is a brief description of the variables:

As will often be the case when using R, you will need to use the $ operator to access these variables within the object. Specifically, once you have loaded humanrights.RData, the result will be available in the data owid. To get at the variable country, for example, you would use owid$country. Remember, the end of each chapter in the textbook includes R code that can be helpful. We also posted R resources on Canvas.

Q1. Load the data into R. The data are stored as an Rdata file, so you can use the load() function to load it.

Q2. Check the dimensions of the data (i.e. the number of rows and columns). How many observations are there? How many variables are there?

ANSWER: The dimensions are 160x3. There are 160 observations and 3 variables.

Q3. How many countries are covered in this data set?

ANSWER: 160 countries are covered in the data set.

Q4. Calculate the average of gdppc, which is average income per capita for each country, across all points in the sample. Do you think this average is large or small? What is the minimum and the maximum gdppc? Make sure to explain what each of these values are communicating.

ANSWER: Using mean(owid$gdppc) The average of gdppc is 19,112.69. This is small as it means the economic output of every citizen in the country is only around $19,000, which is no where near enough to support a decent lifestyle.

Q5. Calculate the average of the human rights index across all points in the sample. What does this tell you about the prevalence of human rights protections? Make sure to explain what this value is communicating.

ANSWER: Using mean(owid$humanrights), the average of the human rights index is 0.6737188. This tells us that the average country present within the dataset having fairly strong human rights protection policies. The scale is measured from 0 to 1 so an average of around .67 reveals that the human rights protection within the countries listed in the set are somewhat strong.

Q6. Produce a simple scatterplot with average income (GDP per capita) on the horizontal axis and the human rights protections (Human Rights Index) on the vertical axis.

plot(owid\(gdppc, owid\)humanrights, main = “Human Rights Index vs GDPPC”, xlab = “GDP per Capita”, ylab = “Human Rights Index”, pch = 20,
col = “purple” )
See attached Graph

Q7. Make the plot again, but this time add a trend line (also known as a line of best fit or a regression line) using the abline() command.

plot(owid\(gdppc, owid\)humanrights, main = “Human Rights Index vs GDPPC”, xlab = “GDP per Capita”, ylab = “Human Rights Index”, pch = 20,
col = “purple” )
abline(lm(humanrights ~ gdppc, data = owid), col = “red”)

See attached Graph

Q8. What does this line tell you about development and human rights?

ANSWER: It has a positive slope which reveals that a higher GDPPC is associated with stronger human rights policies.

Q9. What could you call this relationship? Why?

ANSWER: This would be called a positive correlation. As the text states, “Two variables are correlated (‘co-related’) if they move together. A positive correlation means that high values of one variable are associated with high values of the other; a negative correlation indicates that high values of one variable are associated with low values of the other.” (17)

Q10. Did you collaborate with anyone on this problem set? If so, list them here.

ANSWER: I worked alone on the problem set

Q11. Did you use generative AI on any part of this problem set? If so, identify which model you used and how you used it – be specific!

ANSWER: No I did not