Introduction

Welcome to the first problem set. There is not a great deal of material here, but since this may be your first time using R and R Markdown, there are many potential pitfalls, so leave yourself plenty of time to complete it. The idea here is not that you can sit down and answer these questions straight away, but that you have a week to figure it out, and this is a key part of the learning process.

If you are looking at the HTML version of the problem set (pset1.html) that may have opened in your web browser, you are seeing the ouput produced by running the “script”” or code called in the file pset1.rmd, also available on the course website. Go ahead and open the file called pset1.rmd. If it does not open automatically within R Studio, you can open R Studio first and then use the File menu to open up pset1.rmd. Once you open pset1.rmd, you can continue reading the text easily in that file.

It will be easiest for you to open the .rmd file posted for each pset, and start writing your solutions in by learning from the code you see in the questions.

Before you start: The R Markdown Introduction

The text, output and graphics in this section are provided as an example whenever you create a new R markdown (.rmd) file in R Studio. It’s a good quick introduction so I replicate it here with minor modification. At this point, you may not understand all of the R code being used here, but the goal is to understand how the .rmd file works and how it relates to the .html file that gets outputted.

Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. (Here we have set the code to produce an HTML output, which is what you need to upload for this class).

When you click the Knit button a document will be generated that includes both the content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot. In general, you will not include the echo = FALSE, because we will want to see your code.

Submission Instructions

Please submit your problem set via Canvas. Submit both the .rmd file, and the HTML file it creates. This assignment is due by 11:59 PM on Wednesday January 24, 2024. Late problem sets will receive a penalty for each day of delay. Please list any students you collaborated with.

Example Problem

Here is an example problem, with an example solution.

Question 0. In this question, we’ll provide the answer for you, as an example. You need to be looking at the .rmd file right now for this to make much sense.

Showing your code and the result, execute the code getwd(). Describe what this command does. You may want to execute the command directly in the console first (ask your TA if you dont know how to run a command–this is essential) to see what it does, but be sure to write it into your .rmd file so that it runs when you click knit.

Solution: Question 0

getwd()

## [1] "/home/jovyan/UCSB Pols 15/Problem Set 1"

This command, when executed (either in the console or through the .rmd file once you click ``knit’’), tells the user what directory is set as the working directory. This is the directory where output will be saved, or where R will look first when searching for a file, for example a dataset. You need to always set your working directory first so R knows where to pull the data from.

Make sure to try executing your .rmd file now by clicking knit. Then take a look at the HTML that it created and see what you get.

Okay, your turn to answer the remaining questions!

Part 1. Theory. Short answer questions (1 point each; 5 points total)

Start by reading Chapter 1 of the textbook, Real Stats. You can also review your lecture notes.

Q1. A researcher observes that countries with parliamentary systems have higher levels of income. She decides to publish a research article that says having this form of government causes countries to become richer. Would you like to be a co-author on this paper? Why or why not? (100 words max)

No, I would not like to be a co-author on this paper. Too many of the variables are unknown for this paper to give a clear explanation whether or not this correlation exists. They do not have a exogenous relationship.

Q2. Explain what this sentence means: “Experiments create exogeneity via randomization.” (75 words max)

The experimenters were able to create an experiment where causation=correlation by randomizing their control and treated groups. These results are much more valid statistically and scientifically than those that are endogenous

Q3. What do we refer when we talk about the “internal validity” and “external validity” of experiments? (100 words max)

Internal validity refers to the design of the experiment and whether or not it has confounders or biases while external validity is the experiment’s ability to be applied outside of the study. They both work to show how trustworthy the experiments findings are.

Q4. You decide to run an experiment to see whether working in groups helps students learn. You randomly assign half of the class to form study groups to work on their problem sets. For the other half of the class, you assign them to work individually. At the end of the semester, you give the entire class a test. You find that the students in the first group did much better than those in the second group, who worked individually. (120 words max)

4a. What could you call each group?

The group who were assigned to be in groups would be the treated group while the group who worked as individuals would be the control group.

4b. What is your independent variable and what is your dependent variable?

The independent variable is working in groups and the dependent variable is the score on the test.

4c. Given this set up, list some factors you are controlling for.

Instruction of the students, the problem sets they work on, the test they receive at the end of the semester, etc.

4d. Can you say that working in groups caused the students to do better on the test? Why or why not? Explain using the technical terms in the textbook.

No you can not. Due to the endogeneity of the experiment we do not know whether the correlation between working in groups and doing better amounts to a causation. Students could be at diffferent academic levels, individuals might be going through things that could affect their scores.

4e. Can you say that this finding would also apply in other types of courses (for example, writing-intensive classes)? Why or why not? Explain using the technical terms in the textbook.

I believe that it could possibly be applied, however the findings lack both the internal and external validity needed to say for sure. By applying this method to another class, the experiment would gain external validity though.

Q5. Imagine you are looking at the relationship between income and level of education. List some of the factors that could lead to endogeneity. (50 words max)

Generational wealth, location, degree type, youth poverty, age, race, gender.

Part 2. Data Analysis (1/2 point each; 5 points total)

The decrease in infant mortality in the second half of the 20th century is one of the most relevant global development trends. What is the relationship between income and health outcomes?

We will use a dataset from “Our World in Data” (Saloni Dattani and Fiona Spooner and Hannah Ritchie and Max Roser, 2023) to explore this question.

Download the dataset, hdi_health.RData, which you’ll find on the course website. You may want to put it in your working directory to make it easy to find (use getwd() to see what your current working directory is; you can use the Session menu in Rstudio or the setwd() command to change your working directory.)

Here is a brief description of the variables:

Entity - name of the country or territory
avg_hdi - average Human Development Index, a measure that includes income, education, and health outcomes
avg_infant_mortality - average infant mortality rate, defined as the “estimated share of newborns who die before they are one year old” (Dattani, et al., 2023)

As will often be the case when using R, you will need to use the $ operator to access these variables within the object. Specifically, once you have loaded hdi_health.RData, the result will be available in the data hdi_health. To get at the variable Entity, for example, you would use hdi_health$Entity. Remember, the end of each chapter in the textbook includes R code that can be helpful. We also posted R resources on Gauchospace.

Q1. Load the data into R. The data are stored as an Rdata file, so you can use the load() function to load it.

load("hdi_health.Rdata")

Q2. Check the dimensions of the data (i.e. the number of rows and columns). How many observations are there? How many variables are there?

nrow(hdi_health)

## [1] 175

ncol(hdi_health)

## [1] 3

there are 175 observations and 3 variables

Q3. How many countries and territories are covered in this data set?

175

Q4. Calculate the average of avg_hdi, which is mean Human Development Index for each country, across all points in the sample. Do you think this average is large or small? What is the minimum and the maximum avg_hdi?

summary(hdi_health)

##     Entity             avg_hdi       avg_infant_mortality
##  Length:175         Min.   :0.3097   Min.   : 0.5179     
##  Class :character   1st Qu.:0.5192   1st Qu.: 2.2448     
##  Mode  :character   Median :0.6967   Median : 4.4149     
##                     Mean   :0.6644   Mean   : 5.0629     
##                     3rd Qu.:0.7935   3rd Qu.: 7.6625     
##                     Max.   :0.9214   Max.   :14.1199

The mean is .6644. This average sits a little above the the center of the minimum and the maximum being about .05 above the average of the min and max. The minimum is .3097 and the maximum is .9214.

Q5. Calculate the average infant mortality rate across all points in the sample. What does this tell you about the prevalence of infant mortality?

The mean is 5.0629. this shows that while average infant mortality rate is high compared to the countries in the first quartile, around the world it can be much higher.

Q6. Produce a simple scatterplot with Human Development Index on the horizontal axis and the average Infant Mortality Rate on the vertical axis.

plot(hdi_health$avg_hdi, hdi_health$avg_infant_mortality,
     xlab = "Average Human Development Index",
     ylab = "Average Infant Mortality Rate", 
     main = "HDI plotted along with Avg Infant mortality rate")

Q7. Make the plot again, but this time add a trend line (also known as a line of best fit or a regression line) using the abline() command.

plot(hdi_health$avg_hdi, hdi_health$avg_infant_mortality,
     xlab = "Average Human Development Index",
     ylab = "Average Infant Mortality Rate")

model<-lm(formula = hdi_health$avg_infant_mortality ~ hdi_health$avg_hdi,
   data = hdi_health)

abline(model, col = "red")

Q8. What does this line tell you about development and health outcomes?

This line shows the correlation that exists between the average infant mortality rate of a country and that country’s HDI. It has a negative correlation as you can see with the lower the infant mortality rate is the higher the HDI is

Q9. What could you call this relationship?

I would call this relationship a negative correlation

Problem Set 1 (Due January 24, 2024)

Prof. Cesar B. Martinez-Alvarez, PS 15, UCSB

Winter 2024

Introduction

Before you start: The R Markdown Introduction

Submission Instructions

Example Problem