india <- read.csv # Problem Set 02 - Do Women Promote Different Policies than Men? (continuation)

Due: Feb 20

(Based on DSS Materials and on Chattopadhyay and Esther Duflo. 2004. ``Women as Policy Makers: Evidence from a Randomized Policy Experiment in India.” Econometrica, 72 (5): 1409–43.)

We will estimate the average causal effect of having a female politician on two policy outcomes. For this purpose, we will analyze data from an experiment conducted in India, where villages were randomly assigned to have a female council head. The dataset we will use is in a file called “india.csv”. The Table below shows the names and descriptions of the variables in this dataset, where the unit of observation is villages.

Variable Description
village village identifier (“Gram Panchayat number _ village number”)
female whether the village was assigned a female politician: 1=yes, 0=no
water number of new (or repaired) drinking water facilities in the village
since random assignment
irrigation number of new (or repaired) irrigation facilities in the village
since random assignment

In this problem set, we will practice loading, making sense of data, and understanding the basics of causal inference. We will also learn how to use R Markdown.


1. Considering that the dataset we are analyzing comes from a randomized experiment, what can we compute to estimate the average causal effect of having a female politician on the number of new (or repaired) drinking water facilities? Please provide the name of the estimator. (1 point)

One way to we can get the average casual affect of having a female politician on the number of new (or repaired) drinking water facilities is by using the difference of mean estimator.


2. In this dataset, what is the average number of new (or repaired) drinking water facilities in villages with a female politician? Please answer with a full sentence. (1 point)

mean(india$water[india$female == 1])
## [1] 23.99074

23.99074

(Hint: we use [] to subset a variable; inside the square brackets, we specify the selection criterion. For example, we can use the relational operator == to set a logical test; only the observations for which the logical test is true will be extracted.)


3. What is the average number of new (or repaired) drinking water facilities in villages with a male politician? Please answer with a full sentence. (1 point)

mean(india$water[india$female == 0])
## [1] 14.73832

14.73832


4. What is the estimated average causal effect of having a female politician on the number of new (or repaired) drinking water facilities? (2 points)

mean(india$water[india$female == 1])- mean(india$water[india$female == 0])
## [1] 9.252423

9.252423


5. Create a visualization of the distribution of the variable water.

  1. Does this variable look bell-shaped distributed? (0.5 points)
  2. Approximately how many villages in this experiment had about 250 new (or repaired) drinking water facilities since the randomization of politicians? (0.5 points)
ggplot(india, aes(x=water))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A.I decided to make a bar plot, and it does not appear to be bell shaped. Rather have a negative corralation, that can be seen with the largest amount of data is in the beginning and decreases. B. Approximately, 2 villages have data above 250 +-25.

(Hint: the histogram of a variable is the visual representation of its distribution. The function in R to create a histogram is . The only required arguments are the variable and the dataset.)


6. Create a visualization of the relationship between water and irrigation.

  1. Does the linear relationship between these two variables look positive or negative? A positive/negative answer will suffice. (0.5 points)
  2. Does the relationship between these two variables look strongly linear? A yes/no answer will suffice. (0.5 points)
barplot(india$water[india$irrigation == 1])

a. The relationship between these two variables look positive. b. no

(Hint: a scatter plot is the graphical representation of the relationship between two variables. The function in R to create a scatter plot with the fitting line is:

ggplot(data = dataset, aes(x = var_x, y = var_y)) + 
  geom_point() + 
  geom_smooth(formula = 'y ~ x', method = 'lm', se = F)

It requires three arguments: (1) the name you saved the dataset; (2) the code identifying the variable to be plotted along the x-axis, and (3) the code identifying the variable to be plotted along the y-axis.)


7. Compute the correlation between water and irrigation.

  1. Are you surprised by the sign of the correlation? Provide your reason. (0.5 points)
  2. And are you surprised by the absolute value of the correlation? Provide your reason. (0.5 points)
cor(india$water,india$irrigation)
## [1] 0.4073307

0.4073307 a. No. The data dds ups as can be seen when politicians prioritize building or repairing water systems would as well prioritize building public infrastructure. b. I am surprised by the absolute value, because I thought it would be closer to 1 since they re both public infrastructure.

(Hint: the function in R to compute a correlation coefficient is . It requires two arguments (separated by a comma) and in no particular order: the code identifying each of the two variables.)


8. If we wanted to use the sample of villages in this dataset to infer the characteristics of all villages in India, we would have to make sure that the sample is _____________ of the population of all villages. (Please provide the missing word). (1 point)

Representative


9. What would have been the best way of selecting the villages for the sample to ensure that the statement above was true? (1 point)

Randomly selecting data from these villages to compare and contrast. Any outliars can be quickly addressed to insure they are truly representative of this data.