CSSS 321: Problem Set #2
Joey Chi (AC)
No Collaborator

Question 1
In this dataset, what is the average number of new (or repaired) drinking water facilities in villages with a female politician? Please answer with a full sentence. (1 point)
Answer 1
In this case, I need data for villages where female = 1, meaning the village was assigned a female politician. The female column is read as numeric. The average number of new or repaired drinking water facilities in villages with a female politician is approximately 23. 99.

Question 2
What is the average number of new (or repaired) drinking water facilities in villages with a male politician? Please answer with a full sentence. (1 point)
Answer 2
On average, villages with a male politician had 14.74 new (or repaired) drinking water facilities since random assignment.

Question 3
What is the estimated average causal effect of having a female politician on the number of new (or repaired) drinking water facilities? Please provide a full substantive answer (make sure to include the assumption, why the assumption is reasonable, the treatment, the outcome, as well as the direction, size, and unit of measurement of the average treatment effect) (4 points)
Answer 3
Firstly, the average causal effect is calculated as Mean of water facilities in villages with female politicians minus mean of water facilities with male politicians. Treatment is whether the village was assigned a female politician where female = 1. Outcome is the number of new or repaired drinking water facilities. Unit of measurement is the number of drinking water facilities per village (numeric).
From Questions 1 and 2, we have: (1) Average number of water facilities in female-led villages: 23.99, and (2) Average number of water facilities in male-led villages: 14.74. Now, we compute the ATE. 23.99 - 14.74 = 9.25. So, ATE = 9.25.
The estimated average causal effect of having a female politician on the number of new (or repaired) drinking water facilities is 9.25 additional water facilities per village. This means that villages led by female politicians had 9.25 more drinking water facilities on average than villages led by male politicians.
We assume that the random assignment of female politicians ensures that the comparison is unbiased, meaning that differences in the number of water facilities can be attributed to the politician’s gender rather than other variables (village characteristics). This assumption is reasonable because randomization should eliminate confounding variables, making the treatment and control groups comparable.
Since the treatment (having a female politician) increases the outcome (number of new or repaired drinking water facilities), the direction of the effect is positive, meaning that villages led by female politicians had more drinking water facilities compared to male-led villages. The size of the effect is 9.25 facilities per village.

Question 4
Create a visualization of the distribution of the variable water. Approximately how many villages in this experiment had about 250 new (or repaired) drinking water facilities since the Randomization of politicians? (1 point)
Answer 4
To visualize the distribution of the water variable, I will use a histogram to show the distribution of water across all villages, identifying how many villages had around 250 new or repaired drinking water facilities. According to the output, there were no villages that had a number of drinking water facilities in the range of 245 to 255. The distribution is highly left skewed, meaning most villages had much fewer facilities. Most values are concentrated on the lower end, and only a few villages had extremely high values. So, the histogram shows that most villages had a small number of drinking water facilities, and no villages had approximately 250 new or repaired drinking water facilities since the randomization of politicians.

Question 5
Create a visualization of the relationship between water and irrigation. Does the linear relationship between these two variables look positive or negative? A positive/negative answer will suffice. (1 point)
Answer 5
In this case, we need to analyze the relationship between water (number of new or repaired drinking water facilities) and irrigation (number of new or repaired irrigation facilities). The best way to visualize this relationship is to create a scatter plot. If the points trend upward, then the relationship is positive. If the points trend downward, the relationship is negative. According to the graph, the linear regression line slopes upward, so the relationship is positive. This means that villages with more drinking water facilities also tend to have more irrigation facilities.

Question 6
Compute the correlation between water and irrigation. (.5 points)
a. Are you surprised by the sign of the correlation? Provide your reason. (.25 points)
b. And are you surprised by the absolute value of the correlation? Provide your reason. (.25 points)
Answer 6
The correlation value is approximately 0.407. This means that the relationship between water and irrigation is positive, and it suggests a moderate positive correlation. More drinking water facilities tend to have more irrigation facilities. I am not surprised by the sign of the correlation because I already know the relationship between water and irrigation is positive. I am not surprised by the absolute value of the correlation because although villages might develop water and irrigation facilities together, it doesn’t have to be a perfect positive relationship due to other influencing factors.

Question 7
If we wanted to use the sample of villages in this dataset to infer the characteristics of all villages in India, we would have to make sure that the sample is representative of the population. What would have been the best way of selecting the villages for the sample to ensure that the statement above was true? (1 point)
Answer 7
To ensure that the sample of villages in this dataset is representative of all villages in India, the best approach would be to use stratified random sampling. This method involves dividing villages into different strata based on key characteristics such as geographic region, urban vs. rural classification, economic development, and political representation. Then, villages would be randomly selected from each stratum in proportion to their actual distribution in the population. This ensures that all types of villages are adequately represented, leading to more reliable and generalizable conclusions.