CSSS 321: Problem Set #2
Joey Chi (AC)
No Collaborator
Question 1
In this dataset, what is the average number of new (or repaired)
drinking water facilities in villages with a female politician? Please
answer with a full sentence. (1 point)
Answer 1
In this case, I need data for villages where female = 1, meaning the
village was assigned a female politician. The female column is read as
numeric. The average number of new or repaired drinking water facilities
in villages with a female politician is approximately 23. 99.
Question 2
What is the average number of new (or repaired) drinking water
facilities in villages with a male politician? Please answer with a full
sentence. (1 point)
Answer 2
On average, villages with a male politician had 14.74 new (or repaired)
drinking water facilities since random assignment.
Question 3
What is the estimated average causal effect of having a female
politician on the number of new (or repaired) drinking water facilities?
Please provide a full substantive answer (make sure to include the
assumption, why the assumption is reasonable, the treatment, the
outcome, as well as the direction, size, and unit of measurement of the
average treatment effect) (4 points)
Answer 3
Firstly, the average causal effect is calculated as Mean of water
facilities in villages with female politicians minus mean of water
facilities with male politicians. Treatment is whether the village was
assigned a female politician where female = 1. Outcome is the number of
new or repaired drinking water facilities. Unit of measurement is the
number of drinking water facilities per village (numeric).
From Questions 1 and 2, we have: (1) Average number of water facilities
in female-led villages: 23.99, and (2) Average number of water
facilities in male-led villages: 14.74. Now, we compute the ATE. 23.99 -
14.74 = 9.25. So, ATE = 9.25.
The estimated average causal effect of having a female politician on the
number of new (or repaired) drinking water facilities is 9.25 additional
water facilities per village. This means that villages led by female
politicians had 9.25 more drinking water facilities on average than
villages led by male politicians.
We assume that the random assignment of female politicians ensures that
the comparison is unbiased, meaning that differences in the number of
water facilities can be attributed to the politician’s gender rather
than other variables (village characteristics). This assumption is
reasonable because randomization should eliminate confounding variables,
making the treatment and control groups comparable.
Since the treatment (having a female politician) increases the outcome
(number of new or repaired drinking water facilities), the direction of
the effect is positive, meaning that villages led by female politicians
had more drinking water facilities compared to male-led villages. The
size of the effect is 9.25 facilities per village.
Question 4
Create a visualization of the distribution of the variable water.
Approximately how many villages in this experiment had about 250 new (or
repaired) drinking water facilities since the Randomization of
politicians? (1 point)
Answer 4
To visualize the distribution of the water variable, I will use a
histogram to show the distribution of water across all villages,
identifying how many villages had around 250 new or repaired drinking
water facilities. According to the output, there were no villages that
had a number of drinking water facilities in the range of 245 to 255.
The distribution is highly left skewed, meaning most villages had much
fewer facilities. Most values are concentrated on the lower end, and
only a few villages had extremely high values. So, the histogram shows
that most villages had a small number of drinking water facilities, and
no villages had approximately 250 new or repaired drinking water
facilities since the randomization of politicians.
Question 5
Create a visualization of the relationship between water and irrigation.
Does the linear relationship between these two variables look positive
or negative? A positive/negative answer will suffice. (1 point)
Answer 5
In this case, we need to analyze the relationship between water (number
of new or repaired drinking water facilities) and irrigation (number of
new or repaired irrigation facilities). The best way to visualize this
relationship is to create a scatter plot. If the points trend upward,
then the relationship is positive. If the points trend downward, the
relationship is negative. According to the graph, the linear regression
line slopes upward, so the relationship is positive. This means that
villages with more drinking water facilities also tend to have more
irrigation facilities.
Question 6
Compute the correlation between water and irrigation. (.5 points)
a. Are you surprised by the sign of the correlation? Provide your
reason. (.25 points)
b. And are you surprised by the absolute value of the correlation?
Provide your reason. (.25 points)
Answer 6
The correlation value is approximately 0.407. This means that the
relationship between water and irrigation is positive, and it suggests a
moderate positive correlation. More drinking water facilities tend to
have more irrigation facilities. I am not surprised by the sign of the
correlation because I already know the relationship between water and
irrigation is positive. I am not surprised by the absolute value of the
correlation because although villages might develop water and irrigation
facilities together, it doesn’t have to be a perfect positive
relationship due to other influencing factors.
Question 7
If we wanted to use the sample of villages in this dataset to infer the
characteristics of all villages in India, we would have to make sure
that the sample is representative of the population. What would have
been the best way of selecting the villages for the sample to ensure
that the statement above was true? (1 point)
Answer 7
To ensure that the sample of villages in this dataset is representative
of all villages in India, the best approach would be to use stratified
random sampling. This method involves dividing villages into different
strata based on key characteristics such as geographic region, urban
vs. rural classification, economic development, and political
representation. Then, villages would be randomly selected from each
stratum in proportion to their actual distribution in the population.
This ensures that all types of villages are adequately represented,
leading to more reliable and generalizable conclusions.