cook
This activity uses the simulation as before, where the treatment is an educational intervention designed to improve the achievement of a group of students. A key difference this time is that, after experimenting with the simulation a bit more, you will export the data from a single run of the simulation to R to estimate the treatment effect using linear regression. You will see how linear regression can be used to estimate the same treatment effect you calculated by straightforward subtraction when comparing the means of the treated and non-treated groups, as well as a treatment effect that adjusts for self-selection.
The link below will open a new webpage containing the simulator.
The simulation is comprised of a group of students who have two
characteristics, achievement level and a socioeconomic status; and a
program (the treatment) intended to improve their performance. The
achievement of the student is represented by their color. A darker blue
represents a higher level of achievement. The SES of the student is
represented by their size. A larger size represents a higher level of
SES. Both achievement and SES are normally distributed across the
population of students, but the exact values are randomly chosen each
time you press the setupbutton.
If a student participates in the treatment, then their achievement will increase by an amount given by the treatment-effect slider plus or minus a little bit of random error. If the student does not participate, for now we will assume that their achievement remains entirely unchanged.
How students are divided into treatment and comparison groups depends
on the treatment-assignment slider. If random assignment is
specified, then the treatment group is entirely chosen at random. If
self-selection is specified, then whether a student is in the treatment
group or not depends on their socioeconomic status (SES) – the higher
their SES, the more likely it is that they will be in the treatment
group.
| Model Parameter | Description |
|---|---|
| nbr-people (slider) | the total number of students in your target population |
| treatment-effect (slider) | the amount that the treatment increases achievement; it’s the actual effect (in real life you don’t know this – you only see the estimated effect) |
| treatment-assignment (drop down box) | lets you specify whether people can self-select into the treatment group, or if the treatment group is randomly assigned |
| achievement-ses-correlation (slider) | the association between the initial achievement of students (at time zero) and their socioeconomic status |
| treatment-ses-correlation (slider) | the association between the propensity of students to select themselves into the treatment group and their socioeconomic status (only matters when “self-selection” is chosen in the treatment-assignment drop down box) |
In order for your changes to take effect, be sure to press setup after making all your changes.
setup buttontreat buttonN slider to the number of trials you want to
runrepeat-N-times buttonTo remind yourself how the simulation worked, run a few trials
starting with the settings below. Then try different combinations of
values for achievement-ses-correlation and
treatment-ses-correlation and observe what happens to the
estimated effect when one is high and the other low; both are high; and
both low (in comparison to the actual effect you set on using the
achievement-ses-correlation slider).
nbr-people = 200
treatment-effect = 15
treatment-assignment = “self-selection”
achievement-ses-correlation = 0.7
treatment-ses-correlation = 0.6
How does the the estimated treatment effect (e.g., the value in the
estimated-effect monitor) compare the actual effect (e.g.,
15, which you set using the treatment-effect slider) when
both achievement-ses-correlation and
treatment-ses-correlation are both high? Both low? Which
settting – both high or both low – results in more bias. Why?
When both correlations are set to high (0.9), the estimated treatment effect was 27.21, which is significantly higher than the actual effect of 15. In contrast, when both correlations were set to low (0.1), the estimated effect was 16.34, which is much closer to the truth.
The setting with both high correlations results in far more bias. This occurs because socioeconomic status acts as a strong confounding variable. High SES students already have higher baseline achievements, and they are also much more likely to self-select into the treatment group. Therefore, the treated group naturally performs better than the control group even without the treatment, artificially inflating the apparent success of the program.
Redo your favorite trials in Question 1.1 with
treatment-assignment = “random”. What happens to the
estimated effect? Why?
When I reran the high correlation configuration (0.9 for both) but changed the assignment to “random”, the estimated effect dropped to 15.66. This is incredibly close to the actual designed effect of 15.
This happens because random assignment completely breaks the link between a student’s socioeconomic status and their likelihood to receive the treatment. Even though SES still strongly influences baseline achievement, the random assignment ensures that both high and low SES students are evenly distributed between the treatment and control groups. As a result, the two groups are comparable from the start, and the confounding bias is eliminated.
So far we have had the luxury of being able to experiment with many different “worlds,” where both the achievement-ses-correlation and the treatment-ses-correlation are entirely in our control. This has allowed us to investigate how different assumptions about those associations impact the estimate the effects of a treatment when we simply compare the mean of the groups who received it to those who did not. In reality, however, the achievement-ses-correlation is something we cannot change; if we are lucky, the best we can do is break the treatment-ses-correlation by conducting a social experiment. Moreover, it is highly unlikely that we will be able to run many repetitions of the same experiment. For this question we will limit ourselves to one “world” where we get to investigate the outcomes of students under experimental and non-experimental conditions.
Set the parameters of the model to the following:
nbr-people = 500
treatment-effect = 15
treatment-assignment = “self-selection”
achievement-ses-correlation = 0.7
treatment-ses-correlation = 0.6
Run this simulated world one time by pressing the setup
button once, and then pressing the “treat” button once. Make a note of
the estimated effect for this run of the world and keep this screen open
on your computer. You may need to come back to it for reference.
The next step is to press the export-data button one
time. Depending on what browser you are using, that will either download
file called “student-data.csv” directly to your downloads directory, or
prompt you to save it. Either way, make sure you place that file in your
R Studio current working directory and rename the file
“M3-student-data.csv”.
The file has 500 rows of data, with each row containing the following information about a single student in the simulation:
| Variable | Description |
|---|---|
| currach | student’s current achievement after hitting the treat button |
| initach | student’s initial achievement before hitting the treat button |
| ses | student’s socioeconomic status |
| treat | a dummy variable indicated whether or not the student received the treatment |
Go look at that file to understand its structure. Note that each variable/column of data is seperated by a comma (hence the “comma seperated values (csv)” extension on the file). You can then import the csv file into R and take a look at the data using the code below:
# load these libraries first, if you haven't already
library(tidyverse)
library(stargazer)
library(ggplot2)
library(GGally)
# read in data
studentData <- read.csv("M3-student-data.csv")
# view data
head(studentData)| currach | initach | ses | treat |
|---|---|---|---|
| 58.48576 | 36.52548 | 81.11380 | 1 |
| 68.41335 | 56.46040 | 93.21819 | 1 |
| 67.84852 | 49.79241 | 74.57385 | 1 |
| 78.01536 | 56.84960 | 88.05072 | 1 |
| 67.25615 | 55.95618 | 70.50045 | 1 |
| 36.51433 | 36.51433 | 64.44605 | 0 |
## Rows: 500
## Columns: 4
## $ currach <dbl> 58.48576, 68.41335, 67.84852, 78.01536, 67.25615, 36.51433, 83…
## $ initach <dbl> 36.52548, 56.46040, 49.79241, 56.84960, 55.95618, 36.51433, 62…
## $ ses <dbl> 81.11380, 93.21819, 74.57385, 88.05072, 70.50045, 64.44605, 81…
## $ treat <int> 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
Note that your output will look different here and for all the questions because you are using your own data from your own run of the simulation. The output should be somewhat similar, but it will not be the same.
Look at the relationships between the achievement and ses variables
using the ggpairs function from the GGally library. This
function creates a plot that gives you three things at once: the
histogram of each variable, the scatter plot between all pairs of
variables, and the correlation between all pairs of variables. The
command below sends all the data from studentData to the
the select function; the select function keeps
only the desired columns and sends them to the ggpairs
function for final processing. Note that the “pipe” operator
(%>%) simply sends the output of the function on the
left to the function on the right. It is a very handy way to send data
through several functions.
Do these relationships make sense given what you know about how the data
were created? Why or why not?
Yes, the relationships observed in the plots make perfect sense based
on the simulation parameters we set. Because we set the
achievement-ses-correlation to 0.7, we would immediately
expect to see a strong positive correlation in the scatter plots
naturally linking initial achievement and SES. Furthermore, since we
chose a self-selection model with a
treatment-ses-correlation of 0.6, we know that students
with higher SES are more heavily represented in the treatment group. As
a consequence, the treated students will generally have a higher initial
starting point, pushing their current achievement even further ahead
after the treatment takes effect.
Examine how the achievement distributions of the treated and control
groups differ. You can do this using the ggplot function
which allows you to create multiple “layers” of graphics to display at
the same time. The first part of the command tells it what data to use
and defines any “aesthetic mapppings” you want to use for all the layers
that follow. For example, the first part of the command below (before
the +) says “map the currach column to the
x-axis for all the layers that follow.” The second part of the command
inherits that mapping and adds the geom_histogram layer,
which says “use the data and mappings you inherit to create a histogram
with black lines around the bars.”
To plot the treated and untreated distributions at the same time, all
you need to do is add an aesthetic mapping that involves a categorical
variable. Common mappings include the shape of the things
you are plotting, the color of the lines of the things you
are plotting; and the color you use to fill the things your
are plotting. To illustrate, the first command below maps the different
levels of the factor treat to different fill
colors for each histogram (keeping the color of the bar lines
black).
[Note: The as.factor() function is used around the
treat variable in the ggplot code so that ggplot knows to
treat it as a categorical variable rather than a continous number. This
is important because categorical variables are treated differently in
visualizations, particularly in how they are grouped, colored, and
labeled in plots.]
ggplot(studentData, aes(x = currach, fill = as.factor(treat))) +
geom_histogram(color = "black") +
labs(fill = "treat") # Changes the legend titleAlternatively, we could have visualized the two histograms separately
by mapping the different levels of the factor treat to a
different line color for each histogram (and make
fill each histogram with black). I think this makes it
harder to visualize the differences between the two histograms, but
decide for yourself:
ggplot(studentData, aes(x = currach, color = as.factor(treat))) +
geom_histogram(fill = "black") +
labs(fill = "treat") # Changes the legend titleTo put exact numbers on the differences between these two
distribution, we can tabulate the data by whether or not the students
received the treatment. You can accomplish this by using the
group_by function to group the data by the different levels
of the treat factor, and then pipe that data to the
summarize function to apply some calculation to each group.
In the command below, we tell the summarize function to
calculate the mean of currach for each group, and call the
temporary variable it creates to store that information
meanCurrent. The na.rm=TRUE option is a good
one to include by default so that you can still calculate the mean if
there are missing values.
studentData %>%
select(treat, currach) %>%
group_by(treat) %>%
summarize(meanCurrent = mean(currach, na.rm = TRUE))| treat | meanCurrent |
|---|---|
| 0 | 46.84449 |
| 1 | 69.62802 |
Use this information to calculate the treatment effect (i.e., the difference between the mean of treated students and the mean of the non-treated students). How does it compare to the Estimated Effect in the simulation calculated by taking the difference between the mean of the treated and untreated groups? How does it compare to the true treatment effect used in the simulation that produced the data?
By subtracting the untreated group’s mean currach
(46.84) from the treated group’s mean (69.63), the calculated treatment
effect is precisely 22.79.
This matches the Estimated Effect of 22.784 that was displayed on the simulation screen before we exported the data, proving that our manual calculation is correct. However, this calculated effect is much higher than the true treatment effect of 15 used to program the simulation. Just like in our initial manual trial, the presence of self-selection bias based on SES has artificially amplified the apparent effectiveness of the program.
Run a linear regression that predicts current achievement on the basis of whether or not they received the treatment. Look at the estimated coefficient of the treatment variable. What it is telling you? How does it compare to the treatment effect you just calculated by hand above? How does it compare to the actual treatment effect used in the simulation to generate this data?
##
## Call:
## lm(formula = currach ~ treat, data = studentData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.219 -6.999 -0.445 7.004 31.692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.8445 0.6718 69.73 <2e-16 ***
## treat 22.7835 0.9407 24.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.52 on 498 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5399
## F-statistic: 586.6 on 1 and 498 DF, p-value: < 2.2e-16
The estimated coefficient for the treat variable is
22.78. This coefficient tells us the simple difference in mean current
achievement between the students who received the treatment and those
who did not, without considering any other outside variables.
This regression coefficient is exactly identical to the treatment effect we calculated by hand in the previous step. Similar to our previous conclusions, this simple regression overestimates the actual simulation effect of 15 because it fails to account for the confounding influence of socioeconomic baseline differences between the groups.
Run a linear regression that predicts current achievement on the basis of whether or not they received the treatment as well as the ses of the student. How has the estimated coefficient of the treatment variable changed? Why?
##
## Call:
## lm(formula = currach ~ treat + ses, data = studentData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.6771 -5.1869 0.6702 5.6071 27.6692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.04995 2.73112 -0.018 0.985
## treat 15.23232 0.85719 17.770 <2e-16 ***
## ses 0.71741 0.04099 17.502 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.28 on 497 degrees of freedom
## Multiple R-squared: 0.7159, Adjusted R-squared: 0.7148
## F-statistic: 626.2 on 2 and 497 DF, p-value: < 2.2e-16
The estimated coefficient for the treat variable has
dropped significantly to 15.23.
This change occurs because the newly updated regression model now
controls for the student’s socioeconomic status. By including
ses in the model, we are holding the confounding variable
constant. The revised coefficient of 15.23 now specifically represents
the isolated impact of the treatment program on achievement, making it a
far more accurate reflection of the true simulated effect of 15.
Substitute initach for ses and run the
regression again. How close is the coefficient on treat to
the actual effect? Why?
##
## Call:
## lm(formula = currach ~ treat + initach, data = studentData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.7450 -0.4497 0.0444 0.6548 12.4668
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.40778 0.90088 -1.563 0.119
## treat 14.68064 0.37867 38.769 <2e-16 ***
## initach 1.03005 0.01848 55.746 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.909 on 497 degrees of freedom
## Multiple R-squared: 0.9367, Adjusted R-squared: 0.9364
## F-statistic: 3677 on 2 and 497 DF, p-value: < 2.2e-16
The coefficient for the treat variable is 14.68, which
is remarkably close to the actual programmed effect of 15.
This happens because initach serves as a near-perfect
control variable that represents the exact baseline capability of the
student right before the treatment was applied. By adjusting for their
initial starting score, we completely strip away all baseline
inequalities between the treated and untreated individuals. The
remaining difference captured by the treatment coefficient essentially
becomes the pure, isolated value added mechanically by the
intervention.