M3 Lab 3 Submission

cook

Overview

This activity uses the simulation as before, where the treatment is an educational intervention designed to improve the achievement of a group of students. A key difference this time is that, after experimenting with the simulation a bit more, you will export the data from a single run of the simulation to R to estimate the treatment effect using linear regression. You will see how linear regression can be used to estimate the same treatment effect you calculated by straightforward subtraction when comparing the means of the treated and non-treated groups, as well as a treatment effect that adjusts for self-selection.

Simulator

The link below will open a new webpage containing the simulator.

Click here to access simulator

How it Works

The simulation is comprised of a group of students who have two characteristics, achievement level and a socioeconomic status; and a program (the treatment) intended to improve their performance. The achievement of the student is represented by their color. A darker blue represents a higher level of achievement. The SES of the student is represented by their size. A larger size represents a higher level of SES. Both achievement and SES are normally distributed across the population of students, but the exact values are randomly chosen each time you press the setupbutton.

If a student participates in the treatment, then their achievement will increase by an amount given by the treatment-effect slider plus or minus a little bit of random error. If the student does not participate, for now we will assume that their achievement remains entirely unchanged.

How students are divided into treatment and comparison groups depends on the treatment-assignment slider. If random assignment is specified, then the treatment group is entirely chosen at random. If self-selection is specified, then whether a student is in the treatment group or not depends on their socioeconomic status (SES) – the higher their SES, the more likely it is that they will be in the treatment group.

Parameters You Can Change

Model Parameter	Description
nbr-people (slider)	the total number of students in your target population
treatment-effect (slider)	the amount that the treatment increases achievement; it’s the actual effect (in real life you don’t know this – you only see the estimated effect)
treatment-assignment (drop down box)	lets you specify whether people can self-select into the treatment group, or if the treatment group is randomly assigned
achievement-ses-correlation (slider)	the association between the initial achievement of students (at time zero) and their socioeconomic status
treatment-ses-correlation (slider)	the association between the propensity of students to select themselves into the treatment group and their socioeconomic status (only matters when “self-selection” is chosen in the treatment-assignment drop down box)

In order for your changes to take effect, be sure to press setup after making all your changes.

Running One Trial

Set your parameters
Press setup button
Press treat button

Running Multiple Trials in a Batch

Set your parameters
Set the N slider to the number of trials you want to run
Press the repeat-N-times button

Questions

Q1

To remind yourself how the simulation worked, run a few trials starting with the settings below. Then try different combinations of values for achievement-ses-correlation and treatment-ses-correlation and observe what happens to the estimated effect when one is high and the other low; both are high; and both low (in comparison to the actual effect you set on using the achievement-ses-correlation slider).

nbr-people = 200
treatment-effect = 15
treatment-assignment = “self-selection”
achievement-ses-correlation = 0.7
treatment-ses-correlation = 0.6

Q1.1

How does the the estimated treatment effect (e.g., the value in the estimated-effect monitor) compare the actual effect (e.g., 15, which you set using the treatment-effect slider) when both achievement-ses-correlation and treatment-ses-correlation are both high? Both low? Which settting – both high or both low – results in more bias. Why?

When both correlations are set to high (0.9), the estimated treatment effect was 27.21, which is significantly higher than the actual effect of 15. In contrast, when both correlations were set to low (0.1), the estimated effect was 16.34, which is much closer to the truth.

The setting with both high correlations results in far more bias. This occurs because socioeconomic status acts as a strong confounding variable. High SES students already have higher baseline achievements, and they are also much more likely to self-select into the treatment group. Therefore, the treated group naturally performs better than the control group even without the treatment, artificially inflating the apparent success of the program.

Q1.2

Redo your favorite trials in Question 1.1 with treatment-assignment = “random”. What happens to the estimated effect? Why?

When I reran the high correlation configuration (0.9 for both) but changed the assignment to “random”, the estimated effect dropped to 15.66. This is incredibly close to the actual designed effect of 15.

This happens because random assignment completely breaks the link between a student’s socioeconomic status and their likelihood to receive the treatment. Even though SES still strongly influences baseline achievement, the random assignment ensures that both high and low SES students are evenly distributed between the treatment and control groups. As a result, the two groups are comparable from the start, and the confounding bias is eliminated.

Q2

So far we have had the luxury of being able to experiment with many different “worlds,” where both the achievement-ses-correlation and the treatment-ses-correlation are entirely in our control. This has allowed us to investigate how different assumptions about those associations impact the estimate the effects of a treatment when we simply compare the mean of the groups who received it to those who did not. In reality, however, the achievement-ses-correlation is something we cannot change; if we are lucky, the best we can do is break the treatment-ses-correlation by conducting a social experiment. Moreover, it is highly unlikely that we will be able to run many repetitions of the same experiment. For this question we will limit ourselves to one “world” where we get to investigate the outcomes of students under experimental and non-experimental conditions.

Set the parameters of the model to the following:
nbr-people = 500
treatment-effect = 15
treatment-assignment = “self-selection”
achievement-ses-correlation = 0.7
treatment-ses-correlation = 0.6

Run this simulated world one time by pressing the setup button once, and then pressing the “treat” button once. Make a note of the estimated effect for this run of the world and keep this screen open on your computer. You may need to come back to it for reference.

The next step is to press the export-data button one time. Depending on what browser you are using, that will either download file called “student-data.csv” directly to your downloads directory, or prompt you to save it. Either way, make sure you place that file in your R Studio current working directory and rename the file “M3-student-data.csv”.

The file has 500 rows of data, with each row containing the following information about a single student in the simulation:

Variable	Description
currach	student’s current achievement after hitting the treat button
initach	student’s initial achievement before hitting the treat button
ses	student’s socioeconomic status
treat	a dummy variable indicated whether or not the student received the treatment

Go look at that file to understand its structure. Note that each variable/column of data is seperated by a comma (hence the “comma seperated values (csv)” extension on the file). You can then import the csv file into R and take a look at the data using the code below:

# load these libraries first, if you haven't already
library(tidyverse)
library(stargazer)
library(ggplot2)
library(GGally)

# read in data
studentData <- read.csv("M3-student-data.csv")

# view data
head(studentData)

currach	initach	ses	treat
58.48576	36.52548	81.11380	1
68.41335	56.46040	93.21819	1
67.84852	49.79241	74.57385	1
78.01536	56.84960	88.05072	1
67.25615	55.95618	70.50045	1
36.51433	36.51433	64.44605	0

glimpse(studentData)

## Rows: 500
## Columns: 4
## $ currach <dbl> 58.48576, 68.41335, 67.84852, 78.01536, 67.25615, 36.51433, 83…
## $ initach <dbl> 36.52548, 56.46040, 49.79241, 56.84960, 55.95618, 36.51433, 62…
## $ ses     <dbl> 81.11380, 93.21819, 74.57385, 88.05072, 70.50045, 64.44605, 81…
## $ treat   <int> 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,…

Note that your output will look different here and for all the questions because you are using your own data from your own run of the simulation. The output should be somewhat similar, but it will not be the same.

Q2.1

Look at the relationships between the achievement and ses variables using the ggpairs function from the GGally library. This function creates a plot that gives you three things at once: the histogram of each variable, the scatter plot between all pairs of variables, and the correlation between all pairs of variables. The command below sends all the data from studentData to the the select function; the select function keeps only the desired columns and sends them to the ggpairs function for final processing. Note that the “pipe” operator (%>%) simply sends the output of the function on the left to the function on the right. It is a very handy way to send data through several functions.

studentData %>%
  select(currach, initach, ses) %>%
  ggpairs()

Do these relationships make sense given what you know about how the data were created? Why or why not?

Yes, the relationships observed in the plots make perfect sense based on the simulation parameters we set. Because we set the achievement-ses-correlation to 0.7, we would immediately expect to see a strong positive correlation in the scatter plots naturally linking initial achievement and SES. Furthermore, since we chose a self-selection model with a treatment-ses-correlation of 0.6, we know that students with higher SES are more heavily represented in the treatment group. As a consequence, the treated students will generally have a higher initial starting point, pushing their current achievement even further ahead after the treatment takes effect.

Q2.2

Examine how the achievement distributions of the treated and control groups differ. You can do this using the ggplot function which allows you to create multiple “layers” of graphics to display at the same time. The first part of the command tells it what data to use and defines any “aesthetic mapppings” you want to use for all the layers that follow. For example, the first part of the command below (before the +) says “map the currach column to the x-axis for all the layers that follow.” The second part of the command inherits that mapping and adds the geom_histogram layer, which says “use the data and mappings you inherit to create a histogram with black lines around the bars.”

ggplot(studentData, aes(x = currach)) +
  geom_histogram(color = "black")

To plot the treated and untreated distributions at the same time, all you need to do is add an aesthetic mapping that involves a categorical variable. Common mappings include the shape of the things you are plotting, the color of the lines of the things you are plotting; and the color you use to fill the things your are plotting. To illustrate, the first command below maps the different levels of the factor treat to different fill colors for each histogram (keeping the color of the bar lines black).

[Note: The as.factor() function is used around the treat variable in the ggplot code so that ggplot knows to treat it as a categorical variable rather than a continous number. This is important because categorical variables are treated differently in visualizations, particularly in how they are grouped, colored, and labeled in plots.]

ggplot(studentData, aes(x = currach, fill = as.factor(treat))) +
  geom_histogram(color = "black") +
  labs(fill = "treat") # Changes the legend title

Alternatively, we could have visualized the two histograms separately by mapping the different levels of the factor treat to a different line color for each histogram (and make fill each histogram with black). I think this makes it harder to visualize the differences between the two histograms, but decide for yourself:

ggplot(studentData, aes(x = currach, color = as.factor(treat))) +
  geom_histogram(fill = "black") +
  labs(fill = "treat") # Changes the legend title

To put exact numbers on the differences between these two distribution, we can tabulate the data by whether or not the students received the treatment. You can accomplish this by using the group_by function to group the data by the different levels of the treat factor, and then pipe that data to the summarize function to apply some calculation to each group. In the command below, we tell the summarize function to calculate the mean of currach for each group, and call the temporary variable it creates to store that information meanCurrent. The na.rm=TRUE option is a good one to include by default so that you can still calculate the mean if there are missing values.

studentData %>%
  select(treat, currach) %>%
  group_by(treat) %>%
  summarize(meanCurrent = mean(currach, na.rm = TRUE))

treat	meanCurrent
0	46.84449
1	69.62802

Use this information to calculate the treatment effect (i.e., the difference between the mean of treated students and the mean of the non-treated students). How does it compare to the Estimated Effect in the simulation calculated by taking the difference between the mean of the treated and untreated groups? How does it compare to the true treatment effect used in the simulation that produced the data?

By subtracting the untreated group’s mean currach (46.84) from the treated group’s mean (69.63), the calculated treatment effect is precisely 22.79.

This matches the Estimated Effect of 22.784 that was displayed on the simulation screen before we exported the data, proving that our manual calculation is correct. However, this calculated effect is much higher than the true treatment effect of 15 used to program the simulation. Just like in our initial manual trial, the presence of self-selection bias based on SES has artificially amplified the apparent effectiveness of the program.

Q2.3

Run a linear regression that predicts current achievement on the basis of whether or not they received the treatment. Look at the estimated coefficient of the treatment variable. What it is telling you? How does it compare to the treatment effect you just calculated by hand above? How does it compare to the actual treatment effect used in the simulation to generate this data?

model1 <- lm(currach ~ treat, data = studentData)
summary(model1)

## 
## Call:
## lm(formula = currach ~ treat, data = studentData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.219  -6.999  -0.445   7.004  31.692 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.8445     0.6718   69.73   <2e-16 ***
## treat        22.7835     0.9407   24.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.52 on 498 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5399 
## F-statistic: 586.6 on 1 and 498 DF,  p-value: < 2.2e-16

The estimated coefficient for the treat variable is 22.78. This coefficient tells us the simple difference in mean current achievement between the students who received the treatment and those who did not, without considering any other outside variables.

This regression coefficient is exactly identical to the treatment effect we calculated by hand in the previous step. Similar to our previous conclusions, this simple regression overestimates the actual simulation effect of 15 because it fails to account for the confounding influence of socioeconomic baseline differences between the groups.

Q2.4

Run a linear regression that predicts current achievement on the basis of whether or not they received the treatment as well as the ses of the student. How has the estimated coefficient of the treatment variable changed? Why?

model2 <- lm(currach ~ treat + ses, data = studentData)
summary(model2)

## 
## Call:
## lm(formula = currach ~ treat + ses, data = studentData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.6771  -5.1869   0.6702   5.6071  27.6692 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.04995    2.73112  -0.018    0.985    
## treat       15.23232    0.85719  17.770   <2e-16 ***
## ses          0.71741    0.04099  17.502   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.28 on 497 degrees of freedom
## Multiple R-squared:  0.7159, Adjusted R-squared:  0.7148 
## F-statistic: 626.2 on 2 and 497 DF,  p-value: < 2.2e-16

The estimated coefficient for the treat variable has dropped significantly to 15.23.

This change occurs because the newly updated regression model now controls for the student’s socioeconomic status. By including ses in the model, we are holding the confounding variable constant. The revised coefficient of 15.23 now specifically represents the isolated impact of the treatment program on achievement, making it a far more accurate reflection of the true simulated effect of 15.

Substitute initach for ses and run the regression again. How close is the coefficient on treat to the actual effect? Why?

model3 <- lm(currach ~ treat + initach, data = studentData)
summary(model3)

## 
## Call:
## lm(formula = currach ~ treat + initach, data = studentData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7450  -0.4497   0.0444   0.6548  12.4668 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.40778    0.90088  -1.563    0.119    
## treat       14.68064    0.37867  38.769   <2e-16 ***
## initach      1.03005    0.01848  55.746   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.909 on 497 degrees of freedom
## Multiple R-squared:  0.9367, Adjusted R-squared:  0.9364 
## F-statistic:  3677 on 2 and 497 DF,  p-value: < 2.2e-16

The coefficient for the treat variable is 14.68, which is remarkably close to the actual programmed effect of 15.

This happens because initach serves as a near-perfect control variable that represents the exact baseline capability of the student right before the treatment was applied. By adjusting for their initial starting score, we completely strip away all baseline inequalities between the treated and untreated individuals. The remaining difference captured by the treatment coefficient essentially becomes the pure, isolated value added mechanically by the intervention.