library(pacman)
p_load(wooldridge, tidyverse)

Ejercicio 1

Let math10 denote the percentage of students at a Michigan high school receiving a passing score on a standardized math test (see also Example 4.2 in Wooldridge). We are interested in estimating the effect of per-student spending on math performance. A simple model is:

\(math10 = \beta_{0} + \beta_{1} log(expend) + \beta_{2} log(enroll) + \beta_{3} poverty + u\)

i. The variable lnchprg is the percentage of students eligible for the federally funded school lunch program. Why is this a sensible proxy variable for poverty?

Pensaría que, como si podemos medir la pobreza del estudiante a través de la medición del ingreso de su hogar, no necesitamos usar un proxy. Sin embargo, si el lunch program analiza los ingresos de su familia para darles el apoyo o no, entonces es un buen proxy. El programa también podría analizar otras cuestiones, sobre si sus padres trabajan todo el día (eso no implica que sean pobres). Por tanto, podría ser un buen proxy dependiendo del sistema mediante el que se otorguen los apoyos.

ii. The following table contains OLS estimates, with and without lnchprg as an explanatory variable.

a. Explain why the effect of expenditures on math10 is lower in column (2) than in column (1). Is the effect in column (2) still statistically greater than zero?

El efecto de expend está sobre estimado en la primera columna, hay un sesgo de variable omitida. Al incluir el proxy de pobreza, disminuye el sesgo. El efecto sigue siendo estadísticamente significativo al menos al 90%.

b. Explain why there is a change in sign for log(enroll) between model in Column (1) and in Column (2)

No controlar por pobreza sesga hacia arriba el estimador de enroll: cuando controlamos pobreza, podemos ver que hay un efecto negativo con respecto a la calificación del curso de matemáticas. Esto puede ser porque cuando contemplamos alumnos que reciben lunch, aumenta la probabilidad que estos se queden en la escuela, se concentren más, tengan mejor rendimiento en la escuela. Esto a su vez ocasiona que haya mayor retención en el curso de matemáticas, lo cual satura la clase de matemáticas y baja el promedio de la clase. El cambio de sesgo es porque ‘desmenusa’ el efecto de enroll.

c. It appears that pass rates are lower at larger schools, other factors being equal. Explain.

Escuelas más grandes implica que hay más alumnos por profesor, lo cual implica que hay menor personalización en la educación, lo cual baja las calificaciones (ceteris paribus).

d. Interpret the coefficient on lnchprg in column (2).

Dado el proxy de lnchprg, podemos inferir que un incremento de punto porcentual en la pobreza, reduce -0.324 puntos porcentuales el resultado de math10.

e. What do you make of the substantial increase in R2 from column (1) to column (2)?

Tiene sentido porque la pobreza muestra ser muy importante: nuestro modelo tiene un mejor poder explicativo.

iii. Suppose you conduct this same regression substituting the logarithm of expend and enroll with their levels and squares. How can you test what specification is a better prediction of math10? Describe the steps formally.

Podemos hacer una prueba de significancia a partir de un modelo general con levels y cuadrados especificados en la pregunta :

\(math = \beta_{0} + \beta_{1}log(expend) + \beta_{2} log(enroll) + \beta_{3} poverty + \beta_{4} expend + \beta_{5} expend^2 + \beta_{6} enroll + \beta_{7} enroll^2 + u\)

De tal modo que con una prueba F:

\(H_{0}: \beta_{1} = \beta_{2} = 0\) con \(F_{2,428-7-1}\)

Si \(F_{stat} < F_{2,428-7-1}\), podemos concluir que es mejor no incluir los logaritmos de expend y enroll.

Ejercicio 2

The following equation explains weekly hours of television viewing by a child in terms of the child’s age, mother’s education, father’s education, and number of siblings:

\(tvhours* = \beta_{0} + \beta_{1} age + \beta_{2}age^2 + \beta_{3}motheduc + \beta_{4} fatheduc + \beta_{5} sibs + u\)

We are worried that tvhours* is measured with error in our survey. Let tvhours denote the reported hours of television viewing per week.

a. Do you think the CEV assumptions are likely to hold? Explain.

Dado que \(thvhours = tvhours* + \epsilon:\)

\(cov(thvhours*, \epsilon) = 0\) implica que no hay relación entre el error de medición y el valor no observado de medición.

Considero que no se sostienen porque tvhours es auto reportada: el error de medición puede ser dependiente de la edad (adolescentes pueden mentir en su uso de televisión para ver más televisión y ganar el favor de los padres), lo cual invalidaría nuestro la insesgadez de nuestro OLS. Por otro lado, personas que ven mucho la televisión probablemente infraestimen la cantidad de horas que ven.

b. What do the classical errors-in-variables (CEV) assumptions require in this application?

Necestiamos que el promedio del error \(\epsilon\) sea igual a 0 y que el error no esté relacionado con ninguna de nuestras variables explicativas.

c. Show formally the bias that the measurement error in tvhours* generates.

Dado el error: \(tvhours = tvhours^{*} + v_{i}\) y en forma general: \[tvhours^{*}_{i}\] Si en vez de esto regresamos el valor erróneo de \(tvhours\):

\[tvhours_{i} = \alpha + \beta X_{i} + \epsilon_{i}\]

\[\therefore \hat \beta = \frac{cov(tvhours_{i}, C_{i})}{var(X_{i})}\] \[= \frac{cov(Y_{i} + v_{i}, X_{i})}{var(X_{i})}\] \[= \frac{cov(\alpha + \beta X_{i} + \epsilon_{i} + v_{i}, X_{i})}{var(X_{i})}\]

\[= \frac{cov(\alpha, X_{i})}{var(X_{i})} + \beta \frac{cov(X_{i}, X_{i})}{var(X_{i})} + \frac{cov(\epsilon_{i}, X_{i})}{var(X_{i})} + \frac{cov(v_{i}, X_{i})}{var(X_{i})} \]

\[= \beta \frac{var(X_{i})}{var(X_{i})}\] \[= \beta\] Porque la covarianza entre una variable aleatoria y la constante \(\alpha\) es cero (al igual que con \(X_{i}, \epsilon_{i}, v_{i}\) por la CEV). Esto implica que nuestro estimador \(\beta\) está consistentemente estimado, sin embargo, \(tvhours = tvhours^* + v_{i} = \alpha + \beta X_{i} + \epsilon_{i} + v_{i}\) nos da un término de error adicional que reduce nuestro poder estadístico. Esto sesgo de error es lo que puede crear problemas si no se cumple CEV.

Ejercicio 3

You need to use two data sets for this exercise, JTRAIN2 and JTRAIN3. JTRAIN2 is the outcome of a job training experiment. The file JTRAIN3 contains observational data, where individuals themselves largely determine whether they participate in job training (self-select). The data sets cover the same time period.

data("jtrain2")
data("jtrain3")

a. In the data set JTRAIN2, what fraction of the men received job training? What is the fraction in JTRAIN3? Why do you think there is such a big difference?

#En JTRAIN2:
jtrain2 %>% 
  count(train==1)
##   train == 1   n
## 1      FALSE 260
## 2       TRUE 185
#En JTRAIN3:
jtrain3 %>%
  count(train==1)
##   train == 1    n
## 1      FALSE 2490
## 2       TRUE  185

Por tanto:

JTRAIN2 cuando train = 1: 41.5% JTRAIN2 cuando train = 0: 58.4%

JTRAIN3 cuando train = 1: 6.91% JTRAIN3 cuando train = 0: 93.0%

Los hombres en JTRAIN2, que tenían bajos ingresos, fueron seleccionados para recibir entrenamiento laboral. Este no es un grupo representativo de todo el población. La muestra de JTRAIN3 es una muestra aleatoria de la población de hombres trabajando en 1978; esperaríamos que una fracción pequeña tuviera capacitación laboral.

b. Using JTRAIN2, run a simple regression of re78 on train. What is the estimated effect of participating in job training on real earnings?

reg <- lm(re78 ~ train, data =jtrain2)
summary(reg)
## 
## Call:
## lm(formula = re78 ~ train, data = jtrain2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.349 -4.555 -1.829  2.917 53.959 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.5548     0.4080  11.162  < 2e-16 ***
## train         1.7943     0.6329   2.835  0.00479 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.58 on 443 degrees of freedom
## Multiple R-squared:  0.01782,    Adjusted R-squared:  0.01561 
## F-statistic: 8.039 on 1 and 443 DF,  p-value: 0.004788

El efecto estimado es que si se cuenta con entrenamiento laboral, el salario incrementaba en 1.79 unidades en promedio (en miles de dólares).

c. Now add as controls to the regression in part (ii) the variables re74, re75, educ, age, black, and hisp. Does the estimated effect of job training on re78 change much? why? (Hint: Remember that these are experimental data.)

reg2 <- lm(re78 ~ train + re74 + re75 + educ + age + black + hisp, data = jtrain2)
summary(reg2)
## 
## Call:
## lm(formula = re78 ~ train + re74 + re75 + educ + age + black + 
##     hisp, data = jtrain2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.890 -4.424 -1.661  3.012 54.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.67407    2.42272   0.278  0.78097   
## train        1.68005    0.63086   2.663  0.00803 **
## re74         0.08331    0.07653   1.089  0.27694   
## re75         0.04677    0.13068   0.358  0.72062   
## educ         0.40360    0.17485   2.308  0.02145 * 
## age          0.05435    0.04382   1.240  0.21560   
## black       -2.18007    1.15550  -1.887  0.05987 . 
## hisp         0.14356    1.54092   0.093  0.92582   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.499 on 437 degrees of freedom
## Multiple R-squared:  0.05476,    Adjusted R-squared:  0.03962 
## F-statistic: 3.617 on 7 and 437 DF,  p-value: 0.0008396

El efecto de train cambia en 0.10 unidades (de 1.79 a 1.68). Esto tiene sentido porque el poder explicativo de train baja porque estamos controlando por más cosas. Sin embargo, no cambia mucho: train es bastante importante y se mantiene medianamente significante, además de que fue aleatorio.

d. Do the regressions in parts (b) and (c) using the data in JTRAIN3, reporting only the estimated coefficients on train, along with their t statistics. What is the effect now of controlling for the extra factors, and why?

# Para la regresión de la parte b)
reg3 <- lm(re78 ~ train, data = jtrain3)
stats <- data.frame(Efecto = summary(reg3)$coefficients[2],
                    t_value = summary(reg3)$coefficients[2,3])
print(stats)
##      Efecto   t_value
## 1 -15.20478 -13.16871
# Para la regresion de de la parte c)
reg4 <- lm(re78 ~ train + re74 + re75 + educ + age + black + hisp, data = jtrain3)
stats_1 <- data.frame(Efecto = summary(reg4)$coefficients[2],
                      t_value = summary(reg4)$coefficients[2,3])
print(stats_1)
##      Efecto   t_value
## 1 0.2132254 0.2498584

El efecto de controlar por los demás regresores además de train es mucho menor en JTRAIN3 que JTRAIN2: pasa de ser considerablemente negativo a positivo cuando agregamos regresores. Esto es probable debido a que se exagera el recibir entrenamiento laboral en el ingreso en los datos autoreportados de JTRAIN2: Sin embargo, ninguno de los casos es significativo. En JTRAIN3, el entrenamiento laboral afecta menos los ingresos.

e. Define avgre = (re74 + re75)/2. Find the sample averages, standard deviations, and minimum and maximum values in the two data sets. Are these data sets representative of the same populations in 1978?

# JTRAIN2
jtrain2 <- jtrain2 %>%
  mutate(avgre = (re74 + re75)/2)

stats_3 <- data.frame(mean = mean(jtrain2$avgre),
                      sd = sd(jtrain2$avgre), 
                      min = min(jtrain2$avgre),
                      max = max(jtrain2$avgre))
print(stats_3)
##       mean       sd min      max
## 1 1.739702 3.900095   0 24.37645
# JTRAIN3
jtrain3 <- jtrain3 %>%
  mutate(avgre = (re74 + re75)/2)

stats_4 <- data.frame(mean = mean(jtrain3$avgre), 
                      sd = sd(jtrain3$avgre), 
                      min = min(jtrain3$avgre),
                      max = max(jtrain3$avgre))
print(stats_4)
##       mean       sd min     max
## 1 18.04045 13.29345   0 146.901

Claramente no son la misma población: JTRAIN3 tiene un promedio y desviación estándar mucho más alto que JTRAIN2

f. Almost 96% of men in the data set JTRAIN2 have avgre less than $10,000. Using only these men, run the regression re78 on train, re74, re75, educ, age, black, hisp and report the training estimate and its t statistic. Run the same regression for JTRAIN3, using only men with avgre < 10. For the subsample of low-income men, how do the estimated training effects compare across the experimental and nonexperimental data sets?

# Para JTRAIN2:

reg5 <- lm(re78 ~ train + re74 + re75 + educ + age + black + hisp, 
           data = filter(jtrain2, jtrain2$avgre<10))

stats_5 <- data.frame(Efecto = summary(reg5)$coefficients[2],
                      t_value = summary(reg5)$coefficients[2,3])

print(stats_5)
##     Efecto  t_value
## 1 1.583033 2.503033
#Para JTRAIN3:

reg6 <- lm(re78 ~ train + re74 + re75 + educ + age + black + hisp, 
           data = filter(jtrain3, jtrain3$avgre<10))

stats_6 <- data.frame(Efecto = summary(reg6)$coefficients[2],
                      t_value = summary(reg6)$coefficients[2,3])

print(stats_6)
##     Efecto  t_value
## 1 1.844452 2.065204

De este modo, el efecto de train en personas de bajos ingresos no difiere tanto entre JTRAIN2 y JTRAIN3: esto se puede deber a que para personas con menos ingresos, es consistene el poder que tiene recibir entrenamiento laboral.

g. Now use each data set to run the simple regression re78 on train, but only for men who were unemployed in 1974 and 1975. How do the training estimates compare now?

#Para JTRAIN2:

reg7 <- lm(re78 ~ train, 
           data = filter(jtrain2,jtrain2$unem74==1 & jtrain2$unem75==1))

stats_7 <- data.frame(Efecto = summary(reg7)$coefficients[2],
                      t_value = summary(reg7)$coefficients[2,3])

print(stats_7)
##     Efecto  t_value
## 1 1.842065 2.672738
#Para JTRAIN3:

reg8 <- lm(re78 ~ train, 
           data = filter(jtrain3, jtrain3$unem74==1 & jtrain3$unem75==1))

stats_8 <- data.frame(Efecto = summary(reg8)$coefficients[2],
                      t_value = summary(reg8)$coefficients[2,3])

print(stats_8)
##     Efecto  t_value
## 1 3.803299 4.303465

Ahora difiere: no es consistente entre los datsets de JTRAIN2 Y JTRAIN3 el poder explicativo del entrenamiento laboral cuando tomamos en cuenta el desempleo del 74 y 75.

h. Using your findings from the previous regressions, discuss the potential importance of having comparable populations underlying comparisons of experimental and nonexperimental estimates.

Muestras comparables, misma población (mas o menos), aquellos con ingresos reales promedio < $ 10,000 entre 1974 y 1975: obtenemos estimaciones de efectos de entrenamiento positivos (sea con JTRAIN2 o JTRAIN3). Usar JTRAIN3 puede ser engañoso porque incluye a muchos hombres para quienes el entrenamiento nunca sería útil.

Si solo datos experimentales puede ser difícil saber cómo encontrar la parte de la población donde hay un efecto. Pero para aquellos que estuvieron desempleados en los dos años anteriores al entrenamiento, el efecto es positivo.

Ejercicio 4

Does attending a summer school improve test scores? This fictitious setting is as follows:

• In the summer break between year 5 and year 6, (roughly corresponding to age 10) there is an optional summer school.

• The summer school could be focusing on the school curriculum, or it could be focused on skills that lead to improved schooling outcomes (for example “grit” as in Alan et al (2019)).

• The summer school is free, but enrollment requires active involvement by parents.

• We are interested in whether participation in summer school improves child outcomes.

We have a dataset to study the research question, including:

Information about person id, school id, an indicator variable that takes the value of 1 if the individual participated in the summer school, information about gender, parental income and parental schooling, and test scores in year 5 (before the treatment) and year 6. The dataset also contains information about whether the individual received a reminder letter.

a. Load summercamp.dta into R (data is in the class folder). and include the data in an object called summercamp. Then:

library(tidyr)
library(haven)
## Warning: package 'haven' was built under R version 4.3.2
ruta_archivo <- "C:/Users/ricar/OneDrive/Ricardo C/CIDE/Quinto semestre/Econometria I/Laboratorio/Actividad 6/summercamp.dta"

summercamp <- read_dta(ruta_archivo)

# make data tidy (make long)
school_data <- summercamp %>%
       pivot_longer(
         cols = starts_with("test_year"),
         names_to = "year",
         names_prefix = "test_year_",
         names_transform = list(year = as.integer),
         values_to = "test_score",
       )

b.

library("skimr")
## Warning: package 'skimr' was built under R version 4.3.2
# Use skim() to skim the data
skim(school_data)
Data summary
Name school_data
Number of rows 6982
Number of columns 9
_______________________
Column type frequency:
character 1
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
parental_schooling 0 1 2 2 0 13 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
person_id 0 1 1746.00 1007.84 1.00 873.25 1746.00 2618.75 3491.00 ▇▇▇▇▇
school_id 0 1 15.66 8.67 1.00 8.00 15.00 23.00 30.00 ▇▇▇▇▇
summercamp 0 1 0.46 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▇
female 0 1 0.52 0.50 0.00 0.00 1.00 1.00 1.00 ▇▁▁▁▇
parental_lincome 0 1 14.56 0.69 12.67 14.11 14.52 14.95 19.45 ▂▇▁▁▁
letter 0 1 0.25 0.43 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
year 0 1 5.50 0.50 5.00 5.00 5.50 6.00 6.00 ▇▁▁▁▇
test_score 11 1 2.39 0.71 -0.27 1.92 2.35 2.86 4.98 ▁▃▇▃▁

c. Correlate missing values for parental_schooling with parental income (hint: create a dummy for missing values). Is there evidence that missing values are not random?

# Create a dummy variable for missing values in parental_schooling
summercamp$parental_schooling_missing <- ifelse(is.na(summercamp$parental_schooling), 1, 0)

# Correlate parental_schooling_missing with parental_income
correlation <- cor(summercamp$parental_schooling_missing, summercamp$parental_lincome, use = "complete.obs")
## Warning in cor(summercamp$parental_schooling_missing,
## summercamp$parental_lincome, : the standard deviation is zero
# Print the correlation coefficient
print(correlation)
## [1] NA

The warning “the standard deviation is zero” indicates that there is no variability in the variable you are trying to calculate the correlation with. In this case, it seems that the variable parental_lincome has no variability, which means that all its values are the same.

A correlation coefficient is not meaningful if one of the variables has no variability because the standard deviation is in the denominator of the correlation formula. In such cases, the correlation coefficient is undefined (resulting in NA).

d. Assume all “missing values at random”. Hence drop NA rows.

If you assume that the missing values are at random and want to drop the rows with missing values, you can use the na.omit() function in R. This function removes any rows in a data frame that contain missing values. Here’s how you can implement it:

# Assuming your dataset is named 'summercamp'

# Drop rows with missing values
summercamp_no_missing <- na.omit(summercamp)

# Now 'summercamp_no_missing' contains the dataset without rows with missing values

After executing this code, summercamp_no_missing will be a new dataset that does not include rows with missing values. Keep in mind that this approach assumes missing values are missing completely at random (MCAR), meaning that the probability of missingness is unrelated to the observed or unobserved data. If the missing values are not completely at random, dropping them may introduce bias into your analysis.

e. Why do we want to run the code below?

# Standardize test score
# Group analysisdata by year
analysisdata<-group_by(summercamp)

# Create a new variable with mutate
analysisdata<-mutate(analysisdata, test_score=(test_year_5-mean(test_year_5))/sd(test_year_5))
# show mean of test_score
print(paste("Mean of test score:",mean(analysisdata$test_year_5)))
## [1] "Mean of test score: NA"
#show sd of test_score
print(paste("SD of test score:",sd(analysisdata$test_year_5)))
## [1] "SD of test score: NA"

The provided R code standardizes the test scores within each year. Standardization, also known as z-score normalization, is a common practice in statistics and data analysis. The purpose of standardizing variables is to transform them to a common scale, making it easier to compare and interpret them.

Here’s a breakdown of each step in the code:

  1. Group by Year: group_by(summercamp, year) groups the data by the variable “year.” This means that subsequent operations will be performed separately for each year.

  2. Standardize Test Scores: mutate(analysisdata, test_score = (test_score - mean(test_score)) / sd(test_score)) creates a new variable called test_score where the original test scores are standardized within each year. The formula for standardization is (x - mean(x)) / sd(x), where x represents the variable being standardized. This process transforms the test scores into z-scores, indicating how many standard deviations a particular score is from the mean within its respective year.

  3. Print Mean and Standard Deviation of Standardized Test Scores: print(paste("Mean of test score:", mean(analysisdata$test_score))) and print(paste("SD of test score:", sd(analysisdata$test_score))) print out the mean and standard deviation of the standardized test scores. This can be helpful for understanding the distribution of the standardized scores.

By standardizing the test scores within each year, you make the scores comparable across different years, especially if the scales or distributions of the test scores vary between years. This standardization is commonly used in analyses where you want to focus on the relative standing of observations within their respective groups or years rather than the absolute values. It also helps in cases where the test scores have different units or measurement scales between years.

f. Create a bar chart of pre-summer school test scores (in SD) and summer school attendance ¿Is there evidence of a selection bias?

# Load patchwork 
# library("patchwork")
# Create raw chart element
rawchart<-ggplot(analysisdata%>%filter(),x=as.factor(fill))+
          theme_classic()
p2<-rawchart+
    geom_bar(aes(x=as.factor(summercamp),y=test_score),
        stat="summary",fun="mean")+
    labs(y="Test Score Year 5", x="Attended Summer School")

To assess evidence of selection bias, you might want to compare the characteristics of those who attended summer school with those who did not. If there are significant differences in Year 5 standardized test scores between the two groups, it could suggest potential bias.

# Load the necessary packages
# install.packages("ggplot2")
# install.packages("dplyr")
# install.packages("patchwork")

library(ggplot2)
library(dplyr)
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.3.2
# Assuming 'analysisdata' is prepared with standardized test scores and relevant columns

# Create a bar chart of pre-summer school test scores (in SD) and summer school attendance
bar_chart <- ggplot(analysisdata %>% filter(), aes(x = as.factor(summercamp), y = test_score)) +
  geom_bar(stat = "summary", fun = "mean") +
  labs(y = "Test Score Year 5 (Standardized)", x = "Attended Summer School") +
  theme_classic()

# Display the bar chart
bar_chart
## Warning: Removed 3491 rows containing non-finite values (`stat_summary()`).
## Warning in max(f): ningun argumento finito para max; retornando -Inf
## Warning: Computation failed in `stat_summary()`
## Caused by error in `seq_len()`:
## ! argument must be coercible to non-negative integer

g. Denote, formally (i.e. using expected values and potential outcomes notation), how the selection bias arises in the case of a naive comparison between those who attend summer school and those who do not.

h. What can we conclude regarding selection bias after the table generated by:

# Load libraries
library(modelsummary)
## Warning: package 'modelsummary' was built under R version 4.3.2
library(estimatr)
## Warning: package 'estimatr' was built under R version 4.3.2
# Filter and modify data
testdata<-filter(analysisdata)
testdata<-ungroup(testdata)
testdata<-mutate(testdata,Treated=ifelse(summercamp==1,"Summer Camp","No Summer Camp"))
testdata<-select(testdata,female,parental_schooling,parental_lincome,test_score,Treated)
testdata<-rename(testdata,`Female`=female,
          `Parental schooling (years)`=parental_schooling,
          `Parental income (log)`=parental_lincome,
          `Test Score`=test_score)
     
# Table with balancing test
datasummary_balance(~Treated,
                    data = testdata,
                    title = "Balance of pre-treatment variables",
                    notes = "Notes: Goya, Goya, Universidad!",
                    fmt= '%.5f',
                    dinm_statistic = "p.value")
## Warning: These variables were omitted because they are entirely missing: Test
## Score.
Balance of pre-treatment variables
No Summer Camp (N=1889)
Summer Camp (N=1602)
Mean Std. Dev. Mean Std. Dev. Diff. in Means p
Female 0.50715 0.50008 0.52497 0.49953 0.01782 0.29384
Parental income (log) 14.35118 0.51438 14.81235 0.78823 0.46117 0.00000
N Pct. N Pct.
Parental schooling (years) 10 364 19.3 176 11.0
11 1194 63.2 732 45.7
12 263 13.9 379 23.7
13 47 2.5 186 11.6
14 15 0.8 65 4.1
15 1 0.1 33 2.1
16 1 0.1 18 1.1
17 1 0.1 3 0.2
18 0 0.0 4 0.2
20 0 0.0 2 0.1
22 0 0.0 1 0.1
23 0 0.0 1 0.1
NA 3 0.2 2 0.1
Notes: Goya, Goya, Universidad!

Based on the information provided in the table, let’s interpret the results regarding selection bias:

  1. Female (Binary Variable):
    • The p-value associated with the test for the difference in proportions of females between the treated and control groups is 0.29384.
    • The non-significant p-value suggests that there is no strong evidence of a significant difference in the proportion of females between the treated and control groups. This is a positive indication of balance on the variable “Female.”
  2. Parental Income (Logarithmic Variable):
    • The p-value associated with the test for the difference in means of parental income (log) between the treated and control groups is 0.00000.
    • The significant p-value suggests that there is a significant difference in mean parental income (log) between the treated and control groups. This may indicate potential imbalance on this variable.
  3. Parental Schooling (Categorical Variable):
    • The table provides the distribution of parental schooling years for both groups, and the p-values are likely associated with tests comparing the distributions between the treated and control groups.
    • It’s important to examine the p-values for each category to assess the significance of differences in the distribution of parental schooling years.

Conclusions: - The variable “Female” appears to be balanced between the treated and control groups, as indicated by the non-significant p-value. - The variable “Parental Income” shows a significant difference in mean values between the groups, suggesting potential imbalance on this variable. - Further investigation is needed for the variable “Parental Schooling” to assess the significance of differences in the distribution of schooling years.

Considerations: - While balancing tests on observed characteristics are informative, they cannot address selection bias resulting from unobserved characteristics. - If there is significant imbalance on observed variables or concerns about unobserved characteristics, researchers might need to employ additional techniques such as matching, weighting, or instrumental variables to mitigate selection bias.

In summary, the results from the balancing tests provide some evidence of balance on the variable “Female” but indicate potential imbalance on the variable “Parental Income.” Further investigation and consideration of the overall study design are necessary to draw more definitive conclusions regarding selection bias.

i. Reproduce the same table for students receiving a reminding letter and those who do not. What can you conclude about the letter “assignation” does it appear to be “as good as random”?

To reproduce a similar table for students receiving a reminder letter and those who do not, and to assess whether the assignment of the letter appears to be as good as random, you can modify the code accordingly. Here’s an example:

# Assuming your dataset is named 'summercamp'

# Load libraries
library(modelsummary)
library(estimatr)

# Filter and modify data for the letter assignment
letter_data <- summercamp %>%
  filter(!is.na(letter)) %>%
  ungroup() %>%
  mutate(Letter_Assigned = ifelse(letter == 1, "Received Letter", "No Letter"))

# Select relevant variables
letter_data <- select(letter_data, female, parental_schooling, parental_lincome, test_year_5, Letter_Assigned)

# Rename columns for clarity
letter_data <- rename(letter_data,
                      Female = female,
                      `Parental schooling (years)` = parental_schooling,
                      `Parental income (log)` = parental_lincome,
                      `Test Score` = test_year_5)

# Table with balancing test for letter assignment
datasummary_balance(~Letter_Assigned,
                    data = letter_data,
                    title = "Balance of pre-treatment variables for Letter Assignment",
                    notes = "Notes: Goya, Goya, Universidad!",
                    fmt = '%.5f',
                    dinm_statistic = "p.value")
Balance of pre-treatment variables for Letter Assignment
No Letter (N=2629)
Received Letter (N=862)
Mean Std. Dev. Mean Std. Dev. Diff. in Means p
Female 0.51350 0.49991 0.52088 0.49985 0.00738 0.70691
Parental income (log) 14.56264 0.70343 14.56333 0.66280 0.00069 0.97914
Test Score 2.26331 0.68468 2.24548 0.68486 -0.01783 0.50741
N Pct. N Pct.
Parental schooling (years) 10 417 15.9 123 14.3
11 1441 54.8 485 56.3
12 487 18.5 155 18.0
13 162 6.2 71 8.2
14 60 2.3 20 2.3
15 29 1.1 5 0.6
16 16 0.6 3 0.3
17 4 0.2 0 0.0
18 4 0.2 0 0.0
20 2 0.1 0 0.0
22 1 0.0 0 0.0
23 1 0.0 0 0.0
NA 5 0.2 0 0.0
Notes: Goya, Goya, Universidad!

This code filters the data to include only individuals for whom the reminder letter information is available, and it then performs balancing tests for the variables between those who received the letter and those who did not.

Interpretation of the results will be similar to the previous analysis. If the p-values associated with the balancing tests are not significant, it suggests that there is no strong evidence of systematic differences in observed characteristics between those who received the letter and those who did not, indicating a good balance. If p-values are significant, it may suggest potential imbalance.

Whether the assignment of the letter appears to be “as good as random” depends on the results of these tests. If the p-values are low, it may indicate that the assignment is not as good as random, suggesting potential issues with the randomization process or unobserved factors influencing the assignment. If the p-values are high, it provides evidence that the assignment is more likely to be random or unrelated to observed characteristics.

j. Run an OLS estimation relating letter and summer camp attendance, including a set of sensible controls.

# Assuming your dataset is named 'summercamp'

# Specify the OLS model with controls
ols_model <- lm(summercamp ~ letter + female + parental_schooling + parental_lincome + test_year_5, data = summercamp)

# Display the summary of the regression results
summary(ols_model)
## 
## Call:
## lm(formula = summercamp ~ letter + female + parental_schooling + 
##     parental_lincome + test_year_5, data = summercamp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0476 -0.3027 -0.1600  0.3387  0.9826 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -2.510869   0.266010  -9.439  < 2e-16 ***
## letter                0.444647   0.016768  26.518  < 2e-16 ***
## female                0.010078   0.014467   0.697 0.486082    
## parental_schooling11 -0.081209   0.023737  -3.421 0.000631 ***
## parental_schooling12  0.031332   0.032488   0.964 0.334900    
## parental_schooling13  0.112763   0.044643   2.526 0.011585 *  
## parental_schooling14  0.077468   0.062571   1.238 0.215770    
## parental_schooling15  0.174233   0.088173   1.976 0.048230 *  
## parental_schooling16  0.066502   0.112892   0.589 0.555851    
## parental_schooling17 -0.121293   0.221720  -0.547 0.584376    
## parental_schooling18  0.079581   0.223129   0.357 0.721368    
## parental_schooling20 -0.114954   0.312599  -0.368 0.713092    
## parental_schooling22 -0.240853   0.436220  -0.552 0.580891    
## parental_schooling23 -0.334939   0.438564  -0.764 0.445087    
## parental_schoolingNA  0.022051   0.191953   0.115 0.908551    
## parental_lincome      0.198561   0.019967   9.945  < 2e-16 ***
## test_year_5          -0.003927   0.013998  -0.281 0.779094    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.426 on 3468 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.2729, Adjusted R-squared:  0.2696 
## F-statistic: 81.35 on 16 and 3468 DF,  p-value: < 2.2e-16

k. Run a regression without controls that allows you to accomplish Gauss-Markov assumptions with standardized test scores after summer camp (i.e. year 6) as a dependent variable ¿What is the interpretation of this effect? ¿Is it an ATE or an ATT?

# Assuming your dataset is named 'summercamp'

# Specify the regression model without controls
regression_model <- lm(test_year_5 ~ letter, data = summercamp)

# Display the summary of the regression results
summary(regression_model)
## 
## Call:
## lm(formula = test_year_5 ~ letter, data = summercamp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.53293 -0.45889  0.00017  0.45737  2.44417 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.26331    0.01337 169.322   <2e-16 ***
## letter      -0.01783    0.02689  -0.663    0.507    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6847 on 3483 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.0001262,  Adjusted R-squared:  -0.0001608 
## F-statistic: 0.4397 on 1 and 3483 DF,  p-value: 0.5073

l. Run a regression with good controls that allow you to accomplish Gauss-Markov assumptions with standardized test scores after summer camp (i.e. year 6) as a dependent variable ¿is the result of the estimator of interest different from (k)?

# Assuming your dataset is named 'summercamp'

# Specify the regression model with good controls
regression_model_controls <- lm(test_year_5 ~ letter + female + parental_schooling + parental_lincome, data = summercamp)

# Display the summary of the regression results
summary(regression_model_controls)
## 
## Call:
## lm(formula = test_year_5 ~ letter + female + parental_schooling + 
##     parental_lincome, data = summercamp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1422 -0.3344  0.0110  0.3413  2.2440 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -7.292261   0.297936 -24.476  < 2e-16 ***
## letter               -0.026582   0.020332  -1.307 0.191164    
## female                0.012060   0.017546   0.687 0.491923    
## parental_schooling11  0.125866   0.028710   4.384  1.2e-05 ***
## parental_schooling12  0.151731   0.039319   3.859 0.000116 ***
## parental_schooling13  0.127148   0.054104   2.350 0.018825 *  
## parental_schooling14  0.004033   0.075891   0.053 0.957625    
## parental_schooling15 -0.048035   0.106940  -0.449 0.653331    
## parental_schooling16 -0.385699   0.136768  -2.820 0.004828 ** 
## parental_schooling17 -0.252976   0.268886  -0.941 0.346858    
## parental_schooling18 -0.514614   0.270487  -1.903 0.057183 .  
## parental_schooling20 -1.101781   0.378683  -2.910 0.003643 ** 
## parental_schooling22 -0.756479   0.528926  -1.430 0.152745    
## parental_schooling23 -1.208899   0.531528  -2.274 0.023004 *  
## parental_schoolingNA  0.140278   0.232804   0.603 0.546843    
## parental_lincome      0.648890   0.021566  30.089  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5166 on 3469 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.4331, Adjusted R-squared:  0.4306 
## F-statistic: 176.6 on 15 and 3469 DF,  p-value: < 2.2e-16

Ejercicio 5

Imagine you own a Snickers store. An employee suggests giving “little gifts” (i.e. key rings) to clients to make them return to the store. You have 200 stores; an economist thus creates an experiment to evaluate the effect of the “little gifts” on revenues. The data of this little experiment is given by:

# library(pacman)
p_load(tidyverse, glue)


#Clear the environment
rm(list = ls())

# experimental data at the store level dataset
set.seed(22)
n_stores <- 200
true_gift_effect <- 100
noise <- 50
data_downstream <- tibble(store_id = 1:n_stores) %>% 
  mutate(  #mutate allows to create new_columns in data frame.
    # treatment (random in this case)
    gives_gift=rbinom(n_stores, 1, prob =  0.5),
    # return rate increased by 20% if given gifts
    return_rate=rnorm(n_stores, mean = 0.5, sd=0.1) + gives_gift*0.1,
    # outcome (influenced by return rate)
    # gifs impact revenue through return rate
    revenue= 50 + true_gift_effect*10*return_rate + rnorm(n_stores, mean=0, sd=noise)
  )

# plot to visualize the relationship
data_downstream %>% 
  mutate(treatment=ifelse(gives_gift==1, "gift", "no gift")) %>% 
  ggplot(aes(return_rate, revenue, color=treatment)) +
  geom_point() +
  labs(title=glue(" ")) + geom_rug()

a. Discuss the data depicted in the graph plot and draw a DAG on the relationship between gifts, return rates, and revenues.

# Con ggdadg:
library(ggdag)
## Warning: package 'ggdag' was built under R version 4.3.2
## 
## Attaching package: 'ggdag'
## The following object is masked from 'package:stats':
## 
##     filter
theme_set(theme_dag())

dagify(Revenue ~ Returns, Returns ~ Gifts) %>%
  ggdag(node_size = 23, text_size = 3) + theme_dag_blank()

A partir del gráfico anterior, los datos muestran que cuando se dan regalos, la tasa de retorno incremente, lo cual incrementa los ingresos.

b. Run a regression of gifts on return rates. Does the gift make customers return to the store?

reg <- lm(return_rate ~ gives_gift, data = data_downstream)
summary(reg)
## 
## Call:
## lm(formula = return_rate ~ gives_gift, data = data_downstream)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34220 -0.06444 -0.00332  0.06678  0.34309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.475498   0.009847  48.287  < 2e-16 ***
## gives_gift  0.112314   0.014762   7.608  1.1e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1037 on 198 degrees of freedom
## Multiple R-squared:  0.2262, Adjusted R-squared:  0.2223 
## F-statistic: 57.89 on 1 and 198 DF,  p-value: 1.101e-12

Si se da un regalo, la tasa de retorno incrementa en 11.23 puntos porcentuales. Dar un regalo explica la tasa de retorno en 22.62%.

c. What is the regression a well-trained economist would run after the experiment to know the effect of the gifts?

reg1 <- lm(revenue ~ return_rate + gives_gift + return_rate*gives_gift, data = data_downstream) 
summary(reg1)
## 
## Call:
## lm(formula = revenue ~ return_rate + gives_gift + return_rate * 
##     gives_gift, data = data_downstream)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.54  -32.72    0.02   30.15  132.32 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               40.83      20.19   2.022   0.0445 *  
## return_rate             1025.57      41.46  24.736   <2e-16 ***
## gives_gift                57.92      35.18   1.646   0.1013    
## return_rate:gives_gift  -102.65      63.66  -1.613   0.1084    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.93 on 196 degrees of freedom
## Multiple R-squared:  0.8667, Adjusted R-squared:  0.8647 
## F-statistic: 424.9 on 3 and 196 DF,  p-value: < 2.2e-16

d. Describe formally, this is with the use of the potential outcomes notation, why controlling for “return rates” in the experimental regression creates a bias.

\[Gift_{i} = {0,1}\] Nos interesa ver si esto afecta el revenue en las tiendas que sí dieron regalo: \(Revenue_{i}\) Por tanto:

\[posible consecuencia = revenue_{1i} si gift_{i} = 1\] O bien: \[posible consecuencia = revenue_{0i} si gift_{i} = 0\]

En donde: •\(Revenue_{0i} =\) revenue si la tienda \(i\) no dio regalo. •\(Revenue_{1i} =\) revenue si la tienda \(i\) dio regalo.

De tal modo: \[Revenue_{i} = revenue_{0i} + (revenue_{1i} + revenue_{01})Gift_{i}\] Particularmente: \[E[revenue_{i} | Gift_{i} = 1] - E[revenue_{i} | Gift_{i} = 0] = E[revenue_{1i} | Gift_{i} = 1] - E[reveue_{0i} | Gift_{i} = 1] - E[reveue_{0i} | Gift_{i} = 0]\] Que significa que la diferencia de revenues promedio es igual al efecto del tratamiento promedio (en las tiendas tratadas) más un sesgo de selección. Además:

\[E[revenue_{1i} | Gift_{i} = 1] - E[reveue_{0i} | Gift_{i} = 1] = E[revenue_{1i} - revenue_{0i} | Gift_{i} = 1]\] Es el efecto causal promedio de las tiendas que sí dieron regalo. La diferencia observada del promedio de revenue se la suma un sesgo de selección a este efecto causal de las tiendas que sí dieron regalo.