Ejercicio 1

Let math10 denote the percentage of students at a Michigan high school receiving a passing score on a standardized math test (see also Example 4.2 in Wooldridge). We are interested in estimating the effect of per-student spending on math performance. A simple model is:

  1. The variable lnchprg is the percentage of students eligible for the federally funded school lunch program. Why is this a sensible proxy variable for poverty?

Una variable así indica que los estudiantes en ese programa dieron indicios de necesidad por ser parte, es decir, signos de pobreza. No obstante, estas señales puede ser información revelada que es falsa, con el propósito de pertener al programa.

  1. The following table contains OLS estimates, with and without lnchprg as an explanatory variable.

  1. Explain why the effect of expenditures on math10 is lower in column (2) than in column (1). Is the effect in column (2) still statistically greater than zero?

El efecto de expenditure en math10 es mayor en la primera columna ya que no se está controlondo por otros factores que afectan, como la pobreza. Una vez que se controla, se encuentra que no es tan grande el efecto de expenditure, esto porque había una variable omitida. Además, sí es estadísticamente significativa (diferente de 0) al 95% de confianza.

  1. Explain why there is a change in sign for log(enroll) between model in Column (1) and in Column (2)

Al parecer, realmente mas incripciones a esucelas implican menores resultados en matemáticas cuando se controla por pobreza. Cuando no se controla, al parecer hay un afecto positivo entre inscripciones y resultados en matemáticas. Existía un sesgo muy fuerte cuando se omitía entre control.

  1. It appears that pass rates are lower at larger schools, other factors being equal. Explain.

Esto ya que probablemente entre más estudiantes hay en una escuela, menor atención reciben, o más libertad, ya que son uno entre muchos. De ahí que la atención por obtener mejores resultados en las pruebas, para cada estudiante, sea menor.

  1. Interpret the coefficient on lnchprg in column (2).

El aumento de un estudiante admitido en el probrama de almuerzo (aka. un poco más de pobreza) reduce la puntuación en prubas de matemáticas en -0.324.

  1. What do you make of the substantial increase in R2 from column (1) to column (2)?

Que agregar una proxy de pobreza ayuda a explicar más la variación de los resultados en pruebas de matemáticas.

  1. Suppose you conduct this same regression substituting the logarithm of expend and enroll with their levels and squares. How can you test what specification is a better prediction of math10? Describe the steps formally.

Los pasos son: generar una nueva regresión anhidada, donde además de incluir las variables de esta regresión, incluir en su forma level y cuadrado. Después, hacer una test de hipótesis para ver si estas nuevas variables son significativas. Si lo son (o son aún mas significativas que el modelo orginial), entonces es señal de un error de especificación, donde las variables en forma de level y sus cuadrados apunta a un mejor modelo.

Ejercicio 2

The following equation explains weekly hours of television viewing by a child in terms of the child’s age, mother’s education, father’s education, and number of siblings:

We are worried that tvhours* is measured with error in our survey. Let tvhours denote the reported hours of television viewing per week.

  1. Do you think the CEV assumptions are likely to hold? Explain.

Probablemente el error sea no aleatorio, ya que puede haber un error de medición en las horas de tv que se reportan. Ya sea porque entre más grandes los niños, menos atención se les pone y se desconoce la cantidad de horas que pasan viendo tv, o porque quién reporta quizá pasen más o menos tiempo con sus hijos, contando con más o menos info de lo que hacern sus hijos en el día.

  1. What do the classical errors-in-variables (CEV) assumptions require in this application?

Se requiere, claro, un erros aleatorio. Se requiere que además, este error no enté relacionado con alguna variable. En específico que el error en horas de tv reportadas no dependan de age, educación de madre o padre, o cantidad de hermanos. Muy probablemente esto no sea así.

  1. Show formally the bias that the measurement error in tvhours* generates.

\[plim(\hat{\beta_i})=\beta_i+\frac{Cov(x_i,e_0)}{Var(x_i)}\] donde la \(Cov(x_i,e_0)\neq0\) y \(e_0\) nace del error de medición de la variable dependiente tvhours*.

Ejercicio 3

You need to use two data sets for this exercise, JTRAIN2 and JTRAIN3. JTRAIN2 is the outcome of a job training experiment. The file JTRAIN3 contains observational data, where individuals themselves largely determine whether they participate in job training (self-select). The data sets cover the same time period.

  1. In the data set JTRAIN2, what fraction of the men received job training? What is the fraction in JTRAIN3? Why do you think there is such a big difference?
#fracción de JTRAIN2 que recibe jobtrain
library(pacman)
p_load(wooldridge,tidyverse)
frac_train2<-jtrain2 %>%
  summarise(frac=sum(train==1)/n())

frac_train3<-jtrain3 %>%
  summarise(frac=sum(train==1)/n())
            
print(frac_train2)
##        frac
## 1 0.4157303
print(frac_train3)
##         frac
## 1 0.06915888

La diferencia radica que, en la base donde no realiza el experimento, no hay autoselección, por tanto es una contidad controlada de personas que reciben el tratamiendo. Mientras que la base donde se autoselecciona, la gran mayoría no decide participar, y esta no es una cantidad controlada.

  1. Using JTRAIN2, run a simple regression of re78 on train. What is the estimated effect of participating in job training on real earnings?
reg_6b1<-lm(re78 ~ train, data = jtrain2)
summary(reg_6b1)
## 
## Call:
## lm(formula = re78 ~ train, data = jtrain2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.349 -4.555 -1.829  2.917 53.959 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.5548     0.4080  11.162  < 2e-16 ***
## train         1.7943     0.6329   2.835  0.00479 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.58 on 443 degrees of freedom
## Multiple R-squared:  0.01782,    Adjusted R-squared:  0.01561 
## F-statistic: 8.039 on 1 and 443 DF,  p-value: 0.004788

En promedio, la participación en el entrenamiento aumenta 1,794 el salario real.

  1. Now add as controls to the regression in part (ii) the variables re74, re75, educ, age, black, and hisp. Does the estimated effect of job training on re78 change much? why? (Hint: Remember that these are experimental data.)
reg_6c<-lm(re78 ~ train+re74+re75+educ+age+black+hisp, data = jtrain2)
summary(reg_6c)
## 
## Call:
## lm(formula = re78 ~ train + re74 + re75 + educ + age + black + 
##     hisp, data = jtrain2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.890 -4.424 -1.661  3.012 54.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.67407    2.42272   0.278  0.78097   
## train        1.68005    0.63086   2.663  0.00803 **
## re74         0.08331    0.07653   1.089  0.27694   
## re75         0.04677    0.13068   0.358  0.72062   
## educ         0.40360    0.17485   2.308  0.02145 * 
## age          0.05435    0.04382   1.240  0.21560   
## black       -2.18007    1.15550  -1.887  0.05987 . 
## hisp         0.14356    1.54092   0.093  0.92582   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.499 on 437 degrees of freedom
## Multiple R-squared:  0.05476,    Adjusted R-squared:  0.03962 
## F-statistic: 3.617 on 7 and 437 DF,  p-value: 0.0008396

Si, aunque no mucho. Esto debido a que controlamos por más variables, aunque estas no afecten mucho.

  1. Do the regressions in parts (b) and (c) using the data in JTRAIN3, reporting only the estimated coefficients on train, along with their t statistics. What is the effect now of controlling for the extra factors, and why?
reg_6d1<-lm(re78 ~ train, data = jtrain3)
summary(reg_6d1)
## 
## Call:
## lm(formula = re78 ~ train, data = jtrain3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.554  -9.732  -0.866   7.705  99.620 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  21.5539     0.3036   70.98   <2e-16 ***
## train       -15.2048     1.1546  -13.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.15 on 2673 degrees of freedom
## Multiple R-squared:  0.06092,    Adjusted R-squared:  0.06057 
## F-statistic: 173.4 on 1 and 2673 DF,  p-value: < 2.2e-16
reg_6d2<-lm(re78 ~ train+re74+re75+educ+age+black+hisp, data = jtrain3)
summary(reg_6d2)
## 
## Call:
## lm(formula = re78 ~ train + re74 + re75 + educ + age + black + 
##     hisp, data = jtrain3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.246  -4.355  -0.465   3.770 110.950 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.64755    1.30093   1.266 0.205465    
## train        0.21323    0.85339   0.250 0.802716    
## re74         0.28098    0.02790  10.071  < 2e-16 ***
## re75         0.56929    0.02757  20.648  < 2e-16 ***
## educ         0.52006    0.07522   6.914 5.89e-12 ***
## age         -0.07507    0.02047  -3.667 0.000251 ***
## black       -0.64771    0.49193  -1.317 0.188056    
## hisp         2.20261    1.09279   2.016 0.043944 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.08 on 2667 degrees of freedom
## Multiple R-squared:  0.5856, Adjusted R-squared:  0.5845 
## F-statistic: 538.4 on 7 and 2667 DF,  p-value: < 2.2e-16

Al usar JTRAIN3 y controlar por las demás variables, ahora el efecto es mucho menor y además, no es estadísticamente significativa. La gran diferencia radica en que, en este caso, la estimación está sesgada por la autoselección.

  1. Define avgre = (re74 + re75)/2. Find the sample averages, standard deviations, and minimum and maximum values in the two data sets. Are these data sets representative of the same populations in 1978?
jtrain2_me<- jtrain2 %>%
  mutate(avgre=(re74+re75)/2)

jtrain3_me<-jtrain3 %>%
  mutate(avgre=(re74+re75)/2)

summary(jtrain2_me$avgre)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.740   1.492  24.376
summary(jtrain3_me$avgre)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   8.829  16.873  18.040  25.257 146.901

Para nada son muestras representativas de la misma población. Tienen promedio en salario real y valores máximos bastante diferentes.

  1. Almost 96% of men in the data set JTRAIN2 have avgre less than $10,000. Using only these men, run the regression re78 on train, re74, re75, educ, age, black, hisp and report the training estimate and its t statistic. Run the same regression for JTRAIN3, using only men with avgre < 10. For the subsample of low-income men, how do the estimated training effects compare across the experimental and nonexperimental data sets?
est_3f1<-lm(re78 ~ train+re74+re75+educ+age+black+hisp,data=subset(jtrain2_me,avgre<10))

est_3f2<-lm(re78 ~ train+re74+re75+educ+age+black+hisp,data=subset(jtrain3_me,avgre<10))

summary(est_3f1)$coefficients
##                Estimate Std. Error    t value   Pr(>|t|)
## (Intercept)  1.73690700 2.44601497  0.7100966 0.47803922
## train        1.58303346 0.63244599  2.5030334 0.01269309
## re74        -0.11675896 0.12378300 -0.9432552 0.34609393
## re75         0.17320750 0.18879153  0.9174538 0.35943269
## educ         0.35820920 0.17590623  2.0363645 0.04234208
## age          0.04399659 0.04387515  1.0027679 0.31655162
## black       -2.38353435 1.16779043 -2.0410634 0.04187062
## hisp        -0.36940412 1.55104886 -0.2381641 0.81187027
summary(est_3f2)$coefficients
##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)  3.44800519 2.14135822  1.6101954 1.077721e-01
## train        1.84445204 0.89310880  2.0652042 3.924360e-02
## re74         0.31311314 0.06919345  4.5251849 7.004816e-06
## re75         0.77434945 0.07556649 10.2472600 3.651289e-23
## educ         0.32831064 0.11034426  2.9753305 3.019871e-03
## age         -0.08315209 0.03068175 -2.7101486 6.877528e-03
## black       -1.97330709 0.72071558 -2.7379831 6.326911e-03
## hisp        -1.10072192 1.43183503 -0.7687491 4.422820e-01

Los efectos son diferentes. En el experimento, el efecto de train sobre re78 es menor que en el caso donde los individuos están seleccionados. Se sobreestima el efecto.

  1. Now use each data set to run the simple regression re78 on train, but only for men who were unemployed in 1974 and 1975. How do the training estimates compare now?
est_3g1<-lm(re78 ~ train, data=subset(jtrain2_me, unem74==1&unem75==1))

est_3g2<-lm(re78 ~ train, data=subset(jtrain3_me, unem74==1&unem75==1))
            
summary(est_3g1)$coefficients
##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 4.112421  0.4300141 9.563455 6.434443e-19
## train       1.842065  0.6892051 2.672738 7.968044e-03
summary(est_3g2)$coefficients
##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 2.151186  0.5604937 3.838021 1.546510e-04
## train       3.803299  0.8837758 4.303465 2.356454e-05

En la base del experimento, el efecto dado que se controla por variables antes omitidas no cambia mucho cuando son omitidas; eso implica una buena aleatorización. En el caso de autoselección, omitir los controles sesga muy fuerte la estimación del efecto de train.

  1. Using your findings from the previous regressions, discuss the potential importance of having comparable populations underlying comparisons of experimental and nonexperimental estimates.

Los resultados apuntan a resaltar la importancia de la aleatoriación. Esto eficientiza las estimaciones, e implica que hay menor riesgo de violar MLR.4. En caso de autoselección, existe un enorme sesgo en estadísticas claves para relacionar muestra-población, además de que la estimación es mayormente sesgada y, violar MLR.4, es mucho más probable.

Ejercicio 4

Does attending a summer school improve test scores? This fictitious setting is as follows:

  • In the summer break between year 5 and year 6, (roughly corresponding to age 10) there is an optional summer school.

  • The summer school could be focusing on the school curriculum, or it could be focused on skills that lead to improved schooling outcomes (for example “grit” as in Alan et al (2019)).

  • The summer school is free, but enrollment requires active involvement by parents.

  • We are interested in whether participation in summer school improves child outcomes.

We have a dataset to study the research question, including:

Information about person id, school id, an indicator variable that takes the value of 1 if the individual participated in the summer school, information about gender, parental income and parental schooling, and test scores in year 5 (before the treatment) and year 6. The dataset also contains information about whether the individual received a reminder letter.

  1. Load summercamp.dta into R (data is in the class folder). and include the data in an object called summercamp. Then:
library(haven)
summercamp <- read_dta("~/Downloads/summercamp.dta")
# make data tidy (make long)
school_data<-summercamp%>%
       pivot_longer(
         cols = starts_with("test_year"),
         names_to = "year",
         names_prefix = "test_year_",
         names_transform = list(year = as.integer),
         values_to = "test_score",
       )
# Load skimr
library(skimr)
# Use skim() to skim the data
skim(school_data)
Data summary
Name school_data
Number of rows 6982
Number of columns 9
_______________________
Column type frequency:
character 1
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
parental_schooling 0 1 2 2 0 13 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
person_id 0 1 1746.00 1007.84 1.00 873.25 1746.00 2618.75 3491.00 ▇▇▇▇▇
school_id 0 1 15.66 8.67 1.00 8.00 15.00 23.00 30.00 ▇▇▇▇▇
summercamp 0 1 0.46 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▇
female 0 1 0.52 0.50 0.00 0.00 1.00 1.00 1.00 ▇▁▁▁▇
parental_lincome 0 1 14.56 0.69 12.67 14.11 14.52 14.95 19.45 ▂▇▁▁▁
letter 0 1 0.25 0.43 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
year 0 1 5.50 0.50 5.00 5.00 5.50 6.00 6.00 ▇▁▁▁▇
test_score 11 1 2.39 0.71 -0.27 1.92 2.35 2.86 4.98 ▁▃▇▃▁
  1. Correlate missing values for parental_schooling with parental income (hint: create a dummy for missing values). Is there evidence that missing values are not random?
#pasar datos a numeric
school_data$parental_schooling<-as.numeric(school_data$parental_schooling)
## Warning: NAs introduced by coercion
#creo dummy
school_data$missing_parental_schooling <- ifelse(is.na(school_data$parental_schooling), 1, 0)

#calculo de correlación
cor.test(school_data$missing_parental_schooling, school_data$parental_lincome)
## 
##  Pearson's product-moment correlation
## 
## data:  school_data$missing_parental_schooling and school_data$parental_lincome
## t = 0.051596, df = 6980, p-value = 0.9589
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02283972  0.02407419
## sample estimates:
##          cor 
## 0.0006175735

El resultado es ρ ≈ 0.00061, lo que indica que casi no hay correlación entre ambas variables.

  1. Assume all “missing values at random”. Hence drop NA rows.
#elimine NA
school_data <- school_data[!is.na(school_data$parental_schooling), ]
  1. Why do we want to run the code below?
# Standardize test score
# Group analysisdata by year
analysisdata<-group_by(school_data,year)
# Create a new variable with mutate
analysisdata<-mutate(analysisdata, test_score=(test_score-mean(test_score))/sd(test_score))
# show mean of test_score
print(paste("Mean of test score:",mean(analysisdata$test_score)))
## [1] "Mean of test score: NA"
#show sd of test_score
print(paste("SD of test score:",sd(analysisdata$test_score)))
## [1] "SD of test score: NA"

Estandariza el test score. Lo anterior permite comparar efectos entre poblaciones y ver las distribuciones

Ejercicio 5

  1. Imagine you own a Snickers store. An employee suggests giving “little gifts” (i.e. key rings) to clients to make them return to the store. You have 200 stores; an economist thus creates an experiment to evaluate the effect of the “little gifts” on revenues.
library(pacman)
p_load(tidyverse, glue)

# experimental data at the store level dataset
set.seed(22)
n_stores <- 200
true_gift_effect <- 100
noise <- 50
data_downstream <- tibble(store_id = 1:n_stores) %>% 
  mutate(  #mutate allows to create new_columns in data frame.
    # treatment (random in this case)
    gives_gift=rbinom(n_stores, 1, prob =  0.5),
    # return rate increased by 20% if given gifts
    return_rate=rnorm(n_stores, mean = 0.5, sd=0.1) + gives_gift*0.1,
    # outcome (influenced by return rate)
    # gifs impact revenue through return rate
    revenue= 50 + true_gift_effect*10*return_rate + rnorm(n_stores, mean=0, sd=noise)
  )

# plot to visualize the relationship
data_downstream %>% 
  mutate(treatment=ifelse(gives_gift==1, "gift", "no gift")) %>% 
  ggplot(aes(return_rate, revenue, color=treatment)) +
  geom_point() +
  labs(title=glue(" ")) + geom_rug()

  1. Discuss the data depicted in the graph plot and draw a DAG on the relationship between gifts, return rates, and revenues.

Al precer, dar regalo aumenta la tasa de retorno y, por tanto, Snickers obtiene más ingresos.

# Con ggdadg:
library(ggdag)
## 
## Attaching package: 'ggdag'
## The following object is masked from 'package:stats':
## 
##     filter
theme_set(theme_dag())

dagify(Revenue ~ Returns, Returns ~ Gifts) %>%
  ggdag(node_size = 23, text_size = 3) + theme_dag_blank()

  1. Run a regression of gifts on return rates. Does the gift make customers return to the store?
est_5b<-lm(return_rate ~ gives_gift,data=data_downstream)
summary(est_5b)
## 
## Call:
## lm(formula = return_rate ~ gives_gift, data = data_downstream)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34220 -0.06444 -0.00332  0.06678  0.34309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.475498   0.009847  48.287  < 2e-16 ***
## gives_gift  0.112314   0.014762   7.608  1.1e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1037 on 198 degrees of freedom
## Multiple R-squared:  0.2262, Adjusted R-squared:  0.2223 
## F-statistic: 57.89 on 1 and 198 DF,  p-value: 1.101e-12

Si, en promedio, dar regalo aumenta la tasa de retorno en .11.

  1. What is the regression a well-trained economist would run after the experiment to know the effect of the gifts?
est_5c <- lm(revenue ~ return_rate + gives_gift + return_rate*gives_gift, data = data_downstream) 

summary(est_5c)
## 
## Call:
## lm(formula = revenue ~ return_rate + gives_gift + return_rate * 
##     gives_gift, data = data_downstream)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.54  -32.72    0.02   30.15  132.32 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               40.83      20.19   2.022   0.0445 *  
## return_rate             1025.57      41.46  24.736   <2e-16 ***
## gives_gift                57.92      35.18   1.646   0.1013    
## return_rate:gives_gift  -102.65      63.66  -1.613   0.1084    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.93 on 196 degrees of freedom
## Multiple R-squared:  0.8667, Adjusted R-squared:  0.8647 
## F-statistic: 424.9 on 3 and 196 DF,  p-value: < 2.2e-16