library(pacman)
p_load(data.table, fixest, magrittr, wooldridge, multcomp, stargazer, lmtest, modelsummar, sandwich)

Ejercicio 1

Consider a linear model to explain monthly beer consumption:

\[beer = \beta_{0} + \beta_{1}inc + \beta_{2}price + \beta_{3}educ + \beta_{4}female + u \] \[ E(u | inc, price, educ, female) = 0\] \[ Var(u | inc, price, educ, female) = \sigma^2 inc^2\]

Write the transformed equation that has a homoscedastic error term. Show this is homoscedastic. \[ var(u|X) = \sigma^2 inc^2\] \[ E(u^2 | X)\] \[E\left( \frac{u^2}{inc^2} \right)\]

\[\frac{1}{u^2} E(u^2 | X)\] \[\frac{\sigma^2 {inc^2}}{inc^2}\] \[\sigma^2\]

Ejercicio 2

Using the data in GPA3, the following equation was estimated for the fall and second semester students:

\[\hat{trmgpa} = -2.12 + .900cumgpa + .0014tothrs + .0018 sat - .0039 hsperc + .351 female - .157 season\]

Here, trmgpa is term GPA, crsgpa is a weighted average of overall GPA in courses taken, cumgpa is GPA prior to the current semester, tothrs is total credit hours prior to the semester, sat is SAT score, hsperc is graduating percentile in high school class, female is a gender dummy, and season is a dummy variable equal to unity if the student’s sport is in season during the fall. The usual and heteroskedasticity-robust standard errors are reported in parentheses and brackets, respectively.

Do the variables crsgpa, cumgpa, and tothrs have the expected estimated effects? Which of these variables are statistically significant at the 5% level? Does it matter which standard errors are used?

Errores estándar t1,t2,t3:

t1 = (.9-1)/.175
  
t2 = (.193-1)/.64
  
t3 = (.0014-1)/.0012

cat(t1,t2,t3)

## -0.5714286 -1.260937 -832.1667

Errores robustos t1,t2,t3:

t1 = (.9-1)/.166
  
t2 = (.193-1)/.074
  
t3 = (.0014-1)/.0012

cat(t1,t2,t3)

## -0.6024096 -10.90541 -832.1667

\(|t| > |1.96|\) Por tanto no importa que errores uses.

Why does the hypothesis H0: bcrsgpa = 1 make sense? Test this hypothesis against the two-sided alternative at the 5% level, using both standard errors. Describe your conclusions. \(H_{0}: crspa = 1\)

t = (.9-1)/.175

cat(t)

## -0.5714286

Por tanto no se rechaza con error estandar \(H_{0}:\)

t = (.9-1)/.166

cat(t)

## -0.6024096

Por tanto no se rechaza con error robusto \(H_{0}:\)

Test whether there is an in-season effect on term GPA, using both standard errors. Does the significance level at which the null can be rejected depend on the standard error used?

Con el error estandar:

t = -.157/.098

cat(t)

## -1.602041

Con el error robusto:

t = -.157/.80

cat(t)

## -0.19625

Ejercicio 3

Consider a model at the employee level, where the unobserved variable \(f_i\) is a “firm effect” to each employee at a given firm i. The error term vi,e is specific to employee e at firm i. The composite error is \(u_i\),e = \(f_i\) + \(v_{i,e}\), such as in equation (8.28) in Wooldridge 7ed.

Assume that \(var(f_{i})=\sigma^2, var(v_{i,e}) = \sigma^2_{v}\); call this \(\sigma^2\).

Por tanto, es homocedastica.

Now suppose that for \(e \neq g, v_{i,e}\) and \(v_{i,g}\) are uncorrelated. Show that \(cov(u_{i,e}, u_{i,g}) = \sigma^2_{f}\).

To show that the covariance between \(u_{i,e}\) and \(u_{i,g}\) is \(\sigma^2_f\), we’ll start by calculating the covariance of these composite errors.

The composite error terms are defined as follows:

\(u_{i,e} = f_i + v_{i,e}\) \(u_{i,g} = f_i + v_{i,g}\)

Now, let’s calculate the covariance:

\(cov(u_{i,e}, u_{i,g}) = cov(f_i, f_i) + cov(f_i, v_{i,g}) + cov(v_{i,g}, f_i) + cov(v_{i,e},v_{i,g})\)

Given that \(v_{i,e}\) and \(v_{i,g}\) are uncorrelated (\(\text{cov}(v_{i,e}, v_{i,g}) = 0\) for \(e \neq g\)), and that the variance of \(f_i\) is \(\sigma^2\), we can simplify the expression:

\[cov(u_{i,e}, u_{i,g}) = var(f_i) + cov(v_{i,e},v_{i,g}) \]

\[ = \sigma^2 + 0\]

\[ = \sigma^2 \]

So, we’ve shown that the covariance between \(u_{i,e}\) and \(u_{i,g}\) is indeed \(\sigma^2\), which is the same as \(\sigma^2_f\).

Let \(\bar{u}_{i} = m_{i}^{-1} \sum{u_{i,e}}\) be the average oof the composite errors wihtin a firm. Show that \(var(\bar{u}_{i}) = \sigma^2_{f} + \frac{\sigma^2_{v}}{m_{i}}\)

To show that the variance of \(\bar{u}_i\) is \(\sigma^2_f + \frac{\sigma^2_v}{m_i}\), where \(\bar{u}_i\) is the average of the composite errors within a firm, we’ll calculate the variance of \(\bar{u}_i\):

\[var(\hat{u_{i}}) = var \left( \frac{1}{m} \sum{u_{i,e}} \right)\]

\[= \frac{1}{m^2} var\left( \sum{u_{i,e}} \right)\]

\[= \frac{1}{m^2} \sum\sum{cov(u_{i,e}, u{i,g})}\]

Given that \(u_{i,e}\) and \(u_{i,g}\) are independent for \(e \neq g\), we can simplify the expression:

\[var(\hat{u_i}) = \frac{1}{m^2_i} \sum\sum{cov(\sigma^2_f + \frac{\sigma^2_v}{m_i})}\]

\[= \frac{1}{m^2_i} \sum\sum{\sigma^2_f + \frac{1}{m^2_i} \sum\sum \frac{\sigma^2_v}{m_i}}\]

\[= \frac{1}{m^2_i} m_i^2 \sigma^2_f + \frac{1}{m^2_i} \frac{1}{m_i} \sigma^2_v\]

\[= \sigma^2_f + \frac{\sigma^2_v}{m_i}\]

So, we’ve shown that the variance of \(\bar{u}_i\) is indeed \(\sigma^2_f + \frac{\sigma^2_v}{m_i}\).

Discuss the relevance of part iii. for WLS estimation using data averaged at the firm level, where the weight used for observation \(i\) is the usual firm size.

Part iii. is relevant for Weighted Least Squares (WLS) estimation when using data averaged at the firm level, especially when the weight used for observation \(i\) is the usual firm size. Let’s discuss the relevance of part iii. in this context:

Firm-Level Averaging: In some cases, you may be interested in estimating regression models using aggregated data at the firm level. For example, you might have firm-level data, and you want to estimate a regression model where each firm contributes one observation, and the weight assigned to each firm is based on firm size (e.g., total sales, number of employees, market capitalization, etc.).
Incorporating Heteroscedasticity: When using aggregated firm-level data, you may still encounter heteroscedasticity, where the variance of the error term within each firm may vary. Part iii. deals with calculating the variance of the composite errors when using firm-level data.
Variance Components: The variance of the composite errors, as calculated in part iii., consists of two components:
- \(\sigma^2_f\): The variance of the “firm effect” \(f_i\). This represents the variation between firms and captures the firm-level heterogeneity. It is part of the total variance.
- \(\frac{\sigma^2_v}{m_i}\): The variance of the individual employee-specific errors \((v_{i,e})\) within a firm. This component reflects the variation within each firm.
Weighted Least Squares (WLS): When conducting WLS estimation using firm-level data, it is important to account for the heteroscedasticity present within each firm. You can assign weights to each firm’s observation based on the estimated variances. These weights are used to give more importance to firms with less within-firm variation, which is captured by \(frac{\sigma^2_v}{m_i}\). By incorporating these weights in the estimation, you can obtain more efficient and robust parameter estimates.
Relevance: Part iii. is relevant because it helps you understand how the variance components contribute to the overall variance when using firm-level data. It highlights that the overall variance of the composite errors is a combination of the variation between firms \(\sigma^2_f\) and the variation within each firm \(\frac{\sigma^2_v}{m_i}\). By acknowledging this structure, you can apply appropriate weights to mitigate the impact of within-firm variation when estimating regression models with firm-level data, improving the precision of parameter estimates.

Ejercicio 4

Use the data set wage1 for this exercise in R studio (pegue su script abajo).

Use OLS to estimate the model relating Wage with education, experience, and tenure.

data(wage1)

lm <- lm(wage ~ educ + exper + tenure, data = wage1)
summary(lm)

## 
## Call:
## lm(formula = wage ~ educ + exper + tenure, data = wage1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6068 -1.7747 -0.6279  1.1969 14.6536 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.87273    0.72896  -3.941 9.22e-05 ***
## educ         0.59897    0.05128  11.679  < 2e-16 ***
## exper        0.02234    0.01206   1.853   0.0645 .  
## tenure       0.16927    0.02164   7.820 2.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.084 on 522 degrees of freedom
## Multiple R-squared:  0.3064, Adjusted R-squared:  0.3024 
## F-statistic: 76.87 on 3 and 522 DF,  p-value: < 2.2e-16

Compute in “manually” in R-studio the Breusch-Pagan Test, including the F-stat and LM.

residuals2 <- (lm$residuals)^2
lm2 <- lm(residuals2 ~ educ + exper + tenure, data = wage1)
summary(lm2)

## 
## Call:
## lm(formula = residuals2 ~ educ + exper + tenure, data = wage1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.175  -8.019  -3.528  -0.133 202.389 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.70645    5.46116  -2.876  0.00419 ** 
## educ          1.57052    0.38420   4.088 5.04e-05 ***
## exper         0.11046    0.09033   1.223  0.22191    
## tenure        0.69315    0.16215   4.275 2.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.11 on 522 degrees of freedom
## Multiple R-squared:  0.08193,    Adjusted R-squared:  0.07665 
## F-statistic: 15.53 on 3 and 522 DF,  p-value: 1.09e-09

#La Rˆ2 es 0.08193, por lo que entonces sacamos el valor F
F <- ((0.08193)/(3)) / ((1-0.08193)/(526-3-1))

#Comparamos con el F de tablas: F-Stat_(526-3-1) = 5.13 (creo es este)
F > 5.13

## [1] TRUE

Por lo tanto, podemos rechazar H0 de que tenemos homocedasticidad. Verificamos aún así con el estadístico del multiplicador de Lagrange:

LM <- (526)*(0.08193)
print(LM)

## [1] 43.09518

valorP <- pchisq(q=LM, df=3, lower.tail=FALSE)
LM > valorP

## [1] TRUE

Show that, what you get in b. equals: (where est_0 is the object where you save your first regression)

# En caso de no tener instalado: install.packages("lmtest") Yo ya lo incluyo en los paquetes del inicio:

bptest(lm)

## 
##  studentized Breusch-Pagan test
## 
## data:  lm
## BP = 43.096, df = 3, p-value = 2.349e-09

Nuestro valor en b) coincide con el de la función bptest

Compute manually the alternative White test. In the regression of U-hat2 on the predicted values and the squared predicted values for wage coming from a.

fittedvalues <- (predict(lm))
fittedvalues2 <- (predict(lm))^2

lm3 <- lm(residuals2 ~ fittedvalues + fittedvalues2, data = wage1)
summary(lm3)

## 
## Call:
## lm(formula = residuals2 ~ fittedvalues + fittedvalues2, data = wage1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.955  -6.179  -3.146  -0.818 204.427 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.3576     5.6290   1.307 0.191754    
## fittedvalues   -2.8614     1.7642  -1.622 0.105423    
## fittedvalues2   0.4868     0.1352   3.602 0.000346 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.88 on 523 degrees of freedom
## Multiple R-squared:  0.09843,    Adjusted R-squared:  0.09499 
## F-statistic: 28.55 on 2 and 523 DF,  p-value: 1.705e-12

#Por lo tanto:
LM2 <- (526)*(0.09843)
print(LM2)

## [1] 51.77418

df <- 2
alpha <- 0.05
critical_value2 <- qchisq(1 - alpha, df)

LM2 > critical_value2

## [1] TRUE

Show that your results are the same using lmtest in R studio.

Me sale error con lmtest: Usaré un paquete adicional llamado whitestrap

#install.packages("whitestrap")
library(whitestrap)

## 
## Please cite as:

## Lopez, J. (2020), White's test and Bootstrapped White's test under the methodology of Jeong, J., Lee, K. (1999) package version 0.0.1

white_test(lm)

## White's test results
## 
## Null hypothesis: Homoskedasticity of the residuals
## Alternative hypothesis: Heteroskedasticity of the residuals
## Test Statistic: 51.78
## P-value: 0

Coinciden

Use solo educación como predictor e interprete beta-hat ¿es heteroscedástica la relación? Interprete el coeficiente de educ.

lm4 <- lm(wage ~ educ, data = wage1)
summary(lm4)

## 
## Call:
## lm(formula = wage ~ educ, data = wage1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3396 -2.1501 -0.9674  1.1921 16.6085 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.90485    0.68497  -1.321    0.187    
## educ         0.54136    0.05325  10.167   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.378 on 524 degrees of freedom
## Multiple R-squared:  0.1648, Adjusted R-squared:  0.1632 
## F-statistic: 103.4 on 1 and 524 DF,  p-value: < 2.2e-16

Dado que tenemos level-level, nuestra implica que si aumenta la educación en una unidad, el salario aumentará 0.54136 veces

Checamos Heterocedasticidad:

bptest(lm4)

## 
##  studentized Breusch-Pagan test
## 
## data:  lm4
## BP = 15.306, df = 1, p-value = 9.144e-05

P-value indica que podemos rechazar H0 de homocedasticidad

Luego suponga que \(Var(U_{i} | X_{k}) = \sigma^2 educ_{i}\) Obtenga los estimadores de \(\beta_{k}\) con WLS ¿se soluciona el problema de heteroscedasticidad encontrado en inciso f. al 95% de confianza? Interprete el coeficiente de educ.

wt <- sqrt(wage1$educ)

#Transformamos variables y agregamos al dataset:
wage1$educ_wt <- wage1$educ/wt
wage1$wage_wt <- wage1$wage/wt

#Comparamos regresión original de f) y transformada:
lm <- lm(wage ~ educ, data = wage1)
lm_wt <- lm(wage_wt ~ educ_wt, data = wage1)

Una vez hechos los weights, checamos:

bptest(lm)

## 
##  studentized Breusch-Pagan test
## 
## data:  lm
## BP = 15.306, df = 1, p-value = 9.144e-05

bptest(lm_wt)

## 
##  studentized Breusch-Pagan test
## 
## data:  lm_wt
## BP = 4.141, df = 1, p-value = 0.04186

En ambas, se rechaza la H0: sin embargo, con los pesos reducimos bastante el umbral de rechzazo, en contraste a la original.

Verify that the fitted values from d. are all positive and obtain the weighted least squares after estimating \(h_{i}\) (i.e. FGLS)

sum(lm3$fitted.values <= 0)

## [1] 0

FGLS con residuales al cuadrado de d):

h_gorro <- lm(log(residuals2) ~ educ + exper + tenure, data = wage1)
h <- exp(h_gorro$fitted.values)
wt_h <- 1/h

# Estimamos WLS con nuestra h:
lm_fgls <- lm(wage ~ educ + exper + tenure, data = wage1, weights = wt_h)
summary(lm_fgls)

## 
## Call:
## lm(formula = wage ~ educ + exper + tenure, data = wage1, weights = wt_h)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5426 -1.3984 -0.4874  0.9856 12.5386 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.59554    0.41407   1.438  0.15096    
## educ         0.29689    0.03161   9.392  < 2e-16 ***
## exper        0.03186    0.00930   3.426  0.00066 ***
## tenure       0.15022    0.02435   6.169 1.38e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.147 on 522 degrees of freedom
## Multiple R-squared:  0.2101, Adjusted R-squared:  0.2055 
## F-statistic: 46.27 on 3 and 522 DF,  p-value: < 2.2e-16

Conduct a white test for the FGLS estimation and report your results.

white_test(lm_fgls)

## White's test results
## 
## Null hypothesis: Homoskedasticity of the residuals
## Alternative hypothesis: Heteroskedasticity of the residuals
## Test Statistic: 51.36
## P-value: 0

Dado que tenemos que estimar , nuestro estimado FGLS no está insesgado. Sin embargo, el estimador es consistente y es asintóticamente más eficiente que OLS. Podemos observar como nuestros estimadores bajo FGLS son más significativos que nuestrs en OLS.

Estimate FGLS with heteroskedastic-robust white-Huber SEs. Report your results and discuss why you would like to estimate this? Are the SE too different from those obtained in h. why?

lm9 <- feols(wage ~ educ + exper + tenure, data = wage1, weights = wt_h)

coeftest(lm9,
         vcov=vcovHC(lm9, type = "HC0"))

## 
## z test of coefficients:
## 
##              Estimate Std. Error z value  Pr(>|z|)    
## (Intercept) 0.5955445  0.6159974  0.9668 0.3336455    
## educ        0.2968899  0.0461869  6.4280 1.293e-10 ***
## exper       0.0318644  0.0091421  3.4855 0.0004913 ***
## tenure      0.1502191  0.0279000  5.3842 7.277e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Los SE heterocedastico-robustos de White-Huber son más pequeños que los resultantes de FGLS con SE clásicos. No cambian tanto: queremos estimarlos debido a que los errores estándar consistentes con la heteroscedasticidad se utilizan para permitir el ajuste de un modelo que contiene residuos heterocedásticos

Ejercicio 5

Ejecute el siguiente Código en R Studio:

library(car)

## Loading required package: carData

# generate heteroskedastic data 
X <- 1:1000 # crea un vector que continee valores del 1 al 1000
Y <- rnorm(n = 1000, mean = X, sd = 0.6 * X) # crea un vector Y de 1000 observaciones que siguen una distribucion normal. la media es igual a X y la desv est a 69% del valor de X

# plot the data
plot(x = X, y = Y, 
     pch = 19, 
     col = "steelblue", 
     cex = 0.8)

# grafico de dispersion 

# add the regression line to the plot
# abline(reg, 
     #  col = "darkred", 
      # lwd = 1.5)
# agrega una linea de regresion a los datos dispersos, se ajusta a los datos generados en los pasos anteriores

t <- c()
t.rob <- c()
# crea dos vectores vacios para almacenar resultados


# loop sampling and estimation
for (i in 1:10000) { # inicia un bucle para realizar muestreo y estimacion repetidos 10000 veces
  
  # sample data
  X <- 1:1000
  Y <- rnorm(n = 1000, mean = X, sd = 0.6 * X)
  
  # en cada iteracion del bucle se generan nuevos datos heterocedasticos 
  
  # estimate regression model
  reg <- lm(Y ~ X)
  
  # se ajusta un modelo de regresion lineal a los nuevos datos 
  
  # homoskedasdicity-only significance test
  t[i] <- linearHypothesis(reg, "X = 1")$'Pr(>F)'[2] < 0.05
  
  # Se realiza una prueba de significancia para verificar si el coeficiente de regresión de X es igual a 1 (prueba de homocedasticidad) 
  # La prueba verifica si el p-valor asociado con la restricción "X = 1" es menor que 0.05 y el resultado se almacena en el vector t.
  
  # robust significance test
  t.rob[i] <- linearHypothesis(reg, "X = 1", white.adjust = "hc1")$'Pr(>F)'[2] < 0.05
  # Se realiza una prueba robusta para evaluar la misma restricción, pero con ajuste para la heterocedasticidad utilizando white.adjust = "hc1". El resultado se almacena en el vector t.rob.
  
}

# compute the fraction of false rejections
# round(cbind(t = mean(t), t.rob = mean(t.rob)), 3)

# Calcula el promedio de los vectores t y t.rob, y redondea el resultado a 3 decimales. Esto proporciona la fracción de rechazos falsos en las pruebas de hipótesis de homocedasticidad y las pruebas robustas.

Describa brevemente qué es lo que está realizando el código

Lo puse en el codigo en comentario.

Cambie n a 100 ¿cuál es el resultado en la fracción de rechazos falsos aun con SE robustos? ¿Qué puede concluir sobre los errores estándar robustos?

  library(car)
# generate heteroskedastic data 
X <- 1:100
Y <- rnorm(n = 100, mean = X, sd = 0.6 * X)

t <- c()
t.rob <- c()

# loop sampling and estimation
for (i in 1:10000) {
  # sample data
  X <- 1:100
  Y <- rnorm(n = 100, mean = X, sd = 0.6 * X)
  
  # estimate regression model
  reg <- lm(Y ~ X)
  
  # homoskedasdicity-only significance test
  t[i] <- linearHypothesis(reg, "X = 1")$'Pr(>F)'[2] < 0.05
  
  # robust significance test
  t.rob[i] <- linearHypothesis(reg, "X = 1", white.adjust = "hc1")$'Pr(>F)'[2] < 0.05
}

# compute the fraction of false rejections
#result <- round(cbind(t = mean(t), t.rob = mean(t.rob)), 3)
#result

En ambos casos, la fracción de rechazos falsos en la prueba de homocedasticidad (t) es relativamente baja, lo que indica que esta prueba tiende a controlar adecuadamente los errores de tipo I.

En el caso con n igual a 100, la fracción de rechazos falsos en la prueba robusta (t.rob) es un poco más alta que en el caso con n igual a 1000. Esto sugiere que la prueba robusta es menos propensa a rechazar incorrectamente la hipótesis nula en comparación con la prueba de homocedasticidad, lo cual es una característica deseable.

Cambie SD a 1.6 ¿esto qué significa en términos de ajuste del modelo? ¿Qué efecto tiene en la fracción de rechazos falsos aun con SE robustos?

# generate heteroskedastic data with SD = 1.6
X <- 1:1000
Y <- rnorm(n = 1000, mean = X, sd = 1.6 * X)

t <- c()
t.rob <- c()

# loop sampling and estimation
for (i in 1:10000) {
  # sample data
  X <- 1:1000
  Y <- rnorm(n = 1000, mean = X, sd = 1.6 * X)
  
  # estimate regression model
  reg <- lm(Y ~ X)
  
  # homoskedasdicity-only significance test
  t[i] <- linearHypothesis(reg, "X = 1")$'Pr(>F)'[2] < 0.05
  
  # robust significance test
  t.rob[i] <- linearHypothesis(reg, "X = 1", white.adjust = "hc1")$'Pr(>F)'[2] < 0.05
}

# compute the fraction of false rejections
#result <- round(cbind(t = mean(t), t.rob = mean(t.rob)), 3)
#result

Mayor SD: El aumento en la desviación estándar (SD) de los errores indica una mayor heteroscedasticidad en los datos. En este contexto, la heteroscedasticidad se ha incrementado en comparación con el caso anterior con una SD más baja de 0.6 * X. Esto significa que la variabilidad de los errores en función de la variable independiente X es aún más pronunciada.

Impacto en el ajuste del modelo: La mayor heteroscedasticidad en los datos afectará negativamente el ajuste del modelo de regresión lineal. El modelo lineal no podrá capturar adecuadamente la relación entre X e Y debido a la fuerte variabilidad en los errores. Los coeficientes estimados del modelo pueden ser menos precisos y menos confiables.

En cuanto a la fracción de rechazos falsos:

Con una SD de 1.6, la fracción de rechazos falsos en la prueba de homocedasticidad (t) aumentó ligeramente en comparación con el caso anterior (con SD de 0.6). Esto se debe a que el aumento de la heteroscedasticidad puede hacer que la prueba de homocedasticidad sea más propensa a rechazar incorrectamente la hipótesis nula.

Sin embargo, la prueba robusta (t.rob) sigue siendo más efectiva en el control de los errores de tipo I en este contexto de mayor heteroscedasticidad. Aunque la fracción de rechazos falsos aumentó con la SD de 1.6, sigue siendo más baja que la fracción de rechazos falsos en la prueba de homocedasticidad.

Describa formalmente el procedimiento para obtener WLS con estos datos.

La estimación de mínimos cuadrados ponderados (WLS, por sus siglas en inglés, Weighted Least Squares) es una técnica utilizada en análisis de regresión cuando se sospecha que los errores no tienen varianzas constantes, es decir, en presencia de heteroscedasticidad. La idea detrás de WLS es asignar pesos a cada observación en función de su varianza. El procedimiento formal para obtener WLS con estos datos se describe a continuación:

Especificación del modelo:
- Define el modelo de regresión lineal apropiado para tus datos. En este caso, el modelo general puede ser expresado como:
  
  \(Y_i = \beta_0 + \beta_1 \cdot X_i + \varepsilon_i\)
Donde:
- \(Y_i\) es la variable dependiente.
- \(X_i\) es la variable independiente.
- \(varepsilon_i\) son los errores, que se asume que tienen varianzas no constantes.
Estimación de la varianza de los errores:
- Calcula la varianza de los errores \(\varepsilon_i\) para cada observación. En este caso, como la varianza es proporcional a \(X_i\), podrías estimarla como \(\sigma_i^2 = \alpha \cdot X_i\), donde \(\alpha\) es un parámetro desconocido.
Especificación de los pesos:
- Calcula los pesos \(w_i\) para cada observación \(i\) en función de las estimaciones de la varianza \(\sigma_i^2\). Un enfoque común es asignar pesos inversamente proporcionales a la varianza. En este caso, puedes usar \(w_i = 1 / \sigma_i^2\).
Estimación de WLS:
- Utiliza una función de estimación de WLS para ajustar el modelo de regresión lineal ponderado. En R, puedes utilizar la función lm junto con el argumento weights para aplicar WLS. Por ejemplo:

# wls_model <- lm(Y ~ X, weights = w_i)

Donde: - \(wls_model\) es el modelo de regresión lineal ponderado obtenido mediante WLS. - \(Y\) es la variable dependiente. - \(X\) es la variable independiente. - \(w_i\) son los pesos calculados en el paso anterior.

Análisis de resultados:
- Examina los resultados del modelo de WLS para interpretar los coeficientes de regresión y realizar pruebas de hipótesis. También puedes evaluar la bondad de ajuste y otros estadísticos de interés para entender cómo el modelo se ajusta a los datos.

El procedimiento formal anterior describe cómo aplicar WLS en el contexto de datos con heteroscedasticidad. Al asignar pesos adecuados a las observaciones en función de la varianza, WLS ayuda a obtener estimaciones más precisas y confiables de los coeficientes de regresión en presencia de heteroscedasticidad.

Ejercicio 6

\(arr86 = \beta_{0} + \beta_{1}pcnv + \beta_{2}tottime + \beta_{4}ptime86 + \beta_{5}qemp86 + u\)

We estimate a linear probability model for whether a young man was arrested during 1986:

Using the data in CRIME1, estimate this model by OLS and verify that all fitted values are strictly between zero and one. What are the smallest and largest fitted values?

data("crime1")
lm10 <- lm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86, data = crime1)

Checamos que fitted values están entre 0 y 1:

length(which(lm10$fitted.values>0,lm10$fitted.values<1))

## [1] 2725

Checamos mayor y menor fitted values:

min(lm10$fitted.values)

## [1] 0.08373395

max(lm10$fitted.values)

## [1] 0.9868572

Estimate the equation by weighted least squares, as discussed in Section 8-5.

lm11 <- lm(log(lm10$fitted.values, base = exp(1)) ~ pcnv + avgsen + tottime + ptime86 + qemp86, data = crime1)

exp12 <- exp(lm11$fitted.values)
weights <- 1/exp12
lm12 <- lm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86, data = crime1, weights = weights)

summary(lm12)

## 
## Call:
## lm(formula = narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86, 
##     data = crime1, weights = weights)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8320 -0.6823 -0.5477  0.3873 15.4894 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.750823   0.036963  20.313  < 2e-16 ***
## pcnv        -0.180890   0.033111  -5.463 5.10e-08 ***
## avgsen      -0.002865   0.013088  -0.219    0.827    
## tottime      0.007274   0.010480   0.694    0.488    
## ptime86     -0.046157   0.006537  -7.061 2.09e-12 ***
## qemp86      -0.114281   0.010444 -10.943  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.267 on 2719 degrees of freedom
## Multiple R-squared:  0.05656,    Adjusted R-squared:  0.05483 
## F-statistic:  32.6 on 5 and 2719 DF,  p-value: < 2.2e-16

Use the FGLS estimates to determine whether \(avgsen\) and \(tottime\) are jointly significant at the 5% level.

#Modelo No-Restringido:
lm12 <- lm(narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86, data = crime1,
           weights = weights)

#Modelo Restringido:
lm13 <- lm(narr86 ~ pcnv + ptime86 + qemp86, data = crime1, weights = weights)

F_Stat <- ((summary(lm12)$r.squared-summary(lm13)$r.squared)/2) /
  ((1-summary(lm12)$r.squared)/(2725-5-1))
#F_5,2725-5-1 = 2.21
F_Stat > 2.21

## [1] FALSE

Por lo tanto, no son significativas en conjunto al 5%

Actividad 5: Heterocedasticidad

Ricardo Cortés

2023-10-15