Prepartion

# Load library ------------------------------------------------------------

pacman::p_load("tidyverse", "foreign", "estimatr", "modelr", "Rcpp", "huxtable", "strucchange", "qpcR")


# Import data ---------------------------------------------------------------

ces <- read.dta("~/R/Musashi_University/2021_second_semester/econometrics/Data/CES2013.dta")

eawe <- read.dta("~/R/Musashi_University/2021_second_semester/econometrics/Data/EAWE21.dta")


df_ces <- janitor::clean_names(ces) %>% 
  as_tibble()

df_eawe <- janitor::clean_names(eawe) %>% 
  as_tibble()


# Preparation -------------------------------------------------------------

# model list
mod <- list() 

# visualization function
lm_visualize <- function(x, y, data = data) {
  x <- enquo(x)
  y <- enquo(y)
  ggplot(data = data, aes(!!x, !!y)) +
    geom_point(alpha = 0.25) +
    geom_smooth(method = "lm") +
    labs(x = x,
         y = y)
}

# define chow test
chow_test <- function(rss_p, rss_1, rss_2, k, n_1, n_2) {
  value <- ((rss_p - (rss_1 + rss_2)) / k ) / ((rss_1 + rss_2) / (n_1 + n_2 - 2*k))
  f_stat <- qf(0.95, k, n_1 + n_2)
  text <- sprintf("chow test: %.3f\nf_stat:\t%.3f\n", value, f_stat)
  cat(text)
  if (value > f_stat) {
    print("reject")
  } else {
    print("not reject")
  }
}

A4.1

Is expenditure on your category per capita related to total expenditure per capita?

# tidy
df_ces <- df_ces %>% 
  mutate(lgcatpc = log(fdho),
         lgexppc = log(exp))


mod4_1 <- lm_robust(lgcatpc ~ lgexppc, data = df_ces)

# show the result
summary(mod4_1)
## 
## Call:
## lm_robust(formula = lgcatpc ~ lgexppc, data = df_ces)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper   DF
## (Intercept)   0.7009   0.086260   8.126 5.29e-16   0.5319   0.8700 6332
## lgexppc       0.6658   0.009914  67.158 0.00e+00   0.6464   0.6852 6332
## 
## Multiple R-squared:  0.4271 ,    Adjusted R-squared:  0.427 
## F-statistic:  4510 on 1 and 6332 DF,  p-value: < 2.2e-16

From the summary(mod4_1), when total household expenditure per capita (lgexppc) increases by 1 percent, it tends to be followed by the 1 percent increase of extra expenditure on food. As R squared is roughly 0.43, 43 perfect of the overall variation of food expenditure per capita is explained by the total household expenditure per capita. Since t-stat is 67.158, null hypothesis that beta2 is zero is rejected by 99 percent confidence interval. To confirm the relationship between two variables, F-statistics indicates that there are some relationship because p-value is way below 0.01.

# visualization
lm_visualize(lgexppc, lgcatpc, data = df_ces)

A4.7

Give an interpretation of the regression output.

# tidy
df_eawe <- df_eawe %>% 
  mutate(lgearn = log(earnings),
         sasvabc = s * asvabc)

# modeling
mod4_7 <- lm_robust(lgearn ~ s + exp + asvabc + sasvabc, data = df_eawe)

# show the result
summary(mod4_7)
## 
## Call:
## lm_robust(formula = lgearn ~ s + exp + asvabc + sasvabc, data = df_eawe)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)  CI Lower CI Upper  DF
## (Intercept)  1.38675   0.222370   6.236 9.617e-10  0.949848  1.82366 495
## s            0.07642   0.012114   6.309 6.247e-10  0.052623  0.10023 495
## exp          0.04005   0.010654   3.759 1.909e-04  0.019117  0.06098 495
## asvabc      -0.20963   0.138428  -1.514 1.306e-01 -0.481611  0.06235 495
## sasvabc      0.01887   0.008982   2.101 3.617e-02  0.001221  0.03652 495
## 
## Multiple R-squared:  0.1549 ,    Adjusted R-squared:  0.1481 
## F-statistic: 23.81 on 4 and 495 DF,  p-value: < 2.2e-16

Form the above regression table, except asvabc, all independent variables are statistically significant within the 95% confidence interval. Based on the F-statistics, since p-value is below 0.05, there are some relationships between dependent and independent variables.

lm_visualize(asvabc, lgearn, data = df_eawe)

From the above visuzalization, there is a positive relationship between lgearn and asvabc though when they are combined with s, exp, sasvabc, a coefficient fo asbabc becomes negative. This is because of exptraplotation. When written in the mathmatical terms, the true modele would be as followings.

\[ \ln Earn = \beta_1 + \beta_2 S + \beta_3 Exp + (\beta_4 + \beta_5 S) Score + u \] S, the years of education would not be zero for almost everyone in modern world. Thus, to eliminate the effect of s on \(\beta_4\), settting S is zero, \((\beta_4 + \beta_5 S)\) ????

A5.8

*Does going to college have an efect on household expenditure?**

# tidy
df_ces <- df_ces %>% 
  mutate(college = if_else(refeduc <= 13, 0, 1),
         lgsize = log(size)
         )

# modeling
mod <- list()
mod$mod5_8_1 <- lm(lgcatpc ~ lgexppc + lgsize, data = df_ces %>% filter(college == 1))
mod$mod5_8_2 <- lm(lgcatpc ~ lgexppc + lgsize, data = df_ces %>% filter(college == 0))
mod$mod5_8_3 <- lm(lgcatpc ~ lgexppc + lgsize, data = df_ces)


# show results
huxreg("college_yes" = mod[[1]], "college_no" = mod[[2]], "whole sample" = mod[[3]])

college_yescollege_nowhole sample
(Intercept)0.923 ***0.916 ***1.158 ***
(0.134)   (0.111)   (0.082)   
lgexppc0.603 ***0.621 ***0.584 ***
(0.015)   (0.014)   (0.010)   
lgsize0.328 ***0.313 ***0.334 ***
(0.020)   (0.016)   (0.013)   
N2671        3663        6334        
R20.489    0.485    0.483    
logLik-2202.586    -3073.123    -5318.209    
AIC4413.171    6154.245    10644.418    
*** p < 0.001; ** p < 0.01; * p < 0.05.
The purpose of chow test is where you should divide whole sample into sub-samples or not. The reason of chow test would be;

  • example1: time series data divided into two sub-samples, before and after the financial crisis
  • example2: cross sectional data divided into two sub-samples based on a dummy variable.

The choices; * Run two regressions, one for each sub-sample. * Run one regression using the whole sample.

In this case, we perform chow test based on college, a dummy variable. \(H_0\) is there is no structural break in the deta set while \(H_1\) is there is a structural break, meaning we should include a dummy variable in our regression model. By using a formula, the code would be;

# RSS
RSS_p <- RSS(mod$mod5_8_3)
RSS_1 <- RSS(mod$mod5_8_1)
RSS_2 <- RSS(mod$mod5_8_2)
k <- 3
n_1 <- 2671
n_2 <- 3663


chow_test(RSS_p, RSS_1, RSS_2, 
          k = k,
          n_1 = n_1, 
          n_2 = n_2)
## chow test: 28.284
## f_stat:  2.606
## [1] "reject"

Since 28.28 is way above 5.46 (\(\nu (3, 1000)\)), we can reject the null hypothesis that there is no difference between sub-samples, meaning we should include the dummy variable in our model.

A5.10

You are given the following data on 2,800 respondents in the National Longitudinal Survey of Youth 1979– with jobs in 2011: • hourly earnings in the respondent’s main job at the time of the 2011 interview • educational attainment (highest grade completed) • mother’s and father’s educational attainment • ASVABC score • sex • ethnicity: black, Hispanic, or white, that is (not black nor Hispanic) • whether the main job in 2011 was in the government sector or the private sector.

As a policy analyst, you are asked to investigate whether there is evidence of earnings discrimination, positive or negative, by sex or ethnicity in (1) the government sector, and (2) the private sector. Explain how you would do this, giving a mathematical representation of your regression specification(s). You are also asked to investigate whether the incidence of earnings discrimination, if any, is significantly different in the two sectors. Explain how you would do this, giving a mathematical representation of your regression specification(s). In particular, discuss whether a Chow test would be useful for this purpose.

(a)

The fit model would be; \[ \ln Earn = \beta_1 + \beta_2 S + \beta_3 Score + \beta_4 Male + \beta_5 Ethblack + \beta Ethhisp + u \]

(b)

The model would be specified as;

\[ \ln Earn = \beta_1 + \beta_2 S + \beta_3 Score + \beta_4 Male + \beta_5 Ethblack + \beta Ethhisp \\+ \delta_1 Gov + \delta_2 GovS + \delta_3 GovScore + \delta_4 GovMale + \delta_5 GovBlack + \delta_6 + \delta GovHisp + u \]

# modeling
mod <- list()

mod$mod5_10 <- lm(lgearn ~ s + asvabc + male + ethblack + ethhisp 
                + catgov + catgov*s + catgov*asvabc + catgov*male + catgov*ethblack + catgov*ethhisp, data = df_eawe)

mod$mod5_10_p <- lm(lgearn ~ s + asvabc + male + ethblack + ethhisp,
                data = df_eawe)

mod$mod5_10_1 <- lm(lgearn ~ s + asvabc + male + ethblack + ethhisp,
                data = df_eawe %>% filter(catgov == 1))

mod$mod5_10_0 <- lm(lgearn ~ s + asvabc + male + ethblack + ethhisp,
                data = df_eawe %>% filter(catgov == 0))

# show the result
huxreg("with dummy" = mod[[1]], 
       "whole sample" = mod[[2]], 
       "Gov_yes" = mod[[3]], 
       "Gov_No" = mod[[4]])
with dummywhole sampleGov_yesGov_No
(Intercept)1.925 ***1.841 ***1.577 ***1.925 ***
(0.163)   (0.151)   (0.425)   (0.165)   
s0.053 ***0.060 ***0.083 ** 0.053 ***
(0.011)   (0.010)   (0.026)   (0.011)   
asvabc0.070 *  0.059    0.001    0.070 *  
(0.034)   (0.032)   (0.077)   (0.035)   
male0.198 ***0.178 ***0.071    0.198 ***
(0.050)   (0.046)   (0.119)   (0.051)   
ethblack-0.115    -0.071    0.058    -0.115    
(0.077)   (0.070)   (0.160)   (0.078)   
ethhisp-0.082    -0.065    0.125    -0.082    
(0.077)   (0.073)   (0.226)   (0.078)   
catgov-0.348                            
(0.492)                           
s:catgov0.030                            
(0.030)                           
asvabc:catgov-0.069                            
(0.090)                           
male:catgov-0.127                            
(0.140)                           
ethblack:catgov0.173                            
(0.191)                           
ethhisp:catgov0.207                            
(0.258)                           
N500        500        71        429        
R20.159    0.148    0.161    0.142    
logLik-369.503    -372.681    -43.937    -324.487    
AIC765.006    759.361    101.873    662.975    
*** p < 0.001; ** p < 0.01; * p < 0.05.
# chow test
RSS_p <- RSS(mod$mod5_10_p)
RSS_1 <- RSS(mod$mod5_10_1)
RSS_2 <- RSS(mod$mod5_10_0)
k <- mod[[2]]$rank
n_1 <- nrow(mod[[3]]$model)
n_2 <- nrow(mod[[4]]$model)
            
chow_test(RSS_p, RSS_1, RSS_2, 
          k = k,
          n_1 = n_1, 
          n_2 = n_2)
## chow test: 1.040
## f_stat:  2.117
## [1] "not reject"

Since we can not reject the null hypothesis that there is no structural break in the data set, we should not divide the data set into two parts, meaning no use of a dummy variable, Gov in this case.