First, we will merge the two datasets.
market <- read.dta("C:/Users/dorgo/Documents/R/econ_c/Fake_market_research.dta")
salamtac <- read.dta("C:/Users/dorgo/Documents/R/econ_c/Fake_Salamtac_customer.dta")
joined <- left_join(market,salamtac, "ID")Next we will create indication vectors for Salamtac’s customers, and for the first 800 ID’s:
joined$customer <- joined$ID %in% salamtac$ID %>% as.numeric()
joined$treated <- joined$ID %in% c(1:799) %>% as.numeric()I expect the \(\rho\) sign to be negative: I expect big companies to have their own data department, so Salamtac’s customers will tend to be smaller.
Now we will assume the naive assumption of \(\rho = 0\) (means no selection bias) and estimate naive regression.
naive <- lm(formula = y_1 ~ employees + sales + export, data = joined)
summary(naive)##
## Call:
## lm(formula = y_1 ~ employees + sales + export, data = joined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -363.35 -69.87 -0.47 66.34 346.97
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1526.93186 13.79380 110.697 < 2e-16 ***
## employees -0.03328 0.07271 -0.458 0.647351
## sales 0.20444 0.06168 3.314 0.000973 ***
## export 943.30839 9.16478 102.928 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 100.1 on 610 degrees of freedom
## (1168 observations deleted due to missingness)
## Multiple R-squared: 0.9576, Adjusted R-squared: 0.9573
## F-statistic: 4587 on 3 and 610 DF, p-value: < 2.2e-16
coef <- list()
for (i in 1:4) {
coef[[i]] <- naive$coefficients[[i]]
}
coef %<>% as.matrix()
values <- c(1,mean(joined$employees), mean(joined$sales), 1) %>%
as.matrix()
naive_expenditure <- crossprod(as.numeric(coef),values)So, the outcome is 2526.8902047 - the expenditure of exporting firm on data services.
We now go over Heckit manually - run a probit which explains what is the probability of being Salamtac’s customer, find the \(\lambda\), and run the regression with \(\lambda\) as variable. We can see the output in here:
myprobit <- glm(customer ~ employees + sales + export + treated,
family = binomial(link = "probit"),
data = joined)## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(myprobit)##
## Call:
## glm(formula = customer ~ employees + sales + export + treated,
## family = binomial(link = "probit"), data = joined)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.92004 -0.08555 -0.00004 0.00921 2.68745
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.815986 0.384313 15.133 <2e-16 ***
## employees -0.017331 0.001386 -12.501 <2e-16 ***
## sales -0.020256 0.001427 -14.192 <2e-16 ***
## export -0.171658 0.143642 -1.195 0.232
## treated 5.022228 0.290494 17.289 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2295.26 on 1781 degrees of freedom
## Residual deviance: 532.09 on 1777 degrees of freedom
## AIC: 542.09
##
## Number of Fisher Scoring iterations: 9
delta_x <- predict.glm(myprobit, newdata = joined)
standart_norm_density <- dnorm(delta_x)
standart_norm_cumulative <- pnorm(delta_x)
lambda <- standart_norm_density/standart_norm_cumulative
manual_heckman <- lm(formula = y_1 ~ employees + sales + export + lambda, data = joined)
summary(manual_heckman)##
## Call:
## lm(formula = y_1 ~ employees + sales + export + lambda, data = joined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -372.61 -62.93 -1.10 62.26 333.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1507.54814 13.38152 112.659 < 2e-16 ***
## employees 0.09653 0.07126 1.355 0.176
## sales 0.28805 0.05977 4.819 1.82e-06 ***
## export 945.82691 8.74474 108.160 < 2e-16 ***
## lambda -78.99118 10.03940 -7.868 1.65e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 95.49 on 609 degrees of freedom
## (1168 observations deleted due to missingness)
## Multiple R-squared: 0.9615, Adjusted R-squared: 0.9612
## F-statistic: 3799 on 4 and 609 DF, p-value: < 2.2e-16
Now we will use the pre-defined function in R to check our manual estimation:
general_formula <- y_1 ~ employees + sales + export
probit_formula <- customer ~ employees + sales + export + treated
heckman <- selection(selection = probit_formula,outcome = general_formula,
data = joined, method = "2step")
summary(heckman)## --------------------------------------------
## Tobit 2 model (sample selection model)
## 2-step Heckman / heckit estimation
## 1782 observations (1168 censored and 614 observed)
## 12 free parameters (df = 1771)
## Probit selection equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.815987 0.384803 15.114 <2e-16 ***
## employees -0.017331 0.001391 -12.456 <2e-16 ***
## sales -0.020256 0.001430 -14.161 <2e-16 ***
## export -0.171658 0.142519 -1.204 0.229
## treated 5.022228 0.290279 17.301 <2e-16 ***
## Outcome equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.508e+03 1.372e+01 109.849 < 2e-16 ***
## employees 9.653e-02 7.210e-02 1.339 0.181
## sales 2.880e-01 6.161e-02 4.675 3.16e-06 ***
## export 9.458e+02 9.008e+00 104.997 < 2e-16 ***
## Multiple R-Squared:0.9615, Adjusted R-Squared:0.9612
## Error terms:
## Estimate Std. Error t value Pr(>|t|)
## invMillsRatio -78.9912 9.1053 -8.675 <2e-16 ***
## sigma 101.5684 NA NA NA
## rho -0.7777 NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
As we can see below we have different standart errors for the manual and the pre-defined function. The reason is that our manual estimation doesn’t take into account the S.E from the probit regression, but if we want to get the real S.E we have to consider both of them.
heckman_data <- summary(heckman)
manual_data <- summary(manual_heckman)
heckman_se <- heckman_data$estimate[,"Std. Error"]
manual_se <- manual_data$coefficients[,"Std. Error"]
heckman_se %>% print()## (Intercept) employees sales export treated
## 0.384803205 0.001391445 0.001430353 0.142518878 0.290278854
## (Intercept) employees sales export invMillsRatio
## 13.723787635 0.072099151 0.061614101 9.008101486 9.105260253
## sigma rho
## NA NA
manual_se %>% print()## (Intercept) employees sales export lambda
## 13.38151581 0.07126458 0.05977039 8.74473542 10.03939823
As we can see above - the \(\rho\) is negative and the \(\lambda\) is significant - so we can reject the null hypothesis.
To conclude we will look at the expenditureon data services of average firm (in the manners of number of employees and sales) who exprots:
mean(joined$employees)## [1] 195.092
mean(joined$sales)## [1] 308.8456
heck_df <- heckman_data$estimate %>% as.data.frame()
heck_coef <- heck_df[6:9,1]
smart_expenditure <- crossprod(as.numeric(heck_coef),values)
smart_expenditure-naive_expenditure## [,1]
## [1,] 34.28106
The expenditure of this firm is 2561.1712655, and the difference between the naive estimation and the Heckit is 34.2810609. The intuition is as stated above: Firms that aren’t Salamtac’s customers tend to be bigger, because maybe they have their own data departments, hence we underestimate the average expenditure on data.