A

First, we will merge the two datasets.

market <- read.dta("C:/Users/dorgo/Documents/R/econ_c/Fake_market_research.dta")
salamtac <- read.dta("C:/Users/dorgo/Documents/R/econ_c/Fake_Salamtac_customer.dta")
joined <- left_join(market,salamtac, "ID")

B+C

Next we will create indication vectors for Salamtac’s customers, and for the first 800 ID’s:

joined$customer <- joined$ID %in% salamtac$ID %>% as.numeric()
joined$treated <- joined$ID %in% c(1:799) %>% as.numeric()

D

I expect the \(\rho\) sign to be negative: I expect big companies to have their own data department, so Salamtac’s customers will tend to be smaller.

E

Now we will assume the naive assumption of \(\rho = 0\) (means no selection bias) and estimate naive regression.

naive <- lm(formula = y_1 ~ employees + sales + export, data = joined)
summary(naive)
## 
## Call:
## lm(formula = y_1 ~ employees + sales + export, data = joined)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -363.35  -69.87   -0.47   66.34  346.97 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1526.93186   13.79380 110.697  < 2e-16 ***
## employees     -0.03328    0.07271  -0.458 0.647351    
## sales          0.20444    0.06168   3.314 0.000973 ***
## export       943.30839    9.16478 102.928  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 100.1 on 610 degrees of freedom
##   (1168 observations deleted due to missingness)
## Multiple R-squared:  0.9576, Adjusted R-squared:  0.9573 
## F-statistic:  4587 on 3 and 610 DF,  p-value: < 2.2e-16
coef <- list()
for (i in 1:4) {
coef[[i]] <- naive$coefficients[[i]]  
}
coef %<>% as.matrix() 
values <- c(1,mean(joined$employees), mean(joined$sales), 1) %>%
  as.matrix()
naive_expenditure <- crossprod(as.numeric(coef),values)

So, the outcome is 2526.8902047 - the expenditure of exporting firm on data services.

F

We now go over Heckit manually - run a probit which explains what is the probability of being Salamtac’s customer, find the \(\lambda\), and run the regression with \(\lambda\) as variable. We can see the output in here:

myprobit <- glm(customer ~ employees + sales + export + treated,
                family = binomial(link = "probit"), 
                data = joined)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(myprobit)
## 
## Call:
## glm(formula = customer ~ employees + sales + export + treated, 
##     family = binomial(link = "probit"), data = joined)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.92004  -0.08555  -0.00004   0.00921   2.68745  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.815986   0.384313  15.133   <2e-16 ***
## employees   -0.017331   0.001386 -12.501   <2e-16 ***
## sales       -0.020256   0.001427 -14.192   <2e-16 ***
## export      -0.171658   0.143642  -1.195    0.232    
## treated      5.022228   0.290494  17.289   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2295.26  on 1781  degrees of freedom
## Residual deviance:  532.09  on 1777  degrees of freedom
## AIC: 542.09
## 
## Number of Fisher Scoring iterations: 9
delta_x <- predict.glm(myprobit, newdata = joined)
standart_norm_density <- dnorm(delta_x)
standart_norm_cumulative <- pnorm(delta_x)
lambda <- standart_norm_density/standart_norm_cumulative
manual_heckman <- lm(formula = y_1 ~ employees + sales + export + lambda, data = joined)
summary(manual_heckman)
## 
## Call:
## lm(formula = y_1 ~ employees + sales + export + lambda, data = joined)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -372.61  -62.93   -1.10   62.26  333.85 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1507.54814   13.38152 112.659  < 2e-16 ***
## employees      0.09653    0.07126   1.355    0.176    
## sales          0.28805    0.05977   4.819 1.82e-06 ***
## export       945.82691    8.74474 108.160  < 2e-16 ***
## lambda       -78.99118   10.03940  -7.868 1.65e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 95.49 on 609 degrees of freedom
##   (1168 observations deleted due to missingness)
## Multiple R-squared:  0.9615, Adjusted R-squared:  0.9612 
## F-statistic:  3799 on 4 and 609 DF,  p-value: < 2.2e-16

G

Now we will use the pre-defined function in R to check our manual estimation:

general_formula <- y_1 ~ employees + sales + export
probit_formula <- customer ~ employees + sales + export + treated
heckman <- selection(selection = probit_formula,outcome = general_formula, 
           data = joined, method = "2step")
summary(heckman)
## --------------------------------------------
## Tobit 2 model (sample selection model)
## 2-step Heckman / heckit estimation
## 1782 observations (1168 censored and 614 observed)
## 12 free parameters (df = 1771)
## Probit selection equation:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.815987   0.384803  15.114   <2e-16 ***
## employees   -0.017331   0.001391 -12.456   <2e-16 ***
## sales       -0.020256   0.001430 -14.161   <2e-16 ***
## export      -0.171658   0.142519  -1.204    0.229    
## treated      5.022228   0.290279  17.301   <2e-16 ***
## Outcome equation:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.508e+03  1.372e+01 109.849  < 2e-16 ***
## employees   9.653e-02  7.210e-02   1.339    0.181    
## sales       2.880e-01  6.161e-02   4.675 3.16e-06 ***
## export      9.458e+02  9.008e+00 104.997  < 2e-16 ***
## Multiple R-Squared:0.9615,   Adjusted R-Squared:0.9612
##    Error terms:
##               Estimate Std. Error t value Pr(>|t|)    
## invMillsRatio -78.9912     9.1053  -8.675   <2e-16 ***
## sigma         101.5684         NA      NA       NA    
## rho            -0.7777         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

H

As we can see below we have different standart errors for the manual and the pre-defined function. The reason is that our manual estimation doesn’t take into account the S.E from the probit regression, but if we want to get the real S.E we have to consider both of them.

heckman_data <- summary(heckman)
manual_data <- summary(manual_heckman)
heckman_se <- heckman_data$estimate[,"Std. Error"]
manual_se <- manual_data$coefficients[,"Std. Error"]
heckman_se %>% print()
##   (Intercept)     employees         sales        export       treated 
##   0.384803205   0.001391445   0.001430353   0.142518878   0.290278854 
##   (Intercept)     employees         sales        export invMillsRatio 
##  13.723787635   0.072099151   0.061614101   9.008101486   9.105260253 
##         sigma           rho 
##            NA            NA
manual_se %>% print()
## (Intercept)   employees       sales      export      lambda 
## 13.38151581  0.07126458  0.05977039  8.74473542 10.03939823

I

As we can see above - the \(\rho\) is negative and the \(\lambda\) is significant - so we can reject the null hypothesis.

J

To conclude we will look at the expenditureon data services of average firm (in the manners of number of employees and sales) who exprots:

mean(joined$employees)
## [1] 195.092
mean(joined$sales)
## [1] 308.8456
heck_df <- heckman_data$estimate %>% as.data.frame()
heck_coef <- heck_df[6:9,1]
smart_expenditure <- crossprod(as.numeric(heck_coef),values)
smart_expenditure-naive_expenditure
##          [,1]
## [1,] 34.28106

The expenditure of this firm is 2561.1712655, and the difference between the naive estimation and the Heckit is 34.2810609. The intuition is as stated above: Firms that aren’t Salamtac’s customers tend to be bigger, because maybe they have their own data departments, hence we underestimate the average expenditure on data.