ECNM_Discussion_5

Clean data

rm(list = ls())      # Clear all files from your environment
         gc()            # Clear unused memory
          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  581592 31.1    1326975 70.9         NA   669431 35.8
Vcells 1068585  8.2    8388608 64.0      16384  1852345 14.2
         cat("\f")       # Clear the console
 graphics.off()      # Clear all graphs

Part 1) What is bias of an estimator?

Bias of an estimator is the difference between the expected value of the estimator and true value of the parameter being estimated (mean, median, etc). There would be no bias if expected value equals the true parameter value.

  • Measures how far the average of the estimates from repeated samples is from the actual parameter.

  • Negative bias means you’re consistently underestimating the true effect (your estimates are smaller than they should be).

    • expected value is less than the true value of the parameter.
  • Positive bias means you’re consistently overestimating the true effect (your estimates are larger than they should be).

    • expected value is greater than the true value of the parameter.

Part 2) Omitted Variable Bias (OVB)

What is OVB?

  • Occurs when an important explanatory variable is left out of the model, leading to a biased estimate of the parameters for the variables in the model. This is especially prevalent when the omitted variable is correlated with both the dependent and the independent variable(s).

Does it go away as you increase the sample size?

  • No, It will not go away as you increase the sample size. While it does improve the precesion of the estimates it does not address the systematic error caused by the omission of the relevant variable.

Does it go away as you add more variables?

  • Adding more variables indiscriminately won’t necessarily fix the bias unless the added variables are the ones that were previously omitted and are relevant to the model.

Part 3) Applying Omitted Variable Bias on a data set

Section 3.1 - Bringing in the data and creating new variables

library("AER")
Warning: package 'AER' was built under R version 4.3.3
Loading required package: car
Loading required package: carData
Loading required package: lmtest
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
Loading required package: sandwich
Loading required package: survival
data("CigarettesSW")

cig <- CigarettesSW

# excluding taxes as price in data already includes taxes, want to isolate it
cig$price_ex_taxes <- cig$price - cig$taxs

# per pack metrics
cig$price_per_pack <- cig$price_ex_taxes / cig$packs

cig$tax_per_pack <- cig$taxs / cig$packs

#summary
head(cig)
  state year   cpi population    packs    income  tax     price     taxs
1    AL 1985 1.076    3973000 116.4863  46014968 32.5 102.18167 33.34834
2    AR 1985 1.076    2327000 128.5346  26210736 37.0 101.47500 37.00000
3    AZ 1985 1.076    3184000 104.5226  43956936 31.0 108.57875 36.17042
4    CA 1985 1.076   26444000 100.3630 447102816 26.0 107.83734 32.10400
5    CO 1985 1.076    3209000 112.9635  49466672 31.0  94.26666 31.00000
6    CT 1985 1.076    3201000 109.2784  60063368 42.0 128.02499 51.48333
  price_ex_taxes price_per_pack tax_per_pack
1       68.83334      0.5909137    0.2862855
2       64.47500      0.5016159    0.2878603
3       72.40833      0.6927528    0.3460535
4       75.73334      0.7545940    0.3198787
5       63.26666      0.5600627    0.2744248
6       76.54166      0.7004284    0.4711211

Linear Model

This is a panel data on cigarette consumption for the 48 continental US States from 1985–1995.

\(Cigarette \ pack \ consumption = \beta_0 + \beta_1 * income_i + \beta_2 * price \ per \ pack_i + \beta_3 * taxes \ per \ pack_i + \epsilon_i\)

\(Y_i\) = Number of packs consumed

\(\beta_0\) = Intercept (constant) term

\(\beta_1\) = Slope coefficient representing the change in pack consumption per unit change in income

\(X_1\) = State personal income

\(\beta_2\) = Slope coefficient representing the change in pack consumption per unit change in price per pack

\(X_2\) = Average price per pack during fiscal year (excluding taxes)

\(\beta_2\) = Slope coefficient representing the change in pack consumption per unit change in taxes per pack

\(X_2\) = Average excise taxes for fiscal year per pack, including sales tax.

\(\epsilon_i\) = Error term representing enexplainted variation

cig_lm <- lm(packs ~ income + price_per_pack + tax_per_pack, 
               data = cig)

Section 3.2 - Omitting a variable (tax)

\(Cigarette \ pack \ consumption = \beta_0 + \beta_1 * income_i + \beta_2 * price \ per \ pack_i + \epsilon_i\)

ovb_cig_lm <- lm(packs ~ income + price_per_pack, 
               data = cig)

Section 3.3 - Understanding the conditions

Ommited variable bias has 2 conditions:

  1. The omitted variable (tax per pack) must be correlated with the dependent variable (packs)
  2. The omitted variable (tax per pack) must be correlated with the key independent variable (price per pack)

Based on the below graph we can see tax per pack has a strong correlation with both the dependent variable (packs -0.8) and the key independent variable (price per pack -0.92), satisfying both conditions.

library(ggcorrplot)
Loading required package: ggplot2
# excluding non numeric numbers
Cig_numeric <- cig[, sapply(cig, is.numeric)]

# Correlation matrix
Cig_cor_matrix <- cor(Cig_numeric, 
                  use = "complete.obs")

# P value
p.mat <- ggcorrplot::cor_pmat(Cig_numeric)

# Correlation graph
ggcorrplot(corr = Cig_cor_matrix, 
  method = "square", 
  type = "full", 
  title = "Correlation Plot", 
  colors = c("red", "white", "green"), 
  lab = TRUE, 
  lab_size = 2, 
  p.mat = p.mat, 
  insig = "pch", 
  pch = 4, 
  hc.order = TRUE, 
  tl.cex = 8, 
  tl.col = "black", 
  digits = 2)

Section 3.4 - Bias direction

Given that we know:

  1. The correlation between price per pack (A) and tax per pack (B) is positive

  2. The effect of tax (B) on pack consumption (Y) is negative

  3. We can predict the bias is likely to be negative.

Section 3.5 - Using stargazer to confirm Bias

Full model —> cig_lm

Short model —> ovb_cig_lm

library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
# side by side
stargazer(cig_lm, ovb_cig_lm, 
          type = "text")

==================================================================
                                 Dependent variable:              
                    ----------------------------------------------
                                        packs                     
                             (1)                     (2)          
------------------------------------------------------------------
income                      0.000                  -0.000         
                           (0.000)                 (0.000)        
                                                                  
price_per_pack            -34.218***             -44.707***       
                           (7.689)                 (3.372)        
                                                                  
tax_per_pack               -18.064                                
                           (11.919)                               
                                                                  
Constant                  150.979***             152.410***       
                           (3.445)                 (3.337)        
                                                                  
------------------------------------------------------------------
Observations                  96                     96           
R2                          0.700                   0.692         
Adjusted R2                 0.690                   0.685         
Residual Std. Error    14.409 (df = 92)       14.509 (df = 93)    
F Statistic         71.422*** (df = 3; 92) 104.526*** (df = 2; 93)
==================================================================
Note:                                  *p<0.1; **p<0.05; ***p<0.01

Section 3.5 - Intuition on the results

As seen above you can see after we introduced the tax variable, the price coefficient becomes more negative when tax is omitted, meaning the effect of price on cigarette consumption appears stronger in the short model.

This indicates negative bias when tax is omitted from the model.

  • When tax is not included, the price coefficient is biased downward, overestimating the negative effect of price on cigarette consumption.

In this case tax is an important variables on the impact of consumption, cigarette companies have often tried to lower prices to offset taxes but it shows that taxes provides an important influence on overall pack consuption. Failing to control for it will misattribute some of its effect to price (as they are positively correlated), causing for a bias.