Discussion - OVB

Author

Gina Occhipinti

I. Biased Estimators

A bias estimator is the difference between the expected value of statistic of the population (mean, standard deviation, etc.) and true value of the population. This can result from omitting variables in a regression model that affect both an X variable AND the Y variable (i.e., it’s correlated with both variables). Bias estimators can make our models inaccurate because they misjudge the true value of the population.

II. Reducing Omitted Variable Bias

Omitted variable bias will not go away if we increase the sample size. Say for example, we have a model predicting test scores with an input variable for student teacher ratio. There is an omitted variable (percent of English learners) that affects both of these variables that we are not accounting for. If we just increase the number of instance of student teacher ratio, this missing variable will still continue to influence our X and Y variables. By adding additional variables, such as percent English learners and even other variables, like school funding or race, can help better explain test scores.

III. OVB Dataset Example

  1. The data I will use for this example is the swiss dataset. It contains 47 rows of data with 6 columns or variables, each of which is in percent. Switzerland, in 1888, was entering a period known as the demographic transition; i.e., its fertility was beginning to fall from the high level typical of underdeveloped countries. The data collected are for 47 French-speaking “provinces” at about 1888, including:
  • Fertility - common standardized fertility measure

  • Agriculture - % of males involved in agriculture as occupation

  • Examination - % draftees receiving highest mark on army examination

  • Education - % education beyond primary school for draftees

  • Catholic - % ‘catholic’ (as opposed to ‘protestant’).

  • Infant.Mortality - live births who live less than 1 year.

?swiss
swiss
             Fertility Agriculture Examination Education Catholic
Courtelary        80.2        17.0          15        12     9.96
Delemont          83.1        45.1           6         9    84.84
Franches-Mnt      92.5        39.7           5         5    93.40
Moutier           85.8        36.5          12         7    33.77
Neuveville        76.9        43.5          17        15     5.16
Porrentruy        76.1        35.3           9         7    90.57
Broye             83.8        70.2          16         7    92.85
Glane             92.4        67.8          14         8    97.16
Gruyere           82.4        53.3          12         7    97.67
Sarine            82.9        45.2          16        13    91.38
Veveyse           87.1        64.5          14         6    98.61
Aigle             64.1        62.0          21        12     8.52
Aubonne           66.9        67.5          14         7     2.27
Avenches          68.9        60.7          19        12     4.43
Cossonay          61.7        69.3          22         5     2.82
Echallens         68.3        72.6          18         2    24.20
Grandson          71.7        34.0          17         8     3.30
Lausanne          55.7        19.4          26        28    12.11
La Vallee         54.3        15.2          31        20     2.15
Lavaux            65.1        73.0          19         9     2.84
Morges            65.5        59.8          22        10     5.23
Moudon            65.0        55.1          14         3     4.52
Nyone             56.6        50.9          22        12    15.14
Orbe              57.4        54.1          20         6     4.20
Oron              72.5        71.2          12         1     2.40
Payerne           74.2        58.1          14         8     5.23
Paysd'enhaut      72.0        63.5           6         3     2.56
Rolle             60.5        60.8          16        10     7.72
Vevey             58.3        26.8          25        19    18.46
Yverdon           65.4        49.5          15         8     6.10
Conthey           75.5        85.9           3         2    99.71
Entremont         69.3        84.9           7         6    99.68
Herens            77.3        89.7           5         2   100.00
Martigwy          70.5        78.2          12         6    98.96
Monthey           79.4        64.9           7         3    98.22
St Maurice        65.0        75.9           9         9    99.06
Sierre            92.2        84.6           3         3    99.46
Sion              79.3        63.1          13        13    96.83
Boudry            70.4        38.4          26        12     5.62
La Chauxdfnd      65.7         7.7          29        11    13.79
Le Locle          72.7        16.7          22        13    11.22
Neuchatel         64.4        17.6          35        32    16.92
Val de Ruz        77.6        37.6          15         7     4.97
ValdeTravers      67.6        18.7          25         7     8.65
V. De Geneve      35.0         1.2          37        53    42.34
Rive Droite       44.7        46.6          16        29    50.43
Rive Gauche       42.8        27.7          22        29    58.33
             Infant.Mortality
Courtelary               22.2
Delemont                 22.2
Franches-Mnt             20.2
Moutier                  20.3
Neuveville               20.6
Porrentruy               26.6
Broye                    23.6
Glane                    24.9
Gruyere                  21.0
Sarine                   24.4
Veveyse                  24.5
Aigle                    16.5
Aubonne                  19.1
Avenches                 22.7
Cossonay                 18.7
Echallens                21.2
Grandson                 20.0
Lausanne                 20.2
La Vallee                10.8
Lavaux                   20.0
Morges                   18.0
Moudon                   22.4
Nyone                    16.7
Orbe                     15.3
Oron                     21.0
Payerne                  23.8
Paysd'enhaut             18.0
Rolle                    16.3
Vevey                    20.9
Yverdon                  22.5
Conthey                  15.1
Entremont                19.8
Herens                   18.3
Martigwy                 19.4
Monthey                  20.2
St Maurice               17.8
Sierre                   16.3
Sion                     18.1
Boudry                   20.3
La Chauxdfnd             20.5
Le Locle                 18.9
Neuchatel                23.0
Val de Ruz               20.0
ValdeTravers             19.5
V. De Geneve             18.0
Rive Droite              18.2
Rive Gauche              19.3
library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("stargazer")

Please cite as: 

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
glimpse(swiss)
Rows: 47
Columns: 6
$ Fertility        <dbl> 80.2, 83.1, 92.5, 85.8, 76.9, 76.1, 83.8, 92.4, 82.4,…
$ Agriculture      <dbl> 17.0, 45.1, 39.7, 36.5, 43.5, 35.3, 70.2, 67.8, 53.3,…
$ Examination      <int> 15, 6, 5, 12, 17, 9, 16, 14, 12, 16, 14, 21, 14, 19, …
$ Education        <int> 12, 9, 5, 7, 15, 7, 7, 8, 7, 13, 6, 12, 7, 12, 5, 2, …
$ Catholic         <dbl> 9.96, 84.84, 93.40, 33.77, 5.16, 90.57, 92.85, 97.16,…
$ Infant.Mortality <dbl> 22.2, 22.2, 20.2, 20.3, 20.6, 26.6, 23.6, 24.9, 21.0,…
stargazer(swiss, 
          type="text", 
          title = "Summary Statistics", 
          covariate.labels = c("Fertility", "Agriculture", "Education")
          )

Summary Statistics
==================================================
Statistic        N   Mean  St. Dev.  Min     Max  
--------------------------------------------------
Fertility        47 70.143  12.492  35.000 92.500 
Agriculture      47 50.660  22.711  1.200  89.700 
Education        47 16.489  7.978     3      37   
Education        47 10.979  9.615     1      53   
Catholic         47 41.144  41.705  2.150  100.000
Infant.Mortality 47 19.943  2.913   10.800 26.600 
--------------------------------------------------

Full Model: Our estimating equation with this dataset is:

\[ \begin{align} Fertility_i &= β_0 + β_1 \cdot Agriculture_i \\ &+ β_2 \cdot Examination_i \\ &+ β_3 \cdot Education_i \\ &+ β_4 \cdot Catholic_i \\ &+ β_5 \cdot Infant \ Mortality_i \\ &+ u_i \end{align} \]

Our Y dependent variable is Fertility while our X independent variables are Agriculture, Examination, being Catholic, Infant Mortality, and Education.

full_model <- lm(data = swiss, 
                 formula = Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality)

summary(full_model)

Call:
lm(formula = Fertility ~ Agriculture + Examination + Education + 
    Catholic + Infant.Mortality, data = swiss)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2743  -5.2617   0.5032   4.1198  15.3213 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
Examination      -0.25801    0.25388  -1.016  0.31546    
Education        -0.87094    0.18303  -4.758 2.43e-05 ***
Catholic          0.10412    0.03526   2.953  0.00519 ** 
Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.165 on 41 degrees of freedom
Multiple R-squared:  0.7067,    Adjusted R-squared:  0.671 
F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10
agriculture_coeff_full_model <- full_model$coefficients[2]
examination_coeff_full_model <- full_model$coefficients[3]
education_coeff_full_model <- full_model$coefficients[4]
catholic_coeff_full_model <- full_model$coefficients[5]
infant.mortality_coeff_full_model <- full_model$coefficients[6]
  1. Short Model: In the short model, we omit the Education Independent Variable. Our estimating equation with this dataset is:

\[ \begin{align} Fertility_i &= β_0 + β_1 \cdot Agriculture_i \\ &+ β_2 \cdot Examination_i \\ &+ β_3 \cdot Infant \ Mortality _i \\ &+ β_4 \cdot Catholic_i \\ &+ u_i \end{align} \]

short_model <- lm(data = swiss, 
                  formula = Fertility ~ Agriculture + Examination + Catholic + Infant.Mortality)

summary(short_model)

Call:
lm(formula = Fertility ~ Agriculture + Examination + Catholic + 
    Infant.Mortality, data = swiss)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.9194  -3.5530  -0.6489   6.5956  14.1767 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      59.60267   13.04246   4.570 4.25e-05 ***
Agriculture      -0.04759    0.08032  -0.593 0.556688    
Examination      -0.96805    0.25284  -3.829 0.000423 ***
Catholic          0.02611    0.03843   0.679 0.500551    
Infant.Mortality  1.39597    0.46259   3.018 0.004315 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.82 on 42 degrees of freedom
Multiple R-squared:  0.5448,    Adjusted R-squared:  0.5014 
F-statistic: 12.57 on 4 and 42 DF,  p-value: 8.272e-07
  1. Check OVB Conditions

    To check for OVB conditions, we need to see if the dataset meets the following criteria: 1) the omitted variable is correlated with our other independent variables, and is the determinant of our Fertility variable.

    We can see that Education is correlated and statistically significant at .05 level with all the independent variables except Infant Mortality and being Catholic.

cor(swiss[,c(2,3,4,5,6)]) # correlation coefficient check 
                 Agriculture Examination   Education   Catholic
Agriculture       1.00000000  -0.6865422 -0.63952252  0.4010951
Examination      -0.68654221   1.0000000  0.69841530 -0.5727418
Education        -0.63952252   0.6984153  1.00000000 -0.1538589
Catholic          0.40109505  -0.5727418 -0.15385892  1.0000000
Infant.Mortality -0.06085861  -0.1140216 -0.09932185  0.1754959
                 Infant.Mortality
Agriculture           -0.06085861
Examination           -0.11402160
Education             -0.09932185
Catholic               0.17549591
Infant.Mortality       1.00000000
?cor.test
cor.test(swiss$Agriculture, swiss$Education) # statistically sig.

    Pearson's product-moment correlation

data:  swiss$Agriculture and swiss$Education
t = -5.5804, df = 45, p-value = 1.305e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7829085 -0.4316231
sample estimates:
       cor 
-0.6395225 
cor.test(swiss$Examination, swiss$Education) # statistically sig.

    Pearson's product-moment correlation

data:  swiss$Examination and swiss$Education
t = 6.5463, df = 45, p-value = 4.811e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5144218 0.8209342
sample estimates:
      cor 
0.6984153 
cor.test(swiss$Infant.Mortality, swiss$Education) # not statistically sig.

    Pearson's product-moment correlation

data:  swiss$Infant.Mortality and swiss$Education
t = -0.66958, df = 45, p-value = 0.5065
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.3757709  0.1933600
sample estimates:
        cor 
-0.09932185 
cor.test(swiss$Catholic, swiss$Education) # not statistically sig.

    Pearson's product-moment correlation

data:  swiss$Catholic and swiss$Education
t = -1.0446, df = 45, p-value = 0.3018
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.4223643  0.1394701
sample estimates:
       cor 
-0.1538589 

For our second test, we see if the omitted variable is correlated with our dependent Y variable, Fertility.

From this test, we see a negative statistically significant correlation between the omitted variable and Y.

cor(swiss[,c(1,4)]) # correlation coefficient check - yes negative correlation
           Fertility  Education
Fertility  1.0000000 -0.6637889
Education -0.6637889  1.0000000
?cor.test
cor.test(swiss$Fertility, swiss$Education) # statistically sig.

    Pearson's product-moment correlation

data:  swiss$Fertility and swiss$Education
t = -5.9536, df = 45, p-value = 3.659e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7987075 -0.4653206
sample estimates:
       cor 
-0.6637889 
  1. We have a mixed effect of bias in this multivariate model. Our omitted variable has both positive and negative effects on our independent variables. For example, it’s negatively correlated with Agriculture, but positive correlated with Education. It’s negatively correlated with the other 2 variables, so possibly that pull is stronger. It also has a negative correlation with our Y. For this reason, we’ll say there’s a positive bias.

  2. stargazer(full_model, short_model, 
              type = "text")
    
    =================================================================
                                     Dependent variable:             
                        ---------------------------------------------
                                          Fertility                  
                                 (1)                    (2)          
    -----------------------------------------------------------------
    Agriculture                -0.172**                -0.048        
                               (0.070)                (0.080)        
    
    Examination                 -0.258               -0.968***       
                               (0.254)                (0.253)        
    
    Education                 -0.871***                              
                               (0.183)                               
    
    Catholic                   0.104***                0.026         
                               (0.035)                (0.038)        
    
    Infant.Mortality           1.077***               1.396***       
                               (0.382)                (0.463)        
    
    Constant                  66.915***              59.603***       
                               (10.706)               (13.042)       
    
    -----------------------------------------------------------------
    Observations                  47                     47          
    R2                          0.707                  0.545         
    Adjusted R2                 0.671                  0.501         
    Residual Std. Error    7.165 (df = 41)        8.820 (df = 42)    
    F Statistic         19.761*** (df = 5; 41) 12.565*** (df = 4; 42)
    =================================================================
    Note:                                 *p<0.1; **p<0.05; ***p<0.01
  3. In each case, we can see the sign of the correlation stayed the same, the numbers changed slightly. For example, agriculture decreased suggesting upward bias in the short model. Since Agriculture is negatively correlated with Fertility, it causes and upward bias.