Part I

What is bias of an estimator?

Suppose that \(x\) is the observed data, \(\theta\) is the true value of an unknown parameter of interest which is estimated by function \(T(x)\) and the value estimated by this function is represented by \(\hat{\theta}\). The bias of an estimator can then be defined as the difference between the expected value of \(\hat{\theta}\) and \(\theta\). The equation can be written as:

\[ Bias(\hat{\theta})=E_{\theta}(\hat{\theta})-\theta \]

An estimator of the function is deemed as unbiased if the difference is zero for all value of \(\theta\). The bias of an estimator can be used to determine how accurate the estimator is. If an estimator is unbiased, it is said to be equal to the true value within the population.

Part II

In terms of omitted variable bias, will the bias go away if we increase the same size or add more variables?

The inclusion of variables to reduce bias is dependent on its relation to the dependent and key independent variable.

The variable should be included if the variable is related to the key independent and dependent variable as this reduces bias. It is also important to note that a consequence of doing so is that it causes multicollinearity.

When it comes to sample size, an increase will not change the bias.

Part III

Example 1

The first data set is a cross sectional data from the AER package that consists of 1319 observations and 12 variables. The data contains credit history for a sample of people that applied for a credit card.

card - Factor. Was the application for a credit card accepted?

reports - Number of major derogatory reports.

age - Age in years plus twelfths of a year.

income - Yearly income (in USD 10,000).

share - Ratio of monthly credit card expenditure to yearly income.

expenditure - Average monthly credit card expenditure.

owner - Factor. Does the individual own their home?

selfemp - Factor. Is the individual self-employed?

dependents - Number of dependents.

months - Months living at current address.

majorcards - Number of major credit cards held.

active - Number of active credit accounts.

# loading the data
data("CreditCard")
df1 <- CreditCard
stargazer(df1, type ="text", 
          title = "Summary Statistics for Credit Card Data")
## 
## Summary Statistics for Credit Card Data
## ===================================================
## Statistic     N    Mean   St. Dev.  Min      Max   
## ---------------------------------------------------
## reports     1,319  0.456   1.345     0       14    
## age         1,319 33.213   10.143  0.167   83.500  
## income      1,319  3.365   1.694   0.210   13.500  
## share       1,319  0.069   0.095   0.0001   0.906  
## expenditure 1,319 185.057 272.219  0.000  3,099.505
## dependents  1,319  0.994   1.248     0        6    
## months      1,319 55.268   66.272    0       540   
## majorcards  1,319  0.817   0.387     0        1    
## active      1,319  6.997   6.306     0       46    
## ---------------------------------------------------

Full Model Regression

\[ expenditure = \beta_0 +\beta_1income +\beta_2age+\mu \]

The key independent variable of interest is income.

full_model <- lm(data = df1, 
                 formula = expenditure ~ income + age)

summary(full_model)
## 
## Call:
## lm(formula = expenditure ~ income + age, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -544.26 -137.46  -65.26   58.94 2503.20 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   94.089     25.548   3.683  0.00024 ***
## income        49.626      4.479  11.080  < 2e-16 ***
## age           -2.289      0.748  -3.061  0.00225 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 260.5 on 1316 degrees of freedom
## Multiple R-squared:  0.08553,    Adjusted R-squared:  0.08414 
## F-statistic: 61.54 on 2 and 1316 DF,  p-value: < 2.2e-16

Model with Omitted Variable

\[ expenditure = \beta_0 +\beta_1income +\mu \]

Age is the variable omitted for this model.

ov_model <- lm(data = df1, 
               formula = expenditure ~ income)

summary(ov_model)
## 
## Call:
## lm(formula = expenditure ~ income, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -529.95 -139.19  -70.94   69.25 2501.80 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    33.03      16.01   2.063   0.0393 *  
## income         45.17       4.25  10.630   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 261.3 on 1317 degrees of freedom
## Multiple R-squared:  0.07902,    Adjusted R-squared:  0.07832 
## F-statistic:   113 on 1 and 1317 DF,  p-value: < 2.2e-16

Check for OVB

# corstar function
# x is a matrix containing the data
# method : correlation method. "pearson"" or "spearman"" is supported
# removeTriangle : remove upper or lower triangle
# results :  if "html" or "latex"
  # the results will be displayed in html or latex format
corstars <-function(x, method=c("pearson", "spearman"), removeTriangle=c("upper", "lower"),
                     result=c("none", "html", "latex")){
    #Compute correlation matrix
    require(Hmisc)
    x <- as.matrix(x)
    correlation_matrix<-rcorr(x, type=method[1])
    R <- correlation_matrix$r # Matrix of correlation coeficients
    p <- correlation_matrix$P # Matrix of p-value 
    
    ## Define notions for significance levels; spacing is important.
    mystars <- ifelse(p < .0001, "****", ifelse(p < .001, "*** ", ifelse(p < .01, "**  ", ifelse(p < .05, "*   ", "    "))))
    
    ## trunctuate the correlation matrix to two decimal
    R <- format(round(cbind(rep(-1.11, ncol(x)), R), 2))[,-1]
    
    ## build a new matrix that includes the correlations with their apropriate stars
    Rnew <- matrix(paste(R, mystars, sep=""), ncol=ncol(x))
    diag(Rnew) <- paste(diag(R), " ", sep="")
    rownames(Rnew) <- colnames(x)
    colnames(Rnew) <- paste(colnames(x), "", sep="")
    
    ## remove upper triangle of correlation matrix
    if(removeTriangle[1]=="upper"){
      Rnew <- as.matrix(Rnew)
      Rnew[upper.tri(Rnew, diag = TRUE)] <- ""
      Rnew <- as.data.frame(Rnew)
    }
    
    ## remove lower triangle of correlation matrix
    else if(removeTriangle[1]=="lower"){
      Rnew <- as.matrix(Rnew)
      Rnew[lower.tri(Rnew, diag = TRUE)] <- ""
      Rnew <- as.data.frame(Rnew)
    }
    
    ## remove last column and return the correlation matrix
    Rnew <- cbind(Rnew[1:length(Rnew)-1])
    if (result[1]=="none") return(Rnew)
    else{
      if(result[1]=="html") print(xtable(Rnew), type="html")
      else print(xtable(Rnew), type="latex") 
    }
} 
# compute the correlations
corstars(df1[,c(3, 4, 6)])
##                   age    income
## age                            
## income       0.32****          
## expenditure  0.01      0.28****

Conditions for OVB

  • There is correlation between X and the omitted variable.

  • The omitted variable influences the Y variable.

Applying these conditions to this example:

  • OVB exists because age and income have a positive correlation of 0.32 (statistically significant as p is less than 0.05). Based on the full regression model, age also has a negative effect (-2.289) on expenditure which is statistically significant since the p-value is less than 0.05. Therefore, the direction of the OVB is said to be negative.

Comparison between the Models

stargazer(full_model, ov_model, 
          type = "text",
          title = "Regression Results", 
          align = TRUE)
## 
## Regression Results
## ======================================================================
##                                    Dependent variable:                
##                     --------------------------------------------------
##                                        expenditure                    
##                               (1)                       (2)           
## ----------------------------------------------------------------------
## income                     49.626***                 45.175***        
##                             (4.479)                   (4.250)         
##                                                                       
## age                        -2.289***                                  
##                             (0.748)                                   
##                                                                       
## Constant                   94.089***                 33.027**         
##                             (25.548)                 (16.010)         
##                                                                       
## ----------------------------------------------------------------------
## Observations                 1,319                     1,319          
## R2                           0.086                     0.079          
## Adjusted R2                  0.084                     0.078          
## Residual Std. Error   260.515 (df = 1316)       261.341 (df = 1317)   
## F Statistic         61.542*** (df = 2; 1316) 112.998*** (df = 1; 1317)
## ======================================================================
## Note:                                      *p<0.1; **p<0.05; ***p<0.01

Negative Bias - \(\beta_1\) is underestimated

With a comparison between the two models, we can see that the estimated key coefficient of the `ov_model` is smaller which is caused by negative bias.

Example 2

The second data set is a cross-sectional data from school districts in Massachusetts. It contains 220 observations and 16 variables.

district - district code

municipality - municipality name

expreg - expenditures per pupil (regular)

expspecial - expenditures per pupil (special needs)

expbil - expenditures per pupil (bilingual)

expocc - expenditures per pupil (occupational)

exptot - expenditures per pupil (total)

scratio - students per computer

special - percent of special education students

lunch - percent qualified for reduced-price lunch

stratio - student-teacher ratio

income - per capita income

score4 - 4th grade score (math + English + science)

score8 - 8th grade score (math + English + science)

salary - average teacher salary

english - percent of English learners

# loading dataset 
data("MASchools")
df2 <- MASchools
stargazer(df2, type = "text")
## 
## ========================================================
## Statistic   N    Mean     St. Dev.     Min       Max    
## --------------------------------------------------------
## expreg     220 4,605.464  880.252     2,905     8,759   
## expspecial 220 8,900.727 3,511.696  3,832.230 53,569.240
## expbil     220 3,037.309 20,259.260     0      295,140  
## expocc     220 1,104.209 2,732.449      0       15,088  
## exptot     220 5,370.250  977.040     3,465     9,868   
## scratio    211   8.107     2.836      2.300     18.400  
## special    220  15.968     3.538      8.100     34.300  
## lunch      220  15.316     15.060     0.400     76.200  
## stratio    220  17.344     2.277     11.400     27.000  
## income     220  18.747     5.808      9.686     46.855  
## score4     220  709.827    15.126      658       740    
## score8     180  698.411    21.053      641       747    
## salary     195  35.993     3.191     24.965     44.494  
## english    220   1.118     2.901      0.000     24.494  
## --------------------------------------------------------

Full Regression Model

\[ score8 = \beta_0 +\beta_1income +\beta_2stratio+\mu \]

The key independent variable for this model is income while the dependent variable is the score of 8th grade students for school districts.

full_model1 <- lm(data = df2, 
                  formula = score8 ~ income + stratio)

summary(full_model1)
## 
## Call:
## lm(formula = score8 ~ income + stratio, data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.077  -7.564   0.996   8.220  43.161 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 666.1443     9.4608  70.411   <2e-16 ***
## income        2.7901     0.1806  15.446   <2e-16 ***
## stratio      -1.1588     0.4596  -2.521   0.0126 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.11 on 177 degrees of freedom
##   (40 observations deleted due to missingness)
## Multiple R-squared:  0.6167, Adjusted R-squared:  0.6124 
## F-statistic: 142.4 on 2 and 177 DF,  p-value: < 2.2e-16

Model with Omitted Variable

\[ score8 = \beta_0 +\beta_1income +\mu \]

stratio is the omitted variable in this example. The student-teacher ratio is a highly relevant variable as a smaller ratio could potentially facilitate better learning in students.

ov_model1 <- lm(data = df2, 
               formula = score8 ~ income)

summary(ov_model1)
## 
## Call:
## lm(formula = score8 ~ income, data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.209  -7.277   0.640   8.317  43.922 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  643.894      3.461  186.05   <2e-16 ***
## income         2.909      0.177   16.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.3 on 178 degrees of freedom
##   (40 observations deleted due to missingness)
## Multiple R-squared:  0.603,  Adjusted R-squared:  0.6007 
## F-statistic: 270.3 on 1 and 178 DF,  p-value: < 2.2e-16

Check for OVB

# compute correlations and significance using corstars
corstars(df2[,c(11, 12, 14)])
##           stratio    income
## stratio                    
## income  -0.16*             
## score8  -0.32****  0.78****

Apply the conditions for OVB to this example:

  • The omitted variable, stratio, has a negative correlation of 0.16 with income . (statistically significant as p is less than 0.05)

  • From the full regression model, we can also see that the stratio has a negative effect (-1.16) on score8. (statistically significant as p is less than 0.05)

  • Conclusively, the direction of bias is said to be positive.

Comparison between the Models

stargazer(full_model1, ov_model1, 
          type = "text",
          title = "Regression Results", 
          align = TRUE)
## 
## Regression Results
## =====================================================================
##                                    Dependent variable:               
##                     -------------------------------------------------
##                                          score8                      
##                               (1)                      (2)           
## ---------------------------------------------------------------------
## income                      2.790***                 2.909***        
##                             (0.181)                  (0.177)         
##                                                                      
## stratio                     -1.159**                                 
##                             (0.460)                                  
##                                                                      
## Constant                   666.144***               643.894***       
##                             (9.461)                  (3.461)         
##                                                                      
## ---------------------------------------------------------------------
## Observations                  180                      180           
## R2                           0.617                    0.603          
## Adjusted R2                  0.612                    0.601          
## Residual Std. Error    13.107 (df = 177)        13.303 (df = 178)    
## F Statistic         142.404*** (df = 2; 177) 270.316*** (df = 1; 178)
## =====================================================================
## Note:                                     *p<0.1; **p<0.05; ***p<0.01

Positive Bias - \(\beta_1\) is overestimated

By comparing the key coefficient estimates of these two models, we can see that positive bias is indeed present as the second model with the omitted variable has a larger estimate.


The omission of a variable that is highly correlated with the key independent and dependent variable causes the model to have a larger error term. Intuitively, we can say that when the error term is larger, the model fit will be less well. The direction of bias caused by the omitted variable and its effect on \(\beta_1\) are determined by (1) the correlation between the omitted variable and the dependent variable and (2) the correlation between the key independent variable and the omitted variable.

Here is an easy way to remember the effect of the omitted variable on \(\beta_1\) :

If the effect of condition (1) and (2) are the same (eg. both positive) then \(\beta_1\) will be overestimated (positive bias).

Conversely, if the effect of condition (1) and (2) are found to be opposing (eg. one is positive and the other one is negative), then \(\beta_1\) will be underestimated (negative bias).