What is bias of an estimator?
Suppose that \(x\) is the observed
data, \(\theta\) is the true value of
an unknown parameter of interest which is estimated by function \(T(x)\) and the value estimated by this
function is represented by \(\hat{\theta}\). The bias of an estimator
can then be defined as the difference between the expected value of
\(\hat{\theta}\) and \(\theta\). The equation can be written
as:
\[ Bias(\hat{\theta})=E_{\theta}(\hat{\theta})-\theta \]
An estimator of the function is deemed as unbiased if the difference is zero for all value of \(\theta\). The bias of an estimator can be used to determine how accurate the estimator is. If an estimator is unbiased, it is said to be equal to the true value within the population.
In terms of omitted variable bias, will the bias go away if we increase the same size or add more variables?
The inclusion of variables to reduce bias is dependent on its relation to the dependent and key independent variable.
The variable should be included if the variable is related to the key independent and dependent variable as this reduces bias. It is also important to note that a consequence of doing so is that it causes multicollinearity.
When it comes to sample size, an increase will not change the bias.
The first data set is a cross sectional data from the AER package that consists of 1319 observations and 12 variables. The data contains credit history for a sample of people that applied for a credit card.
card - Factor. Was the application for a credit card
accepted?
reports - Number of major derogatory reports.
age - Age in years plus twelfths of a year.
income - Yearly income (in USD 10,000).
share - Ratio of monthly credit card expenditure to
yearly income.
expenditure - Average monthly credit card
expenditure.
owner - Factor. Does the individual own their home?
selfemp - Factor. Is the individual self-employed?
dependents - Number of dependents.
months - Months living at current address.
majorcards - Number of major credit cards held.
active - Number of active credit accounts.
# loading the data
data("CreditCard")
df1 <- CreditCard
stargazer(df1, type ="text",
title = "Summary Statistics for Credit Card Data")
##
## Summary Statistics for Credit Card Data
## ===================================================
## Statistic N Mean St. Dev. Min Max
## ---------------------------------------------------
## reports 1,319 0.456 1.345 0 14
## age 1,319 33.213 10.143 0.167 83.500
## income 1,319 3.365 1.694 0.210 13.500
## share 1,319 0.069 0.095 0.0001 0.906
## expenditure 1,319 185.057 272.219 0.000 3,099.505
## dependents 1,319 0.994 1.248 0 6
## months 1,319 55.268 66.272 0 540
## majorcards 1,319 0.817 0.387 0 1
## active 1,319 6.997 6.306 0 46
## ---------------------------------------------------
\[ expenditure = \beta_0 +\beta_1income +\beta_2age+\mu \]
The key independent variable of interest is income.
full_model <- lm(data = df1,
formula = expenditure ~ income + age)
summary(full_model)
##
## Call:
## lm(formula = expenditure ~ income + age, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -544.26 -137.46 -65.26 58.94 2503.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.089 25.548 3.683 0.00024 ***
## income 49.626 4.479 11.080 < 2e-16 ***
## age -2.289 0.748 -3.061 0.00225 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 260.5 on 1316 degrees of freedom
## Multiple R-squared: 0.08553, Adjusted R-squared: 0.08414
## F-statistic: 61.54 on 2 and 1316 DF, p-value: < 2.2e-16
\[ expenditure = \beta_0 +\beta_1income +\mu \]
Age is the variable omitted for this model.
ov_model <- lm(data = df1,
formula = expenditure ~ income)
summary(ov_model)
##
## Call:
## lm(formula = expenditure ~ income, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -529.95 -139.19 -70.94 69.25 2501.80
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.03 16.01 2.063 0.0393 *
## income 45.17 4.25 10.630 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 261.3 on 1317 degrees of freedom
## Multiple R-squared: 0.07902, Adjusted R-squared: 0.07832
## F-statistic: 113 on 1 and 1317 DF, p-value: < 2.2e-16
# corstar function
# x is a matrix containing the data
# method : correlation method. "pearson"" or "spearman"" is supported
# removeTriangle : remove upper or lower triangle
# results : if "html" or "latex"
# the results will be displayed in html or latex format
corstars <-function(x, method=c("pearson", "spearman"), removeTriangle=c("upper", "lower"),
result=c("none", "html", "latex")){
#Compute correlation matrix
require(Hmisc)
x <- as.matrix(x)
correlation_matrix<-rcorr(x, type=method[1])
R <- correlation_matrix$r # Matrix of correlation coeficients
p <- correlation_matrix$P # Matrix of p-value
## Define notions for significance levels; spacing is important.
mystars <- ifelse(p < .0001, "****", ifelse(p < .001, "*** ", ifelse(p < .01, "** ", ifelse(p < .05, "* ", " "))))
## trunctuate the correlation matrix to two decimal
R <- format(round(cbind(rep(-1.11, ncol(x)), R), 2))[,-1]
## build a new matrix that includes the correlations with their apropriate stars
Rnew <- matrix(paste(R, mystars, sep=""), ncol=ncol(x))
diag(Rnew) <- paste(diag(R), " ", sep="")
rownames(Rnew) <- colnames(x)
colnames(Rnew) <- paste(colnames(x), "", sep="")
## remove upper triangle of correlation matrix
if(removeTriangle[1]=="upper"){
Rnew <- as.matrix(Rnew)
Rnew[upper.tri(Rnew, diag = TRUE)] <- ""
Rnew <- as.data.frame(Rnew)
}
## remove lower triangle of correlation matrix
else if(removeTriangle[1]=="lower"){
Rnew <- as.matrix(Rnew)
Rnew[lower.tri(Rnew, diag = TRUE)] <- ""
Rnew <- as.data.frame(Rnew)
}
## remove last column and return the correlation matrix
Rnew <- cbind(Rnew[1:length(Rnew)-1])
if (result[1]=="none") return(Rnew)
else{
if(result[1]=="html") print(xtable(Rnew), type="html")
else print(xtable(Rnew), type="latex")
}
}
# compute the correlations
corstars(df1[,c(3, 4, 6)])
## age income
## age
## income 0.32****
## expenditure 0.01 0.28****
Conditions for OVB
There is correlation between X and the omitted variable.
The omitted variable influences the Y variable.
Applying these conditions to this example:
age and income have a
positive correlation of 0.32 (statistically significant as p is less
than 0.05). Based on the full regression model, age also
has a negative effect (-2.289) on expenditure which is statistically
significant since the p-value is less than 0.05. Therefore, the
direction of the OVB is said to be negative.stargazer(full_model, ov_model,
type = "text",
title = "Regression Results",
align = TRUE)
##
## Regression Results
## ======================================================================
## Dependent variable:
## --------------------------------------------------
## expenditure
## (1) (2)
## ----------------------------------------------------------------------
## income 49.626*** 45.175***
## (4.479) (4.250)
##
## age -2.289***
## (0.748)
##
## Constant 94.089*** 33.027**
## (25.548) (16.010)
##
## ----------------------------------------------------------------------
## Observations 1,319 1,319
## R2 0.086 0.079
## Adjusted R2 0.084 0.078
## Residual Std. Error 260.515 (df = 1316) 261.341 (df = 1317)
## F Statistic 61.542*** (df = 2; 1316) 112.998*** (df = 1; 1317)
## ======================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Negative Bias - \(\beta_1\) is underestimated
With a comparison between the two models, we can see that the
estimated key coefficient of the `ov_model` is smaller
which is caused by negative bias.
The second data set is a cross-sectional data from school districts in Massachusetts. It contains 220 observations and 16 variables.
district - district code
municipality - municipality name
expreg - expenditures per pupil (regular)
expspecial - expenditures per pupil (special needs)
expbil - expenditures per pupil (bilingual)
expocc - expenditures per pupil (occupational)
exptot - expenditures per pupil (total)
scratio - students per computer
special - percent of special education students
lunch - percent qualified for reduced-price lunch
stratio - student-teacher ratio
income - per capita income
score4 - 4th grade score (math + English + science)
score8 - 8th grade score (math + English + science)
salary - average teacher salary
english - percent of English learners
# loading dataset
data("MASchools")
df2 <- MASchools
stargazer(df2, type = "text")
##
## ========================================================
## Statistic N Mean St. Dev. Min Max
## --------------------------------------------------------
## expreg 220 4,605.464 880.252 2,905 8,759
## expspecial 220 8,900.727 3,511.696 3,832.230 53,569.240
## expbil 220 3,037.309 20,259.260 0 295,140
## expocc 220 1,104.209 2,732.449 0 15,088
## exptot 220 5,370.250 977.040 3,465 9,868
## scratio 211 8.107 2.836 2.300 18.400
## special 220 15.968 3.538 8.100 34.300
## lunch 220 15.316 15.060 0.400 76.200
## stratio 220 17.344 2.277 11.400 27.000
## income 220 18.747 5.808 9.686 46.855
## score4 220 709.827 15.126 658 740
## score8 180 698.411 21.053 641 747
## salary 195 35.993 3.191 24.965 44.494
## english 220 1.118 2.901 0.000 24.494
## --------------------------------------------------------
\[ score8 = \beta_0 +\beta_1income +\beta_2stratio+\mu \]
The key independent variable for this model is income while the dependent variable is the score of 8th grade students for school districts.
full_model1 <- lm(data = df2,
formula = score8 ~ income + stratio)
summary(full_model1)
##
## Call:
## lm(formula = score8 ~ income + stratio, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.077 -7.564 0.996 8.220 43.161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 666.1443 9.4608 70.411 <2e-16 ***
## income 2.7901 0.1806 15.446 <2e-16 ***
## stratio -1.1588 0.4596 -2.521 0.0126 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.11 on 177 degrees of freedom
## (40 observations deleted due to missingness)
## Multiple R-squared: 0.6167, Adjusted R-squared: 0.6124
## F-statistic: 142.4 on 2 and 177 DF, p-value: < 2.2e-16
\[ score8 = \beta_0 +\beta_1income +\mu \]
stratio is the omitted variable in this example. The
student-teacher ratio is a highly relevant variable as a smaller ratio
could potentially facilitate better learning in students.
ov_model1 <- lm(data = df2,
formula = score8 ~ income)
summary(ov_model1)
##
## Call:
## lm(formula = score8 ~ income, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.209 -7.277 0.640 8.317 43.922
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 643.894 3.461 186.05 <2e-16 ***
## income 2.909 0.177 16.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.3 on 178 degrees of freedom
## (40 observations deleted due to missingness)
## Multiple R-squared: 0.603, Adjusted R-squared: 0.6007
## F-statistic: 270.3 on 1 and 178 DF, p-value: < 2.2e-16
# compute correlations and significance using corstars
corstars(df2[,c(11, 12, 14)])
## stratio income
## stratio
## income -0.16*
## score8 -0.32**** 0.78****
Apply the conditions for OVB to this example:
The omitted variable, stratio, has a negative
correlation of 0.16 with income . (statistically
significant as p is less than 0.05)
From the full regression model, we can also see that the
stratio has a negative effect (-1.16) on
score8. (statistically significant as p is less than
0.05)
Conclusively, the direction of bias is said to be positive.
stargazer(full_model1, ov_model1,
type = "text",
title = "Regression Results",
align = TRUE)
##
## Regression Results
## =====================================================================
## Dependent variable:
## -------------------------------------------------
## score8
## (1) (2)
## ---------------------------------------------------------------------
## income 2.790*** 2.909***
## (0.181) (0.177)
##
## stratio -1.159**
## (0.460)
##
## Constant 666.144*** 643.894***
## (9.461) (3.461)
##
## ---------------------------------------------------------------------
## Observations 180 180
## R2 0.617 0.603
## Adjusted R2 0.612 0.601
## Residual Std. Error 13.107 (df = 177) 13.303 (df = 178)
## F Statistic 142.404*** (df = 2; 177) 270.316*** (df = 1; 178)
## =====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Positive Bias - \(\beta_1\) is overestimated
By comparing the key coefficient estimates of these two models, we can see that positive bias is indeed present as the second model with the omitted variable has a larger estimate.
The omission of a variable that is highly correlated with the key independent and dependent variable causes the model to have a larger error term. Intuitively, we can say that when the error term is larger, the model fit will be less well. The direction of bias caused by the omitted variable and its effect on \(\beta_1\) are determined by (1) the correlation between the omitted variable and the dependent variable and (2) the correlation between the key independent variable and the omitted variable.
Here is an easy way to remember the effect of the omitted variable on \(\beta_1\) :
If the effect of condition (1) and (2) are the same (eg. both positive) then \(\beta_1\) will be overestimated (positive bias).
Conversely, if the effect of condition (1) and (2) are found to be opposing (eg. one is positive and the other one is negative), then \(\beta_1\) will be underestimated (negative bias).