What is bias of an estimator?
The bias of an estimator can arise when there is a missing factor that’s not included in the linear regression function. The missing factor can have correlation with both that estimator in our model as well as the dependent variable (Y) that we are trying to estimate. Here is an example:
\[Wage = \beta_0 + \beta_1Experience + \beta_2EducLevel + \epsilon \]
Will the bias go away if we increase the same size or add more variables?
# Load data 1
income <- read_excel("/Users/pin.lyu/Desktop/2008_Income/cps_2008.xlsx")
# Load data 2
swiss <- swiss
Data description
Variables:
wage = earning per hour, in dollars
educ = years of education
age = years of age
exper = years of work experience
female = gender, = 1 if femlae
black = race, =1 if black
white = race, = 1 if white
married = maritial status, = 1 if married
union = union member, = 1 if yes
northeast = region, = 1 if northwest
midwest = region, =1 if midwest
south = region, =1 if south
west =region, =1 if west
fulltime = work status, = 1 if full time employee
metro = living environment, = 1 if living in a city
Data source: Dr. Kang Sun Lee, Louisiana Department of Health and Human Services.
library(corrplot)
## corrplot 0.92 loaded
corr <- cor(income)
# Present the chart in a graph
corrplot(corr,
method = 'square',
order = 'FPC',
type = 'lower',
diag = FALSE)
Full linear regression
\[ Wage = \beta_0 + \beta_1exp + \beta_2edu + \beta_3hours + \beta_5age + \epsilon \]
The Key independent variable that I want to study here is how one’s working experience affects a person’s income. Normally, as an individual gains more working experience, the corresponding wage of that individual would increase.
model_A <- lm(data = income, wage ~
exper +
educ +
fulltime +
age
)
summary(model_A)
##
## Call:
## lm(formula = wage ~ exper + educ + fulltime + age, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.491 -3.274 -1.020 2.164 61.817
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.0778 3.3721 -4.471 7.95e-06 ***
## exper -0.7472 0.5580 -1.339 0.181
## educ 0.3696 0.5582 0.662 0.508
## fulltime 1.6297 0.2437 6.689 2.51e-11 ***
## age 0.8644 0.5576 1.550 0.121
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.386 on 4728 degrees of freedom
## Multiple R-squared: 0.2493, Adjusted R-squared: 0.2487
## F-statistic: 392.5 on 4 and 4728 DF, p-value: < 2.2e-16Shorter version
\[ Wage = \beta_0 + \beta_1exp + \beta_2edu + \beta_3hours + \epsilon \]
model_B <- lm(data = income, wage ~
exper +
educ +
fulltime
)
summary(model_B)
##
## Call:
## lm(formula = wage ~ exper + educ + fulltime, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.495 -3.275 -1.018 2.167 61.827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.913117 0.522929 -18.957 < 2e-16 ***
## exper 0.117819 0.006979 16.883 < 2e-16 ***
## educ 1.233413 0.033647 36.657 < 2e-16 ***
## fulltime 1.644950 0.243493 6.756 1.59e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.387 on 4729 degrees of freedom
## Multiple R-squared: 0.2489, Adjusted R-squared: 0.2484
## F-statistic: 522.4 on 3 and 4729 DF, p-value: < 2.2e-16Comparison
Difference in two linear regression functions
# Compare two versions of regresion results
stargazer(model_A, model_B,
type = "text",
covariate.labels = c("experience", "Education", "fulltime", "age", "β0")
)
##
## =======================================================================
## Dependent variable:
## ---------------------------------------------------
## wage
## (1) (2)
## -----------------------------------------------------------------------
## experience -0.747 0.118***
## (0.558) (0.007)
##
## Education 0.370 1.233***
## (0.558) (0.034)
##
## fulltime 1.630*** 1.645***
## (0.244) (0.243)
##
## age 0.864
## (0.558)
##
## β0 -15.078*** -9.913***
## (3.372) (0.523)
##
## -----------------------------------------------------------------------
## Observations 4,733 4,733
## R2 0.249 0.249
## Adjusted R2 0.249 0.248
## Residual Std. Error 5.386 (df = 4728) 5.387 (df = 4729)
## F Statistic 392.512*** (df = 4; 4728) 522.393*** (df = 3; 4729)
## =======================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Two conditions that omitted variable need to meet:
Condition 1: the omitted variable has to correlate with one or more independent variables from the regression function
cor(income[,c(1:4)])
## wage educ age exper
## wage 1.0000000 0.43867643 0.24761954 0.1544979
## educ 0.4386764 1.00000000 0.05897888 -0.1479991
## age 0.2476195 0.05897888 1.00000000 0.9784605
## exper 0.1544979 -0.14799908 0.97846047 1.0000000
Condition 2: The omitted variable is a determinant of the dependent variable Y.
# Test how statistically significant is the omitted variable,"age", is to the depdent variable "wage"
cor_test_A <- cor.test(income$age, income$wage)
cor_test_A
##
## Pearson's product-moment correlation
##
## data: income$age and income$wage
## t = 17.579, df = 4731, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2206859 0.2741758
## sample estimates:
## cor
## 0.2476195
Direction of Bias
Intuition behind how omitted variable create bias
the omitted variable, say \(X_2\), creates a confounding effect on the relationship between \(X_1\) and \(Y\) and this confounding effect biases the estimated coefficient of \(\beta_1X_1\) away from its true value. The direction of the bias depends on the nature of the relationship between the omitted variable and both the included variable and the dependent variable.
Data description
Standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888.
Variables
Fertility = Common standardized fertility measure
Agriculture = % of males involved in agriculture as occupation
Examination = % draftees receiving highest mark on army examination
Education = % education beyond primary school for draftees.
Catholic = % 'catholic' (as opposed to 'protestant').
Infant.Mortality = Live births who live less than 1 year.
# View data type of each column
str(swiss)
## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
# Build correlation matrix graph
corr <- cor(swiss)
# Present the chart in a graph
corrplot(corr,
method = 'square',
order = 'FPC',
type = 'lower',
diag = FALSE)
Full linear regression
\[ Fertility = \beta_0 + \beta_1Education + \beta_2Agriculture + \beta_3Catholic + \beta_4Examination + \epsilon \]
model_A <- lm( data = swiss, Fertility ~
Education +
Agriculture +
Catholic +
Examination
)
summary(model_A)
##
## Call:
## lm(formula = Fertility ~ Education + Agriculture + Catholic +
## Examination, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.7813 -6.3308 0.8113 5.7205 15.5569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.05542 6.94881 13.104 < 2e-16 ***
## Education -0.96161 0.19455 -4.943 1.28e-05 ***
## Agriculture -0.22065 0.07360 -2.998 0.00455 **
## Catholic 0.12442 0.03727 3.339 0.00177 **
## Examination -0.26058 0.27411 -0.951 0.34722
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.736 on 42 degrees of freedom
## Multiple R-squared: 0.6498, Adjusted R-squared: 0.6164
## F-statistic: 19.48 on 4 and 42 DF, p-value: 3.95e-09Shorter version
\[ Fertility = \beta_0 + \beta_1Education + \beta_2Agriculture + \beta_3Catholic + \epsilon \]
model_B <- lm( data = swiss, Fertility ~
Education +
Agriculture +
Catholic
)
summary(model_B)
##
## Call:
## lm(formula = Fertility ~ Education + Agriculture + Catholic,
## data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.178 -6.548 1.379 5.822 14.840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 86.22502 4.73472 18.211 < 2e-16 ***
## Education -1.07215 0.15580 -6.881 1.91e-08 ***
## Agriculture -0.20304 0.07115 -2.854 0.00662 **
## Catholic 0.14520 0.03015 4.817 1.84e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.728 on 43 degrees of freedom
## Multiple R-squared: 0.6423, Adjusted R-squared: 0.6173
## F-statistic: 25.73 on 3 and 43 DF, p-value: 1.089e-09Comparison
# Compare two versions of regresion results
stargazer(model_A, model_B,
type = "text",
covariate.labels = c("Education", "Agriculture", "Catholic", "Examination", "β0")
)
##
## =================================================================
## Dependent variable:
## ---------------------------------------------
## Fertility
## (1) (2)
## -----------------------------------------------------------------
## Education -0.962*** -1.072***
## (0.195) (0.156)
##
## Agriculture -0.221*** -0.203***
## (0.074) (0.071)
##
## Catholic 0.124*** 0.145***
## (0.037) (0.030)
##
## Examination -0.261
## (0.274)
##
## β0 91.055*** 86.225***
## (6.949) (4.735)
##
## -----------------------------------------------------------------
## Observations 47 47
## R2 0.650 0.642
## Adjusted R2 0.616 0.617
## Residual Std. Error 7.736 (df = 42) 7.728 (df = 43)
## F Statistic 19.482*** (df = 4; 42) 25.732*** (df = 3; 43)
## =================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Based on the summary chart, we can see that all coefficients are statistically significant because they have p-values less than 0.01.
Two conditions that omitted variable need to meet:
Condition 1: the omitted variable has to correlate with one or more independent variables from the regression function
cor(swiss[,c(1:4)])
## Fertility Agriculture Examination Education
## Fertility 1.0000000 0.3530792 -0.6458827 -0.6637889
## Agriculture 0.3530792 1.0000000 -0.6865422 -0.6395225
## Examination -0.6458827 -0.6865422 1.0000000 0.6984153
## Education -0.6637889 -0.6395225 0.6984153 1.0000000
Condition 2: The omitted variable is a determinant of the dependent variable Y.
# Test how statistically significant is the omitted variable,"examination", is to the depdent variable "fertility"
cor_test_A <- cor.test(swiss$Fertility, swiss$Examination)
cor_test_A
##
## Pearson's product-moment correlation
##
## data: swiss$Fertility and swiss$Examination
## t = -5.6753, df = 45, p-value = 9.45e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7870674 -0.4403995
## sample estimates:
## cor
## -0.6458827
Direction of Bias
“Examination” has a positive effect on “Education”, the coefficient value is 0.698. And it has a negative effect on “fertility”, the coefficient level is -0.645. Therefore, we can deduct that the omitted variable does create a negative bias.
Intuition: Individuals who excel in academics, demonstrating intelligence and dedication, are more likely to extend their educational journey, which, in turn, increases their chances of achieving higher scores in military entrance examinations. Simultaneously, those with higher levels of education tend to have fewer children. In summary, individuals who invest more time in their education tend to perform better in military entrance exams while also potentially having fewer children.”