Part I

What is bias of an estimator?
- The bias of an estimator can arise when there is a missing factor that’s not included in the linear regression function. The missing factor can have correlation with both that estimator in our model as well as the dependent variable (Y) that we are trying to estimate. Here is an example:
  - \[Wage = \beta_0 + \beta_1Experience + \beta_2EducLevel + \epsilon \]
    - Here, the multi-variable regression function might have substantial bias in it because there could be a common estimator which is not included in the regression that have both effects on the two estimatorsand on the Y that we try to estimate. In this case, the estimator could be “gender” because females tend to have less work experience than males due to time spend on child bearing. Additionally, the countries that we try to estimate wage for, perhaps families there spend less money on girls where, on average, they do not obtain the same level of education as boys. Or lastly, being a female will be paid less at work…etc. So by not including such a estimator in our model, it can result in a biased estimation of wage.

Part II

Will the bias go away if we increase the same size or add more variables?
- No, increasing our data size or adding more variable to our data set do not solve the issue of biasness. An bias introduced by a inappropriate sampling method will still exist no matter how large the sample size increases to, unless the sampling method is fixed. Or in another case, bias can be introduced by missing an important estimator, like the example from last question, if the “gender” variable is not included, no matter how big the sample size increases to, the bias will still exist. However, it’s worth to note that, in this case, the bias can be reduced by adding that missing variable “gender”. However, if the variables that we add that aim to reduce biasness are bad variables, then it can exacerbate the issue. so generally speaking, increasing sample size or adding more variables do not address the issue of biasness.

Part III

# Load data 1
income <- read_excel("/Users/pin.lyu/Desktop/2008_Income/cps_2008.xlsx")

# Load data 2
swiss <- swiss

1st data set

Data description
- Variables:
  - wage = earning per hour, in dollars
  - educ = years of education
  - age = years of age
  - exper = years of work experience
  - female = gender, = 1 if femlae
  - black = race, =1 if black
  - white = race, = 1 if white
  - married = maritial status, = 1 if married
  - union = union member, = 1 if yes
  - northeast = region, = 1 if northwest
  - midwest = region, =1 if midwest
  - south = region, =1 if south
  - west =region, =1 if west
  - fulltime = work status, = 1 if full time employee
  - metro = living environment, = 1 if living in a city
- Data source: Dr. Kang Sun Lee, Louisiana Department of Health and Human Services.

library(corrplot)

## corrplot 0.92 loaded

corr <- cor(income)

# Present the chart in a graph
corrplot(corr, 
         method = 'square', 
         order = 'FPC', 
         type = 'lower', 
         diag = FALSE)

Full linear regression

\[ Wage = \beta_0 + \beta_1exp + \beta_2edu + \beta_3hours + \beta_5age + \epsilon \]

The Key independent variable that I want to study here is how one’s working experience affects a person’s income. Normally, as an individual gains more working experience, the corresponding wage of that individual would increase.

model_A <- lm(data = income, wage ~
                exper    +
                educ     +
                fulltime +
                age
                )
summary(model_A)

## 
## Call:
## lm(formula = wage ~ exper + educ + fulltime + age, data = income)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.491  -3.274  -1.020   2.164  61.817 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.0778     3.3721  -4.471 7.95e-06 ***
## exper        -0.7472     0.5580  -1.339    0.181    
## educ          0.3696     0.5582   0.662    0.508    
## fulltime      1.6297     0.2437   6.689 2.51e-11 ***
## age           0.8644     0.5576   1.550    0.121    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.386 on 4728 degrees of freedom
## Multiple R-squared:  0.2493, Adjusted R-squared:  0.2487 
## F-statistic: 392.5 on 4 and 4728 DF,  p-value: < 2.2e-16

Shorter version

\[ Wage = \beta_0 + \beta_1exp + \beta_2edu + \beta_3hours + \epsilon \]

model_B <- lm(data = income, wage ~
                exper    +
                educ     +
                fulltime 
                )
summary(model_B)

## 
## Call:
## lm(formula = wage ~ exper + educ + fulltime, data = income)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.495  -3.275  -1.018   2.167  61.827 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.913117   0.522929 -18.957  < 2e-16 ***
## exper        0.117819   0.006979  16.883  < 2e-16 ***
## educ         1.233413   0.033647  36.657  < 2e-16 ***
## fulltime     1.644950   0.243493   6.756 1.59e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.387 on 4729 degrees of freedom
## Multiple R-squared:  0.2489, Adjusted R-squared:  0.2484 
## F-statistic: 522.4 on 3 and 4729 DF,  p-value: < 2.2e-16

Comparison

Difference in two linear regression functions

# Compare two versions of regresion results
stargazer(model_A, model_B, 
          type = "text",
          covariate.labels = c("experience", "Education", "fulltime", "age", "β0")
          )

## 
## =======================================================================
##                                     Dependent variable:                
##                     ---------------------------------------------------
##                                            wage                        
##                                (1)                       (2)           
## -----------------------------------------------------------------------
## experience                   -0.747                   0.118***         
##                              (0.558)                   (0.007)         
##                                                                        
## Education                     0.370                   1.233***         
##                              (0.558)                   (0.034)         
##                                                                        
## fulltime                    1.630***                  1.645***         
##                              (0.244)                   (0.243)         
##                                                                        
## age                           0.864                                    
##                              (0.558)                                   
##                                                                        
## β0                         -15.078***                 -9.913***        
##                              (3.372)                   (0.523)         
##                                                                        
## -----------------------------------------------------------------------
## Observations                  4,733                     4,733          
## R2                            0.249                     0.249          
## Adjusted R2                   0.249                     0.248          
## Residual Std. Error     5.386 (df = 4728)         5.387 (df = 4729)    
## F Statistic         392.512*** (df = 4; 4728) 522.393*** (df = 3; 4729)
## =======================================================================
## Note:                                       *p<0.1; **p<0.05; ***p<0.01

Based on the result obtained from the test, the omitted variable, “age” is statistically significant both to the independent variable, “age”, and to the dependent variable “wage” as its significance value does not equal to 0.

Two conditions that omitted variable need to meet:

Condition 1: the omitted variable has to correlate with one or more independent variables from the regression function

cor(income[,c(1:4)])

##            wage        educ        age      exper
## wage  1.0000000  0.43867643 0.24761954  0.1544979
## educ  0.4386764  1.00000000 0.05897888 -0.1479991
## age   0.2476195  0.05897888 1.00000000  0.9784605
## exper 0.1544979 -0.14799908 0.97846047  1.0000000

In our case, the omitted variable “age” strongly correlates with “experience” as we can see from the correlation matrix above.

Condition 2: The omitted variable is a determinant of the dependent variable Y.

# Test how statistically significant is the omitted variable,"age", is to the depdent variable "wage"
cor_test_A <- cor.test(income$age, income$wage)
cor_test_A

## 
##  Pearson's product-moment correlation
## 
## data:  income$age and income$wage
## t = 17.579, df = 4731, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2206859 0.2741758
## sample estimates:
##       cor 
## 0.2476195

In our case, the omitted variable “age” has a positive impact on the independent variable “wage” based on the result above because the coefficient value is 0.247.

Direction of Bias
- “Age” has a positive effect on the dependent variable, “wage”, as its coefficient value is 0.864. And it also has a strong positive correlation with “experience”, coefficient equals to 0.247. Thus, based on the these information, we can say that by omitting the variable “age”, the bias will exist. And the bias would a positive bias.
Intuition behind how omitted variable create bias
- the omitted variable, say \(X_2\), creates a confounding effect on the relationship between \(X_1\) and \(Y\) and this confounding effect biases the estimated coefficient of \(\beta_1X_1\) away from its true value. The direction of the bias depends on the nature of the relationship between the omitted variable and both the included variable and the dependent variable.

2nd data set

Data description
- Standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888.
- Variables
  - Fertility = Common standardized fertility measure
  - Agriculture = % of males involved in agriculture as occupation
  - Examination = % draftees receiving highest mark on army examination
  - Education = % education beyond primary school for draftees.
  - Catholic = % 'catholic' (as opposed to 'protestant').
  - Infant.Mortality = Live births who live less than 1 year.

# View data type of each column
str(swiss)

## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...

# Build correlation matrix graph
corr <- cor(swiss)

# Present the chart in a graph
corrplot(corr, 
         method = 'square', 
         order = 'FPC', 
         type = 'lower', 
         diag = FALSE)

Full linear regression

\[ Fertility = \beta_0 + \beta_1Education + \beta_2Agriculture + \beta_3Catholic + \beta_4Examination + \epsilon \]

The key independent variable that I want to focus here is education. I want to see how it affects the fertility rate in switzerland.

model_A <- lm( data = swiss, Fertility ~ 
                 Education   +
                 Agriculture +
                 Catholic    +
                 Examination
)
summary(model_A)

## 
## Call:
## lm(formula = Fertility ~ Education + Agriculture + Catholic + 
##     Examination, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7813  -6.3308   0.8113   5.7205  15.5569 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 91.05542    6.94881  13.104  < 2e-16 ***
## Education   -0.96161    0.19455  -4.943 1.28e-05 ***
## Agriculture -0.22065    0.07360  -2.998  0.00455 ** 
## Catholic     0.12442    0.03727   3.339  0.00177 ** 
## Examination -0.26058    0.27411  -0.951  0.34722    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.736 on 42 degrees of freedom
## Multiple R-squared:  0.6498, Adjusted R-squared:  0.6164 
## F-statistic: 19.48 on 4 and 42 DF,  p-value: 3.95e-09