Part 1: What is bias of an estimator.? You can read online blogs or even your textbooks to answer this question.
The bias of an estimator is the difference between the expected value of the estimator and the true value of the parameter it aims to estimate. In simpler terms, it reflects how much, on average, the estimator deviates from the actual parameter. An estimator is unbiased if its expected value equals the true parameter, meaning it accurately represents the parameter without systematic error. If the estimator consistently overestimates or underestimates the parameter, it is considered biased.
Part 2: In terms of omitted variable bias, will the bias go away if we increase the same size or add more variables?
No, omitted variable bias (OVB) will not go away by increasing the sample size or adding more variables. OVB arises when a relevant variable that affects both the dependent and independent variables is left out of the model. This creates a systematic bias in the estimated coefficients. Increasing the sample size may reduce the variance but will not address the bias caused by the missing variable. Similarly, adding unrelated variables will not eliminate OVB; the bias remains until the omitted variable is included in the model or accounted for by other means, such as using instrumental variables.
Part 3: Give me 1 distinct example of OVB (on a different dataset).
I use “Boston” here as the dataset and tested p value at previous discussion
??Bostonmodel <-lm(medv ~ ., data = Boston)# Get the summary of the model to check p-valuesmodel_summary <-summary(model)# Extract coefficients and p-values from the model summarycoefficients_df <-as.data.frame(model_summary$coefficients)# Extract p-valuesp_values <- coefficients_df[, "Pr(>|t|)"]# Display all p-valuesprint("P-values of all factors:")
# You can also filter out high VIF values (>10)high_vif <- vif_values[vif_values >10]print(high_vif) # These variables may have multicollinearity issues
named numeric(0)
# Plot residuals to check for homoscedasticity (constant variance) and normalityplot(full_model) # This will show 4 diagnostic plots: residuals, QQ-plot, etc.
# Extract residualsresiduals <-residuals(full_model)# Perform Shapiro-Wilk test for normalityshapiro_test <-shapiro.test(residuals)print(shapiro_test) # p-value > 0.05 means residuals are normally distributed
Shapiro-Wilk normality test
data: residuals
W = 0.90131, p-value < 2.2e-16
1. You will choose a dataset, describe the variables in it, and give us the full / correct model (be sure to write out the estimating equation in R markdown., and pay attention to the subscripts as well). Tell us what is your key independent variable that you are interested in studying.
The Boston dataset contains data on housing values in suburbs of Boston and includes 14 variables. Key variables include:crim-per capita crime rate by town. zn-proportion of residential land zoned for lots over 25,000 sq.ft. indus-proportion of non-retail business acres per town. chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox-nitrogen oxides concentration (parts per 10 million). rm - average number of rooms per dwelling. age - proportion of owner-occupied units built prior to 1940. dis - weighted mean of distances to five Boston employment centres. rad - index of accessibility to radial highways. tax - full-value property-tax rate per $10,000. ptratio - pupil-teacher ratio by town. black - 1000(Bk−0.63)^2 where Bk is the proportion of blacks by town. lstat - lower status of the population (percent). medv - median value of owner-occupied homes in $1000s.
I’m interest in stuying lstat.
2. Now, suppose you were running the short / incorrect model where you omitted a variable “by mistake”. Write the estimating equation out as well.
In the real world, you do not miss a variable due to mistake, but rather do not have the entire “data”. Otherwise of course you would run the full model.
3. From the OVB formula, tell us whether the omitted variable will cause bias or not i.e. are the two conditions for OVB met or not?
1) Be sure to list out the 2 conditions for OVB explicitly and translate it for your example.
2) You can run the correlation. functions in R to show if the two conditions are met -
-See if the correlation between omitted variable and y is statistically significant.
-See if the correlation between omitted variable and key x is statistically significant.
OVB occurs if two conditions are met 1.The omitted variable is correlated with the dependent variable 2.The omitted variable is correlated with one or more of the included independent variables. We need to if rm correlated with medv and other variabes like lstat
# Correlation between omitted variable 'rm' and dependent variable 'medv'cor_rm_medv <-cor(Boston$rm, Boston$medv)cat("Correlation between rm and medv:", cor_rm_medv, "\n")
Correlation between rm and medv: 0.6953599
# Correlation between omitted variable 'rm' and key independent variable 'lstat'cor_rm_lstat <-cor(Boston$rm, Boston$lstat)cat("Correlation between rm and lstat:", cor_rm_lstat, "\n")
Correlation between rm and lstat: -0.6138083
Seems both correlations are significantly different from 0, two conditions for OVB are met. Omitted variable bias (OVB) will cause bias in the estimation of the effect of lstat on medv.
4. Furthermore, OVB will be in what direction (positive/negative bias) ? Which case/cell in the 2 by 2 matrix that lists the 2 OVB conditions?
Correlation between rm and medv: 0.6953599 : the omitted variable rm is a determinant of medv (stront positive correlation) Correlation between rm and lstat: -0.6138083 : omitting rm will cause bias in the estimation of the effect of lstat on medv (strong negative correlation)
Since rm has a positive correlation with medv and a negative correlation with lstat, the omission of rm will lead to an inflated negative estimate for lstat in the short model. This means that the bias will be negative, in other words, the estimated effect of lstat on medv will be more negative than the true effect.
Correlation of omitted variable with y: positive, Negative Bias Correlation of omitted variable with key x: Negative
5. Show the two regressions side by side (you can use stargazer command) and confirm the bias is in the direction OVB formula predicted.
# Full model (with rm)full_model <-lm(medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + black + lstat, data = Boston)# Short model (without rm)short_model <-lm(medv ~ crim + zn + chas + nox + dis + rad + tax + ptratio + black + lstat, data = Boston)# Show both models side-by-sidestargazer(full_model, short_model, type ="text", title ="Comparison of Full and Short Models")
The coefficient for lstat in the short model is more negative (-0.752) than in the full model (-0.523), confirming negative bias due to the omission of rm. The R-squared value drops from 0.741 in the full model to 0.695 in the short model, indicating reduced model fit. All coefficients are statistically significant at the 1% level, and the F-statistic is significant in both models, showing at least one predictor is related to medv. The omission of rm leads to an exaggerated negative effect of lstat on medv, highlighting how failing to include relevant variables can distort estimates. Thus we confirming the presence of negative bias as predicted by the OVB formula. The results align with the expected direction of bias.
6. Try to provide some intuition to why does OVB formula work / bias your results in the example in a certain direction.
The OVB formula works because it captures the interplay between omitted variables and the relationships they have with both the dependent and independent variables. In this case, omitting rm distorts the estimated effect of lstat on medv, leading to a stronger negative coefficient than is actually present, thereby demonstrating how important it is to include all relevant variables in a regression model to obtain accurate and unbiased estimates.
ADVANCED BONUS QUESTION (for deeper understanding of OVB) -
1. Try to add/exclude a variable in your multivariate regression that does not impact y (is uncorrelated with y) but is correlated with the key x variable, and show that your point estimate will not change (significantly).
We’ll add the variable age, which represents the proportion of owner-occupied units built prior to 1940, to our model.
# Check correlation between tax and lstatcor_age_lstat <-cor(Boston$age, Boston$lstat)cat("Correlation between tax and lstat:", cor_age_lstat, "\n")
Correlation between tax and lstat: 0.6023385
# Check correlation between tax and medvcor_age_medv <-cor(Boston$age, Boston$medv)cat("Correlation between tax and medv:", cor_age_medv, "\n")
Correlation between tax and medv: -0.3769546
# Full model without agemodel_without_age <-lm(medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + black + lstat, data = Boston)summary(model_without_age)
In the regression analysis, we added the variable age (which is uncorrelated with medv but correlated with lstat). The coefficient for lstat remained at -0.523 in both models (with and without age), indicating that the inclusion of age did not significantly change the point estimate for lstat. This confirms that adding a variable that does not impact the dependent variable does not affect the estimates of other variables.
2. Try to add/exclude a variable in your multivariate regression that impacts y (is correlated with y) but is not correlated with the key x variable, and show that your point estimate will not change (significantly).
# Check correlation between nox and lstatcor_nox_lstat <-cor(Boston$nox, Boston$lstat)cat("Correlation between nox and lstat:", cor_nox_lstat, "\n")
Correlation between nox and lstat: 0.5908789
# Full model without noxmodel_without_nox <-lm(medv ~ crim + zn + chas + rm + dis + rad + tax + ptratio + black + lstat, data = Boston)summary(model_without_nox)
In this scenario, we added the variable nox (which impacts medv but has a correlation of 0.5909 with lstat). The coefficient for lstat changed slightly from -0.5707 (without nox) to -0.5230 (with nox). Although the point estimate for lstat changed, it was not a substantial difference, indicating that adding a variable correlated with the dependent variable but not strongly with lstat has a minor effect on the point estimate of lstat.