I. The bias of an estimator represents the difference between the mean value of the estimator and the population parameter. In other words, the bias measures how much, on average, the estimator deviates from the true parameter. Ideally that value is zero, implying that the estimator is completely unbiased.
II.Both increasing the sample size and adding variables can lessen, but not necessarily eliminate, the bias of estimator. Per the law of large numbers, increasing sample size should bring the expected value of the sample closer to the population mean. Adding relevant independent variables can cause the model to more effectively explain the variation in the dependent variable. However, adding variables that are not relevant can cause overfitting. This is when variables do not explain the underlying patterns the model is intended to explain.
My two datasets to identify OVB are USMacroG and Fatalities from the AER library. USMacroG is a time series dataset with a variety of US macroeconomic concepts at a quarterly frequency from 1950-2000. Variables include GDP and its components, unemployment, inflation and interest rates, among others. I decided to model unemployment using the other concepts. For the full list of concepts, visit the AER package website: https://cran.r-project.org/web/packages/AER/AER.pdf
#First dataset - macro US data
data("USMacroG")
USMacroG <- data.frame(USMacroG)
USMacroG <- na.omit(USMacroG)
stargazer(USMacroG,
type = "text",
title = "Summary Stats")
##
## Summary Stats
## =======================================================
## Statistic N Mean St. Dev. Min Max
## -------------------------------------------------------
## gdp 203 4,577.188 2,108.934 1,658.800 9,303.900
## consumption 203 3,008.995 1,456.900 1,075.900 6,341.100
## invest 203 654.530 391.371 197.700 1,801.600
## government 203 1,000.149 310.972 359.600 1,582.800
## dpi 203 3,352.094 1,577.521 1,178.100 6,634.900
## cpi 203 226.586 148.955 71.400 521.100
## m1 203 455.615 359.800 111.750 1,152.100
## tbill 203 5.250 2.837 0.810 15.090
## unemp 203 5.671 1.578 2.600 10.700
## population 203 214.287 35.517 150.260 281.422
## inflation 203 3.939 3.399 -2.530 16.864
## interest 203 1.311 2.890 -11.216 10.626
## -------------------------------------------------------
#stack the data to enable two graphs
USMacroG_stacked <- USMacroG %>%
pivot_longer(cols = -unemp, names_to = "Var", values_to = "Explanatory")
#create combined graph of scatterplots
ggplot(USMacroG_stacked, aes(x = Explanatory, y = unemp)) +
geom_point() +
facet_wrap(~Var, scales = "free_x")
ggplot(USMacroG_stacked, aes(x = Explanatory)) +
geom_histogram(bins = 30) +
facet_wrap(~Var, scales = "free_x")
#create combined scatterplot
ggplot(USMacroG_stacked, aes(x = Explanatory, y = unemp, color = Var)) +
geom_point()+
labs(
x= "Unemployment"
)
#Look at correlations between concepts. Will be important for OVB identification.
Macro_cor <- cor(USMacroG)
corrplot(Macro_cor, type = "upper", order = "hclust",
col=brewer.pal(n=5, name="RdYlBu"))
Now that I’ve looked at the data in more detail, I’ll create two models. The first is a version that contains OVB. The second has less bias due to added variables, fitting the criteria of OVB.
#starting with gov spending, population, interest rate.
reg_unemp <- lm(data = USMacroG, formula = unemp ~ government + population + interest)
summary(reg_unemp)
##
## Call:
## lm(formula = unemp ~ government + population + interest, data = USMacroG)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9136 -0.8183 -0.0899 0.7424 2.9216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.548879 1.248994 -7.645 8.69e-13 ***
## government -0.014794 0.001368 -10.811 < 2e-16 ***
## population 0.139313 0.011922 11.686 < 2e-16 ***
## interest 0.124133 0.030320 4.094 6.16e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.165 on 199 degrees of freedom
## Multiple R-squared: 0.4633, Adjusted R-squared: 0.4552
## F-statistic: 57.26 on 3 and 199 DF, p-value: < 2.2e-16
#omitted variable is investment and consumption, other components of GDP that correlate with unemployment and the independent variables
#Testing significance of correlation between potential new X values and Y
cor.test(USMacroG$unemp,USMacroG$consumption)
##
## Pearson's product-moment correlation
##
## data: USMacroG$unemp and USMacroG$consumption
## t = 3.4779, df = 201, p-value = 0.0006193
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1039509 0.3640168
## sample estimates:
## cor
## 0.2382502
cor.test(USMacroG$unemp,USMacroG$invest)
##
## Pearson's product-moment correlation
##
## data: USMacroG$unemp and USMacroG$invest
## t = 1.4331, df = 201, p-value = 0.1534
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03765784 0.23502825
## sample estimates:
## cor
## 0.1005736
#Testing significance of correlation between potential new X values and other independent variables
cor.test(USMacroG$population,USMacroG$consumption)
##
## Pearson's product-moment correlation
##
## data: USMacroG$population and USMacroG$consumption
## t = 70.14, df = 201, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9739281 0.9849397
## sample estimates:
## cor
## 0.9801771
cor.test(USMacroG$population,USMacroG$invest)
##
## Pearson's product-moment correlation
##
## data: USMacroG$population and USMacroG$invest
## t = 35.974, df = 201, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9091224 0.9467670
## sample estimates:
## cor
## 0.9303562
#All correlations fall within the 95% confidence interval, indicating they are statistically significant.
reg_unemp_v2 <- lm( data = USMacroG, formula = unemp ~ government + population + interest + log(invest) + consumption)
summary(reg_unemp_v2)
##
## Call:
## lm(formula = unemp ~ government + population + interest + log(invest) +
## consumption, data = USMacroG)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.82144 -0.64982 -0.01044 0.52995 1.82602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8892788 2.1961837 2.682 0.00795 **
## government -0.0136722 0.0010990 -12.441 < 2e-16 ***
## population 0.2803886 0.0126039 22.246 < 2e-16 ***
## interest 0.0454003 0.0216972 2.092 0.03768 *
## log(invest) -7.0199274 0.5703873 -12.307 < 2e-16 ***
## consumption -0.0007960 0.0002403 -3.313 0.00110 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8052 on 197 degrees of freedom
## Multiple R-squared: 0.7462, Adjusted R-squared: 0.7398
## F-statistic: 115.8 on 5 and 197 DF, p-value: < 2.2e-16
stargazer(reg_unemp_v2, reg_unemp,
type = "text")
##
## ====================================================================
## Dependent variable:
## ------------------------------------------------
## unemp
## (1) (2)
## --------------------------------------------------------------------
## government -0.014*** -0.015***
## (0.001) (0.001)
##
## population 0.280*** 0.139***
## (0.013) (0.012)
##
## interest 0.045** 0.124***
## (0.022) (0.030)
##
## log(invest) -7.020***
## (0.570)
##
## consumption -0.001***
## (0.0002)
##
## Constant 5.889*** -9.549***
## (2.196) (1.249)
##
## --------------------------------------------------------------------
## Observations 203 203
## R2 0.746 0.463
## Adjusted R2 0.740 0.455
## Residual Std. Error 0.805 (df = 197) 1.165 (df = 199)
## F Statistic 115.837*** (df = 5; 197) 57.263*** (df = 3; 199)
## ====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The variables investment and consumption correlate with both the independent variables and the dependent variables, which means they fit the criteria for an omitted variable. The correlations are all positive, and their effects on unemployment are negative, meaning the bias is negative. Adding these variables makes sense intuitively. Unemployment and GDP are indicators closely tied to one another, so adding components of GDP to the model should help to explain changes in GDP. The final formula is now:
\[unemp = \beta0 + government\beta1 + population\beta2 + interest\beta3 + invest\beta4 + consumption\beta5 +\epsilon\] With coefficients: \[unemp = \beta0 + -0.016*government + 0.131*population + 0.125*interest + -0.011*invest + 0.003*consumption +\epsilon\]
The next dataset, Fatalities, contains panel data on annual state-level traffic fatalities in the years 1982-1988. Among the variables are income, spirits consumption, population, fatalities, miles driven, and others. For a full list, visit: https://cran.r-project.org/web/packages/AER/AER.pdf
I am attempting to model alcohol-related traffic fatalities (afatal). First, I’ll clean and analyze the data.
data("Fatalities")
Fatalities <- data.frame(Fatalities)
stargazer(Fatalities,
type = "text",
title = "Summary Stats")
##
## Summary Stats
## =======================================================================
## Statistic N Mean St. Dev. Min Max
## -----------------------------------------------------------------------
## spirits 336 1.754 0.684 0.790 4.900
## unemp 336 7.347 2.533 2.400 18.000
## income 336 13,880.180 2,253.046 9,513.762 22,193.460
## emppop 336 60.806 4.722 42.993 71.269
## beertax 336 0.513 0.478 0.043 2.721
## baptist 336 7.157 9.763 0.000 30.356
## mormon 336 2.802 9.665 0.100 65.916
## drinkage 336 20.456 0.899 18.000 21.000
## dry 336 4.267 9.501 0.000 45.792
## youngdrivers 336 0.186 0.025 0.073 0.282
## miles 336 7,890.754 1,475.659 4,576.346 26,148.270
## fatal 336 928.664 934.051 79 5,504
## nfatal 336 182.583 188.431 13 1,049
## sfatal 336 109.949 108.540 8 603
## fatal1517 336 62.610 55.729 3 318
## nfatal1517 336 12.262 12.253 0 76
## fatal1820 336 106.661 104.224 7 601
## nfatal1820 336 33.527 33.238 0 196
## fatal2124 336 126.872 131.789 12 770
## nfatal2124 336 41.378 42.930 1 249
## afatal 336 293.333 303.581 24.600 2,094.900
## pop 336 4,930,272.000 5,073,704.000 478,999.700 28,314,028.000
## pop1517 336 230,815.500 229,896.300 21,000.020 1,172,000.000
## pop1820 336 249,090.400 249,345.600 20,999.960 1,321,004.000
## pop2124 336 336,389.900 345,304.400 30,000.160 1,892,998.000
## milestot 336 37,101.490 37,454.370 3,993.000 241,575.000
## unempus 336 7.529 1.479 5.500 9.700
## emppopus 336 59.971 1.585 57.800 62.300
## gsp 336 0.025 0.043 -0.124 0.142
## -----------------------------------------------------------------------
#delete non-numeric columns
Fatalities <- select(Fatalities,-c(state, year, breath, jail, service))
#create some interaction terms.
Fatalities <- Fatalities %>% mutate(afatalrate = (afatal*pop),# fatalities and population
milesper = (milestot/pop),# miles per person
spiritsper= spirits*pop) #spirits consumption and population
#delete na values
Fatalities <- na.omit(Fatalities)
#stack for additional
Fatalities_stacked <- Fatalities %>%
pivot_longer(cols = -afatalrate, names_to = "Var", values_to = "Explanatory")
ggplot(Fatalities_stacked, aes(x = Explanatory, y = afatalrate)) +
geom_point() +
facet_wrap(~Var, scales = "free_x")
ggplot(Fatalities_stacked, aes(x = Explanatory)) +
geom_histogram(bins = 30) +
facet_wrap(~Var, scales = "free_x")
ggplot(Fatalities_stacked, aes(x = Explanatory, y = afatalrate, color = Var)) +
geom_point() +
labs(
x= "Alcohol-related traffic fatalities"
)
Fatalities_cor <- cor(Fatalities)
corrplot(Fatalities_cor, type = "upper", order = "hclust",
col=brewer.pal(n=5, name="RdYlBu"))
Afatal appears skewed, so I’ll use a log version in the formula. In fact, all variables I will use are skewed, so this will be a log-log model. Again, the first model will have OVB and the second attempts to correct for it.
afatal_reg <- lm(formula = log(afatal) ~ log(spiritsper), data = Fatalities)
summary(afatal_reg)
##
## Call:
## lm(formula = log(afatal) ~ log(spiritsper), data = Fatalities)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2986 -0.4091 0.0654 0.3521 1.2580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.10995 0.43360 -16.4 <2e-16 ***
## log(spiritsper) 0.80018 0.02798 28.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5063 on 334 degrees of freedom
## Multiple R-squared: 0.71, Adjusted R-squared: 0.7092
## F-statistic: 817.8 on 1 and 334 DF, p-value: < 2.2e-16
#% of fatal accidents with drivers aged 21-24 correlates with alcohol related deaths and spirits*population. Adding that may decrease bias
cor.test(Fatalities$afatal,Fatalities$nfatal2124)
##
## Pearson's product-moment correlation
##
## data: Fatalities$afatal and Fatalities$nfatal2124
## t = 52.329, df = 334, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9311407 0.9546441
## sample estimates:
## cor
## 0.9440794
cor.test(Fatalities$spiritsper,Fatalities$nfatal2124)
##
## Pearson's product-moment correlation
##
## data: Fatalities$spiritsper and Fatalities$nfatal2124
## t = 34.598, df = 334, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8584425 0.9055459
## sample estimates:
## cor
## 0.8842224
#correlation falls within the 95% confidence interval, indicating these correlations are statistically significant.
afatal_reg_v2 <- lm(formula = log(afatal) ~ log(spiritsper) + log(nfatal2124), data = Fatalities)
summary(afatal_reg_v2)
##
## Call:
## lm(formula = log(afatal) ~ log(spiritsper) + log(nfatal2124),
## data = Fatalities)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20738 -0.25027 0.00372 0.23656 1.33002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.35363 0.57322 2.361 0.0188 *
## log(spiritsper) 0.08726 0.04522 1.930 0.0545 .
## log(nfatal2124) 0.78435 0.04454 17.612 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3648 on 333 degrees of freedom
## Multiple R-squared: 0.8499, Adjusted R-squared: 0.849
## F-statistic: 942.5 on 2 and 333 DF, p-value: < 2.2e-16
The variables afatal, spiritsper, and number of fatal accidents are all correlated, as you can see in the correlation plot, so there is reason to think that any bias present could be lessened by adding fatal accidents among people aged 21-24 years. The additional variable, young fatalities, has a positive coefficient, meaning the bias is positive. The intuition makes some sense to me. Younger people are less experienced drivers and are newer to the dangers of alcohol, which could make them more likely to be involved in alcohol related fatalities.
stargazer(afatal_reg_v2, afatal_reg,
type = "text")
##
## =====================================================================
## Dependent variable:
## -------------------------------------------------
## log(afatal)
## (1) (2)
## ---------------------------------------------------------------------
## log(spiritsper) 0.087* 0.800***
## (0.045) (0.028)
##
## log(nfatal2124) 0.784***
## (0.045)
##
## Constant 1.354** -7.110***
## (0.573) (0.434)
##
## ---------------------------------------------------------------------
## Observations 336 336
## R2 0.850 0.710
## Adjusted R2 0.849 0.709
## Residual Std. Error 0.365 (df = 333) 0.506 (df = 334)
## F Statistic 942.502*** (df = 2; 333) 817.816*** (df = 1; 334)
## =====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The final formula is: \[log(afatal) = \beta0 + log(spiritsper)\beta1 + log(nfatal2124)\beta2 + \epsilon\] An example of a variable correlated with Y but not X is Baptist. Let’s look at the correlation graphic once more and see what happens when it’s added to the model.
corrplot(Fatalities_cor, type = "upper", order = "hclust",
col=brewer.pal(n=5, name="RdYlBu"))
afatal_reg_v3 <- lm(formula = log(afatal) ~ log(spiritsper) + log(nfatal2124) + baptist, data = Fatalities)
stargazer(afatal_reg_v3, afatal_reg,
type = "text")
##
## =====================================================================
## Dependent variable:
## -------------------------------------------------
## log(afatal)
## (1) (2)
## ---------------------------------------------------------------------
## log(spiritsper) 0.206*** 0.800***
## (0.041) (0.028)
##
## log(nfatal2124) 0.633***
## (0.041)
##
## baptist 0.020***
## (0.002)
##
## Constant -0.136 -7.110***
## (0.517) (0.434)
##
## ---------------------------------------------------------------------
## Observations 336 336
## R2 0.887 0.710
## Adjusted R2 0.886 0.709
## Residual Std. Error 0.316 (df = 332) 0.506 (df = 334)
## F Statistic 871.894*** (df = 3; 332) 817.816*** (df = 1; 334)
## =====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The coefficient associated with the baptist variable is indeed very small.