Bias in linear models

R Markdown

I. The bias of an estimator represents the difference between the mean value of the estimator and the population parameter. In other words, the bias measures how much, on average, the estimator deviates from the true parameter. Ideally that value is zero, implying that the estimator is completely unbiased.

II.Both increasing the sample size and adding variables can lessen, but not necessarily eliminate, the bias of estimator. Per the law of large numbers, increasing sample size should bring the expected value of the sample closer to the population mean. Adding relevant independent variables can cause the model to more effectively explain the variation in the dependent variable. However, adding variables that are not relevant can cause overfitting. This is when variables do not explain the underlying patterns the model is intended to explain.

My two datasets to identify OVB are USMacroG and Fatalities from the AER library. USMacroG is a time series dataset with a variety of US macroeconomic concepts at a quarterly frequency from 1950-2000. Variables include GDP and its components, unemployment, inflation and interest rates, among others. I decided to model unemployment using the other concepts. For the full list of concepts, visit the AER package website: https://cran.r-project.org/web/packages/AER/AER.pdf

#First dataset - macro US data
data("USMacroG")
USMacroG <- data.frame(USMacroG)
USMacroG <- na.omit(USMacroG)

stargazer(USMacroG, 
          type = "text",
          title = "Summary Stats")

## 
## Summary Stats
## =======================================================
## Statistic    N    Mean    St. Dev.     Min       Max   
## -------------------------------------------------------
## gdp         203 4,577.188 2,108.934 1,658.800 9,303.900
## consumption 203 3,008.995 1,456.900 1,075.900 6,341.100
## invest      203  654.530   391.371   197.700  1,801.600
## government  203 1,000.149  310.972   359.600  1,582.800
## dpi         203 3,352.094 1,577.521 1,178.100 6,634.900
## cpi         203  226.586   148.955   71.400    521.100 
## m1          203  455.615   359.800   111.750  1,152.100
## tbill       203   5.250     2.837     0.810    15.090  
## unemp       203   5.671     1.578     2.600    10.700  
## population  203  214.287   35.517    150.260   281.422 
## inflation   203   3.939     3.399    -2.530    16.864  
## interest    203   1.311     2.890    -11.216   10.626  
## -------------------------------------------------------

#stack the data to enable two graphs
USMacroG_stacked <- USMacroG %>%
  pivot_longer(cols = -unemp, names_to = "Var", values_to = "Explanatory")
#create combined graph of scatterplots
ggplot(USMacroG_stacked, aes(x = Explanatory, y = unemp)) + 
  geom_point() + 
  facet_wrap(~Var, scales = "free_x")

ggplot(USMacroG_stacked, aes(x = Explanatory)) + 
  geom_histogram(bins = 30) + 
  facet_wrap(~Var, scales = "free_x")

#create combined scatterplot
ggplot(USMacroG_stacked, aes(x = Explanatory, y = unemp, color = Var)) +
  geom_point()+
  labs(
    x= "Unemployment"
  )

#Look at correlations between concepts. Will be important for OVB identification.
Macro_cor <- cor(USMacroG)
corrplot(Macro_cor, type = "upper", order = "hclust",
         col=brewer.pal(n=5, name="RdYlBu"))

Now that I’ve looked at the data in more detail, I’ll create two models. The first is a version that contains OVB. The second has less bias due to added variables, fitting the criteria of OVB.

#starting with gov spending, population, interest rate.
reg_unemp <- lm(data = USMacroG, formula = unemp ~  government + population + interest)
summary(reg_unemp)

## 
## Call:
## lm(formula = unemp ~ government + population + interest, data = USMacroG)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9136 -0.8183 -0.0899  0.7424  2.9216 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.548879   1.248994  -7.645 8.69e-13 ***
## government  -0.014794   0.001368 -10.811  < 2e-16 ***
## population   0.139313   0.011922  11.686  < 2e-16 ***
## interest     0.124133   0.030320   4.094 6.16e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.165 on 199 degrees of freedom
## Multiple R-squared:  0.4633, Adjusted R-squared:  0.4552 
## F-statistic: 57.26 on 3 and 199 DF,  p-value: < 2.2e-16

#omitted variable is investment and consumption, other components of GDP that correlate with unemployment and the independent variables
#Testing significance of correlation between potential new X values and Y
cor.test(USMacroG$unemp,USMacroG$consumption)

## 
##  Pearson's product-moment correlation
## 
## data:  USMacroG$unemp and USMacroG$consumption
## t = 3.4779, df = 201, p-value = 0.0006193
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1039509 0.3640168
## sample estimates:
##       cor 
## 0.2382502

cor.test(USMacroG$unemp,USMacroG$invest)

## 
##  Pearson's product-moment correlation
## 
## data:  USMacroG$unemp and USMacroG$invest
## t = 1.4331, df = 201, p-value = 0.1534
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03765784  0.23502825
## sample estimates:
##       cor 
## 0.1005736

#Testing significance of correlation between potential new X values and other independent variables
cor.test(USMacroG$population,USMacroG$consumption)

## 
##  Pearson's product-moment correlation
## 
## data:  USMacroG$population and USMacroG$consumption
## t = 70.14, df = 201, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9739281 0.9849397
## sample estimates:
##       cor 
## 0.9801771

cor.test(USMacroG$population,USMacroG$invest)

## 
##  Pearson's product-moment correlation
## 
## data:  USMacroG$population and USMacroG$invest
## t = 35.974, df = 201, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9091224 0.9467670
## sample estimates:
##       cor 
## 0.9303562

#All correlations fall within the 95% confidence interval, indicating they are statistically significant.

reg_unemp_v2 <- lm( data = USMacroG, formula = unemp ~  government + population + interest + log(invest) + consumption)
summary(reg_unemp_v2)

## 
## Call:
## lm(formula = unemp ~ government + population + interest + log(invest) + 
##     consumption, data = USMacroG)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.82144 -0.64982 -0.01044  0.52995  1.82602 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.8892788  2.1961837   2.682  0.00795 ** 
## government  -0.0136722  0.0010990 -12.441  < 2e-16 ***
## population   0.2803886  0.0126039  22.246  < 2e-16 ***
## interest     0.0454003  0.0216972   2.092  0.03768 *  
## log(invest) -7.0199274  0.5703873 -12.307  < 2e-16 ***
## consumption -0.0007960  0.0002403  -3.313  0.00110 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8052 on 197 degrees of freedom
## Multiple R-squared:  0.7462, Adjusted R-squared:  0.7398 
## F-statistic: 115.8 on 5 and 197 DF,  p-value: < 2.2e-16

stargazer(reg_unemp_v2, reg_unemp,
          type = "text")

## 
## ====================================================================
##                                   Dependent variable:               
##                     ------------------------------------------------
##                                          unemp                      
##                               (1)                      (2)          
## --------------------------------------------------------------------
## government                 -0.014***                -0.015***       
##                             (0.001)                  (0.001)        
##                                                                     
## population                  0.280***                0.139***        
##                             (0.013)                  (0.012)        
##                                                                     
## interest                    0.045**                 0.124***        
##                             (0.022)                  (0.030)        
##                                                                     
## log(invest)                -7.020***                                
##                             (0.570)                                 
##                                                                     
## consumption                -0.001***                                
##                             (0.0002)                                
##                                                                     
## Constant                    5.889***                -9.549***       
##                             (2.196)                  (1.249)        
##                                                                     
## --------------------------------------------------------------------
## Observations                  203                      203          
## R2                           0.746                    0.463         
## Adjusted R2                  0.740                    0.455         
## Residual Std. Error     0.805 (df = 197)        1.165 (df = 199)    
## F Statistic         115.837*** (df = 5; 197) 57.263*** (df = 3; 199)
## ====================================================================
## Note:                                    *p<0.1; **p<0.05; ***p<0.01

The variables investment and consumption correlate with both the independent variables and the dependent variables, which means they fit the criteria for an omitted variable. The correlations are all positive, and their effects on unemployment are negative, meaning the bias is negative. Adding these variables makes sense intuitively. Unemployment and GDP are indicators closely tied to one another, so adding components of GDP to the model should help to explain changes in GDP. The final formula is now:

\[unemp = \beta0 + government\beta1 + population\beta2 + interest\beta3 + invest\beta4 + consumption\beta5 +\epsilon\] With coefficients: \[unemp = \beta0 + -0.016*government + 0.131*population + 0.125*interest + -0.011*invest + 0.003*consumption +\epsilon\]

The next dataset, Fatalities, contains panel data on annual state-level traffic fatalities in the years 1982-1988. Among the variables are income, spirits consumption, population, fatalities, miles driven, and others. For a full list, visit: https://cran.r-project.org/web/packages/AER/AER.pdf

I am attempting to model alcohol-related traffic fatalities (afatal). First, I’ll clean and analyze the data.

data("Fatalities")
Fatalities <- data.frame(Fatalities)

stargazer(Fatalities, 
          type = "text",
          title = "Summary Stats")

## 
## Summary Stats
## =======================================================================
## Statistic     N      Mean        St. Dev.        Min          Max      
## -----------------------------------------------------------------------
## spirits      336     1.754         0.684        0.790        4.900     
## unemp        336     7.347         2.533        2.400        18.000    
## income       336  13,880.180     2,253.046    9,513.762    22,193.460  
## emppop       336    60.806         4.722       42.993        71.269    
## beertax      336     0.513         0.478        0.043        2.721     
## baptist      336     7.157         9.763        0.000        30.356    
## mormon       336     2.802         9.665        0.100        65.916    
## drinkage     336    20.456         0.899       18.000        21.000    
## dry          336     4.267         9.501        0.000        45.792    
## youngdrivers 336     0.186         0.025        0.073        0.282     
## miles        336   7,890.754     1,475.659    4,576.346    26,148.270  
## fatal        336    928.664       934.051        79          5,504     
## nfatal       336    182.583       188.431        13          1,049     
## sfatal       336    109.949       108.540         8           603      
## fatal1517    336    62.610        55.729          3           318      
## nfatal1517   336    12.262        12.253          0            76      
## fatal1820    336    106.661       104.224         7           601      
## nfatal1820   336    33.527        33.238          0           196      
## fatal2124    336    126.872       131.789        12           770      
## nfatal2124   336    41.378        42.930          1           249      
## afatal       336    293.333       303.581      24.600      2,094.900   
## pop          336 4,930,272.000 5,073,704.000 478,999.700 28,314,028.000
## pop1517      336  230,815.500   229,896.300  21,000.020  1,172,000.000 
## pop1820      336  249,090.400   249,345.600  20,999.960  1,321,004.000 
## pop2124      336  336,389.900   345,304.400  30,000.160  1,892,998.000 
## milestot     336  37,101.490    37,454.370    3,993.000   241,575.000  
## unempus      336     7.529         1.479        5.500        9.700     
## emppopus     336    59.971         1.585       57.800        62.300    
## gsp          336     0.025         0.043       -0.124        0.142     
## -----------------------------------------------------------------------

#delete non-numeric columns
Fatalities <- select(Fatalities,-c(state, year, breath, jail, service))

#create some interaction terms.
Fatalities <- Fatalities %>% mutate(afatalrate = (afatal*pop),# fatalities and population
                                    milesper = (milestot/pop),# miles per person
                                    spiritsper= spirits*pop) #spirits consumption and population
#delete na values
Fatalities <- na.omit(Fatalities)
#stack for additional 
Fatalities_stacked <- Fatalities %>%
  pivot_longer(cols = -afatalrate, names_to = "Var", values_to = "Explanatory")

ggplot(Fatalities_stacked, aes(x = Explanatory, y = afatalrate)) + 
  geom_point() + 
  facet_wrap(~Var, scales = "free_x")

ggplot(Fatalities_stacked, aes(x = Explanatory)) + 
  geom_histogram(bins = 30) + 
  facet_wrap(~Var, scales = "free_x")

ggplot(Fatalities_stacked, aes(x = Explanatory, y = afatalrate, color = Var)) +
  geom_point() +
  labs(
    x= "Alcohol-related traffic fatalities"
  )

Fatalities_cor <- cor(Fatalities)

corrplot(Fatalities_cor, type = "upper", order = "hclust",
         col=brewer.pal(n=5, name="RdYlBu"))

Afatal appears skewed, so I’ll use a log version in the formula. In fact, all variables I will use are skewed, so this will be a log-log model. Again, the first model will have OVB and the second attempts to correct for it.

afatal_reg <- lm(formula = log(afatal) ~ log(spiritsper), data = Fatalities)
summary(afatal_reg)

## 
## Call:
## lm(formula = log(afatal) ~ log(spiritsper), data = Fatalities)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2986 -0.4091  0.0654  0.3521  1.2580 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -7.10995    0.43360   -16.4   <2e-16 ***
## log(spiritsper)  0.80018    0.02798    28.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5063 on 334 degrees of freedom
## Multiple R-squared:   0.71,  Adjusted R-squared:  0.7092 
## F-statistic: 817.8 on 1 and 334 DF,  p-value: < 2.2e-16

#% of fatal accidents with drivers aged 21-24 correlates with alcohol related deaths and spirits*population. Adding that may decrease bias
cor.test(Fatalities$afatal,Fatalities$nfatal2124)

## 
##  Pearson's product-moment correlation
## 
## data:  Fatalities$afatal and Fatalities$nfatal2124
## t = 52.329, df = 334, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9311407 0.9546441
## sample estimates:
##       cor 
## 0.9440794

cor.test(Fatalities$spiritsper,Fatalities$nfatal2124)

## 
##  Pearson's product-moment correlation
## 
## data:  Fatalities$spiritsper and Fatalities$nfatal2124
## t = 34.598, df = 334, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8584425 0.9055459
## sample estimates:
##       cor 
## 0.8842224

#correlation falls within the 95% confidence interval, indicating these correlations are statistically significant.

afatal_reg_v2 <- lm(formula = log(afatal) ~ log(spiritsper) + log(nfatal2124), data = Fatalities)
summary(afatal_reg_v2)

## 
## Call:
## lm(formula = log(afatal) ~ log(spiritsper) + log(nfatal2124), 
##     data = Fatalities)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20738 -0.25027  0.00372  0.23656  1.33002 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.35363    0.57322   2.361   0.0188 *  
## log(spiritsper)  0.08726    0.04522   1.930   0.0545 .  
## log(nfatal2124)  0.78435    0.04454  17.612   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3648 on 333 degrees of freedom
## Multiple R-squared:  0.8499, Adjusted R-squared:  0.849 
## F-statistic: 942.5 on 2 and 333 DF,  p-value: < 2.2e-16

The variables afatal, spiritsper, and number of fatal accidents are all correlated, as you can see in the correlation plot, so there is reason to think that any bias present could be lessened by adding fatal accidents among people aged 21-24 years. The additional variable, young fatalities, has a positive coefficient, meaning the bias is positive. The intuition makes some sense to me. Younger people are less experienced drivers and are newer to the dangers of alcohol, which could make them more likely to be involved in alcohol related fatalities.

stargazer(afatal_reg_v2, afatal_reg,
          type = "text")

## 
## =====================================================================
##                                    Dependent variable:               
##                     -------------------------------------------------
##                                        log(afatal)                   
##                               (1)                      (2)           
## ---------------------------------------------------------------------
## log(spiritsper)              0.087*                  0.800***        
##                             (0.045)                  (0.028)         
##                                                                      
## log(nfatal2124)             0.784***                                 
##                             (0.045)                                  
##                                                                      
## Constant                    1.354**                 -7.110***        
##                             (0.573)                  (0.434)         
##                                                                      
## ---------------------------------------------------------------------
## Observations                  336                      336           
## R2                           0.850                    0.710          
## Adjusted R2                  0.849                    0.709          
## Residual Std. Error     0.365 (df = 333)         0.506 (df = 334)    
## F Statistic         942.502*** (df = 2; 333) 817.816*** (df = 1; 334)
## =====================================================================
## Note:                                     *p<0.1; **p<0.05; ***p<0.01

The final formula is: \[log(afatal) = \beta0 + log(spiritsper)\beta1 + log(nfatal2124)\beta2 + \epsilon\] An example of a variable correlated with Y but not X is Baptist. Let’s look at the correlation graphic once more and see what happens when it’s added to the model.

corrplot(Fatalities_cor, type = "upper", order = "hclust",
         col=brewer.pal(n=5, name="RdYlBu"))

afatal_reg_v3 <- lm(formula = log(afatal) ~ log(spiritsper) + log(nfatal2124) + baptist, data = Fatalities)
stargazer(afatal_reg_v3, afatal_reg,
          type = "text")

## 
## =====================================================================
##                                    Dependent variable:               
##                     -------------------------------------------------
##                                        log(afatal)                   
##                               (1)                      (2)           
## ---------------------------------------------------------------------
## log(spiritsper)             0.206***                 0.800***        
##                             (0.041)                  (0.028)         
##                                                                      
## log(nfatal2124)             0.633***                                 
##                             (0.041)                                  
##                                                                      
## baptist                     0.020***                                 
##                             (0.002)                                  
##                                                                      
## Constant                     -0.136                 -7.110***        
##                             (0.517)                  (0.434)         
##                                                                      
## ---------------------------------------------------------------------
## Observations                  336                      336           
## R2                           0.887                    0.710          
## Adjusted R2                  0.886                    0.709          
## Residual Std. Error     0.316 (df = 332)         0.506 (df = 334)    
## F Statistic         871.894*** (df = 3; 332) 817.816*** (df = 1; 334)
## =====================================================================
## Note:                                     *p<0.1; **p<0.05; ***p<0.01

The coefficient associated with the baptist variable is indeed very small.

Bias in linear models

Daniel McLaughlin

10/2/2023

R Markdown