OVB

# Clear the workspace
  rm(list = ls()) # Clear all files from your environment
  gc()            # Clear unused memory

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 522012 27.9    1161270 62.1   660385 35.3
## Vcells 951178  7.3    8388608 64.0  1769879 13.6

cat("\f")

graphics.off()

What is Bias of an estimator?

The bias of an estimator measures whether or not in expectation, the estimator will be equal to the true parameter Let ˆθ be an estimator for θ. The bias of ˆθ as an estimator for θ is

                                   Bias(ˆθ, θ) = E[ˆθ]− θ

• Bias(ˆθ, θ) = 0, or equivalently E[ˆθ] = θ, then we say ˆθ is an unbiased estimator of ˆθ. • Bias(ˆθ, θ) > 0, then ˆθ typically overestimates θ. • Bias(ˆθ, θ) < 0, then ˆθ typically underestimates θ.

In terms of omitted variable bias, will the bias go away if we increase the same size or add more variables?

The omitted variable bias occurs when a relevant independent variable is excluded from a regression model. In such cases, the estimated coefficients of the included variables may be biased and inconsistent.

Increasing the sample size alone does not necessarily eliminate the omitted variable bias. While a larger sample size can improve the precision of the coefficient estimates, it does not address the fundamental issue of an omitted variable. If a relevant variable is omitted from the model, the bias will persist regardless of sample size.

The variable should be included if the variable is related to the key independent and dependent variable as this reduces bias. It is also important to note that a consequence of doing so is that it causes multicollinearity.

To effectively address the omitted variable bias, it is crucial to identify and include relevant variables that are correlated with both the dependent variable and the omitted variable. This can be achieved through careful theoretical reasoning, subject-matter expertise, or empirical evidence.

1) You will choose a dataset, describe the variables in it, and give us the full / correct model (be sure to write out the estimating equation in R markdownLinks to an external site., and pay attention to the subscripts as well). Tell us what is your key independent variable that you are interested in studying

data("airquality")

Description Daily air quality measurements in New York, May to September 1973.

Usage airquality

Format A data frame with 153 observations on 6 variables.

[,1] Ozone numeric Ozone (ppb) [,2] Solar.R numeric Solar R (lang) [,3] Wind numeric Wind (mph) [,4] Temp numeric Temperature (degrees F) [,5] Month numeric Month (1–12) [,6] Day numeric Day of month (1–31)

Details

Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.

Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island

Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park

Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport

Temp: Maximum daily temperature in degrees Fahrenheit at LaGuardia Airport.

“Ozone,” which stands for the concentration of ozone in the air, is the primary independent variable of interest for researching air quality. I’m interested in learning how ozone levels are impacted by additional variables like temperature, wind speed, and sun radiation.

head(airquality)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

There is some NA values in the dataset. So, I am going to compute and impute the Na values in each column.

# Counting the number of NA values in each column in the dataset
na_count <- colSums(is.na(airquality))
print(na_count)

##   Ozone Solar.R    Wind    Temp   Month     Day 
##      37       7       0       0       0       0

# Load the dplyr package
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Example: Impute NA values with mean using dplyr
imputed <- airquality %>%
  mutate_all(~ifelse(is.na(.), mean(., na.rm = TRUE), .))

# Count NA values after imputation
na_count <- colSums(is.na(imputed))
print(na_count)

##   Ozone Solar.R    Wind    Temp   Month     Day 
##       0       0       0       0       0       0

After imputation there is no missing values in the dataset.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.3     ✔ stringr   1.5.0
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

glimpse(imputed)

## Rows: 153
## Columns: 6
## $ Ozone   <dbl> 41.00000, 36.00000, 12.00000, 18.00000, 42.12931, 28.00000, 23…
## $ Solar.R <dbl> 190.0000, 118.0000, 149.0000, 313.0000, 185.9315, 185.9315, 29…
## $ Wind    <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9, 9…
## $ Temp    <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58, 64…
## $ Month   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ Day     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…

library("stargazer")

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

stargazer(imputed, 
          type="text", 
          title = "Summary Statistics", 
          covariate.labels = c("Ozone", "Solar Radiation", "Wind","Temperature", "Month", "Day")
          )

## 
## Summary Statistics
## ==================================================
## Statistic        N   Mean   St. Dev.  Min    Max  
## --------------------------------------------------
## Ozone           153 42.129   28.693  1.000 168.000
## Solar Radiation 153 185.932  87.960  7.000 334.000
## Wind            153  9.958   3.523   1.700 20.700 
## Temperature     153 77.882   9.465    56     97   
## Month           153  6.993   1.417     5      9   
## Day             153 15.804   8.865     1     31   
## --------------------------------------------------

hist(imputed$Ozone, 
     xlab = "Ozone Concentration",
     main = "Histogram"
     )

There is right skew in Ozone variable, so I will use log of ozone.

hist(log(imputed$Ozone), 
     xlab = "Ozone Concentration",
     main = "Histogram"
     )

2) Now, suppose you were running the short / incorrect model where you omitted a variable “by mistake”. Write the estimating equation out as well

Ozone = β0 + β1SolarRadiationi + β2Temperaturei + ϵi

full_model <- lm(imputed, 
               formula = log(Ozone) ~ Solar.R + Temp
               )

summary(full_model)

## 
## Call:
## lm(formula = log(Ozone) ~ Solar.R + Temp, data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2510 -0.3992  0.0291  0.3580  1.4606 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.4920003  0.3817795  -1.289 0.199485    
## Solar.R      0.0020962  0.0005427   3.863 0.000166 ***
## Temp         0.0462067  0.0050433   9.162 3.58e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5679 on 150 degrees of freedom
## Multiple R-squared:  0.4568, Adjusted R-squared:  0.4496 
## F-statistic: 63.07 on 2 and 150 DF,  p-value: < 2.2e-16

Temp_coeff_full_model <- full_model$coefficients[3]
Temp_coeff_full_model

##       Temp 
## 0.04620666

Solar_coeff_full_model <- full_model$coefficients[2]
Solar_coeff_full_model

##     Solar.R 
## 0.002096237

short_model <- lm(imputed, 
               formula = log(Ozone) ~ Solar.R
               )

summary(short_model)

## 
## Call:
## lm(formula = log(Ozone) ~ Solar.R, data = imputed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89115 -0.41920  0.05357  0.51796  1.45040 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.8639398  0.1339875  21.375  < 2e-16 ***
## Solar.R     0.0034018  0.0006518   5.219 5.85e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7068 on 151 degrees of freedom
## Multiple R-squared:  0.1528, Adjusted R-squared:  0.1472 
## F-statistic: 27.24 on 1 and 151 DF,  p-value: 5.851e-07

solar_coeff_short_model <- short_model$coefficients[2]
solar_coeff_short_model

##     Solar.R 
## 0.003401789

3) From the OVB formula, tell us whether the omitted variable will cause bias or not i.e. are the two conditions for OVB met or not?

For omitted variable bias to occur, two conditions must be fulfilled:

X is correlated with the omitted variable.

The omitted variable is a determinant of the dependent variable Y.

In this dataset,

Solar.r variable correlated with omitted variable temperature.
Temperature variable has a effect on dependent variable ozone.

airquality_cor <- round(cor(imputed[, c("Ozone", "Solar.R", "Temp")]), 3)
airquality_cor

##         Ozone Solar.R  Temp
## Ozone   1.000   0.303 0.609
## Solar.R 0.303   1.000 0.263
## Temp    0.609   0.263 1.000

We can see there is positive correlation(0.263) between temperature and solar.r variable.

cor_independent1 <- cor.test(imputed$Solar.R, imputed$Temp)
cor_independent1

## 
##  Pearson's product-moment correlation
## 
## data:  imputed$Solar.R and imputed$Temp
## t = 3.3438, df = 151, p-value = 0.001042
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1084074 0.4043982
## sample estimates:
##       cor 
## 0.2625689

imputed$lnOzone <- log(imputed$Ozone) 

corr_matrix <- round(cor(imputed[, c("lnOzone", "Solar.R", "Temp")]), 3)
corr_matrix

##         lnOzone Solar.R  Temp
## lnOzone   1.000   0.391 0.635
## Solar.R   0.391   1.000 0.263
## Temp      0.635   0.263 1.000

We can see there is not much difference between dependent and dependent with log corr_matrix values. So, log doesn’t change the relationship.

temp_coeff_full_regression <- full_model$coefficients[3]
temp_coeff_full_regression

##       Temp 
## 0.04620666

From the full regression, the effect of “temp” on “ozone” is positive 0.046 and the ommited variable also exist.

cor_condition2 <- cor.test(imputed$lnOzone, imputed$Temp)
cor_condition2

## 
##  Pearson's product-moment correlation
## 
## data:  imputed$lnOzone and imputed$Temp
## t = 10.091, df = 151, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5292653 0.7207408
## sample estimates:
##       cor 
## 0.6346442

4) Furthermore, OVB will be in what direction (positive/negative bias) ? Which case/cell in the 2 by 2 matrix that lists the 2 OVB conditions?

As per the above outputs values there is correlation between the independent variable and ommitted variable. Also there is a positive effect on dependent variable and ommitted variable. So the OVB will be in positive bias direction.

5) Show the two regressions side by side (you can use stargazer command) and confirm the bias is in the direction OVB formula predicted

stargazer(full_model, short_model, 
          type = "text",
          covariate.labels = c("Solar.R", "Temp", "Constant")
          )

## 
## ===================================================================
##                                   Dependent variable:              
##                     -----------------------------------------------
##                                       log(Ozone)                   
##                               (1)                     (2)          
## -------------------------------------------------------------------
## Solar.R                    0.002***                0.003***        
##                             (0.001)                 (0.001)        
##                                                                    
## Temp                       0.046***                                
##                             (0.005)                                
##                                                                    
## Constant                    -0.492                 2.864***        
##                             (0.382)                 (0.134)        
##                                                                    
## -------------------------------------------------------------------
## Observations                  153                     153          
## R2                           0.457                   0.153         
## Adjusted R2                  0.450                   0.147         
## Residual Std. Error    0.568 (df = 150)        0.707 (df = 151)    
## F Statistic         63.071*** (df = 2; 150) 27.239*** (df = 1; 151)
## ===================================================================
## Note:                                   *p<0.1; **p<0.05; ***p<0.01

I compared two regressions side by side using stargazer. In Model 1, the coefficient for “Solar.R” is 0.002, and in Model 2, it is 0.003. Both coefficients are statistically significant at the 0.01 level, indicating a positive relationship between “Solar.R” and the dependent variable. In Model 2, there is no coefficient reported for “Temp”, because it is ommitted varibale. the significance levels: *p<0.1, **p<0.05, p<0.01. These indicate the level of statistical significance of the coefficients.

Solar_coeff_full_model < solar_coeff_short_model

## Solar.R 
##    TRUE

Also, it is confirmed which is positive bias.

6 Try to provide some intuition to why does OVB formula work / bias your results in the example in a certain direction

Due to the positive correlation between the independent variable “solarRadiation” and the omitted variable “temperature,” “Temperature” tends to increase as “SolarRadiation” rises. In this case, chemical reactions involving ozone precursors can result in higher ozone levels at higher temperatures. The model gave the effect of temperature to “Solar Radiation” in error when we ran the short regression without the “Temperature” variable, which resulted in a positive bias.

OVB happened as a result of my ommited a crucial variable (temperature), which has a direct positive impact on the dependent variable (ozone) and is correlated with the independent variable (solar radiation). This resulted in a positive bias as the impact of “Solar Radiation” on “Ozone” was overestimated.

Try to add/exclude a variable in your multivariate regression that does not impact y (is uncorrelated with y) but is correlated with the key x variable, and show that your point estimate will not change (significantly)

model <- lm(Ozone ~ Temp + Solar.R, data = airquality)
summary(model)

## 
## Call:
## lm(formula = Ozone ~ Temp + Solar.R, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.610 -15.976  -2.928  12.371 115.555 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -145.70316   18.44672  -7.899 2.53e-12 ***
## Temp           2.27847    0.24600   9.262 2.22e-15 ***
## Solar.R        0.05711    0.02572   2.221   0.0285 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.5 on 108 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.5103, Adjusted R-squared:  0.5012 
## F-statistic: 56.28 on 2 and 108 DF,  p-value: < 2.2e-16

# Add the unrelated variable "Wind" to the regression model
model_with_unrelated <- lm(Ozone ~ Temp + Wind, data = airquality)
summary(model_with_unrelated)

## 
## Call:
## lm(formula = Ozone ~ Temp + Wind, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.251 -13.695  -2.856  11.390 100.367 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -71.0332    23.5780  -3.013   0.0032 ** 
## Temp          1.8402     0.2500   7.362 3.15e-11 ***
## Wind         -3.0555     0.6633  -4.607 1.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.85 on 113 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.5687, Adjusted R-squared:  0.5611 
## F-statistic:  74.5 on 2 and 113 DF,  p-value: < 2.2e-16

Based on the values x variable temp is more correlated than y variable wind. The model temp 2.27847 is greater than unrelated model temp 1.8402.

OVB

Ganesh Kumar

2024-07-17