I. What is Bias of an estimator?

The bias of an estimator refers to the systematic error of the estimator’s expected value (mean) from the true value of the parameter being estimated. It measures whether, on average, an estimator tends to overestimate or underestimate the true value of the parameter.

Mathematically, the bias (B) of an estimator \((\hat{\theta})\) for a parameter \((\theta)\) is defined as: \[ Bias(\hat{\theta}) = E_{\theta}(\hat{\theta}) - \theta \]

where, \(Bias(\hat{\theta})\) is the bias of the estimator \(\hat{\theta}\).

\(E_{\theta}(\hat{\theta})\) is the expected value of the estimator \(\hat{\theta}\).

\(\theta\) is the true (population) value of the parameter being estimated.

The bias can be positive (overestimating the parameter), negative (underestimating the parameter), or zero (unbiased) depending on whether \(E_{\theta}(\hat{\theta})\) is greater than, less than, or equal to \(\theta\), respectively.

If the estimator is unbiased, then: \[ Bias(\hat{\theta}) = E_{\theta}(\hat{\theta}) - \theta = 0 \]

II. In terms of OVB, will the bias go away

if we increase the sample size?

Increasing the sample size cannot completely eliminate OVB if the omitted variable is strongly correlated with the dependent variable and the independent variables. OVB will still exist if the omitted variable’s impact is substantial and not properly controlled.

if we add more variables?

Adding more relevant variables to the regression model can potentially reduce OVB. Including variables that are correlated with the omitted variable can help capture some of its effects. Simply adding more variables cannot guarantee the elimination of OVB. The effectiveness of this approach depends on the strength of the correlation between the added variables and the omitted variable. If the omitted variable’s influence is not adequately captured by the added variables, bias will still exist.

III. Give me 2 distinct examples of OVB (on different datasets).

Dataset - 1: (mtcars dataset)

1) You will choose a dataset, describe the variables in it, and give us the full / correct model. Tell us what is your key independent variable that you are interested in studying.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
data(mtcars)
glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

This dataset contains information on various car models from the 1970s. This is cross-sectional data because it captures information about various car models at a single point in time.

The mtcars dataset gives the specifications of various car models, each represented as a row, and includes characteristics such as fuel efficiency, engine specifications, and more. It’s suitable for exploring relationships between these features.

There are 11 variables in the dataset.

mpg : Miles per gallon (fuel efficiency) achieved by the car model.

cyl : Number of cylinders in the engine.

disp : Displacement (in cubic inches) of the engine.

hp : Horsepower (hp) rating of the engine.

drat : Rear axle ratio, which affects the car’s performance.

wt : Weight of the car (in 1000 lbs).

qsec : Quarter-mile time in seconds, measuring acceleration and performance.

vs : Engine type, where 0 represents a V-shaped engine and 1 represents a straight engine.

am : Transmission type, where 0 indicates an automatic transmission, and 1 indicates a manual transmission.

gear : Number of forward gears.

carb : Number of carburetors.

Among these 11 variables, mpg is the dependent variable and all the remaining ten are the independent variables. And in the 10 independent variables, two variables (vs and am) are categorical and they have been encoded with a dummy variable. This involved creating two binary dummy variables (0 and 1) to represent the categories of the categorical variable.

library("stargazer")
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(mtcars, 
          type="text", 
          title = "Summary Statistics", 
          covariate.labels = c("MPG", "Cylinders", "Displacement","Horsepower", "Rear-axle ratio", "Weight", "Quarter-mile time", "Engine", "Transmission", "Gears", "Carburetors")
          )
## 
## Summary Statistics
## ====================================================
## Statistic         N   Mean   St. Dev.  Min     Max  
## ----------------------------------------------------
## MPG               32 20.091   6.027   10.400 33.900 
## Cylinders         32  6.188   1.786     4       8   
## Displacement      32 230.722 123.939  71.100 472.000
## Horsepower        32 146.688  68.563    52     335  
## Rear-axle ratio   32  3.597   0.535   2.760   4.930 
## Weight            32  3.217   0.978   1.513   5.424 
## Quarter-mile time 32 17.849   1.787   14.500 22.900 
## Engine            32  0.438   0.504     0       1   
## Transmission      32  0.406   0.499     0       1   
## Gears             32  3.688   0.738     3       5   
## Carburetors       32  2.812   1.615     1       8   
## ----------------------------------------------------
hist(mtcars$mpg,
     xlab = "Miles per gallon",
     main = "Histogram")

We’ll use log of mpg as there is skewness in mpg variable.

hist(log(mtcars$mpg),
     xlab = "Log Miles per gallon",
     main = "Histogram")

2) Now, suppose you were running the short / incorrect model where you omitted a variable “by mistake”. Write the estimating equation out as well.

Full regression:

\(MPG_i\) = \(\beta_0\) + \(\beta_1Horsepower_i\) + \(\beta_2Weight_i\) + \(\epsilon_i\)

full_regression <- lm(data    = mtcars, 
                      formula = log(mpg) ~ hp + wt
                      )

summary(full_regression)
## 
## Call:
## lm(formula = log(mpg) ~ hp + wt, data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18744 -0.07540 -0.02440  0.06244  0.28562 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.8291030  0.0686807  55.752  < 2e-16 ***
## hp          -0.0015435  0.0003879  -3.979 0.000423 ***
## wt          -0.2005368  0.0271810  -7.378 3.96e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1114 on 29 degrees of freedom
## Multiple R-squared:  0.8691, Adjusted R-squared:   0.86 
## F-statistic: 96.23 on 2 and 29 DF,  p-value: 1.577e-13

Short regression:

\(MPG_i\) = \(\beta_0\) + \(\beta_1Horsepower_i\) + \(\epsilon_i\)

short_regression <- lm(data    = mtcars, 
                      formula = log(mpg) ~ hp
                      )

summary(short_regression)
## 
## Call:
## lm(formula = log(mpg) ~ hp, data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41577 -0.06583 -0.01737  0.09827  0.39621 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.4604669  0.0785838  44.035  < 2e-16 ***
## hp          -0.0034287  0.0004867  -7.045 7.85e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1858 on 30 degrees of freedom
## Multiple R-squared:  0.6233, Adjusted R-squared:  0.6107 
## F-statistic: 49.63 on 1 and 30 DF,  p-value: 7.853e-08

3) From the OVB formula, tell us whether the omitted variable will cause bias or not i.e. are the two conditions for OVB met or not?

For OVB to occur, two conditions must be fulfilled:

  • X is correlated with the omitted variable.

  • The omitted variable is a determinant of the dependent variable Y.

In my dataset, for OVB to occur, it must meet the following two conditions:

  • Horsepower variable should be correlated with the omitted variable weight.

  • The omitted variable weight has a direct or indirect effect on the dependent variable mpg.

Let’s check these two conditions:

  1. Condition - 1:

There is a postive correlation (0.659) between the omitted variable “weight” and the independent variable “horsepower”.

mtcars_cor <- round(cor(mtcars[, c("mpg", "hp", "wt")]), 3)
mtcars_cor
##        mpg     hp     wt
## mpg  1.000 -0.776 -0.868
## hp  -0.776  1.000  0.659
## wt  -0.868  0.659  1.000
cor_independent <- cor.test(mtcars$hp, mtcars$wt)
cor_independent
## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$hp and mtcars$wt
## t = 4.7957, df = 30, p-value = 4.146e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4025113 0.8192573
## sample estimates:
##       cor 
## 0.6587479

We can also see that the correlation is statistically significant with a p-value = 4.146e-05

mtcars$lnmpg <- log(mtcars$mpg) 

corr_mat <- round(cor(mtcars[, c("lnmpg", "hp", "wt")]), 3)
corr_mat
##        lnmpg     hp     wt
## lnmpg  1.000 -0.789 -0.893
## hp    -0.789  1.000  0.659
## wt    -0.893  0.659  1.000

We can notice that there isn’t a huge difference in correlation values by transforming the dependent variable with log. Taking the log doesn’t really change the relationship.

  1. Condition - 2:

And from the full regression model, the effect of “weight” on “mpg” is negative (-0.2), therefore the omitted variable bias will exist.

weight_coeff_full_regression <- full_regression$coefficients[3]
weight_coeff_full_regression
##         wt 
## -0.2005368
cor_condition_2 <- cor.test(mtcars$lnmpg, mtcars$wt)
cor_condition_2
## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$lnmpg and mtcars$wt
## t = -10.872, df = 30, p-value = 6.31e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9468891 -0.7905477
## sample estimates:
##        cor 
## -0.8930611

4) Furthermore, OVB will be in what direction (positive/negative bias) ?  Which case/cell in the 2 by 2 matrix that lists the 2 OVB conditions?

Since, there is a positive correlation between the independent variable and the omitted variable and there is a negative effect on the dependent variable due to the omitted variable. The direction of the OVB is negative, i.e., there is a negative bias.

5) Show the two regressions side by side (you can use stargazer command) and confirm the bias is in the direction OVB formula predicted. 

stargazer(full_regression, short_regression, 
          type = "text",
          covariate.labels = c("hp", "wt", "Constant")
          )
## 
## =================================================================
##                                  Dependent variable:             
##                     ---------------------------------------------
##                                       log(mpg)                   
##                              (1)                    (2)          
## -----------------------------------------------------------------
## hp                        -0.002***              -0.003***       
##                            (0.0004)               (0.0005)       
##                                                                  
## wt                        -0.201***                              
##                            (0.027)                               
##                                                                  
## Constant                   3.829***               3.460***       
##                            (0.069)                (0.079)        
##                                                                  
## -----------------------------------------------------------------
## Observations                  32                     32          
## R2                          0.869                  0.623         
## Adjusted R2                 0.860                  0.611         
## Residual Std. Error    0.111 (df = 29)        0.186 (df = 30)    
## F Statistic         96.232*** (df = 2; 29) 49.632*** (df = 1; 30)
## =================================================================
## Note:                                 *p<0.1; **p<0.05; ***p<0.01
hp_coeff_full_regression  <- full_regression$coefficients[2]
hp_coeff_short_regression <- short_regression$coefficients[2]
hp_coeff_full_regression  > hp_coeff_short_regression
##   hp 
## TRUE

Therefore, it is confirmed that there is a negative bias.

6) Try to provide some intuition to why does OVB formula work / bias your results in the example in a certain direction.

The positive correlation between the omitted variable “Weight” and the independent variable “Horsepower” means that when “Horsepower” increases, “Weight” tends to increase. This says that more powerful cars are often heavier. In the full regression model, I found that the effect of “Weight” on “MPG” is negative (-0.2). This means that, as the weight of a car increases, its miles per gallon (MPG) tends to decrease. Heavier cars are less fuel-efficient.

When I omitted “Weight” from the short regression, the model gave some of the impact of “Weight” to “Horsepower.” The short regression mistakenly assumes that the negative effect of weight on MPG is due to “Horsepower” alone. This results in the coefficient for “Horsepower” being more negative because the model thinks that “Horsepower” is responsible for some of the effect of “Weight”.

Dataset - 2: (airquality dataset)

1) You will choose a dataset, describe the variables in it, and give us the full / correct model. Tell us what is your key independent variable that you are interested in studying.

library(zoo)
## Warning: package 'zoo' was built under R version 4.2.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
data("airquality")
glimpse(airquality)
## Rows: 153
## Columns: 6
## $ Ozone   <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, 18, 14, …
## $ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256, 290, 27…
## $ Wind    <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9, 9…
## $ Temp    <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58, 64…
## $ Month   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ Day     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
# Counting the number of NA values in each column in the dataset
na_count <- colSums(is.na(airquality))
print(na_count)
##   Ozone Solar.R    Wind    Temp   Month     Day 
##      37       7       0       0       0       0
# Imputing the mean value of the specific column to the NA's
df_imputed <- na.aggregate(airquality, FUN = mean)
na_count1  <- colSums(is.na(df_imputed))
print(na_count1)
##   Ozone Solar.R    Wind    Temp   Month     Day 
##       0       0       0       0       0       0
glimpse(df_imputed)
## Rows: 153
## Columns: 6
## $ Ozone   <dbl> 41.00000, 36.00000, 12.00000, 18.00000, 42.12931, 28.00000, 23…
## $ Solar.R <dbl> 190.0000, 118.0000, 149.0000, 313.0000, 185.9315, 185.9315, 29…
## $ Wind    <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9, 9…
## $ Temp    <dbl> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58, 64…
## $ Month   <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ Day     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…

The airquality dataset in R contains information related to air quality measurements in New York City. It includes the following variables:

Ozone: The concentration of ozone in parts per billion (ppb).

Solar.R: The amount of solar radiation in calories per square centimeter.

Wind: The wind speed in miles per hour.

Temp: The temperature in degrees Fahrenheit.

Month: The month of the year (ranging from 5 to 9, corresponding to May to September).

Day: The day of the month.

The key independent variable of interest for studying air quality is “Ozone,” as it represents the concentration of ozone in the air. I want to understand how other factors such as solar radiation, wind speed, temperature impact ozone levels.

stargazer(airquality, 
          type="text", 
          title = "Summary Statistics", 
          covariate.labels = c("Ozone", "Solar Radiation", "Wind","Temperature", "Month", "Day"
                               )
          )
## 
## Summary Statistics
## =================================================
## Statistic        N   Mean   St. Dev.  Min   Max  
## -------------------------------------------------
## Ozone           116 42.129   32.988    1    168  
## Solar Radiation 146 185.932  90.058    7    334  
## Wind            153  9.958   3.523   1.700 20.700
## Temperature     153 77.882   9.465    56     97  
## Month           153  6.993   1.417     5     9   
## Day             153 15.804   8.865     1     31  
## -------------------------------------------------
hist(airquality$Ozone,
     xlab = "Ozone concentration in ppb",
     main = "Histogram")

We’ll use log of ozone as there is skewness in ozone variable.

hist(log(airquality$Ozone),
     xlab = "Log Ozone concentration in ppb",
     main = "Histogram")

2) Now, suppose you were running the short / incorrect model where you omitted a variable “by mistake”. Write the estimating equation out as well.

Full regression:

\(Ozone_i\) = \(\beta_0\) + \(\beta_1SolarRadiation_i\) + \(\beta_2Temperature_i\) + \(\epsilon_i\)

full_reg <- lm(data    = df_imputed, 
               formula = log(Ozone) ~ Solar.R + Temp
               )

summary(full_reg)
## 
## Call:
## lm(formula = log(Ozone) ~ Solar.R + Temp, data = df_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2510 -0.3992  0.0291  0.3580  1.4606 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.4920003  0.3817795  -1.289 0.199485    
## Solar.R      0.0020962  0.0005427   3.863 0.000166 ***
## Temp         0.0462067  0.0050433   9.162 3.58e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5679 on 150 degrees of freedom
## Multiple R-squared:  0.4568, Adjusted R-squared:  0.4496 
## F-statistic: 63.07 on 2 and 150 DF,  p-value: < 2.2e-16

Short regression:

\(Ozone_i\) = \(\beta_0\) + \(\beta_1SolarRadiation_i\) + \(\epsilon_i\)

short_reg <- lm(data    = df_imputed, 
                formula = log(Ozone) ~ Solar.R
                )

summary(short_reg)
## 
## Call:
## lm(formula = log(Ozone) ~ Solar.R, data = df_imputed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89115 -0.41920  0.05357  0.51796  1.45040 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.8639398  0.1339875  21.375  < 2e-16 ***
## Solar.R     0.0034018  0.0006518   5.219 5.85e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7068 on 151 degrees of freedom
## Multiple R-squared:  0.1528, Adjusted R-squared:  0.1472 
## F-statistic: 27.24 on 1 and 151 DF,  p-value: 5.851e-07

3) From the OVB formula, tell us whether the omitted variable will cause bias or not i.e. are the two conditions for OVB met or not?

For OVB to occur, two conditions must be fulfilled:

  • X is correlated with the omitted variable.

  • The omitted variable is a determinant of the dependent variable Y.

In my dataset, for OVB to occur, it must meet the following two conditions:

  • Solar.R variable should be correlated with the omitted variable temperature.

  • The omitted variable temperature has a direct or indirect effect on the dependent variable ozone.

Let’s check these two conditions:

  1. Condition - 1:

There is a postive correlation (0.263) between the omitted variable “temperature” and the independent variable “Solar Radiation”.

airquality_cor <- round(cor(df_imputed[, c("Ozone", "Solar.R", "Temp")]), 3)
airquality_cor
##         Ozone Solar.R  Temp
## Ozone   1.000   0.303 0.609
## Solar.R 0.303   1.000 0.263
## Temp    0.609   0.263 1.000
cor_independent1 <- cor.test(df_imputed$Solar.R, df_imputed$Temp)
cor_independent1
## 
##  Pearson's product-moment correlation
## 
## data:  df_imputed$Solar.R and df_imputed$Temp
## t = 3.3438, df = 151, p-value = 0.001042
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1084074 0.4043982
## sample estimates:
##       cor 
## 0.2625689

We can also see that the correlation is statistically significant with a p-value = 0.001

df_imputed$lnOzone <- log(df_imputed$Ozone) 

corr_matrix <- round(cor(df_imputed[, c("lnOzone", "Solar.R", "Temp")]), 3)
corr_matrix
##         lnOzone Solar.R  Temp
## lnOzone   1.000   0.391 0.635
## Solar.R   0.391   1.000 0.263
## Temp      0.635   0.263 1.000

We can notice that there isn’t a huge difference in correlation values by transforming the dependent variable with log. Taking the log doesn’t really change the relationship.

  1. Condition - 2:

And from the full regression model, the effect of “temp” on “ozone” is positive (0.046), therefore the omitted variable bias will exist.

temp_coeff_full_regression <- full_reg$coefficients[3]
temp_coeff_full_regression
##       Temp 
## 0.04620666
cor_condition2 <- cor.test(df_imputed$lnOzone, df_imputed$Temp)
cor_condition2
## 
##  Pearson's product-moment correlation
## 
## data:  df_imputed$lnOzone and df_imputed$Temp
## t = 10.091, df = 151, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5292653 0.7207408
## sample estimates:
##       cor 
## 0.6346442

4) Furthermore, OVB will be in what direction (positive/negative bias) ?  Which case/cell in the 2 by 2 matrix that lists the 2 OVB conditions?

Since, there is a positive correlation between the independent variable and the omitted variable and there is a positive effect on the dependent variable due to the omitted variable. The direction of the OVB is positive, i.e., there is a positive bias.

5) Show the two regressions side by side (you can use stargazer command) and confirm the bias is in the direction OVB formula predicted.

stargazer(full_reg, short_reg, 
          type = "text",
          covariate.labels = c("Solar.R", "Temp", "Constant")
          )
## 
## ===================================================================
##                                   Dependent variable:              
##                     -----------------------------------------------
##                                       log(Ozone)                   
##                               (1)                     (2)          
## -------------------------------------------------------------------
## Solar.R                    0.002***                0.003***        
##                             (0.001)                 (0.001)        
##                                                                    
## Temp                       0.046***                                
##                             (0.005)                                
##                                                                    
## Constant                    -0.492                 2.864***        
##                             (0.382)                 (0.134)        
##                                                                    
## -------------------------------------------------------------------
## Observations                  153                     153          
## R2                           0.457                   0.153         
## Adjusted R2                  0.450                   0.147         
## Residual Std. Error    0.568 (df = 150)        0.707 (df = 151)    
## F Statistic         63.071*** (df = 2; 150) 27.239*** (df = 1; 151)
## ===================================================================
## Note:                                   *p<0.1; **p<0.05; ***p<0.01
SolarR_coeff_full_regression  <- full_reg$coefficients[2]
SolarR_coeff_short_regression <- short_reg$coefficients[2]
SolarR_coeff_full_regression  < SolarR_coeff_short_regression
## Solar.R 
##    TRUE

Therefore, it is confirmed that there is a positive bias.

6) Try to provide some intuition to why does OVB formula work / bias your results in the example in a certain direction.

The positive correlation between the omitted variable “temperature” and the independent variable “solarRadiation” means that when “SolarRadiation” increases, “Temperature” tends to increase. In this context, higher temperatures can lead to higher ozone levels due to chemical reactions involving ozone precursors. When we omitted the “Temperature” variable and ran the short regression, the model incorrectly gave the effect of temperature to “Solar Radiation,” resulting in a positive bias.

OVB occurred because I omitted an important variable (“Temperature”) that is both correlated with the independent variable “Solar Radiation” and had a direct positive effect on the dependent variable “Ozone.” This led to an overestimation of the effect of “Solar Radiation” on “Ozone,” resulting in a positive bias.