Regression Discussion - OVB

Author

Langley Burke

Regression Discussion 5

Part 1

Bias of an estimator is equal to the difference of the predicted value and the true value of beta. It measures how off from the true parameter value the estimator value is on average.

Part 2

No, the bias will not go away as we increase the sample size since increasing the sample size will is not included in the formula for the beta predictor. Simply having more data does not address the underlying issue of omitted variables.

Yes, adding more variables will decrease the omitted variable bias since there will be less variables omitted.

Part 3

data()
library(ggplot2)
df <- diamonds
head(df)  
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

The purpose of this regression is to determine if carat effects the price of a diamond.

Describe Data

The original data set “diamond” contains the variables price, carat, cut, color, clarity, x, y, z, depth and table.

price : Price in US dollars

Carat : Weight of the diamond

Cut: Quality of the cut (factor variable)

Clarity: A measurement of how clear the diamond is (factor variable)

x : Length in mm

y : Width in mm

z : Depth in mm

Depth : Total depth percentage

Table : Width of top of the diamond relative to widest point

For this regression I take the Price, carat and z. I save them to a new data frame with Price = Price, Carat = Carat, z = Depth.

“Correct” (Full Model)

\[ Price_i = \beta_0 + \beta_1*Carat_i + \beta_2* Depth_i + \epsilon_i \]

carat <- df$carat

depth <- df$z

price<-df$price

df_lr <- data.frame(price, carat, depth)

head(df_lr)
  price carat depth
1   326  0.23  2.43
2   326  0.21  2.31
3   327  0.23  2.31
4   334  0.29  2.63
5   335  0.31  2.75
6   336  0.24  2.48
model <- lm(price ~ carat + depth, data = df_lr)

summary(model)

Call:
lm(formula = price ~ carat + depth, data = df_lr)

Residuals:
     Min       1Q   Median       3Q      Max 
-21323.9   -704.8    -12.8    423.4  31214.8 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   340.66      75.62   4.505 6.65e-06 ***
carat        9288.40      46.10 201.480  < 2e-16 ***
depth       -1079.32      30.97 -34.856  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1531 on 53937 degrees of freedom
Multiple R-squared:  0.8526,    Adjusted R-squared:  0.8526 
F-statistic: 1.561e+05 on 2 and 53937 DF,  p-value: < 2.2e-16

The “correct” model states that if the carat of a diamond increase by one, the price will increase by $9288 USD holding all other variables constant. If depth increases by 1 mm the price will decrease by $1079 USD, holding all other variables constant.

Dropping Carat from Data Frame

df_ovb <- data.frame(price, depth)

head(df_ovb)
  price depth
1   326  2.43
2   326  2.31
3   327  2.31
4   334  2.63
5   335  2.75
6   336  2.48

\[ Price_i = \beta_0+\beta_1* Depth_i + \epsilon_i \]

model_ovb <- lm(price ~ depth, data = df_ovb)

summary(model_ovb)

Call:
lm(formula = price ~ depth, data = df_ovb)

Residuals:
    Min      1Q  Median      3Q     Max 
-139561   -1235    -240     825   32085 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -13296.57      44.64  -297.9   <2e-16 ***
depth         4868.79      12.37   393.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2027 on 53938 degrees of freedom
Multiple R-squared:  0.7418,    Adjusted R-squared:  0.7417 
F-statistic: 1.549e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

This model (omitted variable) shows that as depth increases by 1mm the price will increase by $4868 USD.

The R squared value is higher for the first model because more variance in the price is shown with carat included too. Both models show the estimator is statistically significant (where alpha = 0.001).

The 2 conditions for OVB are

  1. X is correlated to the omitted variable - in terms of this regression this condition is true, the carat of a diamond is correlated to the depth and the table of a diamond
  2. The omitted variable is a determinant of Y - this is also true for this regression, carat size is a determining factor for price.

Proof Through Correlation Values

#Finding the correlations to the carat variable 
correlation_carat_price <- cor(df_lr$carat, df$price)
print(correlation_carat_price)
[1] 0.9215913
correlation_depth_carat <- cor(df_lr$carat, df$depth)
print(correlation_depth_carat)
[1] 0.02822431
cor_test_carat_price <- cor.test(df_lr$carat, df$price)
print(cor_test_carat_price)

    Pearson's product-moment correlation

data:  df_lr$carat and df$price
t = 551.41, df = 53938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9203098 0.9228530
sample estimates:
      cor 
0.9215913 
cor_test_depth_carat <- cor.test(df_lr$carat, df$depth)
print(cor_test_depth_carat)

    Pearson's product-moment correlation

data:  df_lr$carat and df$depth
t = 6.5576, df = 53938, p-value = 5.518e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.01978996 0.03665465
sample estimates:
       cor 
0.02822431 

Since the correlation between carat and depth is low here a test was done to see if the p value was less than 0.05. It is so it is concluded that the omitted variable is correlated with the X variables and it is a determinant for price. Done for the other variables simply to confirm.

It is possible to get the direction of OVB from the signs of the statistically significant correlation values.

  1. The correlation between carat and price = 0.92
  2. The correlation between carat and depth = 0.02

Since both of these values are positive and based off of the two by two matrix to determine the direction it will be a positive bias. This means that by removing carat from the equation the parameter value for depth increases and it is an overestimate. This can cause belief that depth of the diamond has more of an impact on price than the truth.

library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
stargazer(model, model_ovb, 
          type = "text",
          covariate.labels = c("Carat", "Depth", "Constant")
          )

=================================================================================
                                         Dependent variable:                     
                    -------------------------------------------------------------
                                                price                            
                                 (1)                            (2)              
---------------------------------------------------------------------------------
Carat                        9,288.403***                                        
                               (46.101)                                          
                                                                                 
Depth                       -1,079.325***                   4,868.795***         
                               (30.966)                       (12.370)           
                                                                                 
Constant                      340.657***                   -13,296.570***        
                               (75.618)                       (44.636)           
                                                                                 
---------------------------------------------------------------------------------
Observations                    53,940                         53,940            
R2                              0.853                          0.742             
Adjusted R2                     0.853                          0.742             
Residual Std. Error     1,531.425 (df = 53937)         2,027.382 (df = 53938)    
F Statistic         156,054.400*** (df = 2; 53937) 154,922.100*** (df = 1; 53938)
=================================================================================
Note:                                                 *p<0.1; **p<0.05; ***p<0.01

Intuition:

As we remove variables from a model it expects that the ones that are used in the regression are the only explanatory variables. Therefore, it creates estimators for the influence it would have on the y (price) as though no other variable contributes. The model tries the “make up” the influence of other significant variables on the others in the equation. This leads to biased estimates, which is why examining the covariance relationships through a two-by-two matrix helps in discovering the direction and magnitude of the bias.