data()
Regression Discussion - OVB
Regression Discussion 5
Part 1
Bias of an estimator is equal to the difference of the predicted value and the true value of beta. It measures how off from the true parameter value the estimator value is on average.
Part 2
No, the bias will not go away as we increase the sample size since increasing the sample size will is not included in the formula for the beta predictor. Simply having more data does not address the underlying issue of omitted variables.
Yes, adding more variables will decrease the omitted variable bias since there will be less variables omitted.
Part 3
library(ggplot2)
<- diamonds
df head(df)
# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
The purpose of this regression is to determine if carat effects the price of a diamond.
Describe Data
The original data set “diamond” contains the variables price, carat, cut, color, clarity, x, y, z, depth and table.
price : Price in US dollars
Carat : Weight of the diamond
Cut: Quality of the cut (factor variable)
Clarity: A measurement of how clear the diamond is (factor variable)
x : Length in mm
y : Width in mm
z : Depth in mm
Depth : Total depth percentage
Table : Width of top of the diamond relative to widest point
For this regression I take the Price, carat and z. I save them to a new data frame with Price = Price, Carat = Carat, z = Depth.
“Correct” (Full Model)
\[ Price_i = \beta_0 + \beta_1*Carat_i + \beta_2* Depth_i + \epsilon_i \]
<- df$carat
carat
<- df$z
depth
<-df$price
price
<- data.frame(price, carat, depth)
df_lr
head(df_lr)
price carat depth
1 326 0.23 2.43
2 326 0.21 2.31
3 327 0.23 2.31
4 334 0.29 2.63
5 335 0.31 2.75
6 336 0.24 2.48
<- lm(price ~ carat + depth, data = df_lr)
model
summary(model)
Call:
lm(formula = price ~ carat + depth, data = df_lr)
Residuals:
Min 1Q Median 3Q Max
-21323.9 -704.8 -12.8 423.4 31214.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 340.66 75.62 4.505 6.65e-06 ***
carat 9288.40 46.10 201.480 < 2e-16 ***
depth -1079.32 30.97 -34.856 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1531 on 53937 degrees of freedom
Multiple R-squared: 0.8526, Adjusted R-squared: 0.8526
F-statistic: 1.561e+05 on 2 and 53937 DF, p-value: < 2.2e-16
The “correct” model states that if the carat of a diamond increase by one, the price will increase by $9288 USD holding all other variables constant. If depth increases by 1 mm the price will decrease by $1079 USD, holding all other variables constant.
Dropping Carat from Data Frame
<- data.frame(price, depth)
df_ovb
head(df_ovb)
price depth
1 326 2.43
2 326 2.31
3 327 2.31
4 334 2.63
5 335 2.75
6 336 2.48
\[ Price_i = \beta_0+\beta_1* Depth_i + \epsilon_i \]
<- lm(price ~ depth, data = df_ovb)
model_ovb
summary(model_ovb)
Call:
lm(formula = price ~ depth, data = df_ovb)
Residuals:
Min 1Q Median 3Q Max
-139561 -1235 -240 825 32085
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -13296.57 44.64 -297.9 <2e-16 ***
depth 4868.79 12.37 393.6 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2027 on 53938 degrees of freedom
Multiple R-squared: 0.7418, Adjusted R-squared: 0.7417
F-statistic: 1.549e+05 on 1 and 53938 DF, p-value: < 2.2e-16
This model (omitted variable) shows that as depth increases by 1mm the price will increase by $4868 USD.
The R squared value is higher for the first model because more variance in the price is shown with carat included too. Both models show the estimator is statistically significant (where alpha = 0.001).
The 2 conditions for OVB are
- X is correlated to the omitted variable - in terms of this regression this condition is true, the carat of a diamond is correlated to the depth and the table of a diamond
- The omitted variable is a determinant of Y - this is also true for this regression, carat size is a determining factor for price.
Proof Through Correlation Values
#Finding the correlations to the carat variable
<- cor(df_lr$carat, df$price)
correlation_carat_price print(correlation_carat_price)
[1] 0.9215913
<- cor(df_lr$carat, df$depth)
correlation_depth_carat print(correlation_depth_carat)
[1] 0.02822431
<- cor.test(df_lr$carat, df$price)
cor_test_carat_price print(cor_test_carat_price)
Pearson's product-moment correlation
data: df_lr$carat and df$price
t = 551.41, df = 53938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9203098 0.9228530
sample estimates:
cor
0.9215913
<- cor.test(df_lr$carat, df$depth)
cor_test_depth_carat print(cor_test_depth_carat)
Pearson's product-moment correlation
data: df_lr$carat and df$depth
t = 6.5576, df = 53938, p-value = 5.518e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.01978996 0.03665465
sample estimates:
cor
0.02822431
Since the correlation between carat and depth is low here a test was done to see if the p value was less than 0.05. It is so it is concluded that the omitted variable is correlated with the X variables and it is a determinant for price. Done for the other variables simply to confirm.
It is possible to get the direction of OVB from the signs of the statistically significant correlation values.
- The correlation between carat and price = 0.92
- The correlation between carat and depth = 0.02
Since both of these values are positive and based off of the two by two matrix to determine the direction it will be a positive bias. This means that by removing carat from the equation the parameter value for depth increases and it is an overestimate. This can cause belief that depth of the diamond has more of an impact on price than the truth.
library(stargazer)
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(model, model_ovb,
type = "text",
covariate.labels = c("Carat", "Depth", "Constant")
)
=================================================================================
Dependent variable:
-------------------------------------------------------------
price
(1) (2)
---------------------------------------------------------------------------------
Carat 9,288.403***
(46.101)
Depth -1,079.325*** 4,868.795***
(30.966) (12.370)
Constant 340.657*** -13,296.570***
(75.618) (44.636)
---------------------------------------------------------------------------------
Observations 53,940 53,940
R2 0.853 0.742
Adjusted R2 0.853 0.742
Residual Std. Error 1,531.425 (df = 53937) 2,027.382 (df = 53938)
F Statistic 156,054.400*** (df = 2; 53937) 154,922.100*** (df = 1; 53938)
=================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Intuition:
As we remove variables from a model it expects that the ones that are used in the regression are the only explanatory variables. Therefore, it creates estimators for the influence it would have on the y (price) as though no other variable contributes. The model tries the “make up” the influence of other significant variables on the others in the equation. This leads to biased estimates, which is why examining the covariance relationships through a two-by-two matrix helps in discovering the direction and magnitude of the bias.