in class exercise 4

Main Model Variables:

log(psoda): Log of soda prices (dependent variable)

prpblck: Proportion of Black residents in the zip code

log(income): Log of median income in the zip code

prppov: Proportion of population in poverty

log(hseval): Log of median housing value (added later)

Load required libraries

library(wooldridge) 
library(car)

## Loading required package: carData

data("discrim")

(i) First regression model

model1 <- lm(log(psoda) ~ prpblck + log(income) + prppov, data = discrim)
summary(model1)

## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov, data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32218 -0.04648  0.00651  0.04272  0.35622 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.46333    0.29371  -4.982  9.4e-07 ***
## prpblck      0.07281    0.03068   2.373   0.0181 *  
## log(income)  0.13696    0.02676   5.119  4.8e-07 ***
## prppov       0.38036    0.13279   2.864   0.0044 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08137 on 397 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.08696,    Adjusted R-squared:  0.08006 
## F-statistic:  12.6 on 3 and 397 DF,  p-value: 6.917e-08

This runs a log-linear regression

The log transformation on psoda and income helps handle non-linear relationships
Coefficients will represent percentage changes

(ii) Correlation analysis

cor.test(log(discrim$income), discrim$prppov)

## 
##  Pearson's product-moment correlation
## 
## data:  log(discrim$income) and discrim$prppov
## t = -31.04, df = 407, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8650980 -0.8071224
## sample estimates:
##       cor 
## -0.838467

Tests the relationship between income and poverty

Important for checking potential multicollinearity

(iii) Extended regression model with hseval

model2 <- lm(log(psoda) ~ prpblck + log(income) + prppov + log(hseval), data = discrim)
summary(model2)

## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov + log(hseval), 
##     data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30652 -0.04380  0.00701  0.04332  0.35272 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.84151    0.29243  -2.878 0.004224 ** 
## prpblck      0.09755    0.02926   3.334 0.000937 ***
## log(income) -0.05299    0.03753  -1.412 0.158706    
## prppov       0.05212    0.13450   0.388 0.698571    
## log(hseval)  0.12131    0.01768   6.860 2.67e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07702 on 396 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.1839, Adjusted R-squared:  0.1757 
## F-statistic: 22.31 on 4 and 396 DF,  p-value: < 2.2e-16

Adds housing values as a control variable

Helps account for general cost of living differences

(iv) Joint significance test for income and prppov

model3 <- lm(log(psoda) ~ prpblck + log(hseval), data = discrim)
anova(model3, model2)

## Analysis of Variance Table
## 
## Model 1: log(psoda) ~ prpblck + log(hseval)
## Model 2: log(psoda) ~ prpblck + log(income) + prppov + log(hseval)
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    398 2.3911                              
## 2    396 2.3493  2  0.041797 3.5227 0.03045 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Tests whether income and poverty together improve the model

Useful for dealing with potentially correlated variables

vif(model2)

##     prpblck log(income)      prppov log(hseval) 
##    1.951820    7.724903    5.627505    3.193821

par(mfrow=c(2,2))
plot(model2)

If we still find significant effects of racial composition after controlling for all these factors, it provides stronger evidence of potential discrimination

It’s less likely to produce false positives in identifying discrimination

v. The most reliable regression would be model2,

which includes log(hseval): log(psoda) ~ prpblck + log(income) + prppov + log(hseval)

Here’s why this model is most reliable:

Completeness of Controls:
- It includes housing values (hseval) which controls for general cost of living and neighborhood characteristics
- This helps isolate the true effect of racial composition (prpblck) by accounting for economic differences between areas
Reduced Omitted Variable Bias:
- Without housing values, we might incorrectly attribute price differences to racial composition when they’re actually due to overall neighborhood cost structures
- Housing values serve as a proxy for unobserved neighborhood characteristics
Better Model Specification:
- Including both economic variables (income, poverty) and housing values provides a more complete picture of neighborhood characteristics
- This helps distinguish between racial discrimination and legitimate cost-based price differences
More Conservative Estimate:
- If we still find significant effects of racial composition after controlling for all these factors, it provides stronger evidence of potential discrimination
- It’s less likely to produce false positives in identifying discrimination

in class exercise 4

Oyuntuya_112035144

2024-10-24

Main Model Variables:

Load required libraries

(i) First regression model

(ii) Correlation analysis

(iii) Extended regression model with hseval

(iv) Joint significance test for income and prppov

v. The most reliable regression would be model2,