library(dplyr)
library(tidyverse)
library(wooldridge)
library(car)
data("discrim")

C9: Dataset DISCRIM

(i) Estimate model using OLS

ols <- lm(log(psoda) ~ prpblck + log(income) + prppov, data = discrim)
summary(ols)
## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov, data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32218 -0.04648  0.00651  0.04272  0.35622 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.46333    0.29371  -4.982  9.4e-07 ***
## prpblck      0.07281    0.03068   2.373   0.0181 *  
## log(income)  0.13696    0.02676   5.119  4.8e-07 ***
## prppov       0.38036    0.13279   2.864   0.0044 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08137 on 397 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.08696,    Adjusted R-squared:  0.08006 
## F-statistic:  12.6 on 3 and 397 DF,  p-value: 6.917e-08

From the p-value of the estimated coefficients, we can see that \(\hat{\beta_1}\), at the 5% level, is not statstically different from 0. The same can be concluded for the 1% level.

(ii) correlation and two-sided p-values

log_income <- log(discrim$income)
cor(log_income, discrim$prppov, use = "complete.obs")
## [1] -0.838467

The correlation between log(income) and prppov is -0.84. Both of the variables is statistically signigicant in the model at 1% and the p-values are 4.8e-07 and 0.0044, respectively.

(iii) Add another variable

ols1 <- lm(log(psoda) ~ prpblck + log(income) + prppov + log(hseval), data = discrim)
summary(ols1)
## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov + log(hseval), 
##     data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30652 -0.04380  0.00701  0.04332  0.35272 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.84151    0.29243  -2.878 0.004224 ** 
## prpblck      0.09755    0.02926   3.334 0.000937 ***
## log(income) -0.05299    0.03753  -1.412 0.158706    
## prppov       0.05212    0.13450   0.388 0.698571    
## log(hseval)  0.12131    0.01768   6.860 2.67e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07702 on 396 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.1839, Adjusted R-squared:  0.1757 
## F-statistic: 22.31 on 4 and 396 DF,  p-value: < 2.2e-16

Coefficients interpretation:
- For every one-unit increase in the proportion of Black individuals in the area (prpblck), the expected log price of medium soda increases by approximately 0.09755, holding all other variables constant.
- For a 1% increase in median family income, the expected log price of medium soda decreases by about 0.05299, holding all other variables constant.
- For every one-unit increase in the proportion of individuals in poverty (prppov), the expected log price of medium soda increases by about 0.05212, holding all other variables constant.
- For a 1% increase in median housing value, the expected log price of medium soda increases by approximately 0.12131, holding all other variables constant.

The p-value for \(H_0: \beta_{\log(\text{hseval})}=0\) is 2.67e-11

(iv) Jointly significant

In the new model, log(income) and prppov are not statistically significant at the 0.05 level (p = 0.1587 and p = 0.6986, respectively), suggesting that their contributions to the model may not be meaningful.

joint_test <- linearHypothesis(ols1, c("log(income) = 0", "prppov = 0"))
print(joint_test)
## 
## Linear hypothesis test:
## log(income) = 0
## prppov = 0
## 
## Model 1: restricted model
## Model 2: log(psoda) ~ prpblck + log(income) + prppov + log(hseval)
## 
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    398 2.3911                              
## 2    396 2.3493  2  0.041797 3.5227 0.03045 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value (0.03045) is less than the common significance level of 0.05, we reject the null hypothesis that the coefficients of log(income) and prppov are jointly equal to zero. This means that these two variables are jointly significant in explaining the variability in the dependent variable, log(psoda).

The fact that the variables are jointly significant means that together they explain variability in the dependent variable (log(psoda)), but it does not necessarily mean that each variable has a strong individual effect.

(v): Model choice

Model 2 is more reliable for determining whether the racial makeup of a zip code influences local fast-food prices because:

  • It has a higher R-squared, suggesting it explains more variance in soda prices.
  • It retains the significance of prpblck while controlling for the impact of housing value, which can also be an important factor.
  • The overall model significance is higher, which suggests that the relationships captured are stronger when accounting for additional predictors.