library(dplyr)
library(tidyverse)
library(wooldridge)
library(car)
data("discrim")
DISCRIM
ols <- lm(log(psoda) ~ prpblck + log(income) + prppov, data = discrim)
summary(ols)
##
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov, data = discrim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32218 -0.04648 0.00651 0.04272 0.35622
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.46333 0.29371 -4.982 9.4e-07 ***
## prpblck 0.07281 0.03068 2.373 0.0181 *
## log(income) 0.13696 0.02676 5.119 4.8e-07 ***
## prppov 0.38036 0.13279 2.864 0.0044 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08137 on 397 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.08696, Adjusted R-squared: 0.08006
## F-statistic: 12.6 on 3 and 397 DF, p-value: 6.917e-08
From the p-value of the estimated coefficients, we can see that \(\hat{\beta_1}\), at the 5% level, is not statstically different from 0. The same can be concluded for the 1% level.
log_income <- log(discrim$income)
cor(log_income, discrim$prppov, use = "complete.obs")
## [1] -0.838467
The correlation between log(income) and prppov is -0.84. Both of the variables is statistically signigicant in the model at 1% and the p-values are 4.8e-07 and 0.0044, respectively.
ols1 <- lm(log(psoda) ~ prpblck + log(income) + prppov + log(hseval), data = discrim)
summary(ols1)
##
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov + log(hseval),
## data = discrim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30652 -0.04380 0.00701 0.04332 0.35272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.84151 0.29243 -2.878 0.004224 **
## prpblck 0.09755 0.02926 3.334 0.000937 ***
## log(income) -0.05299 0.03753 -1.412 0.158706
## prppov 0.05212 0.13450 0.388 0.698571
## log(hseval) 0.12131 0.01768 6.860 2.67e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07702 on 396 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.1839, Adjusted R-squared: 0.1757
## F-statistic: 22.31 on 4 and 396 DF, p-value: < 2.2e-16
Coefficients interpretation:
- For every one-unit increase in the proportion of Black individuals in
the area (prpblck), the expected log price of medium soda increases by
approximately 0.09755, holding all other variables constant.
- For a 1% increase in median family income, the expected log price of
medium soda decreases by about 0.05299, holding all other variables
constant.
- For every one-unit increase in the proportion of individuals in
poverty (prppov), the expected log price of medium soda increases by
about 0.05212, holding all other variables constant.
- For a 1% increase in median housing value, the expected log price of
medium soda increases by approximately 0.12131, holding all other
variables constant.
The p-value for \(H_0: \beta_{\log(\text{hseval})}=0\) is 2.67e-11
In the new model, log(income) and prppov are not statistically significant at the 0.05 level (p = 0.1587 and p = 0.6986, respectively), suggesting that their contributions to the model may not be meaningful.
joint_test <- linearHypothesis(ols1, c("log(income) = 0", "prppov = 0"))
print(joint_test)
##
## Linear hypothesis test:
## log(income) = 0
## prppov = 0
##
## Model 1: restricted model
## Model 2: log(psoda) ~ prpblck + log(income) + prppov + log(hseval)
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 398 2.3911
## 2 396 2.3493 2 0.041797 3.5227 0.03045 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value (0.03045) is less than the common significance level of 0.05, we reject the null hypothesis that the coefficients of log(income) and prppov are jointly equal to zero. This means that these two variables are jointly significant in explaining the variability in the dependent variable, log(psoda).
The fact that the variables are jointly significant means that together they explain variability in the dependent variable (log(psoda)), but it does not necessarily mean that each variable has a strong individual effect.
Model 2 is more reliable for determining whether the racial makeup of a zip code influences local fast-food prices because: