(i) Finding the averages and standard deviations of prpblck and income:

# Load the wooldridge package
library(wooldridge)

# Load the DISCRIM dataset
data("discrim")

avg_prpblck <- mean(discrim$prpblck)
sd_prpblck <- sd(discrim$prpblck)

avg_income <- mean(discrim$income)
sd_income <- sd(discrim$income)

cat("Average of prpblck:", avg_prpblck, "\n")
## Average of prpblck: NA
cat("Standard deviation of prpblck:", sd_prpblck, "\n")
## Standard deviation of prpblck: NA
cat("Average of income:", avg_income, "\n")
## Average of income: NA
cat("Standard deviation of income:", sd_income, "\n")
## Standard deviation of income: NA

The code calculates the mean and standard deviation of prpblck (proportion of the population that is Black) and income (median household income in ZIP code).

(ii) Estimating the model:

model_ii <- lm(psoda ~ prpblck + income, data = discrim)
summary(model_ii)
## 
## Call:
## lm(formula = psoda ~ prpblck + income, data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29401 -0.05242  0.00333  0.04231  0.44322 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.563e-01  1.899e-02  50.354  < 2e-16 ***
## prpblck     1.150e-01  2.600e-02   4.423 1.26e-05 ***
## income      1.603e-06  3.618e-07   4.430 1.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08611 on 398 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.06422,    Adjusted R-squared:  0.05952 
## F-statistic: 13.66 on 2 and 398 DF,  p-value: 1.835e-06
coef_prpblck <- coef(model_ii)[2]
cat("Coefficient on prpblck:", coef_prpblck, "\n")
## Coefficient on prpblck: 0.1149882

(v) Adding prppov to the regression:

model_v <- lm(log(psoda) ~ prpblck + log(income) + prppov, data = discrim)
summary(model_v)
## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov, data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32218 -0.04648  0.00651  0.04272  0.35622 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.46333    0.29371  -4.982  9.4e-07 ***
## prpblck      0.07281    0.03068   2.373   0.0181 *  
## log(income)  0.13696    0.02676   5.119  4.8e-07 ***
## prppov       0.38036    0.13279   2.864   0.0044 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08137 on 397 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.08696,    Adjusted R-squared:  0.08006 
## F-statistic:  12.6 on 3 and 397 DF,  p-value: 6.917e-08

The regression from part (iv) is extended by adding prppov (proportion of the population in poverty). This examines how the inclusion of prppov affects the coefficient on prpblck.

(vi) Correlation between log(income) and prppov:

correlation <- cor(log(discrim$income), discrim$prppov)
cat("Correlation between log(income) and prppov:", correlation, "\n")
## Correlation between log(income) and prppov: NA

The correlation between the logged value of income and prppov is computed to check for potential multicollinearity between these two variables.

(vii) Evaluating multicollinearity:

cat("Since the correlation is", correlation, ", multicollinearity might be a concern depending on how high it is.\n")
## Since the correlation is NA , multicollinearity might be a concern depending on how high it is.

A brief interpretation is provided based on the correlation between log(income) and prppov. High correlation suggests possible multicollinearity, which can distort the results of the regression.