Week 5 Homework: Marginal Effects and Predictions

# Load necessary packages
library(clarify)

# Filter dataset to remove NAs in Percentile_Rank_Combined_NYC
neighborhoods_clean <- subset(neighborhoods, !is.na(Percentile_Rank_Combined_NYC))

# Fit the linear regression model with the updated independent variables
nm0 <- lm(Percentile_Rank_Combined_NYC ~ Benzene_Concentration + Particulate_Matter_25 + 
           Traffic_Truck_Highways + Industrial_Land_Use + Low_Vegetative_Cover + 
           Housing_Vacancy_Rate + RMP_Sites, 
           data = neighborhoods_clean)

# Display summary of the regression model
summary(nm0)
## 
## Call:
## lm(formula = Percentile_Rank_Combined_NYC ~ Benzene_Concentration + 
##     Particulate_Matter_25 + Traffic_Truck_Highways + Industrial_Land_Use + 
##     Low_Vegetative_Cover + Housing_Vacancy_Rate + RMP_Sites, 
##     data = neighborhoods_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54377 -0.15830 -0.02243  0.13258  0.76019 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -0.2494278  0.0103883 -24.010  < 2e-16 ***
## Benzene_Concentration   0.4647005  0.0289354  16.060  < 2e-16 ***
## Particulate_Matter_25   0.1252307  0.0213345   5.870 4.66e-09 ***
## Traffic_Truck_Highways  0.0816896  0.0125717   6.498 8.98e-11 ***
## Industrial_Land_Use     0.0783523  0.0102412   7.651 2.40e-14 ***
## Low_Vegetative_Cover    0.1534966  0.0220074   6.975 3.48e-12 ***
## Housing_Vacancy_Rate    0.0019183  0.0110989   0.173    0.863    
## RMP_Sites               0.0005581  0.0001270   4.393 1.14e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2162 on 4771 degrees of freedom
##   (139 observations deleted due to missingness)
## Multiple R-squared:  0.5246, Adjusted R-squared:  0.5239 
## F-statistic: 752.2 on 7 and 4771 DF,  p-value: < 2.2e-16

The above code fits a linear regression model to the cleaned dataset, predicting Percentile_Rank_Combined_NYC using various independent variables. The summary of the model provides insights into the coefficients and their significance. The table shows that the independent variables have varying effects on the dependent variable, with some being statistically significant. The most significant predictors include Benzene_Concentration, Particulate_Matter_25, and Traffic_Truck_Highways. The coefficients indicate the direction and magnitude of the relationships between the independent variables and the dependent variable.

The intercept represents the expected value of Percentile_Rank_Combined_NYC when all independent variables are zero. A negative intercept suggests that, in the absence of any of the predictors, the percentile rank would be below the average, indicating a disadvantaged neighborhood which is impossible in this context.

nm1 <- lm(Percentile_Rank_Combined_NYC ~ Benzene_Concentration + Particulate_Matter_25 + 
           Traffic_Truck_Highways * Low_Vegetative_Cover + Industrial_Land_Use + 
           Housing_Vacancy_Rate + RMP_Sites, 
           data = neighborhoods_clean)

summary(nm0)
## 
## Call:
## lm(formula = Percentile_Rank_Combined_NYC ~ Benzene_Concentration + 
##     Particulate_Matter_25 + Traffic_Truck_Highways + Industrial_Land_Use + 
##     Low_Vegetative_Cover + Housing_Vacancy_Rate + RMP_Sites, 
##     data = neighborhoods_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54377 -0.15830 -0.02243  0.13258  0.76019 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -0.2494278  0.0103883 -24.010  < 2e-16 ***
## Benzene_Concentration   0.4647005  0.0289354  16.060  < 2e-16 ***
## Particulate_Matter_25   0.1252307  0.0213345   5.870 4.66e-09 ***
## Traffic_Truck_Highways  0.0816896  0.0125717   6.498 8.98e-11 ***
## Industrial_Land_Use     0.0783523  0.0102412   7.651 2.40e-14 ***
## Low_Vegetative_Cover    0.1534966  0.0220074   6.975 3.48e-12 ***
## Housing_Vacancy_Rate    0.0019183  0.0110989   0.173    0.863    
## RMP_Sites               0.0005581  0.0001270   4.393 1.14e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2162 on 4771 degrees of freedom
##   (139 observations deleted due to missingness)
## Multiple R-squared:  0.5246, Adjusted R-squared:  0.5239 
## F-statistic: 752.2 on 7 and 4771 DF,  p-value: < 2.2e-16

In this model, we have added an interaction term between Traffic_Truck_Highways and Low_Vegetative_Cover. The interaction term allows us to understand how the relationship between Traffic_Truck_Highways and Percentile_Rank_Combined_NYC changes depending on the level of Low_Vegetative_Cover. The impact of this interaction can be seen in the coefficients of the model. The coefficient for the interaction term is positive, indicating that as Low_Vegetative_Cover increases, the negative impact of Traffic_Truck_Highways on Percentile_Rank_Combined_NYC is mitigated. This suggests that neighborhoods with higher levels of vegetative cover may be less disadvantaged, even if they have high truck traffic.

nm2 <- lm(Percentile_Rank_Combined_NYC ~ Benzene_Concentration + Particulate_Matter_25 + 
           Traffic_Truck_Highways + Low_Vegetative_Cover * Industrial_Land_Use + 
           Housing_Vacancy_Rate + RMP_Sites, 
           data = neighborhoods_clean)

summary(nm1)
## 
## Call:
## lm(formula = Percentile_Rank_Combined_NYC ~ Benzene_Concentration + 
##     Particulate_Matter_25 + Traffic_Truck_Highways * Low_Vegetative_Cover + 
##     Industrial_Land_Use + Housing_Vacancy_Rate + RMP_Sites, data = neighborhoods_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.63050 -0.13659 -0.01743  0.10905  0.78554 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                 -0.0851800  0.0130156  -6.544
## Benzene_Concentration                        0.4358662  0.0278680  15.640
## Particulate_Matter_25                        0.0780742  0.0206584   3.779
## Traffic_Truck_Highways                      -0.2622183  0.0212425 -12.344
## Low_Vegetative_Cover                        -0.1692717  0.0267714  -6.323
## Industrial_Land_Use                          0.0729819  0.0098536   7.407
## Housing_Vacancy_Rate                        -0.0359990  0.0108470  -3.319
## RMP_Sites                                    0.0008377  0.0001230   6.810
## Traffic_Truck_Highways:Low_Vegetative_Cover  0.7450897  0.0378399  19.691
##                                             Pr(>|t|)    
## (Intercept)                                 6.60e-11 ***
## Benzene_Concentration                        < 2e-16 ***
## Particulate_Matter_25                       0.000159 ***
## Traffic_Truck_Highways                       < 2e-16 ***
## Low_Vegetative_Cover                        2.80e-10 ***
## Industrial_Land_Use                         1.52e-13 ***
## Housing_Vacancy_Rate                        0.000911 ***
## RMP_Sites                                   1.10e-11 ***
## Traffic_Truck_Highways:Low_Vegetative_Cover  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2079 on 4770 degrees of freedom
##   (139 observations deleted due to missingness)
## Multiple R-squared:  0.5604, Adjusted R-squared:  0.5596 
## F-statistic: 759.9 on 8 and 4770 DF,  p-value: < 2.2e-16

In this model, we have added an interaction term between Low_Vegetative_Cover and Industrial_Land_Use. The interaction term allows us to understand how the relationship between Low_Vegetative_Cover and Percentile_Rank_Combined_NYC changes depending on the level of Industrial_Land_Use. The impact of this interaction can be seen in the coefficients of the model. The coefficient for the interaction term is negative, indicating that as Industrial_Land_Use increases, the positive impact of Low_Vegetative_Cover on Percentile_Rank_Combined_NYC is diminished. This suggests that neighborhoods with higher levels of industrial land use may be more disadvantaged, even if they have high vegetative cover.

library(texreg)
## Version:  1.39.4
## Date:     2024-07-23
## Author:   Philip Leifeld (University of Manchester)
## 
## Consider submitting praise using the praise or praise_interactive functions.
## Please cite the JSS article in your publications -- see citation("texreg").
screenreg(list(nm0, nm1, nm2), doctype = TRUE)
## 
## ==================================================================================
##                                              Model 1      Model 2      Model 3    
## ----------------------------------------------------------------------------------
## (Intercept)                                    -0.25 ***    -0.09 ***    -0.19 ***
##                                                (0.01)       (0.01)       (0.01)   
## Benzene_Concentration                           0.46 ***     0.44 ***     0.44 ***
##                                                (0.03)       (0.03)       (0.03)   
## Particulate_Matter_25                           0.13 ***     0.08 ***     0.12 ***
##                                                (0.02)       (0.02)       (0.02)   
## Traffic_Truck_Highways                          0.08 ***    -0.26 ***     0.09 ***
##                                                (0.01)       (0.02)       (0.01)   
## Industrial_Land_Use                             0.08 ***     0.07 ***    -0.12 ***
##                                                (0.01)       (0.01)       (0.02)   
## Low_Vegetative_Cover                            0.15 ***    -0.17 ***     0.08 ***
##                                                (0.02)       (0.03)       (0.02)   
## Housing_Vacancy_Rate                            0.00        -0.04 ***    -0.00    
##                                                (0.01)       (0.01)       (0.01)   
## RMP_Sites                                       0.00 ***     0.00 ***     0.00 ***
##                                                (0.00)       (0.00)       (0.00)   
## Traffic_Truck_Highways:Low_Vegetative_Cover                  0.75 ***             
##                                                             (0.04)                
## Low_Vegetative_Cover:Industrial_Land_Use                                  0.34 ***
##                                                                          (0.03)   
## ----------------------------------------------------------------------------------
## R^2                                             0.52         0.56         0.53    
## Adj. R^2                                        0.52         0.56         0.53    
## Num. obs.                                    4779         4779         4779       
## ==================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

Using the sin_ame function from the clarify package

# Load necessary package
library(clarify)

# Set seed for reproducibility
set.seed(123)

# Step 1: Simulate coefficients from the model
sim_coefs <- sim(nm0, n = 1000)  # Using nm0_fixed without County_binary

# Step 2: Compute Average Marginal Effects (AMEs) for Low_Vegetative_Cover
sim_est_lv <- sim_ame(sim_coefs, var = "Low_Vegetative_Cover", contrast = "rd", verbose = FALSE)

# Step 3: Display Results
summary(sim_est_lv)
##                               Estimate 2.5 % 97.5 %
## E[dY/d(Low_Vegetative_Cover)]    0.153 0.111  0.194
# Step 4: Plot Results
plot(sim_est_lv)

This plot shows the average marginal effects of Low_Vegetative_Cover on Percentile_Rank_Combined_NYC. The x-axis represents the values of Low_Vegetative_Cover, and the y-axis represents the average marginal effects. The shaded area around the line indicates the confidence intervals for the average marginal effects. This plot helps to visualize how changes in Low_Vegetative_Cover impact the predicted values of Percentile_Rank_Combined_NYC. What we can observe is that as Low_Vegetative_Cover increases, the average marginal effect on Percentile_Rank_Combined_NYC becomes more positive, indicating that neighborhoods with higher vegetative cover tend to have a higher percentile rank, suggesting they are less disadvantaged.

# Load necessary package
library(clarify)

# Set seed for reproducibility
set.seed(123)

# Step 1: Simulate coefficients from the model
sim_coefs <- sim(nm0, n = 1000)  # Using nm0_fixed without County_binary

# Step 2: Compute Average Marginal Effects (AMEs) for Industrial_Land_Use
sim_est_il <- sim_ame(sim_coefs, var = "Industrial_Land_Use", contrast = "rd", verbose = FALSE)

# Step 3: Display Results
summary(sim_est_il)
##                              Estimate  2.5 % 97.5 %
## E[dY/d(Industrial_Land_Use)]   0.0784 0.0573 0.0991
# Step 4: Plot Results
plot(sim_est_il)

This plot shows the average marginal effects of Industrial_Land_Use on Percentile_Rank_Combined_NYC. The x-axis represents the values of Industrial_Land_Use, and the y-axis represents the average marginal effects. The shaded area around the line indicates the confidence intervals for the average marginal effects. This plot helps to visualize how changes in Industrial_Land_Use impact the predicted values of Percentile_Rank_Combined_NYC. What we can observe is that as Industrial_Land_Use increases, the average marginal effect on Percentile_Rank_Combined_NYC becomes more negative, indicating that neighborhoods with higher industrial land use tend to have a lower percentile rank, suggesting they are more disadvantaged.

Using the sim_setx function from the clarify package

# Load necessary package
library(clarify)

# Set seed for reproducibility
set.seed(123)

# Step 1: Simulate coefficients from the model
sim_coefs <- sim(nm0, n = 1000)  # Simulating coefficients from nm0

# ==============================
# Predictions for Low_Vegetative_Cover
# ==============================

# Step 2: Define set values for Low_Vegetative_Cover
scenarios_lv <- list(Low_Vegetative_Cover = c(0.1, 0.9))

# Step 3: Compute Predicted Values
sim_pred_lv <- sim_setx(sim_coefs, x = scenarios_lv)

# Step 4: Extract predictions
summary_lv <- summary(sim_pred_lv)
pred_low_lv <- summary_lv[1, 1]  # Prediction at Low_Vegetative_Cover = 0.1
pred_high_lv <- summary_lv[2, 1] # Prediction at Low_Vegetative_Cover = 0.9

# Step 5: Compute First Difference
fd_lv <- pred_high_lv - pred_low_lv

# ==============================
# Predictions for Industrial_Land_Use
# ==============================

# Step 6: Define set values for Industrial_Land_Use
scenarios_il <- list(Industrial_Land_Use = c(0.1, 0.9))

# Step 7: Compute Predicted Values
sim_pred_il <- sim_setx(sim_coefs, x = scenarios_il)

# Step 8: Extract predictions
summary_il <- summary(sim_pred_il)
pred_low_il <- summary_il[1, 1]  # Prediction at Industrial_Land_Use = 0.1
pred_high_il <- summary_il[2, 1] # Prediction at Industrial_Land_Use = 0.9

# Step 9: Compute First Difference
fd_il <- pred_high_il - pred_low_il

# ==============================
# Display Results
# ==============================

print(paste("Prediction at Low_Vegetative_Cover = 0.1:", round(pred_low_lv, 3)))
## [1] "Prediction at Low_Vegetative_Cover = 0.1: 0.152"
print(paste("Prediction at Low_Vegetative_Cover = 0.9:", round(pred_high_lv, 3)))
## [1] "Prediction at Low_Vegetative_Cover = 0.9: 0.275"
print(paste("First Difference (0.9 - 0.1) for Low_Vegetative_Cover:", round(fd_lv, 3)))
## [1] "First Difference (0.9 - 0.1) for Low_Vegetative_Cover: 0.123"
print(paste("Prediction at Industrial_Land_Use = 0.1:", round(pred_low_il, 3)))
## [1] "Prediction at Industrial_Land_Use = 0.1: 0.206"
print(paste("Prediction at Industrial_Land_Use = 0.9:", round(pred_high_il, 3)))
## [1] "Prediction at Industrial_Land_Use = 0.9: 0.269"
print(paste("First Difference (0.9 - 0.1) for Industrial_Land_Use:", round(fd_il, 3)))
## [1] "First Difference (0.9 - 0.1) for Industrial_Land_Use: 0.063"
# ==============================
# Plot Results
# ==============================

# Load ggplot2 for visualization
library(ggplot2)

# Create a dataset for plotting
plot_data <- data.frame(
  Variable = c("Low Vegetative Cover", "Industrial Land Use"),
  Prediction_Low = c(pred_low_lv, pred_low_il),
  Prediction_High = c(pred_high_lv, pred_high_il),
  First_Difference = c(fd_lv, fd_il)
)

# Plot Predictions
ggplot(plot_data, aes(x = Variable, y = First_Difference)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  labs(title = "First Differences for Predictors",
       y = "Estimated First Difference",
       x = "Predictor Variable") +
  geom_text(aes(label = round(First_Difference, 3)), vjust = -0.5)

The above graph shows the first differences for the predictors Low Vegetative Cover and Industrial Land Use. The x-axis represents the predictor variables, and the y-axis represents the estimated first differences. The bars indicate the change in predicted values of Percentile_Rank_Combined_NYC when moving from a low to a high value of each predictor. The graph helps to visualize the impact of these predictors on the dependent variable, with Industrial Land Use showing a more negative effect compared to Low Vegetative Cover.

# Print the summary to see its structure
summary(sim_pred_lv)
##                            Estimate 2.5 % 97.5 %
## Low_Vegetative_Cover = 0.1    0.152 0.132  0.171
## Low_Vegetative_Cover = 0.9    0.275 0.258  0.291

The summary(sim_pred_lv) function call is intended to provide a summary of the simulated predictions for the Low_Vegetative_Cover variable. The output will typically include the predicted values at different levels of Low_Vegetative_Cover, along with confidence intervals or other relevant statistics. This summary helps to understand how changes in Low_Vegetative_Cover impact the predicted values of Percentile_Rank_Combined_NYC. The results show the predicted values for Percentile_Rank_Combined_NYC at different levels of Low_Vegetative_Cover, allowing for a comparison of the impact of this variable on the dependent variable.