Data 712 Homework 3

For this course, I would like to continue my analysis on the 2023 New York City Disadvantaged Neighborhoods data set. This data set contains information on the neighborhoods in New York City that are considered disadvantaged. This data was collected by the New York’s Climate Justice Working Group (NYCJWG).This data set is public use and can be found at the following link: https://data.ny.gov/Energy-Environment/Final-Disadvantaged-Communities-DAC-2023/2e6c-s6fp/about_data

rm(list = ls())
gc()
##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 527825 28.2    1177878   63   660385 35.3
## Vcells 960138  7.4    8388608   64  1770057 13.6
# Set working directory
directory <- "C:/Users/mikem/iCloudDrive/Spring 2025/DATA 712/Data sets"
setwd(directory)

# Set seed for reproducibility
set.seed(123)

neighborhoods <- read.csv("2023_NY_Disadvantaged Neighborhoods.csv", stringsAsFactors = FALSE)

There are several questions I would like to answer by analyzing this data set. These questions pertain to a greater analysis I am working on that examines the correlation between disadvantaged neighborhoods and gun violence in NY City. I would like to examine the several variable with the 2023 New York City Disadvantaged Neighborhoods data set that I believe will help me get a deeper understanding on what factors contribute to the high rates of gun violence in these neighborhoods. These variables are:

I am pretty familiar with the data set I have from the NYPD that pertains to gun violence, but the 2023 New York City Disadvantaged Neighborhoods data set is new to me. I would like to work with this data set throught the course of this class to become better versed with its variables and how they can be used to help me answer my research question.

Pivoting Data Exercise

I will now pivot the data to convert it from long to wide format. I will then convert it back to long format to ensure that the data was pivoted correctly.

# Load necessary libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Check column names to ensure correct selection
colnames(neighborhoods)
##  [1] "the_geom"                           "GEOID"                             
##  [3] "DAC_Designation"                    "REDC"                              
##  [5] "County"                             "City_Town"                         
##  [7] "NYC_Region"                         "Urban_Rural"                       
##  [9] "Tribal_Designation"                 "Household_Low_Count_Flag"          
## [11] "Population_Count"                   "Household_Count"                   
## [13] "Percentile_Rank_Combined_Statewide" "Percentile_Rank_Combined_NYC"      
## [15] "Percentile_Rank_Combined_ROS"       "Combined_Score"                    
## [17] "Burden_Score_Percentile"            "Vulnerability_Score_Percentile"    
## [19] "Burden_Score"                       "Vulnerability_Score"               
## [21] "Benzene_Concentration"              "Particulate_Matter_25"             
## [23] "Traffic_Truck_Highways"             "Traffic_Number_Vehicles"           
## [25] "Wastewater_Discharge"               "Housing_Vacancy_Rate"              
## [27] "Industrial_Land_Use"                "Landfills"                         
## [29] "Oil_Storage"                        "Municipal_Waste_Combustors"        
## [31] "Power_Generation_Facilities"        "RMP_Sites"                         
## [33] "Remediation_Sites"                  "Scrap_Metal_Processing"            
## [35] "Agricultural_Land_Use"              "Coastal_Flooding_Storm_Risk"       
## [37] "Days_Above_90_Degrees_2050"         "Drive_Time_Healthcare"             
## [39] "Inland_Flooding_Risk"               "Low_Vegetative_Cover"              
## [41] "Asian_Percent"                      "Black_African_American_Percent"    
## [43] "Redlining_Updated"                  "Latino_Percent"                    
## [45] "English_Proficiency"                "Native_Indigenous"                 
## [47] "LMI_80_AMI"                         "LMI_Poverty_Federal"               
## [49] "Population_No_College"              "Household_Single_Parent"           
## [51] "Unemployment_Rate"                  "Asthma_ED_Rate"                    
## [53] "COPD_ED_Rate"                       "Households_Disabled"               
## [55] "Low_Birth_Weight"                   "MI_Hospitalization_Rate"           
## [57] "Health_Insurance_Rate"              "Age_Over_65"                       
## [59] "Premature_Deaths"                   "Internet_Access"                   
## [61] "Home_Energy_Affordability"          "Homes_Built_Before_1960"           
## [63] "Mobile_Homes"                       "Rent_Percent_Income"               
## [65] "Renter_Percent"
# Select relevant columns
neighborhoods_selected <- neighborhoods %>%
  select(GEOID, Percentile_Rank_Combined_NYC, Unemployment_Rate)

# Convert from LONG to WIDE format
wide_neighborhoods <- neighborhoods_selected %>%
  pivot_wider(names_from = GEOID, values_from = c(Percentile_Rank_Combined_NYC, Unemployment_Rate))

# Print wide format data
print("Wide Format Data:")
## [1] "Wide Format Data:"
print(wide_neighborhoods, digits = 10)
## # A tibble: 1 × 9,836
##   Percentile_Rank_Combined_NYC_3…¹ Percentile_Rank_Comb…² Percentile_Rank_Comb…³
##                              <dbl>                  <dbl>                  <dbl>
## 1                            0.386                  0.112                  0.634
## # ℹ abbreviated names: ¹​Percentile_Rank_Combined_NYC_36081044800,
## #   ²​Percentile_Rank_Combined_NYC_36081045800,
## #   ³​Percentile_Rank_Combined_NYC_36081046200
## # ℹ 9,833 more variables: Percentile_Rank_Combined_NYC_36081046300 <dbl>,
## #   Percentile_Rank_Combined_NYC_36081046400 <dbl>,
## #   Percentile_Rank_Combined_NYC_36081046500 <dbl>,
## #   Percentile_Rank_Combined_NYC_36081046600 <dbl>, …
# Convert back from WIDE to LONG format (Fixing pivot_longer issue)
long_neighborhoods <- neighborhoods_selected %>%
  pivot_longer(cols = c(Percentile_Rank_Combined_NYC, Unemployment_Rate), 
               names_to = "Variable", 
               values_to = "Value")

# Print converted long format data
print("Converted Back to Long Format Data:")
## [1] "Converted Back to Long Format Data:"
print(long_neighborhoods, digits = 10)
## # A tibble: 9,836 × 3
##          GEOID Variable                      Value
##          <dbl> <chr>                         <dbl>
##  1 36081044800 Percentile_Rank_Combined_NYC 0.386 
##  2 36081044800 Unemployment_Rate            0.787 
##  3 36081045800 Percentile_Rank_Combined_NYC 0.112 
##  4 36081045800 Unemployment_Rate            0.711 
##  5 36081046200 Percentile_Rank_Combined_NYC 0.634 
##  6 36081046200 Unemployment_Rate            0.577 
##  7 36081046300 Percentile_Rank_Combined_NYC 0.575 
##  8 36081046300 Unemployment_Rate            0.0616
##  9 36081046400 Percentile_Rank_Combined_NYC 0.173 
## 10 36081046400 Unemployment_Rate            0.876 
## # ℹ 9,826 more rows

As seen above, the data was successfully pivoted from long to wide format and then back to long format. The data was pivoted correctly and the data was not lost in the process. The process of doing this started with selecting the relevant columns from the data set. I then used the pivot_wider function to convert the data from long to wide format. I then used the pivot_longer function to convert the data back to long format. I decided to choose the Percentile_Rank_Combined_NYC and Unemployment_Rate columns to pivot the data because I believe these two variables will be important in my analysis of the data set. I will be using these variables to help me answer my research question. I also used the print() function to print the data in both wide and long format to ensure that the data was pivoted correctly.

Part 3

I will now use the glimpse() function to get a better understanding of the data set. This function will provide me with a summary of the data set, including the number of observations and variables, the data types of the variables, and the first few observations of the data set. This will help me get a better understanding of the data set and the variables that are included in it.

dplyr::glimpse(neighborhoods)
## Rows: 4,918
## Columns: 65
## $ the_geom                           <chr> "MULTIPOLYGON (((-73.80645699999998…
## $ GEOID                              <dbl> 36081044800, 36081045800, 360810462…
## $ DAC_Designation                    <chr> "Not Designated as DAC", "Not Desig…
## $ REDC                               <chr> "New York City", "New York City", "…
## $ County                             <chr> "Queens", "Queens", "Queens", "Quee…
## $ City_Town                          <chr> "New York city", "New York city", "…
## $ NYC_Region                         <chr> "NYC", "NYC", "NYC", "NYC", "NYC", …
## $ Urban_Rural                        <chr> "urban", "urban", "urban", "urban",…
## $ Tribal_Designation                 <chr> "No", "No", "No", "No", "No", "No",…
## $ Household_Low_Count_Flag           <chr> "No", "No", "No", "No", "No", "No",…
## $ Population_Count                   <int> 2809, 2348, 6907, 3836, 1999, 4172,…
## $ Household_Count                    <int> 730, 693, 2079, 1036, 528, 972, 135…
## $ Percentile_Rank_Combined_Statewide <dbl> 0.602427, 0.341076, 0.772128, 0.733…
## $ Percentile_Rank_Combined_NYC       <dbl> 0.386342, 0.112225, 0.634193, 0.574…
## $ Percentile_Rank_Combined_ROS       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Combined_Score                     <dbl> 85.47589, 72.21788, 95.02802, 92.60…
## $ Burden_Score_Percentile            <dbl> 0.168528, 0.230976, 0.298594, 0.691…
## $ Vulnerability_Score_Percentile     <dbl> 0.761875, 0.444444, 0.858339, 0.706…
## $ Burden_Score                       <dbl> 25.22720, 26.83213, 28.36868, 36.03…
## $ Vulnerability_Score                <dbl> 60.24869, 45.38575, 66.65935, 56.57…
## $ Benzene_Concentration              <dbl> 0.6373717, 0.6301848, 0.6215606, 0.…
## $ Particulate_Matter_25              <dbl> 0.5609153, 0.5501958, 0.5460730, 0.…
## $ Traffic_Truck_Highways             <dbl> 0.4612870, 0.6496425, 0.6512768, 0.…
## $ Traffic_Number_Vehicles            <dbl> 0.6512262, 0.8700482, 0.7283588, 0.…
## $ Wastewater_Discharge               <dbl> 0.0000000, 0.0000000, 0.0000000, 0.…
## $ Housing_Vacancy_Rate               <dbl> 0.5493945, 0.2806458, 0.4310601, 0.…
## $ Industrial_Land_Use                <dbl> 0.00000000, 0.00000000, 0.00000000,…
## $ Landfills                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Oil_Storage                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Municipal_Waste_Combustors         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Power_Generation_Facilities        <dbl> 0.0000000, 0.0000000, 0.0000000, 0.…
## $ RMP_Sites                          <dbl> 37.76036, 34.60507, 37.05919, 30.45…
## $ Remediation_Sites                  <dbl> 0.000000, 0.000000, 0.000000, 0.000…
## $ Scrap_Metal_Processing             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Agricultural_Land_Use              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Coastal_Flooding_Storm_Risk        <dbl> 0.0000000, 0.0000000, 0.0000000, 0.…
## $ Days_Above_90_Degrees_2050         <dbl> 0.4478827, 0.6254072, 0.4393322, 0.…
## $ Drive_Time_Healthcare              <dbl> 0.15778730, 0.18105736, 0.20820576,…
## $ Inland_Flooding_Risk               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Low_Vegetative_Cover               <dbl> 0.7261146, 0.5844498, 0.9611245, 0.…
## $ Asian_Percent                      <dbl> 0.9846598, 0.9886616, 0.9588706, 0.…
## $ Black_African_American_Percent     <dbl> 0.34791038, 0.63981043, 0.67923309,…
## $ Redlining_Updated                  <dbl> 0.6235643, 0.1645054, 0.6254168, 0.…
## $ Latino_Percent                     <dbl> 0.7746568, 0.4162619, 0.8306230, 0.…
## $ English_Proficiency                <dbl> 0.7830882, 0.8034314, 0.8093137, 0.…
## $ Native_Indigenous                  <dbl> 0.82578300, 0.77964206, 0.55844519,…
## $ LMI_80_AMI                         <dbl> 69.84094, 54.24864, 86.41691, 88.94…
## $ LMI_Poverty_Federal                <dbl> 0.8505338, 0.2302700, 0.4249529, 0.…
## $ Population_No_College              <dbl> 0.4161088, 0.1251046, 0.5951883, 0.…
## $ Household_Single_Parent            <dbl> 0.44869043, 0.00000000, 0.67883212,…
## $ Unemployment_Rate                  <dbl> 0.78675079, 0.71104101, 0.57707676,…
## $ Asthma_ED_Rate                     <dbl> 0.6573471, 0.5778852, 0.8023965, 0.…
## $ COPD_ED_Rate                       <dbl> 0.33634644, 0.26382174, 0.50515030,…
## $ Households_Disabled                <dbl> 0.007949791, 0.024686192, 0.3102510…
## $ Low_Birth_Weight                   <dbl> 0.8142648, 0.8228404, 0.9138256, 0.…
## $ MI_Hospitalization_Rate            <dbl> 0.64028626, 0.38602399, 0.90191539,…
## $ Health_Insurance_Rate              <dbl> 0.9080605, 0.6389589, 0.8969353, 0.…
## $ Age_Over_65                        <dbl> 0.0428960, 0.2456581, 0.4036409, 0.…
## $ Premature_Deaths                   <dbl> 0.4929577, 0.1816271, 0.8183729, 0.…
## $ Internet_Access                    <dbl> 0.4153005, 0.2244641, 0.7391761, 0.…
## $ Home_Energy_Affordability          <dbl> 0.9403516, 0.5849728, 0.8252407, 0.…
## $ Homes_Built_Before_1960            <dbl> 0.5855649, 0.5736402, 0.2581590, 0.…
## $ Mobile_Homes                       <dbl> 0.00000000, 0.00000000, 0.00000000,…
## $ Rent_Percent_Income                <dbl> 0.7762712, 0.8531780, 0.9406780, 0.…
## $ Renter_Percent                     <dbl> 0.6494975, 0.6224874, 0.7952261, 0.…
library(maxLik)
## Warning: package 'maxLik' was built under R version 4.3.3
## Loading required package: miscTools
## Warning: package 'miscTools' was built under R version 4.3.3
## 
## Please cite the 'maxLik' package as:
## Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
## 
## If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
## https://r-forge.r-project.org/projects/maxlik/
# Define the log-likelihood function for OLS
ols.lf <- function(param) {
  beta <- param[-1]  # Extract beta coefficients
  sigma <- param[1]   # Extract standard deviation (sigma)
  
  y <- as.vector(neighborhoods$Age_Over_65)  # Dependent variable
  x <- cbind(1, as.vector(neighborhoods$Unemployment_Rate))  # Independent variable with intercept
  
  mu <- x %*% beta  # Compute predicted values
  
  # Compute log-likelihood
  log_likelihood <- sum(dnorm(y, mean = mu, sd = sigma, log = TRUE))
  
  return(log_likelihood)
}

# Perform Maximum Likelihood Estimation (MLE)
mle_ols <- maxLik(logLik = ols.lf, start = c(sigma = 1, beta1 = 1, beta2 = 1))

# Display the results
summary(mle_ols)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 0 iterations
## Return code 100: Initial value out of range.
## --------------------------------------------
# Compare with standard OLS using lm()
summary(lm(Percentile_Rank_Combined_NYC ~ Unemployment_Rate, data = neighborhoods))
## 
## Call:
## lm(formula = Percentile_Rank_Combined_NYC ~ Unemployment_Rate, 
##     data = neighborhoods)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39498 -0.21928 -0.09539  0.20164  0.90632 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.045056   0.008508   5.296 1.24e-07 ***
## Unemployment_Rate 0.349925   0.014773  23.687  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2964 on 4778 degrees of freedom
##   (138 observations deleted due to missingness)
## Multiple R-squared:  0.1051, Adjusted R-squared:  0.1049 
## F-statistic: 561.1 on 1 and 4778 DF,  p-value: < 2.2e-16

Analysis of the MLE Results

Intercept (0.045056, p < 0.001) If a neighborhood had 0% unemployment, the predicted percentile rank for seniors 65+ would be 4.5%. While this is meaningful mathematically, it is unlikely in real-world scenarios where unemployment is never exactly 0.

Unemployment Rate Coefficient (0.349925, p < 0.001) For every 1% increase in the unemployment rate, the predicted percentile rank for seniors 65+ increases by 0.35%. This suggests that higher unemployment rates are associated with higher percentile ranks for seniors 65+. This suggests that there is a positive relationship between the unemployment rate and the percentile rank for seniors 65+.

R-Squared: 0.1051 The R-squared value of 0.1051 indicates that the model explains 10.51% of the variance in the percentile rank for seniors 65+. This suggests that the model is not a good fit for the data, as it only explains a small portion of the variance in the dependent variable. This suggests that there are other variables that are not included in the model that may be influencing the percentile rank for seniors 65+.

F-statistic: 561.1 (p < 2.2e-16) The F-statistic of 561.1 with a p-value less than 2.2e-16 indicates that the model is statistically significant. This suggests that the model is a good fit for the data and that the independent variable (unemployment rate) is a significant predictor of the dependent variable (percentile rank for seniors 65+).

Conclusion

  • There is a positive association between neighborhood unemployment rate and the proportion of seniors (65+).
  • Unemployment rate alone does not fully explain senior distribution—other factors (housing costs, healthcare, amenities) should be explored.
  • The model explains only 10.51% of the variance in the dependent variable, suggesting that other variables are needed to improve the model’s predictive power.