project 2

Author

R Josue

Is the city of NYC safe for pedestrians ?

NYC car crash. <https://www.segalandlax.com/nyc-car-crash-guide/>

Introduction / Goal :

For my DATA 110 final project, I am using the Motor Vehicle Collisions - Crashes dataset from NYC Open Data. The data source is the City of New York / NYC Open Data, and the crash information comes from police reported motor vehicle collision records from the NYPD. Each row represents one motor vehicle crash in New York City.

The variables I will use include borough, vehicle type code 1, number of pedestrians injured, number of persons injured, number of persons killed, crash hour, and vehicle count. Borough and vehicle type code 1 are categorical variables. Number of pedestrians injured and number of persons injured are quantitative variables.

My main research questions are: Which borough has the highest number of pedestrian injuries?

Which vehicle types are involved in crashes with the highest number of persons injured?

The data was collected from police collision reports. Police officers complete reports for motor vehicle collisions, and the crash information is published through NYC Open Data. I chose this topic because traffic safety affects many people every day, especially pedestrians and drivers in large cities like New York City.

Data Source:

New York City Police Department (NYPD). Motor Vehicle Collisions – Crashes. NYC Open Data.

https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95

# Load the Required Libraries.I chose tidyverse because it cleans,organizes and helps visualize the data.The lubridate library is more helpful when it comes to working with dates and times.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)

# Load the Dataset, in this chunk is how I imported the dataset.

crashes <- read_csv(
  "/Users/josue/Library/Mobile Documents/com~apple~CloudDocs/Downloads/Data110_Summer/Motor_Vehicle_Collisions_-_Crashes.csv"
)

Rows: 2269187 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (16): CRASH DATE, BOROUGH, LOCATION, ON STREET NAME, CROSS STREET NAME,...
dbl  (12): ZIP CODE, LATITUDE, LONGITUDE, NUMBER OF PERSONS INJURED, NUMBER ...
time  (1): CRASH TIME

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Explore the Dataset.This is helpful to examine the variables and understand the structure of the dataset. The functions below show me the first few rows, variable names, and data types.

head(crashes)

# A tibble: 6 × 29
  `CRASH DATE` `CRASH TIME` BOROUGH  `ZIP CODE` LATITUDE LONGITUDE LOCATION     
  <chr>        <time>       <chr>         <dbl>    <dbl>     <dbl> <chr>        
1 09/11/2021   02:39        <NA>             NA     NA        NA   <NA>         
2 03/26/2022   11:45        <NA>             NA     NA        NA   <NA>         
3 11/01/2023   01:29        BROOKLYN      11230     40.6     -74.0 (40.62179, -…
4 06/29/2022   06:55        <NA>             NA     NA        NA   <NA>         
5 09/21/2022   13:21        <NA>             NA     NA        NA   <NA>         
6 04/26/2023   13:30        <NA>             NA     NA        NA   <NA>         
# ℹ 22 more variables: `ON STREET NAME` <chr>, `CROSS STREET NAME` <chr>,
#   `OFF STREET NAME` <chr>, `NUMBER OF PERSONS INJURED` <dbl>,
#   `NUMBER OF PERSONS KILLED` <dbl>, `NUMBER OF PEDESTRIANS INJURED` <dbl>,
#   `NUMBER OF PEDESTRIANS KILLED` <dbl>, `NUMBER OF CYCLIST INJURED` <dbl>,
#   `NUMBER OF CYCLIST KILLED` <dbl>, `NUMBER OF MOTORIST INJURED` <dbl>,
#   `NUMBER OF MOTORIST KILLED` <dbl>, `CONTRIBUTING FACTOR VEHICLE 1` <chr>,
#   `CONTRIBUTING FACTOR VEHICLE 2` <chr>, …

names(crashes)

 [1] "CRASH DATE"                    "CRASH TIME"                   
 [3] "BOROUGH"                       "ZIP CODE"                     
 [5] "LATITUDE"                      "LONGITUDE"                    
 [7] "LOCATION"                      "ON STREET NAME"               
 [9] "CROSS STREET NAME"             "OFF STREET NAME"              
[11] "NUMBER OF PERSONS INJURED"     "NUMBER OF PERSONS KILLED"     
[13] "NUMBER OF PEDESTRIANS INJURED" "NUMBER OF PEDESTRIANS KILLED" 
[15] "NUMBER OF CYCLIST INJURED"     "NUMBER OF CYCLIST KILLED"     
[17] "NUMBER OF MOTORIST INJURED"    "NUMBER OF MOTORIST KILLED"    
[19] "CONTRIBUTING FACTOR VEHICLE 1" "CONTRIBUTING FACTOR VEHICLE 2"
[21] "CONTRIBUTING FACTOR VEHICLE 3" "CONTRIBUTING FACTOR VEHICLE 4"
[23] "CONTRIBUTING FACTOR VEHICLE 5" "COLLISION_ID"                 
[25] "VEHICLE TYPE CODE 1"           "VEHICLE TYPE CODE 2"          
[27] "VEHICLE TYPE CODE 3"           "VEHICLE TYPE CODE 4"          
[29] "VEHICLE TYPE CODE 5"

glimpse(crashes)

Rows: 2,269,187
Columns: 29
$ `CRASH DATE`                    <chr> "09/11/2021", "03/26/2022", "11/01/202…
$ `CRASH TIME`                    <time> 02:39:00, 11:45:00, 01:29:00, 06:55:0…
$ BOROUGH                         <chr> NA, NA, "BROOKLYN", NA, NA, NA, NA, NA…
$ `ZIP CODE`                      <dbl> NA, NA, 11230, NA, NA, NA, NA, NA, NA,…
$ LATITUDE                        <dbl> NA, NA, 40.62179, NA, NA, NA, NA, NA, …
$ LONGITUDE                       <dbl> NA, NA, -73.97002, NA, NA, NA, NA, NA,…
$ LOCATION                        <chr> NA, NA, "(40.62179, -73.970024)", NA, …
$ `ON STREET NAME`                <chr> "WHITESTONE EXPRESSWAY", "QUEENSBORO B…
$ `CROSS STREET NAME`             <chr> "20 AVENUE", NA, "AVENUE K", NA, NA, N…
$ `OFF STREET NAME`               <chr> NA, NA, NA, NA, NA, NA, NA, NA, "61   …
$ `NUMBER OF PERSONS INJURED`     <dbl> 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `NUMBER OF PERSONS KILLED`      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `NUMBER OF PEDESTRIANS INJURED` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `NUMBER OF PEDESTRIANS KILLED`  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `NUMBER OF CYCLIST INJURED`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `NUMBER OF CYCLIST KILLED`      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `NUMBER OF MOTORIST INJURED`    <dbl> 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `NUMBER OF MOTORIST KILLED`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `CONTRIBUTING FACTOR VEHICLE 1` <chr> "Aggressive Driving/Road Rage", "Pavem…
$ `CONTRIBUTING FACTOR VEHICLE 2` <chr> "Unspecified", NA, "Unspecified", "Uns…
$ `CONTRIBUTING FACTOR VEHICLE 3` <chr> NA, NA, "Unspecified", NA, NA, NA, NA,…
$ `CONTRIBUTING FACTOR VEHICLE 4` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `CONTRIBUTING FACTOR VEHICLE 5` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ COLLISION_ID                    <dbl> 4455765, 4513547, 4675373, 4541903, 45…
$ `VEHICLE TYPE CODE 1`           <chr> "Sedan", "Sedan", "Moped", "Sedan", "S…
$ `VEHICLE TYPE CODE 2`           <chr> "Sedan", NA, "Sedan", "Pick-up Truck",…
$ `VEHICLE TYPE CODE 3`           <chr> NA, NA, "Sedan", NA, NA, NA, NA, NA, N…
$ `VEHICLE TYPE CODE 4`           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `VEHICLE TYPE CODE 5`           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

summary(crashes$`NUMBER OF PERSONS INJURED`)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     NAs 
 0.0000  0.0000  0.0000  0.3333  0.0000 43.0000      18

summary(crashes$`NUMBER OF PEDESTRIANS INJURED`)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 0.00000  0.00000  0.00000  0.06073  0.00000 27.00000

# This chunk selects the variables I need.
crashes_small <- crashes %>%
  select(
    `CRASH DATE`,
    `CRASH TIME`,
    BOROUGH,
    `NUMBER OF PERSONS INJURED`,
    `NUMBER OF PEDESTRIANS INJURED`,
    `NUMBER OF CYCLIST INJURED`,
    `NUMBER OF MOTORIST INJURED`,
    `VEHICLE TYPE CODE 1`,
    `VEHICLE TYPE CODE 2`
  )

# Create new variables using mutate. This chunk creates crash year, crash hour, and vehicle count.

crashes_clean <- crashes_small %>%
  mutate(
    crash_date = mdy(`CRASH DATE`),
    crash_year = year(crash_date),
    crash_hour = as.numeric(substr(`CRASH TIME`, 1, 2)),
    vehicle_count = ifelse(is.na(`VEHICLE TYPE CODE 2`), 1, 2)
  )

# Filter the Data This chunk keeps only crashes from 2024 and removes missing values only from the variables I use.

crashes_2024 <- crashes_clean %>%
  filter(crash_year == 2024) %>%
  filter(!is.na(BOROUGH)) %>%
  filter(!is.na(`VEHICLE TYPE CODE 1`)) %>%
  filter(!is.na(`NUMBER OF PERSONS INJURED`)) %>%
  filter(!is.na(`NUMBER OF PEDESTRIANS INJURED`)) %>%
  filter(!is.na(crash_hour))

Research Question 1

Which borough has the highest number of pedestrian injuries?

# Summarizes the pedestrian injuries by Borough in 2024.This chunk groups the data by borough and calculates the total pedestrian injuries. Also it shows the total crashes in each borough. 

borough_pedestrians <- crashes_2024 %>%
  group_by(BOROUGH) %>%
  summarize(
    total_pedestrian_injuries = sum(`NUMBER OF PEDESTRIANS INJURED`),
    total_crashes = n()
  ) %>%
  arrange(desc(total_pedestrian_injuries))

borough_pedestrians

# A tibble: 5 × 3
  BOROUGH       total_pedestrian_injuries total_crashes
  <chr>                             <dbl>         <int>
1 BROOKLYN                           2281         22361
2 QUEENS                             1742         17502
3 MANHATTAN                          1463         11664
4 BRONX                              1052          9751
5 STATEN ISLAND                       214          2660

# Visualization 1. This graph compares the total pedestrian injuries by borough. The size of the point shows the number of crashes.

borough_pedestrians %>%
  ggplot(aes(
    x = BOROUGH,
    y = total_pedestrian_injuries,
    color = BOROUGH,
    size = total_crashes
  )) +
  geom_point() +
  scale_color_manual(values = c(
    "BRONX" = "red",
    "BROOKLYN" = "orange",
    "MANHATTAN" = "blue",
    "QUEENS" = "green",
    "STATEN ISLAND" = "purple"
  )) +
  labs(
    title = "Total Pedestrian Injuries by Borough",
    subtitle = "NYC Motor Vehicle Collisions in 2024",
    x = "Borough",
    y = "Total Number of Pedestrians Injured",
    color = "Borough",
    size = "Number of Crashes",
    caption = "Source: NYC Open Data / NYPD Motor Vehicle Collisions"
  ) +
  theme_bw()

Visualization 1

Description: This graph compares the total number of pedestrian injuries in each borough. Larger points represent boroughs with more reported crashes.

Multiple Linear Regression

# This chunk creates a multiple linear regression model. The response variable is the number of persons injured, and the predictor variables are the crash hour, borough, and number of vehicles involved. This allows us to examine whether these factors are associated with the number of people injured in a crash.

injury_model <- lm(
  `NUMBER OF PERSONS INJURED` ~
    crash_hour +
    vehicle_count +
    BOROUGH,
  data = crashes_2024
)

# Display the regression results
summary(injury_model)


Call:
lm(formula = `NUMBER OF PERSONS INJURED` ~ crash_hour + vehicle_count + 
    BOROUGH, data = crashes_2024)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7426 -0.5689 -0.4208  0.4456 16.3799 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           0.2085382  0.0153286  13.605  < 2e-16 ***
crash_hour            0.0090269  0.0005158  17.500  < 2e-16 ***
vehicle_count         0.1632002  0.0067340  24.235  < 2e-16 ***
BOROUGHBROOKLYN      -0.0231617  0.0098356  -2.355   0.0185 *  
BOROUGHMANHATTAN     -0.1159149  0.0111212 -10.423  < 2e-16 ***
BOROUGHQUEENS        -0.0406556  0.0102427  -3.969 7.22e-05 ***
BOROUGHSTATEN ISLAND -0.1104947  0.0177284  -6.233 4.61e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8104 on 63931 degrees of freedom
Multiple R-squared:  0.01589,   Adjusted R-squared:  0.0158 
F-statistic: 172.1 on 6 and 63931 DF,  p-value: < 2.2e-16

Regression Equation

Number of Persons Injured=β0+β1(Crash Hour)+β2(Vehicle Count)+β3(Borough)+ε

Number of Persons Injured = Intercept + crash hour + vehicle count + pedestrian injuries + error. The p-values tell whether each variable is statistically significant. The adjusted R-squared value tells how much variation in injuries is explained by the model.

Regression Interpretation

The regression model examines whether crash hour, vehicle count, and borough are associated with the number of persons injured. The coefficient estimates show the direction and size of each relationship. The p-values indicate whether each predictor is statistically significant, and the adjusted R² shows how much of the variation in the number of persons injured is explained by the model.

Research Question 2

Which vehicle types are involved in crashes with the highest average injuries?

# Summarize injuries by vehicle type. This chunk groups the data by vehicle type and calculates the total injuries. 

vehicle_injuries <- crashes_2024 %>%
  group_by(`VEHICLE TYPE CODE 1`) %>%
  summarize(
    total_injuries = sum(`NUMBER OF PERSONS INJURED`),
    total_crashes = n()
  ) %>%
  arrange(desc(total_injuries)) %>%
  slice(1:10)

Visualization 2

# Visalization 2. This graph shows which type of vehicle is responsible for injuries to pedestrians.

vehicle_injuries %>%
  ggplot(aes(
    reorder(`VEHICLE TYPE CODE 1`, total_injuries),
    total_injuries,
    fill = `VEHICLE TYPE CODE 1`
  )) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Total Number of Persons Injured by Vehicle Type",
    subtitle = "Ten Vehicle Types with the Highest Total Injuries in 2024",
    x = "Vehicle Type",
    y = "Total Persons Injured",
    fill = "Vehicle Type",
    caption = "Source: NYC Open Data / NYPD Motor Vehicle Collisions"
  ) +
  theme_bw()

Visualization 2 Explanation

This graph compares the number of people injured by vehicle type. Vehicle types with longer bars have higher injured people.

Outside Research

New York City has a traffic safety program called Vision Zero. The goal of Vision Zero is to reduce traffic deaths and serious injuries. This connects to my project because this dataset helps show patterns in crash injuries by borough and vehicle type.

Final Reflection

The first visualization shows which borough has the highest average number of pedestrian injuries. The second visualization shows which vehicle types have the highest average number of persons injured. One interesting pattern is that the averages are usually small, but some boroughs and vehicle types still have higher injury averages.