Project 6

Author

Andrew George

In this project you will apply all the techniques we have studied so far including mapping.

The following file is the Montgomery County File of Traffic crashes by driver. We will be working with this data for the rest of the semester. The point of this project is for you to, using the tools at your disposal , perform and Exploratory Data analysis of the data and report your findings in a qmd document. This is not intended to be an in depth analysis, it is simply an exercise in looking at and thinking about the data. In particular you should be thinking about the questions you might ask and answer. Steps for this part 1: Create a new Quarto project in a separate folder on your system. Download the data set below into your project files, you should make a data sub directory in the project directory to hold this data. Be sure to understand all the data fields and how they are related, and what they can tell you. Write you conclusions and use whatever means you have to substantiate your conclusions.

Loading everything in

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(leaflet)

Warning: package 'leaflet' was built under R version 4.3.3

setwd("C:/Users/andre/Downloads/data 101")
moco_crashes <- read_csv("Crash_Reporting_-_Drivers_Data_20240407.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 172105 Columns: 43
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (38): Report Number, Agency Name, ACRS Report Type, Crash Date/Time, Rou...
dbl  (5): Local Case Number, Speed Limit, Vehicle Year, Latitude, Longitude

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(moco_crashes)

# A tibble: 6 × 43
  `Report Number` `Local Case Number` `Agency Name`           `ACRS Report Type`
  <chr>                         <dbl> <chr>                   <chr>             
1 MCP3040003N               190026050 Montgomery County Poli… Property Damage C…
2 EJ78850038                230034791 Gaithersburg Police De… Property Damage C…
3 MCP2009002G               230034583 Montgomery County Poli… Property Damage C…
4 MCP3201004C               230035036 Montgomery County Poli… Property Damage C…
5 MCP23290028               230035152 Montgomery County Poli… Property Damage C…
6 MCP295200DV               230032956 Montgomery County Poli… Property Damage C…
# ℹ 39 more variables: `Crash Date/Time` <chr>, `Route Type` <chr>,
#   `Road Name` <chr>, `Cross-Street Type` <chr>, `Cross-Street Name` <chr>,
#   `Off-Road Description` <chr>, Municipality <chr>,
#   `Related Non-Motorist` <chr>, `Collision Type` <chr>, Weather <chr>,
#   `Surface Condition` <chr>, Light <chr>, `Traffic Control` <chr>,
#   `Driver Substance Abuse` <chr>, `Non-Motorist Substance Abuse` <chr>,
#   `Person ID` <chr>, `Driver At Fault` <chr>, `Injury Severity` <chr>, …

Cleaning the data

names(moco_crashes) <- tolower(names(moco_crashes))
names(moco_crashes) <- gsub(" ","_",names(moco_crashes))
names(moco_crashes) <- gsub("-","_",names(moco_crashes)) 
head(moco_crashes)

# A tibble: 6 × 43
  report_number local_case_number agency_name acrs_report_type `crash_date/time`
  <chr>                     <dbl> <chr>       <chr>            <chr>            
1 MCP3040003N           190026050 Montgomery… Property Damage… 05/31/2019 03:00…
2 EJ78850038            230034791 Gaithersbu… Property Damage… 07/21/2023 05:59…
3 MCP2009002G           230034583 Montgomery… Property Damage… 07/20/2023 03:10…
4 MCP3201004C           230035036 Montgomery… Property Damage… 07/23/2023 12:10…
5 MCP23290028           230035152 Montgomery… Property Damage… 07/24/2023 06:10…
6 MCP295200DV           230032956 Montgomery… Property Damage… 07/11/2023 07:40…
# ℹ 38 more variables: route_type <chr>, road_name <chr>,
#   cross_street_type <chr>, cross_street_name <chr>,
#   off_road_description <chr>, municipality <chr>, related_non_motorist <chr>,
#   collision_type <chr>, weather <chr>, surface_condition <chr>, light <chr>,
#   traffic_control <chr>, driver_substance_abuse <chr>,
#   non_motorist_substance_abuse <chr>, person_id <chr>, driver_at_fault <chr>,
#   injury_severity <chr>, circumstance <chr>, driver_distracted_by <chr>, …

Speculate on how you might use this data and about who might be interested in it.

Most obviously this dataset should tell us where concentrations of crashes may exist or what roads have the most crashes This dataset would be useful in determining if weather conditions do play role in crashes in addition to the road conditions and visibility This dataset could be useful in determine if the extent of possible injury can be determined by the type of crash This dataset would allow us to see what type of vehicle, what brand of vehicle or what model is most involved in crashes. This dataset would be useful in seeing where cyclists and pedestrians are often involved in crashes

The most obvious entity that would appreciate this data set would be MOCO department of transportation because they would be able to get so much insight. They would be able to see where the most accident prone areas are and they would be able to see how these crashes happen. Thus more police presence might be needed in those areas to maintain safe roadways. A group that might be interested in this dataset is the MOCO park service particularity with crashes involving animals. In areas near animal related crashes could be of concern to the park service perhaps in managing the deer population, an animal that can cause crashes. One group that could possibly be interested in this dataset is Tesla. They want to know if any of their driverless vehicles are getting to crashes and some of the factors that could be affecting crashes involved with their vehicles if any. City management officials could find useful information where pedestrians are most at risk of getting struck by a vehicle

Exploratory Data Analysis

names(moco_crashes)

 [1] "report_number"                  "local_case_number"             
 [3] "agency_name"                    "acrs_report_type"              
 [5] "crash_date/time"                "route_type"                    
 [7] "road_name"                      "cross_street_type"             
 [9] "cross_street_name"              "off_road_description"          
[11] "municipality"                   "related_non_motorist"          
[13] "collision_type"                 "weather"                       
[15] "surface_condition"              "light"                         
[17] "traffic_control"                "driver_substance_abuse"        
[19] "non_motorist_substance_abuse"   "person_id"                     
[21] "driver_at_fault"                "injury_severity"               
[23] "circumstance"                   "driver_distracted_by"          
[25] "drivers_license_state"          "vehicle_id"                    
[27] "vehicle_damage_extent"          "vehicle_first_impact_location" 
[29] "vehicle_second_impact_location" "vehicle_body_type"             
[31] "vehicle_movement"               "vehicle_continuing_dir"        
[33] "vehicle_going_dir"              "speed_limit"                   
[35] "driverless_vehicle"             "parked_vehicle"                
[37] "vehicle_year"                   "vehicle_make"                  
[39] "vehicle_model"                  "equipment_problems"            
[41] "latitude"                       "longitude"                     
[43] "location"

moco_crashes |>
  group_by(weather, vehicle_damage_extent) |>
  summarize(num = n()) |>
  filter(vehicle_damage_extent == "DESTROYED")

`summarise()` has grouped output by 'weather'. You can override using the
`.groups` argument.

# A tibble: 13 × 3
# Groups:   weather [13]
   weather                  vehicle_damage_extent   num
   <chr>                    <chr>                 <int>
 1 BLOWING SAND, SOIL, DIRT DESTROYED                 2
 2 BLOWING SNOW             DESTROYED                 6
 3 CLEAR                    DESTROYED              5241
 4 CLOUDY                   DESTROYED               709
 5 FOGGY                    DESTROYED                54
 6 N/A                      DESTROYED               562
 7 OTHER                    DESTROYED                17
 8 RAINING                  DESTROYED               929
 9 SEVERE WINDS             DESTROYED                 7
10 SLEET                    DESTROYED                 6
11 SNOW                     DESTROYED                43
12 UNKNOWN                  DESTROYED                15
13 WINTRY MIX               DESTROYED                19

What about fatal crashes?

ggplot(moco_crashes, aes(x = speed_limit, color = acrs_report_type)) +
  geom_density()

Interesting plot that shows that most of the fatal crashes seem to be centered on roads with 40mph Speed Limit. On the other hand injury crashes and property damage crashes seem to be closely aligned with each other with a mode of 35mph speed limit.

Removing NA’s

moco_fatal <- moco_crashes |>
  filter(acrs_report_type == "Fatal Crash") 
moco_fatal |>
  filter(!is.na(acrs_report_type) & !is.na(related_non_motorist))

# A tibble: 130 × 43
   report_number local_case_number agency_name              acrs_report_type
   <chr>                     <dbl> <chr>                    <chr>           
 1 MCP2563001M           230022301 Montgomery County Police Fatal Crash     
 2 MCP2348006J           230031638 Montgomery County Police Fatal Crash     
 3 MCP2001001Z           230014394 Montgomery County Police Fatal Crash     
 4 MCP2348006F           230012796 Montgomery County Police Fatal Crash     
 5 MCP1227001X           230024315 Montgomery County Police Fatal Crash     
 6 MCP1227001Z           230052338 Montgomery County Police Fatal Crash     
 7 MCP1227001Z           230052338 Montgomery County Police Fatal Crash     
 8 MCP1227001Z           230052338 Montgomery County Police Fatal Crash     
 9 MCP31730032           230054553 Montgomery County Police Fatal Crash     
10 MCP21810054           230032948 Montgomery County Police Fatal Crash     
# ℹ 120 more rows
# ℹ 39 more variables: `crash_date/time` <chr>, route_type <chr>,
#   road_name <chr>, cross_street_type <chr>, cross_street_name <chr>,
#   off_road_description <chr>, municipality <chr>, related_non_motorist <chr>,
#   collision_type <chr>, weather <chr>, surface_condition <chr>, light <chr>,
#   traffic_control <chr>, driver_substance_abuse <chr>,
#   non_motorist_substance_abuse <chr>, person_id <chr>, …

Lets take a look at how much pedestrians are involved with in fatal crashes

ggplot(moco_fatal, aes(x = speed_limit, color = related_non_motorist)) +
    geom_density()

This graph shows that pedestrians are involved with deadly crashes at a similar extent to other non related motorist except bicyclists who seem to have more involvement. However, the distribution of non related motorist in deadly crashes seems to be different for each one in terms of speed limit. Pedestrians in specific seem to be involved in deadly crashes the most when the speed limit is 35 to 40mph.

Leaflet to see concentrations of deadly pedestrian crashes

fatal_pedestrian <- moco_fatal |>
  filter(related_non_motorist == "PEDESTRIAN")
tooltip <- paste0(
      "<b>Driver at Fault?: </b>", fatal_pedestrian$driver_at_fault, "<br>",
      "<b>Traffic Controls: </b>", fatal_pedestrian$traffic_control, "<br>",
      "<b>Alcohol?: </b>", fatal_pedestrian$driver_substance_abuse, "<br>"
    )

leaflet() |>
  addProviderTiles("Esri.WorldStreetMap") |> 
  setView(lng = -77.1, lat = 39.1, zoom = 10.2) |>
  addCircles(
    data = fatal_pedestrian,
    radius = 200,
    fillColor = "red",
    fillOpacity = 0.5,
    popup = tooltip)

Assuming "longitude" and "latitude" are longitude and latitude, respectively

Two big hot spots I see are on Viers Mill and Rockville pike, those roads have speed limits around 35 to 45 mph which matches well with the previous visualization. Other clusters include: University Blvd east, clusters near Adelphi and a couple clusters on Georgia Ave. When clicking around most of these crashes don’t see to be associated with alcohol and other substance abuses. Furthermore, most of the crashes seem not to be the fault of the driver which I find interesting.

Blinding Headlights

blinding_lights <- moco_crashes |>
  filter(collision_type == "HEAD ON" & light == "DARK LIGHTS ON") |>
  filter(parked_vehicle == "No")

Plot

ggplot(blinding_lights) +
  geom_bar(aes(x = weather, fill = driver_at_fault), position = "dodge") +
  coord_flip() + 
  theme_classic() +
  labs(title = "Are Headlights Blinding us?",
       subtitle = "Head On crashes at night with lights on",
       x = "Weather Conditon",
       y = "Number", 
       fill = "Driver at fault?",
       caption = "
https://data.montgomerycountymd.gov/Public-Safety/Crash-Reporting-Drivers-Data/mmzv-x632/about_data")

I can insinuate that drivers not at fault could not see, perhaps were blinded by the other vehicle’s headlights in the head on crash. In addition, in spite of all weather conditions, besides blowing snow, drivers are still more likely to be at fault than not. This maybe insinuates that headlights may not actually blinding people significantly.

ggplot(blinding_lights) +
  geom_bar(aes(x = driver_at_fault))

Creating proportion for chi squared goodness of fit test

Lets see if the the number of ‘yes’ is the largest due to it having the amount of observations or is there actually something different here

moco_crashes |>
  group_by(driver_at_fault) |>
  summarize(number = n()) |>
  mutate(prop = number / 172105)

# A tibble: 3 × 3
  driver_at_fault number   prop
  <chr>            <int>  <dbl>
1 No               74939 0.435 
2 Unknown           4701 0.0273
3 Yes              92465 0.537

         ##number of observation in the data set

Getting observed values

blinding_lights |>
  group_by(driver_at_fault) |>
  summarize(observed = n())

# A tibble: 3 × 2
  driver_at_fault observed
  <chr>              <int>
1 No                   396
2 Unknown               24
3 Yes                  580

Expected values

##no
.4354 * 1000

[1] 435.4

##number of observations in the subset
##unknown
.02731 * 1000

[1] 27.31

##yes
.5372 * 1000

[1] 537.2

The results of the test show we have significant evidence at a = .03 to suggest that at least on proportion of drivers at fault in the blinding lights subset is different from the proportions of the main data set.

Discussion:

The results of this project show some interesting stuff, but there is also a lot more I could explore given more time. In regards to the fatal crashes, that data could be helpful in reducing pedestrian deaths. City officials could use that information to see where new speed cameras might be needed or maybe just added crosswalks for people to walk across the roads unscathed. For the blinding lights, my analysis shows that perhaps headlights are not too bad and that people are still responsible for those head on crashes at night. Despite that, the data could still be useful. It might be a good idea to refine how we teach people to drive at night. Overall, there are a number of other factors/variables that could also be responsible for pedestrian deaths and blinding lights such as the road condition and the weather.