Final Project

Members

Amy Nguyen: - M13925168

Alyssa Hughes: - M13320281

Emily Byrne: - M13283492

Packages

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggrepel)
library(ggmap)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
ggmap::register_google(key = "AIzaSyB2dXu4azkdzKjFg8Tskk5NFPurFc2__r0")

Dataset

Cincinnati Traffic Crash Reports

source: https://data.cincinnati-oh.gov/safety/Traffic-Crash-Reports-CPD-/rvmt-pkmq

crashes = read_csv("Traffic_Crash_Reports__CPD_.csv")
## Rows: 329772 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (21): ADDRESS_X, COMMUNITY_COUNCIL_NEIGHBORHOOD, CPD_NEIGHBORHOOD, SNA_N...
## dbl  (7): LATITUDE_X, LONGITUDE_X, AGE, CRASHSEVERITYID, LOCALREPORTNO, ROAD...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(crashes)
## # A tibble: 6 × 28
##   ADDRESS_X      LATITU…¹ LONGIT…²   AGE COMMU…³ CPD_N…⁴ SNA_N…⁵ CRASH…⁶ CRASH…⁷
##   <chr>             <dbl>    <dbl> <dbl> <chr>   <chr>   <chr>   <chr>   <chr>  
## 1 39XX EDWARDS … 39.2     -8.44e+1    51 OAKLEY  OAKLEY  OAKLEY  04/13/… <NA>   
## 2 39XX EDWARDS … 39.2     -8.44e+1    NA OAKLEY  OAKLEY  OAKLEY  04/13/… <NA>   
## 3 17XX HOLLOWAY… 39.1     -8.45e+1    37 EVANST… EVANST… EVANST… 04/12/… <NA>   
## 4 4XX E MARTIN … 39.1     -8.45e+1    30 CORRYV… CORRYV… CORRYV… 04/12/… <NA>   
## 5 4XX E MARTIN … 39.1     -8.45e+1    NA CORRYV… CORRYV… CORRYV… 04/12/… <NA>   
## 6 XX W MITCHELL  -0.00083 -7.7 e-4    NA N/A     N/A     N/A     04/12/… <NA>   
## # … with 19 more variables: CRASHSEVERITY <chr>, CRASHSEVERITYID <dbl>,
## #   DATECRASHREPORTED <chr>, DAYOFWEEK <chr>, GENDER <chr>, INJURIES <chr>,
## #   INSTANCEID <chr>, LIGHTCONDITIONSPRIMARY <chr>, LOCALREPORTNO <dbl>,
## #   MANNEROFCRASH <chr>, ROADCONDITIONSPRIMARY <chr>, ROADCONTOUR <chr>,
## #   ROADSURFACE <chr>, ROADCLASS <dbl>, ROADCLASSDESC <chr>, UNITTYPE <chr>,
## #   TYPEOFPERSON <chr>, WEATHER <chr>, ZIP <dbl>, and abbreviated variable
## #   names ¹​LATITUDE_X, ²​LONGITUDE_X, ³​COMMUNITY_COUNCIL_NEIGHBORHOOD, …

Description of Dataset

Columns

Columns meanings (28 total):

  • ADRESS_X = address where crash occurred

  • LATITUDE_X = latitudinal location where the crash occurred

  • LONGITUDE_X = longitudinal location where the crash occurred

  • AGE = age of the driver who caused the crash

  • COMMUNITY_COUNCIL_NEIGHBORHOOD = neighborhood where cash occurred in terms of community council

  • CPD_NEIGHBORHOOD = neighborhood where crash occurred in terms of CPD

  • SNA_NEIGHBORHOOD = neighborhood where crash occurred

  • CRASHDATE = date of crash

  • CRASHLOCATION = location of crash

  • CRASHSERVERITY = severity of crash

  • CRASHSEVERITYID = severity of crash ID number

  • DATECRASHREPORTED = date when crash was reported

  • DAYOFWEEK = day of week when crash occurred

  • GENDER = gender of driver who caused crash

  • INJURIES = injuries that occurred in crash

  • INSTANCEID = ID number of crash

  • LIGHTCONDITIONSPRIMARY = time of day/light when crash occurred

  • LOCALREPORTNUMBER = report number of crash

  • MANNEROFCRASH = type of crash

  • ROADCONDITIONSPRIMARY = condition of road when crash occurred

  • ROADCONTOUR = contour of road where crash occurred

  • ROADSURFACE = surface of road when crash occurred

  • ROADCLASS = type of road crash occurred on ID number

  • ROADCLASSDESC = type of road crash occurred on

  • UNITTYPE = type of car that caused crash

  • TYPEOFPERSON = driver or passenger at fault for crash

  • WEATHER = weather when crash occurred

  • ZIP = zip code where crash occurred

Rows

Each row represents a unique car crash incident in Cincinnati, Ohio.

There are 326,752 rows/observations in this data set.

Dimensions and Missing Values

Dimensions

dim(crashes)
## [1] 329772     28

Total Missing Values

sum(is.na(crashes))
## [1] 474296
colSums(is.na(crashes))
##                      ADDRESS_X                     LATITUDE_X 
##                             17                             26 
##                    LONGITUDE_X                            AGE 
##                             28                          40760 
## COMMUNITY_COUNCIL_NEIGHBORHOOD               CPD_NEIGHBORHOOD 
##                              0                              0 
##               SNA_NEIGHBORHOOD                      CRASHDATE 
##                              0                             19 
##                  CRASHLOCATION                  CRASHSEVERITY 
##                         135754                              7 
##                CRASHSEVERITYID              DATECRASHREPORTED 
##                              7                             21 
##                      DAYOFWEEK                         GENDER 
##                             17                          37959 
##                       INJURIES                     INSTANCEID 
##                            303                              0 
##         LIGHTCONDITIONSPRIMARY                  LOCALREPORTNO 
##                             18                              0 
##                  MANNEROFCRASH          ROADCONDITIONSPRIMARY 
##                             18                             18 
##                    ROADCONTOUR                    ROADSURFACE 
##                             18                             18 
##                      ROADCLASS                  ROADCLASSDESC 
##                         125865                         125865 
##                       UNITTYPE                   TYPEOFPERSON 
##                            299                            297 
##                        WEATHER                            ZIP 
##                             18                           6944

Visualizations

Emily’s Questions

Question 1

What is the distribution of level 1 severity (fatal) car crashes in Cincinnati?

# get roadmap of Cincinnati area 
cincinnati = get_map(location = "Cincinnati",
                     zoom = 12,
                     source = "google",
                     maptype = "roadmap")
## ℹ <https://maps.googleapis.com/maps/api/staticmap?center=Cincinnati&zoom=12&size=640x640&scale=2&maptype=roadmap&language=en-EN&key=xxx>
## ℹ <https://maps.googleapis.com/maps/api/geocode/json?address=Cincinnati&key=xxx>
#create scatter plot and heat map of crashes on top of Cincinnati map 
ggmap(cincinnati) +
  geom_point(data = crashes %>% 
               filter(CRASHSEVERITY == "1 - FATAL"),
             aes(x = LONGITUDE_X, y = LATITUDE_X)) +
  geom_density2d(data = crashes %>% 
               filter(CRASHSEVERITY == "1 - FATAL"),
             aes(x = LONGITUDE_X, y = LATITUDE_X),
             color = "purple") +
  geom_density2d_filled(data = crashes %>% 
               filter(CRASHSEVERITY == "1 - FATAL"),
             aes(x = LONGITUDE_X, y = LATITUDE_X),
             alpha = 0.4) +
  labs(title = "Distribution of Level 1 Severity Crashes in Cincinnati",
       x = "Longitude", 
       y = "Latitude",
       fill = "Number of Crashes")

I chose this question because I was interested to see if there were certain roads that were more dangerous than others in the Cincinnati area. The results of this plot show that most of the fatal car crashes occur in Downtown Cincinnati, specifically on the roads East Central Parkway and East 5th and 6th Street. There are also record of some new Riverbend concert venue and Ault Park in Mt. Lookout.

Question 2

Which day of the week has experienced the most crashes when the light condition was considered dark?

# convert columns to factors 
crashes$LIGHTCONDITIONSPRIMARY <- factor(crashes$LIGHTCONDITIONSPRIMARY)
crashes$DAYOFWEEK <- factor(crashes$DAYOFWEEK, levels = c("MON", "TUE", "WED", "THU", "FRI", "SAT", "SUN"))

# prepare data set for plot 
crashes_filtered <- crashes %>% 
  filter(LIGHTCONDITIONSPRIMARY %in% c("3 - DARK - LIGHTED ROADWAY", "4 - DARK – ROADWAY NOT LIGHTED", "5 - DARK – UNKNOWN ROADWAY LIGHTING")) %>% 
  group_by(DAYOFWEEK, LIGHTCONDITIONSPRIMARY) %>% 
  mutate(crash_count = n()) %>% 
  summarize(avg_crashes = mean(crash_count))
## `summarise()` has grouped output by 'DAYOFWEEK'. You can override using the
## `.groups` argument.
# plot the average number of crashes per day of week 
ggplot(crashes_filtered, aes(x = LIGHTCONDITIONSPRIMARY, y = avg_crashes, fill = LIGHTCONDITIONSPRIMARY)) +
  geom_bar(stat = "identity", color = "black") +
  facet_wrap(~ DAYOFWEEK, nrow = 1, scales = "free_x") +
  scale_x_discrete(labels = c("3 - DARK - LIGHTED ROADWAY" = "3", "4 - DARK – ROADWAY NOT LIGHTED" = "4", "5 - DARK – UNKNOWN ROADWAY LIGHTING" = "5")) +
  labs(title = "Number of Crashes at Night by Day of Week",
       x = "Light Conditions",
       y = "Average Number of Crashes",
       fill = "Light Condition")

I chose this question because I was interested to see if there were certain days of the week that experience more car crashes at night than others. The results of this plot show that the majority of nighttime car accidents happen over the weekend on Saturdays, Sundays, and Fridays, respecitvely. They also mostly occur on lighted roadways.

Alyssa’s Questions

Question 1

What neighborhoods on the east side have the most car crashes?

# Get the unique neighborhood values
unique_neighborhoods <- unique(crashes$CPD_NEIGHBORHOOD)

# Print the unique neighborhood values
print(unique_neighborhoods)
##  [1] "OAKLEY"                     "EVANSTON"                  
##  [3] "CORRYVILLE"                 "N/A"                       
##  [5] "AVONDALE"                   "SPRING GROVE VILLAGE"      
##  [7] "COLLEGE  HILL"              "BONDHILL"                  
##  [9] "MOUNT  AUBURN"              "WESTWOOD"                  
## [11] "PLEASANT RIDGE"             "EAST PRICE HILL"           
## [13] "WEST  END"                  "C. B. D. / RIVERFRONT"     
## [15] "QUEENSGATE"                 "WEST PRICE HILL"           
## [17] "MOUNT AIRY"                 "CLIFTON/UNIVERSITY HEIGHTS"
## [19] "NORTH AVONDALE"             "CARTHAGE"                  
## [21] "CLIFTON"                    "LOWER PRICE  HILL"         
## [23] "SOUTH  FAIRMOUNT"           "MADISONVILLE"              
## [25] "CAMP  WASHINGTON"           "WALNUT HILLS"              
## [27] "PADDOCK  HILLS"             "MOUNT  ADAMS"              
## [29] "OVER-THE-RHINE"             "MT.  WASHINGTON"           
## [31] "COLUMBIA / TUSCULUM"        "HYDE PARK"                 
## [33] "NORTHSIDE"                  "ROSELAWN"                  
## [35] "FAIRVIEW"                   "SAYLER  PARK"              
## [37] "WINTON HILLS"               "SEDAMSVILLE"               
## [39] "MT.  LOOKOUT"               "EAST  END"                 
## [41] "HARTWELL"                   "ENGLISH  WOODS"            
## [43] "EAST  WESTWOOD"             "LINWOOD"                   
## [45] "PENDLETON"                  "RIVERSIDE"                 
## [47] "MILLVALE"                   "EAST WALNUT HILLS"         
## [49] "O'BRYONVILLE"               "S.. CUMMINSVILLE"          
## [51] "KENNEDY  HEIGHTS"           "CALIFORNIA"                
## [53] "NORTH FAIRMOUNT"            "FAY APARTMENTS"
east_neighborhoods <- crashes %>%
  filter(CPD_NEIGHBORHOOD %in% c("OAKLEY", "EVANSTON", "CORRYVILLE", "AVONDALE", 
                                 "SPRING GROVE VILLAGE", "COLLEGE HILL", "MOUNT AUBURN", 
                                 "PLEASANT RIDGE", "EAST PRICE HILL", "MT. WASHINGTON", 
                                 "COLUMBIA / TUSCULUM", "HYDE PARK", "LINWOOD", 
                                 "EAST WALNUT HILLS", "O'BRYONVILLE", "KENNEDY HEIGHTS", 
                                 "CALIFORNIA"))

# View the filtered data
head(east_neighborhoods)
## # A tibble: 6 × 28
##   ADDRES…¹ LATIT…² LONGI…³   AGE COMMU…⁴ CPD_N…⁵ SNA_N…⁶ CRASH…⁷ CRASH…⁸ CRASH…⁹
##   <chr>      <dbl>   <dbl> <dbl> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
## 1 39XX ED…    39.2   -84.4    51 OAKLEY  OAKLEY  OAKLEY  04/13/… <NA>    5 - PR…
## 2 39XX ED…    39.2   -84.4    NA OAKLEY  OAKLEY  OAKLEY  04/13/… <NA>    5 - PR…
## 3 17XX HO…    39.1   -84.5    37 EVANST… EVANST… EVANST… 04/12/… <NA>    5 - PR…
## 4 4XX E M…    39.1   -84.5    30 CORRYV… CORRYV… CORRYV… 04/12/… <NA>    4 - IN…
## 5 4XX E M…    39.1   -84.5    NA CORRYV… CORRYV… CORRYV… 04/12/… <NA>    4 - IN…
## 6 27XX RE…    39.1   -84.5    22 AVONDA… AVONDA… AVONDA… 04/12/… <NA>    5 - PR…
## # … with 18 more variables: CRASHSEVERITYID <dbl>, DATECRASHREPORTED <chr>,
## #   DAYOFWEEK <fct>, GENDER <chr>, INJURIES <chr>, INSTANCEID <chr>,
## #   LIGHTCONDITIONSPRIMARY <fct>, LOCALREPORTNO <dbl>, MANNEROFCRASH <chr>,
## #   ROADCONDITIONSPRIMARY <chr>, ROADCONTOUR <chr>, ROADSURFACE <chr>,
## #   ROADCLASS <dbl>, ROADCLASSDESC <chr>, UNITTYPE <chr>, TYPEOFPERSON <chr>,
## #   WEATHER <chr>, ZIP <dbl>, and abbreviated variable names ¹​ADDRESS_X,
## #   ²​LATITUDE_X, ³​LONGITUDE_X, ⁴​COMMUNITY_COUNCIL_NEIGHBORHOOD, …
east_crash_counts <- east_neighborhoods %>%
  group_by(CPD_NEIGHBORHOOD) %>%
  summarize(count = n())

# Plot the count of car crashes in each neighborhood
ggplot(east_crash_counts, aes(x = fct_reorder(CPD_NEIGHBORHOOD, -count), y = count, fill = CPD_NEIGHBORHOOD)) +
  geom_bar(stat = "identity") +
  labs(x = "Neighborhood", y = "Count of Car Crashes",
       title = "Count of Car Crashes in East Side Neighborhoods") +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  scale_fill_viridis_d(option = "plasma") +
  guides(fill = FALSE)

I asked this question as I spend most of my time in Cincinnati on the east side. I was curious to see if there were big differences between the number of crashes in each neighborhood. I first researched which of the neighborhoods in the data set were on the east side and then filtered for those neighborhoods. Using the viridis plasma color palette, I then modeled each neighborhood’s crash count in a bar plot. As you can see, Avondale has the highest number of crashes and O’Bryonville the least. This was important to me as many of the neighborhoods listed (particularly Avondale, Corryville, Oakley and Hyde Park) are places surrounding where I live as well as places I visit often.

Question 2

Who is responsible for the car accidents in Clifton?

clifton_crashes <- crashes %>%
  filter(CPD_NEIGHBORHOOD == "CLIFTON") %>%
  count(TYPEOFPERSON)

crashes$TYPEOFPERSON <- ifelse(crashes$TYPEOFPERSON == "", "U - UNKNOWN", crashes$TYPEOFPERSON)

# calculate total count of crashes in Clifton
total <- sum(clifton_crashes$n)

# create donut chart
ggplot(clifton_crashes, aes(x = "", y = n, fill = TYPEOFPERSON)) +
  geom_bar(stat = "identity", color = "white") +
  coord_polar("y", start = 0) +
  labs(title = "Distribution of People Involved in Crashes in Clifton",
       fill = "Type of Person") +
  theme_void() +
  theme(legend.position = "right") +
  scale_fill_manual(values = c("#7e03a8", "#cc4778", "#f89540", "#f0f921"), 
                    labels = c("Driver", "Pedestrian", "Occupant", "Unknown")) +
  geom_text_repel(aes(label = paste0(round((n/total*100), 1), "%")), 
                  position = position_stack(vjust = 0.5), 
                  size = 4)

For this question, I filtered for the crashes in Clifton, where I live. Since I live near the University of Cincinnati’s campus, I see a number of the crazy drivers in Clifton. My hunch was that occupants in the car would be a large proportion of the responsible parties in crashes in Clifton. I did a pie chart as this was the easiest way to show proportions. As seen on the chart, drivers are by far the most responsible for the crashes in Clifton (90.7%). The immediate next highest value is occupants (8.4%) followed by pedestrians (0.9%) with no unknown values in the data. I continued with the viridis color palette to demonstrate the different proportions. While this visualization did not prove my hunch, it gave me who to look out for when I’m out in Clifton.

Amy’s Questions

Question 1

What is the number of crashes during the different conditions of road?

# Create a new column indicating the road condition
crashes$ROADCONDITIONSPRIMARY <- ifelse(crashes$ROADCONDITIONSPRIMARY %in% c("01 - DRY", "02 - WET", "04 - ICE", "03 - SNOW"), crashes$ROADCONDITIONSPRIMARY, "OTHER")

# Count the number of crashes in each road condition
crash_counts <- crashes %>% 
  group_by(ROADCONDITIONSPRIMARY) %>% 
  summarize(num_crashes = n())

# Create a pie chart
crash_counts$percent_crashes <- (crash_counts$num_crashes / sum(crash_counts$num_crashes)) * 100

# Create a pie chart
ggplot(crash_counts, aes(x = "", y = num_crashes, fill = ROADCONDITIONSPRIMARY)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  labs(title = "Number of Crashes by Road Condition") +
  theme_void()

To find out if road conditions affect the number of crashes, I generated a pie graph to show the percentage of crashes that happened during each road conditions. From the chart, we can see that over 75% of crashes happened when the road is dry, over 20% happened when the road is wet, and around 5% of crashes happened when the road is under other conditions such as snow, ice, etc.

Question 2

What is the number of crashes of male vs female, and the percentage of fatal crashes on their total crashes?

# Calculate the number of crashes by gender
crash_counts <- crashes %>% 
  group_by(GENDER) %>% 
  summarize(num_crashes = n())

# Calculate the number and percentage of fatal crashes by gender
fatal_counts <- crashes %>% 
  filter(GENDER %in% c("M - MALE", "F - FEMALE"), INJURIES %in% c("1 - FATAL", "5 - FATAL", "2 - SUSPECTED SERIOUS INJURY")) %>% 
  group_by(GENDER) %>% 
  summarize(num_fatal = n()) %>% 
  mutate(percent_fatal = round(num_fatal / sum(crash_counts$num_crashes) * 100, 2))

# Merge the crash counts and fatal counts by gender
crash_summary <- merge(crash_counts, fatal_counts, by = "GENDER")

# Create a bar plot of crashes by gender with percentage of fatal crashes
ggplot(crash_summary, aes(x = GENDER, y = num_crashes, fill = GENDER)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(percent_fatal, 2), "% (Fatal)")), 
            position = position_dodge(width = 1), vjust = -0.5, size = 3,
            fontface = "bold") +
  labs(title = "Number of Crashes by Gender", y = "Number of Crashes",
       fill = "Gender") +
  scale_fill_manual(values = c("pink", "blue")) +
  theme_minimal()

To find out whether male or female causes more crashes, I generated a histogram showing the total number of crashes caused by each gender. In addition, I included the percentage of fatal crashes on total crashes by each gender. From the histogram, we can see that male caused significantly more crashes than female, with 0.1% more fatal crashes. Therefore, we can generally conclude that males drive more recklessly than female, and tend to make more fatal crashes.