Source Data

The following analysis uses data comprised of traffic accidents in the city of Tempe, Arizona. They constitute 28,470 observations and 32 variables. You can find more details in the dataset documentation at the Tempe Open Data Portal.


Preview

We can explore the first six observations in the following table using function head().

head(dat)



Preprocessing

All data preprocessing are available in Appendix A: Data Preprocessing.



Part I: Summaries

The following provides brief exploratory summaries on key data points.


1. Mondays Accidents

Question: How many accidents happened on Mondays?

Note: This solution has been completed for you.


colnames(dat)
##  [1] "Incidentid"          "DateTime"            "Year"               
##  [4] "StreetName"          "CrossStreet"         "Distance"           
##  [7] "JunctionRelation"    "Totalinjuries"       "Totalfatalities"    
## [10] "Injuryseverity"      "Collisionmanner"     "Lightcondition"     
## [13] "Weather"             "SurfaceCondition"    "Unittype_One"       
## [16] "Age_Drv1"            "Gender_Drv1"         "Traveldirection_One"
## [19] "Unitaction_One"      "Violation1_Drv1"     "AlcoholUse_Drv1"    
## [22] "DrugUse_Drv1"        "Unittype_Two"        "Age_Drv2"           
## [25] "Gender_Drv2"         "Traveldirection_Two" "Unitaction_Two"     
## [28] "Violation1_Drv2"     "AlcoholUse_Drv2"     "DrugUse_Drv2"       
## [31] "Latitude"            "Longitude"           "hour"               
## [34] "month"               "day"                 "day365"             
## [37] "week"                "hour12"              "age"
unique(dat$day)
## [1] Tue Thu Mon Fri Sun Sat Wed
## Levels: Mon Tue Wed Thu Fri Sat Sun
sum(dat$day == "Mon", 
    na.rm = TRUE)
## [1] 4094


Answer: 4,094 accidents occurred on Mondays.


2. Monday Accidents (%)

Question: What proportion of accidents each week occur on Monday?


# Code

# Calculate the proportion of accidents occurring on Monday
monday_proportion <- dat %>%
  summarise(proportion = mean(day == "Mon", na.rm = TRUE))

# Print the result
monday_proportion


Answer: [14.38]% of all accidents occur on Mondays.


3. Harmful Monday Accidents (%)

Question: v?

Note: “Harm” is defined as any accident that causes at least one injury or fatality, a.k.a. a “casualty”.


# Code
colnames(dat)
##  [1] "Incidentid"          "DateTime"            "Year"               
##  [4] "StreetName"          "CrossStreet"         "Distance"           
##  [7] "JunctionRelation"    "Totalinjuries"       "Totalfatalities"    
## [10] "Injuryseverity"      "Collisionmanner"     "Lightcondition"     
## [13] "Weather"             "SurfaceCondition"    "Unittype_One"       
## [16] "Age_Drv1"            "Gender_Drv1"         "Traveldirection_One"
## [19] "Unitaction_One"      "Violation1_Drv1"     "AlcoholUse_Drv1"    
## [22] "DrugUse_Drv1"        "Unittype_Two"        "Age_Drv2"           
## [25] "Gender_Drv2"         "Traveldirection_Two" "Unitaction_Two"     
## [28] "Violation1_Drv2"     "AlcoholUse_Drv2"     "DrugUse_Drv2"       
## [31] "Latitude"            "Longitude"           "hour"               
## [34] "month"               "day"                 "day365"             
## [37] "week"                "hour12"              "age"
MonPropHarm <- dat %>% 
  filter(day == "Mon") %>%  # Filter only Monday accidents
  summarise(Propharm = mean((Totalfatalities + Totalinjuries) > 0, na.rm = TRUE))

MonPropHarm


Answer: [29.85]% of all Monday accidents have at least one casualty.


4. Most Common Accidents

Question: What is the most common type of accident (Collisionmanner) that occurs on Mondays?


dat %>%
  filter(day == "Mon") %>%
  count(Collisionmanner) %>%
  arrange(desc(n))


Answer: [Rear End] collisions are the most common accident types.


5. Differences in Accidents

Question: Are there differences in the proportion of accidents that result in harm each day of the week?


# Code

# Calculate the proportion of harmful accidents for each day of the week
harm.prop.by.day <- dat %>%
  group_by(day) %>%  # Group by day of the week
  summarise(Propharm = mean((Totalfatalities + Totalinjuries) > 0, na.rm = TRUE)) %>%
  arrange(desc(Propharm))  # Optional: Sort by highest proportion

# Print the result
harm.prop.by.day


Answer: Significantly more accidents occur [Thr and Tue].


6. Accident Stats by Weekday

Instructions: Create a table that reports the following for each day of the week.

  • Total accidents
  • Total injuries
  • Total fatalities
  • Proportion of harmful accidents (i.e. casualties)


dat %>%
  group_by(day) %>%
  mutate(harm_acc = Totalinjuries > 0 | Totalfatalities > 0) %>%
  summarize(n = n(),
            injuries = sum(Totalinjuries),
            fatalities = sum(Totalfatalities),
            harm.rate = mean(harm_acc))



Part II: Age Groups

The following provides summaries of accidents by age groups.


1. Accidents by Hour & Age

Instructions: Create a table of counts of accidents by time of day (hour12) and age of driver (age).

Question: Which age group has the largest number of accidents at 7AM?


# Code



# Create a table of accident counts by time of day and age of driver
accident_counts <- dat %>%
  count(hour12, age) %>%  # Count occurrences for each combination of hour12 and age
  pivot_wider(names_from = age, values_from = n, values_fill = 0)  # Convert to wide format

# Print the table
accident_counts


Answer: Drivers of [18-25] have the greatest number of accidents from 7:00 to 7:59 AM.


2. Accidents by Hour & Age (%)

Instructions: Create a new table of time of day and age group that reports the proportion of accidents at “7 AM” (hour12) for each age group (age). The proportions within each age group should sum to one.



Objective: The table should contain the following columns and initial values:

age hour12 n n.age n.hour p p.age p.hour
Age 16-18 7 AM 77 1458 1606 0.05 0.05 0.05
Age 18-25 7 AM 408 8796 1606 0.25 0.05 0.25
Age 25-35 7 AM 371 5456 1606 0.23 0.07 0.23



Solution: The solution has been partially provided. Use the hints provided to create the table.

 dat %>% 
   group_by(hour12, age) %>% 
   summarize(n = n()) %>%      # Partial solution
   group_by(age) %>%           # Requires two new variables in 'mutate()'
   mutate()                    # Lastly, requires function 'filter()
accident_proportions <- dat %>% 
  group_by(hour12, age) %>% 
  summarize(n = n(), .groups = "drop") %>%  
  group_by(age) %>% 
  mutate(n.age = sum(n)) %>%  
  ungroup() %>%
  filter(hour12 == " 7 AM") %>%  # Keep only 7 AM now
  mutate(n.hour = sum(n),  # Compute total accidents at 7 AM across all age groups
         p = n / n.hour,   
         p.age = n / n.age,  
         p.hour = n / n.hour)

# Display the table
accident_proportions



Part III: Rates of Harm

The following reports the accidents, casualties, proportion, and average casualties per harmful accident.


1. Accidents by Hour

Instructions: Visualize total accidents by time of day (hour).

Note: This solution has been completed for you.


dat %>%
  group_by(hour) %>%
  summarize(n = n()) %>%
  plot(type = "b",
       bty = "n",
       pch = 19,
       cex = 2,
       xlab = "Hour",
       ylab = "Total Number of Accidents",
       main = "Total Number of Accidents by Time of Day")


2. Total Casualties by Hour

Instructions: Visualize total injuries and fatalities by time of day (hour).


# Code


#  total injuries and fatalities by hour
injury_fatality_by_hour <- dat %>%
  group_by(hour) %>%
  summarize(
    total_harm = sum(Totalinjuries + Totalfatalities, na.rm = TRUE)  # Sum injuries and fatalities together
  )

# Plot total harm (injuries + fatalities) by hour
plot(injury_fatality_by_hour$hour, injury_fatality_by_hour$total_harm,
     type = "b",          # Line and point plot
     bty = "n",           # No box around the plot
     pch = 19,            # Solid circle points
     cex = 1.5,           # Increase point size
     col = "purple",      # Use a single color for combined harm
     xlab = "Hour",
     ylab = "Total Injuries + Fatalities",
     main = "Total Harm (Injuries + Fatalities) by Time of Day")

# gridlines for better readability
grid()


3. Accidents with Casualties (%)

Instructions: Visualize the proportion of harmful accidents out of all accidents by time of day (hour).


# Code

library(dplyr)

# the proportion of harmful accidents (injuries or fatalities > 0) by hour
harmful_accidents_by_hour <- dat %>%
  group_by(hour) %>%
  summarize(
    total_accidents = n(),  # Total accidents per hour
    harmful_accidents = sum((Totalinjuries > 0 | Totalfatalities > 0), na.rm = TRUE),  # Harmful accidents per hour
    prop_harmful = harmful_accidents / total_accidents  # Proportion of harmful accidents
  )

# Plot the proportion of harmful accidents by time of day
plot(harmful_accidents_by_hour$hour, harmful_accidents_by_hour$prop_harmful,
     type = "b",          # Line and point plot
     bty = "n",           # No box around the plot
     pch = 19,            # Solid circle points
     cex = 1.5,           # Increase point size
     col = "red",         # Use red for harmful accidents proportion
     xlab = "Hour",
     ylab = "Proportion of Harmful Accidents",
     main = "Proportion of Harmful Accidents by Time of Day")

# gridlines for better readability
grid()


4. Avg. Harm by Harmful Accident

Instructions: Visualize average injuries or fatalities per harmful collision.


# Code

library(dplyr)

# Compute the average (injuries + fatalities) per harmful collision by hour
avg_harmful_by_hour <- dat %>%
  filter(Totalinjuries > 0 | Totalfatalities > 0) %>%  # Keep only harmful accidents
  group_by(hour) %>%
  summarize(
    avg_harm = mean(Totalinjuries + Totalfatalities, na.rm = TRUE)  # Average total harm (injuries + fatalities)
  )

# Plot average harm per harmful collision
plot(avg_harmful_by_hour$hour, avg_harmful_by_hour$avg_harm,
     type = "b",          # Line and point plot
     bty = "n",           # No box around the plot
     pch = 19,            # Solid circle points
     cex = 1.5,           # Increase point size
     col = "purple",      # Use a single color for combined harm
     xlab = "Hour",
     ylab = "Average Harm per Harmful Collision",
     main = "Avg Injuries + Fatalities per Harmful Collision")

# gridlines for better readability
grid()


Appendix

The following code is used for preprocessing tasks in the above solutions.


A: Data Preprocessing

These expressions transform, format, and create myriad new variables used in this report.


date.vec <- strptime(dat$DateTime, 
                     format = "%m/%d/%y %H:%M")       # Create date fields

dat$hour   <- format(date.vec, format = "%H")
dat$month  <- format(date.vec, format = "%b")
dat$day    <- format(date.vec, format = "%a")
dat$day365 <- format(date.vec, format = "%j")
dat$week   <- format(date.vec, format = "%V")

dat$day <- factor(dat$day, 
                  levels = c("Mon",
                             "Tue",
                             "Wed",
                             "Thu",
                             "Fri",
                             "Sat",
                             "Sun"))                  # Order weekdays

dat$hour12 <- format(date.vec, 
                     format="%l %p")                  # Create 12-hour format

time.levels <- c("12 AM", paste(1:11, "AM"), 
                 "12 PM", paste(1:11, "PM"))

dat$hour12 <- factor(dat$hour12, 
                     levels = time.levels)            # Order time intervals

age.labels <- paste0("Age ", 
                     c(16,18,25,35,45,55,65,75), "-", 
                     c(18,25,35,45,55,65,75,100) )

dat$age <- cut(dat$Age_Drv1, 
               breaks = c(16,18,25,
                          35,45,55,
                          65,75,100), 
               labels = age.labels)                   # Discretize age ranges