The following analysis uses data comprised of traffic accidents in the city of Tempe, Arizona. They constitute 28,470 observations and 32 variables. You can find more details in the dataset documentation at the Tempe Open Data Portal.
All data preprocessing are available in Appendix A: Data Preprocessing.
The following provides brief exploratory summaries on key data points.
Question: How many accidents happened on Mondays?
Note: This solution has been completed for you.
## [1] "Incidentid" "DateTime" "Year"
## [4] "StreetName" "CrossStreet" "Distance"
## [7] "JunctionRelation" "Totalinjuries" "Totalfatalities"
## [10] "Injuryseverity" "Collisionmanner" "Lightcondition"
## [13] "Weather" "SurfaceCondition" "Unittype_One"
## [16] "Age_Drv1" "Gender_Drv1" "Traveldirection_One"
## [19] "Unitaction_One" "Violation1_Drv1" "AlcoholUse_Drv1"
## [22] "DrugUse_Drv1" "Unittype_Two" "Age_Drv2"
## [25] "Gender_Drv2" "Traveldirection_Two" "Unitaction_Two"
## [28] "Violation1_Drv2" "AlcoholUse_Drv2" "DrugUse_Drv2"
## [31] "Latitude" "Longitude" "hour"
## [34] "month" "day" "day365"
## [37] "week" "hour12" "age"
## [1] Tue Thu Mon Fri Sun Sat Wed
## Levels: Mon Tue Wed Thu Fri Sat Sun
## [1] 4094
Answer: 4,094 accidents occurred on Mondays.
Question: What proportion of accidents each week occur on Monday?
# Code
# Calculate the proportion of accidents occurring on Monday
monday_proportion <- dat %>%
summarise(proportion = mean(day == "Mon", na.rm = TRUE))
# Print the result
monday_proportionAnswer: [14.38]% of all accidents occur on Mondays.
Question: v?
Note: “Harm” is defined as any accident that causes at least one injury or fatality, a.k.a. a “casualty”.
## [1] "Incidentid" "DateTime" "Year"
## [4] "StreetName" "CrossStreet" "Distance"
## [7] "JunctionRelation" "Totalinjuries" "Totalfatalities"
## [10] "Injuryseverity" "Collisionmanner" "Lightcondition"
## [13] "Weather" "SurfaceCondition" "Unittype_One"
## [16] "Age_Drv1" "Gender_Drv1" "Traveldirection_One"
## [19] "Unitaction_One" "Violation1_Drv1" "AlcoholUse_Drv1"
## [22] "DrugUse_Drv1" "Unittype_Two" "Age_Drv2"
## [25] "Gender_Drv2" "Traveldirection_Two" "Unitaction_Two"
## [28] "Violation1_Drv2" "AlcoholUse_Drv2" "DrugUse_Drv2"
## [31] "Latitude" "Longitude" "hour"
## [34] "month" "day" "day365"
## [37] "week" "hour12" "age"
MonPropHarm <- dat %>%
filter(day == "Mon") %>% # Filter only Monday accidents
summarise(Propharm = mean((Totalfatalities + Totalinjuries) > 0, na.rm = TRUE))
MonPropHarmAnswer: [29.85]% of all Monday accidents have at least one casualty.
Question: What is the most common type of accident (Collisionmanner) that occurs on Mondays?
Answer: [Rear End] collisions are the most common accident types.
Question: Are there differences in the proportion of accidents that result in harm each day of the week?
# Code
# Calculate the proportion of harmful accidents for each day of the week
harm.prop.by.day <- dat %>%
group_by(day) %>% # Group by day of the week
summarise(Propharm = mean((Totalfatalities + Totalinjuries) > 0, na.rm = TRUE)) %>%
arrange(desc(Propharm)) # Optional: Sort by highest proportion
# Print the result
harm.prop.by.dayAnswer: Significantly more accidents occur [Thr and Tue].
Instructions: Create a table that reports the following for each day of the week.
dat %>%
group_by(day) %>%
mutate(harm_acc = Totalinjuries > 0 | Totalfatalities > 0) %>%
summarize(n = n(),
injuries = sum(Totalinjuries),
fatalities = sum(Totalfatalities),
harm.rate = mean(harm_acc))
The following provides summaries of accidents by age groups.
Instructions: Create a table of counts of accidents
by time of day (hour12) and age of driver
(age).
Question: Which age group has the largest number of accidents at 7AM?
# Code
# Create a table of accident counts by time of day and age of driver
accident_counts <- dat %>%
count(hour12, age) %>% # Count occurrences for each combination of hour12 and age
pivot_wider(names_from = age, values_from = n, values_fill = 0) # Convert to wide format
# Print the table
accident_countsAnswer: Drivers of [18-25] have the greatest number of accidents from 7:00 to 7:59 AM.
Instructions: Create a new table of time of day and
age group that reports the proportion of accidents at “7 AM”
(hour12) for each age group (age). The
proportions within each age group should sum to one.
Objective: The table should contain the following columns and initial values:
| age | hour12 | n | n.age | n.hour | p | p.age | p.hour |
|---|---|---|---|---|---|---|---|
| Age 16-18 | 7 AM | 77 | 1458 | 1606 | 0.05 | 0.05 | 0.05 |
| Age 18-25 | 7 AM | 408 | 8796 | 1606 | 0.25 | 0.05 | 0.25 |
| Age 25-35 | 7 AM | 371 | 5456 | 1606 | 0.23 | 0.07 | 0.23 |
Solution: The solution has been partially provided. Use the hints provided to create the table.
dat %>%
group_by(hour12, age) %>%
summarize(n = n()) %>% # Partial solution
group_by(age) %>% # Requires two new variables in 'mutate()'
mutate() # Lastly, requires function 'filter()accident_proportions <- dat %>%
group_by(hour12, age) %>%
summarize(n = n(), .groups = "drop") %>%
group_by(age) %>%
mutate(n.age = sum(n)) %>%
ungroup() %>%
filter(hour12 == " 7 AM") %>% # Keep only 7 AM now
mutate(n.hour = sum(n), # Compute total accidents at 7 AM across all age groups
p = n / n.hour,
p.age = n / n.age,
p.hour = n / n.hour)
# Display the table
accident_proportions
The following reports the accidents, casualties, proportion, and average casualties per harmful accident.
Instructions: Visualize total accidents by time of
day (hour).
Note: This solution has been completed for you.
dat %>%
group_by(hour) %>%
summarize(n = n()) %>%
plot(type = "b",
bty = "n",
pch = 19,
cex = 2,
xlab = "Hour",
ylab = "Total Number of Accidents",
main = "Total Number of Accidents by Time of Day")Instructions: Visualize total injuries and
fatalities by time of day (hour).
# Code
# total injuries and fatalities by hour
injury_fatality_by_hour <- dat %>%
group_by(hour) %>%
summarize(
total_harm = sum(Totalinjuries + Totalfatalities, na.rm = TRUE) # Sum injuries and fatalities together
)
# Plot total harm (injuries + fatalities) by hour
plot(injury_fatality_by_hour$hour, injury_fatality_by_hour$total_harm,
type = "b", # Line and point plot
bty = "n", # No box around the plot
pch = 19, # Solid circle points
cex = 1.5, # Increase point size
col = "purple", # Use a single color for combined harm
xlab = "Hour",
ylab = "Total Injuries + Fatalities",
main = "Total Harm (Injuries + Fatalities) by Time of Day")
# gridlines for better readability
grid()Instructions: Visualize the proportion of harmful
accidents out of all accidents by time of day (hour).
# Code
library(dplyr)
# the proportion of harmful accidents (injuries or fatalities > 0) by hour
harmful_accidents_by_hour <- dat %>%
group_by(hour) %>%
summarize(
total_accidents = n(), # Total accidents per hour
harmful_accidents = sum((Totalinjuries > 0 | Totalfatalities > 0), na.rm = TRUE), # Harmful accidents per hour
prop_harmful = harmful_accidents / total_accidents # Proportion of harmful accidents
)
# Plot the proportion of harmful accidents by time of day
plot(harmful_accidents_by_hour$hour, harmful_accidents_by_hour$prop_harmful,
type = "b", # Line and point plot
bty = "n", # No box around the plot
pch = 19, # Solid circle points
cex = 1.5, # Increase point size
col = "red", # Use red for harmful accidents proportion
xlab = "Hour",
ylab = "Proportion of Harmful Accidents",
main = "Proportion of Harmful Accidents by Time of Day")
# gridlines for better readability
grid()Instructions: Visualize average injuries or fatalities per harmful collision.
# Code
library(dplyr)
# Compute the average (injuries + fatalities) per harmful collision by hour
avg_harmful_by_hour <- dat %>%
filter(Totalinjuries > 0 | Totalfatalities > 0) %>% # Keep only harmful accidents
group_by(hour) %>%
summarize(
avg_harm = mean(Totalinjuries + Totalfatalities, na.rm = TRUE) # Average total harm (injuries + fatalities)
)
# Plot average harm per harmful collision
plot(avg_harmful_by_hour$hour, avg_harmful_by_hour$avg_harm,
type = "b", # Line and point plot
bty = "n", # No box around the plot
pch = 19, # Solid circle points
cex = 1.5, # Increase point size
col = "purple", # Use a single color for combined harm
xlab = "Hour",
ylab = "Average Harm per Harmful Collision",
main = "Avg Injuries + Fatalities per Harmful Collision")
# gridlines for better readability
grid()The following code is used for preprocessing tasks in the above solutions.
These expressions transform, format, and create myriad new variables used in this report.
date.vec <- strptime(dat$DateTime,
format = "%m/%d/%y %H:%M") # Create date fields
dat$hour <- format(date.vec, format = "%H")
dat$month <- format(date.vec, format = "%b")
dat$day <- format(date.vec, format = "%a")
dat$day365 <- format(date.vec, format = "%j")
dat$week <- format(date.vec, format = "%V")
dat$day <- factor(dat$day,
levels = c("Mon",
"Tue",
"Wed",
"Thu",
"Fri",
"Sat",
"Sun")) # Order weekdays
dat$hour12 <- format(date.vec,
format="%l %p") # Create 12-hour format
time.levels <- c("12 AM", paste(1:11, "AM"),
"12 PM", paste(1:11, "PM"))
dat$hour12 <- factor(dat$hour12,
levels = time.levels) # Order time intervals
age.labels <- paste0("Age ",
c(16,18,25,35,45,55,65,75), "-",
c(18,25,35,45,55,65,75,100) )
dat$age <- cut(dat$Age_Drv1,
breaks = c(16,18,25,
35,45,55,
65,75,100),
labels = age.labels) # Discretize age ranges