The following analysis uses data comprised of traffic accidents in the city of Tempe, Arizona. They constitute 28,470 observations and 32 variables. You can find more details in the dataset documentation at the Tempe Open Data Portal.
All data preprocessing are available in Appendix A: Data Preprocessing.
The following provides brief exploratory summaries on key data points.
Question: How many accidents happened on Mondays?
Note: This solution has been completed for you.
## [1] 4094
Answer: 4,094 accidents occurred on Mondays.
Question: What proportion of accidents each week occur on Monday?
# Code
monday_weekly_props <- dat %>%
group_by(week) %>%
summarise(
total_accidents = n(),
monday_accidents = sum(day == "Mon", na.rm = TRUE),
prop_monday = monday_accidents / total_accidents
) %>%
arrange(as.integer(week))
head(monday_weekly_props) %>% pander()| week | total_accidents | monday_accidents | prop_monday |
|---|---|---|---|
| 01 | 460 | 57 | 0.1239 |
| 02 | 624 | 106 | 0.1699 |
| 03 | 686 | 71 | 0.1035 |
| 04 | 628 | 71 | 0.1131 |
| 05 | 651 | 96 | 0.1475 |
| 06 | 678 | 96 | 0.1416 |
monday_weekly_props <- dat %>%
group_by(week) %>%
summarise(
prop_monday = mean(day == "Mon", na.rm = TRUE)
) %>%
arrange(as.integer(week))
head(monday_weekly_props) %>% pander()| week | prop_monday |
|---|---|
| 01 | 0.1239 |
| 02 | 0.1699 |
| 03 | 0.1035 |
| 04 | 0.1131 |
| 05 | 0.1475 |
| 06 | 0.1416 |
# Finding the average per week
dat %>%
group_by(week) %>%
summarise(prop_monday = mean(day == "Mon", na.rm = TRUE)) %>%
summarise(avg_weekly_prop = mean(prop_monday, na.rm = TRUE)) %>%
pander()| avg_weekly_prop |
|---|
| 0.1444 |
Answer: 0.14% of all accidents occur on Mondays.
Question: What proportion of accidents on Mondays result in harm?
Note: “Harm” is defined as any accident that causes at least one injury or fatality, a.k.a. a “casualty”.
## [1] "Incidentid" "DateTime" "Year"
## [4] "StreetName" "CrossStreet" "Distance"
## [7] "JunctionRelation" "Totalinjuries" "Totalfatalities"
## [10] "Injuryseverity" "Collisionmanner" "Lightcondition"
## [13] "Weather" "SurfaceCondition" "Unittype_One"
## [16] "Age_Drv1" "Gender_Drv1" "Traveldirection_One"
## [19] "Unitaction_One" "Violation1_Drv1" "AlcoholUse_Drv1"
## [22] "DrugUse_Drv1" "Unittype_Two" "Age_Drv2"
## [25] "Gender_Drv2" "Traveldirection_Two" "Unitaction_Two"
## [28] "Violation1_Drv2" "AlcoholUse_Drv2" "DrugUse_Drv2"
## [31] "Latitude" "Longitude" "hour"
## [34] "month" "day" "day365"
## [37] "week" "hour12" "age"
dat %>%
filter(day == "Mon") %>%
summarise(
prop_harm = mean(
(Totalinjuries > 0) | (Totalfatalities > 0),
na.rm = TRUE
)
) %>%
pander()| prop_harm |
|---|
| 0.2985 |
Answer: 0.3% of all Monday accidents have at least one casualty.
Question: What is the most common type of accident (Collisionmanner) that occurs on Mondays?
Answer: Rear End collisions are the most common accident types.
Question: Are there differences in the proportion of accidents that result in harm each day of the week?
# Code
harm_by_day <- dat %>%
group_by(day) %>%
summarise(
prop_harm = mean(
(Totalinjuries > 0) | (Totalfatalities > 0),
na.rm = TRUE
)
) %>%
arrange(desc(prop_harm))
harm_by_day %>% pander()| day | prop_harm |
|---|---|
| Thu | 0.327 |
| Tue | 0.3228 |
| Wed | 0.3148 |
| Mon | 0.2985 |
| Fri | 0.2958 |
| Sun | 0.2685 |
| Sat | 0.2645 |
Answer: Significantly more accidents occur on Thursday.
Instructions: Create a table that reports the following for each day of the week.
dat %>%
group_by(day) %>%
mutate(harm_acc = Totalinjuries > 0 | Totalfatalities > 0) %>%
summarize(n = n(),
injuries = sum(Totalinjuries),
fatalities = sum(Totalfatalities),
harm.rate = mean(harm_acc))
The following provides summaries of accidents by age groups.
Instructions: Create a table of counts of accidents
by time of day (hour12) and age of driver
(age).
Question: Which age group has the largest number of accidents at 7AM?
| hour12 | age | n |
|---|---|---|
| 12 AM | Age 16-18 | 10 |
| 12 AM | Age 18-25 | 109 |
| 12 AM | Age 25-35 | 90 |
| 12 AM | Age 35-45 | 16 |
| 12 AM | Age 45-55 | 11 |
| 12 AM | Age 55-65 | 4 |
| 12 AM | Age 65-75 | 1 |
| 12 AM | NA | 48 |
| 1 AM | Age 16-18 | 84 |
| 1 AM | Age 18-25 | 587 |
| 1 AM | Age 25-35 | 324 |
| 1 AM | Age 35-45 | 190 |
| 1 AM | Age 45-55 | 184 |
| 1 AM | Age 55-65 | 153 |
| 1 AM | Age 65-75 | 85 |
| 1 AM | Age 75-100 | 43 |
| 1 AM | NA | 182 |
| 2 AM | Age 16-18 | 5 |
| 2 AM | Age 18-25 | 154 |
| 2 AM | Age 25-35 | 82 |
| 2 AM | Age 35-45 | 28 |
| 2 AM | Age 45-55 | 9 |
| 2 AM | Age 55-65 | 5 |
| 2 AM | Age 65-75 | 1 |
| 2 AM | NA | 78 |
| 3 AM | Age 16-18 | 109 |
| 3 AM | Age 18-25 | 585 |
| 3 AM | Age 25-35 | 349 |
| 3 AM | Age 35-45 | 206 |
| 3 AM | Age 45-55 | 187 |
| 3 AM | Age 55-65 | 138 |
| 3 AM | Age 65-75 | 85 |
| 3 AM | Age 75-100 | 50 |
| 3 AM | NA | 219 |
| 4 AM | Age 16-18 | 3 |
| 4 AM | Age 18-25 | 80 |
| 4 AM | Age 25-35 | 35 |
| 4 AM | Age 35-45 | 15 |
| 4 AM | Age 45-55 | 5 |
| 4 AM | Age 55-65 | 3 |
| 4 AM | NA | 49 |
| 5 AM | Age 16-18 | 134 |
| 5 AM | Age 18-25 | 674 |
| 5 AM | Age 25-35 | 419 |
| 5 AM | Age 35-45 | 270 |
| 5 AM | Age 45-55 | 228 |
| 5 AM | Age 55-65 | 155 |
| 5 AM | Age 65-75 | 92 |
| 5 AM | Age 75-100 | 55 |
| 5 AM | NA | 274 |
| 6 AM | Age 16-18 | 4 |
| 6 AM | Age 18-25 | 43 |
| 6 AM | Age 25-35 | 24 |
| 6 AM | Age 35-45 | 11 |
| 6 AM | Age 45-55 | 16 |
| 6 AM | Age 55-65 | 4 |
| 6 AM | Age 65-75 | 3 |
| 6 AM | Age 75-100 | 1 |
| 6 AM | NA | 19 |
| 7 AM | Age 16-18 | 157 |
| 7 AM | Age 18-25 | 835 |
| 7 AM | Age 25-35 | 563 |
| 7 AM | Age 35-45 | 314 |
| 7 AM | Age 45-55 | 280 |
| 7 AM | Age 55-65 | 184 |
| 7 AM | Age 65-75 | 83 |
| 7 AM | Age 75-100 | 46 |
| 7 AM | NA | 296 |
| 8 AM | Age 16-18 | 4 |
| 8 AM | Age 18-25 | 98 |
| 8 AM | Age 25-35 | 46 |
| 8 AM | Age 35-45 | 39 |
| 8 AM | Age 45-55 | 40 |
| 8 AM | Age 55-65 | 26 |
| 8 AM | Age 65-75 | 8 |
| 8 AM | Age 75-100 | 5 |
| 8 AM | NA | 34 |
| 9 AM | Age 16-18 | 152 |
| 9 AM | Age 18-25 | 989 |
| 9 AM | Age 25-35 | 582 |
| 9 AM | Age 35-45 | 356 |
| 9 AM | Age 45-55 | 262 |
| 9 AM | Age 55-65 | 168 |
| 9 AM | Age 65-75 | 68 |
| 9 AM | Age 75-100 | 41 |
| 9 AM | NA | 320 |
| 10 AM | Age 16-18 | 24 |
| 10 AM | Age 18-25 | 164 |
| 10 AM | Age 25-35 | 122 |
| 10 AM | Age 35-45 | 90 |
| 10 AM | Age 45-55 | 79 |
| 10 AM | Age 55-65 | 57 |
| 10 AM | Age 65-75 | 21 |
| 10 AM | Age 75-100 | 7 |
| 10 AM | NA | 83 |
| 11 AM | Age 16-18 | 106 |
| 11 AM | Age 18-25 | 635 |
| 11 AM | Age 25-35 | 381 |
| 11 AM | Age 35-45 | 226 |
| 11 AM | Age 45-55 | 174 |
| 11 AM | Age 55-65 | 119 |
| 11 AM | Age 65-75 | 59 |
| 11 AM | Age 75-100 | 41 |
| 11 AM | NA | 233 |
| 12 PM | Age 16-18 | 77 |
| 12 PM | Age 18-25 | 408 |
| 12 PM | Age 25-35 | 371 |
| 12 PM | Age 35-45 | 243 |
| 12 PM | Age 45-55 | 175 |
| 12 PM | Age 55-65 | 116 |
| 12 PM | Age 65-75 | 39 |
| 12 PM | Age 75-100 | 17 |
| 12 PM | NA | 160 |
| 1 PM | Age 16-18 | 74 |
| 1 PM | Age 18-25 | 376 |
| 1 PM | Age 25-35 | 213 |
| 1 PM | Age 35-45 | 126 |
| 1 PM | Age 45-55 | 101 |
| 1 PM | Age 55-65 | 62 |
| 1 PM | Age 65-75 | 37 |
| 1 PM | Age 75-100 | 13 |
| 1 PM | NA | 183 |
| 2 PM | Age 16-18 | 72 |
| 2 PM | Age 18-25 | 419 |
| 2 PM | Age 25-35 | 329 |
| 2 PM | Age 35-45 | 205 |
| 2 PM | Age 45-55 | 149 |
| 2 PM | Age 55-65 | 107 |
| 2 PM | Age 65-75 | 56 |
| 2 PM | Age 75-100 | 31 |
| 2 PM | NA | 151 |
| 3 PM | Age 16-18 | 48 |
| 3 PM | Age 18-25 | 338 |
| 3 PM | Age 25-35 | 170 |
| 3 PM | Age 35-45 | 91 |
| 3 PM | Age 45-55 | 71 |
| 3 PM | Age 55-65 | 44 |
| 3 PM | Age 65-75 | 20 |
| 3 PM | Age 75-100 | 12 |
| 3 PM | NA | 137 |
| 4 PM | Age 16-18 | 42 |
| 4 PM | Age 18-25 | 276 |
| 4 PM | Age 25-35 | 200 |
| 4 PM | Age 35-45 | 136 |
| 4 PM | Age 45-55 | 111 |
| 4 PM | Age 55-65 | 94 |
| 4 PM | Age 65-75 | 59 |
| 4 PM | Age 75-100 | 38 |
| 4 PM | NA | 110 |
| 5 PM | Age 16-18 | 58 |
| 5 PM | Age 18-25 | 295 |
| 5 PM | Age 25-35 | 135 |
| 5 PM | Age 35-45 | 70 |
| 5 PM | Age 45-55 | 52 |
| 5 PM | Age 55-65 | 51 |
| 5 PM | Age 65-75 | 13 |
| 5 PM | Age 75-100 | 6 |
| 5 PM | NA | 135 |
| 6 PM | Age 16-18 | 36 |
| 6 PM | Age 18-25 | 298 |
| 6 PM | Age 25-35 | 207 |
| 6 PM | Age 35-45 | 115 |
| 6 PM | Age 45-55 | 108 |
| 6 PM | Age 55-65 | 95 |
| 6 PM | Age 65-75 | 59 |
| 6 PM | Age 75-100 | 46 |
| 6 PM | NA | 118 |
| 7 PM | Age 16-18 | 54 |
| 7 PM | Age 18-25 | 241 |
| 7 PM | Age 25-35 | 100 |
| 7 PM | Age 35-45 | 46 |
| 7 PM | Age 45-55 | 38 |
| 7 PM | Age 55-65 | 19 |
| 7 PM | Age 65-75 | 13 |
| 7 PM | Age 75-100 | 2 |
| 7 PM | NA | 101 |
| 8 PM | Age 16-18 | 66 |
| 8 PM | Age 18-25 | 387 |
| 8 PM | Age 25-35 | 249 |
| 8 PM | Age 35-45 | 144 |
| 8 PM | Age 45-55 | 154 |
| 8 PM | Age 55-65 | 121 |
| 8 PM | Age 65-75 | 81 |
| 8 PM | Age 75-100 | 56 |
| 8 PM | NA | 157 |
| 9 PM | Age 16-18 | 38 |
| 9 PM | Age 18-25 | 169 |
| 9 PM | Age 25-35 | 88 |
| 9 PM | Age 35-45 | 40 |
| 9 PM | Age 45-55 | 35 |
| 9 PM | Age 55-65 | 15 |
| 9 PM | Age 65-75 | 8 |
| 9 PM | Age 75-100 | 3 |
| 9 PM | NA | 85 |
| 10 PM | Age 16-18 | 24 |
| 10 PM | Age 18-25 | 129 |
| 10 PM | Age 25-35 | 76 |
| 10 PM | Age 35-45 | 43 |
| 10 PM | Age 45-55 | 11 |
| 10 PM | Age 55-65 | 9 |
| 10 PM | Age 65-75 | 2 |
| 10 PM | NA | 94 |
| 11 PM | Age 16-18 | 77 |
| 11 PM | Age 18-25 | 507 |
| 11 PM | Age 25-35 | 301 |
| 11 PM | Age 35-45 | 230 |
| 11 PM | Age 45-55 | 199 |
| 11 PM | Age 55-65 | 129 |
| 11 PM | Age 65-75 | 77 |
| 11 PM | Age 75-100 | 57 |
| 11 PM | NA | 147 |
# Filter age and hour
dat %>%
filter(hour12 == "7 AM") %>%
count(age) %>%
arrange(desc(n)) %>%
pander()| age | n |
|---|---|
| Age 18-25 | 835 |
| Age 25-35 | 563 |
| Age 35-45 | 314 |
| NA | 296 |
| Age 45-55 | 280 |
| Age 55-65 | 184 |
| Age 16-18 | 157 |
| Age 65-75 | 83 |
| Age 75-100 | 46 |
Answer: Drivers of Age 18-25 have the greatest number of accidents from 7:00 to 7:59 AM.
Instructions: Create a new table of time of day and
age group that reports the proportion of accidents at “7 AM”
(hour12) for each age group (age). The
proportions within each age group should sum to one.
Objective: The table should contain the following columns and initial values:
| age | hour12 | n | n.age | n.hour | p | p.age | p.hour |
|---|---|---|---|---|---|---|---|
| Age 16-18 | 7 AM | 77 | 1458 | 1606 | 0.05 | 0.05 | 0.05 |
| Age 18-25 | 7 AM | 408 | 8796 | 1606 | 0.25 | 0.05 | 0.25 |
| Age 25-35 | 7 AM | 371 | 5456 | 1606 | 0.23 | 0.07 | 0.23 |
Solution: The solution has been partially provided. Use the hints provided to create the table.
dat %>%
group_by(hour12, age) %>%
summarise(n = n(), .groups = "drop") %>%
group_by(age) %>%
mutate(
n.age = sum(n),
n.hour = sum(n[hour12 == "7 AM"]),
p = n / sum(n),
p.age = n / n.age,
p.hour = n / sum(n[hour12 == "7 AM"])
) %>%
ungroup() %>%
filter(hour12 == "7 AM") %>%
arrange(desc(n)) %>%
pander() | hour12 | age | n | n.age | n.hour | p | p.age | p.hour |
|---|---|---|---|---|---|---|---|
| 7 AM | Age 18-25 | 835 | 8796 | 835 | 0.09493 | 0.09493 | 1 |
| 7 AM | Age 25-35 | 563 | 5456 | 563 | 0.1032 | 0.1032 | 1 |
| 7 AM | Age 35-45 | 314 | 3250 | 314 | 0.09662 | 0.09662 | 1 |
| 7 AM | NA | 296 | 3413 | 296 | 0.08673 | 0.08673 | 1 |
| 7 AM | Age 45-55 | 280 | 2679 | 280 | 0.1045 | 0.1045 | 1 |
| 7 AM | Age 55-65 | 184 | 1878 | 184 | 0.09798 | 0.09798 | 1 |
| 7 AM | Age 16-18 | 157 | 1458 | 157 | 0.1077 | 0.1077 | 1 |
| 7 AM | Age 65-75 | 83 | 970 | 83 | 0.08557 | 0.08557 | 1 |
| 7 AM | Age 75-100 | 46 | 570 | 46 | 0.0807 | 0.0807 | 1 |
The following reports the accidents, casualties, proportion, and average casualties per harmful accident.
Instructions: Visualize total accidents by time of
day (hour).
Note: This solution has been completed for you.
dat %>%
group_by(hour) %>%
summarize(n = n()) %>%
plot(type = "b",
bty = "n",
pch = 19,
cex = 2,
xlab = "Hour",
ylab = "Total Number of Accidents",
main = "Total Number of Accidents by Time of Day")Instructions: Visualize total injuries and
fatalities by time of day (hour).
# Code
dat %>%
group_by(hour) %>%
summarise(
total_casualties = sum(Totalinjuries + Totalfatalities, na.rm = TRUE)
) %>%
plot(type = "b",
bty = "n",
pch = 19,
cex = 2,
xlab = "Hour",
ylab = "Total Casualties",
main = "Total Casualties by Time of Day")Instructions: Visualize the proportion of harmful
accidents out of all accidents by time of day (hour).
# Code
harm_by_hour <- dat %>%
group_by(hour) %>%
summarise(
total_accidents = n(),
harmful_accidents = sum(
(Totalinjuries > 0) | (Totalfatalities > 0),
na.rm = TRUE
)
) %>%
mutate(prop_harm = harmful_accidents / total_accidents)
plot(harm_by_hour$hour,
harm_by_hour$prop_harm,
type = "b",
bty = "n",
pch = 19,
cex = 2,
xlab = "Hour",
ylab = "Proportion of Harmful Accidents",
main = "Proportion of Harmful Accidents by Time of Day")Instructions: Visualize average injuries or fatalities per harmful collision.
# Code
harmful_avg <- dat %>%
filter((Totalinjuries > 0) | (Totalfatalities > 0)) %>%
group_by(hour) %>%
summarise(
avg_harm = mean(Totalinjuries + Totalfatalities, na.rm = TRUE)
)
plot(harmful_avg$hour,
harmful_avg$avg_harm,
type = "b",
bty = "n",
pch = 19,
cex = 2,
xlab = "Hour",
ylab = "Average Injuries/Fatalities per Harmful Accident",
main = "Average Harm per Harmful Accident by Time of Day")The following code is used for preprocessing tasks in the above solutions.
These expressions transform, format, and create myriad new variables used in this report.
date.vec <- strptime(dat$DateTime,
format = "%m/%d/%y %H:%M") # Create date fields
dat$hour <- format(date.vec, format = "%H")
dat$month <- format(date.vec, format = "%b")
dat$day <- format(date.vec, format = "%a")
dat$day365 <- format(date.vec, format = "%j")
dat$week <- format(date.vec, format = "%V")
dat$day <- factor(dat$day,
levels = c("Mon",
"Tue",
"Wed",
"Thu",
"Fri",
"Sat",
"Sun")) # Order weekdays
dat$hour12 <- format(date.vec,
format="%l %p") # Create 12-hour format
time.levels <- c("12 AM", paste(1:11, "AM"),
"12 PM", paste(1:11, "PM"))
dat$hour12 <- factor(dat$hour12,
levels = time.levels) # Order time intervals
age.labels <- paste0("Age ",
c(16,18,25,35,45,55,65,75), "-",
c(18,25,35,45,55,65,75,100) )
dat$age <- cut(dat$Age_Drv1,
breaks = c(16,18,25,
35,45,55,
65,75,100),
labels = age.labels) # Discretize age ranges