Source Data

The following analysis uses data comprised of traffic accidents in the city of Tempe, Arizona. They constitute 28,470 observations and 32 variables. You can find more details in the dataset documentation at the Tempe Open Data Portal.


Preview

We can explore the first six observations in the following table using function head().

head(dat)



Preprocessing

All data preprocessing are available in Appendix A: Data Preprocessing.



Part I: Summaries

The following provides brief exploratory summaries on key data points.


1. Mondays Accidents

Question: How many accidents happened on Mondays?

Note: This solution has been completed for you.


sum(dat$day == "Mon", 
    na.rm = TRUE)
## [1] 4094


Answer: 4,094 accidents occurred on Mondays.


2. Monday Accidents (%)

Question: What proportion of accidents each week occur on Monday?


# Code
monday_weekly_props <- dat %>%
  group_by(week) %>%
  summarise(
    total_accidents  = n(),
    monday_accidents = sum(day == "Mon", na.rm = TRUE),
    prop_monday      = monday_accidents / total_accidents
  ) %>%
  arrange(as.integer(week))
head(monday_weekly_props) %>% pander()
week total_accidents monday_accidents prop_monday
01 460 57 0.1239
02 624 106 0.1699
03 686 71 0.1035
04 628 71 0.1131
05 651 96 0.1475
06 678 96 0.1416
monday_weekly_props <- dat %>%
  group_by(week) %>%
  summarise(
    prop_monday = mean(day == "Mon", na.rm = TRUE)
  ) %>%
  arrange(as.integer(week))

head(monday_weekly_props) %>% pander()
week prop_monday
01 0.1239
02 0.1699
03 0.1035
04 0.1131
05 0.1475
06 0.1416
# Finding the average per week
dat %>%
  group_by(week) %>%
  summarise(prop_monday = mean(day == "Mon", na.rm = TRUE)) %>%
  summarise(avg_weekly_prop = mean(prop_monday, na.rm = TRUE)) %>%
  pander()
avg_weekly_prop
0.1444


Answer: 0.14% of all accidents occur on Mondays.


3. Harmful Monday Accidents (%)

Question: What proportion of accidents on Mondays result in harm?

Note: “Harm” is defined as any accident that causes at least one injury or fatality, a.k.a. a “casualty”.


# Code
names(dat)
##  [1] "Incidentid"          "DateTime"            "Year"               
##  [4] "StreetName"          "CrossStreet"         "Distance"           
##  [7] "JunctionRelation"    "Totalinjuries"       "Totalfatalities"    
## [10] "Injuryseverity"      "Collisionmanner"     "Lightcondition"     
## [13] "Weather"             "SurfaceCondition"    "Unittype_One"       
## [16] "Age_Drv1"            "Gender_Drv1"         "Traveldirection_One"
## [19] "Unitaction_One"      "Violation1_Drv1"     "AlcoholUse_Drv1"    
## [22] "DrugUse_Drv1"        "Unittype_Two"        "Age_Drv2"           
## [25] "Gender_Drv2"         "Traveldirection_Two" "Unitaction_Two"     
## [28] "Violation1_Drv2"     "AlcoholUse_Drv2"     "DrugUse_Drv2"       
## [31] "Latitude"            "Longitude"           "hour"               
## [34] "month"               "day"                 "day365"             
## [37] "week"                "hour12"              "age"
 dat %>%
   filter(day == "Mon") %>%
   summarise(
     prop_harm = mean(
       (Totalinjuries > 0) | (Totalfatalities > 0),
       na.rm = TRUE
     )
   ) %>%
   pander()
prop_harm
0.2985


Answer: 0.3% of all Monday accidents have at least one casualty.


4. Most Common Accidents

Question: What is the most common type of accident (Collisionmanner) that occurs on Mondays?


dat %>%
  filter(day == "Mon") %>%
  count(Collisionmanner) %>%
  arrange(desc(n))


Answer: Rear End collisions are the most common accident types.


5. Differences in Accidents

Question: Are there differences in the proportion of accidents that result in harm each day of the week?


# Code
harm_by_day <- dat %>%
  group_by(day) %>%
  summarise(
    prop_harm = mean(
      (Totalinjuries > 0) | (Totalfatalities > 0),
      na.rm = TRUE
    )
  ) %>%
  arrange(desc(prop_harm))

harm_by_day %>% pander()
day prop_harm
Thu 0.327
Tue 0.3228
Wed 0.3148
Mon 0.2985
Fri 0.2958
Sun 0.2685
Sat 0.2645


Answer: Significantly more accidents occur on Thursday.


6. Accident Stats by Weekday

Instructions: Create a table that reports the following for each day of the week.

  • Total accidents
  • Total injuries
  • Total fatalities
  • Proportion of harmful accidents (i.e. casualties)


dat %>%
  group_by(day) %>%
  mutate(harm_acc = Totalinjuries > 0 | Totalfatalities > 0) %>%
  summarize(n = n(),
            injuries = sum(Totalinjuries),
            fatalities = sum(Totalfatalities),
            harm.rate = mean(harm_acc))



Part II: Age Groups

The following provides summaries of accidents by age groups.


1. Accidents by Hour & Age

Instructions: Create a table of counts of accidents by time of day (hour12) and age of driver (age).

Question: Which age group has the largest number of accidents at 7AM?


# Table of Counts
dat %>%
  count(hour12, age) %>%
  pander()
hour12 age n
12 AM Age 16-18 10
12 AM Age 18-25 109
12 AM Age 25-35 90
12 AM Age 35-45 16
12 AM Age 45-55 11
12 AM Age 55-65 4
12 AM Age 65-75 1
12 AM NA 48
1 AM Age 16-18 84
1 AM Age 18-25 587
1 AM Age 25-35 324
1 AM Age 35-45 190
1 AM Age 45-55 184
1 AM Age 55-65 153
1 AM Age 65-75 85
1 AM Age 75-100 43
1 AM NA 182
2 AM Age 16-18 5
2 AM Age 18-25 154
2 AM Age 25-35 82
2 AM Age 35-45 28
2 AM Age 45-55 9
2 AM Age 55-65 5
2 AM Age 65-75 1
2 AM NA 78
3 AM Age 16-18 109
3 AM Age 18-25 585
3 AM Age 25-35 349
3 AM Age 35-45 206
3 AM Age 45-55 187
3 AM Age 55-65 138
3 AM Age 65-75 85
3 AM Age 75-100 50
3 AM NA 219
4 AM Age 16-18 3
4 AM Age 18-25 80
4 AM Age 25-35 35
4 AM Age 35-45 15
4 AM Age 45-55 5
4 AM Age 55-65 3
4 AM NA 49
5 AM Age 16-18 134
5 AM Age 18-25 674
5 AM Age 25-35 419
5 AM Age 35-45 270
5 AM Age 45-55 228
5 AM Age 55-65 155
5 AM Age 65-75 92
5 AM Age 75-100 55
5 AM NA 274
6 AM Age 16-18 4
6 AM Age 18-25 43
6 AM Age 25-35 24
6 AM Age 35-45 11
6 AM Age 45-55 16
6 AM Age 55-65 4
6 AM Age 65-75 3
6 AM Age 75-100 1
6 AM NA 19
7 AM Age 16-18 157
7 AM Age 18-25 835
7 AM Age 25-35 563
7 AM Age 35-45 314
7 AM Age 45-55 280
7 AM Age 55-65 184
7 AM Age 65-75 83
7 AM Age 75-100 46
7 AM NA 296
8 AM Age 16-18 4
8 AM Age 18-25 98
8 AM Age 25-35 46
8 AM Age 35-45 39
8 AM Age 45-55 40
8 AM Age 55-65 26
8 AM Age 65-75 8
8 AM Age 75-100 5
8 AM NA 34
9 AM Age 16-18 152
9 AM Age 18-25 989
9 AM Age 25-35 582
9 AM Age 35-45 356
9 AM Age 45-55 262
9 AM Age 55-65 168
9 AM Age 65-75 68
9 AM Age 75-100 41
9 AM NA 320
10 AM Age 16-18 24
10 AM Age 18-25 164
10 AM Age 25-35 122
10 AM Age 35-45 90
10 AM Age 45-55 79
10 AM Age 55-65 57
10 AM Age 65-75 21
10 AM Age 75-100 7
10 AM NA 83
11 AM Age 16-18 106
11 AM Age 18-25 635
11 AM Age 25-35 381
11 AM Age 35-45 226
11 AM Age 45-55 174
11 AM Age 55-65 119
11 AM Age 65-75 59
11 AM Age 75-100 41
11 AM NA 233
12 PM Age 16-18 77
12 PM Age 18-25 408
12 PM Age 25-35 371
12 PM Age 35-45 243
12 PM Age 45-55 175
12 PM Age 55-65 116
12 PM Age 65-75 39
12 PM Age 75-100 17
12 PM NA 160
1 PM Age 16-18 74
1 PM Age 18-25 376
1 PM Age 25-35 213
1 PM Age 35-45 126
1 PM Age 45-55 101
1 PM Age 55-65 62
1 PM Age 65-75 37
1 PM Age 75-100 13
1 PM NA 183
2 PM Age 16-18 72
2 PM Age 18-25 419
2 PM Age 25-35 329
2 PM Age 35-45 205
2 PM Age 45-55 149
2 PM Age 55-65 107
2 PM Age 65-75 56
2 PM Age 75-100 31
2 PM NA 151
3 PM Age 16-18 48
3 PM Age 18-25 338
3 PM Age 25-35 170
3 PM Age 35-45 91
3 PM Age 45-55 71
3 PM Age 55-65 44
3 PM Age 65-75 20
3 PM Age 75-100 12
3 PM NA 137
4 PM Age 16-18 42
4 PM Age 18-25 276
4 PM Age 25-35 200
4 PM Age 35-45 136
4 PM Age 45-55 111
4 PM Age 55-65 94
4 PM Age 65-75 59
4 PM Age 75-100 38
4 PM NA 110
5 PM Age 16-18 58
5 PM Age 18-25 295
5 PM Age 25-35 135
5 PM Age 35-45 70
5 PM Age 45-55 52
5 PM Age 55-65 51
5 PM Age 65-75 13
5 PM Age 75-100 6
5 PM NA 135
6 PM Age 16-18 36
6 PM Age 18-25 298
6 PM Age 25-35 207
6 PM Age 35-45 115
6 PM Age 45-55 108
6 PM Age 55-65 95
6 PM Age 65-75 59
6 PM Age 75-100 46
6 PM NA 118
7 PM Age 16-18 54
7 PM Age 18-25 241
7 PM Age 25-35 100
7 PM Age 35-45 46
7 PM Age 45-55 38
7 PM Age 55-65 19
7 PM Age 65-75 13
7 PM Age 75-100 2
7 PM NA 101
8 PM Age 16-18 66
8 PM Age 18-25 387
8 PM Age 25-35 249
8 PM Age 35-45 144
8 PM Age 45-55 154
8 PM Age 55-65 121
8 PM Age 65-75 81
8 PM Age 75-100 56
8 PM NA 157
9 PM Age 16-18 38
9 PM Age 18-25 169
9 PM Age 25-35 88
9 PM Age 35-45 40
9 PM Age 45-55 35
9 PM Age 55-65 15
9 PM Age 65-75 8
9 PM Age 75-100 3
9 PM NA 85
10 PM Age 16-18 24
10 PM Age 18-25 129
10 PM Age 25-35 76
10 PM Age 35-45 43
10 PM Age 45-55 11
10 PM Age 55-65 9
10 PM Age 65-75 2
10 PM NA 94
11 PM Age 16-18 77
11 PM Age 18-25 507
11 PM Age 25-35 301
11 PM Age 35-45 230
11 PM Age 45-55 199
11 PM Age 55-65 129
11 PM Age 65-75 77
11 PM Age 75-100 57
11 PM NA 147
# Filter age and hour

dat %>%
  filter(hour12 == "7 AM") %>%
  count(age) %>%
  arrange(desc(n)) %>%
  pander()
age n
Age 18-25 835
Age 25-35 563
Age 35-45 314
NA 296
Age 45-55 280
Age 55-65 184
Age 16-18 157
Age 65-75 83
Age 75-100 46


Answer: Drivers of Age 18-25 have the greatest number of accidents from 7:00 to 7:59 AM.


2. Accidents by Hour & Age (%)

Instructions: Create a new table of time of day and age group that reports the proportion of accidents at “7 AM” (hour12) for each age group (age). The proportions within each age group should sum to one.



Objective: The table should contain the following columns and initial values:

age hour12 n n.age n.hour p p.age p.hour
Age 16-18 7 AM 77 1458 1606 0.05 0.05 0.05
Age 18-25 7 AM 408 8796 1606 0.25 0.05 0.25
Age 25-35 7 AM 371 5456 1606 0.23 0.07 0.23



Solution: The solution has been partially provided. Use the hints provided to create the table.

dat %>% 
  group_by(hour12, age) %>% 
  summarise(n = n(), .groups = "drop") %>%     
  group_by(age) %>% 
  mutate(
    n.age  = sum(n),
    n.hour = sum(n[hour12 == "7 AM"]),
    p      = n / sum(n),
    p.age  = n / n.age,
    p.hour = n / sum(n[hour12 == "7 AM"])   
  ) %>% 
  ungroup() %>% 
  filter(hour12 == "7 AM") %>%   
  arrange(desc(n)) %>%            
  pander()                        
hour12 age n n.age n.hour p p.age p.hour
7 AM Age 18-25 835 8796 835 0.09493 0.09493 1
7 AM Age 25-35 563 5456 563 0.1032 0.1032 1
7 AM Age 35-45 314 3250 314 0.09662 0.09662 1
7 AM NA 296 3413 296 0.08673 0.08673 1
7 AM Age 45-55 280 2679 280 0.1045 0.1045 1
7 AM Age 55-65 184 1878 184 0.09798 0.09798 1
7 AM Age 16-18 157 1458 157 0.1077 0.1077 1
7 AM Age 65-75 83 970 83 0.08557 0.08557 1
7 AM Age 75-100 46 570 46 0.0807 0.0807 1



Part III: Rates of Harm

The following reports the accidents, casualties, proportion, and average casualties per harmful accident.


1. Accidents by Hour

Instructions: Visualize total accidents by time of day (hour).

Note: This solution has been completed for you.


dat %>%
  group_by(hour) %>%
  summarize(n = n()) %>%
  plot(type = "b",
       bty = "n",
       pch = 19,
       cex = 2,
       xlab = "Hour",
       ylab = "Total Number of Accidents",
       main = "Total Number of Accidents by Time of Day")


2. Total Casualties by Hour

Instructions: Visualize total injuries and fatalities by time of day (hour).


# Code
dat %>%
  group_by(hour) %>%
  summarise(
    total_casualties = sum(Totalinjuries + Totalfatalities, na.rm = TRUE)
  ) %>%
  plot(type = "b",
       bty = "n",
       pch = 19,
       cex = 2,
       xlab = "Hour",
       ylab = "Total Casualties",
       main = "Total Casualties by Time of Day")


3. Accidents with Casualties (%)

Instructions: Visualize the proportion of harmful accidents out of all accidents by time of day (hour).


# Code
harm_by_hour <- dat %>%
  group_by(hour) %>%
  summarise(
    total_accidents = n(),
    harmful_accidents = sum(
      (Totalinjuries > 0) | (Totalfatalities > 0),
      na.rm = TRUE
    )
  ) %>%
  mutate(prop_harm = harmful_accidents / total_accidents)

plot(harm_by_hour$hour,
     harm_by_hour$prop_harm,
     type = "b",
     bty = "n",
     pch = 19,
     cex = 2,
     xlab = "Hour",
     ylab = "Proportion of Harmful Accidents",
     main = "Proportion of Harmful Accidents by Time of Day")


4. Avg. Harm by Harmful Accident

Instructions: Visualize average injuries or fatalities per harmful collision.


# Code
harmful_avg <- dat %>%
  filter((Totalinjuries > 0) | (Totalfatalities > 0)) %>%
  group_by(hour) %>%
  summarise(
    avg_harm = mean(Totalinjuries + Totalfatalities, na.rm = TRUE)
  )
plot(harmful_avg$hour,
     harmful_avg$avg_harm,
     type = "b",
     bty = "n",
     pch = 19,
     cex = 2,
     xlab = "Hour",
     ylab = "Average Injuries/Fatalities per Harmful Accident",
     main = "Average Harm per Harmful Accident by Time of Day")


Appendix

The following code is used for preprocessing tasks in the above solutions.


A: Data Preprocessing

These expressions transform, format, and create myriad new variables used in this report.


date.vec <- strptime(dat$DateTime, 
                     format = "%m/%d/%y %H:%M")       # Create date fields

dat$hour   <- format(date.vec, format = "%H")
dat$month  <- format(date.vec, format = "%b")
dat$day    <- format(date.vec, format = "%a")
dat$day365 <- format(date.vec, format = "%j")
dat$week   <- format(date.vec, format = "%V")

dat$day <- factor(dat$day, 
                  levels = c("Mon",
                             "Tue",
                             "Wed",
                             "Thu",
                             "Fri",
                             "Sat",
                             "Sun"))                  # Order weekdays

dat$hour12 <- format(date.vec, 
                     format="%l %p")                  # Create 12-hour format

time.levels <- c("12 AM", paste(1:11, "AM"), 
                 "12 PM", paste(1:11, "PM"))

dat$hour12 <- factor(dat$hour12, 
                     levels = time.levels)            # Order time intervals

age.labels <- paste0("Age ", 
                     c(16,18,25,35,45,55,65,75), "-", 
                     c(18,25,35,45,55,65,75,100) )

dat$age <- cut(dat$Age_Drv1, 
               breaks = c(16,18,25,
                          35,45,55,
                          65,75,100), 
               labels = age.labels)                   # Discretize age ranges