The following analysis uses data comprised of traffic accidents in the city of Tempe, Arizona. They constitute 28,470 observations and 32 variables. You can find more details in the data set documentation at the Tempe Open Data Portal.
We can explore the first six observations in the following table
using function head().
head(dat)
All data pre-processing are available in Appendix A: Data Preprocessing.
The following provides brief exploratory summaries on key data points.
Question: How many accidents happened on Mondays?
Note: This solution has been completed for you.
sum(dat$day == "Mon",
na.rm = TRUE)## [1] 4094
Answer: 4,094 accidents occurred on Mondays.
Question: What proportion of accidents each week occur on Monday??
table(dat$day) %>% pander()| Mon | Tue | Wed | Thu | Fri | Sat | Sun |
|---|---|---|---|---|---|---|
| 4094 | 4656 | 4711 | 4814 | 5006 | 3044 | 2145 |
mean(dat$day =="Mon", na.rm=TRUE)## [1] 0.1438005
Answer: 14.38% of all accidents occur on Mondays.
Question: What proportion of accidents on Mondays result in harm??
Note: “Harm” is defined as any accident that causes at least one injury or fatality, a.k.a. a “casualty”.
#head(dat, n=20)
dat %>%
group_by(day) %>%
summarise(accidents=n(), injuries = sum(Totalinjuries), fatalities = sum(Totalfatalities)) %>% pander()| day | accidents | injuries | fatalities |
|---|---|---|---|
| Mon | 4094 | 1644 | 13 |
| Tue | 4656 | 2056 | 8 |
| Wed | 4711 | 2144 | 9 |
| Thu | 4814 | 2204 | 10 |
| Fri | 5006 | 2103 | 12 |
| Sat | 3044 | 1192 | 11 |
| Sun | 2145 | 866 | 6 |
prop <- sum(dat$day == "Mon" & (dat$Totalinjuries > 0 | dat$Totalfatalities > 0))
prop/sum(dat$day == "Mon")## [1] 0.2984856
Answer: 29.85% of all Monday accidents have at least one casualty.
Question: What is the most common accident type on Mondays?
dat %>%
filter(day == "Mon") %>%
count(Collisionmanner) %>%
arrange(desc(n))Answer: Rear End collisions are the most common accident types.
Question: Are there differences in the proportion of accidents that result in harm each day of the week?
dat %>%
group_by(day) %>%
mutate(harm.accidents = Totalinjuries > 0 | Totalfatalities > 0) %>%
summarise(accidents=n(),
injuries = sum(Totalinjuries),
fatalities = sum(Totalfatalities),
prop.harm = round(mean(harm.accidents), 2)) %>%
pander()| day | accidents | injuries | fatalities | prop.harm |
|---|---|---|---|---|
| Mon | 4094 | 1644 | 13 | 0.3 |
| Tue | 4656 | 2056 | 8 | 0.32 |
| Wed | 4711 | 2144 | 9 | 0.31 |
| Thu | 4814 | 2204 | 10 | 0.33 |
| Fri | 5006 | 2103 | 12 | 0.3 |
| Sat | 3044 | 1192 | 11 | 0.26 |
| Sun | 2145 | 866 | 6 | 0.27 |
Answer: Significantly more accidents occur during the week (specifically Tuesday, Wednesday, and Thursday)..
Instructions: Create a table that reports the following for each day of the week.
dat %>%
group_by(day) %>%
mutate(harm_acc = Totalinjuries > 0 | Totalfatalities > 0) %>%
summarize(n = n(),
injuries = sum(Totalinjuries),
fatalities = sum(Totalfatalities),
harm.rate = round(mean(harm_acc),2)) %>%
pander()| day | n | injuries | fatalities | harm.rate |
|---|---|---|---|---|
| Mon | 4094 | 1644 | 13 | 0.3 |
| Tue | 4656 | 2056 | 8 | 0.32 |
| Wed | 4711 | 2144 | 9 | 0.31 |
| Thu | 4814 | 2204 | 10 | 0.33 |
| Fri | 5006 | 2103 | 12 | 0.3 |
| Sat | 3044 | 1192 | 11 | 0.26 |
| Sun | 2145 | 866 | 6 | 0.27 |
The following provides summaries of accidents by age groups.
Instructions: Create a table of counts of accidents
by time of day (hour12) and age of driver
(age).
Question: Which age group has the largest number of accidents at 7 AM?
table(dat$hour, dat$age)##
## Age 16-18 Age 18-25 Age 25-35 Age 35-45 Age 45-55 Age 55-65 Age 65-75
## 00 24 129 76 43 11 9 2
## 01 10 109 90 16 11 4 1
## 02 5 154 82 28 9 5 1
## 03 3 80 35 15 5 3 0
## 04 4 43 24 11 16 4 3
## 05 4 98 46 39 40 26 8
## 06 24 164 122 90 79 57 21
## 07 77 408 371 243 175 116 39
## 08 72 419 329 205 149 107 56
## 09 42 276 200 136 111 94 59
## 10 36 298 207 115 108 95 59
## 11 66 387 249 144 154 121 81
## 12 77 507 301 230 199 129 77
## 13 84 587 324 190 184 153 85
## 14 109 585 349 206 187 138 85
## 15 134 674 419 270 228 155 92
## 16 157 835 563 314 280 184 83
## 17 152 989 582 356 262 168 68
## 18 106 635 381 226 174 119 59
## 19 74 376 213 126 101 62 37
## 20 48 338 170 91 71 44 20
## 21 58 295 135 70 52 51 13
## 22 54 241 100 46 38 19 13
## 23 38 169 88 40 35 15 8
##
## Age 75-100
## 00 0
## 01 0
## 02 0
## 03 0
## 04 1
## 05 5
## 06 7
## 07 17
## 08 31
## 09 38
## 10 46
## 11 56
## 12 57
## 13 43
## 14 50
## 15 55
## 16 46
## 17 41
## 18 41
## 19 13
## 20 12
## 21 6
## 22 2
## 23 3
dat %>%
filter(dat$hour =="07") %>%
count(age, hour) %>%
pander()| age | hour | n |
|---|---|---|
| Age 16-18 | 07 | 77 |
| Age 18-25 | 07 | 408 |
| Age 25-35 | 07 | 371 |
| Age 35-45 | 07 | 243 |
| Age 45-55 | 07 | 175 |
| Age 55-65 | 07 | 116 |
| Age 65-75 | 07 | 39 |
| Age 75-100 | 07 | 17 |
| NA | 07 | 160 |
Answer: Drivers of Age 18-25 have the greatest number of accidents from 7:00 to 7:59 AM.
Instructions: Create a new table of time of day and
age group that reports the proportion of accidents at “7 AM”
(hour12) for each age group (age). The
proportions within each age group should sum to one.
Objective: The table should contain the following columns and initial values:
| age | hour12 | n | n.age | n.hour | p | p.age | p.hour |
|---|---|---|---|---|---|---|---|
| Age 16-18 | 7 AM | 77 | 1458 | 1606 | 0.05 | 0.05 | 0.05 |
| Age 18-25 | 7 AM | 408 | 8796 | 1606 | 0.25 | 0.05 | 0.25 |
| Age 25-35 | 7 AM | 371 | 5456 | 1606 | 0.23 | 0.07 | 0.23 |
Solution: The solution has been partially provided. Use the hints provided to create the table.
dat %>%
count(age, hour) %>%
group_by(age)%>%
mutate(n.age=sum(n), n.hour = sum(dat$hour == "07"))%>%
mutate(p = round(n/n.hour, 2), p.age = round(n/n.age, 2), p.hour = round(n/n.hour,2)) %>%
filter(hour == "07") %>%
pander()| age | hour | n | n.age | n.hour | p | p.age | p.hour |
|---|---|---|---|---|---|---|---|
| Age 16-18 | 07 | 77 | 1458 | 1606 | 0.05 | 0.05 | 0.05 |
| Age 18-25 | 07 | 408 | 8796 | 1606 | 0.25 | 0.05 | 0.25 |
| Age 25-35 | 07 | 371 | 5456 | 1606 | 0.23 | 0.07 | 0.23 |
| Age 35-45 | 07 | 243 | 3250 | 1606 | 0.15 | 0.07 | 0.15 |
| Age 45-55 | 07 | 175 | 2679 | 1606 | 0.11 | 0.07 | 0.11 |
| Age 55-65 | 07 | 116 | 1878 | 1606 | 0.07 | 0.06 | 0.07 |
| Age 65-75 | 07 | 39 | 970 | 1606 | 0.02 | 0.04 | 0.02 |
| Age 75-100 | 07 | 17 | 570 | 1606 | 0.01 | 0.03 | 0.01 |
| NA | 07 | 160 | 3413 | 1606 | 0.1 | 0.05 | 0.1 |
The following reports the accidents, casualties, proportion, and average casualties per harmful accident.
Instructions: Visualize total accidents by time of
day (hour).
Note: This solution has been completed for you.
dat %>%
group_by(hour) %>%
summarize(n = n()) %>%
plot(type = "b",
bty = "n",
pch = 19,
cex = 2,
xlim = c(0,25),
xlab = "Hour",
ylab = "Total Accidents",
main = "Total Accidents by Time of Day")Instructions: Visualize total injuries and
fatalities by time of day (hour).
dat %>%
group_by(hour) %>%
summarize(n = sum(Totalfatalities | Totalinjuries)) %>%
plot(type = "b",
bty = "n",
pch = 19,
cex = 2,
xlim = c(0,25),
xlab = "Hour",
ylab = "Total Number of Passengers Hurt or Killed",
main = "Total Injuries or Fatalities by Hour of the Day")Instructions: Visualize the proportion of harmful
collisions our of all collisions by time of day (hour).
dat %>%
group_by(hour) %>%
mutate(harm.accidents = Totalinjuries > 0 | Totalfatalities > 0) %>%
summarise(prop.harm = mean(harm.accidents)) %>%
plot(type = "b",
bty = "n",
pch = 19,
cex = 2,
ylim = c(0.25, 0.35),
xlim = c(0,25),
xlab = "Hour",
ylab = "Proportion of Accidents Resulting in Harm",
main = "Proportion of Crashes that Result in Injuries or Fatalities")Instructions: Visualize average injuries or fatalities per harmful collision.
#summary(dat$casualty)
dat%>%
group_by(hour) %>%
mutate(n = sum(Totalinjuries > 0 | Totalfatalities > 0),
Injuries = sum(Totalinjuries),
Fatalities = sum(Totalfatalities),
Casualties = (Injuries+Fatalities)) %>%
summarise(aveinj = Casualties/n) %>%
plot(type = "b",
bty = "n",
pch = 19,
cex = 2,
xlim = c(0,25),
xlab = "Hour",
ylab = "AVe Number of Passengers Hurt",
main = "Average Injuries or Fatalities Per Harmful Crash")The following code is used for preprocessing tasks in the above solutions.
These expressions transform, format, and create myriad new variables used in this report.
date.vec <- strptime(dat$DateTime,
format = "%m/%d/%y %H:%M") # Create date fields
dat$hour <- format(date.vec, format = "%H")
dat$month <- format(date.vec, format = "%b")
dat$day <- format(date.vec, format = "%a")
dat$day365 <- format(date.vec, format = "%j")
dat$week <- format(date.vec, format = "%V")
dat$day <- factor(dat$day,
levels = c("Mon",
"Tue",
"Wed",
"Thu",
"Fri",
"Sat",
"Sun")) # Order weekdays
dat$hour12 <- format(date.vec,
format="%l %p") # Create 12-hour format
time.levels <- c("12 AM", paste(1:11, "AM"),
"12 PM", paste(1:11, "PM"))
dat$hour12 <- factor(dat$hour12,
levels = time.levels) # Order time intervals
age.labels <- paste0("Age ",
c(16,18,25,35,45,55,65,75), "-",
c(18,25,35,45,55,65,75,100) )
dat$age <- cut(dat$Age_Drv1,
breaks = c(16,18,25,
35,45,55,
65,75,100),
labels = age.labels) # Discretize age ranges
Use the following instructions to submit your assignment, which may vary depending on your course’s platform.
When you have completed your assignment, click the “Knit” button to
render your .RMD file into a .HTML report.
Perform the following depending on your course’s platform:
.RMD and
.HTML files to the appropriate link.RMD and .HTML files in a .ZIP
file and upload to the appropriate link.HTML files are preferred but not allowed by all
platforms.
Remember to ensure the following before submitting your assignment.
head()See Google’s R Style Guide for examples of common conventions.
.RMD files are knit into .HTML and other
formats procedural, or line-by-line.
install.packages() or
setwd() are bound to cause errors in knittinglibrary() in a previous chunkIf All Else Fails: If you cannot determine and fix
the errors in a code chunk that’s preventing you from knitting your
document, add eval = FALSE inside the brackets of
{r} at the beginning of a chunk to ensure that R does not
attempt to evaluate it, that is: {r eval = FALSE}. This
will prevent an erroneous chunk of code from halting the knitting
process.