repo link : https://github.com/baharhm/Homework10
acc <- read.csv("https://raw.githubusercontent.com/yumouqiu/DS202-Spring2023/main/Practice/data/fars2016/accident.csv")
names(acc)
## [1] "STATE" "ST_CASE" "VE_TOTAL" "VE_FORMS" "PVH_INVL"
## [6] "PEDS" "PERNOTMVIT" "PERMVIT" "PERSONS" "COUNTY"
## [11] "CITY" "DAY" "MONTH" "YEAR" "DAY_WEEK"
## [16] "HOUR" "MINUTE" "NHS" "RUR_URB" "FUNC_SYS"
## [21] "RD_OWNER" "ROUTE" "TWAY_ID" "TWAY_ID2" "MILEPT"
## [26] "LATITUDE" "LONGITUD" "SP_JUR" "HARM_EV" "MAN_COLL"
## [31] "RELJCT1" "RELJCT2" "TYP_INT" "WRK_ZONE" "REL_ROAD"
## [36] "LGT_COND" "WEATHER1" "WEATHER2" "WEATHER" "SCH_BUS"
## [41] "RAIL" "NOT_HOUR" "NOT_MIN" "ARR_HOUR" "ARR_MIN"
## [46] "HOSP_HR" "HOSP_MN" "CF1" "CF2" "CF3"
## [51] "FATALS" "DRUNK_DR"
Part one: Accident data
are there some days of the week where more accidents happen than on others (use variable DAY_WEEK)? It seems like more accidents happen on the weekend.
acc %>% group_by(DAY_WEEK) %>% tally()
## # A tibble: 7 × 2
## DAY_WEEK n
## <int> <int>
## 1 1 5303
## 2 2 4501
## 3 3 4129
## 4 4 4388
## 5 5 4662
## 6 6 5352
## 7 7 6104
acc %>% ggplot(aes(x=DAY_WEEK)) +
geom_bar() +
ggtitle("Accidents by day_week") +
ylab("accident count")
what time of the day do accidents happen (use variable HOUR)?
most accidents happen between 5-8 pm, specifically at 6 pm.
acc.hour = acc %>% filter(HOUR <= 24) %>% group_by(HOUR) %>% tally()
acc.hour %>%
ggplot(aes(x=HOUR)) +
geom_bar(aes(weight = n)) +
ggtitle("Accidents by hour") + ylab("accident count")
what is the number of accidents with at least one drunk driver (use variable DRUNK_DR)? there are There are 8720 accidents with atleast one drunk driver
filter(acc, DRUNK_DR > 0) %>% summarize(n=n())
## n
## 1 8720
Part two: Connecting data
Connect to the person table. Identify drivers (PER_TYP == 1, see fars manual ) and subset on them.
person.table = read.csv("https://raw.githubusercontent.com/yumouqiu/DS202-Spring2023/main/Practice/data/fars2016/person.csv")
person.table = person.table %>% filter(PER_TYP == 1)
Join accident and driver table (work out which variable(s) to use)
person.table = person.table %>% filter(SEX == 1 | SEX == 2)
acc.person = left_join(acc, person.table, by = c("ST_CASE", "HOUR", "MINUTE", "DAY", "MONTH"))
## Warning in left_join(acc, person.table, by = c("ST_CASE", "HOUR", "MINUTE", : Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 9 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
Tally the number of accidents by day of the week (DAY_WEEK), hour of the day (HOUR) and gender (SEX). Visualize the results!
acc.person.tally = acc.person %>% group_by(DAY_WEEK, HOUR, SEX) %>%
summarise(n = n())
## `summarise()` has grouped output by 'DAY_WEEK', 'HOUR'. You can override using
## the `.groups` argument.
acc.person.tally = na.omit(acc.person.tally)
acc.person.tally = acc.person.tally %>% filter(HOUR < 25)
acc.person.tally %>% ggplot(aes(x = HOUR, y = n, color = SEX)) +
geom_point() +
facet_wrap(~DAY_WEEK)