repo link : https://github.com/baharhm/Homework10

acc <- read.csv("https://raw.githubusercontent.com/yumouqiu/DS202-Spring2023/main/Practice/data/fars2016/accident.csv")
names(acc)
##  [1] "STATE"      "ST_CASE"    "VE_TOTAL"   "VE_FORMS"   "PVH_INVL"  
##  [6] "PEDS"       "PERNOTMVIT" "PERMVIT"    "PERSONS"    "COUNTY"    
## [11] "CITY"       "DAY"        "MONTH"      "YEAR"       "DAY_WEEK"  
## [16] "HOUR"       "MINUTE"     "NHS"        "RUR_URB"    "FUNC_SYS"  
## [21] "RD_OWNER"   "ROUTE"      "TWAY_ID"    "TWAY_ID2"   "MILEPT"    
## [26] "LATITUDE"   "LONGITUD"   "SP_JUR"     "HARM_EV"    "MAN_COLL"  
## [31] "RELJCT1"    "RELJCT2"    "TYP_INT"    "WRK_ZONE"   "REL_ROAD"  
## [36] "LGT_COND"   "WEATHER1"   "WEATHER2"   "WEATHER"    "SCH_BUS"   
## [41] "RAIL"       "NOT_HOUR"   "NOT_MIN"    "ARR_HOUR"   "ARR_MIN"   
## [46] "HOSP_HR"    "HOSP_MN"    "CF1"        "CF2"        "CF3"       
## [51] "FATALS"     "DRUNK_DR"

Part one: Accident data

are there some days of the week where more accidents happen than on others (use variable DAY_WEEK)? It seems like more accidents happen on the weekend.

acc %>% group_by(DAY_WEEK) %>% tally()
## # A tibble: 7 × 2
##   DAY_WEEK     n
##      <int> <int>
## 1        1  5303
## 2        2  4501
## 3        3  4129
## 4        4  4388
## 5        5  4662
## 6        6  5352
## 7        7  6104
acc %>% ggplot(aes(x=DAY_WEEK)) +
  geom_bar() +
  ggtitle("Accidents by day_week") +
  ylab("accident count")

what time of the day do accidents happen (use variable HOUR)?

most accidents happen between 5-8 pm, specifically at 6 pm.

acc.hour = acc %>% filter(HOUR <= 24) %>% group_by(HOUR) %>% tally()
acc.hour %>%
  ggplot(aes(x=HOUR)) +
  geom_bar(aes(weight = n)) +
  ggtitle("Accidents by hour") + ylab("accident count")

what is the number of accidents with at least one drunk driver (use variable DRUNK_DR)? there are There are 8720 accidents with atleast one drunk driver

filter(acc, DRUNK_DR > 0) %>% summarize(n=n())
##      n
## 1 8720

Part two: Connecting data

Connect to the person table. Identify drivers (PER_TYP == 1, see fars manual ) and subset on them.

person.table = read.csv("https://raw.githubusercontent.com/yumouqiu/DS202-Spring2023/main/Practice/data/fars2016/person.csv")
person.table = person.table %>% filter(PER_TYP == 1)

Join accident and driver table (work out which variable(s) to use)

person.table = person.table %>% filter(SEX == 1 | SEX == 2)
acc.person = left_join(acc, person.table, by = c("ST_CASE", "HOUR", "MINUTE", "DAY", "MONTH"))
## Warning in left_join(acc, person.table, by = c("ST_CASE", "HOUR", "MINUTE", : Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 9 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
##   warning.

Tally the number of accidents by day of the week (DAY_WEEK), hour of the day (HOUR) and gender (SEX). Visualize the results!

acc.person.tally = acc.person %>% group_by(DAY_WEEK, HOUR, SEX) %>% 
  summarise(n = n())
## `summarise()` has grouped output by 'DAY_WEEK', 'HOUR'. You can override using
## the `.groups` argument.
acc.person.tally = na.omit(acc.person.tally)
acc.person.tally = acc.person.tally %>% filter(HOUR < 25)
acc.person.tally %>% ggplot(aes(x = HOUR, y = n, color = SEX)) +
  geom_point() + 
  facet_wrap(~DAY_WEEK)