Osama Alfawzan (3905369), Pinjerla Jyothika (3941583), Jerusha Lee (3935717)
Last updated: 16/10/2022
Car crashes occur every second all around the world, resulting in injuries and fatalities. Road traffic accidents are the tenth greatest cause of death, accounting for 2.2% of all deaths globally. Cars have been improved and made safer over time, and they are today safer than they were before. In this assignment, we will be looking at data for car crashes in South Australia. We assume that deaths caused by car accidents are more likely to increase during peak hours where there is heavy traffic during the day.
There were 1,123 fatal car accidents in Australia in 2021. This represents a 2.6% increase from 2020. The number of fatalities dropped from about 1,300 annually to 1,100 annually throughout the course of the decade. To further help keep this number in decline, we will be looking at historical data for car crashes from the year 2019 that includes road deaths. We will test a hypothesis stating that deaths caused by car accidents are more likely to increase during peak hours where there is heavy traffic. We hope this analysis can help better allocate resources for Emergency Medical Services (EMS) during times of day. The use of statistical graphs and tests will be utilized in this analysis to answer the hypothesis question.
The data used in this assignment was collected from the Australian open government data website. The Road Crash Data contains Details of reported road crashes and casualties in South Australia carrying 12,964 observations, 33 variables, both quantitative and qualitative. A list of all variables and the metadata can be found on the Australian open government data website.
# reading data
crash_2019 <- read.csv("2019_Crash.csv")
head(crash_2019[,c(1,3,7,11:14,23)], n = 5) -> table1
knitr::kable(table1)| REPORT_ID | Suburb | Total.Cas | Year | Month | Day | Time | DayNight |
|---|---|---|---|---|---|---|---|
| 2019-1-8/07/2020 | HAMPSTEAD GARDENS | 0 | 2019 | June | Wednesday | 11:15 am | Daylight |
| 2019-2-8/07/2020 | DRY CREEK | 0 | 2019 | January | Tuesday | 12:49 am | Night |
| 2019-3-8/07/2020 | MILE END | 1 | 2019 | January | Tuesday | 12:00 am | Night |
| 2019-4-8/07/2020 | PARALOWIE | 1 | 2019 | January | Tuesday | 12:05 am | Night |
| 2019-5-8/07/2020 | MOUNT BARKER | 0 | 2019 | January | Tuesday | 05:15 am | Night |
We will be working on few variables from this data set, mainly:
Total.Cas: Total number of casualties as a result of a
road crashMonth: Month crash occurredDayNight: Time of day the crash occurred (A factor with
two levels: Day, Night)This dataset has character and numeric type variable. Our main focus
is on the variable DayNight which is a “character” type
variable. In this data set, the data type of DayNight
should be factor. Thus, we are going to change its type from character
to factor. The levels of this factor will be Day and Night to help us
identify the time of day of the car crash. Also, the numeric variable
Total.Cas will be used as the total casualties for a car
crash. These two variables are significant to our analysis as they can
help us compare the total number of casualties for car crashes by Day or
Night.
data.frame(variable = names(crash_2019),
class = sapply(crash_2019, typeof),
first_values = sapply(crash_2019, function(x) paste0(head(x), collapse = ", ")),
row.names = NULL) %>%
kable()| variable | class | first_values |
|---|---|---|
| REPORT_ID | character | 2019-1-8/07/2020, 2019-2-8/07/2020, 2019-3-8/07/2020, 2019-4-8/07/2020, 2019-5-8/07/2020, 2019-6-8/07/2020 |
| Stats.Area | character | 2 Metropolitan, 2 Metropolitan, 2 Metropolitan, 2 Metropolitan, 2 Metropolitan, 2 Metropolitan |
| Suburb | character | HAMPSTEAD GARDENS, DRY CREEK, MILE END, PARALOWIE, MOUNT BARKER, TORRENSVILLE |
| Postcode | integer | 5086, 5094, 5031, 5108, 5251, 5031 |
| LGA.Name | character | CITY OF PORT ADELAIDE ENFIELD, CITY OF SALISBURY, CITY OF WEST TORRENS, CITY OF SALISBURY, DC MT.BARKER. , CITY OF WEST TORRENS |
| Total.Units | integer | 2, 2, 2, 2, 2, 2 |
| Total.Cas | integer | 0, 0, 1, 1, 0, 1 |
| Total.Fats | integer | 0, 0, 0, 0, 0, 0 |
| Total.SI | integer | 0, 0, 0, 1, 0, 0 |
| Total.MI | integer | 0, 0, 1, 0, 0, 1 |
| Year | integer | 2019, 2019, 2019, 2019, 2019, 2019 |
| Month | character | June, January, January, January, January, January |
| Day | character | Wednesday, Tuesday, Tuesday, Tuesday, Tuesday, Tuesday |
| Time | character | 11:15 am, 12:49 am, 12:00 am, 12:05 am, 05:15 am, 07:00 am |
| Area.Speed | integer | 60, 90, 60, 50, 110, 50 |
| Position.Type | character | Cross Road, Divided Road, Divided Road, Not Divided, Divided Road, Divided Road |
| Horizontal.Align | character | Straight road, Straight road, Straight road, CURVED, VIEW OPEN, Straight road, Straight road |
| Vertical.Align | character | Level, Level, Level, Level, Slope, Level |
| Other.Feat | character | Not Applicable, Not Applicable, Not Applicable, Not Applicable, Not Applicable, Not Applicable |
| Road.Surface | character | Sealed, Sealed, Sealed, Sealed, Sealed, Sealed |
| Moisture.Cond | character | Dry, Dry, Dry, Dry, Dry, Dry |
| Weather.Cond | character | Not Raining, Not Raining, Not Raining, Not Raining, Not Raining, Not Raining |
| DayNight | character | Daylight, Night, Night, Night, Night, Daylight |
| Crash.Type | character | Right Angle, Rear End, Hit Pedestrian, Hit Fixed Object, Hit Animal, Hit Fixed Object |
| Unit.Resp | integer | 1, 2, 1, 1, 2, 1 |
| Entity.Code | character | Driver Rider, Driver Rider, Driver Rider, Driver Rider, Animal, Driver Rider |
| CSEF.Severity | character | 1: PDO, 1: PDO, 2: MI, 3: SI, 1: PDO, 2: MI |
| Traffic.Ctrls | character | Give Way Sign, No Control, No Control, No Control, No Control, No Control |
| DUI.Involved | character | , , , , , |
| Drugs.Involved | character | , , , , , |
| ACCLOC_X | double | 1331810.03, 1328376.2, 1325819.68, 1328320.6, 1353279.99, 1324652.75 |
| ACCLOC_Y | double | 1676603.26, 1682942.63, 1670994.26, 1690237.08, 1655645.15, 1672027.64 |
| UNIQUE_LOC | double | 13318101676603, 13283761682943, 13258201670994, 13283211690237, 13532801655645, 13246531672028 |
crash_2019$DayNight<- factor(crash_2019$DayNight,levels = (unique(crash_2019$DayNight)),labels = c("Day","Night") ,ordered = TRUE)
crash_2019$Month = factor(crash_2019$Month, levels = month.name) # get the ordered months of year
crash_2019$CSEF.Severity <- factor(crash_2019$CSEF.Severity,levels = (unique(crash_2019$CSEF.Severity)), labels = c("Property Damage Only", "Minor Injury", "Serious Injury", "Fatal Injury"))## [1] "Day" "Night"
After careful inspection of the variables we found no outliers or
missing data. In the below figure, we can see a boxplot for
Deaths by TimesOfDay and we can see a few
values outside the upper fence, which we can’t simply consider as
outliers because they are actual values and not recorded by mistake or
as a result of any data processing error.
The below bar chart compares Deaths during Daytime and
Night time in 2019. Overall, road deaths are more likely to happen
during Daytime which is the first evidence to back our claim. We can see
that the number of deaths during Daytime are more than triple the deaths
during Night time.
ggplot(crash_2019, aes(x=TimesOfDay, y=Deaths)) +
geom_bar(stat="identity", fill="steelblue", width = 0.3)+
theme_minimal()+
ggtitle("The number of deaths during Daytime and Night time")In February and March 2019, road deaths during day time are higher than any other months with more than 450 deaths per month. However, in May and June same year deaths from car crashes are highest with more than 150 deaths per month.
death_by_month <- crash_2019 %>% group_by(Month, TimesOfDay) %>% summarise(Deaths = sum(Deaths))
ggplot(death_by_month, aes(x=Month, y=Deaths, group = TimesOfDay, fill = TimesOfDay)) +
geom_bar(position="dodge", stat="identity")+
theme_minimal()+
ggtitle("The total number of deaths across months of the year during day and night")crash_2019 %>% group_by(TimesOfDay) %>% summarise(Min = min(Deaths,na.rm = TRUE),
Q1 = quantile(Deaths,probs = .25,na.rm = TRUE),
Median = median(Deaths, na.rm = TRUE),
IQR = IQR(Deaths),
Q3 = quantile(Deaths,probs = .75,na.rm = TRUE),
Max = max(Deaths,na.rm = TRUE),
Mean = mean(Deaths, na.rm = TRUE),
SD = sd(Deaths, na.rm = TRUE),
N = n(),
Missing = sum(is.na(Deaths))) -> table2
knitr::kable(table2)| TimesOfDay | Min | Q1 | Median | IQR | Q3 | Max | Mean | SD | N | Missing |
|---|---|---|---|---|---|---|---|---|---|---|
| Day | 0 | 0 | 0 | 1 | 1 | 9 | 0.4719033 | 0.7361479 | 9930 | 0 |
| Night | 0 | 0 | 0 | 1 | 1 | 6 | 0.4433092 | 0.7201012 | 3034 | 0 |
In order to test our hypothesis, first we need to compare the
variances of Deaths by Day and
Night. We will use The Levene’s test to get the p-value
compared to the standard 0.05 significance level. The Levene’s test has
the following statistical hypotheses: \[H_0:
\sigma_1^2 = \sigma_2^2\] \[H_A:
\sigma_1^2 \ne \sigma_2^2\]
The p-value for the Levene’s test of equal variance for car crash
Deaths by TimesOfDay is statistically
insignificant where p = 0.06. We find p > .05, therefore, we fail to
reject H0. In other words, the variance for the two samples are
equal.
##
## Two Sample t-test
##
## data: Deaths by TimesOfDay
## t = 1.882, df = 12962, p-value = 0.05985
## alternative hypothesis: true difference in means between group Day and group Night is not equal to 0
## 95 percent confidence interval:
## -0.001186797 0.058375118
## sample estimates:
## mean in group Day mean in group Night
## 0.4719033 0.4433092
The Two Sample t-test has the following statistical hypotheses: \[H_0: \mu_1 - \mu_2 = 0 \] \[H_A: \mu_1 - \mu_2 \ne 0\]
where \(\mu_1\) and \(\mu_2\) refer to the mean of
Deaths by TimesOfDay for the two groups
Day and Night respectively. The difference
between Deaths at Day and Deaths
at Night estimated by the sample was 0.4719033 - 0.4433092
= 0.0285941. In other words, \(\mu_1\)
> \(\mu_2\) which means the mean for
Day group is greater than the mean for Night
group. Therefore, our assumption that deaths caused by car crashes are
more likely to increase during rush hour which is day time is true.