Kashish Kohli-s3794337
Mohammad Nasir Uddin-s3713833
Zawar Shah-s3733102
Last updated: 27 October, 2019
Road deaths in Australia is a big concern for the government and the people here. Sometimes due to vehicle manufacturer’s fault or sometimes due to the driver’s carelessness, people have suffered a lot. The accident type could single vehicle, multiple vehicles, or pedestrian. Everytime a road accident happens, the chances of death are 50%. That’s why the name: Fatal Crashes.
Although, the road death rate has fallen to its lowest in four years due to some strict policies, but still they are in 1000s.[1]
This dataset provides deaths information due to road accident across all states and territories of Australia from 1989 until 2018 that comes from Australian Government statistical data source.
This dataset includes state or territory names of Australia, time period (day of week, month, year), crash type (single or multiple vehicle involvement), speed limit, seasons like Christmas or Easter period.
Through this experiment, we aim to investigate different reasons for the road deaths.
We have analysed the distribution of fatality rate in terms of period (year, month or day-week), type of the crash, speed limit.
Finally, we performed various hypothesis testing like one sample t-test, chi-square goodness of fit and association test and simple linear regression to obtain accurate results.
This data is collected from https://www.kaggle.com/yvien90/australia-road-deaths-from-1989-to-aug-2018
Data set comes from Australian Government website, showing the list of road accidents across all states from 1989 until August 2018.
For our use, the years considered for this experiment are 2015, 2016 & 2017 and the 10 columns considered in the dataset are:
Easter_Period
Here, we have loaded the data in the dataframe. For the simplicity of this experiment, the columns which were not required were removed prior to loading of the dataset.
As we are only interested in the analysis for years 2015, 2016 and 2017, we have filtered the Years column.
In the next step, we have factored the variables in the dataset.
sampled_data <- filtered_data %>% mutate(
Month = factor(Month,
levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"),
ordered = TRUE),
State = factor(State, levels = c("NSW", "VIC", "QLD", "SA", "WA", "TAS", "NT", "ACT")),
Dayweek = factor(Dayweek, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"),
ordered = TRUE),
Crash_Type = factor(Crash_Type, levels = c("Single vehicle", "Multiple vehicle", "Pedestrian")),
Christmas_Period = factor(Christmas_Period, levels = c("Yes", "No")),
Easter_Period = factor(Easter_Period, levels = c("Yes", "No")),
Year = factor(Year, levels = c(2015, 2016, 2017), ordered = TRUE))Now, removing the NA values
## [1] 99
## [1] 99
As there is only one column for NAs, we have removed it from the dataset.
This barplot depicts the number of fatalities in different states. Here, we can see that the highest number of fatalities are recorded in NSW.
This barplot depicts the number of fatalities by crash type. Here, we can see that the highest number of accidents are recorded for Multiple vehicle crash type and the least involve Pedestrian crash type.
In the next barplots, we can see the accidents distribution by Month, DayWeek and Year. For Month, we can see that the highest number is recorded in December whereas lowest in February. For DayWeek, we can see that highest number of accidents are recorded on Saturday. For Year, 2015 recorded the highest number.
par(mfrow=c(1, 4))
barplot(prop.table(table(sampled_data$Month)), col = c("skyblue", "lightgreen", "yellow"), ylab = "Proportion", xlab = "Months")
barplot(prop.table(table(sampled_data$Dayweek)), col = c("skyblue", "lightgreen", "yellow", "red"), ylab = "Proportion", xlab = "Days of week")
barplot(prop.table(table(sampled_data$Year)), col = c("skyblue", "lightgreen", "yellow", "red"), ylab = "Proportion", xlab = "Years")
hist(sampled_data$Speed_Limit, main = "Histogram for frequency of accidents happening in speed limits", xlab = "speed limits", ylab = "Frequency", ylim = c(0, 500), col = c("skyblue", "lightgreen", "yellow"))From the histogram, it could be seen that the highest number of accidents are recorded in the areas where speed limits are 90-100.
From the boxplot, the speed limit distribution can be seen.
The speed limit summary can be seen throught the textual method below. The mean can be noted as: 80.518666, median:80, standard deviation: 21.52277.
stats_summary_speed_limit <- sampled_data %>%
summarise(
Min = min(Speed_Limit),
Q1 = quantile(Speed_Limit)[2],
Median = median(Speed_Limit),
Mean = mean(Speed_Limit),
Q3 = quantile(Speed_Limit)[4],
Max = max(Speed_Limit),
SD = sd(Speed_Limit),
IQR = IQR(Speed_Limit),
Total = n(),
Missing = sum(is.na(Speed_Limit))
) -> table2
knitr::kable(table2)| Min | Q1 | Median | Mean | Q3 | Max | SD | IQR | Total | Missing |
|---|---|---|---|---|---|---|---|---|---|
| 10 | 60 | 80 | 80.51866 | 100 | 130 | 21.52277 | 40 | 1340 | 0 |
The overall summary of the variables can be seen below:
Using the summary function we can check the number of accidents roughly according to variables
| Crash ID | State | Month | Year | Dayweek | Time | Crash_Type | Speed_Limit | Christmas_Period | Easter_Period | |
|---|---|---|---|---|---|---|---|---|---|---|
| Min. :20151001 | NSW :1032 | December :151 | 2015:516 | Sunday :210 | Length:1340 | Single vehicle :545 | Min. : 10.00 | Yes: 56 | Yes: 12 | |
| 1st Qu.:20152018 | VIC : 198 | August :143 | 2016:458 | Monday :169 | Class1:hms | Multiple vehicle:566 | 1st Qu.: 60.00 | No :1284 | No :1328 | |
| Median :20161154 | NT : 61 | October :136 | 2017:366 | Tuesday :172 | Class2:difftime | Pedestrian :229 | Median : 80.00 | NA | NA | |
| Mean :20160674 | ACT : 25 | November :132 | NA | Wednesday:184 | Mode :numeric | NA | Mean : 80.52 | NA | NA | |
| 3rd Qu.:20171031 | TAS : 24 | July :118 | NA | Thursday :186 | NA | NA | 3rd Qu.:100.00 | NA | NA | |
| Max. :20172144 | QLD : 0 | September:108 | NA | Friday :180 | NA | NA | Max. :130.00 | NA | NA | |
| NA | (Other): 0 | (Other) :552 | NA | Saturday :239 | NA | NA | NA | NA | NA |
One-sample test: Test the average speed limit where fatalities happened
Two sample t-test: Test of average speed limits during christmas period and easter period.
Chi-square goodness of fit test: Test the ratio of fatalities during 3 years.
Chi-square test of association: Test the relationship between whether day of week has association with crash type.
Simple linear regression and correlation: Test the increase in Single vehicle crash frequency as time flows.
NOTE: The significance level in every test is 0.05.
The mean speed limit where fatalities happened is 80. \[H_0: \mu_1 = 80 \] The mean speed limit where fatalities happened is not 80. \[H_A: \mu_1 \ne 80\]
##
## One Sample t-test
##
## data: sampled_data$Speed_Limit
## t = 0.88213, df = 1339, p-value = 0.3779
## alternative hypothesis: true mean is not equal to 80
## 95 percent confidence interval:
## 79.36524 81.67207
## sample estimates:
## mean of x
## 80.51866
p-value > 0.05 then not statistically significant
we fail to reject the null hypothesis
H0: The number of fatalities are equal each year
HA: The number of fatalities are not equal each year
##
## 2015 2016 2017
## 0.3850746 0.3417910 0.2731343
##
## Chi-squared test for given probabilities
##
## data: table(sampled_data$Year)
## X-squared = 25.618, df = 2, p-value = 2.736e-06
##
## 2015 2016 2017
## 516 458 366
## 2015 2016 2017
## 446.6667 446.6667 446.6667
p-value < 0.05, statistically significant
the fatality rate is not same for the three years.
H0: There is no association of crash type dependent on weekday (independence) HA: There is an association of crash type dependent on weekday (dependence)
table(sampled_data$Crash_Type, sampled_data$Dayweek) %>% prop.table() -> table_proportion
knitr::kable(table_proportion)| Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | |
|---|---|---|---|---|---|---|---|
| Single vehicle | 0.0858209 | 0.0507463 | 0.0440299 | 0.0447761 | 0.0544776 | 0.0485075 | 0.0783582 |
| Multiple vehicle | 0.0507463 | 0.0537313 | 0.0597015 | 0.0686567 | 0.0582090 | 0.0597015 | 0.0716418 |
| Pedestrian | 0.0201493 | 0.0216418 | 0.0246269 | 0.0238806 | 0.0261194 | 0.0261194 | 0.0283582 |
##
## Pearson's Chi-squared test
##
## data: table(sampled_data$Crash_Type, sampled_data$Dayweek)
## X-squared = 28.973, df = 12, p-value = 0.003976
##
## Sunday Monday Tuesday Wednesday Thursday Friday
## Single vehicle 115 68 59 60 73 65
## Multiple vehicle 68 72 80 92 78 80
## Pedestrian 27 29 33 32 35 35
##
## Saturday
## Single vehicle 105
## Multiple vehicle 96
## Pedestrian 38
##
## Sunday Monday Tuesday Wednesday Thursday Friday
## Single vehicle 85.41045 68.73507 69.95522 74.83582 75.64925 73.20896
## Multiple vehicle 88.70149 71.38358 72.65075 77.71940 78.56418 76.02985
## Pedestrian 35.88806 28.88134 29.39403 31.44478 31.78657 30.76119
##
## Saturday
## Single vehicle 97.20522
## Multiple vehicle 100.95075
## Pedestrian 40.84403
p-value < 0.05, statistically significant
We can say that there is some association between the crash type and dayweek.
H0: The data do not fit the regression model
HA: The data fit the regression model
acc_sample <- sampled_data %>% filter(Year == 2017, Crash_Type == "Pedestrian")
month_table <- table(acc_sample$Month)
as.vector(month_table)## [1] 4 2 4 3 7 5 9 4 5 3 5 5
df <- data.frame(month = c(1:12), Pedestrian = month_table)
names(df)[3] <- paste("Pedestrian")
maxmodel <- lm(Pedestrian ~ month, data = df)
maxmodel %>% summary()##
## Call:
## lm(formula = Pedestrian ~ month, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1072 -0.9796 -0.2296 0.1183 4.2704
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.8485 1.1742 3.277 0.00832 **
## month 0.1259 0.1595 0.789 0.44844
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.908 on 10 degrees of freedom
## Multiple R-squared: 0.0586, Adjusted R-squared: -0.03554
## F-statistic: 0.6224 on 1 and 10 DF, p-value: 0.4484
p-value > 0.05, not statistically significant
we fail to reject null hypothesis
-[1] The Guardianhttps://www.theguardian.com/australia-news/2019/jan/17/australias-road-deaths-at-lowest-level-in-four-years
-[2] Australian Government,Department of Infrastructure, Transport, Cities and Regional Development https://www.bitre.gov.au/statistics/safety/