Introduction

Road deaths in Australia is a big concern for the government and the people here. Sometimes due to vehicle manufacturer’s fault or sometimes due to the driver’s carelessness, people have suffered a lot. The accident type could single vehicle, multiple vehicles, or pedestrian. Everytime a road accident happens, the chances of death are 50%. That’s why the name: Fatal Crashes.

Although, the road death rate has fallen to its lowest in four years due to some strict policies, but still they are in 1000s.[1]

This dataset provides deaths information due to road accident across all states and territories of Australia from 1989 until 2018 that comes from Australian Government statistical data source.
This dataset includes state or territory names of Australia, time period (day of week, month, year), crash type (single or multiple vehicle involvement), speed limit, seasons like Christmas or Easter period.

Problem Statement

Through this experiment, we aim to investigate different reasons for the road deaths.
We have analysed the distribution of fatality rate in terms of period (year, month or day-week), type of the crash, speed limit.
Finally, we performed various hypothesis testing like one sample t-test, chi-square goodness of fit and association test and simple linear regression to obtain accurate results.

Data

This data is collected from https://www.kaggle.com/yvien90/australia-road-deaths-from-1989-to-aug-2018
Data set comes from Australian Government website, showing the list of road accidents across all states from 1989 until August 2018.
For our use, the years considered for this experiment are 2015, 2016 & 2017 and the 10 columns considered in the dataset are:
Crash ID,
State,
Month,
Year,
Dayweek,
Time,
Crash_Type,
Speed_Limit,
Christmas_Period &
Easter_Period

Data Cont.

Here, we have loaded the data in the dataframe. For the simplicity of this experiment, the columns which were not required were removed prior to loading of the dataset.

dataframe <- read_csv("Fatal Crashes.csv", col_types = cols(Time = col_time(format = "%H:%M")))

As we are only interested in the analysis for years 2015, 2016 and 2017, we have filtered the Years column.

filtered_data <- dataframe %>% filter(Year %in% c(2015, 2016, 2017))

In the next step, we have factored the variables in the dataset.

sampled_data <- filtered_data %>% mutate(
  Month = factor(Month,
                 levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"),
                 ordered = TRUE),
  State = factor(State, levels = c("NSW", "VIC", "QLD", "SA", "WA", "TAS", "NT", "ACT")),
  Dayweek = factor(Dayweek, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"),
                   ordered = TRUE),
  Crash_Type = factor(Crash_Type, levels = c("Single vehicle", "Multiple vehicle", "Pedestrian")),
  Christmas_Period = factor(Christmas_Period, levels = c("Yes", "No")),
  Easter_Period = factor(Easter_Period, levels = c("Yes", "No")),
  Year = factor(Year, levels = c(2015, 2016, 2017), ordered = TRUE))

Data Cont.

Now, removing the NA values

which(is.na(sampled_data$Christmas_Period))

## [1] 99

which(is.na(sampled_data$Easter_Period))

## [1] 99

As there is only one column for NAs, we have removed it from the dataset.

sampled_data <- filter(sampled_data,  !is.na(sampled_data$Christmas_Period))
sampled_data <- filter(sampled_data,  !is.na(sampled_data$Easter_Period))

Descriptive Statistics and Visualisation

This barplot depicts the number of fatalities in different states. Here, we can see that the highest number of fatalities are recorded in NSW.

barplot(prop.table(table(sampled_data$State)), main = "Proportion distribution of death in states", xlab = "State/Territory", ylab="Proportion", col = c("skyblue", "lightgreen", "yellow"), ylim=c(0, 0.8))

Descriptive Statistics Cont.

This barplot depicts the number of fatalities by crash type. Here, we can see that the highest number of accidents are recorded for Multiple vehicle crash type and the least involve Pedestrian crash type.

barplot(prop.table(table(sampled_data$Crash_Type)), main = "Proportion distribution of death in crash type", xlab = "Crash type", ylab="Proportion", col = c("skyblue", "lightgreen", "yellow"))

Descriptive Statistics Cont.

In the next barplots, we can see the accidents distribution by Month, DayWeek and Year. For Month, we can see that the highest number is recorded in December whereas lowest in February. For DayWeek, we can see that highest number of accidents are recorded on Saturday. For Year, 2015 recorded the highest number.

par(mfrow=c(1, 4))
barplot(prop.table(table(sampled_data$Month)), col = c("skyblue", "lightgreen", "yellow"), ylab = "Proportion", xlab = "Months")
barplot(prop.table(table(sampled_data$Dayweek)), col = c("skyblue", "lightgreen", "yellow", "red"), ylab = "Proportion", xlab = "Days of week")
barplot(prop.table(table(sampled_data$Year)), col = c("skyblue", "lightgreen", "yellow", "red"), ylab = "Proportion", xlab = "Years")
hist(sampled_data$Speed_Limit, main = "Histogram for frequency of accidents happening in speed limits", xlab = "speed limits", ylab = "Frequency", ylim = c(0, 500), col = c("skyblue", "lightgreen", "yellow"))

From the histogram, it could be seen that the highest number of accidents are recorded in the areas where speed limits are 90-100.

Descriptive Statistics Cont.

From the boxplot, the speed limit distribution can be seen.

boxplot(sampled_data$Speed_Limit, ylab = "Speed", col = "red", main = "Boxplot for speed limit", ylim = c(0, 150))

Descriptive Statistics Cont.

The speed limit summary can be seen throught the textual method below. The mean can be noted as: 80.518666, median:80, standard deviation: 21.52277.

stats_summary_speed_limit <- sampled_data %>%
  summarise(
    Min = min(Speed_Limit),
    Q1 = quantile(Speed_Limit)[2],
    Median = median(Speed_Limit),
    Mean = mean(Speed_Limit),
    Q3 = quantile(Speed_Limit)[4],
    Max = max(Speed_Limit),
    SD = sd(Speed_Limit),
    IQR = IQR(Speed_Limit),
    Total = n(),
    Missing = sum(is.na(Speed_Limit))
  ) -> table2
knitr::kable(table2)

Min	Q1	Median	Mean	Q3	Max	SD	IQR	Total	Missing
10	60	80	80.51866	100	130	21.52277	40	1340	0

Descriptive Statistics Cont.

The overall summary of the variables can be seen below:

Using the summary function we can check the number of accidents roughly according to variables

table <- sampled_data %>% summary()

knitr::kable(table)

Crash ID	State	Month	Year	Dayweek	Time	Crash_Type	Speed_Limit	Christmas_Period	Easter_Period
Min. :20151001	NSW :1032	December :151	2015:516	Sunday :210	Length:1340	Single vehicle :545	Min. : 10.00	Yes: 56	Yes: 12
1st Qu.:20152018	VIC : 198	August :143	2016:458	Monday :169	Class1:hms	Multiple vehicle:566	1st Qu.: 60.00	No :1284	No :1328
Median :20161154	NT : 61	October :136	2017:366	Tuesday :172	Class2:difftime	Pedestrian :229	Median : 80.00	NA	NA
Mean :20160674	ACT : 25	November :132	NA	Wednesday:184	Mode :numeric	NA	Mean : 80.52	NA	NA
3rd Qu.:20171031	TAS : 24	July :118	NA	Thursday :186	NA	NA	3rd Qu.:100.00	NA	NA
Max. :20172144	QLD : 0	September:108	NA	Friday :180	NA	NA	Max. :130.00	NA	NA
NA	(Other): 0	(Other) :552	NA	Saturday :239	NA	NA	NA	NA	NA

Hypothesis Testing

One-sample test: Test the average speed limit where fatalities happened
Two sample t-test: Test of average speed limits during christmas period and easter period.
Chi-square goodness of fit test: Test the ratio of fatalities during 3 years.
Chi-square test of association: Test the relationship between whether day of week has association with crash type.
Simple linear regression and correlation: Test the increase in Single vehicle crash frequency as time flows.

NOTE: The significance level in every test is 0.05.

One sample t-test

The mean speed limit where fatalities happened is 80. \[H_0: \mu_1 = 80 \] The mean speed limit where fatalities happened is not 80. \[H_A: \mu_1 \ne 80\]

t.test(sampled_data$Speed_Limit, mu = 80, alternative="two.sided")

## 
##  One Sample t-test
## 
## data:  sampled_data$Speed_Limit
## t = 0.88213, df = 1339, p-value = 0.3779
## alternative hypothesis: true mean is not equal to 80
## 95 percent confidence interval:
##  79.36524 81.67207
## sample estimates:
## mean of x 
##  80.51866

p-value > 0.05 then not statistically significant
we fail to reject the null hypothesis

Hypthesis Testing Cont.

Chi-square goodness of fit test

H0: The number of fatalities are equal each year
HA: The number of fatalities are not equal each year

table(sampled_data$Year) %>% prop.table()

## 
##      2015      2016      2017 
## 0.3850746 0.3417910 0.2731343

pop_prop <- c(1/3, 1/3, 1/3)
chi1 <- chisq.test(table(sampled_data$Year), p = pop_prop)
chi1

## 
##  Chi-squared test for given probabilities
## 
## data:  table(sampled_data$Year)
## X-squared = 25.618, df = 2, p-value = 2.736e-06

chi1$observed

## 
## 2015 2016 2017 
##  516  458  366

chi1$expected

##     2015     2016     2017 
## 446.6667 446.6667 446.6667

p-value < 0.05, statistically significant
the fatality rate is not same for the three years.

Hypthesis Testing Cont.

Chi-square test of association

H0: There is no association of crash type dependent on weekday (independence) HA: There is an association of crash type dependent on weekday (dependence)

table(sampled_data$Crash_Type, sampled_data$Dayweek) %>% prop.table() -> table_proportion
knitr::kable(table_proportion)

	Sunday	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday
Single vehicle	0.0858209	0.0507463	0.0440299	0.0447761	0.0544776	0.0485075	0.0783582
Multiple vehicle	0.0507463	0.0537313	0.0597015	0.0686567	0.0582090	0.0597015	0.0716418
Pedestrian	0.0201493	0.0216418	0.0246269	0.0238806	0.0261194	0.0261194	0.0283582

chi2 <- chisq.test(table(sampled_data$Crash_Type, sampled_data$Dayweek))
chi2

## 
##  Pearson's Chi-squared test
## 
## data:  table(sampled_data$Crash_Type, sampled_data$Dayweek)
## X-squared = 28.973, df = 12, p-value = 0.003976

chi2$observed

##                   
##                    Sunday Monday Tuesday Wednesday Thursday Friday
##   Single vehicle      115     68      59        60       73     65
##   Multiple vehicle     68     72      80        92       78     80
##   Pedestrian           27     29      33        32       35     35
##                   
##                    Saturday
##   Single vehicle        105
##   Multiple vehicle       96
##   Pedestrian             38

chi2$expected

##                   
##                      Sunday   Monday  Tuesday Wednesday Thursday   Friday
##   Single vehicle   85.41045 68.73507 69.95522  74.83582 75.64925 73.20896
##   Multiple vehicle 88.70149 71.38358 72.65075  77.71940 78.56418 76.02985
##   Pedestrian       35.88806 28.88134 29.39403  31.44478 31.78657 30.76119
##                   
##                     Saturday
##   Single vehicle    97.20522
##   Multiple vehicle 100.95075
##   Pedestrian        40.84403

p-value < 0.05, statistically significant
We can say that there is some association between the crash type and dayweek.

Hypthesis Testing Cont.

Simple Linear Regression - Pedestrian type increase in 2017

H0: The data do not fit the regression model
HA: The data fit the regression model

acc_sample <- sampled_data %>% filter(Year == 2017, Crash_Type == "Pedestrian")
month_table <- table(acc_sample$Month)

as.vector(month_table)

##  [1] 4 2 4 3 7 5 9 4 5 3 5 5

df <- data.frame(month = c(1:12), Pedestrian = month_table)
names(df)[3] <- paste("Pedestrian")

maxmodel <- lm(Pedestrian ~ month, data = df)
maxmodel %>% summary()

## 
## Call:
## lm(formula = Pedestrian ~ month, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1072 -0.9796 -0.2296  0.1183  4.2704 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.8485     1.1742   3.277  0.00832 **
## month         0.1259     0.1595   0.789  0.44844   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.908 on 10 degrees of freedom
## Multiple R-squared:  0.0586, Adjusted R-squared:  -0.03554 
## F-statistic: 0.6224 on 1 and 10 DF,  p-value: 0.4484

p-value > 0.05, not statistically significant
we fail to reject null hypothesis

Discussion

We can analyze different variables about happened road deaths in Australia from 2015 to 2017.
We focused on the Speed_Limit and observed that the mean speed limit where the accidents happened was nearly equal to 80(our assumption).
We observed that there were more fatalities in the Christmas period as compared to Easter period.(through descriptive)
We would have done the two sample test, but we didn’t because there was limited numerical data.
Chi Square goodness of fit test concluded that the numbers of fatalities per year are more or less equal.
Chi Square test of association concluded that crash type is associated with weekday.
Although, it is very difficult to show Linear Regression with our dataset, still we observed that data do not fit the linear regression model.
We are confident that we have tried to cover almost every category and we have the adequate sample size.

MATH1324 Assignment 3

Fatal Crashes

RPubs link information

Introduction

Problem Statement

Data

Data Cont.

Data Cont.

Descriptive Statistics and Visualisation

Descriptive Statistics Cont.

Descriptive Statistics Cont.

Descriptive Statistics Cont.

Descriptive Statistics Cont.

Descriptive Statistics Cont.

Hypothesis Testing

One sample t-test

Hypthesis Testing Cont.

Chi-square goodness of fit test

Hypthesis Testing Cont.

Chi-square test of association

Hypthesis Testing Cont.

Simple Linear Regression - Pedestrian type increase in 2017

Discussion

References