Historical record of past three years shows that truck crash rates on 2-Lane roads in a (hypothetical city) of Zreeha is 0.5 crashes per million vehicle-kilometers with a standard deviation of 0.1. After the completion of compulsory refresher driving course for truck drivers, crash rates were recorded at 50 random sites of 2-Lane roads to estimate the current stats. Given the crash rate data in the following section, estimate whether the driving course made any difference in terms of the crash rates.
photo credit: Two Dead After Multi-vehicle Collision via photopin (license)
We have stored data in a data frame (df) which is sort of like a spreadsheet.
df <- c(0.3, 0.91, 0.69, 0.57, 0.28, 0.86, 0.68, 0.36, 0.83, 0.88,
0.76, 0.85, 1.05, 0.7, 0.83, 0.76, 0.51, 0.6, 0.8, 0.71, 1.24,
1.09, 0.23, 0.67, 0.97, 0.97, 0.37, 0.31, 0.76, 0.63, 0.91, 0.97,
0.4, 0.59, 1.05, 0.57, 0.12, 0.69, 0.86, 0.7, 0.6, 0.69, 0.51,
0.72, 1.27, 0.82, 0.43, 0.86, 0.32, 0.83)
df <- data.frame(df)
# Column name "Crash Rate"
colnames(df) <- "CR"
If we get this data in a text/ csv/ Excel file we can read it in R using readr or readxl packages or default read.table function in R.
Following shows the distribution of crash rates in the sample:
# Loading the plotting library
library(ggplot2)
gg1 <- ggplot(data = df) +
geom_histogram(mapping = aes(x = CR), fill = "skyblue") +
# geom_density(mapping = aes(x = CR), color = "blue", size = 1) +
theme_bw()
library(plotly)
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: RJSONIO
## Loading required package: ggplot2
py <- plotly()
r <- py$ggplotly(gg1, session = "knitr")
The distribution looks nearly normal.
From the given data we got following information:
Looking at the difference of 0.2016 between the sample mean and population mean one might argue that the higher sample mean implies an increase in crash rate by trucks on 2-Lane roads even after the training of drivers.
But as we know that a sample mean is just a mean of a subset of population, it might not be a true representation of population mean. This might be caused due to sampling error (the difference in population and sample means due to chance).
Therefore, we will formulate our null hypothesis as:
The current population mean of the crash rate of truck drivers on 2-Lane road is 0.5 i.e.
\(H_o\): \(\mu\) = 0.5
and the alternative hypothesis would be:
\(H_A\): \(\mu\) \(\neq\) 0.5
We will test this hypothesis at 95% confidence level. To understand the theory behind hypothesis testing, consider the figure above and assume that you take infinite random samples (size 50) of crashes on 2-Lane roads in the Zreeha city and compute mean of each sample. Then you plot a histogram or density of these means. What you get now is the `Sampling Distribution of Sample Means’.
According to the Central Limit Theorem (CLT) the mean of this distribution of sample means will always be equal to the mean of population distribution. Here we are hypothesizing that the mean of the current population is 0.5, similar to historical population of crash rates i.e. there is no difference in the crash rates of drivers even after new training and the observed difference of 0.2016 is simply due to chance.
At 5% significance level (we are tolerating only 5% chance that we reject a null hypothesis when it was infact true) the critical Z value can be computed as:
Z.critical <- qnorm(1-(0.05/2))
Z.critical
## [1] 1.959964
Anything extreme from \(\pm\) 1.96 is considered so extreme that probability of it occuring by chance is very low and therefore the null hypothesis will be rejected.
To see where our sample mean falls we estimate the test statistic Z:
\[Z = {(\bar{x} - \mu)}/{(\sigma/\sqrt(n))}\]
Z <- (mean(df$CR) - 0.5)/(0.1/sqrt(50))
Z
## [1] 14.25527
As the test statistic of 14.2552727 is significantly higher than the critical value of +1.959964 we reject the null hypothesis that mean crash rates of trucks on 2-Lane roads in Zreeha are same as its 3 year historical average. However, we are still unsure if the crash rates have increased or decreased. For ensuring that new results are worse we have to do a one-tail test on population mean i.e. \(\mu\) > 0.5.
Another way to reach the same conclusion is to compute the p-value. p-value is the probability of getting observed or extreme outcome given the null hypothesis is true. The p-value is also explained in the figure.
If null hypothesis is true than the probability of getting the current data can be computed as:
p.0.29 <- pnorm(-14.25)
p.0.70 <- 1-pnorm(14.25)
pval = p.0.29 + p.0.70
pval
## [1] 2.241406e-46
Because the p-value of 2.241406210^{-46} is significantly smaller than the level of significance of 0.05, we reject the null hypothesis.