Analysis of Road Casualties dataset

This assessment is based on the “Road Casualties in Great Britain 1969-84” inbuilt R data set, which is called “Seatbelts”.

The ‘Seatbelts’ data set in R is a multiple time-series data set that was commissioned by the Department of Transport in 1984 to measure differences in deaths before and after front seatbelt legislation was introduced on 31st January 1983. It provides monthly total numerical data on a number of incidents including those related to death and injury in Road Traffic Accidents (RTA’s). The data set starts in January 1969 and observations run until December 1984.

The Seatbelts data set needs to be called up and formatted for simpler viewing.

data(Seatbelts)
Seatbelts <- data.frame(Year=floor(time(Seatbelts)),
Month=factor(cycle(Seatbelts),
labels=month.abb), Seatbelts)

With any data set the obvious start is investigating the structure.

str(Seatbelts)

## 'data.frame':    192 obs. of  10 variables:
##  $ Year         : Time-Series  from 1969 to 1985: 1969 1969 1969 1969 1969 ...
##  $ Month        : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ DriversKilled: num  107 97 102 87 119 106 110 106 107 134 ...
##  $ drivers      : num  1687 1508 1507 1385 1632 ...
##  $ front        : num  867 825 806 814 991 ...
##  $ rear         : num  269 265 319 407 454 427 522 536 405 437 ...
##  $ kms          : num  9059 7685 9963 10955 11823 ...
##  $ PetrolPrice  : num  0.103 0.102 0.102 0.101 0.101 ...
##  $ VanKilled    : num  12 6 12 8 10 13 11 6 10 16 ...
##  $ law          : num  0 0 0 0 0 0 0 0 0 0 ...

There are 192 observations of ten variables in a numerical class. To gain a better understanding the first 6 records can be looked at.

head(Seatbelts)

##   Year Month DriversKilled drivers front rear   kms PetrolPrice VanKilled
## 1 1969   Jan           107    1687   867  269  9059   0.1029718        12
## 2 1969   Feb            97    1508   825  265  7685   0.1023630         6
## 3 1969   Mar           102    1507   806  319  9963   0.1020625        12
## 4 1969   Apr            87    1385   814  407 10955   0.1008733         8
## 5 1969   May           119    1632   991  454 11823   0.1010197        10
## 6 1969   Jun           106    1511   945  427 12391   0.1005812        13
##   law
## 1   0
## 2   0
## 3   0
## 4   0
## 5   0
## 6   0

As the main investigation of the data set will concentrate on before and after seatbelt legislation was introduced, it is convenient to split the data set into two- one of before the legislation (labelled as law = 0), and one after (Labelled as law = 1).

As the dplyr package makes manipulating numerical datasets more intuitive it was used.

Law0Sub <-filter(Seatbelts,law ==0)
Law1Sub <- filter(Seatbelts,law!=0)

Then a summary of each subset was given:

summary(Law0Sub)

##       Year          Month    DriversKilled      drivers    
##  Min.   :1969   Jan    :15   Min.   : 79.0   Min.   :1309  
##  1st Qu.:1972   Feb    :14   1st Qu.:108.0   1st Qu.:1511  
##  Median :1976   Mar    :14   Median :121.0   Median :1653  
##  Mean   :1976   Apr    :14   Mean   :125.9   Mean   :1718  
##  3rd Qu.:1979   May    :14   3rd Qu.:140.0   3rd Qu.:1926  
##  Max.   :1983   Jun    :14   Max.   :198.0   Max.   :2654  
##                 (Other):84                                 
##      front             rear            kms         PetrolPrice     
##  Min.   : 567.0   Min.   :224.0   Min.   : 7685   Min.   :0.08118  
##  1st Qu.: 767.0   1st Qu.:344.0   1st Qu.:12387   1st Qu.:0.09078  
##  Median : 860.0   Median :401.0   Median :14455   Median :0.10273  
##  Mean   : 873.5   Mean   :400.3   Mean   :14463   Mean   :0.10187  
##  3rd Qu.: 986.0   3rd Qu.:454.0   3rd Qu.:16585   3rd Qu.:0.11132  
##  Max.   :1299.0   Max.   :646.0   Max.   :21040   Max.   :0.13303  
##                                                                    
##    VanKilled           law   
##  Min.   : 2.000   Min.   :0  
##  1st Qu.: 7.000   1st Qu.:0  
##  Median :10.000   Median :0  
##  Mean   : 9.586   Mean   :0  
##  3rd Qu.:13.000   3rd Qu.:0  
##  Max.   :17.000   Max.   :0  
##

summary(Law1Sub)

##       Year          Month    DriversKilled      drivers    
##  Min.   :1983   Feb    : 2   Min.   : 60.0   Min.   :1057  
##  1st Qu.:1983   Mar    : 2   1st Qu.: 85.0   1st Qu.:1171  
##  Median :1984   Apr    : 2   Median : 92.0   Median :1282  
##  Mean   :1984   May    : 2   Mean   :100.3   Mean   :1322  
##  3rd Qu.:1984   Jun    : 2   3rd Qu.:119.0   3rd Qu.:1464  
##  Max.   :1984   Jul    : 2   Max.   :154.0   Max.   :1763  
##                 (Other):11                                 
##      front            rear            kms         PetrolPrice    
##  Min.   :426.0   Min.   :296.0   Min.   :15511   Min.   :0.1131  
##  1st Qu.:516.0   1st Qu.:347.0   1st Qu.:17971   1st Qu.:0.1148  
##  Median :585.0   Median :408.0   Median :19162   Median :0.1161  
##  Mean   :571.0   Mean   :407.7   Mean   :18890   Mean   :0.1165  
##  3rd Qu.:629.5   3rd Qu.:471.5   3rd Qu.:19952   3rd Qu.:0.1180  
##  Max.   :721.0   Max.   :521.0   Max.   :21626   Max.   :0.1201  
##                                                                  
##    VanKilled          law   
##  Min.   :2.000   Min.   :1  
##  1st Qu.:3.500   1st Qu.:1  
##  Median :5.000   Median :1  
##  Mean   :5.174   Mean   :1  
##  3rd Qu.:7.000   3rd Qu.:1  
##  Max.   :8.000   Max.   :1  
##

The standard deviation of the drivers killed in each subset was calculated:

sd (Law0Sub$DriversKilled)

## [1] 24.26088

sd(Law1Sub$DriversKilled)

## [1] 22.2286

It can be seen from the summary statistics that there are less drivers killed after the legislation was passed, but this was explored further in a boxplot.

GGPlot2 is a package that provides superior graphics than the base package in R, and operates by using “layers” to build on.

Bluebox <-ggplot(Seatbelts, aes(x=factor(law), y =DriversKilled)) +geom_boxplot(fill = "skyblue") +theme_grey()+ylab ("Monthly Driver Mortality")+xlab("Before and after law introduced")
Bluebox

From the boxplot the values for before and after legislation are easily identified.

The data of the total killed before and legislation compared statistically. The data was checked for normality:

shapiro.test(Seatbelts$DriversKilled)

## 
##  Shapiro-Wilk normality test
## 
## data:  Seatbelts$DriversKilled
## W = 0.97557, p-value = 0.001938

As the p value was less than 0.05 the null hypothesis was rejected, the data is not normally distributed. This means that a non-parametric test should be used to compare the two values:

wilcox.test (Seatbelts$DriversKilled ~ Seatbelts$law)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Seatbelts$DriversKilled by Seatbelts$law
## W = 3058, p-value = 8.343e-06
## alternative hypothesis: true location shift is not equal to 0

The Wilcoxon Sum Rank Test p value was less than 0.05, so the medians of the drivers killed do differ significantly.

Testing of Proportions

The proportion of drivers and passengers killed or seriously injured before and after legislation was tested. The totals of each column were produced and put into a matrix, which was then put into a table, and proportions were calculated.

totaldriv0 <- sum(Law0Sub$Drivers) 
totalfront0<- sum (Law0Sub$front)
totalrear0 <- sum (Law0Sub$rear)
totaldriv1 <-sum(Law1Sub$Drivers)
totalfront1 <-sum (Law1Sub$front)
totalrear1 <-sum (Law1Sub$rear)

tablesums <- matrix(c(290300, 30399,147614,13132,67654,9378), ncol = 3)
tabprop <- prop.table (tablesums,margin = 1)
colnames(tabprop) <- c("Drivers", "Front", "Back")
rownames (tabprop) <-c ("Before", "After")

knitr::kable(tabprop, caption = "Proportion of Drivers and Passengers Killed")

Proportion of Drivers and Passengers Killed
	Drivers	Front	Back
Before	0.5742056	0.2919765	0.1338178
After	0.5745525	0.2481997	0.1772477

From the table it can be seen that the proportions drivers killed were only slightly different, but the front passengers were less, although the back were proportionally more.

Correlation and Regression

To calculation correlation and regression between two factors, the subset value of before the law was passed were used,so the data wasn’t skewed by the different factors.

The ggplot package was used to explore any correlation between the amount of drivers killed, and the price of petrol, and a regression line was modelled:

Priceplot <- ggplot (Law0Sub, aes(x = PetrolPrice, y= DriversKilled))+geom_point(size = 2, shape = 18, col="darkgreen") +stat_smooth (method = lm)+ xlab("Petrol Cost") + ylab (" Monthly Driver Mortalities")

Priceplot

The regression linear model and was calculated:

model0 <-lm(DriversKilled ~ PetrolPrice, data= Law0Sub)
model0

## 
## Call:
## lm(formula = DriversKilled ~ PetrolPrice, data = Law0Sub)
## 
## Coefficients:
## (Intercept)  PetrolPrice  
##       190.5       -634.7

A summary of the model was produced for accuracy:

summary(model0)

## 
## Call:
## lm(formula = DriversKilled ~ PetrolPrice, data = Law0Sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.804 -17.664  -4.083  14.058  62.712 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   190.53      15.33  12.425  < 2e-16 ***
## PetrolPrice  -634.69     149.51  -4.245 3.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.12 on 167 degrees of freedom
## Multiple R-squared:  0.0974, Adjusted R-squared:  0.092 
## F-statistic: 18.02 on 1 and 167 DF,  p-value: 3.616e-05

Then the correlation value:

cor.test (Law0Sub$DriversKilled, Law0Sub$PetrolPrice, method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  Law0Sub$DriversKilled and Law0Sub$PetrolPrice
## S = 1042200, p-value = 9.537e-05
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.2956175

The rho value of -0.295 suggests a weak negative relationship between the two variables, as the price of petrol increases, then the number of driver’s killed decreases.

Then the correlation was studied between drivers killed and the distance they traveled, and a scatterplot using ggplot2 was used again.

distplot <-ggplot(Seatbelts, aes(x = kms, y= DriversKilled))+geom_point(size = 2, shape = 18, col="darkgreen") +stat_smooth (method = lm)+ xlab("Kms Travelled") + ylab (" Monthly Driver Mortalities")
distplot

The regression linear model and was calculated:

model1 <-lm(DriversKilled ~ kms, data= Law0Sub)
model1

## 
## Call:
## lm(formula = DriversKilled ~ kms, data = Law0Sub)
## 
## Coefficients:
## (Intercept)          kms  
##  151.091397    -0.001744

A summary of the model was completed:

summary(model1)

## 
## Call:
## lm(formula = DriversKilled ~ kms, data = Law0Sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.117 -17.551  -3.379  15.450  67.878 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.511e+02  1.017e+01   14.85   <2e-16 ***
## kms         -1.744e-03  6.919e-04   -2.52   0.0127 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.88 on 167 degrees of freedom
## Multiple R-squared:  0.03664,    Adjusted R-squared:  0.03087 
## F-statistic: 6.352 on 1 and 167 DF,  p-value: 0.01266

The value of The Spearmans correlation value was calculated:

cor.test (Law0Sub$DriversKilled, Law0Sub$kms, method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  Law0Sub$DriversKilled and Law0Sub$kms
## S = 925520, p-value = 0.05077
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1505207

The rho value of -0.150 again suggested a weak negative relationship, as the distance traveled increases,the number of driver deaths decreases.

Analysis of Road Casualties dataset

Vikki Grist

November 2016

Testing of Proportions

Correlation and Regression