This assessment is based on the “Road Casualties in Great Britain 1969-84” inbuilt R data set, which is called “Seatbelts”.
The ‘Seatbelts’ data set in R is a multiple time-series data set that was commissioned by the Department of Transport in 1984 to measure differences in deaths before and after front seatbelt legislation was introduced on 31st January 1983. It provides monthly total numerical data on a number of incidents including those related to death and injury in Road Traffic Accidents (RTA’s). The data set starts in January 1969 and observations run until December 1984.
The Seatbelts data set needs to be called up and formatted for simpler viewing.
data(Seatbelts)
Seatbelts <- data.frame(Year=floor(time(Seatbelts)),
Month=factor(cycle(Seatbelts),
labels=month.abb), Seatbelts)
With any data set the obvious start is investigating the structure.
str(Seatbelts)
## 'data.frame': 192 obs. of 10 variables:
## $ Year : Time-Series from 1969 to 1985: 1969 1969 1969 1969 1969 ...
## $ Month : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ DriversKilled: num 107 97 102 87 119 106 110 106 107 134 ...
## $ drivers : num 1687 1508 1507 1385 1632 ...
## $ front : num 867 825 806 814 991 ...
## $ rear : num 269 265 319 407 454 427 522 536 405 437 ...
## $ kms : num 9059 7685 9963 10955 11823 ...
## $ PetrolPrice : num 0.103 0.102 0.102 0.101 0.101 ...
## $ VanKilled : num 12 6 12 8 10 13 11 6 10 16 ...
## $ law : num 0 0 0 0 0 0 0 0 0 0 ...
There are 192 observations of ten variables in a numerical class. To gain a better understanding the first 6 records can be looked at.
head(Seatbelts)
## Year Month DriversKilled drivers front rear kms PetrolPrice VanKilled
## 1 1969 Jan 107 1687 867 269 9059 0.1029718 12
## 2 1969 Feb 97 1508 825 265 7685 0.1023630 6
## 3 1969 Mar 102 1507 806 319 9963 0.1020625 12
## 4 1969 Apr 87 1385 814 407 10955 0.1008733 8
## 5 1969 May 119 1632 991 454 11823 0.1010197 10
## 6 1969 Jun 106 1511 945 427 12391 0.1005812 13
## law
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
As the main investigation of the data set will concentrate on before and after seatbelt legislation was introduced, it is convenient to split the data set into two- one of before the legislation (labelled as law = 0), and one after (Labelled as law = 1).
As the dplyr package makes manipulating numerical datasets more intuitive it was used.
Law0Sub <-filter(Seatbelts,law ==0)
Law1Sub <- filter(Seatbelts,law!=0)
Then a summary of each subset was given:
summary(Law0Sub)
## Year Month DriversKilled drivers
## Min. :1969 Jan :15 Min. : 79.0 Min. :1309
## 1st Qu.:1972 Feb :14 1st Qu.:108.0 1st Qu.:1511
## Median :1976 Mar :14 Median :121.0 Median :1653
## Mean :1976 Apr :14 Mean :125.9 Mean :1718
## 3rd Qu.:1979 May :14 3rd Qu.:140.0 3rd Qu.:1926
## Max. :1983 Jun :14 Max. :198.0 Max. :2654
## (Other):84
## front rear kms PetrolPrice
## Min. : 567.0 Min. :224.0 Min. : 7685 Min. :0.08118
## 1st Qu.: 767.0 1st Qu.:344.0 1st Qu.:12387 1st Qu.:0.09078
## Median : 860.0 Median :401.0 Median :14455 Median :0.10273
## Mean : 873.5 Mean :400.3 Mean :14463 Mean :0.10187
## 3rd Qu.: 986.0 3rd Qu.:454.0 3rd Qu.:16585 3rd Qu.:0.11132
## Max. :1299.0 Max. :646.0 Max. :21040 Max. :0.13303
##
## VanKilled law
## Min. : 2.000 Min. :0
## 1st Qu.: 7.000 1st Qu.:0
## Median :10.000 Median :0
## Mean : 9.586 Mean :0
## 3rd Qu.:13.000 3rd Qu.:0
## Max. :17.000 Max. :0
##
summary(Law1Sub)
## Year Month DriversKilled drivers
## Min. :1983 Feb : 2 Min. : 60.0 Min. :1057
## 1st Qu.:1983 Mar : 2 1st Qu.: 85.0 1st Qu.:1171
## Median :1984 Apr : 2 Median : 92.0 Median :1282
## Mean :1984 May : 2 Mean :100.3 Mean :1322
## 3rd Qu.:1984 Jun : 2 3rd Qu.:119.0 3rd Qu.:1464
## Max. :1984 Jul : 2 Max. :154.0 Max. :1763
## (Other):11
## front rear kms PetrolPrice
## Min. :426.0 Min. :296.0 Min. :15511 Min. :0.1131
## 1st Qu.:516.0 1st Qu.:347.0 1st Qu.:17971 1st Qu.:0.1148
## Median :585.0 Median :408.0 Median :19162 Median :0.1161
## Mean :571.0 Mean :407.7 Mean :18890 Mean :0.1165
## 3rd Qu.:629.5 3rd Qu.:471.5 3rd Qu.:19952 3rd Qu.:0.1180
## Max. :721.0 Max. :521.0 Max. :21626 Max. :0.1201
##
## VanKilled law
## Min. :2.000 Min. :1
## 1st Qu.:3.500 1st Qu.:1
## Median :5.000 Median :1
## Mean :5.174 Mean :1
## 3rd Qu.:7.000 3rd Qu.:1
## Max. :8.000 Max. :1
##
The standard deviation of the drivers killed in each subset was calculated:
sd (Law0Sub$DriversKilled)
## [1] 24.26088
sd(Law1Sub$DriversKilled)
## [1] 22.2286
It can be seen from the summary statistics that there are less drivers killed after the legislation was passed, but this was explored further in a boxplot.
GGPlot2 is a package that provides superior graphics than the base package in R, and operates by using “layers” to build on.
Bluebox <-ggplot(Seatbelts, aes(x=factor(law), y =DriversKilled)) +geom_boxplot(fill = "skyblue") +theme_grey()+ylab ("Monthly Driver Mortality")+xlab("Before and after law introduced")
Bluebox
From the boxplot the values for before and after legislation are easily identified.
The data of the total killed before and legislation compared statistically. The data was checked for normality:
shapiro.test(Seatbelts$DriversKilled)
##
## Shapiro-Wilk normality test
##
## data: Seatbelts$DriversKilled
## W = 0.97557, p-value = 0.001938
As the p value was less than 0.05 the null hypothesis was rejected, the data is not normally distributed. This means that a non-parametric test should be used to compare the two values:
wilcox.test (Seatbelts$DriversKilled ~ Seatbelts$law)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Seatbelts$DriversKilled by Seatbelts$law
## W = 3058, p-value = 8.343e-06
## alternative hypothesis: true location shift is not equal to 0
The Wilcoxon Sum Rank Test p value was less than 0.05, so the medians of the drivers killed do differ significantly.
The proportion of drivers and passengers killed or seriously injured before and after legislation was tested. The totals of each column were produced and put into a matrix, which was then put into a table, and proportions were calculated.
totaldriv0 <- sum(Law0Sub$Drivers)
totalfront0<- sum (Law0Sub$front)
totalrear0 <- sum (Law0Sub$rear)
totaldriv1 <-sum(Law1Sub$Drivers)
totalfront1 <-sum (Law1Sub$front)
totalrear1 <-sum (Law1Sub$rear)
tablesums <- matrix(c(290300, 30399,147614,13132,67654,9378), ncol = 3)
tabprop <- prop.table (tablesums,margin = 1)
colnames(tabprop) <- c("Drivers", "Front", "Back")
rownames (tabprop) <-c ("Before", "After")
knitr::kable(tabprop, caption = "Proportion of Drivers and Passengers Killed")
| Drivers | Front | Back | |
|---|---|---|---|
| Before | 0.5742056 | 0.2919765 | 0.1338178 |
| After | 0.5745525 | 0.2481997 | 0.1772477 |
From the table it can be seen that the proportions drivers killed were only slightly different, but the front passengers were less, although the back were proportionally more.
To calculation correlation and regression between two factors, the subset value of before the law was passed were used,so the data wasn’t skewed by the different factors.
The ggplot package was used to explore any correlation between the amount of drivers killed, and the price of petrol, and a regression line was modelled:
Priceplot <- ggplot (Law0Sub, aes(x = PetrolPrice, y= DriversKilled))+geom_point(size = 2, shape = 18, col="darkgreen") +stat_smooth (method = lm)+ xlab("Petrol Cost") + ylab (" Monthly Driver Mortalities")
Priceplot
The regression linear model and was calculated:
model0 <-lm(DriversKilled ~ PetrolPrice, data= Law0Sub)
model0
##
## Call:
## lm(formula = DriversKilled ~ PetrolPrice, data = Law0Sub)
##
## Coefficients:
## (Intercept) PetrolPrice
## 190.5 -634.7
A summary of the model was produced for accuracy:
summary(model0)
##
## Call:
## lm(formula = DriversKilled ~ PetrolPrice, data = Law0Sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.804 -17.664 -4.083 14.058 62.712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 190.53 15.33 12.425 < 2e-16 ***
## PetrolPrice -634.69 149.51 -4.245 3.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.12 on 167 degrees of freedom
## Multiple R-squared: 0.0974, Adjusted R-squared: 0.092
## F-statistic: 18.02 on 1 and 167 DF, p-value: 3.616e-05
Then the correlation value:
cor.test (Law0Sub$DriversKilled, Law0Sub$PetrolPrice, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: Law0Sub$DriversKilled and Law0Sub$PetrolPrice
## S = 1042200, p-value = 9.537e-05
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.2956175
The rho value of -0.295 suggests a weak negative relationship between the two variables, as the price of petrol increases, then the number of driver’s killed decreases.
Then the correlation was studied between drivers killed and the distance they traveled, and a scatterplot using ggplot2 was used again.
distplot <-ggplot(Seatbelts, aes(x = kms, y= DriversKilled))+geom_point(size = 2, shape = 18, col="darkgreen") +stat_smooth (method = lm)+ xlab("Kms Travelled") + ylab (" Monthly Driver Mortalities")
distplot
The regression linear model and was calculated:
model1 <-lm(DriversKilled ~ kms, data= Law0Sub)
model1
##
## Call:
## lm(formula = DriversKilled ~ kms, data = Law0Sub)
##
## Coefficients:
## (Intercept) kms
## 151.091397 -0.001744
A summary of the model was completed:
summary(model1)
##
## Call:
## lm(formula = DriversKilled ~ kms, data = Law0Sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.117 -17.551 -3.379 15.450 67.878
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.511e+02 1.017e+01 14.85 <2e-16 ***
## kms -1.744e-03 6.919e-04 -2.52 0.0127 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.88 on 167 degrees of freedom
## Multiple R-squared: 0.03664, Adjusted R-squared: 0.03087
## F-statistic: 6.352 on 1 and 167 DF, p-value: 0.01266
The value of The Spearmans correlation value was calculated:
cor.test (Law0Sub$DriversKilled, Law0Sub$kms, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: Law0Sub$DriversKilled and Law0Sub$kms
## S = 925520, p-value = 0.05077
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1505207
The rho value of -0.150 again suggested a weak negative relationship, as the distance traveled increases,the number of driver deaths decreases.