Case Study: Vehicle Accidents in the City of Barcelona

Data Analysis and Visualization in R (IN2339): A Computer Science Class at the Technical University of Munich

Authors: Mara Geisler, Anna Lenk, Matthias Watzl, Viviane Weinstabl (23/01/2022)

Motivation

In 2019 there were a total of 10,027 accidents in the city of Barcelona, resulting in 11,864 victims, of which 222 were seriously injured or killed. The numbers in 2016 were quite similar: 10,140 accidents; 12,115 victims; 220 serious injuries or deaths. As laid out by the city council’s Local Road Safety Plan 2019-2022, the goal is to strengthen road safety and reduce road deaths and serious injuries by 20% and 16%, respectively. Measures to achieve this include: introducing a speed limit of 30 km/h on secondary roads; campaigns to decrease cases of DUI, speeding and traffic offences; and increasing the number of speed radars.

By investigating preventable factors related to accident occurrence, we aim to complement the city council’s analysis and plan of action. More precisely, we investigate 1) the relationship between the hour of day and the number of accidents linked to alcohol & drugs, 2) vehicle color and time of day, 3) geo-spatial factors for accident density.

Data Preparation

The necessary data preparation steps are omitted in the compiled pdf-file. First, we load the 2016-2019 csv data files from the online platform Open Data Barcelona. The data sets are: 1) Accidents managed by the local police (incl. # of injuries by severity level); 2) Accidents and their external causes (ex. alcohol, speeding), if any; 3) Accidents incl. people (driver, pedestrian, passenger) involved, their description and degree of injury; and 4) Accidents incl. vehicles involved and their specifications.

Additionally, we use data on the daily distribution of travel (number of trips) per hour, collected in the Enquesta de Mobilitat en Dia Feiner for the Barcelona Metropolitan Integrated Mobility System (SIMMB).

We translate column headers to English, clean the data (e.g. converting character columns to numeric, adjusting date-based columns, fixing some data errors), and merge the data for accident causes onto the base data table (dt_acc) with all unique accidents by the primary key “File_Nr”.

Data Analysis

Overview of Accidents for the Average Day from 2016-2019

We plotted dt_acc by different temporal groupings (month, day, and hour of day) and saw that there was a major difference in occurrence of accidents by hour of the day (Fig. 1.1). This can be explained in large part, due to the higher frequency in hourly trips (Fig. 1.2), which have similar peaks around 8-9 AM and 17-19 PM.

Interestingly, the fatality rate (measured as total deaths divided by total accidents at each hour) as well as the severe injury rate (incl. deaths and severe injuries) is higher during night than during day hours. To test the significance of these observations, we compute the correlations for number of accidents, travel and injury rates by hour of day. Since the distribution of the data is not Gaussian (as seen in Table 1 where all p-values are <0.05 besides for “Hour”), we used the spearman rank-correlation and p-test (Fig. 1.3).

P-Values for Shapiro-Wilk Normality Test
Factor	p.value
Hour	0.4159
Nr_Accidents	0.0147
Fatality_Rt	0.0000
Severe_Rt	0.0362
Avg_Trips	0.0055

dt_heatmap <- dt_acc[Year==2019, .(Nr_Accidents = .N, Fatality_Rt = sum(Nr_Deaths)/.N, 
                       Severe_Rt = sum(Nr_Deaths+Nr_Injuries_Severe)/.N),  by=Hour]
dt_heatmap <- merge(dt_heatmap, trips_data, by="Hour")
# correlations excluded if insignificant (p-test), ordered hierarchically 
ggcorrplot(round(cor(dt_heatmap, method="spearman"),2), p.mat = cor_pmat(dt_heatmap), 
           ggtheme = ggplot2::theme_gray, hc.order=TRUE, lab=TRUE, type = "lower",
           outline.col = "white", colors = colors1) + ggtitle(title1) + theme1

Claim 1: The Average Number of Substance-Related Accidents is Higher During the Night

In particular, the observed negative rank correlations in Fig. 1.3 for the number of accidents with both the traffic fatalities rate and the serious injuries rate attracted our attention. An explanation for this is that the fatality and serious injury rate increases at nighttime whereas the number of accidents decreases, as was seen in Fig. 1.1. Thus, we are intrigued to look into the indirect factors making nightly driving less secure.

Table 2 displays the distribution of accidents linked to an external cause and serves as a first indication. It shows that the largest portion of accidents in Barcelona with an external cause are related to the use of substances (alcohol, drugs or medicine).

Accidents Linked to an External Cause 2016-19
External_Cause	Nr_Accidents	Share
Alcoholism, Drugs or Medicine	1744	74.2%
Road in poor condition	293	12.5%
Speeding	251	10.7%
Other	63	2.7%

As most people tend to consume a greater amount of substances during the weekend than during weekdays, we look further into the average daily substance-related accidents on weekdays versus weekends. The line graph in Fig. 2.1 displays the average number of accidents per hour in 2016-2019. It shows that the average hourly number of accidents is higher on weekends as compared to weekdays. On weekends, the most accidents related to substance consumption take place at 7 AM. On weekdays however, the peak number of accidents is at 11 PM and decreases thereafter.

We know that the number of traffic accidents decreased severely during the Covid-19 Pandemic in 2020 and 2021, however the number of drunk-driver accidents increased. This reiterates the urgent need for traffic checks at an increased frequency during the night hours from 11 PM to 7 AM to combat the issue driving under the influence of substances.

Claim 2: The Proportion of Black Vehicles Involved in Accidents is Higher at Night than During Daytime

With regard to vehicle color, our assumption is that certain colors are less visible during the nightime and thus are more often involved in accidents during darkness than during daylight. In this context, we focus on the vehicle color black and compare the proportion of black vehicles involved in accidents at nighttime with the proportion of black vehicles involved in accidents during the daytime, for the 2016 to 2019 data on vehicles involved in accidents.

First, we determine the total number of vehicles involved in accidents per nighttime (22:00-5:59) as well as per daytime (6:00-21:59) shifts for each date. Second, we determine for each nighttime and daytime shift per date the number of black vehicles involved in an accident. Thus, for each date, we obtain the relative proportion (share) of black vehicles for the nighttime and the daytime.

It can be seen that the distribution within day and night (shifts) differs strongly (see Fig. 3.1). While for the day shift a clear cluster can be seen at just under 25%, the distribution within the night shift is much broader - resulting also in a higher median. To validate our assumption, we test our null hypothesis: “The distributions of the two populations Night and Day are equal.” Although the two data populations are independent, we cannot assume a Gaussian distribution. Therefore, we perform the Wilcox rank sum test which results in a p-value < 2.2e-16. Consequently, the share for the night shift is significantly higher than the one for the day shift.

In order to support our claim, we need to take into account one factor, that was mentioned in the beginning of our analysis: the number of trips has a substantial impact on the number of accidents. This means at night, much less (black) vehicles are on the road and less likely on account of this fact alone, to be in an accident, and thus this may be a confounding factor. In fact, we observe many shifts with no or very few accidents and often without black vehicles involved (see Fig. 3.2). Numerically, this is about 360 night shifts and about 100 day shifts without a black vehicle involved in an accident. Keeping these shifts in the populations leads to many 0% shares. These 0% shares in turn have a large impact on the comparison between nighttime and daytime. To account for this, we decided to filter out shifts without black vehicles. By doing this, we ensure to only compare shifts that include at least one accident involving a black vehicle. Facing the rather high number of shifts without black vehicles, it is a necessary step to control this confounding factor influencing our analysis.

#violin plotting day and night comparison
l <- ggplot(dt_plot_Vehicle_color, aes(x=Shift, y=Share, fill=Shift)) + 
  geom_violin() + geom_boxplot(width=0.1) + ggtitle(title1) + theme1

r <- ggplot(dt_plot_Vehicle_colorCONF, aes(x=Share)) + ggtitle(title2) +
  geom_histogram(aes(color=Shift,fill=Shift),position="identity",bins=30,alpha=0.4) + 
  scale_color_manual(values=color) + scale_fill_manual(values = color) + theme1

#t-test(Wilcoxon Rank Sum Test)
wilcox.test(Share~Shift, data = dt_plot_Vehicle_color)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Share by Shift
## W = 497307, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Claim 3: The Number of Accidents Decreases with Distance from the City Center

Another important aspect to consider is the geo-location information for accidents. Our spatial analysis of the data leads us to claim that the number of accidents is higher the closer to the city center. In the heatmap below (Fig 4.1), it is evident that there is a greater concentration of accidents around the district of Eixample. Especially the area between the Gran Via de les Corts Catalanes and the Avinguda Diagonal which run through the districts of Eixample in the West and San Martí in the North East, we see a high level of concentration.

We calculate the distance from each accident location to the center point of Eixample. In buckets of 100m distance we calculate the total accidents occurring within the ring represented by this distance and divide it by the area in km^2 of the ring. Plotting this information, we find that there is indeed a significant trend. Having checked the Q-Q Plots for both distance and number of accidents, we again decide against the pearson-method, which assumes normal distribution. We find a spearman-rank correlation of -0.99 with a p-value that evidently passes the significance threshold. We can see that within 2 km distance from the city center the number of accidents per squaremeter is a lot higher.

coords.data <- select(dt_acc,c(Latitude,Longitude))
coords.data.ped <- dt_acc_people %>% 
  filter(Person_Type=="Vianant" & grepl('greu|Mort', Victim_Description)) 
center <- c(2.164889,41.392171) # central point in Eixample District

# distance from city center
dist_data <- dt_acc[,Dist:=plyr::round_any(distHaversine(center, cbind(Longitude,Latitude)),100,
                    f=ceiling)] %>% group_by(Year, Dist) %>%  summarise(Accidents = n())
dist_data <- merge(dist_data, data.table(Dist_km = unique(dist_data$Dist)/1000) %>% 
                  mutate(Area = pi*(Dist_km)^2, Dist=Dist_km*1000),by="Dist",all.x=TRUE) %>% 
                  mutate(Accidents_per_Area = Accidents/Area)

coords.map <- get_stamenmap(Barcelona, zoom = 12, maptype = "terrain")
coords.map <- ggmap(coords.map, extent="device", legend="topleft") + 
  stat_density2d(aes(x=Longitude, y=Latitude, fill=..level..,alpha =..level..), 
                 data=coords.data, geom="polygon") + 
  geom_point(data=coords.data.ped, aes(x=Longitude, y=Latitude, 
                shape="Pedestrians Severely \nInjured or Killed"), alpha=0.5) +
  geom_point(aes(x=center[1],y=center[2], shape="Center of Eixample"), size=2, fill="purple")

trend <- ggplot(dist_data, aes(x=Dist, y=Accidents_per_Area)) + geom_point() + theme1 +
  labs(title =title2, y="Nr. Accidents per KM2", x="Distance (m)*", caption=caption) +
  geom_label(data=an2, aes(x=x, y=y, label=label), size=3, color='black', hjust="left") #+

Conclusion

From the research in this case study, we would propose the following:

Based on Claim 1 we suggest more checks for alcohol especially between the hours of 23:00 and 7:00 on weekends, as the number of alcohol-related accidents is highest here.
Considering Claim 2 we advise new vehicle buyers not to select black-/dark-colored vehicles, as these have a higher chance of being involved in an accident, especially during night times. In fact, it may even keep insurance costs lower!
Lastly, driving private vehicles in the city center should be avoided where possible, in favour of public transport. Since most accidents occur within 2 km of the center of Eixample, a 30km/h speed limit could be a promising preventive measure here.
For future research, especially post Covid-19, it would be interesting to analyze the impact of the new measures on serious injury and fatality rates, especially related to the reduction in speed limit from 50 to 30 km/h.