Data Analysis and Visualization in R (IN2339): A Computer Science Class at the Technical University of Munich
Authors: Mara Geisler, Anna Lenk, Matthias Watzl, Viviane Weinstabl (23/01/2022)
In 2019 there were a total of 10,027 accidents in the city of Barcelona, resulting in 11,864 victims, of which 222 were seriously injured or killed. The numbers in 2016 were quite similar: 10,140 accidents; 12,115 victims; 220 serious injuries or deaths. As laid out by the city council’s Local Road Safety Plan 2019-2022, the goal is to strengthen road safety and reduce road deaths and serious injuries by 20% and 16%, respectively. Measures to achieve this include: introducing a speed limit of 30 km/h on secondary roads; campaigns to decrease cases of DUI, speeding and traffic offences; and increasing the number of speed radars.
By investigating preventable factors related to accident occurrence, we aim to complement the city council’s analysis and plan of action. More precisely, we investigate 1) the relationship between the hour of day and the number of accidents linked to alcohol & drugs, 2) vehicle color and time of day, 3) geo-spatial factors for accident density.
The necessary data preparation steps are omitted in the compiled pdf-file. First, we load the 2016-2019 csv data files from the online platform Open Data Barcelona. The data sets are: 1) Accidents managed by the local police (incl. # of injuries by severity level); 2) Accidents and their external causes (ex. alcohol, speeding), if any; 3) Accidents incl. people (driver, pedestrian, passenger) involved, their description and degree of injury; and 4) Accidents incl. vehicles involved and their specifications.
Additionally, we use data on the daily distribution of travel (number of trips) per hour, collected in the Enquesta de Mobilitat en Dia Feiner for the Barcelona Metropolitan Integrated Mobility System (SIMMB).
We translate column headers to English, clean the data (e.g. converting character columns to numeric, adjusting date-based columns, fixing some data errors), and merge the data for accident causes onto the base data table (dt_acc) with all unique accidents by the primary key “File_Nr”.
We plotted dt_acc by different temporal groupings (month, day, and hour of day) and saw that there was a major difference in occurrence of accidents by hour of the day (Fig. 1.1). This can be explained in large part, due to the higher frequency in hourly trips (Fig. 1.2), which have similar peaks around 8-9 AM and 17-19 PM.
Interestingly, the fatality rate (measured as total deaths divided by total accidents at each hour) as well as the severe injury rate (incl. deaths and severe injuries) is higher during night than during day hours. To test the significance of these observations, we compute the correlations for number of accidents, travel and injury rates by hour of day. Since the distribution of the data is not Gaussian (as seen in Table 1 where all p-values are <0.05 besides for “Hour”), we used the spearman rank-correlation and p-test (Fig. 1.3).
| Factor | p.value |
|---|---|
| Hour | 0.4159 |
| Nr_Accidents | 0.0147 |
| Fatality_Rt | 0.0000 |
| Severe_Rt | 0.0362 |
| Avg_Trips | 0.0055 |
dt_heatmap <- dt_acc[Year==2019, .(Nr_Accidents = .N, Fatality_Rt = sum(Nr_Deaths)/.N,
Severe_Rt = sum(Nr_Deaths+Nr_Injuries_Severe)/.N), by=Hour]
dt_heatmap <- merge(dt_heatmap, trips_data, by="Hour")
# correlations excluded if insignificant (p-test), ordered hierarchically
ggcorrplot(round(cor(dt_heatmap, method="spearman"),2), p.mat = cor_pmat(dt_heatmap),
ggtheme = ggplot2::theme_gray, hc.order=TRUE, lab=TRUE, type = "lower",
outline.col = "white", colors = colors1) + ggtitle(title1) + theme1
With regard to vehicle color, our assumption is that certain colors are less visible during the nightime and thus are more often involved in accidents during darkness than during daylight. In this context, we focus on the vehicle color black and compare the proportion of black vehicles involved in accidents at nighttime with the proportion of black vehicles involved in accidents during the daytime, for the 2016 to 2019 data on vehicles involved in accidents.
First, we determine the total number of vehicles involved in accidents per nighttime (22:00-5:59) as well as per daytime (6:00-21:59) shifts for each date. Second, we determine for each nighttime and daytime shift per date the number of black vehicles involved in an accident. Thus, for each date, we obtain the relative proportion (share) of black vehicles for the nighttime and the daytime.
It can be seen that the distribution within day and night (shifts) differs strongly (see Fig. 3.1). While for the day shift a clear cluster can be seen at just under 25%, the distribution within the night shift is much broader - resulting also in a higher median. To validate our assumption, we test our null hypothesis: “The distributions of the two populations Night and Day are equal.” Although the two data populations are independent, we cannot assume a Gaussian distribution. Therefore, we perform the Wilcox rank sum test which results in a p-value < 2.2e-16. Consequently, the share for the night shift is significantly higher than the one for the day shift.
In order to support our claim, we need to take into account one factor, that was mentioned in the beginning of our analysis: the number of trips has a substantial impact on the number of accidents. This means at night, much less (black) vehicles are on the road and less likely on account of this fact alone, to be in an accident, and thus this may be a confounding factor. In fact, we observe many shifts with no or very few accidents and often without black vehicles involved (see Fig. 3.2). Numerically, this is about 360 night shifts and about 100 day shifts without a black vehicle involved in an accident. Keeping these shifts in the populations leads to many 0% shares. These 0% shares in turn have a large impact on the comparison between nighttime and daytime. To account for this, we decided to filter out shifts without black vehicles. By doing this, we ensure to only compare shifts that include at least one accident involving a black vehicle. Facing the rather high number of shifts without black vehicles, it is a necessary step to control this confounding factor influencing our analysis.
#violin plotting day and night comparison
l <- ggplot(dt_plot_Vehicle_color, aes(x=Shift, y=Share, fill=Shift)) +
geom_violin() + geom_boxplot(width=0.1) + ggtitle(title1) + theme1
r <- ggplot(dt_plot_Vehicle_colorCONF, aes(x=Share)) + ggtitle(title2) +
geom_histogram(aes(color=Shift,fill=Shift),position="identity",bins=30,alpha=0.4) +
scale_color_manual(values=color) + scale_fill_manual(values = color) + theme1
#t-test(Wilcoxon Rank Sum Test)
wilcox.test(Share~Shift, data = dt_plot_Vehicle_color)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Share by Shift
## W = 497307, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
Another important aspect to consider is the geo-location information for accidents. Our spatial analysis of the data leads us to claim that the number of accidents is higher the closer to the city center. In the heatmap below (Fig 4.1), it is evident that there is a greater concentration of accidents around the district of Eixample. Especially the area between the Gran Via de les Corts Catalanes and the Avinguda Diagonal which run through the districts of Eixample in the West and San Martí in the North East, we see a high level of concentration.
We calculate the distance from each accident location to the center point of Eixample. In buckets of 100m distance we calculate the total accidents occurring within the ring represented by this distance and divide it by the area in km^2 of the ring. Plotting this information, we find that there is indeed a significant trend. Having checked the Q-Q Plots for both distance and number of accidents, we again decide against the pearson-method, which assumes normal distribution. We find a spearman-rank correlation of -0.99 with a p-value that evidently passes the significance threshold. We can see that within 2 km distance from the city center the number of accidents per squaremeter is a lot higher.
coords.data <- select(dt_acc,c(Latitude,Longitude))
coords.data.ped <- dt_acc_people %>%
filter(Person_Type=="Vianant" & grepl('greu|Mort', Victim_Description))
center <- c(2.164889,41.392171) # central point in Eixample District
# distance from city center
dist_data <- dt_acc[,Dist:=plyr::round_any(distHaversine(center, cbind(Longitude,Latitude)),100,
f=ceiling)] %>% group_by(Year, Dist) %>% summarise(Accidents = n())
dist_data <- merge(dist_data, data.table(Dist_km = unique(dist_data$Dist)/1000) %>%
mutate(Area = pi*(Dist_km)^2, Dist=Dist_km*1000),by="Dist",all.x=TRUE) %>%
mutate(Accidents_per_Area = Accidents/Area)
coords.map <- get_stamenmap(Barcelona, zoom = 12, maptype = "terrain")
coords.map <- ggmap(coords.map, extent="device", legend="topleft") +
stat_density2d(aes(x=Longitude, y=Latitude, fill=..level..,alpha =..level..),
data=coords.data, geom="polygon") +
geom_point(data=coords.data.ped, aes(x=Longitude, y=Latitude,
shape="Pedestrians Severely \nInjured or Killed"), alpha=0.5) +
geom_point(aes(x=center[1],y=center[2], shape="Center of Eixample"), size=2, fill="purple")
trend <- ggplot(dist_data, aes(x=Dist, y=Accidents_per_Area)) + geom_point() + theme1 +
labs(title =title2, y="Nr. Accidents per KM2", x="Distance (m)*", caption=caption) +
geom_label(data=an2, aes(x=x, y=y, label=label), size=3, color='black', hjust="left") #+
From the research in this case study, we would propose the following: