Analysis of traffic accidents in Warsaw 2024

Introduction

The main aim of this project is to examine the spatial pattern of traffic accidents in Warsaw in 2024. In particular, the analysis focuses on whether accidents are distributed randomly along the road network or whether they show clear spatial concentration in specific parts of the city. Since previous studies have shown that network-based methods are more appropriate for traffic accident analysis than ordinary planar methods, the project treats accidents as events occurring on a linear road network rather than as points distributed freely across the city area (Yamada & Thill, 2004; Xie & Yan, 2008).

The analysis begins with data preparation, including coordinate cleaning, transformation to a projected coordinate system, filtering observations to the Warsaw boundary, and matching accident locations to the road network. After this preprocessing stage, the accident data are represented as an unmarked point pattern on a linear network. This makes it possible to analyse accident locations in relation to the road system itself.

The empirical part includes several complementary methods. First, the overall accident intensity is calculated as the number of accidents per kilometre of road. Then, hotspot visualisation is used to identify road sections and areas with higher accident concentration. Nearest-neighbour distances are analysed to describe how close accidents tend to occur to one another. A quadrat test is applied to assess whether accident counts are consistent with a homogeneous random distribution on the network. Finally, Poisson point process models are used to compare a constant-intensity model with spatial trend models, allowing us to evaluate whether accident intensity varies across Warsaw.

In fact, the project does not intend to explain the mechanism of forming traffic accidents hotspots, but rather to detect overall pattern.

Related literature

When analysing traffic accidents, one important point has to be made clear: accidents cannot occur anywhere in planar space. They are constrained by the road system, so the road network should be treated as the actual space in which accident events occur.

Yamada and Thill (2004) compared planar and network K-functions for analysing traffic accident patterns. Their main point is that planar K-functions are problematic for accidents because they treat the study area as a normal two-dimensional space, while accidents are constrained to roads. Using 1997 traffic accident data from the Buffalo, New York area, they showed that planar K-functions can falsely detect clustering because they compare accidents to randomness over the whole area instead of the road network. The network K-function performed better because it used network distance and treated roads as the actual space where accidents occur. Their results support the use of linear network methods for accident analysis (Yamada & Thill, 2004).

Xie and Yan (2008) developed a network-based kernel density estimation method for analysing traffic accidents. Their main argument is that accidents should not be analysed only with ordinary planar KDE, because accidents occur on roads, not across the whole two-dimensional space. They divided the road network into small equal-length linear units called lixels and estimated accident density along the network. Using traffic accident data from Bowling Green, Kentucky, they showed that network KDE gives a more realistic picture of accident hotspots than planar KDE, which can overestimate density by spreading it outside the road system. They also found that bandwidth strongly affects the detected hotspot pattern: smaller bandwidths show local concentrations, while larger bandwidths reveal broader clusters. This supports the use of linear network methods in road accident analysis (Xie & Yan, 2008).

Bíl et al. (2013) focused on identifying hazardous road locations using kernel density estimation, but they argued that ordinary KDE is not enough because it mainly gives a visual hotspot map without telling whether the detected clusters are statistically meaningful. To solve this, they added Monte Carlo significance testing, cluster strength, and cluster stability measures. Using traffic accident data from the Southern Moravia Region in the Czech Republic, they identified hazardous locations and ranked them according to their importance. This article is useful for our project because it supports the idea that accident hotspots should be interpreted carefully and, where possible, supported by statistical testing rather than only visual inspection (Bíl et al., 2013).

Ang et al. (2012) provide a more methodological contribution to the analysis of events on linear networks. They explain that ordinary planar point-pattern methods are not appropriate when events occur on roads or other networks. They also show that even network K-functions can be difficult to interpret because their values depend on the geometry of the network itself. To address this, they propose a geometrically corrected network K-function, as well as related tools such as the pair correlation function and an inhomogeneous version of the K-function. This article is useful for our analysis because it gives theoretical support for treating accidents as events on a linear network and for distinguishing clustering from spatial inhomogeneity (Ang et al., 2012).

Data loading and preparation

As always, we need to start with loading all necessary packages.

# Define required packages
packages <- c("sf", "dplyr", "stringr", "spatstat.geom", "spatstat.explore","spatstat.linnet")

for (pkg in packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
  }
  
  library(pkg, character.only = TRUE)
}

Then, we need to read the data for accidents, filter the Warsaw only data and explore the columns storing the coordinates.

data <- read.csv("Mazowieckie_accidents.csv")
data_wawa <- data[!is.na(data$POWIAT) & data$POWIAT == "POWIAT WARSZAWA", ]
head(data_wawa[, c("WSP_GPS_X", "WSP_GPS_Y")])

##    WSP_GPS_X WSP_GPS_Y
## 1  21*15'060 52*12'250
## 3  21*02'168 52*21'365
## 7  20*58'488 52*16'126
## 15 21.069849 52.230339
## 16 21.070175  52.23019
## 21 20.903177 52.239088

The actual coordinates are stored in mixed format. Part of them are preseneted in Degrees, Minutes, Seconds format, whereas the second part is stored in decimal degrees. Therefore, we need transforming function to properly convert coordinates to unified format, which is decimal degrees.

convert_wsp_coord <- function(x) {
  x <- str_replace_all(as.character(x), ",", ".")
  dms <- str_match(x, "^(\\d+)\\*(\\d+)'(\\d+)$")
  from_dms <- as.numeric(dms[, 2]) +
    as.numeric(dms[, 3]) / 60 +
    (as.numeric(dms[, 4]) / 10) / 3600
  from_decimal <- suppressWarnings(as.numeric(x))
  ifelse(str_detect(x, "\\*"), from_dms, from_decimal)
}

data_wawa <- data_wawa %>%
  mutate(
    lon = convert_wsp_coord(WSP_GPS_X),
    lat = convert_wsp_coord(WSP_GPS_Y)
  ) %>% select(lon, lat)

head(data_wawa)

##         lon      lat
## 1  21.25167 52.20694
## 3  21.03800 52.36014
## 7  20.98022 52.27017
## 15 21.06985 52.23034
## 16 21.07017 52.23019
## 21 20.90318 52.23909

At this point, the data can be transformed to “sf” format.

# remowe NA values before transforming to sf

n_before <- nrow(data_wawa)
data_wawa <- na.omit(data_wawa)
n_after <- nrow(data_wawa)

cat("Rows removed due to missing coordinates:", n_before - n_after, "\n")

## Rows removed due to missing coordinates: 0

data_wawa_sf <- st_as_sf(
  data_wawa,
  coords = c("lon", "lat"),
  crs = 4326,
  remove = FALSE
)

data_wawa_3857 <- st_transform(data_wawa_sf, 3857)

The full Warsaw road network might be too dense to analyse. That is why we decided to leave only selected vehicle-accessible road categories.

## Reading layer `powiaty' from data source 
##   `C:\Users\yana1\Desktop\Point and line\powiaty.shp' using driver `ESRI Shapefile'
## Simple feature collection with 380 features and 36 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 171677.5 ymin: 133223.4 xmax: 861895.8 ymax: 775019.1
## Projected CRS: ETRF2000-PL / CS92

## Reading layer `roads' from data source 
##   `C:\Users\yana1\Desktop\Point and line\roads.shp' using driver `ESRI Shapefile'
## Simple feature collection with 475385 features and 7 fields
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: 20.6118 ymin: 52.02014 xmax: 21.47108 ymax: 52.48307
## Geodetic CRS:  WGS 84

It is possible that not all the accidents reported to happen in Warsaw actually occured in the city. That is why, it is essential to filter out all coordinates outside of Warsaw.

data_wawa_3857_inside <- st_filter(
  data_wawa_3857,
  wawa_3857,
  .predicate = st_within
)

Then, let us clip the roads to Warsaw.

At this point, we can combine all the components and plot the accidents occured in Warsaw.

plot(st_geometry(wawa_3857), border = "black")
plot(st_geometry(roads_wawa_3857), add = TRUE, col = "grey")
plot(st_geometry(data_wawa_3857_inside), add = TRUE, pch = 16, cex = 0.25)

In order not to overcomplicate the calculations, it was decided to exclude the observations occurred too far from the chosen roads.

# Check how far accidents are from the nearest road

nearest_road_id <- st_nearest_feature(data_wawa_3857_inside, roads_wawa_3857)

dist_to_road_m <- as.numeric(st_distance(
  data_wawa_3857_inside,
  roads_wawa_3857[nearest_road_id, ],
  by_element = TRUE
))

summary(dist_to_road_m)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 5.800e-04 2.346e+00 6.051e+00 1.519e+01 1.304e+01 1.000e+03

data.frame(
  threshold_m = c(5, 10, 20, 50, 100),
  n_farther = c(
    sum(dist_to_road_m > 5),
    sum(dist_to_road_m > 10),
    sum(dist_to_road_m > 20),
    sum(dist_to_road_m > 50),
    sum(dist_to_road_m > 100)
  ),
  percent_farther = round(c(
    mean(dist_to_road_m > 5),
    mean(dist_to_road_m > 10),
    mean(dist_to_road_m > 20),
    mean(dist_to_road_m > 50),
    mean(dist_to_road_m > 100)
  ) * 100, 2)
)

##   threshold_m n_farther percent_farther
## 1           5     13456           56.01
## 2          10      7974           33.19
## 3          20      3775           15.71
## 4          50      1473            6.13
## 5         100       576            2.40

data_wawa_3857_inside$dist_to_road_m <- dist_to_road_m

data_wawa_3857_inside %>%
  arrange(desc(dist_to_road_m)) %>%
  select(dist_to_road_m) %>%
  head(20)

## Simple feature collection with 20 features and 1 field
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 2327969 ymin: 6823434 xmax: 2366402 ymax: 6858159
## Projected CRS: WGS 84 / Pseudo-Mercator
## First 10 features:
##    dist_to_road_m                geometry
## 1       1000.2945 POINT (2366402 6827142)
## 2        767.2900 POINT (2361309 6836177)
## 3        576.7778 POINT (2366269 6827595)
## 4        574.3104 POINT (2346405 6854304)
## 5        568.0472 POINT (2332240 6854128)
## 6        542.5972 POINT (2327969 6853875)
## 7        522.5301 POINT (2346522 6854203)
## 8        510.6766 POINT (2330128 6844213)
## 9        484.5309 POINT (2341742 6823434)
## 10       475.5283 POINT (2330184 6844204)

To avoid uncertain snapping, only accidents located within 5 metres of the selected road network were retained. This creates a stricter high-confidence subset of accidents that can be reliably assigned to the analysed road network.

# delete too far accidents
data_wawa_3857_near_roads <- data_wawa_3857_inside[dist_to_road_m <= 5, ]

# check how many remain
nrow(data_wawa_3857_inside)

## [1] 24023

nrow(data_wawa_3857_near_roads)

## [1] 10567

Now, we can snap remaining accidents to the nearest road.

# Snap accidents to the nearest road

nearest_road_id <- st_nearest_feature(data_wawa_3857_near_roads, roads_wawa_3857)

nearest_lines <- st_nearest_points(
  st_geometry(data_wawa_3857_near_roads),
  st_geometry(roads_wawa_3857[nearest_road_id, ]),
  pairwise = TRUE
)

snapped_geom <- st_sfc(lapply(nearest_lines, function(x) {
  coords <- st_coordinates(x)
  st_point(coords[2, 1:2])
}), crs = st_crs(data_wawa_3857_near_roads))

snap_dist_m <- as.numeric(st_distance(
  st_geometry(data_wawa_3857_near_roads),
  snapped_geom,
  by_element = TRUE
))

accidents_snapped_3857 <- data_wawa_3857_near_roads
st_geometry(accidents_snapped_3857) <- snapped_geom
accidents_snapped_3857$snap_dist_m <- snap_dist_m

summary(accidents_snapped_3857$snap_dist_m)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000578 0.881970 1.994407 2.152331 3.308611 4.999569

nrow(accidents_snapped_3857)

## [1] 10567

plot(st_geometry(roads_wawa_3857), col = "grey", lwd = 0.3)
plot(st_geometry(accidents_snapped_3857), add = TRUE, pch = 16, cex = 0.2)

And finally create lpp object.

# Create linnet road network and lpp accident pattern

coords <- as.data.frame(st_coordinates(roads_wawa_3857))
line_cols <- grep("^L", names(coords), value = TRUE)
coords$line_id <- interaction(coords[, line_cols, drop = FALSE], drop = TRUE)

segs <- coords %>%
  group_by(line_id) %>%
  mutate(x0 = X, y0 = Y, x1 = lead(X), y1 = lead(Y)) %>%
  ungroup() %>%
  filter(!is.na(x1), !is.na(y1), x0 != x1 | y0 != y1)

bb <- st_bbox(wawa_3857)
win <- owin(xrange = c(bb["xmin"], bb["xmax"]), yrange = c(bb["ymin"], bb["ymax"]))

roads_psp <- psp(segs$x0, segs$y0, segs$x1, segs$y1, window = win, check = FALSE)
road_net <- as.linnet(roads_psp, sparse = TRUE)
unitname(road_net) <- c("metre", "metres")

xy <- st_coordinates(accidents_snapped_3857)
accidents_lpp <- lpp(data.frame(x = xy[, 1], y = xy[, 2]), road_net)
unitname(accidents_lpp) <- c("metre", "metres")

accidents_lpp

## Point pattern on linear network
## 10567 points
## Linear network with 228000 vertices and 282717 lines
## Enclosing window: rectangle = [2321199.3, 2367893.7] x [6817837, 6866968] 
## metres

summary(accidents_lpp)

## Point pattern on linear network
## 10567 points
## Linear network with 228000 vertices and 282717 lines
## Total length 14831810 metres
## Average intensity 0.0007124552 points per metre
## Unmarked
## Enclosing window: rectangle = [2321199.3, 2367893.7] x [6817837, 6866968] 
## metres

cat("Number of accidents:", npoints(accidents_lpp), "\n")

## Number of accidents: 10567

cat("Road network length, km:", volume(road_net) / 1000, "\n")

## Road network length, km: 14831.81

cat("Accidents per km:", npoints(accidents_lpp) / (volume(road_net) / 1000), "\n")

## Accidents per km: 0.7124552

plot(road_net, col = "grey", main = "Accidents on Warsaw road network")
points(accidents_lpp, pch = 16, cex = 0.2)

Analysis

Intensity and hot spots

As initial step of the analysis, we calculated the intensity of accidents per one kilometer of the roads in Warsaw (not the whole road network, but the main types filtered earlier).

n_accidents <- npoints(accidents_lpp)
road_length_m <- volume(domain(accidents_lpp))
road_length_km <- road_length_m / 1000

intensity_per_m <- n_accidents / road_length_m
intensity_per_km <- n_accidents / road_length_km

intensity_summary <- data.frame(
  n_accidents = n_accidents,
  road_length_km = round(road_length_km, 2),
  accidents_per_km = round(intensity_per_km, 4),
  accidents_per_100_km = round(intensity_per_km * 100, 2)
)

intensity_summary

##   n_accidents road_length_km accidents_per_km accidents_per_100_km
## 1       10567       14831.81           0.7125                71.25

As we can observe, the the number of accidents is quite big. Visualizing hot spots with network kernel density can bring us better view on the distribution of the accidents.Because the exact network-distance kernel density estimator was computationally infeasible, hotspot visualisation was based on a Euclidean kernel density estimate evaluated on the linear network. Therefore, the map should be interpreted as an exploratory hotspot visualisation rather than a full shortest-path network KDE.

sigma_m <- 150

accident_density <- density(
  accidents_lpp,
  sigma = sigma_m,
  distance = "euclidean",
  eps = 150
)

plot(
  accident_density,
  main = paste("Fast kernel density of accidents, sigma =", sigma_m, "m")
)

points(accidents_lpp, pch = 16, cex = 0.12)

Nearest-neighbor distance analysis shows how far each accident is from the closest other accident along the road network.

nnd_m <- nndist(accidents_lpp)
nnd_finite <- nnd_m[is.finite(nnd_m)]

summary(nnd_finite)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   16.27   51.50  108.68  135.48 2856.76

cat("Number of infinite nearest-neighbour distances:", sum(!is.finite(nnd_m)), "\n")

## Number of infinite nearest-neighbour distances: 6

nnd_summary <- data.frame(
  min_m = round(min(nnd_finite, na.rm = TRUE), 2),
  q25_m = round(quantile(nnd_finite, 0.25, na.rm = TRUE), 2),
  median_m = round(median(nnd_finite, na.rm = TRUE), 2),
  mean_m = round(mean(nnd_finite, na.rm = TRUE), 2),
  q75_m = round(quantile(nnd_finite, 0.75, na.rm = TRUE), 2),
  max_m = round(max(nnd_finite, na.rm = TRUE), 2),
  zero_distances = sum(nnd_finite == 0, na.rm = TRUE)
)

nnd_summary

##     min_m q25_m median_m mean_m  q75_m   max_m zero_distances
## 25%     0 16.27     51.5 108.68 135.48 2856.76            157

hist(
  nnd_finite,
  breaks = 50,
  main = "Nearest-neighbour distances between accidents",
  xlab = "Distance to nearest accident along road network, metres",
  ylab = "Number of accidents"
)

data.frame(
  distance_m = c(10, 25, 50, 100, 250, 500, 1000),
  n_accidents_with_neighbour_within = c(
    sum(nnd_finite <= 10, na.rm = TRUE),
    sum(nnd_finite <= 25, na.rm = TRUE),
    sum(nnd_finite <= 50, na.rm = TRUE),
    sum(nnd_finite <= 100, na.rm = TRUE),
    sum(nnd_finite <= 250, na.rm = TRUE),
    sum(nnd_finite <= 500, na.rm = TRUE),
    sum(nnd_finite <= 1000, na.rm = TRUE)
  ),
  percent_accidents = round(c(
    mean(nnd_finite <= 10, na.rm = TRUE),
    mean(nnd_finite <= 25, na.rm = TRUE),
    mean(nnd_finite <= 50, na.rm = TRUE),
    mean(nnd_finite <= 100, na.rm = TRUE),
    mean(nnd_finite <= 250, na.rm = TRUE),
    mean(nnd_finite <= 500, na.rm = TRUE),
    mean(nnd_finite <= 1000, na.rm = TRUE)
  ) * 100, 2)
)

##   distance_m n_accidents_with_neighbour_within percent_accidents
## 1         10                              1886             17.86
## 2         25                              3492             33.07
## 3         50                              5185             49.10
## 4        100                              7106             67.29
## 5        250                              9289             87.96
## 6        500                             10237             96.93
## 7       1000                             10520             99.61

The finite nearest-neighbour distances show that many accidents have another accident relatively close along the road network. This suggests local concentration of accident events, although it should be treated as descriptive evidence rather than a formal clustering test.

Quadrat test

quad_test_10 <- quadrat.test(accidents_lpp, nx = 10)

## Warning: Some expected counts are small; chi^2 approximation may be inaccurate

quad_test_10

## 
##  Chi-squared test of CSR on a network using quadrat counts
## 
## data:  accidents_lpp
## X2 = 6828, df = 99, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## Quadrats: Tessellation on a linear network
## 100 tiles

# in-build spatstata plot is non-inpretable, so it is better to use gg

# Clean quadrat/intensity figure

if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.5.2

roads_plot <- if (exists("roads_for_lpp")) roads_for_lpp else roads_wawa_3857
accidents_plot <- accidents_snapped_3857

boundary <- st_union(wawa_3857)

grid_sf <- st_make_grid(boundary, n = c(10, 10), square = TRUE) %>%
  st_sf(tile_id = seq_along(.), geometry = .) %>%
  st_intersection(boundary)

## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries

grid_sf$n_accidents <- lengths(st_intersects(grid_sf, accidents_plot))

grid_sf$road_length_m <- sapply(seq_len(nrow(grid_sf)), function(i) {
  x <- suppressWarnings(st_intersection(roads_plot, grid_sf[i, ]))
  if (nrow(x) == 0) return(0)
  sum(as.numeric(st_length(x)), na.rm = TRUE)
})

grid_sf$road_length_km <- grid_sf$road_length_m / 1000
grid_sf$accidents_per_km <- ifelse(grid_sf$road_length_km > 0, grid_sf$n_accidents / grid_sf$road_length_km, NA)

ggplot() +
  geom_sf(data = grid_sf, aes(fill = accidents_per_km), color = "white", linewidth = 0.25) +
  geom_sf(data = roads_plot, color = "grey30", linewidth = 0.15, alpha = 0.45) +
  scale_fill_gradient(low = "grey95", high = "red3", na.value = "white", name = "Accidents\nper km") +
  coord_sf(datum = NA) +
  labs(
    title = "Traffic accident intensity by network quadrat",
    subtitle = "Accident counts standardized by road length in each tile",
    x = NULL,
    y = NULL
  ) +
  theme_minimal(base_size = 13) +
  theme(
    panel.grid = element_blank(),
    axis.text = element_blank(),
    legend.position = "right",
    plot.title = element_text(face = "bold")
  )

The quadrat test was used to assess whether accident counts are consistent with a homogeneous random distribution on the road network. The result strongly rejected the null hypothesis of CSR on the network, χ²(99) = 6828, p < 0.001. This indicates that accidents are not uniformly distributed along the analysed road system. Instead, accident intensity varies substantially between different parts of Warsaw. The quadrat intensity map supports this result visually, showing higher accident intensity per kilometer of road in the central part of the city and lower intensity in many peripheral areas.

This result should be interpreted as evidence of spatial inhomogeneity in accident intensity rather than direct proof of local distance-based clustering, because the quadrat test compares counts between spatial tiles rather than distances between neighbouring accidents.

Poisson model

After describing the general accident intensity and visualizing local hot spots, the next step is to model the accident pattern more formally. Since the data are treated as an unmarked point pattern on a road network, the aim is not to explain different types of accidents, but to model where accidents are more or less likely to occur along the network.

For this purpose, Poisson point process models were fitted using the linear network point pattern framework. The simplest model assumes that accident intensity is constant on every part of the road network. In other words, each meter of road is assumed to have the same expected accident risk. This homogeneous model is useful as a baseline, but it is probably too simple for urban accident data, because road structure, traffic density, and centrality vary strongly across the city.

Therefore, the homogeneous model was compared with spatial trend models, where accident intensity is allowed to change with location. The purpose of this comparison is to check whether the accident pattern can be reasonably described by a constant-intensity process, or whether there is evidence of large-scale spatial inhomogeneity in accident occurrence.

model_hom <- lppm(accidents_lpp ~ 1)

model_xy <- lppm(accidents_lpp ~ x + y)

model_quad <- lppm(accidents_lpp ~ x + y + I(x^2) + I(y^2) + I(x * y))

summary(model_hom)

## Point process model on linear network
## Fitted to point pattern dataset 'accidents_lpp'
## Internal fit: Point process model
## Fitted to data: linequad(accidents_lpp, eps = NULL, nd = 1000, random = FALSE)
## Fitting method: maximum likelihood (Berman-Turner approximation)
## Model was fitted using glm()
## Algorithm converged
## Call:
## ppm.quad(Q = linequad(accidents_lpp, eps = NULL, nd = 1000, random = FALSE), 
##     trend = ~1, interaction = NULL, data = NULL, method = "mpl", 
##     forcefit = TRUE)
## Edge correction: "border"
##  [border correction distance r = 0 ]
## --------------------------------------------------------------------------------
## Quadrature scheme (Berman-Turner) = data + dummy + weights
## 
## Data pattern:
## Planar point pattern:  10567 points
## Average intensity 4.61e-06 points per square metre
## Window: rectangle = [2321199.3, 2367893.7] x [6817837, 6866968] metres
##                     (46690 x 49130 metres)
## Window area = 2294130000 square metres
## Unit of length: 1 metre
## 
## Dummy quadrature points:
##      Equally spaced along each segment at spacing eps = 14830 metres
##      Original parameter nd = 1000
## 
## Planar point pattern:  565434 points
## Average intensity 0.000246 points per square metre
## Window: rectangle = [2321199.3, 2367893.7] x [6817837, 6866968] metres
##                     (46690 x 49130 metres)
## Window area = 2294130000 square metres
## Unit of length: 1 metre
## Quadrature weights:
##      Counting weights based on segment length
## All weights:
##  range: [0.000607, 2360] total: 14800000
## Weights on data points:
##  range: [0.245, 1080]    total: 203000
## Weights on dummy points:
##  range: [0.000607, 2360] total: 14600000
## --------------------------------------------------------------------------------
## FITTED :
## 
## Stationary Poisson process
## 
## ---- Intensity: ----
## 
## 
## Uniform intensity:
## [1] 0.0007124552
## 
##              Estimate        S.E.  CI95.lo   CI95.hi Ztest      Zval
## (Intercept) -7.246793 0.009728013 -7.26586 -7.227727   *** -744.9408
## 
## ----------- gory details -----
## 
## Fitted regular parameters (theta):
## (Intercept) 
##   -7.246793 
## 
## Fitted exp(theta):
##  (Intercept) 
## 0.0007124552 
## 
## Original data: accidents_lpp
## Domain: Linear network with 228000 vertices and 282717 lines
## Total length 14831810 metres
## Maximum vertex degree: 6
## [Sparse matrix representation]
## Network is disconnected:  165 connected components
## Numerical tolerance: 1.213884e-06 metres
## Enclosing window: rectangle = [2321199.3, 2367893.7] x [6817837, 6866968] 
## metres

summary(model_xy)

## Point process model on linear network
## Fitted to point pattern dataset 'accidents_lpp'
## Internal fit: Error in solve.default(M) : 
##   system is computationally singular: reciprocal condition number = 2.80263e-20

## Warning: Cannot compute variance: Fisher information matrix is singular

## Error in solve.default(M) : 
##   system is computationally singular: reciprocal condition number = 2.80263e-20

## Warning: Cannot compute variance: Fisher information matrix is singular

## Point process model
## Fitted to data: linequad(accidents_lpp, eps = NULL, nd = 1000, random = FALSE)
## Fitting method: maximum likelihood (Berman-Turner approximation)
## Model was fitted using glm()
## Algorithm converged
## Call:
## ppm.quad(Q = linequad(accidents_lpp, eps = NULL, nd = 1000, random = FALSE), 
##     trend = ~x + y, interaction = NULL, data = NULL, method = "mpl", 
##     forcefit = TRUE)
## Edge correction: "border"
##  [border correction distance r = 0 ]
## --------------------------------------------------------------------------------
## Quadrature scheme (Berman-Turner) = data + dummy + weights
## 
## Data pattern:
## Planar point pattern:  10567 points
## Average intensity 4.61e-06 points per square metre
## Window: rectangle = [2321199.3, 2367893.7] x [6817837, 6866968] metres
##                     (46690 x 49130 metres)
## Window area = 2294130000 square metres
## Unit of length: 1 metre
## 
## Dummy quadrature points:
##      Equally spaced along each segment at spacing eps = 14830 metres
##      Original parameter nd = 1000
## 
## Planar point pattern:  565434 points
## Average intensity 0.000246 points per square metre
## Window: rectangle = [2321199.3, 2367893.7] x [6817837, 6866968] metres
##                     (46690 x 49130 metres)
## Window area = 2294130000 square metres
## Unit of length: 1 metre
## Quadrature weights:
##      Counting weights based on segment length
## All weights:
##  range: [0.000607, 2360] total: 14800000
## Weights on data points:
##  range: [0.245, 1080]    total: 203000
## Weights on dummy points:
##  range: [0.000607, 2360] total: 14600000
## --------------------------------------------------------------------------------
## FITTED :
## 
## Nonstationary Poisson process
## 
## ---- Intensity: ----
## 
## Log intensity: ~x + y
## 
## Fitted trend coefficients:
##   (Intercept)             x             y 
##  2.714156e+01 -3.048448e-05  5.400834e-06 
## 
## ----------- gory details -----
## 
## Fitted regular parameters (theta):
##   (Intercept)             x             y 
##  2.714156e+01 -3.048448e-05  5.400834e-06 
## 
## Fitted exp(theta):
##  (Intercept)            x            y 
## 6.129573e+11 9.999695e-01 1.000005e+00 
## 
## Original data: accidents_lpp
## Domain: Linear network with 228000 vertices and 282717 lines
## Total length 14831810 metres
## Maximum vertex degree: 6
## [Sparse matrix representation]
## Network is disconnected:  165 connected components
## Numerical tolerance: 1.213884e-06 metres
## Enclosing window: rectangle = [2321199.3, 2367893.7] x [6817837, 6866968] 
## metres

summary(model_quad)

## Point process model on linear network
## Fitted to point pattern dataset 'accidents_lpp'
## Internal fit: Error in solve.default(M) : 
##   system is computationally singular: reciprocal condition number = 5.22235e-40

## Warning: Cannot compute variance: Fisher information matrix is singular

## Error in solve.default(M) : 
##   system is computationally singular: reciprocal condition number = 5.22235e-40

## Warning: Cannot compute variance: Fisher information matrix is singular

## Point process model
## Fitted to data: linequad(accidents_lpp, eps = NULL, nd = 1000, random = FALSE)
## Fitting method: maximum likelihood (Berman-Turner approximation)
## Model was fitted using glm()
## Algorithm converged
## Call:
## ppm.quad(Q = linequad(accidents_lpp, eps = NULL, nd = 1000, random = FALSE), 
##     trend = ~x + y + I(x^2) + I(y^2) + I(x * y), interaction = NULL, 
##     data = NULL, method = "mpl", forcefit = TRUE)
## Edge correction: "border"
##  [border correction distance r = 0 ]
## --------------------------------------------------------------------------------
## Quadrature scheme (Berman-Turner) = data + dummy + weights
## 
## Data pattern:
## Planar point pattern:  10567 points
## Average intensity 4.61e-06 points per square metre
## Window: rectangle = [2321199.3, 2367893.7] x [6817837, 6866968] metres
##                     (46690 x 49130 metres)
## Window area = 2294130000 square metres
## Unit of length: 1 metre
## 
## Dummy quadrature points:
##      Equally spaced along each segment at spacing eps = 14830 metres
##      Original parameter nd = 1000
## 
## Planar point pattern:  565434 points
## Average intensity 0.000246 points per square metre
## Window: rectangle = [2321199.3, 2367893.7] x [6817837, 6866968] metres
##                     (46690 x 49130 metres)
## Window area = 2294130000 square metres
## Unit of length: 1 metre
## Quadrature weights:
##      Counting weights based on segment length
## All weights:
##  range: [0.000607, 2360] total: 14800000
## Weights on data points:
##  range: [0.245, 1080]    total: 203000
## Weights on dummy points:
##  range: [0.000607, 2360] total: 14600000
## --------------------------------------------------------------------------------
## FITTED :
## 
## Nonstationary Poisson process
## 
## ---- Intensity: ----
## 
## Log intensity: ~x + y + I(x^2) + I(y^2) + I(x * y)
## 
## Fitted trend coefficients:
##   (Intercept)             x             y        I(x^2)        I(y^2) 
## -2.548676e+05  2.749759e-02  6.510358e-02 -4.963022e-09 -4.650227e-09 
##      I(x * y) 
## -6.280656e-10 
## 
## ----------- gory details -----
## 
## Fitted regular parameters (theta):
##   (Intercept)             x             y        I(x^2)        I(y^2) 
## -2.548676e+05  2.749759e-02  6.510358e-02 -4.963022e-09 -4.650227e-09 
##      I(x * y) 
## -6.280656e-10 
## 
## Fitted exp(theta):
## (Intercept)           x           y      I(x^2)      I(y^2)    I(x * y) 
##    0.000000    1.027879    1.067270    1.000000    1.000000    1.000000 
## 
## Original data: accidents_lpp
## Domain: Linear network with 228000 vertices and 282717 lines
## Total length 14831810 metres
## Maximum vertex degree: 6
## [Sparse matrix representation]
## Network is disconnected:  165 connected components
## Numerical tolerance: 1.213884e-06 metres
## Enclosing window: rectangle = [2321199.3, 2367893.7] x [6817837, 6866968] 
## metres

AIC(model_hom, model_xy, model_quad)

##            df      AIC
## model_hom   1 174289.7
## model_xy    3 173356.5
## model_quad  6 169991.0

plot(predict(model_quad), main = "Fitted accident intensity: quadratic spatial trend")
points(accidents_lpp, pch = 16, cex = 0.12)

The homogeneous model assumed constant accident intensity along all road segments, whereas the spatial trend models allowed intensity to vary with location. Model comparison using AIC showed a clear improvement when spatial trend terms were included. The homogeneous model had AIC = 174289.7, the linear trend model had AIC = 173356.5, and the quadratic trend model had AIC = 169991.0. Therefore, the quadratic spatial trend model provided the best fit. This indicates that traffic accident intensity is strongly spatially inhomogeneous and cannot be described by a constant-intensity process on the road network.

Although the quadratic model had the lowest AIC, suggesting strong large-scale spatial inhomogeneity, the coefficients themselves were not interpreted. The model produced singularity warnings, most likely because the raw projected coordinates are very large and quadratic terms were included. Therefore, the model was used mainly as an exploratory comparison against the homogeneous model and as a visualisation of broad spatial trend.

The fitted intensity surface shows higher expected accident intensity in the central part of the analysed road network and lower intensity in peripheral areas. This pattern is consistent with the expected structure of urban traffic exposure, where central and highly connected road sections generate more accident opportunities.

Conclusion

This project analysed traffic accidents in Warsaw in 2024 using a linear network point pattern approach. The main assumption was that accidents should not be treated as events occurring freely across the whole city area, but as events constrained to the road network. Therefore, after cleaning the coordinate data, filtering observations to the Warsaw boundary, and matching accident locations to the road system, the accidents were represented as an unmarked point pattern on a linear network.

The analysis showed that accidents were not evenly distributed along Warsaw roads. The overall accident intensity provided a useful baseline, but the following analyses indicated that this average value hides strong spatial variation. Some parts of the network had visibly higher accident concentration, especially in more central and densely connected areas.

The hotspot visualisation suggested the existence of road sections and zones with increased accident intensity. The nearest-neighbour distance analysis also showed that many accidents occurred close to other accidents along the road network, which points to local concentration of events. However, this should not be interpreted too quickly as pure clustering, because part of this pattern may result from broader spatial inhomogeneity, such as differences in road density, traffic volume, and urban structure.

The quadrat test confirmed that accident counts vary strongly between different parts of the network. This means that the assumption of a homogeneous random distribution is not appropriate for the analysed data. In other words, accident risk is not constant across Warsaw roads, but changes substantially from one part of the city to another.

The Poisson point process models led to a similar conclusion. The homogeneous model, which assumes constant accident intensity across the whole network, was too simple. Spatial trend models fitted the data better, suggesting that large-scale spatial variation plays an important role in the distribution of accidents. The model results should be treated mainly as evidence of inhomogeneity rather than as a precise explanation of individual accident locations.

There were also some practical limitations. Some network-based methods, such as the full linear K-function and exact network kernel density estimation, were computationally difficult to apply because of the size and complexity of the Warsaw road network. Moreover, the analysis did not include additional explanatory variables such as traffic volume, speed limits, road width, intersection structure, or accident severity. Including such factors would make it possible to explain the observed spatial differences in more detail.

Overall, the results suggest that the analysed subset of Warsaw traffic accidents was not randomly distributed along the selected road network. The pattern is characterized mainly by strong spatial inhomogeneity, with higher accident intensity in selected parts of the city and local concentrations along particular sections of the road system. This confirms that the linear network approach is appropriate for this type of data, because it reflects the actual spatial structure within which traffic accidents occur.

References

Xie, Z., & Yan, J. (2008). Kernel density estimation of traffic accidents in a network space. Computers, Environment and Urban Systems, 32(5), 396–406. https://doi.org/10.1016/j.compenvurbsys.2008.05.001
Yamada, I., & Thill, J.-C. (2004). Comparison of planar and network K-functions in traffic accident analysis. Journal of Transport Geography, 12(2), 149–158. https://doi.org/10.1016/j.jtrangeo.2003.10.006
Bíl, M., Andrášik, R., & Janoška, Z. (2013). Identification of hazardous road locations of traffic accidents by means of kernel density estimation and cluster significance evaluation. Accident Analysis & Prevention, 55, 265–273. https://doi.org/10.1016/j.aap.2013.03.003
Ang, Q. W., Baddeley, A., & Nair, G. (2012). Geometrically corrected second order analysis of events on a linear network, with applications to ecology and criminology. Scandinavian Journal of Statistics, 39(4), 591–617. https://doi.org/10.1111/j.1467-9469.2011.00752.x