The Historische Databank Nederlandse Gemeenten (HDNG) contains historical information about Dutch municipalities, including demographic, economic, and administrative characteristics. The database stores information in long format, meaning that each row corresponds to a single observation of a variable for a municipality in a specific year. As a result, variables such as population and municipal area are not stored in separate columns but appear as values in a column describing the variable type.
One variable that can be derived from the HDNG is population density, defined as the number of inhabitants per unit of municipal area. Population density is an important indicator for understanding spatial patterns of settlement and urbanization. Differences in density may reflect the presence of urban centers, rural municipalities, or sparsely populated areas. In this report, municipal population density is constructed for all Dutch municipalities in a single year. Because the HDNG does not always directly contain a population density variable, it is necessary to calculate it using population and area information stored in the dataset.The area variable is used as provided in the HDNG dataset. Because population density is analyzed comparatively across municipalities rather than interpreted in absolute units, the specific area unit does not affect the conclusions.Two research questions guide the analysis:
What is the distribution of municipal population density in the selected year?
Is the required information available for all municipalities, or are there missing observations?
Answering these questions requires identifying the relevant variables, determining whether the necessary information is available for the selected year, constructing the density variable, and evaluating the completeness of the resulting dataset.
Firstly, the needed packages were loaded (data.table for fast manipulation and ggplot2 for plotting). Then we read the HDNG file from disk into a data.table.
library(data.table)
library(ggplot2)
PATH <- "C:/Users/ntbml/Desktop/R-Working Directory/HDNG_v4.txt"
HDNG <- fread(PATH, sep = ",", quote = "\"", header = TRUE)
First, it was determined in which years population and area information were available simultaneously but population density itself was not directly recorded. These are the years for which population density can be constructed manually.
This was done by identifying the years in which the variables Bevolking (population) and Oppervlakte (area) appear while the Bevolkingsdichtheid (population density) is not present.
years_construct <- with(
HDNG,
sort(setdiff(
intersect(unique(year[description == "Bevolking"]),
unique(year[description == "Oppervlakte"])),
unique(year[description == "Bevolkingsdichtheid"])
))
)
cat("Years where density can be constructed (Bevolking+Oppervlakte exist, Bevolkingsdichtheid absent):\n")
print(years_construct)
YEAR <- 1889
This step showed that 1859 and 1889 are years where population and area data are available, but density must be calculated. The year 1889 was selected for the analysis.
Before constructing the density variable, it was necessary to determine whether information is available for all municipalities in the selected year.The number of municipalities in the entire dataset was compared with the number of municipalities that have any observation in 1889. The difference indicates municipalities that are missing observations in the selected year.
all_munis <- unique(HDNG$amco[!is.na(HDNG$amco)])
munis_year <- unique(HDNG$amco[HDNG$year == YEAR & !is.na(HDNG$amco)])
missing_munis <- setdiff(all_munis, munis_year)
availability <- data.table(
Year = YEAR,
Total_municipalities_in_HDNG = length(all_munis),
Municipalities_present_in_year = length(munis_year),
Municipalities_missing_in_year = length(missing_munis)
)
cat("\nAvailability (presence in year):\n")
print(availability)
This analysis showed that:
1224 municipalities exist in the full dataset
1123 municipalities have observations in 1889
101 municipalities are missing in that year
Municipalities that are missing in 1889 may not necessarily represent missing data errors. In many cases, municipalities may have been created after the selected year or may have ceased to exist before that year due to administrative changes such as mergers or boundary changes. To understand the source of missing observations, the first and last year in which each missing municipality appears in the dataset was calculated. Based on this information, municipalities were classified into three categories:
municipalities that disappear before the selected year
municipalities that appear after the selected year
municipalities that exist both before and after but lack data in that year
time_span_missing <- HDNG[
amco %in% missing_munis,
.(first_year = min(year), last_year = max(year), n_years_observed = uniqueN(year)),
by = .(amco, name)
]
time_span_missing[, existed_around_year := first_year < YEAR & last_year > YEAR]
missing_categories <- time_span_missing[
,
.N,
by = .(
category = fifelse(first_year > YEAR, "Appears after year",
fifelse(last_year < YEAR, "Disappears before year",
"Exists before and after (missing in year only)"))
)
][order(-N)]
cat("\nMissing municipality categories:\n")
print(missing_categories)
The results showed that 101 municipalities were missing in 1889 (1224 total in the dataset vs. 1123 present in 1889). Of these 101 missing municipalities, 90 were classified as “Disappears before year” and 12 were classified as “Appears after year.” Only 1 municipality was classified as “Exists before and after (missing in year only).” This pattern indicates that the missing municipalities in 1889 were explained almost entirely by administrative change over time (municipalities that ceased to exist before 1889 or were created after 1889).
Before analyzing the resulting density variable, potential data issues were explored. These include:
municipalities with missing area values
municipalities with zero population
municipalities with zero density
diagnostics <- data.table(
YEAR = YEAR,
n_rows_merged = nrow(table_1889),
n_na_area = sum(is.na(table_1889$Oppervlakte)),
n_na_density = sum(is.na(table_1889$dichtheid)),
n_zero_population = sum(table_1889$Bevolking == 0, na.rm = TRUE),
n_zero_area = sum(table_1889$Oppervlakte == 0, na.rm = TRUE),
n_zero_density = sum(table_1889$dichtheid == 0, na.rm = TRUE)
)
cat("\nDiagnostics (merged table):\n")
print(diagnostics)
cat("\nRows with NA area (=> NA density):\n")
print(table_1889[is.na(Oppervlakte)])
cat("\nRows with density == 0:\n")
print(table_1889[!is.na(dichtheid) & dichtheid == 0])
Because the final visualization uses a logarithmic scale, municipalities with missing or zero values cannot be included. Therefore, rows containing missing population, missing area, zero population, zero area, or zero density were removed.
table_1889_clean <- table_1889[
!is.na(Bevolking) & !is.na(Oppervlakte) & !is.na(dichtheid) &
Bevolking > 0 & Oppervlakte > 0 & dichtheid > 0
]
removed_munis <- table_1889[
is.na(Bevolking) | is.na(Oppervlakte) | is.na(dichtheid) |
Bevolking == 0 | Oppervlakte == 0 | dichtheid == 0,
unique(amco)
]
diagnostic_overlap <- data.table(
removed_municipalities = length(removed_munis),
missing_in_year = length(missing_munis),
overlap = length(intersect(removed_munis, missing_munis))
)
cat("\nDiagnostic overlap (removed vs missing-in-year):\n")
print(diagnostic_overlap)
cat("\nRows kept for analysis (Bevolking>0, Oppervlakte>0, dichtheid>0):\n")
print(data.table(
n_kept = nrow(table_1889_clean),
n_dropped = nrow(table_1889) - nrow(table_1889_clean)
))
An additional diagnostic check was conducted to determine whether the municipalities removed during cleaning correspond to the municipalities that were entirely missing in 1889. The overlap was 0, meaning that the municipalities removed due to invalid values (zero or missing population/area/density) are not the same as the municipalities that do not appear in the year 1889 at all. This indicates that the cleaning step mainly removes municipalities that are present in 1889 but have incomplete or inconsistent records, rather than municipalities that did not exist in that year.
To answer the first research question, descriptive statistics were calculated for the constructed density variable. These include the minimum, quartiles, median, mean, and maximum values.
stats <- c(
Min = min(table_1889_clean$dichtheid, na.rm = TRUE),
Q1 = unname(quantile(table_1889_clean$dichtheid, 0.25, na.rm = TRUE)),
Median = median(table_1889_clean$dichtheid, na.rm = TRUE),
Mean = mean(table_1889_clean$dichtheid, na.rm = TRUE),
Q3 = unname(quantile(table_1889_clean$dichtheid, 0.75, na.rm = TRUE)),
Max = max(table_1889_clean$dichtheid, na.rm = TRUE)
)
summary_table <- data.table(Statistic = names(stats), Value = as.numeric(stats))
cat("\nSummary of constructed density (CLEAN):\n")
print(summary_table)
| Statistic | Value |
|---|---|
| Min | 0.1371047 |
| Q1 | 0.6642146 |
| Median | 0.9496403 |
| Mean | 3.2505710 |
| Q3 | 1.5119251 |
| Max | 271.1187500 |
The descriptive statistics show that most municipalities have relatively low population densities, while a small number have much higher values. In the cleaned dataset, the median population density is 0.95, while the mean density is 3.25.
The first and third quartiles are 0.66 and 1.51, meaning that half of the municipalities have densities within this range. However, the maximum observed density is 271.12, which is far higher than the median.
The fact that the mean (3.25) is larger than the median (0.95) indicates that the distribution is right-skewed. This suggests that while most municipalities are rural or moderately populated, a small number of municipalities exhibit much higher population densities, likely corresponding to urban centers.
A boxplot was used to visualize the distribution of municipal population density. Boxplots are useful for summarizing the distribution of a variable because they display key descriptive statistics in a compact form. The central line in the box represents the median, the box itself represents the interquartile range (the middle 50% of observations), and the whiskers extend to the remaining range of the data. This type of visualization is particularly suitable for population density because the variable is typically highly skewed. A boxplot makes it easy to identify the spread of the data and to observe whether a small number of municipalities have substantially higher densities than the majority. Using a logarithmic scale further improves the visualization by making differences between low-density municipalities easier to observe while still displaying higher-density observations.
mean_density <- mean(table_1889_clean$dichtheid, na.rm = TRUE)
p_box <- ggplot(table_1889_clean, aes(x = "", y = dichtheid)) +
geom_boxplot(outlier.shape = NA) +
geom_point(aes(x = "", y = mean_density), size = 3) +
scale_y_log10(
breaks = c(
0.1, 0.2, 0.3, 0.5,
1, 1.5, 2, 3, 5,
10, 15, 20
)
) +
coord_cartesian(ylim = c(0.1, 20)) +
theme_minimal() +
labs(
title = "Population Density per Municipality (1889)",
x = NULL,
y = "Population Density (log scale)"
)
print(p_box)
Distribution of municipal population density in 1889
The boxplot provides a visual representation of the distribution of municipal population density in 1889. The median density is indicated by the horizontal line inside the box, while the box itself represents the interquartile range, containing the middle 50% of municipalities. Most municipalities fall within a relatively narrow range of population densities, but the upper whisker extends much further than the lower one on the logarithmic scale. This indicates that a small number of municipalities have substantially higher population densities than the majority. The black dot representing the mean lies above the median, which further confirms that the distribution is right-skewed. Overall, the figure confirms the pattern observed in the summary statistics: most municipalities in 1889 had relatively low population densities, while a small number of urban municipalities had much higher densities.
This report examined the distribution of municipal population density in the Netherlands in 1889 using the Historische Databank Nederlandse Gemeenten (HDNG). Because the dataset stores information in long format and does not always directly include a density variable, population density was constructed by combining population and municipal area information. The analysis shows that the distribution of population density is highly skewed. Most municipalities have relatively low densities, reflecting the predominantly rural character of the Netherlands in the late nineteenth century. A smaller number of municipalities exhibit much higher densities, which likely correspond to early urban centers.
The analysis of data availability shows that the required information is available for the large majority of municipalities in the selected year. Most municipalities missing in 1889 are explained by administrative changes over time, such as municipalities that disappeared before 1889 or were created afterward. Only a small number of municipalities appear to be missing due to incomplete data. Finally, several observations with missing or zero values were identified and removed before calculating population density. This ensured that the final analysis only included municipalities for which both population and area were available.
library(data.table)
library(ggplot2)
PATH <- "C:/Users/ntbml/Desktop/R-Working Directory/HDNG_v4.txt"
HDNG <- fread(PATH, sep = ",", quote = "\"", header = TRUE)
years_construct <- with(
HDNG,
sort(setdiff(
intersect(unique(year[description == "Bevolking"]),
unique(year[description == "Oppervlakte"])),
unique(year[description == "Bevolkingsdichtheid"])
))
)
YEAR <- 1889
all_munis <- unique(HDNG$amco[!is.na(HDNG$amco)])
munis_year <- unique(HDNG$amco[HDNG$year == YEAR & !is.na(HDNG$amco)])
missing_munis <- setdiff(all_munis, munis_year)
availability <- data.table(
Year = YEAR,
Total_municipalities_in_HDNG = length(all_munis),
Municipalities_present_in_year = length(munis_year),
Municipalities_missing_in_year = length(missing_munis)
)
time_span_missing <- HDNG[
amco %in% missing_munis,
.(first_year = min(year), last_year = max(year), n_years_observed = uniqueN(year)),
by = .(amco, name)
]
time_span_missing[, existed_around_year := first_year < YEAR & last_year > YEAR]
missing_categories <- time_span_missing[
,
.N,
by = .(
category = fifelse(first_year > YEAR, "Appears after year",
fifelse(last_year < YEAR, "Disappears before year",
"Exists before and after (missing in year only)"))
)
][order(-N)]
pop <- HDNG[
!is.na(amco) &
description == "Bevolking" &
year == YEAR &
sex %in% c("M", "V"),
.(Bevolking = sum(value, na.rm = TRUE)),
by = .(amco, name)
]
area <- HDNG[
!is.na(amco) &
description == "Oppervlakte" &
year == YEAR,
.(Oppervlakte = value),
by = .(amco, name)
]
table_1889 <- merge(pop, area, by = c("amco", "name"))
table_1889[, dichtheid := Bevolking / Oppervlakte]
table_1889_clean <- table_1889[
!is.na(Bevolking) & !is.na(Oppervlakte) & !is.na(dichtheid) &
Bevolking > 0 & Oppervlakte > 0 & dichtheid > 0
]
stats <- c(
Min = min(table_1889_clean$dichtheid, na.rm = TRUE),
Q1 = unname(quantile(table_1889_clean$dichtheid, 0.25, na.rm = TRUE)),
Median = median(table_1889_clean$dichtheid, na.rm = TRUE),
Mean = mean(table_1889_clean$dichtheid, na.rm = TRUE),
Q3 = unname(quantile(table_1889_clean$dichtheid, 0.75, na.rm = TRUE)),
Max = max(table_1889_clean$dichtheid, na.rm = TRUE)
)
summary_table <- data.table(Statistic = names(stats), Value = as.numeric(stats))
mean_density <- mean(table_1889_clean$dichtheid, na.rm = TRUE)
p_box <- ggplot(table_1889_clean, aes(x = "", y = dichtheid)) +
geom_boxplot(outlier.shape = NA) +
annotate("point", x = 1, y = mean_density, size = 3) +
scale_y_log10(
breaks = c(
0.1, 0.2, 0.3, 0.5,
1, 1.5, 2, 3, 5,
10, 15, 20
)
) +
coord_cartesian(ylim = c(0.1, 20)) +
theme_minimal() +
labs(
title = "Population Density per Municipality (1889)",
x = NULL,
y = "Population Density (log scale)"
)
p_box