#Synopsis
This report analyzes the Open Sensor Data for Rail 2023 (OSDaR23) dataset, specifically the sequence 7_approach_underground_station_7.1. The goal is to demonstrate a complete data stewardship and calibration analysis pipeline directly in R. Starting from raw JSON and text files, the dataset was parsed, cleaned, and transformed into tidy structures suitable for validation and visualization. The analysis identifies inconsistencies across sensors and highlights key aspects of multimodal calibration and annotation metadata. Five figures illustrate data structure, annotation patterns, and sensor alignment. This project adheres to FAIR principles (Findable, Accessible, Interoperable, Reusable) and ensures reproducibility using R Markdown. The output includes cleaned datasets, quality checks, and interpretive analysis for stewardship purposes. All preprocessing, parsing, and visualization steps were conducted inside this R Markdown file. The report concludes with insights on metadata completeness and recommendations for improving calibration consistency. Every figure and table shown was generated directly from raw inputs without any external data processing.
#Data Processing
This section describes how the raw files were imported and processed inside R. No data preparation was done outside of this R Markdown document. The code reads both the OpenLABEL annotation file (.json) and the calibration file (.txt), extracts metadata, and prepares tidy datasets for analysis.
##Step 1: Load Raw Data
# Define raw file paths
label_path <- "7_approach_underground_station_7.1/7_approach_underground_station_7.1_labels.json"
calib_path <- "7_approach_underground_station_7.1/calibration.txt"
# Load OpenLABEL annotation file
osdar_json <- fromJSON(label_path, simplifyVector = FALSE)
# Load calibration file (handles incomplete final line)
calib_lines <- readLines(calib_path, warn = FALSE)
##Step 2: Parse Calibration Data
The calibration file contains extrinsic parameters for each camera or sensor. Each block is separated by lines of hash marks (###). We extract the folder name, 3D position, and quaternion rotation for each sensor.
# Identify start and end of calibration blocks
block_indices <- which(str_detect(calib_lines, "^#{3,}"))
block_indices <- c(block_indices, length(calib_lines) + 1)
# Function to parse each block
parse_calib_block <- function(lines) {
txt <- paste(lines, collapse = "\n")
folder <- str_match(txt, "data_folder:\\s*(\\S+)")[, 2]
pos_str <- str_match(txt, "position:\\s*\\[([^\\]]+)\\]")[, 2]
rot_str <- str_match(txt, "rotation_quaternion.*:\\s*\\[([^\\]]+)\\]")[, 2]
pos <- if (!is.na(pos_str)) as.numeric(str_split(pos_str, ",")[[1]]) else rep(NA, 3)
rot <- if (!is.na(rot_str)) as.numeric(str_split(rot_str, ",")[[1]]) else rep(NA, 4)
tibble(
folder, pos_x = pos[1], pos_y = pos[2], pos_z = pos[3],
q_w = rot[1], q_x = rot[2], q_y = rot[3], q_z = rot[4]
)
}
# Apply parser to all blocks
calib_df <- map_dfr(seq_along(block_indices[-length(block_indices)]), function(i) {
start <- block_indices[i] + 1
end <- block_indices[i + 1] - 1
parse_calib_block(calib_lines[start:end])
}) %>% distinct()
head(calib_df)
##Step 3: Parse Annotation Data
The JSON annotation file follows the ASAM OpenLABEL format. We extract object types, annotation types, and frame intervals for each object.
openlabel <- osdar_json$openlabel
objects <- openlabel$objects
ann_rows <- list()
for (id in names(objects)) {
obj <- objects[[id]]
obj_type <- obj$type %||% NA_character_
odp <- obj$object_data_pointers
if (length(odp)==0) next
for (pname in names(odp)) {
parts <- str_split(pname, "__")[[1]]
stream <- parts[1]; ann_type <- parts[2]
frames <- odp[[pname]]$frame_intervals
for (fi in frames) {
ann_rows[[length(ann_rows)+1]] <- tibble(
obj_id=id, obj_type=obj_type, stream=stream, ann_type=ann_type,
frame_start=fi$frame_start, frame_end=fi$frame_end)
}
}
}
ann_df <- bind_rows(ann_rows)
head(ann_df)
##Step 4: Write calibrated data
output_dir <- "cleaned_calibrated_data"
fig_dir <- file.path(output_dir, "figures")
dir_create(output_dir)
dir_create(fig_dir)
write_csv(calib_df, file.path(output_dir, "cleaned_calibration.csv"))
write_csv(ann_df, file.path(output_dir, "annotation_intervals.csv"))
missing_folders <- setdiff(unique(ann_df$stream), unique(calib_df$folder))
missing_calib <- setdiff(unique(calib_df$folder), unique(ann_df$stream))
validation_summary <- list(
missing_streams_in_calibration = missing_folders,
missing_calibration_folders_in_annotations = missing_calib,
calibration_rows = nrow(calib_df),
annotation_rows = nrow(ann_df)
)
write_json(validation_summary, file.path(output_dir, "metadata_validation_summary.json"), pretty = TRUE)
#Results
This section presents the findings derived from the cleaned calibration and annotation data. A maximum of five figures are used to summarize results and insights.
##Figure 1: Object Counts by Class
obj_counts <- ann_df %>% count(obj_type, name="n") %>% arrange(desc(n))
ggplot(obj_counts, aes(x=reorder(obj_type,n), y=n)) +
geom_col(fill="steelblue") +
coord_flip() +
labs(title="Figure 1: Object Counts by Class", x="Object Type", y="Count") +
theme_minimal()
Interpretation: This figure shows the frequency of object annotations by
type. Dominant classes reflect the dataset’s focus and guide quality
assurance priorities.
##Figure 2: Annotations per Sensor Stream
stream_counts <- ann_df %>% count(stream, name="n")
ggplot(stream_counts, aes(x=reorder(stream,n), y=n)) +
geom_col(fill="darkorange") +
coord_flip() +
labs(title="Figure 2: Annotation Distribution per Sensor Stream",
x="Sensor Stream", y="Number of Annotations") +
theme_minimal()
Interpretation: This distribution shows which sensors contribute most to
annotations. Sparse streams indicate missing or underused sensors in
labeling.
##Figure 3: Distribution of Annotation Interval Lengths
intervals <- ann_df %>% mutate(length = frame_end - frame_start + 1)
ggplot(intervals, aes(x=length)) +
geom_histogram(bins=30, fill="seagreen", color="white") +
labs(title="Figure 3: Distribution of Annotation Interval Lengths",
x="Interval Length (Frames)", y="Frequency") +
theme_minimal()
Interpretation: Short intervals indicate per-frame annotations, while
longer ones capture persistent tracked objects.
##Figure 4: Annotation Types per Object Class
type_class <- ann_df %>% count(obj_type, ann_type)
ggplot(type_class, aes(x=reorder(obj_type,-n), y=n, fill=ann_type)) +
geom_col() + coord_flip() +
labs(title="Figure 4: Annotation Types by Object Class",
x="Object Class", y="Count", fill="Annotation Type") +
theme_minimal()
Interpretation: Displays the diversity of annotation geometry types per
object class. Mixed types for a class suggest labeling
inconsistency.
##Figure 5: Camera Sensor Positions (2D Projection)
ggplot(calib_df, aes(x=pos_x, y=pos_y, label=folder)) +
geom_point(size=3, color="dodgerblue") +
geom_text(nudge_y=0.05, size=3) +
labs(title="Figure 5: Camera Sensor Positions (X vs Y Projection)",
x="Position X (m)", y="Position Y (m)") +
theme_minimal()
Interpretation: The 2D spatial projection of sensor positions verifies
correct camera alignment; outliers may indicate miscalibration.