#Synopsis

This report analyzes the Open Sensor Data for Rail 2023 (OSDaR23) dataset, specifically the sequence 7_approach_underground_station_7.1. The goal is to demonstrate a complete data stewardship and calibration analysis pipeline directly in R. Starting from raw JSON and text files, the dataset was parsed, cleaned, and transformed into tidy structures suitable for validation and visualization. The analysis identifies inconsistencies across sensors and highlights key aspects of multimodal calibration and annotation metadata. Five figures illustrate data structure, annotation patterns, and sensor alignment. This project adheres to FAIR principles (Findable, Accessible, Interoperable, Reusable) and ensures reproducibility using R Markdown. The output includes cleaned datasets, quality checks, and interpretive analysis for stewardship purposes. All preprocessing, parsing, and visualization steps were conducted inside this R Markdown file. The report concludes with insights on metadata completeness and recommendations for improving calibration consistency. Every figure and table shown was generated directly from raw inputs without any external data processing.

#Data Processing

This section describes how the raw files were imported and processed inside R. No data preparation was done outside of this R Markdown document. The code reads both the OpenLABEL annotation file (.json) and the calibration file (.txt), extracts metadata, and prepares tidy datasets for analysis.

##Step 1: Load Raw Data

# Define raw file paths
label_path <- "7_approach_underground_station_7.1/7_approach_underground_station_7.1_labels.json"
calib_path <- "7_approach_underground_station_7.1/calibration.txt"

# Load OpenLABEL annotation file
osdar_json <- fromJSON(label_path, simplifyVector = FALSE)

# Load calibration file (handles incomplete final line)
calib_lines <- readLines(calib_path, warn = FALSE)

##Step 2: Parse Calibration Data

The calibration file contains extrinsic parameters for each camera or sensor. Each block is separated by lines of hash marks (###). We extract the folder name, 3D position, and quaternion rotation for each sensor.

# Identify start and end of calibration blocks
block_indices <- which(str_detect(calib_lines, "^#{3,}"))
block_indices <- c(block_indices, length(calib_lines) + 1)

# Function to parse each block
parse_calib_block <- function(lines) {
  txt <- paste(lines, collapse = "\n")
  folder <- str_match(txt, "data_folder:\\s*(\\S+)")[, 2]
  pos_str <- str_match(txt, "position:\\s*\\[([^\\]]+)\\]")[, 2]
  rot_str <- str_match(txt, "rotation_quaternion.*:\\s*\\[([^\\]]+)\\]")[, 2]
  pos <- if (!is.na(pos_str)) as.numeric(str_split(pos_str, ",")[[1]]) else rep(NA, 3)
  rot <- if (!is.na(rot_str)) as.numeric(str_split(rot_str, ",")[[1]]) else rep(NA, 4)
  tibble(
    folder, pos_x = pos[1], pos_y = pos[2], pos_z = pos[3],
    q_w = rot[1], q_x = rot[2], q_y = rot[3], q_z = rot[4]
  )
}

# Apply parser to all blocks
calib_df <- map_dfr(seq_along(block_indices[-length(block_indices)]), function(i) {
  start <- block_indices[i] + 1
  end <- block_indices[i + 1] - 1
  parse_calib_block(calib_lines[start:end])
}) %>% distinct()

head(calib_df)

##Step 3: Parse Annotation Data

The JSON annotation file follows the ASAM OpenLABEL format. We extract object types, annotation types, and frame intervals for each object.

openlabel <- osdar_json$openlabel
objects <- openlabel$objects

ann_rows <- list()
for (id in names(objects)) {
  obj <- objects[[id]]
  obj_type <- obj$type %||% NA_character_
  odp <- obj$object_data_pointers
  if (length(odp)==0) next
  for (pname in names(odp)) {
    parts <- str_split(pname, "__")[[1]]
    stream <- parts[1]; ann_type <- parts[2]
    frames <- odp[[pname]]$frame_intervals
    for (fi in frames) {
      ann_rows[[length(ann_rows)+1]] <- tibble(
        obj_id=id, obj_type=obj_type, stream=stream, ann_type=ann_type,
        frame_start=fi$frame_start, frame_end=fi$frame_end)
    }
  }
}
ann_df <- bind_rows(ann_rows)
head(ann_df)

##Step 4: Write calibrated data

output_dir <- "cleaned_calibrated_data"
fig_dir <- file.path(output_dir, "figures")
dir_create(output_dir)
dir_create(fig_dir)

write_csv(calib_df, file.path(output_dir, "cleaned_calibration.csv"))
write_csv(ann_df, file.path(output_dir, "annotation_intervals.csv"))

missing_folders <- setdiff(unique(ann_df$stream), unique(calib_df$folder))
missing_calib <- setdiff(unique(calib_df$folder), unique(ann_df$stream))

validation_summary <- list(
  missing_streams_in_calibration = missing_folders,
  missing_calibration_folders_in_annotations = missing_calib,
  calibration_rows = nrow(calib_df),
  annotation_rows = nrow(ann_df)
)

write_json(validation_summary, file.path(output_dir, "metadata_validation_summary.json"), pretty = TRUE)

#Results

This section presents the findings derived from the cleaned calibration and annotation data. A maximum of five figures are used to summarize results and insights.

##Figure 1: Object Counts by Class

obj_counts <- ann_df %>% count(obj_type, name="n") %>% arrange(desc(n))
ggplot(obj_counts, aes(x=reorder(obj_type,n), y=n)) +
  geom_col(fill="steelblue") +
  coord_flip() +
  labs(title="Figure 1: Object Counts by Class", x="Object Type", y="Count") +
  theme_minimal()

Interpretation: This figure shows the frequency of object annotations by type. Dominant classes reflect the dataset’s focus and guide quality assurance priorities.

##Figure 2: Annotations per Sensor Stream

stream_counts <- ann_df %>% count(stream, name="n")
ggplot(stream_counts, aes(x=reorder(stream,n), y=n)) +
  geom_col(fill="darkorange") +
  coord_flip() +
  labs(title="Figure 2: Annotation Distribution per Sensor Stream",
       x="Sensor Stream", y="Number of Annotations") +
  theme_minimal()

Interpretation: This distribution shows which sensors contribute most to annotations. Sparse streams indicate missing or underused sensors in labeling.

##Figure 3: Distribution of Annotation Interval Lengths

intervals <- ann_df %>% mutate(length = frame_end - frame_start + 1)
ggplot(intervals, aes(x=length)) +
  geom_histogram(bins=30, fill="seagreen", color="white") +
  labs(title="Figure 3: Distribution of Annotation Interval Lengths",
       x="Interval Length (Frames)", y="Frequency") +
  theme_minimal()

Interpretation: Short intervals indicate per-frame annotations, while longer ones capture persistent tracked objects.

##Figure 4: Annotation Types per Object Class

type_class <- ann_df %>% count(obj_type, ann_type)
ggplot(type_class, aes(x=reorder(obj_type,-n), y=n, fill=ann_type)) +
  geom_col() + coord_flip() +
  labs(title="Figure 4: Annotation Types by Object Class",
       x="Object Class", y="Count", fill="Annotation Type") +
  theme_minimal()

Interpretation: Displays the diversity of annotation geometry types per object class. Mixed types for a class suggest labeling inconsistency.

##Figure 5: Camera Sensor Positions (2D Projection)

ggplot(calib_df, aes(x=pos_x, y=pos_y, label=folder)) +
  geom_point(size=3, color="dodgerblue") +
  geom_text(nudge_y=0.05, size=3) +
  labs(title="Figure 5: Camera Sensor Positions (X vs Y Projection)",
       x="Position X (m)", y="Position Y (m)") +
  theme_minimal()

Interpretation: The 2D spatial projection of sensor positions verifies correct camera alignment; outliers may indicate miscalibration.