Preparing GSE55763 Beta Value Datasets for Modeling

This script processes the GSE55763 methylation dataset to create a tidy dataset with samples (GSM IDs) as rows, CpG probes as columns, and associated metadata including sex and age. The resulting dataset is ready for downstream modeling and analysis.

Load metadata

metadata_GSE55763 <- read.csv("GSE55763_metadata_clean.csv", header = TRUE, stringsAsFactors = FALSE)

# Display first 3 rows
head(metadata_GSE55763,3)

## [1] 2664    5

This dataset contains the mapping between array IDs and their corresponding GSM sample IDs.

array_to_GSM_mapping <- read.csv("array_to_GSM_mapping_clean.csv", header = TRUE, stringsAsFactors = FALSE)
head(array_to_GSM_mapping)

## [1] 2664    2

The GSE55763_top500_all_samples.csv dataset contains beta values for DNA methylation at the 500 most variable CpG sites across all available samples. Unlike previous subsets, this dataset includes all 5,423 samples, providing a comprehensive set for downstream analysis and modeling.

betas_sub500 <- read.csv("GSE55763_top500_all_samples.csv", header = TRUE, stringsAsFactors = FALSE)
head(betas_sub500 )

dim(betas_sub500)

## [1]  500 5423

Reshapes betas_sub500 from wide to long format, turning all columns except ID_REF into two columns: Array_ID (column names) and beta (values)

## [1] 2664

betas_long <- betas_sub500_filtered %>%
  pivot_longer(
    cols = -ID_REF,
    names_to = "Array_ID",
    values_to = "beta"
  ) %>%
  inner_join(array_to_GSM_mapping, by = "Array_ID")  # add GSM_ID

Map Array_ID to GSM_ID

betas_wide <- betas_long %>%
  select(GSM_ID, ID_REF, beta) %>%
  pivot_wider(
    id_cols = GSM_ID,
    names_from = ID_REF,
    values_from = beta
  )

dim(betas_wide)

## [1] 2664  501

head(betas_wide[,1:10])

The beta values were first reshaped from wide to long format, making each sample (GSM_ID) a row and each CpG probe a column, which prepares the data for analysis. The long-format beta data was then merged with the sample metadata using GSM_ID as the key, combining methylation measurements with relevant annotations. After verifying the dimensions and inspecting the first rows to ensure correctness, the final merged dataset was saved as a CSV file for downstream analyses or machine learning applications.

betas_wide <- betas_long %>%
  select(-Array_ID) %>%
  pivot_wider(
    names_from = ID_REF,
    values_from = beta
  )

Merge with metadata

metadata_filtered <- metadata_GSE55763 %>%
  filter(X %in% betas_wide$GSM_ID)

merged_data <- metadata_filtered %>%
  inner_join(betas_wide, by = c("X" = "GSM_ID"))
merged_data <- merged_data %>%
  select(-tissue, -dataset)

dim(merged_data)

## [1] 2664  503

head(merged_data,3)

Save data set in CSV format

write.csv(merged_data, "GSE55763_merged.csv", row.names = FALSE)

Preparing GSE55763 Beta Value Datasets for Modeling

Magalí Eisik