This script processes the GSE55763 methylation dataset to create a tidy dataset with samples (GSM IDs) as rows, CpG probes as columns, and associated metadata including sex and age. The resulting dataset is ready for downstream modeling and analysis.
Load metadata
metadata_GSE55763 <- read.csv("GSE55763_metadata_clean.csv", header = TRUE, stringsAsFactors = FALSE)
# Display first 3 rows
head(metadata_GSE55763,3)
## [1] 2664 5
This dataset contains the mapping between array IDs and their corresponding GSM sample IDs.
array_to_GSM_mapping <- read.csv("array_to_GSM_mapping_clean.csv", header = TRUE, stringsAsFactors = FALSE)
head(array_to_GSM_mapping)
## [1] 2664 2
The GSE55763_top500_all_samples.csv dataset contains beta values for DNA methylation at the 500 most variable CpG sites across all available samples. Unlike previous subsets, this dataset includes all 5,423 samples, providing a comprehensive set for downstream analysis and modeling.
betas_sub500 <- read.csv("GSE55763_top500_all_samples.csv", header = TRUE, stringsAsFactors = FALSE)
head(betas_sub500 )
dim(betas_sub500)
## [1] 500 5423
Reshapes betas_sub500 from wide to long format, turning all columns except ID_REF into two columns: Array_ID (column names) and beta (values)
## [1] 2664
betas_long <- betas_sub500_filtered %>%
pivot_longer(
cols = -ID_REF,
names_to = "Array_ID",
values_to = "beta"
) %>%
inner_join(array_to_GSM_mapping, by = "Array_ID") # add GSM_ID
Map Array_ID to GSM_ID
betas_wide <- betas_long %>%
select(GSM_ID, ID_REF, beta) %>%
pivot_wider(
id_cols = GSM_ID,
names_from = ID_REF,
values_from = beta
)
dim(betas_wide)
## [1] 2664 501
head(betas_wide[,1:10])
The beta values were first reshaped from wide to long format, making each sample (GSM_ID) a row and each CpG probe a column, which prepares the data for analysis. The long-format beta data was then merged with the sample metadata using GSM_ID as the key, combining methylation measurements with relevant annotations. After verifying the dimensions and inspecting the first rows to ensure correctness, the final merged dataset was saved as a CSV file for downstream analyses or machine learning applications.
betas_wide <- betas_long %>%
select(-Array_ID) %>%
pivot_wider(
names_from = ID_REF,
values_from = beta
)
Merge with metadata
metadata_filtered <- metadata_GSE55763 %>%
filter(X %in% betas_wide$GSM_ID)
merged_data <- metadata_filtered %>%
inner_join(betas_wide, by = c("X" = "GSM_ID"))
merged_data <- merged_data %>%
select(-tissue, -dataset)
dim(merged_data)
## [1] 2664 503
head(merged_data,3)
Save data set in CSV format
write.csv(merged_data, "GSE55763_merged.csv", row.names = FALSE)