Goal: Load the UCSC humanHprdP2P interactions and map query/target transcript IDs to gene symbols using kgXref with two left-joins.
library(tibble)
# Load Data into R
HprdP2P = read.table("humanHprdP2P", sep = "\t", as.is = TRUE, comment = "", header = TRUE)
kgx = read.table("kgXref", sep = "\t", as.is = TRUE, comment = "", header = TRUE, quote = "")
as_tibble(kgx[1:5, 1:8])
## # A tibble: 5 × 8
## X.kgID mRNA spID spDisplayID geneSymbol refseq protAcc description
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ENST00000000233… NM_0… P840… ARF5_HUMAN ARF5 NM_00… NM_001… ADP ribosy…
## 2 ENST00000000412… NM_0… P206… MPRD_HUMAN M6PR NM_00… NM_002… mannose-6-…
## 3 ENST00000000442… NM_0… Q569… Q569H8_HUM… ESRRA NM_00… NM_004… estrogen r…
## 4 ENST00000001008… NM_0… Q027… FKBP4_HUMAN FKBP4 NM_00… NM_002… FKBP proly…
## 5 ENST00000001146… NM_0… Q9NR… CP26B_HUMAN CYP26B1 NM_01… NM_019… cytochrome…
as_tibble(head(HprdP2P, 10))
## # A tibble: 10 × 3
## X.query target distance
## <chr> <chr> <int>
## 1 ENST00000406381.6 ENST00000406381.6 1
## 2 ENST00000406381.6 ENST00000541774.5 1
## 3 ENST00000406381.6 ENST00000584601.5 1
## 4 ENST00000406381.6 ENST00000269571.10 1
## 5 ENST00000541774.5 ENST00000406381.6 1
## 6 ENST00000541774.5 ENST00000541774.5 1
## 7 ENST00000541774.5 ENST00000584601.5 1
## 8 ENST00000541774.5 ENST00000269571.10 1
## 9 ENST00000584601.5 ENST00000406381.6 1
## 10 ENST00000584601.5 ENST00000541774.5 1
# Normalization
if ("X.query" %in% names(HprdP2P)) names(HprdP2P)[names(HprdP2P) == "X.query"] <- "query"
if ("X.kgID" %in% names(kgx)) names(kgx)[names(kgx) == "X.kgID"] <- "kgID"
cols <- intersect(c("kgID","geneSymbol","description"), names(kgx))
# Merge Data (two left-joins)
m <- merge(HprdP2P, kgx[, cols], by.x = "query", by.y = "kgID", all.x = TRUE)
m <- merge(m, kgx[, cols], by.x = "target", by.y = "kgID", all.x = TRUE,
suffixes = c("_query", "_target"))
# Sanity Check
as_tibble(head(m, 10))
## # A tibble: 10 × 7
## target query distance geneSymbol_query description_query geneSymbol_target
## <chr> <chr> <int> <chr> <chr> <chr>
## 1 ENST0000… ENST… 2 ARRB2 arrestin beta 2,… ARF5
## 2 ENST0000… ENST… 2 KIF13A kinesin family m… ARF5
## 3 ENST0000… ENST… 2 RAB11A RAB11A, member R… ARF5
## 4 ENST0000… ENST… 2 DTNBP1 dystrobrevin bin… ARF5
## 5 ENST0000… ENST… 2 ARF1 ADP ribosylation… ARF5
## 6 ENST0000… ENST… 2 RAB11FIP4 RAB11 family int… ARF5
## 7 ENST0000… ENST… 1 PIP5K1A phosphatidylinos… ARF5
## 8 ENST0000… ENST… 2 ITGB3BP integrin subunit… ARF5
## 9 ENST0000… ENST… 1 ARFIP2 Plays a role in … ARF5
## 10 ENST0000… ENST… 2 BUB1B Essential compon… ARF5
## # ℹ 1 more variable: description_target <chr>
as_tibble(head(HprdP2P, 10))
## # A tibble: 10 × 3
## query target distance
## <chr> <chr> <int>
## 1 ENST00000406381.6 ENST00000406381.6 1
## 2 ENST00000406381.6 ENST00000541774.5 1
## 3 ENST00000406381.6 ENST00000584601.5 1
## 4 ENST00000406381.6 ENST00000269571.10 1
## 5 ENST00000541774.5 ENST00000406381.6 1
## 6 ENST00000541774.5 ENST00000541774.5 1
## 7 ENST00000541774.5 ENST00000584601.5 1
## 8 ENST00000541774.5 ENST00000269571.10 1
## 9 ENST00000584601.5 ENST00000406381.6 1
## 10 ENST00000584601.5 ENST00000541774.5 1
as_tibble(head(kgx, 10))
## # A tibble: 10 × 10
## kgID mRNA spID spDisplayID geneSymbol refseq protAcc description rfamAcc
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl>
## 1 ENST00… NM_0… P840… ARF5_HUMAN ARF5 NM_00… NM_001… ADP ribosy… NA
## 2 ENST00… NM_0… P206… MPRD_HUMAN M6PR NM_00… NM_002… mannose-6-… NA
## 3 ENST00… NM_0… Q569… Q569H8_HUM… ESRRA NM_00… NM_004… estrogen r… NA
## 4 ENST00… NM_0… Q027… FKBP4_HUMAN FKBP4 NM_00… NM_002… FKBP proly… NA
## 5 ENST00… NM_0… Q9NR… CP26B_HUMAN CYP26B1 NM_01… NM_019… cytochrome… NA
## 6 ENST00… NR_1… Q7L5… NDUF7_HUMAN NDUFAF7 NR_14… NR_146… NADH:ubiqu… NA
## 7 ENST00… NM_0… Q9BT… FUCO2_HUMAN FUCA2 NM_03… NM_032… alpha-L-fu… NA
## 8 ENST00… NM_0… Q9H9… DBND1_HUMAN DBNDD1 NM_00… NM_001… dysbindin … NA
## 9 ENST00… NM_0… O147… HS3S1_HUMAN HS3ST1 NM_00… NM_005… heparan su… NA
## 10 ENST00… NM_0… Q132… SEM3F_HUMAN SEMA3F NM_00… NM_004… semaphorin… NA
## # ℹ 1 more variable: tRnaName <lgl>
Description: For Question 1, I load two UCSC tables: humanHprdP2P and kgXref. I look at the first few rows to make sure the columns look right. I fix column names so the keys match. From there, I keep only the needed lookup fields. Then I do two left joins: first match query to kgID to add the query gene symbol, then match target to kgID to add the target gene symbol. Left joins keep every interaction row; if an ID does not map, the new symbol is NA. I print the top rows and check the row count to confirm nothing unexpected happened.
Ways to merge in R: Option 1: Base merge() (built-in): inner join (only matches), left join (keep all from the left), full join (keep all from both). Option 2: dplyr joins (left_join, inner_join, full_join): very readable pipelines, needs the package. Option 3: data.table joins: fastest on big data, different syntax. Option 4: match() / named vector: simplest for one-to-one lookups (add a column by key), can not duplicate rows and preserves order. Always try to use unique keys in the lookup and, after a join, check the row count and a few example mappings.
Goal: Download one GEO dataset and map ENTREZ → SYMBOL/GENENAME from org.Hs.eg.db (without duplicating rows!).
library(GEOquery)
library(AnnotationDbi)
library(org.Hs.eg.db)
#Load GSE Data
gse_list <- getGEO("GSE68849", GSEMatrix = TRUE)
eset <- gse_list[[1]]
#Pull matrices
expr <- Biobase::exprs(eset)
feat <- Biobase::fData(eset)
#Sanity Check
dim(expr)
## [1] 47321 10
nrow(feat)
## [1] 47321
#Load ENTREZID
cand_entrez <- c("ENTREZ_GENE_ID","Entrez_Gene_ID","ENTREZID","GeneID","Gene.ID","Gene ID","entrez_gene_id")
ent_col <- intersect(cand_entrez, colnames(feat))[1]
if (!is.na(ent_col)) {
ENTREZID <- as.character(feat[[ent_col]])
# keep first token if cells have "12345;67890"
ENTREZID <- sub("^\\s*([^;|/, ]+).*", "\\1", ENTREZID)
ENTREZID[ENTREZID == ""] <- NA
} else {
ENTREZID <- rep(NA_character_, nrow(feat))
}
#Clean ENTREZ IDs
ENTREZID <- sub("^\\s*([^;|/, ]+).*", "\\1", ENTREZID)
ENTREZID[ENTREZID == ""] <- NA
#Map annotations
ok <- !is.na(ENTREZID)
SYMBOL <- rep(NA_character_, length(ENTREZID))
GENENAME <- rep(NA_character_, length(ENTREZID))
SYMBOL[ok] <- AnnotationDbi::mapIds(org.Hs.eg.db,
keys = ENTREZID[ok],
keytype = "ENTREZID",
column = "SYMBOL",
multiVals = "first")
GENENAME[ok] <- AnnotationDbi::mapIds(org.Hs.eg.db,
keys = ENTREZID[ok],
keytype = "ENTREZID",
column = "GENENAME",
multiVals = "first")
#Build a row-aligned annotation table
annot <- data.frame(
ENTREZID = ENTREZID,
SYMBOL = SYMBOL,
GENENAME = GENENAME,
row.names = rownames(expr),
stringsAsFactors = FALSE
)
#Print Dimensions
cat("expr dims: ", paste(dim(expr), collapse = " x "), "\n", sep = "")
## expr dims: 47321 x 10
cat("annot rows: ", nrow(annot), "\n", sep = "")
## annot rows: 47321
#Sanity Check
head(annot, 10)
## ENTREZID SYMBOL
## ILMN_1343291 1915 EEF1A1
## ILMN_1343295 2597 GAPDH
## ILMN_1651199 643334 <NA>
## ILMN_1651209 9906 SLC35E2A
## ILMN_1651210 56940 DUSP22
## ILMN_1651221 642820 <NA>
## ILMN_1651228 6234 RPS28
## ILMN_1651229 9670 IPO13
## ILMN_1651230 360226 PRSS41
## ILMN_1651232 653113 FAM86FP
## GENENAME
## ILMN_1343291 eukaryotic translation elongation factor 1 alpha 1
## ILMN_1343295 glyceraldehyde-3-phosphate dehydrogenase
## ILMN_1651199 <NA>
## ILMN_1651209 solute carrier family 35 member E2A (pseudogene)
## ILMN_1651210 dual specificity phosphatase 22
## ILMN_1651221 <NA>
## ILMN_1651228 ribosomal protein S28
## ILMN_1651229 importin 13
## ILMN_1651230 serine protease 41
## ILMN_1651232 family with sequence similarity 86, member A pseudogene
Goal: Create a SummarizedExperiment with assay = expr, rowData = annot (from Q2), and colData = sample metadata from GEO.
library(SummarizedExperiment)
#Normalization
meta <- Biobase::pData(eset)
stopifnot(nrow(annot) == nrow(expr))
annot <- annot[rownames(expr), , drop = FALSE]
stopifnot(nrow(meta) >= ncol(expr))
meta <- meta[colnames(expr), , drop = FALSE]
#SummarizedExperiment
se <- SummarizedExperiment::SummarizedExperiment(
assays = list(expr = expr),
rowData = S4Vectors::DataFrame(annot),
colData = S4Vectors::DataFrame(meta)
)
#Print Data
rowData(se)[1:6, ]
## DataFrame with 6 rows and 3 columns
## ENTREZID SYMBOL GENENAME
## <character> <character> <character>
## ILMN_1343291 1915 EEF1A1 eukaryotic translati..
## ILMN_1343295 2597 GAPDH glyceraldehyde-3-pho..
## ILMN_1651199 643334 NA NA
## ILMN_1651209 9906 SLC35E2A solute carrier famil..
## ILMN_1651210 56940 DUSP22 dual specificity pho..
## ILMN_1651221 642820 NA NA
colData(se)[, 1:min(6, ncol(colData(se)))]
## DataFrame with 10 rows and 6 columns
## title geo_accession status
## <character> <character> <character>
## GSM1684095 Donor 1 - No virus c.. GSM1684095 Public on Feb 01 2016
## GSM1684096 Donor 1 - Influenza .. GSM1684096 Public on Feb 01 2016
## GSM1684097 Donor 2 - No virus c.. GSM1684097 Public on Feb 01 2016
## GSM1684098 Donor 2 - Influenza .. GSM1684098 Public on Feb 01 2016
## GSM1684099 Donor 3 - No virus c.. GSM1684099 Public on Feb 01 2016
## GSM1684100 Donor 3 - Influenza .. GSM1684100 Public on Feb 01 2016
## GSM1684101 Donor 4 - No virus c.. GSM1684101 Public on Feb 01 2016
## GSM1684102 Donor 4 - Influenza .. GSM1684102 Public on Feb 01 2016
## GSM1684103 Donor 5 - No virus c.. GSM1684103 Public on Feb 01 2016
## GSM1684104 Donor 5 - Influenza .. GSM1684104 Public on Feb 01 2016
## submission_date last_update_date type
## <character> <character> <character>
## GSM1684095 May 13 2015 Feb 01 2016 RNA
## GSM1684096 May 13 2015 Feb 01 2016 RNA
## GSM1684097 May 13 2015 Feb 01 2016 RNA
## GSM1684098 May 13 2015 Feb 01 2016 RNA
## GSM1684099 May 13 2015 Feb 01 2016 RNA
## GSM1684100 May 13 2015 Feb 01 2016 RNA
## GSM1684101 May 13 2015 Feb 01 2016 RNA
## GSM1684102 May 13 2015 Feb 01 2016 RNA
## GSM1684103 May 13 2015 Feb 01 2016 RNA
## GSM1684104 May 13 2015 Feb 01 2016 RNA
assays(se)$expr[1:6, 1:6]
## GSM1684095 GSM1684096 GSM1684097 GSM1684098 GSM1684099
## ILMN_1343291 23599.32000 22303.32000 25980.17000 24776.12000 24776.12000
## ILMN_1343295 6405.33500 7732.65200 6553.47900 5296.74600 7699.21900
## ILMN_1651199 95.19509 106.23970 97.54575 94.69421 106.28950
## ILMN_1651209 95.65199 97.22574 105.75440 100.54720 119.32600
## ILMN_1651210 101.82980 97.09778 125.64750 102.29230 99.58566
## ILMN_1651221 111.78650 104.71960 109.89840 102.23130 102.76570
## GSM1684100
## ILMN_1343291 26775.3400
## ILMN_1343295 6430.1560
## ILMN_1651199 108.2523
## ILMN_1651209 109.0526
## ILMN_1651210 119.7902
## ILMN_1651221 101.0684
Goal: Print simple column-wise and row-wise stats, plus two concise plots.
#Pull matrix
expr <- assays(se)$expr
expr_log <- log2(expr + 1)
#Column Statistics
col_mean <- colMeans(expr, na.rm = TRUE)
col_median <- apply(expr, 2, median, na.rm = TRUE)
col_sd <- apply(expr, 2, sd, na.rm = TRUE)
col_stats <- data.frame(
sample = colnames(expr),
mean = round(col_mean, 2),
median = round(col_median, 2),
sd = round(col_sd, 2),
row.names = NULL
)
col_stats # prints a table of sample summaries
## sample mean median sd
## 1 GSM1684095 346.25 110.04 1339.12
## 2 GSM1684096 346.23 110.02 1339.12
## 3 GSM1684097 346.25 110.03 1339.12
## 4 GSM1684098 346.24 110.04 1339.12
## 5 GSM1684099 346.25 110.04 1339.12
## 6 GSM1684100 346.25 110.06 1339.12
## 7 GSM1684101 346.27 110.08 1339.12
## 8 GSM1684102 346.27 110.09 1339.12
## 9 GSM1684103 346.26 110.06 1339.12
## 10 GSM1684104 346.27 110.07 1339.12
#Row Statistics
row_mean <- rowMeans(expr, na.rm = TRUE)
row_sd <- apply(expr, 1, sd, na.rm = TRUE)
summary(row_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 76.83 100.23 109.64 346.25 140.90 25054.54
summary(row_sd)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.534 7.443 10.140 75.674 20.420 11184.610
#Visuals
#Boxplots per sample
op <- par(mar = c(9,4,2,1))
boxplot(expr_log,
outline = FALSE, las = 2,
main = "Per-sample expression (log2 scale)",
ylab = "log2(expression + 1)")
par(op)
#Sample–sample correlation heatmap
cors <- cor(expr_log, use = "pairwise.complete.obs")
heatmap(cors, symm = TRUE,
main = "Sample correlation (log2 data)")
Description: For Question 4, the code first pulls the expression matrix from the SummarizedExperiment and creates a log2(+1) version so the values are on a stable, comparable scale. It then calculates simple column-wise summaries—mean, median, and standard deviation—for each sample and prints them to a table to check if any samples look different. Next, it computes row-wise summaries—the mean and standard deviation for each feature across samples—and prints compact summaries so you can see the overall spread of gene/probe values. Finally, it makes two quick visual checks: a log2 boxplot that compares the per-sample distributions and a correlation heatmap of the samples.
The per-sample summary table (means/medians/SDs) is similar across all samples, suggesting no obvious outliers. The log2 boxplot shows comparable medians and spreads, which supports good, consistent preprocessing. The sample–sample correlation heatmap shows high correlations, indicating the samples agree well overall. Row-wise summaries show a typical pattern for expression data.