RBIF 112 Homework 1

Question 1 — Map HPRD interactions to gene symbols

Goal: Load the UCSC humanHprdP2P interactions and map query/target transcript IDs to gene symbols using kgXref with two left-joins.

library(tibble)

# Load Data into R
HprdP2P = read.table("humanHprdP2P", sep = "\t", as.is = TRUE, comment = "", header = TRUE)
kgx     = read.table("kgXref",      sep = "\t", as.is = TRUE, comment = "", header = TRUE, quote = "")

as_tibble(kgx[1:5, 1:8])

## # A tibble: 5 × 8
##   X.kgID           mRNA  spID  spDisplayID geneSymbol refseq protAcc description
##   <chr>            <chr> <chr> <chr>       <chr>      <chr>  <chr>   <chr>      
## 1 ENST00000000233… NM_0… P840… ARF5_HUMAN  ARF5       NM_00… NM_001… ADP ribosy…
## 2 ENST00000000412… NM_0… P206… MPRD_HUMAN  M6PR       NM_00… NM_002… mannose-6-…
## 3 ENST00000000442… NM_0… Q569… Q569H8_HUM… ESRRA      NM_00… NM_004… estrogen r…
## 4 ENST00000001008… NM_0… Q027… FKBP4_HUMAN FKBP4      NM_00… NM_002… FKBP proly…
## 5 ENST00000001146… NM_0… Q9NR… CP26B_HUMAN CYP26B1    NM_01… NM_019… cytochrome…

as_tibble(head(HprdP2P, 10))

## # A tibble: 10 × 3
##    X.query           target             distance
##    <chr>             <chr>                 <int>
##  1 ENST00000406381.6 ENST00000406381.6         1
##  2 ENST00000406381.6 ENST00000541774.5         1
##  3 ENST00000406381.6 ENST00000584601.5         1
##  4 ENST00000406381.6 ENST00000269571.10        1
##  5 ENST00000541774.5 ENST00000406381.6         1
##  6 ENST00000541774.5 ENST00000541774.5         1
##  7 ENST00000541774.5 ENST00000584601.5         1
##  8 ENST00000541774.5 ENST00000269571.10        1
##  9 ENST00000584601.5 ENST00000406381.6         1
## 10 ENST00000584601.5 ENST00000541774.5         1

# Normalization
if ("X.query" %in% names(HprdP2P)) names(HprdP2P)[names(HprdP2P) == "X.query"] <- "query"
if ("X.kgID"  %in% names(kgx))     names(kgx)[names(kgx) == "X.kgID"]         <- "kgID"
cols <- intersect(c("kgID","geneSymbol","description"), names(kgx))

# Merge Data (two left-joins)
m <- merge(HprdP2P, kgx[, cols], by.x = "query",  by.y = "kgID", all.x = TRUE)
m <- merge(m,        kgx[, cols], by.x = "target", by.y = "kgID", all.x = TRUE,
           suffixes = c("_query", "_target"))

# Sanity Check
as_tibble(head(m, 10))

## # A tibble: 10 × 7
##    target    query distance geneSymbol_query description_query geneSymbol_target
##    <chr>     <chr>    <int> <chr>            <chr>             <chr>            
##  1 ENST0000… ENST…        2 ARRB2            arrestin beta 2,… ARF5             
##  2 ENST0000… ENST…        2 KIF13A           kinesin family m… ARF5             
##  3 ENST0000… ENST…        2 RAB11A           RAB11A, member R… ARF5             
##  4 ENST0000… ENST…        2 DTNBP1           dystrobrevin bin… ARF5             
##  5 ENST0000… ENST…        2 ARF1             ADP ribosylation… ARF5             
##  6 ENST0000… ENST…        2 RAB11FIP4        RAB11 family int… ARF5             
##  7 ENST0000… ENST…        1 PIP5K1A          phosphatidylinos… ARF5             
##  8 ENST0000… ENST…        2 ITGB3BP          integrin subunit… ARF5             
##  9 ENST0000… ENST…        1 ARFIP2           Plays a role in … ARF5             
## 10 ENST0000… ENST…        2 BUB1B            Essential compon… ARF5             
## # ℹ 1 more variable: description_target <chr>

as_tibble(head(HprdP2P, 10))

## # A tibble: 10 × 3
##    query             target             distance
##    <chr>             <chr>                 <int>
##  1 ENST00000406381.6 ENST00000406381.6         1
##  2 ENST00000406381.6 ENST00000541774.5         1
##  3 ENST00000406381.6 ENST00000584601.5         1
##  4 ENST00000406381.6 ENST00000269571.10        1
##  5 ENST00000541774.5 ENST00000406381.6         1
##  6 ENST00000541774.5 ENST00000541774.5         1
##  7 ENST00000541774.5 ENST00000584601.5         1
##  8 ENST00000541774.5 ENST00000269571.10        1
##  9 ENST00000584601.5 ENST00000406381.6         1
## 10 ENST00000584601.5 ENST00000541774.5         1

as_tibble(head(kgx, 10))

## # A tibble: 10 × 10
##    kgID    mRNA  spID  spDisplayID geneSymbol refseq protAcc description rfamAcc
##    <chr>   <chr> <chr> <chr>       <chr>      <chr>  <chr>   <chr>       <lgl>  
##  1 ENST00… NM_0… P840… ARF5_HUMAN  ARF5       NM_00… NM_001… ADP ribosy… NA     
##  2 ENST00… NM_0… P206… MPRD_HUMAN  M6PR       NM_00… NM_002… mannose-6-… NA     
##  3 ENST00… NM_0… Q569… Q569H8_HUM… ESRRA      NM_00… NM_004… estrogen r… NA     
##  4 ENST00… NM_0… Q027… FKBP4_HUMAN FKBP4      NM_00… NM_002… FKBP proly… NA     
##  5 ENST00… NM_0… Q9NR… CP26B_HUMAN CYP26B1    NM_01… NM_019… cytochrome… NA     
##  6 ENST00… NR_1… Q7L5… NDUF7_HUMAN NDUFAF7    NR_14… NR_146… NADH:ubiqu… NA     
##  7 ENST00… NM_0… Q9BT… FUCO2_HUMAN FUCA2      NM_03… NM_032… alpha-L-fu… NA     
##  8 ENST00… NM_0… Q9H9… DBND1_HUMAN DBNDD1     NM_00… NM_001… dysbindin … NA     
##  9 ENST00… NM_0… O147… HS3S1_HUMAN HS3ST1     NM_00… NM_005… heparan su… NA     
## 10 ENST00… NM_0… Q132… SEM3F_HUMAN SEMA3F     NM_00… NM_004… semaphorin… NA     
## # ℹ 1 more variable: tRnaName <lgl>

Description: For Question 1, I load two UCSC tables: humanHprdP2P and kgXref. I look at the first few rows to make sure the columns look right. I fix column names so the keys match. From there, I keep only the needed lookup fields. Then I do two left joins: first match query to kgID to add the query gene symbol, then match target to kgID to add the target gene symbol. Left joins keep every interaction row; if an ID does not map, the new symbol is NA. I print the top rows and check the row count to confirm nothing unexpected happened.

Ways to merge in R: Option 1: Base merge() (built-in): inner join (only matches), left join (keep all from the left), full join (keep all from both). Option 2: dplyr joins (left_join, inner_join, full_join): very readable pipelines, needs the package. Option 3: data.table joins: fastest on big data, different syntax. Option 4: match() / named vector: simplest for one-to-one lookups (add a column by key), can not duplicate rows and preserves order. Always try to use unique keys in the lookup and, after a join, check the row count and a few example mappings.

Question 2 — Download GEO data and map ENTREZ annotations

Goal: Download one GEO dataset and map ENTREZ → SYMBOL/GENENAME from org.Hs.eg.db (without duplicating rows!).

library(GEOquery)
library(AnnotationDbi)
library(org.Hs.eg.db)

#Load GSE Data
gse_list <- getGEO("GSE68849", GSEMatrix = TRUE)
eset <- gse_list[[1]]  

#Pull matrices
expr <- Biobase::exprs(eset)
feat <- Biobase::fData(eset)

#Sanity Check
dim(expr)

## [1] 47321    10

nrow(feat)

## [1] 47321

#Load ENTREZID
cand_entrez <- c("ENTREZ_GENE_ID","Entrez_Gene_ID","ENTREZID","GeneID","Gene.ID","Gene ID","entrez_gene_id")
ent_col <- intersect(cand_entrez, colnames(feat))[1]

if (!is.na(ent_col)) {
  ENTREZID <- as.character(feat[[ent_col]])
  # keep first token if cells have "12345;67890"
  ENTREZID <- sub("^\\s*([^;|/, ]+).*", "\\1", ENTREZID)
  ENTREZID[ENTREZID == ""] <- NA
} else {
  ENTREZID <- rep(NA_character_, nrow(feat))
}

#Clean ENTREZ IDs
ENTREZID <- sub("^\\s*([^;|/, ]+).*", "\\1", ENTREZID)
ENTREZID[ENTREZID == ""] <- NA

#Map annotations 
ok <- !is.na(ENTREZID)

SYMBOL   <- rep(NA_character_, length(ENTREZID))
GENENAME <- rep(NA_character_, length(ENTREZID))

SYMBOL[ok] <- AnnotationDbi::mapIds(org.Hs.eg.db,
                                    keys     = ENTREZID[ok],
                                    keytype  = "ENTREZID",
                                    column   = "SYMBOL",
                                    multiVals = "first")

GENENAME[ok] <- AnnotationDbi::mapIds(org.Hs.eg.db,
                                      keys     = ENTREZID[ok],
                                      keytype  = "ENTREZID",
                                      column   = "GENENAME",
                                      multiVals = "first")

#Build a row-aligned annotation table
annot <- data.frame(
  ENTREZID = ENTREZID,
  SYMBOL   = SYMBOL,
  GENENAME = GENENAME,
  row.names = rownames(expr),
  stringsAsFactors = FALSE
)

#Print Dimensions
cat("expr dims:   ", paste(dim(expr), collapse = " x "), "\n", sep = "")

## expr dims:   47321 x 10

cat("annot rows:  ", nrow(annot), "\n", sep = "")

## annot rows:  47321

#Sanity Check
head(annot, 10)

##              ENTREZID   SYMBOL
## ILMN_1343291     1915   EEF1A1
## ILMN_1343295     2597    GAPDH
## ILMN_1651199   643334     <NA>
## ILMN_1651209     9906 SLC35E2A
## ILMN_1651210    56940   DUSP22
## ILMN_1651221   642820     <NA>
## ILMN_1651228     6234    RPS28
## ILMN_1651229     9670    IPO13
## ILMN_1651230   360226   PRSS41
## ILMN_1651232   653113  FAM86FP
##                                                             GENENAME
## ILMN_1343291      eukaryotic translation elongation factor 1 alpha 1
## ILMN_1343295                glyceraldehyde-3-phosphate dehydrogenase
## ILMN_1651199                                                    <NA>
## ILMN_1651209        solute carrier family 35 member E2A (pseudogene)
## ILMN_1651210                         dual specificity phosphatase 22
## ILMN_1651221                                                    <NA>
## ILMN_1651228                                   ribosomal protein S28
## ILMN_1651229                                             importin 13
## ILMN_1651230                                      serine protease 41
## ILMN_1651232 family with sequence similarity 86, member A pseudogene

Question 3 — Build a SummarizedExperiment

Goal: Create a SummarizedExperiment with assay = expr, rowData = annot (from Q2), and colData = sample metadata from GEO.

library(SummarizedExperiment)

#Normalization
meta <- Biobase::pData(eset) 

stopifnot(nrow(annot) == nrow(expr))
annot <- annot[rownames(expr), , drop = FALSE]

stopifnot(nrow(meta) >= ncol(expr))
meta <- meta[colnames(expr), , drop = FALSE]  

#SummarizedExperiment
se <- SummarizedExperiment::SummarizedExperiment(
  assays  = list(expr = expr),
  rowData = S4Vectors::DataFrame(annot),
  colData = S4Vectors::DataFrame(meta)
)

#Print Data
rowData(se)[1:6, ]

## DataFrame with 6 rows and 3 columns
##                 ENTREZID      SYMBOL               GENENAME
##              <character> <character>            <character>
## ILMN_1343291        1915      EEF1A1 eukaryotic translati..
## ILMN_1343295        2597       GAPDH glyceraldehyde-3-pho..
## ILMN_1651199      643334          NA                     NA
## ILMN_1651209        9906    SLC35E2A solute carrier famil..
## ILMN_1651210       56940      DUSP22 dual specificity pho..
## ILMN_1651221      642820          NA                     NA

colData(se)[, 1:min(6, ncol(colData(se)))]

## DataFrame with 10 rows and 6 columns
##                             title geo_accession                status
##                       <character>   <character>           <character>
## GSM1684095 Donor 1 - No virus c..    GSM1684095 Public on Feb 01 2016
## GSM1684096 Donor 1 - Influenza ..    GSM1684096 Public on Feb 01 2016
## GSM1684097 Donor 2 - No virus c..    GSM1684097 Public on Feb 01 2016
## GSM1684098 Donor 2 - Influenza ..    GSM1684098 Public on Feb 01 2016
## GSM1684099 Donor 3 - No virus c..    GSM1684099 Public on Feb 01 2016
## GSM1684100 Donor 3 - Influenza ..    GSM1684100 Public on Feb 01 2016
## GSM1684101 Donor 4 - No virus c..    GSM1684101 Public on Feb 01 2016
## GSM1684102 Donor 4 - Influenza ..    GSM1684102 Public on Feb 01 2016
## GSM1684103 Donor 5 - No virus c..    GSM1684103 Public on Feb 01 2016
## GSM1684104 Donor 5 - Influenza ..    GSM1684104 Public on Feb 01 2016
##            submission_date last_update_date        type
##                <character>      <character> <character>
## GSM1684095     May 13 2015      Feb 01 2016         RNA
## GSM1684096     May 13 2015      Feb 01 2016         RNA
## GSM1684097     May 13 2015      Feb 01 2016         RNA
## GSM1684098     May 13 2015      Feb 01 2016         RNA
## GSM1684099     May 13 2015      Feb 01 2016         RNA
## GSM1684100     May 13 2015      Feb 01 2016         RNA
## GSM1684101     May 13 2015      Feb 01 2016         RNA
## GSM1684102     May 13 2015      Feb 01 2016         RNA
## GSM1684103     May 13 2015      Feb 01 2016         RNA
## GSM1684104     May 13 2015      Feb 01 2016         RNA

assays(se)$expr[1:6, 1:6]

##               GSM1684095  GSM1684096  GSM1684097  GSM1684098  GSM1684099
## ILMN_1343291 23599.32000 22303.32000 25980.17000 24776.12000 24776.12000
## ILMN_1343295  6405.33500  7732.65200  6553.47900  5296.74600  7699.21900
## ILMN_1651199    95.19509   106.23970    97.54575    94.69421   106.28950
## ILMN_1651209    95.65199    97.22574   105.75440   100.54720   119.32600
## ILMN_1651210   101.82980    97.09778   125.64750   102.29230    99.58566
## ILMN_1651221   111.78650   104.71960   109.89840   102.23130   102.76570
##              GSM1684100
## ILMN_1343291 26775.3400
## ILMN_1343295  6430.1560
## ILMN_1651199   108.2523
## ILMN_1651209   109.0526
## ILMN_1651210   119.7902
## ILMN_1651221   101.0684

Question 4 — Row/column summaries + visuals

Goal: Print simple column-wise and row-wise stats, plus two concise plots.

#Pull matrix
expr <- assays(se)$expr
expr_log <- log2(expr + 1)

#Column Statistics 
col_mean   <- colMeans(expr, na.rm = TRUE)
col_median <- apply(expr, 2, median, na.rm = TRUE)
col_sd     <- apply(expr, 2, sd,     na.rm = TRUE)

col_stats <- data.frame(
  sample = colnames(expr),
  mean   = round(col_mean,   2),
  median = round(col_median, 2),
  sd     = round(col_sd,     2),
  row.names = NULL
)
col_stats   # prints a table of sample summaries

##        sample   mean median      sd
## 1  GSM1684095 346.25 110.04 1339.12
## 2  GSM1684096 346.23 110.02 1339.12
## 3  GSM1684097 346.25 110.03 1339.12
## 4  GSM1684098 346.24 110.04 1339.12
## 5  GSM1684099 346.25 110.04 1339.12
## 6  GSM1684100 346.25 110.06 1339.12
## 7  GSM1684101 346.27 110.08 1339.12
## 8  GSM1684102 346.27 110.09 1339.12
## 9  GSM1684103 346.26 110.06 1339.12
## 10 GSM1684104 346.27 110.07 1339.12

#Row Statistics 
row_mean <- rowMeans(expr, na.rm = TRUE)
row_sd   <- apply(expr, 1, sd, na.rm = TRUE)

summary(row_mean)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    76.83   100.23   109.64   346.25   140.90 25054.54

summary(row_sd)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     1.534     7.443    10.140    75.674    20.420 11184.610

#Visuals 
#Boxplots per sample 
op <- par(mar = c(9,4,2,1))  
boxplot(expr_log,
        outline = FALSE, las = 2,
        main = "Per-sample expression (log2 scale)",
        ylab = "log2(expression + 1)")

par(op)

#Sample–sample correlation heatmap
cors <- cor(expr_log, use = "pairwise.complete.obs")
heatmap(cors, symm = TRUE,
        main = "Sample correlation (log2 data)")

Description: For Question 4, the code first pulls the expression matrix from the SummarizedExperiment and creates a log2(+1) version so the values are on a stable, comparable scale. It then calculates simple column-wise summaries—mean, median, and standard deviation—for each sample and prints them to a table to check if any samples look different. Next, it computes row-wise summaries—the mean and standard deviation for each feature across samples—and prints compact summaries so you can see the overall spread of gene/probe values. Finally, it makes two quick visual checks: a log2 boxplot that compares the per-sample distributions and a correlation heatmap of the samples.

The per-sample summary table (means/medians/SDs) is similar across all samples, suggesting no obvious outliers. The log2 boxplot shows comparable medians and spreads, which supports good, consistent preprocessing. The sample–sample correlation heatmap shows high correlations, indicating the samples agree well overall. Row-wise summaries show a typical pattern for expression data.

RBIF 112 Homework 1

Megan Dorval

2025-09-01

Question 1 — Map HPRD interactions to gene symbols

Question 2 — Download GEO data and map ENTREZ annotations

Question 3 — Build a SummarizedExperiment

Question 4 — Row/column summaries + visuals