Gene Expression Analysis in Alzheimer’s Disease Brain Tissue

geneExp = read.table("ENCFF166SFX.tsv", header = TRUE)

str(geneExp)

## 'data.frame':    59526 obs. of  17 variables:
##  $ gene_id                               : chr  "10904" "12954" "12956" "12958" ...
##  $ transcript_id.s.                      : chr  "10904" "12954" "12956" "12958" ...
##  $ length                                : num  93 94 72 82 73 72 74 82 73 73 ...
##  $ effective_length                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ expected_count                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TPM                                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FPKM                                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ posterior_mean_count                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ posterior_standard_deviation_of_count : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pme_TPM                               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pme_FPKM                              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TPM_ci_lower_bound                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TPM_ci_upper_bound                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TPM_coefficient_of_quartile_variation : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FPKM_ci_lower_bound                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FPKM_ci_upper_bound                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FPKM_coefficient_of_quartile_variation: num  0 0 0 0 0 0 0 0 0 0 ...

Plot 1: Gene Length Distribution Analysis

Understanding the distribution of gene lengths helps in the assessment of data quality and identification of potential biases in expression measurements.

Initial exploration of gene length distribution

ggplot(geneExp,
       aes(x = length)) +
  geom_histogram(binwidth = 50,
                 color = "darkblue",
                 alpha = 0.8) +
  labs(title = "Distribution of Gene Lengths in Brain Tissue",
       subtitle = "Initial exploration showing full range of gene lengths",
       x = "Gene Length (base pairs)",
       y = "Number of Genes") +
  theme_minimal()

Filter to focus on typical gene length range

normLength = geneExp |>
  filter(length < 10000)
ggplot(normLength,
       aes(x = length)) +
  geom_histogram(binwidth = 200,
                 fill = "lightblue",
                 color = "darkblue",
                 alpha = 0.8) +
  labs(title = "Refined Gene Length Distribution",
       subtitle = "Focusing on genes under 10,000 bp reveals typical transcript lengths",
       x = "Gene Length (base pairs)",
       y = "Number of Genes") +
  theme_minimal()

Biological Interpretation: Most protein-coding genes in brain tissue are less than 2,500 bp. This distribution shows the expected pattern where shorter genes are more common, consistent with efficient cellular processes in brain tissue.

Plot 2: Highly Expressed Genes in Aging Brain

Identifying the most highly expressed genes provides insight into the dominant biological processes in aging brain tissue.

Identify and visualize highly expressed genes

# Identify and visualize highly expressed genes
hiEx = geneExp |>
  filter(FPKM > 5000)

ggplot(hiEx, aes(x = reorder(gene_id, FPKM), y = FPKM)) +
  geom_col(aes(fill = FPKM), show.legend = TRUE) +
  scale_fill_gradient2(low = "green", mid = "yellow", high = "red", 
                       midpoint = median(hiEx$FPKM),
                       name = "FPKM\nExpression") +
  labs(title = "Top Expressed Genes in 90-Year-Old Brain Tissue",
       subtitle = "Genes with FPKM > 5000 reveal dominant cellular processes",
       x = "Gene ID (ordered by expression level)",
       y = "FPKM (Fragments Per Kilobase Million)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust = 1))

Biological Interpretation: The highly expressed genes likely represent housekeeping functions essential for neuronal survival and brain metabolism. In aging brain tissue, these may include genes involved in energy metabolism, protein synthesis, and general cellular maintenance - all essential for neuronal function in the face of age-related decline.

Plot 3: Correlation Between Expression Metrics

Understanding the relationship between different expression measurements validate the data and reveal technical aspects of gene expression quantification.

Calculate correlation between FPKM and TPM

reg = lm(formula = TPM ~ FPKM,
         data = normLength)
coeff = coefficients(reg)
int1 = coeff[1]
slo = coeff[2]

# Create correlation coefficient for display
cor_coef = round(cor(normLength$FPKM,
                     normLength$TPM,
                     use = "complete.obs"),
                 3)
ggplot(normLength,
       aes(x = FPKM,
           y = TPM)) +
  geom_point(alpha = 0.9, color = "darkgray",
             size = 0.8) +
  geom_smooth(method = "lm",
              color = "orange",
              size = 0.5,
              se = TRUE,
              alpha = 0.1) +
  labs(title = "Correlation Between FPKM and TPM Expression Measures",
       x = "FPKM",
       y = "TPM") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

Examine relationship between expected count and FPKM

reg = lm(formula = expected_count ~ FPKM,
         data = normLength)
coeff = coefficients(reg)
int1 = coeff[1]
slo = coeff[2]
cor_coef2 = round(cor(normLength$FPKM,
                      normLength$expected_count,
                      use = "complete.obs"), 3)
ggplot(normLength,
       aes(x = FPKM,
           y = expected_count)) +
  geom_point(alpha = 0.9,
             color = "darkgray",
             size = 0.8) +
  geom_smooth(method = "lm",
              color = "orange",
              size = 0.5,
              se = TRUE,
              alpha = 0.1) +
  labs(title = "FPKM vs Expected Count Relationship",
       x = "FPKM",
       y = "Expected Count") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Biological Interpretation: A linear relationship between FPKM and expected count can be seen, but the clustering of low data points near 0 should be noted (noise?). This correlation provides confidence that observed expression differences reflect true biological variation rather than technical artifacts - important when studying subtle age-related changes in brain tissue.

Encode Expression Data

Collin McNeil

2025-09-15