Exploratory Data Analysis of Gene Expression

#Introduction In this project, I will analyze RNA-seq data from the brain of a 90-year-old woman in the RUSH Alzheimer’s study. The dataset includes information on gene length, FPKM (Fragments Per Kilobase of transcript per Million mapped reads), and TPM (Transcripts Per Million). I will create 3 exploratory plots that give biologically relevant insights into gene expression in this tissue.

#Reading and Inspecting the Data

myGex <- read.table("ENCFF166SFX.tsv", header = TRUE)

glimpse(myGex)

## Rows: 59,526
## Columns: 17
## $ gene_id                                <chr> "10904", "12954", "12956", "129…
## $ transcript_id.s.                       <chr> "10904", "12954", "12956", "129…
## $ length                                 <dbl> 93, 94, 72, 82, 73, 72, 74, 82,…
## $ effective_length                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ expected_count                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TPM                                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ FPKM                                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ posterior_mean_count                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ posterior_standard_deviation_of_count  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pme_TPM                                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pme_FPKM                               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TPM_ci_lower_bound                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TPM_ci_upper_bound                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TPM_coefficient_of_quartile_variation  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ FPKM_ci_lower_bound                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ FPKM_ci_upper_bound                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ FPKM_coefficient_of_quartile_variation <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

summary(myGex)

##    gene_id          transcript_id.s.       length       effective_length
##  Length:59526       Length:59526       Min.   :     8   Min.   :     0  
##  Class :character   Class :character   1st Qu.:   387   1st Qu.:   287  
##  Mode  :character   Mode  :character   Median :   808   Median :   708  
##                                        Mean   :  1339   Mean   :  1240  
##                                        3rd Qu.:  1823   3rd Qu.:  1723  
##                                        Max.   :205012   Max.   :204912  
##  expected_count           TPM                 FPKM          
##  Min.   :      0.0   Min.   :     0.00   Min.   :     0.00  
##  1st Qu.:      0.0   1st Qu.:     0.00   1st Qu.:     0.00  
##  Median :      0.0   Median :     0.00   Median :     0.00  
##  Mean   :    339.3   Mean   :    16.80   Mean   :    19.55  
##  3rd Qu.:     46.0   3rd Qu.:     1.52   3rd Qu.:     1.77  
##  Max.   :1251092.4   Max.   :267430.21   Max.   :311258.58  
##  posterior_mean_count posterior_standard_deviation_of_count    pme_TPM         
##  Min.   :      0.0    Min.   :    0.000                     Min.   :     0.00  
##  1st Qu.:      0.0    1st Qu.:    0.000                     1st Qu.:     0.09  
##  Median :      0.0    Median :    0.000                     Median :     0.30  
##  Mean   :    339.3    Mean   :    1.965                     Mean   :    16.80  
##  3rd Qu.:     46.0    3rd Qu.:    0.000                     3rd Qu.:     3.00  
##  Max.   :1068373.8    Max.   :22576.980                     Max.   :217493.25  
##     pme_FPKM         TPM_ci_lower_bound  TPM_ci_upper_bound 
##  Min.   :     0.00   Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.:     0.11   1st Qu.:0.000e+00   1st Qu.:2.530e-01  
##  Median :     0.36   Median :3.000e-03   Median :6.790e-01  
##  Mean   :    20.22   Mean   :1.502e+01   Mean   :1.923e+01  
##  3rd Qu.:     3.61   3rd Qu.:1.072e+00   3rd Qu.:4.465e+00  
##  Max.   :261833.14   Max.   :2.169e+05   Max.   :2.181e+05  
##  TPM_coefficient_of_quartile_variation FPKM_ci_lower_bound FPKM_ci_upper_bound
##  Min.   :0.0000                        Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.:0.0759                        1st Qu.:0.000e+00   1st Qu.:3.040e-01  
##  Median :0.3873                        Median :4.000e-03   Median :8.180e-01  
##  Mean   :0.3677                        Mean   :1.808e+01   Mean   :2.316e+01  
##  3rd Qu.:0.6553                        3rd Qu.:1.292e+00   3rd Qu.:5.370e+00  
##  Max.   :0.7098                        Max.   :2.612e+05   Max.   :2.625e+05  
##  FPKM_coefficient_of_quartile_variation
##  Min.   :0.00000                       
##  1st Qu.:0.07591                       
##  Median :0.38733                       
##  Mean   :0.36768                       
##  3rd Qu.:0.65535                       
##  Max.   :0.70981

Before analysis, it’s important to confirm the dataset structure and check for missing values:

colSums(is.na(myGex))

##                                gene_id                       transcript_id.s. 
##                                      0                                      0 
##                                 length                       effective_length 
##                                      0                                      0 
##                         expected_count                                    TPM 
##                                      0                                      0 
##                                   FPKM                   posterior_mean_count 
##                                      0                                      0 
##  posterior_standard_deviation_of_count                                pme_TPM 
##                                      0                                      0 
##                               pme_FPKM                     TPM_ci_lower_bound 
##                                      0                                      0 
##                     TPM_ci_upper_bound  TPM_coefficient_of_quartile_variation 
##                                      0                                      0 
##                    FPKM_ci_lower_bound                    FPKM_ci_upper_bound 
##                                      0                                      0 
## FPKM_coefficient_of_quartile_variation 
##                                      0

#Plot 1: Distribution of Gene Lengths Gene length matters because longer genes may be more likely to accumulate sequencing reads, and expression normalization methods (like TPM) correct for this bias.

ggplot(myGex, aes(x = length)) +
  geom_histogram(binwidth = 500, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Gene Lengths",
       x = "Gene length (bp)",
       y = "Number of genes") +
  theme_minimal()

Most of the genes fall below 20,000 base pairs, but there are some extremely long genes. This skew can affect quantification and explains why normalization is necessary.

#Plot 2: Distribution of Gene Expression Gene expression levels are usually highly skewed — a few genes dominate RNA abundance, while most are expressed at low levels. I will use a log scale to clarity.

ggplot(myGex, aes(x = TPM)) +
  geom_histogram(bins = 100, fill = "darkseagreen", color = "black") +
  scale_x_log10() +
  labs(title = "Distribution of Gene Expression (TPM)",
       x = "Log10 TPM",
       y = "Number of genes") +
  theme_minimal()

## Warning in scale_x_log10(): log-10 transformation introduced infinite values.

## Warning: Removed 31437 rows containing non-finite outside the scale range
## (`stat_bin()`).

Most genes in the brain are expressed at very low levels, while a smaller subset of genes are highly expressed — often those related to neuronal function and energy metabolism.

#Plot 3: Reltionship Between FPKM and TPM FPKM and TPM are two common normalization methods. Since TPM accounts for both sequencing depth and gene length, it is considered more comparable across samples.

ggplot(myGex, aes(x = FPKM, y = TPM)) +
  geom_point(alpha = 0.3, size = 1) +
  scale_x_log10() +
  scale_y_log10() +
  geom_smooth(method = "lm", color = "violet", se = FALSE) +
  labs(title = "Relationship Between FPKM and TPM",
       x = "FPKM",
       y = "TPM") +
  theme_minimal()

## Warning in scale_x_log10(): log-10 transformation introduced infinite values.

## Warning in scale_y_log10(): log-10 transformation introduced infinite values.

## Warning in scale_x_log10(): log-10 transformation introduced infinite values.

## Warning in scale_y_log10(): log-10 transformation introduced infinite values.

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 31437 rows containing non-finite outside the scale range
## (`stat_smooth()`).

There is a strong positive correlation which shows that both metrics capture similar trends, but TPM is adjusted for gene length, making it a better choice when comparing expression across genes or samples.

#Conclusion Through these plots I have learned that gene lengths vary widely, with some very long transcripts. Most genes are expressed at low levels, while a few dominate RNA content. TPM and FPKM are highly correlated, but TPM is biologically more interpretable. This exploratory analysis highlights how RNA-seq normalization and distribution patterns shape our understanding of gene expression in aging brain tissue.

Exploratory Data Analysis of Gene Expression

Davie Slocum

2025-09-15