#Introduction In this project, I will analyze RNA-seq data from the brain of a 90-year-old woman in the RUSH Alzheimer’s study. The dataset includes information on gene length, FPKM (Fragments Per Kilobase of transcript per Million mapped reads), and TPM (Transcripts Per Million). I will create 3 exploratory plots that give biologically relevant insights into gene expression in this tissue.
#Reading and Inspecting the Data
myGex <- read.table("ENCFF166SFX.tsv", header = TRUE)
glimpse(myGex)
## Rows: 59,526
## Columns: 17
## $ gene_id <chr> "10904", "12954", "12956", "129…
## $ transcript_id.s. <chr> "10904", "12954", "12956", "129…
## $ length <dbl> 93, 94, 72, 82, 73, 72, 74, 82,…
## $ effective_length <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ expected_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TPM <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ FPKM <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ posterior_mean_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ posterior_standard_deviation_of_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pme_TPM <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pme_FPKM <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TPM_ci_lower_bound <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TPM_ci_upper_bound <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TPM_coefficient_of_quartile_variation <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ FPKM_ci_lower_bound <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ FPKM_ci_upper_bound <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ FPKM_coefficient_of_quartile_variation <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
summary(myGex)
## gene_id transcript_id.s. length effective_length
## Length:59526 Length:59526 Min. : 8 Min. : 0
## Class :character Class :character 1st Qu.: 387 1st Qu.: 287
## Mode :character Mode :character Median : 808 Median : 708
## Mean : 1339 Mean : 1240
## 3rd Qu.: 1823 3rd Qu.: 1723
## Max. :205012 Max. :204912
## expected_count TPM FPKM
## Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.0 Median : 0.00 Median : 0.00
## Mean : 339.3 Mean : 16.80 Mean : 19.55
## 3rd Qu.: 46.0 3rd Qu.: 1.52 3rd Qu.: 1.77
## Max. :1251092.4 Max. :267430.21 Max. :311258.58
## posterior_mean_count posterior_standard_deviation_of_count pme_TPM
## Min. : 0.0 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.: 0.09
## Median : 0.0 Median : 0.000 Median : 0.30
## Mean : 339.3 Mean : 1.965 Mean : 16.80
## 3rd Qu.: 46.0 3rd Qu.: 0.000 3rd Qu.: 3.00
## Max. :1068373.8 Max. :22576.980 Max. :217493.25
## pme_FPKM TPM_ci_lower_bound TPM_ci_upper_bound
## Min. : 0.00 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.: 0.11 1st Qu.:0.000e+00 1st Qu.:2.530e-01
## Median : 0.36 Median :3.000e-03 Median :6.790e-01
## Mean : 20.22 Mean :1.502e+01 Mean :1.923e+01
## 3rd Qu.: 3.61 3rd Qu.:1.072e+00 3rd Qu.:4.465e+00
## Max. :261833.14 Max. :2.169e+05 Max. :2.181e+05
## TPM_coefficient_of_quartile_variation FPKM_ci_lower_bound FPKM_ci_upper_bound
## Min. :0.0000 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.:0.0759 1st Qu.:0.000e+00 1st Qu.:3.040e-01
## Median :0.3873 Median :4.000e-03 Median :8.180e-01
## Mean :0.3677 Mean :1.808e+01 Mean :2.316e+01
## 3rd Qu.:0.6553 3rd Qu.:1.292e+00 3rd Qu.:5.370e+00
## Max. :0.7098 Max. :2.612e+05 Max. :2.625e+05
## FPKM_coefficient_of_quartile_variation
## Min. :0.00000
## 1st Qu.:0.07591
## Median :0.38733
## Mean :0.36768
## 3rd Qu.:0.65535
## Max. :0.70981
Before analysis, it’s important to confirm the dataset structure and check for missing values:
colSums(is.na(myGex))
## gene_id transcript_id.s.
## 0 0
## length effective_length
## 0 0
## expected_count TPM
## 0 0
## FPKM posterior_mean_count
## 0 0
## posterior_standard_deviation_of_count pme_TPM
## 0 0
## pme_FPKM TPM_ci_lower_bound
## 0 0
## TPM_ci_upper_bound TPM_coefficient_of_quartile_variation
## 0 0
## FPKM_ci_lower_bound FPKM_ci_upper_bound
## 0 0
## FPKM_coefficient_of_quartile_variation
## 0
#Plot 1: Distribution of Gene Lengths Gene length matters because longer genes may be more likely to accumulate sequencing reads, and expression normalization methods (like TPM) correct for this bias.
ggplot(myGex, aes(x = length)) +
geom_histogram(binwidth = 500, fill = "skyblue", color = "black") +
labs(title = "Distribution of Gene Lengths",
x = "Gene length (bp)",
y = "Number of genes") +
theme_minimal()
Most of the genes fall below 20,000 base pairs, but there are some
extremely long genes. This skew can affect quantification and explains
why normalization is necessary.
#Plot 2: Distribution of Gene Expression Gene expression levels are usually highly skewed — a few genes dominate RNA abundance, while most are expressed at low levels. I will use a log scale to clarity.
ggplot(myGex, aes(x = TPM)) +
geom_histogram(bins = 100, fill = "darkseagreen", color = "black") +
scale_x_log10() +
labs(title = "Distribution of Gene Expression (TPM)",
x = "Log10 TPM",
y = "Number of genes") +
theme_minimal()
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 31437 rows containing non-finite outside the scale range
## (`stat_bin()`).
Most genes in the brain are expressed at very low levels, while a
smaller subset of genes are highly expressed — often those related to
neuronal function and energy metabolism.
#Plot 3: Reltionship Between FPKM and TPM FPKM and TPM are two common normalization methods. Since TPM accounts for both sequencing depth and gene length, it is considered more comparable across samples.
ggplot(myGex, aes(x = FPKM, y = TPM)) +
geom_point(alpha = 0.3, size = 1) +
scale_x_log10() +
scale_y_log10() +
geom_smooth(method = "lm", color = "violet", se = FALSE) +
labs(title = "Relationship Between FPKM and TPM",
x = "FPKM",
y = "TPM") +
theme_minimal()
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 31437 rows containing non-finite outside the scale range
## (`stat_smooth()`).
There is a strong positive correlation which shows that both metrics
capture similar trends, but TPM is adjusted for gene length, making it a
better choice when comparing expression across genes or samples.
#Conclusion Through these plots I have learned that gene lengths vary widely, with some very long transcripts. Most genes are expressed at low levels, while a few dominate RNA content. TPM and FPKM are highly correlated, but TPM is biologically more interpretable. This exploratory analysis highlights how RNA-seq normalization and distribution patterns shape our understanding of gene expression in aging brain tissue.