Encode Expression Data Visualization

r setup for markdown file

loading tidyverse library (installed in console)

library('tidyverse')

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

reading in tsv file from Encode as a table

myGex <-read.table('ENCFF166SFX.tsv', header=TRUE)

initial ggplot to start visualizing data, starting with gene length

ggplot(myGex, aes(x=length))+
  geom_histogram(binwidth = 50)

filtering the data from myGex to look at genes with lengths longer than 10,000, then visualizing the data again to have a more zoomed in view.

normalLength <-myGex |>
  filter(length <10000)
ggplot(normalLength, aes(x=length))+
  geom_histogram(binwidth = 1000)

finding highly expressed genes

hiEx<-myGex |>
  filter(FPKM>5000)
ggplot(hiEx, aes(x=gene_id, y=FPKM, fill=FPKM))+
  geom_col()+
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust = 1))

looking at the correspondence of TPM and FPKM data

reg<-lm(formula = TPM ~FPKM,
  data=normalLength)
coeff<-coefficients(reg)
int1<-coeff[1]
slo<-coeff[2]
ggplot(normalLength, aes(x=FPKM, y=TPM))+
  geom_point(position = "jitter", alpha=0.5)+
  geom_abline(intercept = int1, slope = slo, color='skyblue', linewidth=2, alpha=0.1)+
  theme_bw()

checking expected counts vs FPKM to look for any outliers

reg<-lm(formula = expected_count~FPKM,
data=normalLength)
coeff<-coefficients(reg)
int1<-coeff[1]
slo<-coeff[2]
ggplot(normalLength, aes(x=FPKM, y=expected_count))+
  geom_point(position = 'jitter', alpha= 0.5)+
  geom_abline(intercept = int1, slope=slo, color= "skyblue", linewidth=3, alpha=0.1)+
theme_bw()

exploring the outliers by zooming in on FMPK and count data

hiCount<- myGex |>
  filter(expected_count >200000, FPKM<50000)
ggplot(hiCount, aes(x=FPKM, y=expected_count))+
  geom_point(aes(color=gene_id))+
  theme_bw()

Further exploring the data and creating 3 unique plots

Number 1: I wanted to first look at the relationship between TPM and length from the highly expressed genes so I made a box plot and violin plot. the boxplot allowed us to see the number of outliers and the violin plot shows us the more concentrated sizes of TPM and lengths. From the graph we can see the TPM is concentrated under 1x10^5th (100,000) with lengths concentrated between ~650-1100. There are two outliers on this plot above 200,000 TPM with lengths of ~875. This means that there are two genes that have very high TPM compared to the other genes, these two outliers are also nearly identical in length and do fall under the size that is most concentrated as seen in the violin plot.

Tutorial: create a ggplot by telling R studio which data set we want to use, decide which type of plot (box plot, violin plot, etc), label the axes with column names from the data set (length, TPM, etc), then pick a color to fill in the plot (otherwise it is just all white and less appealing). I also removed the legend title because we don’t need one for these images as it is all just one color.

ggplot(data = hiEx) +
 geom_violin (aes(x = length, y = TPM,
 fill=("Dark2")))+
theme(legend.position="none")

ggplot(data = hiEx) +
 geom_boxplot (aes(x = length, y = TPM,
 fill=("Dark2")))+
theme(legend.position="none")

Number 2: I want to dive deeper and identify the two outlier from the TPM vs length plot above. From this plot we see a very zoomed in plot of the two outleirs with high TPM counts and we see their lengths. They were not identical lengths like the boxplot made it seem, they are 1 off from one another in terms of length, and just over 3,000 off in terms of TPM.

Tutorial: I am making a new data set that is just the data set I used for the above plot, but filtered to only show hits with TPM over 100,000 since there are only two instances of that happening. I then take the new filtered data set and again plot TPM over length, this time the colors are representative of their gene IDs.

filteredhiex<- hiEx |>
  filter(TPM >100000)
ggplot(filteredhiex, aes(x=length, y=TPM))+
  geom_point(aes(color=gene_id))

Number 3: I want to compare the two highly expressed genes I found by looking at TPM over length with the two genes found in the earlier plots when looking at expected count over FPKM data. To keep data normalized, I will use TPM values. We now have a plot showing TPM over expected counts, with a line of best fit- and we see all four points are along the best fit line of TPM and expected count. these points were all outliers in their initial plots but when looked at through this assesment of TPM and expected counts, they fall in line with the expected values.

A good next step could be RNA seq analysis between samples of cells from this woman with Alzheimer’s and another similarly aged woman who does not have the same diagnosis, to observe differentially expressed genes.

Tutorial: First I want to combine data sets so that all four points are in one location. These two data sets are originally from the same data set so I know their properties and format are exactly the same, I can bind the two data sets without worry of missing values or needing to specify columns. Now that they are all in one place, I want to look at their TPM and expected_counts to see how all four compare to one another and compare to the expected values. The first chunk is the bind, the second chunk is labeling the coefficients with their respective values to create the slope, and the third chunk is creating the plot with all four genes colored by their IDs and plotted by TPM value over expected count.

highexpress <-rbind(hiCount, filteredhiex)

reg2<-lm(formula = TPM~expected_count,
data=highexpress)
coeff<-coefficients(reg2)
int2<-coeff[1]
slo2<-coeff[2]

ggplot(highexpress, aes(x=expected_count, y=TPM))+
 geom_point(aes(color=gene_id))+
  geom_abline(intercept = int2, slope=slo2, color= "darkblue", linewidth=3, alpha=0.1)+
theme_bw()

Encode Expression Data Visualization

Lenna Wolffe

2024-09-20