Encode Expression Data

#Reading in the Encode Gene Expression Data.
myGex<- read.table("ENCFF166SFX.tsv", header=TRUE)

#Plotting the normalized length of genes in the dataset.
normal_lengths<-myGex|>
  filter(length < 10000)
ggplot(normal_lengths, aes(x=length))+
  geom_histogram(binwidth = 500, color = "black", fill = "blue")

The data shows that of the almost 60,000 genes observed the majority fell between zero and 1000 bases pairs in length.

#Plotting the highest expressed genes by FPKM(fragments per kilobase per million mapped reads. 
hiEX<- myGex|>
  filter(FPKM > 10000)
ggplot(hiEX, aes(x=gene_id,y=FPKM, fill = FPKM))+
  scale_fill_gradient(low = "skyblue", high = "navy")+
geom_col()+
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust = 1))

The data shows the six highest expressed genes and their FPKM. Of the six, two genes stand out as being expressed exponentially more than the others.

#Plotting the highest expressed genes by FPKM vs TPM(Transcripts per million).
ggplot(hiEX, aes(x=FPKM, y=TPM))+
geom_point(aes(colour = gene_id))

The data shows the six genes from the graphic above now plotted against their TPM. The same two genes that stood as outliers before once more show exponentially higher expression values than the other four genes. Based off the data, we can assume that these two genes will play the largest role in Alzheimer’s and are a potentially avenue for further research and study.

Encode Expression Data

John Beliveau

2024-09-10