Homework-03

PROBLEM 1

[u1014248@malibu homework-03]$ curl https://raw.githubusercontent.com/quinlan-lab/sllobs-biostats/master/data/lecture-03/airway_scaledcounts.subset.euro.fixed.tsv > genecounts.tsv

PROBLEM 2

[u1014248@malibu homework-03]$ grep -v “NA” genecounts.tsv | grep -v “0,0” > genecounts.clean.tsv

INTERLUDE

(base) abbieireland@AbbiesNwMacbook ~ % scp u1014248@malibu.genetics.utah.edu:/home/u1014248/homework-03/genecounts.clean.tsv /Users/abbieireland/Desktop/COMPHW3

PROBLEM 3

After creating an Rscript titled, HW3 R.R, setting working directory to where I saved the genecounts.clean.tsv file pulled from the malibu server.

dir<-setwd("~/Desktop/COMPHW3/")
getwd()

## [1] "/Users/abbieireland/Desktop/COMPHW3"

list.files(dir)

## [1] "genecounts.clean.tsv" "HW3 R.R"              "u1014248.HW3.html"   
## [4] "u1014248.HW3.Rmd"

PROBLEM 4

Now, reading in the genecounts.clean.tsv file as a data.frame called gc.

gc<-read.table("genecounts.clean.tsv", header = TRUE, dec=",")
head(gc)

##           ensgene control_1 treated_1 control_2 treated_2
## 1 ENSG00000000419       467       523       616       371
## 2 ENSG00000000457       347       258       364       237
## 3 ENSG00000000460        96        81        73        66
## 4 ENSG00000000971      3413      3916      6000      4308
## 5 ENSG00000001167       426       295       531       178
## 6 ENSG00000001460       191       182       287       137

PROBLEM 5

Determining how many genes are in the data.frame, using either length of the ensgene column or the dimension of the whole data.frame.

length(gc$ensgene)

## [1] 11851

dim(gc)

## [1] 11851     5

ANSWER: 11851 genes in this data.frame

PROBLEM 6

Isolating the row of the data.frame, gc, that includes the max cDNA count for control_2.

newgc<-gc[gc$control_2 %in% max(gc$control_2),]
newgc

##              ensgene control_1 treated_1 control_2 treated_2
## 2568 ENSG00000115414    251407    217546    510107    304818

ANSWER: The gene with the max cDNA counts in control_2 is ENSG00000115414

PROBLEM 7

Counting the total cDNA counts for each condition in the data frame, gc.

sum_test<-data.frame(control_1=sum(gc$control_1), control_2=sum(gc$control_2), treated_1=sum(gc$treated_1), treated_2=sum(gc$treated_2))
sum_test

##   control_1 control_2 treated_1 treated_2
## 1  15138099  18375142  13766158  11054846

PROBLEM 8

Adding new columns to data.frame representing the average of the control and treated samples.

gc$avg_ctrl<- (gc$control_2 + gc$control_1)/2
gc$avg_txd<-  (gc$treated_2 + gc$treated_1)/2
head(gc)

##           ensgene control_1 treated_1 control_2 treated_2 avg_ctrl avg_txd
## 1 ENSG00000000419       467       523       616       371    541.5   447.0
## 2 ENSG00000000457       347       258       364       237    355.5   247.5
## 3 ENSG00000000460        96        81        73        66     84.5    73.5
## 4 ENSG00000000971      3413      3916      6000      4308   4706.5  4112.0
## 5 ENSG00000001167       426       295       531       178    478.5   236.5
## 6 ENSG00000001460       191       182       287       137    239.0   159.5

Now defining x and y and plotting…

x<-gc$avg_ctrl
y<-gc$avg_txd
plot(x,y,main="HW3 : Problem 8",xlab="Average Control Counts", ylab="Average Treated Counts")
abline(0,1, col="darkturquoise")

Done.

PROBLEM 9

The scatter plot reveals a slope < the x=y slope (abline), which means there are greater control counts than treated counts per gene (as a trend). This is conistent with the total cDNA between the control samples being more abundant than the treated samples shown in Problem 7.

PROBLEM 10

First, computing the mean cDNA counts for all four conditions and the log2 ratio of treated/control and adding them as new columns to the data.frame.

rownames(gc)<-gc$ensgene
gc$ensgene<-NULL
?rowMeans
gc$avg_total<- rowMeans(gc[1:4]) # or could've not changed genes to rownames and just did gc[2:5]

gc$log2fc<-  log2(y/x)
head(gc)

##                 control_1 treated_1 control_2 treated_2 avg_ctrl avg_txd
## ENSG00000000419       467       523       616       371    541.5   447.0
## ENSG00000000457       347       258       364       237    355.5   247.5
## ENSG00000000460        96        81        73        66     84.5    73.5
## ENSG00000000971      3413      3916      6000      4308   4706.5  4112.0
## ENSG00000001167       426       295       531       178    478.5   236.5
## ENSG00000001460       191       182       287       137    239.0   159.5
##                 avg_total     log2fc
## ENSG00000000419    494.25 -0.2766865
## ENSG00000000457    301.50 -0.5224210
## ENSG00000000460     79.00 -0.2012071
## ENSG00000000971   4409.25 -0.1948143
## ENSG00000001167    357.50 -1.0166787
## ENSG00000001460    199.25 -0.5834542

Now, defining x and y and plotting with defined range for x axis..

x_ax<-gc$avg_total
y_ax<-gc$log2fc

plot(x_ax,y_ax,main="HW3 : Problem 10",xlab="Average Counts/Gene (all conditions)", ylab="Log2FC(txd/ctrl)", xlim=c(0,10000))

Done! :)