[u1014248@malibu homework-03]$ curl https://raw.githubusercontent.com/quinlan-lab/sllobs-biostats/master/data/lecture-03/airway_scaledcounts.subset.euro.fixed.tsv > genecounts.tsv
[u1014248@malibu homework-03]$ grep -v “NA” genecounts.tsv | grep -v “0,0” > genecounts.clean.tsv
(base) abbieireland@AbbiesNwMacbook ~ % scp u1014248@malibu.genetics.utah.edu:/home/u1014248/homework-03/genecounts.clean.tsv /Users/abbieireland/Desktop/COMPHW3
After creating an Rscript titled, HW3 R.R, setting working directory to where I saved the genecounts.clean.tsv file pulled from the malibu server.
dir<-setwd("~/Desktop/COMPHW3/")
getwd()
## [1] "/Users/abbieireland/Desktop/COMPHW3"
list.files(dir)
## [1] "genecounts.clean.tsv" "HW3 R.R" "u1014248.HW3.html"
## [4] "u1014248.HW3.Rmd"
Now, reading in the genecounts.clean.tsv file as a data.frame called gc.
gc<-read.table("genecounts.clean.tsv", header = TRUE, dec=",")
head(gc)
## ensgene control_1 treated_1 control_2 treated_2
## 1 ENSG00000000419 467 523 616 371
## 2 ENSG00000000457 347 258 364 237
## 3 ENSG00000000460 96 81 73 66
## 4 ENSG00000000971 3413 3916 6000 4308
## 5 ENSG00000001167 426 295 531 178
## 6 ENSG00000001460 191 182 287 137
Determining how many genes are in the data.frame, using either length of the ensgene column or the dimension of the whole data.frame.
length(gc$ensgene)
## [1] 11851
dim(gc)
## [1] 11851 5
ANSWER: 11851 genes in this data.frame
Isolating the row of the data.frame, gc, that includes the max cDNA count for control_2.
newgc<-gc[gc$control_2 %in% max(gc$control_2),]
newgc
## ensgene control_1 treated_1 control_2 treated_2
## 2568 ENSG00000115414 251407 217546 510107 304818
ANSWER: The gene with the max cDNA counts in control_2 is ENSG00000115414
Counting the total cDNA counts for each condition in the data frame, gc.
sum_test<-data.frame(control_1=sum(gc$control_1), control_2=sum(gc$control_2), treated_1=sum(gc$treated_1), treated_2=sum(gc$treated_2))
sum_test
## control_1 control_2 treated_1 treated_2
## 1 15138099 18375142 13766158 11054846
Adding new columns to data.frame representing the average of the control and treated samples.
gc$avg_ctrl<- (gc$control_2 + gc$control_1)/2
gc$avg_txd<- (gc$treated_2 + gc$treated_1)/2
head(gc)
## ensgene control_1 treated_1 control_2 treated_2 avg_ctrl avg_txd
## 1 ENSG00000000419 467 523 616 371 541.5 447.0
## 2 ENSG00000000457 347 258 364 237 355.5 247.5
## 3 ENSG00000000460 96 81 73 66 84.5 73.5
## 4 ENSG00000000971 3413 3916 6000 4308 4706.5 4112.0
## 5 ENSG00000001167 426 295 531 178 478.5 236.5
## 6 ENSG00000001460 191 182 287 137 239.0 159.5
Now defining x and y and plotting…
x<-gc$avg_ctrl
y<-gc$avg_txd
plot(x,y,main="HW3 : Problem 8",xlab="Average Control Counts", ylab="Average Treated Counts")
abline(0,1, col="darkturquoise")
Done.
The scatter plot reveals a slope < the x=y slope (abline), which means there are greater control counts than treated counts per gene (as a trend). This is conistent with the total cDNA between the control samples being more abundant than the treated samples shown in Problem 7.
First, computing the mean cDNA counts for all four conditions and the log2 ratio of treated/control and adding them as new columns to the data.frame.
rownames(gc)<-gc$ensgene
gc$ensgene<-NULL
?rowMeans
gc$avg_total<- rowMeans(gc[1:4]) # or could've not changed genes to rownames and just did gc[2:5]
gc$log2fc<- log2(y/x)
head(gc)
## control_1 treated_1 control_2 treated_2 avg_ctrl avg_txd
## ENSG00000000419 467 523 616 371 541.5 447.0
## ENSG00000000457 347 258 364 237 355.5 247.5
## ENSG00000000460 96 81 73 66 84.5 73.5
## ENSG00000000971 3413 3916 6000 4308 4706.5 4112.0
## ENSG00000001167 426 295 531 178 478.5 236.5
## ENSG00000001460 191 182 287 137 239.0 159.5
## avg_total log2fc
## ENSG00000000419 494.25 -0.2766865
## ENSG00000000457 301.50 -0.5224210
## ENSG00000000460 79.00 -0.2012071
## ENSG00000000971 4409.25 -0.1948143
## ENSG00000001167 357.50 -1.0166787
## ENSG00000001460 199.25 -0.5834542
Now, defining x and y and plotting with defined range for x axis..
x_ax<-gc$avg_total
y_ax<-gc$log2fc
plot(x_ax,y_ax,main="HW3 : Problem 10",xlab="Average Counts/Gene (all conditions)", ylab="Log2FC(txd/ctrl)", xlim=c(0,10000))
Done! :)