This is an R Markdown document. The data for this analysis was collected on 23 January 2017. I have a pool of paired-end 100 sequences from Cedrela species. These sequences were obtained via hybridization capture, targeted enrichment, and short-read sequencing on the Illumina HiSeq 3000.
Here I have used SPAdes de novo assembly for the reads of 1 indiviudal: Cedrela odorata number 300 from Peru (coded ced132). Here I show contigs after chloroplast and mitochondria have been removed. Additionally, I only retained contigs that contained the hybridization probe sequence for at least 66% of the probe length (66 bp).
library(ggplot2)
dist<-read.csv("ced132_reference_v1_stats.csv", header=1)
head(dist)
## NODE Length Coverage ID
## 1 17529 156 64.1194 462336
## 2 17512 190 132.8020 462302
## 3 17507 218 166.4340 462292
## 4 17505 226 121.7880 462288
## 5 17499 256 30.1018 462276
## 6 17496 264 47.9657 462270
ced132_len.plot<-ggplot(dist, aes(x=sort(Length)))+
theme_bw()+
geom_histogram(binwidth = 30)+
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0,size = 10),
axis.title=element_text(size=8))+
labs(x = "Contig Length", y = "Frequency")+
ggtitle("Length distribution CEOD 300 Peru (ced132)")
ced132_len.plot
summary(dist)
## NODE Length Coverage ID
## Min. : 6 Min. : 156.0 Min. : 0.7375 Min. :427290
## 1st Qu.: 2696 1st Qu.: 779.0 1st Qu.: 17.0118 1st Qu.:432709
## Median : 5540 Median : 914.0 Median : 27.7427 Median :438426
## Mean : 6050 Mean : 982.5 Mean : 32.1336 Mean :439426
## 3rd Qu.: 8894 3rd Qu.:1079.0 3rd Qu.: 42.2763 3rd Qu.:445132
## Max. :17529 Max. :4053.0 Max. :356.4380 Max. :462336
Save data
#ggsave("ced132_reference_v1_lendist.jpg",plot=ced132_len.plot, width=3.5, height=3)