Details

This is an R Markdown document. The data for this analysis was collected on 23 January 2017. I have a pool of paired-end 100 sequences from Cedrela species. These sequences were obtained via hybridization capture, targeted enrichment, and short-read sequencing on the Illumina HiSeq 3000.

Here I have used SPAdes de novo assembly for the reads of 1 indiviudal: Cedrela odorata number 300 from Peru (coded ced132). Here I show contigs after chloroplast and mitochondria have been removed. Additionally, I only retained contigs that contained the hybridization probe sequence for at least 66% of the probe length (66 bp).

library(ggplot2)
dist<-read.csv("ced132_reference_v1_stats.csv", header=1)
head(dist)
##    NODE Length Coverage     ID
## 1 17529    156  64.1194 462336
## 2 17512    190 132.8020 462302
## 3 17507    218 166.4340 462292
## 4 17505    226 121.7880 462288
## 5 17499    256  30.1018 462276
## 6 17496    264  47.9657 462270
ced132_len.plot<-ggplot(dist, aes(x=sort(Length)))+
  theme_bw()+
  geom_histogram(binwidth = 30)+
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    plot.title = element_text(hjust = 0,size = 10),
    axis.title=element_text(size=8))+
  labs(x = "Contig Length", y = "Frequency")+
  ggtitle("Length distribution CEOD 300 Peru (ced132)")
ced132_len.plot

summary(dist)
##       NODE           Length          Coverage              ID        
##  Min.   :    6   Min.   : 156.0   Min.   :  0.7375   Min.   :427290  
##  1st Qu.: 2696   1st Qu.: 779.0   1st Qu.: 17.0118   1st Qu.:432709  
##  Median : 5540   Median : 914.0   Median : 27.7427   Median :438426  
##  Mean   : 6050   Mean   : 982.5   Mean   : 32.1336   Mean   :439426  
##  3rd Qu.: 8894   3rd Qu.:1079.0   3rd Qu.: 42.2763   3rd Qu.:445132  
##  Max.   :17529   Max.   :4053.0   Max.   :356.4380   Max.   :462336

Save data

#ggsave("ced132_reference_v1_lendist.jpg",plot=ced132_len.plot, width=3.5, height=3)