Description

Please answer the following questions as part of the narrative in your project report. For each question, make sure that the corresponding code is included along with your results.

What is the average amount of missing data per individual?

library(dartR)
Loading required package: ggplot2
Loading required package: dplyr

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: dartR.data
**** Welcome to dartR.data [Version 1.0.8 ] ****

Registered S3 method overwritten by 'pegas':
  method      from
  print.amova ade4
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
Registered S3 method overwritten by 'genetics':
  method      from 
  [.haplotype pegas
**** Welcome to dartR [Version 2.9.7 ] ****

Be aware that owing to CRAN requirements and compatibility reasons not all functions of the package may run after the basic installation, as some packages could still be missing. Hence for a most enjoyable experience we recommend to run the function 
gl.install.vanilla.dartR()
This installs all missing and required packages for your version of dartR. In case something fails during installation please refer to this tutorial: https://github.com/green-striped-gecko/dartR/wiki/Installation-tutorial.

For information how to cite dartR, please use:
citation('dartR')
Global verbosity is set to: 2


**** Have fun using dartR! ****

Attaching package: ‘dartR’

The following objects are masked from ‘package:dartR.data’:

    bandicoot.gl, possums.gl, testset.gl, testset.gs

Setting the work directory

print(genlight_obj)
 /// GENLIGHT OBJECT /////////

 // 77 genotypes,  5,308 binary SNPs, size: 2.2 Mb
 226944 (55.53 %) missing data

 // Basic content
   @gen: list of 77 SNPbin
   @ploidy: ploidy of each individual  (range: 2-2)

 // Optional content
   @ind.names:  77 individual labels
   @loc.names:  5308 locus labels
   @loc.all:  5308 alleles
   @chromosome: factor storing chromosomes of the SNPs
   @position: integer storing positions of the SNPs
   @pop: population of each individual (group size range: 77-77)
   @other: a list containing: loc.metrics  loc.metrics.flags  verbose  history  ind.metrics 

Questions:

Q1:

How many individuals are in the dataset? There are 77 individuals after filtering them (see the {sh} below for the command line) The average missing data per individual in the matrix is 0.5553 At the very least 25 individual are candidate for exclusion under 50% of missing data per individuals.

vcftools --vcf temp_filtered.vcf --minDP 10 --max-meanDP 40 --max-missing 0.2 --min-alleles 2 --max-alleles 2 --remove-indels --recode --out final_filtered_variants
print(id_indiv_discard)
AC1-AC16_0509_023.q10.sorted.bam 
                              25 

Q2:

How did you choose to filter your data, and how many SNPs do you retain after filtering? Filtered data with a threshold of 80%, meaning individual with 80% of data are retain and others are excluded. After filters, 333 SNPs are retained.

Why did you choose to use the filters that you did with these data? The filter threshold was selected to prevent low-quality SNPs where significant missing data can mislead the results and interpretations.

What are we trying to mitigate by using these filters? The 80% filter threshold remove any individual that does not have 80% of data, creating high quality of individuals data. The monomorph filter seems to remove SNPs with no variants, which could be ideal to understand only those individuals that contribute to genetic diversity.

print(paste("# of SNPs retained aftered filteres:", num_snps_retained))
[1] "# of SNPs retained aftered filteres: 333"

Q3

Are there any individuals that seem to have elevated heterozygosity in the sampled population? Please provide a plot that demonstrates this. At least three individuals have elevated heterozygosity on the data frame.

Why might an individual with elevated heterozygosity be concerning? Unusual elevated heterozygosity might be due to contaminants, external genetic influences and/or biases on the outcomes if ot accounted for.

#1. We calculate the heterogeneity (het) per individuals
het_data <- gl.report.heterozygosity(genlight_obj)
Starting gl.report.heterozygosity 
  Processing genlight object with SNP data
  Calculating Observed Heterozygosities, averaged across 
                    loci, for each population
  Calculating Expected Heterozygosities
Completed: gl.report.heterozygosity 

Q4

How diverse is the sampled population? Please provide an estimate of genetic diversity and a brief explanation of the measure of diversity that you chose to use. The expected heterozygosity average is 0.137.

Would you make a management recommendation based solely on this diversity estimate? The expected and observed genetic diversity might not be enough to make management decision, but it is a good base line to make decision on what kind of data to collect.

Why or why not? If not, what further information would you want before making a recommendation? We should know more about the population size, the structure of the population and environmental factors. Those threats and pressures would define the

ggplot(diversity_df, aes(x = Heterozygosity)) +
  geom_histogram(binwidth = 0.05, fill = "steelblue", color = "black") +
  labs(title = "Distribution of Genetic Diversity (Obv. Het)",
       x = "Heterozygosity",
       y = "Frequency")
Error in `geom_histogram()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error:
! object 'Heterozygosity' not found
Backtrace:
  1. base (local) `<fn>`(x)
  2. ggplot2:::print.ggplot(x)
  4. ggplot2:::ggplot_build.ggplot(x)
  5. ggplot2:::by_layer(...)
 12. ggplot2 (local) f(l = layers[[i]], d = data[[i]])
 13. l$compute_aesthetics(d, plot)
 14. ggplot2 (local) compute_aesthetics(..., self = self)
 15. base::lapply(aesthetics, eval_tidy, data = data, env = env)
 16. rlang (local) FUN(X[[i]], ...)

Q5:

Are the majority of loci in this population in Hardy-Weinberg equilibrium (HWE) overall? The majority of the loci tested are distal of the HWE. It could be a issue in the data or from the filter function of p-value 0.05. The filter function was set with colunm “Prob” for p-value.

If not, how do you interpret its departure from HWE? What could this mean about the population or sampling scheme? If it is actually an outcome of the dataset, we can consider the population to be highly inbreeding and perhaps the population size decline. The environmental structure factors in for the lack of heterozygosity. On the other hand, sampling randomization could influence in the outcome if the bioinformatics are process carefully. Both are possibilities for the distal from HWE.

print(paste("loci distal from HWE:", num_hwe_loci))
[1] "loci distal from HWE: 1153"

Q6:

Please provide a PCA plot visualizing the genetic relatedness among individuals. What stands out to you in this visualization? We have three clear clusters, two of the three more packed, while the third has some deviation. The two more packed suggested genetic similarity and perhaps it comes from the same ecosystem. Also, it coud suggest shared ancestry.

There are a couple of outlines that might have some interesting genetic diversity or fragmented by phisical varries with little migration opportunities.

Q7:

What additional question(s) do you have about this dataset that would aid in your interpretation of these data? Yes: here are three that come to mind: (they might be incorrect or out of the scoepe) 1. Is there evidence of population structure within the dataset? k-mean cluster assigment test suggests population structuring with the available data.

  1. How does inbreeding affect the genetic diversity observed in this population? This analysis could help to understand the clustering of the individuals that has been presented on previous question answer.

