Description
Please answer the following questions as part of the narrative in
your project report. For each question, make sure that the corresponding
code is included along with your results.
What is the average amount of missing data per individual?
library(dartR)
Loading required package: ggplot2
Loading required package: dplyr
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Loading required package: dartR.data
**** Welcome to dartR.data [Version 1.0.8 ] ****
Registered S3 method overwritten by 'pegas':
method from
print.amova ade4
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
Registered S3 method overwritten by 'genetics':
method from
[.haplotype pegas
**** Welcome to dartR [Version 2.9.7 ] ****
Be aware that owing to CRAN requirements and compatibility reasons not all functions of the package may run after the basic installation, as some packages could still be missing. Hence for a most enjoyable experience we recommend to run the function
gl.install.vanilla.dartR()
This installs all missing and required packages for your version of dartR. In case something fails during installation please refer to this tutorial: https://github.com/green-striped-gecko/dartR/wiki/Installation-tutorial.
For information how to cite dartR, please use:
citation('dartR')
Global verbosity is set to: 2
**** Have fun using dartR! ****
Attaching package: ‘dartR’
The following objects are masked from ‘package:dartR.data’:
bandicoot.gl, possums.gl, testset.gl, testset.gs
Setting the work directory
print(genlight_obj)
/// GENLIGHT OBJECT /////////
// 77 genotypes, 5,308 binary SNPs, size: 2.2 Mb
226944 (55.53 %) missing data
// Basic content
@gen: list of 77 SNPbin
@ploidy: ploidy of each individual (range: 2-2)
// Optional content
@ind.names: 77 individual labels
@loc.names: 5308 locus labels
@loc.all: 5308 alleles
@chromosome: factor storing chromosomes of the SNPs
@position: integer storing positions of the SNPs
@pop: population of each individual (group size range: 77-77)
@other: a list containing: loc.metrics loc.metrics.flags verbose history ind.metrics
Questions:
Q1:
How many individuals are in the dataset? There are
77 individuals after filtering them (see the {sh} below for the command
line) The average missing data per individual in the matrix is 0.5553 At
the very least 25 individual are candidate for exclusion under 50% of
missing data per individuals.
vcftools --vcf temp_filtered.vcf --minDP 10 --max-meanDP 40 --max-missing 0.2 --min-alleles 2 --max-alleles 2 --remove-indels --recode --out final_filtered_variants
print(id_indiv_discard)
AC1-AC16_0509_023.q10.sorted.bam
25
Q2:
How did you choose to filter your data, and how many SNPs do
you retain after filtering? Filtered data with a threshold of
80%, meaning individual with 80% of data are retain and others are
excluded. After filters, 333 SNPs are retained.
Why did you choose to use the filters that you did with these
data? The filter threshold was selected to prevent low-quality
SNPs where significant missing data can mislead the results and
interpretations.
What are we trying to mitigate by using these
filters? The 80% filter threshold remove any individual that
does not have 80% of data, creating high quality of individuals data.
The monomorph filter seems to remove SNPs with no variants, which could
be ideal to understand only those individuals that contribute to genetic
diversity.
print(paste("# of SNPs retained aftered filteres:", num_snps_retained))
[1] "# of SNPs retained aftered filteres: 333"
Q3
Are there any individuals that seem to have elevated
heterozygosity in the sampled population? Please provide a plot that
demonstrates this. At least three individuals have elevated
heterozygosity on the data frame.
Why might an individual with elevated heterozygosity be
concerning? Unusual elevated heterozygosity might be due to
contaminants, external genetic influences and/or biases on the outcomes
if ot accounted for.
#1. We calculate the heterogeneity (het) per individuals
het_data <- gl.report.heterozygosity(genlight_obj)
Starting gl.report.heterozygosity
Processing genlight object with SNP data
Calculating Observed Heterozygosities, averaged across
loci, for each population
Calculating Expected Heterozygosities
Completed: gl.report.heterozygosity

Q4
How diverse is the sampled population? Please provide an
estimate of genetic diversity and a brief explanation of the measure of
diversity that you chose to use. The expected heterozygosity
average is 0.137.
Would you make a management recommendation based solely on
this diversity estimate? The expected and observed genetic
diversity might not be enough to make management decision, but it is a
good base line to make decision on what kind of data to collect.
Why or why not? If not, what further information would you
want before making a recommendation? We should know more about
the population size, the structure of the population and environmental
factors. Those threats and pressures would define the
ggplot(diversity_df, aes(x = Heterozygosity)) +
geom_histogram(binwidth = 0.05, fill = "steelblue", color = "black") +
labs(title = "Distribution of Genetic Diversity (Obv. Het)",
x = "Heterozygosity",
y = "Frequency")
Error in `geom_histogram()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error:
! object 'Heterozygosity' not found
Backtrace:
1. base (local) `<fn>`(x)
2. ggplot2:::print.ggplot(x)
4. ggplot2:::ggplot_build.ggplot(x)
5. ggplot2:::by_layer(...)
12. ggplot2 (local) f(l = layers[[i]], d = data[[i]])
13. l$compute_aesthetics(d, plot)
14. ggplot2 (local) compute_aesthetics(..., self = self)
15. base::lapply(aesthetics, eval_tidy, data = data, env = env)
16. rlang (local) FUN(X[[i]], ...)

Q5:
Are the majority of loci in this population in Hardy-Weinberg
equilibrium (HWE) overall? The majority of the loci tested are
distal of the HWE. It could be a issue in the data or from the filter
function of p-value 0.05. The filter function was set with colunm “Prob”
for p-value.
If not, how do you interpret its departure from HWE? What
could this mean about the population or sampling scheme? If it
is actually an outcome of the dataset, we can consider the population to
be highly inbreeding and perhaps the population size decline. The
environmental structure factors in for the lack of heterozygosity. On
the other hand, sampling randomization could influence in the outcome if
the bioinformatics are process carefully. Both are possibilities for the
distal from HWE.
print(paste("loci distal from HWE:", num_hwe_loci))
[1] "loci distal from HWE: 1153"
Q6:
Please provide a PCA plot visualizing the genetic relatedness
among individuals. What stands out to you in this
visualization? We have three clear clusters, two of the three
more packed, while the third has some deviation. The two more packed
suggested genetic similarity and perhaps it comes from the same
ecosystem. Also, it coud suggest shared ancestry.
There are a couple of outlines that might have some interesting
genetic diversity or fragmented by phisical varries with little
migration opportunities.

Q7:
What additional question(s) do you have about this dataset that would
aid in your interpretation of these data? Yes: here are three that come
to mind: (they might be incorrect or out of the scoepe) 1. Is there
evidence of population structure within the dataset? k-mean cluster
assigment test suggests population structuring with the available
data.
- How does inbreeding affect the genetic diversity observed in this
population? This analysis could help to understand the clustering of the
individuals that has been presented on previous question answer.

