Description

Please answer the following questions as part of the narrative in your project report. For each question, make sure that the corresponding code is included along with your results.

What is the average amount of missing data per individual?

library(dartR)
Loading required package: ggplot2
Loading required package: dplyr

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: dartR.data
**** Welcome to dartR.data [Version 1.0.8 ] ****

Registered S3 method overwritten by 'pegas':
  method      from
  print.amova ade4
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
Registered S3 method overwritten by 'genetics':
  method      from 
  [.haplotype pegas
**** Welcome to dartR [Version 2.9.7 ] ****

Be aware that owing to CRAN requirements and compatibility reasons not all functions of the package may run after the basic installation, as some packages could still be missing. Hence for a most enjoyable experience we recommend to run the function 
gl.install.vanilla.dartR()
This installs all missing and required packages for your version of dartR. In case something fails during installation please refer to this tutorial: https://github.com/green-striped-gecko/dartR/wiki/Installation-tutorial.

For information how to cite dartR, please use:
citation('dartR')
Global verbosity is set to: 2


**** Have fun using dartR! ****

Attaching package: ‘dartR’

The following objects are masked from ‘package:dartR.data’:

    bandicoot.gl, possums.gl, testset.gl, testset.gs

Setting the work directory

print(genlight_obj)
 /// GENLIGHT OBJECT /////////

 // 77 genotypes,  5,308 binary SNPs, size: 2.2 Mb
 226944 (55.53 %) missing data

 // Basic content
   @gen: list of 77 SNPbin
   @ploidy: ploidy of each individual  (range: 2-2)

 // Optional content
   @ind.names:  77 individual labels
   @loc.names:  5308 locus labels
   @loc.all:  5308 alleles
   @chromosome: factor storing chromosomes of the SNPs
   @position: integer storing positions of the SNPs
   @pop: population of each individual (group size range: 77-77)
   @other: a list containing: loc.metrics  loc.metrics.flags  verbose  history  ind.metrics 

Questions:

Q1:

How many individuals are in the dataset? There are 77 individuals after filtering them (see the {sh} below for the command line) The average missing data per individual in the matrix is 0.5553 At the very least 25 individual are candidate for exclusion under 50% of missing data per individuals.

vcftools --vcf temp_filtered.vcf --minDP 10 --max-meanDP 40 --max-missing 0.2 --min-alleles 2 --max-alleles 2 --remove-indels --recode --out final_filtered_variants
print(id_indiv_discard)
AC1-AC16_0509_023.q10.sorted.bam 
                              25 

Q2:

How did you choose to filter your data, and how many SNPs do you retain after filtering? Filtered data with a threshold of 80%, meaning individual with 80% of data are retain and others are excluded. After filters, 333 SNPs are retained.

Why did you choose to use the filters that you did with these data? The filter threshold was selected to prevent low-quality SNPs where significant missing data can mislead the results and interpretations.

What are we trying to mitigate by using these filters? The 80% filter threshold remove any individual that does not have 80% of data, creating high quality of individuals data. The monomorph filter seems to remove SNPs with no variants, which could be ideal to understand only those individuals that contribute to genetic diversity.

print(paste("# of SNPs retained aftered filteres:", num_snps_retained))
[1] "# of SNPs retained aftered filteres: 333"

Q3

Are there any individuals that seem to have elevated heterozygosity in the sampled population? Please provide a plot that demonstrates this. At least three individuals have elevated heterozygosity on the data frame.

Why might an individual with elevated heterozygosity be concerning? Unusual elevated heterozygosity might be due to contaminants, external genetic influences and/or biases on the outcomes if ot accounted for.

#1. We calculate the heterogeneity (het) per individuals
het_data <- gl.report.heterozygosity(genlight_obj)
Starting gl.report.heterozygosity 
  Processing genlight object with SNP data
  Calculating Observed Heterozygosities, averaged across 
                    loci, for each population
  Calculating Expected Heterozygosities
Completed: gl.report.heterozygosity 

Q4

How diverse is the sampled population? Please provide an estimate of genetic diversity and a brief explanation of the measure of diversity that you chose to use. The expected heterozygosity average is 0.137.

Would you make a management recommendation based solely on this diversity estimate? The expected and observed genetic diversity might not be enough to make management decision, but it is a good base line to make decision on what kind of data to collect.

Why or why not? If not, what further information would you want before making a recommendation? We should know more about the population size, the structure of the population and environmental factors. Those threats and pressures would define the

ggplot(diversity_df, aes(x = Heterozygosity)) +
  geom_histogram(binwidth = 0.05, fill = "steelblue", color = "black") +
  labs(title = "Distribution of Genetic Diversity (Obv. Het)",
       x = "Heterozygosity",
       y = "Frequency")
Error in `geom_histogram()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error:
! object 'Heterozygosity' not found
Backtrace:
  1. base (local) `<fn>`(x)
  2. ggplot2:::print.ggplot(x)
  4. ggplot2:::ggplot_build.ggplot(x)
  5. ggplot2:::by_layer(...)
 12. ggplot2 (local) f(l = layers[[i]], d = data[[i]])
 13. l$compute_aesthetics(d, plot)
 14. ggplot2 (local) compute_aesthetics(..., self = self)
 15. base::lapply(aesthetics, eval_tidy, data = data, env = env)
 16. rlang (local) FUN(X[[i]], ...)

Q5:

Are the majority of loci in this population in Hardy-Weinberg equilibrium (HWE) overall? The majority of the loci tested are distal of the HWE. It could be a issue in the data or from the filter function of p-value 0.05. The filter function was set with colunm “Prob” for p-value.

If not, how do you interpret its departure from HWE? What could this mean about the population or sampling scheme? If it is actually an outcome of the dataset, we can consider the population to be highly inbreeding and perhaps the population size decline. The environmental structure factors in for the lack of heterozygosity. On the other hand, sampling randomization could influence in the outcome if the bioinformatics are process carefully. Both are possibilities for the distal from HWE.

print(paste("loci distal from HWE:", num_hwe_loci))
[1] "loci distal from HWE: 1153"

Q6:

Please provide a PCA plot visualizing the genetic relatedness among individuals. What stands out to you in this visualization? We have three clear clusters, two of the three more packed, while the third has some deviation. The two more packed suggested genetic similarity and perhaps it comes from the same ecosystem. Also, it coud suggest shared ancestry.

There are a couple of outlines that might have some interesting genetic diversity or fragmented by phisical varries with little migration opportunities.

Q7:

What additional question(s) do you have about this dataset that would aid in your interpretation of these data? Yes: here are three that come to mind: (they might be incorrect or out of the scoepe) 1. Is there evidence of population structure within the dataset? k-mean cluster assigment test suggests population structuring with the available data.

  1. How does inbreeding affect the genetic diversity observed in this population? This analysis could help to understand the clustering of the individuals that has been presented on previous question answer.

---
title: "Porject 1"
subtitle: "Bioinformatics for Coservation"
author: "Nkwi Flores"
output: html_notebook
---

# Description
Please answer the following questions as part of the narrative in your project report. For each question, make sure that the corresponding code is included along with your results.


What is the average amount of missing data per individual?
```{r}
# Install necessary packages (if not already installed)
library(tidyverse)
library(dartR)
library(adegenet)

```
Setting the work directory
```{r}
setwd("~/Library/CloudStorage/OneDrive-UniversityofArizona/Documents/sandbox/bioinformatics/data/")

#Load the vcf file "final_filtered_variants.recode.vcf"
#We use the tool genlight_obj
vcf_file <- "/Users/nkf/Library/CloudStorage/OneDrive-UniversityofArizona/Documents/sandbox/bioinformatics/data/final_filtered_variants.recode.vcf"
genlight_obj <- gl.read.vcf(vcf_file)

print(genlight_obj)

```

# Questions:
### Q1:
**How many individuals are in the dataset?**
There are 77 individuals after filtering them (see the {sh} below for the command line)
The average missing data per individual in the matrix is 0.5553
At the very least 25 individual are candidate for exclusion under 50% of missing data per individuals.
```{sh}
vcftools --vcf temp_filtered.vcf --minDP 10 --max-meanDP 40 --max-missing 0.2 --min-alleles 2 --max-alleles 2 --remove-indels --recode --out final_filtered_variants
```

```{r, label = Q1}
#1. We count the number of inidviduals
num_indiv <- nInd(genlight_obj)
print(paste("number of individuals", num_indiv))

#2. We extracted the genotype as a matrix because some functions not working for this version of R.
genotype_matrix <- as.matrix(genlight_obj)
#3. We try to calculate the missing data per individual
missing_data_individual <- apply(genotype_matrix, 1, function(x) mean(is.na(x)))
#4. We estimated the average missing data on the matrix
avg_missing_data <-mean(missing_data_individual)
print(paste("average missing data per individual in the matrix:", round(avg_missing_data, 4)))

#5. we try to id individuals to discard due to missing data. We set the threshold low at 50% missing data 
threshold <- 0.5
id_indiv_discard <- which(missing_data_individual > threshold)
print("individual with more the % missing data")
print(id_indiv_discard)

```
```{r}

```

### Q2:
**How did you choose to filter your data, and how many SNPs do you retain after filtering?**
Filtered data with a threshold of 80%, meaning individual with 80% of data are retain and others are excluded.
After filters, 333 SNPs are retained.

**Why did you choose to use the filters that you did with these data?**
The filter threshold was selected to prevent low-quality SNPs where significant missing data can mislead the results and interpretations.

**What are we trying to mitigate by using these filters?**
The 80% filter threshold remove any individual that does not have 80% of data, creating high quality of individuals data. The monomorph filter seems to remove SNPs with no variants, which could be ideal to understand only those individuals that contribute to genetic diversity.

```{r, labe = Q2}
#1. retain individuals with enough data. set threshold at 80%.
genlight_filtered <- gl.filter.callrate(genlight_obj, threshold = 0.8)
genlight_filtered <- gl.filter.monomorphs(genlight_filtered)
#2. Count retain SNPs
num_snps_retained <- nLoc(genlight_filtered)
print(paste("# of SNPs retained aftered filteres:", num_snps_retained))
```
### Q3
**Are there any individuals that seem to have elevated heterozygosity in the sampled population? Please provide a plot that demonstrates this.**
At least three individuals have elevated heterozygosity on the data frame.

**Why might an individual with elevated heterozygosity be concerning?**
Unusual elevated heterozygosity might be due to contaminants, external genetic influences and/or biases on the outcomes if ot accounted for.

```{r, label = Q3}
#1. We calculate the heterogeneity (het) per individuals
het_data <- gl.report.heterozygosity(genlight_obj)

#2. we convert this het_data to data frame 
het_df <- as.data.frame(het_data)
colnames(het_df) <- c("Individual", "Heterozygosity")

#3. Plot the het_df (heterozygosity data frame)
ggplot(het_df, aes(x = Individual, y = Heterozygosity)) +
  geom_bar(stat = "identity") +
  labs(title = "Heterozygosity per Individual",
       x = "Individual",
       y = "Heterozygosity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#4. tag the individual with unusual heterozygosity
mean_het <- mean(het_df$Heterozygosity)
sd_het <- sd(het_df$Heterozygosity)
threshold <- mean_het + 2 * sd_het #sets the threshold to id the elevated heterozygosity per individual

colnames(het_df)

#5. id individual with above threshold heterozygosity
elevated_het <- het_df %>%
  filter(Heterozygosity > threshold)
print("individual with elevated heterozygosy")
print(elevated_het)
```
### Q4
**How diverse is the sampled population? Please provide an estimate of genetic diversity and a brief explanation of the measure of diversity that you chose to use.**
The expected heterozygosity average is 0.137. 

**Would you make a management recommendation based solely on this diversity estimate?**
The expected and observed genetic diversity might not be enough to make management decision, but it is a good base line to make decision on what kind of data to collect.

**Why or why not? If not, what further information would you want before making a recommendation?**
We should know more about the population size, the structure of the population and environmental factors. Those threats and pressures would define the 

```{r, label = Q4}
#1. Genetic diversity calculation. Ho = observed heterozygosity, which is the proportion of measure genetic diversity.
genetic_diversity <- gl.basic.stats(genlight_obj)$Ho
#print(genetic_diversity)

#2. Alternative expected het calculation. He = expected heterozygosity is what is expected under the Hardy-Weinberg.
expected_het <- mean(gl.basic.stats(genlight_obj)$Hs, na.rm = TRUE)
print(paste("expected heterozysosity (het):", round(expected_het, 4)))

#3. try to visualize the data as data frame.
diversity_df <- data.frame(Heterozygosity = genetic_diversity)
ggplot(diversity_df, aes(x = Heterozygosity)) +
  geom_histogram(binwidth = 0.05, fill = "steelblue", color = "black") +
  labs(title = "Distribution of Genetic Diversity (Obv. Het)",
       x = "Heterozygosity",
       y = "Frequency")
```
## Q5:
**Are the majority of loci in this population in Hardy-Weinberg equilibrium (HWE) overall?**
The majority of the loci tested are distal of the HWE. It could be a issue in the data or from the filter function of p-value 0.05. The filter function was set with colunm "Prob" for p-value.

**If not, how do you interpret its departure from HWE? What could this mean about the population or sampling scheme?**
If it is actually an outcome of the dataset, we can consider the population to be highly inbreeding and perhaps the population size decline. The environmental structure factors in for the lack of heterozygosity. 
On the other hand, sampling randomization could influence in the outcome if the bioinformatics are process carefully. Both are possibilities for the distal from HWE.
```{r, label = Q5}
#1. test for the hardy-Weinberg equilibrium. It is a result for each loci in the population with a default p-value of 0.05. 
hwe_results <- gl.report.hwe(genlight_obj)

#2. we count for the number of loci distal from hardy-weighbridge 
hwe_deviation <- hwe_results %>%
  filter(Prob < 0.05)

#3. Counting the number of HWE per loci that are distal 
num_total_loci <- nrow(hwe_results)
num_hwe_loci <- nrow(hwe_deviation)
num_in_hwe <- num_total_loci - num_hwe_loci

print(paste("total loci tested:", num_total_loci))
print(paste("loci in HWE:", num_in_hwe))
print(paste("loci distal from HWE:", num_hwe_loci))
```
## Q6:
**Please provide a PCA plot visualizing the genetic relatedness among individuals. What stands out to you in this visualization?**
We have three clear clusters, two of the three more packed, while the third has some deviation. The two more packed suggested genetic similarity and perhaps it comes from the same ecosystem. Also, it coud suggest shared ancestry.

There are a couple of outlines that might have some interesting genetic diversity or fragmented by phisical varries with little migration opportunities. 

```{r, label = Q6}
#loading additional library 
library(ggplot2)

#1. Principal coordinates analysis (PCA)
# PCA summary the genetic variation among individuals
pca_result <- gl.pcoa(genlight_obj)

#2. plotting PCA from the extract and prepared data 
pca_df <- as.data.frame(pca_result$scores) #coverse the PCA scores to data frames with each row for each individual
colnames(pca_df) <- c("PC1", "PC2") #rename columns for easy plotting (Principal Component)
pca_df$Individual <- rownames(pca_df) #added individual identifiers 

#3. plotting the genetic relatedness among individuals
ggplot(pca_df, aes(x = PC1, y = PC2, label = Individual)) +
  geom_point(color = "blue", size = 3) +
  #geom_text(vjust = -0.5, hjust = 0.5, size = 3) +
  labs(title = "PCA of genetic relatedness among individuals",
       x = "PC1",
       y = "PC2")
```
## Q7:
What additional question(s) do you have about this dataset that would aid in your interpretation of these data?
Yes: here are three that come to mind: (they might be incorrect or out of the scoepe)
1.  Is there evidence of population structure within the dataset?
k-mean cluster assigment test suggests population structuring with the available data.

2. How does inbreeding affect the genetic diversity observed in this population?
This analysis could help to understand the clustering of the individuals that has been presented on previous question answer.
```{r, label = Q7}
#1. Use the pca_df from question 6. The k-mean clustered based on k = 2
kmean_result <- kmeans(pca_df[, c("PC1", "PC2")], centers = 2)

# cluster added to date frame
pca_df$Cluster <- as.factor(kmean_result$cluster)

# plot the pca_df assigned to cluster
ggplot(pca_df, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(size = 3) +
  labs(title = "K-mean cluster from PCA",
       x = "PC1",
       y = "PC2") +
  theme_minimal()

#2. inbreeding coefficient (Fis)
obs_het <- gl.basic.stats(genlight_obj)$Ho
expt_het <- gl.basic.stats(genlight_obj)$Hs
fis_values <- 1 - (obs_het / expt_het)

#replace NaN values (divided by zero) with NA for clarity 
fis_values[is.nan(fis_values)] <- NA
#length(fis_values)
#head(fis_values)
#names(fis_values) <- names(obs_het)
#names(fis_values) <- names(expt_het)
# create the date frame for analysis 
fis_values <- as.vector(fis_values)
names(fis_values) <- names(obs_het)
fis_df <- data.frame(Locus = names(fis_values), Fis = fis_values)
head(fis_df)
```


