EF_assignment2_2024.knit

Genomics of Human Populations Assignment 2

Task 1: Visual evaluation of fit to Hardy-Weinberg

Q1.1 For Q1.1a to Q1.1d, consider a biallelic SNP with minor allele frequency of 0.45.

Q1.1a What is the frequency of the “major” allele at this locus? (1 point)

The frequency of the major allele is 0.55

Q1.1b What is the expected frequency of the homozygous genotype for the minor allele (assuming Hardy-Weinberg equilibrium) (1 point)?

It would be 0.203

Q1.1c What is the expected frequency of homozygotes for the “major” allele? (1 point)

The expected frequency would be 0.303

Q1.1d What is the expected frequency of heterozygotes at this locus? (1 point)

The expected frequency would be 0.495

Q1.2 Now using the plot you created above to assist you, identify the color of the genotypes associated the (a) heterozygotes, (b) homozygote for minor allele, and (c) homozygote for major allele (3 points)

heterozygote color: (p12) green.
homozygote for minor allele color: (p11) orange
homozygote for major allele color: (p22) blue.

Q1.3 Do the observed genotypic frequencies (i.e., points) roughly follow HWE expectations (black lines) in the HGDP data? (2 point)

p11 and pq2 lines do roughly follow HWE expectations with the data where changes are seen (spread of fit) with the p12 HGDP data and there is both excess and deficiency in the values expected and reported.

Q1.4. Based on the observed genotype frequencies, do you think that the samples included in this analysis are derived from two or more highly structured populations (3 points)? Why or why not?

Because there is observation of both excess and deficiency in the spread of fit of data and their anticipated values (lines, from HWE graph), it suggests that there is 1) complex subpopulation structure and; 2) evolutionary forces are forced on the human genomes. A real extension to this though of forces being at play could be admixture. A different consideration should be raised with how the data was collected (sampling error) where this could lead to unexpected rises or declines in observed/expected heterozygosity information.

Task 2: ADMIXTURE analysis of European data from the HGDP

k=8_diagram_for_report Q2.1 Include the K=8 diagram in your report (1 point).

Q2.2 Which populations have individuals with a Russian ancestry component? (1 point)?

HGDP00879
HGDP00880
HGDP00881
HGDP00882
HGDP00883
HGDP00884
HGDP00885
HGDP00886
HGDP00887
HGDP00888
HGDP00889
HGDP00890
HGDP00891
HGDP00892
HGDP00893
HGDP00894
HGDP00895
HGDP00896
HGDP00897
HGDP00898
HGDP00899
HGDP00900
HGDP00901
HGDP00902
HGDP00903

Q2.3. Are the Tuscan and North_Italian individuals completely distinguished as being from distinct populations? In other words, if you didn’t know the population origin of these samples, could you assign them confidently to one or the other population? Explain. (2 points)

No, just looking at the data with blindness does not lead to distinct patterns where between X1-8 with the exception of X4, there are nearly identical or have slight deviation of the frequent values observed (X1 has like a 0.00001 signature where all individuals have this same pattern however X4 has no clear signature… individuals who might tell or be helpful in drawing (but not further explanation of where they came from) would be HGDP01167,HGDP01169, HGDP01153, HGDP01155).

table_populations_file

Q2.4. Which two named populations seems to have the most internal population structure? (i.e., consist of individuals with two distinct ancestries that may represent two or more unrecognized populations) Explain. (2 points)

French and Orcadian because they have more ancestral components compared to the other populations where, I believe, have at least 5 counts of different colors on the ADMIXTURE graph, where this correpsonds to their ability of having “more” internal subpopulation structure.

Q2.5b. How does the ancestry diagram differ from K=8? Please comment on which if any of the groups defined a priori by their ethnic/geographic origin that were split in K=8 are not split at K=6. (2 points).

Reading and if to answer the last question again, it makes it easier to understand/read and eliminate those that lack internal sub population structure (.e.g, French_Basque).
French_Basque | Sardinian | Russian are all populations where at K=6, they lost observed tracts of ancestral identity to X7 or X8. In that, they appear more homogenous in the in the K=6 diagram as opposed to K=8; the rest of the populations still retain their distinct signatures of chromosomal differences, allowing internal sub-population structure to be noticeably observed.

Q2.5c The French appear to have mixed ancestry at both K=6 and K=8. Which sources of ancestry appear to be present at both K=6 and K=8? (3 points)

X1, X2 appear to be present and do not move in position of the graph on fraction of genome of samples in a population. Other sources of ancestry include but their tracks and fraction percentage change some relative positions do not change, this includes: X5 (K=8, has one segment of color as opposed ot K=6, observed in more samples when reducing the chromsomes numbers), and X6 where it has some retained tracts of similarity in both admixture graphs.

Task 3: Principal Component Analysis (PCA) of population genomic data

Q3.1a How many missing data values were imputed? Hint: Check the output (in red) written to your console in the second PCA run. (1 point)

29222 missing values imputed

Q3.1b What is the percentage of variation explained by each of the first five PC axes? (2 points).

Each PC axes captures the variation of data where PC1 has the largest amount of variation (10.47532%) where each following axes is further variation but at smaller amount at each one; 4.348854%; 1.55274%; 1.477004; 1.191729%. Then eigenvalues relay the total propoprtion of variance contained in each PC, and that too, has an overall trend of reduction of the values after the first reported eigenvalue of 26065.09078.

Q3.1c After reviewing the variation explained in the first 10 PC axes, which axes seem to capture the majority of the variation before the remaining axes begin to plateau? (note: you could make a “scree plot” with these values in a barplot with PC1 variation in explained as the leftmost bar, PC2 as the second bar etc. if it helps visualization) (1 point)

Only PC1 contains the most amount of information that can be extrapolated as helpful since it relays the total variation. Every other axes that follows contains less information and therefore, less cumulative variance is added, but it is still helpful in the spread of data and capturing and reflecting a full picture of where and how the data came from. After PC1, the data WILL drop in the percentage of variation and get to smaller amounts, plateuing at the end.

PCA_plot

Q3.2a Report the PCA plot in your assignment (2 points)

Q3.2b Which populations are MOST differentiated on PC1? Does this make sense in terms of geography? (2 points)

CHB and YRI. CHB is from East Asia and YRI are from Africa and so, because of geographical difference, they are isolated and perhaps genetic differentiation could be expected as a result in both the lack of converging histories, separated by geographic isolation and other forces.

Q3.2c Which populations are not differentiated by PC1? (2 point)

GIH and CEU - they are very close together (Utah population + Texas, not far apart. Also, MKK (Kenya), little to none differentiation by PC1

Q3.2d Are the populations not differentiated on PC1 separated on PC2? (2 point)

CEU/MKK/GIH.