Genomics of Human Populations Assignment 2
Task 1: Visual evaluation of fit to Hardy-Weinberg
Q1.1 For Q1.1a to Q1.1d, consider a biallelic SNP with minor allele
frequency of 0.45.
Q1.1a What is the frequency of the “major” allele at this locus? (1
point)
- The frequency of the major allele is 0.55
Q1.1b What is the expected frequency of the homozygous genotype for
the minor allele (assuming Hardy-Weinberg equilibrium) (1 point)?
Q1.1c What is the expected frequency of homozygotes for the “major”
allele? (1 point)
- The expected frequency would be 0.303
Q1.1d What is the expected frequency of heterozygotes at this locus?
(1 point)
- The expected frequency would be 0.495
Q1.2 Now using the plot you created above to assist you, identify the
color of the genotypes associated the (a) heterozygotes, (b) homozygote
for minor allele, and (c) homozygote for major allele (3 points)
heterozygote color: (p12) green.
homozygote for minor allele color: (p11)
orange
homozygote for major allele color: (p22)
blue.
Q1.3 Do the observed genotypic frequencies (i.e., points) roughly
follow HWE expectations (black lines) in the HGDP data? (2 point)
- p11 and pq2 lines do roughly follow HWE expectations with
the data where changes are seen (spread of fit) with the p12 HGDP data
and there is both excess and deficiency in the values expected and
reported.
Q1.4. Based on the observed genotype frequencies, do you think that
the samples included in this analysis are derived from two or more
highly structured populations (3 points)? Why or why not?
- Because there is observation of both excess and deficiency
in the spread of fit of data and their anticipated values (lines, from
HWE graph), it suggests that there is 1) complex subpopulation structure
and; 2) evolutionary forces are forced on the human genomes. A real
extension to this though of forces being at play could be admixture. A
different consideration should be raised with how the data was collected
(sampling error) where this could lead to unexpected rises or declines
in observed/expected heterozygosity information.
Task 2: ADMIXTURE analysis of European data from the HGDP
Q2.1 Include the K=8 diagram in your
report (1 point).
Q2.2 Which populations have individuals with a Russian ancestry
component? (1 point)?
- HGDP00879
- HGDP00880
- HGDP00881
- HGDP00882
- HGDP00883
- HGDP00884
- HGDP00885
- HGDP00886
- HGDP00887
- HGDP00888
- HGDP00889
- HGDP00890
- HGDP00891
- HGDP00892
- HGDP00893
- HGDP00894
- HGDP00895
- HGDP00896
- HGDP00897
- HGDP00898
- HGDP00899
- HGDP00900
- HGDP00901
- HGDP00902
- HGDP00903
Q2.3. Are the Tuscan and North_Italian individuals completely
distinguished as being from distinct populations? In other words, if you
didn’t know the population origin of these samples, could you assign
them confidently to one or the other population? Explain. (2 points)
- No, just looking at the data with blindness does not lead to
distinct patterns where between X1-8 with the exception of X4, there are
nearly identical or have slight deviation of the frequent values
observed (X1 has like a 0.00001 signature where all individuals have
this same pattern however X4 has no clear signature… individuals who
might tell or be helpful in drawing (but not further explanation of
where they came from) would be HGDP01167,HGDP01169, HGDP01153,
HGDP01155).
table_populations_file
Q2.4. Which two named populations seems to have the most internal
population structure? (i.e., consist of individuals with two distinct
ancestries that may represent two or more unrecognized populations)
Explain. (2 points)
- French and Orcadian because they have more ancestral
components compared to the other populations where, I believe, have at
least 5 counts of different colors on the ADMIXTURE graph, where this
correpsonds to their ability of having “more” internal subpopulation
structure.
Q2.5b. How does the ancestry diagram differ from K=8? Please comment
on which if any of the groups defined a priori by their
ethnic/geographic origin that were split in K=8 are not split at K=6. (2
points).
Reading and if to answer the last question again, it
makes it easier to understand/read and eliminate those that lack
internal sub population structure (.e.g,
French_Basque).
French_Basque | Sardinian | Russian are all populations
where at K=6, they lost observed tracts of ancestral identity to X7 or
X8. In that, they appear more homogenous in the in the K=6 diagram as
opposed to K=8; the rest of the populations still retain their distinct
signatures of chromosomal differences, allowing internal sub-population
structure to be noticeably observed.
Q2.5c The French appear to have mixed ancestry at both K=6 and K=8.
Which sources of ancestry appear to be present at both K=6 and K=8? (3
points)
- X1, X2 appear to be present and do not move in position of
the graph on fraction of genome of samples in a population. Other
sources of ancestry include but their tracks and fraction percentage
change some relative positions do not change, this includes: X5 (K=8,
has one segment of color as opposed ot K=6, observed in more samples
when reducing the chromsomes numbers), and X6 where it has some retained
tracts of similarity in both admixture graphs.
Task 3: Principal Component Analysis (PCA) of population genomic
data
Q3.1a How many missing data values were imputed? Hint: Check the
output (in red) written to your console in the second PCA run. (1
point)
- 29222 missing values imputed
Q3.1b What is the percentage of variation explained by each of the
first five PC axes? (2 points).
- Each PC axes captures the variation of data where PC1 has
the largest amount of variation (10.47532%) where each following axes is
further variation but at smaller amount at each one; 4.348854%;
1.55274%; 1.477004; 1.191729%. Then eigenvalues relay the total
propoprtion of variance contained in each PC, and that too, has an
overall trend of reduction of the values after the first reported
eigenvalue of 26065.09078.
Q3.1c After reviewing the variation explained in the first 10 PC
axes, which axes seem to capture the majority of the variation before
the remaining axes begin to plateau? (note: you could make a “scree
plot” with these values in a barplot with PC1 variation in explained as
the leftmost bar, PC2 as the second bar etc. if it helps visualization)
(1 point)
- Only PC1 contains the most amount of information that can be
extrapolated as helpful since it relays the total variation. Every other
axes that follows contains less information and therefore, less
cumulative variance is added, but it is still helpful in the spread of
data and capturing and reflecting a full picture of where and how the
data came from. After PC1, the data WILL drop in the percentage of
variation and get to smaller amounts, plateuing at the
end.
PCA_plot
Q3.2a Report the PCA plot in your assignment (2 points)
Q3.2b Which populations are MOST differentiated on PC1? Does this
make sense in terms of geography? (2 points)
- CHB and YRI. CHB is from East Asia and YRI are from Africa
and so, because of geographical difference, they are isolated and
perhaps genetic differentiation could be expected as a result in both
the lack of converging histories, separated by geographic isolation and
other forces.
Q3.2c Which populations are not differentiated by PC1? (2 point)
- GIH and CEU - they are very close together (Utah population
+ Texas, not far apart. Also, MKK (Kenya), little to none
differentiation by PC1
Q3.2d Are the populations not differentiated on PC1 separated on PC2?
(2 point)