Definitions

Let’s define some terms:

rsid_num is a unique label (“rs” followed by a number) used by researchers and databases to identify a specific SNP (Single Nucleotide Polymorphism). It stands for Reference SNP cluster ID and is the naming convention used for most SNPs. Before we can understand what a SNP is we must understand what a chromosome is.

What is a chromosome?

Chromosomes are thread-like structures located inside the nucleus of animal and plant cells. Each chromosome is made of protein and a single molecule of deoxyribonucleic acid (DNA). Passed from parents to offspring, DNA contains the specific instructions that make each type of living creature unique.

How many chromosomes do we have?

Humans have 23 pairs of chromosomes, for a total of 46 chromosomes.

How are chromosomes inherited?

In humans and most other complex organisms, one copy of each chromosome is inherited from the female parent and the other from the male parent. This explains why children inherit some of their traits from their mother and others from their father.

The pattern of inheritance is different for the small circular chromosome found in mitochondria. Only egg cells - and not sperm cells - keep their mitochondria during fertilization. So, mitochondrial DNA is always inherited from the female parent. In humans, a few conditions, including some forms of hearing impairment and diabetes, have been associated with DNA found in the mitochondria.

Do males and females differ?

Yes, they differ in a pair of chromosomes known as the sex chromosomes. Females have two X chromosomes in their cells, while males have one X and one Y chromosome.

Inheriting too many or not enough copies of sex chromosomes can lead to serious problems. For example, females who have extra copies of the X chromosome are usually taller than average and some have mental retardation. Males with more than one X chromosome have Klinefelter syndrome, which is a condition characterized by tall stature and, often, impaired fertility. Another syndrome caused by imbalance in the number of sex chromosomes is Turner syndrome. Women with Turner have one X chromosome only. They are very short, usually do not undergo puberty and some may have kidney or heart problems.

So, what are Single Nucleotide Polymorphisms (SNPs)?

A nucleotide is the basic building block of nucleic acids. … A nucleotide consists of a sugar molecule (either ribose in RNA or deoxyribose in DNA) attached to a phosphate group and a nitrogen-containing base. The bases used in DNA are adenine (A), cytosine (C), guanine (G), and thymine (T).

Single nucleotide polymorphisms, frequently called SNPs (pronounced “snips”), are the most common type of genetic variation among people. Each SNP represents a difference in a single DNA building block, called a nucleotide. For example, a SNP may replace the nucleotide cytosine (C) with the nucleotide thymine (T) in a certain stretch of DNA.

SNPs occur normally throughout a person’s DNA. They occur almost once in every 1,000 nucleotides on average, which means there are roughly 4 to 5 million SNPs in a person’s genome. These variations may be unique or occur in many individuals; scientists have found more than 100 million SNPs in populations around the world. Most commonly, these variations are found in the DNA between genes. They can act as biological markers, helping scientists locate genes that are associated with disease. When SNPs occur within a gene or in a regulatory region near a gene, they may play a more direct role in disease by affecting the gene’s function.

Most SNPs have no effect on health or development. Some of these genetic differences, however, have proven to be very important in the study of human health. Researchers have found SNPs that may help predict an individual’s response to certain drugs, susceptibility to environmental factors such as toxins, and risk of developing particular diseases. SNPs can also be used to track the inheritance of disease genes within families.

position is the location of the SNP in your DNA.

genotype is how both of your chromosomes read. If you had a SNP that read A/A on chromosome 1 this would mean that on both copies of chromosome 1 in that position there is an A. If you see – that means there is a deletion in that spot.

You can search rsid numbers of interest at https://www.ncbi.nlm.nih.gov/snp/ for more information

Now we have definitions out of the way and have our data and our reference set, now let’s merge the two data frames together.

#Joining Data 
join_data <- inner_join(genome_raw, 
                        updated_gwas_data, 
                        by = c("rsid_num" = "SNPS")
                        )

There are 44808 observations of 46 variables. However many of these will not have any risk or mutations. When we use the word risk here, we are not necessarily using risk in a bad way instead we are saying you have risk for displaying a certain phenotype (the outward expression of your genotype). This could be positive such as intelligence or height!

Why did this go down?

23 and Me has SNPs that may not be in the GWAS database we are using. We are using gwascat which is a R package that serves as an interface to the National Human Genome Research Institute’s (NHGRI) database of genome-wide association studies. At NHGRI, they believe that advances in genomics research are transforming our understanding of human health and disease. Building on their leadership role in the initial sequencing of the human genome, they collaborate with the scientific and medical communities to enhance genomic technologies that accelerate breakthroughs and improve lives. At NHGRI, they are empowering and expanding the field of genomics.

#Cleaning up data and separating out genotype into separate fields
join_data$risk_allele_clean <- str_sub(join_data$STRONGEST.SNP.RISK.ALLELE, 
                                       -1
                                       )
join_data$my_allele_1 <- str_sub(join_data$genotype, 1, 1)
join_data$my_allele_2 <- str_sub(join_data$genotype, 2, 2)
join_data$have_risk_allele_count <- if_else(
  join_data$my_allele_1 == join_data$risk_allele_clean, 1, 0) + 
  if_else(
    join_data$my_allele_2 == join_data$risk_allele_clean, 1, 0
    )

Let’s filter out SNPs you do not have the risk allele(s) for and look at their disease traits and mapped traits.

#Selecting Variables of Interest and Filtering Data
risk_data<- select(join_data,
       have_risk_allele_count,
       rsid_num,
       my_genotype = genotype,
       risk_allele = risk_allele_clean,
       DISEASE.TRAIT,
       MAPPED_TRAIT,
       RISK.ALLELE.FREQUENCY,
       REPORTED.GENE.S.) %>%
filter(have_risk_allele_count>0)

There is many SNPS that carry a risk allele! Don’t worry your phenotype is based not solely on your genomes but a host of other factors such as lifestyle and environmental factors. Still, this is important to know so let’s look at some plots and tables now.

#Count of Disease Traits
trait_count_mapped <- risk_data %>%
  group_by(MAPPED_TRAIT) %>%
  summarise(risk_count = n()) 

#Count of Disease Traits for Risk Allele = 2
trait_count_mapped2 <- risk_data %>%
  group_by(MAPPED_TRAIT, have_risk_allele_count) %>%
  filter(have_risk_allele_count==2) %>%
  summarise(risk_count_2 = n()) %>%
 subset(select = -c(have_risk_allele_count))

#Mutating Risk Counts of Allele =1 and allele = 2
trait_count_mapped3<- 
  merge(trait_count_mapped, trait_count_mapped2, 
        by = c("MAPPED_TRAIT", "MAPPED_TRAIT"),all=TRUE) 

trait_count_mapped3[is.na(trait_count_mapped3)] <- 0

#Risk count for Allele_1 equation
risk_1 <- (trait_count_mapped3$risk_count - trait_count_mapped3$risk_count_2)

#Adding risk_1 column in
trait_count_mapped4 <- select(trait_count_mapped3,
  MAPPED_TRAIT,
  risk_count,
  risk_count_2
)%>%
  mutate(risk_1
         )
#Adding Overall Risk Column and Equation
overall_risk <- (trait_count_mapped3$risk_count_2 / ((trait_count_mapped3$risk_count_2 + trait_count_mapped4$risk_1*2)))+(trait_count_mapped4$risk_1/((trait_count_mapped3$risk_count_2+trait_count_mapped4$risk_1*2)))

overall_risk_count <- select(trait_count_mapped4,
       MAPPED_TRAIT,
       risk_count,
       risk_1,
       risk_count_2,
       ) %>%
  mutate(overall_risk*100) 

overall_risk_count[is.na(overall_risk_count)] <- 0


#Get Count of Disease Traits by DISEASE.TRAIT
trait_count_disease <- risk_data %>%
  group_by(DISEASE.TRAIT)%>%
  summarise(risk_count= n())

#Count of Disease Traits for Risk Allele = 2
trait_count_disease2 <- risk_data %>%
  group_by(DISEASE.TRAIT, have_risk_allele_count) %>%
  filter(have_risk_allele_count==2) %>%
  summarise(risk_count_2 = n()) %>%
 subset(select = -c(have_risk_allele_count))

#Mutating Risk Counts of Allele =1 and allele = 2
trait_count_disease3<- 
  merge(trait_count_disease, trait_count_disease2, 
        by = c("DISEASE.TRAIT", "DISEASE.TRAIT"),all=TRUE) 

trait_count_disease3[is.na(trait_count_disease3)] <- 0

#Risk count for Allele_1 equation
risk_d_1 <- (trait_count_disease3$risk_count - trait_count_disease3$risk_count_2)

#Adding risk_1 column in
trait_count_disease4 <- select(trait_count_disease3,
  DISEASE.TRAIT,
  risk_count,
  risk_count_2
)%>%
  mutate(risk_d_1
         )
#Adding Overall Risk Column and Equation
overall_risk_d <- (trait_count_disease3$risk_count_2 / ((trait_count_disease3$risk_count_2 + trait_count_disease4$risk_d_1*2)))+(trait_count_disease4$risk_d_1/((trait_count_disease3$risk_count_2+trait_count_disease4$risk_d_1*2)))

overall_risk_count_d <- select(trait_count_disease4,
       DISEASE.TRAIT,
       risk_count,
       risk_1= "risk_d_1",
       risk_count_2,
       ) %>%
  mutate(overall_risk_d*100) 

overall_risk_count[is.na(overall_risk_count)] <- 0

When grouped by MAPPED_TRAIT instead of DISEASE.TRAIT we can get rid of redundant entries. However, we can lose some of the story there. I want to take a second and define some terms I am using here:

overall_risk means based what proportion of 1 risk allele mutations to 2 risk allele mutations you have for that trait, the equation is discussed later on. We are finding what is the percentage of risk you genetically carry of 1 risk allele to 2 risk alleles versus the potential you could have had. (Meaning the the sum of 1 risk alleles you have times two plus the sum of 2 risk alleles you have, why multiply by 2? We multiply by two because you have the potential to have this SNP mutation on two chromosomes)

risk_1 means total number of 1 risk alleles you have in a certain trait.

risk_count_2 means total number of 2 risk alleles you have in a certain trait.

risk_count is the total number of risk alleles you have in a certain trait.

#Table of Disease Traits with over 100 risk alleles for MAPPED_TRAIT
table1 <-overall_risk_count %>%
  filter(risk_count > 99) %>%
  arrange(risk_count) %>%
  kable () %>%
  kable_paper("hover", full_width=F)
table1

MAPPED_TRAIT	risk_count	risk_1	risk_count_2	overall_risk * 100
ulcerative colitis	100	53	47	65.35948
smoking behavior, BMI-adjusted waist circumference	105	64	41	62.13018
glomerular filtration rate	107	72	35	59.77654
asthma	111	48	63	69.81132
urate measurement	112	73	39	60.54054
BMI-adjusted waist circumference	114	68	46	62.63736
age at menarche	115	62	53	64.97175
mathematical ability	117	63	54	65.00000
pulse pressure measurement	122	67	55	64.55026
coronary artery disease	126	61	65	67.37968
colorectal cancer	128	88	40	59.25926
bone density	134	57	77	70.15707
Crohn’s disease	134	68	66	66.33663
self reported educational attainment	154	87	67	63.90041
C-reactive protein measurement	160	87	73	64.77733
heel bone mineral density	164	60	104	73.21429
blood metabolite measurement	177	100	77	63.89892
physical activity measurement, body mass index	179	68	111	72.46964
waist-hip ratio	187	92	95	67.02509
BMI-adjusted waist circumference, physical activity measurement	188	117	71	61.63934
diastolic blood pressure	196	107	89	64.68647
prostate carcinoma	197	119	78	62.34177
systolic blood pressure	215	129	86	62.50000
rheumatoid arthritis	218	114	104	65.66265
smoking behavior, body mass index	228	107	121	68.05970
schizophrenia	265	127	138	67.60204
serum IgG glycosylation measurement	272	160	112	62.96296
breast carcinoma	285	156	129	64.62585
BMI-adjusted waist-hip ratio	409	204	205	66.72104
total cholesterol measurement	479	254	225	65.34789
type II diabetes mellitus	498	333	165	59.92780
body height	510	317	193	61.66868
low density lipoprotein cholesterol measurement	517	281	236	64.78697
triglyceride measurement	559	328	231	63.02142
high density lipoprotein cholesterol measurement	671	396	275	62.88660
body mass index	869	445	424	66.13394
blood protein measurement	1055	884	171	54.40949

Very informative, we found overall-risk by taking \[overall \ risk = (risk \ count_2/((risk \ count_1* 2) + risk \ count_2)) + (risk \ count_1/ ((risk \ count_1* 2) + risk \ count_2))\], this is giving the proportion of double alleles (carrying a double risk allele) plus proportion of single alleles (carrying only 1 copy of risk allele) to the total risk alleles you could have gotten.

Let’s take a moment to discuss the importance of knowing this information but not being coming a hypochondriac. This data is showing your overall risk for a disease GENETICALLY, however, your phenotypes (expression of genetic traits) is influences by more than just your genetics. Lifestyle choices such as smoking, drinking, working out, eating healthy and environmental factors such as do you live in a city or country side, all play a role in what will be expressed in your life time. Even, the bacteria in your gut play a role in this! So, while this information is truly useful to know that you have the GENETIC POTENTIAL to express this outward traits and can be discussed with your doctor for preventative screenings or measurements, don’t have a heart attack when you still all the data.

#Table of Disease Traits with over 100 risk alleles for DISEASE.TRAIT
table2 <-overall_risk_count_d %>%
  filter(risk_count > 99) %>%
  arrange(risk_count) %>%
  kable () %>%
  kable_paper("hover", full_width=F)
table2

DISEASE.TRAIT	risk_count	risk_1	risk_count_2	overall_risk_d * 100
Asthma	100	45	55	68.96552
LDL cholesterol	100	74	26	57.47126
Ulcerative colitis	100	53	47	65.35948
Body mass index (joint analysis main effects and smoking interaction)	101	48	53	67.78523
BMI (adjusted for smoking behaviour)	104	49	55	67.97386
Urate levels	114	73	41	60.96257
Menarche (age at onset)	115	62	53	64.97175
Pulse pressure	115	63	52	64.60674
Waist-to-hip ratio adjusted for BMI	119	61	58	66.11111
Coronary artery disease	121	62	59	66.12022
HDL cholesterol	127	97	30	56.69643
Colorectal cancer	128	88	40	59.25926
C-reactive protein levels	132	66	66	66.66667
Crohn’s disease	134	68	66	66.33663
LDL cholesterol levels	135	75	60	64.28571
Blood metabolite levels	144	77	67	65.15837
Triglyceride levels	152	83	69	64.68085
Heel bone mineral density	163	59	104	73.42342
Waist circumference adjusted for body mass index	169	105	64	61.67883
HDL cholesterol levels	182	96	86	65.46763
Waist-hip ratio	186	91	95	67.14801
Prostate cancer	188	114	74	62.25166
Diastolic blood pressure	189	106	83	64.06780
Rheumatoid arthritis	201	106	95	65.47231
Systolic blood pressure	205	126	79	61.93353
Waist-to-hip ratio adjusted for body mass index	215	100	115	68.25397
IgG glycosylation	233	123	110	65.44944
Schizophrenia	265	127	138	67.60204
Low density lipoprotein cholesterol levels	270	121	149	69.05371
Breast cancer	278	154	124	64.35185
High density lipoprotein cholesterol levels	348	191	157	64.56401
Triglycerides	381	222	159	63.18408
Total cholesterol levels	399	191	208	67.62712
Type 2 diabetes	470	319	151	59.56907
Height	503	313	190	61.64216
Body mass index	846	412	434	67.24960
Blood protein levels	1126	950	176	54.23892

See this gives a rounder picture of what is going on. Breast cancer is one to keep in mind, since I carry a significant risk I need to take this into mind. Possibly early screenings?

#Summarize proportion of SNPs for a given trait where you have a risk allele

trait_snp_proportion <-  filter(join_data, risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) %>%
mutate(you_have_risk_allele = if_else(have_risk_allele_count >= 1, 1, 0)) %>%
 group_by(DISEASE.TRAIT, you_have_risk_allele) %>%
 summarise(count_of_snps = n_distinct(rsid_num)) %>%
 mutate(total_snps_for_trait = sum(count_of_snps), proportion_of_snps_for_trait = count_of_snps / sum(count_of_snps) * 100) %>%
 filter(you_have_risk_allele == 1) %>%
 arrange(desc(proportion_of_snps_for_trait)) %>%
 ungroup()

# Count the studies per trait in the database

trait_study_count <- filter(join_data, risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) %>%
 group_by(DISEASE.TRAIT) %>%
 summarise(count_of_studies = n_distinct(PUBMEDID), mean_risk_allele_freq = mean(RISK.ALLELE.FREQUENCY))

# Merge the above together

trait_snp_proportion <- inner_join(trait_snp_proportion, trait_study_count, by = "DISEASE.TRAIT")

# Plot the traits where there were more than 2 studies and you have risk alleles for more than 80% of the SNPs studied

ggplot(filter(trait_snp_proportion, count_of_studies > 1 & proportion_of_snps_for_trait > 70), aes(x = reorder(DISEASE.TRAIT, proportion_of_snps_for_trait, sum), y = proportion_of_snps_for_trait, fill = mean_risk_allele_freq)) +
 geom_col() +
 coord_flip() +
 theme_bw() + 
 labs(title = "Traits I have more than half of the risk\nalleles studied where > 1 studies involved", 
 y = "% of SNPs with risk allele", x = "Trait", fill = "Mean risk allele frequency") +
 theme(legend.position="bottom")

If the bar is gray we do not have a risk allele frequency for that category to calculate currently. This is where research comes into play! Allele Frequency Definition An allele at a particular locus may also confer some fitness effect for an individual carrying that allele, on which natural selection acts. Beneficial alleles tend to increase in frequency, while deleterious alleles tend to decrease in frequency in the population.

write.table(join_data, file="~/Downloads/All_SNP_Results.txt", sep = "\t"  )
write.table(overall_risk_count, file="~/Downloads/MAPPED_OVERALL_RISK.txt", sep = "\t"  )
write.table(overall_risk_count_d, file="~/Downloads/DISEASE_OVERALL_RISK.txt", sep = "\t"  )
write.table(risk_data, file="~/Downloads/RISK_SNPs.txt", sep = "\t"  )

23 and Me Raw Data Risk Analysis in R Tutorial

Hayley Knight Lecker

9/4/2020

Load Packages

Load Data

Definitions