Load Packages

library(readr)
library(ggplot2)
library(dplyr)
library(gwascat)
library(stringr)
library(kableExtra)
library(tidystringdist)
library(tidyverse)

Load Data

#Loading in Raw 23 and Me Data from csv file
genome_raw <- read_csv("~/Downloads/genome_Hayley.csv", 
                    col_types = cols(
                      chromosome = col_character(), 
                      position = col_integer()))


head(genome_raw)
## # A tibble: 6 x 4
##   rsid_num    chromosome position genotype
##   <chr>       <chr>         <int> <chr>   
## 1 rs548049170 1             69869 TT      
## 2 rs13328684  1             74792 --      
## 3 rs9283150   1            565508 --      
## 4 i713426     1            726912 AA      
## 5 rs116587930 1            727841 GG      
## 6 rs3131972   1            752721 GG
#Updated GWAS Reference Data; Take a while to run depending on internet speed
updated_gwas_data <- as.data.frame(makeCurrentGwascat())

Definitions

Let’s define some terms:

rsid_num is a unique label (“rs” followed by a number) used by researchers and databases to identify a specific SNP (Single Nucleotide Polymorphism). It stands for Reference SNP cluster ID and is the naming convention used for most SNPs. Before we can understand what a SNP is we must understand what a chromosome is.

What is a chromosome?

Chromosomes are thread-like structures located inside the nucleus of animal and plant cells. Each chromosome is made of protein and a single molecule of deoxyribonucleic acid (DNA). Passed from parents to offspring, DNA contains the specific instructions that make each type of living creature unique.

How many chromosomes do we have?

Humans have 23 pairs of chromosomes, for a total of 46 chromosomes.

How are chromosomes inherited?

In humans and most other complex organisms, one copy of each chromosome is inherited from the female parent and the other from the male parent. This explains why children inherit some of their traits from their mother and others from their father.

The pattern of inheritance is different for the small circular chromosome found in mitochondria. Only egg cells - and not sperm cells - keep their mitochondria during fertilization. So, mitochondrial DNA is always inherited from the female parent. In humans, a few conditions, including some forms of hearing impairment and diabetes, have been associated with DNA found in the mitochondria.

Do males and females differ?

Yes, they differ in a pair of chromosomes known as the sex chromosomes. Females have two X chromosomes in their cells, while males have one X and one Y chromosome.

Inheriting too many or not enough copies of sex chromosomes can lead to serious problems. For example, females who have extra copies of the X chromosome are usually taller than average and some have mental retardation. Males with more than one X chromosome have Klinefelter syndrome, which is a condition characterized by tall stature and, often, impaired fertility. Another syndrome caused by imbalance in the number of sex chromosomes is Turner syndrome. Women with Turner have one X chromosome only. They are very short, usually do not undergo puberty and some may have kidney or heart problems.

So, what are Single Nucleotide Polymorphisms (SNPs)?

A nucleotide is the basic building block of nucleic acids. … A nucleotide consists of a sugar molecule (either ribose in RNA or deoxyribose in DNA) attached to a phosphate group and a nitrogen-containing base. The bases used in DNA are adenine (A), cytosine (C), guanine (G), and thymine (T).

Single nucleotide polymorphisms, frequently called SNPs (pronounced “snips”), are the most common type of genetic variation among people. Each SNP represents a difference in a single DNA building block, called a nucleotide. For example, a SNP may replace the nucleotide cytosine (C) with the nucleotide thymine (T) in a certain stretch of DNA.

SNPs occur normally throughout a person’s DNA. They occur almost once in every 1,000 nucleotides on average, which means there are roughly 4 to 5 million SNPs in a person’s genome. These variations may be unique or occur in many individuals; scientists have found more than 100 million SNPs in populations around the world. Most commonly, these variations are found in the DNA between genes. They can act as biological markers, helping scientists locate genes that are associated with disease. When SNPs occur within a gene or in a regulatory region near a gene, they may play a more direct role in disease by affecting the gene’s function.

Most SNPs have no effect on health or development. Some of these genetic differences, however, have proven to be very important in the study of human health. Researchers have found SNPs that may help predict an individual’s response to certain drugs, susceptibility to environmental factors such as toxins, and risk of developing particular diseases. SNPs can also be used to track the inheritance of disease genes within families.

position is the location of the SNP in your DNA.

genotype is how both of your chromosomes read. If you had a SNP that read A/A on chromosome 1 this would mean that on both copies of chromosome 1 in that position there is an A. If you see – that means there is a deletion in that spot.

You can search rsid numbers of interest at https://www.ncbi.nlm.nih.gov/snp/ for more information

Now we have definitions out of the way and have our data and our reference set, now let’s merge the two data frames together.

#Joining Data 
join_data <- inner_join(genome_raw, 
                        updated_gwas_data, 
                        by = c("rsid_num" = "SNPS")
                        )

There are 44808 observations of 46 variables. However many of these will not have any risk or mutations. When we use the word risk here, we are not necessarily using risk in a bad way instead we are saying you have risk for displaying a certain phenotype (the outward expression of your genotype). This could be positive such as intelligence or height!

Why did this go down?

23 and Me has SNPs that may not be in the GWAS database we are using. We are using gwascat which is a R package that serves as an interface to the National Human Genome Research Institute’s (NHGRI) database of genome-wide association studies. At NHGRI, they believe that advances in genomics research are transforming our understanding of human health and disease. Building on their leadership role in the initial sequencing of the human genome, they collaborate with the scientific and medical communities to enhance genomic technologies that accelerate breakthroughs and improve lives. At NHGRI, they are empowering and expanding the field of genomics.

#Cleaning up data and separating out genotype into separate fields
join_data$risk_allele_clean <- str_sub(join_data$STRONGEST.SNP.RISK.ALLELE, 
                                       -1
                                       )
join_data$my_allele_1 <- str_sub(join_data$genotype, 1, 1)
join_data$my_allele_2 <- str_sub(join_data$genotype, 2, 2)
join_data$have_risk_allele_count <- if_else(
  join_data$my_allele_1 == join_data$risk_allele_clean, 1, 0) + 
  if_else(
    join_data$my_allele_2 == join_data$risk_allele_clean, 1, 0
    )

Let’s filter out SNPs you do not have the risk allele(s) for and look at their disease traits and mapped traits.

#Selecting Variables of Interest and Filtering Data
risk_data<- select(join_data,
       have_risk_allele_count,
       rsid_num,
       my_genotype = genotype,
       risk_allele = risk_allele_clean,
       DISEASE.TRAIT,
       MAPPED_TRAIT,
       RISK.ALLELE.FREQUENCY,
       REPORTED.GENE.S.) %>%
filter(have_risk_allele_count>0)

There is many SNPS that carry a risk allele! Don’t worry your phenotype is based not solely on your genomes but a host of other factors such as lifestyle and environmental factors. Still, this is important to know so let’s look at some plots and tables now.

#Count of Disease Traits
trait_count_mapped <- risk_data %>%
  group_by(MAPPED_TRAIT) %>%
  summarise(risk_count = n()) 

#Count of Disease Traits for Risk Allele = 2
trait_count_mapped2 <- risk_data %>%
  group_by(MAPPED_TRAIT, have_risk_allele_count) %>%
  filter(have_risk_allele_count==2) %>%
  summarise(risk_count_2 = n()) %>%
 subset(select = -c(have_risk_allele_count))

#Mutating Risk Counts of Allele =1 and allele = 2
trait_count_mapped3<- 
  merge(trait_count_mapped, trait_count_mapped2, 
        by = c("MAPPED_TRAIT", "MAPPED_TRAIT"),all=TRUE) 

trait_count_mapped3[is.na(trait_count_mapped3)] <- 0

#Risk count for Allele_1 equation
risk_1 <- (trait_count_mapped3$risk_count - trait_count_mapped3$risk_count_2)

#Adding risk_1 column in
trait_count_mapped4 <- select(trait_count_mapped3,
  MAPPED_TRAIT,
  risk_count,
  risk_count_2
)%>%
  mutate(risk_1
         )
#Adding Overall Risk Column and Equation
overall_risk <- (trait_count_mapped3$risk_count_2 / ((trait_count_mapped3$risk_count_2 + trait_count_mapped4$risk_1*2)))+(trait_count_mapped4$risk_1/((trait_count_mapped3$risk_count_2+trait_count_mapped4$risk_1*2)))

overall_risk_count <- select(trait_count_mapped4,
       MAPPED_TRAIT,
       risk_count,
       risk_1,
       risk_count_2,
       ) %>%
  mutate(overall_risk*100) 

overall_risk_count[is.na(overall_risk_count)] <- 0


#Get Count of Disease Traits by DISEASE.TRAIT
trait_count_disease <- risk_data %>%
  group_by(DISEASE.TRAIT)%>%
  summarise(risk_count= n())

#Count of Disease Traits for Risk Allele = 2
trait_count_disease2 <- risk_data %>%
  group_by(DISEASE.TRAIT, have_risk_allele_count) %>%
  filter(have_risk_allele_count==2) %>%
  summarise(risk_count_2 = n()) %>%
 subset(select = -c(have_risk_allele_count))

#Mutating Risk Counts of Allele =1 and allele = 2
trait_count_disease3<- 
  merge(trait_count_disease, trait_count_disease2, 
        by = c("DISEASE.TRAIT", "DISEASE.TRAIT"),all=TRUE) 

trait_count_disease3[is.na(trait_count_disease3)] <- 0

#Risk count for Allele_1 equation
risk_d_1 <- (trait_count_disease3$risk_count - trait_count_disease3$risk_count_2)

#Adding risk_1 column in
trait_count_disease4 <- select(trait_count_disease3,
  DISEASE.TRAIT,
  risk_count,
  risk_count_2
)%>%
  mutate(risk_d_1
         )
#Adding Overall Risk Column and Equation
overall_risk_d <- (trait_count_disease3$risk_count_2 / ((trait_count_disease3$risk_count_2 + trait_count_disease4$risk_d_1*2)))+(trait_count_disease4$risk_d_1/((trait_count_disease3$risk_count_2+trait_count_disease4$risk_d_1*2)))

overall_risk_count_d <- select(trait_count_disease4,
       DISEASE.TRAIT,
       risk_count,
       risk_1= "risk_d_1",
       risk_count_2,
       ) %>%
  mutate(overall_risk_d*100) 

overall_risk_count[is.na(overall_risk_count)] <- 0

When grouped by MAPPED_TRAIT instead of DISEASE.TRAIT we can get rid of redundant entries. However, we can lose some of the story there. I want to take a second and define some terms I am using here:

overall_risk means based what proportion of 1 risk allele mutations to 2 risk allele mutations you have for that trait, the equation is discussed later on. We are finding what is the percentage of risk you genetically carry of 1 risk allele to 2 risk alleles versus the potential you could have had. (Meaning the the sum of 1 risk alleles you have times two plus the sum of 2 risk alleles you have, why multiply by 2? We multiply by two because you have the potential to have this SNP mutation on two chromosomes)

risk_1 means total number of 1 risk alleles you have in a certain trait.

risk_count_2 means total number of 2 risk alleles you have in a certain trait.

risk_count is the total number of risk alleles you have in a certain trait.

#Table of Disease Traits with over 100 risk alleles for MAPPED_TRAIT
table1 <-overall_risk_count %>%
  filter(risk_count > 99) %>%
  arrange(risk_count) %>%
  kable () %>%
  kable_paper("hover", full_width=F)
table1
MAPPED_TRAIT risk_count risk_1 risk_count_2 overall_risk * 100
ulcerative colitis 100 53 47 65.35948
smoking behavior, BMI-adjusted waist circumference 105 64 41 62.13018
glomerular filtration rate 107 72 35 59.77654
asthma 111 48 63 69.81132
urate measurement 112 73 39 60.54054
BMI-adjusted waist circumference 114 68 46 62.63736
age at menarche 115 62 53 64.97175
mathematical ability 117 63 54 65.00000
pulse pressure measurement 122 67 55 64.55026
coronary artery disease 126 61 65 67.37968
colorectal cancer 128 88 40 59.25926
bone density 134 57 77 70.15707
Crohn’s disease 134 68 66 66.33663
self reported educational attainment 154 87 67 63.90041
C-reactive protein measurement 160 87 73 64.77733
heel bone mineral density 164 60 104 73.21429
blood metabolite measurement 177 100 77 63.89892
physical activity measurement, body mass index 179 68 111 72.46964
waist-hip ratio 187 92 95 67.02509
BMI-adjusted waist circumference, physical activity measurement 188 117 71 61.63934
diastolic blood pressure 196 107 89 64.68647
prostate carcinoma 197 119 78 62.34177
systolic blood pressure 215 129 86 62.50000
rheumatoid arthritis 218 114 104 65.66265
smoking behavior, body mass index 228 107 121 68.05970
schizophrenia 265 127 138 67.60204
serum IgG glycosylation measurement 272 160 112 62.96296
breast carcinoma 285 156 129 64.62585
BMI-adjusted waist-hip ratio 409 204 205 66.72104
total cholesterol measurement 479 254 225 65.34789
type II diabetes mellitus 498 333 165 59.92780
body height 510 317 193 61.66868
low density lipoprotein cholesterol measurement 517 281 236 64.78697
triglyceride measurement 559 328 231 63.02142
high density lipoprotein cholesterol measurement 671 396 275 62.88660
body mass index 869 445 424 66.13394
blood protein measurement 1055 884 171 54.40949

Very informative, we found overall-risk by taking \[overall \ risk = (risk \ count_2/((risk \ count_1* 2) + risk \ count_2)) + (risk \ count_1/ ((risk \ count_1* 2) + risk \ count_2))\], this is giving the proportion of double alleles (carrying a double risk allele) plus proportion of single alleles (carrying only 1 copy of risk allele) to the total risk alleles you could have gotten.

Let’s take a moment to discuss the importance of knowing this information but not being coming a hypochondriac. This data is showing your overall risk for a disease GENETICALLY, however, your phenotypes (expression of genetic traits) is influences by more than just your genetics. Lifestyle choices such as smoking, drinking, working out, eating healthy and environmental factors such as do you live in a city or country side, all play a role in what will be expressed in your life time. Even, the bacteria in your gut play a role in this! So, while this information is truly useful to know that you have the GENETIC POTENTIAL to express this outward traits and can be discussed with your doctor for preventative screenings or measurements, don’t have a heart attack when you still all the data.

#Table of Disease Traits with over 100 risk alleles for DISEASE.TRAIT
table2 <-overall_risk_count_d %>%
  filter(risk_count > 99) %>%
  arrange(risk_count) %>%
  kable () %>%
  kable_paper("hover", full_width=F)
table2
DISEASE.TRAIT risk_count risk_1 risk_count_2 overall_risk_d * 100
Asthma 100 45 55 68.96552
LDL cholesterol 100 74 26 57.47126
Ulcerative colitis 100 53 47 65.35948
Body mass index (joint analysis main effects and smoking interaction) 101 48 53 67.78523
BMI (adjusted for smoking behaviour) 104 49 55 67.97386
Urate levels 114 73 41 60.96257
Menarche (age at onset) 115 62 53 64.97175
Pulse pressure 115 63 52 64.60674
Waist-to-hip ratio adjusted for BMI 119 61 58 66.11111
Coronary artery disease 121 62 59 66.12022
HDL cholesterol 127 97 30 56.69643
Colorectal cancer 128 88 40 59.25926
C-reactive protein levels 132 66 66 66.66667
Crohn’s disease 134 68 66 66.33663
LDL cholesterol levels 135 75 60 64.28571
Blood metabolite levels 144 77 67 65.15837
Triglyceride levels 152 83 69 64.68085
Heel bone mineral density 163 59 104 73.42342
Waist circumference adjusted for body mass index 169 105 64 61.67883
HDL cholesterol levels 182 96 86 65.46763
Waist-hip ratio 186 91 95 67.14801
Prostate cancer 188 114 74 62.25166
Diastolic blood pressure 189 106 83 64.06780
Rheumatoid arthritis 201 106 95 65.47231
Systolic blood pressure 205 126 79 61.93353
Waist-to-hip ratio adjusted for body mass index 215 100 115 68.25397
IgG glycosylation 233 123 110 65.44944
Schizophrenia 265 127 138 67.60204
Low density lipoprotein cholesterol levels 270 121 149 69.05371
Breast cancer 278 154 124 64.35185
High density lipoprotein cholesterol levels 348 191 157 64.56401
Triglycerides 381 222 159 63.18408
Total cholesterol levels 399 191 208 67.62712
Type 2 diabetes 470 319 151 59.56907
Height 503 313 190 61.64216
Body mass index 846 412 434 67.24960
Blood protein levels 1126 950 176 54.23892

See this gives a rounder picture of what is going on. Breast cancer is one to keep in mind, since I carry a significant risk I need to take this into mind. Possibly early screenings?

#Summarize proportion of SNPs for a given trait where you have a risk allele

trait_snp_proportion <-  filter(join_data, risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) %>%
mutate(you_have_risk_allele = if_else(have_risk_allele_count >= 1, 1, 0)) %>%
 group_by(DISEASE.TRAIT, you_have_risk_allele) %>%
 summarise(count_of_snps = n_distinct(rsid_num)) %>%
 mutate(total_snps_for_trait = sum(count_of_snps), proportion_of_snps_for_trait = count_of_snps / sum(count_of_snps) * 100) %>%
 filter(you_have_risk_allele == 1) %>%
 arrange(desc(proportion_of_snps_for_trait)) %>%
 ungroup()

# Count the studies per trait in the database

trait_study_count <- filter(join_data, risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) %>%
 group_by(DISEASE.TRAIT) %>%
 summarise(count_of_studies = n_distinct(PUBMEDID), mean_risk_allele_freq = mean(RISK.ALLELE.FREQUENCY))

# Merge the above together

trait_snp_proportion <- inner_join(trait_snp_proportion, trait_study_count, by = "DISEASE.TRAIT")

# Plot the traits where there were more than 2 studies and you have risk alleles for more than 80% of the SNPs studied

ggplot(filter(trait_snp_proportion, count_of_studies > 1 & proportion_of_snps_for_trait > 70), aes(x = reorder(DISEASE.TRAIT, proportion_of_snps_for_trait, sum), y = proportion_of_snps_for_trait, fill = mean_risk_allele_freq)) +
 geom_col() +
 coord_flip() +
 theme_bw() + 
 labs(title = "Traits I have more than half of the risk\nalleles studied where > 1 studies involved", 
 y = "% of SNPs with risk allele", x = "Trait", fill = "Mean risk allele frequency") +
 theme(legend.position="bottom")

If the bar is gray we do not have a risk allele frequency for that category to calculate currently. This is where research comes into play! Allele Frequency Definition An allele at a particular locus may also confer some fitness effect for an individual carrying that allele, on which natural selection acts. Beneficial alleles tend to increase in frequency, while deleterious alleles tend to decrease in frequency in the population.

write.table(join_data, file="~/Downloads/All_SNP_Results.txt", sep = "\t"  )
write.table(overall_risk_count, file="~/Downloads/MAPPED_OVERALL_RISK.txt", sep = "\t"  )
write.table(overall_risk_count_d, file="~/Downloads/DISEASE_OVERALL_RISK.txt", sep = "\t"  )
write.table(risk_data, file="~/Downloads/RISK_SNPs.txt", sep = "\t"  )