#Loading in Raw 23 and Me Data from csv file
genome_raw <- read_csv("~/Downloads/genome_Hayley.csv",
col_types = cols(
chromosome = col_character(),
position = col_integer()))
head(genome_raw)## # A tibble: 6 x 4
## rsid_num chromosome position genotype
## <chr> <chr> <int> <chr>
## 1 rs548049170 1 69869 TT
## 2 rs13328684 1 74792 --
## 3 rs9283150 1 565508 --
## 4 i713426 1 726912 AA
## 5 rs116587930 1 727841 GG
## 6 rs3131972 1 752721 GG
Let’s define some terms:
rsid_num is a unique label (“rs” followed by a number) used by researchers and databases to identify a specific SNP (Single Nucleotide Polymorphism). It stands for Reference SNP cluster ID and is the naming convention used for most SNPs. Before we can understand what a SNP is we must understand what a chromosome is.
What is a chromosome?
Chromosomes are thread-like structures located inside the nucleus of animal and plant cells. Each chromosome is made of protein and a single molecule of deoxyribonucleic acid (DNA). Passed from parents to offspring, DNA contains the specific instructions that make each type of living creature unique.
How many chromosomes do we have?
Humans have 23 pairs of chromosomes, for a total of 46 chromosomes.
How are chromosomes inherited?
In humans and most other complex organisms, one copy of each chromosome is inherited from the female parent and the other from the male parent. This explains why children inherit some of their traits from their mother and others from their father.
The pattern of inheritance is different for the small circular chromosome found in mitochondria. Only egg cells - and not sperm cells - keep their mitochondria during fertilization. So, mitochondrial DNA is always inherited from the female parent. In humans, a few conditions, including some forms of hearing impairment and diabetes, have been associated with DNA found in the mitochondria.
Do males and females differ?
Yes, they differ in a pair of chromosomes known as the sex chromosomes. Females have two X chromosomes in their cells, while males have one X and one Y chromosome.
Inheriting too many or not enough copies of sex chromosomes can lead to serious problems. For example, females who have extra copies of the X chromosome are usually taller than average and some have mental retardation. Males with more than one X chromosome have Klinefelter syndrome, which is a condition characterized by tall stature and, often, impaired fertility. Another syndrome caused by imbalance in the number of sex chromosomes is Turner syndrome. Women with Turner have one X chromosome only. They are very short, usually do not undergo puberty and some may have kidney or heart problems.
So, what are Single Nucleotide Polymorphisms (SNPs)?
A nucleotide is the basic building block of nucleic acids. … A nucleotide consists of a sugar molecule (either ribose in RNA or deoxyribose in DNA) attached to a phosphate group and a nitrogen-containing base. The bases used in DNA are adenine (A), cytosine (C), guanine (G), and thymine (T).
Single nucleotide polymorphisms, frequently called SNPs (pronounced “snips”), are the most common type of genetic variation among people. Each SNP represents a difference in a single DNA building block, called a nucleotide. For example, a SNP may replace the nucleotide cytosine (C) with the nucleotide thymine (T) in a certain stretch of DNA.
SNPs occur normally throughout a person’s DNA. They occur almost once in every 1,000 nucleotides on average, which means there are roughly 4 to 5 million SNPs in a person’s genome. These variations may be unique or occur in many individuals; scientists have found more than 100 million SNPs in populations around the world. Most commonly, these variations are found in the DNA between genes. They can act as biological markers, helping scientists locate genes that are associated with disease. When SNPs occur within a gene or in a regulatory region near a gene, they may play a more direct role in disease by affecting the gene’s function.
Most SNPs have no effect on health or development. Some of these genetic differences, however, have proven to be very important in the study of human health. Researchers have found SNPs that may help predict an individual’s response to certain drugs, susceptibility to environmental factors such as toxins, and risk of developing particular diseases. SNPs can also be used to track the inheritance of disease genes within families.
position is the location of the SNP in your DNA.
genotype is how both of your chromosomes read. If you had a SNP that read A/A on chromosome 1 this would mean that on both copies of chromosome 1 in that position there is an A. If you see – that means there is a deletion in that spot.
You can search rsid numbers of interest at https://www.ncbi.nlm.nih.gov/snp/ for more information
Now we have definitions out of the way and have our data and our reference set, now let’s merge the two data frames together.
There are 44808 observations of 46 variables. However many of these will not have any risk or mutations. When we use the word risk here, we are not necessarily using risk in a bad way instead we are saying you have risk for displaying a certain phenotype (the outward expression of your genotype). This could be positive such as intelligence or height!
Why did this go down?
23 and Me has SNPs that may not be in the GWAS database we are using. We are using gwascat which is a R package that serves as an interface to the National Human Genome Research Institute’s (NHGRI) database of genome-wide association studies. At NHGRI, they believe that advances in genomics research are transforming our understanding of human health and disease. Building on their leadership role in the initial sequencing of the human genome, they collaborate with the scientific and medical communities to enhance genomic technologies that accelerate breakthroughs and improve lives. At NHGRI, they are empowering and expanding the field of genomics.
#Cleaning up data and separating out genotype into separate fields
join_data$risk_allele_clean <- str_sub(join_data$STRONGEST.SNP.RISK.ALLELE,
-1
)
join_data$my_allele_1 <- str_sub(join_data$genotype, 1, 1)
join_data$my_allele_2 <- str_sub(join_data$genotype, 2, 2)
join_data$have_risk_allele_count <- if_else(
join_data$my_allele_1 == join_data$risk_allele_clean, 1, 0) +
if_else(
join_data$my_allele_2 == join_data$risk_allele_clean, 1, 0
)Let’s filter out SNPs you do not have the risk allele(s) for and look at their disease traits and mapped traits.
#Selecting Variables of Interest and Filtering Data
risk_data<- select(join_data,
have_risk_allele_count,
rsid_num,
my_genotype = genotype,
risk_allele = risk_allele_clean,
DISEASE.TRAIT,
MAPPED_TRAIT,
RISK.ALLELE.FREQUENCY,
REPORTED.GENE.S.) %>%
filter(have_risk_allele_count>0)There is many SNPS that carry a risk allele! Don’t worry your phenotype is based not solely on your genomes but a host of other factors such as lifestyle and environmental factors. Still, this is important to know so let’s look at some plots and tables now.
#Count of Disease Traits
trait_count_mapped <- risk_data %>%
group_by(MAPPED_TRAIT) %>%
summarise(risk_count = n())
#Count of Disease Traits for Risk Allele = 2
trait_count_mapped2 <- risk_data %>%
group_by(MAPPED_TRAIT, have_risk_allele_count) %>%
filter(have_risk_allele_count==2) %>%
summarise(risk_count_2 = n()) %>%
subset(select = -c(have_risk_allele_count))
#Mutating Risk Counts of Allele =1 and allele = 2
trait_count_mapped3<-
merge(trait_count_mapped, trait_count_mapped2,
by = c("MAPPED_TRAIT", "MAPPED_TRAIT"),all=TRUE)
trait_count_mapped3[is.na(trait_count_mapped3)] <- 0
#Risk count for Allele_1 equation
risk_1 <- (trait_count_mapped3$risk_count - trait_count_mapped3$risk_count_2)
#Adding risk_1 column in
trait_count_mapped4 <- select(trait_count_mapped3,
MAPPED_TRAIT,
risk_count,
risk_count_2
)%>%
mutate(risk_1
)
#Adding Overall Risk Column and Equation
overall_risk <- (trait_count_mapped3$risk_count_2 / ((trait_count_mapped3$risk_count_2 + trait_count_mapped4$risk_1*2)))+(trait_count_mapped4$risk_1/((trait_count_mapped3$risk_count_2+trait_count_mapped4$risk_1*2)))
overall_risk_count <- select(trait_count_mapped4,
MAPPED_TRAIT,
risk_count,
risk_1,
risk_count_2,
) %>%
mutate(overall_risk*100)
overall_risk_count[is.na(overall_risk_count)] <- 0
#Get Count of Disease Traits by DISEASE.TRAIT
trait_count_disease <- risk_data %>%
group_by(DISEASE.TRAIT)%>%
summarise(risk_count= n())
#Count of Disease Traits for Risk Allele = 2
trait_count_disease2 <- risk_data %>%
group_by(DISEASE.TRAIT, have_risk_allele_count) %>%
filter(have_risk_allele_count==2) %>%
summarise(risk_count_2 = n()) %>%
subset(select = -c(have_risk_allele_count))
#Mutating Risk Counts of Allele =1 and allele = 2
trait_count_disease3<-
merge(trait_count_disease, trait_count_disease2,
by = c("DISEASE.TRAIT", "DISEASE.TRAIT"),all=TRUE)
trait_count_disease3[is.na(trait_count_disease3)] <- 0
#Risk count for Allele_1 equation
risk_d_1 <- (trait_count_disease3$risk_count - trait_count_disease3$risk_count_2)
#Adding risk_1 column in
trait_count_disease4 <- select(trait_count_disease3,
DISEASE.TRAIT,
risk_count,
risk_count_2
)%>%
mutate(risk_d_1
)
#Adding Overall Risk Column and Equation
overall_risk_d <- (trait_count_disease3$risk_count_2 / ((trait_count_disease3$risk_count_2 + trait_count_disease4$risk_d_1*2)))+(trait_count_disease4$risk_d_1/((trait_count_disease3$risk_count_2+trait_count_disease4$risk_d_1*2)))
overall_risk_count_d <- select(trait_count_disease4,
DISEASE.TRAIT,
risk_count,
risk_1= "risk_d_1",
risk_count_2,
) %>%
mutate(overall_risk_d*100)
overall_risk_count[is.na(overall_risk_count)] <- 0When grouped by MAPPED_TRAIT instead of DISEASE.TRAIT we can get rid of redundant entries. However, we can lose some of the story there. I want to take a second and define some terms I am using here:
overall_risk means based what proportion of 1 risk allele mutations to 2 risk allele mutations you have for that trait, the equation is discussed later on. We are finding what is the percentage of risk you genetically carry of 1 risk allele to 2 risk alleles versus the potential you could have had. (Meaning the the sum of 1 risk alleles you have times two plus the sum of 2 risk alleles you have, why multiply by 2? We multiply by two because you have the potential to have this SNP mutation on two chromosomes)
risk_1 means total number of 1 risk alleles you have in a certain trait.
risk_count_2 means total number of 2 risk alleles you have in a certain trait.
risk_count is the total number of risk alleles you have in a certain trait.
#Table of Disease Traits with over 100 risk alleles for MAPPED_TRAIT
table1 <-overall_risk_count %>%
filter(risk_count > 99) %>%
arrange(risk_count) %>%
kable () %>%
kable_paper("hover", full_width=F)
table1| MAPPED_TRAIT | risk_count | risk_1 | risk_count_2 | overall_risk * 100 |
|---|---|---|---|---|
| ulcerative colitis | 100 | 53 | 47 | 65.35948 |
| smoking behavior, BMI-adjusted waist circumference | 105 | 64 | 41 | 62.13018 |
| glomerular filtration rate | 107 | 72 | 35 | 59.77654 |
| asthma | 111 | 48 | 63 | 69.81132 |
| urate measurement | 112 | 73 | 39 | 60.54054 |
| BMI-adjusted waist circumference | 114 | 68 | 46 | 62.63736 |
| age at menarche | 115 | 62 | 53 | 64.97175 |
| mathematical ability | 117 | 63 | 54 | 65.00000 |
| pulse pressure measurement | 122 | 67 | 55 | 64.55026 |
| coronary artery disease | 126 | 61 | 65 | 67.37968 |
| colorectal cancer | 128 | 88 | 40 | 59.25926 |
| bone density | 134 | 57 | 77 | 70.15707 |
| Crohn’s disease | 134 | 68 | 66 | 66.33663 |
| self reported educational attainment | 154 | 87 | 67 | 63.90041 |
| C-reactive protein measurement | 160 | 87 | 73 | 64.77733 |
| heel bone mineral density | 164 | 60 | 104 | 73.21429 |
| blood metabolite measurement | 177 | 100 | 77 | 63.89892 |
| physical activity measurement, body mass index | 179 | 68 | 111 | 72.46964 |
| waist-hip ratio | 187 | 92 | 95 | 67.02509 |
| BMI-adjusted waist circumference, physical activity measurement | 188 | 117 | 71 | 61.63934 |
| diastolic blood pressure | 196 | 107 | 89 | 64.68647 |
| prostate carcinoma | 197 | 119 | 78 | 62.34177 |
| systolic blood pressure | 215 | 129 | 86 | 62.50000 |
| rheumatoid arthritis | 218 | 114 | 104 | 65.66265 |
| smoking behavior, body mass index | 228 | 107 | 121 | 68.05970 |
| schizophrenia | 265 | 127 | 138 | 67.60204 |
| serum IgG glycosylation measurement | 272 | 160 | 112 | 62.96296 |
| breast carcinoma | 285 | 156 | 129 | 64.62585 |
| BMI-adjusted waist-hip ratio | 409 | 204 | 205 | 66.72104 |
| total cholesterol measurement | 479 | 254 | 225 | 65.34789 |
| type II diabetes mellitus | 498 | 333 | 165 | 59.92780 |
| body height | 510 | 317 | 193 | 61.66868 |
| low density lipoprotein cholesterol measurement | 517 | 281 | 236 | 64.78697 |
| triglyceride measurement | 559 | 328 | 231 | 63.02142 |
| high density lipoprotein cholesterol measurement | 671 | 396 | 275 | 62.88660 |
| body mass index | 869 | 445 | 424 | 66.13394 |
| blood protein measurement | 1055 | 884 | 171 | 54.40949 |
Very informative, we found overall-risk by taking \[overall \ risk = (risk \ count_2/((risk \ count_1* 2) + risk \ count_2)) + (risk \ count_1/ ((risk \ count_1* 2) + risk \ count_2))\], this is giving the proportion of double alleles (carrying a double risk allele) plus proportion of single alleles (carrying only 1 copy of risk allele) to the total risk alleles you could have gotten.
Let’s take a moment to discuss the importance of knowing this information but not being coming a hypochondriac. This data is showing your overall risk for a disease GENETICALLY, however, your phenotypes (expression of genetic traits) is influences by more than just your genetics. Lifestyle choices such as smoking, drinking, working out, eating healthy and environmental factors such as do you live in a city or country side, all play a role in what will be expressed in your life time. Even, the bacteria in your gut play a role in this! So, while this information is truly useful to know that you have the GENETIC POTENTIAL to express this outward traits and can be discussed with your doctor for preventative screenings or measurements, don’t have a heart attack when you still all the data.
#Table of Disease Traits with over 100 risk alleles for DISEASE.TRAIT
table2 <-overall_risk_count_d %>%
filter(risk_count > 99) %>%
arrange(risk_count) %>%
kable () %>%
kable_paper("hover", full_width=F)
table2| DISEASE.TRAIT | risk_count | risk_1 | risk_count_2 | overall_risk_d * 100 |
|---|---|---|---|---|
| Asthma | 100 | 45 | 55 | 68.96552 |
| LDL cholesterol | 100 | 74 | 26 | 57.47126 |
| Ulcerative colitis | 100 | 53 | 47 | 65.35948 |
| Body mass index (joint analysis main effects and smoking interaction) | 101 | 48 | 53 | 67.78523 |
| BMI (adjusted for smoking behaviour) | 104 | 49 | 55 | 67.97386 |
| Urate levels | 114 | 73 | 41 | 60.96257 |
| Menarche (age at onset) | 115 | 62 | 53 | 64.97175 |
| Pulse pressure | 115 | 63 | 52 | 64.60674 |
| Waist-to-hip ratio adjusted for BMI | 119 | 61 | 58 | 66.11111 |
| Coronary artery disease | 121 | 62 | 59 | 66.12022 |
| HDL cholesterol | 127 | 97 | 30 | 56.69643 |
| Colorectal cancer | 128 | 88 | 40 | 59.25926 |
| C-reactive protein levels | 132 | 66 | 66 | 66.66667 |
| Crohn’s disease | 134 | 68 | 66 | 66.33663 |
| LDL cholesterol levels | 135 | 75 | 60 | 64.28571 |
| Blood metabolite levels | 144 | 77 | 67 | 65.15837 |
| Triglyceride levels | 152 | 83 | 69 | 64.68085 |
| Heel bone mineral density | 163 | 59 | 104 | 73.42342 |
| Waist circumference adjusted for body mass index | 169 | 105 | 64 | 61.67883 |
| HDL cholesterol levels | 182 | 96 | 86 | 65.46763 |
| Waist-hip ratio | 186 | 91 | 95 | 67.14801 |
| Prostate cancer | 188 | 114 | 74 | 62.25166 |
| Diastolic blood pressure | 189 | 106 | 83 | 64.06780 |
| Rheumatoid arthritis | 201 | 106 | 95 | 65.47231 |
| Systolic blood pressure | 205 | 126 | 79 | 61.93353 |
| Waist-to-hip ratio adjusted for body mass index | 215 | 100 | 115 | 68.25397 |
| IgG glycosylation | 233 | 123 | 110 | 65.44944 |
| Schizophrenia | 265 | 127 | 138 | 67.60204 |
| Low density lipoprotein cholesterol levels | 270 | 121 | 149 | 69.05371 |
| Breast cancer | 278 | 154 | 124 | 64.35185 |
| High density lipoprotein cholesterol levels | 348 | 191 | 157 | 64.56401 |
| Triglycerides | 381 | 222 | 159 | 63.18408 |
| Total cholesterol levels | 399 | 191 | 208 | 67.62712 |
| Type 2 diabetes | 470 | 319 | 151 | 59.56907 |
| Height | 503 | 313 | 190 | 61.64216 |
| Body mass index | 846 | 412 | 434 | 67.24960 |
| Blood protein levels | 1126 | 950 | 176 | 54.23892 |
See this gives a rounder picture of what is going on. Breast cancer is one to keep in mind, since I carry a significant risk I need to take this into mind. Possibly early screenings?
#Summarize proportion of SNPs for a given trait where you have a risk allele
trait_snp_proportion <- filter(join_data, risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) %>%
mutate(you_have_risk_allele = if_else(have_risk_allele_count >= 1, 1, 0)) %>%
group_by(DISEASE.TRAIT, you_have_risk_allele) %>%
summarise(count_of_snps = n_distinct(rsid_num)) %>%
mutate(total_snps_for_trait = sum(count_of_snps), proportion_of_snps_for_trait = count_of_snps / sum(count_of_snps) * 100) %>%
filter(you_have_risk_allele == 1) %>%
arrange(desc(proportion_of_snps_for_trait)) %>%
ungroup()
# Count the studies per trait in the database
trait_study_count <- filter(join_data, risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) %>%
group_by(DISEASE.TRAIT) %>%
summarise(count_of_studies = n_distinct(PUBMEDID), mean_risk_allele_freq = mean(RISK.ALLELE.FREQUENCY))
# Merge the above together
trait_snp_proportion <- inner_join(trait_snp_proportion, trait_study_count, by = "DISEASE.TRAIT")
# Plot the traits where there were more than 2 studies and you have risk alleles for more than 80% of the SNPs studied
ggplot(filter(trait_snp_proportion, count_of_studies > 1 & proportion_of_snps_for_trait > 70), aes(x = reorder(DISEASE.TRAIT, proportion_of_snps_for_trait, sum), y = proportion_of_snps_for_trait, fill = mean_risk_allele_freq)) +
geom_col() +
coord_flip() +
theme_bw() +
labs(title = "Traits I have more than half of the risk\nalleles studied where > 1 studies involved",
y = "% of SNPs with risk allele", x = "Trait", fill = "Mean risk allele frequency") +
theme(legend.position="bottom") If the bar is gray we do not have a risk allele frequency for that category to calculate currently. This is where research comes into play! Allele Frequency Definition An allele at a particular locus may also confer some fitness effect for an individual carrying that allele, on which natural selection acts. Beneficial alleles tend to increase in frequency, while deleterious alleles tend to decrease in frequency in the population.
write.table(join_data, file="~/Downloads/All_SNP_Results.txt", sep = "\t" )
write.table(overall_risk_count, file="~/Downloads/MAPPED_OVERALL_RISK.txt", sep = "\t" )
write.table(overall_risk_count_d, file="~/Downloads/DISEASE_OVERALL_RISK.txt", sep = "\t" )
write.table(risk_data, file="~/Downloads/RISK_SNPs.txt", sep = "\t" )