This code compiles summary information about the gene ACKR1 (Atypical chemokine receptor 1). The protein encoded by the ACKR1 gene is a glycosylated membrane protein and a non-specific receptor for several chemokines. The protein is the receptor for the human malarial parasites Plasmodium vivax and Plasmodium knowlesi. Polymorphisms in this gene are the basis of the Duffy blood group system
It also generates alignments and a phylogeneitc tree to indicating the evolutionary relationship betweeen the human version of the gene and its homologs in other species.
Key information use to make this script can be found here: * Refseq Gene: https://www.ncbi.nlm.nih.gov/gene/2532 * Refseq Homologene: https://www.ncbi.nlm.nih.gov/homologene/48067
Other resources consulted includes
Other interesting resources and online tools include:
# github packages
library(compbio4all)
library(ggmsa)
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
# CRAN packages
library(rentrez)
library(seqinr)
library(ape)
##
## Attaching package: 'ape'
## The following objects are masked from 'package:seqinr':
##
## as.alignment, consensus
library(pander)
library(ggplot2)
# Bioconductor packages
library(msa)
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:ape':
##
## complement
## The following object is masked from 'package:seqinr':
##
## translate
## The following object is masked from 'package:base':
##
## strsplit
library(drawProteins)
## Biostrings
library(Biostrings)
library(HGNChelper)
Accession numbers were obtained from RefSeq, Refseq Homlogene, UniProt and PDB. UniProt accession numbers can be found by searching for the gene name. PDB accessions can be found by searching with a UniProt accession or a gene name, though many proteins are not in PDB. The the Neanderthal genome database was searched and did yield sequence information.
A protein BLAST search (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) was carried out excluding vertebrates to determine if it occurred outside of vertebreates. The gene does not appear in non-vertebrates and so a second search was conducted to exclude mammals.
ACKR1_table<-c("NP_002027.2", "Q16570","4NUU","Homo sapiens" , "Human", "ACKR1",
"XP_016785546.1","A0A2J8K5T9", "NA", "Pan troglodytes" , "Chimpanzee","ACKR1",
"NP_034175.2", "Q9QUI6","NA","Mus musculus", "Mouse" ,"ACKR1",
"XP_002728079.3","NA","NA","Rattus norvegicus", "Rat", "ACKR1",
"NP_477500.1","O16868","NA","Drosophila melanogaster","Fruit fly", "fy",
"NP_001015634.1","Q9GLX0","NA","Bos taurus", "Cattle", "ACKR1",
"XP_001117200.1","NA", "NA", "Macaca mulatta" , "Rheus monkey","ACKR1",
"XP_005641007.1","NA", "NA", "Canis lupus" , "Dog","ACKR1",
"XP_003821027.1","A0A2R9AUQ0", "NA", "Pan paniscus" , "Bonobo","ACKR1",
"ENSP00000352341","Q16570", "4NUU", "Homo neanderthalensis" , "Neaderthal","ACKR1")
RefSeq<-c("NP_002027.2","XP_016785546.1", "NP_034175.2", "XP_002728079.3","NP_477500.1", "NP_001015634.1", "XP_001117200.1", "XP_005641007.1", "XP_003821027.1", "ENSP00000352341" )
UniProt<- c("Q16570","A0A2J8K5T9", "Q9QUI6", "NA", "O16868", "Q9GLX0", "NA", "NA","A0A2R9AUQ0", "Q16570")
PDB<-c("4NUU", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "4NUU")
Scientific_Name<-c("Homo sapiens","Pan troglodytes", "Mus musculus", "Rattus norvegicus", "Drosophila melanogaster", "Bos taurus","Macaca mulatta", "Canis lupus", "Pan paniscus", "Homo neanderthalensis")
Common_Name<-c("Human","Chimpanzee", "Mouse", "Rat", "Fruit fly","Cattle", "Rheus monkey", "Dog", "Bonobo", "Neaderthal")
Gene_Name<- c("ACKR1", "ACKR1", "ACKR1", "ACKR1", "fy", "ACKR1", "ACKR1", "ACKR1", "ACKR1", "ACKR1" )
ACKR1.df<- data.frame(RefSeq=RefSeq,
UniProt=UniProt,
PDB=PDB,
Scientific_Name=Scientific_Name,
Common_Name=Common_Name,
Gene_Name=Gene_Name)
pander::pander(ACKR1.df)
| RefSeq | UniProt | PDB | Scientific_Name | Common_Name |
|---|---|---|---|---|
| NP_002027.2 | Q16570 | 4NUU | Homo sapiens | Human |
| XP_016785546.1 | A0A2J8K5T9 | NA | Pan troglodytes | Chimpanzee |
| NP_034175.2 | Q9QUI6 | NA | Mus musculus | Mouse |
| XP_002728079.3 | NA | NA | Rattus norvegicus | Rat |
| NP_477500.1 | O16868 | NA | Drosophila melanogaster | Fruit fly |
| NP_001015634.1 | Q9GLX0 | NA | Bos taurus | Cattle |
| XP_001117200.1 | NA | NA | Macaca mulatta | Rheus monkey |
| XP_005641007.1 | NA | NA | Canis lupus | Dog |
| XP_003821027.1 | A0A2R9AUQ0 | NA | Pan paniscus | Bonobo |
| ENSP00000352341 | Q16570 | 4NUU | Homo neanderthalensis | Neaderthal |
| Gene_Name |
|---|
| ACKR1 |
| ACKR1 |
| ACKR1 |
| ACKR1 |
| fy |
| ACKR1 |
| ACKR1 |
| ACKR1 |
| ACKR1 |
| ACKR1 |
All sequences were downloaded using a wrapper compbio4all::entrez_fetch_list() which uses rentrez::entrez_fetch() to access NCBI databases.
ACKR1_human<- rentrez::entrez_fetch(id = "NP_002027.2",
db = "protein",
rettype="fasta")
ACKR1_chimpanzee<- rentrez::entrez_fetch(id = "XP_016785546.1",
db = "protein",
rettype="fasta")
ACKR1_mouse<- rentrez::entrez_fetch(id = "NP_034175.2",
db = "protein",
rettype="fasta")
ACKR1_rat<- rentrez::entrez_fetch(id = "XP_002728079.3",
db = "protein",
rettype="fasta")
ACKR1_fruitfly<- rentrez::entrez_fetch(id = "NP_477500.1",
db = "protein",
rettype="fasta")
ACKR1_cattle<- rentrez::entrez_fetch(id = "NP_001015634.1",
db = "protein",
rettype="fasta")
ACKR1_rheusmonkey<- rentrez::entrez_fetch(id = "XP_001117200.1",
db = "protein",
rettype="fasta")
ACKR1_dog<- rentrez::entrez_fetch(id = "XP_005641007.1",
db = "protein",
rettype="fasta")
ACKR1_bonobo<- rentrez::entrez_fetch(id = "XP_003821027.1",
db = "protein",
rettype="fasta")
#ACKR1_neanderthal<- rentrez::entrez_fetch(id = "ENSP00000352341", #Won't work
#db = "protein",
#rettype="fasta")
Clean and set up sequence as vector.
ACKR1_human_vector<- fasta_cleaner(ACKR1_human)
ACKR1_chimpanzee_vector<- fasta_cleaner(ACKR1_chimpanzee)
ACKR1_mouse_vector<- fasta_cleaner(ACKR1_mouse)
ACKR1_rat_vector<- fasta_cleaner(ACKR1_rat)
ACKR1_fruitfly_vector<- fasta_cleaner(ACKR1_fruitfly)
ACKR1_cattle_vector<- fasta_cleaner(ACKR1_cattle)
ACKR1_rheusmonkey_vector<- fasta_cleaner(ACKR1_rheusmonkey)
ACKR1_dog_vector<- fasta_cleaner(ACKR1_dog)
ACKR1_bonobo_vector<- fasta_cleaner(ACKR1_bonobo)
#ACKR1_neanderthal_vector<- fasta_cleaner(ACKR1_neanderthal)
Confirm set up.
str(ACKR1_human_vector)
## chr [1:336] "M" "G" "N" "C" "L" "H" "R" "A" "E" "L" "S" "P" "S" "T" "E" ...
str(ACKR1_chimpanzee_vector)
## chr [1:336] "M" "G" "N" "C" "L" "H" "R" "A" "E" "L" "S" "P" "S" "T" "E" ...
str(ACKR1_mouse_vector)
## chr [1:334] "M" "G" "N" "C" "L" "Y" "P" "V" "E" "N" "L" "S" "L" "D" "K" ...
str(ACKR1_rat_vector)
## chr [1:331] "M" "G" "N" "C" "L" "H" "P" "V" "E" "R" "F" "S" "V" "D" "N" ...
str(ACKR1_fruitfly_vector)
## chr [1:416] "M" "S" "I" "Y" "L" "L" "C" "L" "T" "T" "N" "G" "G" "L" "P" ...
str(ACKR1_cattle_vector)
## chr [1:330] "M" "G" "N" "C" "L" "Y" "P" "V" "A" "D" "D" "N" "S" "T" "K" ...
str(ACKR1_rheusmonkey_vector)
## chr [1:335] "M" "G" "N" "C" "L" "H" "P" "A" "E" "L" "S" "P" "S" "T" "Q" ...
str(ACKR1_dog_vector)
## chr [1:433] "M" "G" "N" "C" "L" "Q" "Q" "Q" "A" "A" "D" "E" "A" "S" "T" ...
str(ACKR1_bonobo_vector)
## chr [1:336] "M" "G" "N" "C" "L" "H" "R" "A" "E" "L" "S" "P" "S" "T" "E" ...
# str(ACKR1_neanderthal_vector)
Q16570_json <- drawProteins::get_features("Q16570 ")
## [1] "Download has worked"
is(Q16570_json)
## [1] "list" "vector" "list_OR_List" "vector_OR_Vector"
## [5] "vector_OR_factor"
Then the raw data from the webpage is converted to a dataframe
my_prot_df <- drawProteins::feature_to_dataframe(Q16570_json)
is(my_prot_df)
## [1] "data.frame" "list" "oldClass" "vector"
## [5] "list_OR_List" "vector_OR_Vector" "vector_OR_factor"
my_prot_df[,-2]
## type begin end length accession entryName taxid order
## featuresTemp CHAIN 1 336 335 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.1 TOPO_DOM 1 63 62 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.2 TRANSMEM 64 84 20 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.3 TOPO_DOM 85 95 10 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.4 TRANSMEM 96 116 20 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.5 TOPO_DOM 117 129 12 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.6 TRANSMEM 130 153 23 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.7 TOPO_DOM 154 166 12 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.8 TRANSMEM 167 187 20 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.9 TOPO_DOM 188 207 19 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.10 TRANSMEM 208 228 20 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.11 TOPO_DOM 229 244 15 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.12 TRANSMEM 245 265 20 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.13 TOPO_DOM 266 287 21 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.14 TRANSMEM 288 308 20 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.15 TOPO_DOM 309 336 27 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.16 CARBOHYD 16 16 0 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.17 CARBOHYD 33 33 0 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.18 DISULFID 51 276 225 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.19 DISULFID 129 195 66 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.20 VAR_SEQ 1 7 6 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.21 VARIANT 42 42 0 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.22 VARIANT 89 89 0 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.23 VARIANT 100 100 0 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.24 VARIANT 203 203 0 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.25 VARIANT 326 326 0 Q16570 ACKR1_HUMAN 9606 1
## featuresTemp.26 HELIX 22 28 6 Q16570 ACKR1_HUMAN 9606 1
Protein has no domains, motif, but has helical folds
my_canvas <- draw_canvas(my_prot_df)
my_canvas <- draw_chains(my_canvas, my_prot_df, label_size = 2.5)
my_canvas <- draw_regions(my_canvas, my_prot_df)
my_canvas <- draw_motif(my_canvas, my_prot_df)
my_canvas <- draw_phospho(my_canvas, my_prot_df)
my_canvas <- draw_repeat(my_canvas, my_prot_df)
my_canvas <- draw_recept_dom(my_canvas, my_prot_df)
my_canvas <- draw_folding(my_canvas, my_prot_df)
my_canvas
##Draw dotplot ### Prepare data
par(mfrow = c(2,2),
mar = c(0,0,2,1))
# plot 1: ACKR1_human - Defaults
dotPlot(ACKR1_human_vector, ACKR1_human_vector,
wsize = 1,
nmatch = 1,
main = "ACKR1_human Defaults")
# plot - size = 10, nmatch = 1
dotPlot(ACKR1_human_vector, ACKR1_human_vector,
wsize = 10,
nmatch = 1,
main = " ACKR1_human- size = 10, nmatch = 1")
# plot 3: ACKR1_human - size = 10, nmatch = 5
dotPlot(ACKR1_human_vector, ACKR1_human_vector,
wsize = 10,
nmatch = 5,
main = "ACKR1_human - size = 10, nmatch = 5")
# plot 4: size = 20, nmatch = 5
dotPlot(ACKR1_human_vector, ACKR1_human_vector,
wsize = 10,
nmatch = 5,
main = "ACKR1_human - size = 20, nmatch = 5")
# reset par() - run this or other plots will be small!
par(mfrow = c(1,1),
mar = c(4,4,4,4))
dotPlot(ACKR1_human_vector, ACKR1_human_vector,
wsize = 1,
nmatch = 1,
main = "ACKR1_human Defaults")
## Protein properties compiled from databases
Pfam<- c("Pfam shows 7 transmembrane domians within the 62-308 regions and 2 low complexity regions at 220-241 and 277-275 respectively.")
UniProt<- c("UniProt sub-cellular locations:Early Endosome and recycling endosome")
Alphafold<- c("Alphafold predicts with high confidence that the proetin encoded by ACKR1 is complosed of mostly aplha helices (>90)")
pp.df<-data.frame (Pfam=Pfam,
UniProt=UniProt,
Alphafold=Alphafold)
pander::pander(pp.df)
| Pfam | UniProt |
|---|---|
| Pfam shows 7 transmembrane domians within the 62-308 regions and 2 low complexity regions at 220-241 and 277-275 respectively. | UniProt sub-cellular locations:Early Endosome and recycling endosome |
| Alphafold |
|---|
| Alphafold predicts with high confidence that the proetin encoded by ACKR1 is complosed of mostly aplha helices (>90) |
# enter once
aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
"L","K","M","F","P","S","T","W","Y","V")
# enter again
aa.1.2 <- c("A","R","N","D","C","Q","E","G","H","I",
"L","K","M","F","P","S","T","W","Y","V")
# are they the same length?
length(aa.1.1) == length(aa.1.2)
## [1] TRUE
aa.1.1 == aa.1.2
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE
unique(aa.1.1)
## [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y"
## [20] "V"
length(unique(aa.1.1)) == length(unique(aa.1.2))
## [1] TRUE
# alpha proteins
alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91,
221, 249, 48, 123, 82, 122, 119, 33, 63, 167)
# check against chou's total
sum(alpha) == 2447
## [1] TRUE
# beta proteins
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120,
177, 115, 16, 85, 127, 341, 253, 44, 110, 229)
# check against chou's total
sum(beta) == 2776
## [1] TRUE
# alpha + beta
a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
sum(a.plus.b) == 1889
## [1] TRUE
# alpha/beta
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239,
339, 321, 91, 158, 188, 327, 238, 72, 130, 378)
sum(a.div.b) == 4333
## [1] TRUE
# Calculate proportions for each of the four protein fold types
alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b<- a.div.b/sum(a.div.b)
#Create a dataframe
#dataframe
aa.prop <- data.frame(alpha.prop,
beta.prop,
a.plus.b.prop,
a.div.b)
#row labels
row.names(aa.prop)<-aa.1.1
pander(data.frame(aa.1.1, alpha, beta, a.plus.b, a.div.b))
| aa.1.1 | alpha | beta | a.plus.b | a.div.b |
|---|---|---|---|---|
| A | 285 | 203 | 175 | 0.08331 |
| R | 53 | 67 | 78 | 0.03369 |
| N | 97 | 139 | 120 | 0.04223 |
| D | 163 | 121 | 111 | 0.05631 |
| C | 22 | 75 | 74 | 0.01454 |
| Q | 67 | 122 | 74 | 0.02631 |
| E | 134 | 86 | 86 | 0.05931 |
| G | 197 | 297 | 171 | 0.08701 |
| H | 111 | 49 | 33 | 0.02469 |
| I | 91 | 120 | 93 | 0.05516 |
| L | 221 | 177 | 110 | 0.07824 |
| K | 249 | 115 | 112 | 0.07408 |
| M | 48 | 16 | 25 | 0.021 |
| F | 123 | 85 | 52 | 0.03646 |
| P | 82 | 127 | 71 | 0.04339 |
| S | 122 | 341 | 126 | 0.07547 |
| T | 119 | 253 | 117 | 0.05493 |
| W | 33 | 44 | 30 | 0.01662 |
| Y | 63 | 110 | 108 | 0.03 |
| V | 167 | 229 | 123 | 0.08724 |
pander::pander(aa.prop)
| alpha.prop | beta.prop | a.plus.b.prop | a.div.b | |
|---|---|---|---|---|
| A | 0.1165 | 0.07313 | 0.09264 | 0.08331 |
| R | 0.02166 | 0.02414 | 0.04129 | 0.03369 |
| N | 0.03964 | 0.05007 | 0.06353 | 0.04223 |
| D | 0.06661 | 0.04359 | 0.05876 | 0.05631 |
| C | 0.008991 | 0.02702 | 0.03917 | 0.01454 |
| Q | 0.02738 | 0.04395 | 0.03917 | 0.02631 |
| E | 0.05476 | 0.03098 | 0.04553 | 0.05931 |
| G | 0.08051 | 0.107 | 0.09052 | 0.08701 |
| H | 0.04536 | 0.01765 | 0.01747 | 0.02469 |
| I | 0.03719 | 0.04323 | 0.04923 | 0.05516 |
| L | 0.09031 | 0.06376 | 0.05823 | 0.07824 |
| K | 0.1018 | 0.04143 | 0.05929 | 0.07408 |
| M | 0.01962 | 0.005764 | 0.01323 | 0.021 |
| F | 0.05027 | 0.03062 | 0.02753 | 0.03646 |
| P | 0.03351 | 0.04575 | 0.03759 | 0.04339 |
| S | 0.04986 | 0.1228 | 0.0667 | 0.07547 |
| T | 0.04863 | 0.09114 | 0.06194 | 0.05493 |
| W | 0.01349 | 0.01585 | 0.01588 | 0.01662 |
| Y | 0.02575 | 0.03963 | 0.05717 | 0.03 |
| V | 0.06825 | 0.08249 | 0.06511 | 0.08724 |
Functions to calculate similarities Two custom functions are needed: one to calculate correlates between two columns of our table, and one to calculate correlation similarities.
}
# Corrleation used in Chou adn Zhange 1992.
chou_cor <- function(x,y){
numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}
# Cosine similarity used in Higgs and Attwood (2005).
chou_cosine <- function(z.1, z.2){
z.1.abs <- sqrt(sum(z.1^2))
z.2.abs <- sqrt(sum(z.2^2))
my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
return(my.cosine)}
#cor(chou_cor$result,chou_cosine$my.cosine)
For use with R bioinformatics tools we need to convert our named vector to a string set using Biostrings::AAStringSet(). Note the _ss tag at the end of the object we’re assigning the output to, which designates this as a string set.
ACKR1_vector<-c(ACKR1_bonobo_vector, ACKR1_cattle_vector, ACKR1_chimpanzee_vector, ACKR1_dog_vector, ACKR1_fruitfly_vector, ACKR1_human_vector, ACKR1_mouse_vector, ACKR1_rat_vector, ACKR1_rheusmonkey_vector)
ACKR1_vector_ss <- Biostrings::AAStringSet(ACKR1_vector)
ACKR1_matrixmatrix <- matrix(ACKR1_vector)
#Cleaning / setting up an MSA
#msa produces a species MSA objects
#ACKR1_align <- msa(ACKR1_vector_ss,
#method = "ClustalW")
#class(ACKR1_align)