Introduction

This code compiles summary information about the gene ACKR1 (Atypical chemokine receptor 1). The protein encoded by the ACKR1 gene is a glycosylated membrane protein and a non-specific receptor for several chemokines. The protein is the receptor for the human malarial parasites Plasmodium vivax and Plasmodium knowlesi. Polymorphisms in this gene are the basis of the Duffy blood group system

It also generates alignments and a phylogeneitc tree to indicating the evolutionary relationship betweeen the human version of the gene and its homologs in other species.

Resources / References

Key information use to make this script can be found here: * Refseq Gene: https://www.ncbi.nlm.nih.gov/gene/2532 * Refseq Homologene: https://www.ncbi.nlm.nih.gov/homologene/48067

Other resources consulted includes

Other interesting resources and online tools include:

Preparation

# github packages
library(compbio4all)
library(ggmsa)
## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2
# CRAN packages
library(rentrez)
library(seqinr)
library(ape)
## 
## Attaching package: 'ape'
## The following objects are masked from 'package:seqinr':
## 
##     as.alignment, consensus
library(pander)
library(ggplot2) 

# Bioconductor packages
library(msa)
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:ape':
## 
##     complement
## The following object is masked from 'package:seqinr':
## 
##     translate
## The following object is masked from 'package:base':
## 
##     strsplit
library(drawProteins)
## Biostrings
library(Biostrings)
library(HGNChelper)

Accession numbers

Accession numbers were obtained from RefSeq, Refseq Homlogene, UniProt and PDB. UniProt accession numbers can be found by searching for the gene name. PDB accessions can be found by searching with a UniProt accession or a gene name, though many proteins are not in PDB. The the Neanderthal genome database was searched and did yield sequence information.

A protein BLAST search (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) was carried out excluding vertebrates to determine if it occurred outside of vertebreates. The gene does not appear in non-vertebrates and so a second search was conducted to exclude mammals.

Accession Table

ACKR1_table<-c("NP_002027.2",   "Q16570","4NUU","Homo sapiens" ,    "Human",      "ACKR1",
              "XP_016785546.1","A0A2J8K5T9",    "NA", "Pan troglodytes" , "Chimpanzee","ACKR1",
              "NP_034175.2",   "Q9QUI6","NA","Mus musculus",      "Mouse"    ,"ACKR1",
              "XP_002728079.3","NA","NA","Rattus norvegicus", "Rat",      "ACKR1",
              "NP_477500.1","O16868","NA","Drosophila melanogaster","Fruit fly",     "fy",
              "NP_001015634.1","Q9GLX0","NA","Bos taurus",       "Cattle",     "ACKR1", 
              "XP_001117200.1","NA",    "NA", "Macaca mulatta" , "Rheus monkey","ACKR1", 
              "XP_005641007.1","NA",    "NA", "Canis lupus" , "Dog","ACKR1", 
              "XP_003821027.1","A0A2R9AUQ0",    "NA", "Pan paniscus" , "Bonobo","ACKR1", 
              "ENSP00000352341","Q16570",    "4NUU", "Homo neanderthalensis" , "Neaderthal","ACKR1")
              
RefSeq<-c("NP_002027.2","XP_016785546.1", "NP_034175.2", "XP_002728079.3","NP_477500.1", "NP_001015634.1", "XP_001117200.1", "XP_005641007.1", "XP_003821027.1", "ENSP00000352341" )
UniProt<- c("Q16570","A0A2J8K5T9", "Q9QUI6", "NA", "O16868", "Q9GLX0", "NA", "NA","A0A2R9AUQ0", "Q16570")
PDB<-c("4NUU", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "4NUU")
Scientific_Name<-c("Homo sapiens","Pan troglodytes", "Mus musculus", "Rattus norvegicus", "Drosophila melanogaster", "Bos taurus","Macaca mulatta", "Canis lupus",  "Pan paniscus", "Homo neanderthalensis")
Common_Name<-c("Human","Chimpanzee", "Mouse", "Rat", "Fruit fly","Cattle", "Rheus monkey", "Dog", "Bonobo", "Neaderthal")
Gene_Name<- c("ACKR1", "ACKR1", "ACKR1", "ACKR1", "fy", "ACKR1", "ACKR1", "ACKR1", "ACKR1", "ACKR1" )
ACKR1.df<- data.frame(RefSeq=RefSeq,
                      UniProt=UniProt,
                      PDB=PDB,
                      Scientific_Name=Scientific_Name,
                      Common_Name=Common_Name,
                      Gene_Name=Gene_Name)
pander::pander(ACKR1.df)
Table continues below
RefSeq UniProt PDB Scientific_Name Common_Name
NP_002027.2 Q16570 4NUU Homo sapiens Human
XP_016785546.1 A0A2J8K5T9 NA Pan troglodytes Chimpanzee
NP_034175.2 Q9QUI6 NA Mus musculus Mouse
XP_002728079.3 NA NA Rattus norvegicus Rat
NP_477500.1 O16868 NA Drosophila melanogaster Fruit fly
NP_001015634.1 Q9GLX0 NA Bos taurus Cattle
XP_001117200.1 NA NA Macaca mulatta Rheus monkey
XP_005641007.1 NA NA Canis lupus Dog
XP_003821027.1 A0A2R9AUQ0 NA Pan paniscus Bonobo
ENSP00000352341 Q16570 4NUU Homo neanderthalensis Neaderthal
Gene_Name
ACKR1
ACKR1
ACKR1
ACKR1
fy
ACKR1
ACKR1
ACKR1
ACKR1
ACKR1

Data prepartation

Download sequences

All sequences were downloaded using a wrapper compbio4all::entrez_fetch_list() which uses rentrez::entrez_fetch() to access NCBI databases.

ACKR1_human<- rentrez::entrez_fetch(id = "NP_002027.2",
                                      db = "protein", 
                                      rettype="fasta")
ACKR1_chimpanzee<- rentrez::entrez_fetch(id = "XP_016785546.1",
                                      db = "protein", 
                                      rettype="fasta")
ACKR1_mouse<- rentrez::entrez_fetch(id = "NP_034175.2",
                                      db = "protein", 
                                      rettype="fasta") 
ACKR1_rat<- rentrez::entrez_fetch(id = "XP_002728079.3",
                                      db = "protein", 
                                      rettype="fasta")
ACKR1_fruitfly<- rentrez::entrez_fetch(id = "NP_477500.1",
                                      db = "protein", 
                                      rettype="fasta")   
ACKR1_cattle<- rentrez::entrez_fetch(id = "NP_001015634.1",
                                      db = "protein", 
                                      rettype="fasta")
ACKR1_rheusmonkey<- rentrez::entrez_fetch(id = "XP_001117200.1",
                                      db = "protein", 
                                      rettype="fasta")
ACKR1_dog<- rentrez::entrez_fetch(id = "XP_005641007.1",
                                      db = "protein", 
                                      rettype="fasta") 
ACKR1_bonobo<- rentrez::entrez_fetch(id = "XP_003821027.1",
                                      db = "protein", 
                                      rettype="fasta") 
#ACKR1_neanderthal<- rentrez::entrez_fetch(id = "ENSP00000352341",   #Won't work 
                                      #db = "protein", 
                                      #rettype="fasta")                                     

Clean and set up sequence as vector.

ACKR1_human_vector<- fasta_cleaner(ACKR1_human)
ACKR1_chimpanzee_vector<- fasta_cleaner(ACKR1_chimpanzee)
ACKR1_mouse_vector<- fasta_cleaner(ACKR1_mouse)
ACKR1_rat_vector<- fasta_cleaner(ACKR1_rat)
ACKR1_fruitfly_vector<- fasta_cleaner(ACKR1_fruitfly)
ACKR1_cattle_vector<- fasta_cleaner(ACKR1_cattle)
ACKR1_rheusmonkey_vector<- fasta_cleaner(ACKR1_rheusmonkey)
ACKR1_dog_vector<- fasta_cleaner(ACKR1_dog)
ACKR1_bonobo_vector<- fasta_cleaner(ACKR1_bonobo)
#ACKR1_neanderthal_vector<- fasta_cleaner(ACKR1_neanderthal)

Confirm set up.

str(ACKR1_human_vector)
##  chr [1:336] "M" "G" "N" "C" "L" "H" "R" "A" "E" "L" "S" "P" "S" "T" "E" ...
str(ACKR1_chimpanzee_vector)
##  chr [1:336] "M" "G" "N" "C" "L" "H" "R" "A" "E" "L" "S" "P" "S" "T" "E" ...
str(ACKR1_mouse_vector)
##  chr [1:334] "M" "G" "N" "C" "L" "Y" "P" "V" "E" "N" "L" "S" "L" "D" "K" ...
str(ACKR1_rat_vector)
##  chr [1:331] "M" "G" "N" "C" "L" "H" "P" "V" "E" "R" "F" "S" "V" "D" "N" ...
str(ACKR1_fruitfly_vector)
##  chr [1:416] "M" "S" "I" "Y" "L" "L" "C" "L" "T" "T" "N" "G" "G" "L" "P" ...
str(ACKR1_cattle_vector)
##  chr [1:330] "M" "G" "N" "C" "L" "Y" "P" "V" "A" "D" "D" "N" "S" "T" "K" ...
str(ACKR1_rheusmonkey_vector)
##  chr [1:335] "M" "G" "N" "C" "L" "H" "P" "A" "E" "L" "S" "P" "S" "T" "Q" ...
str(ACKR1_dog_vector)
##  chr [1:433] "M" "G" "N" "C" "L" "Q" "Q" "Q" "A" "A" "D" "E" "A" "S" "T" ...
str(ACKR1_bonobo_vector)
##  chr [1:336] "M" "G" "N" "C" "L" "H" "R" "A" "E" "L" "S" "P" "S" "T" "E" ...
# str(ACKR1_neanderthal_vector)

General Protein information

Protein diagram

Building a protein diagram

Q16570_json  <- drawProteins::get_features("Q16570 ")
## [1] "Download has worked"
is(Q16570_json)
## [1] "list"             "vector"           "list_OR_List"     "vector_OR_Vector"
## [5] "vector_OR_factor"

Then the raw data from the webpage is converted to a dataframe

my_prot_df <- drawProteins::feature_to_dataframe(Q16570_json)
is(my_prot_df)
## [1] "data.frame"       "list"             "oldClass"         "vector"          
## [5] "list_OR_List"     "vector_OR_Vector" "vector_OR_factor"
my_prot_df[,-2]
##                     type begin end length accession   entryName taxid order
## featuresTemp       CHAIN     1 336    335    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.1  TOPO_DOM     1  63     62    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.2  TRANSMEM    64  84     20    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.3  TOPO_DOM    85  95     10    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.4  TRANSMEM    96 116     20    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.5  TOPO_DOM   117 129     12    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.6  TRANSMEM   130 153     23    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.7  TOPO_DOM   154 166     12    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.8  TRANSMEM   167 187     20    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.9  TOPO_DOM   188 207     19    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.10 TRANSMEM   208 228     20    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.11 TOPO_DOM   229 244     15    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.12 TRANSMEM   245 265     20    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.13 TOPO_DOM   266 287     21    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.14 TRANSMEM   288 308     20    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.15 TOPO_DOM   309 336     27    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.16 CARBOHYD    16  16      0    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.17 CARBOHYD    33  33      0    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.18 DISULFID    51 276    225    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.19 DISULFID   129 195     66    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.20  VAR_SEQ     1   7      6    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.21  VARIANT    42  42      0    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.22  VARIANT    89  89      0    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.23  VARIANT   100 100      0    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.24  VARIANT   203 203      0    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.25  VARIANT   326 326      0    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.26    HELIX    22  28      6    Q16570 ACKR1_HUMAN  9606     1

Domains Present

Protein has no domains, motif, but has helical folds

my_canvas <- draw_canvas(my_prot_df)
my_canvas <- draw_chains(my_canvas, my_prot_df, label_size = 2.5)

my_canvas <- draw_regions(my_canvas, my_prot_df)
my_canvas <- draw_motif(my_canvas, my_prot_df)
my_canvas <- draw_phospho(my_canvas, my_prot_df)
my_canvas <- draw_repeat(my_canvas, my_prot_df)
my_canvas <- draw_recept_dom(my_canvas, my_prot_df)
my_canvas <- draw_folding(my_canvas, my_prot_df)
my_canvas

##Draw dotplot ### Prepare data

par(mfrow = c(2,2), 
    mar = c(0,0,2,1))

# plot 1: ACKR1_human - Defaults
dotPlot(ACKR1_human_vector, ACKR1_human_vector, 
        wsize = 1, 
        nmatch = 1, 
        main = "ACKR1_human Defaults")

# plot - size = 10, nmatch = 1
dotPlot(ACKR1_human_vector, ACKR1_human_vector, 
        wsize = 10, 
        nmatch = 1, 
        main = " ACKR1_human- size = 10, nmatch = 1")

# plot 3: ACKR1_human - size = 10, nmatch = 5
dotPlot(ACKR1_human_vector, ACKR1_human_vector, 
        wsize = 10, 
        nmatch = 5, 
        main = "ACKR1_human - size = 10, nmatch = 5")

# plot 4: size = 20, nmatch = 5
dotPlot(ACKR1_human_vector, ACKR1_human_vector, 
        wsize = 10,
        nmatch = 5,
        main = "ACKR1_human - size = 20, nmatch = 5")

# reset par() - run this or other plots will be small!
par(mfrow = c(1,1), 
    mar = c(4,4,4,4))
dotPlot(ACKR1_human_vector, ACKR1_human_vector, 
        wsize = 1, 
        nmatch = 1, 
        main = "ACKR1_human Defaults")

## Protein properties compiled from databases

Pfam<- c("Pfam shows 7 transmembrane domians within the 62-308 regions and 2 low complexity regions at 220-241 and 277-275 respectively.")
UniProt<- c("UniProt sub-cellular locations:Early Endosome and recycling endosome")
Alphafold<- c("Alphafold predicts with high confidence that the proetin encoded by ACKR1 is complosed of mostly aplha helices (>90)")
pp.df<-data.frame (Pfam=Pfam,
                  UniProt=UniProt,
                  Alphafold=Alphafold)
pander::pander(pp.df)
Table continues below
Pfam UniProt
Pfam shows 7 transmembrane domians within the 62-308 regions and 2 low complexity regions at 220-241 and 277-275 respectively. UniProt sub-cellular locations:Early Endosome and recycling endosome
Alphafold
Alphafold predicts with high confidence that the proetin encoded by ACKR1 is complosed of mostly aplha helices (>90)

Predict Protein Folf

Chou and Zhang Data

# enter once
aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
            "L","K","M","F","P","S","T","W","Y","V")

# enter again
aa.1.2 <- c("A","R","N","D","C","Q","E","G","H","I",
            "L","K","M","F","P","S","T","W","Y","V")

# are they the same length?
length(aa.1.1) == length(aa.1.2)
## [1] TRUE
aa.1.1 == aa.1.2
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE
unique(aa.1.1)
##  [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y"
## [20] "V"
length(unique(aa.1.1)) == length(unique(aa.1.2))
## [1] TRUE
# alpha proteins
alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91, 
           221, 249, 48, 123, 82, 122, 119, 33, 63, 167)

# check against chou's total
sum(alpha) == 2447
## [1] TRUE
# beta proteins
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120, 
          177, 115, 16, 85, 127, 341, 253, 44, 110, 229)

# check against chou's total
sum(beta) == 2776
## [1] TRUE
# alpha + beta
a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
              110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
sum(a.plus.b) == 1889
## [1] TRUE
# alpha/beta
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239, 
             339, 321, 91, 158, 188, 327, 238, 72, 130, 378)
sum(a.div.b) == 4333
## [1] TRUE
# Calculate proportions for each of the four protein fold types

alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b<- a.div.b/sum(a.div.b)

#Create a dataframe

#dataframe
aa.prop <- data.frame(alpha.prop,
                      beta.prop,
                      a.plus.b.prop,
                      a.div.b)
#row labels
row.names(aa.prop)<-aa.1.1
pander(data.frame(aa.1.1, alpha, beta, a.plus.b, a.div.b))
aa.1.1 alpha beta a.plus.b a.div.b
A 285 203 175 0.08331
R 53 67 78 0.03369
N 97 139 120 0.04223
D 163 121 111 0.05631
C 22 75 74 0.01454
Q 67 122 74 0.02631
E 134 86 86 0.05931
G 197 297 171 0.08701
H 111 49 33 0.02469
I 91 120 93 0.05516
L 221 177 110 0.07824
K 249 115 112 0.07408
M 48 16 25 0.021
F 123 85 52 0.03646
P 82 127 71 0.04339
S 122 341 126 0.07547
T 119 253 117 0.05493
W 33 44 30 0.01662
Y 63 110 108 0.03
V 167 229 123 0.08724
pander::pander(aa.prop)
  alpha.prop beta.prop a.plus.b.prop a.div.b
A 0.1165 0.07313 0.09264 0.08331
R 0.02166 0.02414 0.04129 0.03369
N 0.03964 0.05007 0.06353 0.04223
D 0.06661 0.04359 0.05876 0.05631
C 0.008991 0.02702 0.03917 0.01454
Q 0.02738 0.04395 0.03917 0.02631
E 0.05476 0.03098 0.04553 0.05931
G 0.08051 0.107 0.09052 0.08701
H 0.04536 0.01765 0.01747 0.02469
I 0.03719 0.04323 0.04923 0.05516
L 0.09031 0.06376 0.05823 0.07824
K 0.1018 0.04143 0.05929 0.07408
M 0.01962 0.005764 0.01323 0.021
F 0.05027 0.03062 0.02753 0.03646
P 0.03351 0.04575 0.03759 0.04339
S 0.04986 0.1228 0.0667 0.07547
T 0.04863 0.09114 0.06194 0.05493
W 0.01349 0.01585 0.01588 0.01662
Y 0.02575 0.03963 0.05717 0.03
V 0.06825 0.08249 0.06511 0.08724

Functions to calculate similarities Two custom functions are needed: one to calculate correlates between two columns of our table, and one to calculate correlation similarities.

}

# Corrleation used in Chou adn Zhange 1992.
chou_cor <- function(x,y){
  numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}

# Cosine similarity used in Higgs and Attwood (2005). 
chou_cosine <- function(z.1, z.2){
  z.1.abs <- sqrt(sum(z.1^2))
  z.2.abs <- sqrt(sum(z.2^2))
  my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
  return(my.cosine)}
  
#cor(chou_cor$result,chou_cosine$my.cosine)

Percent Identity Comparisons (PID)

Multiple sequence alignment

MSA data preparation

For use with R bioinformatics tools we need to convert our named vector to a string set using Biostrings::AAStringSet(). Note the _ss tag at the end of the object we’re assigning the output to, which designates this as a string set.

ACKR1_vector<-c(ACKR1_bonobo_vector, ACKR1_cattle_vector, ACKR1_chimpanzee_vector, ACKR1_dog_vector, ACKR1_fruitfly_vector, ACKR1_human_vector, ACKR1_mouse_vector, ACKR1_rat_vector, ACKR1_rheusmonkey_vector)
ACKR1_vector_ss <- Biostrings::AAStringSet(ACKR1_vector)

Building Multiple Sequence Alignment (MSA)

ACKR1_matrixmatrix <- matrix(ACKR1_vector)

#Cleaning / setting up an MSA
#msa produces a species MSA objects

#ACKR1_align <- msa(ACKR1_vector_ss,
                     #method = "ClustalW")
#class(ACKR1_align)